JP2022059732A

JP2022059732A - Information processing device, control method, and program

Info

Publication number: JP2022059732A
Application number: JP2020167499A
Authority: JP
Inventors: 敬己下郡山; Itsuki Shimokooriyama
Original assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Priority date: 2020-10-02
Filing date: 2020-10-02
Publication date: 2022-04-14

Abstract

To provide a technique for confirming and correcting a result of voice recognition in a short time.SOLUTION: The present invention is an information processing device that acquires recognition data as a voice recognition result of voice data and text data corresponding to the voice data, including: specifying means for specifying a portion of attention in the recognition data or the text data; and display control means for, when displaying a result of comparing the recognition data and the text data, identifying and displaying the specified portion.SELECTED DRAWING: Figure 3

Description

本発明は、音声認識エンジンの認識結果を修正する技術に関する。 The present invention relates to a technique for modifying a recognition result of a speech recognition engine.

ろう者が聴者と同様に情報を得ることができるよう支援する方法として、話者の発話内容をテキストで表示する方法がある。例えば、テレビの字幕放送、講演会などでの要約筆記、さらに音声認識により発話を自動的に文字列に変換し、ＰＣの画面などに表示する方法もある。 As a method of assisting a Deaf person to obtain information in the same way as a listener, there is a method of displaying the utterance content of the speaker as a text. For example, there is also a method of automatically converting an utterance into a character string by subtitle broadcasting on TV, writing a summary at a lecture, or voice recognition, and displaying it on a PC screen or the like.

しかし、いずれも実際に発話されてから、ろう者がそれを文字として読むことができるまでには、かなりの時間差がある。テレビの文字放送では発話から１０～１５秒程度遅れて、字幕が表示されることがある。 However, there is a considerable time lag between when they are actually spoken and when Deaf people can read them as letters. In teletext on television, subtitles may be displayed 10 to 15 seconds after the utterance.

例えば「お天気コーナー」を想定すると、「明日の予報」についての説明が終わり、「今後１週間の予報」に場面が変わってからやっと前の「明日の予報」での天気予報士の発話が字幕として表示されることになり、その字幕の内容を理解するためには前の図を記憶しておく必要がある。 For example, assuming a "weather corner", the explanation about "tomorrow's forecast" is over, and after the scene changes to "forecast for the next week", the utterance of the weather forecaster in the previous "forecast tomorrow" is subtitled. In order to understand the contents of the subtitles, it is necessary to memorize the previous figure.

また報道の映像が次の事件のものに変わってから、前の事件の映像に関する字幕が表示されることもあり非常に分かりにくい。 Also, after the video of the news is changed to that of the next incident, subtitles related to the video of the previous incident may be displayed, which is very difficult to understand.

この問題を解決するために、できる限りリアルタイムに近いタイミングで字幕を提供する方法が検討されている。 In order to solve this problem, a method of providing subtitles at a timing as close to real time as possible is being studied.

特許文献１に記載の技術では、ニュース番組などの原稿を事前にテキスト修正装置に登録し、アナウンサー等の発話を音声認識で得られた認識結果と比較する。具体的には次の方法による。 In the technique described in Patent Document 1, a manuscript such as a news program is registered in a text correction device in advance, and the utterance of an announcer or the like is compared with the recognition result obtained by voice recognition. Specifically, the following method is used.

まずＮ形態素分の認識結果にある程度の誤認識が含まれることを想定した上で、原稿のどのＮ形態素に対応するのか位置を特定する。さらに誤認識した形態素がある場合には、原稿にある形態素に置き換えて提示する。 First, assuming that the recognition result of the N morpheme contains some misrecognition, the position of which N morpheme of the manuscript corresponds to is specified. Furthermore, if there is a morpheme that is misrecognized, it is presented by replacing it with the morpheme in the manuscript.

さらに第２の方法として認識結果の形態素が誤りであるとは決めつけず、認識結果と原稿の形態素の読みを比較し、読みが大きく異なると判定した場合には、誤認識ではなく音声認識が出力した形態素を正しいものとして修正しない。 Furthermore, as a second method, it is not determined that the morpheme of the recognition result is incorrect, and the recognition result is compared with the reading of the morpheme of the manuscript. Do not correct the morpheme that was created as correct.

特開２０１２－１２８１８８号公報Japanese Unexamined Patent Publication No. 2012-128188

特許文献１の技術では、認識結果の形態素列と事前に作成した原稿の形態素列との類似度が最も高く、またその類似度が事前に指定した閾値を超える場合には両者を比較し、原稿に合わせて修正する（図６のステップＳ１６など）、また修正しない（図１０のステップＳ２８で“ＮＯ”となった場合）を判断することになる。 In the technique of Patent Document 1, when the similarity between the morpheme sequence of the recognition result and the morpheme sequence of the previously prepared manuscript is the highest, and the similarity exceeds the threshold value specified in advance, the two are compared and the manuscript is compared. It is determined whether or not to correct (such as step S16 in FIG. 6) or not (when “NO” is obtained in step S28 in FIG. 10).

すなわち、修正するかしないかが判断されるのはあくまで「認識結果の形態素列が、原稿の中にある」ということが前提となり、話者が原稿にない発言をした場合は、類似度が閾値を超えないと判断されるため（図６のステップＳ１５で“ＮＯ”、図１０のステップＳ２５で“ＮＯ”）、そのまま出力されることになる。さらに、認識結果と原稿との類似度は、形態素列を比較した機械的な手法で算出している。そしてその結果に基づく修正は、全て原稿を基準とした自動的なものである。 In other words, it is premised that "the morpheme sequence of the recognition result is in the manuscript" to determine whether to correct or not, and if the speaker makes a statement that is not in the manuscript, the similarity is the threshold value. Since it is determined that the above value is not exceeded (“NO” in step S15 in FIG. 6 and “NO” in step S25 in FIG. 10), the output is performed as it is. Furthermore, the similarity between the recognition result and the manuscript is calculated by a mechanical method comparing morpheme sequences. And all the corrections based on the result are automatic based on the manuscript.

しかし、やはり人間の校正者による確認は重要なものである。リアルタイム性を重視するため字幕の候補となる文字列の全てを人間が確認、修正することはしないが、重要な部分に限定して校正者に対応させる。字幕の目的にもよるが、内容によって微妙なニュアンスさえ誤ってはいけない場合や、数値などが正確に伝われば比較的問題が少ない場合など、原稿の内容によって限定すべき確認のポイントも異なってくる。 However, confirmation by a human proofreader is still important. In order to emphasize real-time performance, humans do not check and correct all the character strings that are candidates for subtitles, but only the important parts are made to correspond to the proofreader. Depending on the purpose of the subtitles, the points of confirmation that should be limited differ depending on the content of the manuscript, such as when even subtle nuances should not be mistaken depending on the content, or when there are relatively few problems if the numerical values etc. are accurately conveyed. ..

日本語の場合、付属語（例えば文末に付加される助詞や助動詞、感嘆詞など）でニュアンスが大きく変わることがあり、そのような形態素も人手によるチェックをした方が良い場合もある。 In the case of Japanese, the nuances may change significantly depending on the attached words (for example, particles, auxiliary verbs, interjections, etc. added to the end of the sentence), and it may be better to manually check such morphemes.

本発明の目的は、音声認識の結果を短時間で確認、修正する技術を提供することである。 An object of the present invention is to provide a technique for confirming and correcting a voice recognition result in a short time.

本発明は、音声データの音声認識結果である認識データと、前記音声データに対応するテキストデータとを取得する情報処理装置であって、前記認識データまたは前記テキストデータの中で着目する箇所を特定する特定手段と、前記認識データと前記テキストデータとを比較した結果を表示する際に、前記特定した箇所を識別表示する表示制御手段とを備えることを特徴とする。 The present invention is an information processing device that acquires recognition data that is a voice recognition result of voice data and text data corresponding to the voice data, and specifies a portion of interest in the recognition data or the text data. It is characterized by comprising a specific means for identifying and displaying the specified portion when displaying the result of comparing the recognition data and the text data.

本発明により、音声認識の結果を短時間で確認、修正する技術を提供することが可能となる。 INDUSTRIAL APPLICABILITY According to the present invention, it is possible to provide a technique for confirming and correcting the result of voice recognition in a short time.

本発明の実施形態に係るシステム構成の一例を示す図である。It is a figure which shows an example of the system configuration which concerns on embodiment of this invention. 本発明の実施形態に係る音声認識サーバ、情報処理端末のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware composition of the voice recognition server and the information processing terminal which concerns on embodiment of this invention. 本発明の実施形態に係る機能構成の一例を示す図である。It is a figure which shows an example of the functional structure which concerns on embodiment of this invention. 本発明の実施形態に係る原稿記憶部に記憶される原稿および関連する情報の一例である。This is an example of a manuscript and related information stored in the manuscript storage unit according to the embodiment of the present invention. 本発明の実施形態に係る確認ルール記憶部に記憶されるルールの一例である。This is an example of a rule stored in the confirmation rule storage unit according to the embodiment of the present invention. 本発明の実施形態に係る設定情報記憶部に記憶される設定項目の一例である。This is an example of setting items stored in the setting information storage unit according to the embodiment of the present invention. 本発明の実施形態に係る音声認識結果を校正者用に表示する処理のフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart of the process which displays the voice recognition result which concerns on embodiment of this invention for a proofreader. 本発明の実施形態に係る音声認識結果に確認情報を付与する処理のフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart of the process of giving confirmation information to the voice recognition result which concerns on embodiment of this invention. 本発明の実施形態に係る本発明の実施形態に係る音声認識結果、確認情報を付与した表示情報の一例である。（校正者画面）This is an example of display information to which the voice recognition result and confirmation information according to the embodiment of the present invention according to the embodiment of the present invention are added. (Proofreader screen) 本発明の実施形態に係る本発明の実施形態に係る校正者用表示画面の一例である（校正者画面）。It is an example of a display screen for a proofreader according to the embodiment of the present invention according to the embodiment of the present invention (proofreader screen). 本発明の実施形態に係る本発明の実施形態に係る音声認識結果を字幕として表示する画面の一例である（テレビなどの画面）。This is an example of a screen for displaying the voice recognition result according to the embodiment of the present invention according to the embodiment of the present invention as subtitles (screen of a television or the like).

以下、本発明の実施の形態を、図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の実施形態に係るシステム構成の一例を示す図である。本発明の実施形態に係るシステムは、音声認識サーバ１０１、情報処理端末１０２（発話者用１０２ａ、校正者用１０２ｂ、表示用１０２ｃとする）で構成される。 FIG. 1 is a diagram showing an example of a system configuration according to an embodiment of the present invention. The system according to the embodiment of the present invention includes a voice recognition server 101 and an information processing terminal 102 (referred to as a speaker 102a, a proofreader 102b, and a display 102c).

ユーザは情報処理端末１０２ａに接続されたマイク１０４で音声を入力する。情報処理端末１０２ａは、前記音声を音声認識サーバ１０１に送信して文字列に変換し情報処理端末１０２ｂ（校正者用）に送り、情報処理端末１０２ｂ（校正者用）で表示、校正者に提示する。 The user inputs voice through the microphone 104 connected to the information processing terminal 102a. The information processing terminal 102a transmits the voice to the voice recognition server 101, converts it into a character string, sends it to the information processing terminal 102b (for the proofreader), displays it on the information processing terminal 102b (for the proofreader), and presents it to the proofreader. do.

情報処理端末１０２ａ～ｃは、音声の入力と文字列の出力の入出力双方を兼ね備えていてもよい。ここで出力される情報処理端末１０２においては、後述する表示用１０２ｃと校正者用１０２ｂが兼ねられていてもよいし、またそれぞれ専用の情報処理端末であってもよい。また出力は情報処理端末１０２に接続された表示装置上に対して行うが、プロジェクタなどを用いた構成も、本発明の実施形態に係るシステム構成とする。プロジェクタを使う場合であれば、情報処理端末１０２は発話者用の一台のみで、当該情報処理端末１０２ａに接続したプロジェクタからスクリーンに表示した音声認識結果の文字列を全員が読んでもよい。その場合、発話者用の前記情報処理端末１０２ａで直接、発話者自身あるいは別のユーザが校正者として誤認識を校正してもよい。 The information processing terminals 102a to 102c may have both input and output of voice input and character string output. In the information processing terminal 102 output here, the display 102c and the proofreader 102b, which will be described later, may be combined, or each may be a dedicated information processing terminal. Further, the output is performed on the display device connected to the information processing terminal 102, but the configuration using a projector or the like is also the system configuration according to the embodiment of the present invention. When using a projector, the information processing terminal 102 is only one for the speaker, and all may read the character string of the voice recognition result displayed on the screen from the projector connected to the information processing terminal 102a. In that case, the speaker himself or another user may calibrate the misrecognition directly on the information processing terminal 102a for the speaker as a proofreader.

さらに音声認識サーバ１０１は、クラウド上に存在するものであってもよく、その場合には、本システムのユーザは後述する音声認識サーバ１０１上の機能を、クラウドサービスにより利用する形態であってもよい。すなわち、後述する音声認識部３２２は音声認識サーバ１０１から呼び出す他のサーバ上の機能またはクラウド上のサービスであってもよい。すなわちこれらのサービスを他のサーバあるいはクラウドサービスとして利用する形態であっても、本発明の実施形態に係るシステム構成とする。 Further, the voice recognition server 101 may exist on the cloud, and in that case, the user of this system may use the function on the voice recognition server 101 described later by the cloud service. good. That is, the voice recognition unit 322, which will be described later, may be a function on another server called from the voice recognition server 101 or a service on the cloud. That is, even if these services are used as other servers or cloud services, the system configuration according to the embodiment of the present invention is used.

構成例で説明した情報処理端末１０２ａ～ｃは、入出力を兼ね備えていたが、入力専用、出力専用と分かれていてもよい。 The information processing terminals 102a to 102 described in the configuration example have both input and output, but may be separated into input-only and output-only.

音声認識サーバ１０１と情報処理端末１０２ａ～ｃは同一筐体であってもよい。すなわち、図１における情報処理端末１０２ａ～ｃのうちの１つに音声認識可能なソフトウェアがインストールされていて、音声認識サーバ１０１を兼ねていてもよい。 The voice recognition server 101 and the information processing terminals 102a to 102c may be in the same housing. That is, software capable of voice recognition may be installed in one of the information processing terminals 102a to 102 in FIG. 1 and may also serve as the voice recognition server 101.

図２は、本発明の実施形態に係る音声認識サーバ１０１、情報処理端末１０２に適用可能なハードウェア構成の一例を示すブロック図である。 FIG. 2 is a block diagram showing an example of a hardware configuration applicable to the voice recognition server 101 and the information processing terminal 102 according to the embodiment of the present invention.

図２に示すように、情報処理サーバ１００、認識サーバ１０１、情報処理端末１０２は、システムバス２０４を介してＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２０１、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０３、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２０２、入力コントローラ２０５、ビデオコントローラ２０６、メモリコントローラ２０７、通信Ｉ／Ｆコントローラ２０８等が接続された構成を採る。 As shown in FIG. 2, the information processing server 100, the recognition server 101, and the information processing terminal 102 are the CPU (Central Processing Unit) 201, the RAM (Random Access Memory) 203, and the ROM (Read Only Memory) via the system bus 204. A configuration is adopted in which 202, an input controller 205, a video controller 206, a memory controller 207, a communication I / F controller 208, and the like are connected.

ＣＰＵ２０１は、システムバス２０４に接続される各デバイスやコントローラを統括的に制御する。 The CPU 201 comprehensively controls each device and controller connected to the system bus 204.

また、ＲＯＭ２０２あるいは外部メモリ２１１には、ＣＰＵ２０１の制御プログラムであるＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）やＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）や、各サーバあるいは各ＰＣが実行する機能を実現するために必要な後述する各種プログラム等が記憶されている。また、本発明を実施するために必要な情報が記憶されている。なお外部メモリはデータベースであってもよい。 Further, the ROM 202 or the external memory 211 will be described later, which is necessary for realizing the functions executed by the BIOS (Basic Input / Output System) and the OS (Operating System), which are the control programs of the CPU 201, and the functions executed by each server or each PC. Various programs etc. are stored. In addition, information necessary for carrying out the present invention is stored. The external memory may be a database.

ＲＡＭ２０３は、ＣＰＵ２０１の主メモリ、ワークエリア等として機能する。ＣＰＵ２０１は、処理の実行に際して必要なプログラム等をＲＯＭ２０２あるいは外部メモリ２１１からＲＡＭ２０３にロードし、ロードしたプログラムを実行することで各種動作を実現する。 The RAM 203 functions as a main memory, a work area, and the like of the CPU 201. The CPU 201 realizes various operations by loading a program or the like necessary for executing the process from the ROM 202 or the external memory 211 into the RAM 203 and executing the loaded program.

また、入力コントローラ２０５は、キーボード（ＫＢ）２０９や不図示のマウス等のポインティングデバイス等からの入力を制御する。 Further, the input controller 205 controls input from a pointing device such as a keyboard (KB) 209 or a mouse (not shown).

ビデオコントローラ２０６は、ディスプレイ２１０等の表示器への表示を制御する。尚、表示器は液晶ディスプレイ等の表示器でもよい。これらは、必要に応じて管理者が使用する。 The video controller 206 controls the display on a display such as the display 210. The display may be a display such as a liquid crystal display. These are used by the administrator as needed.

メモリコントローラ２０７は、ブートプログラム、各種のアプリケーション、フォントデータ、ユーザファイル、編集ファイル、各種データ等を記憶する外部記憶装置（ハードディスク（ＨＤ））や、フレキシブルディスク（ＦＤ）、あるいは、ＰＣＭＣＩＡ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒＭｅｍｏｒｙＣａｒｄＩｎｔｅｒｎａｔｉｏｎａｌＡｓｓｏｃｉａｔｉｏｎ）カードスロットにアダプタを介して接続されるコンパクトフラッシュ（登録商標）メモリ等の外部メモリ２１１へのアクセスを制御する。 The memory controller 207 is an external storage device (hard disk (HD)) for storing boot programs, various applications, font data, user files, edit files, various data, etc., a flexible disk (FD), or a PCMCIA (Personal Computer). Controls access to external memory 211 such as CompactFlash® memory connected via an adapter to the Memory Card International Association card slot.

通信Ｉ／Ｆコントローラ２０８は、ネットワークを介して外部機器と接続・通信し、ネットワークでの通信制御処理を実行する。例えば、ＴＣＰ／ＩＰ（ＴｒａｎｓｍｉｓｓｉｏｎＣｏｎｔｒｏｌＰｒｏｔｏｃｏｌ／ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）を用いた通信等が可能である。 The communication I / F controller 208 connects and communicates with an external device via the network, and executes communication control processing on the network. For example, communication using TCP / IP (Transmission Control Protocol / Internet Protocol) is possible.

尚、ＣＰＵ２０１は、例えばＲＡＭ２０３内の表示情報用領域へアウトラインフォントの展開（ラスタライズ）処理を実行することにより、ディスプレイ２１０上に表示することが可能である。また、ＣＰＵ２０１は、ディスプレイ２１０上のマウスカーソル（図示しない）等によるユーザ指示を可能とする。 The CPU 201 can be displayed on the display 210, for example, by executing an outline font expansion (rasterization) process in the display information area in the RAM 203. Further, the CPU 201 enables a user instruction by a mouse cursor (not shown) or the like on the display 210.

本発明を実現するための後述する各種プログラムは、外部メモリ２１１に記録されており、必要に応じてＲＡＭ２０３にロードされることによりＣＰＵ２０１によって実行されるものである。 Various programs described later for realizing the present invention are recorded in the external memory 211, and are executed by the CPU 201 by being loaded into the RAM 203 as needed.

図３は、本発明の実施形態に係る機能構成の一例を示す図である。なお、図１で説明したように情報処理端末１０２は、発話者用１０２ａ、校正者用１０２ｂ、表示用１０２ｃの機能をそれぞれ別々の端末に持っても、共通した端末で持ってもよいので、ここではそれぞれを区別せずに説明する。 FIG. 3 is a diagram showing an example of a functional configuration according to an embodiment of the present invention. As described with reference to FIG. 1, the information processing terminal 102 may have the functions of the speaker 102a, the proofreader 102b, and the display 102c in different terminals or in a common terminal. Here, each will be described without distinction.

音声取得部３１１は、情報処理端末１０２が内蔵している、あるいは接続されたマイクなどから話者の音声による発話を音声データとして入力し、音声データ送信部３１２により音声認識サーバ１０１に送信する。 The voice acquisition unit 311 inputs utterances by the speaker's voice from a microphone or the like built in or connected to the information processing terminal 102 as voice data, and transmits the voice data transmission unit 312 to the voice recognition server 101.

音声認識サーバ１０１は、音声データ受信部３２１で受信した音声データを音声認識部３２２に渡して音声データを文字列に変換し、当該文字列を確認情報付加部３２３により校正者に提示する確認情報を付与する。その際に、原稿記憶部３３１に記憶された原稿（図４で詳述）と、確認ルール記憶部３３２に記憶されたルール、パターン（図５で詳述）を参照する。さらに確認情報を付与した結果を確認情報送信部３２４により情報処理端末１０２に送付する。 The voice recognition server 101 passes the voice data received by the voice data receiving unit 321 to the voice recognition unit 322, converts the voice data into a character string, and presents the character string to the calibrator by the confirmation information addition unit 323. Is given. At that time, the manuscript stored in the manuscript storage unit 331 (detailed in FIG. 4) and the rules and patterns stored in the confirmation rule storage unit 332 (detailed in FIG. 5) are referred to. Further, the result of adding the confirmation information is sent to the information processing terminal 102 by the confirmation information transmission unit 324.

情報処理端末１０２の確認情報受信部３１３は、前述の確認情報が付与された音声認識の結果を受信し、校正用表示部３１４により情報処理端末１０２に接続された表示装置に表示し、校正部３１５は校正者による操作を受け付け、校正結果を校正結果送信部３１６により音声認識サーバ１０１に送付する。 The confirmation information receiving unit 313 of the information processing terminal 102 receives the result of voice recognition to which the above-mentioned confirmation information is added, displays it on the display device connected to the information processing terminal 102 by the calibration display unit 314, and displays it on the calibration unit. The 315 accepts an operation by the calibrator and sends the calibration result to the voice recognition server 101 by the calibration result transmission unit 316.

音声認識サーバ１０１の校正結果受付部３２５により校正結果を受信し、音声認識結果、確認情報、校正結果に基づき字幕を生成、字幕配布部３２６により、表示するために送信する。 The calibration result reception unit 325 of the voice recognition server 101 receives the calibration result, generates subtitles based on the voice recognition result, confirmation information, and calibration result, and the subtitle distribution unit 326 transmits the subtitles for display.

情報処理端末１０２では、字幕配布部３２６により送信された前記の字幕を字幕受信部２１６により受信し、表示装置に表示する。実際にはこの情報処理端末１０２で表示しなくとも良い。例えばニュースや天気予報の映像などとの重ね合わせるための装置に更に送信したり、またテレビ放送の場合であれば家庭等に映像を配信するための機器に対して送信する、などであってもよい。即ち、字幕受信部２１６で受信した字幕の最終的な表示先は、対応する情報処理端末１０２から更に送信（配信）されるものであってもよいことは言うまでもない。 The information processing terminal 102 receives the subtitles transmitted by the subtitle distribution unit 326 by the subtitle receiving unit 216 and displays them on the display device. Actually, it is not necessary to display it on the information processing terminal 102. For example, even if it is further transmitted to a device for superimposing news or weather forecast images, or in the case of television broadcasting, it is transmitted to a device for distributing images to homes, etc. good. That is, it goes without saying that the final display destination of the subtitles received by the subtitle receiving unit 216 may be further transmitted (delivered) from the corresponding information processing terminal 102.

図４は、本発明の実施形態に係る原稿記憶部に記憶される原稿および関連する情報の一例である。本図においては、原稿例４００は発話者（アナウンサーなど）が読み上げる予定の原稿である。この原稿に含まれる文章を予め何らかの基準で区切り、原稿記憶部３３１に格納する。 FIG. 4 is an example of a manuscript and related information stored in the manuscript storage unit according to the embodiment of the present invention. In this figure, the manuscript example 400 is a manuscript to be read aloud by the speaker (announcer or the like). The sentences included in this manuscript are separated by some standard in advance and stored in the manuscript storage unit 331.

原稿の文章を区切る基準は本願発明においては特に限定しておらず規程を示してはいないが、周知の技術としては形態素解析した際の形態素の数、文字列の長さなどがある。一般的に音声認識結果と比較して有意な類似度を算出可能な長さであれば良い。 The criteria for dividing the text of the manuscript are not particularly limited in the present invention, and no rules are shown. However, well-known techniques include the number of morphemes when morphological analysis is performed, the length of a character string, and the like. Generally, it is sufficient if the length is such that a significant similarity can be calculated as compared with the speech recognition result.

原稿記憶部３３１は、前記文章を区切った単位を１行として格納している（内容４０５）。原稿番号４０１は、各行がいずれの原稿に属したものであるかを示しており、原稿２（原稿例４００）で示した原稿が格納されている複数行をまとめて原稿番号４０１に「２」と記載している。図４の原稿記憶部３３１には原稿番号が１から５に対応する原稿の情報が登録されている。 The manuscript storage unit 331 stores the unit for dividing the sentence as one line (content 405). The manuscript number 401 indicates which manuscript each line belongs to, and the plurality of lines in which the manuscript shown in the manuscript 2 (manuscript example 400) is stored are collectively referred to as "2" in the manuscript number 401. It is described as. Information on the manuscript corresponding to the manuscript numbers 1 to 5 is registered in the manuscript storage unit 331 of FIG.

また、ルール番号４０２も同様にまとめられているが、原稿番号２に対応する確認ルールは、後述する図５のルール番号５０１「１」が対応することを示しており、この番号のルールに従って校正者用の確認情報が付与されることになる。また時刻４０３は、当該原稿を読み上げることを想定している時刻が記載されている。実際にはずれが生じることもあるが、似たようなニュースが複数ある場合には、何れのニュースの原稿を読んでいるのか判断する場合に使用する。 Further, the rule number 402 is also summarized in the same manner, but the confirmation rule corresponding to the manuscript number 2 indicates that the rule number 501 “1” in FIG. 5 described later corresponds to the rule number 501, and the proofreading is performed according to the rule of this number. Confirmation information for the person will be given. Further, the time 403 is a time when the manuscript is supposed to be read aloud. Actually, there may be a gap, but if there are multiple similar news items, it is used to determine which news manuscript you are reading.

情報種別４０４は、内容４０５に記載の文字列が「原稿内」であるか「原稿外」であるかを示している。「原稿内」とは、原稿例４００から区切って得たものであることを示している。一方「原稿外」とは、アナウンサーなどが読み上げる原稿（原稿例４００）には含まれないが、音声認識結果を字幕として表示して読むろう者以外の視聴者にもニュースの概要を１～２行で示すものであり、例えば図１１の原稿外表示文字列１１０２のように、画面の映像と重ねて表示するものである。本願発明では音声認識結果から書き起こした字幕とは区別する。 The information type 404 indicates whether the character string described in the content 405 is "inside the manuscript" or "outside the manuscript". "Inside the manuscript" indicates that it is obtained by separating from the manuscript example 400. On the other hand, "outside the manuscript" is not included in the manuscript read out by the announcer (manuscript example 400), but the summary of the news is given to viewers other than deaf people who read the voice recognition result as subtitles 1-2. It is shown by a line, and is displayed so as to be superimposed on the image on the screen, for example, as shown in the non-manuscript display character string 1102 in FIG. In the present invention, it is distinguished from the subtitles transcribed from the voice recognition result.

状態４０６は、各行がアナウンサーなどによって既に読み上げられたものであるかどうかを示す記載である。「完了」、「使用中」、「未使用」の３種類を記載しているが、「完了」になっているからといって、それ以降絶対に同じ発話はなされない、という厳しい判定はしない。原稿にある部分をアナウンサーが独自の判断で（例えば現場のレポーターと通信がつながらない間に）既に一度読んだ原稿を再度読み上げることもある。ただし「完了」となっているものは読み上げられる確率はやはり低くなり、例えば他のニュースにある類似の発言とどちらを採用するか、によりニュース自体が別の原稿番号のものに移行しているかどうかの判断にも参考となる情報である。 The state 406 is a description indicating whether or not each line has already been read aloud by an announcer or the like. Three types of "completed", "in use", and "unused" are listed, but even if it is "completed", it does not make a strict judgment that the same utterance will never be made after that. .. The announcer may read the part of the manuscript once again at his own discretion (for example, while the reporter in the field is not connected). However, if it is "completed", the probability that it will be read aloud is still low. It is information that can be used as a reference for the judgment of.

図５は、本発明の実施形態に係る確認ルール記憶部３３２に記憶されるルールの一例である。原稿と音声認識結果の差分がある場合であっても全ての情報が重要なものとは限らないため、リアルタイム性を重視する観点から、音声認識結果の文字列内から重要な部分を特定して、確認を促す情報（確認情報）を付与し校正者に提示可能とするものである。本実施形態では、校正者に提供する音声認識結果の文字列に対してルールを適用し、確認情報を付与していくものとするが、実際の処理では原稿側に付与しても良い。原稿側に付与する場合は、音声認識結果と同時ではなく、原稿を原稿記憶部３３１に登録する際に事前に付与可能な情報もある。例えば後述する品詞、付属語に関する情報などを原稿側に事前に付与しておき、音声認識結果との比較時に確認情報を音声認識結果の文字列側に転記しても良い。いかなる方法でも情報処理端末１０２ｂ（校正者用）に提供できるのであればよい。 FIG. 5 is an example of a rule stored in the confirmation rule storage unit 332 according to the embodiment of the present invention. Even if there is a difference between the manuscript and the voice recognition result, not all the information is important, so from the viewpoint of emphasizing real-time performance, identify the important part from the character string of the voice recognition result. , Information that prompts confirmation (confirmation information) is given and can be presented to the proofreader. In the present embodiment, the rule is applied to the character string of the voice recognition result provided to the proofreader and the confirmation information is given, but in the actual processing, it may be given to the manuscript side. When it is given to the manuscript side, some information can be given in advance when the manuscript is registered in the manuscript storage unit 331, not at the same time as the voice recognition result. For example, information on part of speech and attached words, which will be described later, may be added to the manuscript side in advance, and confirmation information may be transcribed on the character string side of the voice recognition result when compared with the voice recognition result. Any method may be used as long as it can be provided to the information processing terminal 102b (for the proofreader).

幾つかのルールを上げて説明するが、本実施形態はあくまで技術的な例であり、実際のニュースの内容や現場の状況に応じて適切なルールを作成する必要があることは言うまでもない。 I will explain by raising some rules, but it goes without saying that this embodiment is just a technical example, and it is necessary to create appropriate rules according to the actual news content and the situation at the site.

ルール番号５０１は、図４のルール番号４０２に紐付けるための識別番号である。識別表示条件５０２は、原稿と音声認識の対応部分が不一致の場合のみ校正者に確認を促すのか（不一致のみ）、一致していても確認を促すのか（一致含む）を指定する。例えば、ニュースにおいて人名を間違えると人権問題に関わる場合もあるため人名に関しては原稿と音声認識結果が一致していても校正者に確認する、などが考えられる。原稿と音声認識結果が一致しているとは言っても、あくまで登録されている原稿は複数のニュースであり、音声認識結果が原稿の他の部分と一致している可能性なども考えられるからである。 The rule number 501 is an identification number for associating with the rule number 402 in FIG. The identification display condition 502 specifies whether to prompt the proofreader to confirm only when the corresponding portion of the document and the voice recognition do not match (only the mismatch), or whether to prompt the confirmation even if they match (including the match). For example, if a person's name is mistaken in the news, it may be related to human rights issues, so it is conceivable to check with the proofreader even if the manuscript and the voice recognition result match. Even though the manuscript and the voice recognition result match, the registered manuscript is just multiple news items, and it is possible that the voice recognition result matches the other parts of the manuscript. Is.

適用ルール５０３は、後述するルール詳細５００のルール名５０４と対応づけるものである。例えば「数値表現（日付、時刻、金額）」という記載は、ルール詳細５００のルール名５０４が「数値表現（日付）」、「数値表現（時刻）」、「数値表現（金額）」の全てを文字列に適用することを意味する。 The application rule 503 is associated with the rule name 504 of the rule details 500 described later. For example, in the description of "numerical expression (date, time, amount of money)", the rule name 504 of the rule detail 500 includes all of "numerical expression (date)", "numerical expression (time)", and "numerical expression (amount)". Means apply to strings.

パターン５０５は、原稿または音声認識結果の文字列から、形態素の品詞情報、文字のパターンなどを利用して特徴的な表現を抽出するものである。音声認識サービスによっては形態素毎に品詞を付与するものがある。あるいは結果の文字列を本実施例の処理の中で形態素解析して品詞を付与しても良い。文字のパターンとは、例えば文字種別（漢字、平仮名、カタカナ、数値など）や人名の接頭語、接尾語などである。また辞書（不図示）に都道府県名などを登録し、文字列一致でもよい。これらの品詞や文字パターンを正規表現などを用いて特定する方法は周知の技術である（例として特開２００１－１２５９１１号公報）。 The pattern 505 extracts a characteristic expression from a manuscript or a character string of a voice recognition result by using part-of-speech information of a morpheme, a character pattern, and the like. Some speech recognition services give part of speech to each morpheme. Alternatively, the resulting character string may be given a part of speech by morphological analysis in the processing of this embodiment. The character pattern is, for example, a character type (Kanji, Hiragana, Katakana, numerical value, etc.), a prefix of a person's name, a suffix, or the like. In addition, a prefecture name or the like may be registered in a dictionary (not shown) and a character string may be matched. A method of specifying these part of speech and character patterns by using a regular expression or the like is a well-known technique (for example, Japanese Patent Application Laid-Open No. 2001-125911).

また、校正者の確認すべき項目として付属語も重要な場合がある。打ち消しの助動詞が誤っていると完全に逆の事実を伝えてしまうし、時制が誤っていると誤解を与える。それほどではないにしても刑事事件などで「～ですね」などの表現はカジュアルで不適切な印象を与えることもある。一方、インタビューや天気予報などであればカジュアルな表現が含まれていてもさほど不適切ではなく、校正者の負担を軽減するために確認しなくとも良い場合もある。付属語については、その並びから誤解を与える程度を判定しその程度に従って「確認重要度」（前述の特許文献ではクレームなどの表現の「危険度」であるがこれを重要度として考える）を算出することなども周知の技術である（例として特開２００４－１３３７１４）。付属語が１つでも複数の並びであっても良い。複数の並びである場合は、前述のパターン同様、品詞や文字の並びを正規表現などでパターンとして表しても良い。この危険度の許容範囲を識別表示条件５０２に記載して、その危険度を上回る場合には校正者に提示するようにしても良い。また、図６の付属語判断閾値として指定しても良い。 In addition, ancillary words may be important as items to be confirmed by the proofreader. If the auxiliary verb of cancellation is wrong, it will tell the opposite fact completely, and if the tense is wrong, it will be misunderstood. Even if it is not so much, expressions such as "..." may give a casual and inappropriate impression in criminal cases. On the other hand, in the case of interviews and weather forecasts, it is not so inappropriate to include casual expressions, and in some cases it is not necessary to check them in order to reduce the burden on the proofreader. For attached words, the degree of misunderstanding is determined from the order, and the "confirmation importance" (in the above-mentioned patent document, the "risk" of expressions such as claims is considered as the importance) is calculated. It is also a well-known technique (for example, Japanese Patent Application Laid-Open No. 2004-133714). The adjunct may be one or more sequences. In the case of a plurality of sequences, the sequence of part of speech and characters may be represented as a pattern by a regular expression or the like, as in the above-mentioned pattern. The permissible range of this risk may be described in the identification display condition 502, and if the risk exceeds the risk, it may be presented to the proofreader. Further, it may be specified as the adjoint word determination threshold value in FIG.

図６は、本発明の実施形態に係る設定情報記憶部に記憶される設定項目の一例である。個々の設定について説明をしていく。 FIG. 6 is an example of setting items stored in the setting information storage unit according to the embodiment of the present invention. I will explain each setting.

音声認識において一定時間入力がなければ（人の声が入力されなければ）、文章として区切れたと認識する。そのための区切り時間を指定する値が、発話区切り時間であり、例では０．５秒としている。例えばこの区切りによって音声認識結果としての１文が指定され、その文字列を図４の「内容」と比較して類似しているか否かの判定をしても良い。ただし、この場合、原稿作成者と発話者の区切りに関する認識が異なれば、文字列の位置的なずれにより、原稿記憶部３３１の特定の内容と一致しない場合がある。そのため、類似比較形態素数で指定された形態素の数、あるいは文字列の長さを設定としても良い。特許文献１においては、常に４つの形態素で１行となし、またある行の後半の２形態素と次の行の２形態素に同一のものを指定する。すなわち少しずつずらしていくが、本発明の実施の形態でも同様の方法を用いても良い。本発明の実施の形態としては、類似しているかどうかを判定するために格納する方法であれば、どのような方法でも良い。 If there is no input for a certain period of time in voice recognition (if no human voice is input), it is recognized as a sentence. The value that specifies the break time for that purpose is the utterance break time, which is 0.5 seconds in the example. For example, one sentence as a voice recognition result may be specified by this delimiter, and the character string may be compared with the "content" of FIG. 4 to determine whether or not they are similar. However, in this case, if the recognition regarding the delimiter between the manuscript creator and the speaker is different, it may not match the specific content of the manuscript storage unit 331 due to the positional deviation of the character string. Therefore, the number of morphemes specified by the similar comparison morpheme number or the length of the character string may be set. In Patent Document 1, four morphemes are always used as one line, and the same two morphemes in the latter half of one line and the next two morphemes are specified. That is, although it is shifted little by little, the same method may be used in the embodiment of the present invention. As the embodiment of the present invention, any method may be used as long as it is a storage method for determining whether or not they are similar.

次に原稿確認範囲について説明する。例では１０行となっている。図４の原稿記憶部３３１で現在使用中の行（原稿番号２の２行目）である時に、次に発話者の音声認識結果が、１０行以上離れた行と類似している場合、原稿の位置として不適切ではないか、と判断するものである。 Next, the manuscript confirmation range will be described. In the example, it is 10 lines. If the line currently in use in the manuscript storage unit 331 of FIG. 4 (the second line of the manuscript number 2) and the voice recognition result of the speaker is similar to the line separated by 10 lines or more, the manuscript It is judged that the position of is inappropriate.

次の行に進むのであれば「市内に住む女性の銀行口座から」という文に類似した認識結果が得られるはずであるが「警察の調べによりますと」に類似したとする。この場合、可能性としては次のことが考えられる。何らかの事情で現在読み上げている原稿２を最初から繰り返した場合、あるいは原稿２を飛ばして原稿３に移行し、その３行目に相当する発話をした場合、また緊急のニュースが入り、原稿２、３の何れでもない（原稿にはない）ニュースをアナウンサーが発話している場合、などである。 If you proceed to the next line, you should get a recognition result similar to the sentence "from the bank account of a woman living in the city", but it is similar to "according to police investigation". In this case, the following are possible possibilities. If, for some reason, the manuscript 2 currently being read aloud is repeated from the beginning, or if the manuscript 2 is skipped and the manuscript 3 is moved to the manuscript 3, and the utterance corresponding to the third line is made, urgent news is entered, and the manuscript 2, the manuscript 2, For example, when the announcer is speaking news that is neither of 3 (not in the manuscript).

この場合、いずれが正しいのかを確認するため、この音声認識結果に原稿番号を確認する情報を付与して、情報処理装置１０２ｂ（校正者用）に送付しても良い。図１０では、その状況を示している。校正者操作画面１０００に校正対象文字列１００１とともに、操作パネル１００２を表示し校正者に原稿３に切り替えるか等確認させるようにしても良い。緊急ニュース（原稿なし）の場合は「校正者は全てを確認してください」という指示が提示されても良い。また、原稿の切り替えが選択された場合には、それ以降の処理では、切り替えた後の原稿とその行を中心にして後続の処理を継続するようにしても良い。 In this case, in order to confirm which is correct, information for confirming the manuscript number may be added to the voice recognition result and sent to the information processing apparatus 102b (for the proofreader). FIG. 10 shows the situation. The operation panel 1002 may be displayed on the proofreader operation screen 1000 together with the proofreading target character string 1001 so that the proofreader can confirm whether to switch to the manuscript 3. In the case of urgent news (without manuscript), the instruction "The proofreader should check everything" may be presented. Further, when switching of the original is selected, in the subsequent processing, the subsequent processing may be continued centering on the original after switching and its line.

いずれにしても一定の基準を満たさなかった場合にはその認識結果は使用できないと言うことではなく、その認識結果をどのように利用するか校正者が判断可能なようにするものである。 In any case, it does not mean that the recognition result cannot be used if a certain standard is not met, but it allows the proofreader to judge how to use the recognition result.

また図４の原稿では原稿番号４０１に数字のみ記載していたが、図１０のように内容が分かるタイトルを付与しても良い。 Further, in the manuscript of FIG. 4, only numbers are described in the manuscript number 401, but a title whose contents can be understood may be given as shown in FIG.

次に図７、図８のフローチャートを用いて、本願発明の実施形態に係る処理の流れを説明する。 Next, the flow of processing according to the embodiment of the present invention will be described with reference to the flowcharts of FIGS. 7 and 8.

ステップＳ７０１においては、情報処理端末１０２ａ（発話者用）に接続されたマイク１０４から、発話者が入力した音声を受け付けて、ステップＳ７０２で当該音声データを音声認識サーバ１０１に送信する。 In step S701, the voice input by the speaker is received from the microphone 104 connected to the information processing terminal 102a (for the speaker), and the voice data is transmitted to the voice recognition server 101 in step S702.

音声認識サーバ１０１は、前記情報処理端末１０２ａから送信された音声データをステップＳ７０３で受信し、ステップＳ７０４にて音声認識処理をして文字列に変換する。このとき、連続して入力された音声データの無音状態が図６の「発話区切り時間」より長い部分を検知して、無音状態の部分で音声データを区切ってから音声認識処理に渡しても良い。また、音声認識エンジンによっては無音状態で自動的に区切った結果を出力するものもありその結果に基づき、文字列を原稿の１行に対応するように区切っても良い。 The voice recognition server 101 receives the voice data transmitted from the information processing terminal 102a in step S703, performs voice recognition processing in step S704, and converts it into a character string. At this time, a portion where the silent state of the continuously input voice data is longer than the "utterance break time" in FIG. 6 may be detected, the voice data may be divided by the silent state portion, and then passed to the voice recognition process. .. Further, some voice recognition engines output the result of automatic division in a silent state, and based on the result, the character string may be divided so as to correspond to one line of the document.

ステップＳ７０５については、図８のフローチャートで説明する。ステップＳ８０１においては、このフローチャートが何れのステップから呼び出されたかを判断する。現在の説明では、ステップＳ７０４から続いて呼び出されているので、ステップＳ８０２に進む。 Step S705 will be described with reference to the flowchart of FIG. In step S801, it is determined from which step this flowchart is called. In the current description, since it is called continuously from step S704, the process proceeds to step S802.

ステップＳ８０２においては、音声認識した結果の文字列に基づき、原稿記憶部３３１の何れの行の内容（文字列）が類似しているかをリストアップする。現時点で原稿２に着目していても原稿３など他の部分からリストアップしても良い。また、文字列ではなく形態素解析をして、形態素列として類似する行をリストアップしても良い。なお後述するが図８のフローチャートを完了した後、図７のフローチャートに戻り、校正者の判断により再び図８に戻ってくる場合がある。その場合には、ステップＳ７１０に続く処理としての呼び出しとなり、ステップＳ８０１の分岐でステップＳ８０３に進むことになるので、ステップＳ８０２が実行されるのは最初の１回だけである。即ち、原稿の中から類似の行をリストアップする処理は最初だけである。 In step S802, based on the character string as a result of voice recognition, which line content (character string) of the document storage unit 331 is similar is listed. Even if the manuscript 2 is focused at the present time, it may be listed from other parts such as the manuscript 3. Alternatively, a morphological analysis may be performed instead of a character string to list similar lines as a morpheme string. As will be described later, after completing the flowchart of FIG. 8, the flowchart of FIG. 7 may be returned and the proofreader may return to FIG. 8 again at the discretion of the proofreader. In that case, the call is made as a process following step S710, and the branch of step S801 proceeds to step S803. Therefore, step S802 is executed only once at the beginning. That is, the process of listing similar lines from the manuscript is only the first.

ステップＳ８０３においては、ステップＳ８０２にてリストアップした原稿の類似行のうち、最も類似しているものを選択し、着目する行とする。前述の通り、図８のフローチャートは繰り返し呼び出されることがあるが、２回目の実行では１回目に選択された行は候補から外されているので、常に最も類似するものを選択すればよい。最初にリストアップしたものを基準にすると、図８のフローチャートが呼び出された回数に合わせて、１位、２位、３位と次の候補行に順次着目することになる。 In step S803, among the similar lines of the manuscripts listed in step S802, the most similar lines are selected and set as the lines of interest. As described above, the flowchart of FIG. 8 may be called repeatedly, but in the second execution, the row selected in the first time is excluded from the candidates, so the most similar one should always be selected. Based on the one listed first, the first, second, and third place and the next candidate line are sequentially focused on according to the number of times the flowchart of FIG. 8 is called.

ステップＳ８０４においては、音声認識結果の文字列と、原稿からリストアップした類似行の文字列との差分を抽出し、差分を示す情報を記憶する。 In step S804, the difference between the character string of the voice recognition result and the character string of the similar line listed from the manuscript is extracted, and the information indicating the difference is stored.

ステップＳ８０５においては、確認ルール記憶部３３２に格納された原稿と対応するルールに従って、校正者に確認させたい部分を抽出し、確認情報を付与する。さらにステップＳ８０６においては、確認情報が付与された付属語（付属語列）の確認重要度を算出し、付属語判断閾値を超えている等を判定して必要なら確認情報を付与する。 In step S805, the portion to be confirmed by the proofreader is extracted and the confirmation information is given according to the rule corresponding to the manuscript stored in the confirmation rule storage unit 332. Further, in step S806, the confirmation importance of the adjunct word (adjunct word string) to which the confirmation information is added is calculated, it is determined that the adjoint word determination threshold is exceeded, and the confirmation information is added if necessary.

ステップＳ８０７においては、現在着目している行が、直前に完了した原稿の行から進行する範囲としては遠い位置にあるか否かを判断する。例えば、原稿２の最後にいる場合に原稿３の最初の行を類似行として着目しても進行としては自然である。これは図６の原稿確認範囲（１０行）を基準にして判断することが可能である。しかしながら原稿２の最初にいる場合に、原稿３の後方を類似行として着目した場合、誤りの可能性がある。またさらに原稿記憶部３３１の状態４０６が「未」の行に進行することは自然であるが「完」の行に進行することは自然ではない。勿論、実際に原稿を読み返すことなどもあり、正しいか否かは校正者の判断が必要であるため、ここでは範囲外であるマークを付与し、校正者が判断可能な識別子とするだけである。その識別子を付与するか否かを判断する。付与する必要があると判断する場合には、ステップＳ８０８に進む。付与する必要がないと判断した場合には、図８のフローチャートを終了する。 In step S807, it is determined whether or not the line currently being focused on is at a position far from the line of the document completed immediately before. For example, even if the first line of the manuscript 3 is focused on as a similar line when the manuscript 2 is at the end, the progress is natural. This can be determined based on the document confirmation range (10 lines) in FIG. However, if the back of the manuscript 3 is focused on as a similar line when the manuscript 2 is at the beginning, there is a possibility of an error. Further, it is natural that the state 406 of the document storage unit 331 progresses to the "not yet" line, but it is not natural that the state 406 progresses to the "finished" line. Of course, the proofreader may actually read the manuscript again, and it is necessary for the proofreader to judge whether it is correct or not. .. Determine whether to assign the identifier. If it is determined that the grant is necessary, the process proceeds to step S808. If it is determined that it is not necessary to give it, the flowchart of FIG. 8 is terminated.

ステップＳ８０８において、現時点での音声認識結果は、直前に完了した原稿の行から進行する範囲としては非常に遠い位置で範囲外である旨の情報を付与する。これで図８のフローチャートを完了し、図７のフローチャートのステップＳ７０５が終わった状態に戻る。 In step S808, the voice recognition result at the present time gives information that the range is very far from the line of the document completed immediately before and is out of the range. This completes the flowchart of FIG. 8 and returns to the state in which step S705 of the flowchart of FIG. 7 is completed.

図７のフローチャートの説明に戻る。ステップＳ７０６においては、音声認識結果及び付与された確認情報を情報処理端末１０２ｂ（校正者用）に送信する。 Returning to the description of the flowchart of FIG. 7. In step S706, the voice recognition result and the given confirmation information are transmitted to the information processing terminal 102b (for the proofreader).

ステップＳ７０７においては、情報処理端末１０２ｂ（校正者用）が音声認識結果及び付与された確認情報を受信し、ステップＳ７０８において校正者が作業可能なように接続された表示装置に表示する。表示された例は図９を用いて後述する。 In step S707, the information processing terminal 102b (for the proofreader) receives the voice recognition result and the given confirmation information, and displays them on the display device connected so that the proofreader can work in step S708. The displayed example will be described later with reference to FIG.

ステップＳ７０９においては、図９の表示情報に基づいて校正者が音声認識結果に対して行った修正、判断などの操作を受け付ける。この判断には、図１０を用いて既に説明した原稿の切り替えを認めるか否かの判断も含む。 In step S709, an operation such as a correction or a judgment performed by the proofreader on the voice recognition result based on the display information of FIG. 9 is accepted. This determination also includes a determination as to whether or not to allow switching of the manuscript already described with reference to FIG.

ステップＳ７１０は、ステップＳ７０９の操作の結果として原稿の切り替えをするか否かの判断が含まれている場合に、続く処理の流れを分岐させるための判定である。校正者によって原稿の切り替えを認められた場合には、ステップＳ８０３で着目している原稿内の候補の行を使用して良いことになり処理は後続のステップＳ７１１に進む（ＹＥＳの場合）。そうでなければ候補の行を選択し治す必要があるため処理を音声認識サーバ１０１のステップＳ７０５に戻す（ＮＯの場合）。ステップＳ７０５の説明は図８のフローチャートとして前述したものと同じだが、ステップＳ８０２は最初の実行時に既に類似行を原稿からリストアップしているのでスキップし（ステップＳ８０１の分岐で「Ｓ７１０の続き」となり）、類似行の候補となる行から次の順位のものに着目して処理を進めていく。 Step S710 is a determination for branching the subsequent processing flow when the determination of whether or not to switch the original is included as a result of the operation of step S709. If the proofreader approves the switching of the manuscript, the candidate line in the manuscript of interest in step S803 may be used, and the process proceeds to the subsequent step S711 (in the case of YES). If not, it is necessary to select and cure the candidate line, so the process is returned to step S705 of the voice recognition server 101 (in the case of NO). The explanation of step S705 is the same as that described above as the flowchart of FIG. 8, but since step S802 has already listed similar lines from the manuscript at the time of the first execution, it is skipped (it becomes "continuation of S710" at the branch of step S801). ), We will proceed with the processing focusing on the next rank from the candidate rows of similar rows.

ステップＳ７１１においては、ステップＳ７０９において校正者が行った校正結果（最終的な表示文字列）を情報処理端末１０２ｃ（表示用）に送信する。 In step S711, the calibration result (final display character string) performed by the calibrator in step S709 is transmitted to the information processing terminal 102c (for display).

ステップＳ７１２において、情報処理端末１０２ｃ（表示用）は校正結果（最終的な表示文字列）を受信し、ステップＳ７１３において情報処理端末１０２ｃ（表示用）に接続された表示装置に表示する。 In step S712, the information processing terminal 102c (for display) receives the calibration result (final display character string) and displays it on the display device connected to the information processing terminal 102c (for display) in step S713.

以上で、図７、図８のフローチャートを用いて本願発明の実施形態における処理フローの一例に関する説明を完了する。 This completes the description of an example of the processing flow according to the embodiment of the present invention using the flowcharts of FIGS. 7 and 8.

図９は、本発明の実施形態に係る本発明の実施形態に係る音声認識結果、確認情報を付与した表示情報の一例である（校正者画面）。比較するために原稿２の文字列（原稿例４００）と音声認識結果の例（音声認識結果９０１）を図示している。音声認識結果９０１では、矩形で囲んだ部分が原稿例４００との差分である。 FIG. 9 is an example of display information to which the voice recognition result and confirmation information according to the embodiment of the present invention according to the embodiment of the present invention are added (calibrator screen). For comparison, a character string of the manuscript 2 (manuscript example 400) and an example of the voice recognition result (voice recognition result 901) are shown. In the voice recognition result 901, the portion surrounded by the rectangle is the difference from the original example 400.

校正者用の表示は校正者用表示９０２のように９０２ａ～ｅまで図４の原稿２の各行と対応する文字列が表示される。１行分の認識が完了した時点で受信、表示されるため、これらの行は全て同時に表示されているわけではなく１つずつ表示されている（図１０の校正者操作画面１０００を参照）。９０２ａ～ｅを一つずつ説明していく。 As for the display for the proofreader, a character string corresponding to each line of the manuscript 2 in FIG. 4 is displayed from 902a to e as in the display for the proofreader 902. Since it is received and displayed when the recognition of one line is completed, all of these lines are not displayed at the same time but are displayed one by one (see the proofreader operation screen 1000 in FIG. 10). 902a to e will be explained one by one.

９０２ａは、音声認識結果と原稿が一致しているため文字列のみが表示される。 In 902a, since the voice recognition result and the document match, only the character string is displayed.

９０２ｂでは、９２１と９２２の２ヵ所が原稿と音声認識結果の差分であり、それぞれ図５のルール番号１の固有名詞（人名）、数値表現（日付）が適用されているので、校正者に識別可能に強調表示され（実線の矩形）、また原稿側に含まれる正解（それぞれ「稲川」、「７月」）が吹き出しで表示されている。ここで校正者が確認した後、例えば吹き出し部分をマウスでクリックするなどの操作により、音声認識結果中の文字列を吹き出し内の文字列に置き換えることができる。あるいは直接、音声認識結果の文字列を編集しても良い。また何らかの事情により音声認識結果の文字列が正しければ置き換える必要はない。 In 902b, the difference between the manuscript and the voice recognition result is in two places, 921 and 922, and the proper nomenclature (personal name) and numerical expression (date) of rule number 1 in FIG. 5 are applied, so that the proofreader can identify them. It is highlighted as much as possible (solid rectangle), and the correct answers ("Inagawa" and "July", respectively) included in the manuscript side are displayed in balloons. Here, after the proofreader confirms, the character string in the voice recognition result can be replaced with the character string in the balloon by an operation such as clicking the balloon portion with the mouse. Alternatively, the character string of the voice recognition result may be edited directly. Also, if the character string of the voice recognition result is correct for some reason, there is no need to replace it.

９２１の固有名詞（人名）は、図５のルールでは「仮に一致していたとしても校正者に確認させる」ために確認情報が付与される。一致していた場合は吹き出しは表示されなくとも良いが、実線の矩形は表示される。 The proper noun (personal name) of 921 is given confirmation information in order to "make the proofreader confirm even if they match" in the rule of FIG. If they match, the callout does not have to be displayed, but the solid rectangle is displayed.

また、９２３の「９月」は原稿と一致しており、図５のルールにも数値表現（日付）は不一致の場合のみ校正者に提示されることになっているが、図５のルールと一致した部分のみを識別可能に表示しても良い（点線の矩形）。 In addition, "September" in 923 matches the manuscript, and the numerical expression (date) is to be presented to the proofreader only when the numerical expression (date) does not match in the rule of FIG. Only the matching parts may be displayed in an identifiable manner (dotted rectangle).

９０２ｃの９２４は、原稿の「口座」が音声認識結果で「講座」となっているため実線の矩形で確認情報が表示されている。図５に普通名詞であっても不一致なら確認情報を付与するようルールがあるからである。重要度が低く、校正者の作業効率を重視するならば、図５のルールから削除せず、表示しなくとも良い。またその場合でも原稿と差異があると言うことで、点線の矩形を付与するなどして一段階レベルの低い確認を校正者に促しても良い。 In 902c 924, since the "account" of the manuscript is a "course" in the voice recognition result, the confirmation information is displayed by a solid rectangle. This is because there is a rule in FIG. 5 to give confirmation information if there is a discrepancy even if it is a common noun. If the importance is low and the work efficiency of the proofreader is important, it is not necessary to delete it from the rule of FIG. 5 and display it. Even in that case, since there is a difference from the original, the proofreader may be urged to confirm the proofreader at a lower level by adding a dotted rectangle.

９０２ｄは原稿と差分がないため、確認情報は表示されていない。ただし「７０万円」（数値情報（金額））に一段階レベルの低い確認情報が表示されていても良い。 Since 902d has no difference from the original, the confirmation information is not displayed. However, confirmation information at a lower level may be displayed in "700,000 yen" (numerical information (amount)).

９０２ｅには２ヵ所の差分がある。９２６「の」は図５のルールには含まれていないため表示されない。９２７の「した」は原稿では「す」であり、この文章が現在形か過去形かという時制の違いは、字幕を読む人にとって大きな誤解を生むため確認情報が実線の矩形で表示されている。 There are two differences in 902e. 926 "no" is not displayed because it is not included in the rule of FIG. The "done" in 927 is "su" in the manuscript, and the difference in tense between the present tense and the past tense causes a great misunderstanding for the reader of the subtitles, so the confirmation information is displayed as a solid rectangle. ..

以上、校正者はこれらの情報を確認しながら、短時間で校正することが可能となる。以上で図９の説明を完了する。 As described above, the calibrator can calibrate in a short time while confirming this information. This completes the description of FIG.

図１１は、本発明の実施形態に係る音声認識結果を字幕として表示する画面の一例である（テレビなどの画面）。例えばテレビのニュース番組でアナウンサーが原稿を読み上げている場面が表示されている。 FIG. 11 is an example of a screen for displaying the voice recognition result according to the embodiment of the present invention as subtitles (screen of a television or the like). For example, a scene in which an announcer is reading a manuscript is displayed on a TV news program.

音声認識結果文字列１１０１は、校正まで完了した結果を情報処理端末１０２ｂから受け取り、そのまま表示したものである。行頭に「＞＞」を付与して音声認識結果の字幕であることを分かりやすく示している。 The voice recognition result character string 1101 receives the result of completion of calibration from the information processing terminal 102b and displays it as it is. ">>" is added at the beginning of the line to clearly indicate that it is a subtitle of the voice recognition result.

またニュースなどの画面では、本願発明のアナウンサーの音声を音声認識結果を用いた字幕とは別に、ニュースの要約などを大きくテロップとして表示することが多い。これを音声認識結果の字幕と区別するため、原稿外表示文字列１１０２と呼ぶことにする。 Further, on a screen such as news, the voice of the announcer of the present invention is often displayed as a large telop, such as a summary of news, in addition to subtitles using the voice recognition result. In order to distinguish this from the subtitles of the voice recognition result, it is referred to as the character string 1102 displayed outside the manuscript.

音声認識結果文字列１１０１と原稿外表示文字列１１０２が両方表示されると、文字が重なって非常に読みにくくなったり、また短時間で大量の文字を読むことができなかったりするなどの問題が発生する。 When both the voice recognition result character string 1101 and the non-manuscript display character string 1102 are displayed, there are problems such as the characters overlapping and becoming very difficult to read, and the large number of characters cannot be read in a short time. Occur.

そこで、音声認識結果を表示しない方法を説明する。図４の原稿記憶部３３１の原稿番号２、最後の行に「原稿外」として「稲川容疑者、業務上横領の疑い」が登録されている。原稿外というのは、図１１の原稿外表示文字列１１０２として、アナウンサーの発話とは関係なく表示されることが決まっているものを意味する。 Therefore, a method of not displaying the voice recognition result will be described. In the last line of the manuscript number 2 of the manuscript storage unit 331 of FIG. 4, "Suspect Inagawa, suspected embezzlement in business" is registered as "outside the manuscript". The term "outside the manuscript" means that the character string 1102 outside the manuscript of FIG. 11 is determined to be displayed regardless of the announcer's utterance.

この原稿外表示文字列１１０２が表示されれば、原稿内の「業務上横領の疑いがもたれています」という行に対応する音声認識結果が表示されなくとも視聴者には十分な情報が伝わると考えられる。そこで図７、図８のフローチャートに追加の説明をする。 If this non-manuscript display character string 1102 is displayed, sufficient information is transmitted to the viewer even if the voice recognition result corresponding to the line "suspicion of embezzlement in business" is not displayed in the manuscript. Conceivable. Therefore, an additional explanation will be given to the flowcharts of FIGS. 7 and 8.

図８のステップＳ８０２で音声認識結果に類似する行をリストアップした際に、その行が「原稿外」のものが上位あるいは非常に高い類似度で含まれていれば、原稿外の表示がテレビにされると判断し、当該音声認識結果に対する処理を中断する、すなわち、ステップＳ７０１に戻り、次の音声入力の処理を開始するということが考えられる。これにより対応する音声認識結果は後続の処理に送られず、存在しなかったものとなる。 When a line similar to the voice recognition result is listed in step S802 of FIG. 8, if the line "outside the manuscript" is included in the higher rank or with a very high degree of similarity, the display outside the manuscript is displayed on the television. It is conceivable that the process for the voice recognition result is interrupted, that is, the process returns to step S701 and the process for the next voice input is started. As a result, the corresponding speech recognition result is not sent to the subsequent processing and does not exist.

あるいは、ステップＳ８０２の処理で「原稿外の表示が存在する」という確認情報を付与した後で、従来の説明と同様の処理を継続し、校正者に前記「原稿外の表示が存在する」旨を表示し、校正者の判断で表示しないようにする、ということも可能である。 Alternatively, after the confirmation information that "the display outside the manuscript exists" is given in the process of step S802, the same process as the conventional description is continued, and the proofreader is informed that "the display outside the manuscript exists". It is also possible to display and not display at the discretion of the proofreader.

いずれにしても、表示しないことを自動的、あるいは校正者が簡単な操作で決定可能となり、よりリアルタイムな字幕表示が可能になるという効果を得ることが可能となる。また視聴者から見て画面にある大量の文字を読まなければならないこと自体を回避するという効果を得ることが可能となる。以上で、図１１の説明を完了する。 In any case, it is possible to automatically decide not to display the subtitles or by the proofreader with a simple operation, and it is possible to obtain the effect of enabling more real-time subtitle display. In addition, it is possible to obtain the effect of avoiding having to read a large amount of characters on the screen from the viewer's point of view. This completes the description of FIG.

以上、いくつかの実施形態について示したが、本発明は、例えば、システム、装置、方法、コンピュータプログラムもしくは記録媒体等としての実施態様をとることが可能であり、具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 Although some embodiments have been described above, the present invention can be, for example, an embodiment as a system, an apparatus, a method, a computer program, a recording medium, or the like, and specifically, a plurality of devices. It may be applied to a system composed of, or may be applied to a device consisting of one device.

また、本発明におけるコンピュータプログラムは、図７、図８に示すフローチャートの処理方法をコンピュータが実行可能なコンピュータプログラムであり、本発明の記憶媒体は図７、図８の処理方法をコンピュータが実行可能なコンピュータプログラムが記憶されている。なお、本発明におけるコンピュータプログラムは図７、図８の各装置の処理方法ごとのコンピュータプログラムであってもよい。 Further, the computer program in the present invention is a computer program in which a computer can execute the processing methods shown in FIGS. 7 and 8, and the storage medium of the present invention can execute the processing methods in FIGS. 7 and 8. Computer programs are stored. The computer program in the present invention may be a computer program for each processing method of the devices of FIGS. 7 and 8.

以上のように、前述した実施形態の機能を実現するコンピュータプログラムを記録した記録媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたコンピュータプログラムを読出し実行することによっても、本発明の目的が達成されることは言うまでもない。 As described above, a computer in which a recording medium on which a computer program that realizes the functions of the above-described embodiment is recorded is supplied to the system or device, and the computer (or CPU or MPU) of the system or device is stored in the recording medium. Needless to say, the object of the present invention is achieved by reading and executing the program.

この場合、記録媒体から読み出されたコンピュータプログラム自体が本発明の新規な機能を実現することになり、そのコンピュータプログラムを記憶した記録媒体は本発明を構成することになる。 In this case, the computer program itself read from the recording medium realizes the novel function of the present invention, and the recording medium storing the computer program constitutes the present invention.

コンピュータプログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ－ＲＯＭ、ＣＤ－Ｒ、ＤＶＤ－ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＥＥＰＲＯＭ、シリコンディスク、ソリッドステートドライブ等を用いることができる。 Recording media for supplying computer programs include, for example, flexible disks, hard disks, optical disks, magneto-optical disks, CD-ROMs, CD-Rs, DVD-ROMs, magnetic tapes, non-volatile memory cards, ROMs, and EEPROMs. Silicon disks, solid state drives, etc. can be used.

また、コンピュータが読み出したコンピュータプログラムを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのコンピュータプログラムの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the computer program read by the computer, not only the function of the above-described embodiment is realized, but also the OS (operating system) or the like running on the computer is realized based on the instruction of the computer program. Needless to say, there are cases where a part or all of the actual processing is performed and the processing realizes the functions of the above-described embodiment.

さらに、記録媒体から読み出されたコンピュータプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのコンピュータプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, the computer program read from the recording medium is written to the memory provided in the function expansion board inserted in the computer or the function expansion unit connected to the computer, and then its function is based on the instruction of the computer program code. Needless to say, there are cases where the CPU provided in the expansion board or the function expansion unit performs a part or all of the actual processing, and the processing realizes the functions of the above-described embodiment.

また、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。また、本発明は、システムあるいは装置にコンピュータプログラムを供給することによって達成される場合にも適応できることは言うまでもない。この場合、本発明を達成するためのコンピュータプログラムを格納した記録媒体を該システムあるいは装置に読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 Further, the present invention may be applied to a system composed of a plurality of devices or a device composed of one device. It goes without saying that the present invention can also be applied when it is achieved by supplying a computer program to a system or an apparatus. In this case, by reading the recording medium containing the computer program for achieving the present invention into the system or device, the system or device can enjoy the effect of the present invention.

さらに、本発明を達成するためのコンピュータプログラムをネットワーク上のサーバ、データベース等から通信プログラムによりダウンロードして読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。なお、上述した各実施形態およびその変形例を組み合わせた構成も全て本発明に含まれるものである。 Further, by downloading and reading a computer program for achieving the present invention from a server, database, or the like on a network by a communication program, the system or device can enjoy the effect of the present invention. It should be noted that the present invention also includes all the configurations in which each of the above-described embodiments and modifications thereof are combined.

１０１音声認識サーバ
１０２情報処理端末
３１１音声取得部
３１２音声データ送信部
３１３確認情報受信部
３１４校正用表示部
３１５校正部
３１６校正結果送信部
３１７字幕受信部
３２１音声データ受信部
３２２音声認識部
３２３確認情報付与部
３２４確認情報送信部
３２５校正結果受信部
３２６字幕配布部
３３１原稿記憶部
３３２確認ルール記憶部 101 Voice recognition server 102 Information processing terminal 311 Voice acquisition unit 312 Voice data transmission unit 313 Confirmation information reception unit 314 Calibration display unit 315 Calibration unit 316 Calibration result transmission unit 317 Subtitle reception unit 321 Voice data reception unit 322 Voice recognition unit 323 Confirmation Information addition unit 324 Confirmation information transmission unit 325 Calibration result reception unit 326 Subtitle distribution unit 331 Manuscript storage unit 332 Confirmation rule storage unit

Claims

An information processing device that acquires recognition data that is a voice recognition result of voice data and text data corresponding to the voice data.
A specific means for identifying a point of interest in the recognition data or the text data, and
An information processing apparatus including a display control means for identifying and displaying the specified portion when displaying the result of comparing the recognition data and the text data.

The information processing apparatus according to claim 1, wherein the specifying means identifies the location according to an attribute of a character string included in the data.

The information processing according to claim 1 or 2, wherein the display control means controls the display method of the location based on the result of comparing the recognition data and the text data at the specified location. Device.

The information according to claim 1 to 3, wherein the display control means notifies the data of the text data at the specified location when there is a difference between the recognition data and the text data. Processing equipment.

The information processing apparatus according to any one of claims 1 to 4, further comprising a receiving means for receiving correction of the recognition data at the specified location.

The information processing apparatus according to any one of claims 1 to 5, wherein the text data is text data of a manuscript of the voice data.

It is a control method of an information processing device that acquires recognition data that is a voice recognition result of voice data and text data corresponding to the voice data.
A specific step in which the specific means identifies a point of interest in the recognition data or the text data, and
A control method for an information processing apparatus, wherein the display control means includes a display control step for identifying and displaying the specified portion when displaying a result of comparing the recognition data with the text data.

A program that can be executed in an information processing device that acquires recognition data that is a voice recognition result of voice data and text data corresponding to the voice data.
The information processing device
A specific means for identifying a point of interest in the recognition data or the text data, and
A program for functioning as a display control means for identifying and displaying the specified portion when displaying the result of comparing the recognition data and the text data.