JP4020083B2

JP4020083B2 - Transcription text creation support system and program

Info

Publication number: JP4020083B2
Application number: JP2004037718A
Authority: JP
Inventors: 尚志斯波
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2004-02-16
Filing date: 2004-02-16
Publication date: 2007-12-12
Anticipated expiration: 2024-02-16
Also published as: JP2005228178A

Description

本発明は音声データを文章化する作業を支援する書き起こしテキスト作成支援システムに関し、特に音声情報に対し音声認識処理を行った認識結果を適宜修正して最終的な書き起こしテキストを作成するための書き起こしテキスト作成支援システムに関する。 The present invention relates to a transcribed text creation support system that supports a task of converting speech data into a sentence, and more particularly, to appropriately correct a recognition result obtained by performing speech recognition processing on speech information and create a final transcribed text. The present invention relates to a transcription text creation support system.

コンテンツに含まれる音声情報をテキスト化することを、音声の書き起こし或いは単に書き起こしと言い、従来より各種の分野で実施されている。例えば、映像・音声コンテンツの内容に対してキーワードで検索し、検索にヒットしたシーンから映像・音声を配信する映像検索・配信サービスの実現のため、映像・音声コンテンツ内の音声をテキストに書き起こして検索用キーワードとすることが行われている。また、映像コンテンツに対する字幕情報として音声情報をテキストに書き起こすことも行われている。 Converting audio information included in content into text is called transcription or simply transcription, and has been practiced in various fields. For example, to search for content of video / audio content using keywords and to realize video search / distribution service that distributes video / audio from scenes that hit the search, the audio in the video / audio content is transcribed into text. And making it a search keyword. In addition, audio information is written as text as subtitle information for video content.

書き起こしテキストの作成手順は、作業者が書き起こし対象コンテンツの音声情報を聞き、その音声情報をキーボードなどの入力手段を用いて書き起こした後、書き起こしたテキストを目視で見直し、書き起こし誤りを発見、修正するのが一般的である。ただし、このような作業には大変な労力が必要になるため、音声認識技術を用いてある程度自動的に音声情報からテキストを生成することで、書き起こし作業の効率化を図ったシステムが考案されている（例えば特許文献１参照）。 The procedure for creating the transcript text is that the operator listens to the audio information of the content to be transcribed, transcribes the audio information using an input means such as a keyboard, and then visually reviews the transcribed text to make a transcription error. It is common to find and correct. However, since this kind of work requires a lot of effort, a system has been devised to improve the efficiency of the transcription work by automatically generating text from speech information to some extent using speech recognition technology. (For example, refer to Patent Document 1).

しかしながら、音声認識技術による書き起こし作業の効率化システムにおいては、作業者が、実際の音声情報を耳で聞きながら、システムから出力された音声認識結果テキストの誤りを目視で検出して修正する作業が必要であり、繰り返し確認する際にどの部分を作業中であったかの確認を誤り再生ポイントを前後してしまうなど、結局作業負荷がかかってしまうという問題がある。 However, in the system for improving the efficiency of transcription work using speech recognition technology, the operator visually detects and corrects errors in the speech recognition result text output from the system while listening to the actual speech information by ear. There is a problem that a work load is eventually applied, such as confirming which part is being worked when repeatedly confirming, and moving back and forth the error reproduction point.

一つの解決策として、特許文献２では、実際の音声情報と、認識結果テキストを元に音声合成技術にて発声させた音声情報とを比較することで自動的に誤り箇所を発見し、作業者に提示する方法が提示されている。これにより、ある程度作業者の負荷軽減が期待されるが、実際には、自動的に誤りと指摘された箇所の精度が１００％でない限りは、対象コンテンツ全体の内容確認を人手で行う必要があるため、根本的な作業負荷軽減とはならないという問題が残る。 As one solution, Patent Document 2 automatically finds an error part by comparing actual speech information with speech information uttered by speech synthesis technology based on the recognition result text. The method of presenting is presented. As a result, the load on the operator is expected to be reduced to some extent. In practice, however, it is necessary to manually check the entire target content unless the accuracy of the point automatically pointed out to be incorrect is 100%. Therefore, there remains a problem that it does not fundamentally reduce the workload.

他方、非特許文献１には、発話単位である１文毎に、話者と時間情報と音声認識結果と書き起こしテキストとを１行に表示した認識結果リストと、この認識結果リスト中のハイライト表示されている発話単位に対応する音声情報の波形を表示する音声自動再生ウィンドウと、書き起こしテキストの編集ウィンドウとを画面に表示し、発話単位で音声を再生しながら音声認識結果を対話的に修正することのできる書き起こしテキスト支援システムが掲載されている。
特開２００１−１６６７９０号公報特開２００１−１３４２７６号公報 "音声認識による「書き起こしシステム」のご紹介"、［online］、株式会社アドバンスト・メディア、［平成１６年２月４日検索］、インターネット＜ＵＲＬ：http://www.advanced-media.co.jp/event/news/AmiVoice_Rewriter.pdf＞ On the other hand, Non-Patent Document 1 discloses a recognition result list in which a speaker, time information, a speech recognition result, and a transcription text are displayed in one line for each sentence that is an utterance unit, and a high level in the recognition result list. The voice automatic playback window that displays the waveform of the voice information corresponding to the utterance unit that is displayed in light and the editing window of the transcription text are displayed on the screen, and the voice recognition result is interactively played while the voice is played back in the utterance unit. Transcript text support system that can be modified is published.
JP 2001-166790 A JP 2001-134276 A “Introduction of“ Transcription System ”by Voice Recognition”, [online], Advanced Media, Inc. [Search February 4, 2004], Internet <URL: http://www.advanced-media.co .jp / event / news / AmiVoice_Rewriter.pdf>

上述した非特許文献１に記載される書き起こしテキスト支援システムによれば、発話単位である１文毎に、音声を再生しながらその発話単位の音声認識結果を対話的に修正することができるため、ある程度効率良く書き起こしテキストを作成していくことが可能である。しかしながら、文単位で音声が再生されるため、文中の一部の単語に疑義がある場合であっても、文の最初から音声を聞いて該当する単語箇所の音声を聞き取る必要があり、効率良く且つ正確に単語の音声を聞き取るのが難しい。このような問題は、発話の文章が長文になればなるほど顕著である。また、音声情報とそれに対応する音声認識結果および書き起こしテキストとの対応関係が１文単位であるため、音声情報の各部分が音声認識結果および書き起こしテキストのどの部分に対応しているのかが不明であり、音声情報のすべてを漏れなく書き起こしたかどうかを確認するのが容易でない。 According to the transcription text support system described in Non-Patent Document 1 described above, it is possible to interactively correct the speech recognition result for each utterance while reproducing the speech for each sentence that is the utterance. It is possible to create a transcribed text with a certain degree of efficiency. However, since the sound is played in sentence units, even if some words in the sentence are suspicious, it is necessary to listen to the sound from the beginning of the sentence and listen to the sound of the corresponding word part efficiently. And it is difficult to hear the voice of the word accurately. Such a problem becomes more prominent as the utterance sentence becomes longer. In addition, since the correspondence between the speech information and the corresponding speech recognition result and the transcription text is one sentence unit, which part of the speech information corresponds to each part of the speech recognition result and the transcription text. It is unknown and it is not easy to confirm whether all of the audio information has been written down without omission.

本発明は、上記技術的課題の認識に基づき創案されたものであって、その目的は、コンテンツに含まれる音声情報の書き起こしテキストを効率良くかつ正確に作成することのできる書き起こしテキスト作成支援システムを提供することにある。 The present invention was devised based on the recognition of the above technical problem, and its purpose is to provide a transcription text creation support capable of efficiently and accurately creating a transcription text of audio information included in content. To provide a system.

本発明の第１の書き起こしテキスト作成支援システムは、記憶装置と、表示装置と、入力装置と、音声出力器と、素材コンテンツに含まれる音声情報に対して音声認識処理を行って得られた音声認識結果を文より小さな単位に分割した際の各分割単位毎に、その分割単位の音声認識結果とその元となる音声情報とその音声情報を特定するための時間情報と初期状態の書き起こしテキストとを関連付けて前記記憶装置に格納する同期情報生成部と、前記記憶装置に格納されている各分割単位毎の音声認識結果とその元となる音声情報のイメージとその音声情報を特定するための時間情報と書き起こしテキストとを互いに時間軸を揃えて前記表示装置の画面に平行に並べて表示し、前記入力装置による前記画面上の分割単位を指定した音声再生指示に応答して前記指定された分割単位の音声情報を再生して前記音声出力器から出力し、前記入力装置による前記画面上の分割単位を指定した書き起こしテキストの入力操作に応答して前記記憶装置における前記指定された分割単位の書き起こしテキストを更新する確認修正部とを備えることを特徴とする。 The first transcript text creation support system of the present invention is obtained by performing speech recognition processing on speech information included in a storage device, a display device, an input device, an audio output device, and material content. For each division unit when the speech recognition result is divided into units smaller than sentences, the speech recognition result of the division unit, the original speech information, time information for identifying the speech information, and transcription of the initial state A synchronization information generating unit that associates text and stores it in the storage device, a speech recognition result for each division unit stored in the storage device, an image of the speech information that is the basis thereof, and the speech information The time information and the transcribed text are aligned and displayed in parallel on the screen of the display device with the time axis aligned, and a voice reproduction finger specifying a division unit on the screen by the input device. The voice information of the designated division unit is reproduced and output from the voice output device, and the storage is performed in response to the input operation of the transcription text designating the division unit on the screen by the input device. And a confirmation correction unit for updating the transcription text of the designated division unit in the apparatus.

本発明の第２の書き起こしテキスト作成支援システムは、第１の書き起こしテキスト作成支援システムにおいて、前記分割単位が単語であることを特徴とする。 The second transcription text creation support system of the present invention is characterized in that the division unit is a word in the first transcription text creation support system.

本発明の第３の書き起こしテキスト作成支援システムは、第１の書き起こしテキスト作成支援システムにおいて、前記音声情報のイメージは、前記音声情報から生成した音声波形の画像であることを特徴とする。 The third transcription text creation support system according to the present invention is characterized in that in the first transcription text creation support system, the image of the voice information is an image of a voice waveform generated from the voice information.

本発明の第４の書き起こしテキスト作成支援システムは、第１の書き起こしテキスト作成支援システムにおいて、前記入力装置による前記画面上の分割単位を指定した書き起こしテキストの入力操作は、認識結果を書き起こしテキストへコピーするコピー操作であることを特徴とする。 According to a fourth transcription text creation support system of the present invention, in the first transcription text creation support system, the input operation of the transcription text specifying the division unit on the screen by the input device writes the recognition result. It is a copy operation for copying to a wake-up text.

本発明の第５の書き起こしテキスト作成支援システムは、第１の書き起こしテキスト作成支援システムにおいて、前記確認修正部は、前記入力装置からのスクロール指示に応答して、前記表示装置の画面に表示されている各分割単位毎の音声認識結果とその元となる音声情報のイメージとその音声情報を特定するための時間情報と書き起こしテキストとを互いに同期させてスクロールするものであることを特徴とする。 According to a fifth transcript text creation support system of the present invention, in the first transcript text creation support system, the confirmation correction unit displays the screen on the display device in response to a scroll instruction from the input device. The speech recognition result for each divided unit, the image of the original speech information, the time information for specifying the speech information, and the transcribed text are scrolled in synchronization with each other. To do.

本発明の第１のプログラムは、記憶装置と表示装置と入力装置と音声出力器とを備えたコンピュータを、素材コンテンツに含まれる音声情報に対して音声認識処理を行って得られた音声認識結果を文より小さな単位に分割した際の各分割単位毎に、その分割単位の音声認識結果とその元となる音声情報とその音声情報を特定するための時間情報と初期状態の書き起こしテキストとを関連付けて前記記憶装置に格納する同期情報生成手段、前記記憶装置に格納されている各分割単位毎の音声認識結果とその元となる音声情報のイメージとその音声情報を特定するための時間情報と書き起こしテキストとを互いに時間軸を揃えて前記表示装置の画面に平行に並べて表示し、前記入力装置による前記画面上の分割単位を指定した音声再生指示に応答して前記指定された分割単位の音声情報を再生して前記音声出力器から出力し、前記入力装置による前記画面上の分割単位を指定した書き起こしテキストの入力操作に応答して前記記憶装置における前記指定された分割単位の書き起こしテキストを更新する確認修正手段、として機能させることを特徴とする。 A first program of the present invention is a speech recognition result obtained by performing speech recognition processing on speech information included in material content on a computer including a storage device, a display device, an input device, and an audio output device. Is divided into units smaller than sentences, and for each division unit, the speech recognition result of the division unit, the original voice information, time information for identifying the voice information, and the initial transcript Synchronization information generating means for storing in the storage device in association with each other, a speech recognition result for each division unit stored in the storage device, an image of the original speech information, and time information for specifying the speech information; Transcript text is aligned and displayed in parallel on the screen of the display device with the time axis aligned, and responds to an audio playback instruction specifying the division unit on the screen by the input device The audio information of the designated division unit is reproduced and output from the audio output device, and the input unit of the transcription text designating the division unit on the screen by the input device is used in the storage device. It is made to function as a confirmation and correction means for updating the transcription text of the designated division unit.

本発明の第２のプログラムは、第１のプログラムにおいて、前記分割単位が単語であることを特徴とする。 According to a second program of the present invention, in the first program, the division unit is a word.

本発明によれば、コンテンツに含まれる音声情報の書き起こしテキストを効率良くかつ正確に作成することができる。 According to the present invention, it is possible to efficiently and accurately create a transcription text of audio information included in content.

その理由は、第１に、音声情報の再生は、文より小さな単位で行えるため、或る音声情報の一部分に疑義がある場合に、その一部分だけを再生でき、効率良く且つ正確に聞き取ることができるからであり、第２に、音声情報とそれに対応する音声認識結果および書き起こしテキストとが、文より小さな単位毎に対応付けられて表示されるため、音声情報の各部分がどの音声認識結果及びどの書き起こしテキストの部分に対応しているのか一目瞭然となり、音声情報のすべてを漏れなく書き起こしたかどうかなどを容易に確認することができるからである。 The first reason is that audio information can be played back in smaller units than sentences, so if there is any doubt about a part of a piece of voice information, only that part can be played back, and it can be heard efficiently and accurately. Secondly, since the speech information and the speech recognition result and the transcription text corresponding to the speech information are displayed in association with each unit smaller than the sentence, the speech recognition result corresponding to each part of the speech information is displayed. This is because it becomes clear at a glance which portion of the transcription text corresponds, and it can be easily confirmed whether or not all of the voice information has been transcribed without omission.

図１を参照すると、本発明の実施の形態にかかる書き起こしテキスト作成支援システムは、同期情報生成部１０１、同期情報格納部１０２、確認修正部１０３、表示装置１０４、入力装置１０５、音声出力器１０６および出力装置１０７を備えている。 Referring to FIG. 1, a transcription text creation support system according to an exemplary embodiment of the present invention includes a synchronization information generation unit 101, a synchronization information storage unit 102, a confirmation correction unit 103, a display device 104, an input device 105, and a voice output device. 106 and an output device 107.

同期情報格納部１０２は、磁気ディスク装置などのランダムアクセス可能な記憶装置で構成され、多数のエントリ１０２５を有する。１つのエントリ１０２５は、時間情報フィールド１０２１、音声情報フィールド１０２２、認識結果フィールド１０２３および書き起こしテキストフィールド１０２４で構成される。１つのエントリ１０２５において、認識結果フィールド１０２３には、書き起こし対象の素材コンテンツ１１１に含まれる音声情報を音声認識処理したときに得られる認識結果を文より小さな単位に分割したときの分割単位の認識結果が格納され、音声情報フィールド１０２２には、認識結果フィールド１０２３に格納された認識結果の元となる音声情報のバイナリデータが格納され、時間情報フィールド１０２１には、音声情報フィールド１０２２に格納されている音声情報を素材コンテンツ１１１に含まれる一連の音声情報から識別するための時間情報が格納される。これら３つのフィールド１０２１、１０２２および１０２３の内容は機械的に自動設定される。残りの書き起こしテキストフィールド１０２４は、作業者自身が対話的に設定するフィールドである。 The synchronization information storage unit 102 is configured by a randomly accessible storage device such as a magnetic disk device, and has a large number of entries 1025. One entry 1025 includes a time information field 1021, a voice information field 1022, a recognition result field 1023, and a transcription text field 1024. In one entry 1025, in the recognition result field 1023, the recognition of the division unit when the recognition result obtained when the speech information included in the material content 111 to be transcribed is subjected to speech recognition processing is divided into smaller units than sentences. The result is stored. In the voice information field 1022, binary data of the voice information that is the basis of the recognition result stored in the recognition result field 1023 is stored. In the time information field 1021, the voice information field 1022 is stored. The time information for identifying the existing audio information from a series of audio information included in the material content 111 is stored. The contents of these three fields 1021, 1022 and 1023 are automatically set mechanically. The remaining transcription text field 1024 is a field that the operator himself sets interactively.

同期情報生成部１０１は、少なくとも音声情報を含む素材コンテンツ１１１を入力し、素材コンテンツ１１１に含まれる音声情報に対して音声認識処理を行い、得られた音声認識結果を文より小さな単位に分割し、その分割単位毎に、同期情報格納部１０２の１つのエントリ１０２５を割り当て、認識結果フィールド１０２３にその分割単位の認識結果を格納し、音声情報フィールド１０２２にその認識結果に対応する元の音声情報を格納し、時間情報フィールド１０２１にその音声情報を特定する時間情報を格納し、書き起こしテキスト格納用フィールド１０２４は初期状態（一般には空欄であるが、認識結果通りにしておいても良い）としておく。分割単位としては、例えば単語を用いることができる。 The synchronization information generation unit 101 inputs material content 111 including at least audio information, performs audio recognition processing on the audio information included in the material content 111, and divides the obtained audio recognition result into units smaller than sentences. For each division unit, one entry 1025 of the synchronization information storage unit 102 is allocated, the recognition result of the division unit is stored in the recognition result field 1023, and the original voice information corresponding to the recognition result is stored in the voice information field 1022 Is stored in the time information field 1021, and the transcription text storage field 1024 is in an initial state (generally blank, but may be left as per the recognition result). deep. As a division unit, for example, a word can be used.

例えば、素材コンテンツ１１１中の先頭からの再生時間がｔ１〜ｔ３の間に存在する一連の音声情報に対して音声認識を行った結果、「魚を食べた」という認識結果が得られたとする。この場合、「魚（普通名詞）」と「を（各助詞）」とをあわせた「魚を」を１つの単語、「食べ（一段動詞語幹）」と「た（助動詞）」とをあわせた「食べた」を別の１つの単語とし、それぞれ１つのエントリ１０２５−１、１０２５−２を割り当てる。そして、エントリ１０２５−１の認識結果フィールド１０２３に認識結果「魚を」を格納し、音声情報フィールド１０２２に認識結果「魚を」に対応する音声情報ＡＡＡを格納し、その音声情報ＡＡＡの素材コンテンツ１１１における先頭からの再生時間がｔ１〜ｔ２の間に存在する場合、例えば時間ｔ１を時間情報フィールド１０２１に格納し、書き起こしテキストフィールド１０２４は空白とする。同様に、エントリ１０２５−２の認識結果フィールド１０２３に認識結果「食べた」を格納し、音声情報フィールド１０２２に認識結果「食べた」に対応する音声情報ＢＢＢを格納し、その音声情報ＢＢＢの素材コンテンツ１１１における先頭からの再生時間がｔ２〜ｔ３とすると、その開始側の時間ｔ２を時間情報フィールド１０２１に格納し、書き起こしテキストフィールド１０２４は空白とする。 For example, it is assumed that a recognition result “eating fish” is obtained as a result of performing speech recognition on a series of speech information that exists between the playback times t1 to t3 in the material content 111. In this case, “fish”, which combines “fish (common noun)” and “wo (each particle)”, is combined with one word, “eat (single verb stem)” and “ta (auxiliary verb)”. “Eat” is set as another word, and one entry 1025-1, 1025-2 is assigned to each word. Then, the recognition result “fish” is stored in the recognition result field 1023 of the entry 1025-1, the audio information AAA corresponding to the recognition result “fish” is stored in the audio information field 1022, and the material content of the audio information AAA is stored. When the playback time from the beginning in 111 exists between t1 and t2, for example, the time t1 is stored in the time information field 1021, and the transcription text field 1024 is blank. Similarly, the recognition result field 1023 of the entry 1025-2 stores the recognition result “eat”, the speech information field 1022 stores the speech information BBB corresponding to the recognition result “eat”, and the material of the speech information BBB If the playback time from the beginning in the content 111 is t2 to t3, the start time t2 is stored in the time information field 1021, and the transcription text field 1024 is blank.

確認修正部１０３は、液晶ディスプレイなどで構成される表示装置１０４に確認修正画面１０４６を表示し、この確認修正画面１０４６を通じて作業者に対し音声認識結果を利用した書き起こしテキストの編集を行わせる。確認修正画面１０４６には、時間情報表示部１０４１、認識結果表示部１０４２、音声波形表示部１０４３、書き起こしテキスト表示部１０４４が横方向（Ｘ軸方向）に平行に並べられており、これら表示部１０４１〜１０４４の内容をスクロールするためのスクロールボタン１０４５−１、１０４５−２が設けられている。スクロールボタン１０４５−１は現在より過去の時刻側にスクロールするためのボタン（巻き戻しボタン）、スクロールボタン１０４５−２は現在より未来の時刻側にスクロールするためのボタン（早送りボタン）である。 The confirmation / correction unit 103 displays a confirmation / correction screen 1046 on the display device 104 constituted by a liquid crystal display or the like, and allows the operator to edit the transcription text using the voice recognition result through the confirmation / correction screen 1046. On the confirmation and correction screen 1046, a time information display unit 1041, a recognition result display unit 1042, a speech waveform display unit 1043, and a transcription text display unit 1044 are arranged in parallel in the horizontal direction (X-axis direction). Scroll buttons 1045-1 and 1045-2 are provided for scrolling the contents of 1041 to 1044. The scroll button 1045-1 is a button (rewind button) for scrolling from the present to the past time side, and the scroll button 1045-2 is a button (fast forward button) for scrolling from the present to the future time side.

時間情報表示部１０４１には、同期情報格納部１０２の各エントリ１０２５の時間情報フィールド１０２１に格納されている時間情報が時間順に表示され、認識結果表示部１０４２には、同期情報格納部１０２の各エントリ１０２５の認識結果フィールド１０２３に格納されている認識結果が時間順に表示され、音声波形表示部１０４３には、同期情報格納部１０２の各エントリ１０２５の音声情報フィールド１０２２に格納されている音声情報の簡略化した波形イメージが時間順に表示され、書き起こしテキスト表示部１０４４には、同期情報格納部１０２の各エントリ１０２５の書き起こしテキストフィールド１０２４に格納されている書き起こしテキストが時間順に表示される。 In the time information display unit 1041, the time information stored in the time information field 1021 of each entry 1025 of the synchronization information storage unit 102 is displayed in order of time. The recognition result display unit 1042 displays each item of the synchronization information storage unit 102. The recognition results stored in the recognition result field 1023 of the entry 1025 are displayed in chronological order, and the audio waveform display unit 1043 displays the audio information stored in the audio information field 1022 of each entry 1025 of the synchronization information storage unit 102. Simplified waveform images are displayed in chronological order, and in the transcribed text display unit 1044, the transcribed texts stored in the transcribed text field 1024 of each entry 1025 in the synchronization information storage unit 102 are displayed in chronological order.

これら４つの表示部１０４１〜１０４４における表示情報の時間軸は同期している。すなわち、同期情報格納部１０２のエントリ１０２５−１を例にして説明すると、そのエントリ１０２５−１の時間情報フィールド１０２１に格納されている時間ｔ１とその直後のエントリ１０２５−２の時間情報フィールド１０２１に格納されている時間ｔ２とが時間情報表示部１０４１に表示される場合、時間ｔ１の表示位置からＹ軸方向に平行に引いたガイド線１０４７と時間ｔ２の表示位置からＹ軸方向に平行に引いたガイドライン１０４８とで挟まれた認識結果表示部１０４２の部分にエントリ１０２５−１の認識結果フィールド１０２３に格納されている認識結果「魚を」が表示され、同ガイドライン１０４７、１０４８で挟まれた音声波形表示部１０４３の部分にエントリ１０２５−１の音声情報フィールド１０２２に格納されている音声情報の波形イメージが表示され、同ガイドライン１０４７、１０４８で挟まれた書き起こしテキスト表示部１０４４の部分にエントリ１０２５−１の書き起こしテキストフィールド１０２４に格納されている書き起こしテキストが表示される。そして、スクロールボタン１０４５−１または１０４５−２が操作されると、４つの表示部１０４１〜１０４４の表示情報が同期してスクロールされるようになっている。 The time axes of the display information in these four display units 1041 to 1044 are synchronized. In other words, the entry 1025-1 of the synchronization information storage unit 102 will be described as an example. The time t1 stored in the time information field 1021 of the entry 1025-1 and the time information field 1021 of the entry 1025-2 immediately after that are stored. When the stored time t2 is displayed on the time information display unit 1041, the guide line 1047 drawn in parallel to the Y-axis direction from the display position of the time t1 and the display position of time t2 are drawn in parallel to the Y-axis direction. The recognition result “fish” stored in the recognition result field 1023 of the entry 1025-1 is displayed in the recognition result display portion 1042 sandwiched between the guideline 1048 and the voice sandwiched between the guidelines 1047 and 1048. Stored in the voice information field 1022 of the entry 1025-1 in the waveform display portion 1043. The waveform image of the voice information being displayed is displayed, and the transcription text stored in the transcription text field 1024 of the entry 1025-1 is displayed in the portion of the transcription text display section 1044 sandwiched between the guidelines 1047 and 1048. The When the scroll button 1045-1 or 1045-2 is operated, the display information of the four display units 1041 to 1044 is scrolled in synchronization.

また確認修正部１０３は、入力装置１０５からの確認修正画面１０４６上の分割単位を指定した音声再生指示に応答して、前記指定された分割単位の音声情報を再生し、スピーカなどで構成される音声出力器１０６から出力する。分割単位を指定する音声再生指示の方法としては、再生した分割単位の音声波形部分や認識結果部分をマウスでクリックする方法が考えられる。これにより、作業者は各分割単位毎にその元の音声情報を自分の耳で聞いて確認することができる。 The confirmation / correction unit 103 reproduces the audio information of the designated division unit in response to an audio reproduction instruction designating the division unit on the confirmation / correction screen 1046 from the input device 105, and is configured by a speaker or the like. Output from the audio output unit 106. As a method for instructing audio reproduction for designating a division unit, a method of clicking a reproduced audio waveform portion or recognition result portion of the division unit with a mouse can be considered. As a result, the worker can confirm the original voice information by listening with his / her ear for each division unit.

さらに確認修正部１０３は、入力装置１０５からの確認修正画面１０４６上の分割単位を指定した書き起こしテキストの入力操作に応答して、書き起こしテキスト表示部１０４４の当該分割単位に対応する部分に、前記入力された書き起こしテキストを表示する。書き起こしテキストの入力操作は、入力装置１０５を構成するキーボードからの直接入力以外に、認識結果表示部１０４２に表示された認識結果を書き起こしテキスト表示部１０４４にコピーする方法が利用できる。こうして、書き起こしテキスト表示部１０４４上で編集された書き起こしテキストは、同期情報格納部１０２の該当するエントリの書き起こしテキストフィールド１０２４に書き込まれる。 Further, the confirmation correction unit 103 responds to the input operation of the transcription text specifying the division unit on the confirmation correction screen 1046 from the input device 105, in the portion corresponding to the division unit of the transcription text display unit 1044. The inputted transcription text is displayed. For the input operation of the transcription text, in addition to the direct input from the keyboard constituting the input device 105, a method of copying the recognition result displayed on the recognition result display unit 1042 to the text display unit 1044 can be used. In this way, the transcription text edited on the transcription text display unit 1044 is written into the transcription text field 1024 of the corresponding entry in the synchronization information storage unit 102.

以上のようにして編集されて同期情報格納部１０２に保存された各エントリの書き起こしテキストフィールド１０２４の内容は、入力装置１０５からの書き起こしテキスト出力指示に応答して、時間情報フィールド１０２１の時間順に同期情報格納部１０２から順次読み取られ、プリンタなどの出力装置１０７から出力することも可能である。 The contents of the transcription text field 1024 of each entry edited and stored in the synchronization information storage unit 102 as described above are the time in the time information field 1021 in response to the transcription text output instruction from the input device 105. It is also possible to sequentially read from the synchronization information storage unit 102 and output from the output device 107 such as a printer.

次に本実施の形態の書き起こしテキスト作成支援システムの各部の実施例について説明する。まず、同期情報生成部１０１の実施例について図面を参照して詳細に説明する。 Next, examples of each part of the transcription text creation support system according to the present embodiment will be described. First, an embodiment of the synchronization information generation unit 101 will be described in detail with reference to the drawings.

図２を参照すると、同期情報生成部１０１の一実施例は、素材コンテンツ１１１を入力する素材入力部１０１１と、素材入力部１０１１が入力した素材コンテンツ１１１の音声情報に対し音声認識処理を行う音声認識部１０１２と、音声認識部１０１２にて或る単語単位の認識結果テキストが生成されたタイミングで、当該認識結果テキストの元となる音声情報の、素材コンテンツ１１１の先頭からの時間情報を取得して出力する時間情報取得部１０１３と、音声認識部１０１２から認識結果テキストとその元となる音声情報を受け取り、時間情報取得部１０１３から前記時間情報を受け取って、これら受け取った時間情報、音声情報、認識結果と空白に設定した書き込みテキストフィールドとを持つ１つのエントリ１０２５を同期情報格納部１０２に記録する記録部１０１４とを備えている。 Referring to FIG. 2, an example of the synchronization information generation unit 101 includes a material input unit 1011 that inputs the material content 111, and a voice that performs voice recognition processing on the audio information of the material content 111 input by the material input unit 1011. When the recognition unit 1012 and the speech recognition unit 1012 generate the recognition result text in a certain word unit, time information from the head of the material content 111 of the speech information that is the basis of the recognition result text is acquired. Output the time information acquisition unit 1013 and the speech recognition unit 1012 to receive the recognition result text and the original voice information, receive the time information from the time information acquisition unit 1013, and receive the received time information, voice information, One entry 1025 having a recognition result and a writing text field set to blank is stored as synchronization information. And a recording section 1014 for recording the 102.

図３は同期情報生成部１０１の処理の流れを示すフローチャート、図４は同期情報格納部１０２のデータ格納例の一部を示す図である。本発明を、例えば議会映像コンテンツの書き起こしテキスト作成作業に利用することを考えると、まず、素材コンテンツ１１１である議会映像ビデオがビデオデッキなどによって再生され、素材入力部１０１１にて取り込まれる。一般に、映像コンテンツをコンピュータ内に取り込む作業はキャプチャと呼ばれている。キャプチャの際、映像コンテンツは、コンピュータで扱うことの可能なＭＰＥＧ１などでデジタル化処理が施された後、コンピュータ上に設置されているハードディスクドライブなどの外部記憶媒体（図示せず）に一旦格納される（Ｓ１０１）。このとき再生時刻の情報を付加して記録しても良い。次いで、音声認識部１０１２は、外部記憶媒体（図示せず）からデジタル音声データを時系列順に読み出し、既存の音声認識技術により音声認識処理を実施し、認識結果テキストを生成する（Ｓ１０２）。その間、時間情報取得部１０１３は、あらかじめ素材入力部１０１１から素材コンテンツ１１１を受け取っておき、音声認識部１０１２での処理開始時間と素材コンテンツ１１１の時間情報とを対応させ、現在素材コンテンツ１１１のどの箇所に対し処理しているかを監視している。そして、音声認識部１０１２が、音声情報を認識し、認識結果テキストを出力したタイミングを時間情報取得部１０１３に通知すると、時間情報取得部１０１３は、音声認識部１０１２から受け取ったタイミングに合致する素材コンテンツ１１１の先頭からの時間情報を取得する（Ｓ１０３）。ここで、出力タイミングの通知は、例えば音声認識部１０１２の出力する単語単位で通知すればよい。 FIG. 3 is a flowchart showing a processing flow of the synchronization information generation unit 101, and FIG. 4 is a diagram showing a part of a data storage example of the synchronization information storage unit 102. Considering that the present invention is used for, for example, a transcription work for transcription of parliament video content, first, the parliament video as the material content 111 is reproduced by a video deck or the like and captured by the material input unit 1011. In general, the work of capturing video content into a computer is called capture. At the time of capture, the video content is digitized by MPEG1, which can be handled by a computer, and then temporarily stored in an external storage medium (not shown) such as a hard disk drive installed on the computer. (S101). At this time, reproduction time information may be added and recorded. Next, the voice recognition unit 1012 reads digital voice data from an external storage medium (not shown) in chronological order, performs voice recognition processing using an existing voice recognition technique, and generates a recognition result text (S102). Meanwhile, the time information acquisition unit 1013 receives the material content 111 from the material input unit 1011 in advance, associates the processing start time in the speech recognition unit 1012 with the time information of the material content 111, and determines which part of the current material content 111 Are being processed. Then, when the voice recognition unit 1012 recognizes the voice information and notifies the time information acquisition unit 1013 of the timing when the recognition result text is output, the time information acquisition unit 1013 matches the timing received from the voice recognition unit 1012. Time information from the top of the content 111 is acquired (S103). Here, the notification of the output timing may be made, for example, in units of words output by the voice recognition unit 1012.

音声認識部１０１２は、タイミングを通知すると同時に、認識結果テキストとその元となる音声情報を記録部１０１４に通知し、時間情報取得部１０１３は、時間情報を記録部１０１４に通知し、記録部１０１４は、受け取った時間情報を主キーとして、受け取った音声情報および認識結果テキストを時間情報で紐付け格納する（Ｓ１０４）。また、記録部１０１４は、同期情報格納部１０２に、時間情報ごとに、書き起こしテキストを格納するためのフィールドを生成しておく。 The voice recognition unit 1012 notifies the timing, and at the same time, notifies the recording unit 1014 of the recognition result text and the voice information that is the source thereof. The time information acquisition unit 1013 notifies the recording unit 1014 of the time information, and the recording unit 1014. Uses the received time information as a primary key, and stores the received voice information and recognition result text in association with the time information (S104). Further, the recording unit 1014 generates a field for storing the transcription text for each time information in the synchronization information storage unit 102.

これら一連のステップは、素材コンテンツ１１１の末尾まで継続的に実施される。 These series of steps are continuously performed up to the end of the material content 111.

図４を参照すると、時間情報として「０：００：０３」から「０：００：１４」が格納されている領域の横に、それぞれ時間情報に紐付けられた音声情報、および、認識結果「続きまして」「秋の」「委員会で」「貸されました」「起案」「について」「ですが」、および、書き起こしテキストが格納されている。ここで、音声情報２０２は説明のため音声波形の画像として図示している。 Referring to FIG. 4, next to the area where “0:00:03” to “0:00:14” are stored as time information, the voice information associated with the time information and the recognition result “ Continuing, “Autumn”, “At the Committee”, “Lending”, “Drafting”, “About”, “But”, and transcription text are stored. Here, the audio information 202 is illustrated as an image of an audio waveform for explanation.

次に確認修正部１０３の実施例について図面を参照して詳細に説明する。 Next, an embodiment of the confirmation / correction unit 103 will be described in detail with reference to the drawings.

図５を参照すると、確認修正部１０３は、同期情報格納部１０２に記憶された各エントリ１０２５の内容を読み込むデータ読み込み部１０３１と、データ読み込み部１０３１から各エントリ１０２５の内容を受け取って図１に示したような確認修正画面１０４６を生成する画面生成部１０３２と、画面生成部１０３２から音声情報を受け取り、バイナリ形式である音声情報から、作業者に視覚的に見やすい音声波形を生成し、その音声波形を画面生成部１０３２に返却する音声波形生成部１０３３と、画面生成部１０３２で生成された確認修正画面１０４６を表示装置１０４に表示する表示部１０３４と、データ読み込み部１０３１から音声情報を受け取り、それを再生した音声信号を生成する音声再生部１０３５、音声再生部１０３５で生成された音声信号で音声出力器１０６を駆動する出力部１０３６と、同期情報格納部１０２の各エントリ１０２５における書き起こしテキストフィールド１０２４の内容を更新するデータ書き込み部１０３７と、キーボードおよびマウスで構成される入力装置１０５から指示やデータを入力する入力部１０３８と、入力部１０３８から指示やデータを受け取って各部を制御する制御部１０３９とを備えている。 Referring to FIG. 5, the confirmation / correction unit 103 receives the contents of each entry 1025 stored in the synchronization information storage unit 102, and receives the contents of each entry 1025 from the data reading unit 1031. The screen generation unit 1032 that generates the confirmation / correction screen 1046 as shown, and the audio information is received from the screen generation unit 1032, and the audio waveform that is visually easy to see for the operator is generated from the audio information in the binary format. The voice waveform generation unit 1033 for returning the waveform to the screen generation unit 1032, the display unit 1034 for displaying the confirmation correction screen 1046 generated by the screen generation unit 1032 on the display device 104, and the voice information from the data reading unit 1031, The audio reproduction unit 1035 and the audio reproduction unit 1035 that generate the audio signal that reproduces the audio signal The output unit 1036 for driving the audio output device 106 with the generated audio signal, the data writing unit 1037 for updating the contents of the transcription text field 1024 in each entry 1025 of the synchronization information storage unit 102, and a keyboard and a mouse. An input unit 1038 for inputting instructions and data from the input device 105 and a control unit 1039 for receiving instructions and data from the input unit 1038 and controlling each unit are provided.

図６は同期情報格納部１０４の各エントリの書き起こしテキストフィールド１０２４が空白になっている初期の状態における確認修正画面１０４６の一例を示す図である。ここでは、時間情報表示部１０４１、認識結果表示部１０４２、音声波形表示部１０４３および書き起こしテキスト表示部１０４４は、アプリケーションウィンドウ内に並べて配置する形態を採用しており、並べて配置する際に、同期情報格納部１０２に格納されている時間情報をキーとして、ディスプレイの同一Ｘ軸座標上に各データを配置している。 FIG. 6 is a diagram showing an example of the confirmation / correction screen 1046 in the initial state where the transcription text field 1024 of each entry in the synchronization information storage unit 104 is blank. Here, the time information display unit 1041, the recognition result display unit 1042, the speech waveform display unit 1043, and the transcription text display unit 1044 adopt a form in which they are arranged side by side in the application window. Using the time information stored in the information storage unit 102 as a key, each data is arranged on the same X-axis coordinate of the display.

図７は図６のような初期状態の確認修正画面を表示する際の確認修正部１０３の処理例を示すフローチャートである。確認修正部１０３の画面生成部１０３２は、データ読み込み部１０３１によって同期情報格納部１０２から、作業対象の素材コンテンツ１１１の先頭位置からの時間情報を取得し、表示部１０３４によって表示装置１０４上のウィンドウ内の時間情報表示部１０４１に時間情報を表示する（Ｓ２０１）。図６では、時間情報「０：００：０３」、「０：００：０４」、…、「０：００：１４」が順に表示されている。 FIG. 7 is a flowchart showing a processing example of the confirmation / correction unit 103 when displaying the confirmation / correction screen in the initial state as shown in FIG. The screen generation unit 1032 of the confirmation correction unit 103 acquires time information from the head position of the material content 111 to be worked from the synchronization information storage unit 102 by the data reading unit 1031, and the window on the display device 104 by the display unit 1034. The time information is displayed on the time information display unit 1041 (S201). In FIG. 6, time information “0:00:03”, “0:00:04”,..., “0:00:14” are displayed in order.

次に、画面生成部１０３２は、データ読み込み部１０３１によって同期情報格納部１０２から前記時間情報に対応する音声情報を取得し、音声波形生成部１０３３によって、デジタル化されたバイナリデータの形式を取っている音声情報から一般的なグラフィックイコライザ表示技術などを用いて音声波形画像を生成し、表示部１０３４によって表示装置１０４上のウィンドウ内の音声波形表示部１０４３に音声波形を表示する（Ｓ２０２）。。ここで、音声波形は、時間情報表示部１０４１にて表示した時間情報のうち、同期情報格納部１０２にて対応付けられている時間情報と同一のＸ座標軸上に表示する。図６では、時間情報「０：００：０３」、「０：００：０４」、…、「０：００：１４」と同一のＸ座標軸上に、対応する音声情報の波形が表示されている。 Next, the screen generation unit 1032 acquires audio information corresponding to the time information from the synchronization information storage unit 102 by the data reading unit 1031, and takes the format of digitized binary data by the audio waveform generation unit 1033. A voice waveform image is generated from the voice information using a general graphic equalizer display technique or the like, and the voice waveform is displayed on the voice waveform display unit 1043 in the window on the display device 104 by the display unit 1034 (S202). . Here, the audio waveform is displayed on the same X coordinate axis as the time information associated with the synchronization information storage unit 102 among the time information displayed on the time information display unit 1041. In FIG. 6, the waveform of the corresponding audio information is displayed on the same X coordinate axis as the time information “0:00:03”, “0:00:04”,..., “0:00:14”. .

次に、画面生成部１０３２は、データ読み込み部１０３１によって同期情報格納部１０２から前記時間情報に対応する認識結果テキストを取得し、表示部１０３４によって表示装置１０４上のウィンドウ内の認識結果表示部１０４２に認識結果テキストを表示する（Ｓ２０３）。ここで、認識結果テキストは、時間情報表示部１０４１にて表示した時間情報のうち、同期情報格納部１０２にて対応付けられている時間情報と同一のＸ座標軸上に表示する。図６では、時間情報「０：００：０３」、「０：００：０４」、…、「０：００：１４」と同一のＸ座標軸上に、対応する音声情報の認識結果テキストが表示されている。 Next, the screen generation unit 1032 acquires the recognition result text corresponding to the time information from the synchronization information storage unit 102 by the data reading unit 1031, and the recognition result display unit 1042 in the window on the display device 104 by the display unit 1034. The recognition result text is displayed on (S203). Here, the recognition result text is displayed on the same X coordinate axis as the time information associated with the synchronization information storage unit 102 among the time information displayed on the time information display unit 1041. In FIG. 6, the recognition result text of the corresponding voice information is displayed on the same X coordinate axis as the time information “0:00:03”, “0:00:04”,..., “0:00:14”. ing.

最後に、画面生成部１０３２は、データ読み込み部１０３１によって同期情報格納部１０２から前記時間情報に対応する書き起こしテキストを取得し、表示部１０３４によって表示装置１０４上のウィンドウ内の書き起こしテキスト表示部１０４４に書き起こしテキストを表示する（Ｓ２０４）。ここで、書き起こしテキストは、時間情報表示部１０４１にて表示した時間情報のうち、同期情報格納部１０２にて対応付けられている時間情報と同一のＸ座標軸上に表示する。ただし、初期状態では書き起こしテキストの情報は未だ格納されていないため、時間情報に対応する書き起こしテキストの表示部１０４４は空欄として表示されることとなる。 Finally, the screen generation unit 1032 acquires the transcription text corresponding to the time information from the synchronization information storage unit 102 by the data reading unit 1031, and the transcription text display unit in the window on the display device 104 by the display unit 1034. The transcription text is displayed in 1044 (S204). Here, the transcription text is displayed on the same X coordinate axis as the time information associated with the synchronization information storage unit 102 among the time information displayed on the time information display unit 1041. However, in the initial state, the transcription text information is not yet stored, so the transcription text display section 1044 corresponding to the time information is displayed as a blank.

図８は図７に示したフローチャートが実行されて図６に示したような初期状態の確認修正画面１０４６が表示装置１０４に表示された後に確認修正部１０３で実行される処理の流れを示すフローチャートである。 FIG. 8 is a flowchart showing the flow of processing executed by the confirmation / correction unit 103 after the flowchart shown in FIG. 7 is executed and the confirmation / correction screen 1046 in the initial state as shown in FIG. 6 is displayed on the display device 104. It is.

まず、作業者が、表示装置１０４上に表示されている音声波形表示部１０４３の音声波形または認識結果表示部１０４２に表示されている認識結果テキストを入力装置１０５のマウスでクリックして聞き取りたい再生箇所を指定すると（Ｓ３０１のＹＥＳ）、そのことを入力部１０３８からの信号で検出した制御部１０３９は、クリックされた音声波形または認識結果テキストに対応する時間情報を画面生成部１０３２から取得して、音声再生部１０３５に通知し、音声再生部１０３５は通知された時間情報に対応する音声情報をデータ読み込み部１０３１を通じて同期情報格納部１０２から取り出して再生信号を生成し、出力部１０３６はその再生信号でスピーカなどの音声再生器１０６を駆動する（Ｓ３０２）。これにより、作業者がクリックした箇所の対応する音声情報が音声として再生される。音声再生は、同一ステップの繰り返しにより何度でも可能である。 First, the reproduction that the operator wants to hear by clicking the speech waveform of the speech waveform display unit 1043 displayed on the display device 104 or the recognition result text displayed on the recognition result display unit 1042 with the mouse of the input device 105. When a location is designated (YES in S301), the control unit 1039 that has detected this by a signal from the input unit 1038 acquires time information corresponding to the clicked speech waveform or recognition result text from the screen generation unit 1032. The audio reproduction unit 1035 notifies the audio reproduction unit 1035, the audio reproduction unit 1035 extracts the audio information corresponding to the notified time information from the synchronization information storage unit 102 through the data reading unit 1031 and generates a reproduction signal, and the output unit 1036 reproduces the reproduction information. The audio player 106 such as a speaker is driven by the signal (S302). Thereby, the audio | voice information corresponding to the location where the operator clicked is reproduced | regenerated as an audio | voice. Audio reproduction can be performed any number of times by repeating the same steps.

ステップＳ３０１で音声波形または認識結果に対するクリック操作が行われなかった場合の動作の流れについては後述する（Ｃ）。 The flow of the operation when the click operation on the speech waveform or the recognition result is not performed in step S301 will be described later (C).

作業者は、音声出力器１０６から出力される音声情報を聞き取り、認識結果表示部１０４２に表示されている認識結果テキストの内容が正しいかどうかを判断する。ここで、現在再生している音声情報に対応する認識結果テキストの色を変更したり、フォントを変更するなどして修飾する機能を画面生成部１０３２に持たせるようにしても良い。これにより、現在どの箇所を作業しているかを視覚的に即座に確認できるため、目視による作業誤りを軽減することができる。 The operator listens to the audio information output from the audio output device 106 and determines whether the content of the recognition result text displayed on the recognition result display unit 1042 is correct. Here, the screen generation unit 1032 may have a function of modifying the recognition result text corresponding to the currently reproduced audio information by changing the color or changing the font. As a result, it is possible to visually confirm which part is currently being worked on, and thus it is possible to reduce visual work errors.

認識結果が正しかったことにより（Ｓ３０３のＹＥＳ）、作業者が、認識結果表示部１０４２の該当部分をダブルクリックするか、またはドラッグ＆ドロップにより該当部分を書き起こしテキスト表示部１０４４にコピーする操作を行うと、制御部１０３９はそのことを入力部１０３８の入力信号により検出して画面生成部１０３２を制御し、書き起こしテキスト表示部１０４４の空欄箇所に認識結果テキストをコピーして表示させる（Ｓ３０４）。このときの様子を図９に示す。図９を参照すると、時間情報「０：００：０３」に対応する認識結果「続きまして」をクリックし、作業者が音声を聞き取り確認した結果、正常であったため、認識結果「続きまして」を同一Ｘ座標軸上の書き起こしテキスト表示部１０４４にドラッグ＆ドロップすると、書き起こしテキスト表示部１０４４に認識結果の「続きまして」という情報が追加表示される。その後、書き起こしテキスト表示部１０４４に追加表示された書き起こしテキストは、データ書き込み部１０３７によって、図１０に示すように時間情報で同期させて同期情報格納部１０３に格納される（Ｓ３０６）。 When the recognition result is correct (YES in S303), the operator performs an operation of double-clicking the corresponding part of the recognition result display unit 1042 or copying the corresponding part to the text display unit 1044 by drag and drop. Then, the control unit 1039 detects this from the input signal of the input unit 1038 and controls the screen generation unit 1032 to copy and display the recognition result text in the blank space of the transcription text display unit 1044 (S304). . The state at this time is shown in FIG. Referring to FIG. 9, the recognition result “continue” corresponding to the time information “0:00:03” is clicked, and the result of confirming that the operator has heard the voice is normal. When dragging and dropping to the transcription text display portion 1044 on the same X coordinate axis, information of “continue” as a recognition result is additionally displayed on the transcription text display portion 1044. Thereafter, the transcription text additionally displayed on the transcription text display unit 1044 is stored in the synchronization information storage unit 103 in synchronization with time information as shown in FIG. 10 by the data writing unit 1037 (S306).

他方、認識結果が正しくなかった場合（Ｓ３０３のＮＯ）、作業者は、入力装置１０５のキーボードから、書き起こしテキスト表示部１０４４の該当する箇所に正しいテキストを入力する（Ｓ３０５）。その様子を図１１に示す。図１１を参照すると、時間情報「０：００：０４」に対応する認識結果「秋の」をクリックし、作業者が音声を聞き取り確認した結果、正しくなかったため、同一Ｘ座標軸上の書き起こしテキスト表示部１０４４の箇所にキーボードを用いて「先の」という正しい書き起こしテキストを入力している。ここで、キーボードのように直接文字入力する装置でなく、マイクを用いた音声認識によってテキストを入力するようにしてもよい。その後、書き起こしテキスト表示部１０４４に追加表示された書き起こしテキストは、データ書き込み部１０３７によって、図１２に示すように時間情報で同期させて同期情報格納部１０３に格納される（Ｓ３０６）。 On the other hand, when the recognition result is not correct (NO in S303), the operator inputs the correct text from the keyboard of the input device 105 to the corresponding portion of the transcription text display unit 1044 (S305). This is shown in FIG. Referring to FIG. 11, the recognition result “autumn” corresponding to the time information “0:00:04” is clicked, and the result of the operator confirming the voice is incorrect. The correct transcription text “first” is input to the display portion 1044 using the keyboard. Here, instead of a device that directly inputs characters like a keyboard, text may be input by speech recognition using a microphone. Thereafter, the transcription text additionally displayed on the transcription text display unit 1044 is stored in the synchronization information storage unit 103 by the data writing unit 1037 in synchronization with time information as shown in FIG. 12 (S306).

作業対象コンテンツの末尾まで作業が完了していれば（Ｓ３０７のＹＥＳ）、全体の作業終了となる。作業対象コンテンツの途中であれば（Ｓ３０７のＮＯ）、継続して作業を実施する（Ｂ）。図１３は書き起こしの継続作業の処理例を示すフローチャートである。 If the work is completed up to the end of the work target content (YES in S307), the whole work is finished. If it is in the middle of the work target content (NO in S307), the work is continued (B). FIG. 13 is a flowchart illustrating an example of a process for continuing the transcription.

図１３を参照すると、作業者が、表示装置１０４上に表示されている音声波形表示部１０４３または認識結果表示部１０４２または書き起こしテキスト表示部１０４４の箇所をマウスでクリックして所望の再生箇所を指定すると（Ｓ４０１のＹＥＳ）、そのことを入力部１０３８からの信号で検出した制御部１０３９は、クリックされた音声波形または認識結果テキストまたは書き起こしテキストに対応する時間情報を画面生成部１０３２から取得して、音声再生部１０３５に通知し、音声再生部１０３５は通知された時間情報に対応する音声情報を同期情報格納部１０２から取り出して再生信号を生成し、出力部１０３６はその再生信号でスピーカなどの音声再生器１０６を駆動する（Ｓ４０２）。これにより、作業者がクリックした箇所に対応する音声情報が音声として再生される。音声再生は、同一ステップの繰り返しにより何度でも可能である。 Referring to FIG. 13, the operator clicks the position of the speech waveform display unit 1043, the recognition result display unit 1042, or the transcription text display unit 1044 displayed on the display device 104 with the mouse, and selects a desired reproduction part. When specified (YES in S401), the control unit 1039, which has detected this from the signal from the input unit 1038, acquires time information corresponding to the clicked speech waveform, recognition result text, or transcription text from the screen generation unit 1032. Then, the audio reproduction unit 1035 notifies the audio reproduction unit 1035, and the audio reproduction unit 1035 extracts the audio information corresponding to the notified time information from the synchronization information storage unit 102 to generate a reproduction signal, and the output unit 1036 uses the reproduction signal as a speaker. The audio player 106 is driven (S402). Thereby, the audio information corresponding to the location clicked by the worker is reproduced as audio. Audio reproduction can be performed any number of times by repeating the same steps.

ステップＳ４０１で指定が無かった場合の動作の流れについては後述する（Ｃ）。 The flow of operation when there is no designation in step S401 will be described later (C).

次に作業者は、音声出力器１０６から出力される音声情報を聞き取り、認識結果表示部１０４２に表示されている認識結果テキストの内容や、書き起こしテキスト表示部１０４４に表示されている書き起こしテキストの内容が正しいかどうかを判断する。ここで、現在再生している音声情報に対応する認識結果テキストや書き起こしテキストの色を変更したり、フォントを変更するなどで修飾する機能を画面生成部１０３２に持たせるようにしても良い。これにより、現在どの箇所を作業しているかを視覚的に即座に確認できるため、目視による作業誤りを軽減することができる。 Next, the operator listens to the voice information output from the voice output device 106, and the content of the recognition result text displayed on the recognition result display unit 1042 and the transcription text displayed on the transcription text display unit 1044. Determine if the content of is correct. Here, the screen generation unit 1032 may have a function of modifying the recognition result text or the transcription text corresponding to the currently reproduced audio information by changing the color or changing the font. As a result, it is possible to visually confirm which part is currently being worked on, and thus it is possible to reduce visual work errors.

書き起こしテキストが空欄か誤っており且つ認識結果が正しかった場合（Ｓ４０３のＹＥＳ）、作業者が、認識結果表示部１０４２の該当部分をダブルクリックするか、またはドラッグ＆ドロップにより該当部分を書き起こしテキスト表示部１０４４にコピーする操作を行うと、制御部１０３９はそのことを入力部１０３８の入力信号により検出して画面制御部１０３２を制御し、書き起こしテキスト表示部１０４４の箇所に認識結果テキストをコピーして表示させる（Ｓ４０４）。他方、書き起こしテキストが空欄か誤っており且つ認識結果が正しくなかった場合（Ｓ４０３のＮＯ）、作業者は、入力装置１０５のキーボードから、書き起こしテキスト表示部１０４４の該当する箇所に正しいテキストを入力する（Ｓ４０５）。ここで、キーボードのように直接文字入力する装置でなく、マイクを用いた音声認識によってテキストを入力するようにしてもよい。 If the transcribed text is blank or incorrect and the recognition result is correct (YES in S403), the operator double-clicks the corresponding part of the recognition result display part 1042, or transcribes the corresponding part by drag and drop. When an operation of copying to the text display unit 1044 is performed, the control unit 1039 detects this from the input signal of the input unit 1038 and controls the screen control unit 1032 to place the recognition result text in the place of the transcribed text display unit 1044. Copy and display (S404). On the other hand, if the transcription text is blank or incorrect and the recognition result is not correct (NO in S403), the operator puts the correct text on the corresponding portion of the transcription text display unit 1044 from the keyboard of the input device 105. Input (S405). Here, instead of a device that directly inputs characters like a keyboard, text may be input by speech recognition using a microphone.

書き起こしテキストが挿入または入力されると、書き起こしテキストに対応する時間情報とともに、書き起こしテキストがデータ書き込み部１０３７に通知され、データ書き込み部１０３７は、同期情報格納部１０２の該当するエントリをアクセスし、受け取った時間情報と対応付けて、受け取った書き起こしテキストを同期情報格納部１０２に格納する（Ｓ４０６）。 When the transcription text is inserted or inputted, the transcription text is notified to the data writing unit 1037 together with the time information corresponding to the transcription text, and the data writing unit 1037 accesses the corresponding entry in the synchronization information storage unit 102. Then, the received transcription text is stored in the synchronization information storage unit 102 in association with the received time information (S406).

作業対象コンテンツの途中であれば（Ｓ４０７のＮＯ）、さらに継続して作業を実施する（Ｂ）。 If it is in the middle of the work target content (NO in S407), the work is further continued (B).

図１４を参照すると、継続作業により、各時間情報に対応する同一Ｘ座標軸上の書き起こしテキスト表示部１０４４に、「続きまして」「先の」「委員会で」「出されました」「議案」「について」「ですが」いう情報が反映されている。 Referring to FIG. 14, “Continue”, “Previous”, “At the committee”, “Proposed”, “Proposal” are displayed on the transcript text display section 1044 on the same X coordinate axis corresponding to each time information. "About" and "But" information is reflected.

作業対象コンテンツの末尾まで作業が完了していれば（Ｓ４０７のＹＥＳ）、全体の作業終了となる。作業の終了を確認修正部１０３に通知する方法としては、例えば、図１４に図示するように、確認修正画面１０４６に「保存して終了」と表示されている作業終了ボタン１０４９を用意しておき、作業者がボタンを押下することで作業の終了を確認修正部１０３に通知する方法が利用できる。 If the work is completed up to the end of the work target content (YES in S407), the whole work is finished. As a method for notifying the end of work to the confirmation / correction unit 103, for example, as shown in FIG. 14, a work end button 1049 displayed as “Save and End” on the confirmation / correction screen 1046 is prepared. A method of notifying the end of the work to the confirmation and correction unit 103 when the operator presses a button can be used.

図１５は、図８のステップＳ３０１および図１３のステップＳ４０１でＮＯと判定された場合に確認修正部１０３で実行される処理の流れを示すフローチャートである。以下、この図１５のフローチャートおよび図１６の確認修正画面の例を参照して、確認修正画面１０４６のスクロール動作について説明する。 FIG. 15 is a flowchart showing a flow of processing executed by the confirmation correction unit 103 when it is determined NO in step S301 of FIG. 8 and step S401 of FIG. Hereinafter, the scroll operation of the confirmation / correction screen 1046 will be described with reference to the flowchart of FIG. 15 and the example of the confirmation / correction screen of FIG. 16.

作業者によって、図１６に示すスクロールボタンである表示アイコン１０４５−１または表示アイコン１０４５−２が押下されたとすると（Ｓ５０１のＹＥＳ）、制御部１０３９はそのことを入力部１０３８からの入力信号で検出し、あらかじめ設定してあるスクロール時間分だけ時間的な表示領域の変更処理を画面生成部１０３２に通知する。ここで、例えば表示アイコン１０４５−２が押下された場合に設定されているスクロール時間が９秒だとすると、画面生成部１０３２は、現在ウィンドウ上に表示されている時間情報の先頭位置を取得し、時間情報プラス９秒目が先頭となるよう、データ読み込み部１０３１を通じて同期情報格納部１０２から時間情報、音声情報、認識結果および書き起こしテキストを順次読み出し（Ｓ５０２）、該当する時間情報の先頭から、各データに表示を切り替える（Ｓ５０３）。 If the worker presses the display icon 1045-1 or the display icon 1045-2 that are scroll buttons shown in FIG. 16 (YES in S501), the control unit 1039 detects this by an input signal from the input unit 1038. Then, the screen generation unit 1032 is notified of the change processing of the temporal display area for the scroll time set in advance. Here, for example, if the scroll time set when the display icon 1045-2 is pressed is 9 seconds, the screen generation unit 1032 acquires the start position of the time information currently displayed on the window, and the time The time information, voice information, recognition result, and transcription text are sequentially read out from the synchronization information storage unit 102 through the data reading unit 1031 so that the information plus 9th second is first (S502). The display is switched to data (S503).

図１４および図１６を参照すると、図１４では時間情報表示部１０４１の先頭の時間情報が「０：００：０３」であったが、時間情報にプラス９秒されるため、図１６では、時間情報「０：００：１２」から順次対応する音声情報、認識結果「起案」「について」「ですが」「大根の」「財政基盤の」「悪貨に」「伴いまして」、書き起こしテキスト「議案」「について」「ですが」「昨今の」「財政基盤の」「悪化に」「伴いまして」が表示される。 Referring to FIG. 14 and FIG. 16, in FIG. 14, the top time information of the time information display unit 1041 is “0:00:03”. Voice information corresponding to information “0:00:12”, recognition result “draft” “about” “but” “radish” “financial basis” “bad money” “accompanied”, transcription text “agenda” “About” “But” “Recent” “Financial foundation” “Deteriorating” “Accompanying” is displayed.

表示の切り替え後の動作は、図１３に示した書き起こしの継続作業に遷移する（Ｂ）。また、スクロールの指示がなければ（Ｓ５０２のＮＯ）、作業終了かどうかを確認し、継続作業であれば（Ｓ５０４のＮＯ）、図１３に示した書き起こしの継続作業に遷移する（Ｂ）。作業終了であれば（Ｓ５０４のＹＥＳ）、作業を終了する。 The operation after the switching of the display shifts to the continued operation of the transcription shown in FIG. 13 (B). If there is no scroll instruction (NO in S502), it is confirmed whether or not the work is completed. If it is a continuous work (NO in S504), the process proceeds to the continuous work of transcription shown in FIG. 13 (B). If the work is finished (YES in S504), the work is finished.

以上のようにして本実施の形態によれば、素材コンテンツ１１１に含まれる音声情報の書き起こしテキストを作成する際の作業者の負荷を大幅に軽減することができ、且つ正確な書き起こしテキストの作成が可能となる。 As described above, according to the present embodiment, it is possible to greatly reduce the burden on the operator when creating the transcription text of the audio information included in the material content 111, and to correct the accurate transcription text. Can be created.

以上本発明の実施の形態について説明したが、本発明は以上の実施の形態にのみ限定されず、その他各種の付加変更が可能である。また、本発明の書き起こしテキスト作成支援システムにおける同期情報生成部１０１および確認修正部１０３は、その有する機能をハードウェア的に実現することは勿論、コンピュータとプログラムとで実現することができる。プログラムは、磁気ディスクや半導体メモリ等のコンピュータ可読記録媒体に記録されて提供され、コンピュータの立ち上げ時などにコンピュータに読み取られ、そのコンピュータの動作を制御することにより、そのコンピュータを前述した実施の形態における同期情報生成部１０１および確認修正部１０３として機能させる。 Although the embodiment of the present invention has been described above, the present invention is not limited to the above embodiment, and various other additions and modifications can be made. In addition, the synchronization information generation unit 101 and the confirmation correction unit 103 in the transcription text creation support system of the present invention can be realized by a computer and a program, as well as by hardware. The program is provided by being recorded on a computer-readable recording medium such as a magnetic disk or a semiconductor memory, and is read by the computer when the computer is started up, and the computer is controlled by controlling the operation of the computer. Function as the synchronization information generation unit 101 and the confirmation correction unit 103 in the embodiment.

本発明の書き起こしテキスト作成支援システムの実施の形態のブロック図である。It is a block diagram of an embodiment of a transcription text creation support system of the present invention. 同期情報生成部の実施例のブロック図である。It is a block diagram of the Example of a synchronous information generation part. 同期情報生成部の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of a synchronous information generation part. 同期情報格納部のデータ格納例の一部を示す図である。It is a figure which shows some data storage examples of a synchronous information storage part. 確認修正部の実施例のブロック図である。It is a block diagram of the Example of a confirmation correction part. 同期情報格納部の各エントリの書き起こしテキストフィールドが空白になっている初期の状態における確認修正画面の一例を示す図である。It is a figure which shows an example of the confirmation correction screen in the initial state in which the transcription text field of each entry of the synchronous information storage part is blank. 初期状態の確認修正画面を表示する際の確認修正部の処理例を示すフローチャートである。It is a flowchart which shows the process example of the confirmation correction part at the time of displaying the confirmation correction screen of an initial state. 初期状態の確認修正画面が表示装置に表示された後に確認修正部で実行される処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process performed in a confirmation correction part, after the confirmation correction screen of an initial state is displayed on a display apparatus. 確認修正画面に書き起こしテキストが入力された一場面を示す図である。It is a figure which shows the one scene where the transcription text was input into the confirmation correction screen. 同期情報格納部に書き起こしテキストが書き込まれた様子を示す図である。It is a figure which shows a mode that the transcription text was written in the synchronous information storage part. 確認修正画面に書き起こしテキストが入力された別の場面を示す図である。It is a figure which shows another scene where the transcription text was input into the confirmation correction screen. 同期情報格納部に書き起こしテキストが書き込まれた様子を示す図である。It is a figure which shows a mode that the transcription text was written in the synchronous information storage part. 確認修正部が実行する書き起こしの継続作業の処理例を示すフローチャートである。It is a flowchart which shows the process example of the continuous work of the transcription | transfer which a confirmation correction part performs. 確認修正画面に書き起こしテキストが入力された他の場面を示す図である。It is a figure which shows the other scene where the transcription text was input into the confirmation correction screen. 確認修正部の残りの処理例を示すフローチャートである。It is a flowchart which shows the remaining process examples of a confirmation correction part. スクロール後の確認修正画面の例を示す図である。It is a figure which shows the example of the confirmation correction screen after scrolling.

Explanation of symbols

１０１…同期情報生成部
１０１１…素材入力部
１０１２…音声認識部
１０１３…時間情報取得部
１０１４…記録部
１０２…同期情報格納部
１０２１…時間情報フィールド
１０２２…音声情報フィールド
１０２３…認識結果フィールド
１０２４…書き起こしテキストフィールド
１０２５、１０２５−１、１０２５−２…エントリ
１０３…確認修正部
１０３１…データ読み込み部
１０３２…画面生成部
１０３３…音声波形生成部
１０３４…表示部
１０３５…音声再生部
１０３６…出力部
１０３７…データ書き込み部
１０３８…入力部
１０３９…制御部
１０４…表示装置
１０４１…時間情報表示部
１０４２…認識結果表示部
１０４３…音声波形表示部
１０４４…書き起こしテキスト表示部
１０４５−１、１０４５−２…スクロールボタン
１０４６…確認修正画面
１０４７、１０４８…ガイド線
１０４９…作業終了ボタン
１０５…入力装置
１０６…音声出力器
１０７…出力装置 DESCRIPTION OF SYMBOLS 101 ... Synchronization information generation part 1011 ... Material input part 1012 ... Voice recognition part 1013 ... Time information acquisition part 1014 ... Recording part 102 ... Synchronization information storage part 1021 ... Time information field 1022 ... Voice information field 1023 ... Recognition result field 1024 ... Writing Awakening text field 1025, 1025-1, 1025-2 ... entry 103 ... confirmation and correction unit 1031 ... data reading unit 1032 ... screen generation unit 1033 ... audio waveform generation unit 1034 ... display unit 1035 ... audio reproduction unit 1036 ... output unit 1037 ... Data writing unit 1038 ... input unit 1039 ... control unit 104 ... display device 1041 ... time information display unit 1042 ... recognition result display unit 1043 ... speech waveform display unit 1044 ... transcription text display unit 1045-1, 1045-2 ... scroll Tan 1046 ... confirmation correction window 1047,1048 ... guide lines 1049 ... working end button 105 ... input device 106 ... sound output unit 107 ... output device

Claims

Each of the storage device, the display device, the input device, the audio output device, and the voice recognition result obtained by performing the voice recognition processing on the voice information included in the material content is divided into smaller units than sentences. Synchronous information generation for storing the speech recognition result of the division unit, the original voice information, the time information for specifying the voice information, and the transcription text in the initial state in association with each division unit And the speech recognition result for each division unit stored in the storage device, the image of the original speech information, the time information for specifying the speech information, and the transcription text are aligned with each other on the time axis. And displays the audio information of the designated division unit in response to an audio reproduction instruction designating the division unit on the screen by the input device. Confirmation correction to update the transcription text of the designated division unit in the storage device in response to the input operation of the transcription text output from the voice output device and designating the division unit on the screen by the input device Transcription text creation support system characterized by comprising a part.

2. The transcription creation support system according to claim 1, wherein the division unit is a word.

The transcription text creation support system according to claim 1, wherein the speech information image is a speech waveform image generated from the speech information.

2. The transcription text creation support system according to claim 1, wherein the input operation of the transcription text designating the division unit on the screen by the input device is a copy operation for transcribing the recognition result to the text. .

In response to a scroll instruction from the input device, the confirmation / correction unit displays a voice recognition result for each division unit displayed on the screen of the display device, an image of the voice information that is the source, and the voice information. 2. The transcription text creation support system according to claim 1, wherein the time information for specifying and the transcription text are scrolled in synchronization with each other.

When a computer having a storage device, a display device, an input device, and an audio output device performs speech recognition processing on audio information included in material content, and divides the speech recognition result into smaller units than sentences For each division unit, the voice recognition result of the division unit, the original voice information, the time information for specifying the voice information, and the transcription text in the initial state are associated with each other and stored in the storage device. Information generating means, a speech recognition result for each division unit stored in the storage device, an image of the original speech information, time information for specifying the speech information, and a transcription text on the time axis Aligned and displayed in parallel on the screen of the display device, and in response to an audio reproduction instruction specifying the division unit on the screen by the input device, The voice information is reproduced and output from the voice output unit, and the designated division unit in the storage device is transcribed in response to the input operation of the transcription text designating the division unit on the screen by the input device. A program which functions as a confirmation and correction means for updating text.

The program according to claim 6, wherein the division unit is a word.