JP6394332B2

JP6394332B2 - Information processing apparatus, transcription support method, and transcription support program

Info

Publication number: JP6394332B2
Application number: JP2014244161A
Authority: JP
Inventors: 野田　拓也; 拓也野田; 渡辺　一宏; 一宏渡辺; 淳哉斎藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2014-12-02
Filing date: 2014-12-02
Publication date: 2018-09-26
Anticipated expiration: 2034-12-02
Also published as: JP2016109735A

Description

本発明は、情報処理装置、書き起こし支援方法、及び書き起こし支援プログラムに関する。 The present invention relates to an information processing apparatus, a transcription support method, and a transcription support program.

録音した音声データをテキスト化する書き起こし作業において、音声の書き起こしは、手動が主流であり、音声認識技術等を用いて自動書き起こしを行った場合でも、認識精度が不十分であるため、手動書き起こしによる修正が必要となる。手動による書き起こしの場合には、何回か聴き直す必要があり、そのための頭出し再生操作も繰り返すことになるため、書き起こし効率が低下する。したがって、頭出し再生を自動化し、書き起こし効率を改善する技術が求められている。 In the transcription work to convert the recorded voice data into text, the transcription of voice is mainly manual, and even if automatic transcription is performed using voice recognition technology etc., the recognition accuracy is insufficient, Correction by manual transcription is required. In the case of manual transcription, it is necessary to listen again several times, and the cueing reproduction operation for that purpose is repeated, so that the transcription efficiency decreases. Therefore, there is a need for a technique for automating cue reproduction and improving transcription efficiency.

例えば、音声データの音声認識結果から形態素単位の音声位置情報をリスト化し、手動で書き起こしたテキストと、上述したリストとを比較して一致する位置を基準に音声データの頭出し位置を特定する手法がある。 For example, the speech position information in units of morphemes is listed from the speech recognition result of the speech data, and the cueing position of the speech data is specified based on the matching position by comparing the manually written text with the above-described list. There is a technique.

特開２０１３−１５２３６５号公報JP 2013-152365 A 特開２０１３−２５７６３号公報JP 2013-25763 A 特開２００５−２２８１７８号公報JP 2005-228178 A

しかしながら、録音した音声データに対する音声認識結果に誤認識が多いと、比較対象のテキストと一致せず、頭出し位置を特定できない。 However, if there are many misrecognitions in the voice recognition result for the recorded voice data, the text does not match the comparison target text and the cue position cannot be specified.

一つの側面では、本発明は、頭出し再生位置を適切に特定でき、文書書き起こし効率を高めることを目的とする。 In one aspect, an object of the present invention is to appropriately specify a cueing reproduction position and improve document transcription efficiency.

一つの態様では、情報処理装置は、音声データを格納する音声データ格納部と、頭出し再生位置から前記音声データを再生する音声再生部と、音声再生部により再生された前記音声データに対応して入力された仮名テキストから書き起こし単位の仮名テキストを生成するテキスト生成部と、前記テキスト生成部から得られる書き起こし単位の仮名テキストから、１以上の認識語彙を単位とした動的認識辞書を生成する辞書生成部と、前記辞書生成部により得られる前記動的認識辞書を用いた音声認識により、前記認識語彙に対応する前記音声データの位置情報を抽出する位置情報抽出部と、前記位置情報抽出部により抽出した位置情報から、前記頭出し再生位置を決定する再生位置決定部とを有する。 In one aspect, the information processing apparatus corresponds to the audio data storage unit that stores the audio data, the audio reproduction unit that reproduces the audio data from the cue reproduction position, and the audio data that is reproduced by the audio reproduction unit. A text generation unit that generates a transcription unit kana text from the input kana text, and a dynamic recognition dictionary based on one or more recognition vocabularies from the transcription unit kana text obtained from the text generation unit. A dictionary generation unit to generate, a position information extraction unit to extract position information of the voice data corresponding to the recognition vocabulary by voice recognition using the dynamic recognition dictionary obtained by the dictionary generation unit, and the position information A reproduction position determination unit that determines the cue reproduction position from the position information extracted by the extraction unit.

一つの側面として、本発明は、頭出し再生位置を適切に特定でき、文書書き起こし効率を高めることができる。 As one aspect, the present invention can appropriately specify a cueing reproduction position, and can improve document transcription efficiency.

第１実施形態における情報処理装置の機能構成例を示す図である。It is a figure which shows the function structural example of the information processing apparatus in 1st Embodiment. ハードウェア構成の一例を示す図である。It is a figure which shows an example of a hardware constitutions. 第１実施形態における書き起こし支援処理の一例を示すフローチャートである。It is a flowchart which shows an example of the transcription assistance process in 1st Embodiment. 書き起こし支援処理の具体例を示す図である。It is a figure which shows the specific example of a transcription support process. 第２実施形態における情報処理システムのシステム構成例を示す図である。It is a figure which shows the system configuration example of the information processing system in 2nd Embodiment. 第３実施形態における情報処理システムのシステム構成例を示す図である。It is a figure which shows the system configuration example of the information processing system in 3rd Embodiment. 書き起こし支援処理の第１実施例を示す図である。It is a figure which shows 1st Example of a transcription assistance process. 書き起こし支援処理の第１実施例における動作を説明するための図である。It is a figure for demonstrating the operation | movement in 1st Example of a transcription support process. 書き起こし支援処理の第２実施例を示す図である。It is a figure which shows 2nd Example of a transcription assistance process. 書き起こし支援処理の第２実施例における動作を説明するための図である。It is a figure for demonstrating the operation | movement in 2nd Example of a transcription assistance process. 書き起こし支援処理の第３実施例を示す図である。It is a figure which shows 3rd Example of a transcription assistance process. 書き起こし支援処理の第３実施例における動作を説明するための図である。It is a figure for demonstrating the operation | movement in 3rd Example of transcription support processing. 書き起こし支援処理の第４実施例を示す図である。It is a figure which shows 4th Example of a transcription assistance process. 書き起こし支援処理の第４実施例を示す一例のフローチャートである。It is an example flowchart which shows 4th Example of a transcription assistance process. 書き起こし支援処理の第４実施例における動作を説明するための図である。It is a figure for demonstrating the operation | movement in 4th Example of a transcription assistance process.

以下、図面に基づいて実施形態を説明する。 Embodiments will be described below with reference to the drawings.

＜第１実施形態＞
図１は、第１実施形態における情報処理装置の機能構成例を示す図である。図１に示す情報処理装置１０は、記憶部１１と、音声再生部１２と、入力部１３と、テキスト生成部１４と、辞書生成部１５と、位置情報抽出部１６と、再生位置決定部１７とを有する。 <First Embodiment>
FIG. 1 is a diagram illustrating a functional configuration example of the information processing apparatus according to the first embodiment. An information processing apparatus 10 illustrated in FIG. 1 includes a storage unit 11, a sound reproduction unit 12, an input unit 13, a text generation unit 14, a dictionary generation unit 15, a position information extraction unit 16, and a reproduction position determination unit 17. And have.

記憶部１１は、書き起こし支援処理を行う際に必要となる各種情報や、処理結果等を記憶する。記憶部１１は、例えば音声データ格納部１１ａを有する。音声データ格納部１１ａは、書き起こしテキスト等を生成する対象となる音声データを格納する。音声データは、例えばインタビュー、会話、会議、講演、演説、スピーチ等の人が発した音声等を録音したものであるが、これに限定されるものではない。 The storage unit 11 stores various information necessary for performing the transcription support process, a processing result, and the like. The storage unit 11 includes, for example, an audio data storage unit 11a. The voice data storage unit 11a stores voice data that is a target for generating a transcription text or the like. The voice data is, for example, recorded voice or the like uttered by a person such as an interview, conversation, conference, lecture, speech, speech or the like, but is not limited thereto.

音声再生部１２は、音声データ格納部１１ａに格納された音声データを再生する。音声再生部１２は、通常速度で音声データを再生させてもよく、低速モードや高速モードで音声データを再生させてもよい。再生速度については、ユーザの指定により再生前又は再生時に変更することができる。 The audio reproduction unit 12 reproduces audio data stored in the audio data storage unit 11a. The audio reproducing unit 12 may reproduce audio data at a normal speed, or may reproduce audio data in a low speed mode or a high speed mode. The playback speed can be changed before or during playback as specified by the user.

入力部１３は、音声再生部１２により再生された音声データを聴き取ったユーザ等からの書き起こしテキスト等の入力を受け付ける。また、入力部１３は、情報処理装置１０に対するユーザからの設定処理や、書き起こし処理の開始、終了処理等の指示等の入力を受け付ける。入力部１３は、例えばキーボード等であるが、これに限定されるものではなく、画面上に表示されるタッチパネル等の操作ボタン等でもよい。 The input unit 13 receives input of a transcription text or the like from a user or the like who has listened to the audio data reproduced by the audio reproduction unit 12. Further, the input unit 13 accepts inputs such as instructions for setting processing from the user to the information processing apparatus 10 and start / end processing of the transcription process. The input unit 13 is, for example, a keyboard, but is not limited thereto, and may be an operation button such as a touch panel displayed on the screen.

テキスト生成部１４は、入力部１３から得られる情報に対して、書き起こし単位の仮名テキストを生成する。書き起こし単位の仮名テキストとは、例えば録音した音声データの再生後にユーザの手動書き起こしにより入力部１３から入力されたテキスト情報で、例えば１又は複数の認識語彙からなる。また、仮名テキストとは、漢字等に変換されていないカタカナ、ひらがな等で表現されるテキスト情報である。 The text generation unit 14 generates a kana text in a transcription unit for the information obtained from the input unit 13. The kana text of the transcription unit is, for example, text information input from the input unit 13 by the user's manual transcription after reproduction of recorded voice data, and is composed of, for example, one or a plurality of recognized vocabularies. The kana text is text information expressed in katakana, hiragana, etc. that has not been converted into kanji.

辞書生成部１５は、テキスト生成部１４で生成される１つ以上の書き起こし単位の仮名テキストから、１つ以上の認識語彙を持った動的認識辞書を生成する。なお、辞書生成部１５は、予め設定された長母音化ルールにより長母音化処理を行い、処理された語彙（長音化仮名テキスト）も含めて動的認識辞書に登録してもよい。生成された動的認識辞書は、例えば記憶部１１等に記憶されてもよい。 The dictionary generation unit 15 generates a dynamic recognition dictionary having one or more recognition vocabularies from one or more transcript texts generated by the text generation unit 14. Note that the dictionary generation unit 15 may perform a long vowel generation process according to a preset long vowel generation rule and register the processed vocabulary (long phonetic kana text) in the dynamic recognition dictionary. The generated dynamic recognition dictionary may be stored in the storage unit 11 or the like, for example.

位置情報抽出部１６は、辞書生成部１５により生成された動的認識辞書を用いた音声認識処理を行う。また、位置情報抽出部１６は、音声データと音声認識結果から認識語彙に対応する音声データの位置情報を抽出する。音声認識処理では、動的認識辞書を用いて、１認識語彙の音声認識を行う。また、音声認識処理は、例えば音声データを一字一句認識していくディクテーションではなく、認識語彙単位のワードスポッティングを行うのが好ましい。ワードスポッティングは、例えば動的認識辞書内にある語彙を音声データから拾い出してくる手法であり、動的認識辞書にない余計な語彙を認識しないため、誤認識を抑制することができる。また、音声認識処理には、例えば音響的特徴量等を用いることもできるが、これに限定されるものではない。 The position information extraction unit 16 performs voice recognition processing using the dynamic recognition dictionary generated by the dictionary generation unit 15. Further, the position information extraction unit 16 extracts position information of the sound data corresponding to the recognized vocabulary from the sound data and the sound recognition result. In the speech recognition process, speech recognition of one recognition vocabulary is performed using a dynamic recognition dictionary. In the speech recognition processing, for example, it is preferable to perform word spotting in units of recognized vocabulary rather than dictation for recognizing speech data one by one. Word spotting is a method of picking up vocabulary in the dynamic recognition dictionary from the speech data, for example, and does not recognize extra vocabulary that is not in the dynamic recognition dictionary, so that erroneous recognition can be suppressed. Further, for example, an acoustic feature amount or the like can be used for the voice recognition process, but the present invention is not limited to this.

再生位置決定部１７は、位置情報抽出部１６により抽出した位置情報から、音声データの頭出し再生位置を決定する。頭出し再生位置は、確定した仮名テキストの先頭でもよく、終端でもよいが、これに限定されるものではなく、仮名テキストに含まれるモーラの中間でもよい。 The reproduction position determination unit 17 determines the cueing reproduction position of the audio data from the position information extracted by the position information extraction unit 16. The cue playback position may be the beginning or end of the fixed kana text, but is not limited to this, and may be the middle of the mora included in the kana text.

情報処理装置１０は、ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ（ＰＣ）やサーバ、スマートフォン、タブレット端末等であるが、これに限定されるものではない。 The information processing apparatus 10 is a personal computer (PC), a server, a smartphone, a tablet terminal, or the like, but is not limited thereto.

第１実施形態では、上述した構成により、例えば確定した仮名テキストに続く録音音声の再生位置や、確認のために聴き直しを行う録音音声の再生位置等を適切に特定でき、文書の書き起こし効率を高めることができる。例えば、手動書き起こし、又は自動書き起こしの手動修正における、聴き直しの繰り返しにおいて、正しい位置からの頭出し再生が可能となり、書き起こし効率が向上する。 In the first embodiment, with the above-described configuration, for example, the playback position of the recorded voice following the confirmed kana text, the playback position of the recorded voice that is re-listen for confirmation, and the like can be appropriately specified, and the transcription efficiency of the document Can be increased. For example, in repeated re-listening in manual transcription or manual correction of automatic transcription, cue reproduction from the correct position becomes possible, and transcription efficiency is improved.

＜ハードウェア構成例＞
次に、情報処理装置１０等のコンピュータのハードウェア構成例について、図を用いて説明する。図２は、ハードウェア構成の一例を示す図である。図２の例において、情報処理装置１０は、入力装置２１と、出力装置２２と、ドライブ装置２３と、補助記憶装置２４と、主記憶装置２５と、ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ（ＣＰＵ）２６と、ネットワーク接続装置２７とを有し、これらはシステムバスＢで相互に接続されている。 <Hardware configuration example>
Next, a hardware configuration example of a computer such as the information processing apparatus 10 will be described with reference to the drawings. FIG. 2 is a diagram illustrating an example of a hardware configuration. In the example of FIG. 2, the information processing apparatus 10 includes an input device 21, an output device 22, a drive device 23, an auxiliary storage device 24, a main storage device 25, a central processing unit (CPU) 26, and a network connection. Which are connected to each other by a system bus B.

入力装置２１は、ユーザ等が操作するキーボード及びマウス等のポインティングデバイスやマイクロフォン等の音声入力デバイスを有しており、ユーザ等からのプログラムの実行指示、各種操作情報、ソフトウェア等を起動するための情報等の入力を受け付ける。 The input device 21 includes a keyboard and a pointing device such as a mouse operated by a user and a voice input device such as a microphone. The input device 21 activates a program execution instruction, various operation information, software, and the like from the user. Accept input of information.

出力装置２２は、本実施形態における処理を行うためのコンピュータ本体（情報処理装置１０）を操作するのに必要な各種ウィンドウやデータ等を表示するディスプレイ等を有する。出力装置２２は、ＣＰＵ２６が有する制御プログラムによりプログラムの実行経過や結果等を表示することができる。 The output device 22 includes a display for displaying various windows, data, and the like necessary for operating the computer main body (information processing device 10) for performing the processing in the present embodiment. The output device 22 can display program execution progress, results, and the like by a control program of the CPU 26.

ここで、本実施形態において、例えば情報処理装置１０等のコンピュータ本体にインストールされる実行プログラムは、記録媒体２８等により提供される。記録媒体２８は、ドライブ装置２３にセット可能である。ＣＰＵ２６からの制御信号に基づき、記録媒体２８に格納された実行プログラムが、記録媒体２８からドライブ装置２３を介して補助記憶装置２４にインストールされる。 Here, in the present embodiment, for example, the execution program installed in the computer main body such as the information processing apparatus 10 is provided by the recording medium 28 or the like. The recording medium 28 can be set in the drive device 23. Based on the control signal from the CPU 26, the execution program stored in the recording medium 28 is installed from the recording medium 28 into the auxiliary storage device 24 via the drive device 23.

補助記憶装置２４は、例えばＨａｒｄＤｉｓｋＤｒｉｖｅ（ＨＤＤ）やＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ（ＳＳＤ）等のストレージ手段等である。補助記憶装置２４は、ＣＰＵ２６からの制御信号に基づき、本実施形態における実行プログラム（例えば、書き起こし支援プログラム）や、コンピュータに設けられた制御プログラム等を記憶し、必要に応じて入出力を行う。補助記憶装置２４は、ＣＰＵ２６からの制御信号等に基づいて、記憶された各情報から必要な情報を読み出したり、書き込むことができる。 The auxiliary storage device 24 is, for example, a storage unit such as a hard disk drive (HDD) or a solid state drive (SSD). The auxiliary storage device 24 stores an execution program (for example, a transcription support program) in this embodiment, a control program provided in a computer, and the like based on a control signal from the CPU 26, and performs input / output as necessary. . The auxiliary storage device 24 can read and write necessary information from each stored information based on a control signal from the CPU 26 and the like.

主記憶装置２５は、ＣＰＵ２６により補助記憶装置２４から読み出された実行プログラム等を格納する。主記憶装置２５は、ＲｅａｄＯｎｌｙＭｅｍｏｒｙ（ＲＯＭ）やＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ（ＲＡＭ）等である。 The main storage device 25 stores an execution program read from the auxiliary storage device 24 by the CPU 26. The main storage device 25 is a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.

ＣＰＵ２６は、ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ（ＯＳ）等の制御プログラム、及び主記憶装置２５に格納されている実行プログラムに基づいて、各種演算や各ハードウェア構成部とのデータの入出力等、コンピュータ全体の処理を制御して各処理を実現する。プログラムの実行中に必要な各種情報等は、補助記憶装置２４から取得することができ、また実行結果等を格納することもできる。 The CPU 26 performs processing of the entire computer such as various operations and input / output of data with each hardware component based on a control program such as an operating system (OS) and an execution program stored in the main storage device 25. Control each process. Various information necessary during the execution of the program can be acquired from the auxiliary storage device 24, and the execution results and the like can also be stored.

具体的には、ＣＰＵ２６は、例えば入力装置２１から得られるプログラムの実行指示等に基づき、補助記憶装置２４にインストールされたプログラムを実行させることにより、主記憶装置２５上でプログラムに対応する処理を行う。 Specifically, the CPU 26 executes processing corresponding to the program on the main storage device 25 by executing the program installed in the auxiliary storage device 24 based on, for example, an instruction to execute the program obtained from the input device 21. Do.

例えば、ＣＰＵ２６は、書き起こし支援プログラムを実行させることで、上述した記憶部１１による音声データ等の各種情報の格納や、音声再生部１２による音声再生、入力部１３による書き起こしテキストや実行指示等の入力、テキスト生成部１４によるテキストの生成、辞書生成部１５による辞書の生成、位置情報抽出部１６による位置情報の抽出、再生位置決定部１７による音声データの再生位置の決定等の処理を行う。ＣＰＵ２６における処理内容は、上述した内容に限定されるものではない。ＣＰＵ２６により実行された内容は、必要に応じて補助記憶装置２４等に記憶される。 For example, the CPU 26 executes a transcription support program to store various types of information such as voice data in the storage unit 11 described above, voice reproduction by the voice reproduction unit 12, transcription text by the input unit 13, execution instructions, and the like. Input, text generation by the text generation unit 14, generation of a dictionary by the dictionary generation unit 15, extraction of position information by the position information extraction unit 16, determination of the reproduction position of the audio data by the reproduction position determination unit 17, etc. . The processing content in the CPU 26 is not limited to the above-described content. The contents executed by the CPU 26 are stored in the auxiliary storage device 24 or the like as necessary.

ネットワーク接続装置２７は、例えばインターネットやＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ（ＬＡＮ）等の通信ネットワークを介して、他の外部装置との通信を行う。ネットワーク接続装置２７は、ＣＰＵ２６からの制御信号に基づき、通信ネットワーク等と接続することにより、実行プログラムやソフトウェア、設定情報等を外部装置等から取得する。また、ネットワーク接続装置２７は、プログラムを実行することで得られた実行結果を外部装置等に提供したり、本実施形態における実行プログラム自体を外部装置等に提供してもよい。 The network connection device 27 communicates with other external devices via a communication network such as the Internet or a local area network (LAN). The network connection device 27 acquires an execution program, software, setting information, and the like from an external device or the like by connecting to a communication network or the like based on a control signal from the CPU 26. The network connection device 27 may provide an execution result obtained by executing the program to an external device or the like, or may provide the execution program itself in the present embodiment to the external device or the like.

記録媒体２８は、上述したように実行プログラム等が格納されたコンピュータで読み取り可能な記録媒体である。記録媒体２８は、例えばフラッシュメモリ等の半導体メモリやＣＤ−ＲＯＭ、ＤＶＤ等の可搬型の記録媒体であるが、これに限定されるものではない。 The recording medium 28 is a computer-readable recording medium that stores an execution program and the like as described above. The recording medium 28 is, for example, a semiconductor memory such as a flash memory, or a portable recording medium such as a CD-ROM or DVD, but is not limited thereto.

図２に示すハードウェア構成に実行プログラム（例えば、書き起こし支援プログラム等）をインストールすることで、ハードウェア資源とソフトウェアとが協働して本実施形態における書き起こ支援処理等を実現することができる。 By installing an execution program (for example, a transcription support program) in the hardware configuration shown in FIG. 2, the hardware resource and the software can cooperate to realize the transcription support process in the present embodiment. it can.

＜書き起こし支援処理について＞
次に、書き起こし支援処理についてフローチャートを用いて説明する。図３は、第１実施形態における書き起こし支援処理の一例を示すフローチャートである。図３の例において、情報処理装置１０の入力部１３は、例えば音声データ格納部１１ａに格納された文章書き起こし対象の音声データを再生し、再生した音声データに対するテキストの入力を受け付ける（Ｓ０１）。次に、テキスト生成部１４は、受け付けた情報から書き起こしテキストが仮名漢字変換前か否かを判断する（Ｓ０２）。Ｓ０２の処理では、例えば入力部１３がキーボード等であれば、キーボードの変換ボタンやスペースキー等の所定のキーの押下の有無により判断することができるが、これに限定されるものではない。例えば、キーボードの変換ボタンやスペースキーが押下される前であれば、仮名漢字変換前と判断する。また、変換ボタンやスペースキーが押下された後であれば、仮名漢字変換後と判断する。また、例えば変換ボタンやスペースキーが押下された後の内容（例えば、文字コード等）等から、漢字が含まれているか否かを判断することで、仮名漢字変換前か否かを判断してもよい。 <Transcription support processing>
Next, the transcription support process will be described with reference to a flowchart. FIG. 3 is a flowchart showing an example of the transcription support process in the first embodiment. In the example of FIG. 3, the input unit 13 of the information processing apparatus 10 reproduces, for example, voice data to be transcribed text stored in the voice data storage unit 11 a and receives text input for the reproduced voice data (S 01). . Next, the text generation unit 14 determines from the received information whether the transcribed text is before kana-kanji conversion (S02). In the process of S02, for example, if the input unit 13 is a keyboard or the like, the determination can be made based on whether or not a predetermined key such as a keyboard conversion button or a space key is pressed, but is not limited thereto. For example, before the keyboard conversion button or the space key is pressed, it is determined that the kana-kanji conversion is not performed. If the conversion button or the space key is pressed, it is determined that the kana-kanji conversion has been completed. In addition, for example, by determining whether or not kanji is included from the contents after the conversion button or space key is pressed (for example, character code), it is determined whether or not it is before kana-kanji conversion. Also good.

Ｓ０２の処理において、仮名漢字変換前である場合（Ｓ０２において、ＹＥＳ）、仮名入力の書き起こしテキストを取得する（Ｓ０３）。また、仮名漢字変換後である場合（Ｓ０２において、ＮＯ）、変換された仮名漢字テキストを取得する（Ｓ０４）。ここで、仮名漢字テキストとは、例えば手動により入力された書き起こしテキストを意味する。 In the process of S02, when it is before Kana-Kanji conversion (YES in S02), a transcription text of Kana input is acquired (S03). If it is after Kana-Kanji conversion (NO in S02), the converted Kana-Kanji text is acquired (S04). Here, the kana / kanji text means, for example, a transcription text manually input.

次に、テキスト生成部１４は、仮名漢字テキストが確定したか否かを判断する（Ｓ０５）。Ｓ０５の処理では、仮名漢字テキストを取得済みであるかで確定の有無を判断し、取得済みであれば確定したと判断する。 Next, the text generator 14 determines whether or not the kana / kanji text has been confirmed (S05). In the process of S05, it is determined whether or not the kana kanji text has been acquired, and whether or not it has been determined is determined.

Ｓ０５の処理において、仮名漢字テキストが確定していない場合（Ｓ０５において、ＮＯ）、Ｓ０１の処理に戻る。また、Ｓ０５の処理において、仮名漢字テキストが確定した場合（Ｓ０５において、ＹＥＳ）、テキスト生成部１４は、書き起こし単位(確定単位)で取得した仮名テキストを生成する（Ｓ０６）。 If the kana / kanji text is not confirmed in the process of S05 (NO in S05), the process returns to the process of S01. In the process of S05, when the kana-kanji text is confirmed (YES in S05), the text generation unit 14 generates the kana text acquired in the transcription unit (determined unit) (S06).

次に、テキスト生成部１４は、連続して生成される複数の仮名テキストを結合する（Ｓ０７）。Ｓ０７の処理では、例えば、書き起こし単位の仮名テキストが所定モーラ数以下の場合、過去の書き起こし単位の仮名テキストと結合する。また、Ｓ０７の処理では、書き起こし単位の仮名テキストが所定モーラ数以下の場合、所定モーラ数となるように過去の書き起こし単位の仮名テキストと結合してもよい。また、また、Ｓ０７の処理では、随時、書き起こし単位の仮名テキストを連結してもよい。 Next, the text generation unit 14 combines a plurality of consecutively generated kana texts (S07). In the processing of S07, for example, when the kana text of the transcription unit is equal to or less than the predetermined number of mora, it is combined with the kana text of the past transcription unit. In the process of S07, when the kana text of the transcription unit is equal to or less than the predetermined number of mora, it may be combined with the kana text of the past transcription unit so as to have the predetermined number of mora. Moreover, in the process of S07, the kana text of the transcription unit may be connected at any time.

次に、辞書生成部１５は、Ｓ０７の処理で得られた仮名テキストを１認識語彙として動的認識辞書を生成する（Ｓ０８）。なお、動的認識辞書は、例えば１認識語彙を単位とすることで認識処理の高速化と高精度化を実現することができる。また、Ｓ０８の処理で得られた動的認識辞書は、記憶部１１に記憶されてもよい。 Next, the dictionary generation unit 15 generates a dynamic recognition dictionary using the kana text obtained in the process of S07 as one recognition vocabulary (S08). Note that the dynamic recognition dictionary can achieve high speed and high accuracy of recognition processing by using, for example, one recognition vocabulary as a unit. Further, the dynamic recognition dictionary obtained by the process of S08 may be stored in the storage unit 11.

次に、位置情報抽出部１６は、Ｓ０８の処理で得られた動的認識辞書を用いて、１認識語彙の音声認識（ワードスポッティング）を行う（Ｓ０９）。Ｓ０９の処理では、一字一句認識していくディクテーションでなく、認識語彙単位のワードスポッティングを行うことで、誤認識を抑制することができる。 Next, the position information extraction unit 16 performs speech recognition (word spotting) of one recognized vocabulary using the dynamic recognition dictionary obtained in the process of S08 (S09). In the processing of S09, misrecognition can be suppressed by performing word spotting in units of recognized vocabulary instead of dictation for recognizing each character.

次に、位置情報抽出部１６は、Ｓ０９の処理による認識結果から、ワードスポッティングの先頭モーラ先頭、中間モーラ先頭、終端モーラ終端の少なくとも１以上を、音声データ位置情報として取得する（Ｓ１０）。ここで、先頭モーラ先頭とは、例えば１認識語彙が「しゅど−そ−ちおよび」である場合における「し」の音声が出力される直前の位置情報である。また、中間モーラ先頭とは、例えば「しゅど−そ−ちおよび」の場合における「お」の音声が出力される直前の位置情報である。また、終端モーラ終端とは、例えば「しゅど−そ−ちおよび」の場合における「び」の音声が出力された直後の位置情報である。取得する位置情報は、予め設定されたモード等に基づいて選択的に取得してもよく、モードに関係かく全ての位置情報を取得してもよい。 Next, the position information extraction unit 16 acquires at least one or more of the top mora head, the middle mora head, and the end mora end of the word spotting as voice data position information from the recognition result obtained in S09 (S10). Here, the head mora head is, for example, position information immediately before the voice of “shi” is output when one recognized vocabulary is “sudo-so-andi”. Further, the head of the intermediate mora is position information immediately before the voice of “O” is output in the case of, for example, “Shudou-Toshi and”. Further, the terminal mora terminal is position information immediately after the sound of “bi” is output in the case of “sudo-that and”, for example. The position information to be acquired may be selectively acquired based on a preset mode or the like, or all position information may be acquired regardless of the mode.

次に、再生位置決定部１７は、Ｓ１０の処理で取得した音声データ位置情報を、頭出し再生位置として確定し（Ｓ１１）、確定した頭出し再生位置から音声再生を開始する（Ｓ１２）。 Next, the reproduction position determination unit 17 determines the audio data position information acquired in the process of S10 as a cue reproduction position (S11), and starts audio reproduction from the decided cue reproduction position (S12).

ここで、処理を終了するか否かを判断し（Ｓ１３）、処理を終了しない場合（Ｓ１３において、ＮＯ）、Ｓ０１の処理に戻る。また、ユーザからの終了操作や音声データが終了した場合等により処理を終了する場合（Ｓ１３において、ＮＯ）、書き起こし支援処理を終了する。 Here, it is determined whether or not to end the process (S13). If the process is not ended (NO in S13), the process returns to S01. Further, when the process is terminated due to the termination operation from the user or when the voice data is terminated (NO in S13), the transcription support process is terminated.

上述したように、第１実施形態では、情報処理装置１０が、書き起こし単位で仮名テキストを取得し、書き起こし単位以上の仮名テキストを１認識語彙とした動的認識辞書を生成する。また、情報処理装置１０は、音声認識で１認識語彙を基準にワードスポッティングし、ワードスポッティングした音声範囲から頭出し再生位置を決定する。第１実施形態では、多モーラ、かつ１認識語彙に限定したワードスポッティング音声認識により、認識精度を高めて、頭出し再生位置の精度を向上させることができる。 As described above, in the first embodiment, the information processing apparatus 10 acquires a kana text in a transcription unit, and generates a dynamic recognition dictionary that uses a kana text in a transcription unit or more as one recognition vocabulary. In addition, the information processing apparatus 10 performs word spotting based on one recognized vocabulary in speech recognition, and determines a cue playback position from the speech range that has been word spotted. In the first embodiment, the recognition accuracy can be improved and the accuracy of the cueing reproduction position can be improved by the word spotting voice recognition limited to a multi-mora and one recognition vocabulary.

上述した処理により、手動書き起こし、又は自動書き起こしの手動修正時において、頭出し再生位置の確定精度を向上し、書き起こし効率を高めることができる。したがって、音声の自動再生や停止の精度を向上させることができる。 By the above-described processing, it is possible to improve the accuracy of determining the cueing reproduction position and increase the transcription efficiency at the time of manual transcription or manual correction of automatic transcription. Therefore, it is possible to improve the accuracy of automatic playback and stop of voice.

＜書き起こし支援処理の具体例＞
次に、上述した書き起こし支援処理の具体例について説明する。図４は、書き起こし支援処理の具体例を示す図である。図４（Ａ）、（Ｂ）は、それぞれ「仮名テキスト書き起こし単位」、「動的認識辞書の１認識語彙」、「頭出し再生位置」、「書き起こし単位確定時の頭出し再生」の例が示されている。また、図４（Ａ）、（Ｂ）の例において、仮名テキストの書き起こし単位は、例えば「しゅどうそうち」、「および」、「ほうほうの」、・・・等であり、仮名テキストが入力される毎に、動的認識辞書が生成される。 <Specific examples of transcription support processing>
Next, a specific example of the above-described transcription support process will be described. FIG. 4 is a diagram showing a specific example of the transcription support process. 4A and 4B respectively show “kana text transcription unit”, “one recognition vocabulary of dynamic recognition dictionary”, “cue playback position”, and “cue playback when transcription unit is determined”. An example is shown. In the example of FIGS. 4A and 4B, the transcription unit of the kana text is, for example, “Shudosokochi”, “and”, “hono”,... Each time is input, a dynamic recognition dictionary is generated.

図４（Ａ）、（Ｂ）の例では、説明の便宜上、「しゅどうそうち」、「および」、「ほうほうの」、・・・等の各仮名テキストが入力される毎に生成される各動的認識辞書を、そのまま残して示している。第１実施形態における動的認識辞書は、例えば図４（Ａ）、（Ｂ）における各レコードである。 In the example of FIGS. 4A and 4B, for the convenience of explanation, each kana text such as “Sudosokochi”, “and”, “hono”,. Each dynamic recognition dictionary is left as it is. The dynamic recognition dictionary in the first embodiment is, for example, each record in FIGS. 4 (A) and 4 (B).

図４（Ａ）では、１認識語彙を所定モーラ数（５モーラ）以上の仮名テキストとした例を示している。１認識語彙の最低モーラ数を設定しておくことで、ワードスポッティング精度を向上することができる。また、図４（Ａ）の例では、ワードスポッティング終端モーラ終端を頭出し再生位置としたときの位置情報として、音声データの再生してからの時間情報が示されている。位置情報を時間情報として管理することで、例えば書き起こし単位の終端の位置情報を頭出し再生位置として音声をデータを再生することができる。 FIG. 4A shows an example in which one recognized vocabulary is a kana text having a predetermined number of mora (5 mora) or more. By setting the minimum number of mora for one recognized vocabulary, the word spotting accuracy can be improved. Further, in the example of FIG. 4A, time information after reproduction of audio data is shown as position information when the word spotting end mora end is set as the cue playback position. By managing the position information as time information, for example, it is possible to reproduce audio data using the position information at the end of the transcription unit as the cue reproduction position.

図４（Ｂ）では、１認識語彙を連結仮名テキストとした例を示している。これにより、１認識語彙のモーラ数を伸長して、ワードスポッティング精度を向上させることができる。また、図４（Ｂ）の例では、ワードスポッティングの中間モーラ先頭を頭出し再生位置としたときの時間情報が示されている。これにより、書き起こし正誤の再確認モードとして中間モーラ等から頭出し再生を行うことができる。なお、中間モーラ先頭は、連結仮名テキストにおける書き起こし単位において連結された次の語彙の先頭モーラを示している。図４（Ｂ）に示すように、ワードスポッティングにおける中間位置を採用することで、書き起こし効率重視モードの頭出し再生位置精度を向上することができる。 FIG. 4B shows an example in which one recognized vocabulary is connected kana text. Thereby, the number of mora of one recognition vocabulary can be expanded and word spotting accuracy can be improved. In the example of FIG. 4B, time information is shown when the head of the intermediate mora in word spotting is set as the cue playback position. This makes it possible to perform cue playback from an intermediate mora or the like as a transcription reconfirmation mode. The intermediate mora head indicates the head mora of the next vocabulary connected in the transcription unit in the connected kana text. As shown in FIG. 4B, by adopting the intermediate position in the word spotting, it is possible to improve the accuracy of the cue playback position in the transcription efficiency-oriented mode.

また、図４（Ｃ）に示すように、頭出し再生位置精度を向上させるために、先頭から９００ｍｓの位置を頭出し再生位置とした場合、そのモーラが終端モーラの場合、次モーラの情報（音響的特徴量）がないため、終端境界の検出精度が劣化傾向となる。しかしながら、中間モーラがある場合、頭出し再生位置を決定する際に、中間モーラ（次モーラ）の情報（音響的特徴量）があるため、モーラ間境界の検出精度を向上することができ、例えば先頭から８８０ｍｓ等の適切な位置を頭出し再生位置とすることができる。 Also, as shown in FIG. 4C, in order to improve the cue playback position accuracy, when the position of 900 ms from the head is set as the cue playback position, if the mora is the end mora, the next mora information ( Since there is no acoustic feature value), the detection accuracy of the end boundary tends to deteriorate. However, when there is an intermediate mora, there is information (acoustic feature amount) of the intermediate mora (next mora) when determining the cueing reproduction position, so that the detection accuracy of the boundary between mora can be improved. An appropriate position such as 880 ms from the beginning can be set as the cue playback position.

＜第２実施形態＞
次に、第２実施形態について説明する。第２実施形態では、サーバとクライアント端末とを用いたシステム構成により、上述した第１実施形態と同様の書き起こし支援処理を実現するものである。 Second Embodiment
Next, a second embodiment will be described. In the second embodiment, a transcription support process similar to that in the first embodiment described above is realized by a system configuration using a server and a client terminal.

図５は、第２実施形態における情報処理システムのシステム構成例を示す図である。図５に示す情報処理システムは、サーバ３１と、クライアント端末３２とを有し、サーバ３１とクライアント端末３２とは、インターネットやＬＡＮ等に代表される通信ネットワーク３３によりデータの送受信が可能な情報で接続されている。通信ネットワーク３３は、有線でも無線でもよく、これらの組み合わせでもよい。なお、サーバ３１とクライアント端末３２との数は、これに限定されるものではなく、例えば複数のクライアント端末３２がサーバ３１と接続されていてもよい。サーバ３１及びクライアント端末３２が、上述した情報処理装置のハードウェア構成等を有する。 FIG. 5 is a diagram illustrating a system configuration example of an information processing system according to the second embodiment. The information processing system shown in FIG. 5 includes a server 31 and a client terminal 32. The server 31 and the client terminal 32 are information that can transmit and receive data via a communication network 33 typified by the Internet or a LAN. It is connected. The communication network 33 may be wired or wireless, or a combination thereof. In addition, the number of the server 31 and the client terminal 32 is not limited to this, For example, the some client terminal 32 may be connected with the server 31. FIG. The server 31 and the client terminal 32 have the above-described hardware configuration of the information processing apparatus.

サーバ３１は、は、通信制御部４１と、記憶部４２と、辞書生成部４３と、位置情報抽出部４４とを有する。記憶部４２は、音声データ格納部４２ａを有する。 The server 31 includes a communication control unit 41, a storage unit 42, a dictionary generation unit 43, and a position information extraction unit 44. The storage unit 42 includes an audio data storage unit 42a.

通信制御部４１は、通信ネットワーク３３を介してクライアント端末３２や他の外部装置との通信制御によりデータ等の送受信を行う。音声データ格納部４２ａは、各クライアント端末３２から得られる音声データを格納する。辞書生成部４３は、クライアント端末３２から得られる書き起こし単位の仮名テキストから動的認識辞書を生成する。 The communication control unit 41 transmits and receives data and the like through communication control with the client terminal 32 and other external devices via the communication network 33. The audio data storage unit 42 a stores audio data obtained from each client terminal 32. The dictionary generation unit 43 generates a dynamic recognition dictionary from the kana text of the transcription unit obtained from the client terminal 32.

位置情報抽出部４４は、動的認識辞書を用いて音声データの位置情報を抽出し、通信制御部４１により、音声データ位置情報をクライアント端末３２へ送信する。 The position information extraction unit 44 extracts the position information of the voice data using the dynamic recognition dictionary, and transmits the voice data position information to the client terminal 32 by the communication control unit 41.

サーバ３１は、例えばＰＣ等でもよく、また一以上の情報処理装置を有するクラウドコンピューティングにより構成されたクラウドサーバであってもよいが、これに限定されるものではない。 The server 31 may be a PC, for example, or may be a cloud server configured by cloud computing having one or more information processing apparatuses, but is not limited thereto.

クライアント端末３２は、通信制御部５１と、入力部５２と、テキスト生成部５３と、再生位置決定部５４と、音声再生部５５と、記憶部５６とを有する。音声データ格納部５６ａを有する。 The client terminal 32 includes a communication control unit 51, an input unit 52, a text generation unit 53, a playback position determination unit 54, an audio playback unit 55, and a storage unit 56. An audio data storage unit 56a is provided.

通信制御部５１は、通信ネットワーク３３を介してサーバ３１や他の外部装置との通信制御によりデータ等の送受信を行う。入力部５２は、ユーザの手動によるテキストデータの入力や処理の開始又は終了の指示等を受け付ける。テキスト生成部５３は、書き起こし単位のテキストを生成する。ここで、通信制御部５１は、テキスト生成部５３により生成された書き起こし単位のテキストと音声データ格納部５６ａから得られる音声データを通信ネットワーク３３を介してサーバ３１に送信する。 The communication control unit 51 transmits and receives data and the like through communication control with the server 31 and other external devices via the communication network 33. The input unit 52 accepts a user's manual input of text data, an instruction to start or end processing, and the like. The text generation unit 53 generates a text in a transcription unit. Here, the communication control unit 51 transmits the text of the transcription unit generated by the text generation unit 53 and the audio data obtained from the audio data storage unit 56 a to the server 31 via the communication network 33.

再生位置決定部５４は、通信制御部５１によりサーバ３１に送信した書き起こし単位のテキスト及び音声データに対応する音声データの位置情報から頭出し再生位置を決定する。音声再生部５５は、音声データを再生位置決定部５４で得られた頭出し再生位置から音声データを再生する。 The reproduction position determination unit 54 determines the cue reproduction position from the position information of the audio data corresponding to the text and audio data of the transcription unit transmitted to the server 31 by the communication control unit 51. The audio reproducing unit 55 reproduces the audio data from the cueing reproduction position obtained by the reproduction position determining unit 54.

音声データ格納部５６ａは、クライアント端末３２毎に音声データを格納する。格納した音声データは、通信制御部５１から通信ネットワーク３３を介してサーバ３１に送信される。 The audio data storage unit 56 a stores audio data for each client terminal 32. The stored audio data is transmitted from the communication control unit 51 to the server 31 via the communication network 33.

上述した第２実施形態におけるサーバ３１及びクライアント端末３２の各構成を用いて、上述した第１実施形態における書き起こし支援処理と同様の処理を行うことができる。例えば、図３に示す各処理のうち、例えばＳ０８〜Ｓ１０の処理がサーバ３１側で実行され、それ以外の処理がクライアント端末３２側で実行される。サーバ３１とクライアント端末３２との間のデータの送受信は、通信制御部４１，５１で行われる。 Using the configurations of the server 31 and the client terminal 32 in the second embodiment described above, processing similar to the transcription support processing in the first embodiment described above can be performed. For example, among the processes shown in FIG. 3, for example, the processes of S08 to S10 are executed on the server 31 side, and other processes are executed on the client terminal 32 side. Data transmission / reception between the server 31 and the client terminal 32 is performed by the communication control units 41 and 51.

第２実施形態によれば、例えばクラウドサービス型の書き起こし支援システムを提供することができる。また、第２実施形態によれば、クライアント端末３２側の処理負荷を第１実施形態よりも軽減することができる。 According to the second embodiment, for example, a cloud service type transcription support system can be provided. Further, according to the second embodiment, the processing load on the client terminal 32 side can be reduced as compared with the first embodiment.

＜第３実施形態＞
次に、第３実施形態について説明する。第３実施形態では、第２実施形態と同様にサーバとクライアント端末とに分かれたシステム構成を用いるが、構成の一部を変更する。以下の説明では、第２実施形態と同様の構成部分には、同様の符号を付することとし、ここでの具体的な説明は省略する。 <Third Embodiment>
Next, a third embodiment will be described. In the third embodiment, a system configuration divided into a server and a client terminal is used as in the second embodiment, but a part of the configuration is changed. In the following description, the same reference numerals are given to the same components as those in the second embodiment, and a specific description thereof is omitted here.

図６は、第３実施形態における情報処理システムのシステム構成例を示す図である。第３実施形態における情報処理システム３０'は、サーバ３１'と、クライアント端末３２'とを有し、サーバ３１'とクライアント端末３２'とは、通信ネットワーク３３によりデータの送受信が可能な情報で接続されている。 FIG. 6 is a diagram illustrating a system configuration example of an information processing system according to the third embodiment. The information processing system 30 ′ in the third embodiment includes a server 31 ′ and a client terminal 32 ′, and the server 31 ′ and the client terminal 32 ′ are connected with information that allows data transmission / reception via the communication network 33. Has been.

第３実施形態と第２実施形態とを比較すると、サーバ３１'は、サーバ３１に示す音声データ格納部４２ａを有していない。また、クライアント端末３２'は、クライアント端末３２と比較して特徴量抽出部６１を有している。 Comparing the third embodiment and the second embodiment, the server 31 ′ does not have the audio data storage unit 42 a shown in the server 31. Further, the client terminal 32 ′ has a feature amount extraction unit 61 as compared with the client terminal 32.

特徴量抽出部６１は、音声データを所定フレーム長単位で解析して、対応する音響的特徴量を抽出する。音響的特徴量とは、例えばＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ（ＭＦＣＣ，メル周波数ケプストラム係数）等の特徴量であるが、これに限定されるものではない。例えば入力音声に対するパワー（音量）や、ＤｉｆｆｅｒｅｎｔｉａｌＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ（ＤＭＦＣＣ，差分メル周波数ケプストラム係数等を用いることができる。 The feature amount extraction unit 61 analyzes the sound data in units of a predetermined frame length and extracts a corresponding acoustic feature amount. The acoustic feature amount is, for example, a feature amount such as Mel Frequency Cepstrum Coefficient (MFCC, Mel frequency cepstrum coefficient), but is not limited thereto. For example, power (sound volume) with respect to the input voice, differential mel frequency cepstrum coefficient (DMFCC, differential mel frequency cepstrum coefficient, etc.) can be used.

通信制御部５１は、サーバ３１'に対して書き起こし単位仮名テキスト及び音響的特徴量を送信する。また、通信制御部５１は、サーバ３１'から音声データの位置情報を受信する。 The communication control unit 51 transmits the transcription unit kana text and the acoustic feature amount to the server 31 ′. Further, the communication control unit 51 receives the position information of the audio data from the server 31 ′.

サーバ３１'において、通信制御部４１は、クライアント端末３２'から書き起こし単位仮名テキスト及び音響的特徴量を受信する。また、通信制御部４１は、音声データの位置情報をクライアント端末３２'に送信する。 In the server 31 ′, the communication control unit 41 receives the transcriptional unit kana text and the acoustic feature amount from the client terminal 32 ′. Further, the communication control unit 41 transmits the position information of the audio data to the client terminal 32 ′.

辞書生成部４３は、クライアント端末３２'から送信された書き起こし単位仮名テキストから動的認識辞書を生成する。位置情報抽出部４４は、クライアント端末３２'から送信された所定フレーム長単位の音響的特徴量から音声認識して、音声データ位置情報を抽出する。通信制御部４１は、位置情報抽出部４４により抽出された位置情報を通信ネットワーク３３を介してクライアント端末３２'に送信する。 The dictionary generation unit 43 generates a dynamic recognition dictionary from the transcription unit kana text transmitted from the client terminal 32 ′. The position information extraction unit 44 recognizes a voice from an acoustic feature amount in units of a predetermined frame length transmitted from the client terminal 32 ′, and extracts voice data position information. The communication control unit 41 transmits the position information extracted by the position information extraction unit 44 to the client terminal 32 ′ via the communication network 33.

クライアント端末３２'の再生位置決定部５４は、第２実施形態と同様に、サーバ３１'から得られる音声データの位置情報から頭出し再生位置を決定する。音声再生部５５は、音声データを再生位置決定部５４で得られた頭出し再生位置から音声データを再生する。 The playback position determination unit 54 of the client terminal 32 ′ determines the cue playback position from the position information of the audio data obtained from the server 31 ′, as in the second embodiment. The audio reproducing unit 55 reproduces the audio data from the cueing reproduction position obtained by the reproduction position determining unit 54.

第３実施形態によれば、例えばクラウドサービス型の書き起こし支援システムを提供することができる。また、第２実施形態によれば、クライアント端末３２側の処理負荷を第１実施形態よりも軽減することができる。なお、上述した第２及び第３実施形態では、いわゆる分散型音声認識（ＤＳＲ）処理を実現できる。分散型音声認識では、音声認識の高負荷処理をサーバ３１側で行い、軽負荷処理をクライアント端末３２側で行う。また、第３実施形態では、音声データではなく、音響的特徴量をサーバ３１'に送信するため、クライアント端末３２側の処理負荷を軽減だけでなく、ネットワーク通信負荷も軽減することができる。 According to the third embodiment, for example, a cloud service type transcription support system can be provided. Further, according to the second embodiment, the processing load on the client terminal 32 side can be reduced as compared with the first embodiment. In the second and third embodiments described above, so-called distributed speech recognition (DSR) processing can be realized. In distributed speech recognition, high load processing of speech recognition is performed on the server 31 side, and light load processing is performed on the client terminal 32 side. Further, in the third embodiment, not the voice data but the acoustic feature amount is transmitted to the server 31 ′, so that not only the processing load on the client terminal 32 side but also the network communication load can be reduced.

＜辞書生成部１５，４３及び再生位置決定部１７，５４の具体例＞
次に、上述した第１〜第３実施形態における辞書生成部１５，４３及び再生位置決定部１７，５４を用いた書き起こし支援処理の具体例について、図を用いて説明する。 <Specific Examples of Dictionary Generation Units 15 and 43 and Playback Position Determination Units 17 and 54>
Next, a specific example of the transcription support process using the dictionary generation units 15 and 43 and the reproduction position determination units 17 and 54 in the first to third embodiments described above will be described with reference to the drawings.

＜第１実施例＞
図７は、書き起こし支援処理の第１実施例を示す図である。図７（Ａ）は、第１実施例における辞書生成部１５，４３の構成例を示している。第１実施例において、辞書生成部１５，４３は、書き起こし単位結合部７１と、動的認識辞書生成部７２とを有する。 <First embodiment>
FIG. 7 is a diagram showing a first embodiment of the transcription support process. FIG. 7A shows a configuration example of the dictionary generation units 15 and 43 in the first embodiment. In the first embodiment, the dictionary generation units 15 and 43 include a transcription unit combining unit 71 and a dynamic recognition dictionary generation unit 72.

書き起こし単位結合部７１は、書き起こし単位の仮名テキストが所定モーラ数以上となるまで、書き起こし単位の仮名テキストを結合する。動的認識辞書生成部７２は、書き起こし単位結合部７１により結合された書き起こし単位の仮名テキストから動的認識辞書を生成する。 The transcription unit combining unit 71 combines the kana texts of the transcription units until the kana text of the transcription units reaches a predetermined number of mora. The dynamic recognition dictionary generating unit 72 generates a dynamic recognition dictionary from the kana texts of the transcription units combined by the transcription unit combining unit 71.

図７（Ｂ）は、第１実施例における辞書生成処理の一例を示すフローチャートである。図７（Ｂ）の例において、書き起こし単位結合部７１は、仮名テキスト（書き起こし単位）を入力すると（Ｓ２１）、その認識語彙のモーラ数（ｎ）をカウントし（Ｓ２２）、モーラ数（ｎ）が所定のモーラ数（閾値）以上か否かを判断する（Ｓ２３）。 FIG. 7B is a flowchart showing an example of dictionary generation processing in the first embodiment. In the example of FIG. 7B, when the transcription unit coupling unit 71 inputs the kana text (transcription unit) (S21), the number of mora (n) of the recognized vocabulary is counted (S22), and the number of mora ( It is determined whether n) is equal to or greater than a predetermined number of mora (threshold value) (S23).

所定のモーラ数以上でない場合（Ｓ２３において、ＮＯ）、書き起こし単位結合部７１は、認識処理で直前に成功した認識結果に対応する仮名テキストと結合して、認識対象の語彙のモーラ数を長くする（Ｓ２４）。モーラ数を多くすることで、認識精度を向上させることができる。また、Ｓ２４の処理後、Ｓ２２の処理に戻る。 If the number is not equal to or greater than the predetermined number of moras (NO in S23), the transcription unit combining unit 71 combines the kana text corresponding to the recognition result that has just succeeded in the recognition processing to increase the number of mora of the vocabulary to be recognized. (S24). Recognition accuracy can be improved by increasing the number of mora. In addition, after the process of S24, the process returns to the process of S22.

また、所定のモーラ数以上である場合（Ｓ２３において、ＹＥＳ）、動的認識辞書生成部７２は、長音化仮名テキストを生成し（Ｓ２５）、動的認識辞書を生成し（Ｓ２６）、生成した動的認識辞書を出力する（Ｓ２７）。Ｓ２７の処理では、生成した動的認識辞書を記憶部１１、４２等に記憶させてもよい。 If the number of moras is equal to or greater than the predetermined number of moras (YES in S23), the dynamic recognition dictionary generation unit 72 generates a prolonged kana text (S25), generates a dynamic recognition dictionary (S26), and generates the dynamic recognition dictionary. A dynamic recognition dictionary is output (S27). In the process of S27, the generated dynamic recognition dictionary may be stored in the storage unit 11, 42 or the like.

図７（Ｃ）は、第１実施例に対応する書き起こし単位の一例を示している。第１実施例では、原則、仮名テキストの結合は、書き起こし単位とする。例えば、所定のモーラ数（閾値）を５モーラとすると「さいせい」が入力された場合のモーラ数が４であるため、次の「いちの」と結合され「さいせいいちの」となる。このときのモーラ数が７となり、５モーラ以上となるため、これ以上の結合は行わない。 FIG. 7C shows an example of a transcription unit corresponding to the first embodiment. In the first embodiment, in principle, the combination of kana text is a transcription unit. For example, if the predetermined number of mora (threshold value) is 5 mora, the number of mora when “Saisei” is input is 4, so it is combined with the next “Ichino” and becomes “Saisei Ichino”. At this time, the number of mora becomes 7 and becomes 5 mora or more, so no further coupling is performed.

また、第１実施例では、図７（Ｃ）に示すように、仮名テキストの結合を所定モーラ数単位としてもよい。この場合、例えば、所定モーラ数を５モーラとすると、例えば「さいせい」だけだと４モーラであるため１モーラ不足し、結合した「さいせいいちの」だと７モーラであるため、２モーラオーバーする。したがって、モーラ数を５モーラとするために、例えばテキストの終端から前に５モーラ分である「せいいちの」を抽出して、動的認識辞書を生成してもよい。 In the first embodiment, as shown in FIG. 7C, the kana text may be combined in units of a predetermined number of mora. In this case, for example, if the predetermined number of mora is 5 mora, for example, if only “Saisei” is 4 mora, 1 mora is insufficient, and if combined “Saisei Ichino” is 7 mora, 2 mora. Over. Therefore, in order to set the number of mora to 5 mora, for example, “Seiichino” that is 5 mora before the end of the text may be extracted to generate a dynamic recognition dictionary.

また、第１実施例では、仮名テキストの結合条件として、図７（Ｃ）に示すように、音声データ位置情報が確定した書き起こし単位を全て結合してもよい。例えば、仮名テキストが「あたまだし」、「さいせい」、「いちの」であり、それぞれの位置情報が確定している場合に、「あたまだしさいせいいちの」として結合してもよい。 Further, in the first embodiment, as a kana text combining condition, as shown in FIG. 7C, all the transcription units in which the voice data position information is determined may be combined. For example, if the kana texts are “Atamashishi”, “Saisei”, “Ichino”, and the location information of each is fixed, it will be combined as “Atamashisaiseiichinoichi” Also good.

ここで、図８は、書き起こし支援処理の第１実施例における動作を説明するための図である。図８の例では、例文として「頭出し再生位置の設定精度が劣化」という文章を用いた動作内容を示している。また、図８（Ａ）〜（Ｃ）は、時系列で入力される仮名テキスト（書き起こし単位）に対して、どのように動的認識辞書が生成され、またどのように音声が再生されていくかを示している。 FIG. 8 is a diagram for explaining the operation in the first embodiment of the transcription support process. In the example of FIG. 8, the operation content using the sentence “setting accuracy of the cue playback position is degraded” is shown as an example sentence. FIGS. 8A to 8C show how a dynamic recognition dictionary is generated and how sound is reproduced for a kana text (transcription unit) input in time series. Shows how to go.

まず、図８の（Ａ）の部分では、音声再生「あたまだし・・・」に対して入力された仮名テキスト「あたまだし」が所定モーラ数（５モーラ）以上であるため、そのまま動的認識辞書が生成される。また、図８（Ａ）の例では、「あたまだし」の音声データ位置情報が確定したため（音声認識成功）、確定した仮名テキストに続く録音音声が再生される。 First, in the part (A) of FIG. 8, the kana text “Atamashishi” input for the voice reproduction “Atamashi ...” is greater than or equal to a predetermined number of mora (5 mora). A dynamic recognition dictionary is generated as it is. Further, in the example of FIG. 8A, since the voice data position information of “Atamashi” has been confirmed (speech recognition is successful), the recorded voice following the confirmed kana text is reproduced.

次に、図８の（Ｂ）の部分では、音声再生「さいせい・・・」に対して入力された仮名テキスト「さいせい」が所定モーラ数（５モーラ）以上ではないため、前の書き起こし単位の仮名テキストと結合して５モーラ以上となった時点で動的認識辞書を生成する例を示している。なお、第１実施例では、音声データ位置情報の確定時に、図８（Ｂ）に示すように、予め設定された長母音化ルールにより長母音化処理された語彙（長音化仮名テキスト）も含めて動的認識辞書に登録する。したがって、動的認識辞書には、図８（Ｂ）に示すように「あたまだしさいせい」、「あたまだしさいせ−」の２つのデータが生成されることになる。 Next, in the part (B) of FIG. 8, since the kana text “saisei” input for the audio reproduction “saisei ...” is not more than the predetermined number of moras (5 mora), An example is shown in which a dynamic recognition dictionary is generated when the combined kana text of the wake-up unit becomes 5 mora or more. In the first embodiment, at the time of determining the voice data position information, as shown in FIG. 8 (B), the vocabulary (long phonetic kana text) subjected to the long vowel processing according to the preset long vowel rule is also included. To register in the dynamic recognition dictionary. Accordingly, in the dynamic recognition dictionary, as shown in FIG. 8 (B), two pieces of data “Amata Seisei” and “Amata Seisei” are generated.

次に、図８（Ｃ）の部分では、音声再生「いちの・・・」に対して入力された仮名テキスト「いちの」が所定モーラ数（５モーラ）以上でないため、前の前の書き起こし単位の仮名テキストと結合して、５モーラ以上となった時点で動的認識辞書を生成する例を示している。図８（Ｃ）の場合も上述した図８（Ｂ）に示す長母音化ルールを適用した認識語彙を動的認識辞書に追加する。したがって、動的認識辞書には、図８（Ｃ）に示すように「さいせいいちの」、「さいせ−いちの」の２つのデータが生成される。以下、上述した処理と同様の手順で例文が終了するまで書き起こし処理が行われる。 Next, in the part of FIG. 8C, since the kana text “Ichino” input for the audio reproduction “Ichino ...” is not more than the predetermined number of mora (5 mora), An example is shown in which a dynamic recognition dictionary is generated at the time when it becomes 5 mora or more by combining with the kana text of the wakeup unit. In the case of FIG. 8C as well, the recognition vocabulary to which the long vowelization rule shown in FIG. 8B is applied is added to the dynamic recognition dictionary. Accordingly, in the dynamic recognition dictionary, as shown in FIG. 8C, two data of “saisei ichino” and “saisei ichino” are generated. Thereafter, the transcription process is performed until the example sentence is completed in the same procedure as described above.

このように、第１実施例により生成された動的認識辞書を利用して、録音音声のワードスポッティングをして手動書き起こしで確定したテキストに続く録音音声の再生位置を適切に特定することができ、文書書き起こし効率を高めることができる。 As described above, by using the dynamic recognition dictionary generated by the first embodiment, it is possible to appropriately specify the playback position of the recorded voice following the text determined by manual transcription by word spotting of the recorded voice. Can improve the efficiency of document transcription.

＜第２実施例＞
図９は、書き起こし支援処理の第２実施例を示す図である。図９（Ａ）は、第２実施例における辞書生成部１５，４３の構成例を示している。第２実施例において、辞書生成部１５，４３では、上述した第１実施例と同様に、書き起こし単位結合部７１と、動的認識辞書生成部７２とを有する。 <Second embodiment>
FIG. 9 is a diagram showing a second embodiment of the transcription support process. FIG. 9A shows a configuration example of the dictionary generation units 15 and 43 in the second embodiment. In the second embodiment, the dictionary generation units 15 and 43 include a transcription unit combining unit 71 and a dynamic recognition dictionary generation unit 72 as in the first embodiment described above.

第２実施例では、音声データの位置情報を用いて動的認識辞書を生成する。音声データの位置情報は、例えば仮名テキスト（書き起こし単位）の先頭と終端のみの音声データ位置情報でもよく、また仮名テキスト（書き起こし単位）の各モーラの開始又は終端位置のデータ位置情報でもよい。 In the second embodiment, a dynamic recognition dictionary is generated using position information of voice data. The position information of the voice data may be, for example, voice data position information only at the beginning and end of the kana text (transcription unit), or may be data position information of the start or end position of each mora of the kana text (transcription unit). .

図９（Ｂ）は、辞書生成処理の第２実施例を示すフローチャートである。図９（Ｂ）の例において、辞書生成部１５，４３は、仮名テキストを入力し（Ｓ３１）、音声データ位置情報を入力する（Ｓ３２）。次に、辞書生成部１５，４３は、音声データの位置情報から仮名テキストを結合するか否かを判断する（Ｓ３３）。Ｓ３３の処理では、Ｓ３１の処理で入力された仮名テキストに対する音声データ位置情報が不確定である場合に、仮名テキストを結合すると判断する。また、仮名テキストに対する位置情報が確定している場合には、結合と行わないと判断する。 FIG. 9B is a flowchart showing a second embodiment of the dictionary generation process. In the example of FIG. 9B, the dictionary generators 15 and 43 input kana text (S31) and input voice data position information (S32). Next, the dictionary generators 15 and 43 determine whether or not to combine the kana text from the position information of the voice data (S33). In the process of S33, when the voice data position information for the kana text input in the process of S31 is indeterminate, it is determined to combine the kana text. If the position information for the kana text is fixed, it is determined that the combination is not performed.

なお、Ｓ３３の処理は、これに限定されるものではない。例えば、Ｓ３３の処理では、音声データ位置情報から、仮名テキスト（書き起こし単位）の先頭と終端の位置情報の間隔が所定値以下の場合（仮名テキストに対応する音声データの再生時間が短時間の場合）に結合してもよい。また、例えば仮名テキスト（書き起こし単位）の各モーラに対し、前のモーラの終端位置と、その後のモーラの開始位置との間隔が所定値以下の場合（モーラ間の位置（間隔）が短い場合）に、結合してもよい。 Note that the processing of S33 is not limited to this. For example, in the process of S33, when the interval between the position information of the beginning and end of the kana text (transcription unit) is less than a predetermined value from the audio data position information (the reproduction time of the audio data corresponding to the kana text is short) Case). Also, for example, for each mora of kana text (transcription unit), when the interval between the end position of the previous mora and the start position of the subsequent mora is below a predetermined value (when the position (interval) between mora is short) ) May be combined.

辞書生成部１５，４３は、仮名テキストを結合する場合（Ｓ３３において、ＹＥＳ）、仮名テキストを書き起こし単位で結合する（Ｓ３４）。Ｓ３４の処理後、又はＳ３３の処理において、仮名テキストを結合しない場合（Ｓ３３において、ＮＯ）、辞書生成部１５，４３は、長音化仮名テキストを生成し（Ｓ３５）、動的認識辞書を生成し（Ｓ３６）、動的認識辞書を出力する（Ｓ３７）。 When combining the kana texts (YES in S33), the dictionary generators 15 and 43 combine the kana texts in units of transcription (S34). If the kana text is not combined after the process of S34 or in the process of S33 (NO in S33), the dictionary generators 15 and 43 generate a prolonged kana text (S35) and generate a dynamic recognition dictionary. (S36), a dynamic recognition dictionary is output (S37).

図９（Ｃ）は、第２実施例における音声データ位置情報の例（書き起こし単位）を示している。第２実施例では、書き起こし単位の最初のモーラ音声の開始位置３００ｍｓと終了位置９００ｍｓのみを音声データ位置情報としてもよい。第２実施例では、入力した仮名テキストに対応する音声データの位置情報が確定したか否かや、図９（Ｃ）に示すような音声データ位置情報を用いて、仮名テキストの結合の要否を適切に判断することができる。 FIG. 9C shows an example (transcription unit) of audio data position information in the second embodiment. In the second embodiment, only the start position 300 ms and the end position 900 ms of the first mora sound in the transcription unit may be used as the sound data position information. In the second embodiment, whether or not the position information of the voice data corresponding to the input kana text has been confirmed, and whether or not the kana text should be combined using the voice data position information as shown in FIG. 9C. Can be determined appropriately.

図１０は、書き起こし支援処理の第２実施例における動作を説明するための図である。図１０の例では、第１実施例と同様に、例文として「頭出し再生位置の設定精度が劣化」を用いる。また、図１０の例では、図８と同様に、時系列で入力される仮名テキスト（書き起こし単位）に対して、どのように動的認識辞書が生成され、またどのように音声が再生されていくかを示している。 FIG. 10 is a diagram for explaining the operation in the second embodiment of the transcription support process. In the example of FIG. 10, as in the first embodiment, “setting accuracy of the cue playback position is degraded” is used as an example sentence. In the example of FIG. 10, as in FIG. 8, how the dynamic recognition dictionary is generated and how the voice is reproduced for the kana text (transcription unit) input in time series. It shows how to go.

まず、図１０（Ａ）の部分では、音声再生「あたまだし・・・」に対して入力された仮名テキスト「あたまだし」が、動的認識辞書として生成される。また、図１０（Ａ）の例では、「あたまだし」の音声データ位置情報が確定したため（音声認識成功）、確定した仮名テキストに続く録音音声が再生される。 First, in the part of FIG. 10A, the kana text “Atamashishi” input for the voice reproduction “Atamashishi ...” is generated as a dynamic recognition dictionary. Further, in the example of FIG. 10A, since the voice data position information of “Atamashishi” is confirmed (speech recognition success), the recorded voice following the confirmed kana text is reproduced.

次に、図１０（Ｂ−１）の部分では、音声再生「さいせい・・・」に対して音声認識が失敗し、音声データ位置情報が不確定となっている。したがって、このような場合に、図１０（Ｂ−２）に示すように、音声データ位置情報が確定している前の書き起こし単位の仮名テキストと結合して、再度音声認識処理を実行する。そして、音声データ位置情報が確定（音声認識成功）した場合に、動的認識辞書を生成する（「あたまだしさいせい」、「あたまだしさいせ−」）。以下、上述した処理と同様の手順で例文が終了するまで書き起こし処理が行われる。 Next, in the part of FIG. 10 (B-1), voice recognition fails for voice playback “saisei ...”, and voice data position information is indeterminate. Therefore, in such a case, as shown in FIG. 10B-2, the speech recognition process is executed again by combining with the kana text of the transcription unit before the speech data position information is confirmed. When the voice data position information is confirmed (speech recognition is successful), a dynamic recognition dictionary is generated (“Adamashi Saisei”, “Adamashi Saisei”). Thereafter, the transcription process is performed until the example sentence is completed in the same procedure as described above.

上述したように、第２実施例では、途中の不確定の部分があってもそれを無視して先の音声と結合することで、効率的に動的認識辞書を生成することができる。 As described above, in the second embodiment, a dynamic recognition dictionary can be efficiently generated by ignoring an indeterminate part in the middle and combining it with the previous speech.

＜第３実施例＞
図１１は、書き起こし支援処理の第３実施例を示す図である。図１１（Ａ）は、第３実施例における再生位置決定部１７，５４の構成例を示している。第３実施例において、再生位置決定部１７，５４は、モード指定部８１と、再生位置選択部８２とを有する。 <Third embodiment>
FIG. 11 is a diagram showing a third embodiment of the transcription support process. FIG. 11A shows a configuration example of the reproduction position determination units 17 and 54 in the third embodiment. In the third embodiment, the reproduction position determination units 17 and 54 include a mode designation unit 81 and a reproduction position selection unit 82.

モード指定部８１は、ユーザ指定等により再生モードの指定を受け付ける。再生モードは、例えば確定した書き起こし単位の仮名テキストの内容を確認する確認モード等がある。 The mode designation unit 81 accepts designation of a playback mode by user designation or the like. The reproduction mode includes, for example, a confirmation mode for confirming the contents of the kana text in the determined transcription unit.

再生位置選択部８２は、モード指定部８１により指定されたモードに基づいて音声データの再生開始位置を選択する。例えば、「確認モードＯＮ」の場合、再生位置選択部８２は、現在の書き起こし単位の先頭から頭出し再生する。例えば、書き起こしテキスト部分を再度視聴して書き起こしミスがないか確認するモードであるが、これに限定されるものではない。 The reproduction position selection unit 82 selects the reproduction start position of the audio data based on the mode designated by the mode designation unit 81. For example, in the case of “confirmation mode ON”, the playback position selection unit 82 starts playback from the beginning of the current transcription unit. For example, the mode is a mode in which the transcription text portion is viewed again to confirm whether there is a transcription mistake, but is not limited thereto.

また、「確認モードＯＦＦ」の場合、再生位置選択部８２は、現在の書き起こし単位の終端から頭出し再生する。例えば、確定した仮名テキスト（書き起こし単位）に続く次の位置から頭出し再生して、書き起こし効率を優先するモードである。なお、再生モードについては、上述した例に限定されるものではない。 In the case of “confirmation mode OFF”, the playback position selection unit 82 performs cue playback from the end of the current transcription unit. For example, it is a mode in which cueing reproduction is performed from the next position following the confirmed kana text (transcription unit) and the transcription efficiency is prioritized. Note that the playback mode is not limited to the above-described example.

図１１（Ｂ）は、書き起こし支援処理の第３実施例を示すフローチャートである。図１１（Ｂ）の例において、再生位置決定部１７，５４は、再生モードの指定を受け付け（Ｓ４１）、その後、音声データ位置情報を入力する（Ｓ４２）。 FIG. 11B is a flowchart showing a third embodiment of the transcription support process. In the example of FIG. 11B, the reproduction position determination units 17 and 54 accept designation of a reproduction mode (S41), and then input audio data position information (S42).

次に、再生位置決定部１７，５４は、位置情報が不確定は否かを判断し（Ｓ４３）、位置情報が不確定である場合（Ｓ４３において、ＹＥＳ）、再生開始位置を不確定として、その旨の情報を出力し（Ｓ４４）、そのまま処理を終了する。Ｓ４４の処理では、再生開始位置を出力しない。 Next, the playback position determination units 17 and 54 determine whether or not the position information is indeterminate (S43). If the position information is indeterminate (YES in S43), the playback start position is set as indeterminate. Information to that effect is output (S44), and the process ends. In the process of S44, the reproduction start position is not output.

また、再生位置決定部１７，５４は、位置情報が不確定でない場合（Ｓ４３において、ＮＯ）、再生モードにより確認モードがＯＮか否かを判断する（Ｓ４５）。確認モードがＯＮの場合（Ｓ４５において、ＹＥＳ）、再生開始位置を入力した仮名テキスト（書き起こし単位）に対応する音声データの先頭に位置付ける（Ｓ４６）。また、再生モードがＯＮでない場合（Ｓ４５において、ＮＯ）、再生開始位置を入力した仮名テキスト（書き起こし単位）に対応する音声データの終端に位置付ける（Ｓ４７）。再生位置決定部１７，５４は、Ｓ４６，Ｓ４７の処理後、再生開始位置を出力する（Ｓ４８）。 In addition, when the position information is not indefinite (NO in S43), the reproduction position determination units 17 and 54 determine whether or not the confirmation mode is ON according to the reproduction mode (S45). When the confirmation mode is ON (YES in S45), the reproduction start position is positioned at the beginning of the audio data corresponding to the input kana text (transcription unit) (S46). If the playback mode is not ON (NO in S45), the playback start position is positioned at the end of the audio data corresponding to the input kana text (transcription unit) (S47). The reproduction position determination units 17 and 54 output the reproduction start position after the processing of S46 and S47 (S48).

図１２は、書き起こし支援処理の第３実施例における動作を説明するための図である。図１２の例では、上述した第１、第２実施例と同様に、例文として「頭出し再生位置の設定精度が劣化」を用いる。 FIG. 12 is a diagram for explaining the operation in the third embodiment of the transcription support process. In the example of FIG. 12, as in the first and second embodiments described above, “determination of setting accuracy of the cue playback position” is used as an example sentence.

図１２の例では、時系列で入力される仮名テキスト（書き起こし単位）に対して、再生モードに応じて、どのように動的認識辞書が生成され、またどのように音声が再生されていくかを示している。 In the example of FIG. 12, for a kana text (transcription unit) input in chronological order, how a dynamic recognition dictionary is generated and how sound is reproduced according to the playback mode. It shows.

まず、図１２（Ａ）の部分では、再生モードの一例として、上述した確認モードがＯＮの場合の例を示している。音声再生「あたまだし・・・」に対して入力された仮名テキスト「あたまだし」が動的認識辞書として生成される。また、図１２（Ａ）の例では、確認モードがＯＮであるため、図１２（Ｂ）に示すように、確定した「あたまだし」の先頭から録音音声が再生される。これにより、書き起こした内容を確認する作業を効率的に行うことができる。 First, FIG. 12A shows an example in which the above-described confirmation mode is ON as an example of the playback mode. The kana text “Atamashishi” input for the voice reproduction “Atamashishi ...” is generated as a dynamic recognition dictionary. In the example of FIG. 12A, since the confirmation mode is ON, as shown in FIG. 12B, the recorded sound is reproduced from the head of the determined “Amatashi”. As a result, it is possible to efficiently check the written contents.

また、図１２（Ｂ）の部分において、例えば再生モードの一例として、上述した確認モードがＯＦＦの場合、音声再生「あたまだしさいせい・・・」に対して音声データ位置情報が確定（音声認識成功）すると、対応する動的認識辞書（「あたまだしさいせい」、「あたまだしさいせ−」）が生成される。このとき、確認モードはＯＦＦであるため、図１２（Ｃ）に示すように、書き起こし単位の終端位置から頭出し再生を行う。これにより、次の書き起こしを迅速に行うことができる。 Also, in the part of FIG. 12B, as an example of the playback mode, for example, when the above-described confirmation mode is OFF, the audio data position information is determined for the audio playback “Adamashi Saisei ...” ( When the speech recognition is successful, corresponding dynamic recognition dictionaries (“Adamashi Saisei”, “Adamashi Saisei-”) are generated. At this time, since the confirmation mode is OFF, as shown in FIG. 12C, cue playback is performed from the end position of the transcription unit. Thereby, the next transcription can be performed quickly.

図１２（Ｃ）の例では、音声再生「いちのせってい・・・」に対して入力された仮名テキスト「いちの」に対して所定モーラ数等からモーラ数を調整し、音声データ位置情報が確定後に動的認識辞書（「さいせいいちの」、「さいせーいちの」が生成される。以下、上述した処理と同様の手順で例文が終了するまで書き起こし処理が行われる。 In the example of FIG. 12C, the mora number is adjusted from the predetermined number of mora or the like for the kana text “Ichinosete ...” input for the voice reproduction “Ichinosete... After the determination, dynamic recognition dictionaries (“saisei ichino” and “saisei ichino” are generated. Transcription processing is performed until the example sentence is completed in the same procedure as described above.

上述したように、第３実施例では、再生モードに応じてユーザの目的にあった再生位置から再生することができるため、文書書き起こし効率を高めることができる。 As described above, in the third embodiment, it is possible to reproduce from a reproduction position suited to the user's purpose in accordance with the reproduction mode, so that the document transcription efficiency can be increased.

＜第４実施例＞
図１３は、書き起こし支援処理の第４実施例を示す図である。第４実施例では、上述した第１〜第３の実施例を部分的に組み合わせた例を示している。 <Fourth embodiment>
FIG. 13 is a diagram showing a fourth embodiment of the transcription support process. In the fourth embodiment, an example in which the above-described first to third embodiments are partially combined is shown.

図１３の例では、２つの辞書生成部１５−１，１５−２と、位置情報抽出部１６と、再生位置決定部１７とを有する。辞書生成部１５−１は、動的認識辞書生成部７２−１を有する。辞書生成部１５−２は、書き起こし単位結合部７１と、動的認識辞書生成部７２−２とを有する。再生位置決定部１７は、モード指定部８１と、再生位置選択部８２とを有する。なお、各構成については、上述した各実施例にて説明しているため、ここでの具体的な説明は省略する。 In the example of FIG. 13, two dictionary generation units 15-1 and 15-2, a position information extraction unit 16, and a reproduction position determination unit 17 are included. The dictionary generation unit 15-1 includes a dynamic recognition dictionary generation unit 72-1. The dictionary generation unit 15-2 includes a transcription unit combining unit 71 and a dynamic recognition dictionary generation unit 72-2. The playback position determination unit 17 includes a mode specification unit 81 and a playback position selection unit 82. Since each configuration has been described in each of the above-described embodiments, a specific description thereof is omitted here.

第４実施例では、複数の辞書生成部１５−１，１５−２が、それぞれ異なる条件で動的認識辞書（辞書Ａ，Ｂ）を生成し、生成した動的認識辞書を用いて音声データの位置情報を抽出する。上述した異なる条件とは、例えばモーラ数を基準にしてもよく、音声データ位置情報の確定、不確定等を基準した条件であり、上述した第１〜第３実施例で示した辞書生成の条件であるが、これに限定されるものではない。 In the fourth embodiment, a plurality of dictionary generation units 15-1 and 15-2 generate dynamic recognition dictionaries (dictionaries A and B) under different conditions, and the generated dynamic recognition dictionary is used to generate voice data. Extract location information. The different conditions described above may be based on, for example, the number of mora, and are conditions based on confirmation, uncertainness, etc. of the voice data position information. The conditions for generating the dictionary shown in the first to third embodiments described above. However, the present invention is not limited to this.

また、第４実施例は、モード指定部８１において、確認モードの指定を受け付け、受け付けた内容に基づいて音声データの再生位置を選択し、再生開始位置を出力する。次に、図１３に示す構成に対応する書き起こし支援処理についてフローチャートを用いて説明する。 In the fourth embodiment, the mode designation unit 81 accepts the designation of the confirmation mode, selects the reproduction position of the audio data based on the accepted content, and outputs the reproduction start position. Next, the transcription support process corresponding to the configuration shown in FIG. 13 will be described using a flowchart.

図１４は、書き起こし支援処理の第４実施例を示す一例のフローチャートである。図１４の例において、辞書生成部１５−１は、仮名テキスト（書き起こし単位）の入力を受け付けると（Ｓ５１）、長音化仮名テキストを生成し（Ｓ５２）、動的認識辞書（辞書Ａ）を生成する（Ｓ５３）。 FIG. 14 is a flowchart of an example showing a fourth embodiment of the transcription support process. In the example of FIG. 14, upon receiving input of kana text (transcription unit) (S 51), the dictionary generation unit 15-1 generates a prolonged kana text (S 52) and creates a dynamic recognition dictionary (dictionary A). Generate (S53).

次に、位置情報抽出部１６は、辞書Ａを用いて音声認識処理を行い（Ｓ５４）、認識が成功したか否かを判断する（Ｓ５５）。認識が成功した場合（Ｓ５５において、ＹＥＳ）、位置情報抽出部１６は、音声データ位置情報を抽出する（Ｓ５６）。認識に成功していない場合（Ｓ５６において、ＮＯ）、位置情報抽出部１６は、音声データ位置情報を不確定とする（Ｓ５７）。 Next, the position information extraction unit 16 performs voice recognition processing using the dictionary A (S54), and determines whether the recognition is successful (S55). When the recognition is successful (YES in S55), the position information extraction unit 16 extracts the voice data position information (S56). If the recognition is not successful (NO in S56), the position information extraction unit 16 makes the voice data position information uncertain (S57).

次に、辞書生成部１５−２は、音声データ位置情報があるか否かを判断し（Ｓ５８）、音声データ位置情報がある場合（Ｓ５８において、ＹＥＳ）、仮名テキストを書き起こし単位で結合する（Ｓ５９）。また、Ｓ５９の処理後、又は、Ｓ５８において、音声データ位置情報がない場合（Ｓ５８において、ＮＯ）、辞書生成部１５−２は、長音化仮名テキストを生成する（Ｓ６０）。 Next, the dictionary generation unit 15-2 determines whether there is voice data position information (S58). If there is voice data position information (YES in S58), the kana text is transcribed and combined in units. (S59). In addition, after the process of S59 or when there is no voice data position information in S58 (NO in S58), the dictionary generation unit 15-2 generates a prolonged kana text (S60).

次に、辞書生成部１５−２は、動的認識辞書（辞書Ｂ）を生成する（Ｓ６１）。次に、位置情報抽出部１６は、辞書Ｂを用いて音声認識処理を行い（Ｓ６２）、認識に成功したか否かを判断する（Ｓ６３）。認識が成功した場合（Ｓ６３において、ＹＥＳ）、位置情報抽出部１６は、音声データ位置情報を抽出する（Ｓ６４）。また、認識が成功しなかった場合（Ｓ６３において、ＮＯ）、音声データ位置情報を不確定とする（Ｓ６５）。 Next, the dictionary generation unit 15-2 generates a dynamic recognition dictionary (dictionary B) (S61). Next, the position information extraction unit 16 performs voice recognition processing using the dictionary B (S62), and determines whether the recognition is successful (S63). When the recognition is successful (YES in S63), the position information extraction unit 16 extracts the voice data position information (S64). If the recognition is not successful (NO in S63), the voice data position information is determined indeterminate (S65).

Ｓ６４又はＳ６５の処理後、位置情報が不確定か否かを判断し（Ｓ６６）、不確定である場合（Ｓ６６において、ＹＥＳ）、再生開始位置を不確定として（Ｓ６７）、その旨の情報を出力又は何も出力せずに処理を終了する（Ｓ６７）。 After the process of S64 or S65, it is determined whether or not the position information is uncertain (S66). If it is uncertain (YES in S66), the reproduction start position is uncertain (S67), and information to that effect is displayed. The process ends without outputting or outputting anything (S67).

また、位置情報が不確定でない場合（Ｓ６６において、ＮＯ）、再生位置決定部１７は、ユーザによりモード指定部８１で指定された確認モードがＯＮであるか否かを判断する（Ｓ６８）。 If the position information is not indefinite (NO in S66), the playback position determination unit 17 determines whether or not the confirmation mode designated by the user in the mode designation unit 81 is ON (S68).

再生位置決定部１７は、確認モードがＯＮの場合（Ｓ６８において、ＹＥＳ）、再生開始位置を入力した仮名テキスト（書き起こし単位）に対応する音声データの先頭に位置付ける（Ｓ６９）。また、再生モードがＯＮでない場合（Ｓ６８において、ＮＯ）、再生開始位置を入力した仮名テキスト（書き起こし単位）に対応する音声データの終端に位置付ける（Ｓ７０）。再生位置決定部１７は、Ｓ６９，Ｓ７０の処理後、再生開始位置を出力する（Ｓ７１）。 When the confirmation mode is ON (YES in S68), the reproduction position determination unit 17 positions the reproduction start position at the beginning of the audio data corresponding to the input kana text (transcription unit) (S69). If the playback mode is not ON (NO in S68), the playback start position is positioned at the end of the audio data corresponding to the input kana text (transcription unit) (S70). The reproduction position determination unit 17 outputs the reproduction start position after the processes of S69 and S70 (S71).

図１５は、書き起こし支援処理の第４実施例における動作を説明するための図である。図１５の例では、上述した第１〜第３実施例と同様に、例文として「頭出し再生位置の設定精度が劣化」を用いる。 FIG. 15 is a diagram for explaining the operation in the fourth embodiment of the transcription support process. In the example of FIG. 15, as in the first to third embodiments described above, “determination of setting accuracy of the cue playback position” is used as an example sentence.

図１５の例では、時系列で入力される仮名テキスト（書き起こし単位）に対して、再生モードに応じて、どのように動的認識辞書が生成され、またどのように音声が再生されていくかを示している。第４実施例では、音声認識の失敗時に動的認識辞書を異なる条件で再生成して再度認識を行う。 In the example of FIG. 15, for a kana text (transcription unit) input in chronological order, how a dynamic recognition dictionary is generated and how sound is reproduced according to the playback mode. It shows. In the fourth embodiment, when voice recognition fails, the dynamic recognition dictionary is regenerated under different conditions and recognized again.

まず、図１５（Ａ）の部分では、音声再生「あたまだし・・・」に対して入力された仮名テキスト「あたま」が、動的認識辞書として生成される。また、図１５（Ａ）の例では、「あたま」の音声データ位置情報が確定したため（音声認識成功）、確定した仮名テキストに続く録音音声が再生される。 First, in the part of FIG. 15A, the kana text “Atama” input for the voice reproduction “Atamashi ...” is generated as a dynamic recognition dictionary. In the example of FIG. 15A, since the voice data position information of “Atama” is confirmed (speech recognition is successful), the recorded voice following the confirmed kana text is reproduced.

次に、図１５（Ｂ−１）の部分では、音声再生「だしさいせい・・・」に対して音声認識が失敗し、音声データ位置情報が不確定となっている。この場合、動的認識辞書を異なる条件で再生成を行う。図１５（Ｂ−２）の部分では、動的認識辞書として「あたまだし」が生成されている。 Next, in the part of FIG. 15 (B-1), the voice recognition fails for the voice reproduction “Dashisaisei ...”, and the voice data position information is indeterminate. In this case, the dynamic recognition dictionary is regenerated under different conditions. In the part of FIG. 15 (B-2), “tamanashi” is generated as the dynamic recognition dictionary.

次に、再生成された動的認識辞書を用いて、再度音声認識処理を実行し、音声データ位置情報が確定（音声認識成功）した場合に、図１５（Ｃ）に示すように、音声再生「さいせい・・・」に対して続けて処理を実行することができる。以下、上述した処理と同様の手順で例文が終了するまで書き起こし処理が行われる。 Next, when the voice recognition process is executed again using the regenerated dynamic recognition dictionary and the voice data position information is determined (successful voice recognition), as shown in FIG. The process can be continuously executed for “Saisei ...”. Thereafter, the transcription process is performed until the example sentence is completed in the same procedure as described above.

上述したように、第４実施例では、音声認識が失敗したとしても、そのとき用いた動的認識辞書を再生成して再度音声認識処理を実行することができるため、作業が中断せずに効率的に動的認識辞書を生成することができる。なお、第４実施例では、２つの辞書生成部１５−１，１５−２を用いたが、音声認識が成功するまで、繰り返し動的認識辞書を再生成してもよい。その場合は、３以上の辞書生成部を有してもよい。 As described above, in the fourth embodiment, even if the speech recognition fails, the dynamic recognition dictionary used at that time can be regenerated and the speech recognition process can be executed again, so that the operation is not interrupted. A dynamic recognition dictionary can be generated efficiently. In the fourth embodiment, the two dictionary generation units 15-1 and 15-2 are used. However, the dynamic recognition dictionary may be regenerated repeatedly until the voice recognition is successful. In that case, you may have three or more dictionary production | generation parts.

上述したように、本実施形態によれば、頭出し再生位置を適切に特定でき、文書書き起こし効率を高めることができる。例えば、確定した仮名テキストに続く録音音声の再生位置や、確認のために聴き直しする際の録音音声の再生位置等を適切に特定でき、文書書き起こし効率を高めることができる。例えば、本実施形態では、録音音声の手動書き起こしに際し、確定した仮名テキスト続く録音音声の繰り返し再生位置を適切に求めるために、確定した書き起こし単位の仮名テキストから１以上の認識語彙を単位とした動的認識辞書を生成し、生成された動的認識辞書を利用して録音音声のワードスポッティングをして手動書き起こしで確定したテキストに続く録音音声の再生位置を特定する。また、本実施形態では、ワードスポッティング対象の１認識語彙が短いモーラ数のとき、ワードスポッティング精度を高くできないため、１認識語彙が短い場合は直前に成功した認識結果を含める形でワードスポッティング対象の語彙のモーラ数を所定以上長くする。したがって、モーラ数が多ければ精度が上がる。また、ワードスポッティングのモーラ数を増やしても位置特定ができない場合はさらに直前の認識結果を含めてもよい。 As described above, according to the present embodiment, it is possible to appropriately specify the cueing reproduction position, and it is possible to improve the document transcription efficiency. For example, it is possible to appropriately specify the playback position of the recorded voice following the confirmed kana text, the playback position of the recorded voice when listening again for confirmation, and the document transcription efficiency can be improved. For example, in the present embodiment, in order to appropriately obtain the repeated playback position of the recorded voice subsequent to the confirmed kana text when manually transcribing the recorded voice, one or more recognized vocabulary words are used as a unit from the kana text of the determined transcription unit. The generated dynamic recognition dictionary is generated, the recorded voice is spotted using the generated dynamic recognition dictionary, and the playback position of the recorded voice following the text determined by manual transcription is specified. Also, in this embodiment, when one recognition vocabulary subject to word spotting has a short number of mora, the word spotting accuracy cannot be increased. Therefore, when one recognition vocabulary is short, the recognition result of the word spotting subject including the last successful recognition result is included. Increase the number of mora in the vocabulary by a predetermined amount or more. Therefore, the accuracy increases as the number of mora increases. Further, if the position cannot be specified even if the number of mora for word spotting is increased, the immediately previous recognition result may be included.

本実施形態によれば、例えば手動書き起こし、及び自動書き起こし修正部分の手動書き起こしにおける、録音音声の自動再生、自動停止、話速変換機能等の制御に適用することができる。 According to the present embodiment, for example, it can be applied to control of automatic reproduction of recorded sound, automatic stop, speech speed conversion function, etc. in manual transcription and manual transcription of an automatic transcription correction portion.

以上、実施例について詳述したが、特定の実施例に限定されるものではなく、特許請求の範囲に記載された範囲内において、種々の変形及び変更が可能である。また、上述した各実施例の一部又は全部を組み合わせることも可能である。 Although the embodiments have been described in detail above, the invention is not limited to the specific embodiments, and various modifications and changes can be made within the scope described in the claims. Moreover, it is also possible to combine a part or all of each Example mentioned above.

なお、以上の実施例に関し、更に以下の付記を開示する。
（付記１）
音声データを格納する音声データ格納部と、
頭出し再生位置から前記音声データを再生する音声再生部と、
音声再生部により再生された前記音声データに対応して入力された仮名テキストから書き起こし単位の仮名テキストを生成するテキスト生成部と、
前記テキスト生成部から得られる書き起こし単位の仮名テキストから、１以上の認識語彙を単位とした動的認識辞書を生成する辞書生成部と、
前記辞書生成部により得られる前記動的認識辞書を用いた音声認識により、前記認識語彙に対応する前記音声データの位置情報を抽出する位置情報抽出部と、
前記位置情報抽出部により抽出した位置情報から、前記頭出し再生位置を決定する再生位置決定部とを有することを特徴とする情報処理装置。
（付記２）
前記辞書生成部は、
前記認識語彙が、所定のモーラ数以上でない場合に、直前に成功した認識結果に対応する認識語彙と結合して、認識語彙のモーラ数を長くすることを特徴とする付記１に記載の情報処理装置。
（付記３）
前記辞書生成部は、
前記認識語彙に対する音声認識ができなかった場合に、直前の認識語彙と結合することを特徴とすることを特徴とする付記１又は２に記載の情報処理装置。
（付記４）
前記辞書生成部は、
前記テキスト生成部から得られる前記書き起こし単位の仮名テキストと、前記位置情報抽出部により得られる前記音声データの位置情報とを用いて前記動的認識辞書を生成することを特徴とする付記１乃至３の何れか１項に記載の情報処理装置。
（付記５）
前記辞書生成部は、
前記認識語彙に対する前記音声データの位置情報が確定していない場合に、直前の認識語彙と結合することを特徴とする付記４に記載の情報処理装置。
（付記６）
前記頭出し再生位置を決定するためのモードの指定を受け付けるモード指定部と、
前記モード指定部により指定された確認モードに応じて、現在の書き起こし単位の先頭又は終端位置を再生位置として選択する再生位置選択部とを有することを特徴とする付記１乃至５の何れか１項に記載の情報処理装置。
（付記７）
前記辞書生成部は、
前記音声認識が失敗した場合に、前記音声認識に用いた動的認識辞書を異なる条件で再生成し、再生成した動的認識辞書を用いて、再度音声認識させることを特徴とする付記１乃至６の何れか１項に記載の情報処理装置。
（付記８）
前記音声データを所定フレーム単位で解析して音響的特徴量を抽出する特徴量抽出部を有し、
前記位置情報抽出部は、
前記特徴量抽出部により得られる音響的特徴量と、前記動的認識辞書とを用いた音声認識により、前記認識語彙に対応する前記音声データの位置情報を抽出することを特徴とする付記１乃至７の何れか１項に記載の情報処理装置。
（付記９）
情報処理装置が、
頭出し再生位置から音声データを再生し、
再生した前記音声データに対応して入力された仮名テキストから書き起こし単位の仮名テキストを生成し、
生成した前記書き起こし単位の仮名テキストから、１以上の認識語彙を単位とした動的認識辞書を生成し、
生成した前記動的認識辞書を用いた音声認識により、前記認識語彙に対応する前記音声データの位置情報を抽出し、
抽出した前記音声データの位置情報から、前記頭出し再生位置を決定する、ことを特徴とする書き起こし支援方法。
（付記１０）
頭出し再生位置から音声データを再生し、
再生した前記音声データに対応して入力された仮名テキストから書き起こし単位の仮名テキストを生成し、
生成した前記書き起こし単位の仮名テキストから、１以上の認識語彙を単位とした動的認識辞書を生成し、
生成した前記動的認識辞書を用いた音声認識により、前記認識語彙に対応する前記音声データの位置情報を抽出し、
抽出した前記音声データの位置情報から、前記頭出し再生位置を決定する、処理をコンピュータに実行させる書き起こし支援プログラム。 In addition, the following additional remarks are disclosed regarding the above Example.
(Appendix 1)
An audio data storage unit for storing audio data;
An audio playback unit for playing back the audio data from the cue playback position;
A text generation unit that generates a kana text in a transcription unit from a kana text input corresponding to the audio data reproduced by the audio reproduction unit;
A dictionary generation unit that generates a dynamic recognition dictionary in units of one or more recognition vocabularies from kana text in a transcription unit obtained from the text generation unit;
A position information extraction unit that extracts position information of the voice data corresponding to the recognition vocabulary by voice recognition using the dynamic recognition dictionary obtained by the dictionary generation unit;
An information processing apparatus comprising: a reproduction position determination unit that determines the cue reproduction position from position information extracted by the position information extraction unit.
(Appendix 2)
The dictionary generation unit
The information processing according to claim 1, wherein, when the recognition vocabulary is not equal to or greater than a predetermined number of mora, the number of mora of the recognition vocabulary is increased by combining with the recognition vocabulary corresponding to the previous successful recognition result. apparatus.
(Appendix 3)
The dictionary generation unit
3. The information processing apparatus according to appendix 1 or 2, wherein when the speech recognition for the recognition vocabulary cannot be performed, the recognition vocabulary is combined.
(Appendix 4)
The dictionary generation unit
The dynamic recognition dictionary is generated using the kana text of the transcription unit obtained from the text generation unit and the position information of the voice data obtained by the position information extraction unit. 4. The information processing apparatus according to any one of 3.
(Appendix 5)
The dictionary generation unit
The information processing apparatus according to appendix 4, wherein when the position information of the voice data with respect to the recognized vocabulary is not fixed, the information is combined with the immediately preceding recognized vocabulary.
(Appendix 6)
A mode designation unit for accepting designation of a mode for determining the cue playback position;
Any one of Supplementary notes 1 to 5, further comprising: a reproduction position selection unit that selects the start or end position of the current transcription unit as a reproduction position in accordance with the confirmation mode designated by the mode designation unit. The information processing apparatus according to item.
(Appendix 7)
The dictionary generation unit
Supplementary notes 1 to 3, wherein when the speech recognition fails, the dynamic recognition dictionary used for the speech recognition is regenerated under different conditions, and the regenerated dynamic recognition dictionary is used for speech recognition again. 7. The information processing apparatus according to any one of 6.
(Appendix 8)
A feature amount extraction unit that extracts the acoustic feature amount by analyzing the audio data in a predetermined frame unit;
The position information extraction unit
Supplementary notes 1 to 3, wherein position information of the voice data corresponding to the recognition vocabulary is extracted by voice recognition using the acoustic feature quantity obtained by the feature quantity extraction unit and the dynamic recognition dictionary. The information processing apparatus according to any one of 7.
(Appendix 9)
Information processing device
Play audio data from the cue playback position,
Generate kana text in the transcription unit from the kana text input corresponding to the reproduced voice data,
Generating a dynamic recognition dictionary with one or more recognition vocabulary as a unit from the generated kana text of the transcription unit;
Extracting position information of the speech data corresponding to the recognition vocabulary by speech recognition using the generated dynamic recognition dictionary,
A transcription support method, wherein the cue playback position is determined from the extracted position information of the audio data.
(Appendix 10)
Play audio data from the cue playback position,
Generate kana text in the transcription unit from the kana text input corresponding to the reproduced voice data,
Generating a dynamic recognition dictionary with one or more recognition vocabulary as a unit from the generated kana text of the transcription unit;
Extracting position information of the speech data corresponding to the recognition vocabulary by speech recognition using the generated dynamic recognition dictionary,
A transcription support program for causing a computer to execute processing for determining the cue playback position from the extracted position information of the audio data.

１０情報処理装置
１１，４２，５６記憶部
１１ａ，４２ａ，５６ａ音声データ格納部
１２，５５音声再生部
１３，５２入力部
１４，５３テキスト生成部
１５，４３辞書生成部
１６，４４位置情報抽出部
１７，５４再生位置決定部
２１入力装置
２２出力装置
２３ドライブ装置
２４補助記憶装置
２５主記憶装置
２６ＣＰＵ
２７ネットワーク接続装置
２８記録媒体
３０情報処理システム
３１サーバ
３２クライアント端末
４１，５１通信制御部
６１特徴量抽出部
７１書き起こし単位結合部
７２動的認識辞書生成部
８１モード指定部
８２再生位置選択部 DESCRIPTION OF SYMBOLS 10 Information processing apparatus 11, 42, 56 Memory | storage part 11a, 42a, 56a Audio | voice data storage part 12,55 Audio | voice reproduction | regeneration part 13,52 Input part 14,53 Text generation part 15,43 Dictionary generation part 16,44 Position information extraction part 17, 54 Playback position determination unit 21 Input device 22 Output device 23 Drive device 24 Auxiliary storage device 25 Main storage device 26 CPU
27 Network connection device 28 Recording medium 30 Information processing system 31 Server 32 Client terminal 41, 51 Communication control unit 61 Feature amount extraction unit 71 Transcription unit coupling unit 72 Dynamic recognition dictionary generation unit 81 Mode specification unit 82 Playback position selection unit

Claims

An audio data storage unit for storing audio data;
An audio playback unit for playing back the audio data from the cue playback position;
A text generation unit that generates a kana text in a transcription unit from a kana text input corresponding to the audio data reproduced by the audio reproduction unit;
A dictionary generation unit that generates a dynamic recognition dictionary in units of one or more recognition vocabularies from kana text in a transcription unit obtained from the text generation unit;
A position information extraction unit that extracts position information of the voice data corresponding to the recognition vocabulary by voice recognition using the dynamic recognition dictionary obtained by the dictionary generation unit;
An information processing apparatus comprising: a reproduction position determination unit that determines the cue reproduction position from position information extracted by the position information extraction unit.

The dictionary generation unit
2. The information according to claim 1, wherein when the recognized vocabulary is not equal to or greater than a predetermined number of mora, the number of mora of the recognized vocabulary is increased by combining with a recognized vocabulary corresponding to the immediately succeeded recognition result. Processing equipment.

The dictionary generation unit
2. The dynamic recognition dictionary is generated by using the kana text of the transcription unit obtained from the text generation unit and the position information of the voice data obtained by the position information extraction unit. Or the information processing apparatus of 2.

A mode designation unit for accepting designation of a mode for determining the cue playback position;
4. A playback position selection unit that selects a start or end position of a current transcription unit as a playback position in accordance with the confirmation mode specified by the mode specification unit. The information processing apparatus according to item 1.

The dictionary generation unit
The dynamic recognition dictionary used for the speech recognition is regenerated under different conditions when the speech recognition fails, and the speech recognition is performed again using the regenerated dynamic recognition dictionary. 5. The information processing apparatus according to any one of items 4 to 4.

A feature amount extraction unit that extracts the acoustic feature amount by analyzing the audio data in a predetermined frame unit;
The position information extraction unit
The position information of the voice data corresponding to the recognition vocabulary is extracted by voice recognition using the acoustic feature quantity obtained by the feature quantity extraction unit and the dynamic recognition dictionary. The information processing apparatus according to any one of 1 to 5.

Information processing device
Play audio data from the cue playback position,
Generate kana text in the transcription unit from the kana text input corresponding to the reproduced voice data,
Generating a dynamic recognition dictionary with one or more recognition vocabulary as a unit from the generated kana text of the transcription unit;
Extracting position information of the speech data corresponding to the recognition vocabulary by speech recognition using the generated dynamic recognition dictionary,
A transcription support method, wherein the cue playback position is determined from the extracted position information of the audio data.

Play audio data from the cue playback position,
Generate kana text in the transcription unit from the kana text input corresponding to the reproduced voice data,
Generating a dynamic recognition dictionary with one or more recognition vocabulary as a unit from the generated kana text of the transcription unit;
Extracting position information of the speech data corresponding to the recognition vocabulary by speech recognition using the generated dynamic recognition dictionary,
A transcription support program for causing a computer to execute processing for determining the cue playback position from the extracted position information of the audio data.