JP2006084965A

JP2006084965A - Voice data collecting device and program

Info

Publication number: JP2006084965A
Application number: JP2004271527A
Authority: JP
Inventors: Gruhn Rainer; ライナー・グルーン; Satoru Nakamura; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2004-09-17
Filing date: 2004-09-17
Publication date: 2006-03-30

Abstract

<P>PROBLEM TO BE SOLVED: To construct an enriched voice corpus in a short time at a low cost. <P>SOLUTION: A voice data collecting device includes a display section (112) which displays a text on a display device; a recording section (114) which starts sampling of voice signals from a microphone when a recording start is instructed while the text is being displayed and stores uttered voice data in a memory; a waveform display section (120) which displays voice waveforms based on the recorded uttered voice data when a recording completion instructing signal is received; and a preservation section (128) which stores the uttered voice data stored in the memory by relating the data to the text when a preservation instructing signal is received. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、所定のテキストの読上音声を録音するための装置に関し、特に、音声コーパスのためのテキストを読上げた発話音声データを効率よく収集するための装置に関する。 The present invention relates to an apparatus for recording a read-out voice of a predetermined text, and more particularly to an apparatus for efficiently collecting utterance voice data obtained by reading a text for a voice corpus.

音声認識、音声合成などの音声関連技術においては、最近では統計的な手法が主流となっている。統計的な手法では、音声コーパスが重要である。統計的手法では、音声コーパスに含まれる音声データの量が多いほど、信頼性の高い処理が可能になる。そこで、できるだけ大きな音声コーパスを効率よく構築することが必要になる。 In speech-related technologies such as speech recognition and speech synthesis, statistical methods have become mainstream recently. In statistical methods, the speech corpus is important. In the statistical method, as the amount of speech data included in the speech corpus increases, processing with higher reliability becomes possible. Therefore, it is necessary to efficiently construct as large a speech corpus as possible.

従来は以下のようにして音声コーパスが構築されている。まず専門家による監督のもと、一連のテキストを話者が連続して読上げ、その音声を録音する。さらにその音声データを専門家が手作業で一発話ごとに分離し、別ファイルとして保存する。そのように保存された各ファイルと、その発話に対応するテキストとの間でアライメントをとる処理が行なわれる。アライメントの結果にしたがい、音声データに音素ラベルなどの情報をコンピュータ読取可能な形式で付しておく。 Conventionally, a speech corpus is constructed as follows. First, under the supervision of an expert, a speaker reads a series of texts continuously and records the speech. Furthermore, the audio data is manually separated by an expert for each utterance and stored as a separate file. A process of aligning each file stored in this manner and the text corresponding to the utterance is performed. According to the alignment result, information such as phoneme labels is attached to the audio data in a computer-readable format.

音声をデジタル化し、統計的手法の基礎データとして利用する場合、音声にノイズが入ることは避けなければならない。また、発話対象のテキストが大量にある場合、長期間にわたって録音が行なわれることもあり、録音機器の調子、および話者の体調などによっては録音時の音声レベルに変化が生じることがある。その結果、音声コーパスに蓄積される音声の質にばらつきが生じ、統計的手法の基礎データとして適切でなくなるおそれもある。また、発話すべきテキストを話者が間違えた場合には、統計的手法の前提となる音声コーパスそのものに誤りが混入することとなり問題である。 When voice is digitized and used as basic data for statistical methods, noise must be avoided in the voice. In addition, when there is a large amount of text to be uttered, recording may be performed over a long period of time, and the sound level at the time of recording may change depending on the condition of the recording device and the physical condition of the speaker. As a result, the quality of speech accumulated in the speech corpus varies, which may be inappropriate as basic data for statistical methods. In addition, when a speaker makes a mistake in the text to be uttered, an error is mixed into the speech corpus that is a premise of the statistical method, which is a problem.

そのため従来は、専門家が話者による録音の状況を注意深く観察し、所定の測定機器の表示を見ながら、音声にノイズが混入したり、音声レベルが適正な範囲から逸脱したり、話者が間違ってテキストを読んだりしていないかをチェックする必要がある。そして、もしも適切な録音が行なわれていなければ、その発話の先頭から話者に録音を繰返させていた。 For this reason, in the past, an expert carefully observed the recording status of the speaker, and while looking at the display of the specified measuring device, noise was mixed into the speech, the speech level deviated from the appropriate range, You need to check that you haven't read the text by mistake. And if appropriate recording was not performed, the speaker was made to repeat the recording from the beginning of the utterance.

しかしこのような方法では、録音時に監督者にかかる負担が大きくなるという問題がある。そのため、一時には一人の話者の録音しかできず、複数の話者の音声データを収集しようとする場合には、監督者の数を増加させたり、収録に要する時間を長くとったりする必要がある。その結果、充実した音声コーパスを、短時間で低コストに構築することが困難であるという問題がある。 However, such a method has a problem that the burden on the supervisor becomes large during recording. For this reason, only one speaker can be recorded at a time, and when collecting voice data of multiple speakers, it is necessary to increase the number of supervisors or increase the time required for recording. . As a result, there is a problem that it is difficult to construct a complete speech corpus at a low cost in a short time.

それゆえに本発明の一つの目的は、充実した音声コーパスを、短時間で低コストに構築可能とする音声データ収集装置およびそのためのプログラムを提供することである。 Therefore, one object of the present invention is to provide an audio data collection apparatus and a program therefor that can build a complete audio corpus in a short time and at a low cost.

本発明の別の目的は、監督者の負担を軽減し、充実した音声コーパスを、短時間で低コストに構築可能とする音声データ収集装置およびそのためのプログラムを提供することである。 Another object of the present invention is to provide an audio data collection device and a program therefor that can reduce the burden on the supervisor and can construct a complete audio corpus in a short time at low cost.

本発明の第１の局面に係る音声データ収集装置は、表示装置、ユーザが操作可能な所定の入力装置、およびマイクロフォンに接続され、所定のテキストに対応する発話の音声データを収集する音声データ収集装置であって、発話対象のテキストを表示装置上に表示するためのテキスト表示手段と、表示装置上に発話対象のテキストが表示されているときに所定の録音開始指示信号を受けたことに応答して、マイクロフォンからの音声信号のサンプリングを開始し、サンプリングされた発話音声データを第１の記憶装置に格納するための音声録音手段と、所定の録音終了指示信号に応答して、第１の記憶装置に格納されている発話音声データに基づいて音声波形を生成し表示装置上に表示するための波形表示手段と、入力装置から所定の保存指示信号を受けたことに応答して、第１の記憶装置に格納された発話音声データを、所定のテキストと関連付けて第２の記憶装置に格納するための保存手段とを含む。 An audio data collection device according to a first aspect of the present invention is connected to a display device, a predetermined input device operable by a user, and a microphone, and collects audio data of an utterance corresponding to a predetermined text. A device for displaying text to be spoken on a display device, and responding to reception of a predetermined recording start instruction signal when the text to be spoken is displayed on the display device Then, in response to the voice recording means for starting sampling of the voice signal from the microphone and storing the sampled speech voice data in the first storage device, and a predetermined recording end instruction signal, Waveform display means for generating a speech waveform based on the utterance speech data stored in the storage device and displaying it on the display device, and a predetermined storage instruction from the input device In response to receiving signals, including speech data stored in the first storage device, and a storage means for storing in the second storage device in association with predetermined text.

表示装置上に発話対象のテキストが表示される。ユーザがそのテキストを見ながら発話すると、その音声がサンプリングされ録音される。さらに、録音された発話の音声波形が画面上に表示される。ユーザはこの波形を見て録音状態を確認できる。したがって、この装置を用いれば、テキストに対する発話音声をユーザの操作により良好な形で収集することができる。監督者による監督は最低限でよい。 The text to be uttered is displayed on the display device. When the user speaks while watching the text, the voice is sampled and recorded. Furthermore, the voice waveform of the recorded utterance is displayed on the screen. The user can confirm the recording state by viewing this waveform. Therefore, by using this apparatus, it is possible to collect speech utterances for texts in a favorable form by user operations. Supervision by the supervisor is minimal.

好ましくは、音声データ収集装置は、表示装置上に音声波形が表示されているときに録音開始指示信号を受けたことに応答して、マイクロフォンからの音声信号のサンプリングを開始し、サンプリングされた発話音声データで第１の記憶装置に格納されている発話音声データを置換するための音声再録音手段をさらに含む。 Preferably, the voice data collection device starts sampling the voice signal from the microphone in response to receiving the recording start instruction signal when the voice waveform is displayed on the display device, and the sampled speech Voice re-recording means for replacing the voice data stored in the first storage device with the voice data is further included.

音声波形が表示されているときに録音開始指示信号を発生させることで、既に一度録音されている発話音声データを新たな発話音声データで置換できる。好ましい録音が得られるまで繰返し同じテキストに対する発話の録音を行なうことができる。その結果、テキストに対し、良好に録音された発話音声データを容易に収集できる。 By generating the recording start instruction signal when the voice waveform is displayed, the voice data already recorded can be replaced with new voice data. It is possible to record utterances for the same text repeatedly until a favorable recording is obtained. As a result, it is possible to easily collect well-recorded speech voice data for text.

さらに好ましくは、音声データ収集装置はさらにスピーカに接続されており、音声データ収集装置は、表示装置上に音声波形が表示されているときに所定の再生指示信号を受けたことに応答して、第１の記憶装置に格納されている発話音声データから発話音声を再生し、スピーカに与えるための再生手段をさらに含む。 More preferably, the audio data collection device is further connected to a speaker, and the audio data collection device is responsive to receiving a predetermined reproduction instruction signal when the audio waveform is displayed on the display device, Reproducing means for reproducing the utterance voice from the utterance voice data stored in the first storage device and giving the utterance voice to the speaker is further included.

録音された音声波形が再生手段により再生される。ユーザはこの再生音声により、録音が良好に行なえたか否かを容易に判定できる。 The recorded voice waveform is reproduced by the reproducing means. The user can easily determine whether or not recording has been successfully performed using the reproduced voice.

より好ましくは、音声データ収集装置は、第１の記憶装置に格納される発話音声データが所定の信号レベル範囲内にあるか否かを判定するためのレベル判定手段をさらに含み、波形表示手段は、録音終了指示信号に応答して、第１の記憶装置に格納されている発話音声データに基づいて音声波形を生成し、レベル判定手段による判定結果にしたがって、信号レベルが所定の信号レベル範囲内にあるか否かを視覚的に示すレベル判定情報とともに表示装置上に表示するための手段を含む。 More preferably, the voice data collection device further includes level judgment means for judging whether or not the speech voice data stored in the first storage device is within a predetermined signal level range, and the waveform display means In response to the recording end instruction signal, a voice waveform is generated based on the utterance voice data stored in the first storage device, and the signal level is within a predetermined signal level range according to the determination result by the level determination means. And means for displaying on the display device together with level determination information visually indicating whether or not there is.

録音された波形が適正レベルにあるか否かが判定され、その結果が視覚的に表示される。ユーザは録音レベルが適正かどうかを判定でき、必要であれば録音をし直すことができる。その結果、収集される発話音声データは適正なレベルのものとなり、発話音声データの品質が向上する。 It is determined whether the recorded waveform is at an appropriate level, and the result is visually displayed. The user can determine whether the recording level is appropriate and can re-record if necessary. As a result, the collected utterance voice data has an appropriate level, and the quality of the utterance voice data is improved.

音声データ収集装置は、第１の記憶装置に格納される発話音声データのうち、発話部分を検出するための発話部分検出手段をさらに含んでもよく、波形表示手段は、所定の録音終了指示信号に応答して、第１の記憶装置に格納されている発話音声データに基づいて音声波形を生成し、発話部分検出手段による検出結果にしたがって、音声波形のうちの発話部分を視覚的に示す発話部分マーカとともに表示装置上に表示するための手段を含んでもよい。 The voice data collection device may further include an utterance portion detection means for detecting an utterance portion of the utterance voice data stored in the first storage device, and the waveform display means outputs a predetermined recording end instruction signal. In response, an utterance portion that generates a speech waveform based on utterance speech data stored in the first storage device and visually indicates the utterance portion of the speech waveform according to the detection result by the utterance portion detection means A means for displaying on the display device together with the marker may be included.

発話部分とそうでない領域とが分けられて音声波形とともに表示される。例えばノイズが誤って発話として認識されたり、本来は発話であるはずの部分が発話部分として検出されなかったりした場合にも容易にそれらの誤りを確認できる。必要に応じて録音をし直すことも可能になり、収集される音声データの品質が向上する。 The speech part and the non-speech area are separated and displayed together with the speech waveform. For example, even when noise is mistakenly recognized as an utterance or a portion that should originally be an utterance is not detected as an utterance portion, those errors can be easily confirmed. It becomes possible to re-record as necessary, and the quality of the collected voice data is improved.

好ましくは、音声データ収集装置は、入力装置から与えられる発話部分マーカの位置の変更指示に応答して、発話部分検出手段により検出された発話部分を当該変更指示にしたがって変更するための発話部分変更手段をさらに含む。 Preferably, the voice data collection device changes the utterance part for changing the utterance part detected by the utterance part detection means in response to the instruction to change the position of the utterance part marker given from the input device. Means are further included.

発話部分が誤って検出された場合に、それを訂正できる。発話部分の切出、テキストとのアライメントなどにおける誤りを防止できる。 If the utterance is detected in error, it can be corrected. It is possible to prevent errors in the extraction of utterances and alignment with text.

さらに好ましくは、音声データ収集装置は、第１の記憶装置に格納される発話音声データと、表示装置上に表示されている発話対象テキストとの間の所定の音声単位でのアライメントを行ない、その結果を示すアライメントデータを生成するためのアライメント手段をさらに含み、保存手段は、入力装置から保存指示信号を受けたことに応答して、第１の記憶装置に格納された発話音声データおよびアライメントデータを、所定のテキストと関連付けて第２の記憶装置に格納するための手段を含む。 More preferably, the voice data collection device performs alignment in a predetermined voice unit between the utterance voice data stored in the first storage device and the utterance target text displayed on the display device. Alignment means for generating alignment data indicating the results is further included, and the storage means is uttered voice data and alignment data stored in the first storage device in response to receiving the storage instruction signal from the input device. Is stored in the second storage device in association with the predetermined text.

テキストと発話音声データとの間で所定の音声単位でのアライメントを自動的に行ない、アライメントデータを保存できる。この装置だけでアライメント済みの発話音声データを作成できる。音響モデルの学習などにこの発話音声データとアライメントデータとを利用できる。 It is possible to automatically align the text and the speech voice data in a predetermined voice unit and save the alignment data. Aligned speech data can be created with this device alone. The speech data and alignment data can be used for learning an acoustic model.

保存手段は、入力装置から保存指示信号を受けたことに応答して、第１の記憶装置に格納された発話音声データを、所定のテキストと関連付けて第２の記憶装置に格納し、さらに次の発話対象のテキストを表示装置に表示させるための手段を含んでもよい。 In response to receiving the save instruction signal from the input device, the save means stores the utterance voice data stored in the first storage device in the second storage device in association with a predetermined text. Means for displaying the text to be uttered on the display device may be included.

発話対象のテキストを一つずつ処理し、それらに対する発話音声データがそれぞれ第２の記憶装置に記憶される。従来のように全文を一度に録音し、それを後に手作業で分離していく必要はない。 The texts to be uttered are processed one by one, and utterance voice data for them is stored in the second storage device. There is no need to record the whole sentence at once and separate it later manually by conventional methods.

本発明の第２の局面に係る音声データ収集プログラムは、コンピュータにより実行されると、上記したいずれかの音声データ収集装置として当該コンピュータを動作させるものである。 The sound data collection program according to the second aspect of the present invention, when executed by a computer, causes the computer to operate as one of the sound data collection devices described above.

［構成］
図１は、本発明の一実施の形態に係る音声データ収集装置４２を含む音声データ収集システム３０の構成をブロック図形式で示したものである。図１を参照して、音声データ収集システム３０は、話者が読上げる複数のテキストが記録された発話ファイルを記憶したテキスト記憶装置４０と、テキスト記憶装置４０からテキストファイルを読出し、一発話分ずつ話者に提示して発話させることにより、一文ずつ音声データファイルを作成するための、本実施の形態に係る音声データ収集装置４２と、音声データ収集装置４２の出力する音声データファイルを記憶するための音声データファイル記憶装置４４とを含む。 [Constitution]
FIG. 1 is a block diagram showing a configuration of an audio data collection system 30 including an audio data collection device 42 according to an embodiment of the present invention. Referring to FIG. 1, a voice data collection system 30 reads a text file from a text storage device 40 storing a speech file in which a plurality of texts read by a speaker are recorded, and reads out a text file from the text storage device 40. The voice data collection device 42 according to the present embodiment and the voice data file output by the voice data collection device 42 for creating a voice data file one sentence at a time by presenting each speaker to the speaker and the voice data file output from the voice data collection device 42 are stored. And an audio data file storage device 44.

後述するように音声データ収集システム３０はコンピュータにより実現可能であるが、その場合にはテキスト記憶装置４０と音声データファイル記憶装置４４とはハードディスクなどの不揮発性記憶装置により実現される。両者が物理的に同一の記憶装置により実現されてもよい。 As will be described later, the voice data collection system 30 can be realized by a computer. In this case, the text storage device 40 and the voice data file storage device 44 are realized by a nonvolatile storage device such as a hard disk. Both may be realized by the physically same storage device.

また、発話テキストファイルとしては、本実施の形態ではプレーンテキストファイルを用いる。一つのファイルは複数の発話テキストを含む。各発話テキストの間は改行コードで分離されている。テキストの読出時には、改行に遭遇するまでファイルからテキストを読出すことで、ファイル中のテキストを順に一つずつ取出すことができる。 As the utterance text file, a plain text file is used in the present embodiment. One file contains a plurality of utterance texts. Each utterance text is separated by a line feed code. When reading text, the text in the file can be taken out one by one by reading the text from the file until a line break is encountered.

音声データ収集システム３０はさらに、音声データ収集装置４２に接続され、音声データ収集装置４２が、次に発話すべきテキストを話者に提示したり、録音結果の音声波形を提示したりする際に使用するモニタ４６と、音声データ収集装置４２に接続され、話者が音声を録音する際に使用するマイクロフォン４８と、音声データ収集装置４２に接続され、音声データ収集装置４２が話者の録音を再生する際に使用するスピーカ５０と、話者が音声データ収集装置４２に指示を与えるために使用する、マウス、キーボードなどからなる入力装置５２とを含む。 The voice data collection system 30 is further connected to the voice data collection device 42. When the voice data collection device 42 presents the text to be spoken next to the speaker or the voice waveform of the recording result. Connected to a monitor 46 to be used and a voice data collecting device 42 and used when a speaker records voice, and connected to a voice data collecting device 42, the voice data collecting device 42 records the speaker. A speaker 50 used for reproduction and an input device 52 composed of a mouse, a keyboard, and the like, which are used by a speaker to give instructions to the voice data collection device 42 are included.

図２に、音声データ収集装置４２のより詳細な構成を機能ブロック図として示す。図２を参照して、音声データ収集装置４２は、バス７２と、入力装置５２に接続され、入力装置５２を介してユーザにより与えられる指示にしたがい、以下に述べる各機能部の動作シーケンスを制御するためのシーケンス制御部７０と、いずれもバス７２に接続されたメモリ７６およびメモリ８６と、バス７２に接続され、ユーザにより指定されたテキストファイルをテキスト記憶装置４０から読出してバス７２を介してメモリ７６にロードするためのロードモジュール７４とを含む。 FIG. 2 shows a more detailed configuration of the voice data collection device 42 as a functional block diagram. Referring to FIG. 2, the audio data collection device 42 is connected to the bus 72 and the input device 52, and controls the operation sequence of each functional unit described below in accordance with an instruction given by the user via the input device 52. The sequence controller 70, the memory 76 and the memory 86, both connected to the bus 72, and the bus 72, and the text file designated by the user is read from the text storage device 40 and sent via the bus 72. A load module 74 for loading into the memory 76.

音声データ収集装置４２はさらに、シーケンス制御部７０からの指示にしたがい、音声データ収集装置４２の各機能部により生成された情報をモニタ４６に出力するための表示部７８と、シーケンス制御部７０を介して与えられるユーザの指示にしたがい、マイクロフォン４８からの音声を録音し、対応するテキストとともにメモリ８６に格納するための録音部８０とを含む。これらはいずれもバス７２に接続されている。 The voice data collection device 42 further includes a display unit 78 for outputting information generated by each functional unit of the voice data collection device 42 to the monitor 46 according to an instruction from the sequence control unit 70, and a sequence control unit 70. And a recording unit 80 for recording the voice from the microphone 48 and storing it in the memory 86 together with the corresponding text. These are all connected to the bus 72.

音声データ収集装置４２はさらに、録音部８０およびメモリ８６に接続され、録音部８０が録音しメモリ８６に格納する音声信号のうち、音声部分を検出する機能と音声のボリュームが所定の範囲を逸脱していないか否かを判定する機能とを行なうためのボリュームチェック部８２と、録音部８０、ボリュームチェック部８２およびバス７２に接続され、録音部８０が録音した音声と対応するテキストとの間でビタビアライメントを行ない、音声とメモリ７６内に格納されたテキストの音素との間の対応付けを行ないアライメントデータを作成するためのビタビアライメント部８４とを含む。 The audio data collecting device 42 is further connected to the recording unit 80 and the memory 86, and the function of detecting the audio part and the volume of the audio out of a predetermined range out of the audio signals recorded by the recording unit 80 and stored in the memory 86. A volume check unit 82 for performing a function for determining whether or not the recording is performed, and a recording unit 80, a volume check unit 82, and a bus 72, between the voice recorded by the recording unit 80 and the corresponding text. A Viterbi alignment unit 84 that performs Viterbi alignment and associates the speech with the phonemes of the text stored in the memory 76 to create alignment data.

音声データ収集装置４２はさらに、入力装置５２およびシーケンス制御部７０を介してユーザから与えられる、後述する発話部分のマーカの訂正入力に応答して、メモリ８６に格納された音声情報のうち発話位置を示す情報を訂正するための調整部８８と、シーケンス制御部７０から音声再生の指示を受けたことに応答して、メモリ８６から音声データを読出し、アナログの音声信号に変換してスピーカ５０に与えるための再生部９０と、シーケンス制御部７０から音声データの格納を指示されたことに応答して、メモリ８６に保持されている音声データ、対応するテキストデータ、アライメントデータその他の関連データを音声データファイル記憶装置４４に書出すための格納処理部９２とを含む。 The voice data collection device 42 further responds to a correction input of a marker for a utterance part, which will be described later, given by the user via the input device 52 and the sequence control unit 70, and the utterance position in the voice information stored in the memory 86. In response to receiving an audio reproduction instruction from the adjustment unit 88 and the sequence control unit 70, the audio data is read from the memory 86, converted into an analog audio signal, and sent to the speaker 50. In response to an instruction to store audio data from the reproduction unit 90 and the sequence control unit 70, the audio data, corresponding text data, alignment data, and other related data held in the memory 86 are audio A storage processing unit 92 for writing to the data file storage device 44.

図３は、シーケンス制御部７０により実現される、音声データ収集装置４２の動作シーケンスを示すフローチャートである。シーケンス制御部７０は、この図に示されるフローチャートにしたがって音声データ収集装置４２の動作ステータスが変化するように音声データ収集装置４２の各部を制御する。この図により、音声データ収集装置４２の動作も説明できる。 FIG. 3 is a flowchart showing an operation sequence of the audio data collection device 42 realized by the sequence control unit 70. The sequence control unit 70 controls each part of the voice data collection device 42 so that the operation status of the voice data collection device 42 changes according to the flowchart shown in this figure. The operation of the voice data collection device 42 can also be explained with this figure.

図３および図２を参照して、このシーケンスによれば、まずステップ１１０でロードモジュール７４を制御して発話テキストファイルをテキスト記憶装置４０から読出し、メモリ７６にロードする。ステップ１１２で、次の発話テキスト（ロード直後には先頭の発話テキスト）をメモリ７６から読出し、表示部７８を制御してモニタ４６に表示させる。 With reference to FIGS. 3 and 2, according to this sequence, first, in step 110, the load module 74 is controlled to read the utterance text file from the text storage device 40 and load it into the memory 76. In step 112, the next utterance text (first utterance text immediately after loading) is read from the memory 76, and the display unit 78 is controlled and displayed on the monitor 46.

続いてユーザからの録音指示に応答してステップ１１４に進み、録音部８０を制御して、マイクロフォン４８から電気信号の形で与えられる音声信号をサンプリングさせ、所定のデータ形式でメモリ８６に記録させる。このサンプリングは、３０ミリ秒のフレーム長で、かつ１０ミリ秒ごとにフレーム位置をずらしながら行なう。 Subsequently, in response to the recording instruction from the user, the process proceeds to step 114 where the recording unit 80 is controlled to sample the audio signal given in the form of an electric signal from the microphone 48 and record it in the memory 86 in a predetermined data format. . This sampling is performed with a frame length of 30 milliseconds and shifting the frame position every 10 milliseconds.

録音が終了すると、ステップ１１６において、メモリ８６に格納された録音データを調べ、発話レベルが適正レベルを超えているか否かをボリュームチェック部８２により判定する。またボリュームチェック部８２により、録音データのうちで発話部分がどこかを音声波形の振幅の大きさによって判定する。ステップ１１８において、録音データと、対応する発話テキストとの対応付けをビタビアライメント部８４によって行ない、そのアライメント情報をラベルとして音声データの各フレームに付してメモリ８６に格納する。 When the recording is completed, the recorded data stored in the memory 86 is checked in step 116, and the volume check unit 82 determines whether or not the utterance level exceeds the appropriate level. Further, the volume check unit 82 determines where the utterance portion is in the recorded data based on the amplitude of the voice waveform. In step 118, the Viterbi alignment unit 84 associates the recorded data with the corresponding utterance text, and the alignment information is attached to each frame of the audio data as a label and stored in the memory 86.

次に、ステップ１２０で、表示部７８を制御し、メモリ８６に記憶されている音声データの波形を、対応するテキストとともにモニタ４６に表示させる。この表示時、ボリュームチェックで音声のレベルが過大な個所が検出されたときには、適正レベルで音声波形をクリップするとともに、適正レベルを示す枠を赤色でモニタ４６に表示させる。また、音声波形のうち、ボリュームチェック部８２が検出した発話領域については、それを示すマーカをモニタ４６に表示させる。この後、ユーザからの指示待ちになる。 Next, in step 120, the display unit 78 is controlled to display the waveform of the audio data stored in the memory 86 on the monitor 46 together with the corresponding text. At this time, if a part with an excessive audio level is detected by the volume check, the audio waveform is clipped at an appropriate level and a frame indicating the appropriate level is displayed on the monitor 46 in red. In addition, a marker indicating the utterance area detected by the volume check unit 82 in the voice waveform is displayed on the monitor 46. After that, it waits for an instruction from the user.

なお、図２および図３には図示していないが、音声データ収集装置４２は再生モードと発話領域の訂正モードとの二つの動作モードの切替が可能である。再生モードではマーカが付された部分に関し、再生部９０による音声再生が行なえる。発話領域の訂正モードでは、マーカ位置を訂正することにより、発話領域を訂正することができる。 Although not shown in FIGS. 2 and 3, the audio data collection device 42 can switch between two operation modes: a reproduction mode and a speech area correction mode. In the playback mode, audio playback by the playback unit 90 can be performed on the part with the marker. In the utterance area correction mode, the utterance area can be corrected by correcting the marker position.

ステップ１２２において、ユーザからの入力がどのようなものであるかを判定する。もしも動作モードが再生モードで、マーカの入力が行なわれるとステップ１２４に進み、音声波形のうち、指定されたマーカ部分を再生部９０およびスピーカ５０を用いて再生する。この後ステップ１２０に戻り、波形表示をしてユーザの入力を待つ。 In step 122, it is determined what the input from the user is. If the operation mode is the reproduction mode and a marker is input, the process proceeds to step 124 where the designated marker portion of the audio waveform is reproduced using the reproduction unit 90 and the speaker 50. Thereafter, the process returns to step 120 to display the waveform and wait for user input.

ステップ１２２において、もしも動作モードが発話領域の訂正モードでマーカの入力が行なわれると、ステップ１２６に進み、調整部８８を用いてメモリ８６に格納されている音声データを修正して、指定された発話領域に一致させる。この後、ステップ１２４に進む。以後の処理は前述したとおりである。 In step 122, if the operation mode is the speech area correction mode and the marker is input, the process proceeds to step 126, where the voice data stored in the memory 86 is corrected using the adjustment unit 88 and designated. Match to the utterance area. Thereafter, the process proceeds to step 124. Subsequent processing is as described above.

ステップ１２２において、もしも後述する保存（SAVE & NEXT）ボタンが押された場合には、メモリ８６に保持されている音声データ、対応するテキスト、アライメントデータ、ラベルなどをまとめて音声データファイル記憶装置４４に書出す。この後、ステップ１３０で、メモリ７６にロードされている発話テキストファイルの全ての発話テキストに対する処理が完了したか否かを判定する。もしも完了していれば処理を終了する。完了していなければステップ１１２に戻り、次の発話テキストをメモリ７６から読出して表示する。以下、上記した処理を発話テキストごとに繰返す。 If a later-described save (SAVE & NEXT) button is pressed in step 122, the voice data file storage device 44 collects the voice data, corresponding text, alignment data, labels, etc. held in the memory 86 together. Write to. Thereafter, in step 130, it is determined whether or not the processing for all utterance texts in the utterance text file loaded in the memory 76 has been completed. If it has been completed, the process ends. If not completed, the process returns to step 112, and the next utterance text is read from the memory 76 and displayed. Thereafter, the above processing is repeated for each utterance text.

ステップ１２２で「録音」ボタンが押されたと判定されると、ステップ１１４に戻り、再度同じ発話テキストを用いた録音を繰返す。 If it is determined in step 122 that the “record” button has been pressed, the process returns to step 114 to repeat recording using the same utterance text again.

以上がシーケンス制御部７０により実現される音声データ収集装置４２の各部の動作シーケンスの内容である。上の記載からまた、音声データ収集装置４２の一般的動作も明らかとなったと思われる。 The above is the contents of the operation sequence of each unit of the audio data collection device 42 realized by the sequence control unit 70. From the above description, the general operation of the audio data collection device 42 may also have been clarified.

［動作］
本装置の動作は、図３を参照して説明したとおりである。したがって、ここではその詳細は繰返さない。 [Operation]
The operation of this apparatus is as described with reference to FIG. Therefore, details thereof will not be repeated here.

［コンピュータによる実現］
−コンピュータプログラムの制御構造−
上記した音声データ収集装置４２は、コンピュータハードウェアと、そのコンピュータ上で実行されるコンピュータプログラムとにより実現可能である。図４および図５に、そのためのプログラムの概略の制御構造をフローチャート形式で示す。なお、このプログラムは、ＧＵＩ（グラフィカル・ユーザ・インタフェース）を採用したものである。したがって、モニタ４６に表示されたＧＵＩ部品（ボタン、メニュー項目などのオブジェクト）をユーザが操作すると、そのＧＵＩ部品のその操作について予め定義されていたプログラム（メソッド）が実行される。ユーザ操作によって対応するメソッドを呼出し実行するメカニズムは、ＯＳ（オペレーティング・システム）と、ユーザプログラムと、ＯＳまたはユーザプログラムとは別にコンピュータにインストールされ、プログラムの実行時に動的に呼出されるモジュールなどと、場合によってはＯＳの上で動作する仮想コンピュータ環境などとの協働によって実現される。 [Realization by computer]
-Control structure of computer program-
The voice data collection device 42 described above can be realized by computer hardware and a computer program executed on the computer. FIG. 4 and FIG. 5 show a schematic control structure of a program for this purpose in a flowchart format. This program adopts a GUI (graphical user interface). Therefore, when a user operates a GUI component (an object such as a button or a menu item) displayed on the monitor 46, a program (method) defined in advance for the operation of the GUI component is executed. A mechanism for calling and executing a corresponding method by a user operation includes an OS (operating system), a user program, a module installed separately from the OS or the user program, and dynamically called when the program is executed, and the like. In some cases, it is realized by cooperation with a virtual computer environment or the like operating on the OS.

図４を参照して、まずステップ１５０でファイルオープンのダイアログを表示する。ここでは、ＯＳが用意したファイルオープンのダイアログを呼出せばよい。テキスト属性のファイルのみをダイアログで表示するように、いわゆるフィルタ処理をしてもよい。フィルタ処理は多くのＯＳで提供されている機能である。 Referring to FIG. 4, first, in step 150, a file open dialog is displayed. Here, a file open dialog prepared by the OS may be called. So-called filter processing may be performed so that only text attribute files are displayed in the dialog. Filter processing is a function provided by many OSs.

続いてステップ１５２では、ファイルオープンダイアログでユーザがファイルのオープンをキャンセルし、処理の終了を選択したか否かを判定する。終了が選択されていればプログラムを終了させる。それ以外の場合、すなわちファイルが選択された場合にはステップ１５４に進む。 Subsequently, in step 152, it is determined whether or not the user cancels the opening of the file in the file open dialog and selects the end of the process. If exit is selected, the program is terminated. In other cases, that is, when a file is selected, the process proceeds to step 154.

ステップ１５４では、ダイアログで指定されたファイルをメモリ７６にロードする。ステップ１５６では、メモリ７６にロードしたファイルから、発話テキストの読出を試みる。ステップ１５８では、ステップ１５６の処理の結果、ファイルの末尾を示すＥＯＦ（ＥｎｄＯｆＦｉｌｅ）マークを読出したか否かを判定する。ＥＯＦを読出した場合には処理対象の発話テキストがなくなったということであるから処理を終了する。発話テキストの読出に成功した場合、ステップ１６０に進む。 In step 154, the file specified in the dialog is loaded into the memory 76. In step 156, an attempt is made to read the utterance text from the file loaded in the memory 76. In step 158, as a result of the processing in step 156, it is determined whether an EOF (End Of File) mark indicating the end of the file has been read. When the EOF is read, it means that there is no utterance text to be processed, and the process is terminated. If the utterance text has been successfully read, the process proceeds to step 160.

ステップ１６０では、保存ボタンを不能化し、操作できないようにする。また録音ボタンを可能化し、ユーザが録音の指示を行なうことができるようにする。さらに、音声データ収集装置４２の動作モードを再生モードに設定する。 In step 160, the save button is disabled so that it cannot be operated. In addition, a recording button is made available so that the user can instruct recording. Further, the operation mode of the audio data collection device 42 is set to the reproduction mode.

この後、ステップ１６２で録音の初期画面を表示する。この画面では、読出した発話テキストを表示し、ユーザの操作を待つ。 Thereafter, in step 162, an initial recording screen is displayed. On this screen, the read utterance text is displayed and a user operation is waited for.

続いて、図５を参照し、ステップ１６４で何らかのイベントが生じたか否かを判定する。ここでイベントとは、操作可能なＧＵＩ部品のいずれかをユーザが操作したり、何らかのモジュールがメッセージを発行したりしたことによって、ＯＳ等からこのプログラムに与えられる通知のことをいう。ここでは、録音（START）ボタンの操作、録音終了（STOP）ボタンの操作、モード切替の操作、マーカ入力の操作、および保存（SAVE & NEXT）ボタンの操作のいずれかがイベントとして発生するものとする。制御は、これらイベントに対応してそれぞれステップ１７０、１８０、２００、２０２、および２１０に分岐する。 Subsequently, referring to FIG. 5, it is determined in step 164 whether any event has occurred. Here, an event refers to a notification given to this program from the OS or the like when a user operates one of the operable GUI components or a module issues a message. Here, it is assumed that any of the recording (START) button operation, recording end (STOP) button operation, mode switching operation, marker input operation, and save (SAVE & NEXT) button operation occurs as an event. To do. Control branches to steps 170, 180, 200, 202, and 210, respectively, corresponding to these events.

ステップ１７０では、画面上で録音ボタンを録音終了ボタンにトグルさせる。すなわち、録音ボタンの表示を録音終了ボタンに変え、その機能を録音ボタンから録音終了ボタンに切替える。この処理により、録音ボタンが不能化され、録音終了ボタンが可能化される。 In step 170, the recording button is toggled to the recording end button on the screen. That is, the display of the recording button is changed to the recording end button, and the function is switched from the recording button to the recording end button. This process disables the record button and enables the record end button.

続いてステップ１７２で録音を開始し、ステップ１６４に戻る。録音は、具体的には録音を行なうためにＯＳなどにより準備されている機能をＡＰＩ（Application Programming Interface）を用いて呼出すことにより行なわれる。 Subsequently, recording is started in step 172, and the process returns to step 164. Specifically, the recording is performed by calling a function prepared by the OS or the like for recording using an API (Application Programming Interface).

録音終了ボタンが操作された場合、ステップ１８０で録音終了ボタンを録音ボタンにトグルさせる。ステップ１８２で録音を終了させる。すなわち、ステップ１７２で録音のためのＡＰＩにより呼出された機能を別のＡＰＩを用いて停止させる。 If the recording end button is operated, in step 180, the recording end button is toggled to the recording button. In step 182, the recording is terminated. That is, in step 172, the function called by the recording API is stopped using another API.

ステップ１８４で、録音された音声データのレベルを調べ、発話領域の検出と、発話領域を示すラベルの音声データへの付与とを行なう。続いてステップ１８６で、音声データのレベルを調べ、適正なレベル範囲を逸脱した個所がないかどうかを判定する。逸脱した個所があれば、その音声データのフレームにボリュームチェックのラベルを付す。 In step 184, the level of the recorded voice data is checked, and the utterance area is detected and a label indicating the utterance area is assigned to the voice data. Subsequently, at step 186, the level of the audio data is examined to determine whether or not there is a part that deviates from an appropriate level range. If there is a deviation, a volume check label is attached to the frame of the audio data.

ステップ１８８では、この発話に対応するテキストと、音声データとの間でビタビアライメントを行ない、音声データのどの部分がテキストのどの音素に対応するかを判定する。判定結果に応じ、音声データの各フレームに、対応する音素を示すラベル付けを行なう。このアライメント処理では話者の音声を学習データとして学習済みの音響モデルなどが必要であるが、アライメント自体には公知のアルゴリズムを使用できる。さらに本実施の形態では、音声データに対応するテキストが分かっているのでアライメントはさらに容易に行なえる。 In step 188, Viterbi alignment is performed between the text corresponding to the utterance and the voice data, and it is determined which part of the voice data corresponds to which phoneme of the text. Depending on the determination result, each frame of audio data is labeled to indicate the corresponding phoneme. This alignment process requires a learned acoustic model or the like using the speaker's voice as learning data, but a known algorithm can be used for the alignment itself. Furthermore, in this embodiment, since the text corresponding to the voice data is known, alignment can be performed more easily.

ステップ１９０では、録音された音声データを視覚化した音声波形をモニタ４６に表示する。このとき、画面の他の部分の表示もリフレッシュされる。また同時に、発話領域を示すマーカが表示される。もしもボリュームチェックの結果適正レベルを逸脱した個所があれば、波形全体はその適正レベルを上限または下限としてクリップされ、さらに適正レベルを示す矩形が波形を囲むように表示される。この後ステップ１６４に戻る。 In step 190, a voice waveform obtained by visualizing the recorded voice data is displayed on the monitor 46. At this time, the display of other parts of the screen is also refreshed. At the same time, a marker indicating the speech area is displayed. If there is a part that deviates from the proper level as a result of the volume check, the entire waveform is clipped with the proper level as the upper limit or lower limit, and a rectangle indicating the proper level is displayed so as to surround the waveform. Thereafter, the process returns to step 164.

ステップ１６４でイベントがモード切替のイベントであると判定されると、ステップ２００において動作モードが再生モードであれば訂正モードに、訂正モードであれば再生モードに、トグルされる。この後ステップ１６４に戻る。 If it is determined in step 164 that the event is a mode switching event, the operation mode is toggled to the correction mode if the operation mode is the reproduction mode, and to the reproduction mode if the operation mode is the correction mode. Thereafter, the process returns to step 164.

ステップ１６４でイベントがマーカ入力であると判定されると、ステップ２０２で音声データ収集装置４２の動作モードが再生モードか否かが判定される。再生モードであればステップ２０６で音声データのうちマーカで示された部分を再生しステップ１６４に戻る。再生モードでなければステップ２０４で発話領域を示すマーカを訂正し、ステップ２０６に進む。ステップ２０６でマーカ部分、すなわち発話領域の再生を行ない、ステップ１６４に戻る。 If it is determined in step 164 that the event is a marker input, it is determined in step 202 whether or not the operation mode of the audio data collection device 42 is the playback mode. If it is in the reproduction mode, the part indicated by the marker in the audio data is reproduced in step 206 and the process returns to step 164. If the playback mode is not set, the marker indicating the speech area is corrected in step 204 and the process proceeds to step 206. In step 206, the marker portion, that is, the speech area is reproduced, and the process returns to step 164.

ステップ１６４で保存ボタンが操作されたと判定された場合、ステップ２１０においてテキストおよびアラインメント済みでラベルが付された音声データが音声データファイル記憶装置４４に格納される。この後、図４のステップ１５６に戻り、次のテキストを読出してステップ１５８以下の処理を繰返す。 If it is determined in step 164 that the save button has been operated, the text and the aligned and labeled audio data are stored in the audio data file storage device 44 in step 210. Thereafter, the process returns to step 156 in FIG. 4, the next text is read, and the processes after step 158 are repeated.

以上が、本実施の形態に係る音声データ収集装置４２をコンピュータにより実現させるためのコンピュータプログラムの概略制御構造である。 The above is the schematic control structure of the computer program for realizing the audio data collection device 42 according to the present embodiment by a computer.

−コンピュータハードウェア−
図６はこの実施の形態に係る音声データ収集システム３０を実現するコンピュータシステム３３０の外観を示し、図７はコンピュータシステム３３０の内部構成を示す。 -Computer hardware-
FIG. 6 shows the external appearance of a computer system 330 that implements the audio data collection system 30 according to this embodiment, and FIG. 7 shows the internal configuration of the computer system 330.

図６を参照して、このコンピュータシステム３３０は、ＦＤ（フレキシブルディスク）ドライブ３５２およびＣＤ−ＲＯＭ（コンパクトディスク読出専用メモリ）ドライブ３５０を有するコンピュータ３４０と、キーボード３４６と、マウス３４８と、モニタ３４２と、マイクロフォン３７０と、スピーカ３７２とを含む。これらのうち、キーボード３４６およびマウス３４８は図１および図２に示す入力装置５２に相当する。モニタ３４２、マイクロフォン３７０およびスピーカ３７２はそれぞれ、図１および図２に示すモニタ４６、マイクロフォン４８およびスピーカ５０に相当する。 Referring to FIG. 6, the computer system 330 includes a computer 340 having an FD (flexible disk) drive 352 and a CD-ROM (compact disk read only memory) drive 350, a keyboard 346, a mouse 348, and a monitor 342. , And microphone 370 and speaker 372. Of these, the keyboard 346 and the mouse 348 correspond to the input device 52 shown in FIGS. 1 and 2. The monitor 342, microphone 370, and speaker 372 correspond to the monitor 46, microphone 48, and speaker 50 shown in FIGS.

図７を参照して、コンピュータ３４０は、ＦＤドライブ３５２およびＣＤ−ＲＯＭドライブ３５０に加えて、ＣＰＵ（中央処理装置）３５６と、ＣＰＵ３５６、ＦＤドライブ３５２およびＣＤ−ＲＯＭドライブ３５０に接続されたバス３６６と、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）３５８と、バス３６６に接続され、プログラム命令、システムプログラム、および作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）３６０とを含む。コンピュータシステム３３０はさらに、図示しないプリンタを含んでいる。 Referring to FIG. 7, in addition to FD drive 352 and CD-ROM drive 350, computer 340 includes CPU (central processing unit) 356 and bus 366 connected to CPU 356, FD drive 352 and CD-ROM drive 350. And a read only memory (ROM) 358 for storing a boot-up program and the like, and a random access memory (RAM) 360 connected to the bus 366 for storing a program command, a system program, work data, and the like. The computer system 330 further includes a printer (not shown).

ここでは示さないが、コンピュータ３４０はさらにローカルエリアネットワーク（ＬＡＮ）への接続を提供するネットワークアダプタボードを含んでもよい。 Although not shown here, the computer 340 may further include a network adapter board that provides a connection to a local area network (LAN).

コンピュータシステム３３０に音声データ収集システム３０（および音声データ収集装置４２）としての動作を行なわせるためのコンピュータプログラムは、ＣＤ−ＲＯＭドライブ３５０またはＦＤドライブ３５２に挿入されるＣＤ−ＲＯＭ３６２またはＦＤ３６４に記憶され、さらにハードディスク３５４に転送される。または、プログラムは図示しないネットワークを通じてコンピュータ３４０に送信されハードディスク３５４に記憶されてもよい。プログラムは実行の際にＲＡＭ３６０にロードされる。ＣＤ−ＲＯＭ３６２から、ＦＤ３６４から、またはネットワークを介して、直接にＲＡＭ３６０にプログラムをロードしてもよい。 A computer program for causing the computer system 330 to operate as the voice data collection system 30 (and the voice data collection device 42) is stored in a CD-ROM 362 or FD 364 inserted in the CD-ROM drive 350 or FD drive 352. And further transferred to the hard disk 354. Alternatively, the program may be transmitted to the computer 340 through a network (not shown) and stored in the hard disk 354. The program is loaded into the RAM 360 when executed. The program may be loaded directly into the RAM 360 from the CD-ROM 362, from the FD 364, or via a network.

このプログラムは、コンピュータ３４０にこの実施の形態の音声データ収集装置４２としての動作を行なわせる複数の命令を含む。この方法を行なわせるのに必要な基本的機能のいくつかはコンピュータ３４０上で動作するＯＳまたはサードパーティのプログラム、もしくはコンピュータ３４０にインストールされる各種ツールキットのモジュールにより提供される。従って、このプログラムはこの実施の形態のシステムおよび方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能または「ツール」を呼出すことにより、上記した音声データ収集装置４２を実現する命令のみを含んでいればよい。コンピュータシステム３３０の動作は周知であるので、ここでは繰返さない。 This program includes a plurality of instructions that cause the computer 340 to operate as the audio data collection device 42 of this embodiment. Some of the basic functions required to perform this method are provided by an OS or third party program running on the computer 340 or various toolkit modules installed on the computer 340. Therefore, this program does not necessarily include all functions necessary for realizing the system and method of this embodiment. If this program includes only the instructions for realizing the above-described voice data collecting device 42 by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result, Good. The operation of computer system 330 is well known and will not be repeated here.

−ＧＵＩ画面−
図８に、本実施の形態において音声データ収集システム３０に表示されるＧＵＩ画面１８０を示す。図８を参照して、このＧＵＩ画面１８０は、メニュー領域に配置されたファイルメニュー１９０と、ファイルの保存時に操作される保存ボタン１９２と、録音時のレベルを示すレベルメータ１９４と、録音／録音終了ボタン１９６とを含む。録音／録音終了ボタン１９６は、録音可能なときには録音ボタンとなり、録音中には録音終了ボタンとなるようにプログラムによりトグルされる。またファイルメニュー１９０をクリックすることにより、再生モードと訂正モードとの双方の項目を含む動作モード切替のプルダウンメニューが表示される。どちらかを選択することで音声データ収集装置４２の動作モードが再生モードと訂正モードとの間でトグルする。 -GUI screen-
FIG. 8 shows a GUI screen 180 displayed on the audio data collection system 30 in the present embodiment. Referring to FIG. 8, this GUI screen 180 includes a file menu 190 arranged in the menu area, a save button 192 operated when saving the file, a level meter 194 indicating the recording level, and recording / recording. And an end button 196. The recording / recording end button 196 is toggled by the program to be a recording button when recording is possible and to be a recording end button during recording. Also, by clicking on the file menu 190, an operation mode switching pull-down menu including items of both the reproduction mode and the correction mode is displayed. By selecting either one, the operation mode of the audio data collection device 42 toggles between the reproduction mode and the correction mode.

ＧＵＩ画面１８０はさらに、処理中の発話テキストのＩＤ（識別名）表示領域１９８と、処理中の発話テキストの表示領域２００と、処理中の発話テキストの保存先ファイル名の表示領域２０２と、録音した音声データの時間軸に対する波形２０６をレベルとともに表示する波形表示領域２０４と、音声データのうち、発話データとして保存される部分を示す保存領域マーカ２０８とを含む。保存領域マーカ２０８は、ボリュームチェック部８２により検出された発話領域２１０と、その前後の所定長さの無音領域２１２とを含む。本実施の形態では、二つの無音領域２１２の長さはいずれも、発話領域２１０の長さの半分に選ばれている。 The GUI screen 180 further includes an ID (identification name) display area 198 of the speech text being processed, a display area 200 of the speech text being processed, a display area 202 of the file name where the speech text is being processed, and a recording. A waveform display area 204 that displays the waveform 206 of the voice data with respect to the time axis together with the level, and a storage area marker 208 that indicates a portion of the voice data that is stored as speech data. The storage area marker 208 includes an utterance area 210 detected by the volume check unit 82 and a silent area 212 having a predetermined length before and after the utterance area 210. In the present embodiment, the length of each of the two silent areas 212 is selected to be half the length of the speech area 210.

なお、本実施の形態では、保存先ファイル名は、所定の文字列（これは操作者により指定可能である。図８の場合は「speechfile_demo」）に発話テキストのＩＤを付し、さらにサンプリングレートを拡張子の形で付したものを保存ファイル名としている。 In the present embodiment, the save destination file name is a predetermined character string (this can be specified by the operator. In the case of FIG. 8, “speechfile_demo”), the speech text ID is added, and the sampling rate The name of the file with the extension added is the saved file name.

以下、動作の各局面におけるＧＵＩ画面１８０の状態について説明する。図９は、図４のステップ１６２で表示される画面の例である。ファイルメニュー１９０が可能化され、録音／録音終了ボタン１９６は録音ボタンに設定される。ＩＤ表示領域１９８、表示領域２００、および保存先ファイル名の表示領域２０２にはそれぞれ、現在処理対象の発話テキストのＩＤ、発話テキスト、および処理対象の発話テキストが保存されるファイル名、がそれぞれ表示されている。 Hereinafter, the state of the GUI screen 180 in each aspect of the operation will be described. FIG. 9 is an example of a screen displayed in step 162 of FIG. The file menu 190 is enabled, and the record / record end button 196 is set as a record button. In the ID display area 198, the display area 200, and the storage destination file name display area 202, the ID of the utterance text currently processed, the utterance text, and the file name in which the utterance text to be processed are stored are displayed. Has been.

図１０は、録音中のＧＵＩ画面１８０の表示例である。録音中には録音／録音終了ボタン１９６は録音終了ボタンとなり、録音ボタンとしての機能は果たさない。また録音中にはレベルメータ１９４が録音レベルをリアルタイムで示している。 FIG. 10 is a display example of the GUI screen 180 during recording. During recording, the recording / recording end button 196 functions as a recording end button and does not function as a recording button. During recording, the level meter 194 indicates the recording level in real time.

図９に示すように、録音可能なときには録音／録音終了ボタン１９６は録音ボタンとなっており、録音終了ボタンとしての機能は果たさない。また図１０に示すように、録音中には録音／録音終了ボタン１９６は録音終了ボタンとなり、録音ボタンとしての機能を果たさない。したがってプログラムはこれらボタンが操作されるのはそれぞれ適切な時期だけであることを前提に処理をすればよい。 As shown in FIG. 9, when recording is possible, the record / record end button 196 is a record button and does not function as a record end button. Also, as shown in FIG. 10, during recording, the recording / recording end button 196 functions as a recording end button and does not function as a recording button. Therefore, the program should be processed on the assumption that these buttons are operated only at appropriate times.

図１１は、録音終了時のＧＵＩ画面１８０の表示例である。録音／録音終了ボタン１９６は再び録音ボタンとなっている。この画面で発話領域２１０のマーカを移動させると、動作モードに応じ、再生モードではマークされている領域の音声が再生され、訂正モードでは発話領域２１０のマーカ自体が入力された範囲に更新されるとともに、その領域の音声が再生される。 FIG. 11 is a display example of the GUI screen 180 at the end of recording. The recording / recording end button 196 is again a recording button. When the marker of the utterance area 210 is moved on this screen, the voice of the marked area is reproduced in the reproduction mode according to the operation mode, and the marker itself in the utterance area 210 is updated to the input range in the correction mode. At the same time, the sound in that area is reproduced.

図１２は、音声レベルが過大な部分２２０があるときのＧＵＩ画面１８０の表示例である。図１２に示すように、音声レベルが過大となったときには、適正なレベルで音声波形がクリップされるとともに、適正レベルの範囲を示す矩形２２２が赤色で表示される。したがって操作者は録音レベルが適正でなかったことをすぐに理解でき、再度録音／録音終了ボタン１９６を押して録音をやり直すことができる。 FIG. 12 is a display example of the GUI screen 180 when there is a portion 220 with an excessive audio level. As shown in FIG. 12, when the audio level becomes excessive, the audio waveform is clipped at an appropriate level, and a rectangle 222 indicating the range of the appropriate level is displayed in red. Therefore, the operator can immediately understand that the recording level is not appropriate, and can press the recording / recording end button 196 again to start recording again.

図１３は、発話音声２３０と別にノイズ２３２が存在している場合のＧＵＩ画面１８０の表示例である。このように、ノイズが録音されてしまった場合、それが分かりやすく表示されるので、利用者は再度録音すべきか否かを容易に判定できる。また、ノイズが保存領域の外にあれば、録音を再度する必要がないことが容易に分かり、録音時の時間と手間とを節約できる。 FIG. 13 is a display example of the GUI screen 180 when noise 232 exists in addition to the speech sound 230. Thus, when noise has been recorded, it is displayed in an easy-to-understand manner, so that the user can easily determine whether or not to record again. Also, if the noise is outside the storage area, it is easy to see that there is no need to re-record, saving time and effort during recording.

以上のように本実施の形態に係る音声データ収集システム３０および音声データ収集装置４２によれば、音声の録音において、発話テキストを表示しながら、話者と対話的に処理を進めていくことができる。音声波形が、自動的に検出された発話領域の表示とともに表示されるので、自動検出が誤った場合に容易に訂正できる。また、音声をすぐに発話単位で、または発話中の指定した領域単位で再生できるので発話間違いなども容易に確認できる。ノイズもすぐ確認できるように波形が表示される。さらに、録音時の音声レベルが適正だったか否かもすぐに確認できる態様で表示される。したがって、録音が適正に行なわれたか否かがすぐに判断でき、良質な音声データを得ることができる。 As described above, according to the voice data collection system 30 and the voice data collection device 42 according to the present embodiment, it is possible to interactively proceed with the speaker while displaying the utterance text during voice recording. it can. Since the voice waveform is displayed together with the display of the automatically detected speech area, it can be easily corrected when the automatic detection is wrong. In addition, since the voice can be immediately played back in units of utterances or in designated areas in the utterance, it is possible to easily confirm utterance mistakes. Waveforms are displayed so that noise can be checked immediately. In addition, it is displayed in a manner that can immediately confirm whether or not the sound level at the time of recording is appropriate. Therefore, it can be immediately determined whether or not the recording has been properly performed, and high-quality audio data can be obtained.

上記した音声データ収集システム３０および音声データ収集装置４２によれば、適正な録音ができたか否かが話者に容易に判断できるので、監督者が常に録音に注意を払っている必要はない。監督者の負担は大幅に軽減される。そのため、例えば複数の装置で複数の話者による録音を一人の監督者で監督しながら進行させたりすることも可能になり、音声コーパス作成のためのコストと時間とが節約できる。 According to the voice data collection system 30 and the voice data collection device 42 described above, since it is possible for the speaker to easily determine whether or not proper recording has been performed, it is not necessary for the supervisor to always pay attention to the recording. The burden on the supervisor is greatly reduced. For this reason, for example, recording by a plurality of speakers can be progressed while being supervised by a single supervisor using a plurality of devices, and the cost and time for creating a speech corpus can be saved.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の一実施の形態に係る音声データ収集システム３０の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the audio | voice data collection system 30 which concerns on one embodiment of this invention. 音声データ収集装置４２の概略構成をブロック図形式で示す図である。It is a figure which shows schematic structure of the audio | voice data collection apparatus 42 in a block diagram format. シーケンス制御部７０により実現される処理シーケンスをフローチャート形式で示す図である。It is a figure which shows the processing sequence implement | achieved by the sequence control part 70 in a flowchart format. 音声データ収集装置４２を実現するプログラムのフローチャートである。4 is a flowchart of a program that realizes an audio data collection device. 音声データ収集装置４２を実現するプログラムのフローチャートである。4 is a flowchart of a program that realizes an audio data collection device. 音声データ収集システム３０を実現するコンピュータシステム３３０の外観図である。1 is an external view of a computer system 330 that implements an audio data collection system 30. FIG. 図６に示すコンピュータシステム３３０のブロック図である。FIG. 7 is a block diagram of the computer system 330 shown in FIG. 6. ＧＵＩ画面１８０の構成を示す図である。4 is a diagram showing a configuration of a GUI screen 180. FIG. 録音開始前の表示画面例を示す図である。It is a figure which shows the example of a display screen before a recording start. 録音中の表示画面例を示す図である。It is a figure which shows the example of a display screen during recording. 録音終了後の表示画面例を示す図である。It is a figure which shows the example of a display screen after completion | finish of recording. 録音レベルが不適切であった場合の表示画面例を示す図である。It is a figure which shows the example of a display screen when a recording level is inadequate. ノイズがあった場合の表示画面例を示す図である。It is a figure which shows the example of a display screen when there exists noise.

Explanation of symbols

３０音声データ収集システム、４０テキスト記憶装置、４２音声データ収集装置、４４音声データファイル記憶装置、４６モニタ、４８マイクロフォン、５０スピーカ、５２入力装置、７０シーケンス制御部、７２バス、７４ロードモジュール、７６メモリ、７８表示部、８０録音部、８２ボリュームチェック部、８４ビタビアライメント部、８６メモリ、８８調整部、９０再生部、９２格納処理部 30 voice data collection system, 40 text storage device, 42 voice data collection device, 44 voice data file storage device, 46 monitor, 48 microphone, 50 speaker, 52 input device, 70 sequence control unit, 72 bus, 74 load module, 76 Memory, 78 Display section, 80 Recording section, 82 Volume check section, 84 Viterbi alignment section, 86 Memory, 88 Adjustment section, 90 Playback section, 92 Storage processing section

Claims

A voice data collection device for collecting voice data of an utterance corresponding to a predetermined text, connected to a display device, a predetermined input device operable by a user, and a microphone;
Text display means for displaying the text to be spoken on the display device;
In response to receiving a predetermined recording start instruction signal when the text to be uttered is displayed on the display device, sampling of the voice signal from the microphone is started, and the sampled utterance voice data is obtained. Voice recording means for storing in the first storage device;
In response to a predetermined recording end instruction signal, waveform display means for generating a voice waveform based on the utterance voice data stored in the first storage device and displaying it on the display device;
In response to receiving a predetermined storage instruction signal from the input device, the speech data stored in the first storage device is stored in the second storage device in association with the predetermined text. And a voice data collection device including a storage means.

In response to receiving a recording start instruction signal when a voice waveform is displayed on the display device, sampling of the voice signal from the microphone is started, and the first utterance voice data is used as the first utterance voice data. 2. The voice data collection device according to claim 1, further comprising voice re-recording means for replacing utterance voice data stored in the storage device.

The audio data collection device is further connected to a speaker;
In response to receiving a predetermined playback instruction signal when a voice waveform is displayed on the display device, the voice is reproduced from the voice data stored in the first storage device, and The audio data collection device according to claim 1, further comprising reproduction means for giving to a speaker.

Further comprising level determination means for determining whether the speech data stored in the first storage device is within a predetermined signal level range;
The waveform display means generates a speech waveform based on the utterance voice data stored in the first storage device in response to the recording end instruction signal, and a signal according to the determination result by the level determination means The sound according to any one of claims 1 to 3, comprising means for displaying on the display device together with level determination information visually indicating whether or not a level is within the predetermined signal level range. Data collection device.

Utterance part detection means for detecting an utterance part of the utterance voice data stored in the first storage device,
The waveform display means generates a speech waveform based on speech data stored in the first storage device in response to a predetermined recording end instruction signal, and according to a detection result by the speech portion detection means. The voice data collection device according to any one of claims 1 to 3, further comprising means for displaying on the display device together with an utterance portion marker that visually indicates an utterance portion of the voice waveform.

The utterance part changing means for changing the utterance part detected by the utterance part detecting means in accordance with the change instruction in response to the instruction to change the position of the utterance part marker given from the input device. 5. The voice data collection device according to 5.

The speech data stored in the first storage device and the speech target text displayed on the display device are aligned in a predetermined speech unit, and alignment data indicating the result is generated. Further comprising alignment means for
In response to receiving the save instruction signal from the input device, the save means associates the speech data and the alignment data stored in the first storage device with the predetermined text in a second The voice data collection device according to claim 1, comprising means for storing in a storage device.

In response to receiving the save instruction signal from the input device, the saving means stores the speech data stored in the first storage device in the second storage device in association with the predetermined text. The voice data collection device according to claim 1, further comprising means for causing the display device to display a text to be spoken next.

An audio data collection program that, when executed by a computer, causes the computer to operate as the audio data collection device according to any one of claims 1 to 8.