JP6631186B2

JP6631186B2 - Speech creation device, method and program, speech database creation device

Info

Publication number: JP6631186B2
Application number: JP2015225047A
Authority: JP
Inventors: 淳一郎副島
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2015-11-17
Filing date: 2015-11-17
Publication date: 2020-01-15
Anticipated expiration: 2035-11-17
Also published as: JP2017090856A

Description

本発明は、音素片データに付与されるアクセント情報声を作成する音声作成装置、方法、及びプログラム、並びに音声データベース作成装置に関する。 The present invention relates to a speech creation device, a method, and a program for creating an accent information voice added to phoneme segment data, and a speech database creation device.

音声合成のための音声データベースの作成では、学習用の音声データから音素を単位とするセグメント区間ごとの部分が切り出されて音素片データとされ、当該セグメントに対応する音素ラベルをキーとして音声データベースに登録される。合成品質向上のために、登録される音素片データにはアクセント情報が付加される。このときまず、音声データに対応するテキストデータに対して形態素解析処理が実行され、その結果得られる音素ごとにアクセント情報が付与される。次に、セグメントごとに、当該セグメントの音素と形態素解析処理で得られる音素との対応付けが行われてそれに付与されているアクセント情報が取得され、当該セグメントの音素片データとともに、音声データベースに登録される。 In the creation of a speech database for speech synthesis, a portion for each segment section in units of phonemes is cut out from the speech data for learning to be speech element data, and the speech database is stored in the speech database using phoneme labels corresponding to the segments as keys. be registered. To improve the synthesis quality, accent information is added to the registered speech segment data. At this time, first, morphological analysis processing is performed on the text data corresponding to the voice data, and accent information is given to each phoneme obtained as a result. Next, for each segment, the phoneme of the segment is associated with the phoneme obtained by the morphological analysis processing, and accent information assigned to the phoneme is acquired, and registered in the speech database together with the phoneme piece data of the segment. Is done.

ここで、音素片データのもととなる学習用の音声データは必ずしも対応するテキストデータから得られる言語情報に基づくアクセントで発話されているとは限らない。そこで従来、テキストデータから得られる言語情報に基づいて生成されるアクセント情報を、音声データの発話から得られる基本周波数情報に基づいて修正する技術が知られている（例えば特許文献１，２に記載の技術）。 Here, the learning speech data that is the basis of the phoneme segment data is not necessarily uttered with an accent based on the linguistic information obtained from the corresponding text data. Therefore, conventionally, there is known a technique of correcting accent information generated based on linguistic information obtained from text data based on fundamental frequency information obtained from speech of voice data (for example, Patent Documents 1 and 2). Technology).

特開２０１２−１８９７０３号公報JP 2012-189703 A 特開平６−３３７６９１号公報JP-A-6-337691

上述したように、音素片データにアクセント情報を付与するためには、音声データから得られるセグメントごとの音素と、当該音声データに対応するテキストデータに対する形態素解析処理で得られる音素との、対応付けを行う必要がある。 As described above, in order to add accent information to phoneme segment data, it is necessary to associate a phoneme for each segment obtained from speech data with a phoneme obtained by morphological analysis processing on text data corresponding to the speech data. Need to do.

ここで、音声データの実際の発話では息継ぎが行われ、息継ぎが行われるタイミングでは音声データの値は無音又はそれに近い値となる。以下、このタイミングを「空白タイミング」と呼ぶ。そして、この音声データがセグメント分割された場合には、上記空白タイミングは無音を示す音素を有するセグメントとして検出される。一方、この空白タイミングは、一般的には形態素の区切りに対応しこの位置に句読点が存在する場合が多いが、句読点が検出されないケースもある。また、空白タイミングが言いよどみにより発生したような場合には、その空白タイミングが言語情報上の例えば１つの形態素の途中の音素の位置になったりするケースもある。これらのケースでは、セグメントごとの音素と形態素解析処理で得られる音素との対応付けが、うまく行われないことになる。 Here, breathing is performed in the actual utterance of the voice data, and the value of the voice data is silence or a value close thereto when the breathing is performed. Hereinafter, this timing is referred to as “blank timing”. When the audio data is divided into segments, the blank timing is detected as a segment having a phoneme indicating silence. On the other hand, this blank timing generally corresponds to a morpheme break, and punctuation is often present at this position. However, there are cases where punctuation is not detected. Further, when the blank timing is caused by stagnation, the blank timing may be, for example, the position of a phoneme in the middle of one morpheme on the linguistic information. In these cases, the correspondence between the phoneme for each segment and the phoneme obtained by the morphological analysis processing is not properly performed.

しかし、前述の従来技術では、空白タイミングの発生は考慮されていないため、上記対応関係がずれて音素片データごとに正しいアクセント情報が得られない場合があるという課題があった。また、言語情報から得られるアクセント情報の位置と音声データの発話から得られる基本周波数の位置との対応関係もずれてしまうため、基本周波数に基づくアクセント情報の修正もうまくいかな場合があるという課題があった。 However, in the above-described related art, since the occurrence of a blank timing is not considered, there is a problem that correct correspondence information may not be obtained for each phoneme piece data due to a shift in the correspondence. In addition, since the correspondence between the position of the accent information obtained from the linguistic information and the position of the fundamental frequency obtained from the speech of the voice data is also shifted, the correction of the accent information based on the fundamental frequency may not be successful. was there.

本発明は、音声データの空白タイミングを考慮することにより正しいアクセント情報を得られるようにすることを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to obtain correct accent information by considering blank timing of audio data.

態様の一例では、入力される音声データからアクセント位置及び区切り位置の少なくとも一方を示す第１位置データを取得する取得処理と、音声データに対応するテキストデータから生成された複数の形態素を含む形態素データに付与されているアクセントの位置及び複数の形態素間の区切り位置の少なくとも一方を示す第２位置データと、音声データから取得された第１位置データとを比較する比較処理と、比較処理にて第１及び第２の位置データが一致していない場合には、形態素データに対して第２位置データに代えて第１位置データを付与する処理を実行する処理部を備え、処理部はさらに、比較処理において、第１位置データが示す区切り位置と第２の位置データが示す区切り位置と一致する場合は、形態素データに対して、第２の位置データが示す位置に、読点の情報を付与する。 In an example of the aspect, an acquisition process of acquiring first position data indicating at least one of an accent position and a delimiter position from input voice data, and morpheme data including a plurality of morphemes generated from text data corresponding to the voice data A second position data indicating at least one of a position of an accent given to the first character and a position of a break between a plurality of morphemes, and first position data obtained from the audio data; When the first and second position data do not match , the processing unit includes a processing unit that performs a process of adding the first position data to the morphological data instead of the second position data. In the processing, if the break position indicated by the first position data matches the break position indicated by the second position data, the second The position indicated by the position data, to impart information comma.

本発明によれば、音声データの空白タイミングを考慮することにより正しいアクセント情報を得ることが可能となる。 According to the present invention, correct accent information can be obtained by considering the blank timing of audio data.

本発明による音声情報作成装置の実施形態のブロック図である。It is a block diagram of an embodiment of a voice information creation device according to the present invention. セグメントデータのデータ構成例を示す図である。FIG. 4 is a diagram illustrating a data configuration example of segment data. セグメントデータの具体例を示す図である。It is a figure showing the example of segment data. 形態素データのデータ構成例を示す図である。It is a figure showing the example of data composition of morpheme data. 形態素データの具体例を示す図である。It is a figure showing the example of morpheme data. 形態素ラベルデータのデータ構成例を示す図である。It is a figure showing the example of data composition of morpheme label data. 形態素ラベルデータの具体例を示す図である。It is a figure showing the example of morpheme label data. 学習用ラベルデータのデータ構成例を示す図である。It is a figure showing the example of data composition of label data for learning. 基本周波数データのデータ構成例を示す図である。FIG. 3 is a diagram illustrating a data configuration example of fundamental frequency data. 音声情報作成装置をソフトウェア処理として実現できるコンピュータのハードウェア構成例を示す図である。FIG. 3 is a diagram illustrating an example of a hardware configuration of a computer that can realize the audio information creating device as software processing. データ付与処理の例を示すフローチャートである。It is a flowchart which shows the example of data provision processing. 対応付け処理の詳細例を示すフローチャート（その１）である。It is a flowchart (the 1) which shows the detailed example of a correspondence process. 対応付け処理の詳細例を示すフローチャート（その２）である。It is a flowchart (the 2) which shows the detailed example of a correspondence process. 調整後の形態素データの例を示す図である。It is a figure showing the example of the morpheme data after adjustment. アクセント修正処理の詳細例を示すフローチャートである。It is a flowchart which shows the detailed example of an accent correction process.

以下、本発明を実施するための形態について図面を参照しながら詳細に説明する。音声合成においては、合成目標の音素列に最も適合する音素片データ列が音声データベースから検索され、それらの音素片データが結合されることにより音声データが合成される。本実施形態は、音声合成のための音声データに登録される音素片データを生成するときの、当該音素片データに付与されるアクセント情報を作成する技術に関するものである。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings. In speech synthesis, a speech unit data string that best matches a synthesis target phoneme string is retrieved from a speech database, and the speech unit data is combined to synthesize speech data. The present embodiment relates to a technique for creating accent information to be added to phonemic piece data registered in voice data for voice synthesis for voice synthesis.

ここで、「音声合成」とは、文章を表すテキストデータが入力されたときに、そのテキストデータを人間が読み上げるときの音声と同様の音声をコンピュータから発声させることのできる音声データを合成する処理をいう。また、「音素片データ」とは１つの音素を表す時間域音声波形データをいう。また、「音素」とは、意味の差異をもたらす言語上の最小の単位をいい、例えば単語「あらゆる」があったとき、それをそれぞれが意味の差異を持つように分解した「ａ」「ｒ」「ａ」「ｙ」「ｕ」「ｒ」「ｕ」は、それぞれ音素を形成する。音声合成の処理ではまず、入力されたテキストデータに対して、形態素辞書が参照されながら形態素解析処理が実行されることにより、形態素データが得られる。ここで、「形態素」とは、それ自体が意味を担う言語上最小の単位をいい、例えば「あらゆる現実をすべて自分のほうへねじ曲げたのだ。」という内容のテキストデータがあったときに、形態素解析処理によりそのテキストデータの内容は例えば、「あらゆる」「現実」「を」「すべて」「自分」「の」「ほう」「へ」「ねじ曲げ」「た」「の」「だ」「。」という１３個の形態素に分解される。音声合成の次の段階として、形態素データが更に分解されて、音素の羅列からなるデータ（本実施形態では以下これを「形態素ラベルデータ」と呼ぶ）が生成される。例えば、上記テキストデータから得られる１つの形態素「あらゆる」があったとき、その形態素は、それぞれ音素「ａ」「ｒ」「ａ」「ｙ」「ｕ」「ｒ」「ｕ」に対応する音素ラベルを有する各形態素ラベルデータが生成される。音声合成のその次の段階として、形態素ラベルデータごとに、それぞれの音素ラベルで前述した音声データベースが検索されることにより、その音素ラベルと同じ音素ラベルが付与された音素片データが検索される。音声合成の最後の段階として、形態素ラベルデータごとに抽出された各音声データが結合されることにより、入力されたテキスト文章に対応する音声データが合成される。 Here, "speech synthesis" refers to a process of synthesizing voice data that, when text data representing a sentence is input, allows a computer to generate a voice similar to the voice when a human reads the text data. Say. The “phoneme piece data” refers to time-domain speech waveform data representing one phoneme. The “phoneme” refers to the smallest unit in the language that causes a difference in meaning. For example, when there is a word “any”, the words “a” and “r” are decomposed so that each has a difference in meaning. "A" "y" "u" "r" "u" form phonemes, respectively. In the speech synthesis processing, first, morphological analysis processing is performed on input text data while referring to a morphological dictionary to obtain morphological data. Here, the "morpheme" is the smallest unit in the language that takes its own meaning. For example, when there is text data that says "I have twisted all the realities toward myself." By the morphological analysis processing, the contents of the text data are, for example, “all”, “reality”, “to”, “all”, “self”, “no”, “ho”, “screw”, “ta”, “no”, “da” and Is broken down into 13 morphemes. As the next stage of the speech synthesis, the morpheme data is further decomposed to generate data (hereinafter, referred to as “morpheme label data” in the present embodiment) composed of a sequence of phonemes. For example, when there is one morpheme “all” obtained from the text data, the morphemes correspond to phonemes “a”, “r”, “a”, “y”, “u”, “r”, and “u”, respectively. Each morpheme label data having a label is generated. As the next stage of the speech synthesis, the above-mentioned speech database is searched for each morpheme label data for each morpheme label data, so that the phoneme piece data to which the same phoneme label as the phoneme label is added is searched. As the last stage of the speech synthesis, the speech data extracted for each morpheme label data are combined to synthesize speech data corresponding to the input text sentence.

図１は、音声合成のための音声データベースに登録される音素片データを生成するときの、当該音素片データに付与されるアクセント情報を作成するための、本発明による音声情報作成装置１００の実施形態のブロック図である。音声情報作成装置１００は、音声解析部１０１と言語解析部１０２とデータ付与部１０３とを備え、最終的に学習用ラベルデータ１３０を出力する。音声解析部１０１は、音声認識部１１０、音響モデル１１１、及び基本周波数解析部１１２を含む。言語解析部１０２は、形態素解析部１２０及び形態素辞書１２１を含む。 FIG. 1 shows an embodiment of a speech information creating apparatus 100 according to the present invention for creating accent information to be added to phoneme piece data registered in a speech database for speech synthesis. It is a block diagram of a form. The voice information creation device 100 includes a voice analysis unit 101, a language analysis unit 102, and a data addition unit 103, and finally outputs the learning label data 130. The voice analysis unit 101 includes a voice recognition unit 110, an acoustic model 111, and a fundamental frequency analysis unit 112. The language analysis unit 102 includes a morphological analysis unit 120 and a morphological dictionary 121.

音声認識部１１０は、入力される学習用の時間域の音声データ１１３に対して、音響モデル１１１を参照しながら音声認識処理を実行することにより、当該音声データ１１３をそれぞれが１つの音素が継続する音声区間であるセグメントに分割し、その結果、セグメントデータ１１４を出力する。 The speech recognition unit 110 executes speech recognition processing on the input speech data 113 in the learning time domain while referring to the acoustic model 111, so that each speech data 113 is continued by one phoneme. The segment is divided into segments, which are speech sections, and as a result, segment data 114 is output.

図２は、セグメントデータ１１４のデータ構成例を示す図である。この図に示されるように、音声認識部１１０による音声認識処理で得られる合計L個の音素ごとに、各セグメントデータsegment[0],segment[1],・・・,segment[L-1]はそれぞれ、seg_id、phone、start、end、prev、nextの各変数データを保持する。seg_idは、セグメントＩＤ（識別子）を保持する。phoneは、音素ラベルを保持する。startは、開始時刻（サンプル番号）を保持する。endは、終了時刻（サンプル番号）を保持する。prevは、１つ手前のセグメントデータ１１４へのポインタ、nextは、１つ後ろのセグメントデータ１１４へのポインタを保持する。現在のセグメントデータ１１４が例えばsegmen[1]であれば、prevはsegment[0]の先頭アドレスを保持し、nextはsegment[2]の先頭アドレスを保持する。また、現在のセグメントデータ１１４が例えば先頭データsegment[0]であれば、prevは未定義値であるNULL値を保持する。現在のセグメントデータ１１４が例えば末端データsegment[L-1]であれば、nextはNULL値を保持する。prevとnextを使ってセグメントデータ１１４が接続されることにより、セグメントデータ１１４が例えばメモリ上の末尾のアドレスに追加されても、そのセグメントデータ１１４を任意の箇所に挿入することができる。 FIG. 2 is a diagram illustrating a data configuration example of the segment data 114. As shown in this figure, segment data segment [0], segment [1],..., Segment [L-1] for each of a total of L phonemes obtained by the speech recognition processing by the speech recognition unit 110. Holds variable data of seg_id, phone, start, end, prev, and next, respectively. seg_id holds a segment ID (identifier). phone holds phoneme labels. "start" holds a start time (sample number). end holds the end time (sample number). prev holds a pointer to the immediately preceding segment data 114, and next holds a pointer to the immediately following segment data 114. If the current segment data 114 is, for example, segment [1], prev holds the start address of segment [0], and next holds the start address of segment [2]. If the current segment data 114 is, for example, the head data segment [0], prev holds a NULL value which is an undefined value. If the current segment data 114 is, for example, end data segment [L-1], next holds a NULL value. By connecting the segment data 114 using prev and next, even if the segment data 114 is added to, for example, the last address on the memory, the segment data 114 can be inserted at an arbitrary position.

図３は、セグメントデータの具体例を示す図である。例えば、seg_id=3のセグメントデータ１１４において、phoneは、音素ラベルとして「ａ」を保持する。startは、開始時刻として「730」（サンプル目）を保持する。Endは、終了時刻として「810」（サンプル目）を保持する。prevは、１つ前のセグメントデータ１１４のseg_idとして「2」を保持する。nextは、１つ後ろのセグメントデータ１１４のseg_idとして「4」を保持する。なお、図３において、先頭のseg_id=0又は17番目のseg_id=16の各セグメントデータ１１４のphoneは、「#」を保持する。この「#」は、それらが登録されるセグメントデータ１１４に対応するセグメントの区間において、音声データ１１３の値がゼロ又はゼロ近傍である無音と判別されたことを示している。 FIG. 3 is a diagram illustrating a specific example of the segment data. For example, in the segment data 114 of seg_id = 3, phone holds “a” as a phoneme label. “start” holds “730” (sample) as the start time. End holds “810” (sample) as the end time. prev holds “2” as seg_id of the previous segment data 114. “next” holds “4” as seg_id of the next segment data 114. In FIG. 3, the phone of each segment data 114 of the first seg_id = 0 or the seventeenth seg_id = 16 holds “#”. This “#” indicates that in the segment section corresponding to the segment data 114 in which they are registered, the value of the audio data 113 is determined to be zero or near zero, ie, silence.

特には図示しない音声データベースへの登録時には、上述のセグメントデータ１１４のstartとendが示すセグメントのサンプル区間ごとに、音声データ１１３から当該サンプル区間に対応する波形データが切り出されて音素片データとされる。そして、その音素片データが、当該セグメントに対応する音素を表す音素ラベルを検索キーとして、音声データベースに登録される。 In particular, at the time of registration in a voice database (not shown), for each sample section of the segment indicated by the start and end of the segment data 114, the waveform data corresponding to the sample section is cut out from the voice data 113 to be speech unit data. You. Then, the phoneme piece data is registered in the speech database using a phoneme label representing a phoneme corresponding to the segment as a search key.

音声合成時に、形態素ラベルデータごとに、各音素ラベルに対応する音素片データを音声データベースから検索するだけでは、品質の良い合成音声データは得られない。理想的には、入力されたテキスト文章を実際に人間が読み上げたときの音声データから生成された音素片データが音声データベースに登録され検索されれば、最も良い品質の合成音声データが得られる。しかし、音声データベースの記憶容量には限りがあるため、全てのテキスト文章に対応する音素片データを登録することはできない。そこで、音素片データにアクセント情報が付与されて、音声データベースに登録される。一方、音声合成時には、合成対象のテキストデータに対する形態素解析処理時にもアクセント情報が生成され、形態素ラベルデータごとに付与される。そして、形態素ラベルデータごとに、それぞれの音素ラベルに対応する音素片データを音声データベースから検索するときに、形態素ラベルデータに付与されたアクセント情報に近い値のアクセント情報を有する音素片データを抽出して結合する。これにより、音声合成対象のテキストデータに近いアクセント状態で発音された音素片データを音声データベースから選択することが可能となり、合成音声データの品質を高めることが可能となる。 At the time of speech synthesis, high-quality synthesized speech data cannot be obtained simply by searching the speech database for phoneme segment data corresponding to each phoneme label for each morpheme label data. Ideally, if speech segment data generated from speech data obtained when a human input text sentence is actually read aloud is registered in the speech database and searched, the best synthesized speech data can be obtained. However, since the storage capacity of the speech database is limited, it is not possible to register phoneme segment data corresponding to all text sentences. Thus, accent information is added to the phoneme segment data and registered in the speech database. On the other hand, at the time of speech synthesis, accent information is also generated at the time of morphological analysis processing on text data to be synthesized, and is added to each morpheme label data. Then, for each morpheme label data, when retrieving phoneme segment data corresponding to each phoneme label from the speech database, extracting phoneme segment data having accent information having a value close to the accent information given to the morpheme label data is extracted. To join. As a result, it is possible to select, from the speech database, the speech segment data which is pronounced in an accent state close to the text data to be speech-synthesized, thereby improving the quality of the synthesized speech data.

具体的には、図１に示される本実施形態では、言語解析部１０２を構成する形態素解析部１２０がまず、学習用の音声データ１１３に対応するテキストデータである発話テキスト１２２を入力し、その発話テキスト１２２に対して、形態素辞書１２１を参照しながら、形態素解析処理を実行する。この結果、形態素解析部１２０は、形態素データ１２３と形態素ラベルデータ１２４を出力する。形態素解析部１２０は、形態素データ１２３が示す形態素に基づいて決定したアクセントブロックごとに、アクセント位置を検出する。ここで、「アクセント」とは、単語又は単語結合（一言で言い切る範囲）での音素の基本周波数の相対的高低や音素の信号強度の相対的強弱を言い、その単語又は単語結合を「アクセントブロック」と呼ぶ。そして、「アクセント位置」は、アクセントブロック内で基本周波数が相対的に最も高いモーラの位置又は信号強度が相対的に最も強いモーラの位置をいう。「モーラ」とは、音韻論上、一定の時間的長さをもった音の分節単位をいい、アクセントブロック内でアクセントの位置を表すのに都合がよい単位である。次に、形態素解析部１２０は、形態素ラベルデータ１２４ごとに、当該データが示す音素が属するアクセントブロックに対応するアクセント位置との位置関係を示すアクセント情報を、当該形態素ラベルデータ１２４に付加する。「アクセント情報」は例えば、形態素ラベルデータ１２４に対応する音素の、その音素が属するアクセントブロックに対応するアクセント位置からその音素が属するモーラまでのモーラ数（アクセント位置より前はマイナス値、後ろはプラス値）と、その音素が当該アクセントブロック内で最初のモーラから数えて何番目のモーラに含まれるかを示すモーラ番号とで表される。 Specifically, in the present embodiment shown in FIG. 1, the morphological analysis unit 120 constituting the language analysis unit 102 first inputs an utterance text 122 that is text data corresponding to the learning speech data 113, A morphological analysis process is performed on the utterance text 122 while referring to the morphological dictionary 121. As a result, the morphological analysis unit 120 outputs the morphological data 123 and the morphological label data 124. The morphological analysis unit 120 detects an accent position for each accent block determined based on the morpheme indicated by the morphological data 123. Here, the term “accent” refers to the relative height of the fundamental frequency of a phoneme or the relative strength of the signal strength of a phoneme in a word or word combination (range that can be expressed in one word), and the word or word combination is referred to as “accent”. Called "block." The “accent position” refers to the position of the mora whose fundamental frequency is relatively highest or the position of the mora whose signal strength is relatively highest in the accent block. “Mora” refers to a segmental unit of a sound having a certain time length in phonology, and is a convenient unit for expressing the position of an accent in an accent block. Next, for each morpheme label data 124, the morpheme analysis unit 120 adds, to the morpheme label data 124, accent information indicating the positional relationship with the accent position corresponding to the accent block to which the phoneme indicated by the data belongs. The “accent information” is, for example, the number of mora of the phoneme corresponding to the morpheme label data 124 from the accent position corresponding to the accent block to which the phoneme belongs to the mora to which the phoneme belongs (minus value before the accent position, plus Value) and a mora number indicating the number of the mora counted from the first mora in the accent block.

図４は、形態素データ１２３のデータ構成例を示す図である。この図に示されるように、形態素解析部１２０による形態素解析処理で得られる合計M個の形態素ごとに、各形態素データmorph[0],morph[1],・・・,morph[M-1]はそれぞれ、morph_id、original、read、pronounce、group、accent、prev、nextの各変数データを保持する。morph_idは、形態素ＩＤを保持する。originalは、表記上の形態素を保持する。readは、表記上の読みを保持する。pronounceは、発音を保持する。groupは、品詞情報を保持する。accentは、アクセント情報を保持する。prevは、１つ手前の形態素データ１２３へのポインタ（NULLで先頭）を保持する。nextは、１つ後ろの形態素データ１２３へのポインタ（NULLで末端）を保持する。セグメントデータ１１４の場合と同様に、prevとnextを使って形態素データ１２３が接続されることにより、形態素データ１２３が例えばメモリ上の末尾のアドレスに追加されても、その形態素データ１２３を任意の箇所に挿入することができる。 FIG. 4 is a diagram illustrating a data configuration example of the morpheme data 123. As shown in this figure, each morpheme data morph [0], morph [1],..., Morph [M-1] for each of a total of M morphemes obtained by the morphological analysis processing by the morphological analysis unit 120. Holds variable data of morph_id, original, read, announcement, group, accent, prev, and next, respectively. morph_id holds a morpheme ID. original holds the morphological notation. read holds the notational reading. pronounce holds pronunciation. The group holds part of speech information. accent holds accent information. prev holds a pointer to the previous morpheme data 123 (the head is NULL). “next” holds a pointer to the next morpheme data 123 (the end is NULL). As in the case of the segment data 114, even if the morpheme data 123 is added to, for example, the end address on the memory by connecting the morpheme data 123 using prev and next, Can be inserted.

図５は、形態素データ１２３の具体例を示す図である。例えば「あらゆる現実をすべて自分のほうへねじ曲げたのだ。」という内容の学習用の発話テキスト１２２のデータがあったときに、形態素解析処理によりその内容は例えば、「あらゆる」「現実」「を」「すべて」「自分」「の」「ほう」「へ」「ねじ曲げ」「た」「の」「だ」「。」という１３個の形態素に分解され、それぞれの形態素に対応する表記、読み、発音がoriginal、read、pronounceに登録された図５に示される形態素データ１２３が生成される。例えば、morph_id=0の先頭の形態素データ１２３において、originalは、表記上の形態素として「あらゆる」を保持する。readは、表記上の読みとして「アラユル」を保持する。pronounceは、発音として「アラユル」を保持する。groupは、品詞情報として「連体詞」を保持する。Accent変数は、アクセント情報として「3」を保持する。これは、この形態素中の３モーラ目にアクセント位置があることを示している。モーラについては後述する。prevは、１つ手前の形態素データ１２３のmorph_idとして手前には何も無いことを示す「NULL」を保持している。nextは、１つ後ろの形態素データ１２３のmorph_idとして「1」を保持する。 FIG. 5 is a diagram showing a specific example of the morphological data 123. For example, when there is data of the utterance text 122 for learning having the content "All the reality has been twisted toward yourself." ”,“ All ”,“ self ”,“ no ”,“ ho ”,“ he ”,“ bend ”,“ ta ”,“ no ”,“ da ”, and“. ”. The notation, reading, The morpheme data 123 shown in FIG. 5 in which pronunciation is registered as original, read, and announcement is generated. For example, in the head morpheme data 123 with morph_id = 0, original holds “any” as a morpheme in notation. read holds "arayur" as a notational read. pronounce holds "arayur" as a pronunciation. The group holds “adnominal” as part of speech information. The Accent variable holds “3” as accent information. This indicates that there is an accent position at the third mora in this morpheme. Mora will be described later. prev holds “NULL” indicating that there is nothing before the morph_id of the morpheme data 123 immediately before. “next” holds “1” as the morph_id of the morpheme data 123 immediately after it.

図６は、形態素ラベルデータ１２４のデータ構成例を示す図である。この図に示されるように、形態素解析部１２０による形態素解析処理で得られた形態素データ１２３に基づいて決定される合計N個の形態素ラベルデータmorph_label[0]、morph_label[1]、・・・、morph_label[N-1]はそれぞれ、mlabel_id、phone、accent、mola、morph_id、prev、nextの各変数データを保持する。mlabel_idは、形態素ラベルＩＤを保持する。phoneは、音素ラベルを保持する。accentは、アクセント位置までのモーラ数を保持する。molaは、モーラ番号を保持する。morph_idは、phoneに保持された音素ラベルで示される音素が属する形態素データ１２３の形態素ＩＤ（図４参照）を保持する。prevは、１つ手前の形態素ラベルデータ１２４へのポインタ（NULLで先頭）を保持する。nextは、１つ後ろの形態素ラベルデータ１２４へのポインタ（NULLで末端）を保持する。prev及びnextの意味は、セグメントデータ１１４（図２）又は形態素データ１２３（図４）の場合と同様である。 FIG. 6 is a diagram illustrating a data configuration example of the morpheme label data 124. As shown in this figure, a total of N morpheme label data morph_label [0], morph_label [1],..., Determined based on the morpheme data 123 obtained by the morphological analysis processing by the morphological analysis unit 120. morph_label [N-1] holds variable data of mlabel_id, phone, accent, mola, morph_id, prev, and next, respectively. mlabel_id holds a morpheme label ID. phone holds phoneme labels. accent holds the number of moras up to the accent position. mola holds the mora number. morph_id holds the morpheme ID (see FIG. 4) of the morpheme data 123 to which the phoneme indicated by the phoneme label held in the phone belongs. prev holds a pointer to the previous morpheme label data 124 (the head is NULL). “next” holds a pointer to the next morpheme label data 124 (the end is NULL). The meanings of prev and next are the same as in the case of the segment data 114 (FIG. 2) or the morphological data 123 (FIG. 4).

図７は、形態素ラベルデータ１２４の具体例を示す図である。形態素解析処理により、例えば４つの連続する形態素「あらゆる」「現実」「を」「すべて」にそれぞれ対応する図５に示されるmorph_id=0,1,2,3の各形態素ＩＤを有する各形態素データ１２３から、２１個の音素「a」「r」「a」「y」「u」「r」「u」「g」「e」「N」「j」「i」「ts」「u」「o」「s」「u」「b」「e」「t」「e」が抽出され、それぞれの音素に対応する音素ラベルがphoneに登録された２１個の形態素ラベルデータ１２４が生成される。なお、処理の都合上、先頭のmlabel_id=0の形態素ラベルＩＤを有する形態素ラベルデータ１２４としては、無音を示す音素ラベル「#」を有するデータが登録される。 FIG. 7 is a diagram illustrating a specific example of the morpheme label data 124. By the morphological analysis process, for example, each morpheme data having each morpheme ID of morph_id = 0,1,2,3 shown in FIG. 5 corresponding to four consecutive morphemes “all”, “reality”, “to”, and “all” From 123, 21 phonemes “a” “r” “a” “y” “u” “r” “u” “g” “e” “N” “j” “i” “ts” “u” “ "o", "s", "u", "b", "e", "t", and "e" are extracted, and 21 morpheme label data 124 in which phoneme labels corresponding to the respective phonemes are registered in the phone are generated. For convenience of processing, data having a phoneme label “#” indicating silence is registered as the morpheme label data 124 having a morpheme label ID of mlabel_id = 0.

ここで、形態素解析処理により例えば、これらの形態素ラベルデータ１２４を生成した形態素「あらゆる」「現実」「を」「すべて」に対して、これらの形態素の区切りと同じ区切りを有するアクセントブロック「あらゆる」「現実」「を」「すべて」が決定される。また、アクセントブロック「あらゆる」を例にとると、形態素解析処理により、このアクセントブロック内で、「a」「ra」「yu」「ru」という４個のモーラが認識される。そして、形態素解析処理により、例えばこのアクセントブロック「あらゆる」内のアクセント位置として、３番目のモーラ「yu」（「ゆ」）が検出される。即ち、この例では、アクセント位置は「3」である。 Here, for example, for the morphemes “all”, “reality”, “wo”, and “all” that have generated these morpheme label data 124 by the morphological analysis processing, the accent block “all” having the same delimiter as the delimiter of these morphemes "Reality", "to" and "all" are determined. Further, taking the accent block “any” as an example, four mora “a”, “ra”, “yu”, and “ru” are recognized in the accent block by the morphological analysis process. Then, by the morphological analysis processing, for example, the third mora “yu” (“yu”) is detected as an accent position in the accent block “any”. That is, in this example, the accent position is “3”.

更に、形態素解析処理により、音素「a」「r」「a」「y」「u」「r」「u」のそれぞれに対応する各音素ラベルを有するmlabel_id=1,2,3,4,5,6,7の各形態素ラベルデータ１２４において、各mola変数には、上記アクセントブロック「あらゆる」内での各音素が含まれる各モーラのモーラ番号「1」「2」「2」「3」「3」「4」「4」が保持される。即ち、mlabel_id=1、phone=「a」の形態素ラベルデータ１２４において、当該音素ラベル「a」の音素は、上記４個のモーラ「a」「ra」「yu」「ru」のうちの１番目のモーラ「a」に含まれるので、molaにはモーラ番号「1」が保持される。また、mlabel_id=2、phone=「r」の形態素ラベルデータ１２４において、当該音素ラベル「r」の音素は、上記４個のモーラのうちの２番目のモーラ「ra」に含まれるので、molaにはモーラ番号「2」が保持される。同様に、mlabel_id=3、phone=の「ａ」の形態素ラベルデータ１２４において、当該音素「a」の音素は、上記４個のモーラのうちの２番目のモーラ「ra」に含まれるので、molaにはモーラ番号「2」が保持される。他の音素も同様である。 Further, by the morphological analysis processing, mlabel_id = 1,2,3,4,5 having each phoneme label corresponding to each of the phonemes “a” “r” “a” “y” “u” “r” “u” , 6, 7 in each morpheme label data 124, each mola variable contains the mora number “1” “2” “2” “3” “mora” of each mora including each phoneme in the accent block “any”. "3", "4" and "4" are retained. That is, in the morpheme label data 124 of mlabel_id = 1 and phone = “a”, the phoneme of the phoneme label “a” is the first of the four mora “a”, “ra”, “yu”, and “ru”. Is included in the mora “a”, so mola holds the mora number “1”. In the morpheme label data 124 of mlabel_id = 2 and phone = “r”, the phoneme of the phoneme label “r” is included in the second mora “ra” of the above four mora, so Holds the mora number “2”. Similarly, in the morpheme label data 124 of “a” of mlabel_id = 3 and phone =, the phoneme of the phoneme “a” is included in the second mora “ra” of the above four mora, so that mola Holds the mora number “2”. The same applies to other phonemes.

加えて、形態素解析処理により、音素「a」「r」「a」「y」「u」「r」「u」のそれぞれに対応する各音素ラベルを有するmlabel_id=1,2,3,4,5,6,7の各形態素ラベルデータ１２４において、各accent変数には、各mola変数に保持された上記各モーラ番号「1」「2」「2」「3」「3」「4」「4」と、アクセントブロック内で検出されているアクセント位置「3」との各差分値「1-3=-2」「2-3=-1」「2-3=-1」「3-3=0」「3-3=0」「4-3=1」「4-3=1」がそれぞれ登録される。 In addition, by the morphological analysis process, mlabel_id = 1,2,3,4, having each phoneme label corresponding to each of the phonemes "a" "r" "a" "y" "u" "r" "u" In each of the morpheme label data 124 of 5, 6, and 7, each of the accent variables includes the mora number “1”, “2”, “2”, “3”, “3”, “4”, “4” And the difference value between the accent position "3" detected in the accent block "1-3 = -2", "2-3 = -1", "2-3 = -1", and "3-3 = 0, 3-3 = 0, 4-3 = 1, and 4-3 = 1 are registered.

以上のようにして、図１において、言語解析部１０２内の形態素解析部１２０によりアクセント情報を含む音素列が登録された形態素ラベルデータ１２４が得られると、データ付与部１０３が、音声解析部１０１内の音声認識部１１０により生成されたセグメントデータ１１４の音素列と、上記形態素ラベルデータ１２４の音素列との対応関係を生成する。例えば、図３に例示されるセグメントデータ１１４と、図７に例示される形態素ラベルデータ１２４とで、データ付与部１０３は、それぞれの先頭から順次、セグメントデータ１１４のphone変数の音素ラベルと形態素ラベルデータ１２４のphone変数の音素ラベルとが一致するか否かをチェックする。そして、セグメントデータ１１４と形態素ラベルデータ１２４の全体にわたって両者の音素ラベルが一致する場合には、データ付与部１０３は、図２のデータ構成例を有するセグメントデータ１１４の内容に、図６のデータ構成例を有する形態素ラベルデータ１２４に登録されているアクセント情報、即ち、アクセント位置までのモーラ数とアクセントブロック内のモーラ番号を付与し、学習用ラベルデータ１３０として出力する。 As described above, in FIG. 1, when the morpheme label data 124 in which the phoneme string including the accent information is registered by the morpheme analysis unit 120 in the language analysis unit 102 is obtained, the data adding unit 103 causes the speech analysis unit 101 The correspondence between the phoneme sequence of the segment data 114 generated by the voice recognition unit 110 and the phoneme sequence of the morpheme label data 124 is generated. For example, in the segment data 114 illustrated in FIG. 3 and the morpheme label data 124 illustrated in FIG. 7, the data adding unit 103 sequentially performs the phoneme label and the morpheme label of the phone variable of the segment data 114 from the head of each. It is checked whether the phoneme label of the phone variable in the data 124 matches. If the phoneme labels of both the segment data 114 and the morpheme label data 124 match, the data adding unit 103 adds the data structure of FIG. 6 to the contents of the segment data 114 having the data structure example of FIG. The accent information registered in the morpheme label data 124 having an example, that is, the number of mora up to the accent position and the mora number in the accent block are given, and output as the learning label data 130.

図８は、学習用ラベルデータ１３０のデータ構成例を示す図である。学習用ラベルデータ１３０の個数は、基本的には図２に例示されるセグメントデータ１１４のL個と同じである。L個のセグメントデータsegment[i](0≦i≦L)に対応するL個の学習用ラベルデータtrain_label[i](0≦i≦L)はそれぞれ、tlabel_id、phone、start、end、accent、mola、pitch、vowel、prev、nextの各変数データを保持する。tlabel_idは、学習用ラベルＩＤである。phone、start、endはそれぞれ、セグメントデータsegment[i]のphone、start、end（図２参照）からコピーされる。accent、molaは、セグメントデータsegment[i]に対応付けられた形態素ラベルデータmorph_label[j] (0≦j≦N)のaccent,mola（図６参照）からコピーされる。vowelは、phoneが示す音素ラベルが母音であるか否かを示すフラグであり、母音であれば「1」、子音であれば「0」がセットされる。prevは、１つ手前の学習用ラベルデータ１３０へのポインタ、nextは、１つ後ろの学習用ラベルデータ１３０へのポインタを保持する。pitchについては、後述する。 FIG. 8 is a diagram illustrating a data configuration example of the label data for learning 130. The number of the learning label data 130 is basically the same as the L number of the segment data 114 illustrated in FIG. L learning label data train_label [i] (0 ≦ i ≦ L) corresponding to L segment data segment [i] (0 ≦ i ≦ L) are tlabel_id, phone, start, end, accent, Holds variable data of mola, pitch, vowel, prev, next. tlabel_id is a label ID for learning. The phone, start, and end are respectively copied from phone, start, and end (see FIG. 2) of the segment data segment [i]. The accent and mola are copied from the accent and mola (see FIG. 6) of the morpheme label data morph_label [j] (0 ≦ j ≦ N) associated with the segment data segment [i]. vowel is a flag indicating whether or not the phoneme label indicated by the phone is a vowel; "1" is set for a vowel and "0" is set for a consonant. prev holds a pointer to the immediately preceding learning label data 130, and next holds a pointer to the immediately succeeding learning label data 130. The pitch will be described later.

このようにして得られる学習用ラベルデータ１３０を使って、図１の音声データ１１３からその学習用ラベルデータ１３０のstart及びend（図８参照）に対応する区間のデータが切り出されて音素片データとされ、この音素片データが、図８の学習用ラベルデータ１３０のphoneに登録されている音素ラベルと、accent及びmolaに登録されているアクセント情報とともに、音声データベースに登録される。音声合成時には、テキストデータに対する形態素解析時にもアクセントブロックごとにアクセント位置が検出され、合成目標の音素列中の音素ごとに、当該音素が属するアクセントブロックのアクセント位置との位置関係を示す値が取得される。そして、当該音素ラベルに対応する音素片データを音声データベースから検索するときに、各音素に対して取得された上記位置関係を示す値に最も近い値を有するアクセント情報が付与されている音素片データが抽出される。これにより、テキスト文章データと同じアクセント状態で発音された音素片データを選択することが可能となり、合成音声データの品質を高めることが可能となる。 Using the learning label data 130 thus obtained, the data of the section corresponding to the start and end (see FIG. 8) of the learning label data 130 is cut out from the voice data 113 of FIG. The phoneme segment data is registered in the speech database together with the phoneme label registered in the phone of the learning label data 130 in FIG. 8 and the accent information registered in the accent and mola. At the time of speech synthesis, an accent position is detected for each accent block even during morphological analysis of text data, and a value indicating a positional relationship with an accent position of an accent block to which the phoneme belongs is obtained for each phoneme in a phoneme string to be synthesized. Is done. When retrieving phoneme piece data corresponding to the phoneme label from the speech database, phoneme piece data to which accent information having a value closest to the value indicating the positional relationship obtained for each phoneme is added. Is extracted. As a result, it is possible to select phoneme piece data that is pronounced in the same accent state as the text sentence data, and it is possible to improve the quality of synthesized speech data.

ここで、「発明が解決しようとする課題」の項で説明したように、音声データ１１３の実際の発話においては息継ぎが発生し、そのタイミングは音声データ１１３の値が無音値又は無音値に近い値をとる空白タイミングとなる。そして、この音声データ１１３がセグメント分割された場合には、上記空白タイミングは無音を示す音素ラベル「#」を有するセグメントデータ１１４に対応付けられる。例えば、図３に例示される先頭のseg_id=0又は17番目のseg_id=16の各セグメントデータ１１４のphone変数に保持されている音素ラベル「#」は、そのセグメントデータ１１４に対応するセグメントが空白セグメントであることを示している。一方、この空白タイミングは一般的には言語情報の区切りに対応し、その場合には、形態素解析処理が実行された場合に上記空白タイミングは句読点の形態素として検出される。そして、そのような句読点の形態素から形態素ラベルデータ１２４が抽出された場合には、その形態素ラベルデータ１２４のphone変数（図６参照）には空白タイミングを示す音素ラベル「#」が登録される。このような状態で、データ付与部１０３が、セグメントデータ１１４の音素列と形態素ラベルデータ１２４の音素列との対応付けを行った場合には、音素ラベル「#」同士がうまくマッチングして、正しい対応関係が生成される。 Here, as described in the section “Problems to be Solved by the Invention”, breathing occurs in the actual utterance of the audio data 113, and the timing is such that the value of the audio data 113 is close to a silence value or a silence value. It is a blank timing to take a value. When the voice data 113 is divided into segments, the blank timing is associated with the segment data 114 having the phoneme label “#” indicating silence. For example, the phoneme label “#” stored in the phone variable of each segment data 114 of the first seg_id = 0 or the 17th seg_id = 16 illustrated in FIG. 3 indicates that the segment corresponding to the segment data 114 is blank. Indicates a segment. On the other hand, this blank timing generally corresponds to a break of language information. In this case, when the morphological analysis process is performed, the blank timing is detected as a morpheme of punctuation. When the morpheme label data 124 is extracted from such punctuation morphemes, a phoneme label “#” indicating a blank timing is registered in the phone variable (see FIG. 6) of the morpheme label data 124. In such a state, when the data providing unit 103 associates the phoneme string of the segment data 114 with the phoneme string of the morpheme label data 124, the phoneme labels “#” match well, and the A correspondence is generated.

一方、音声データの実際の発話では、息継ぎにより無音タイミングが発生したタイミングで言語情報上句読点が検出されない場合もある。この場合には、セグメントデータ１１４には空白タイミングを示す音素ラベル「#」が登録されるが、その空白タイミングに対応する句読点の形態素は出力されず、phone変数に空白タイミングを示す音素ラベル「#」が登録された形態素ラベルデータ１２４も出力されないことになる。例えば、図３のセグメントデータ１１４の例において、音素列「arayurugeNjitsuo」（表記：「あらゆる現実を」）と音素列「subete」（表記：「すべて」）の間で、音声データ１１３の発音において息継ぎが発生したことにより、seg_id=15、音素ラベルphone=「o」のセグメントデータ１１４と、seg_id=17、音素ラベルphone=「s」のセグメントデータ１１４の間に、seg_id=16、音素ラベルphone=「#」の空白セグメントのセグメントデータ１１４が生成されている。一方、図５の形態素データ１２３の例においては、形態素「を」と形態素「すべて」の間には読点は検出されておらず、従って、そこから抽出された図７の形態素ラベルデータ１２４の例においても、mlabel_id=15、音素ラベルphone=「o」の形態素ラベルデータ１２４と、mlabel_id=16、音素ラベルphone=「s」の形態素ラベルデータ１２４の間には、音素ラベルphone=「#」の空白セグメントの形態素ラベルデータ１２４は生成されていない。このような状態で、データ付与部１０３がもし、上記空白セグメントの存在を考慮せずに図３に例示されるセグメントデータ１１４の音素列と図７に例示される形態素ラベルデータ１２４の音素列との対応付けを行った場合には、seg_id=15、音素ラベルphone=「o」のセグメントデータ１１４と、mlabel_id=15、音素ラベルphone=「o」の形態素ラベルデータ１２４とのマッチングが行われた後に、seg_id=16、音素ラベルphone=「#」の空白セグメントのセグメントデータ１１４と、mlabel_id=17、音素ラベルphone=「s」の形態素ラベルデータ１２４との比較が行われることになり、両者の音素ラベルが一致せずに対応関係が成立しなくなって、この音声データ１１３に基づく音素片データを音声データベースの登録用に採用できないという結果になってしまう。 On the other hand, in an actual utterance of voice data, punctuation may not be detected on linguistic information at the timing when silence timing occurs due to breathing. In this case, the phoneme label “#” indicating the blank timing is registered in the segment data 114, but the punctuation morpheme corresponding to the blank timing is not output, and the phoneme label “#” indicating the blank timing is stored in the phone variable. Are also not output. For example, in the example of the segment data 114 in FIG. 3, the breathing of the sound data 113 between the phoneme string “arayurugeNjitsuo” (notation: “every reality”) and the phoneme string “subete” (notation: “all”) is performed. Has occurred, seg_id = 16, phoneme label phone =, between segment data 114 of seg_id = 15, phoneme label phone = “o” and segment data 114 of seg_id = 17, phoneme label phone = “s”. The segment data 114 of the blank segment “#” has been generated. On the other hand, in the example of the morpheme data 123 of FIG. 5, no reading point is detected between the morpheme “o” and the morpheme “all”, and therefore, an example of the morpheme label data 124 of FIG. Also, between morpheme label data 124 of mlabel_id = 15, phoneme label phone = “o” and morpheme label data 124 of mlabel_id = 16, phoneme label phone = “s”, phoneme label phone = “#” The morpheme label data 124 of the blank segment has not been generated. In such a state, the data adding unit 103 determines whether the phoneme sequence of the segment data 114 illustrated in FIG. 3 and the phoneme sequence of the morpheme label data 124 illustrated in FIG. Are matched, the segment data 114 of seg_id = 15, phoneme label phone = “o” is matched with the morpheme label data 124 of mlabel_id = 15, phoneme label phone = “o”. Later, the segment data 114 of the blank segment with seg_id = 16 and phoneme label phone = “#” is compared with the morpheme label data 124 with mlabel_id = 17 and phoneme label phone = “s”. Since the phoneme labels do not match, the correspondence relationship is not established, and as a result, phoneme piece data based on the speech data 113 cannot be adopted for registration in the speech database.

そこで、本実施形態では、データ付与部１０３は、空白セグメントと形態素の区切りとの位置関係を判別しながら、セグメントデータ１１４の音素列と形態素ラベルデータ１２４の音素列との対応関係を生成し、対応関係にずれが生じないように制御を行う。この処理の詳細については、図１２及び図１３のフローチャートを用いて後述する。 Therefore, in the present embodiment, the data adding unit 103 generates the correspondence between the phoneme string of the segment data 114 and the phoneme string of the morpheme label data 124 while determining the positional relationship between the blank segment and the morpheme break. Control is performed so that the correspondence does not shift. Details of this processing will be described later with reference to the flowcharts of FIGS.

ここで、音素片データのもととなる音声データ１１３は必ずしも対応する発話テキスト１２２から得られる形態素データ１２３に基づくアクセント位置で発話されているとは限らない。そこで、本実施形態では、データ付与部１０３が、上述のように正しく生成したセグメントデータ１１４の音素列と形態素ラベルデータ１２４の音素列との対応関係に基づいて、形態素ラベルデータ１２４から得られるアクセントブロックごとの新たなアクセント位置を、音声データ１１３の発話に基づいて抽出される基本周波数が最も高くなるセグメントに対応する位置として算出し直す。そして、データ付与部１０３は、アクセントブロックに属するセグメントに対応する学習用ラベルデータ１３０のアクセント情報を、上述の新たに算出されたアクセント位置に基づいて修正する。 Here, the voice data 113 that is the basis of the phoneme segment data is not necessarily uttered at an accent position based on the morpheme data 123 obtained from the corresponding utterance text 122. Therefore, in the present embodiment, the data adding unit 103 determines the accent obtained from the morpheme label data 124 based on the correspondence between the phoneme sequence of the segment data 114 correctly generated as described above and the phoneme sequence of the morpheme label data 124. A new accent position for each block is recalculated as a position corresponding to a segment having the highest fundamental frequency extracted based on the speech of the audio data 113. Then, the data adding unit 103 corrects the accent information of the learning label data 130 corresponding to the segment belonging to the accent block based on the newly calculated accent position.

具体的にはまず、図１の音声解析部１０１内の基本周波数解析部１１２が、所定のフレーム周期（例えば256ミリ秒）ごとに音声データ１１３の基本（ピッチ）周波数を抽出し、基本周波数データ１１５を出力する。 More specifically, first, the fundamental frequency analysis unit 112 in the speech analysis unit 101 in FIG. 1 extracts the fundamental (pitch) frequency of the speech data 113 every predetermined frame period (for example, 256 milliseconds), 115 is output.

図９は、基本周波数データ１１５のデータ構成例を示す図である。音声データ１１３に対する所定のフレーム周期ごとの基本周波数の解析の結果得られるK個の基本周波数データpitch[i](0≦i≦K-1)はそれぞれ、time,pitch, prev、nextの各変数データを保持する。timeは、現在のフレーム周期に対応する時刻（現在のフレーム周期の先頭、中央、又は末尾のサンプル番号）を保持する。pitchは、解析の結果得られた基本周波数[Hz]（ヘルツ）を保持する。prevは、１つ手前の基本周波数データ１１５へのポインタ、nextは、１つ後ろの基本周波数データ１１５へのポインタを保持する。 FIG. 9 is a diagram illustrating a data configuration example of the fundamental frequency data 115. The K fundamental frequency data pitch [i] (0 ≦ i ≦ K−1) obtained as a result of analyzing the fundamental frequency of the audio data 113 for each predetermined frame period are time, pitch, prev, and next variables, respectively. Retain data. time holds the time corresponding to the current frame period (the sample number at the beginning, center, or end of the current frame period). The pitch holds the fundamental frequency [Hz] (Hertz) obtained as a result of the analysis. prev holds a pointer to the immediately preceding fundamental frequency data 115, and next holds a pointer to the immediately succeeding fundamental frequency data 115.

上述のようにして基本周波数解析部１１２により生成された基本周波数データ１１５を用いて、データ付与部１０３が、アクセントブロックごとに新たなアクセント位置を生成し、アクセントブロックに属するセグメントに対応する学習用ラベルデータ１３０のアクセント情報を、上述の新たに算出されたアクセント位置に基づいて修正する。この処理の詳細については、図１５のフローチャートを用いて後述する。 Using the fundamental frequency data 115 generated by the fundamental frequency analysis unit 112 as described above, the data providing unit 103 generates a new accent position for each accent block, and generates a new accent position corresponding to a segment belonging to the accent block. The accent information of the label data 130 is corrected based on the newly calculated accent position. Details of this processing will be described later with reference to the flowchart in FIG.

図１０は、図１の音声情報作成装置１００の音声解析部１０１、言語解析部１０２、及びデータ付与部１０３の機能をソフトウェア処理として実現できるコンピュータのハードウェア構成例を示す図である。図１０に示されるコンピュータは、ＣＰＵ１００１、ＲＯＭ（リードオンリーメモリ：読出し専用メモリ）１００２、ＲＡＭ（ランダムアクセスメモリ）１００３、入力装置１００４、出力装置１００５、外部記憶装置１００６、可搬記録媒体１０１０が挿入される可搬記録媒体駆動装置１００７、及び通信インタフェース１００８を有し、これらがバス１００９によって相互に接続された構成を有する。同図に示される構成は上記システムを実現できるコンピュータの一例であり、そのようなコンピュータはこの構成に限定されるものではない。 FIG. 10 is a diagram illustrating an example of a hardware configuration of a computer capable of realizing the functions of the voice analysis unit 101, the language analysis unit 102, and the data addition unit 103 of the voice information creation device 100 in FIG. 1 as software processing. In the computer shown in FIG. 10, a CPU 1001, a ROM (read only memory) 1002, a RAM (random access memory) 1003, an input device 1004, an output device 1005, an external storage device 1006, and a portable recording medium 1010 are inserted. And a communication interface 1008, which are interconnected by a bus 1009. The configuration shown in the figure is an example of a computer that can realize the above system, and such a computer is not limited to this configuration.

ＲＯＭ１００２は、図１の音声解析部１０１の機能を実現する音声解析処理プログラム、図の言語解析部１０２の機能を実現する言語解析処理プログラム、及び図１のデータ付与部１０３の機能を実現するデータ付与処理プログラムを含む各プログラムを記憶するメモリである。ＲＡＭ１００３は、各プログラムの実行時に、ＲＯＭ１００２に記憶されているプログラム又はデータを一時的に格納するメモリである。 The ROM 1002 includes a voice analysis processing program for realizing the function of the voice analysis unit 101 in FIG. 1, a language analysis processing program for realizing the function of the language analysis unit 102 in the figure, and data realizing the function of the data addition unit 103 in FIG. This is a memory for storing each program including the assignment processing program. The RAM 1003 is a memory that temporarily stores a program or data stored in the ROM 1002 when each program is executed.

外部記憶装置１００６は、例えばＳＳＤ（ソリッドステートドライブ）記憶装置又はハードディスク記憶装置であり、図１の音声データ１１３、発話テキスト１２２、セグメントデータ１１４、基本周波数データ１１５、形態素データ１２３、形態素ラベルデータ１２４、学習用ラベルデータ１３０、及び図１には特には図示しない音声データベース等の保存に用いられる。 The external storage device 1006 is, for example, an SSD (solid state drive) storage device or a hard disk storage device, and includes the voice data 113, the uttered text 122, the segment data 114, the fundamental frequency data 115, the morpheme data 123, and the morpheme label data 124 in FIG. , Learning label data 130, and a voice database not shown in FIG.

ＣＰＵ１００１は、各プログラムを、ＲＯＭ１００２からＲＡＭ１００３に読み出して実行することにより、当該コンピュータ全体の制御を行う。 The CPU 1001 controls the entire computer by reading each program from the ROM 1002 to the RAM 1003 and executing the program.

入力装置１００４は、ユーザによるキーボードやマウス等による入力操作を検出し、その検出結果をＣＰＵ１００１に通知する。 The input device 1004 detects an input operation by a user using a keyboard, a mouse, or the like, and notifies the CPU 1001 of the detection result.

出力装置１００５は、ＣＰＵ１００１の制御によって送られてくるデータを表示装置や印刷装置に出力する。 The output device 1005 outputs data transmitted under the control of the CPU 1001 to a display device or a printing device.

可搬記録媒体駆動装置１００７は、光ディスクやＳＤＲＡＭ、コンパクトフラッシュ等の可搬記録媒体１０１０を収容するもので、外部記憶装置１００６の補助の役割を有する。 The portable recording medium driving device 1007 accommodates a portable recording medium 1010 such as an optical disk, an SDRAM, and a compact flash, and has an auxiliary role of the external storage device 1006.

通信インターフェース１００８は、例えばＬＡＮ（ローカルエリアネットワーク）又はＷＡＮ（ワイドエリアネットワーク）の通信回線を接続するための装置である。 The communication interface 1008 is a device for connecting, for example, a LAN (local area network) or WAN (wide area network) communication line.

本実施形態によるシステムは、図１の音声解析部１０１の機能を実現する音声解析処理プログラム、図の言語解析部１０２の機能を実現する言語解析処理プログラム、及び図１のデータ付与部１０３の機能を実現するデータ付与処理プログラムを、ＲＯＭ１００２からＲＡＭ１００３に読み出してＣＰＵ１００１が実行することで実現される。そのプログラムは、例えば外部記憶装置１００６や可搬記録媒体１０１０に記録して配布してもよく、或いはネットワーク接続装置１００８によりネットワークから取得できるようにしてもよい。 The system according to the present embodiment includes a voice analysis processing program for realizing the function of the voice analysis unit 101 in FIG. 1, a language analysis processing program for realizing the function of the language analysis unit 102 in the figure, and a function of the data providing unit 103 in FIG. Is read from the ROM 1002 to the RAM 1003 and executed by the CPU 1001. The program may be recorded on the external storage device 1006 or the portable recording medium 1010 and distributed, or may be obtained from the network by the network connection device 1008.

ＣＰＵ１００１が音声解析処理プログラム及び言語解析処理プログラムをそれぞれ実行することにより実現する図１の音声解析部１０１及び言語解析部１０２の各機能は、図１から図９を用いて上述した通りである。この結果、図１０のＲＡＭ１００３に、図２のデータ構成例のセグメントデータ１１４（具体例は図３）、図４のデータ構成例の形態素データ１２３（具体例は図５）、図６のデータ構成例の形態素ラベルデータ１２４（具体例は図７）、及び図９のデータ構成例の基本周波数データ１１５が生成される。 The functions of the voice analysis unit 101 and the language analysis unit 102 in FIG. 1 realized by the CPU 1001 executing the voice analysis processing program and the language analysis processing program are as described above with reference to FIGS. 1 to 9. As a result, in the RAM 1003 of FIG. 10, the segment data 114 of the data configuration example of FIG. 2 (specific example is FIG. 3), the morphological data 123 of the data configuration example of FIG. 4 (specific example of FIG. 5), and the data configuration of FIG. The example morpheme label data 124 (a specific example is shown in FIG. 7) and the fundamental frequency data 115 of the data configuration example in FIG. 9 are generated.

図１１は、図１のデータ付与部１０３の機能を実現するデータ付与処理の例を示すフローチャートである。このデータ付与処理は、ＣＰＵ１００１が、ＲＡＭ１００３をワークメモリとして使用しながら、ＲＯＭ１００２に記憶されたデータ付与処理プログラムを読み出して実行する処理である。 FIG. 11 is a flowchart illustrating an example of a data providing process for realizing the function of the data providing unit 103 in FIG. This data adding process is a process in which the CPU 1001 reads and executes a data adding process program stored in the ROM 1002 while using the RAM 1003 as a work memory.

データ付与処理において、ＣＰＵ１００１はまず、対応付け処理を実行する（ステップＳ１１０１）。 In the data adding process, the CPU 1001 first executes an associating process (step S1101).

次に、ＣＰＵ１００１は、上記対応付け処理の結果、エラーが検出されたか否か（ＲＡＭ１００３上の後述するエラーフラグがオンであるか否か）を判定する（ステップＳ１１０２）。そして、エラーが検出されなかった場合（ステップＳ１１０２の判定がＮＯの場合）には、ＣＰＵ１００１は、アクセント修正処理を実行する（ステップＳ１１０３）。この結果、ＣＰＵ１００１は、入力された音声データ１１３及び発話テキスト１２２に対応する学習用ラベルデータ１３０（共に図１参照）を出力し、データ付与処理を終了する。エラーが検出された場合（ステップＳ１１０２の判定がＹＥＳの場合）には、ＣＰＵ１００１は、入力された音声データ１１３及び発話テキスト１２２は学習用には採用せず、学習用ラベルデータ１３０は出力せずに、データ付与処理を終了する。 Next, the CPU 1001 determines whether or not an error has been detected as a result of the association processing (whether or not an error flag described later on the RAM 1003 is on) (step S1102). If no error is detected (NO in step S1102), CPU 1001 executes accent correction processing (step S1103). As a result, the CPU 1001 outputs the learning label data 130 (see FIG. 1) corresponding to the input voice data 113 and the utterance text 122, and ends the data adding process. If an error is detected (if the determination in step S1102 is YES), the CPU 1001 does not use the input voice data 113 and utterance text 122 for learning, and does not output the learning label data 130. Then, the data adding process is ended.

図１２及び図１３は、図１１のステップＳ１１０１の対応付け処理の詳細例を示すフローチャートである。この対応付け処理では、前述したように、空白セグメントと形態素の区切りとの位置関係が判別されながら、それぞれＲＡＭ１００３上に生成されている、セグメントデータ１１４の音素列（図２、図３参照）と形態素ラベルデータ１２４（図６、図７）の音素列との適切な対応関係が生成される。 FIGS. 12 and 13 are flowcharts showing a detailed example of the associating process in step S1101 of FIG. In this associating process, as described above, while the positional relationship between the blank segment and the morpheme break is determined, the phoneme string of the segment data 114 (see FIGS. 2 and 3) generated on the RAM 1003 is determined. An appropriate correspondence with the phoneme sequence of the morpheme label data 124 (FIGS. 6 and 7) is generated.

対応付け処理において、ＣＰＵ１００１はまず、セグメントデータ１１４の集合先頭と形態素ラベルデータ１２４の集合の先頭から順次、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルとが一致するか否かを判定する処理（ステップＳ１２０５）と、セグメントデータ１１４の側の音素ラベルが空白セグメントに対応する「#」であるか否かを判定する処理（ステップＳ１２０４）と、ステップＳ１２０４の判定がＹＥＳの場合に形態素ラベルデータ１２４の側の音素ラベルが句読点に対応する「#」であるか否かを判定する処理（ステップＳ１２０８）を繰り返し実行する。 In the association process, the CPU 1001 first determines whether the phoneme label on the segment data 114 side matches the phoneme label on the morpheme label data 124 side sequentially from the head of the set of the segment data 114 and the head of the set of the morpheme label data 124. (Step S1205), a process of determining whether the phoneme label on the side of the segment data 114 is "#" corresponding to a blank segment (step S1204), and a determination of step S1204 is YES. In the case of, the process of determining whether or not the phoneme label on the side of the morpheme label data 124 is “#” corresponding to the punctuation (Step S1208) is repeatedly executed.

この繰返し処理を制御するために、ＣＰＵ１００１は、ＲＡＭ１００３上の変数seg及びmorphに、それぞれＲＡＭ１００３上のセグメントデータ１１４の先頭アドレスとＲＡＭ１００３上の形態素ラベルデータ１２４の先頭アドレスをセットする（ステップＳ１２０１）。図２のセグメントデータ１１４のデータ構成例では、変数segにsegment[0]のアドレスがセットされ、図６の形態素ラベルデータ１２４のデータ構成例では、変数morphにmorph_label[0]のアドレスがセットされる。そして、ＣＰＵ１００１は、変数seg及び変数morphともに末端のデータを超えていないと判定される間（ステップＳ１２０２の判定がＹＥＳの場合）、以下に示すステップＳ１２０３からＳ１２１５までの一連の処理を繰り返し実行する。 In order to control this repetition processing, the CPU 1001 sets the start address of the segment data 114 on the RAM 1003 and the start address of the morpheme label data 124 on the RAM 1003 in variables seg and morph on the RAM 1003, respectively (step S1201). In the example of the data configuration of the segment data 114 in FIG. 2, the address of segment [0] is set in the variable seg, and in the example of the data configuration of the morpheme label data 124 in FIG. 6, the address of morph_label [0] is set in the variable morph. You. Then, while it is determined that both the variable seg and the variable morph do not exceed the end data (when the determination in step S1202 is YES), the CPU 1001 repeatedly executes a series of processing from step S1203 to S1215 described below. .

まず、ＣＰＵ１００１は、それぞれＲＡＭ１００３上に記憶される、セグメントデータ１１４の側のインクリメントフラグと、形態素ラベルデータ１２４の側のインクリメントフラグの双方をオフにする（ステップＳ１２０３）。 First, the CPU 1001 turns off both the increment flag on the side of the segment data 114 and the increment flag on the side of the morpheme label data 124 stored in the RAM 1003 (step S1203).

次に、ＣＰＵ１００１は、セグメントデータ１１４の側の音素ラベル（図２のphone変数の値）が、空白セグメントに対応する「#」であるか否かを判定する（ステップＳ１２０４）。 Next, the CPU 1001 determines whether or not the phoneme label (the value of the phone variable in FIG. 2) on the side of the segment data 114 is “#” corresponding to a blank segment (step S1204).

ステップＳ１２０４の判定がＮＯの場合、ＣＰＵ１００１は、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致するか否かを判定する（ステップＳ１２０５）。 If the determination in step S1204 is NO, the CPU 1001 determines whether the phoneme labels of the phoneme label on the segment data 114 side and the phoneme label on the morpheme label data 124 side match (step S1205).

セグメントデータ１１４の側の音素ラベルは空白セグメントを示す「#」でなく（ステップＳ１２０４の判定がＮＯで）、かつセグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致しないことによりステップＳ１２０５の判定もＮＯになる場合には、そもそも音声データ１１３と発話テキスト１２２とがうまく対応付いていないことになる。従って、このような場合には、エラーを発生させて入力された音声データ１１３及び発話テキスト１２２は学習用には採用せず、学習用ラベルデータ１３０は出力しないようにすべきである。そこで、ステップＳ１２０４の判定がＮＯで、かつステップＳ１２０５の判定もＮＯの場合には、ＣＰＵ１００１は、ＲＡＭ１００３上に記憶されるエラーフラグをオンにセットし（ステップＳ１２０７）、その後、図１２及び図１３のフローチャートで示される図１１のステップＳ１１０１の対応付け処理を終了する。この結果、図１１のステップＳ１１０２の判定がＹＥＳとなって、ステップＳ１１０３のアクセント修正処理は実行されないことになり、学習用ラベルデータ１３０は生成されずにデータ付与処理が終了する。 The phoneme label on the segment data 114 side is not “#” indicating a blank segment (the determination in step S1204 is NO), and both the phoneme label on the segment data 114 side and the phoneme label on the morpheme label data 124 side are used. If the determination in step S1205 is NO because the phoneme labels do not match, it means that the voice data 113 and the utterance text 122 do not correspond well in the first place. Therefore, in such a case, the voice data 113 and the utterance text 122 input with an error should not be used for learning, and the learning label data 130 should not be output. Therefore, if the determination in step S1204 is NO and the determination in step S1205 is NO, the CPU 1001 sets the error flag stored in the RAM 1003 to ON (step S1207), and thereafter, FIGS. The association process of step S1101 in FIG. 11 shown in the flowchart of FIG. As a result, the determination in step S1102 of FIG. 11 is YES, and the accent correction processing in step S1103 is not executed, and the data labeling processing ends without generating the learning label data 130.

ステップＳ１２０５の判定がＹＥＳの場合には、ＣＰＵ１００１は、ＲＡＭ１００３上のセグメントデータ１１４の側のインクリメントフラグと形態素ラベルデータ１２４の側のインクリメントフラグの双方のインクリメントフラグをオンにセットする（ステップＳ１２０６）。 If the determination in step S1205 is YES, the CPU 1001 turns on both the increment flag on the segment data 114 side and the increment flag on the morpheme label data 124 side in the RAM 1003 (step S1206).

その後、ＣＰＵ１００１は、セグメントデータ１１４の側のインクリメントフラグがオンであるか否かを判定するが（ステップＳ１２１２）、ステップＳ１２０６の処理によりステップＳ１２１２の判定はＹＥＳとなる。この結果、ＣＰＵ１００１は、変数segに、次のセグメントデータ１１４のアドレスをセットする（ステップＳ１２１３）。具体的には、ＣＰＵ１００１は、現在のセグメントデータ１１４のnext変数（図２参照）の値を、変数segにセットする。 Thereafter, the CPU 1001 determines whether or not the increment flag on the side of the segment data 114 is ON (step S1212), and the determination in step S1212 is YES through the processing in step S1206. As a result, the CPU 1001 sets the address of the next segment data 114 in the variable seg (step S1213). Specifically, the CPU 1001 sets the value of the next variable (see FIG. 2) of the current segment data 114 to the variable seg.

続いて、ＣＰＵ１００１は、形態素データ１２３の側のインクリメントフラグがオンであるか否かを判定するが（ステップＳ１２１４）、ステップＳ１２０６の処理によりステップＳ１２１４の判定はＹＥＳとなる。この結果、ＣＰＵ１００１は、変数morphに、次の形態素ラベルデータ１２４のアドレスをセットする（ステップＳ１２１５）。具体的には、ＣＰＵ１００１は、現在の形態素ラベルデータ１２４のnext変数（図６参照）の値を、変数morphにセットする。 Subsequently, the CPU 1001 determines whether or not the increment flag on the side of the morphological data 123 is ON (step S1214), and the determination in step S1214 becomes YES through the processing in step S1206. As a result, the CPU 1001 sets the address of the next morpheme label data 124 in the variable morph (step S1215). Specifically, the CPU 1001 sets the value of the next variable (see FIG. 6) of the current morpheme label data 124 to the variable morph.

その後、ＣＰＵ１００１は、ステップＳ１２０２の処理に移行して、次の繰返し処理に移る。 After that, the CPU 1001 proceeds to the process of step S1202, and proceeds to the next repetitive process.

上述のように、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致すると、変数segと変数morphの双方がインクリメントされることにより、次のセグメントデータ１１４の音素と次の形態素ラベルデータ１２４の音素同士の対応付け処理に移行する。 As described above, when both the phoneme labels of the phoneme label on the side of the segment data 114 and the phoneme label on the side of the morpheme label data 124 match, both the variable seg and the variable morph are incremented, so that the next segment data The process proceeds to a process of associating the phoneme 114 with the phoneme of the next morpheme label data 124.

前述したステップＳ１２０４の判定がＹＥＳになると、ＣＰＵ１００１は、形態素ラベルデータ１２４の側の音素ラベル（図６のphone）が句読点に対応する「#」であるか否かを判定する（ステップＳ１２０８）。 If the determination in step S1204 is YES, the CPU 1001 determines whether the phoneme label (phone in FIG. 6) on the side of the morpheme label data 124 is “#” corresponding to punctuation (step S1208).

ステップＳ１２０８の判定がＹＥＳの場合には、ＣＰＵ１００１は、ＲＡＭ１００３上のセグメントデータ１１４の側のインクリメントフラグと形態素ラベルデータ１２４の側のインクリメントフラグの双方のインクリメントフラグをオンにセットする（ステップＳ１２０９）。 If the determination in step S1208 is YES, the CPU 1001 sets both the increment flag on the segment data 114 side and the increment flag on the morpheme label data 124 side in the RAM 1003 to ON (step S1209).

その後、ＣＰＵ１００１は、前述したステップＳ１２１２の判定処理に移行するが、ステップＳ１２０９の処理によりステップＳ１２１２の判定はＹＥＳとなる。この結果、ＣＰＵ１００１は、前述したステップＳ１２１３の処理に移行して、変数segに、次のセグメントデータ１１４のアドレスをセットする。 After that, the CPU 1001 proceeds to the above-described determination processing in step S1212, but the determination in step S1212 becomes YES through the processing in step S1209. As a result, the CPU 1001 proceeds to the process of step S1213 described above, and sets the address of the next segment data 114 in the variable seg.

続いて、ＣＰＵ１００１は、前述したステップＳ１２１４の判定処理に移行するが、ステップＳ１２０６の処理によりステップＳ１２１４の判定はＹＥＳとなる。この結果、ＣＰＵ１００１は、前述したステップＳ１２１５の処理に移行して、変数morphに、次の形態素データ１２３のアドレスをセットする。 Subsequently, the CPU 1001 proceeds to the above-described determination processing in step S1214, and the determination in step S1214 becomes YES through the processing in step S1206. As a result, the CPU 1001 proceeds to the process of step S1215 described above, and sets the address of the next morphological data 123 in the variable morph.

上述のように、セグメントデータ１１４の側の音素ラベルが空白セグメントに対応する「#」で、かつ形態素ラベルデータ１２４の側の音素ラベルが句読点に対応する「#」であって両者が対応付く場合には、変数segと変数morphの双方がインクリメントされることにより、次のセグメントデータ１１４の音素と次の形態素ラベルデータ１２４の音素同士の対応付け処理に移行する。 As described above, when the phoneme label on the side of the segment data 114 is “#” corresponding to the blank segment, and the phoneme label on the side of the morpheme label data 124 is “#” corresponding to the punctuation mark, and both correspond. Then, both the variable seg and the variable morph are incremented, and the process shifts to the process of associating the phoneme of the next segment data 114 with the phoneme of the next morpheme label data 124.

ステップＳ１２０８の判定がＮＯの場合には、ＣＰＵ１００１は、現在の形態素ラベルデータの前に「♯」の形態素ラベルデータを挿入するために、ＲＡＭ１００３上に得られている形態素データ１２３において、現在の形態素ラベルデータ１２４の音素が含まれる形態素データ１２３の次に、読点を示す形態素データ１２３を挿入する（ステップＳ１２１０）。具体的には、ＣＰＵ１００１はまず、変数morphが示すＲＡＭ１００３上の現在の形態素ラベルデータ１２４の変数prevによって示されるmorph_id変数（図６参照）の値を参照することにより、ＲＡＭ１００３上の形態素データ１２３からそのmorph_id変数の値と同じ値をmorph_id変数（図４）に保持するデータを検索する。次に、ＣＰＵ１００１は、ＲＡＭ１００３上の形態素データ１２３の末尾に、新たな形態素データ１２３のエントリ（図４参照）を生成する。そして、ＣＰＵ１００１は、この末尾に新たに生成されたエントリにおいて、morph_id変数に新たな値を付与し、original変数、read変数、及びpronounce変数に読点を表す値「、」をそれぞれ格納し、group変数に「記号」を格納し、accent変数にはアクセント情報「0」を格納する。更に、ＣＰＵ１００１は、この末尾に新たに生成されたエントリにおいて、prev変数及びnext変数には、上記検索された形態素データ１２３のprev変数の値及びmorph_id変数の値をそれぞれ格納する。最後に、ＣＰＵ１００１は、上記検索された形態素データ１２３のprev変数が示す形態素データ１２３のnext変数に上記末尾に新たに生成されたエントリのmorph_id変数の値を格納し、その後に、上記検索された形態素データ１２３のprev変数の値も上記末尾に新たに生成されたエントリのmorph_id変数の値に変更する。 If the determination in step S1208 is NO, the CPU 1001 inserts the current morpheme label in the morpheme data 123 obtained on the RAM 1003 in order to insert the morpheme label data of “@” before the current morpheme label data. After the morpheme data 123 including the phoneme of the label data 124, the morpheme data 123 indicating the reading point is inserted (step S1210). More specifically, the CPU 1001 first refers to the value of the morph_id variable (see FIG. 6) indicated by the variable prev of the current morpheme label data 124 on the RAM 1003 indicated by the variable morph, thereby obtaining the morpheme data 123 on the RAM 1003. The data that holds the same value as the value of the morph_id variable in the morph_id variable (FIG. 4) is searched. Next, the CPU 1001 generates an entry (see FIG. 4) of the new morpheme data 123 at the end of the morpheme data 123 on the RAM 1003. Then, the CPU 1001 assigns a new value to the morph_id variable and stores a value “,” representing a reading point in the original variable, the read variable, and the announcement variable in the newly generated entry at the end, and the group variable And the accent variable “0” is stored in the accent variable. Further, the CPU 1001 stores the value of the prev variable and the value of the morph_id variable of the retrieved morphological data 123 in the prev variable and the next variable in the entry newly generated at the end. Lastly, the CPU 1001 stores the value of the morph_id variable of the entry newly generated at the end in the next variable of the morphological data 123 indicated by the prev variable of the searched morphological data 123, and thereafter, the search is performed. The value of the prev variable of the morphological data 123 is also changed to the value of the morph_id variable of the entry newly generated at the end.

例えば、図３のセグメントデータ１１４の具体例、図５の形態素データ１２３の具体例、及び図７の形態素ラベルデータ１２４の具体例において、変数seg及びmorphの値がともに「16」であるときに、ステップＳ１２０４において、図６のseg_id変数の値が「16」であるセグメントデータ１１４の音素ラベルが「#」であると判定されて判定結果がＹＥＳとなり、更に、ステップＳ１２０８において、図７のmlabel_idの値が「16」である形態素ラベルデータ１２４の音素ラベルが「s」であると判定されて判定結果がＮＯとなる。この結果、ＣＰＵ１００１は、ステップＳ１２１０を実行することにより、図７のmlabel_idの値が「16」である形態素ラベルデータ１２４のmorph_id変数の値「3」を参照し、図５の形態素データ１２３からそのmorph_id変数の値が「3」である形態素「すべて」を検索する。次に、ＣＰＵ１００１は、図５の形態素データ１２３の末尾に、新たな形態素データ１２３のエントリ（図４参照）を生成する。図１４はエントリ追加後の形態素データ１２３の具体例である。ＣＰＵ１００１は、図１４の末尾のエントリにおいて、morph_id変数に新たな値「13」を付与し、original変数、read変数、及びpronounce変数に読点を表す値「、」をそれぞれ格納し、group変数に「記号」を格納し、accent変数にはアクセント情報「0」を格納する。更に、ＣＰＵ１００１は、図５の形態素データ１２３上で検索されたmorph_id変数の値が「3」である形態素「すべて」のエントリのprev変数の値「2」及びmorph_id変数の値「3」をそれぞれ、図１４の末尾のエントリのprev変数及びnext変数にセットする。最後に、ＣＰＵ１００１は、図５の形態素データ１２３上で検索されたmorph_id変数の値が「3」である形態素「すべて」のエントリのprev変数の値「2」が示す、図５のmorph_id変数の値が「2」である形態素「を」のエントリのnext変数に、図１４の末尾のエントリのmorph_id変数の値「13」を格納し、図５の形態素データ１２３上で検索されたmorph_id変数の値が「3」である形態素「すべて」のエントリのprev変数の値も、図１４の末尾のエントリのmorph_id変数の値「13」に変更する。この結果、図１４に示される形態素データ１２３ができあがる。この結果、morph_id変数の値が「2」の形態素「を」の次には、そのエントリのnext変数の値「13」が参照されることにより、morph_id変数の値が「13」の読点「、」のエントリが接続され、更にそのエントリのnext変数の値「3」が参照されることにより、morph_id変数の値「3」の形態素「すべて」のエントリが接続される。 For example, in the specific example of the segment data 114 in FIG. 3, the specific example of the morpheme data 123 in FIG. 5, and the specific example of the morpheme label data 124 in FIG. 7, when both the values of the variables seg and morph are “16”, In step S1204, it is determined that the phoneme label of the segment data 114 in which the value of the seg_id variable in FIG. 6 is “16” is “#”, and the determination result is YES. Further, in step S1208, the mlabel_id in FIG. Is determined to be “s” in the morpheme label data 124 whose value is “16”, and the determination result is NO. As a result, by executing step S1210, the CPU 1001 refers to the value “3” of the morph_id variable of the morpheme label data 124 whose mlabel_id value is “16” in FIG. The morpheme “all” whose value of the morph_id variable is “3” is searched. Next, the CPU 1001 generates an entry (see FIG. 4) of the new morpheme data 123 at the end of the morpheme data 123 of FIG. FIG. 14 shows a specific example of the morpheme data 123 after the entry is added. In the last entry of FIG. 14, the CPU 1001 assigns a new value “13” to the morph_id variable, stores the value “,” representing the reading point in the original variable, the read variable, and the announcement variable, respectively, and stores “ Symbol ”, and accent information“ 0 ”is stored in the accent variable. Further, the CPU 1001 assigns the value “2” of the prev variable and the value “3” of the morph_id variable of the entry of the morpheme “all” whose value of the morph_id variable retrieved on the morpheme data 123 of FIG. Are set in the prev variable and next variable of the last entry in FIG. Finally, the CPU 1001 determines the value of the prev variable “2” of the morpheme “all” entry whose value of the morph_id variable is “3” retrieved on the morphological data 123 of FIG. The value “13” of the morph_id variable of the last entry in FIG. 14 is stored in the next variable of the entry of the morpheme “wo” whose value is “2”, and the morph_id variable retrieved on the morpheme data 123 in FIG. The value of the prev variable of the entry of the morpheme “all” whose value is “3” is also changed to the value “13” of the morph_id variable of the last entry in FIG. As a result, morpheme data 123 shown in FIG. 14 is completed. As a result, after the morpheme “o” whose value of the morph_id variable is “2”, the reading value “13” of which the value of the morph_id variable is “13” is referred to by referring to the value “13” of the next variable of the entry. Is connected, and the value “3” of the next variable of the entry is referred to, thereby connecting the morpheme “all” entries of the value “3” of the morph_id variable.

上述の例では、図３のセグメントデータ１１４の例において、音素列「arayurugeNjitsuo」（表記：「あらゆる現実を」）と音素列「subete」（表記：「すべて」）の間で、音声データ１１３の発音において息継ぎが発生したことにより、seg_id=15、音素ラベルphone=「o」のセグメントデータ１１４と、seg_id=17、音素ラベルphone=「s」のセグメントデータ１１４の間に、seg_id=16、音素ラベルphone=「#」の空白セグメントのセグメントデータ１１４が生成されている。一方、図５の形態素データ１２３の例においては、形態素「を」と形態素「すべて」の間には読点は検出されていない。このような場合に対して、本実施形態では、ＣＰＵ１００１は、上述の動作により、図５の形態素データ１２３上で形態素「を」と形態素「すべて」の間に読点「、」を挿入することができる。 In the example described above, in the example of the segment data 114 in FIG. 3, between the phoneme string “arayurugeNjitsuo” (notation: “every reality”) and the phoneme string “subete” (notation: “all”), Due to the occurrence of breath in the pronunciation, seg_id = 16, phoneme between segment data 114 of seg_id = 15, phoneme label phone = “o” and segment data 114 of seg_id = 17, phoneme label phone = “s”. The segment data 114 of the blank segment with the label phone = “#” has been generated. On the other hand, in the example of the morpheme data 123 in FIG. 5, no reading point is detected between the morpheme “o” and the morpheme “all”. In such a case, in the present embodiment, the CPU 1001 can insert the reading point “,” between the morpheme “o” and the morpheme “all” on the morpheme data 123 in FIG. it can.

上述した図１２のステップＳ１２１０の処理の後、ＣＰＵ１００１は、ＲＡＭ１００３上のセグメントデータ１１４の側のインクリメントフラグをオンにする（ステップＳ１２１１）。一方、ＲＡＭ１００３上の形態素ラベルデータ１２４の側のインクリメントフラグは、ステップＳ１２０３でオフにされたままである。 After the processing of step S1210 in FIG. 12 described above, the CPU 1001 turns on the increment flag on the side of the segment data 114 in the RAM 1003 (step S1211). On the other hand, the increment flag on the side of the morpheme label data 124 on the RAM 1003 remains off in step S1203.

その後、ＣＰＵ１００１は、前述したステップＳ１２１２の判定処理に移行するが、ステップＳ１２１１の処理によりステップＳ１２１２の判定はＹＥＳとなる。この結果、ＣＰＵ１００１は、前述したステップＳ１２１３の処理に移行して、変数segに、次のセグメントデータ１１４のアドレスをセットする。 After that, the CPU 1001 proceeds to the above-described determination processing in step S1212, but the determination in step S1212 becomes YES through the processing in step S1211. As a result, the CPU 1001 proceeds to the process of step S1213 described above, and sets the address of the next segment data 114 in the variable seg.

続いて、ＣＰＵ１００１は、前述したステップＳ１２１４の判定処理に移行するが、ステップＳ１２０３の処理によりステップＳ１２１４の判定はＮＯとなる。この結果、ＣＰＵ１００１は、前述したステップＳ１２１５の処理はスキップして、変数morphの値は現在の形態素データ１２３のアドレスのままとさせる。 Subsequently, the CPU 1001 proceeds to the above-described determination processing in step S1214, but the determination in step S1214 is NO due to the processing in step S1203. As a result, the CPU 1001 skips the processing in step S1215 described above and leaves the value of the variable morph at the current address of the morpheme data 123.

上述のように、セグメントデータ１１４の側の音素ラベルが空白セグメントに対応する「#」で、かつ形態素ラベルデータ１２４の側の音素ラベルが句読点に対応する「#」でなくて形態素データ１２３に句読点が追加挿入された場合には、変数segのみがインクリメントされることにより、次のセグメントデータ１１４の音素と現在の形態素ラベルデータ１２４の音素同士の対応付け処理に移行する。 As described above, the phoneme label on the side of the segment data 114 is “#” corresponding to the blank segment, and the phoneme label on the side of the morpheme label data 124 is not “#” corresponding to the punctuation mark. Is added, only the variable seg is incremented, and the process shifts to the process of associating the phoneme of the next segment data 114 with the phoneme of the current morpheme label data 124.

図３のセグメントデータ１１４の具体例、図５と図１４の形態素データ１２３の具体例、及び図７の形態素ラベルデータ１２４の具体例では、変数seg及びmorphの値がともに「16」であるときに、ステップＳ１２０４において、図６のseg_id変数の値が「16」であるセグメントデータ１１４の音素ラベルが「#」であると判定されて判定結果がＹＥＳとなった後、ステップＳ１２０８の判定がＮＯとなって、ステップＳ１２１０で形態素データ１２３に図１４に示されるように句読点が挿入される。その後、ステップＳ１２１１→Ｓ１２１２の判定がＹＥＳ→Ｓ１２１３→Ｓ１２１４の判定がＮＯ→Ｓ１２０２と処理が進むことにより、変数segの値のみが、図３のseg_id変数の値が「16」であるセグメントデータ１１４のnext変数が示す値「17」にインクリメントされる。この結果、次の比較処理の対象は、変数segの変更された値「17」によって検索される図３のseg_id変数の値が「17」であるセグメントデータ１１４の音素ラベル「s」と、変数morphの変更されない値「16」によって検索される」図７のmlabel_id変数の値が「16」である形態素ラベルデータ１２４の音素ラベル「s」となり、うまく対応付けが行われることがわかる。 In the specific example of the segment data 114 in FIG. 3, the specific example of the morphological data 123 in FIGS. 5 and 14, and the specific example of the morphological label data 124 in FIG. 7, when the values of the variables seg and morph are both “16”. Then, in step S1204, after it is determined that the phoneme label of the segment data 114 in which the value of the seg_id variable in FIG. 6 is “16” is “#” and the determination result is YES, the determination in step S1208 is NO. In step S1210, punctuation marks are inserted into the morpheme data 123 as shown in FIG. Thereafter, the processing proceeds from step S1211 → YES in S1212 to YES → S1213 → S1214 to NO → S1202, so that only the value of the variable seg is changed to the segment data 114 in which the value of the seg_id variable in FIG. Is incremented to the value “17” indicated by the next variable of As a result, the target of the next comparison processing is the phoneme label “s” of the segment data 114 in which the value of the seg_id variable in FIG. 3 is “17”, which is searched by the changed value “17” of the variable seg, and the variable The value of the mlabel_id variable in FIG. 7 is “16”, and the phoneme label “s” of the morpheme label data 124 is “16”.

以上の繰返し処理の結果、変数seg及び変数morphの何れかが末端のデータを超えたと判定される（ステップＳ１２０２の判定がＮＯになる）と、ＣＰＵ１００１は、図１３のステップＳ１２１６以降の処理を実行する。まず、ＣＰＵ１００１は、今までの処理によりＲＡＭ１００３上に得られている例えば図１４に示される各形態素データ１２３のoriginal変数に格納されている形態素の表記を全て結合してテキストデータにし、そのテキストデータに対して形態素解析処理（図１の形態素解析部１２０に対応）を再度実行させ、それに対応する形態素データ１２３及び形態素ラベルデータ１２４をＲＡＭ１００３上に生成する（ステップＳ１２１６）。 As a result of the above-described repetitive processing, if it is determined that either the variable seg or the variable morph has exceeded the end data (NO in step S1202), the CPU 1001 executes the processing in step S1216 and subsequent steps in FIG. I do. First, the CPU 1001 combines all notations of morphemes stored in the original variable of each morpheme data 123 shown in FIG. 14, for example, shown in FIG. The morphological analysis process (corresponding to the morphological analysis unit 120 in FIG. 1) is executed again, and the corresponding morphological data 123 and morphological label data 124 are generated on the RAM 1003 (step S1216).

その後、ＣＰＵ１００１は、セグメントデータ１１４の集合先頭と新たにＲＡＭ１００３上に生成された形態素ラベルデータ１２４の集合の先頭から順次、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルとが一致するか否かを再度判定する処理（ステップＳ１２２０）を繰り返し実行する。 Thereafter, the CPU 1001 sequentially determines the phoneme label on the segment data 114 side and the phoneme label on the morpheme label data 124 side from the head of the set of the segment data 114 and the head of the set of the morpheme label data 124 newly generated on the RAM 1003. Are repeatedly determined (step S1220).

具体的には、ＣＰＵ１００１は、ＲＡＭ１００３上の変数seg及びmorphに、それぞれＲＡＭ１００３上のセグメントデータ１１４の先頭アドレスと新たにＲＡＭ１００３上に生成された形態素ラベルデータ１２４の先頭アドレスをセットする（ステップＳ１２１７）。 Specifically, the CPU 1001 sets the start address of the segment data 114 on the RAM 1003 and the start address of the morpheme label data 124 newly generated on the RAM 1003 in the variables seg and morph on the RAM 1003 (step S1217). .

次に、ＣＰＵ１００１は、ＲＡＭ１００３上に学習用ラベルデータ１３０を生成するために、ＲＡＭ１００３上の変数ｔｒに、ＲＡＭ１００３上の学習用ラベルデータ１３０の先頭の空き領域のアドレスをセットする（ステップＳ１２１８）。 Next, the CPU 1001 sets the address of the leading free area of the learning label data 130 on the RAM 1003 to the variable tr on the RAM 1003 to generate the learning label data 130 on the RAM 1003 (step S1218).

そして、ＣＰＵ１００１は、変数seg及び変数morphともに末端のデータを超えていないと判定される間（ステップＳ１２１９の判定がＹＥＳの場合）、ＣＰＵ１００１は、以下のステップＳ１２２０からＳ１２２４までの一連の処理を実行する。 While the CPU 1001 determines that both the variable seg and the variable morph do not exceed the end data (when the determination in step S1219 is YES), the CPU 1001 executes a series of processes from step S1220 to S1224 below. I do.

まず、ＣＰＵ１００１は、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致するか否かを判定する（ステップＳ１２２０）。 First, the CPU 1001 determines whether or not both phoneme labels of the phoneme label on the segment data 114 side and the phoneme label on the morpheme label data 124 side match (step S1220).

前述した図１２のフローチャートの処理により、図１４の形態素データ１２３には、音声データ１１３の発声において形態素の区切りに対応するタイミングで息継ぎ等が発生することにより生成された空白タイミングのセグメントデータ１１４に対応して、読点が追加挿入されている。従って、このような形態素データ１２３を再結合して得たテキストデータに対して図１３のステップＳ１２１６の形態素解析処理が再度実行されることにより、再度生成された形態素ラベルデータ１２４の音素列は、セグメントデータ１１４の音素列と良く対応付けされていることになる。 According to the processing of the flowchart of FIG. 12 described above, the morpheme data 123 of FIG. 14 is added to the segment data 114 of the blank timing generated by the occurrence of breathing or the like at the timing corresponding to the morpheme break in the utterance of the audio data 113. Correspondingly, additional readings have been inserted. Therefore, by performing the morphological analysis processing of step S1216 in FIG. 13 again on the text data obtained by recombining such morphological data 123, the phoneme sequence of the morpheme label data 124 generated again is This means that it is well associated with the phoneme string of the segment data 114.

一方、音声データ１１３の発声において空白タイミングが言いよどみにより発生したような場合において、その空白タイミングが形態素の区切りではなく途中の音素の位置になったりするケースもある。この場合に、図１２のステップＳ１２０４の判定処理により空白タイミングのセグメントデータ１１４が検出され、その判定結果がＹＥＳ、続くステップＳ１２０８の判定結果がＮＯとなることにより、ステップＳ１２１０で、形態素の区切りのない途中の音素で発生した空白タイミングに対して、形態素の区切りの位置に強制的に形態素データ１２３が挿入されることになる。この状態で、図１３のステップＳ１２１６の形態素解析処理が再度実行されると、再度生成された形態素ラベルデータ１２４の音素列は、セグメントデータ１１４の音素列とはうまく対応付かないことになる。もともと言いよどみを含むような音声データ１１３は音素片データの作成には不向きであるため、このような場合にはエラーを発生させて入力された音声データ１１３及び発話テキスト１２２は学習用には採用せず、学習用ラベルデータ１３０は出力しないようにすべきである。 On the other hand, in a case where a blank timing occurs due to stagnation in the utterance of the audio data 113, the blank timing may be a position of a phoneme in the middle instead of a morpheme break. In this case, the segment data 114 at the blank timing is detected by the determination processing of step S1204 in FIG. 12, the determination result is YES, and the determination result of subsequent step S1208 is NO. The morpheme data 123 is forcibly inserted at the position of the morpheme break at the blank timing generated by the phoneme in the middle. In this state, when the morphological analysis process of step S1216 in FIG. 13 is executed again, the phoneme sequence of the morpheme label data 124 generated again does not correspond well to the phoneme sequence of the segment data 114. Originally, the speech data 113 containing the stagnation is not suitable for the production of the speech segment data. In such a case, the speech data 113 and the utterance text 122 input by generating an error are adopted for learning. Instead, the learning label data 130 should not be output.

そこで、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致しないことによりステップＳ１２２０の判定がＮＯになる場合には、ＣＰＵ１００１は、ＲＡＭ１００３上に記憶されるエラーフラグをオンにセットし（ステップＳ１２２４）、その後、図１２及び図１３のフローチャートで示される図１１のステップＳ１１０１の対応付け処理を終了する。この結果、図１１のステップＳ１１０２の判定がＹＥＳとなって、ステップＳ１１０３のアクセント修正処理は実行されないことになり、学習用ラベルデータ１３０は生成されずにデータ付与処理が終了する。 If the determination in step S1220 is NO because both the phoneme label on the segment data 114 and the phoneme label on the morpheme label data 124 do not match, the CPU 1001 stores the phoneme label in the RAM 1003. The error flag is set to ON (step S1224), and then the association processing of step S1101 in FIG. 11 shown in the flowcharts of FIGS. 12 and 13 is ended. As a result, the determination in step S1102 of FIG. 11 is YES, and the accent correction processing in step S1103 is not executed, and the data labeling processing ends without generating the learning label data 130.

セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致することによりステップＳ１２２０の判定がＹＥＳになる場合、ＣＰＵ１００１は、ＲＡＭ１００３上の変数ｔｒが示すＲＡＭ１００３の学習用ラベルデータ１３０（図８参照）の新領域に、変数segが示すセグメントデータ１１４の内容と変数morphが示す形態素ラベルデータ１２４の内容を付与する。具体的には、ＣＰＵ１００１は、上記新領域に、変数segが示すＲＡＭ１００３上のセグメントデータ１１４（図２参照）のphone、start、endの各変数の内容をコピーし、変数morphが示すＲＡＭ１００３上の形態素ラベルデータ１２４（図５参照）のaccent、molaの各変数の内容をコピーする。また、ＣＰＵ１００１は、上記新領域において、phone変数の音素ラベルが、母音の音素を示しているときにはvowel変数（図８参照）に母音を示す値「1」をセットし、子音の音素を示しているときにはvowel変数に子音を示す値「0」をセットする。また、ＣＰＵ１００１は、上記新領域において、tlabel_id変数に新たな形態素ラベルＩＤの値をセットする。更に、ＣＰＵ１００１は、上記新記領域において、prev変数には１つ手前の学習用ラベルデータ１３０の形態素ラベルＩＤの値（先頭の場合にはNULL値）をセットし、next変数には１つ後ろの学習用ラベルデータ１３０の形態素ラベルＩＤの値（末尾の場合にはNULL値）をセットする。pitch変数の内容は後述するアクセント修正処理においてセットされる。 If the determination in step S1220 is YES because both the phoneme label of the segment data 114 and the phoneme label of the morpheme label data 124 match, the CPU 1001 returns to the RAM 1003 of the RAM 1003 indicated by the variable tr. The contents of the segment data 114 indicated by the variable seg and the contents of the morpheme label data 124 indicated by the variable morph are added to the new area of the learning label data 130 (see FIG. 8). Specifically, the CPU 1001 copies the contents of the phone, start, and end variables of the segment data 114 (see FIG. 2) indicated by the variable seg in the RAM 1003 indicated by the variable seg into the new area. The contents of each variable of “accent” and “mola” of the morpheme label data 124 (see FIG. 5) are copied. Also, in the new area, when the phoneme label of the phone variable indicates a vowel phoneme, the CPU 1001 sets the vowel variable (see FIG. 8) to a value “1” indicating a vowel to indicate the consonant phoneme. If it is, set the value "0" indicating consonant to the vowel variable. Further, the CPU 1001 sets the value of the new morpheme label ID in the tlabel_id variable in the new area. Further, in the new record area, the CPU 1001 sets the value of the morpheme label ID of the immediately preceding learning label data 130 (NULL value at the beginning) in the prev variable, and sets the next variable in the next variable. Is set to the value of the morpheme label ID of the learning label data 130 (a null value at the end). The content of the pitch variable is set in an accent correction process described later.

ステップＳ１２２１の後、ＣＰＵ１００１は、ＲＡＭ１００３上の変数segとmorphにそれぞれ、次のセグメントデータ１１４及び次の形態素ラベルデータ１２４のアドレスをセットする（ステップＳ１２２２）。具体的には、ＣＰＵ１００１は、現在のセグメントデータ１１４のnext変数（図２参照）の値と、現在の形態素ラベルデータ１２４のnext変数（図６参照）の値をそれぞれ、変数seg及びmorphにセットする。 After step S1221, the CPU 1001 sets the addresses of the next segment data 114 and the next morpheme label data 124 in the variables seg and morph on the RAM 1003, respectively (step S1222). Specifically, the CPU 1001 sets the value of the next variable (see FIG. 2) of the current segment data 114 and the value of the next variable (see FIG. 6) of the current morpheme label data 124 to the variables seg and morph, respectively. I do.

更に、ＣＰＵ１００１は、ＲＡＭ１００３上の変数ｔｒに、ＲＡＭ１００３上の学習用ラベルデータ１３０の次の空き領域のアドレスをセットする（ステップＳ１２２３）。 Further, the CPU 1001 sets the address of the next free area of the learning label data 130 on the RAM 1003 to the variable tr on the RAM 1003 (step S1223).

その後、ＣＰＵ１００１は、図１３のステップＳ１２１９の処理に移行して、次の繰返し処理に移る。 Thereafter, the CPU 1001 shifts to the processing of step S1219 in FIG. 13 and shifts to the next repetition processing.

以上の繰返し処理の結果、変数seg及び変数morphの何れかが末端のデータを超えたと判定される（ステップＳ１２１９の判定がＮＯになる）と、ＣＰＵ１００１は、図１２及び図１３のフローチャートで示される図１１のステップＳ１１０１の対応付け処理を終了する。この場合には、エラーが発生せずに学習用ラベルデータ１３０が生成されたため、図１１のステップＳ１１０２の判定がＮＯとなって、図１１のステップＳ１１０３のアクセント修正処理が実行される。 As a result of the above-described repetitive processing, when it is determined that either the variable seg or the variable morph has exceeded the end data (the determination in step S1219 is NO), the CPU 1001 is shown in the flowcharts of FIGS. The association processing in step S1101 in FIG. 11 ends. In this case, since the learning label data 130 is generated without any error, the determination in step S1102 in FIG. 11 is NO, and the accent correction processing in step S1103 in FIG. 11 is executed.

図１５は、図１１のステップＳ１１０３のアクセント修正処理の詳細例を示すフローチャートである。アクセント修正処理では、図１１のステップＳ１１０１（図１２及び図１３）の対応付け処理によりＲＡＭ１００３に生成された学習用ラベルデータ１３０と、図１の音声解析部１０１内の基本周波数解析部１１２の機能に対応する基本周波数解析処理によりＲＡＭ１００３に生成された基本周波数データ１１５とを用いて、アクセントブロックごとに新たなアクセント位置を生成し、アクセントブロックに属するセグメントに対応する学習用ラベルデータ１３０のアクセント情報を、上述の新たに算出されたアクセント位置に基づいて修正する処理が実行される。 FIG. 15 is a flowchart showing a detailed example of the accent correction processing in step S1103 of FIG. In the accent correction processing, the learning label data 130 generated in the RAM 1003 by the association processing of step S1101 (FIGS. 12 and 13) in FIG. 11 and the function of the fundamental frequency analysis unit 112 in the voice analysis unit 101 in FIG. A new accent position is generated for each accent block using the fundamental frequency data 115 generated in the RAM 1003 by the fundamental frequency analysis process corresponding to the, and the accent information of the learning label data 130 corresponding to the segment belonging to the accent block. Is corrected based on the above-described newly calculated accent position.

まず、ＣＰＵ１００１は、ＲＡＭ１００３上の学習用ラベルデータ１３０（図８参照）ごとに、当該学習用ラベルデータ１３０のstart変数の値とend変数の値で決まるセグメントの区間内における基本周波数の平均値を算出する（ステップＳ１５０１）。具体的には、ＣＰＵ１００１は、ＲＡＭ１００３上の基本周波数データ１１５（図９参照）において、time変数の値が上記区間内に入る各基本周波数データ１１５を抽出し、それらの基本周波数データ１１５のpitch変数に保持されている基本周波数を抽出し、それらの基本周波数の平均値を算出する。そして、ＣＰＵ１００１は、その算出した平均値を、当該学習用ラベルデータ１３０に対応するセグメントの区間の基本周波数として、当該学習用ラベルデータ１３０のpitch変数（図８参照）に保持する。 First, the CPU 1001 calculates, for each of the learning label data 130 (see FIG. 8) in the RAM 1003, the average value of the fundamental frequency in the segment section determined by the value of the start variable and the value of the end variable of the learning label data 130. It is calculated (step S1501). Specifically, the CPU 1001 extracts, from the basic frequency data 115 (see FIG. 9) on the RAM 1003, each of the basic frequency data 115 in which the value of the time variable falls within the above-mentioned section, and extracts the pitch variable of the basic frequency data 115. Are extracted and the average value of the fundamental frequencies is calculated. Then, the CPU 1001 holds the calculated average value as the fundamental frequency of the segment section corresponding to the learning label data 130 in the pitch variable of the learning label data 130 (see FIG. 8).

次に、ＣＰＵ１００１は、ステップＳ１５０２でＲＡＭ１００３上の変数ｔｒにＲＡＭ１００３上に生成されている学習用ラベルデータ１３０の先頭アドレスをセットした後、ステップＳ１５０７で変数ｔｒに学習用ラベルデータ１３０の次のアドレスを順次セットしながら、ステップＳ１５０３で変数ｔｒのアドレスが学習用ラベルデータ１３０の末端を超えたと判定するまで、ステップＳ１５０４からＳ１５０６までの一連の処理を繰り返し実行する。 Next, the CPU 1001 sets the start address of the learning label data 130 generated on the RAM 1003 to the variable tr on the RAM 1003 in step S1502, and then sets the variable tr to the next address of the learning label data 130 in step S1507. Are sequentially set, and a series of processes from steps S1504 to S1506 are repeatedly executed until it is determined in step S1503 that the address of the variable tr has exceeded the end of the learning label data 130.

上記繰返し処理において、ＣＰＵ１００１はまず、変数ｔｒが示す学習用ラベルデータ１３０に対応するセグメントが、アクセントブロックの切れ目であるか否かを判定する（ステップＳ１５０４）。具体的には、ＣＰＵ１００１は、変数ｔｒが示す学習用ラベルデータ１３０内のmola変数が示すモーラ番号の値（図８参照）が、当該学習用ラベルデータ１３０内のprev変数が示す１つ手前の学習用ラベルデータ１３０内のmola変数が示すモーラ番号の値よりも小さくなった場合に、上記１つ手前の学習用ラベルデータ１３０のセグメントが属するアクセントブロックが終了したと判定し、ステップＳ１５０４の判定がＹＥＳになる。或いは、ＣＰＵ１００１は、変数ｔｒが示す学習用ラベルデータ１３０のphone変数が空白セグメントを示す音素ラベル「#」を保持している場合に、当該学習用ラベルデータ１３０内のprev変数が示す１つ手前の学習用ラベルデータ１３０のセグメントが属するアクセントブロックが終了したと判定し、ステップＳ１５０４の判定がＹＥＳになる。 In the repetition processing, the CPU 1001 first determines whether or not the segment corresponding to the learning label data 130 indicated by the variable tr is a break between accent blocks (step S1504). Specifically, the CPU 1001 determines that the value of the mora number indicated by the mola variable in the learning label data 130 indicated by the variable tr (see FIG. 8) is immediately before the value indicated by the prev variable in the learning label data 130. When the value of the mola number indicated by the mola variable in the learning label data 130 is smaller than the value of the mola variable, it is determined that the accent block to which the segment of the immediately preceding learning label data 130 belongs is finished, and the determination in step S1504 is made. Becomes YES. Alternatively, when the phone variable of the learning label data 130 indicated by the variable tr holds a phoneme label “#” indicating a blank segment, the CPU 1001 immediately precedes the preceding variable indicated by the prev variable in the learning label data 130. It is determined that the accent block to which the segment of the learning label data 130 belongs has ended, and the determination in step S1504 is YES.

ステップＳ１５０４の判定がＹＥＳになると、ＣＰＵ１００１は、上記１つ手前の学習用ラベルデータ１３０のセグメントが属するアクセントブロックの範囲において、start変数の値とend変数の値（図８参照）で決まるセグメントの区間が当該アクセントブロックに属する学習用ラベルデータ１３０のうち、pitch変数（図８参照）が保持する基本周波数が最も高い学習用ラベルデータ１３０の音素位置（例えばstart変数の値）を、当該アクセントブロックにおける新たなアクセント位置（モーラ番号）として決定する。なお、学習用ラベルデータ１３０のvowel変数（図８参照）の値が「0」、即ちそのセグメントの音素が子音である場合には、その前後の母音の音素の学習用ラベルデータ１３０の音素位置を新たなアクセント位置とする（以上、ステップＳ１５０５）。 If the determination in step S1504 is YES, the CPU 1001 determines the segment of the segment determined by the value of the start variable and the value of the end variable (see FIG. 8) in the range of the accent block to which the segment of the immediately preceding learning label data 130 belongs. Of the learning label data 130 whose section belongs to the accent block, the phoneme position (for example, the value of the start variable) of the learning label data 130 having the highest fundamental frequency held by the pitch variable (see FIG. 8) is determined. Is determined as a new accent position (mora number). When the value of the vowel variable (see FIG. 8) of the learning label data 130 is “0”, that is, when the phoneme of the segment is a consonant, the phoneme positions of the learning label data 130 of the vowels before and after the consonant are used. Is set as a new accent position (step S1505).

ステップＳ１５０５の後、ＣＰＵ１００１は、上記アクセントブロックの範囲において、start変数の値とend変数の値（図８参照）で決まるセグメントの区間が当該アクセントブロックに属する学習用ラベルデータ１３０において、accent変数（図８参照）の値を、mola変数（図８参照）に保持されているモーラ番号とステップＳ１５０５で算出した新たなアクセント位置との差分値（アクセント位置の手前がマイナス値、後ろがプラス値）に修正する（ステップＳ１５０６）。 After step S1505, the CPU 1001 determines that the segment of the segment determined by the value of the start variable and the value of the end variable (see FIG. 8) in the range of the accent block includes the accent variable ( The difference between the mora number stored in the mola variable (see FIG. 8) and the new accent position calculated in step S1505 (the value before the accent position is a minus value, and the value after the accent position is a plus value). (Step S1506).

その後、ＣＰＵ１００１は、ステップＳ１５０７の処理に移行する。 After that, the CPU 1001 proceeds to the process in step S1507.

ステップＳ１５０４の判定がＮＯの場合には、ＣＰＵ１００１は、ステップＳ１５０５とＳ１５０６の処理はスキップして、ステップＳ１５０７の処理に移行する。 If the determination in step S1504 is NO, the CPU 1001 skips the processing in steps S1505 and S1506 and proceeds to the processing in step S1507.

ステップＳ１５０７において、ＣＰＵ１００１は、変数ｔｒに学習用ラベルデータ１３０の次のアドレスをセットする。具体的には、ＣＰＵ１００１は、変更前の変数ｔｒが示すＲＡＭ１００３上の学習用ラベルデータ１３０（図８参照）のnext変数の値を、変数ｔｒに格納し直す。その後、ＣＰＵ１００１は、ステップＳ１５０３の処理に戻る。 In step S1507, the CPU 1001 sets the next address of the learning label data 130 to the variable tr. Specifically, the CPU 1001 stores the value of the next variable of the learning label data 130 (see FIG. 8) on the RAM 1003 indicated by the variable tr before the change into the variable tr. After that, the CPU 1001 returns to the process of step S1503.

上記繰返し処理の後、変数ｔｒのアドレスが学習用ラベルデータ１３０の末端を超えたと判定されることによりステップＳ１５０３の判定がＹＥＳになると、ＣＰＵ１００１は、図１５のフローチャートで示される図１１のステップＳ１１０３のアクセント修正処理を終了し、ＲＡＭ１００３に生成されている学習用ラベルデータ１３０を、音素片データの音声データベースを作成するための最終的な学習用ラベルデータ１３０として外部記憶装置１００６等に出力する。 After the repetition processing, when it is determined that the address of the variable tr has passed beyond the end of the learning label data 130 and the determination in step S1503 is YES, the CPU 1001 proceeds to step S1103 in FIG. 11 shown in the flowchart of FIG. Is completed, and the learning label data 130 generated in the RAM 1003 is output to the external storage device 1006 or the like as the final learning label data 130 for creating a speech database of phoneme piece data.

以上のようにして、本実施形態では、音声データの空白タイミングを考慮することにより正しいアクセント情報を得ることが可能となって、音声データから得られる基本周波数によるアクセント情報の正しい修正も行うことが可能となる。 As described above, in the present embodiment, it is possible to obtain correct accent information by considering the blank timing of the audio data, and correct the accent information based on the fundamental frequency obtained from the audio data. It becomes possible.

上述の実施形態では、アクセントブロック内で基本周波数が最高値となるセグメントの位置によって当該アクセントブロック内の各セグメントのアクセント情報を修正した。このほかに、アクセントブロック内で信号強度が最大値となるセグメントの位置によって当該アクセントブロック内の各セグメントのアクセント情報を修正してもよい。 In the above-described embodiment, the accent information of each segment in the accent block is corrected according to the position of the segment having the highest fundamental frequency in the accent block. In addition, the accent information of each segment in the accent block may be corrected according to the position of the segment in which the signal strength is the maximum value in the accent block.

上述の実施形態は、音素片データの音声データベースの作成に用いられる学習用ラベルデータ１３０を作成するためのアクセント情報作成装置１００についての実施形態であった。このほか、図１に示される音声解析部１０１、言語解析部１０２、及びデータ付与部１０３に加えて、データベース登録部を備え、このデータベース登録部は、セグメントごとに、音声データから当該セグメントに対応する部分を音素片データとして切り出す処理と、当該音素片データと、データ付与部１０３が学習用ラベルデータ１３０から抽出できるアクセント情報とを、当該セグメントの音素を表す音素ラベルとともに、音声データベースに登録する処理を実行するようにした音声データベース作成装置として実施されてもよい。 The above-described embodiment is the embodiment of the accent information creating apparatus 100 for creating the learning label data 130 used for creating the speech database of the phoneme segment data. In addition, in addition to the voice analysis unit 101, the language analysis unit 102, and the data addition unit 103 shown in FIG. 1, a database registration unit is provided. A process of cutting out a portion to be performed as phoneme piece data, and registering the phoneme piece data and accent information that can be extracted from the learning label data 130 by the data providing unit 103 together with a phoneme label representing a phoneme of the segment in the speech database. The present invention may be embodied as an audio database creating device that executes processing.

以上の実施形態に関して、更に以下の付記を開示する。
（付記１）
入力される音声データからアクセント位置及び区切り位置の少なくとも一方を示す第１位置データを取得する取得処理と、前記音声データに対応するテキストデータから生成された複数の形態素を含む形態素データに付与されているアクセントの位置及び前記複数の形態素間の区切り位置の少なくとも一方を示す第２位置データと、前記音声データから取得された第１位置データとを比較する比較処理と、前記比較処理にて前記第１及び第２の位置データが一致していない場合には、前記前記形態素データに対して前記第２位置データに代えて前記第１位置データを付与する処理を実行する処理部を備えた音声情報作成装置。
（付記２）
前記音声情報作成装置はさらに、
前記テキストデータに対して形態素解析処理を実行することにより、前記複数の形態素を含む形態素データを生成する形態素解析部を有する、付記１に記載の音声情報作成装置。
（付記３）
前記処理部は、前記取得処理として、前記音声データの無音区間の位置を前記第１位置データの区切り位置として取得する処理を実行する、付記１または２に記載の音声情報作成装置。
（付記４）
前記処理部はさらに、前記比較処理において、前記第１位置データが示す区切り位置と第２の位置データが示す区切り位置と一致する場合は、前記形態素データに対して、前記第２の位置データが示す位置に、読点の情報を付与する、付記１乃至３のいずれかに記載の音声情報作成装置。
（付記５）
前記処理部は、前記取得処理として、前記複数の形態素それぞれに対応する前記音声データの区間内で、前記音声データの基本周波数を判別する処理と、前記音声データの区間内で前記基本周波数が最も高い位置を、前記第１位置データのアクセント位置として取得する処理を実行する、付記１乃至４のいずれかに記載の音声情報作成装置。
（付記６）
前記処理部は、前記取得処理として、前記複数の形態素それぞれに対応する前記音声データの区間内で、前記音声データの信号強度を判別する処理と、前記音声データの区間内で前記信号強度が最も高い位置を、前記第１位置データのアクセント位置として取得する処理を実行する、付記１乃至４のいずれかに記載の音声情報作成装置。
（付記７）
処理部を備えた音声情報作成装置に用いられる音声情報作成方法であって、前記処理部が、
入力される音声データからアクセント位置及び区切り位置の少なくとも一方を示す第１位置データを取得し、
前記音声データに対応するテキストデータから生成された複数の形態素を含む形態素データに付与されているアクセントの位置及び前記複数の形態素間の区切り位置の少なくとも一方を示す第２位置データと、前記音声データから取得された第１位置データとを比較し、
前記第１及び第２の位置データが一致していない場合には、前記前記形態素データに対して前記第２位置データに代えて前記第１位置データを付与する、音声情報作成方法。
（付記８）
音声情報作成装置として用いられるコンピュータに、
入力される音声データからアクセント位置及び区切り位置の少なくとも一方を示す第１位置データを取得するステップと、
前記音声データに対応するテキストデータから生成された複数の形態素を含む形態素データに付与されているアクセントの位置及び前記複数の形態素間の区切り位置の少なくとも一方を示す第２位置データと、前記音声データから取得された第１位置データとを比較するステップと、
前記第１及び第２の位置データが一致していない場合には、前記前記形態素データに対して前記第２位置データに代えて前記第１位置データを付与するステップと、
を実行させるプログラム。
（付記９）
付記１乃至６のいずれかに記載の音声情報作成装置と、
前記音声データから音素片データを切り出す処理と、前記音素片データ、前記音素片データを表わす音素ラベル、及び前記音声情報作成装置により前記音素片データに対応する形態素に付与されたアクセント情報を音声データベースに登録する処理と、を実行する登録処理部と、
を備えた音声データベース作成装置。 Regarding the above embodiments, the following supplementary notes are further disclosed.
(Appendix 1)
An acquisition process of acquiring first position data indicating at least one of an accent position and a delimiter position from input voice data; and a process of obtaining first position data added to morpheme data including a plurality of morphemes generated from text data corresponding to the voice data. A second position data indicating at least one of a position of an accent and a break position between the plurality of morphemes, and a first position data obtained from the audio data; When the first and second position data do not match, the audio information includes a processing unit that executes a process of adding the first position data to the morphological data instead of the second position data. Creation device.
(Appendix 2)
The audio information creation device further includes:
2. The speech information creating device according to claim 1, further comprising a morphological analysis unit configured to generate morphological data including the plurality of morphemes by performing a morphological analysis process on the text data.
(Appendix 3)
3. The audio information creating apparatus according to claim 1, wherein the processing unit executes, as the acquisition process, a process of acquiring a position of a silent section of the audio data as a break position of the first position data.
(Appendix 4)
The processing unit may further include, in the comparison processing, when the break position indicated by the first position data matches the break position indicated by the second position data, the second position data is compared with the morphological data. 4. The audio information creation device according to any one of supplementary notes 1 to 3, wherein reading point information is added to the indicated position.
(Appendix 5)
The processing unit determines the fundamental frequency of the audio data in the section of the audio data corresponding to each of the plurality of morphemes as the acquisition processing. The audio information creation device according to any one of supplementary notes 1 to 4, wherein a process of acquiring a high position as an accent position of the first position data is performed.
(Appendix 6)
The processing unit determines the signal strength of the audio data in a section of the audio data corresponding to each of the plurality of morphemes as the acquisition processing. The audio information creation device according to any one of supplementary notes 1 to 4, wherein a process of acquiring a high position as an accent position of the first position data is performed.
(Appendix 7)
A voice information creation method used in a voice information creation device including a processing unit, wherein the processing unit,
Acquiring first position data indicating at least one of an accent position and a delimiter position from input voice data;
Second position data indicating at least one of a position of an accent given to morpheme data including a plurality of morphemes generated from text data corresponding to the speech data and a break position between the plurality of morphemes; Is compared with the first position data obtained from
If the first and second position data do not match, the audio information creating method includes adding the first position data to the morphological data instead of the second position data.
(Appendix 8)
In a computer used as a voice information creating device,
Acquiring first position data indicating at least one of an accent position and a delimiter position from input voice data;
Second position data indicating at least one of a position of an accent given to morpheme data including a plurality of morphemes generated from text data corresponding to the speech data and a break position between the plurality of morphemes; Comparing with the first position data obtained from
When the first and second position data do not match, adding the first position data to the morphological data instead of the second position data;
A program that executes
(Appendix 9)
An audio information creation device according to any one of supplementary notes 1 to 6,
Processing for cutting out speech element data from the speech data, and processing of the speech element data, phoneme labels representing the speech element data, and accent information given to the morpheme corresponding to the speech element data by the speech information creation device in a speech database. A registration processing unit for executing
An audio database creation device equipped with

１００アクセント情報作成装置
１０１音声解析部
１０２言語解析部
１０３データ付与部
１１０音声認識部
１１１音響モデル
１１２基本周波数解析部
１２０形態素解析部
１２１形態素辞書
１２２発話テキスト
１００１ＣＰＵ
１００２ＲＯＭ（リードオンリーメモリ）
１００３ＲＡＭ（ランダムアクセスメモリ）
１００４入力装置
１００５出力装置
１００６外部記憶装置
１００７可搬記録媒体駆動装置
１００８通信インタフェース
１００９バス
１０１０可搬記録媒体 REFERENCE SIGNS LIST 100 accent information creation device 101 voice analysis unit 102 language analysis unit 103 data addition unit 110 voice recognition unit 111 acoustic model 112 fundamental frequency analysis unit 120 morphological analysis unit 121 morphological dictionary 122 uttered text 1001 CPU
1002 ROM (read only memory)
1003 RAM (random access memory)
1004 Input device 1005 Output device 1006 External storage device 1007 Portable recording medium drive 1008 Communication interface 1009 Bus 1010 Portable recording medium

Claims

An acquisition process of acquiring first position data indicating at least one of an accent position and a delimiter position from input voice data; and a process of obtaining first position data added to morpheme data including a plurality of morphemes generated from text data corresponding to the voice data. A second position data indicating at least one of a position of an accent and a break position between the plurality of morphemes, and a first position data obtained from the audio data; When the first and second position data do not match, a processing unit that performs a process of adding the first position data to the morphological data instead of the second position data ,
The processing unit may further include, in the comparison processing, when the break position indicated by the first position data matches the break position indicated by the second position data, the second position data is compared with the morphological data. An audio information creation device for adding reading point information to the indicated position .

The audio information creation device further includes:
The voice information creation device according to claim 1, further comprising: a morphological analysis unit configured to generate morphological data including the plurality of morphemes by performing a morphological analysis process on the text data.

The audio information creation device according to claim 1, wherein the processing unit executes, as the acquisition process, a process of acquiring a position of a silent section of the audio data as a break position of the first position data.

The processing unit determines the fundamental frequency of the audio data in the section of the audio data corresponding to each of the plurality of morphemes as the acquisition processing. the high position, executes a process of acquiring accent position of the first position data, sound information creating apparatus according to any one of claims 1 to 3.

The processing unit determines the signal strength of the audio data in a section of the audio data corresponding to each of the plurality of morphemes as the acquisition processing. the high position, executes a process of acquiring accent position of the first position data, sound information creating apparatus according to any one of claims 1 to 3.

A voice information creation method used in a voice information creation device including a processing unit, wherein the processing unit,
Acquiring first position data indicating at least one of an accent position and a delimiter position from input voice data;
Second position data indicating at least one of a position of an accent given to morpheme data including a plurality of morphemes generated from text data corresponding to the speech data and a break position between the plurality of morphemes; Is compared with the first position data obtained from
When said first and second position data do not match, the first position data assigned in place of the second position data to the morphological data,
When the delimiter position indicated by the first position data matches the delimiter position indicated by the second position data, reading point information is added to the position indicated by the second position data to the morphological data. , Audio information creation method.

In a computer used as a voice information creating device,
Acquiring first position data indicating at least one of an accent position and a delimiter position from input voice data;
Second position data indicating at least one of a position of an accent given to morpheme data including a plurality of morphemes generated from text data corresponding to the speech data and a break position between the plurality of morphemes; Comparing with the first position data obtained from
When the first and second position data do not match, adding the first position data to the morphological data instead of the second position data;
When the delimiter position indicated by the first position data matches the delimiter position indicated by the second position data, reading point information is added to the position indicated by the second position data to the morphological data. Steps and
A program that executes

An audio information creation device according to any one of claims 1 to 5 ,
A process of extracting phoneme data from the speech data, and processing the phoneme data, phoneme labels representing the phoneme data, and accent information given to the morpheme corresponding to the phoneme data by the speech information creating device in a speech database. And a registration processing unit for executing
An audio database creation device equipped with