JP2017090856A

JP2017090856A - Voice generation device, method, program, and voice database generation device

Info

Publication number: JP2017090856A
Application number: JP2015225047A
Authority: JP
Inventors: 淳一郎副島; Junichiro Soejima
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2015-11-17
Filing date: 2015-11-17
Publication date: 2017-05-25
Anticipated expiration: 2035-11-17
Also published as: JP6631186B2

Abstract

PROBLEM TO BE SOLVED: To provide a technique for generating accent information added to phoneme piece data so as to obtain correct accent information by taking the blank timing of voice data into consideration.SOLUTION: In an association process for associating the phoneme string of segment data with that of a morpheme label data, determination is made as to whether the phoneme label of the segment data side matches the phoneme level of the morpheme label data side sequentially from the collection head of the segment data and the collection head of the morpheme label data (S1205), and when the phoneme label of the segment data is "#" corresponding to a blank segment (YES in S1204) and the phoneme label of the morpheme label data side is not "#" corresponding to a punctuation mark (NO in S1208), the morpheme data of a comma is inserted next to current morpheme data, and thus the association is appropriately carried out.SELECTED DRAWING: Figure 12

Description

本発明は、音素片データに付与されるアクセント情報声を作成する音声作成装置、方法、及びプログラム、並びに音声データベース作成装置に関する。 The present invention relates to a voice creation device, method, program, and voice database creation device for creating an accent information voice given to phoneme data.

音声合成のための音声データベースの作成では、学習用の音声データから音素を単位とするセグメント区間ごとの部分が切り出されて音素片データとされ、当該セグメントに対応する音素ラベルをキーとして音声データベースに登録される。合成品質向上のために、登録される音素片データにはアクセント情報が付加される。このときまず、音声データに対応するテキストデータに対して形態素解析処理が実行され、その結果得られる音素ごとにアクセント情報が付与される。次に、セグメントごとに、当該セグメントの音素と形態素解析処理で得られる音素との対応付けが行われてそれに付与されているアクセント情報が取得され、当該セグメントの音素片データとともに、音声データベースに登録される。 In the creation of a speech database for speech synthesis, a segment unit segment in units of phonemes is extracted from speech data for learning into phoneme segment data, and the phoneme label corresponding to the segment is used as a key to the speech database. be registered. Accent information is added to the registered phoneme piece data in order to improve the synthesis quality. At this time, first, morphological analysis processing is executed on the text data corresponding to the speech data, and accent information is assigned to each phoneme obtained as a result. Next, for each segment, the phoneme of the segment is associated with the phoneme obtained by the morphological analysis process, and the accent information attached to the phoneme is acquired and registered in the speech database together with the phoneme piece data of the segment Is done.

ここで、音素片データのもととなる学習用の音声データは必ずしも対応するテキストデータから得られる言語情報に基づくアクセントで発話されているとは限らない。そこで従来、テキストデータから得られる言語情報に基づいて生成されるアクセント情報を、音声データの発話から得られる基本周波数情報に基づいて修正する技術が知られている（例えば特許文献１，２に記載の技術）。 Here, the speech data for learning that is the basis of the phoneme piece data is not necessarily uttered with an accent based on the language information obtained from the corresponding text data. Therefore, conventionally, there is known a technique for correcting accent information generated based on language information obtained from text data based on fundamental frequency information obtained from speech of speech data (for example, described in Patent Documents 1 and 2). Technology).

特開２０１２−１８９７０３号公報JP 2012-189703 A 特開平６−３３７６９１号公報JP-A-6-337691

上述したように、音素片データにアクセント情報を付与するためには、音声データから得られるセグメントごとの音素と、当該音声データに対応するテキストデータに対する形態素解析処理で得られる音素との、対応付けを行う必要がある。 As described above, in order to give accent information to phoneme piece data, correspondence between phonemes for each segment obtained from speech data and phonemes obtained by morpheme analysis processing for text data corresponding to the speech data Need to do.

ここで、音声データの実際の発話では息継ぎが行われ、息継ぎが行われるタイミングでは音声データの値は無音又はそれに近い値となる。以下、このタイミングを「空白タイミング」と呼ぶ。そして、この音声データがセグメント分割された場合には、上記空白タイミングは無音を示す音素を有するセグメントとして検出される。一方、この空白タイミングは、一般的には形態素の区切りに対応しこの位置に句読点が存在する場合が多いが、句読点が検出されないケースもある。また、空白タイミングが言いよどみにより発生したような場合には、その空白タイミングが言語情報上の例えば１つの形態素の途中の音素の位置になったりするケースもある。これらのケースでは、セグメントごとの音素と形態素解析処理で得られる音素との対応付けが、うまく行われないことになる。 Here, the breathing is performed in the actual speech of the voice data, and the value of the voice data is silent or close to the value at the timing when the breathing is performed. Hereinafter, this timing is referred to as “blank timing”. When the audio data is segmented, the blank timing is detected as a segment having a phoneme indicating silence. On the other hand, this blank timing generally corresponds to a morpheme break and there are many punctuation marks at this position, but there are cases where punctuation marks are not detected. In addition, when the blank timing occurs due to stagnation, the blank timing may be a position of a phoneme in the middle of one morpheme on the language information, for example. In these cases, the correspondence between the phonemes for each segment and the phonemes obtained by the morphological analysis processing is not performed well.

しかし、前述の従来技術では、空白タイミングの発生は考慮されていないため、上記対応関係がずれて音素片データごとに正しいアクセント情報が得られない場合があるという課題があった。また、言語情報から得られるアクセント情報の位置と音声データの発話から得られる基本周波数の位置との対応関係もずれてしまうため、基本周波数に基づくアクセント情報の修正もうまくいかな場合があるという課題があった。 However, in the above-described conventional technology, since the occurrence of blank timing is not taken into consideration, there is a problem in that correct correspondence information may not be obtained for each phoneme piece data because the correspondence relationship is shifted. In addition, the correspondence between the position of the accent information obtained from the linguistic information and the position of the fundamental frequency obtained from the speech of the speech data also shifts, so there is a problem that the correction of the accent information based on the fundamental frequency may not be successful. was there.

本発明は、音声データの空白タイミングを考慮することにより正しいアクセント情報を得られるようにすることを目的とする。 It is an object of the present invention to obtain correct accent information by considering the blank timing of audio data.

態様の一例では、入力される音声データからアクセント位置及び区切り位置の少なくとも一方を示す第１位置データを取得する取得処理と、音声データに対応するテキストデータから生成された複数の形態素を含む形態素データに付与されているアクセントの位置及び複数の形態素間の区切り位置の少なくとも一方を示す第２位置データと、音声データから取得された第１位置データとを比較する比較処理と、比較処理にて第１及び第２の位置データが一致していない場合には、形態素データに対して第２位置データに代えて第１位置データを付与する処理を実行する処理部を備える。 In one example, morpheme data including a plurality of morphemes generated from text data corresponding to voice data and acquisition processing for acquiring first position data indicating at least one of an accent position and a break position from input voice data A comparison process for comparing the second position data indicating at least one of the position of the accent given to and the separation position between the plurality of morphemes and the first position data acquired from the speech data; When the 1st and 2nd position data do not correspond, the processing part which performs processing which gives the 1st position data instead of the 2nd position data to morpheme data is provided.

本発明によれば、音声データの空白タイミングを考慮することにより正しいアクセント情報を得ることが可能となる。 According to the present invention, correct accent information can be obtained by considering the blank timing of audio data.

本発明による音声情報作成装置の実施形態のブロック図である。It is a block diagram of an embodiment of a voice information creation device according to the present invention. セグメントデータのデータ構成例を示す図である。It is a figure which shows the data structural example of segment data. セグメントデータの具体例を示す図である。It is a figure which shows the specific example of segment data. 形態素データのデータ構成例を示す図である。It is a figure which shows the data structural example of morpheme data. 形態素データの具体例を示す図である。It is a figure which shows the specific example of morpheme data. 形態素ラベルデータのデータ構成例を示す図である。It is a figure which shows the data structural example of morpheme label data. 形態素ラベルデータの具体例を示す図である。It is a figure which shows the specific example of morpheme label data. 学習用ラベルデータのデータ構成例を示す図である。It is a figure which shows the data structural example of the label data for learning. 基本周波数データのデータ構成例を示す図である。It is a figure which shows the data structural example of fundamental frequency data. 音声情報作成装置をソフトウェア処理として実現できるコンピュータのハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of the computer which can implement | achieve an audio | voice information production apparatus as a software process. データ付与処理の例を示すフローチャートである。It is a flowchart which shows the example of a data provision process. 対応付け処理の詳細例を示すフローチャート（その１）である。It is a flowchart (the 1) which shows the detailed example of a matching process. 対応付け処理の詳細例を示すフローチャート（その２）である。It is a flowchart (the 2) which shows the detailed example of a matching process. 調整後の形態素データの例を示す図である。It is a figure which shows the example of the morpheme data after adjustment. アクセント修正処理の詳細例を示すフローチャートである。It is a flowchart which shows the detailed example of an accent correction process.

以下、本発明を実施するための形態について図面を参照しながら詳細に説明する。音声合成においては、合成目標の音素列に最も適合する音素片データ列が音声データベースから検索され、それらの音素片データが結合されることにより音声データが合成される。本実施形態は、音声合成のための音声データに登録される音素片データを生成するときの、当該音素片データに付与されるアクセント情報を作成する技術に関するものである。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings. In speech synthesis, a phoneme data string that most closely matches a phoneme string to be synthesized is searched from a speech database, and speech data is synthesized by combining these phoneme data. The present embodiment relates to a technique for creating accent information given to phoneme data when generating phoneme data to be registered in speech data for speech synthesis.

ここで、「音声合成」とは、文章を表すテキストデータが入力されたときに、そのテキストデータを人間が読み上げるときの音声と同様の音声をコンピュータから発声させることのできる音声データを合成する処理をいう。また、「音素片データ」とは１つの音素を表す時間域音声波形データをいう。また、「音素」とは、意味の差異をもたらす言語上の最小の単位をいい、例えば単語「あらゆる」があったとき、それをそれぞれが意味の差異を持つように分解した「ａ」「ｒ」「ａ」「ｙ」「ｕ」「ｒ」「ｕ」は、それぞれ音素を形成する。音声合成の処理ではまず、入力されたテキストデータに対して、形態素辞書が参照されながら形態素解析処理が実行されることにより、形態素データが得られる。ここで、「形態素」とは、それ自体が意味を担う言語上最小の単位をいい、例えば「あらゆる現実をすべて自分のほうへねじ曲げたのだ。」という内容のテキストデータがあったときに、形態素解析処理によりそのテキストデータの内容は例えば、「あらゆる」「現実」「を」「すべて」「自分」「の」「ほう」「へ」「ねじ曲げ」「た」「の」「だ」「。」という１３個の形態素に分解される。音声合成の次の段階として、形態素データが更に分解されて、音素の羅列からなるデータ（本実施形態では以下これを「形態素ラベルデータ」と呼ぶ）が生成される。例えば、上記テキストデータから得られる１つの形態素「あらゆる」があったとき、その形態素は、それぞれ音素「ａ」「ｒ」「ａ」「ｙ」「ｕ」「ｒ」「ｕ」に対応する音素ラベルを有する各形態素ラベルデータが生成される。音声合成のその次の段階として、形態素ラベルデータごとに、それぞれの音素ラベルで前述した音声データベースが検索されることにより、その音素ラベルと同じ音素ラベルが付与された音素片データが検索される。音声合成の最後の段階として、形態素ラベルデータごとに抽出された各音声データが結合されることにより、入力されたテキスト文章に対応する音声データが合成される。 Here, “speech synthesis” is a process of synthesizing speech data that can cause a computer to utter speech similar to speech when a human reads out the text data when text data representing a sentence is input. Say. The “phoneme piece data” means time domain speech waveform data representing one phoneme. The “phoneme” refers to the smallest linguistic unit that causes a difference in meaning. For example, when there is a word “everything”, “a” and “r” are decomposed so that each has a difference in meaning. “A” “y” “u” “r” “u” each form a phoneme. In the speech synthesis process, first, morpheme data is obtained by executing a morpheme analysis process on the input text data while referring to the morpheme dictionary. Here, “morpheme” refers to the smallest unit in the language that is responsible for its own meaning. For example, when there is text data with the content that “every reality is twisted towards you,” The content of the text data by the morphological analysis processing is, for example, “everything” “reality” “to” “all” “self” “no” “how” “to” “screw bending” “ta” “no” “da” “. Is broken down into 13 morphemes. As the next stage of speech synthesis, the morpheme data is further decomposed to generate data consisting of a sequence of phonemes (hereinafter referred to as “morpheme label data” in this embodiment). For example, when there is one “morpheme” obtained from the text data, the morpheme corresponds to the phoneme “a” “r” “a” “y” “u” “r” “u”, respectively. Each morpheme label data having a label is generated. As the next stage of speech synthesis, for each morpheme label data, the above-mentioned speech database is searched for each phoneme label, so that phoneme data to which the same phoneme label as that phoneme label is assigned is searched. As the final stage of speech synthesis, speech data corresponding to the input text sentence is synthesized by combining the speech data extracted for each morpheme label data.

図１は、音声合成のための音声データベースに登録される音素片データを生成するときの、当該音素片データに付与されるアクセント情報を作成するための、本発明による音声情報作成装置１００の実施形態のブロック図である。音声情報作成装置１００は、音声解析部１０１と言語解析部１０２とデータ付与部１０３とを備え、最終的に学習用ラベルデータ１３０を出力する。音声解析部１０１は、音声認識部１１０、音響モデル１１１、及び基本周波数解析部１１２を含む。言語解析部１０２は、形態素解析部１２０及び形態素辞書１２１を含む。 FIG. 1 shows an implementation of a speech information creation device 100 according to the present invention for creating accent information to be given to phoneme data when generating phoneme data registered in a speech database for speech synthesis. It is a block diagram of a form. The speech information creation apparatus 100 includes a speech analysis unit 101, a language analysis unit 102, and a data addition unit 103, and finally outputs learning label data 130. The voice analysis unit 101 includes a voice recognition unit 110, an acoustic model 111, and a fundamental frequency analysis unit 112. The language analysis unit 102 includes a morpheme analysis unit 120 and a morpheme dictionary 121.

音声認識部１１０は、入力される学習用の時間域の音声データ１１３に対して、音響モデル１１１を参照しながら音声認識処理を実行することにより、当該音声データ１１３をそれぞれが１つの音素が継続する音声区間であるセグメントに分割し、その結果、セグメントデータ１１４を出力する。 The speech recognition unit 110 performs speech recognition processing on the input speech data 113 for learning time domain while referring to the acoustic model 111, so that each speech data 113 is continued with one phoneme. And segment data 114 is output as a result.

図２は、セグメントデータ１１４のデータ構成例を示す図である。この図に示されるように、音声認識部１１０による音声認識処理で得られる合計L個の音素ごとに、各セグメントデータsegment[0],segment[1],・・・,segment[L-1]はそれぞれ、seg_id、phone、start、end、prev、nextの各変数データを保持する。seg_idは、セグメントＩＤ（識別子）を保持する。phoneは、音素ラベルを保持する。startは、開始時刻（サンプル番号）を保持する。endは、終了時刻（サンプル番号）を保持する。prevは、１つ手前のセグメントデータ１１４へのポインタ、nextは、１つ後ろのセグメントデータ１１４へのポインタを保持する。現在のセグメントデータ１１４が例えばsegmen[1]であれば、prevはsegment[0]の先頭アドレスを保持し、nextはsegment[2]の先頭アドレスを保持する。また、現在のセグメントデータ１１４が例えば先頭データsegment[0]であれば、prevは未定義値であるNULL値を保持する。現在のセグメントデータ１１４が例えば末端データsegment[L-1]であれば、nextはNULL値を保持する。prevとnextを使ってセグメントデータ１１４が接続されることにより、セグメントデータ１１４が例えばメモリ上の末尾のアドレスに追加されても、そのセグメントデータ１１４を任意の箇所に挿入することができる。 FIG. 2 is a diagram illustrating a data configuration example of the segment data 114. As shown in this figure, each segment data segment [0], segment [1],..., Segment [L-1] is obtained for each of the total L phonemes obtained by the speech recognition processing by the speech recognition unit 110. Holds variable data of seg_id, phone, start, end, prev, and next, respectively. seg_id holds a segment ID (identifier). phone holds a phoneme label. start holds the start time (sample number). end holds the end time (sample number). prev holds a pointer to the previous segment data 114, and next holds a pointer to the next segment data 114. If the current segment data 114 is, for example, segmen [1], prev holds the start address of segment [0], and next holds the start address of segment [2]. Further, if the current segment data 114 is, for example, the head data segment [0], prev holds a null value that is an undefined value. If the current segment data 114 is, for example, end data segment [L-1], next holds a NULL value. By connecting the segment data 114 using prev and next, even if the segment data 114 is added to the last address on the memory, for example, the segment data 114 can be inserted at an arbitrary location.

図３は、セグメントデータの具体例を示す図である。例えば、seg_id=3のセグメントデータ１１４において、phoneは、音素ラベルとして「ａ」を保持する。startは、開始時刻として「730」（サンプル目）を保持する。Endは、終了時刻として「810」（サンプル目）を保持する。prevは、１つ前のセグメントデータ１１４のseg_idとして「2」を保持する。nextは、１つ後ろのセグメントデータ１１４のseg_idとして「4」を保持する。なお、図３において、先頭のseg_id=0又は17番目のseg_id=16の各セグメントデータ１１４のphoneは、「#」を保持する。この「#」は、それらが登録されるセグメントデータ１１４に対応するセグメントの区間において、音声データ１１３の値がゼロ又はゼロ近傍である無音と判別されたことを示している。 FIG. 3 is a diagram showing a specific example of segment data. For example, in the segment data 114 of seg_id = 3, the phone holds “a” as a phoneme label. start holds “730” (sample) as the start time. End holds “810” (sample) as the end time. prev holds “2” as the seg_id of the previous segment data 114. “next” holds “4” as the seg_id of the next segment data 114. In FIG. 3, the phone of each segment data 114 of the first seg_id = 0 or the 17th seg_id = 16 holds “#”. This “#” indicates that in the segment section corresponding to the segment data 114 in which they are registered, the value of the audio data 113 is determined to be no sound or zero silence.

特には図示しない音声データベースへの登録時には、上述のセグメントデータ１１４のstartとendが示すセグメントのサンプル区間ごとに、音声データ１１３から当該サンプル区間に対応する波形データが切り出されて音素片データとされる。そして、その音素片データが、当該セグメントに対応する音素を表す音素ラベルを検索キーとして、音声データベースに登録される。 In particular, at the time of registration in a speech database (not shown), waveform data corresponding to the sample section is cut out from the speech data 113 for each sample section of the segment indicated by the start and end of the segment data 114 described above and used as phoneme piece data. The The phoneme piece data is registered in the speech database using the phoneme label representing the phoneme corresponding to the segment as a search key.

音声合成時に、形態素ラベルデータごとに、各音素ラベルに対応する音素片データを音声データベースから検索するだけでは、品質の良い合成音声データは得られない。理想的には、入力されたテキスト文章を実際に人間が読み上げたときの音声データから生成された音素片データが音声データベースに登録され検索されれば、最も良い品質の合成音声データが得られる。しかし、音声データベースの記憶容量には限りがあるため、全てのテキスト文章に対応する音素片データを登録することはできない。そこで、音素片データにアクセント情報が付与されて、音声データベースに登録される。一方、音声合成時には、合成対象のテキストデータに対する形態素解析処理時にもアクセント情報が生成され、形態素ラベルデータごとに付与される。そして、形態素ラベルデータごとに、それぞれの音素ラベルに対応する音素片データを音声データベースから検索するときに、形態素ラベルデータに付与されたアクセント情報に近い値のアクセント情報を有する音素片データを抽出して結合する。これにより、音声合成対象のテキストデータに近いアクセント状態で発音された音素片データを音声データベースから選択することが可能となり、合成音声データの品質を高めることが可能となる。 At the time of speech synthesis, for each morpheme label data, it is not possible to obtain high-quality synthesized speech data simply by searching phoneme segment data corresponding to each phoneme label from the speech database. Ideally, the synthesized speech data with the best quality can be obtained if the phoneme piece data generated from the speech data when the human actually reads the input text sentence is registered and searched in the speech database. However, since the storage capacity of the speech database is limited, phoneme piece data corresponding to all text sentences cannot be registered. Therefore, accent information is added to the phoneme piece data and registered in the speech database. On the other hand, at the time of speech synthesis, accent information is also generated at the time of morpheme analysis processing for text data to be synthesized, and is given to each morpheme label data. Then, for each morpheme label data, when searching for phoneme piece data corresponding to each phoneme label from the speech database, phoneme piece data having accent information having a value close to the accent information given to the morpheme label data is extracted. And combine. As a result, it is possible to select phoneme piece data generated in an accent state close to the text data to be synthesized from the speech database, and it is possible to improve the quality of the synthesized speech data.

具体的には、図１に示される本実施形態では、言語解析部１０２を構成する形態素解析部１２０がまず、学習用の音声データ１１３に対応するテキストデータである発話テキスト１２２を入力し、その発話テキスト１２２に対して、形態素辞書１２１を参照しながら、形態素解析処理を実行する。この結果、形態素解析部１２０は、形態素データ１２３と形態素ラベルデータ１２４を出力する。形態素解析部１２０は、形態素データ１２３が示す形態素に基づいて決定したアクセントブロックごとに、アクセント位置を検出する。ここで、「アクセント」とは、単語又は単語結合（一言で言い切る範囲）での音素の基本周波数の相対的高低や音素の信号強度の相対的強弱を言い、その単語又は単語結合を「アクセントブロック」と呼ぶ。そして、「アクセント位置」は、アクセントブロック内で基本周波数が相対的に最も高いモーラの位置又は信号強度が相対的に最も強いモーラの位置をいう。「モーラ」とは、音韻論上、一定の時間的長さをもった音の分節単位をいい、アクセントブロック内でアクセントの位置を表すのに都合がよい単位である。次に、形態素解析部１２０は、形態素ラベルデータ１２４ごとに、当該データが示す音素が属するアクセントブロックに対応するアクセント位置との位置関係を示すアクセント情報を、当該形態素ラベルデータ１２４に付加する。「アクセント情報」は例えば、形態素ラベルデータ１２４に対応する音素の、その音素が属するアクセントブロックに対応するアクセント位置からその音素が属するモーラまでのモーラ数（アクセント位置より前はマイナス値、後ろはプラス値）と、その音素が当該アクセントブロック内で最初のモーラから数えて何番目のモーラに含まれるかを示すモーラ番号とで表される。 Specifically, in the present embodiment shown in FIG. 1, the morphological analysis unit 120 constituting the language analysis unit 102 first inputs the utterance text 122 that is text data corresponding to the speech data 113 for learning, A morpheme analysis process is executed on the utterance text 122 while referring to the morpheme dictionary 121. As a result, the morpheme analyzer 120 outputs morpheme data 123 and morpheme label data 124. The morpheme analyzer 120 detects an accent position for each accent block determined based on the morpheme indicated by the morpheme data 123. Here, the term “accent” refers to the relative level of the fundamental frequency of a phoneme or the relative strength of the signal strength of a phoneme in a word or word combination (a range that can be expressed in one word). This is called “block”. The “accent position” refers to the position of the mora having the relatively highest fundamental frequency or the position of the mora having the relatively strongest signal strength in the accent block. “Mora” refers to a segmental unit of sound having a certain length of time in phonological theory, and is a convenient unit for expressing the position of an accent in an accent block. Next, the morpheme analysis unit 120 adds, for each morpheme label data 124, accent information indicating the positional relationship with the accent position corresponding to the accent block to which the phoneme indicated by the data belongs, to the morpheme label data 124. The “accent information” is, for example, the number of mora of the phoneme corresponding to the morpheme label data 124 from the accent position corresponding to the accent block to which the phoneme belongs to the mora to which the phoneme belongs (a negative value before the accent position and a plus after the accent position) Value) and a mora number indicating the number of mora included in the accent block from the first mora.

図４は、形態素データ１２３のデータ構成例を示す図である。この図に示されるように、形態素解析部１２０による形態素解析処理で得られる合計M個の形態素ごとに、各形態素データmorph[0],morph[1],・・・,morph[M-1]はそれぞれ、morph_id、original、read、pronounce、group、accent、prev、nextの各変数データを保持する。morph_idは、形態素ＩＤを保持する。originalは、表記上の形態素を保持する。readは、表記上の読みを保持する。pronounceは、発音を保持する。groupは、品詞情報を保持する。accentは、アクセント情報を保持する。prevは、１つ手前の形態素データ１２３へのポインタ（NULLで先頭）を保持する。nextは、１つ後ろの形態素データ１２３へのポインタ（NULLで末端）を保持する。セグメントデータ１１４の場合と同様に、prevとnextを使って形態素データ１２３が接続されることにより、形態素データ１２３が例えばメモリ上の末尾のアドレスに追加されても、その形態素データ１２３を任意の箇所に挿入することができる。 FIG. 4 is a diagram illustrating a data configuration example of the morpheme data 123. As shown in this figure, each morpheme data morph [0], morph [1],..., Morph [M-1] is obtained for each of the total M morphemes obtained by the morpheme analysis processing by the morpheme analyzer 120. Holds morph_id, original, read, pronounce, group, accent, prev, and next variable data, respectively. morph_id holds a morpheme ID. original holds the morpheme on the notation. read holds the reading on the notation. pronounce holds the pronunciation. group holds part-of-speech information. Accent holds accent information. prev holds a pointer (null head) to the previous morpheme data 123. “next” holds a pointer to the next morpheme data 123 (terminal is NULL). As in the case of the segment data 114, even if the morpheme data 123 is added to the last address on the memory by connecting the morpheme data 123 using prev and next, the morpheme data 123 is stored at an arbitrary location. Can be inserted into.

図５は、形態素データ１２３の具体例を示す図である。例えば「あらゆる現実をすべて自分のほうへねじ曲げたのだ。」という内容の学習用の発話テキスト１２２のデータがあったときに、形態素解析処理によりその内容は例えば、「あらゆる」「現実」「を」「すべて」「自分」「の」「ほう」「へ」「ねじ曲げ」「た」「の」「だ」「。」という１３個の形態素に分解され、それぞれの形態素に対応する表記、読み、発音がoriginal、read、pronounceに登録された図５に示される形態素データ１２３が生成される。例えば、morph_id=0の先頭の形態素データ１２３において、originalは、表記上の形態素として「あらゆる」を保持する。readは、表記上の読みとして「アラユル」を保持する。pronounceは、発音として「アラユル」を保持する。groupは、品詞情報として「連体詞」を保持する。Accent変数は、アクセント情報として「3」を保持する。これは、この形態素中の３モーラ目にアクセント位置があることを示している。モーラについては後述する。prevは、１つ手前の形態素データ１２３のmorph_idとして手前には何も無いことを示す「NULL」を保持している。nextは、１つ後ろの形態素データ１２３のmorph_idとして「1」を保持する。 FIG. 5 is a diagram illustrating a specific example of the morpheme data 123. For example, when there is data of the utterance text 122 for learning with the content “every reality is twisted towards me”, the content is, for example, “everything” “reality” “ ”“ All ”“ My ”“ No ”“ How ”“ To ”“ Screw Bending ”“ Ta ”“ No ”“ Da ”“. ”The morpheme corresponding to each notation, reading, The morpheme data 123 shown in FIG. 5 in which the pronunciation is registered as original, read, and promise is generated. For example, in the first morpheme data 123 with morph_id = 0, original holds “any” as a morpheme on the notation. “read” holds “Arayul” as a notation reading. pronounce holds “Ayur” as a pronunciation. The group holds “combinations” as part-of-speech information. The Accent variable holds “3” as accent information. This indicates that there is an accent position at the third mora in the morpheme. The mora will be described later. prev holds “NULL” indicating that there is nothing in front as the morph_id of the previous morpheme data 123. “next” holds “1” as the morph_id of the next morpheme data 123.

図６は、形態素ラベルデータ１２４のデータ構成例を示す図である。この図に示されるように、形態素解析部１２０による形態素解析処理で得られた形態素データ１２３に基づいて決定される合計N個の形態素ラベルデータmorph_label[0]、morph_label[1]、・・・、morph_label[N-1]はそれぞれ、mlabel_id、phone、accent、mola、morph_id、prev、nextの各変数データを保持する。mlabel_idは、形態素ラベルＩＤを保持する。phoneは、音素ラベルを保持する。accentは、アクセント位置までのモーラ数を保持する。molaは、モーラ番号を保持する。morph_idは、phoneに保持された音素ラベルで示される音素が属する形態素データ１２３の形態素ＩＤ（図４参照）を保持する。prevは、１つ手前の形態素ラベルデータ１２４へのポインタ（NULLで先頭）を保持する。nextは、１つ後ろの形態素ラベルデータ１２４へのポインタ（NULLで末端）を保持する。prev及びnextの意味は、セグメントデータ１１４（図２）又は形態素データ１２３（図４）の場合と同様である。 FIG. 6 is a diagram illustrating a data configuration example of the morpheme label data 124. As shown in this figure, a total of N morpheme label data morph_label [0], morph_label [1],... Determined based on the morpheme data 123 obtained by the morpheme analysis processing by the morpheme analysis unit 120. Each morph_label [N-1] holds variable data of mlabel_id, phone, accent, mola, morph_id, prev, and next. mlabel_id holds a morpheme label ID. phone holds a phoneme label. accent holds the number of mora up to the accent position. mola holds the mora number. morph_id holds the morpheme ID (see FIG. 4) of the morpheme data 123 to which the phoneme indicated by the phoneme label held in the phone belongs. prev holds a pointer (null head) to the previous morpheme label data 124. “next” holds a pointer to the next morpheme label data 124 (null end). The meanings of prev and next are the same as in the case of the segment data 114 (FIG. 2) or the morpheme data 123 (FIG. 4).

図７は、形態素ラベルデータ１２４の具体例を示す図である。形態素解析処理により、例えば４つの連続する形態素「あらゆる」「現実」「を」「すべて」にそれぞれ対応する図５に示されるmorph_id=0,1,2,3の各形態素ＩＤを有する各形態素データ１２３から、２１個の音素「a」「r」「a」「y」「u」「r」「u」「g」「e」「N」「j」「i」「ts」「u」「o」「s」「u」「b」「e」「t」「e」が抽出され、それぞれの音素に対応する音素ラベルがphoneに登録された２１個の形態素ラベルデータ１２４が生成される。なお、処理の都合上、先頭のmlabel_id=0の形態素ラベルＩＤを有する形態素ラベルデータ１２４としては、無音を示す音素ラベル「#」を有するデータが登録される。 FIG. 7 is a diagram illustrating a specific example of the morpheme label data 124. Each morpheme data having each morpheme ID of morph_id = 0,1,2,3 shown in FIG. 5 corresponding to, for example, four consecutive morphemes “every”, “reality”, “to”, and “all” by morpheme analysis processing. From 123, 21 phonemes “a” “r” “a” “y” “u” “r” “u” “g” “e” “N” “j” “i” “ts” “u” “ “o”, “s”, “u”, “b”, “e”, “t”, and “e” are extracted, and 21 morpheme label data 124 in which phoneme labels corresponding to the respective phonemes are registered in the phone is generated. For convenience of processing, data having a phoneme label “#” indicating silence is registered as the morpheme label data 124 having the first mlabel_id = 0 morpheme label ID.

ここで、形態素解析処理により例えば、これらの形態素ラベルデータ１２４を生成した形態素「あらゆる」「現実」「を」「すべて」に対して、これらの形態素の区切りと同じ区切りを有するアクセントブロック「あらゆる」「現実」「を」「すべて」が決定される。また、アクセントブロック「あらゆる」を例にとると、形態素解析処理により、このアクセントブロック内で、「a」「ra」「yu」「ru」という４個のモーラが認識される。そして、形態素解析処理により、例えばこのアクセントブロック「あらゆる」内のアクセント位置として、３番目のモーラ「yu」（「ゆ」）が検出される。即ち、この例では、アクセント位置は「3」である。 Here, for example, with respect to the morpheme “any”, “real”, “to”, and “all” that generated the morpheme label data 124 by the morpheme analysis process, the accent block “every” having the same delimiter as the morpheme delimiter is used. “Reality”, “O” and “All” are determined. Taking the accent block “every” as an example, four mora “a”, “ra”, “yu”, and “ru” are recognized in the accent block by the morphological analysis process. Then, by the morphological analysis processing, for example, the third mora “yu” (“yu”) is detected as the accent position in the accent block “every”. That is, in this example, the accent position is “3”.

更に、形態素解析処理により、音素「a」「r」「a」「y」「u」「r」「u」のそれぞれに対応する各音素ラベルを有するmlabel_id=1,2,3,4,5,6,7の各形態素ラベルデータ１２４において、各mola変数には、上記アクセントブロック「あらゆる」内での各音素が含まれる各モーラのモーラ番号「1」「2」「2」「3」「3」「4」「4」が保持される。即ち、mlabel_id=1、phone=「a」の形態素ラベルデータ１２４において、当該音素ラベル「a」の音素は、上記４個のモーラ「a」「ra」「yu」「ru」のうちの１番目のモーラ「a」に含まれるので、molaにはモーラ番号「1」が保持される。また、mlabel_id=2、phone=「r」の形態素ラベルデータ１２４において、当該音素ラベル「r」の音素は、上記４個のモーラのうちの２番目のモーラ「ra」に含まれるので、molaにはモーラ番号「2」が保持される。同様に、mlabel_id=3、phone=の「ａ」の形態素ラベルデータ１２４において、当該音素「a」の音素は、上記４個のモーラのうちの２番目のモーラ「ra」に含まれるので、molaにはモーラ番号「2」が保持される。他の音素も同様である。 Furthermore, mlabel_id = 1,2,3,4,5 having each phoneme label corresponding to each of phonemes “a”, “r”, “a”, “y”, “u”, “r”, and “u” by morpheme analysis processing , 6, 7 in each morpheme label data 124, each mola variable includes mora numbers “1”, “2”, “2”, “3”, “3” of each mora including each phoneme in the accent block “every”. “3”, “4” and “4” are retained. That is, in the morpheme label data 124 of mlabel_id = 1 and phone = “a”, the phoneme of the phoneme label “a” is the first of the four mora “a” “ra” “yu” “ru”. Therefore, the mora number “1” is held in the mola. Further, in the morpheme label data 124 of mlabel_id = 2 and phone = “r”, the phoneme of the phoneme label “r” is included in the second mora “ra” of the four mora. Holds the mora number “2”. Similarly, in the morpheme label data 124 of “a” of mlabel_id = 3 and phone =, the phoneme of the phoneme “a” is included in the second mora “ra” of the four mora, Holds the mora number “2”. The same applies to other phonemes.

加えて、形態素解析処理により、音素「a」「r」「a」「y」「u」「r」「u」のそれぞれに対応する各音素ラベルを有するmlabel_id=1,2,3,4,5,6,7の各形態素ラベルデータ１２４において、各accent変数には、各mola変数に保持された上記各モーラ番号「1」「2」「2」「3」「3」「4」「4」と、アクセントブロック内で検出されているアクセント位置「3」との各差分値「1-3=-2」「2-3=-1」「2-3=-1」「3-3=0」「3-3=0」「4-3=1」「4-3=1」がそれぞれ登録される。 In addition, mlabel_id = 1,2,3,4, each phoneme label corresponding to each of phonemes “a”, “r”, “a”, “y”, “u”, “r”, and “u” is obtained by morphological analysis processing. In each of the morpheme label data 124 of 5, 6, and 7, each accent variable includes the mora number “1” “2” “2” “3” “3” “4” “4” held in each mola variable. "And the difference between the accent position" 3 "detected in the accent block" 1-3 = -2 "," 2-3 = -1 "," 2-3 = -1 "," 3-3 = “0”, “3-3 = 0”, “4-3 = 1”, and “4-3 = 1” are registered respectively.

以上のようにして、図１において、言語解析部１０２内の形態素解析部１２０によりアクセント情報を含む音素列が登録された形態素ラベルデータ１２４が得られると、データ付与部１０３が、音声解析部１０１内の音声認識部１１０により生成されたセグメントデータ１１４の音素列と、上記形態素ラベルデータ１２４の音素列との対応関係を生成する。例えば、図３に例示されるセグメントデータ１１４と、図７に例示される形態素ラベルデータ１２４とで、データ付与部１０３は、それぞれの先頭から順次、セグメントデータ１１４のphone変数の音素ラベルと形態素ラベルデータ１２４のphone変数の音素ラベルとが一致するか否かをチェックする。そして、セグメントデータ１１４と形態素ラベルデータ１２４の全体にわたって両者の音素ラベルが一致する場合には、データ付与部１０３は、図２のデータ構成例を有するセグメントデータ１１４の内容に、図６のデータ構成例を有する形態素ラベルデータ１２４に登録されているアクセント情報、即ち、アクセント位置までのモーラ数とアクセントブロック内のモーラ番号を付与し、学習用ラベルデータ１３０として出力する。 As described above, in FIG. 1, when the morpheme label data 124 in which the phoneme string including the accent information is registered is obtained by the morpheme analysis unit 120 in the language analysis unit 102, the data adding unit 103 performs the speech analysis unit 101. A correspondence relationship between the phoneme string of the segment data 114 generated by the voice recognition unit 110 and the phoneme string of the morpheme label data 124 is generated. For example, in the segment data 114 illustrated in FIG. 3 and the morpheme label data 124 illustrated in FIG. 7, the data adding unit 103 sequentially determines the phoneme label and the morpheme label of the phone variable of the segment data 114 from the top of each. It is checked whether or not the phoneme label of the phone variable of the data 124 matches. When the phoneme labels of the segment data 114 and the morpheme label data 124 are the same, the data adding unit 103 adds the data structure of FIG. 6 to the contents of the segment data 114 having the data structure example of FIG. Accent information registered in the morpheme label data 124 having an example, that is, the number of mora up to the accent position and the mora number in the accent block are given and output as the learning label data 130.

図８は、学習用ラベルデータ１３０のデータ構成例を示す図である。学習用ラベルデータ１３０の個数は、基本的には図２に例示されるセグメントデータ１１４のL個と同じである。L個のセグメントデータsegment[i](0≦i≦L)に対応するL個の学習用ラベルデータtrain_label[i](0≦i≦L)はそれぞれ、tlabel_id、phone、start、end、accent、mola、pitch、vowel、prev、nextの各変数データを保持する。tlabel_idは、学習用ラベルＩＤである。phone、start、endはそれぞれ、セグメントデータsegment[i]のphone、start、end（図２参照）からコピーされる。accent、molaは、セグメントデータsegment[i]に対応付けられた形態素ラベルデータmorph_label[j] (0≦j≦N)のaccent,mola（図６参照）からコピーされる。vowelは、phoneが示す音素ラベルが母音であるか否かを示すフラグであり、母音であれば「1」、子音であれば「0」がセットされる。prevは、１つ手前の学習用ラベルデータ１３０へのポインタ、nextは、１つ後ろの学習用ラベルデータ１３０へのポインタを保持する。pitchについては、後述する。 FIG. 8 is a diagram illustrating a data configuration example of the learning label data 130. The number of learning label data 130 is basically the same as the L pieces of segment data 114 illustrated in FIG. L pieces of learning label data train_label [i] (0≤i≤L) corresponding to L pieces of segment data segment [i] (0≤i≤L) are tlabel_id, phone, start, end, accent, Holds each variable data of mola, pitch, vowel, prev, next. tlabel_id is a learning label ID. phone, start and end are respectively copied from phone, start and end (see FIG. 2) of the segment data segment [i]. Accent and mola are copied from the accent and mola (see FIG. 6) of the morpheme label data morph_label [j] (0 ≦ j ≦ N) associated with the segment data segment [i]. vowel is a flag indicating whether or not the phoneme label indicated by the phone is a vowel, and is set to “1” if it is a vowel and “0” if it is a consonant. prev holds a pointer to the previous learning label data 130, and next holds a pointer to the next learning label data 130. The pitch will be described later.

このようにして得られる学習用ラベルデータ１３０を使って、図１の音声データ１１３からその学習用ラベルデータ１３０のstart及びend（図８参照）に対応する区間のデータが切り出されて音素片データとされ、この音素片データが、図８の学習用ラベルデータ１３０のphoneに登録されている音素ラベルと、accent及びmolaに登録されているアクセント情報とともに、音声データベースに登録される。音声合成時には、テキストデータに対する形態素解析時にもアクセントブロックごとにアクセント位置が検出され、合成目標の音素列中の音素ごとに、当該音素が属するアクセントブロックのアクセント位置との位置関係を示す値が取得される。そして、当該音素ラベルに対応する音素片データを音声データベースから検索するときに、各音素に対して取得された上記位置関係を示す値に最も近い値を有するアクセント情報が付与されている音素片データが抽出される。これにより、テキスト文章データと同じアクセント状態で発音された音素片データを選択することが可能となり、合成音声データの品質を高めることが可能となる。 Using the learning label data 130 obtained in this way, the data of the section corresponding to the start and end (see FIG. 8) of the learning label data 130 is cut out from the speech data 113 of FIG. This phoneme piece data is registered in the speech database together with the phoneme label registered in the phone of the learning label data 130 of FIG. 8 and the accent information registered in the accent and mola. During speech synthesis, the accent position is detected for each accent block even during morphological analysis of text data, and for each phoneme in the synthesis target phoneme string, a value indicating the positional relationship with the accent position of the accent block to which the phoneme belongs is obtained. Is done. Then, when searching for phoneme data corresponding to the phoneme label from the speech database, phoneme data to which accent information having a value closest to the value indicating the positional relationship acquired for each phoneme is given. Is extracted. As a result, it is possible to select phoneme piece data that is pronounced in the same accent state as the text sentence data, and it is possible to improve the quality of the synthesized speech data.

ここで、「発明が解決しようとする課題」の項で説明したように、音声データ１１３の実際の発話においては息継ぎが発生し、そのタイミングは音声データ１１３の値が無音値又は無音値に近い値をとる空白タイミングとなる。そして、この音声データ１１３がセグメント分割された場合には、上記空白タイミングは無音を示す音素ラベル「#」を有するセグメントデータ１１４に対応付けられる。例えば、図３に例示される先頭のseg_id=0又は17番目のseg_id=16の各セグメントデータ１１４のphone変数に保持されている音素ラベル「#」は、そのセグメントデータ１１４に対応するセグメントが空白セグメントであることを示している。一方、この空白タイミングは一般的には言語情報の区切りに対応し、その場合には、形態素解析処理が実行された場合に上記空白タイミングは句読点の形態素として検出される。そして、そのような句読点の形態素から形態素ラベルデータ１２４が抽出された場合には、その形態素ラベルデータ１２４のphone変数（図６参照）には空白タイミングを示す音素ラベル「#」が登録される。このような状態で、データ付与部１０３が、セグメントデータ１１４の音素列と形態素ラベルデータ１２４の音素列との対応付けを行った場合には、音素ラベル「#」同士がうまくマッチングして、正しい対応関係が生成される。 Here, as described in the section “Problems to be Solved by the Invention”, breathing occurs in the actual speech of the voice data 113, and the timing of the voice data 113 is a silence value or a silence value. It is blank timing that takes a value. When the audio data 113 is segmented, the blank timing is associated with the segment data 114 having the phoneme label “#” indicating silence. For example, in the phoneme label “#” held in the phone variable of each segment data 114 of the first seg_id = 0 or the 17th seg_id = 16 illustrated in FIG. 3, the segment corresponding to the segment data 114 is blank. Indicates a segment. On the other hand, this blank timing generally corresponds to a delimiter of language information. In this case, when the morphological analysis process is executed, the blank timing is detected as a punctuation morpheme. When the morpheme label data 124 is extracted from such punctuation mark morphemes, the phoneme label “#” indicating blank timing is registered in the phone variable of the morpheme label data 124 (see FIG. 6). In this state, when the data assigning unit 103 associates the phoneme string of the segment data 114 with the phoneme string of the morpheme label data 124, the phoneme labels “#” are well matched and correct. A correspondence is generated.

一方、音声データの実際の発話では、息継ぎにより無音タイミングが発生したタイミングで言語情報上句読点が検出されない場合もある。この場合には、セグメントデータ１１４には空白タイミングを示す音素ラベル「#」が登録されるが、その空白タイミングに対応する句読点の形態素は出力されず、phone変数に空白タイミングを示す音素ラベル「#」が登録された形態素ラベルデータ１２４も出力されないことになる。例えば、図３のセグメントデータ１１４の例において、音素列「arayurugeNjitsuo」（表記：「あらゆる現実を」）と音素列「subete」（表記：「すべて」）の間で、音声データ１１３の発音において息継ぎが発生したことにより、seg_id=15、音素ラベルphone=「o」のセグメントデータ１１４と、seg_id=17、音素ラベルphone=「s」のセグメントデータ１１４の間に、seg_id=16、音素ラベルphone=「#」の空白セグメントのセグメントデータ１１４が生成されている。一方、図５の形態素データ１２３の例においては、形態素「を」と形態素「すべて」の間には読点は検出されておらず、従って、そこから抽出された図７の形態素ラベルデータ１２４の例においても、mlabel_id=15、音素ラベルphone=「o」の形態素ラベルデータ１２４と、mlabel_id=16、音素ラベルphone=「s」の形態素ラベルデータ１２４の間には、音素ラベルphone=「#」の空白セグメントの形態素ラベルデータ１２４は生成されていない。このような状態で、データ付与部１０３がもし、上記空白セグメントの存在を考慮せずに図３に例示されるセグメントデータ１１４の音素列と図７に例示される形態素ラベルデータ１２４の音素列との対応付けを行った場合には、seg_id=15、音素ラベルphone=「o」のセグメントデータ１１４と、mlabel_id=15、音素ラベルphone=「o」の形態素ラベルデータ１２４とのマッチングが行われた後に、seg_id=16、音素ラベルphone=「#」の空白セグメントのセグメントデータ１１４と、mlabel_id=17、音素ラベルphone=「s」の形態素ラベルデータ１２４との比較が行われることになり、両者の音素ラベルが一致せずに対応関係が成立しなくなって、この音声データ１１３に基づく音素片データを音声データベースの登録用に採用できないという結果になってしまう。 On the other hand, in actual speech of voice data, punctuation marks may not be detected in the language information at the timing when silence timing occurs due to breathing. In this case, the phoneme label “#” indicating the blank timing is registered in the segment data 114, but the punctuation morpheme corresponding to the blank timing is not output, and the phoneme label “#” indicating the blank timing in the phone variable is output. The morpheme label data 124 in which “is registered” is not output. For example, in the example of the segment data 114 in FIG. 3, the phoneme string “arayurugeNjitsuo” (notation: “every reality”) and the phoneme string “subete” (notation: “all”) are connected in the pronunciation of the speech data 113. Between the segment data 114 of seg_id = 15 and phoneme label phone = “o” and the segment data 114 of seg_id = 17 and phoneme label phone = “s”, seg_id = 16, phoneme label phone = Segment data 114 of a blank segment “#” is generated. On the other hand, in the example of the morpheme data 123 shown in FIG. 5, no punctuation mark is detected between the morpheme “O” and the morpheme “all”, and therefore the example of the morpheme label data 124 shown in FIG. Also, between the morpheme label data 124 of mlabel_id = 15 and phoneme label phone = “o” and the morpheme label data 124 of mlabel_id = 16 and phoneme label phone = “s”, the phoneme label phone = “#” The blank segment morpheme label data 124 is not generated. In such a state, the data adding unit 103 has the phoneme string of the segment data 114 illustrated in FIG. 3 and the phoneme string of the morpheme label data 124 illustrated in FIG. 7 without considering the presence of the blank segment. , Segment data 114 of seg_id = 15 and phoneme label phone = “o” is matched with morpheme label data 124 of mlabel_id = 15 and phoneme label phone = “o”. Later, the segment data 114 of the blank segment with seg_id = 16 and phoneme label phone = “#” is compared with the morpheme label data 124 with mlabel_id = 17 and phoneme label phone = “s”. Since the phoneme labels do not match and the correspondence is not established, the result is that the phoneme piece data based on the voice data 113 cannot be used for registration in the voice database.

そこで、本実施形態では、データ付与部１０３は、空白セグメントと形態素の区切りとの位置関係を判別しながら、セグメントデータ１１４の音素列と形態素ラベルデータ１２４の音素列との対応関係を生成し、対応関係にずれが生じないように制御を行う。この処理の詳細については、図１２及び図１３のフローチャートを用いて後述する。 Therefore, in the present embodiment, the data adding unit 103 generates a correspondence relationship between the phoneme string of the segment data 114 and the phoneme string of the morpheme label data 124 while determining the positional relationship between the blank segment and the morpheme separator, Control is performed so that there is no deviation in the correspondence. Details of this processing will be described later with reference to the flowcharts of FIGS.

ここで、音素片データのもととなる音声データ１１３は必ずしも対応する発話テキスト１２２から得られる形態素データ１２３に基づくアクセント位置で発話されているとは限らない。そこで、本実施形態では、データ付与部１０３が、上述のように正しく生成したセグメントデータ１１４の音素列と形態素ラベルデータ１２４の音素列との対応関係に基づいて、形態素ラベルデータ１２４から得られるアクセントブロックごとの新たなアクセント位置を、音声データ１１３の発話に基づいて抽出される基本周波数が最も高くなるセグメントに対応する位置として算出し直す。そして、データ付与部１０３は、アクセントブロックに属するセグメントに対応する学習用ラベルデータ１３０のアクセント情報を、上述の新たに算出されたアクセント位置に基づいて修正する。 Here, the speech data 113 that is the basis of the phoneme piece data is not necessarily uttered at the accent position based on the morpheme data 123 obtained from the corresponding utterance text 122. Therefore, in the present embodiment, the data assigning unit 103 acquires the accent obtained from the morpheme label data 124 based on the correspondence between the phoneme string of the segment data 114 correctly generated as described above and the phoneme string of the morpheme label data 124. The new accent position for each block is recalculated as the position corresponding to the segment with the highest fundamental frequency extracted based on the speech data 113. Then, the data adding unit 103 corrects the accent information of the learning label data 130 corresponding to the segment belonging to the accent block based on the newly calculated accent position.

具体的にはまず、図１の音声解析部１０１内の基本周波数解析部１１２が、所定のフレーム周期（例えば256ミリ秒）ごとに音声データ１１３の基本（ピッチ）周波数を抽出し、基本周波数データ１１５を出力する。 Specifically, first, the fundamental frequency analysis unit 112 in the speech analysis unit 101 of FIG. 1 extracts the fundamental (pitch) frequency of the speech data 113 every predetermined frame period (for example, 256 milliseconds), and the fundamental frequency data 115 is output.

図９は、基本周波数データ１１５のデータ構成例を示す図である。音声データ１１３に対する所定のフレーム周期ごとの基本周波数の解析の結果得られるK個の基本周波数データpitch[i](0≦i≦K-1)はそれぞれ、time,pitch, prev、nextの各変数データを保持する。timeは、現在のフレーム周期に対応する時刻（現在のフレーム周期の先頭、中央、又は末尾のサンプル番号）を保持する。pitchは、解析の結果得られた基本周波数[Hz]（ヘルツ）を保持する。prevは、１つ手前の基本周波数データ１１５へのポインタ、nextは、１つ後ろの基本周波数データ１１５へのポインタを保持する。 FIG. 9 is a diagram illustrating a data configuration example of the basic frequency data 115. The K fundamental frequency data pitch [i] (0 ≦ i ≦ K−1) obtained as a result of the analysis of the fundamental frequency for each predetermined frame period with respect to the audio data 113 are time, pitch, prev, and next variables, respectively. Retain data. time holds the time (sample number at the beginning, center, or end of the current frame period) corresponding to the current frame period. The pitch holds the fundamental frequency [Hz] (Hertz) obtained as a result of the analysis. prev holds a pointer to the previous fundamental frequency data 115, and next holds a pointer to the next fundamental frequency data 115.

上述のようにして基本周波数解析部１１２により生成された基本周波数データ１１５を用いて、データ付与部１０３が、アクセントブロックごとに新たなアクセント位置を生成し、アクセントブロックに属するセグメントに対応する学習用ラベルデータ１３０のアクセント情報を、上述の新たに算出されたアクセント位置に基づいて修正する。この処理の詳細については、図１５のフローチャートを用いて後述する。 Using the fundamental frequency data 115 generated by the fundamental frequency analysis unit 112 as described above, the data adding unit 103 generates a new accent position for each accent block, and for learning corresponding to the segment belonging to the accent block The accent information of the label data 130 is corrected based on the newly calculated accent position. Details of this processing will be described later with reference to the flowchart of FIG.

図１０は、図１の音声情報作成装置１００の音声解析部１０１、言語解析部１０２、及びデータ付与部１０３の機能をソフトウェア処理として実現できるコンピュータのハードウェア構成例を示す図である。図１０に示されるコンピュータは、ＣＰＵ１００１、ＲＯＭ（リードオンリーメモリ：読出し専用メモリ）１００２、ＲＡＭ（ランダムアクセスメモリ）１００３、入力装置１００４、出力装置１００５、外部記憶装置１００６、可搬記録媒体１０１０が挿入される可搬記録媒体駆動装置１００７、及び通信インタフェース１００８を有し、これらがバス１００９によって相互に接続された構成を有する。同図に示される構成は上記システムを実現できるコンピュータの一例であり、そのようなコンピュータはこの構成に限定されるものではない。 FIG. 10 is a diagram illustrating a hardware configuration example of a computer that can realize the functions of the speech analysis unit 101, the language analysis unit 102, and the data addition unit 103 of the speech information creation apparatus 100 of FIG. 1 as software processing. In the computer shown in FIG. 10, a CPU 1001, a ROM (Read Only Memory) 1002, a RAM (Random Access Memory) 1003, an input device 1004, an output device 1005, an external storage device 1006, and a portable recording medium 1010 are inserted. A portable recording medium driving device 1007 and a communication interface 1008, which are connected to each other by a bus 1009. The configuration shown in the figure is an example of a computer that can implement the above system, and such a computer is not limited to this configuration.

ＲＯＭ１００２は、図１の音声解析部１０１の機能を実現する音声解析処理プログラム、図の言語解析部１０２の機能を実現する言語解析処理プログラム、及び図１のデータ付与部１０３の機能を実現するデータ付与処理プログラムを含む各プログラムを記憶するメモリである。ＲＡＭ１００３は、各プログラムの実行時に、ＲＯＭ１００２に記憶されているプログラム又はデータを一時的に格納するメモリである。 The ROM 1002 is a speech analysis processing program that realizes the function of the speech analysis unit 101 in FIG. 1, a language analysis processing program that realizes the function of the language analysis unit 102 in FIG. 1, and data that realizes the function of the data addition unit 103 in FIG. It is a memory for storing each program including a grant processing program. The RAM 1003 is a memory that temporarily stores a program or data stored in the ROM 1002 when each program is executed.

外部記憶装置１００６は、例えばＳＳＤ（ソリッドステートドライブ）記憶装置又はハードディスク記憶装置であり、図１の音声データ１１３、発話テキスト１２２、セグメントデータ１１４、基本周波数データ１１５、形態素データ１２３、形態素ラベルデータ１２４、学習用ラベルデータ１３０、及び図１には特には図示しない音声データベース等の保存に用いられる。 The external storage device 1006 is, for example, an SSD (solid state drive) storage device or a hard disk storage device, and includes the voice data 113, speech text 122, segment data 114, fundamental frequency data 115, morpheme data 123, and morpheme label data 124 shown in FIG. The learning label data 130 and the voice database not shown in FIG.

ＣＰＵ１００１は、各プログラムを、ＲＯＭ１００２からＲＡＭ１００３に読み出して実行することにより、当該コンピュータ全体の制御を行う。 The CPU 1001 controls the entire computer by reading each program from the ROM 1002 to the RAM 1003 and executing it.

入力装置１００４は、ユーザによるキーボードやマウス等による入力操作を検出し、その検出結果をＣＰＵ１００１に通知する。 The input device 1004 detects an input operation by a user using a keyboard, a mouse, or the like, and notifies the CPU 1001 of the detection result.

出力装置１００５は、ＣＰＵ１００１の制御によって送られてくるデータを表示装置や印刷装置に出力する。 The output device 1005 outputs data sent under the control of the CPU 1001 to a display device or a printing device.

可搬記録媒体駆動装置１００７は、光ディスクやＳＤＲＡＭ、コンパクトフラッシュ等の可搬記録媒体１０１０を収容するもので、外部記憶装置１００６の補助の役割を有する。 The portable recording medium driving device 1007 accommodates a portable recording medium 1010 such as an optical disk, SDRAM, or compact flash, and has an auxiliary role for the external storage device 1006.

通信インターフェース１００８は、例えばＬＡＮ（ローカルエリアネットワーク）又はＷＡＮ（ワイドエリアネットワーク）の通信回線を接続するための装置である。 The communication interface 1008 is a device for connecting, for example, a LAN (local area network) or WAN (wide area network) communication line.

本実施形態によるシステムは、図１の音声解析部１０１の機能を実現する音声解析処理プログラム、図の言語解析部１０２の機能を実現する言語解析処理プログラム、及び図１のデータ付与部１０３の機能を実現するデータ付与処理プログラムを、ＲＯＭ１００２からＲＡＭ１００３に読み出してＣＰＵ１００１が実行することで実現される。そのプログラムは、例えば外部記憶装置１００６や可搬記録媒体１０１０に記録して配布してもよく、或いはネットワーク接続装置１００８によりネットワークから取得できるようにしてもよい。 The system according to the present embodiment includes a speech analysis processing program that realizes the function of the speech analysis unit 101 in FIG. 1, a language analysis processing program that realizes the function of the language analysis unit 102 in FIG. 1, and the function of the data addition unit 103 in FIG. Is realized by the CPU 1001 reading out the data addition processing program for realizing the above from the ROM 1002 to the RAM 1003 and executing it. For example, the program may be recorded and distributed in the external storage device 1006 or the portable recording medium 1010, or may be acquired from the network by the network connection device 1008.

ＣＰＵ１００１が音声解析処理プログラム及び言語解析処理プログラムをそれぞれ実行することにより実現する図１の音声解析部１０１及び言語解析部１０２の各機能は、図１から図９を用いて上述した通りである。この結果、図１０のＲＡＭ１００３に、図２のデータ構成例のセグメントデータ１１４（具体例は図３）、図４のデータ構成例の形態素データ１２３（具体例は図５）、図６のデータ構成例の形態素ラベルデータ１２４（具体例は図７）、及び図９のデータ構成例の基本周波数データ１１５が生成される。 The functions of the speech analysis unit 101 and the language analysis unit 102 in FIG. 1 realized by the CPU 1001 executing the speech analysis processing program and the language analysis processing program are as described above with reference to FIGS. As a result, the RAM 1003 in FIG. 10 stores the segment data 114 in the data configuration example in FIG. 2 (specific example is FIG. 3), the morpheme data 123 in the data configuration example in FIG. 4 (specific example in FIG. 5), and the data configuration in FIG. Example morpheme label data 124 (specific example is FIG. 7) and basic frequency data 115 of the data configuration example of FIG. 9 are generated.

図１１は、図１のデータ付与部１０３の機能を実現するデータ付与処理の例を示すフローチャートである。このデータ付与処理は、ＣＰＵ１００１が、ＲＡＭ１００３をワークメモリとして使用しながら、ＲＯＭ１００２に記憶されたデータ付与処理プログラムを読み出して実行する処理である。 FIG. 11 is a flowchart illustrating an example of a data providing process for realizing the function of the data adding unit 103 in FIG. This data addition process is a process in which the CPU 1001 reads out and executes a data addition process program stored in the ROM 1002 while using the RAM 1003 as a work memory.

データ付与処理において、ＣＰＵ１００１はまず、対応付け処理を実行する（ステップＳ１１０１）。 In the data providing process, the CPU 1001 first executes an association process (step S1101).

次に、ＣＰＵ１００１は、上記対応付け処理の結果、エラーが検出されたか否か（ＲＡＭ１００３上の後述するエラーフラグがオンであるか否か）を判定する（ステップＳ１１０２）。そして、エラーが検出されなかった場合（ステップＳ１１０２の判定がＮＯの場合）には、ＣＰＵ１００１は、アクセント修正処理を実行する（ステップＳ１１０３）。この結果、ＣＰＵ１００１は、入力された音声データ１１３及び発話テキスト１２２に対応する学習用ラベルデータ１３０（共に図１参照）を出力し、データ付与処理を終了する。エラーが検出された場合（ステップＳ１１０２の判定がＹＥＳの場合）には、ＣＰＵ１００１は、入力された音声データ１１３及び発話テキスト１２２は学習用には採用せず、学習用ラベルデータ１３０は出力せずに、データ付与処理を終了する。 Next, the CPU 1001 determines whether or not an error has been detected as a result of the association processing (whether or not an error flag described later on the RAM 1003 is on) (step S1102). If no error is detected (NO in step S1102), the CPU 1001 executes accent correction processing (step S1103). As a result, the CPU 1001 outputs the learning label data 130 (both refer to FIG. 1) corresponding to the input voice data 113 and the utterance text 122, and ends the data addition process. When an error is detected (when the determination in step S1102 is YES), the CPU 1001 does not adopt the input voice data 113 and the utterance text 122 for learning, and does not output the learning label data 130. Finally, the data providing process is terminated.

図１２及び図１３は、図１１のステップＳ１１０１の対応付け処理の詳細例を示すフローチャートである。この対応付け処理では、前述したように、空白セグメントと形態素の区切りとの位置関係が判別されながら、それぞれＲＡＭ１００３上に生成されている、セグメントデータ１１４の音素列（図２、図３参照）と形態素ラベルデータ１２４（図６、図７）の音素列との適切な対応関係が生成される。 12 and 13 are flowcharts showing a detailed example of the association processing in step S1101 of FIG. In this association processing, as described above, the phoneme string (see FIGS. 2 and 3) of the segment data 114 generated on the RAM 1003 while the positional relationship between the blank segment and the morpheme break is determined. An appropriate correspondence with the phoneme string of the morpheme label data 124 (FIGS. 6 and 7) is generated.

対応付け処理において、ＣＰＵ１００１はまず、セグメントデータ１１４の集合先頭と形態素ラベルデータ１２４の集合の先頭から順次、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルとが一致するか否かを判定する処理（ステップＳ１２０５）と、セグメントデータ１１４の側の音素ラベルが空白セグメントに対応する「#」であるか否かを判定する処理（ステップＳ１２０４）と、ステップＳ１２０４の判定がＹＥＳの場合に形態素ラベルデータ１２４の側の音素ラベルが句読点に対応する「#」であるか否かを判定する処理（ステップＳ１２０８）を繰り返し実行する。 In the associating process, the CPU 1001 first determines whether the phoneme label on the segment data 114 side and the phoneme label on the morpheme label data 124 side match sequentially from the top of the set of segment data 114 and the top of the set of morpheme label data 124. Processing for determining whether or not (step S1205), processing for determining whether or not the phoneme label on the segment data 114 side is “#” corresponding to the blank segment (step S1204), and determination in step S1204 is YES In this case, the process of determining whether or not the phoneme label on the morpheme label data 124 side is “#” corresponding to the punctuation mark (step S1208) is repeatedly executed.

この繰返し処理を制御するために、ＣＰＵ１００１は、ＲＡＭ１００３上の変数seg及びmorphに、それぞれＲＡＭ１００３上のセグメントデータ１１４の先頭アドレスとＲＡＭ１００３上の形態素ラベルデータ１２４の先頭アドレスをセットする（ステップＳ１２０１）。図２のセグメントデータ１１４のデータ構成例では、変数segにsegment[0]のアドレスがセットされ、図６の形態素ラベルデータ１２４のデータ構成例では、変数morphにmorph_label[0]のアドレスがセットされる。そして、ＣＰＵ１００１は、変数seg及び変数morphともに末端のデータを超えていないと判定される間（ステップＳ１２０２の判定がＹＥＳの場合）、以下に示すステップＳ１２０３からＳ１２１５までの一連の処理を繰り返し実行する。 In order to control this repetitive processing, the CPU 1001 sets the start address of the segment data 114 on the RAM 1003 and the start address of the morpheme label data 124 on the RAM 1003 in the variables seg and morph on the RAM 1003 (step S1201). In the data configuration example of the segment data 114 in FIG. 2, the address of segment [0] is set in the variable seg, and in the data configuration example of the morpheme label data 124 in FIG. 6, the address of morph_label [0] is set in the variable morph. The Then, the CPU 1001 repeatedly executes a series of processes from step S1203 to S1215 shown below while it is determined that neither the variable seg nor the variable morph exceeds the terminal data (when the determination in step S1202 is YES). .

まず、ＣＰＵ１００１は、それぞれＲＡＭ１００３上に記憶される、セグメントデータ１１４の側のインクリメントフラグと、形態素ラベルデータ１２４の側のインクリメントフラグの双方をオフにする（ステップＳ１２０３）。 First, the CPU 1001 turns off both the increment flag on the segment data 114 side and the increment flag on the morpheme label data 124 side stored in the RAM 1003 (step S1203).

次に、ＣＰＵ１００１は、セグメントデータ１１４の側の音素ラベル（図２のphone変数の値）が、空白セグメントに対応する「#」であるか否かを判定する（ステップＳ１２０４）。 Next, the CPU 1001 determines whether or not the phoneme label (the value of the phone variable in FIG. 2) on the segment data 114 side is “#” corresponding to the blank segment (step S1204).

ステップＳ１２０４の判定がＮＯの場合、ＣＰＵ１００１は、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致するか否かを判定する（ステップＳ１２０５）。 If the determination in step S1204 is NO, the CPU 1001 determines whether or not the phoneme labels on the segment data 114 side and the phoneme label on the morpheme label data 124 side match (step S1205).

セグメントデータ１１４の側の音素ラベルは空白セグメントを示す「#」でなく（ステップＳ１２０４の判定がＮＯで）、かつセグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致しないことによりステップＳ１２０５の判定もＮＯになる場合には、そもそも音声データ１１３と発話テキスト１２２とがうまく対応付いていないことになる。従って、このような場合には、エラーを発生させて入力された音声データ１１３及び発話テキスト１２２は学習用には採用せず、学習用ラベルデータ１３０は出力しないようにすべきである。そこで、ステップＳ１２０４の判定がＮＯで、かつステップＳ１２０５の判定もＮＯの場合には、ＣＰＵ１００１は、ＲＡＭ１００３上に記憶されるエラーフラグをオンにセットし（ステップＳ１２０７）、その後、図１２及び図１３のフローチャートで示される図１１のステップＳ１１０１の対応付け処理を終了する。この結果、図１１のステップＳ１１０２の判定がＹＥＳとなって、ステップＳ１１０３のアクセント修正処理は実行されないことになり、学習用ラベルデータ１３０は生成されずにデータ付与処理が終了する。 The phoneme label on the segment data 114 side is not “#” indicating a blank segment (NO in step S1204), and both the phoneme label on the segment data 114 side and the phoneme label on the morpheme label data 124 side If the phoneme label does not match and the determination in step S1205 is also NO, the voice data 113 and the utterance text 122 do not correspond well in the first place. Therefore, in such a case, the speech data 113 and the utterance text 122 input with an error should not be used for learning, and the learning label data 130 should not be output. Therefore, if the determination in step S1204 is NO and the determination in step S1205 is also NO, the CPU 1001 sets the error flag stored in the RAM 1003 to ON (step S1207), and thereafter, FIG. 12 and FIG. The association process in step S1101 of FIG. 11 shown in the flowchart of FIG. As a result, the determination in step S1102 in FIG. 11 is YES, the accent correction process in step S1103 is not executed, and the data providing process ends without generating the learning label data 130.

ステップＳ１２０５の判定がＹＥＳの場合には、ＣＰＵ１００１は、ＲＡＭ１００３上のセグメントデータ１１４の側のインクリメントフラグと形態素ラベルデータ１２４の側のインクリメントフラグの双方のインクリメントフラグをオンにセットする（ステップＳ１２０６）。 If the determination in step S1205 is YES, the CPU 1001 sets both the increment flag on the segment data 114 side and the increment flag on the morpheme label data 124 side on the RAM 1003 to be on (step S1206).

その後、ＣＰＵ１００１は、セグメントデータ１１４の側のインクリメントフラグがオンであるか否かを判定するが（ステップＳ１２１２）、ステップＳ１２０６の処理によりステップＳ１２１２の判定はＹＥＳとなる。この結果、ＣＰＵ１００１は、変数segに、次のセグメントデータ１１４のアドレスをセットする（ステップＳ１２１３）。具体的には、ＣＰＵ１００１は、現在のセグメントデータ１１４のnext変数（図２参照）の値を、変数segにセットする。 Thereafter, the CPU 1001 determines whether or not the increment flag on the segment data 114 side is ON (step S1212), but the determination in step S1212 is YES due to the processing in step S1206. As a result, the CPU 1001 sets the address of the next segment data 114 in the variable seg (step S1213). Specifically, the CPU 1001 sets the value of the next variable (see FIG. 2) of the current segment data 114 to the variable seg.

続いて、ＣＰＵ１００１は、形態素データ１２３の側のインクリメントフラグがオンであるか否かを判定するが（ステップＳ１２１４）、ステップＳ１２０６の処理によりステップＳ１２１４の判定はＹＥＳとなる。この結果、ＣＰＵ１００１は、変数morphに、次の形態素ラベルデータ１２４のアドレスをセットする（ステップＳ１２１５）。具体的には、ＣＰＵ１００１は、現在の形態素ラベルデータ１２４のnext変数（図６参照）の値を、変数morphにセットする。 Subsequently, the CPU 1001 determines whether or not the increment flag on the morpheme data 123 side is on (step S1214), but the determination in step S1214 is YES due to the processing in step S1206. As a result, the CPU 1001 sets the address of the next morpheme label data 124 in the variable morph (step S1215). Specifically, the CPU 1001 sets the value of the next variable (see FIG. 6) of the current morpheme label data 124 to the variable morph.

その後、ＣＰＵ１００１は、ステップＳ１２０２の処理に移行して、次の繰返し処理に移る。 Thereafter, the CPU 1001 proceeds to the process of step S1202 and proceeds to the next repetition process.

上述のように、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致すると、変数segと変数morphの双方がインクリメントされることにより、次のセグメントデータ１１４の音素と次の形態素ラベルデータ１２４の音素同士の対応付け処理に移行する。 As described above, when the phoneme labels on the segment data 114 side and the phoneme label on the morpheme label data 124 side match, both the variable seg and the variable morph are incremented, so that the next segment data The process shifts to a process for associating the phoneme 114 and the phoneme of the next morpheme label data 124.

前述したステップＳ１２０４の判定がＹＥＳになると、ＣＰＵ１００１は、形態素ラベルデータ１２４の側の音素ラベル（図６のphone）が句読点に対応する「#」であるか否かを判定する（ステップＳ１２０８）。 If the determination in step S1204 is YES, the CPU 1001 determines whether the phoneme label (phone in FIG. 6) on the morpheme label data 124 side is “#” corresponding to the punctuation mark (step S1208).

ステップＳ１２０８の判定がＹＥＳの場合には、ＣＰＵ１００１は、ＲＡＭ１００３上のセグメントデータ１１４の側のインクリメントフラグと形態素ラベルデータ１２４の側のインクリメントフラグの双方のインクリメントフラグをオンにセットする（ステップＳ１２０９）。 If the determination in step S1208 is YES, the CPU 1001 turns on both the increment flag on the segment data 114 side and the increment flag on the morpheme label data 124 side on the RAM 1003 (step S1209).

その後、ＣＰＵ１００１は、前述したステップＳ１２１２の判定処理に移行するが、ステップＳ１２０９の処理によりステップＳ１２１２の判定はＹＥＳとなる。この結果、ＣＰＵ１００１は、前述したステップＳ１２１３の処理に移行して、変数segに、次のセグメントデータ１１４のアドレスをセットする。 Thereafter, the CPU 1001 proceeds to the determination process of step S1212 described above, but the determination of step S1212 is YES due to the process of step S1209. As a result, the CPU 1001 shifts to the processing of step S1213 described above, and sets the address of the next segment data 114 in the variable seg.

続いて、ＣＰＵ１００１は、前述したステップＳ１２１４の判定処理に移行するが、ステップＳ１２０６の処理によりステップＳ１２１４の判定はＹＥＳとなる。この結果、ＣＰＵ１００１は、前述したステップＳ１２１５の処理に移行して、変数morphに、次の形態素データ１２３のアドレスをセットする。 Subsequently, the CPU 1001 proceeds to the determination process of step S1214 described above, but the determination of step S1214 is YES due to the process of step S1206. As a result, the CPU 1001 proceeds to the process of step S1215 described above, and sets the address of the next morpheme data 123 in the variable morph.

上述のように、セグメントデータ１１４の側の音素ラベルが空白セグメントに対応する「#」で、かつ形態素ラベルデータ１２４の側の音素ラベルが句読点に対応する「#」であって両者が対応付く場合には、変数segと変数morphの双方がインクリメントされることにより、次のセグメントデータ１１４の音素と次の形態素ラベルデータ１２４の音素同士の対応付け処理に移行する。 As described above, when the phoneme label on the segment data 114 side is “#” corresponding to the blank segment, and the phoneme label on the morpheme label data 124 side is “#” corresponding to the punctuation mark and both correspond to each other When the variable seg and the variable morph are both incremented, the next segment data 114 phoneme and the next morpheme label data 124 phoneme are associated with each other.

ステップＳ１２０８の判定がＮＯの場合には、ＣＰＵ１００１は、現在の形態素ラベルデータの前に「♯」の形態素ラベルデータを挿入するために、ＲＡＭ１００３上に得られている形態素データ１２３において、現在の形態素ラベルデータ１２４の音素が含まれる形態素データ１２３の次に、読点を示す形態素データ１２３を挿入する（ステップＳ１２１０）。具体的には、ＣＰＵ１００１はまず、変数morphが示すＲＡＭ１００３上の現在の形態素ラベルデータ１２４の変数prevによって示されるmorph_id変数（図６参照）の値を参照することにより、ＲＡＭ１００３上の形態素データ１２３からそのmorph_id変数の値と同じ値をmorph_id変数（図４）に保持するデータを検索する。次に、ＣＰＵ１００１は、ＲＡＭ１００３上の形態素データ１２３の末尾に、新たな形態素データ１２３のエントリ（図４参照）を生成する。そして、ＣＰＵ１００１は、この末尾に新たに生成されたエントリにおいて、morph_id変数に新たな値を付与し、original変数、read変数、及びpronounce変数に読点を表す値「、」をそれぞれ格納し、group変数に「記号」を格納し、accent変数にはアクセント情報「0」を格納する。更に、ＣＰＵ１００１は、この末尾に新たに生成されたエントリにおいて、prev変数及びnext変数には、上記検索された形態素データ１２３のprev変数の値及びmorph_id変数の値をそれぞれ格納する。最後に、ＣＰＵ１００１は、上記検索された形態素データ１２３のprev変数が示す形態素データ１２３のnext変数に上記末尾に新たに生成されたエントリのmorph_id変数の値を格納し、その後に、上記検索された形態素データ１２３のprev変数の値も上記末尾に新たに生成されたエントリのmorph_id変数の値に変更する。 If the determination in step S1208 is NO, the CPU 1001 inserts the current morpheme data 123 in the morpheme data 123 obtained on the RAM 1003 in order to insert “#” morpheme label data before the current morpheme label data. Next to the morpheme data 123 including the phoneme of the label data 124, the morpheme data 123 indicating the reading point is inserted (step S1210). Specifically, the CPU 1001 first refers to the value of the morph_id variable (see FIG. 6) indicated by the variable prev of the current morpheme label data 124 on the RAM 1003 indicated by the variable morph, and thereby from the morpheme data 123 on the RAM 1003. Data that holds the same value as the value of the morph_id variable in the morph_id variable (FIG. 4) is searched. Next, the CPU 1001 generates an entry (see FIG. 4) of new morpheme data 123 at the end of the morpheme data 123 on the RAM 1003. Then, the CPU 1001 assigns a new value to the morph_id variable in the newly generated entry at the end, stores the value “,” representing the reading point in the original variable, the read variable, and the promote variable, respectively, and the group variable “Symbol” is stored in the “accent” variable, and accent information “0” is stored in the “accent” variable. Further, the CPU 1001 stores the value of the prev variable and the value of the morph_id variable of the searched morpheme data 123 in the prev variable and the next variable in the entry newly generated at the end, respectively. Finally, the CPU 1001 stores the value of the morph_id variable of the newly created entry at the end in the next variable of the morpheme data 123 indicated by the prev variable of the searched morpheme data 123, and then the searched The value of the prev variable of the morpheme data 123 is also changed to the value of the morph_id variable of the entry newly generated at the end.

例えば、図３のセグメントデータ１１４の具体例、図５の形態素データ１２３の具体例、及び図７の形態素ラベルデータ１２４の具体例において、変数seg及びmorphの値がともに「16」であるときに、ステップＳ１２０４において、図６のseg_id変数の値が「16」であるセグメントデータ１１４の音素ラベルが「#」であると判定されて判定結果がＹＥＳとなり、更に、ステップＳ１２０８において、図７のmlabel_idの値が「16」である形態素ラベルデータ１２４の音素ラベルが「s」であると判定されて判定結果がＮＯとなる。この結果、ＣＰＵ１００１は、ステップＳ１２１０を実行することにより、図７のmlabel_idの値が「16」である形態素ラベルデータ１２４のmorph_id変数の値「3」を参照し、図５の形態素データ１２３からそのmorph_id変数の値が「3」である形態素「すべて」を検索する。次に、ＣＰＵ１００１は、図５の形態素データ１２３の末尾に、新たな形態素データ１２３のエントリ（図４参照）を生成する。図１４はエントリ追加後の形態素データ１２３の具体例である。ＣＰＵ１００１は、図１４の末尾のエントリにおいて、morph_id変数に新たな値「13」を付与し、original変数、read変数、及びpronounce変数に読点を表す値「、」をそれぞれ格納し、group変数に「記号」を格納し、accent変数にはアクセント情報「0」を格納する。更に、ＣＰＵ１００１は、図５の形態素データ１２３上で検索されたmorph_id変数の値が「3」である形態素「すべて」のエントリのprev変数の値「2」及びmorph_id変数の値「3」をそれぞれ、図１４の末尾のエントリのprev変数及びnext変数にセットする。最後に、ＣＰＵ１００１は、図５の形態素データ１２３上で検索されたmorph_id変数の値が「3」である形態素「すべて」のエントリのprev変数の値「2」が示す、図５のmorph_id変数の値が「2」である形態素「を」のエントリのnext変数に、図１４の末尾のエントリのmorph_id変数の値「13」を格納し、図５の形態素データ１２３上で検索されたmorph_id変数の値が「3」である形態素「すべて」のエントリのprev変数の値も、図１４の末尾のエントリのmorph_id変数の値「13」に変更する。この結果、図１４に示される形態素データ１２３ができあがる。この結果、morph_id変数の値が「2」の形態素「を」の次には、そのエントリのnext変数の値「13」が参照されることにより、morph_id変数の値が「13」の読点「、」のエントリが接続され、更にそのエントリのnext変数の値「3」が参照されることにより、morph_id変数の値「3」の形態素「すべて」のエントリが接続される。 For example, in the specific example of the segment data 114 in FIG. 3, the specific example of the morpheme data 123 in FIG. 5, and the specific example of the morpheme label data 124 in FIG. 7, the values of the variables seg and morph are both “16”. In step S1204, it is determined that the phoneme label of the segment data 114 in which the value of the seg_id variable in FIG. 6 is “16” is “#”, and the determination result is YES. In step S1208, mlabel_id in FIG. It is determined that the phoneme label of the morpheme label data 124 whose value is “16” is “s”, and the determination result is NO. As a result, by executing step S1210, the CPU 1001 refers to the value “3” of the morph_id variable of the morpheme label data 124 whose mlabel_id value is “16” in FIG. Search for the morpheme “all” whose value of the morph_id variable is “3”. Next, the CPU 1001 generates an entry (see FIG. 4) of the new morpheme data 123 at the end of the morpheme data 123 of FIG. FIG. 14 is a specific example of the morpheme data 123 after the entry is added. In the entry at the end of FIG. 14, the CPU 1001 assigns a new value “13” to the morph_id variable, stores values “,” representing reading points in the original variable, read variable, and promote variable, respectively, and sets “ “Symbol” is stored, and accent information “0” is stored in the “accent” variable. Further, the CPU 1001 sets the value “2” of the prev variable and the value “3” of the morph_id variable of the entry of the morpheme “all” whose value of the morph_id variable searched on the morpheme data 123 of FIG. 5 is “3”, respectively. , The prev variable and the next variable of the entry at the end of FIG. 14 are set. Finally, the CPU 1001 indicates the value of the morph_id variable of FIG. 5 indicated by the value “2” of the prev variable of the entry of the morpheme “all” whose value of the morph_id variable searched for on the morpheme data 123 of FIG. 5 is “3”. The value “13” of the morph_id variable of the entry at the end of FIG. 14 is stored in the next variable of the entry of the morpheme “ha” whose value is “2”, and the morph_id variable searched for on the morpheme data 123 of FIG. The value of the prev variable of the entry of the morpheme “all” whose value is “3” is also changed to the value “13” of the morph_id variable of the last entry in FIG. As a result, the morpheme data 123 shown in FIG. 14 is completed. As a result, next to the morpheme “o” whose value of the morph_id variable is “2”, the value “13” of the next variable of the entry is referred to, so that the punctuation mark “,” whose value of the morph_id variable is “13”, ”Is connected, and the value“ 3 ”of the next variable of the entry is referenced, so that the entry of“ all ”morphemes of the value“ 3 ”of the morph_id variable is connected.

上述の例では、図３のセグメントデータ１１４の例において、音素列「arayurugeNjitsuo」（表記：「あらゆる現実を」）と音素列「subete」（表記：「すべて」）の間で、音声データ１１３の発音において息継ぎが発生したことにより、seg_id=15、音素ラベルphone=「o」のセグメントデータ１１４と、seg_id=17、音素ラベルphone=「s」のセグメントデータ１１４の間に、seg_id=16、音素ラベルphone=「#」の空白セグメントのセグメントデータ１１４が生成されている。一方、図５の形態素データ１２３の例においては、形態素「を」と形態素「すべて」の間には読点は検出されていない。このような場合に対して、本実施形態では、ＣＰＵ１００１は、上述の動作により、図５の形態素データ１２３上で形態素「を」と形態素「すべて」の間に読点「、」を挿入することができる。 In the above-described example, in the example of the segment data 114 in FIG. 3, between the phoneme string “arayurugeNjitsuo” (notation: “every reality”) and the phoneme string “subete” (notation: “all”) Due to the occurrence of breathing in pronunciation, seg_id = 15, phoneme label phone = “o” segment data 114 and seg_id = 17, phoneme label phone = “s” segment data 114, seg_id = 16, phoneme Segment data 114 of a blank segment with label phone = “#” is generated. On the other hand, in the example of the morpheme data 123 of FIG. 5, no reading mark is detected between the morpheme “O” and the morpheme “All”. In this embodiment, in this embodiment, the CPU 1001 may insert a reading mark “,” between the morpheme “o” and the morpheme “all” on the morpheme data 123 of FIG. it can.

上述した図１２のステップＳ１２１０の処理の後、ＣＰＵ１００１は、ＲＡＭ１００３上のセグメントデータ１１４の側のインクリメントフラグをオンにする（ステップＳ１２１１）。一方、ＲＡＭ１００３上の形態素ラベルデータ１２４の側のインクリメントフラグは、ステップＳ１２０３でオフにされたままである。 After the process of step S1210 of FIG. 12 described above, the CPU 1001 turns on the increment flag on the segment data 114 side in the RAM 1003 (step S1211). On the other hand, the increment flag on the morpheme label data 124 side in the RAM 1003 remains turned off in step S1203.

その後、ＣＰＵ１００１は、前述したステップＳ１２１２の判定処理に移行するが、ステップＳ１２１１の処理によりステップＳ１２１２の判定はＹＥＳとなる。この結果、ＣＰＵ１００１は、前述したステップＳ１２１３の処理に移行して、変数segに、次のセグメントデータ１１４のアドレスをセットする。 Thereafter, the CPU 1001 proceeds to the determination process of step S1212 described above, but the determination of step S1212 is YES due to the process of step S1211. As a result, the CPU 1001 shifts to the processing of step S1213 described above, and sets the address of the next segment data 114 in the variable seg.

続いて、ＣＰＵ１００１は、前述したステップＳ１２１４の判定処理に移行するが、ステップＳ１２０３の処理によりステップＳ１２１４の判定はＮＯとなる。この結果、ＣＰＵ１００１は、前述したステップＳ１２１５の処理はスキップして、変数morphの値は現在の形態素データ１２３のアドレスのままとさせる。 Subsequently, the CPU 1001 proceeds to the determination process of step S1214 described above, but the determination of step S1214 is NO by the process of step S1203. As a result, the CPU 1001 skips the processing of step S1215 described above, and keeps the value of the variable morph as the address of the current morpheme data 123.

上述のように、セグメントデータ１１４の側の音素ラベルが空白セグメントに対応する「#」で、かつ形態素ラベルデータ１２４の側の音素ラベルが句読点に対応する「#」でなくて形態素データ１２３に句読点が追加挿入された場合には、変数segのみがインクリメントされることにより、次のセグメントデータ１１４の音素と現在の形態素ラベルデータ１２４の音素同士の対応付け処理に移行する。 As described above, the phoneme label on the segment data 114 side is “#” corresponding to the blank segment, and the phoneme label on the morpheme label data 124 side is not “#” corresponding to the punctuation marks, but is punctuated in the morpheme data 123. Is additionally inserted, only the variable seg is incremented, and the process shifts to a process for associating the phoneme of the next segment data 114 with the phoneme of the current morpheme label data 124.

図３のセグメントデータ１１４の具体例、図５と図１４の形態素データ１２３の具体例、及び図７の形態素ラベルデータ１２４の具体例では、変数seg及びmorphの値がともに「16」であるときに、ステップＳ１２０４において、図６のseg_id変数の値が「16」であるセグメントデータ１１４の音素ラベルが「#」であると判定されて判定結果がＹＥＳとなった後、ステップＳ１２０８の判定がＮＯとなって、ステップＳ１２１０で形態素データ１２３に図１４に示されるように句読点が挿入される。その後、ステップＳ１２１１→Ｓ１２１２の判定がＹＥＳ→Ｓ１２１３→Ｓ１２１４の判定がＮＯ→Ｓ１２０２と処理が進むことにより、変数segの値のみが、図３のseg_id変数の値が「16」であるセグメントデータ１１４のnext変数が示す値「17」にインクリメントされる。この結果、次の比較処理の対象は、変数segの変更された値「17」によって検索される図３のseg_id変数の値が「17」であるセグメントデータ１１４の音素ラベル「s」と、変数morphの変更されない値「16」によって検索される」図７のmlabel_id変数の値が「16」である形態素ラベルデータ１２４の音素ラベル「s」となり、うまく対応付けが行われることがわかる。 In the specific example of the segment data 114 in FIG. 3, the specific example of the morpheme data 123 in FIGS. 5 and 14, and the specific example of the morpheme label data 124 in FIG. 7, the values of the variables seg and morph are both “16”. In addition, in step S1204, it is determined that the phoneme label of the segment data 114 in which the value of the seg_id variable of FIG. 6 is “16” is “#”, and the determination result is YES. Then, the determination in step S1208 is NO. In step S1210, punctuation marks are inserted into the morpheme data 123 as shown in FIG. Thereafter, the process proceeds from step S1211 → S1212 to YES → S1213 → S1214 from NO to S1202, so that only the value of the variable seg is the segment data 114 in which the value of the seg_id variable in FIG. 3 is “16”. It is incremented to the value “17” indicated by the next variable. As a result, the target of the next comparison process is the phoneme label “s” of the segment data 114 in which the value of the seg_id variable in FIG. 3 retrieved by the changed value “17” of the variable seg is “17”, and the variable It is found that the phoneme label “s” of the morpheme label data 124 in which the value of the mlabel_id variable in FIG.

以上の繰返し処理の結果、変数seg及び変数morphの何れかが末端のデータを超えたと判定される（ステップＳ１２０２の判定がＮＯになる）と、ＣＰＵ１００１は、図１３のステップＳ１２１６以降の処理を実行する。まず、ＣＰＵ１００１は、今までの処理によりＲＡＭ１００３上に得られている例えば図１４に示される各形態素データ１２３のoriginal変数に格納されている形態素の表記を全て結合してテキストデータにし、そのテキストデータに対して形態素解析処理（図１の形態素解析部１２０に対応）を再度実行させ、それに対応する形態素データ１２３及び形態素ラベルデータ１２４をＲＡＭ１００３上に生成する（ステップＳ１２１６）。 As a result of the above iterative processing, when it is determined that either the variable seg or the variable morph has exceeded the terminal data (NO in step S1202), the CPU 1001 executes the processing from step S1216 in FIG. To do. First, the CPU 1001 combines all the morpheme expressions stored in the original variable of each morpheme data 123 shown in FIG. 14, for example, obtained on the RAM 1003 by the processing so far into text data, and the text data The morpheme analysis process (corresponding to the morpheme analysis unit 120 in FIG. 1) is executed again, and the corresponding morpheme data 123 and morpheme label data 124 are generated on the RAM 1003 (step S1216).

その後、ＣＰＵ１００１は、セグメントデータ１１４の集合先頭と新たにＲＡＭ１００３上に生成された形態素ラベルデータ１２４の集合の先頭から順次、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルとが一致するか否かを再度判定する処理（ステップＳ１２２０）を繰り返し実行する。 After that, the CPU 1001 sequentially determines the phoneme label on the segment data 114 side and the phoneme label on the morpheme label data 124 side from the top of the set of segment data 114 and the top of the set of morpheme label data 124 newly generated on the RAM 1003. The process of determining again whether or not (step S1220) is repeated.

具体的には、ＣＰＵ１００１は、ＲＡＭ１００３上の変数seg及びmorphに、それぞれＲＡＭ１００３上のセグメントデータ１１４の先頭アドレスと新たにＲＡＭ１００３上に生成された形態素ラベルデータ１２４の先頭アドレスをセットする（ステップＳ１２１７）。 Specifically, the CPU 1001 sets the start address of the segment data 114 on the RAM 1003 and the start address of the morpheme label data 124 newly generated on the RAM 1003 in the variables seg and morph on the RAM 1003 (step S1217). .

次に、ＣＰＵ１００１は、ＲＡＭ１００３上に学習用ラベルデータ１３０を生成するために、ＲＡＭ１００３上の変数ｔｒに、ＲＡＭ１００３上の学習用ラベルデータ１３０の先頭の空き領域のアドレスをセットする（ステップＳ１２１８）。 Next, in order to generate the learning label data 130 on the RAM 1003, the CPU 1001 sets the address of the leading empty area of the learning label data 130 on the RAM 1003 in the variable tr on the RAM 1003 (step S1218).

そして、ＣＰＵ１００１は、変数seg及び変数morphともに末端のデータを超えていないと判定される間（ステップＳ１２１９の判定がＹＥＳの場合）、ＣＰＵ１００１は、以下のステップＳ１２２０からＳ１２２４までの一連の処理を実行する。 While the CPU 1001 determines that neither the variable seg nor the variable morph exceeds the terminal data (when the determination in step S1219 is YES), the CPU 1001 executes a series of processing from step S1220 to step S1224 below. To do.

まず、ＣＰＵ１００１は、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致するか否かを判定する（ステップＳ１２２０）。 First, the CPU 1001 determines whether or not the phoneme labels on the segment data 114 side and the phoneme label on the morpheme label data 124 side match (step S1220).

前述した図１２のフローチャートの処理により、図１４の形態素データ１２３には、音声データ１１３の発声において形態素の区切りに対応するタイミングで息継ぎ等が発生することにより生成された空白タイミングのセグメントデータ１１４に対応して、読点が追加挿入されている。従って、このような形態素データ１２３を再結合して得たテキストデータに対して図１３のステップＳ１２１６の形態素解析処理が再度実行されることにより、再度生成された形態素ラベルデータ１２４の音素列は、セグメントデータ１１４の音素列と良く対応付けされていることになる。 By the processing of the flowchart of FIG. 12 described above, the morpheme data 123 of FIG. 14 is changed to the blank timing segment data 114 generated by the occurrence of a breath of breath at the timing corresponding to the morpheme break in the utterance of the audio data 113. Correspondingly, additional reading marks are inserted. Therefore, the phoneme string of the morpheme label data 124 generated again by executing the morpheme analysis process in step S1216 of FIG. 13 again on the text data obtained by recombining the morpheme data 123 is as follows. That is, it is well associated with the phoneme string of the segment data 114.

一方、音声データ１１３の発声において空白タイミングが言いよどみにより発生したような場合において、その空白タイミングが形態素の区切りではなく途中の音素の位置になったりするケースもある。この場合に、図１２のステップＳ１２０４の判定処理により空白タイミングのセグメントデータ１１４が検出され、その判定結果がＹＥＳ、続くステップＳ１２０８の判定結果がＮＯとなることにより、ステップＳ１２１０で、形態素の区切りのない途中の音素で発生した空白タイミングに対して、形態素の区切りの位置に強制的に形態素データ１２３が挿入されることになる。この状態で、図１３のステップＳ１２１６の形態素解析処理が再度実行されると、再度生成された形態素ラベルデータ１２４の音素列は、セグメントデータ１１４の音素列とはうまく対応付かないことになる。もともと言いよどみを含むような音声データ１１３は音素片データの作成には不向きであるため、このような場合にはエラーを発生させて入力された音声データ１１３及び発話テキスト１２２は学習用には採用せず、学習用ラベルデータ１３０は出力しないようにすべきである。 On the other hand, when a blank timing occurs due to stagnation in the speech of the audio data 113, the blank timing may not be a morpheme break, but may be a halfway phoneme position. In this case, the segment data 114 at the blank timing is detected by the determination process of step S1204 of FIG. 12, the determination result is YES, and the determination result of subsequent step S1208 is NO. The morpheme data 123 is forcibly inserted at the morpheme break position with respect to the blank timing generated in the middle phoneme. In this state, when the morpheme analysis process in step S1216 of FIG. 13 is executed again, the phoneme string of the morpheme label data 124 generated again does not correspond well with the phoneme string of the segment data 114. Since the speech data 113 that originally includes staleness is not suitable for generating phoneme segment data, in such a case, the speech data 113 and the utterance text 122 that are input with an error are not used for learning. The learning label data 130 should not be output.

そこで、セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致しないことによりステップＳ１２２０の判定がＮＯになる場合には、ＣＰＵ１００１は、ＲＡＭ１００３上に記憶されるエラーフラグをオンにセットし（ステップＳ１２２４）、その後、図１２及び図１３のフローチャートで示される図１１のステップＳ１１０１の対応付け処理を終了する。この結果、図１１のステップＳ１１０２の判定がＹＥＳとなって、ステップＳ１１０３のアクセント修正処理は実行されないことになり、学習用ラベルデータ１３０は生成されずにデータ付与処理が終了する。 Therefore, when the phoneme label of the phoneme label on the segment data 114 side and the phoneme label on the morpheme label data 124 side do not match, the determination in step S1220 is NO, the CPU 1001 is stored in the RAM 1003. Is set to ON (step S1224), and then the association process in step S1101 of FIG. 11 shown in the flowcharts of FIGS. 12 and 13 is terminated. As a result, the determination in step S1102 in FIG. 11 is YES, the accent correction process in step S1103 is not executed, and the data providing process ends without generating the learning label data 130.

セグメントデータ１１４の側の音素ラベルと形態素ラベルデータ１２４の側の音素ラベルの双方の音素ラベルが一致することによりステップＳ１２２０の判定がＹＥＳになる場合、ＣＰＵ１００１は、ＲＡＭ１００３上の変数ｔｒが示すＲＡＭ１００３の学習用ラベルデータ１３０（図８参照）の新領域に、変数segが示すセグメントデータ１１４の内容と変数morphが示す形態素ラベルデータ１２４の内容を付与する。具体的には、ＣＰＵ１００１は、上記新領域に、変数segが示すＲＡＭ１００３上のセグメントデータ１１４（図２参照）のphone、start、endの各変数の内容をコピーし、変数morphが示すＲＡＭ１００３上の形態素ラベルデータ１２４（図５参照）のaccent、molaの各変数の内容をコピーする。また、ＣＰＵ１００１は、上記新領域において、phone変数の音素ラベルが、母音の音素を示しているときにはvowel変数（図８参照）に母音を示す値「1」をセットし、子音の音素を示しているときにはvowel変数に子音を示す値「0」をセットする。また、ＣＰＵ１００１は、上記新領域において、tlabel_id変数に新たな形態素ラベルＩＤの値をセットする。更に、ＣＰＵ１００１は、上記新記領域において、prev変数には１つ手前の学習用ラベルデータ１３０の形態素ラベルＩＤの値（先頭の場合にはNULL値）をセットし、next変数には１つ後ろの学習用ラベルデータ１３０の形態素ラベルＩＤの値（末尾の場合にはNULL値）をセットする。pitch変数の内容は後述するアクセント修正処理においてセットされる。 If the phoneme label on the segment data 114 side matches the phoneme label on the morpheme label data 124 side, and the determination in step S1220 is YES, the CPU 1001 stores the RAM 1003 in the RAM 1003 indicated by the variable tr on the RAM 1003. The contents of the segment data 114 indicated by the variable seg and the contents of the morpheme label data 124 indicated by the variable morph are assigned to the new area of the learning label data 130 (see FIG. 8). Specifically, the CPU 1001 copies the contents of the phone, start, and end variables of the segment data 114 (see FIG. 2) on the RAM 1003 indicated by the variable seg to the new area, and on the RAM 1003 indicated by the variable morph. The contents of each of the variables “accent” and “mola” of the morpheme label data 124 (see FIG. 5) are copied. Further, in the new area, the CPU 1001 sets a value “1” indicating a vowel to a vowel variable (see FIG. 8) when the phoneme label of the phone variable indicates a vowel phoneme, and indicates a consonant phoneme. If so, set the vowel variable to “0” to indicate the consonant Further, the CPU 1001 sets a new morpheme label ID value in the tlabel_id variable in the new area. Further, the CPU 1001 sets the morpheme label ID value (null value in the first case) of the previous learning label data 130 to the prev variable and the next variable to the next in the new area. The value of the morpheme label ID of the learning label data 130 (a null value in the case of the end) is set. The contents of the pitch variable are set in the accent correction process described later.

ステップＳ１２２１の後、ＣＰＵ１００１は、ＲＡＭ１００３上の変数segとmorphにそれぞれ、次のセグメントデータ１１４及び次の形態素ラベルデータ１２４のアドレスをセットする（ステップＳ１２２２）。具体的には、ＣＰＵ１００１は、現在のセグメントデータ１１４のnext変数（図２参照）の値と、現在の形態素ラベルデータ１２４のnext変数（図６参照）の値をそれぞれ、変数seg及びmorphにセットする。 After step S1221, the CPU 1001 sets the addresses of the next segment data 114 and the next morpheme label data 124 in the variables seg and morph on the RAM 1003, respectively (step S1222). Specifically, the CPU 1001 sets the value of the next variable (see FIG. 2) of the current segment data 114 and the value of the next variable (see FIG. 6) of the current morpheme label data 124 to the variables seg and morph, respectively. To do.

更に、ＣＰＵ１００１は、ＲＡＭ１００３上の変数ｔｒに、ＲＡＭ１００３上の学習用ラベルデータ１３０の次の空き領域のアドレスをセットする（ステップＳ１２２３）。 Further, the CPU 1001 sets the address of the next empty area of the learning label data 130 on the RAM 1003 in the variable tr on the RAM 1003 (step S1223).

その後、ＣＰＵ１００１は、図１３のステップＳ１２１９の処理に移行して、次の繰返し処理に移る。 Thereafter, the CPU 1001 proceeds to the process of step S1219 in FIG. 13 and proceeds to the next repetition process.

以上の繰返し処理の結果、変数seg及び変数morphの何れかが末端のデータを超えたと判定される（ステップＳ１２１９の判定がＮＯになる）と、ＣＰＵ１００１は、図１２及び図１３のフローチャートで示される図１１のステップＳ１１０１の対応付け処理を終了する。この場合には、エラーが発生せずに学習用ラベルデータ１３０が生成されたため、図１１のステップＳ１１０２の判定がＮＯとなって、図１１のステップＳ１１０３のアクセント修正処理が実行される。 As a result of the above iterative processing, if it is determined that either the variable seg or the variable morph has exceeded the terminal data (NO in step S1219), the CPU 1001 is shown in the flowcharts of FIGS. The association process in step S1101 in FIG. In this case, since the learning label data 130 is generated without an error, the determination in step S1102 in FIG. 11 is NO, and the accent correction process in step S1103 in FIG. 11 is executed.

図１５は、図１１のステップＳ１１０３のアクセント修正処理の詳細例を示すフローチャートである。アクセント修正処理では、図１１のステップＳ１１０１（図１２及び図１３）の対応付け処理によりＲＡＭ１００３に生成された学習用ラベルデータ１３０と、図１の音声解析部１０１内の基本周波数解析部１１２の機能に対応する基本周波数解析処理によりＲＡＭ１００３に生成された基本周波数データ１１５とを用いて、アクセントブロックごとに新たなアクセント位置を生成し、アクセントブロックに属するセグメントに対応する学習用ラベルデータ１３０のアクセント情報を、上述の新たに算出されたアクセント位置に基づいて修正する処理が実行される。 FIG. 15 is a flowchart showing a detailed example of the accent correction process in step S1103 of FIG. In the accent correction process, the learning label data 130 generated in the RAM 1003 by the association process in step S1101 (FIGS. 12 and 13) in FIG. 11 and the functions of the fundamental frequency analysis unit 112 in the speech analysis unit 101 in FIG. Using the fundamental frequency data 115 generated in the RAM 1003 by the fundamental frequency analysis processing corresponding to, a new accent position is generated for each accent block, and the accent information of the learning label data 130 corresponding to the segment belonging to the accent block Is corrected based on the newly calculated accent position described above.

まず、ＣＰＵ１００１は、ＲＡＭ１００３上の学習用ラベルデータ１３０（図８参照）ごとに、当該学習用ラベルデータ１３０のstart変数の値とend変数の値で決まるセグメントの区間内における基本周波数の平均値を算出する（ステップＳ１５０１）。具体的には、ＣＰＵ１００１は、ＲＡＭ１００３上の基本周波数データ１１５（図９参照）において、time変数の値が上記区間内に入る各基本周波数データ１１５を抽出し、それらの基本周波数データ１１５のpitch変数に保持されている基本周波数を抽出し、それらの基本周波数の平均値を算出する。そして、ＣＰＵ１００１は、その算出した平均値を、当該学習用ラベルデータ１３０に対応するセグメントの区間の基本周波数として、当該学習用ラベルデータ１３０のpitch変数（図８参照）に保持する。 First, the CPU 1001 calculates, for each learning label data 130 on the RAM 1003 (see FIG. 8), an average value of fundamental frequencies within a segment section determined by the value of the start variable and the value of the end variable of the learning label data 130. Calculate (step S1501). Specifically, the CPU 1001 extracts each basic frequency data 115 in which the value of the time variable falls within the interval from the basic frequency data 115 (see FIG. 9) on the RAM 1003, and the pitch variable of the basic frequency data 115. Are extracted, and an average value of the fundamental frequencies is calculated. Then, the CPU 1001 holds the calculated average value in the pitch variable (see FIG. 8) of the learning label data 130 as the fundamental frequency of the segment interval corresponding to the learning label data 130.

次に、ＣＰＵ１００１は、ステップＳ１５０２でＲＡＭ１００３上の変数ｔｒにＲＡＭ１００３上に生成されている学習用ラベルデータ１３０の先頭アドレスをセットした後、ステップＳ１５０７で変数ｔｒに学習用ラベルデータ１３０の次のアドレスを順次セットしながら、ステップＳ１５０３で変数ｔｒのアドレスが学習用ラベルデータ１３０の末端を超えたと判定するまで、ステップＳ１５０４からＳ１５０６までの一連の処理を繰り返し実行する。 Next, the CPU 1001 sets the start address of the learning label data 130 generated on the RAM 1003 in the variable tr on the RAM 1003 in step S1502, and then the next address of the learning label data 130 in the variable tr in step S1507. Are sequentially set, until a determination is made in step S1503 that the address of the variable tr exceeds the end of the learning label data 130, a series of processes from steps S1504 to S1506 are repeatedly executed.

上記繰返し処理において、ＣＰＵ１００１はまず、変数ｔｒが示す学習用ラベルデータ１３０に対応するセグメントが、アクセントブロックの切れ目であるか否かを判定する（ステップＳ１５０４）。具体的には、ＣＰＵ１００１は、変数ｔｒが示す学習用ラベルデータ１３０内のmola変数が示すモーラ番号の値（図８参照）が、当該学習用ラベルデータ１３０内のprev変数が示す１つ手前の学習用ラベルデータ１３０内のmola変数が示すモーラ番号の値よりも小さくなった場合に、上記１つ手前の学習用ラベルデータ１３０のセグメントが属するアクセントブロックが終了したと判定し、ステップＳ１５０４の判定がＹＥＳになる。或いは、ＣＰＵ１００１は、変数ｔｒが示す学習用ラベルデータ１３０のphone変数が空白セグメントを示す音素ラベル「#」を保持している場合に、当該学習用ラベルデータ１３０内のprev変数が示す１つ手前の学習用ラベルデータ１３０のセグメントが属するアクセントブロックが終了したと判定し、ステップＳ１５０４の判定がＹＥＳになる。 In the above iterative process, the CPU 1001 first determines whether or not the segment corresponding to the learning label data 130 indicated by the variable tr is an accent block break (step S1504). Specifically, the CPU 1001 determines that the value of the mora number indicated by the mola variable in the learning label data 130 indicated by the variable tr (see FIG. 8) is one immediately before indicated by the prev variable in the learning label data 130. When the value of the mola number indicated by the mola variable in the learning label data 130 is smaller than the value of the mora number, it is determined that the accent block to which the previous segment of the learning label data 130 belongs has ended, and the determination in step S1504 Becomes YES. Alternatively, when the phone variable of the learning label data 130 indicated by the variable tr holds a phoneme label “#” indicating a blank segment, the CPU 1001 immediately before that indicated by the prev variable in the learning label data 130 It is determined that the accent block to which the segment of the learning label data 130 belongs has ended, and the determination in step S1504 is YES.

ステップＳ１５０４の判定がＹＥＳになると、ＣＰＵ１００１は、上記１つ手前の学習用ラベルデータ１３０のセグメントが属するアクセントブロックの範囲において、start変数の値とend変数の値（図８参照）で決まるセグメントの区間が当該アクセントブロックに属する学習用ラベルデータ１３０のうち、pitch変数（図８参照）が保持する基本周波数が最も高い学習用ラベルデータ１３０の音素位置（例えばstart変数の値）を、当該アクセントブロックにおける新たなアクセント位置（モーラ番号）として決定する。なお、学習用ラベルデータ１３０のvowel変数（図８参照）の値が「0」、即ちそのセグメントの音素が子音である場合には、その前後の母音の音素の学習用ラベルデータ１３０の音素位置を新たなアクセント位置とする（以上、ステップＳ１５０５）。 If the determination in step S1504 is YES, the CPU 1001 determines the segment determined by the value of the start variable and the value of the end variable (see FIG. 8) in the range of the accent block to which the previous segment of the learning label data 130 belongs. Among the learning label data 130 whose section belongs to the accent block, the phoneme position (for example, the value of the start variable) of the learning label data 130 having the highest fundamental frequency held by the pitch variable (see FIG. 8) is used as the accent block. Is determined as a new accent position (mora number). When the value of the vowel variable (see FIG. 8) of the learning label data 130 is “0”, that is, when the phoneme of the segment is a consonant, the phoneme position of the learning label data 130 of the vowel phonemes before and after the segment. Is a new accent position (step S1505).

ステップＳ１５０５の後、ＣＰＵ１００１は、上記アクセントブロックの範囲において、start変数の値とend変数の値（図８参照）で決まるセグメントの区間が当該アクセントブロックに属する学習用ラベルデータ１３０において、accent変数（図８参照）の値を、mola変数（図８参照）に保持されているモーラ番号とステップＳ１５０５で算出した新たなアクセント位置との差分値（アクセント位置の手前がマイナス値、後ろがプラス値）に修正する（ステップＳ１５０６）。 After step S1505, the CPU 1001 determines that the segment variable determined by the start variable value and the end variable value (see FIG. 8) in the range of the accent block includes the accent variable ( The difference between the mora number held in the mola variable (see FIG. 8) and the new accent position calculated in step S1505 (a negative value before the accent position and a positive value after the value) (Step S1506).

その後、ＣＰＵ１００１は、ステップＳ１５０７の処理に移行する。 Thereafter, the CPU 1001 proceeds to the process of step S1507.

ステップＳ１５０４の判定がＮＯの場合には、ＣＰＵ１００１は、ステップＳ１５０５とＳ１５０６の処理はスキップして、ステップＳ１５０７の処理に移行する。 When the determination in step S1504 is NO, the CPU 1001 skips the processes in steps S1505 and S1506 and proceeds to the process in step S1507.

ステップＳ１５０７において、ＣＰＵ１００１は、変数ｔｒに学習用ラベルデータ１３０の次のアドレスをセットする。具体的には、ＣＰＵ１００１は、変更前の変数ｔｒが示すＲＡＭ１００３上の学習用ラベルデータ１３０（図８参照）のnext変数の値を、変数ｔｒに格納し直す。その後、ＣＰＵ１００１は、ステップＳ１５０３の処理に戻る。 In step S1507, the CPU 1001 sets the next address of the learning label data 130 in the variable tr. Specifically, the CPU 1001 stores the value of the next variable of the learning label data 130 (see FIG. 8) on the RAM 1003 indicated by the variable tr before the change into the variable tr. Thereafter, the CPU 1001 returns to the process of step S1503.

上記繰返し処理の後、変数ｔｒのアドレスが学習用ラベルデータ１３０の末端を超えたと判定されることによりステップＳ１５０３の判定がＹＥＳになると、ＣＰＵ１００１は、図１５のフローチャートで示される図１１のステップＳ１１０３のアクセント修正処理を終了し、ＲＡＭ１００３に生成されている学習用ラベルデータ１３０を、音素片データの音声データベースを作成するための最終的な学習用ラベルデータ１３０として外部記憶装置１００６等に出力する。 After the above iterative processing, if it is determined that the address of the variable tr has exceeded the end of the learning label data 130 and the determination in step S1503 is YES, the CPU 1001 determines in step S1103 of FIG. 11 shown in the flowchart of FIG. The learning label data 130 generated in the RAM 1003 is output to the external storage device 1006 or the like as final learning label data 130 for creating a speech database of phoneme segment data.

以上のようにして、本実施形態では、音声データの空白タイミングを考慮することにより正しいアクセント情報を得ることが可能となって、音声データから得られる基本周波数によるアクセント情報の正しい修正も行うことが可能となる。 As described above, in the present embodiment, correct accent information can be obtained by considering the blank timing of audio data, and correct correction of accent information based on the fundamental frequency obtained from audio data can be performed. It becomes possible.

上述の実施形態では、アクセントブロック内で基本周波数が最高値となるセグメントの位置によって当該アクセントブロック内の各セグメントのアクセント情報を修正した。このほかに、アクセントブロック内で信号強度が最大値となるセグメントの位置によって当該アクセントブロック内の各セグメントのアクセント情報を修正してもよい。 In the embodiment described above, the accent information of each segment in the accent block is corrected according to the position of the segment having the highest fundamental frequency in the accent block. In addition, the accent information of each segment in the accent block may be corrected according to the position of the segment having the maximum signal strength in the accent block.

上述の実施形態は、音素片データの音声データベースの作成に用いられる学習用ラベルデータ１３０を作成するためのアクセント情報作成装置１００についての実施形態であった。このほか、図１に示される音声解析部１０１、言語解析部１０２、及びデータ付与部１０３に加えて、データベース登録部を備え、このデータベース登録部は、セグメントごとに、音声データから当該セグメントに対応する部分を音素片データとして切り出す処理と、当該音素片データと、データ付与部１０３が学習用ラベルデータ１３０から抽出できるアクセント情報とを、当該セグメントの音素を表す音素ラベルとともに、音声データベースに登録する処理を実行するようにした音声データベース作成装置として実施されてもよい。 The above-described embodiment is an embodiment of the accent information creating apparatus 100 for creating the learning label data 130 used for creating the speech database of phoneme piece data. In addition to the speech analysis unit 101, the language analysis unit 102, and the data addition unit 103 shown in FIG. 1, a database registration unit is provided. Registering, in the speech database, a process for cutting out a portion to be processed as phoneme piece data, the phoneme piece data, and accent information that the data adding unit 103 can extract from the learning label data 130 together with a phoneme label representing the phoneme of the segment. The present invention may be implemented as a voice database creation device that executes processing.

以上の実施形態に関して、更に以下の付記を開示する。
（付記１）
入力される音声データからアクセント位置及び区切り位置の少なくとも一方を示す第１位置データを取得する取得処理と、前記音声データに対応するテキストデータから生成された複数の形態素を含む形態素データに付与されているアクセントの位置及び前記複数の形態素間の区切り位置の少なくとも一方を示す第２位置データと、前記音声データから取得された第１位置データとを比較する比較処理と、前記比較処理にて前記第１及び第２の位置データが一致していない場合には、前記前記形態素データに対して前記第２位置データに代えて前記第１位置データを付与する処理を実行する処理部を備えた音声情報作成装置。
（付記２）
前記音声情報作成装置はさらに、
前記テキストデータに対して形態素解析処理を実行することにより、前記複数の形態素を含む形態素データを生成する形態素解析部を有する、付記１に記載の音声情報作成装置。
（付記３）
前記処理部は、前記取得処理として、前記音声データの無音区間の位置を前記第１位置データの区切り位置として取得する処理を実行する、付記１または２に記載の音声情報作成装置。
（付記４）
前記処理部はさらに、前記比較処理において、前記第１位置データが示す区切り位置と第２の位置データが示す区切り位置と一致する場合は、前記形態素データに対して、前記第２の位置データが示す位置に、読点の情報を付与する、付記１乃至３のいずれかに記載の音声情報作成装置。
（付記５）
前記処理部は、前記取得処理として、前記複数の形態素それぞれに対応する前記音声データの区間内で、前記音声データの基本周波数を判別する処理と、前記音声データの区間内で前記基本周波数が最も高い位置を、前記第１位置データのアクセント位置として取得する処理を実行する、付記１乃至４のいずれかに記載の音声情報作成装置。
（付記６）
前記処理部は、前記取得処理として、前記複数の形態素それぞれに対応する前記音声データの区間内で、前記音声データの信号強度を判別する処理と、前記音声データの区間内で前記信号強度が最も高い位置を、前記第１位置データのアクセント位置として取得する処理を実行する、付記１乃至４のいずれかに記載の音声情報作成装置。
（付記７）
処理部を備えた音声情報作成装置に用いられる音声情報作成方法であって、前記処理部が、
入力される音声データからアクセント位置及び区切り位置の少なくとも一方を示す第１位置データを取得し、
前記音声データに対応するテキストデータから生成された複数の形態素を含む形態素データに付与されているアクセントの位置及び前記複数の形態素間の区切り位置の少なくとも一方を示す第２位置データと、前記音声データから取得された第１位置データとを比較し、
前記第１及び第２の位置データが一致していない場合には、前記前記形態素データに対して前記第２位置データに代えて前記第１位置データを付与する、音声情報作成方法。
（付記８）
音声情報作成装置として用いられるコンピュータに、
入力される音声データからアクセント位置及び区切り位置の少なくとも一方を示す第１位置データを取得するステップと、
前記音声データに対応するテキストデータから生成された複数の形態素を含む形態素データに付与されているアクセントの位置及び前記複数の形態素間の区切り位置の少なくとも一方を示す第２位置データと、前記音声データから取得された第１位置データとを比較するステップと、
前記第１及び第２の位置データが一致していない場合には、前記前記形態素データに対して前記第２位置データに代えて前記第１位置データを付与するステップと、
を実行させるプログラム。
（付記９）
付記１乃至６のいずれかに記載の音声情報作成装置と、
前記音声データから音素片データを切り出す処理と、前記音素片データ、前記音素片データを表わす音素ラベル、及び前記音声情報作成装置により前記音素片データに対応する形態素に付与されたアクセント情報を音声データベースに登録する処理と、を実行する登録処理部と、
を備えた音声データベース作成装置。 Regarding the above embodiment, the following additional notes are disclosed.
(Appendix 1)
An acquisition process for acquiring first position data indicating at least one of an accent position and a break position from input speech data, and morpheme data including a plurality of morphemes generated from text data corresponding to the speech data A comparison process comparing the second position data indicating at least one of the position of the accent and the separation position between the plurality of morphemes and the first position data acquired from the audio data; If the first and second position data do not match, the voice information includes a processing unit that executes a process of assigning the first position data to the morpheme data instead of the second position data. Creation device.
(Appendix 2)
The voice information creation device further includes:
The speech information creation device according to attachment 1, further comprising a morpheme analysis unit that generates morpheme data including the plurality of morphemes by executing a morpheme analysis process on the text data.
(Appendix 3)
The audio information creation device according to appendix 1 or 2, wherein the processing unit executes, as the acquisition process, a process of acquiring a position of a silent section of the audio data as a break position of the first position data.
(Appendix 4)
In the comparison process, when the delimiter position indicated by the first position data coincides with the delimiter position indicated by the second position data, the processing unit further determines that the second position data is relative to the morpheme data. 4. The audio information creation device according to any one of appendices 1 to 3, which adds reading point information to the indicated position.
(Appendix 5)
The processing unit includes, as the acquisition process, a process of determining a fundamental frequency of the speech data within a section of the speech data corresponding to each of the plurality of morphemes, and a highest fundamental frequency within the section of the speech data. The audio information creation device according to any one of appendices 1 to 4, which executes a process of acquiring a high position as an accent position of the first position data.
(Appendix 6)
The processing unit includes, as the acquisition process, a process of determining a signal strength of the voice data within a section of the voice data corresponding to each of the plurality of morphemes, and a highest signal strength within the section of the voice data. The audio information creation device according to any one of appendices 1 to 4, which executes a process of acquiring a high position as an accent position of the first position data.
(Appendix 7)
A speech information creation method used in a speech information creation apparatus including a processing unit, wherein the processing unit is
Obtaining first position data indicating at least one of an accent position and a break position from the input voice data;
Second position data indicating at least one of a position of an accent given to morpheme data including a plurality of morphemes generated from text data corresponding to the sound data and a delimiter position between the plurality of morphemes; and the sound data And the first position data obtained from
When the first and second position data do not coincide with each other, the voice information creating method of adding the first position data to the morpheme data instead of the second position data.
(Appendix 8)
In a computer used as a voice information creation device,
Obtaining first position data indicating at least one of an accent position and a break position from input voice data;
Second position data indicating at least one of a position of an accent given to morpheme data including a plurality of morphemes generated from text data corresponding to the sound data and a delimiter position between the plurality of morphemes; and the sound data Comparing the first position data obtained from
When the first and second position data do not match, the first position data is given to the morpheme data instead of the second position data;
A program that executes
(Appendix 9)
The audio information creation device according to any one of appendices 1 to 6;
A speech database that extracts the phoneme data from the speech data, the phoneme data, the phoneme label that represents the phoneme data, and the accent information given to the morpheme corresponding to the phoneme data by the speech information creation device A registration processing unit for executing, a registration processing unit for executing
Voice database creation device with

１００アクセント情報作成装置
１０１音声解析部
１０２言語解析部
１０３データ付与部
１１０音声認識部
１１１音響モデル
１１２基本周波数解析部
１２０形態素解析部
１２１形態素辞書
１２２発話テキスト
１００１ＣＰＵ
１００２ＲＯＭ（リードオンリーメモリ）
１００３ＲＡＭ（ランダムアクセスメモリ）
１００４入力装置
１００５出力装置
１００６外部記憶装置
１００７可搬記録媒体駆動装置
１００８通信インタフェース
１００９バス
１０１０可搬記録媒体 DESCRIPTION OF SYMBOLS 100 Accent information production apparatus 101 Speech analysis part 102 Language analysis part 103 Data provision part 110 Speech recognition part 111 Acoustic model 112 Fundamental frequency analysis part 120 Morphological analysis part 121 Morphological dictionary 122 Utterance text 1001 CPU
1002 ROM (Read Only Memory)
1003 RAM (Random Access Memory)
1004 Input device 1005 Output device 1006 External storage device 1007 Portable recording medium driving device 1008 Communication interface 1009 Bus 1010 Portable recording medium

Claims

An acquisition process for acquiring first position data indicating at least one of an accent position and a break position from input speech data, and morpheme data including a plurality of morphemes generated from text data corresponding to the speech data A comparison process comparing the second position data indicating at least one of the position of the accent and the separation position between the plurality of morphemes and the first position data acquired from the audio data; If the first and second position data do not match, the voice information includes a processing unit that executes a process of assigning the first position data to the morpheme data instead of the second position data. Creation device.

The voice information creation device further includes:
The speech information creation device according to claim 1, further comprising: a morpheme analysis unit that generates morpheme data including the plurality of morphemes by executing a morpheme analysis process on the text data.

The audio information creation device according to claim 1, wherein the processing unit executes a process of acquiring a position of a silent section of the audio data as a delimiter position of the first position data as the acquisition process.

In the comparison process, when the delimiter position indicated by the first position data coincides with the delimiter position indicated by the second position data, the processing unit further determines that the second position data is relative to the morpheme data. The voice information creation device according to claim 1, wherein reading point information is assigned to the indicated position.

The processing unit includes, as the acquisition process, a process of determining a fundamental frequency of the speech data within a section of the speech data corresponding to each of the plurality of morphemes, and a highest fundamental frequency within the section of the speech data. The voice information creation device according to claim 1, wherein a process for acquiring a high position as an accent position of the first position data is executed.

The processing unit includes, as the acquisition process, a process of determining a signal strength of the voice data within a section of the voice data corresponding to each of the plurality of morphemes, and a highest signal strength within the section of the voice data. The voice information creation device according to claim 1, wherein a process for acquiring a high position as an accent position of the first position data is executed.

A speech information creation method used in a speech information creation apparatus including a processing unit, wherein the processing unit is
Obtaining first position data indicating at least one of an accent position and a break position from the input voice data;
Second position data indicating at least one of a position of an accent given to morpheme data including a plurality of morphemes generated from text data corresponding to the sound data and a delimiter position between the plurality of morphemes; and the sound data And the first position data obtained from
When the first and second position data do not coincide with each other, the voice information creating method of adding the first position data to the morpheme data instead of the second position data.

In a computer used as a voice information creation device,
Obtaining first position data indicating at least one of an accent position and a break position from input voice data;
Second position data indicating at least one of a position of an accent given to morpheme data including a plurality of morphemes generated from text data corresponding to the sound data and a delimiter position between the plurality of morphemes; and the sound data Comparing the first position data obtained from
When the first and second position data do not match, the first position data is given to the morpheme data instead of the second position data;
A program that executes

The voice information creation device according to any one of claims 1 to 6,
A speech database that extracts the phoneme data from the speech data, the phoneme data, the phoneme label that represents the phoneme data, and the accent information given to the morpheme corresponding to the phoneme data by the speech information creation device A registration processing unit for executing, a registration processing unit for executing
Voice database creation device with