JP6756607B2

JP6756607B2 - Accent type judgment device and program

Info

Publication number: JP6756607B2
Application number: JP2016252329A
Authority: JP
Inventors: 信正清山; 今井　篤; 篤今井; 都木　徹; 徹都木
Original assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2016-12-27
Filing date: 2016-12-27
Publication date: 2020-09-16
Anticipated expiration: 2036-12-27
Also published as: JP2018106012A

Description

本発明は、音声波形データから抽出した基本周波数に基づいて、音声のアクセント型を判定する装置及びプログラムに関する。 The present invention relates to a device and a program for determining a voice accent type based on a fundamental frequency extracted from voice waveform data.

従来、統計的な情報を用いて、任意のテキストに対する音声を合成するテキスト音声合成技術が知られている。テキスト音声合成を実現するためには、事前準備として、音声波形データ、その発話内容のテキスト、読み方、アクセント等の情報を多量に用意しておく必要がある。 Conventionally, there is known a text-to-speech synthesis technique for synthesizing speech for arbitrary text using statistical information. In order to realize text-to-speech synthesis, it is necessary to prepare a large amount of information such as voice waveform data, text of the utterance content, reading, accent, etc. as a preliminary preparation.

多量の音声波形データに対し、読み方及びアクセント等を正しく付与する事前準備を行うためには、アナウンサー等の、アクセントを聞き分けることのできる専門家による作業が必要となる。しかし、このような事前準備を行うには、コストがかかるという問題があった。 In order to make advance preparations for correctly assigning readings and accents to a large amount of voice waveform data, work by an announcer or other specialist who can distinguish accents is required. However, there is a problem that it is costly to perform such advance preparation.

この問題を解決するために、音声波形データに対してアクセント型を自動的に判定する手法が提案されている（例えば非特許文献１を参照）。 In order to solve this problem, a method for automatically determining the accent type for voice waveform data has been proposed (see, for example, Non-Patent Document 1).

この非特許文献１に記載された手法は、音声波形データを構成するアクセント句内において、アクセント句を構成するモーラ（子音＋母音、母音、促音または撥音）を単位として、各モーラを代表するピッチを求める。そして、隣接するモーラのピッチの差分値により、（モーラ数−１）次元の特徴ベクトルを構成し、同じアクセント型を持つアクセント句の特徴ベクトルを用いて、アクセント型毎に（モーラ数−１）次元の正規分布を求めることにより、モデルを学習する。 The method described in Non-Patent Document 1 is a pitch representing each mora in the accent phrase constituting the voice waveform data in units of the mora (consonant + vowel, vowel, sokuon or syllabary) constituting the accent phrase. Ask for. Then, a feature vector of (mora number -1) dimension is constructed by the difference value of the pitches of adjacent mora, and the feature vector of the accent phrase having the same accent type is used for each accent type (mora number -1). Learn the model by finding the normal distribution of dimensions.

そして、この手法は、判定対象のアクセント句について（モーラ数−１）次元の特徴ベクトルを算出し、学習したモデルを用いて、判定対象のアクセント句の特徴ベクトルに対応するアクセント型を判定する。 Then, this method calculates a (mora number-1) -dimensional feature vector for the accent phrase to be determined, and uses the learned model to determine the accent type corresponding to the feature vector of the accent phrase to be determined.

また、音声波形データのアクセントの種類を判定する手法として、テキスト音声合成を支援することを目的としたものも提案されている（例えば特許文献１を参照）。 Further, as a method for determining the type of accent of voice waveform data, a method for supporting text-speech synthesis has also been proposed (see, for example, Patent Document 1).

この特許文献１に記載された手法は、所定の語句に対応する音声波形データを複数のアクセント句に分割し、判定対象のアクセント句について、アクセント句を構成するモーラを単位とし、ピッチの時間変化の傾き及び最終ピッチを算出する。そして、ピッチの時間変化の傾き及び最終ピッチを用いて、モーラ単位でＨ（High）、Ｌ（Low）、Ｕ（Up）及びＤ（Down）の４種類の評価関数の値を算出する。さらに、アクセント句を構成するモーラ全体の評価関数の値を加算し、その加算値が最大となる組み合わせを求めて、アクセントの種類を判定する。 In the method described in Patent Document 1, the voice waveform data corresponding to a predetermined phrase is divided into a plurality of accent phrases, and the accent phrase to be determined is changed with time in pitch with the mora constituting the accent phrase as a unit. Calculate the slope and final pitch of. Then, the values of four types of evaluation functions, H (High), L (Low), U (Up), and D (Down), are calculated in mora units using the slope of the time change of the pitch and the final pitch. Further, the value of the evaluation function of the entire mora constituting the accent phrase is added, the combination in which the added value is maximized is obtained, and the type of accent is determined.

特許第４１２９９８９号公報Japanese Patent No. 4129899

石井カルロス寿憲、峯松信明、広瀬啓吉、“ピッチ知覚を考慮した日本語連続音声のアクセント型判定”、電子情報通信学会技術研究報告、Vol.101、No.270、23-30、2001Toshinori Ishii, Nobuaki Minematsu, Keikichi Hirose, "Accented Judgment of Japanese Continuous Speech Considering Pitch Perception", IEICE Technical Report, Vol.101, No.270, 23-30, 2001

しかしながら、非特許文献１の手法では、隣接するモーラのピッチの差分値を用いるから、（モーラ数−１）種類のアクセント型しか判定できない。また、特許文献１の手法は、アクセント句についてモーラ単位のＨ、Ｌ、Ｕ及びＤの評価関数を組み合わせ、その組み合わせに応じてアクセントの種類を判断するものであり、アクセント型を判定するものではない。 However, in the method of Non-Patent Document 1, since the difference value of the pitches of adjacent mora is used, only (number of mora-1) types of accent types can be determined. Further, the method of Patent Document 1 combines evaluation functions of H, L, U, and D in mora units for accent phrases, and determines the type of accent according to the combination, and does not determine the accent type. Absent.

さらに、非特許文献１及び特許文献１のいずれの手法においても、音声波形データを構成する個々のアクセント句のみのピッチに基づいて、アクセント型を判定したり、アクセントの種類を判定したりしている。 Further, in both the methods of Non-Patent Document 1 and Patent Document 1, the accent type is determined and the accent type is determined based on the pitch of only the individual accent phrases constituting the voice waveform data. There is.

アクセント句のアクセント型を正しく判定するためには、アクセント句の区間よりも長いテキストの文章全体またはフレーズ（文節）単位におけるピッチの変動を考慮する必要がある。非特許文献１及び特許文献１の手法のように、個々のアクセント句のみに着目したのでは、必ずしも精度の高いアクセント型を得ることができるとは限らない。 In order to correctly determine the accent type of an accent phrase, it is necessary to consider the fluctuation of the pitch in the whole sentence of the text longer than the interval of the accent phrase or in the phrase (phrase) unit. It is not always possible to obtain a highly accurate accent type by focusing only on individual accent phrases as in the methods of Non-Patent Document 1 and Patent Document 1.

このように、従来の手法では、アクセント型の判定精度が低下するという問題があった。判定した音声のアクセント型は、例えば音声合成の事前学習の際に、音響モデルの一部のデータ（韻律データ）として格納される。そして、任意のテキストに対する音声を合成するテキスト音声合成の際に、この音響モデルが用いられる。アクセント型の判定精度が低下すると、当該アクセント型が学習された音響モデルを用いてテキスト音声合成したときに、音声のアクセントが不自然となり、精度の高い音声合成を実現することができなくなる。 As described above, the conventional method has a problem that the accuracy of the accent type determination is lowered. The determined speech accent type is stored as a part of the acoustic model data (prosody data) at the time of pre-learning of speech synthesis, for example. Then, this acoustic model is used in text-to-speech synthesis for synthesizing speech for arbitrary text. If the determination accuracy of the accent type is lowered, the accent of the voice becomes unnatural when the text voice is synthesized using the acoustic model in which the accent type is learned, and it becomes impossible to realize the voice synthesis with high accuracy.

このため、アクセント型を精度高く判定するために、アクセント句の区間よりも長いテキストの文章全体またはフレーズ単位の音声波形データ全体を考慮する新たな手法が所望されていた。 Therefore, in order to determine the accent type with high accuracy, a new method that considers the entire sentence of the text longer than the interval of the accent phrase or the entire voice waveform data of each phrase has been desired.

そこで、本発明は前記課題を解決するためになされたものであり、その目的は、音声波形データからアクセント型を精度高く判定することが可能なアクセント型判定装置及びプログラムを提供することにある。 Therefore, the present invention has been made to solve the above problems, and an object of the present invention is to provide an accent type determination device and a program capable of accurately determining an accent type from voice waveform data.

前記課題を解決するために、請求項１のアクセント型判定装置は、音声波形データから複数のアクセント句を切り出し、前記アクセント句のアクセント型を判定するアクセント型判定装置において、前記音声波形データからフレーム毎のピッチを算出する音声波形ピッチ算出部と、前記音声波形データから複数のアクセント句を切り出し、前記アクセント句から複数のモーラを切り出し、前記音声波形ピッチ算出部により算出された前記フレーム毎のピッチから、モーラ毎のピッチ代表値を算出し、前記音声波形データの前記アクセント句に含まれる前記モーラ毎のピッチ代表値を出力する第１のピッチ代表値算出部と、前記音声波形データに対応するテキストについて、当該テキストに含まれる文節、アクセント句、モーラ及びアクセント位置の情報を含む文脈依存音素ラベルを作成する音素ラベル作成部と、前記音素ラベル作成部により作成された前記文脈依存音素ラベルに基づいて、前記テキストから複数のアクセント句を切り出し、前記アクセント句から複数のモーラを切り出し、前記アクセント句について、全てのアクセント型を表現するように、当該アクセント句に含まれる前記モーラの数に基づいてアクセント位置を変更し、複数のアクセント型文脈依存音素ラベルを作成するアクセント型ラベル作成部と、前記アクセント型ラベル作成部により作成された前記アクセント型文脈依存音素ラベル毎に、フレーム毎のピッチを、前記テキストの文章全体または文節におけるピッチの変動が反映されたピッチ推定値として求めるピッチ推定部と、前記ピッチ推定部により求めた前記フレーム毎のピッチ推定値から、モーラ毎のピッチ代表値を算出し、前記アクセント型文脈依存音素ラベル毎に、前記テキストの前記アクセント句に含まれる前記モーラ毎のピッチ代表値を出力する第２のピッチ代表値算出部と、前記第１のピッチ代表値算出部により出力された前記音声波形データの前記アクセント句に含まれる前記モーラ毎のピッチ代表値と、前記第２のピッチ代表値算出部により前記アクセント型文脈依存音素ラベル毎に出力された前記テキストの前記アクセント句に含まれる前記モーラ毎のピッチ代表値との間の距離をそれぞれ算出し、前記アクセント型文脈依存音素ラベル毎に、前記距離を出力する距離算出部と、前記距離算出部により前記アクセント型文脈依存音素ラベル毎に出力された前記距離が最小となる前記アクセント型文脈依存音素ラベルを特定し、当該アクセント型文脈依存音素ラベルのアクセント型を、前記アクセント句のアクセント型として判定するアクセント型判定部と、を備えたことを特徴とする。 In order to solve the above problem, the accent type determination device according to claim 1 cuts out a plurality of accent phrases from the voice waveform data, and in the accent type determination device for determining the accent type of the accent phrase, a frame from the voice waveform data. A voice waveform pitch calculation unit that calculates each pitch, a plurality of accent phrases cut out from the voice waveform data, a plurality of mora cut out from the accent phrase, and a pitch for each frame calculated by the voice waveform pitch calculation unit. Corresponds to the first pitch representative value calculation unit that calculates the pitch representative value for each mora and outputs the pitch representative value for each mora included in the accent phrase of the voice waveform data, and the voice waveform data. Based on a phonetic label creation unit that creates a context-dependent phonetic label that includes information on phrases, accent phrases, mora, and accent positions contained in the text, and the context-dependent phonetic label created by the phonetic label creation section. Then, a plurality of accent phrases are cut out from the text, a plurality of mora are cut out from the accent phrase, and the accent clause is based on the number of the mora contained in the accent phrase so as to express all accent types. The pitch for each frame is set for each accent type label creation unit that changes the accent position and creates a plurality of accent type context-dependent phonetic labels, and for each of the accent type context-dependent phoneme labels created by the accent type label creation unit. The pitch representative value for each mora is calculated from the pitch estimation unit obtained as the pitch estimation value reflecting the pitch fluctuation in the entire sentence or the phrase of the text and the pitch estimation value for each frame obtained by the pitch estimation unit. For each accent type context-dependent phonetic label, a second pitch representative value calculation unit that outputs a pitch representative value for each mora included in the accent phrase of the text and the first pitch representative value calculation unit The pitch representative value for each mora included in the accent clause of the output voice waveform data and the accent of the text output for each accent type context-dependent phonetic label by the second pitch representative value calculation unit. A distance calculation unit that calculates the distance between the pitch representative value for each mora included in the clause and outputs the distance for each accent type context-dependent phonetic label, and the accent type context by the distance calculation unit. The distance output for each dependent accent label is the minimum The accent type context-dependent phoneme label is specified, and the accent type determination unit for determining the accent type of the accent type context-dependent phoneme label as the accent type of the accent phrase is provided.

また、請求項２のアクセント型判定装置は、請求項１に記載のアクセント型判定装置において、前記第１のピッチ代表値算出部が、前記モーラに含まれる前記フレーム毎のピッチの値を所定式に近似し、前記所定式における最終フレームのピッチの値を、当該モーラのピッチ代表値として求め、前記第２のピッチ代表値算出部が、前記モーラに含まれる前記フレーム毎のピッチ推定値を所定式に近似し、前記所定式における最終フレームのピッチ推定値の値を、当該モーラのピッチ代表値として求める、ことを特徴とする。 Further, in the accent type determination device according to claim 2, in the accent type determination device according to claim 1, the first pitch representative value calculation unit determines a pitch value for each frame included in the mora. The pitch value of the final frame in the predetermined formula is obtained as the pitch representative value of the mora, and the second pitch representative value calculation unit determines the pitch estimation value for each frame included in the mora. Approximate to the equation, the value of the pitch estimated value of the final frame in the predetermined equation is obtained as the pitch representative value of the mora.

また、請求項３のアクセント型判定装置は、請求項１または２に記載のアクセント型判定装置において、前記ピッチ推定部が、予め学習された韻律モデルを用いて、前記ピッチ推定値を求める、ことを特徴とする。 Further, in the accent type determination device according to claim 3, in the accent type determination device according to claim 1, the pitch estimation unit obtains the pitch estimation value by using a prosody model learned in advance. It is characterized by.

また、請求項４のアクセント型判定装置は、請求項３に記載のアクセント型判定装置において、前記韻律モデルを、前記音声波形データの話者と同一の話者が発した音声を用いて学習されたモデルとする、ことを特徴とする。 Further, the accent type determination device according to claim 4 is the accent type determination device according to claim 3, and the prosody model is learned by using the voice emitted by the same speaker as the speaker of the voice waveform data. It is characterized by being a model.

さらに、請求項５のプログラムは、コンピュータを、請求項１から４までのいずれか一項に記載のアクセント型判定装置として機能させることを特徴とする。 Further, the program of claim 5 is characterized in that the computer functions as the accent type determination device according to any one of claims 1 to 4.

以上のように、本発明によれば、音声波形データからアクセント型を精度高く判定することが可能となる。 As described above, according to the present invention, it is possible to accurately determine the accent type from the voice waveform data.

本発明の実施形態によるアクセント型判定装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the accent type determination apparatus by embodiment of this invention. アクセント型判定装置の処理例を示すフローチャートである。It is a flowchart which shows the processing example of the accent type determination apparatus. 音声波形データのピッチ、モーラ、アクセント句、テキスト、フレーム、モーラのピッチ代表値を説明する図である。It is a figure explaining the pitch, a mora, an accent phrase, a text, a frame, and a pitch representative value of a mora of voice waveform data.

以下、本発明を実施するための形態について図面を用いて詳細に説明する。図１は、本発明の実施形態によるアクセント型判定装置の構成例を示すブロック図であり、図２は、アクセント型判定装置の処理例を示すフローチャートである。また、図３は、音声波形データのピッチ、モーラ、アクセント句、テキスト、フレーム、モーラのピッチ代表値を説明する図である。図３において、音声波形データのピッチの曲線は、音声波形データに対応するテキストの区間における息継ぎを含む大方の傾向を示している。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a configuration example of an accent type determination device according to an embodiment of the present invention, and FIG. 2 is a flowchart showing a processing example of the accent type determination device. Further, FIG. 3 is a diagram for explaining the pitch, mora, accent phrase, text, frame, and pitch representative value of the mora of the voice waveform data. In FIG. 3, the pitch curve of the voice waveform data shows most of the trends including breathing in the text section corresponding to the voice waveform data.

図１を参照して、このアクセント型判定装置１は、音声テキスト解析部１０、音声波形ピッチ算出部１１、ピッチ代表値算出部１２、テキスト解析部（音素ラベル作成部）１３、アクセント型ラベル作成部１４、韻律モデル１５、ピッチ推定部１６、ピッチ代表値算出部１７、距離算出部１８及びアクセント型判定部１９を備えている。音声波形データは、標本化周波数を１６ｋＨｚ、変換ビット数を１６ビットとして標本化されており、音声波形データに対応する書き起こしテキストが存在するものとする。 With reference to FIG. 1, the accent type determination device 1 includes a voice text analysis unit 10, a voice waveform pitch calculation unit 11, a pitch representative value calculation unit 12, a text analysis unit (phoneme label creation unit) 13, and an accent type label creation. It includes a unit 14, a prosody model 15, a pitch estimation unit 16, a pitch representative value calculation unit 17, a distance calculation unit 18, and an accent type determination unit 19. It is assumed that the voice waveform data is sampled with a sampling frequency of 16 kHz and a conversion bit number of 16 bits, and there is a transcribed text corresponding to the voice waveform data.

図１及び図２を参照して、アクセント型判定装置１は、音声波形データ及び当該音声波形データに対応する書き起こしテキストを入力する（ステップＳ２０１）。 With reference to FIGS. 1 and 2, the accent type determination device 1 inputs the voice waveform data and the transcript text corresponding to the voice waveform data (step S201).

音声テキスト解析部１０は、音声波形データ及びテキストを入力する。そして、音声テキスト解析部１０は、テキストを解析することで、テキストに対しアクセント句の区切り位置を設定する（ステップＳ２０２）。アクセント句の区切り位置を設定する技術は既知であるから、ここでは説明を省略する。 The voice text analysis unit 10 inputs voice waveform data and text. Then, the voice text analysis unit 10 sets the delimiter position of the accent phrase with respect to the text by analyzing the text (step S202). Since the technique for setting the delimiter position of the accent phrase is known, the description thereof is omitted here.

音声テキスト解析部１０は、音声波形データ及びテキストを解析することで、音素セグメンテーションを行い、音素の区切り位置を設定する（ステップＳ２０３）。 The voice text analysis unit 10 analyzes the voice waveform data and the text to perform phoneme segmentation and sets the phoneme delimiter position (step S203).

音素セグメンテーションの処理は既知であり、例えば強制アライメント（Forced Alignments）の技術が用いられる。強制アライメントの技術の詳細については、以下のＵＲＬを参照されたい。
“The HTK Book（for HTK Version 3.4）”，インターネット＜ＵＲＬ：http://htk.eng.cam.ac.uk/＞ The processing of phoneme segmentation is known and, for example, the technique of forced alignment is used. For details on the forced alignment technology, refer to the following URL.
"The HTK Book (for HTK Version 3.4)", Internet <URL: http://htk.eng.cam.ac.uk/>

音声テキスト解析部１０は、アクセント句の区切り位置に関するデータ及び音素の区切り位置に関するデータをピッチ代表値算出部１２に出力する。 The voice text analysis unit 10 outputs data regarding the accent phrase delimiter position and data regarding the phoneme delimiter position to the pitch representative value calculation unit 12.

音声波形ピッチ算出部１１は、音声波形データを入力し、音声波形データから５ｍｓ単位のフレーム毎に基本周波数を抽出し（ステップＳ２０４）、基本周波数の対数を算出する（ステップＳ２０５）。以下、基本周波数の対数をピッチという。 The voice waveform pitch calculation unit 11 inputs voice waveform data, extracts a fundamental frequency from the voice waveform data for each frame in units of 5 ms (step S204), and calculates the logarithm of the fundamental frequency (step S205). Hereinafter, the logarithm of the fundamental frequency is referred to as pitch.

フレーム毎の基本周波数を抽出する処理は既知であり、例えば音声信号処理の技術（pitchコマンド）が用いられる。この技術の詳細については、以下のＵＲＬを参照されたい。
“REFERENCE MANUAL for Speech Signal Processing Toolkit Ver. 3.9”，インターネット＜ＵＲＬ：http://sp-tk.sourceforge.net/＞ The process of extracting the fundamental frequency for each frame is known, and for example, an audio signal processing technique (pitch command) is used. For details of this technology, refer to the following URL.
"REFERENCE MANUAL for Speech Signal Processing Toolkit Ver. 3.9", Internet <URL: http://sp-tk.sourceforge.net/>

音声波形ピッチ算出部１１は、音声波形データにおけるフレーム毎のピッチをピッチ代表値算出部１２に出力する。 The voice waveform pitch calculation unit 11 outputs the pitch of each frame in the voice waveform data to the pitch representative value calculation unit 12.

ピッチ代表値算出部１２は、音声波形ピッチ算出部１１から音声波形データにおけるフレーム毎のピッチを入力すると共に、音声テキスト解析部１０からアクセント句の区切り位置に関するデータ及び音素の区切り位置に関するデータを入力する。 The pitch representative value calculation unit 12 inputs the pitch for each frame in the voice waveform data from the voice waveform pitch calculation unit 11, and also inputs the data regarding the accent phrase break position and the phoneme break position from the voice text analysis unit 10. To do.

ピッチ代表値算出部１２は、アクセント句の区切り位置に関するデータに基づいて、音声波形データを構成する複数のアクセント句を特定する（切り出す）。そして、ピッチ代表値算出部１２は、音素の区切り位置に関するデータに基づいて、アクセント句を構成する複数のモーラを特定する（切り出す）（ステップＳ２０６）。 The pitch representative value calculation unit 12 identifies (cuts out) a plurality of accent phrases constituting the voice waveform data based on the data relating to the delimiter positions of the accent phrases. Then, the pitch representative value calculation unit 12 identifies (cuts out) a plurality of moras constituting the accent phrase based on the data relating to the phoneme delimiter positions (step S206).

図３に示すように、アクセント句は、アクセント句の区切り位置（矢印の先端位置）で区切られた区間（矢印の区間）となる。また、音素の区切り位置から特定されたモーラは、その矢印の先端位置で区切られた区間（矢印の区間）となる。 As shown in FIG. 3, the accent phrase is a section (arrow section) separated by the delimiter position (tip position of the arrow) of the accent phrase. Further, the mora specified from the phoneme dividing position is a section (arrow section) separated by the tip position of the arrow.

ピッチ代表値算出部１２は、モーラの区間（モーラ区間）におけるフレーム毎のピッチから、モーラ毎のピッチ代表値を算出する（ステップＳ２０７）。これにより、音声波形データを構成するアクセント句のそれぞれについて、当該アクセント句を構成するモーラのピッチ代表値が算出される。 The pitch representative value calculation unit 12 calculates the pitch representative value for each mora from the pitch for each frame in the mora section (mora section) (step S207). As a result, the pitch representative value of the mora constituting the accent phrase is calculated for each of the accent phrases constituting the voice waveform data.

具体的には、ピッチ代表値算出部１２は、アクセント句を構成するモーラを切り出し、モーラの開始時間から終了時間までのモーラ区間に含まれるフレーム毎のピッチを特定する。ピッチ代表値算出部１２は、ピッチが存在しない無声区間については、隣接する有声区間におけるフレーム毎のピッチを補間することで、補間した値をピッチとして扱う。また、ピッチ代表値算出部１２は、無声区間の部分が文頭、文末または文内のポーズ（間）に隣接する場合、文章全体の有声区間におけるフレーム毎のピッチから算出した平均値を、補間すべき無声区間の端点の値に内挿する。図３に示すように、モーラは、複数のフレームにより構成され、モーラ区間に含まれるフレーム毎のピッチが特定される。 Specifically, the pitch representative value calculation unit 12 cuts out the mora constituting the accent phrase, and specifies the pitch for each frame included in the mora section from the start time to the end time of the mora. The pitch representative value calculation unit 12 treats the interpolated value as a pitch by interpolating the pitch for each frame in the adjacent voiced section in the unvoiced section where the pitch does not exist. Further, the pitch representative value calculation unit 12 interpolates the average value calculated from the pitch for each frame in the voiced section of the entire sentence when the unvoiced section portion is adjacent to the beginning, end of sentence, or pause (interval) in the sentence. Interpolate to the value of the endpoint of the powerless interval. As shown in FIG. 3, the mora is composed of a plurality of frames, and the pitch of each frame included in the mora section is specified.

次に、ピッチ代表値算出部１２は、モーラ区間に含まれるフレーム毎のピッチの値を、最小二乗法により１次関数の式として近似し、１次関数の式における最終フレームのピッチの値を、当該モーラのピッチ代表値として設定する。そして、ピッチ代表値算出部１２は、アクセント句の区間（アクセント句区間）における各モーラのピッチ代表値を距離算出部１８に出力する。 Next, the pitch representative value calculation unit 12 approximates the pitch value for each frame included in the mora section as a linear function formula by the least squares method, and calculates the pitch value of the final frame in the linear function formula. , Set as the pitch representative value of the mora. Then, the pitch representative value calculation unit 12 outputs the pitch representative value of each mora in the accent phrase section (accent phrase section) to the distance calculation unit 18.

図３に示すように、モーラ区間のフレーム数をＮ、開始時間のフレーム番号をｘ₁、ピッチの値をｙ₁、終了時間のフレーム番号をｘ_N 、ピッチの値をｙ_N、最小二乗法により近似した１次関数の式をｆ（ｘ）＝ａｘ＋ｂとする。ａは１次関数ｆ（ｘ）の傾き、ｂはその切片である。 As shown in FIG. 3, the number of frames in the mora section is N, the frame number of the start time is x ₁ , the pitch value is y ₁ , the frame number of the end time is x _N , the pitch value is y _N , and the least squares method. Let f (x) = ax + b be the equation of the linear function approximated by. a is the slope of the linear function f (x), and b is the intercept.

１次関数ｆ（ｘ）の傾きａ及び切片ｂは、以下の式で表される。

The slope a and the intercept b of the linear function f (x) are represented by the following equations.

このように、ピッチ代表値算出部１２は、近似した１次関数の式ｆ（ｘ）のうち最終フレームｘ_Nのピッチの値ｆ（ｘ_N）＝ｙ_Nを当該モーラのピッチ代表値とし、アクセント句区間における各モーラのピッチ代表値Ｓ_mを算出する。ｍはモーラの番号である。 In this way, the pitch representative value calculation unit 12 sets the pitch value f (x _N ) = y _N of the final frame x _{N in} the approximate linear function equation f (x) as the pitch representative value of the mora. The pitch representative value S _m of each mora in the accent clause section is calculated. m is the mora number.

テキスト解析部１３は、音声波形データに対応するテキストを入力し、テキストを解析することで、テキストに対する文脈依存音素ラベルを作成する（ステップＳ２０８）。文脈依存音素ラベルは、テキスト全体の文脈に依存した、音素を単位としたラベルである。文脈依存音素ラベルは、例えば、テキスト全体に含まれる文節の位置、数及び番号、文節に含まれるアクセント句の位置、数及び番号、アクセント句に含まれるモーラの位置、数及び番号、アクセント句内のアクセントの位置等の、テキスト全体の文脈を特定するための情報が含まれる。 The text analysis unit 13 inputs the text corresponding to the voice waveform data and analyzes the text to create a context-sensitive phoneme label for the text (step S208). A context-sensitive phoneme label is a phoneme-based label that depends on the context of the entire text. Context-sensitive phoneme labels are, for example, the position, number and number of clauses contained in the entire text, the position, number and number of accent phrases contained in the phrase, the position, number and number of mora contained in the accent phrase, and within the accent phrase. Contains information to identify the context of the entire text, such as the location of accents in.

文脈依存音素ラベルの作成処理は既知である。詳細については、例えば以下のＵＲＬで提供される日本語音声合成システムの技術を参照されたい。
“OPEN JTalk”，インターネット＜ＵＲＬ：http://open-jtalk.sourceforge.net/＞ The process of creating context-sensitive phoneme labels is known. For details, refer to, for example, the technology of the Japanese speech synthesis system provided at the following URL.
"OPEN JTalk", Internet <URL: http://open-jtalk.sourceforge.net/>

テキスト解析部１３は、テキストの文脈依存音素ラベルをアクセント型ラベル作成部１４に出力する。 The text analysis unit 13 outputs the context-sensitive phoneme label of the text to the accent type label creation unit 14.

アクセント型ラベル作成部１４は、テキスト解析部１３からテキストの文脈依存音素ラベルを入力する。そして、アクセント型ラベル作成部１４は、文脈依存音素ラベルに含まれる情報に基づいて、テキストを構成する複数のアクセント句の区切り位置を特定し、テキストを構成する複数のアクセント句を特定する（切り出す）。アクセント型ラベル作成部１４により特定されたアクセント句は、ピッチ代表値算出部１２によりピッチ代表値が算出される前に特定されたアクセント句と同じである。 The accent type label creation unit 14 inputs the context-sensitive phoneme label of the text from the text analysis unit 13. Then, the accent type label creation unit 14 identifies the delimiter positions of the plurality of accent phrases constituting the text based on the information contained in the context-sensitive phoneme label, and identifies (cuts out) the plurality of accent phrases constituting the text. ). The accent phrase specified by the accent type label creation unit 14 is the same as the accent phrase specified before the pitch representative value is calculated by the pitch representative value calculation unit 12.

アクセント型ラベル作成部１４は、文脈依存音素ラベルに含まれる情報に基づいて、アクセント句を構成する複数のモーラを特定する（切り出す）。そして、アクセント型ラベル作成部１４は、アクセント句について、全てのアクセント型を表現するように、アクセント核の位置（アクセントのあるモーラの位置）を変更した文脈依存音素ラベル（アクセント型文脈依存音素ラベル）を作成する（ステップＳ２０９）。アクセント型ラベル作成部１４により作成される文脈依存音素ラベルの数は、アクセント句を構成するモーラの数により決定される。テキストの文章がＬ個のアクセント句で構成され、ｌ（エル）番目のアクセント句のモーラの数がＭ_lの場合、Ｍ_l種類の異なる文脈依存音素ラベルが作成される。ここで、ｌは１以上、Ｌ以下の整数であり、Ｍ_lは１以上の整数である。 The accent type label creation unit 14 identifies (cuts out) a plurality of moras constituting the accent phrase based on the information contained in the context-sensitive phoneme label. Then, the accent type label creation unit 14 changes the position of the accent core (the position of the accented mora) so as to express all the accent types for the accent phrase, and the context-dependent phoneme label (accent type context-dependent phoneme label). ) Is created (step S209). The number of context-sensitive phoneme labels created by the accent type label creation unit 14 is determined by the number of moras constituting the accent phrase. If the text text is composed of L accent phrases and the number of mora in the _l- th accent phrase is M _l , different context-sensitive phoneme labels of M _l types are created. Here, l is an integer of 1 or more and L or less, and M _l is an integer of 1 or more.

アクセント型ラベル作成部１４は、ｌ番目のアクセント句に対して、Ｍ_l個の文脈依存音素ラベル（アクセント型毎の文脈依存音素ラベル）をピッチ推定部１６に出力する。 The accent type label creation unit 14 outputs M _l context-sensitive phoneme labels (context-dependent phoneme labels for each accent type) to the pitch estimation unit 16 for the l-th accent phrase.

ピッチ推定部１６は、アクセント型ラベル作成部１４から、ｌ番目のアクセント句に対して、Ｍ_l個の文脈依存音素ラベルを入力する。そして、ピッチ推定部１６は、Ｍ_l個の文脈依存音素ラベルのそれぞれについて、予め学習された韻律モデル１５を用いて、モーラ区間におけるフレーム毎のピッチを推定し、ピッチ推定値を求める（ステップＳ２１０）。 The pitch estimation unit 16 inputs M _l context-sensitive phoneme labels for the lth accent phrase from the accent type label creation unit 14. Then, the pitch estimation unit 16 estimates the pitch for each frame in the mora section for each of the M _l context-sensitive phoneme labels using the pre-learned prosody model 15, and obtains the pitch estimation value (step S210). ).

文脈依存音素ラベルについてピッチを推定する処理は既知であり、詳細については、例えば前記のＵＲＬで提供される日本語音声合成システムの技術、または以下のＵＲＬで提供される音声合成技術を参照されたい。
“HTS”，インターネット＜ＵＲＬ：http://hts-engine.sourceforge.net/＞ The process of estimating the pitch for context-sensitive phoneme labels is known, and for details, refer to, for example, the technology of the Japanese speech synthesis system provided at the above URL or the speech synthesis technology provided at the following URL. ..
"HTS", Internet <URL: http://hts-engine.sourceforge.net/>

ピッチ推定部１６は、Ｍ_l個の文脈依存音素ラベル（アクセント型毎の文脈依存音素ラベル）のそれぞれについて、アクセント句毎に、モーラ区間におけるフレーム毎のピッチ推定値をピッチ代表値算出部１７に出力する。 The pitch estimation unit 16 transmits the pitch estimation value for each frame in the mora section to the pitch representative value calculation unit 17 for each accent phrase for each of the M _l context-sensitive phoneme labels (context-dependent phoneme labels for each accent type). Output.

韻律モデル１５は、図示しないモデル学習部により予め作成されたＤＢ（データベース）である。モデル学習部は、様々なテキストの文章またはフレーズ等の音声波形データ及び当該音声波形データに対応するテキストに対して文脈依存音素ラベルを作成し、当該音声波形データからピッチを算出し、これに対応する文脈依存音素ラベルを用いて、ピッチを推定できるように学習したモデルを韻律モデル１５に格納する。 The prosody model 15 is a DB (database) created in advance by a model learning unit (not shown). The model learning unit creates context-dependent phoneme labels for voice waveform data such as sentences or phrases of various texts and text corresponding to the voice waveform data, calculates the pitch from the voice waveform data, and corresponds to this. The model learned so that the pitch can be estimated by using the context-dependent phoneme label is stored in the prosody model 15.

したがって、この韻律モデル１５を用いて算出されたピッチ推定値は、テキストの文章全体またはフレーズ等の区間における（アクセント句よりも長い区間における）ピッチの変動が反映された値となる。つまり、ピッチ推定値は、テキストの文章全体またはフレーズ等の中で、アクセント位置が反映された値となる。 Therefore, the pitch estimation value calculated using the prosody model 15 is a value that reflects the fluctuation of the pitch (in the section longer than the accent phrase) in the whole sentence of the text or in the section such as the phrase. That is, the pitch estimation value is a value that reflects the accent position in the entire text sentence or phrase.

ここで、韻律モデル１５は、当該アクセント型判定装置１によりアクセント型が判定される音声波形データの話者と同一の話者が発した音声を学習用データとして、当該話者の音声波形データを用いて学習されたＤＢであることが望ましい。これにより、アクセント型判定装置１において、音声波形データからアクセント型を一層精度高く判定することができる。 Here, the prosody model 15 uses the voice emitted by the same speaker as the speaker of the voice waveform data whose accent type is determined by the accent type determination device 1 as learning data, and uses the voice waveform data of the speaker as learning data. It is desirable that the DB is learned by using it. As a result, the accent type determination device 1 can determine the accent type from the voice waveform data with higher accuracy.

ピッチ代表値算出部１７は、ピッチ推定部１６から、Ｍ_l個の文脈依存音素ラベルのそれぞれについて、アクセント句毎に、モーラ区間におけるフレーム毎のピッチ推定値を入力する。そして、ピッチ代表値算出部１７は、ピッチ代表値算出部１２と同様の処理により、モーラ区間におけるフレーム毎のピッチ推定値からモーラ毎のピッチ代表値を算出する（ステップＳ２１１）。これにより、Ｍ_l個の文脈依存音素ラベルのそれぞれについて、音声波形データに対応するテキストを構成するそれぞれのアクセント句について、当該アクセント句を構成するモーラのピッチ代表値が算出される。 The pitch representative value calculation unit 17 inputs the pitch estimation value for each frame in the mora section for each accent phrase for each of the M _l context-sensitive phoneme labels from the pitch estimation unit 16. Then, the pitch representative value calculation unit 17 calculates the pitch representative value for each mora from the pitch estimation value for each frame in the mora section by the same processing as the pitch representative value calculation unit 12 (step S211). As a result, for each of the M _l context-sensitive phoneme labels, for each accent phrase that constitutes the text corresponding to the speech waveform data, the pitch representative value of the mora that constitutes the accent phrase is calculated.

具体的には、ピッチ代表値算出部１７は、文脈依存音素ラベルに基づいて、アクセント句を構成するモーラを切り出し、モーラの開始時間から終了時間までのモーラ区間に含まれるフレーム毎のピッチを特定する。ピッチ代表値算出部１７は、ピッチが存在しない無声区間について、ピッチ代表値算出部１２と同様に補間処理等を行う。そして、ピッチ代表値算出部１７は、モーラ区間に含まれるフレーム毎のピッチの値を、最小二乗法により１次関数の式として近似し、１次関数の式における最終フレームのピッチを、当該モーラのピッチ代表値として設定する。 Specifically, the pitch representative value calculation unit 17 cuts out the mora constituting the accent phrase based on the context-sensitive phoneme label, and specifies the pitch for each frame included in the mora section from the start time to the end time of the mora. To do. The pitch representative value calculation unit 17 performs interpolation processing and the like in the same manner as the pitch representative value calculation unit 12 for the silent section where the pitch does not exist. Then, the pitch representative value calculation unit 17 approximates the pitch value for each frame included in the mora section as an equation of the linear function by the least squares method, and sets the pitch of the final frame in the equation of the linear function to the mora. It is set as the pitch representative value of.

ｌ番目のアクセント句における、Ｍ_l個の文脈依存音素ラベルのうち、ｎ番目の文脈依存音素ラベルについて、ｍ番目のモーラのピッチ代表値をＴ_l,n,mとする。 Of the M _l context-sensitive phoneme labels in the l-th accent phrase, the pitch representative value of the m-th mora is T _{l, n, m} for the n-th context-sensitive phoneme label.

ピッチ代表値算出部１７は、Ｍ_l個の文脈依存音素ラベル（アクセント型毎の文脈依存音素ラベル）のそれぞれについて、アクセント句区間における各モーラのピッチ代表値を距離算出部１８に出力する。 The pitch representative value calculation unit 17 outputs the pitch representative value of each mora in the accent clause section to the distance calculation unit 18 for each of the M _l context-sensitive phoneme labels (context-dependent phoneme labels for each accent type).

距離算出部１８は、ピッチ代表値算出部１２から、音声波形データについてのアクセント句区間における各モーラのピッチ代表値を入力する。また、距離算出部１８は、ピッチ代表値算出部１７から、Ｍ_l個の文脈依存音素ラベル（アクセント型毎の文脈依存音素ラベル）のそれぞれについてのアクセント句区間における各モーラのピッチ代表値を入力する。 The distance calculation unit 18 inputs the pitch representative value of each mora in the accent phrase section for the voice waveform data from the pitch representative value calculation unit 12. Further, the distance calculation unit 18 inputs the pitch representative value of each mora in the accent clause section for each of the M _l context-sensitive phoneme labels (context-dependent phoneme labels for each accent type) from the pitch representative value calculation unit 17. To do.

距離算出部１８は、音声波形データについてのｌ番目のアクセント句区間における各モーラのピッチ代表値と、Ｍ_l個の文脈依存音素ラベルのそれぞれについて、アクセント句区間における各モーラのピッチ代表値との間の距離を算出する（ステップＳ２１２）。 The distance calculation unit 18 sets the pitch representative value of each mora in the l-th accent clause section of the voice waveform data and the pitch representative value of each mora in the accent phrase section for each of the M _l context-dependent phoneme labels. The distance between them is calculated (step S212).

前述のとおり、音声波形データについてのｌ番目のアクセント句区間におけるｍ番目のモーラのピッチ代表値をＳ_l,mとし、ｌ番目のアクセント句における、Ｍ_l個の文脈依存音素ラベルのうちｎ番目の文脈依存音素ラベルについて、アクセント句におけるｍ番目のモーラのピッチ代表値をＴ_l,n,mとする。両者の距離Ｃ_l,ｎは、以下の式にて算出される。

ここで、Ｍ_l個のモーラからなるアクセント句の場合、アクセント型の数はＭ_lであるから、ｎ＝１，・・・，Ｍ_l、ｍ＝１，・・・，Ｍ_lである。 As described above, the pitch representative value of the m-th mora in the l-th accent clause section of the voice waveform data is S _{l, m,} and the n-th of the M _l context-dependent phoneme labels in the l-th accent clause. _Let T _{l, n, m} be the pitch representative values of the m-th mora in the accent clause for the context-dependent phoneme label of. The distances C _{l and n} between the two are calculated by the following formula.

Here, in the case of an accent phrase consisting of M _l mora, since the number of accent types is M _l , n = 1, ..., M _l , m = 1, ..., M _l .

距離算出部１８は、アクセント句毎に、Ｍ_l個の文脈依存音素ラベル（アクセント型毎の文脈依存音素ラベル）のそれぞれについての距離（Ｃ_l,1，・・・，Ｃ_l,Ml）をアクセント型判定部１９に出力する。 The distance calculation unit 18 calculates the distances (C _{l, 1} , ..., C _{l, Ml} ) for each of the M _l context-sensitive phoneme labels (context-sensitive phoneme labels for each accent type) for each accent phrase. Output to the accent type determination unit 19.

アクセント型判定部１９は、アクセント句毎に、Ｍ_l個の文脈依存音素ラベルのそれぞれについての距離（Ｃ_l,1，・・・，Ｃ_l,Ml）を入力する。そして、アクセント型判定部１９は、Ｍ_l個の文脈依存音素ラベルのうち距離が最小となる文脈依存音素ラベルを特定し、特定した文脈依存音素ラベルのアクセント型を判定する（ステップＳ２１３）。そして、アクセント型判定部１９は、判定したアクセント型を当該アクセント句のアクセント型として出力する。 The accent type determination unit 19 inputs the distances (C _{l, 1} , ..., C _{l, Ml} ) for each of the M _l context-sensitive phoneme labels for each accent phrase. Then, the accent type determination unit 19 identifies the context-dependent phoneme label having the minimum distance among the M _l context-sensitive phoneme labels, and determines the accent type of the specified context-dependent phoneme label (step S213). Then, the accent type determination unit 19 outputs the determined accent type as the accent type of the accent phrase.

ｌ番目のアクセント句における、距離が最小となるアクセント型ｎ_lminは、以下の式にて算出される。

The accent type n _lmin that minimizes the distance in the l-th accent phrase is calculated by the following formula.

以上のように、本発明の実施形態のアクセント型判定装置１によれば、ピッチ代表値算出部１２は、音声波形データについて、モーラ区間に含まれるフレーム毎のピッチの値を最小二乗法により１次関数の式に近似し、１次関数の式における最終フレームのピッチを、当該モーラのピッチ代表値として設定する。これにより、音声波形データについて、アクセント句を構成するモーラのピッチ代表値が算出される。 As described above, according to the accent type determination device 1 of the embodiment of the present invention, the pitch representative value calculation unit 12 sets the pitch value for each frame included in the mora section by the least squares method for the voice waveform data. Approximate to the formula of the next function, the pitch of the final frame in the formula of the linear function is set as the pitch representative value of the mora. As a result, the pitch representative value of the mora constituting the accent phrase is calculated for the voice waveform data.

アクセント型ラベル作成部１４は、テキスト解析部１３により生成されたテキストの文脈依存音素ラベルに基づいて、アクセント句毎に、アクセント核の位置を変更したアクセント型毎のＭ_l個の文脈依存音素ラベルを作成する。 The accent type label creation unit 14 changes the position of the accent nucleus for each accent phrase based on the context-sensitive phoneme label of the text generated by the text analysis unit 13, and M _l context-sensitive phoneme labels for each accent type. To create.

ピッチ推定部１６は、Ｍ_l個の文脈依存音素ラベルのそれぞれについて、韻律モデル１５を用いて、モーラ区間におけるフレーム毎のピッチを推定し、ピッチ推定値を求める。ピッチ代表値算出部１７は、ピッチ代表値算出部１２と同様の処理により、モーラ区間におけるフレーム毎のピッチ推定値から当該モーラのピッチ代表値を算出する。これにより、Ｍ_l個の文脈依存音素ラベルのそれぞれについて、音声波形データに対応するテキストについて、アクセント句を構成するモーラのピッチ代表値が算出される。 The pitch estimation unit 16 estimates the pitch for each frame in the mora section for each of the M _l context-sensitive phoneme labels using the prosody model 15, and obtains the pitch estimation value. The pitch representative value calculation unit 17 calculates the pitch representative value of the mora from the pitch estimation value for each frame in the mora section by the same processing as the pitch representative value calculation unit 12. As a result, for each of the M _l context-sensitive phoneme labels, the pitch representative value of the mora constituting the accent phrase is calculated for the text corresponding to the speech waveform data.

距離算出部１８は、ピッチ代表値算出部１２により算出された音声波形データについてのアクセント句区間における各モーラのピッチ代表値と、ピッチ代表値算出部１７により算出されたＭ_l個の文脈依存音素ラベルのそれぞれについてのアクセント句区間における各モーラのピッチ代表値との間の距離を算出する。アクセント型判定部１９は、アクセント句毎に、Ｍ_l個の文脈依存音素ラベルについての距離が最小となるアクセント型を判定し、判定したアクセント型を当該アクセント句のアクセント型として出力する。 The distance calculation unit 18 includes the pitch representative value of each mora in the accent clause section of the voice waveform data calculated by the pitch representative value calculation unit 12, and M _l context-dependent phonemes calculated by the pitch representative value calculation unit 17. Calculate the distance between each mora's pitch representative value in the accent clause interval for each of the labels. The accent type determination unit 19 determines the accent type that minimizes the distance between M _l context-sensitive phoneme labels for each accent phrase, and outputs the determined accent type as the accent type of the accent phrase.

従来技術では、音声波形データを構成する個々のアクセント句のみのピッチに基づいて、当該アクセント句のアクセント型を判定する。これに対し、本発明の実施形態では、ピッチ推定部１６は、アクセント型毎の文脈依存音素ラベルにつき、様々なテキストと対応する音声波形データを用いて学習した韻律モデル１５に基づいてピッチ推定値を算出し、アクセント型判定部１９は、このピッチ推定値から算出した距離を用いて、アクセント型を判定する。このピッチ推定値は、テキストの文章またはフレーズ等の区間におけるピッチの変動、すなわちアクセント句よりも長い区間のピッチの変動を考慮した値である。判定されたアクセント型も、アクセント句よりも長い文章全体またはフレーズ等を考慮した結果となる。したがって、音声波形データからアクセント型を精度高く判定することが可能となる。 In the prior art, the accent type of the accent phrase is determined based on the pitch of only each accent phrase constituting the voice waveform data. On the other hand, in the embodiment of the present invention, the pitch estimation unit 16 has a pitch estimation value based on the prosody model 15 learned by using various texts and corresponding speech waveform data for the context-sensitive phoneme label for each accent type. Is calculated, and the accent type determination unit 19 determines the accent type using the distance calculated from this pitch estimation value. This pitch estimated value is a value considering the fluctuation of the pitch in the section such as a sentence or phrase of the text, that is, the fluctuation of the pitch in the section longer than the accent phrase. The determined accent type is also the result of considering the entire sentence or phrase that is longer than the accent phrase. Therefore, it is possible to determine the accent type with high accuracy from the voice waveform data.

さらに、判定されたアクセント型は、例えば音声合成の事前学習の際に、音響モデルの一部のデータ（韻律データ）として格納され、テキスト音声合成の際に、この音響モデルが用いられる。したがって、アクセント型の判定精度が良くなると、当該アクセント型が学習された音響モデルを用いてテキスト音声合成したときに、音声のアクセントが自然となり、精度の高い音声合成を実現することができる。 Further, the determined accent type is stored as a part of the data (prosody data) of the acoustic model at the time of pre-learning of speech synthesis, and this acoustic model is used at the time of text speech synthesis. Therefore, if the determination accuracy of the accent type is improved, the accent of the voice becomes natural when the text voice is synthesized using the acoustic model in which the accent type is learned, and the voice synthesis with high accuracy can be realized.

尚、本発明の実施形態によるアクセント型判定装置１のハードウェア構成としては、通常のコンピュータを使用することができる。アクセント型判定装置１は、ＣＰＵ、ＲＡＭ等の揮発性の記憶媒体、ＲＯＭ等の不揮発性の記憶媒体、及びインターフェース等を備えたコンピュータによって構成される。アクセント型判定装置１に備えた音声テキスト解析部１０、音声波形ピッチ算出部１１、ピッチ代表値算出部１２、テキスト解析部１３、アクセント型ラベル作成部１４、韻律モデル１５、ピッチ推定部１６、ピッチ代表値算出部１７、距離算出部１８及びアクセント型判定部１９の各機能は、これらの機能を記述したプログラムをＣＰＵに実行させることによりそれぞれ実現される。これらのプログラムは、前記記憶媒体に格納されており、ＣＰＵに読み出されて実行される。また、これらのプログラムは、磁気ディスク（フロッピー（登録商標）ディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリ等の記憶媒体に格納して頒布することもでき、ネットワークを介して送受信することもできる。 As the hardware configuration of the accent type determination device 1 according to the embodiment of the present invention, a normal computer can be used. The accent type determination device 1 is composed of a computer provided with a volatile storage medium such as a CPU and RAM, a non-volatile storage medium such as a ROM, and an interface. Voice text analysis unit 10, voice waveform pitch calculation unit 11, pitch representative value calculation unit 12, text analysis unit 13, accent type label creation unit 14, prosody model 15, pitch estimation unit 16, pitch provided in the accent type determination device 1. Each function of the representative value calculation unit 17, the distance calculation unit 18, and the accent type determination unit 19 is realized by causing the CPU to execute a program describing these functions. These programs are stored in the storage medium, read by the CPU, and executed. In addition, these programs can be stored and distributed in storage media such as magnetic disks (floppy (registered trademark) disks, hard disks, etc.), optical disks (CD-ROM, DVD, etc.), semiconductor memories, etc., and can be distributed via a network. You can also send and receive.

以上、実施形態を挙げて本発明を説明したが、本発明は前記実施形態に限定されるものではなく、その技術思想を逸脱しない範囲で種々変形可能である。例えば、前記実施形態では、アクセント型判定装置１のピッチ推定部１６は、Ｍ_l個の文脈依存音素ラベルのそれぞれについて、韻律モデル１５を用いて、テキストの文章全体またはフレーズ等のピッチの変動を考慮したピッチ推定値を算出するようにした。本発明では、必ずしも韻律モデル１５を用いる必要はない。ピッチ推定部１６は、韻律モデル１５を用いることなく、既知の手法にて、テキストの文章全体またはフレーズ等のピッチの変動を考慮したピッチ推定値を算出するようにしてもよい。 Although the present invention has been described above with reference to embodiments, the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the technical idea. For example, in the above embodiment, the pitch estimation unit 16 of the accent type determination device 1 uses the prosody model 15 for each of the M _l context-sensitive phoneme labels to change the pitch of the entire text or a phrase. The pitch estimate value was calculated in consideration. In the present invention, it is not always necessary to use the prosody model 15. The pitch estimation unit 16 may calculate a pitch estimation value in consideration of pitch fluctuations of the entire text sentence or a phrase or the like by a known method without using the prosody model 15.

また、前記実施形態では、ピッチ代表値算出部１２は、モーラ区間に含まれるフレーム毎のピッチの値を、最小二乗法により１次関数の式として近似し、１次関数の式における最終フレームのピッチの値を、当該モーラのピッチ代表値として設定するようにした。これに対し、ピッチ代表値算出部１２は、最小二乗法以外の他の手法にて、モーラ区間に含まれるフレーム毎のピッチの値を１次関数の式として近似するようにしてもよい。また、ピッチ代表値算出部１２は、モーラ区間に含まれるフレーム毎のピッチの値を、最小二乗法等の手法により、１次関数の式以外の所定式として近似するようにしてもよい。また、ピッチ代表値算出部１２は、１次関数等の所定式における最終フレーム以外のフレームのピッチの値を、当該モーラのピッチ代表値として設定するようにしてもよい。ピッチ代表値算出部１７についても同様である。 Further, in the above-described embodiment, the pitch representative value calculation unit 12 approximates the pitch value for each frame included in the mora section as a linear function equation by the least squares method, and determines the final frame in the linear function equation. The pitch value is set as the pitch representative value of the mora. On the other hand, the pitch representative value calculation unit 12 may approximate the pitch value for each frame included in the mora section as an equation of a linear function by a method other than the least squares method. Further, the pitch representative value calculation unit 12 may approximate the pitch value for each frame included in the mora section as a predetermined expression other than the expression of the linear function by a method such as the least squares method. Further, the pitch representative value calculation unit 12 may set the pitch value of a frame other than the final frame in a predetermined formula such as a linear function as the pitch representative value of the mora. The same applies to the pitch representative value calculation unit 17.

１アクセント型判定装置
１０音声テキスト解析部
１１音声波形ピッチ算出部
１２ピッチ代表値算出部
１３テキスト解析部
１４アクセント型ラベル作成部
１５韻律モデル
１６ピッチ推定部
１７ピッチ代表値算出部
１８距離算出部
１９アクセント型判定部 1 Accent type judgment device 10 Voice text analysis unit 11 Voice waveform pitch calculation unit 12 Pitch representative value calculation unit 13 Text analysis unit 14 Accent type label creation unit 15 Prosody model 16 Pitch estimation unit 17 Pitch representative value calculation unit 18 Distance calculation unit 19 Accent type judgment unit

Claims

In an accent type determination device that cuts out a plurality of accent phrases from voice waveform data and determines the accent type of the accent phrase.
A voice waveform pitch calculation unit that calculates the pitch for each frame from the voice waveform data,
A plurality of accent phrases are cut out from the voice waveform data, a plurality of mora are cut out from the accent phrase, and a pitch representative value for each mora is calculated from the pitch for each frame calculated by the voice waveform pitch calculation unit. A first pitch representative value calculation unit that outputs a pitch representative value for each mora included in the accent phrase of the voice waveform data, and a first pitch representative value calculation unit.
A phoneme label creation unit that creates a context-dependent phoneme label that includes information on phrases, accent phrases, mora, and accent positions included in the text for the text corresponding to the speech waveform data.
Based on the context-dependent phoneme label created by the phoneme label creation unit, a plurality of accent phrases are cut out from the text, a plurality of mora are cut out from the accent phrase, and all accent types are expressed for the accent phrase. As described above, the accent type label creation unit that changes the accent position based on the number of the mora included in the accent clause and creates a plurality of accent type context-dependent phoneme labels.
A pitch estimation unit that obtains the pitch for each frame for each accent-type context-sensitive phoneme label created by the accent-type label creation unit as a pitch estimation value that reflects pitch fluctuations in the entire sentence or phrase of the text. ,
A pitch representative value for each mora is calculated from the pitch estimation value for each frame obtained by the pitch estimation unit, and the pitch for each mora included in the accent phrase of the text for each accent type context-sensitive phoneme label. A second pitch representative value calculation unit that outputs representative values, and
The pitch representative value for each mora included in the accent clause of the voice waveform data output by the first pitch representative value calculation unit, and the accent type context-dependent phoneme label by the second pitch representative value calculation unit. A distance calculation unit that calculates the distance between the pitch representative value for each mora included in the accent clause of the text output for each, and outputs the distance for each accent type context-dependent phoneme label. ,
The distance calculation unit identifies the accent-type context-dependent phoneme label that minimizes the distance output for each accent-type context-dependent phoneme label, and sets the accent type of the accent-type context-dependent phoneme label as the accent phrase. Accent type judgment unit that judges as accent type,
An accent type judgment device characterized by being equipped with.

In the accent type determination device according to claim 1,
The first pitch representative value calculation unit is
The pitch value for each frame included in the mora is approximated to a predetermined formula, and the pitch value of the final frame in the predetermined formula is obtained as the pitch representative value of the mora.
The second pitch representative value calculation unit is
Accent type determination characterized in that the pitch estimation value for each frame included in the mora is approximated to a predetermined formula, and the value of the pitch estimation value of the final frame in the predetermined formula is obtained as the pitch representative value of the mora. apparatus.

In the accent type determination device according to claim 1 or 2.
The pitch estimation unit
An accent type determination device characterized in that the pitch estimation value is obtained using a prosody model learned in advance.

In the accent type determination device according to claim 3,
An accent type determination device, characterized in that the prosody model is a model learned by using a voice uttered by the same speaker as the speaker of the voice waveform data.

A program for causing a computer to function as the accent type determination device according to any one of claims 1 to 4.