JP2011191528A

JP2011191528A - Rhythm creation device and rhythm creation method

Info

Publication number: JP2011191528A
Application number: JP2010057661A
Authority: JP
Inventors: Takahiro Otsuka; 貴弘大塚; Satoshi Furuta; 訓古田; Tadashi Yamaura; 正山浦; Hirohisa Tazaki; 裕久田崎
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2010-03-15
Filing date: 2010-03-15
Publication date: 2011-09-29
Anticipated expiration: 2030-03-15
Also published as: JP5393546B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a rhythm creation device and a rhythm creation method for creating rhythm information which is a stable and natural rhythm. <P>SOLUTION: By referring to representative rhythm information 102 created based on input language information 101, and detailed rhythm information 103, from a detailed rhythm information storage section 2 which stores the detailed rhythm information 103 for indicating each of a plurality of rhythm feature patterns beforehand, the detailed rhythm information 103 similar to the representative rhythm information 102 is selected. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は、コンピュータに入力された文章から合成音声を作成する場合等において、その合成音声の韻律情報を作成する韻律作成装置及び韻律作成方法に関するものである。 The present invention relates to a prosody creation device and a prosody creation method for creating prosody information of a synthesized speech in the case of creating a synthesized speech from a sentence input to a computer.

任意の文章から機械的に音声波形を作成するテキスト音声合成システムが開発されている。一般に、テキスト音声合成システムは、言語処理部、韻律生成部及び音声波形生成部から構成される。韻律生成部は、当該システムで音声波形を作成するにあたり、音声の抑揚、リズム、音量（パワー）の自然性に関わる処理を実施する。
自然な抑揚を生成するために、自然音声から抽出した大量のピッチパターンをそのまま利用する方法が提案されている（例えば、特許文献１参照）。この方法では、韻律データベースに自然音声から抽出したピッチパターンを格納しておき、入力したテキストに対応する言語情報によって最適なピッチパターンを韻律データベースから１つ選択してピッチパターンを生成する。
一方、特許文献２には、音声合成対象となるテキストの韻律制御単位毎に当該テキストを解析して得られる言語情報に基づいて、自然音声から抽出した大量のピッチパターンを蓄えた韻律データベースから複数のピッチパターンを選択し、これらから新たなピッチパターンを生成する方法が開示されている。 A text-to-speech synthesis system that mechanically creates a speech waveform from an arbitrary sentence has been developed. In general, a text-to-speech synthesis system includes a language processing unit, a prosody generation unit, and a speech waveform generation unit. The prosody generation unit performs processing related to the naturalness of speech inflection, rhythm, and volume (power) when creating a speech waveform in the system.
In order to generate natural inflection, a method has been proposed in which a large number of pitch patterns extracted from natural speech are used as they are (see, for example, Patent Document 1). In this method, a pitch pattern extracted from natural speech is stored in the prosodic database, and one optimal pitch pattern is selected from the prosodic database according to the linguistic information corresponding to the input text to generate a pitch pattern.
On the other hand, Patent Document 2 discloses a plurality of prosody databases that store a large number of pitch patterns extracted from natural speech based on linguistic information obtained by analyzing the text for each prosodic control unit of the text to be synthesized. A method of selecting a pitch pattern and generating a new pitch pattern therefrom is disclosed.

特開２００２−２９７１７５号公報JP 2002-297175 A 特開２００６−３０９１６２号公報JP 2006-309162 A

従来の技術は、入力したテキストと同一の言語情報になったピッチパターンが複数ある場合に、適切なピッチパターンを選択することができず、不自然な抑揚になるという問題がある。例えば、人間が発した音声を分析してピッチパターンを作成する場合、収録時期又は話者の状態によって、同一の言語情報であっても、音声の特徴が揺らいで、声の高さが異なるピッチパターンが選択候補になり得る。この場合には、言語情報に基づいて適切なピッチパターンの選択をうまく行えない。 The conventional technique has a problem that when there are a plurality of pitch patterns having the same language information as the input text, an appropriate pitch pattern cannot be selected, resulting in an unnatural inflection. For example, when creating a pitch pattern by analyzing human-generated speech, the pitch of the voice varies depending on the recording time or the state of the speaker, even if the language information is the same, and the pitch of the voice is different. A pattern can be a selection candidate. In this case, an appropriate pitch pattern cannot be selected successfully based on language information.

また、入力したテキストの言語情報と韻律データベース中の言語情報の距離に基づいてピッチパターンを選択するにあたり、入力したテキストの言語情報と類似した言語情報がない場合、言語情報間の距離は、韻律の聴取印象をうまく表現できず、期待しないピッチパターンが選択されて不自然な抑揚になるという問題もある。 In addition, when selecting a pitch pattern based on the distance between the linguistic information of the input text and the linguistic information in the prosodic database, if there is no linguistic information similar to the linguistic information of the input text, the distance between the linguistic information is There is also a problem that an unnatural inflection occurs when an unexpected pitch pattern is selected.

この発明は、上記のような課題を解決するためになされたもので、安定して自然な韻律になる韻律情報を作成する韻律作成装置及び韻律作成方法を得ることを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to obtain a prosody creation device and a prosody creation method for creating prosody information that is stable and natural.

この発明に係る韻律作成装置は、入力された言語情報から、当該言語情報に対応する韻律特徴のパターンを示す代表韻律情報を作成する代表韻律情報作成部と、複数の韻律特徴のパターンをそれぞれ示す詳細韻律情報を予め記憶する詳細韻律情報記憶部と、代表韻律情報作成部に作成された代表韻律情報と詳細韻律情報記憶部に記憶された詳細韻律情報とを参照して、当該代表韻律情報に類似した詳細韻律情報を詳細韻律情報記憶部から選択する詳細韻律情報選択部とを備えるものである。 The prosody creation device according to the present invention represents a representative prosody information creation unit that creates representative prosody information indicating a pattern of prosodic features corresponding to the linguistic information from input language information, and shows a plurality of prosodic feature patterns, respectively. With reference to the detailed prosodic information storage unit that stores the detailed prosodic information in advance, the representative prosodic information created in the representative prosodic information creation unit and the detailed prosodic information stored in the detailed prosodic information storage unit, A detailed prosodic information selection unit that selects similar detailed prosodic information from the detailed prosodic information storage unit.

この発明によれば、複数の韻律特徴のパターンをそれぞれ示す詳細韻律情報を予め記憶する詳細韻律情報記憶部から、入力された言語情報を基に作成された代表韻律情報と詳細韻律情報とを参照して、当該代表韻律情報に類似した詳細韻律情報を選択する。
このように、代表韻律情報に類似した詳細韻律情報を選択することから、同一言語情報となるパターンが複数ある場合であっても、代表韻律情報が示す韻律特徴パターンから大きくはずれた詳細韻律情報が選ばれにくくなり、安定な（音声合成する音素、単語、文にわたって不連続感がない）韻律が得られるという効果がある。 According to the present invention, the detailed prosodic information storage unit that preliminarily stores detailed prosodic information indicating each of a plurality of prosodic feature patterns refers to representative prosodic information and detailed prosodic information created based on the input language information. Then, detailed prosodic information similar to the representative prosodic information is selected.
In this way, since the detailed prosodic information similar to the representative prosodic information is selected, even if there are a plurality of patterns that become the same language information, the detailed prosodic information greatly deviated from the prosodic feature pattern indicated by the representative prosodic information is There is an effect that it becomes difficult to be selected and a stable prosody (no discontinuity between phonemes, words, and sentences for speech synthesis) can be obtained.

この発明の実施の形態１による韻律作成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the prosody creation apparatus by Embodiment 1 of this invention. 詳細韻律情報記憶部の記憶内容の一例を示す図である。It is a figure which shows an example of the memory content of a detailed prosodic information storage part. 入力言語情報の一例を示す図である。It is a figure which shows an example of input language information. 図３中の入力言語情報から作成された代表韻律情報を示す図である。It is a figure which shows the representative prosody information created from the input language information in FIG. 出力韻律情報の一例を示す図である。It is a figure which shows an example of output prosodic information. 代表韻律情報と詳細韻律情報とを所定の加重割合で加重平均して出力韻律情報を作成する処理を説明する図である。It is a figure explaining the process which weights the representative prosodic information and detailed prosodic information by a predetermined weighting ratio, and produces output prosodic information.

実施の形態１．
図１は、この発明の実施の形態１による韻律作成装置の構成を示すブロック図である。図１において、実施の形態１における韻律作成装置は、代表韻律情報作成部１、詳細韻律情報記憶部２、詳細韻律情報選択部３及び混合韻律情報作成部４を備える。代表韻律情報作成部１は、実施の形態１における韻律作成装置への入力言語情報１０１を入力する構成部であり、入力言語情報１０１を参照して代表韻律情報１０２を作成する。詳細韻律情報記憶部２は、複数の詳細韻律情報１０３を記憶する記憶部である。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a prosody creation apparatus according to Embodiment 1 of the present invention. 1, the prosody creation device according to the first exemplary embodiment includes a representative prosody information creation unit 1, a detailed prosody information storage unit 2, a detailed prosody information selection unit 3, and a mixed prosody information creation unit 4. The representative prosody information creation unit 1 is a component that inputs input language information 101 to the prosody creation device according to the first embodiment, and creates representative prosody information 102 with reference to the input language information 101. The detailed prosodic information storage unit 2 is a storage unit that stores a plurality of detailed prosodic information 103.

詳細韻律情報選択部３は、詳細韻律情報１０３を選択する構成部であり、代表韻律情報１０２を参照して詳細韻律情報記憶部２から代表韻律情報１０２に類似した詳細韻律情報１０３を選択し、詳細韻律情報１０４として出力する。混合韻律情報作成部４は、代表韻律情報１０２と詳細韻律情報１０４が混合された韻律情報を作成する構成部であり、代表韻律情報１０２と詳細韻律情報１０４との韻律特徴のパターンの類似度を求め、時系列に求められた類似度に応じて、代表韻律情報１０２又は詳細韻律情報１０４のいずれかを、時系列に出力すべき出力韻律情報１０５とする。 The detailed prosodic information selection unit 3 is a component that selects the detailed prosodic information 103, selects the detailed prosodic information 103 similar to the representative prosodic information 102 from the detailed prosodic information storage unit 2 with reference to the representative prosodic information 102, The detailed prosodic information 104 is output. The mixed prosodic information creation unit 4 is a component that creates prosodic information in which the representative prosodic information 102 and the detailed prosodic information 104 are mixed. The prosodic feature pattern similarity between the representative prosodic information 102 and the detailed prosodic information 104 is determined. According to the degree of similarity obtained in the time series, either the representative prosodic information 102 or the detailed prosodic information 104 is set as the output prosodic information 105 to be output in the time series.

詳細韻律情報記憶部２には、詳細韻律情報１０３として、複数の音情報及びこれにそれぞれ対応する韻律特徴のパターンが記憶される。ここで、韻律特徴のパターンとは、例えば、人間が発した音声の韻律特徴であるピッチ（声の高さ）をピッチ分析して、１モーラあたり４つのピッチを時間順に並べたものである。
図２は、詳細韻律情報記憶部の記憶内容の一例を示す図であり、２回の発声「中止します」と１回の発声「売り出される」との音情報２０１及びピッチパターン２０２を示している。発声「中止します」の音情報２０１は、「ちゅうしします」であり、パターン２０２は、２４点（６モーラ×４）のピッチを時間順に並べたものである。図２に示すように同じ言語情報であっても、人の発声は完全に同一にならないため、ピッチパターン２０３とピッチパターン２０４との形状が異なる。 The detailed prosodic information storage unit 2 stores, as detailed prosodic information 103, a plurality of sound information and patterns of prosodic features respectively corresponding thereto. Here, the pattern of prosodic features is, for example, a pitch analysis (pitch of voice) that is a prosodic feature of speech uttered by humans, and four pitches per mora arranged in time order.
FIG. 2 is a diagram showing an example of the contents stored in the detailed prosodic information storage unit, showing sound information 201 and pitch pattern 202 of two utterances “stop” and one utterance “sold”. Yes. The sound information 201 of the utterance “I will quit” is “Chussy”, and the pattern 202 is a pitch of 24 points (6 mora × 4) arranged in time order. As shown in FIG. 2, even if the language information is the same, human speech is not completely the same, and thus the pitch pattern 203 and the pitch pattern 204 are different in shape.

代表韻律情報作成部１、詳細韻律情報選択部３及び混合韻律情報作成部４は、この発明の趣旨に従った韻律作成プログラムをコンピュータに実行させることで、ハードウエアとソフトウエアが協働した具体的な手段として当該コンピュータで実現することができる。
また、詳細韻律情報記憶部２は、上記コンピュータが搭載する記憶装置、例えばハードディスク装置や外部記憶メディア等に構築される。 The representative prosody information creation unit 1, the detailed prosody information selection unit 3, and the mixed prosody information creation unit 4 execute a prosody creation program according to the gist of the present invention in a computer so that the hardware and software cooperate. As a practical means, it can be realized by the computer.
The detailed prosodic information storage unit 2 is constructed in a storage device mounted on the computer, such as a hard disk device or an external storage medium.

次に動作について説明する。
実施の形態１における韻律作成装置の動作原理を詳細に述べる。なお、韻律作成装置の構成は、図１を参照する。
先ず、入力言語情報１０１は、音情報、アクセント情報、位置情報などからなる。音情報、アクセント情報、位置情報は、従来の既知の形態素解析技術を用いて漢字仮名交じり文を解析することで得られる。
図３は、入力言語情報の一例を示す図であり、漢字仮名交じり文「地図を表示します」を形態素解析して得られた言語情報を示している。図３において、「地図を」の音情報３０１は「ちずを」であり、これのアクセント型３０２は「１型」であり、位置情報３０３は「１番目」である。また「表示します」の音情報３０１は「ひょおじします」であり、アクセント型３０２は「５型」、位置情報３０３は「２番目」である。 Next, the operation will be described.
The principle of operation of the prosody generation apparatus according to Embodiment 1 will be described in detail. For the configuration of the prosody creation apparatus, refer to FIG.
First, the input language information 101 includes sound information, accent information, position information, and the like. Sound information, accent information, and position information can be obtained by analyzing kanji-kana mixed sentences using conventional known morphological analysis techniques.
FIG. 3 is a diagram showing an example of input language information, and shows language information obtained by morphological analysis of a kanji-kana mixed sentence “display a map”. In FIG. 3, the sound information 301 of “Map” is “Chizuo”, the accent type 302 is “1 type”, and the position information 303 is “first”. The sound information 301 of “display” is “Hyoojishi”, the accent type 302 is “5 type”, and the position information 303 is “second”.

代表韻律情報作成部１では、入力言語情報１０１から代表韻律情報１０２を作成する。代表韻律情報１０２とは、韻律特徴（ピッチやパワーやリズム）のパターンである。
例えば、ピッチパターンは、一般によく知られた点ピッチモデルを使って作成することができる。点ピッチパターンは、参考文献１の第１６８−１６９頁に開示されている。
（参考文献１）田窪行則、前川喜久雄、窪園晴夫、本多清志、白井克彦、中川聖一著、“岩波講座言語の科学２音声”、岩波書店発行 The representative prosodic information creation unit 1 creates representative prosodic information 102 from the input language information 101. The representative prosodic information 102 is a pattern of prosodic features (pitch, power, rhythm).
For example, the pitch pattern can be created using a generally well-known point pitch model. The point pitch pattern is disclosed on pages 168-169 of Reference 1.
(Reference 1) Yukinori Takubo, Kikuo Maekawa, Haruo Kubozono, Kiyoshi Honda, Katsuhiko Shirai and Seiichi Nakagawa, “Iwanami Lecture Language Science 2 Speech”, published by Iwanami Shoten

点ピッチモデルは、文章全体ではピッチが下降していく傾向になるので、この基本傾斜パターンを直線で、その上に付加されるアクセント成分を台形で表現し、各モーラ（ほぼ音節の単位に相当する。）の中心点のピッチを決め、この中心点のピッチの間を補間してピッチパターンを得る。 In the point pitch model, the pitch tends to decrease in the whole sentence, so this basic inclination pattern is represented by a straight line, and the accent component added on it is represented by a trapezoid, and each mora (almost equivalent to a syllable unit) The pitch of the center point is determined, and a pitch pattern is obtained by interpolating between the pitches of the center points.

図４は、図３中の入力言語情報から作成された代表韻律情報を示す図である。図４において、横軸は時間（Ｔｉｍｅ）で、縦軸が声の高さ（ピッチ）である。また、横軸に沿って音情報４０１を示しており、点ピッチモデルにおける基本傾斜パターン４０２を、破線で示している。黒丸のプロットがモーラの中心点のピッチ４０３であり、ピッチ４０３の間を補間することにより、ピッチパターン４０４が得られる。
図３に示す代表韻律情報１０２では、例えば、基本傾斜パターン４０２の始点と終点を位置情報３０３の最大値（図３の場合は“２”（番目））によって決定し、音情報４０１の音の数（「ちずを」の場合は“３”）とアクセント情報（「ちずを」の場合は１型）によってアクセント成分（台形部分の形）を決定する。 FIG. 4 is a diagram showing representative prosody information created from the input language information in FIG. In FIG. 4, the horizontal axis represents time (Time), and the vertical axis represents voice pitch (pitch). In addition, sound information 401 is shown along the horizontal axis, and a basic inclination pattern 402 in the point pitch model is indicated by a broken line. A black circle plot is the pitch 403 of the center point of the mora. By interpolating between the pitches 403, a pitch pattern 404 is obtained.
In the representative prosody information 102 shown in FIG. 3, for example, the start point and end point of the basic gradient pattern 402 are determined by the maximum value of the position information 303 (“2” (th) in the case of FIG. 3). The accent component (the shape of the trapezoidal part) is determined by the number (“3” in the case of “Chizue”) and the accent information (type 1 in the case of “Chizue”).

詳細韻律情報選択部３は、上述のようにして得られた代表韻律情報１０２と詳細韻律情報記憶部２に記憶されるパターン（詳細韻律情報１０３）を比較し、代表韻律情報１０２に類似したパターンを選択して、これを詳細韻律情報１０４とする。
例えば、代表韻律情報１０２と詳細韻律情報記憶部２に記憶されるパターンとの２者間の類似度を計算し、類似度が大きい詳細韻律情報１０３を、詳細韻律情報１０４とする。
なお、類似度は、入力言語情報１０１、代表韻律情報１０２、詳細韻律情報１０３及びこれに対応する言語情報を参照することにより、例えば、下記式（１）で得られる距離ｄを用いて算出できる。
下記式（１）において、ｉは時刻を示し、ｗ（ｉ）は時刻ｉごとの加重を示している。また、Ｆ０ｔ（ｉ）は、代表韻律情報１０２の時刻ｉのピッチを示し、Ｆ０ｓ（ｉ）は、詳細韻律情報１０３のピッチパターンの時刻ｉのピッチを示している。

The detailed prosodic information selection unit 3 compares the representative prosodic information 102 obtained as described above with the pattern stored in the detailed prosodic information storage unit 2 (detailed prosodic information 103), and a pattern similar to the representative prosodic information 102 This is selected as the detailed prosody information 104.
For example, the similarity between the two of the representative prosodic information 102 and the pattern stored in the detailed prosodic information storage unit 2 is calculated, and the detailed prosodic information 103 having a large similarity is used as the detailed prosodic information 104.
The similarity can be calculated using, for example, the distance d obtained by the following equation (1) by referring to the input language information 101, the representative prosody information 102, the detailed prosody information 103, and the corresponding language information. .
In the following formula (1), i represents time and w (i) represents weighting for each time i. F0t (i) indicates the pitch at time i of the representative prosodic information 102, and F0s (i) indicates the pitch at time i of the pitch pattern of the detailed prosodic information 103.

上記式（１）において、加重ｗ（ｉ）は、入力言語情報１０１の時刻ｉに対応する音情報と、詳細韻律情報記憶部２に記憶される詳細韻律情報１０３の時刻ｉに対応する音情報（言語情報）とが同じであれば小さい値（０以上１未満）とし、異なれば大きい値とする（１以上）。すなわち、音情報が似ている場合は、距離ｄが小さくなる（類似度が大きくなる）ように計算して選ばれやすくする。逆に、音情報が似ていない場合は、距離ｄが大きくなる（類似度が小さくなる）ように計算して選ばれにくくする。 In the above formula (1), the weight w (i) is the sound information corresponding to the time i of the input language information 101 and the sound information corresponding to the time i of the detailed prosodic information 103 stored in the detailed prosodic information storage unit 2. If (Language information) is the same, the value is small (0 or more and less than 1), and if it is different, the value is large (1 or more). That is, when sound information is similar, the distance d is reduced (similarity is increased) so that it can be easily selected. On the other hand, when the sound information is not similar, the distance d is increased (similarity is decreased) so that the selection is difficult.

混合韻律情報作成部４は、代表韻律情報作成部１で作成された代表韻律情報１０２と、詳細韻律情報選択部３に選択された詳細韻律情報１０４とを参照し、出力韻律情報１０５を作成する。具体的には、代表韻律情報１０２と詳細韻律情報１０４の類似度を計算し、この類似度に応じて代表韻律情報１０２又は詳細韻律情報１０４のいずれかを出力韻律情報１０５とする。
なお、類似度は、例えば、上記式（１）で得られる距離ｄを用いて算出できる。
この場合、上記式（１）において、Ｆ０ｓ（ｉ）は詳細韻律情報選択部３に選択された詳細韻律情報１０４のピッチパターンの時刻ｉのピッチとなる。 The mixed prosody information creation unit 4 creates output prosody information 105 with reference to the representative prosody information 102 created by the representative prosody information creation unit 1 and the detailed prosody information 104 selected by the detailed prosody information selection unit 3. . Specifically, the similarity between the representative prosodic information 102 and the detailed prosodic information 104 is calculated, and either the representative prosodic information 102 or the detailed prosodic information 104 is set as output prosodic information 105 according to the similarity.
The similarity can be calculated using, for example, the distance d obtained by the above equation (1).
In this case, in the above formula (1), F0s (i) is the pitch at time i of the pitch pattern of the detailed prosodic information 104 selected by the detailed prosodic information selection unit 3.

この場合においても、加重ｗ（ｉ）は、入力言語情報１０１の時刻ｉに対応する音情報と詳細韻律情報１０４の時刻ｉに対応する音情報とが同じであれば、小さい値（０以上１未満）とし、異なれば、大きい値とする（１以上）。すなわち、音情報が同じ場合は、距離ｄを小さくし、逆に、音情報が異なる場合は、距離ｄを大きくする。
距離ｄが一定値以上の区間であれば、代表韻律情報１０２を出力韻律情報１０５とし、距離ｄが一定値未満の区間では、詳細韻律情報１０４を出力韻律情報１０５とする。 Even in this case, the weight w (i) is a small value (0 to 1) if the sound information corresponding to the time i of the input language information 101 and the sound information corresponding to the time i of the detailed prosodic information 104 are the same. If it is different, a larger value is set (1 or more). That is, when the sound information is the same, the distance d is decreased, and conversely, when the sound information is different, the distance d is increased.
If the distance d is a section with a certain value or more, the representative prosody information 102 is the output prosody information 105, and if the distance d is less than a certain value, the detailed prosody information 104 is the output prosody information 105.

図５は、出力韻律情報の一例を示す図である。図５において、音情報「ちずを」に対応する距離ｄが一定値以上（類似度が小さい）であった場合、出力韻律情報１０５の音情報「ちずを」に対応する部分は、代表韻律情報１０２（図４の「ちずを」に対応するピッチパターン）となる。また、出力韻律情報１０５の音情報「ひょうじします」に対応する部分は距離ｄが一定値未満（類似度が大きい）であり、詳細韻律情報１０４が出力される。
なお、図５は、詳細韻律情報１０４がピッチパターン４０４であるときの出力韻律情報１０５を示している。
このように、混合韻律情報作成部４によって、代表韻律情報１０２と詳細韻律情報１０４との類似度に応じて、代表韻律情報１０２と詳細韻律情報１０４が混合された混合韻律情報が、出力韻律情報１０５として出力される。 FIG. 5 is a diagram illustrating an example of output prosody information. In FIG. 5, when the distance d corresponding to the sound information “chizu wo” is a certain value or more (similarity is small), the portion corresponding to the sound information “chizu wo” in the output prosodic information 105 is representative prosodic information. 102 (a pitch pattern corresponding to “chizu wo” in FIG. 4). Further, the distance d of the portion corresponding to the sound information “Hyojoshi-ma” in the output prosodic information 105 is less than a certain value (high similarity), and the detailed prosodic information 104 is output.
FIG. 5 shows the output prosodic information 105 when the detailed prosodic information 104 is the pitch pattern 404.
As described above, the mixed prosodic information creation unit 4 converts the mixed prosodic information 102 and the detailed prosodic information 104 into the output prosodic information according to the similarity between the representative prosodic information 102 and the detailed prosodic information 104. 105 is output.

上述の説明では、漢字仮名交じり文を解析して、入力言語情報１０１を得る場合を述べたが、これに代えて、入力言語情報１０１を人手で作成してもよい。
また、入力言語情報１０１を、音情報、アクセント情報、位置情報等として、代表韻律情報１０２を作成する場合を示したが、これに限定されるものではない。
例えば、音情報だけから代表韻律情報１０２を作成してもよい。すなわち、音情報の音の合計数から基本傾斜パターンの始点と終点を決定し、各音情報の音の数からアクセント成分（アクセント型によらない平均的な台形形状となる）を決定してもよい。 In the above description, the case where the input language information 101 is obtained by analyzing the kanji mixed sentence is described, but the input language information 101 may be created manually instead.
Further, although the case where the representative prosodic information 102 is created using the input language information 101 as sound information, accent information, position information, and the like has been shown, the present invention is not limited to this.
For example, the representative prosody information 102 may be created only from sound information. That is, the start point and end point of the basic gradient pattern are determined from the total number of sounds in the sound information, and the accent component (which is an average trapezoidal shape independent of the accent type) is determined from the number of sounds in each sound information. Good.

また、代表韻律情報１０２の例としてピッチパターンとしたが、これに代えて、声の大きさを表すパワーパターンや、一音一音の長さを表すリズムパターンとしてもよい。
特に、代表韻律情報１０２がピッチパターンである場合、これを点ピッチモデルで作成することを示したが、これに代えて、モーラごとピッチを人手で与え、これらを補間してピッチパターンを作成してもよい。
さらに、統計的な学習方法（数量化Ｉ類モデルや回帰木モデル）を使って言語情報から推定した特徴量（ピッチ、パワー、一音の長さ）をモーラ毎に設定し、これらを補間して得たパターンを代表韻律情報１０２としてもよい。 In addition, although the pitch pattern is used as an example of the representative prosody information 102, a power pattern representing the loudness of a voice or a rhythm pattern representing the length of each sound may be used instead.
In particular, when the representative prosodic information 102 is a pitch pattern, it is shown that it is created by a point pitch model. Instead, a pitch pattern is created by manually giving a pitch for each mora and interpolating these. May be.
Furthermore, feature quantities (pitch, power, length of one note) estimated from linguistic information using statistical learning methods (quantification type I model and regression tree model) are set for each mora, and these are interpolated. The pattern obtained in this way may be used as the representative prosody information 102.

上述の説明では、詳細韻律情報記憶部２にてパターンに対応する音情報を記憶する場合を示したが、音情報はなくてもよい。すなわち、詳細韻律情報記憶部２は、韻律特徴のパターン（ピッチパターン、パワーパターン、リズムパターン）だけを記憶してもよい。
また、詳細韻律情報記憶部２に記憶されるパターンが、１モーラあたり４つの特徴量としたが、もっと多くても少なくてもよい。１モーラあたりの個数を一定個数にせず、音の長さなどによって可変な個数としてもよい。 In the above description, the case where the sound information corresponding to the pattern is stored in the detailed prosodic information storage unit 2 is shown, but the sound information may not be stored. That is, the detailed prosodic information storage unit 2 may store only prosodic feature patterns (pitch pattern, power pattern, rhythm pattern).
Further, although the patterns stored in the detailed prosodic information storage unit 2 are four feature amounts per 1 mora, they may be more or less. The number per mora may not be a fixed number, but may be a variable number depending on the length of sound.

上述では、パターンの長さとして、文節程度の長さを用いて説明したが、もっと長くても（例えば、句や文の長さでも）よいし、逆に、もっと短くても（例えば、音素や形態素の長さ）よい。 In the above description, the length of a phrase is used as the pattern length. However, the pattern length may be longer (for example, the length of a phrase or sentence), or conversely, may be shorter (for example, a phoneme). And the length of the morpheme).

また、入力言語情報１０１として、日本語（とモーラ）を用いて説明したが、他言語、例えば、英語や中国語でも適用可能である。モーラを持たない言語に関しては、モーラに準ずる単位あるいは音節単位で制御すればよい。 In addition, although the description has been made using Japanese (and mora) as the input language information 101, other languages such as English and Chinese are also applicable. For languages that do not have a mora, it may be controlled in units equivalent to mora or in syllable units.

さらに、詳細韻律情報記憶部２に記憶するピッチパターン（詳細韻律情報１０３）を、人間から発せられた音声をピッチ分析して得る場合を述べたが、人間が発した音声をピッチ分析して得られたピッチを参照して、人手でピッチパターンを作成して詳細韻律情報記憶部２に記憶するようにしてもよい。 Furthermore, although the case where the pitch pattern (detailed prosody information 103) memorize | stored in the detailed prosodic information storage part 2 was obtained by pitch-analyzing the voice uttered from the person was described, it obtained by pitch-analyzing the voice uttered by the person. The pitch pattern may be created manually with reference to the pitch and stored in the detailed prosodic information storage unit 2.

上述の説明では、詳細韻律情報選択部３が、代表韻律情報１０２と詳細韻律情報１０３との２者間の類似度を、上記式（１）で表される重みつきのユークリッド距離で計算する場合を示したが、これに代えて、なんらかの公知な距離尺度を用いてもよい。
例えば、下記式（２）で得られるｐ次平均ノルムの距離や下記式（３）で得られる最大ノルムの距離を用いて計算してもよい。
ｐ次平均ノルム（例えば、ｐ＝１，３，４など）

最大ノルム

In the above description, the detailed prosodic information selection unit 3 calculates the similarity between the representative prosodic information 102 and the detailed prosodic information 103 with the weighted Euclidean distance expressed by the above formula (1). Alternatively, any known distance measure may be used instead.
For example, the calculation may be performed using the distance of the p-order average norm obtained by the following equation (2) or the distance of the maximum norm obtained by the following equation (3).
p-order average norm (eg, p = 1, 3, 4, etc.)

Maximum norm

また、類似度として、代表韻律情報１０２と詳細韻律情報１０３との相関係数を計算してもよく、相関係数が最大となる詳細韻律情報１０３を選択結果（詳細韻律情報１０４）としてもよい。 Further, as the similarity, a correlation coefficient between the representative prosodic information 102 and the detailed prosodic information 103 may be calculated, and the detailed prosodic information 103 having the maximum correlation coefficient may be used as the selection result (detailed prosodic information 104). .

上述の説明では、混合韻律情報作成部４が、代表韻律情報１０２と詳細韻律情報１０４との２者間の類似度を、上記式（１）で表される重みつきのユークリッド距離で計算する場合を示したが、これに代えて、なんらかの公知な距離尺度を用いてもよい。
例えば、上記式（２）で得られるｐ次平均ノルムの距離や上記式（３）で得られる最大ノルムの距離を用いて計算してもよい。 In the above description, the mixed prosody information creation unit 4 calculates the similarity between the representative prosodic information 102 and the detailed prosodic information 104 using the weighted Euclidean distance expressed by the above formula (1). Alternatively, any known distance measure may be used instead.
For example, the calculation may be performed using the distance of the p-order average norm obtained by the above equation (2) or the distance of the maximum norm obtained by the above equation (3).

また、類似度として、代表韻律情報１０２と詳細韻律情報１０４との相関係数を計算して、この相関係数が最大となる詳細韻律情報１０４を選択結果（出力韻律情報１０５）としてもよい。 Further, as the similarity, a correlation coefficient between the representative prosodic information 102 and the detailed prosodic information 104 may be calculated, and the detailed prosodic information 104 having the maximum correlation coefficient may be used as the selection result (output prosodic information 105).

上述の説明では、詳細韻律情報選択部３によって加重ｗ（ｉ）を可変値としたが、詳細韻律情報記憶部２に音情報がない場合は一定値（例えば１）とすればよい。 In the above description, the weight w (i) is set to a variable value by the detailed prosodic information selection unit 3, but may be a constant value (for example, 1) when there is no sound information in the detailed prosodic information storage unit 2.

また、詳細韻律情報選択部３によって、加重ｗ（ｉ）を音情報の相違に応じて可変値としたが、位置情報の相違に応じて可変値としてもよい。この場合、位置情報が同じであれば、加重ｗ（ｉ）（全ての時刻ｉについて）を小さな値（０以上１未満）とし、位置情報が異なれば、加重ｗ（ｉ）（全ての時刻ｉについて）を大きな値（１以上）とする。
さらに、音情報の相違と位置情報との相違を同時に加味して加重ｗ（ｉ）を設定してもよい。 Further, although the detailed prosodic information selection unit 3 sets the weight w (i) to a variable value according to the difference in sound information, it may be set to a variable value according to the difference in position information. In this case, if the position information is the same, the weight w (i) (for all times i) is set to a small value (0 to less than 1), and if the position information is different, the weight w (i) (all times i). Is set to a large value (1 or more).
Furthermore, the weight w (i) may be set by taking into account the difference between the sound information and the position information at the same time.

これまでの説明では、詳細韻律情報選択部３が、代表韻律情報１０２と詳細韻律情報記憶部２に記憶したパターン（詳細韻律情報１０３）との２者間の類似度を計算して、類似度が小さいものを詳細韻律情報１０４としたが、これに限定されるものではない。
例えば、２者間の類似度が一定値以下となる複数の詳細韻律情報１０３から代表値（最頻値、中央値、平均値）を求め、これを詳細韻律情報１０４としてもよい。 In the above description, the detailed prosodic information selection unit 3 calculates the similarity between the representative prosody information 102 and the pattern (the detailed prosodic information 103) stored in the detailed prosodic information storage unit 2, and the similarity Although the detailed prosodic information 104 is set to be small, it is not limited to this.
For example, a representative value (mode value, median value, average value) may be obtained from a plurality of detailed prosodic information 103 in which the degree of similarity between two parties is a certain value or less, and this may be used as the detailed prosodic information 104.

また、混合韻律情報作成部４が、代表韻律情報１０２と詳細韻律情報１０４との距離を計算し、この類似度によって代表韻律情報１０２又は詳細韻律情報１０４のいずれかを、出力韻律情報１０５としたが、これに限定されるものではない。
例えば、入力言語情報１０１と詳細韻律情報１０４との音情報の相違によって加重割合を設定し、この加重割合で代表韻律情報１０２と詳細韻律情報１０４を加重平均したものを、出力韻律情報１０５としてもよい。
図６は、代表韻律情報と詳細韻律情報とを所定の加重割合で加重平均して出力韻律情報を作成する処理を説明する図である。図６において、代表韻律情報６０１と詳細韻律情報６０２とを加重割合６０３で加重平均することで、出力韻律情報６０４が作成される。
図６の加重割合６０３は、前半の時刻の区間で代表韻律情報６０１への割合が大きく、後半の時刻の区間では詳細韻律情報６０２への割合が大きくなるよう設定されている。
音情報「ひょうじします」と「ちゅうしします」の一音一音を比較すると、前半は相違があり、後半の「します」の部分は一音一音が同じである。
このように、詳細韻律情報６０２の後半のパターンへの加重割合を大きく設定することで、出力韻律情報６０４のパターンの後半では、詳細韻律情報６０２のパターンを再現することができる。 The mixed prosodic information creation unit 4 calculates the distance between the representative prosodic information 102 and the detailed prosodic information 104, and either the representative prosodic information 102 or the detailed prosodic information 104 is set as the output prosodic information 105 based on the similarity. However, the present invention is not limited to this.
For example, a weighted ratio is set according to the difference in sound information between the input language information 101 and the detailed prosodic information 104, and the weighted average of the representative prosodic information 102 and the detailed prosodic information 104 with this weighted ratio is used as the output prosodic information 105. Good.
FIG. 6 is a diagram for explaining a process of creating output prosodic information by weighted averaging representative prosodic information and detailed prosodic information at a predetermined weighting ratio. In FIG. 6, the output prosodic information 604 is created by performing weighted averaging of the representative prosodic information 601 and the detailed prosodic information 602 at a weighted ratio 603.
The weighted ratio 603 in FIG. 6 is set so that the ratio to the representative prosodic information 601 is large in the first half time section and the ratio to the detailed prosodic information 602 is large in the second half time section.
Comparing the sound information "Hyojishimasu" and "Chushishushi" one sound, there is a difference in the first half, and in the second half "Sushi", the sound is the same.
Thus, by setting a large weight ratio to the latter half pattern of the detailed prosodic information 602, the pattern of the detailed prosodic information 602 can be reproduced in the second half of the pattern of the output prosodic information 604.

以上のように、この実施の形態１によれば、複数の韻律特徴のパターンをそれぞれ示す詳細韻律情報１０３を予め記憶する詳細韻律情報記憶部２から、入力言語情報１０１を基に作成された代表韻律情報１０２と詳細韻律情報１０３とを参照して、当該代表韻律情報１０２に類似した詳細韻律情報１０３を選択する。
このように構成することで、代表韻律情報１０２が示す韻律特徴のパターンに類似した詳細韻律情報１０３を、出力韻律情報１０５を構成する詳細韻律情報１０４として選択することから、同一言語情報のパターンが複数ある場合であっても、代表韻律情報１０２の韻律特徴パターンから大きくはずれた詳細韻律情報１０４が選ばれにくくなる。このため、安定な（音声合成する音素、単語、文にわたって不連続感がない）韻律が得られる。
また、出力韻律情報１０５が、人間の発した音声を分析して得た韻律特徴のパターンである詳細韻律情報となるので、人間の発した音声の韻律特徴に近い韻律のパターンを生成することができ、その結果、自然な韻律が得られる。
例えば、人間が発した音声を分析して詳細韻律情報記憶部２に記憶するパターン（詳細韻律情報１０３）を作成する場合、収録時期や話者の状態によって音声特徴が揺らいで、音声特徴の異なる（ピッチの場合、例えば高さが異なる）詳細韻律情報が選択候補になったとしても、代表韻律情報１０２を参照して選択することで、代表韻律情報１０２が示す韻律特徴のパターンから大きくはずれた詳細韻律情報１０３が選ばれにくくなり、安定な韻律が得られるという効果がある。 As described above, according to the first embodiment, the representative created based on the input language information 101 from the detailed prosodic information storage unit 2 that stores in advance the detailed prosodic information 103 each indicating a plurality of prosodic feature patterns. The detailed prosodic information 103 similar to the representative prosodic information 102 is selected with reference to the prosodic information 102 and the detailed prosodic information 103.
By configuring in this way, the detailed prosodic information 103 similar to the prosodic feature pattern indicated by the representative prosodic information 102 is selected as the detailed prosodic information 104 constituting the output prosodic information 105. Even if there are a plurality of pieces, it becomes difficult to select the detailed prosodic information 104 greatly deviating from the prosodic feature pattern of the representative prosodic information 102. For this reason, a stable prosody (there is no discontinuity across phonemes, words, and sentences to be synthesized) can be obtained.
In addition, since the output prosodic information 105 becomes detailed prosodic information that is a prosodic feature pattern obtained by analyzing human speech, it is possible to generate a prosodic pattern that is close to the prosodic feature of speech produced by humans. As a result, natural prosody is obtained.
For example, when generating a pattern (detailed prosody information 103) to be stored in the detailed prosody information storage unit 2 by analyzing a voice uttered by a human, the voice feature fluctuates depending on the recording time and the state of the speaker, and the voice feature differs. Even if the detailed prosodic information becomes a selection candidate (in the case of pitch, for example, the height is different), by selecting with reference to the representative prosodic information 102, the prosodic feature pattern indicated by the representative prosodic information 102 deviates greatly. The detailed prosodic information 103 is less likely to be selected, and a stable prosody can be obtained.

さらに、類似した言語情報がなく、言語情報間の距離がうまく定義できない場合でも、代表韻律情報１０２に基づいて、大きくはずれた詳細韻律情報が選ばれにくく、かつ、出力韻律情報１０５が、人間の発した音声を分析して得たパターン（詳細韻律情報）となるので、安定で自然な韻律が得られるという効果がある。 Further, even when there is no similar language information and the distance between the language information cannot be defined well, it is difficult to select the detailed prosodic information that is greatly deviated based on the representative prosodic information 102, and the output prosodic information 105 Since it becomes a pattern (detailed prosodic information) obtained by analyzing the emitted speech, there is an effect that a stable and natural prosody can be obtained.

また、この実施の形態１によれば、詳細韻律情報記憶部２が、詳細韻律情報１０３及びこれに対応する言語情報を記憶し、詳細韻律情報選択部３が、入力言語情報１０１、代表韻律情報１０２、詳細韻律情報１０３及びこれに対応する言語情報を参照して、代表韻律情報１０２と詳細韻律情報１０３との類似度を算出するにあたり、入力言語情報１０１と当該詳細韻律情報１０３に対応する言語情報とが一致するときは一致しないときよりも類似度が大きくなるように算出する。
このように入力言語情報１０１を加味して類似度を計算することにより、入力言語情報１０１に類似し、かつ代表韻律情報１０２に類似した詳細韻律情報１０４が選択されるようになり、入力言語情報１０１に依存する特有の韻律特徴（音素に起因するパターンの変化や、文頭・文中・文末のパターンの形状の違い等）を含んだ韻律情報を得られやすくなり、安定で自然な韻律情報が得られるという効果がある。 Further, according to the first embodiment, the detailed prosodic information storage unit 2 stores the detailed prosodic information 103 and the language information corresponding thereto, and the detailed prosodic information selecting unit 3 includes the input language information 101, the representative prosodic information. 102, in calculating the similarity between the representative prosodic information 102 and the detailed prosodic information 103 with reference to the detailed prosodic information 103 and the language information corresponding thereto, the language corresponding to the input language information 101 and the detailed prosodic information 103 When the information matches, the similarity is calculated to be higher than when the information does not match.
Thus, by calculating the similarity with the input language information 101 taken into account, the detailed prosodic information 104 similar to the input language information 101 and similar to the representative prosodic information 102 is selected, and the input language information Prosody information that includes 101-specific prosodic features (pattern changes due to phonemes, differences in pattern shape at the beginning, middle, and end of sentences, etc.) can be easily obtained, and stable and natural prosodic information can be obtained. There is an effect that it is.

さらに、この実施の形態１によれば、代表韻律情報１０２と詳細韻律情報１０４との韻律特徴のパターンを、両者が類似しない区間は代表韻律情報１０２の韻律特徴のパターンとなり、両者が類似する区間では詳細韻律情報１０４の韻律特徴のパターンとなるように混合した混合韻律情報を、出力韻律情報１０５として作成する混合韻律情報作成部４を備えたので、選択された詳細韻律情報１０４と代表韻律情報１０２とが類似しない場合に、代表韻律情報１０２を出力韻律情報１０５とすることによって、代表韻律情報１０２の韻律特徴のパターンから大きく外れた詳細韻律情報１０４が選ばれなくなり、安定な韻律情報が得られるという効果がある。 Further, according to the first embodiment, the prosodic feature patterns of the representative prosodic information 102 and the detailed prosodic information 104 are sections in which both are not similar to the prosodic feature pattern of the representative prosodic information 102, and both are similar. Is provided with the mixed prosody information creation unit 4 that creates mixed prosody information mixed as a prosodic feature pattern of the detailed prosodic information 104 as output prosodic information 105, so that the selected detailed prosodic information 104 and representative prosodic information are included. When the prosodic information 102 is not similar, the representative prosodic information 102 is used as the output prosodic information 105, so that the detailed prosodic information 104 greatly deviating from the prosodic feature pattern of the representative prosodic information 102 is not selected, and stable prosodic information is obtained. There is an effect that it is.

なお、実施の形態１において、２つ以上の詳細韻律情報１０４を連結したものを出力韻律情報１０５とする場合であって、選択された一方の詳細韻律情報１０４と代表韻律情報１０２とが類似しておらず、この区間における出力韻律情報１０５を代表韻律情報１０２とするとき、他方の詳細韻律情報１０４は、代表韻律情報１０２に類似する（距離が近い）ものとして選択される。
このため、代表韻律情報１０２と詳細韻律情報１０４との接続の連続性が高くなり、安定な韻律情報が得られる効果がある。言い換えれば、詳細韻律情報１０４と代表韻律情報１０２を連結して得られる出力韻律情報１０５は、連続性の高い韻律情報となり、安定な韻律情報が得られる。 In the first embodiment, the output prosodic information 105 is obtained by concatenating two or more detailed prosodic information 104, and the selected detailed prosodic information 104 and the representative prosodic information 102 are similar. However, when the output prosodic information 105 in this section is the representative prosodic information 102, the other detailed prosodic information 104 is selected as being similar to the representative prosodic information 102 (with a short distance).
For this reason, the continuity of connection between the representative prosodic information 102 and the detailed prosodic information 104 becomes high, and there is an effect that stable prosodic information can be obtained. In other words, the output prosodic information 105 obtained by connecting the detailed prosodic information 104 and the representative prosodic information 102 becomes prosody information with high continuity, and stable prosodic information is obtained.

また、上記実施の形態１によれば、代表韻律情報１０２と詳細韻律情報１０４との韻律特徴のパターンを所定の加重割合で加重平均して混合した混合韻律情報を、出力韻律情報１０５として作成する混合韻律情報作成部４を備える。
この構成を有することによって、代表韻律情報１０２の韻律特徴のパターンから大きく外れた詳細韻律情報１０４の選択が抑制されて、かつ、出力韻律情報１０５が人間の発した音声を分析して得たパターン（詳細韻律情報）に近くなるので、安定で自然な韻律情報が得られる効果がある。
特に、詳細韻律情報記憶部２に記憶されているパターンが少ない場合では、大きく外れた詳細韻律情報が選ばれる場合がある。この場合、代表韻律情報と詳細韻律情報とを加重平均して出力韻律情報を得ることで、大きく外れた詳細韻律情報の選択が抑制され、安定な韻律情報が得られる。 Also, according to the first embodiment, the mixed prosody information is created as the output prosodic information 105 by mixing the prosodic feature patterns of the representative prosodic information 102 and the detailed prosodic information 104 by weighted averaging at a predetermined weight ratio. A mixed prosody information creation unit 4 is provided.
By having this configuration, the selection of the detailed prosodic information 104 greatly deviating from the prosodic feature pattern of the representative prosodic information 102 is suppressed, and the output prosodic information 105 is a pattern obtained by analyzing human-generated speech Since it is close to (detailed prosodic information), there is an effect that stable and natural prosodic information can be obtained.
In particular, when there are few patterns stored in the detailed prosodic information storage unit 2, there is a case where detailed prosodic information greatly deviated is selected. In this case, the representative prosodic information and the detailed prosodic information are weighted and averaged to obtain output prosodic information, so that selection of the detailed prosodic information that deviates greatly is suppressed, and stable prosodic information can be obtained.

さらに、上記実施の形態１によれば、入力言語情報１０１と詳細韻律情報１０４との音情報の相違によって加重割合を設定して、この加重割合で代表韻律情報１０２と詳細韻律情報１０４とを加重平均して作成した混合韻律情報を、出力韻律情報１０５として出力する。このようにすることで、音情報に類似する韻律情報が、出力韻律情報１０５に反映されるようになり、自然な韻律情報が得られるという効果がある。 Furthermore, according to the first embodiment, the weight ratio is set according to the difference in sound information between the input language information 101 and the detailed prosodic information 104, and the representative prosodic information 102 and the detailed prosodic information 104 are weighted with this weight ratio. The mixed prosody information created on average is output as output prosody information 105. By doing so, prosodic information similar to sound information is reflected in the output prosodic information 105, and natural prosodic information can be obtained.

１代表韻律情報作成部、２詳細韻律情報記憶部、３詳細韻律情報選択部、４混合韻律情報作成部、１０１入力言語情報、１０２，６０１代表韻律情報、１０３，１０４，６０２詳細韻律情報、１０５，６０４出力韻律情報、２０１，３０１，４０１音情報、２０２パターン、２０３，２０４，４０４ピッチパターン、３０２アクセント型、３０３位置情報、４０２基本傾斜パターン、４０３ピッチ、６０３加重割合。 DESCRIPTION OF SYMBOLS 1 Representative prosodic information creation part, 2 Detailed prosodic information storage part, 3 Detailed prosodic information selection part, 4 Mixed prosodic information creation part, 101 Input language information, 102,601 Representative prosodic information, 103,104,602 Detailed prosodic information, 105 , 604 Output prosodic information, 201, 301, 401 Sound information, 202 pattern, 203, 204, 404 Pitch pattern, 302 Accent type, 303 Position information, 402 Basic gradient pattern, 403 pitch, 603 Weighted ratio.

Claims

A representative prosody information creating unit that creates representative prosody information indicating a pattern of prosodic features corresponding to the language information from the input language information;
A detailed prosodic information storage unit that preliminarily stores detailed prosodic information respectively indicating a plurality of prosodic feature patterns;
With reference to the representative prosody information created in the representative prosody information creation unit and the detailed prosody information stored in the detailed prosody information storage unit, detailed prosody information similar to the representative prosody information is stored in the detailed prosody information storage unit A prosody creation device comprising a detailed prosody information selection unit to select from.

The detailed prosodic information storage unit stores the detailed prosodic information and linguistic information corresponding thereto,
The detailed prosodic information selection unit refers to the input language information, the representative prosodic information, the detailed prosodic information, and the language information corresponding thereto, and the similarity between the representative prosodic information and the detailed prosodic information 2, when the input language information and the language information corresponding to the detailed prosodic information match, the similarity is calculated to be larger than when they do not match. The prosody creation device described.

The prosodic feature patterns of the representative prosodic information created by the representative prosodic information creating unit and the detailed prosodic information selected by the detailed prosodic information selecting unit, and the sections where they are not similar are the prosodic feature patterns of the representative prosodic information And a mixed prosody information creating unit that creates mixed prosody information as a prosody information to be output in a section in which both are similar to each other so as to be a pattern of the prosodic features of the detailed prosodic information. The prosody creation device according to claim 1 or 2.

Mixed prosody information obtained by weighted averaging of prosodic feature patterns of the representative prosodic information created in the representative prosodic information creating unit and the detailed prosodic information selected by the detailed prosodic information selecting unit at a predetermined weight ratio, The prosody creation device according to claim 1, further comprising a mixed prosody information creation unit that creates the prosody information to be output.

5. The mixed prosody information creation unit sets a weighted ratio according to a difference in sound information between the input language information and the detailed prosody information selected by the detailed prosody information selection unit. Prosody creation device.

In the prosody creation method for creating prosody information by the prosody creation device,
Representative prosodic information creation step of creating representative prosodic information indicating a pattern of prosodic features corresponding to the language information from the input language information;
From the detailed prosodic information storage unit that stores in advance detailed prosodic information each indicating a plurality of prosodic feature patterns, the representative prosodic information and the detailed prosodic information created by the representative prosodic information creating step are referred to. A prosody creation method comprising: a detailed prosodic information selection step for selecting detailed prosodic information similar to information.

Storing the detailed prosodic information and the corresponding language information in the detailed prosodic information storage unit;
In the detailed prosodic information selection step, referring to the input language information, the representative prosodic information, the detailed prosodic information and the language information corresponding thereto, the similarity between the representative prosodic information and the detailed prosodic information 2. When calculating the linguistic information, the linguistic information is calculated so that the degree of similarity is greater when the linguistic information corresponding to the detailed prosodic information matches the linguistic information than when the linguistic information does not match. 6. The prosody creation method according to 6.

The prosodic feature patterns of the representative prosodic information created in the representative prosodic information creating step and the detailed prosodic information selected in the detailed prosodic information selecting step are the prosodic feature patterns of the representative prosodic information in sections where they are not similar A mixed prosody information creating step for creating mixed prosodic information as a prosodic information to be output in a section in which both are similar to each other so as to be a pattern of prosodic features of the detailed prosodic information. The prosody creation method according to claim 6 or 7.

Mixed prosodic information obtained by weighted averaging of prosodic feature patterns of the representative prosodic information created in the representative prosodic information creating step and the detailed prosodic information selected in the detailed prosodic information selecting step at a predetermined weighting ratio, 8. The prosody creation method according to claim 6 or 7, further comprising a mixed prosody information creation step for creating the prosody information to be output.

The weight ratio is set according to the difference in sound information between the input linguistic information and the detailed prosodic information selected in the detailed prosodic information selecting step in the mixed prosodic information creating step. Prosody creation method.