JP3854593B2

JP3854593B2 - Speech synthesis apparatus, cost calculation apparatus therefor, and computer program

Info

Publication number: JP3854593B2
Application number: JP2003322553A
Authority: JP
Inventors: 信行西澤; 智基戸田; 恒河井
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-09-16
Filing date: 2003-09-16
Publication date: 2006-12-06
Anticipated expiration: 2023-09-16
Also published as: JP2005091551A

Description

この発明は、音声合成技術に関し、特に、音声素片データベースから音声素片を選択し、接続することにより自然な発話に近い音声を合成する音声合成技術に関する。 The present invention relates to a speech synthesis technology, and more particularly to a speech synthesis technology for synthesizing speech that is close to a natural utterance by selecting and connecting speech units from a speech unit database.

人間と機械とのインタフェース（マンマシンインタフェース）を実現するための技術として、簡単な情報伝達を機械への、又は機械からの音響信号の入出力によって行なう技術が、古くから利用されている。近年では、コンピュータ技術等が発展し、機械と人間との間で伝達される情報が多量かつ高度になっている。それに伴って、音響信号を用いるマンマシンインタフェースにも、より高度な情報の伝達が可能なものが必要とされている。 As a technique for realizing a human-machine interface (man-machine interface), a technique for performing simple information transmission to / from a machine or by inputting / outputting an acoustic signal from the machine has been used for a long time. In recent years, computer technology and the like have been developed, and information transmitted between machines and humans has become abundant and sophisticated. Accordingly, a man-machine interface using an acoustic signal is required to be capable of transmitting higher-level information.

音声による情報伝達技術のうち、人間から機械へ情報を伝達するための技術として、人間による発話音声を機械で処理可能な言語情報に変換する音声認識技術が盛んに研究され、利用される機会が増えた。一方、機械から人間へ情報を伝達するための技術として、伝達すべきテキストデータなどの言語情報をもとに、人間の発話音声に近い音声を合成し、出力する音声合成技術も研究が進められ、様々な機械に利用されるようになっている。 Of the information transmission technologies using speech, speech recognition technology that converts speech spoken by humans into linguistic information that can be processed by machines is actively researched and used as a technology for transmitting information from humans to machines. Increased. On the other hand, as a technology for transmitting information from machines to humans, research is also progressing on speech synthesis technology that synthesizes and outputs speech similar to human speech based on language information such as text data to be transmitted. , Has been used in various machines.

音声合成技術では、
（１）人間が正確に言語情報を理解することができるような音声信号を合成すること、
（２）人間にとって自然な発話音声に聞えるような音声信号を合成すること、及び、
（３）任意の言語情報をもとに音声信号を合成すること、
が求められる。 In speech synthesis technology,
(1) synthesizing speech signals that enable humans to accurately understand linguistic information;
(2) synthesizing a voice signal that can be heard in natural speech for humans; and
(3) synthesizing audio signals based on arbitrary language information;
Is required.

これらの点において、より高精度な音声の合成が実現可能な技術として、発話音声を用いる音声合成技術がある。この音声合成技術では、実際の発話音声を収録してデータベース化しておき、合成目標に従って、収録した発話音声のデータから好適な部分を選び、それらを接続することによって一連の音声信号を合成する。 In these respects, there is a speech synthesis technology that uses uttered speech as a technology capable of realizing more accurate speech synthesis. In this speech synthesis technology, actual speech speech is recorded and stored in a database, a suitable portion is selected from the recorded speech speech data in accordance with a synthesis target, and a series of speech signals is synthesized by connecting them.

図５に、このような技術を用いた従来の一般的な音声合成システムの構成のブロック図を示す。図５を参照して、従来の音声合成システム４０は、人間による自然な発話の音声を収録し、発話の音声の素片（以下「音声素片」と呼ぶ。）を予め格納する素片データベース４２と、合成目標６２に従って、素片データベース４２から音声素片を選択して接続し、出力素片系列６４を出力するための素片選択部４４とを含む。 FIG. 5 shows a block diagram of a configuration of a conventional general speech synthesis system using such a technique. Referring to FIG. 5, a conventional speech synthesis system 40 records a speech of a natural utterance by a human and stores a speech segment of the speech (hereinafter referred to as “speech segment”) in advance. 42 and a segment selection unit 44 for selecting and connecting speech segments from the segment database 42 in accordance with the synthesis target 62 and outputting an output segment sequence 64.

音声合成システム４０はさらに、素片選択部４４が音声素片を選択する際の基準となる「コスト」と呼ばれる値を、既に選択された音声素片及び素片データベース４２に記憶された音声素片の物理的特徴量に基づいて算出するためのコスト計算部４６を含む。 The speech synthesis system 40 further uses a value called “cost”, which is a reference when the segment selection unit 44 selects a speech unit, to select a speech unit stored in the already selected speech unit and the segment database 42. A cost calculation unit 46 for calculating based on the physical feature amount of the piece is included.

合成目標６２は、合成されるべき音声の言語情報である入力テキスト６０に対して形態素解析、係り受け解析などの言語処理を行なって、音素記号、アクセント記号等に変換し、さらに言語処理の結果をもとに、音素（発話音声の基本単位。日本語では、ほぼ、アルファベット表記した場合の１文字分の発話音声に相当する。）など所定の単位ごとにその物理的特徴を表わすデータを作成することで準備される。 The synthesis target 62 performs language processing such as morphological analysis and dependency analysis on the input text 60 that is the language information of the speech to be synthesized, converts the input text 60 into phoneme symbols, accent symbols, and the like, and further results of the language processing Based on the above, create phoneme (basic unit of utterance voice. In Japanese, it is almost equivalent to the utterance voice of one character when written in alphabet). To be prepared.

図６に、合成目標６２の構成の一例を示す。図６を参照して、合成目標６２は、音素ごとに素片選択部４４が音声素片の選択を行なうために用いる、音素ごとの合成目標８２，８４，…を含む。これら音素ごとの合成目標８２，８４，…は、音素を特定するための音素記号と、音素の韻律指令、音素の持続時間（音韻継続時間）、音素ごとのスペクトル情報など、当該音素の物理的特徴を示す情報とを含む。 FIG. 6 shows an example of the composition target composition 62. 6, synthesis target 62 includes synthesis targets 82, 84,... For each phoneme used by segment selection unit 44 to select a speech unit for each phoneme. These synthesis targets 82, 84,... For each phoneme include the phoneme symbol for specifying the phoneme, the phoneme prosody command, the phoneme duration (phoneme duration) , the spectrum information for each phoneme, etc. Information indicating features.

図５の素片データベース４２には、予め音声素片をその物理的特徴を表わすデータとともに格納しておく。 In the segment database 42 shown in FIG. 5, speech segments are stored in advance together with data representing their physical characteristics.

合成目標６２が与えられると、素片選択部４４は、素片データベース４２の中から合成目標６２により指定される音素に合致するいくつかの音声素片を選択する。選択された音声素片は、音声の合成に用いる音声素片の候補となる。素片選択部４４は、候補となる音声素片の各々についての物理的な特徴を示す値を、既に選択された音声素片についての物理的な特徴を示す値とともにコスト計算部４６に与える。 When the synthesis target 62 is given, the segment selection unit 44 selects from the segment database 42 some speech segments that match the phoneme specified by the synthesis target 62. The selected speech unit is a speech unit candidate used for speech synthesis. The unit selection unit 44 gives a value indicating the physical characteristics of each candidate speech unit to the cost calculation unit 46 together with a value indicating the physical characteristics of the already selected speech unit.

コスト計算部４６は、与えられた物理的特徴をもとに、候補となる音声素片の各々に対し「コスト」と呼ばれる値を算出し、素片選択部４４に与える。「コスト」とは、その音声素片が、その前の音声素片に接続されるべき音声素片として適切か否かの評価基準となるものである。理想的には、このコストが０となることが望ましいが、通常そのようなことは困難である。 The cost calculation unit 46 calculates a value called “cost” for each candidate speech unit based on the given physical characteristics, and gives the unit to the unit selection unit 44. “Cost” is an evaluation criterion as to whether or not the speech unit is appropriate as a speech unit to be connected to the previous speech unit. Ideally, this cost should be zero, but it is usually difficult to do so.

素片選択部４４は、与えられたコストの総和が最小となるような素片系列を求めることにより、音声の合成に用いるのに好適な音声素片を決定する。このようにして、合成目標６０により特定される音声にそれぞれ対応する音声素片を抽出する。抽出された音声素片から構成される出力素片系列６４は、互いに接続され、合成目標６２に応じた合成音声の音声波形が作成される。 The unit selection unit 44 determines a unit of speech suitable for use in speech synthesis by obtaining a unit sequence that minimizes the sum of given costs. In this way, speech segments corresponding to the speech specified by the synthesis target 60 are extracted. The output segment series 64 composed of the extracted speech segments are connected to each other, and a speech waveform of synthesized speech corresponding to the synthesis target 62 is created.

このようにして音声合成を行なう音声合成技術を用いて高品質な音声を得るためには、素片データベース４２に、コストが十分小さくなるような音声素片が格納されていることと、コスト計算部４６により算出されるコストが、人間の知覚との親和性のよいものであることとが必要となる。 In order to obtain high-quality speech using speech synthesis technology for performing speech synthesis in this way, speech units that are sufficiently low in cost are stored in the unit database 42 and cost calculation is performed. It is necessary that the cost calculated by the unit 46 has good affinity with human perception.

前者を満たすために、現在の音声合成技術では、数十時間分の発話音声を録音した大規模な音声コーパスを素片データベース４２として利用することがある。素片データベース４２が大規模になると、図５に示す素片選択部４４が音声素片の選択を行なう際の選択肢が増える。そのため、素片選択部４４は、それら多数の選択肢の中から接続するのに適した音声素片を決定することが可能となり、合成音声の音質が向上する可能性が高くなる。 In order to satisfy the former, in the current speech synthesis technology, a large-scale speech corpus in which speech speech for several tens of hours is recorded may be used as the segment database 42. When the unit database 42 becomes large-scale, the options for selecting a speech unit by the unit selection unit 44 shown in FIG. 5 increase. Therefore, the unit selection unit 44 can determine a speech unit suitable for connection from among the many options, and the possibility that the sound quality of the synthesized speech is improved is increased.

後者を満たすための技術として、後掲の非特許文献１において、サブコスト関数を用いたコスト計算の手法が提案されている。非特許文献１に記載の技術では、素片選択に用いるそれぞれの物理量について、知覚特性との関係を記述するサブコスト関数を考え、コスト計算部４６で計算されるコスト関数をサブコスト関数の線形和で表現する。コスト計算部４６は、合成目標ｔ_i（ｉは合成目標の中におけるこの音声素片の順番を示す）と、素片選択部４４が前回の選択動作で選択した音声素片ｕ_i-1とをもとに、選択候補となる素片に関するコストＣ（ｕ_i，ｔ_i）を、以下に示す式によって算出する。 As a technique for satisfying the latter, a cost calculation method using a sub-cost function is proposed in Non-Patent Document 1 described later. In the technique described in Non-Patent Document 1, for each physical quantity used for segment selection, a sub-cost function that describes the relationship with the perceptual characteristics is considered, and the cost function calculated by the cost calculation unit 46 is a linear sum of the sub-cost functions. Express. The cost calculation unit 46 combines the synthesis target t _i (i indicates the order of the speech units in the synthesis target), and the speech unit u _i-1 selected by the unit selection unit 44 in the previous selection operation. Based on the above, the cost C (u _i , t _i ) related to the segment as a selection candidate is calculated by the following equation.

この式において、ｗ_pro、ｗ_typ、ｗ_env、ｗ_spec、及びｗ_F0は、それぞれサブコストＣ_pro、Ｃ_typ、Ｃ_env、Ｃ_spec、及びＣ_F0に対応する重みである。非特許文献１に記載の技術では、これらの重みは、各コストの主観評価実験のスコアから重相関分析により推定した定数などを用いる。

In this equation, w _pro , w _typ , w _env , w _spec , and w _F0 are weights corresponding to the sub-costs C _pro , C _typ , C _env , C _spec , and C _F0 , respectively. In the technique described in Non-Patent Document 1, these weights use constants estimated by multiple correlation analysis from the scores of subjective evaluation experiments for each cost.

このようなコストに基づいて選択された音声素片を接続することにより合成された音声は、人間の音声に対する知覚を考慮した尺度を用いて選択された音声素片を用いるため、いわゆる「機械音らしさ」を感じさせない比較的自然な音声となることが期待される。 Since speech synthesized by connecting speech units selected based on such costs uses speech units selected using a scale that takes human speech perception into account, so-called “mechanical sound” is used. It is expected to be a relatively natural voice that does not make you feel “like”.

戸田智基、河井恒、津崎実、鹿野清宏、「素片接続型日本語テキスト音声合成における音素単位とダイフォン単位に基づく素片選択」、電子情報通信学会論文誌、Ｖｏｌ．Ｊ８５‐Ｄ‐ＩＩ．，Ｎｏ．１２，ｐｐ．１７６０‐１７７０，Ｄｅｃ．２００２．Tomoki Toda, Tsune Kawai, Minoru Tsuzaki, Kiyohiro Shikano, “Fragment Selection Based on Phoneme Units and Diphone Units in Segment-Connected Japanese Text Speech Synthesis”, IEICE Transactions, Vol. J85-D-II. , No. 12, pp. 1760-1770, Dec. 2002.

非特許文献１に記載の技術におけるコスト関数は、サブコスト関数の線形和によって計算される。これにより、知覚特性との親和性のよい、より自然な音声を合成できるようになることが期待される。しかし、非特許文献１に記載の技術におけるコストの算出方法を用いた場合であっても、また、現在得られる最も大規模な素片データベースを使用した場合であっても、知覚に影響するような、誤差の大きな音声素片を選択しなければならない場合がある。その結果、合成された音声の品質は不十分なものとなる。 The cost function in the technique described in Non-Patent Document 1 is calculated by a linear sum of sub cost functions. As a result, it is expected that a more natural voice having good affinity with the perceptual characteristic can be synthesized. However, even if the cost calculation method in the technology described in Non-Patent Document 1 is used, or even when the largest unit database currently available is used, it seems to affect perception. In some cases, a speech unit having a large error must be selected. As a result, the quality of the synthesized speech is insufficient.

これは以下のような理由に基づくと考えられる。すなわち、非特許文献１に記載の技術では、サブコストの総和に基づいて音声素片を選択している。しかし、あるサブコストについては、特定の場合には知覚に与える影響が他のサブコストと比較して小さくなることがあり得る。そうした場合、そのサブコストが特に小さくなったとしても、他のサブコストの値が大きければ知覚に与える影響が大きくなり、合成音声の品質は悪くなる。逆に、特定の場合に知覚に与える影響が特に大きくなるようなサブコストでは、他のサブコストと比較して特にその値を小さくする必要がある。 This is considered to be based on the following reasons. That is, in the technique described in Non-Patent Document 1, a speech unit is selected based on the sum of sub costs. However, for some sub-costs, the impact on perception in certain cases may be small compared to other sub-costs. In such a case, even if the sub-cost is particularly small, if the value of the other sub-cost is large, the influence on perception becomes large, and the quality of the synthesized speech is deteriorated. On the other hand, in the sub cost where the influence on the perception is particularly large in a specific case, it is necessary to reduce the value particularly in comparison with other sub costs.

非特許文献１に記載の技術では、サブコストに関するこのような問題が認識されていない。その結果、単純にこの技術を用いた場合、合成された音声の品質が不十分なものとなるおそれが残っている。 In the technique described in Non-Patent Document 1, such a problem relating to the sub-cost is not recognized. As a result, when this technique is simply used, there remains a possibility that the quality of the synthesized speech will be insufficient.

それゆえ、本発明の目的は、コストによる音声素片の選択を行なうことにより音声を合成する音声合成装置において、知覚に与える印象がより自然な、品質の高い発話音声を合成する音声合成装置を提供することである。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a speech synthesizer that synthesizes speech by selecting speech segments according to cost, and synthesizes speech speech that synthesizes high-quality speech speech with a more natural impression on perception. Is to provide.

本発明の別の目的は、音声合成装置において、合成される音声の品質を高くし、違和感が生じないように音声素片を選択することが可能な音声合成装置を提供することである。 Another object of the present invention is to provide a speech synthesizer capable of selecting speech segments so that the quality of synthesized speech is increased and a sense of incongruity does not occur in the speech synthesizer.

本発明のさらに別の目的は、音声合成装置での音声素片の選択において、音声素片同士の接続部が知覚に与える影響を少なくし、全体として合成音声の品質を向上させることが可能な音声合成装置を提供することである。 Still another object of the present invention is to reduce the influence of a connection unit between speech units on perception in selecting speech units in a speech synthesizer, and to improve the quality of synthesized speech as a whole. It is to provide a speech synthesizer.

本発明の第１の局面に係る音声合成装置は、音声の合成目標に従って所定の音声素片データベースから選択した音声素片を用いて音声を合成する音声合成装置である。音声の合成目標は、音声の所定の特徴情報により記述される。この装置は、所定の特徴情報に基づき、音声の合成目標のうち、所定の特徴情報が予め定める条件を充足する個所を検出するための検出手段と、検出手段により所定の音響的特徴がその条件を充足することが検出された個所と、それ以外の個所とに対し、互いに異なる所定の関数を用いて合成目標に基づくコスト計算を行なうためのコスト計算手段と、コスト計算手段により計算されるコストが所定の条件を充足するような音声素片を音声素片データベースから選択するための素片選択手段とを含む。 A speech synthesizer according to a first aspect of the present invention is a speech synthesizer that synthesizes speech using speech segments selected from a predetermined speech segment database in accordance with a speech synthesis target. The speech synthesis target is described by predetermined feature information of speech. This apparatus is configured to detect, based on predetermined feature information, a portion of a speech synthesis target where predetermined feature information satisfies a predetermined condition, and a predetermined acoustic feature is detected by the detection unit. A cost calculation means for performing cost calculation based on a synthesis target using a predetermined function different from each other at a location where it is detected that the condition is satisfied, and a cost calculated by the cost calculation means Includes a speech segment selection means for selecting speech segments that satisfy a predetermined condition from the speech segment database.

好ましくは、所定の特徴情報は、合成目標となる音声の韻律に関する情報を含み、検出手段は、音声の合成目標のうち、韻律に関する情報が予め定める条件を充足する個所を検出するための韻律条件検出手段を含む。 Preferably, the predetermined feature information includes information related to the prosody of the speech that is a synthesis target, and the detection means detects a prosody condition for detecting a part of the speech synthesis target that satisfies a predetermined condition of the information related to the prosody Including detection means.

より好ましくは、韻律に関する情報は、合成目標となる音声のアクセントに関する情報を含み、韻律条件検出手段は、音声の合成目標のうち、アクセントが変化する区間を検出するためのアクセント変化区間検出手段を含む。 More preferably, the information about the prosody includes information about the accent of the speech that is the synthesis target, and the prosody condition detection means includes an accent change section detection means for detecting a section where the accent changes in the synthesis target of the voice. Including.

コスト計算手段は、音声の所定の特徴情報を含む複数通りの音声の特徴情報に関しそれぞれ定義された複数のサブコスト関数の値を、音声の合成目標に従ってそれぞれ算出するための複数のサブコスト関数算出手段と、第１の定数群を準備するための第１の準備手段と、第１の定数群とは異なる第２の定数群を準備するための第２の準備手段と、検出手段の検出結果に応じて、第１の準備手段により準備された第１の定数群及び第２の準備手段により準備された第２の定数群のいずれかを選択するための選択手段と、選択手段により選択された第１又は第２の定数群を係数として、複数のサブコスト関数算出手段により算出されたサブコストの線形和によりコストを算出するための手段とを含んでもよい。第１の定数群及び第２の定数群に含まれる、所定の特徴情報に対応する定数は互いに異なる。 The cost calculating means includes a plurality of sub cost function calculating means for calculating values of a plurality of sub cost functions respectively defined according to a plurality of types of voice feature information including predetermined feature information of the voice according to a voice synthesis target. The first preparation means for preparing the first constant group, the second preparation means for preparing the second constant group different from the first constant group, and the detection result of the detection means Selecting means for selecting one of the first constant group prepared by the first preparation means and the second constant group prepared by the second preparation means, and the first constant selected by the selection means Means for calculating the cost by a linear sum of the sub-costs calculated by the plurality of sub-cost function calculating means using the first or second constant group as a coefficient may be included. Constants corresponding to predetermined feature information included in the first constant group and the second constant group are different from each other.

好ましくは、選択手段は、アクセント変化区間検出手段により合成目標となる音声のアクセントが変化する区間であると検出された区間においては、第１の定数群を選択し、それ以外の区間では第２の定数群を選択する。 Preferably, the selection means selects the first constant group in the section detected by the accent change section detection means as the section in which the accent of the speech to be synthesized changes, and the second constant in the other sections. Select a constant group.

例えば、第１の定数群に含まれる、所定の特徴情報に対応する定数の値は、第２の定数群に含まれる、対応の定数の値よりも大きな値である。 For example, the value of the constant corresponding to the predetermined feature information included in the first constant group is larger than the value of the corresponding constant included in the second constant group.

また例えば、第２の定数群に含まれる、所定の特徴情報に対応する定数の値は、第１の定数群に含まれる対応の定数の値よりも小さな値である。 For example, the value of the constant corresponding to the predetermined feature information included in the second constant group is smaller than the value of the corresponding constant included in the first constant group.

さらに好ましくは、素片選択手段は、コスト計算手段により計算されるコストが最小となるように音声素片データベースから音声素片を選択するための手段を含む。 More preferably, the segment selection unit includes a unit for selecting a speech unit from the speech unit database so that the cost calculated by the cost calculation unit is minimized.

この音声合成装置が合成する音声は日本語の音声であってもよい。 The speech synthesized by the speech synthesizer may be Japanese speech.

より好ましくは、韻律に関する情報は、合成目標となる音声のアクセントの高さに関するアクセント高低情報を含み、アクセント変化区間検出手段は、アクセントの高さが変化する個所の直前のモーラの母音部、及びアクセントの高さが変化する個所の直後のモーラにより構成される区間を、アクセント変化区間として検出するための手段を含む。 More preferably, the information about the prosody includes accent height information about the height of the accent of the speech that is the synthesis target, and the accent change section detecting means includes a vowel part of the mora immediately before the location where the accent height changes, and Means for detecting a section constituted by a mora immediately after the location where the height of the accent changes as an accent change section.

この音声合成装置はさらに、音声素片データベースを含んでもよい。 The speech synthesizer may further include a speech unit database.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを本発明の第１の局面に係るいずれかの音声合成装置として動作させる。 When the computer program according to the second aspect of the present invention is executed by a computer, it causes the computer to operate as any of the speech synthesizers according to the first aspect of the present invention.

本発明の第３の局面に係るコスト計算装置は、音声の合成目標に従って所定の音声素片データベースから選択した音声素片を用いて音声を合成する音声合成装置において、音声素片の選択のためのコストを計算するためのコスト計算装置である。音声の合成目標は、音声の所定の特徴情報により記述されている。このコスト計算装置は、音声の合成目標のうち、所定の特徴情報が予め定める条件を充足する個所を検出するための検出手段と、検出手段により所定の特徴情報が条件を充足することが検出された個所と、それ以外の個所とに対し、互いに異なる所定の関数を用いて、合成目標に基づくコスト計算を行なうためのコスト計算手段とを含む。 A cost calculation apparatus according to a third aspect of the present invention is a speech synthesis apparatus that synthesizes speech using speech units selected from a predetermined speech unit database according to a speech synthesis target. It is a cost calculation device for calculating the cost. The speech synthesis target is described by predetermined feature information of the speech. The cost calculating apparatus detects a part of the speech synthesis target for detecting a point where the predetermined feature information satisfies a predetermined condition, and the detection unit detects that the predetermined feature information satisfies the condition. Cost calculation means for performing cost calculation based on the synthesis target using predetermined functions different from each other and the other locations.

もし知覚的に重要でない部分でコストの小さい素片選択が行なわれ、逆に知覚的に重要な部分でコストが大きくなる素片選択が行なわれた結果、合成された音声の品質が低下したのであれば、コスト関数の計算を改善する必要がある。そのため、一つの方法として、コスト関数を時間的に変化させることが考えられる。これにより重要でない部分の誤差を許容し、その分重要な部分のコストが小さくなるような素片選択を行なうことにより文全体として品質が改善される。例えば、文のアクセントに注目すると、アクセントの変化が生ずるところでは、韻律的特徴の変化が知覚に与える影響は大きくなると考えられる。そこで、アクセントの変化が生ずるところでは、韻律的特徴に対応するサブコストを他のサブコストよりも重視することで、合成される音声の印象がよくなると考えられる。本実施の形態は、このようにアクセントの変化と韻律的特徴との関係に着目したものである。 If low-cost segment selection is performed in a part that is not perceptually important, and conversely, a high-cost segment selection is performed in a part that is perceptually important, resulting in a decrease in the quality of the synthesized speech. If so, the cost function calculation needs to be improved. Therefore, as one method, it is conceivable to change the cost function with time. As a result, the quality of the whole sentence is improved by selecting an element that allows an error in an unimportant part and reduces the cost of an important part. For example, if attention is given to the accent of a sentence, it is considered that the change in prosodic features has a greater influence on perception where the change in accent occurs. Thus, where accent changes occur, the sub-cost corresponding to the prosodic features is considered more important than the other sub-costs, so that the impression of the synthesized speech is improved. In this embodiment, attention is paid to the relationship between the change in accent and the prosodic feature.

そのために、合成目標中にアクセントに関する情報（アクセント情報）を含ませるようにする。テキスト音声変換を目的とする音声合成であれば、事前に言語解析を行なっている。従って、合成目標にアクセント情報を含めること自体は問題なく行なえる。個々の語に関するアクセント情報は、一般のアクセント辞書等を参照することで取得できる。 For this purpose, information on the accent (accent information) is included in the synthesis target. For speech synthesis for the purpose of text-to-speech conversion, language analysis is performed in advance. Therefore, it is possible to include the accent information in the synthesis target without any problem. Accent information regarding individual words can be obtained by referring to a general accent dictionary or the like.

以下、図面を参照しつつ、本発明を日本語の音声合成技術に適用した実施の一形態について説明する。なお本明細書において、「アクセント」とは、日本語等におけるアクセントを示すものである。即ち、この「アクセント」は、印欧語に多くみられる強勢アクセント（ＳｔｒｅｓｓＡｃｃｅｎｔ）と異なり、発話音声の基本周波数を変化させることによって生じる高低アクセント（ＰｉｔｃｈＡｃｃｅｎｔ）である。ただし、以下の説明を参照することにより、高低アクセント以外の音声言語的特徴に対しても同様の取り扱いが可能となることはいうまでもない。 Hereinafter, an embodiment in which the present invention is applied to a Japanese speech synthesis technique will be described with reference to the drawings. In this specification, “accent” indicates an accent in Japanese or the like. In other words, this “accent” is a pitch accent generated by changing the fundamental frequency of the spoken voice, unlike the stress accent often found in Indo-European languages. However, it is needless to say that the same handling is possible with respect to phonetic features other than high and low accents by referring to the following description.

図１を用いて、日本語のアクセントの概略を説明する。図１を参照して、「取りまとめる」という語７０の発音をカタカナ及びローマ字で示すと、発音表記７２のようになる。この発音表記７２の、「リマトメ」という表記の上の横線７４は、「取りまとめる」という語７０のアクセントを示すアクセント記号である。このアクセント記号７４は、語７０を発音する際に、発音表記７２の「リマトメ」の部分が高く発音されることを示す。逆に、横線が無い部分は、低く発音される。 An outline of Japanese accent will be described with reference to FIG. Referring to FIG. 1, the pronunciation of the word 70 “collectively” is expressed in katakana and romaji as a pronunciation notation 72. The horizontal line 74 on the phonetic notation 72 above the notation “Limatome” is an accent symbol indicating the accent of the word 70 “collect”. This accent symbol 74 indicates that when the word 70 is pronounced, the “limatome” portion of the phonetic notation 72 is pronounced high. On the contrary, the part without the horizontal line is pronounced low.

横線７４の始点部分７６では、発音される声の高さが上昇する。また、横線７４の終点部分７８では、発音される声の高さが下降する。このような声の高さの変化を模式的に表わすと、音韻記号列８０になる。「取りまとめる」という語７０を発音すると、「トリ（／ｔ／／ｏ／／ｒ／／ｉ／）」の部分の、「ト」との間、及び「メル（／ｍ／／ｅ／／ｒ／／ｕ／）」の部分の、「メ」と「ル」との間でも、韻律がそれぞれ大きく変化する。日本語では、このような発話音声の基本周波数の上昇及び下降により、アクセントが形成される。 At the start point portion 76 of the horizontal line 74, the pitch of the voice that is pronounced increases. Further, at the end point portion 78 of the horizontal line 74, the pitch of the voice to be pronounced decreases. A typical phonetic symbol string 80 represents such a change in voice pitch. When the word 70, “collect”, is pronounced, the “tri (/ t // o // r // i /)” part, the “t”, and the “mail (/ m // e / The prosody changes greatly between “Me” and “Le” in the “/ r / u /)” portion. In Japanese, an accent is formed by the rise and fall of the fundamental frequency of such speech.

以下の説明では、声の高さが大きく上昇する部分を、「（アクセントの）立上がり」と呼ぶ。また、声の高さが下降する部分を「（アクセントの）立下がり」と呼ぶ。 In the following description, a portion where the pitch of the voice greatly increases is referred to as “rise of (accent)”. The part where the pitch of the voice falls is called “(accent) falling”.

日本語のアクセントでは、連続する「モーラ（ｍｏｒａ）」の境界部分でこのようなアクセントの立上がり及び立下がりが生じる。「モーラ」とは、語の音韻的な時間の長さを測る単位である。通常、１モーラ分の発音に要する時間の長さは、ほぼ子音と１つの短母音とからなる組の１組分を発音する時間の長さとなる。ただし、１モーラで発音される音声は、子音と１つの短母音とからなる組に限らない。長母音の後半部分（カタカナ表記において「ー」で表記される部分）、二重母音の第二部分（例えば「関西（カンサイ）」という語の「イ」の部分）、促音（例えば「発達（ハッタツ）」という語の「ッ」の部分）、及び撥音（例えば「関西（カンサイ）」という語の「ン」の部分）なども、１つの独立したモーラを形成する。 In Japanese accents, such rise and fall of accents occur at the boundary between successive “mora”. “Mora” is a unit for measuring the phonological length of a word. Usually, the length of time required for sound generation for one mora is the length of time for generating one set of a set consisting of a consonant and one short vowel. However, the sound produced by one mora is not limited to a set of consonants and one short vowel. The second half of long vowels (the part marked with “-” in katakana notation), the second part of double vowels (for example, the “i” part of the word “Kansai”), and the sound (for example, “development (hattatsu) ) ”) And sound repellent (for example,“ N ”part of the word“ Kansai ”) also form one independent mora.

図２に、本実施の形態に係る音声合成システムの機能的構成をブロック図形式で示す。図２を参照して、この音声合成システム９０は、図５に示す従来の技術のものと同様の素片データベース４２を含む。 FIG. 2 is a block diagram showing the functional configuration of the speech synthesis system according to the present embodiment. Referring to FIG. 2 , this speech synthesis system 90 includes a segment database 42 similar to that of the prior art shown in FIG.

音声合成システム９０はさらに、図５に示す素片選択部４４に替えて、上述したようにアクセント情報を含む合成目標１０２、及びこのアクセント情報に基づいて後述するように異なる関数を用いて算出されるコストに基づき、素片データベース４２の中から音声の合成に用いるのに好適な音声素片を選択するための素片選択部１２０を含む。 The speech synthesis system 90 is further calculated using the synthesis target 102 including the accent information as described above and a different function based on the accent information as described later, instead of the segment selection unit 44 shown in FIG. A segment selection unit 120 for selecting a speech segment suitable for use in speech synthesis from the segment database 42 based on the cost of the segment database 42 is included.

音声合成システム９０はさらに、図５に示すコスト計算部４６に替えて、候補となる音声素片について、素片選択部１２０より与えられた物理的特徴に基づき、アクセント情報に基づいて異なる関数を用いてコストを計算するためのコスト計算部１２２を含む。コスト計算部１２２は、具体的には、非特許文献１に記載のものと同様に複数のサブコストを算出し、アクセント情報に応じて選択される異なる重みセットを用いたそれらの線形和によりコストを算出する。 Further, the speech synthesis system 90 replaces the cost calculation unit 46 shown in FIG. 5 with different functions based on the accent information based on the physical features given from the unit selection unit 120 for the candidate speech units. A cost calculation unit 122 is included for calculating the cost by using it. Specifically, the cost calculation unit 122 calculates a plurality of sub-costs in the same manner as described in Non-Patent Document 1, and calculates the cost by their linear sum using different weight sets selected according to accent information. calculate.

音声合成システム９０はさらに、アクセント情報を含む合成目標１０２に基づいて、コスト計算部１２２がサブコストの線形和を計算する際に用いる重みセットを予め設定された二つのうちから選択し、コスト計算部１２２に与えるための重み選択部１００を含む。 The speech synthesis system 90 further selects, based on the synthesis target 102 including the accent information, the weight set used when the cost calculation unit 122 calculates the linear sum of the sub-costs from the two set in advance, and the cost calculation unit A weight selection unit 100 for giving to 122 is included.

重み選択部１００は、コスト計算部１２２がコストの算出を行なう際に用いる重みの変更が完了したことを示す完了信号を素片選択部１２０に与える機能を備える。また、素片選択部１２０は、重み選択部１００より与えられるこの完了信号に応答して、音声素片の選択を開始する機能を有する。 The weight selection unit 100 has a function of giving the element selection unit 120 a completion signal indicating that the change of the weight used when the cost calculation unit 122 calculates the cost is completed. In addition, the unit selection unit 120 has a function of starting the selection of a speech unit in response to the completion signal given from the weight selection unit 100.

重み選択部１００は、第１の重みセットＷ_Aを保持する第１の重みセット保持部１０４と、第１の重みセットＷ_Aと異なる第２の重みセットＷ_Bを保持する第２の重みセット保持部１０６とを含む。第１の重みセットＷ_A及び第２の重みセットＷ_Bは、コスト計算部１２２がコストを計算する際の各サブコストに対応する以下の式に示す重みを含む。 Weight selection section 100, first and weight set holding unit 104, a second weight set for holding the second weight set W _B different from the first weight set W _A which holds the first weight set W _A Holding part 106. The first weight set W _A and the second weight set W _B comprises a weight shown in the following formula cost calculation unit 122 is corresponding to each sub-cost in calculating the cost.

これら重みはいずれも定数である。これらの重みセットのうち、音素の適合性に関するサブコストの重みｗ_typ、音素環境代替に関するサブコストの重みｗ_env、スペクトルの不連続に関するサブコストの重みｗ_spec、及び基本周波数Ｆ₀に関するサブコストの重みｗ_F0は、非特許文献１でのコスト計算に用いられている重みと同様である。また、第２の重みセットＷ_Bの韻律に関するサブコストの重みｗ_proは、非特許文献１でのコスト計算に用いられるものと同様である。

These weights are all constants. Of these weight sets, sub-cost weight w _typ for phoneme suitability, sub-cost weight w _env for phoneme environment substitution, sub-cost weight w _spec for spectrum discontinuity, and sub-cost weight w _F0 for fundamental frequency F _0. Is the same as the weight used for the cost calculation in Non-Patent Document 1. The weight w _pro metrical about subcosts of the second weight set W _B is the same as that used in the cost calculation in the non-patent document 1.

本実施の形態では、韻律に関するサブコストＣ_proに対する第１の重みセットＷ_Aにおける重みｗ_proAと、第２の重みセットＷ_Bにおける重みｗ_proとは、次の数式に示す関係となる。 In the present embodiment, the weight w _proA in the first weight set W _A and the weight w _pro in the second weight set W _{B with} respect to the sub-cost C _pro related to the prosody have the relationship shown in the following equation.

すなわち、第１の重みセットＷ_Aを用いる場合には韻律に関するサブコストが重視され、第２の重みセットＷ_Bを用いる場合には韻律に関するサブコストは特に重視はされない（通常と同じ）。第１の重みセットＷ_Aは、アクセントの影響により韻律的な特徴変化の大きな時間的区間（以下、この時間的区間を「韻律変化区間」と呼ぶ。）の音声素片を選択する際に用いられる重みである。第２の重みセットＷ_Bは、それ以外の時間的区間（以下、この区間を「平坦区間」と呼ぶ。）の音声素片を選択する際のコスト計算に用いられる。

That is, in the case of using the first weight set W _A sub-cost is emphasized about prosody, sub-cost is not particularly emphasized about prosody in the case of using the second weight set W _B (usually the same as). The first weight set W _A, a large time interval of prosodic features changes due to the influence of accent (hereinafter, this time interval is referred to as "prosodic change interval".) Used to select the speech units Is the weight to be given. The second weight set W _B is the other time interval (hereinafter, this period is referred to as "flat sections".) Used in the cost calculation in selecting a speech unit of.

重み選択部１００はさらに、合成目標１０２に基づいて、韻律変化区間を検出するための韻律変化区間検出部１０８と、韻律変化区間検出部１０８による検出結果に応答して、第１の重みセットＷ_A及び第２の重みセットＷ_Bのいずれかを選択してコスト計算部１２２に与えるための選択部１１０とを含む。 Based on the synthesis target 102, the weight selection unit 100 further responds to the detection result of the prosody change interval detection unit 108 for detecting the prosody change interval and the prosody change interval detection unit 108, and the first weight set W _A selection unit 110 for selecting any one of _A and the second weight set W _B and giving the selection to the cost calculation unit 122.

アクセント情報が変化する部分周囲の区間は、アクセントの影響により韻律的な特徴変化の大きな区間となる。韻律変化区間検出部１０８は、
（１）アクセント情報が変化する部分の直前の母音部、並びに、
（２）アクセント情報が変化する部分の直後の子音部及び母音部（子音部がある場合）、又は母音部（子音部がない場合）、
を韻律変化区間として検出する。 The section around the part where the accent information changes is a section where the prosodic feature change is large due to the influence of the accent. The prosody change section detecting unit 108
(1) The vowel part immediately before the part where the accent information changes, and
(2) The consonant part and vowel part (if there is a consonant part) immediately after the part where the accent information changes, or the vowel part (if there is no consonant part),
Are detected as prosodic change intervals.

図３に、合成目標１０２の構成の一例を示す。図３を参照して、合成目標１０２は、図６に示す従来の技術における合成目標６２と同様に、入力テキスト６０をもとに作成され、素片選択部１２０が音声素片を選択する際に用いる音素ごとの合成目標１４２，１４４，…を含む。 FIG. 3 shows an example of the configuration of the synthesis target 102. Referring to FIG. 3, the synthesis target 102 is created based on the input text 60 in the same manner as the synthesis target 62 in the prior art shown in FIG. 6, and the segment selection unit 120 selects a speech segment. ... Includes synthesis targets 142, 144,.

音素ごとの合成目標１４２，１４４，…は、図６に示す音素ごとの合成目標８２，８４，…と同様に、音素を特定するための音素記号と、音素の韻律指令、音素の持続時間（音韻継続時間）、音素ごとのスペクトル情報など、当該音素の物理的特徴を示す情報とを含む。音素ごとの合成目標１４２，１４４，…はさらに、当該音素が、アクセントにより高く発音されるか、低く発音されるかを示すアクセント情報１４６，１４８，…を含む。アクセント情報１４６，１４８，…内の「Ｈ」は、当該音素が高く発音されることを示す。アクセント情報１４６，１４８，…内の「Ｌ」は、当該音素が低く発音されることを示す。隣接する音素についてのアクセント情報が互いに他と異なる場合、その境界部分にアクセントの立上がり又は立下がりがあることになる。 The synthesis target 142, 144,... For each phoneme is similar to the synthesis target 82, 84,... For each phoneme shown in FIG. 6, the phoneme symbol for specifying the phoneme, the phoneme prosody command, the phoneme duration ( Phoneme duration) and information indicating physical characteristics of the phoneme, such as spectrum information for each phoneme. The synthesis target 142, 144,... For each phoneme further includes accent information 146, 148,... Indicating whether the phoneme is pronounced higher or lower by accents. “H” in the accent information 146, 148,... Indicates that the phoneme is pronounced highly. “L” in the accent information 146, 148,... Indicates that the phoneme is pronounced low. When the accent information about adjacent phonemes is different from the others, there is an accent rise or fall at the boundary portion.

韻律変化区間の音声素片についてコストを計算する場合、韻律に関するサブコストＣ_proには第１の重みセットＷ_Aに保持された重みｗ_proAが乗算される。従ってコスト計算部１２２により計算されるコスト関数は以下の式となる。 When calculating the cost for the speech unit of prosody change interval, the sub-cost C _pro about prosody weights w _proA held in the first weight set W _A is multiplied. Therefore, the cost function calculated by the cost calculation unit 122 is as follows.

平坦区間の音声素片についてコストを計算する場合、韻律に関するサブコストＣ_proに、第２の重みセットＷ_Bに保持された重みｗ_proが乗算される。従ってコスト関数は、非特許文献１に記載のコスト関数と同様の以下の式となる。

When calculating the cost for the speech unit of the flat sections, a sub-cost C _pro about prosody, weights w _pro held in the second weight set W _B is multiplied. Therefore, the cost function is the following expression similar to the cost function described in Non-Patent Document 1.

即ちコスト計算部１２２は、韻律変化区間内の音声を合成する際の候補となる音声素片についてコストを算出する場合と、平坦区間内の音声を合成する際の候補となる音声素片についてコストを算出する場合とで、それぞれ異なるコスト関数によってコストの計算を行なうこととなる。

That is, the cost calculation unit 122 calculates the cost for speech units that are candidates for synthesizing speech in the prosodic change section and the cost for speech units that are candidates for synthesizing speech in the flat section. In this case, the cost is calculated using different cost functions.

音声合成システム９０は以下のように動作する。図２を参照して、前もって入力テキスト６０から合成目標１０２が作成されているものとする。この際、音素ごとにアクセント情報が合成目標１０２内に作成される。重み選択部１００の韻律変化区間検出部１０８及び素片選択部１２０に、この合成目標１０２が与えられる。素片選択部１２０は、与えられた合成目標１０２を一時記憶する。 The speech synthesis system 90 operates as follows. Referring to FIG. 2, it is assumed that a synthesis target 102 has been created from input text 60 in advance. At this time, accent information is created in the synthesis target 102 for each phoneme. The synthesis target 102 is given to the prosody change section detection unit 108 and the segment selection unit 120 of the weight selection unit 100. The segment selection unit 120 temporarily stores the given synthesis target 102.

韻律変化区間検出部１０８は、以下のようにして、韻律変化区間及び平坦区間を検出する。すなわち、韻律変化区間検出部１０８は、合成目標１０２のアクセント情報１４６，１４８，…（図３参照）を参照し、隣接する２つの音素についての韻律変化区間及び平坦区間を検出する。韻律変化区間検出部１０８は、各音素についての検出結果を選択部１１０に対し与える。韻律変化区間検出部１０８は同時に、素片選択部１２０に対し区間の検出が完了したことを示す完了信号を与える。 The prosody change section detector 108 detects the prosody change section and the flat section as follows. That is, the prosody change section detection unit 108 refers to the accent information 146, 148,... (See FIG. 3) of the synthesis target 102, and detects a prosody change section and a flat section for two adjacent phonemes. The prosodic change section detection unit 108 gives the selection unit 110 a detection result for each phoneme. At the same time, the prosodic change section detection unit 108 gives a completion signal indicating that the detection of the section is completed to the segment selection unit 120.

図２を参照して、選択部１１０は、与えられた検出結果が韻律変化区間を表わすものである場合には第１の重みセットＷ_Aを、平坦区間である場合には第２の重みセットＷ_Bを、それぞれコスト計算部１２２に与える。これにより、重み選択部１００からコスト計算部１２２に与えられる韻律に関するサブコストＣ_proの重みは、図４に示すように、韻律変化区間２１８及び２２０ではｗ_proA、平坦区間２２２及び２２４ではｗ_proになる。 Referring to FIG. 2, the selector 110, when the detection results given are those representing the prosodic change interval when the first weight set W _A, a flat section a second weight set the W _B, respectively given to the cost calculation unit 122. As a result, the weight of the sub-cost C _pro related to the prosody given from the weight selection unit 100 to the cost calculation unit 122 is w _proA in the prosodic change sections 218 and 220 and w _pro in the flat sections 222 and 224 as shown in FIG. Become.

図２を参照して、完了信号が与えられると、素片選択部１２０は記憶していた１音素分の合成目標１０２を読出す。素片選択部１２０は、読出した合成目標１０２をもとに、素片データベース４２より候補となる音声素片を抽出し、抽出した音声素片の音響的特徴と、それまでに選択されていた音素の音響的特徴とを示す情報をコスト計算部１２２に与える。 Referring to FIG. 2, when a completion signal is given, segment selection unit 120 reads stored synthesis target 102 for one phoneme. The segment selection unit 120 extracts candidate speech segments from the segment database 42 based on the read synthesis target 102, and the acoustic features of the extracted speech segments and the selections made so far Information indicating the acoustic characteristics of phonemes is given to the cost calculation unit 122.

コスト計算部１２２は、与えられた音声素片について、選択部１１０を介して与えられる重みセットを用いて、コストの計算を行なう。韻律変化区間の音声素片についてコストを計算する場合、コスト計算部１２２は、以下の式によりコストを算出する。 The cost calculation unit 122 calculates a cost for the given speech unit using the weight set given through the selection unit 110. When calculating the cost for the speech segment in the prosodic change section, the cost calculation unit 122 calculates the cost by the following formula.

平坦区間の音声素片についてコストを計算する場合、コスト計算部１２２は、以下の式によりコストを算出する。

When calculating the cost for the speech segment in the flat section, the cost calculation unit 122 calculates the cost by the following formula.

コスト計算部１２２は、このようにして算出したコストを素片選択部１２０に与える。素片選択部１２０は、与えられたコストをもとに、図５に示す従来の技術における素片選択部４４と同様に、選択された音素列のコストの総和が最小となるような音声素片を選択し出力する。

The cost calculation unit 122 gives the cost calculated in this way to the segment selection unit 120. Based on the given cost, the unit selection unit 120 is similar to the unit selection unit 44 in the prior art shown in FIG. 5, and the speech element that minimizes the sum of the costs of the selected phoneme string is used. Select a piece and output it.

以上の動作を繰返すことにより、素片選択部１２０からは、コストの総和が最小となるような出力素片系列６４が出力される。 By repeating the above operation, the segment selection unit 120 outputs an output segment sequence 64 that minimizes the total cost.

図４に、合成目標１０２と、図２に示す韻律変化区間検出部１０８が検出する韻律変化区間及び平坦区間と、韻律に関するサブコストＣ_proの重みとの関係を、概略的に示す。図１に示す「取りまとめる」という語７０の音声を合成する場合、語７０のアクセントは、図１に示す発音表記７２上のアクセント記号によって表わされる。図４を参照して、この語に対応する合成目標１０２は、このアクセント記号をもとに作成されたアクセント情報２００を含む。 4, the synthetic target 102, a prosody changed section and flat sections for detecting the prosody change section detecting unit 108 shown in FIG. 2, the relationship between the weight of subcost C _pro about prosody schematically illustrates. When synthesizing the speech of the word 70 “collect” shown in FIG. 1, the accent of the word 70 is represented by an accent symbol on the phonetic notation 72 shown in FIG. 1. Referring to FIG. 4, a synthesis target 102 corresponding to this word includes accent information 200 created based on this accent symbol.

韻律変化区間検出部１０８は、合成目標１０２のアクセント情報２００を参照し、隣接する２つの音素についてのアクセント情報が変化する部分を検出する。例えば、音素「／ｏ／」と「／ｒ／」とが隣接する部分２０２では、音素「／ｏ／」のアクセント情報が「Ｌ」であるのに対し、これに隣接する音素「／ｒ／」のアクセント情報は「Ｈ」である。よって、音素「／ｏ／」と「／ｒ／」とが隣接する部分２０２には、アクセントの立上がりがあることになる。また同様に、音素「／ｅ／」と音素「／ｒ／」とが隣接する部分２０４には、アクセントの立下がりがあることになる。 The prosody change section detection unit 108 refers to the accent information 200 of the synthesis target 102 and detects a portion where the accent information about two adjacent phonemes changes. For example, in the part 202 where the phonemes “/ o /” and “/ r /” are adjacent, the accent information of the phoneme “/ o /” is “L”, whereas the phoneme “/ r /” adjacent to the phoneme “/ r /” is “L”. "H" is the accent information. Therefore, the portion 202 where the phonemes “/ o /” and “/ r /” are adjacent has a rise in accent. Similarly, the portion 204 where the phoneme “/ e /” and the phoneme “/ r /” are adjacent has a falling edge of the accent.

図４に示す例では、音素「／ｏ／」２０６及び「／ｅ／」２１２が、アクセント情報の変化する部分の直前の母音部に該当する。また、音素「／ｒ／」２０８及び「／ｉ／」２１０、並びに音素「／ｒ／」２１４及び「／ｉ／」２１６が、アクセント情報の変化する部分の直後の子音部及び母音部に該当する。これらの音素を含む区間が韻律変化区間２１８及び２２０となる。それ以外の区間は平坦区間２２２及び２２４となる。 In the example shown in FIG. 4, phonemes “/ o /” 206 and “/ e /” 212 correspond to the vowel part immediately before the part where the accent information changes. The phonemes “/ r /” 208 and “/ i /” 210 and the phonemes “/ r /” 214 and “/ i /” 216 correspond to the consonant part and the vowel part immediately after the part where the accent information changes. To do. Sections including these phonemes are prosodic change sections 218 and 220. The other sections are flat sections 222 and 224.

本実施の形態では、アクセントにより韻律が大きく変化する部分を検出し、検出した部分の音素を選択する際に、韻律に関するサブコストの重みを大きい値に設定する。これにより、検出された部分の音声素片についての韻律に関するサブコストは、重みが増大した分だけ、他のサブコストに対し相対的に重く評価される。そのため、この音声素片についてのコストは、韻律に関するサブコストをより強く反映した値になる。このようなコストを用いて音声素片を選択すると、韻律について知覚に与える影響が大きい区間では韻律に関するサブコストが通常より小さくなるような音声素片の選択が行なわれることになる。よって、そうした区間では、韻律に関して誤差の少ない素片選択が可能となり、知覚に与える影響が少なくなる。その結果、合成音声を人間が聞いたときに違和感を感ずることが少なくなる。 In the present embodiment, when a portion where the prosody changes greatly due to an accent is detected and the phoneme of the detected portion is selected, the weight of the sub-cost related to the prosody is set to a large value. As a result, the sub-cost related to the prosody for the detected speech segment of the detected portion is evaluated to be relatively heavier than the other sub-costs as the weight is increased. Therefore, the cost for this speech segment is a value that more strongly reflects the sub-cost related to prosody. When a speech unit is selected using such a cost, a speech unit is selected such that the sub-cost related to the prosody becomes smaller than usual in a section where the influence on the perception of the prosody is large. Therefore, in such a section, it is possible to select a segment with less error regarding the prosody, and the influence on perception is reduced. As a result, it is less likely that the person feels uncomfortable when the synthesized speech is heard by a human.

また、本実施の形態では、アクセントによる韻律の変化が、知覚的に大きな影響を及ぼす区間以外の区間で、従来通りのコストを用いて音声素片の選択が行なわれる。そのため、この区間について合成された音声は、従来の技術において期待できる品質を有することとなる。このように、目標となる音声の特徴に応じてコスト計算の方法を変更することにより、合成された音声は、全体として品質が向上することが期待できる。 In the present embodiment, the speech segment is selected using the conventional cost in a section other than the section where the change in the prosody due to the accent has a significant perceptual effect. Therefore, the voice synthesized for this section has quality that can be expected in the conventional technology. Thus, by changing the cost calculation method according to the target voice characteristics, the synthesized voice can be expected to improve in quality as a whole.

なお、以上のブロック図形式で説明した各機能部は、いずれもコンピュータハードウェア及び当該コンピュータ上で実行されるプログラムにより実現できる。このコンピュータとしては、音声を扱う設備を持ったものであれば、汎用のハードウェアを有するものを用いることができる。また、上で説明した装置の各機能ブロックは、この明細書の記載に基づき、当業者であればプログラムで実現することができる。そうしたプログラムもまた１つのデータであり、記憶媒体に記憶させて流通させることができる。 Each functional unit described in the above block diagram format can be realized by computer hardware and a program executed on the computer. As this computer, a computer having general-purpose hardware can be used as long as it has equipment for handling sound. Further, each functional block of the apparatus described above can be realized by a program by those skilled in the art based on the description in this specification. Such a program is also a piece of data and can be stored in a storage medium and distributed.

また、上記した実施の形態では、韻律変化区間において、韻律に関するサブコストを大きな値に設定し、コスト計算を行なった。しかし、本発明は、このような実施の形態には限定されない。例えば、韻律変化区間では韻律に関するサブコストの値を上記したｗ_proとし、平坦区間では、韻律に関するサブコストの重みをｗ_proより小さな値に設定するようにしてもよい。 In the above-described embodiment, the cost calculation is performed by setting the sub-cost related to the prosody to a large value in the prosody change section. However, the present invention is not limited to such an embodiment. For example, the value of the sub cost related to the prosody may be set to w _pro described above in the prosodic change section, and the weight of the sub cost related to the prosody may be set to a value smaller than w _pro in the flat section.

このようにすると、この部分の音声素片についての韻律に関するサブコストは、重みが減少した分だけ、他のサブコストと比べて相対的に軽く評価される。平坦区間内の音声素片についてのコストは、韻律に関するサブコストの値がより弱く反映したものとなる。このようなコストを用いて音声素片を選択すると、平坦区間内において韻律に関する知覚的な誤差が許容される形で音声素片の選択が行なわれることとなる。これにより、平坦区間と韻律変化区間との境界において基本周波数Ｆ₀の不連続を抑制しつつ、韻律変化区間において韻律に関するサブコストが大きくなることを間接的に回避することが可能になる。 In this way, the sub-cost related to the prosody for the speech segment of this portion is evaluated relatively lightly as compared with the other sub-costs by the amount of the reduced weight. The cost for the speech segment in the flat section reflects the value of the sub-cost related to the prosody more weakly. When a speech unit is selected using such a cost, the speech unit is selected in such a manner that a perceptual error relating to the prosody is allowed in a flat section. As a result, it is possible to indirectly avoid an increase in the sub-cost related to the prosody in the prosodic change section while suppressing the discontinuity of the fundamental frequency F ₀ at the boundary between the flat section and the prosodic change section.

これは以下の理由による。即ち、基本周波数Ｆ₀の不連続性は、合成音声の知覚に大きな影響を及ぼすと考えられる。そこで、韻律の平坦区間でも韻律変化区間でも、基本周波数Ｆ₀の不連続に関するサブコストを同様に重く評価することにより、基本周波数Ｆ₀の不連続が小さくなるような素片選択をしている。しかし、その結果、特に韻律変化区間において韻律に関するサブコストが大きな音声素片が選択されてしまう場合があったと考えられる。 This is due to the following reason. That is, the discontinuity of the fundamental frequency F ₀ is considered to have a great influence on the perception of synthesized speech. Therefore, even in the prosody change interval in prosody flat sections, by evaluating similarly heavier subcost about discontinuity of the fundamental frequency F _0, has a unit selection such discontinuities is reduced fundamental frequency F _0. However, as a result, it is considered that a speech unit having a large sub-cost related to the prosody may be selected particularly in the prosody change section.

韻律が合成音声の知覚に与える影響は、韻律の変化が少ない平坦区間では小さく、逆に韻律変化区間では大きい。そこで、韻律に関する知覚的な影響が少ない平坦区間内で、韻律に関する知覚的な誤差を許容すると、平坦区間内において選択されうる音声素片の数は増える。それに伴い、この平坦区間に隣接する個所で接続される音声素片の組合せが増える。そのため、基本周波数Ｆ₀の不連続に関するサブコストを重く評価しても、この個所において選択されうる音声素片の数は増加することになる。このように選択されうる音声素片が増加すると、それらの中に、韻律に関するサブコストが小さな音声素片が存在する可能性が高くなる。 The influence of the prosody on the perception of the synthesized speech is small in the flat section where the prosody change is small and conversely large in the prosody change section. Therefore, if a perceptual error related to prosody is allowed in a flat section where the perceptual influence on prosody is small, the number of speech segments that can be selected in the flat section increases. Along with this, the number of combinations of speech units connected at locations adjacent to the flat section increases. For this reason, even if the sub-cost related to the discontinuity of the fundamental frequency F ₀ is heavily evaluated, the number of speech segments that can be selected at this point increases. If the number of speech units that can be selected in this way increases, there is a high possibility that speech units having a small sub-cost related to the prosody are present.

よって、平坦区間に隣接する区間であり、かつ韻律による知覚的な影響が大きな区間である韻律変化区間において、韻律に関するサブコストが小さな音声素片が選択される可能性が高くなる。その結果、韻律変化区間において、韻律に関するサブコストが大きくなることを回避することが可能になる。 Therefore, in the prosodic change section that is a section adjacent to the flat section and has a large perceptual influence due to the prosody, there is a high possibility that a speech unit having a small sub-cost related to the prosody is selected. As a result, it is possible to avoid an increase in the sub-cost related to the prosody in the prosodic change section.

さらに、平坦区間内の音声素片についてのコストは、韻律に関するサブコストの値をより弱く反映したものとなる。このようなコストを用いて音声素片を選択すると、韻律に関するサブコスト以外のサブコストが相対的に強く反映された形で、音声素片の選択が行なわれるようになる。これにより、この区間でのその他の知覚的要因に関する合成音声の品質が向上することも期待できる。 Furthermore, the cost for the speech segment in the flat section reflects the value of the sub-cost related to the prosody more weakly. When a speech unit is selected using such a cost, the speech unit is selected in a form that relatively reflects sub-costs other than the sub-cost related to prosody. Thereby, it can also be expected that the quality of the synthesized speech related to other perceptual factors in this section is improved.

さらには、韻律変化区間では、韻律に関するサブコストの重みをｗ_proより大きな値に設定し、平坦区間ではｗ_proより小さな値に設定するようにしてもよい。このようにした場合でも、韻律変化区間では韻律に関するサブコストを他のサブコストより重視し、平坦区間では他のサブコストを韻律に関するサブコストより重視することとなり、アクセントの変化のある区間で知覚的な影響が少なくなるような素片選択を行なうことができる。 Furthermore, in the prosodic change section, the weight of the sub-cost related to the prosody may be set to a value larger than w _pro , and may be set to a value smaller than w _pro in the flat section. Even in this case, in the prosodic change section, the prosody sub-cost is more important than the other sub-costs, and in the flat section, other sub-costs are more important than the prosody sub-cost, and there is a perceptual effect in the section where the accent changes. It is possible to perform segment selection so as to reduce the number of segments.

なお、上記した実施の形態では、図２に示す重み選択部１００は、韻律に関するサブコストの重みのみを変更するものであった。しかし、本発明はこのようなものには限定されない。その他のサブコストの重みについても変化させることができる。このようにすると、より多くの音響的特徴についてより詳細に知覚に与える影響を少なくするようなコストの最適化を行なうことができる。 In the above-described embodiment, the weight selection unit 100 shown in FIG. 2 changes only the sub-cost weight related to the prosody. However, the present invention is not limited to this. Other sub-cost weights can also be changed. In this way, it is possible to optimize the cost so as to reduce the influence on the perception in more detail for more acoustic features.

また、上記した実施の形態では、コスト計算に用いる重みセットを２種類用意し、図２に示すコスト計算部１２２がコストを計算する際に、重みセットを切替えた。しかし本発明は、このような実施の形態には限定されない。重み選択部１００が、合成目標に応じて何らかの関数によって重みを算出するようにしてもよい。例えば、合成目標１０２に含まれる、基本周波数Ｆ₀に関する情報と、音素継続時間に関する情報とに基づいて、韻律に関するサブコストの重みｗ_proの変更量を算出するようにしてもよい。韻律変化区間検出部１０８による韻律変化区間及び平坦区間の検出結果に応じて、コスト計算部１２２でのコスト計算の関数そのものを全く別のものとすることも可能である。またそうした処理が韻律の変化（アクセントの変化）に着目したもののみに限定されるわけではなく、他の音響的特徴の変化に着目するようにしてもよいことはいうまでもない。 In the embodiment described above, two types of weight sets used for cost calculation are prepared, and the weight set is switched when the cost calculation unit 122 shown in FIG. 2 calculates the cost. However, the present invention is not limited to such an embodiment. The weight selection unit 100 may calculate the weight by some function according to the synthesis target. For example, the change amount of the sub-cost weight w _pro related to the prosody may be calculated based on the information related to the fundamental frequency F ₀ and the information related to the phoneme duration included in the synthesis target 102. Depending on the detection result of the prosody change section and the flat section by the prosody change section detection unit 108, the cost calculation function itself in the cost calculation unit 122 may be completely different. Further, such processing is not limited to only focusing on prosody changes (accent changes), but it goes without saying that other acoustic feature changes may be focused.

なお、アクセントの変化に着目する場合、合成目標１０２に含まれるアクセント情報の形態は問わない。アクセントによって発話音声に大きな変化がある個所を特定することができる情報であれば、どのような形式の情報であってもよい。それはアクセント以外の音響的特徴に着目する場合も同様である。 Note that when attention is focused on the change in accent, the form of the accent information included in the synthesis target 102 does not matter. Any type of information may be used as long as the information can identify a portion where the utterance is greatly changed by the accent. The same applies when focusing on acoustic features other than accents.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

日本語音声におけるアクセントの特徴を示す概略図である。It is the schematic which shows the feature of the accent in Japanese speech. 本発明の一実施の形態に係る音声合成システムの機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the speech synthesis system which concerns on one embodiment of this invention. 本発明の一実施の形態に係る合成目標の一例を示す図である。It is a figure which shows an example of the synthetic | combination target which concerns on one embodiment of this invention. 合成目標と、韻律変化区間及び平坦区間と、韻律に関するサブコストＣ_proの重みとの関係を示す概略図である。And combining the target is a schematic diagram illustrating a prosody changed section and flat sections, the relationship between the weight of subcost C _pro about prosody. 一般的な音声合成システムの構成を示すブロック図である。It is a block diagram which shows the structure of a common speech synthesis system. 本発明の背景技術に係る合成目標の構成を示す図である。It is a figure which shows the structure of the synthetic | combination target based on the background art of this invention.

Explanation of symbols

４０、９０音声合成システム、４２素片データベース、４４、１２０素片選択部、４６、１２２コスト計算部、６２、１０２合成目標、１００重み選択部、１０４、１０６重みセット保持部、１０８韻律変化区間検出部、１１０選択部、１４６、１４８アクセント情報 40, 90 Speech synthesis system, 42 Segment database, 44, 120 Segment selection unit, 46, 122 Cost calculation unit, 62, 102 Synthesis target, 100 Weight selection unit, 104, 106 Weight set holding unit, 108 Prosody change section Detection unit, 110 selection unit, 146, 148 Accent information

Claims

A speech synthesizer that synthesizes speech using speech units selected from a predetermined speech unit database according to a speech synthesis target, wherein the synthesis target includes a plurality of speech features including predetermined feature information of speech Described by information ,
Detecting means for detecting a portion of the speech synthesis target that satisfies a predetermined condition of the predetermined feature information;
Using the predetermined functions different from each other for the location where the predetermined feature information is detected by the detection means to satisfy the condition , the synthesis target and the speech segment database are used. Cost calculation means for performing cost calculation based on each speech unit included ;
A speech synthesizer, comprising: a speech segment selection unit for selecting from the speech segment database a speech unit that minimizes the cost calculated by the cost calculation unit.

The predetermined feature information includes accent information relating to an accent of a voice to be synthesized,
The condition includes an accent condition that the accent information of the voice that is the synthesis target changes,
The speech synthesis apparatus according to claim 1, wherein the detection unit includes an accent condition satisfaction section detection unit for detecting a section in which the accent information satisfies the accent condition among the speech synthesis targets.

The cost calculated by the cost calculation includes a plurality of sub-costs calculated by a plurality of sub-cost functions defined so as to describe a relationship with the perceptual characteristics with respect to the feature information of the plurality of sounds.
The cost calculation means includes
The value of the sub-cost function before Kifuku number, and support Bukosu preparative calculation detecting means for de San follow the feature amount of each speech unit included in the composite target and the speech unit database of the speech,
First preparation means for preparing a first group of constants;
Second preparation means for preparing a second constant group different from the first constant group;
In the section where the accent condition of the speech to be synthesized is detected as a section that satisfies the accent condition by the accent condition satisfaction section detecting means, the first constant group is selected, and in other sections Selecting means for selecting the second constant group ;
As the coefficient of the first or second constant group selected by said selecting means, the linear sum of sub-cost calculated by the pre-hexa Bukosu preparative calculation detecting means and means for calculating the cost,
The voice according to claim 2, wherein a value of a constant corresponding to a sub-cost related to the prosody included in the first constant group is larger than a value of a corresponding constant included in the second constant group. Synthesizer.

The cost calculated by the cost calculation includes a plurality of sub-costs calculated by a plurality of sub-cost functions defined so as to describe a relationship with the perceptual characteristics with respect to the feature information of the plurality of sounds.
The cost calculation means includes
Sub-cost calculating means for calculating the values of the plurality of sub-cost functions according to the speech synthesis target and the feature value of each speech unit included in the speech unit database;
First preparation means for preparing a first group of constants;
Second preparation means for preparing a second constant group different from the first constant group;
In the section where the accent condition of the speech to be synthesized is detected as a section that satisfies the accent condition by the accent condition satisfaction section detecting means, the first constant group is selected, and in other sections Selecting means for selecting the second constant group;
Means for calculating the cost by a linear sum of the sub-costs calculated by the sub-cost calculation means, using the first or second constant group selected by the selection means as a coefficient,
The value of a constant corresponding to a sub-cost other than the sub-cost related to prosody included in the first constant group is smaller than a value of a corresponding constant included in the second constant group. The speech synthesizer described.

The accent condition satisfaction section detecting means is composed of a vowel part of a mora immediately before a point where the accent information of the speech as the synthesis target changes, and a mora immediately after a point where the accent information of the speech as the synthesis target changes. a section that is comprises means for detecting a period which satisfies the accent condition, the speech synthesizing apparatus according to claim 3 or claim 4.

The speech synthesizer according to any one of claims 1 to 5 , further including the speech segment database.

A computer-executable computer program that, when executed by a computer, causes the computer to operate as the speech synthesizer according to any one of claims 1 to 6 .

In a speech synthesizer that synthesizes speech using speech units selected from a predetermined speech unit database according to a speech synthesis target, a cost calculation device for calculating costs for selecting speech units, The synthesis target is described by a plurality of types of speech feature information including predetermined feature information of speech ,
Based on the predetermined feature information, detection means for detecting a portion of the speech synthesis target that satisfies a predetermined condition of the predetermined feature information;
Using the predetermined functions different from each other for the location where the predetermined feature information is detected by the detection means to satisfy the condition , the synthesis target and the speech segment database are used. A cost calculation device comprising: cost calculation means for performing cost calculation based on each phoneme included .