JP2006133559A

JP2006133559A - Recording / speech / text-to-speech combined speech synthesizer, recording / editing / text-to-speech combined speech synthesis program, recording medium

Info

Publication number: JP2006133559A
Application number: JP2004323229A
Authority: JP
Inventors: Akihiro Yoshida; 明弘吉田; Hideyuki Mizuno; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-11-08
Filing date: 2004-11-08
Publication date: 2006-05-25
Anticipated expiration: 2024-11-08
Also published as: JP4414864B2

Abstract

<P>PROBLEM TO BE SOLVED: To put together the recording sound and the sound synthesized from a text to high quality. <P>SOLUTION: This sound synthesizer is built, by using a recording sound database which stores voices for the same contents but in different talking tones, a text-analyzing section to identifies the recording sound matching text which can use the recording sound stored in the database from the inputted texts, a recording sound search section for deciding the tone components over the whole target text by a statistical method and to choose the recording sound, on the basis matching the tone components and the phoneme states, before and after and that for choosing the recording sound continuing, as long as possible with priority, a rhythm generator for generating rhythms from consistency with the tone components, the fundamental frequency, the speed, and the power of the selected recording sound in the texts other than the identified texts, and a sound element search section to generate the sound, by synthesizing the text sound by using the generated rhythm. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音声合成において、録音音声を使用する定型部分と合成音声により対応する可変部分を組み合わせることで音声を作成する場合、定型部分で使用する録音音声を複数用意しておき、最も自然な合成音声を得るために適切な録音音声を選択する録音編集・テキスト音声合成併用型音声合成装置、プログラム及びこのプログラムを記録した記録媒体に関するものである。 In speech synthesis, when creating a voice by combining a standard part using recorded voice and a variable part corresponding to the synthesized voice, a plurality of recorded voices used in the standard part are prepared and the most natural The present invention relates to a recording / editing / text-to-speech combined speech synthesizer that selects an appropriate recorded speech to obtain synthesized speech, a program, and a recording medium that records the program.

機械などに発声させたい文章（テキスト）を入力することで、所望の音声を生成する音声合成技術の１つに録音編集型音声合成がある。この音声合成方法は現在、駅のホームで流れる音声案内によく使われている方式で、収録した音声を文節など意味的にまとまった長さを単位としてつなぎ合わせることで実現する。
この方式では、録音音声をつなぎ合わせた接続部分で基本周波数のギャップなど音響的な要素により雑音が生じる場合がある。これを回避するための技術を記載した文献として、（特許文献１）及び（特許文献２）が挙げられる。 One of speech synthesis technologies for generating desired speech by inputting a sentence (text) to be uttered by a machine or the like is recording / editing speech synthesis. This speech synthesis method is currently used for the voice guidance that flows at the station platform, and is achieved by connecting the recorded speech in units of semantic lengths such as phrases.
In this method, noise may occur due to acoustic elements such as a gap in the fundamental frequency at a connection portion where recorded voices are connected. Documents describing techniques for avoiding this include (Patent Document 1) and (Patent Document 2).

これらの文献に開示された方式は、基本的に録音音声であるので高品質な音声を作成することができるが、発声させたい内容が変わるたびに収録し直さなければならないので、メンテナンスのコストが高いという問題点がある。
一方、他の音声合成方式には、波形接続型音声合成（特許文献３）或は、パラメータ音声合成など様々あり、技術改良によって合成音声の品質が実用にも耐えうるほどになってきているが、録音した生の音声には及ばない。
他の音声合成方法として、録音編集型音声合成と任意テキストを合成できる波形接続型音声合成が考えられる。この手法は、発声内容がある程度限定されている場面において有効で、録音音声を用いる定型部分と合成音声を用いる可変部分を混在させて合成音声を作成する。この合成方法において、定型部分の音声をあらかじめ収録しておいて連続音声と定義してしまうと、合成音声として出力できる文章が限りなく決められてしまい、汎用性に欠けてしまう不都合がある。 Since the methods disclosed in these documents are basically recorded voices, high quality voices can be created. However, since the contents to be uttered must be recorded again, the maintenance cost is reduced. There is a problem that it is expensive.
On the other hand, there are various other speech synthesis methods such as waveform connection type speech synthesis (Patent Document 3) or parameter speech synthesis, and the quality of synthesized speech has become practical enough to withstand practical use. , Not as much as the recorded live voice.
As other speech synthesis methods, a recording / editing speech synthesis and a waveform-connected speech synthesis capable of synthesizing arbitrary text are conceivable. This method is effective in scenes where the utterance content is limited to some extent, and creates a synthesized speech by mixing a fixed portion using recorded speech and a variable portion using synthesized speech. In this synthesizing method, if the speech of the standard part is recorded in advance and defined as continuous speech, sentences that can be output as synthesized speech are determined indefinitely, and there is a disadvantage that the versatility is lacking.

これを回避する方法として、定型部分をより小さい単位として定義することが挙げられる。その上で、合成させたいテキストの中に録音音声と合致する部分が無いかを判断し、合致部分があれば録音音声を選択し、なければ合成音声に使用することで、録音音声を入力テキストに応じて柔軟に活用することができるので、汎用性が高まる。
この録音編集合成と音声合成を併用した合成手法では、発話内容がある程度限定されているタスクにおいて、よく出現するフレーズをあらかじめ収録しておく。このことで、音声を合成する際に、高い頻度で録音音声が使用されるため、音声合成のみから作成する合成音声よりも高品質な音声を作成することができる。 One way to avoid this is to define the fixed part as a smaller unit. Then, determine whether there is a part that matches the recorded voice in the text you want to synthesize. If there is a matching part, select the recorded voice. If not, use the synthesized voice as the synthesized voice. Since it can be used flexibly according to the situation, versatility increases.
In this synthesizing method using both recording editing synthesis and speech synthesis, phrases that frequently appear in a task whose utterance content is limited to some extent are recorded in advance. Thus, since the recorded voice is used at a high frequency when synthesizing the voice, it is possible to create a higher quality voice than the synthesized voice created only from the voice synthesis.

この定型部分と可変部分をつなげて音声を出力する手法では、定型部分と可変部分の音質の違いや接続箇所の音響的な特徴の不一致により、合成音声全体の品質が低く感じるという問題があった。
つまり、より高品質な合成音声を作成するためには、録音音声からなる定型部分と合成音声からなる可変部分の品質の差を埋める必要があった。この問題を解決するために、録音音声を用いる定型部分と合成音声を用いる可変部分を混在させて実現する音声合成手法において、録音音声の音響的特徴に合わせて合成音声の音響的特徴に変更を加えることで、録音音声と合成音声の様々な不一致を極力減少させることにより、合成音声全体の品質を上げる試みは以前から考えられてきた。
特開平７−５６５９７号公報特許第３０３５３１８号明細書特許第２７６１５５２号明細書 The method of outputting the sound by connecting the fixed part and the variable part has a problem that the quality of the synthesized speech seems to be low due to the difference in the sound quality between the fixed part and the variable part and the mismatch in the acoustic characteristics of the connection part. .
In other words, in order to create a higher quality synthesized speech, it is necessary to fill in the difference in quality between the fixed portion composed of the recorded speech and the variable portion composed of the synthesized speech. In order to solve this problem, in the speech synthesis method realized by mixing the standard part using the recorded speech and the variable part using the synthesized speech, the acoustic feature of the synthesized speech is changed to match the acoustic feature of the recorded speech. In addition, attempts to improve the quality of the entire synthesized speech by reducing various mismatches between the recorded speech and the synthesized speech as much as possible have been considered.
JP-A-7-56597 Japanese Patent No. 3035318 Japanese Patent No. 2761552

しかしながら、合成音声を間に挟み、隔たっている録音音声どうしの組合せが悪い場合、例えば、音の高さ、つまり、基本周波数の差が大きすぎる場合、自然に聞こえないどころか、両隣の録音音声に合わせた処理を行なう合成音声に悪影響を与えてしまい、結果として不自然な合成音声を作成してしまう不都合が生じる。
本発明の目的は、録音編集合成と音声合成を併用する音声合成手法において、高品質な合成音声を作成することができる録音編集・テキスト音声合成併用型音声合成装置を提案しようとするものである。 However, if the combination of recorded voices with a synthetic voice in between is not good, for example, if the pitch, that is, the difference between the fundamental frequencies is too large, it will not be heard naturally. This adversely affects the synthesized speech that is subjected to the combined processing, resulting in the inconvenience of creating an unnatural synthesized speech.
SUMMARY OF THE INVENTION An object of the present invention is to propose a recording / speech / text-to-speech combined speech synthesizer capable of creating a high-quality synthesized speech in a speech synthesis method using both recording / editing synthesis and speech synthesis. .

本発明による録音編集・テキスト音声合成併用型音声合成装置は、少なくとも１つ以上のアクセント句からなる音声の断片を選択可能な単位とし、発声内容が同一で語調が異なる音声を複数保存している録音音声データベースと、入力テキストに関して録音音声データベースに格納されている録音音声が利用可能な録音音声対応テキストを同定するテキスト解析部と、同定された録音音声対応テキストにおいて、該録音音声対応テキストに対して利用可能な録音音声を複数ある候補から選択する条件として録音音声データベースから話調成分を抽出し、目標となる文章全体の話調整分を統計的手法により決定し、その話調成分と、前後音素環境の一致とを基準とし録音音声を選択すると共に、可及的に連続した録音音声を優先して選択する録音音声探索部と、同定された録音音声対応テキスト以外のテキストにおいて、選択された録音音声の話調成分、基本周波数、話速、パワーとの整合性から韻律を生成する韻律生成部と、生成された韻律からテキスト音声合成により合成音声を生成する音声素片探索部と、必要に応じて合成処理を施し、なめらかな合成音声を生成する合成処理部と、によって構成される。 The recording / synthesizing / text-to-speech combined speech synthesizer according to the present invention stores a plurality of voices having the same utterance content but different tone as a selectable unit of a voice fragment composed of at least one accent phrase. A recorded voice database, a text analysis unit for identifying a recorded voice-compatible text that can use the recorded voice stored in the recorded voice database with respect to the input text, and the identified recorded voice-compatible text, As a condition for selecting a plurality of recorded voices that can be used, the speech component is extracted from the recorded speech database, and the speech adjustment for the entire target sentence is determined by a statistical method. Select recording sound based on phoneme environment match, and select recording sound with priority as much as possible. A voice search unit, and a prosody generation unit that generates prosody based on consistency with the tone component, fundamental frequency, speech speed, and power of the selected recorded voice in a text other than the identified recorded voice-corresponding text. A speech segment search unit that generates synthesized speech by text speech synthesis from the prosody, and a synthesis processing unit that performs synthesis processing as necessary to generate smooth synthesized speech.

本発明によれば、録音編集型音声合成と任意のテキストから音声を作成することができるテキスト音声合成を併用する録音編集・テキスト音声合成併用型音声合成装置において、録音音声を選択する際に、文章全体の話調成分の整合性、隣接する音声素片との前後音素環境の一致、を考慮して複数の候補から最適な音声データを選択することで、高品質な合成音声を得ることができる。 According to the present invention, when selecting a recording voice in a recording / editing / speech-synthesizing speech synthesizer that uses both recording-editing speech synthesis and text-to-speech synthesis capable of creating speech from arbitrary text, High-quality synthesized speech can be obtained by selecting optimal speech data from multiple candidates in consideration of the consistency of the speech component of the entire sentence and the matching of the front and back phoneme environment with adjacent speech segments. it can.

本発明による録音編集・テキスト音声合成併用型音声合成装置はハードウェアによって構成することも考えられるが、それよりもコンピュータに本発明による録音編集・テキスト音声合成併用型音声合成プログラムをインストールし、コンピュータに録音編集・テキスト音声合成併用型音声合成装置として機能させる実施形態が最良の形態である。
コンピュータに録音編集・テキスト音声合成併用型音声合成装置として機能させるには、コンピュータに少なくとも１つ以上のアクセント句からなる音声の断片を選択可能な単位とし、発声内容が同一で語調が異なる音声を複数保存している録音音声データベースと、
入力テキストに関して録音音声データベースに格納されている録音音声が利用可能な録音音声対応テキストを同定するテキスト解析部と、同定された録音音声対応テキストにおいて、該録音音声対応テキストに対して利用可能な録音音声を複数ある候補から選択する条件として録音音声データベースから話調成分を抽出し、目標となる文章全体の話調整分を統計的手法により決定し、その話調成分と、前後音素環境の一致とを基準とし録音音声を選択すると共に、可及的に連続した録音音声を優先して選択する録音音声探索部と、同定された録音音声対応テキスト以外のテキストにおいて、選択された録音音声の話調成分、基本周波数、話速、パワーとの整合性から韻律を生成する韻律生成部と、生成された韻律からテキスト音声合成により合成音声を生成する音声素片探索部と、必要に応じて合成処理を施し、なめらかな合成音声を生成する合成処理部とを構築する。この構成とすることにより、高品質な合成音声を得ることができるとする本発明の独特の作用効果を得ることができる。 It is conceivable that the recording / synthesizing / text-to-speech combined speech synthesizer according to the present invention may be configured by hardware, but the recording / synthesizing / text-to-speech combined speech synthesizing program according to the present invention is installed in the computer. The best mode is one in which the voice synthesizer is used as a voice synthesizer combined with recording / editing and text-to-speech synthesis.
In order for a computer to function as a voice synthesizer that uses both recording and editing and text-to-speech synthesis, the computer can select a voice fragment composed of at least one accent phrase, and can produce voices with the same utterance content but different tone. Recording audio database that has been saved multiple times,
A text analysis unit that identifies recorded voice-compatible text that can use recorded voice stored in the recorded voice database with respect to the input text, and recording that can be used for the recorded voice-corresponding text in the identified recorded voice-compatible text The speech component is extracted from the recorded speech database as a condition for selecting speech from a plurality of candidates, the amount of speech adjustment of the target sentence as a whole is determined by a statistical method, and the speech component is matched with the front and back phoneme environment. The recording voice search unit that selects the recorded voice with priority as much as possible, and selects the recorded voice with priority as much as possible, and the tone of the selected recorded voice in the text other than the identified recorded voice compatible text Prosody generation unit that generates prosody from consistency with components, fundamental frequency, speech speed, and power, and text speech synthesis from the generated prosody A speech unit searching unit for generating a formed sound, the synthesis processing on if necessary, to build a synthesis processing unit that generates a smooth synthetic speech. By adopting this configuration, it is possible to obtain the unique effect of the present invention that a high-quality synthesized speech can be obtained.

まず、図１に本発明の基本構成図を示す。本発明の基本的な流れは、入力された入力テキスト１１において入力テキスト１１のどの部分が録音音声として録音音声データベース１４に用意されているかを判断し録音音声から出力音声を選択してくる録音音声対応テキストと、テキスト音声合成により合成音声を作成する対応外テキストをそれぞれの処理フローに振り分けることと、録音音声対応外テキストに対して形態素解析と読み・アクセント・音調結合型などを付与することを、テキスト解析部１２で行い、話調成分抽出部１５により録音音声データベース１４から抽出したり、手入力により与えたりすることで得られた話調成分や録音音声の連続性、入力テキスト上における前後音素環境の一致を基準に録音音声対応テキストに対する録音音声の選択を録音音声探索部１３で行い、録音音声対応外テキストをテキスト音声合成するために、録音音声探索部１３で選択された収録音声の話調成分、基本周波数、話速、パワーなどの抽出を音声特徴抽出部１７で行い、抽出された音声特徴を用いて韻律生成部１８で韻律を生成し、生成された韻律を基に汎用音声データベース２０から音声素片を選択する処理を音声素片探索部１９により行い、録音音声探索部１３，音声素片探索部１９で選択された音声を合成処理部２１で滑らかに連結することで、合成音声データ２２を作成する。 First, FIG. 1 shows a basic configuration diagram of the present invention. The basic flow of the present invention is to determine which part of the input text 11 in the inputted input text 11 is prepared as the recorded voice in the recorded voice database 14 and select the output voice from the recorded voice. The correspondence text and the non-corresponding text that creates synthesized speech by text-to-speech synthesis are assigned to each processing flow, and the morphological analysis and reading / accent / tone combination type are added to the non-corresponding text. , Continuity of speech tone components and recorded speech obtained by extracting from the recorded speech database 14 by the speech tone component extracting portion 15 or giving them manually, before and after on the input text The recording voice search unit 13 selects the recording voice for the text corresponding to the recording voice based on the coincidence of the phoneme environment. In order to synthesize text-to-speech text that does not support recorded speech, the speech feature extraction unit 17 extracts the speech tone component, fundamental frequency, speech speed, power, etc. of the recorded speech selected by the recorded speech search unit 13. The prosody generation unit 18 generates a prosody using the generated speech feature, and the speech unit search unit 19 performs a process of selecting a speech unit from the general-purpose speech database 20 based on the generated prosody. 13. The synthesized speech data 22 is created by smoothly connecting the speech selected by the speech segment search unit 19 by the synthesis processing unit 21.

本発明の特徴となる話調成分抽出部１５，録音音声探索部１３の各処理について、以下で詳細に述べる。
話調成分抽出部１５では、合成対象テキストの文章全体の話調成分を設定する。まず始めに録音音声データベース１４中の音声が実際にどういう話調成分を持っているかを抽出する。話調成分の抽出はアクセント句ごとに行なうこととする。横軸を時間、縦軸を基本周波数として示した図２を例にして説明する。
アクセント句の始端部分の基本周波数Ｆ１と終端部分の基本周波数Ｆ２を直線Ｎで結び、アクセント句の中点以降の周波数パタンＰＴが直線Ｎ以下に存在する場合、始端を通る中点以降の周波数パタンＰＴに対する接線Ｊを引く。この接線Ｊの傾きを話調成分として使用する。 Each processing of the speech tone component extraction unit 15 and the recorded voice search unit 13 which are features of the present invention will be described in detail below.
The tone component extraction unit 15 sets the tone component of the entire sentence of the synthesis target text. First, what kind of speech tone component the voice in the recorded voice database 14 actually has is extracted. Speech tone components are extracted for each accent phrase. An example will be described with reference to FIG. 2 in which the horizontal axis represents time and the vertical axis represents the fundamental frequency.
When the basic frequency F1 at the beginning of the accent phrase and the basic frequency F2 at the end of the accent phrase are connected by a straight line N, and the frequency pattern PT after the middle point of the accent phrase is below the straight line N, the frequency pattern after the middle point passing through the starting point Draw a tangent line J to PT. The slope of this tangent line J is used as the speech tone component.

次に録音音声データベース１４から抽出した話調成分を統計的に分析し、合成対象テキストの話調成分を設定する。第一段階として、同一アクセント句ごとにアクセント句の話調成分の始端周波数と終端周波数と話調成分の傾きの正規分布を作成する。その正規分布の確率の総和が最大になるように、話調成分を表す線分の始端周波数の値と傾きを設定する。これによって、見本となる話調成分を作成することができる。
このような方法で作成した話調成分を見本として用いると、録音音声探索部１３において見本の話調成分に近い、適切な録音音声選択候補を多く得ることができるため、高品質な合成音声を作成することが可能となる。 Next, the speech tone component extracted from the recorded voice database 14 is statistically analyzed, and the speech tone component of the synthesis target text is set. As a first step, a normal distribution of the start frequency and end frequency of the accent tone speech component and the slope of the tone component is created for each same accent phrase. The value and slope of the start frequency of the line segment representing the speech component are set so that the sum of the probabilities of the normal distribution is maximized. As a result, a sample tone component can be created.
If the speech tone component created by such a method is used as a sample, the recorded speech search unit 13 can obtain a large number of appropriate recorded speech selection candidates that are close to the sample speech tone component. It becomes possible to create.

録音音声探索部１３には、最終的に得られる合成音声を高品質にするために、適切な録音音声を選択する。まず始めに録音音声対応テキストに該当するできるだけ長く録音した録音音声が録音音声データベース１４に存在するかを探索する。もし、２つ以上のアクセント句を含む連続した録音音声が存在した場合は、基本的に１つのアクセント句を１つの断片音声として扱う本手法において、それを１つの断片音声として扱うこととする。
次に、話調成分抽出部１５で使用した手法により録音音声選択候補の話調成分データ１６を抽出し、見本の話調成分に対してどれくらい一致しているかを断片音声ごとに図３の正規分布群により確率を求め、１断片音声あたりの平均確率を計算する。 The recorded voice search unit 13 selects an appropriate recorded voice in order to improve the synthesized voice finally obtained. First, it is searched whether or not the recorded voice recorded for as long as possible corresponding to the recorded voice corresponding text exists in the recorded voice database 14. If there is a continuous recording voice including two or more accent phrases, in the present technique in which one accent phrase is basically handled as one fragment voice, it is treated as one fragment voice.
Next, the speech tone component data 16 of the recorded speech selection candidate is extracted by the method used in the speech tone component extraction unit 15, and how much the speech tone component matches with the sample speech tone component is shown in FIG. The probability is obtained from the distribution group, and the average probability per fragment speech is calculated.

また、断片音声の入力テキスト上における前後音素環境が一致度合をスコアとして計算する。この計算手法は、例えば、音韻が完全に一致したなら１、母音や破裂音などの音素種が一致したなら０．５、それ以外であれば０というようにスコア付けする。そして、話調成分による確率と前後音韻環境によるスコアの重み付け加算で表される統合スコアが最も高い録音音声の組合せを選択する。
あらかじめ収録しておいて録音音声データを接続することで所望の音声を作成する録音編集型音声合成と、テキストから任意の音声を作成することができるテキスト音声合成を併用する音声合成方式において、本発明を適用しない場合、どのような不都合が生じるかについて、図４、図５を用いて説明する。 Also, the degree of coincidence of the phoneme environment on the input text of the fragment speech is calculated as a score. In this calculation method, for example, scoring is performed such that 1 if phonemes are completely matched, 0.5 if phonemes such as vowels and plosives are matched, and 0 otherwise. Then, the combination of the recorded speech having the highest integrated score represented by the weighted addition of the probability based on the tone component and the score based on the front and back phoneme environment is selected.
This is a speech synthesis method that combines recording and editing speech synthesis, which creates a desired speech by connecting recorded speech data in advance and text speech synthesis, which can create any speech from text. What inconveniences occur when the invention is not applied will be described with reference to FIGS.

データベースに発声内容が同一の録音音声を複数保持していない場合や、隣接していない録音音声を考慮しないで録音音声を選択した場合、図４のように大きな連続ギャップが生じる危険性が高い。また、連続ギャップを埋めるために、図５のように合成音声に対して韻律変形を行った場合、合成音声の韻律が不自然になるため品質が劣化してしまう。
本発明は、録音音声を選択する際に、文章全体の話調成分の整合性、録音音声の連続性、隣接する音声素片との前後音素環境の一致、を考慮することで、例に示したような品質劣化を避けることができる有効な手段である。 When a plurality of recorded voices having the same utterance contents are not held in the database, or when a recorded voice is selected without considering non-adjacent recorded voices, there is a high risk of a large continuous gap as shown in FIG. Further, when prosody transformation is performed on the synthesized speech as shown in FIG. 5 in order to fill the continuous gap, the quality of the synthesized speech deteriorates because the prosody of the synthesized speech becomes unnatural.
The present invention shows an example by considering the consistency of the tone component of the entire sentence, the continuity of the recorded speech, and the matching of the front and back phoneme environment with adjacent speech segments when selecting the recorded speech. It is an effective means that can avoid such quality degradation.

以上説明した本発明による録音編集・テキスト音声合成併用型音声合成装置は主にコンピュータに本発明による録音編集・テキスト音声合成併用型音声合成プログラムをインストールし、コンピュータに録音編集・テキスト音声合成併用型音声合成装置として機能させて実現される。本発明による録音編集・テキスト音声合成併用型音声合成プログラムはコンピュータが解読可能なプログラム言語によって記述され、コンピュータが読み取り可能な例えば磁気ディスク、ＣＤ−ＲＯＭ、或は半導体メモリのような記録媒体に記録され、これら記録媒体からコンピュータにインストールするか又は通信回線を通じてインストールすることができる。 The recording / synthesizing / text-to-speech combined speech synthesizer according to the present invention described above mainly installs the recording / synthesizing / text-to-speech combined speech synthesis program according to the present invention in a computer, and the recording / editing / text-to-speech combined type to the computer. This is realized by functioning as a speech synthesizer. The recording / speech / text-to-speech combined speech synthesis program according to the present invention is written in a computer-readable program language and recorded on a computer-readable recording medium such as a magnetic disk, a CD-ROM, or a semiconductor memory. From these recording media, it can be installed in a computer or installed through a communication line.

本発明による録音編集・テキスト音声合成併用型音声合成装置は音声品質が良好であることから、各種の自動音声案内装置或は自動応答装置等に活用される。 The recording / editing / text-to-speech combined speech synthesizer according to the present invention has good voice quality, and is used for various automatic voice guidance devices or automatic response devices.

この発明の一実施例を説明するためのブロック図。The block diagram for demonstrating one Example of this invention. 図１に示した実施例の動作を説明するためのグラフ。The graph for demonstrating operation | movement of the Example shown in FIG. 図２と同様のグラフ。The same graph as FIG. 本発明を実施しない録音編集・テキスト音声合成併用型音声合成の不都合を説明するためのグラフ。The graph for demonstrating the inconvenience of the audio | voice synthetic | combination type | formula voice-synthesizing combined use of recording edit which does not implement this invention. 図４と同様のグラフ。The same graph as FIG.

Explanation of symbols

１１入力テキスト１８韻律生成部
１２テキスト解析部１９音声素片探索部
１３録音音声探索部２０汎用音声データベース
１５話調成分抽出部２１合成処理部
１６話調成分データ２２合成音声データ
１７音声特徴抽出部 DESCRIPTION OF SYMBOLS 11 Input text 18 Prosody generation part 12 Text analysis part 19 Speech segment search part 13 Recorded voice search part 20 General-purpose speech database 15 Speech tone component extraction part 21 Synthesis processing part 16 Speech tone component data 22 Synthetic voice data 17 Speech feature extraction part

Claims

Sound recording / synthetic speech synthesis combined speech recording combined with voice recording that creates desired speech by connecting recorded speech data and text speech synthesis that can create any speech from text In the synthesizer,
A recorded voice database in which a plurality of voices having the same utterance content and different speech styles are stored as a selectable unit of voice fragments including at least one accent phrase;
A text analysis unit for identifying a recorded voice-compatible text in which the recorded voice stored in the recorded voice database can be used for the input text;
In the identified recorded speech-corresponding text, the speech component is extracted from the recorded speech database as a condition for selecting from a plurality of candidates the recorded speech that can be used for the recorded speech-corresponding text. The speech adjustment is determined by a statistical method, and the recorded speech is selected based on the speech tone component and the coincidence of the front and back phoneme environment, and the recorded speech is selected with priority as much as possible. And
In a text other than the identified recorded voice-corresponding text, a prosody generating unit that generates a prosody from the consistency with the speech tone component, fundamental frequency, speech speed, and power of the selected recorded voice;
A speech segment search unit that generates synthesized speech by text speech synthesis from the generated prosody;
A synthesis processing unit that performs synthesis processing as necessary to generate a smooth synthesized speech,
A voice synthesizer combined with recording / editing and text-to-speech synthesis, characterized by comprising:

A speech editing program using both recording and editing and text-to-speech synthesis, which is written in a computer-readable program language and causes the computer to function as a speech-synthesizing and text-to-speech synthesis combined speech synthesizer according to claim 1.

A recording medium comprising a computer-readable recording medium, wherein the recording / speech / synthesizing voice synthesis program according to claim 2 is recorded on the recording medium.