JP2586040B2

JP2586040B2 - Voice editing and synthesis device

Info

Publication number: JP2586040B2
Application number: JP62099407A
Authority: JP
Inventors: 勝信伏木田
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1987-04-21
Filing date: 1987-04-21
Publication date: 1997-02-26
Anticipated expiration: 2012-02-26
Also published as: JPS63264800A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、音声応答システムに用いる音声編集合成装
置に関する。Description: TECHNICAL FIELD The present invention relates to a voice editing / synthesizing apparatus used for a voice response system.

（従来の技術）従来、人間の発声した単語や文章等の音声波形を記憶
させておき、これらの音声波形を編集合成することによ
り音声応答を行う方式が知られている。また、CVやVC
（ここで、Ｃは子音、Ｖは母音を表す）等の比較的に短
い音声素片を入力として与えられる文字列に従って編集
合成し任意の音声を合成する音声応答システムが1982年
日本音響学会発行の音声研究会資料（資料番号−S82−0
6（1982−４））中の“CV、VC波形のピッチ同期的補間
による任意語合成方式”と題する文献等により知られて
いる。(Prior Art) Conventionally, there has been known a method of storing a voice waveform of a word or a sentence uttered by a human, and performing a voice response by editing and synthesizing the voice waveform. Also, CV and VC
(Where C represents a consonant and V represents a vowel), a voice response system that edits and synthesizes a relatively short voice segment according to a character string given as an input and synthesizes an arbitrary voice is published by The Acoustical Society of Japan in 1982. Symposium of the Study Group (Document No. -S82-0)
6 (1982-4)), which is known from a document and the like entitled "arbitrary word synthesis method by pitch synchronous interpolation of CV and VC waveforms".

（発明が解決しようとする問題点）しかしながら、前記前者の方式は、編集すべき自然音
声の時間長が比較的長いから編集合成された合成音声の
音質が良いが合成可能な文章な種類が限定されていると
いう欠点を持っている。また、前記後者の方式は、任意
の文章が合成可能であるものの編集すべき音声素片の時
間長が短く調音結合の影響を充分考慮していなから合成
音質が比較的劣っている欠点を有している。(Problems to be Solved by the Invention) However, in the former method, since the natural sound to be edited has a relatively long time length, the synthesized speech edited and synthesized has good sound quality, but the types of sentences that can be synthesized are limited. It has the disadvantage of being. Further, the latter method has a drawback that, although an arbitrary sentence can be synthesized, the synthesized speech quality is relatively inferior because the time length of a speech unit to be edited is short and the effect of articulation coupling is not sufficiently considered. doing.

本発明の目的は、調音結合の影響を出来るだけ考慮し
比較的高品質な任意の文章音声が生成可能な音声編集合
成装置を提供することにある。SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech editing / synthesizing apparatus capable of generating an arbitrary sentence speech of relatively high quality while considering the influence of articulation coupling as much as possible.

（問題点を解決するための手段）本願の発明は、あらかじめ単語等の音声データを前記
各音声データを表す音素名列および音節境界データとと
もに記憶する音声データメモリと、入力として与えられ
る音節名列と前記単語等に対応する音節名列（但し、部
分列を含む）とのマッチングを行い最長一致する前記音
声データの部分音節名列を選択する手段と、この選択手
段で選択された音節名列に従って前記音節境界データを
用いて前記音声データから必要とする音声データを切り
出し編集合成することにより所望の音声を生成する手段
とから構成されている。(Means for Solving the Problems) The present invention provides a voice data memory that stores voice data such as words in advance together with a phoneme name string and syllable boundary data representing each of the voice data, and a syllable name string given as input. Means for matching a syllable name string (including a partial string) corresponding to the word or the like and selecting a partial syllable name string of the voice data that matches the longest, and a syllable name string selected by the selecting means Means for extracting desired audio data from the audio data by using the syllable boundary data and editing and synthesizing the audio data to generate a desired audio.

（発明の原理）連続に発声された単語や文章等の音声内における音節
の周波数スペクトル等の特徴パラメータの変化特性は、
単独に発声された音節の特徴パラメータの変化特性と比
較する前後の音節の影響を受けるから大きな違いが生じ
ることが知られており、調音結合と呼ばれている。あら
かじめ、自然音声から複数個の単位音声を切りだして用
意しておき、これらの単位音声を編集することにより任
意の音声を合成する規則型音声合成システムにおいて、
前記調音結合の影響を充分考慮して合成音質を高めるた
めには前記単位音声として出来るだけ長い（音節数が多
い）ものを用意しておく必要がある。しかしながら、単
位音声が長いと音節の組合せが膨大となるから、単位音
声を自然音声から切り出す作業が困難となるばかりでな
く音声合成システムの規模が大きくなってしまう。そこ
で、音声合成装置の規模のわりに比較的高品質な合成音
声が出力可能な方式として、使用頻度の高い音節系列を
含む単語音声データを音節の境界を表すセグメンテーシ
ョンとともに付け加えて用いる方式が考えられる。第２
図に一例として単語/yamazaki/に対する音節名（ここで
はCV、VCを音節と呼ぶ。:Cは子音、Ｖは母音）列と境界
データ（セグメンテーションデータ）を示す。この方式
の有効性を高めるためには、合成すべき文章に含まれる
音節系列を前記単語中に含まれる出来るだけ長い音節系
列で表す必要があり、いわゆる最長一致検索方式と呼ば
れる方式による実現することができる。(Principle of the Invention) The change characteristics of characteristic parameters such as the frequency spectrum of syllables in speech such as continuously uttered words and sentences are as follows.
It is known that there is a great difference due to the influence of syllables before and after comparison with the change characteristic of the characteristic parameter of a syllable uttered independently, and this is called articulation coupling. In a rule-based speech synthesis system in which a plurality of unit speeches are cut out from natural speech in advance and prepared, and these unit speeches are edited to synthesize an arbitrary speech,
In order to enhance the synthesized sound quality by sufficiently considering the effect of the articulation coupling, it is necessary to prepare the unit voice as long as possible (having a large number of syllables). However, if the unit speech is long, the number of syllable combinations becomes enormous, which makes it difficult to cut out the unit speech from natural speech and also increases the scale of the speech synthesis system. Therefore, as a method capable of outputting relatively high-quality synthesized speech for the scale of the speech synthesizer, a method of adding word speech data including a frequently used syllable sequence together with a segmentation indicating a syllable boundary may be used. Second
In the figure, as an example, a syllable name (here, CV and VC are called syllables .: C is a consonant, V is a vowel) and a boundary data (segmentation data) for the word / yamazaki /. In order to enhance the effectiveness of this method, it is necessary to represent a syllable sequence included in a sentence to be synthesized with a syllable sequence as long as possible included in the word. Can be.

なお、音声データとしては、音声波形あるいは音声波
形から抽出されたホルマントパラメータ等を用いること
が出来る。音節（CV、VC）に対応する音声波形から任意
音声を合成する方式は、例えば、前記文献に、音節に対
応するホルマントパラメータ等から任意音声を合成する
方式は、例えば、1985年日本音響学会発行の音声研究会
資料（資料番号S85−31（1985−７））中の“ホルマン
ト、CV−VC型規則合成”と題する文献に詳しいので、こ
こでは説明を省略する。Note that, as the audio data, an audio waveform or a formant parameter extracted from the audio waveform can be used. A method for synthesizing an arbitrary sound from a sound waveform corresponding to a syllable (CV, VC) is described in, for example, the above-mentioned literature, and a method for synthesizing an arbitrary sound from a formant parameter or the like corresponding to a syllable is described in, for example, 1985 Of the Symposium on Symposiums (Document No. S85-31 (1985-7)) titled "Formant, CV-VC Type Rule Synthesis", and will not be described here.

（実施例）本願発明の実施例を図面を参照して詳細に説明する。(Example) An example of the present invention will be described in detail with reference to the drawings.

第１図は本願発明の一実施例を示すブロック図であ
る。この実施例にはまず、文字列入力端子11を介して合
成すべき文章を表す文字列110が音節名列変換部110を音
節名列101に変換し最適音節系列選択部２に入力する。
最適音節系列選択部２は、記憶部３内の音節名列記憶部
Ａに記憶されている単語音声データに対応する音節名列
を参照して、前記音節名列を前記単語音声データの部分
音節名列102に分解し、記憶部３内の音節境界データ記
憶部Ｂのアドレスデータとして出力する。前記部分音節
名列への分解操作においては、入力文字列から変換され
た音節名系列を（S₁，S₂，…S_i，S_i+1，…，S_n，…，
S_N）とすると（ここで、S_i，…S_nは音節名,Nは入力文字
列に対する音節数を表す）、まずS₁から始めて左から右
に前記単語音声データに含まれる最長の部分音節名列の
検索を行う。この結果S₁，…，S_iが最長の部分音節名列
であったとすると、次に、S_i+1から前記と同様の操作を
繰り返して順次最長の部分音節名列を検索する。記憶部
３内の音声データ記憶部Ｃからは、前記単語音声データ
の部分音節名列に従って該単語音声データ中に含まれる
前記部分音節名列に対応する音声部分の音声データ103c
を順次編集合成回路４に出力する。編集合成回路４は前
記記憶部３から出力される音声データ103cを編集合成し
合成波形を生成した合成波形出力端子12を介して出力す
る。FIG. 1 is a block diagram showing an embodiment of the present invention. In this embodiment, first, a character string 110 representing a sentence to be synthesized is converted into a syllable name sequence 101 via a character string input terminal 11 and input to the optimum syllable sequence selection unit 2.
The optimal syllable sequence selection unit 2 refers to the syllable name sequence corresponding to the word voice data stored in the syllable name sequence storage unit A in the storage unit 3 and converts the syllable name sequence into partial syllables of the word voice data. It is decomposed into name strings 102 and output as address data of the syllable boundary data storage unit B in the storage unit 3. In the decomposition operation into the partial syllable name sequence, the syllable name sequence converted from the input character string is represented by (S ₁ , S ₂ ,... S _i , S _{i + 1} ,..., S _n ,.
S _N ) (where S _i ,... S _n are syllable names, and N is the number of syllables for the input character string), starting from S ₁ and extending from left to right in the longest part included in the word voice data Search syllable name string. As a result, assuming that S ₁ ,..., S _i are the longest partial syllable name sequence, the same operation as above is repeated from S _{i + 1} to sequentially search for the longest partial syllable name sequence. From the voice data storage unit C in the storage unit 3, the voice data 103c of the voice part corresponding to the partial syllable name string included in the word voice data is obtained according to the partial syllable name string of the word voice data.
Are sequentially output to the edit / synthesis circuit 4. The editing / synthesizing circuit 4 edits and synthesizes the audio data 103c output from the storage unit 3 and outputs it via a synthesized waveform output terminal 12 which generates a synthesized waveform.

なお、以上の説明においては、音節名列のアクセント
の有無を考慮に入れなかったが、アクセントによっても
音節の周波数スペクトルパタンは影響を受けるから、前
記部分音節名列の検索の際にアクセント情報も付加して
検索を行うことにより比較的良い合成音質の得られる音
節系列を取得できることは明らかである。In the above description, the presence or absence of the accent in the syllable name string was not taken into consideration, but the accent also affects the frequency spectrum pattern of the syllable. It is clear that a syllable sequence with relatively good synthesized sound quality can be obtained by performing the search with the addition.

（発明の効果）以上述べたように本発明によれば、比較的長い音節系
列を単位音声として用いることが出来る。そこで本発明
の音声編集合成装置を採用することにより、調音結合の
効果がより多く取り入れられ比較的高品質な任意の合成
音声が生成できる。(Effect of the Invention) As described above, according to the present invention, a relatively long syllable sequence can be used as a unit voice. Therefore, by adopting the speech editing / synthesizing apparatus of the present invention, it is possible to generate an arbitrary synthesized speech having a relatively high quality by incorporating the effect of articulation coupling more.

[Brief description of the drawings]

第１図は本発明の一実施例を示すブロック図、第２図は
音節名列および境界データの例を示す図である。第１図において、１は音節名列変換部、２は最適音節系
列選択部、３は記憶部、４は編集合成回路、11は文字列
入力端子、12は合成波形出力端子をそれぞれ表し、また
記憶部３内のＡは単語等の音声データに対応する音節名
列の記憶部、Ｂは前記音声データ内の前記音節名列に対
応する音節境界データの記憶部、Ｃは前記音声データの
記憶部をそれぞれ表す。第２図において、横軸は時間を
表し、縦軸は平均振幅値を表し、（）は音節名を表
し、実線は単語音声/yamazaki/の平均振幅パタンを表
す。FIG. 1 is a block diagram showing an embodiment of the present invention, and FIG. 2 is a diagram showing an example of a syllable name string and boundary data. In FIG. 1, 1 is a syllable name string conversion unit, 2 is an optimal syllable sequence selection unit, 3 is a storage unit, 4 is an editing and synthesis circuit, 11 is a character string input terminal, 12 is a synthesized waveform output terminal, and A in the storage unit 3 is a storage unit of a syllable name sequence corresponding to voice data such as a word, B is a storage unit of syllable boundary data corresponding to the syllable name sequence in the voice data, and C is a storage of the voice data. Represents each part. In FIG. 2, the horizontal axis represents time, the vertical axis represents the average amplitude value, () represents a syllable name, and the solid line represents the average amplitude pattern of the word voice / yamazaki /.

Claims

(57) [Claims]

1. A voice data memory for storing voice data such as words in advance together with a phoneme name sequence representing each voice data and syllable boundary data, a syllable name sequence given as input and a syllable name sequence corresponding to the word etc. Means for selecting a partial syllable name string of the voice data that matches the longest, and using the syllable boundary data in accordance with the syllable name string selected by the selecting means, converts necessary voice data from the voice data. Means for generating a desired voice by cutting out and editing and synthesizing.

2. The speech editing / synthesizing apparatus according to claim 1, wherein said phoneme name string stored in said speech data memory includes a partial string.