JPS61112199A

JPS61112199A - Voice synthesization by carrier modulation

Info

Publication number: JPS61112199A
Application number: JP59158900A
Authority: JP
Inventors: 若林　昭夫
Original assignee: Individual
Current assignee: Individual
Priority date: 1984-07-31
Filing date: 1984-07-31
Publication date: 1986-05-30

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】この発明は、光が周波数個有の色を持つように音波す周
波数個有の音を持ち、しかも母音の原音であることを出
願人が発見したことにより可能となったもので、素片合
成、線型予測、正弦波重畳等の波形またはスペクトル再
成を目指した従来の方法とは本質的に異なるものであり
、極めて自然に近い音質の母音を合成することができる
画期的方法である。[Detailed Description of the Invention] This invention was made possible by the applicant's discovery that just as light has a frequency-specific sound, it also has a frequency-specific sound, which is the original sound of a vowel. This method is essentially different from conventional methods that aim at waveform or spectrum reconstruction such as segment synthesis, linear prediction, and sine wave superposition, and can synthesize vowels with extremely natural sound quality. This is an innovative method.

本発明はハード、ソフトのいづれでも実施可能である。The present invention can be implemented using either hardware or software.

原理および実施例について説明する。低周波発振器の出
力を聴（と３００Ｈｚ程度ではブー、１ｋＨｚ程度では
パー、２ｋＨｚ以上ではピーという音に聴こえる。これ
らが破裂音であるのは耳に到達した時に立ち上がるステ
ップ関数で変調されているからで、十分縁やかな立ち上
がりにすれば母音となる。試みにｆ＝１００Ｈｚのｃｏ
！１波で１００％変調をすると母音に聴こえることがわ
かる。このとき変調された波形は第１図のごとく。The principle and examples will be explained. Listen to the output of a low-frequency oscillator (at about 300 Hz it sounds like a boom, at about 1 kHz it sounds like a buzz, and above 2 kHz it sounds like a beep. These are plosive sounds because they are modulated by a step function that rises when they reach the ear. So, if you make it have a sufficiently sharp rise, it becomes a vowel.
! It can be seen that when 100% modulation is applied to one wave, it sounds like a vowel. The modulated waveform at this time is as shown in FIG.

ｖ　（ｔ）＝　（１＋ｃｏｓ　（２ｓｆｔ））ｓｉｎ　
（２πｎｆｔ）で表わされるが、ｎを３以上の素数に選
ぶと英語母音に対応するものを得ることができる。基本
波以外には高調波関係にない周波数が具なる音となるこ
とは、共鳴ということを考慮すれば予測し得る。勿論、
これらの前後の周波数帯も同じ音を呈し、境界は明確で
はなく中間的母音を経て移り変わる。また基本波が２０
０Ｈｚならば第２．３．４高調波がｕ、ｏ、αで必ずし
も素数倍とはならない、すなわち、母音は音波の波形で
はなく周波数特有の音である。我々は光の周波数に対し
色という感覚または概念を持つように音波の周波数に対
して母音という聴覚または概念を持つのである。そして
、前記のｎの値に対して下記のように母音を対応させる
ことができ、更に光の七色に対応させた調合も簡単では
ないが可能である。v (t) = (1+cos (2sft)) sin
(2πnft), but if n is chosen as a prime number of 3 or more, one corresponding to English vowels can be obtained. Considering resonance, it can be predicted that the sound will consist of frequencies other than the fundamental wave that have no harmonic relationship. Of course,
The frequency bands before and after these also exhibit the same sound, with no clear boundaries and transitions through intermediate vowels. Also, the fundamental wave is 20
At 0 Hz, the 2nd, 3rd, and 4th harmonics are u, o, and α, which are not necessarily prime multiples. In other words, the vowel is not a sound wave waveform but a frequency-specific sound. Just as we have the sense or concept of color for the frequency of light, we have the sense or concept of vowel for the frequency of sound waves. Then, the vowels can be made to correspond to the value of n as shown below, and it is also possible, although not easy, to make a compound that corresponds to the seven colors of light.

ｎ＝３　５　７　１１　１３　１７　１９　２３ｕｏａ
　　Δ　ａ　　ａ！　ｅｉ赤橙黄緑　青藍紫本発明はこれら新事実の出願人による発見に基くもので
従来法とは異なる全く新規のものである。n=3 5 7 11 13 17 19 23uoa
Δaa! ei Red-Orange-Yellow-Green Blue-Indigo-Purple The present invention is based on the discovery of these new facts by the applicant, and is completely new and different from conventional methods.

これら周波数特有の音である母音はｎが１１以上になる
と周波数が高すぎることと、振動数の２乗および振幅の
２乗に比例するエネルギーが大きくなり声の大きさを示
す声帯振動のピッチ周波数成分が相対的に弱くなるため
音声が小さくなることにより実用にならない０本発明は
これらを人間の音声に近い母音に改良するための方法で
あり以下に実施例で説明する。The frequency of vowels, which are sounds specific to these frequencies, is too high when n is 11 or more, and the pitch frequency of vocal fold vibration indicates the loudness of the voice because the energy increases in proportion to the square of the vibration frequency and the square of the amplitude. Since the components are relatively weak, the sound becomes small, making it impractical.The present invention is a method for improving these vowels into sounds that are closer to human sounds, and will be explained below using examples.

前者はピッチ周基の鋸歯状波で変調すれば耳に聴こえる
見掛の周波数を下げることができるが。The former can lower the apparent frequency heard by the ear by modulating it with a pitch-based sawtooth wave.

後者は振幅を小さくする必要があり、そ九には鋸歯状波
の振幅も小さくしなければならないがこれはピッチ成分
の減少であるので母音成分に定常値を加えた１＋ａ−ｓ
ｉｎ　（２ｇｎｆｔ）、Ｏｖａ≦１を変調することによ
りピッチ成分の減少を防ぐ、また鋸歯状波も第２高調波
成分の発生を避けるため定常値を加え正の振幅のみとす
る。上記では母音周波を搬送波としたが、このような場
合にはどちらを搬送波と考えても同じで、情報を担うも
のが搬送波であるという考えに従えば鋸歯状波を搬送波
と考えるのが妥当である０周波数の関係は電波と逆にな
るが、電波が搬送波に極めて大きなエネルギーを持たせ
情報自体のエネルギーは僅かであるのに対し、搬送波の
周波数を低くするとそのエネルギーは小さく情報自体の
エネルギーが大きくなるので効率が良くなる。生物は効
率の良い方法を採ると考えられるので鋸歯状波を搬送波
と考えるのである。得られた母音への波形を第２図に示
す、１ピッチＬｏｏｍｓを１バイトのデーター４００バ
イトで構成してあり必要な時間これを繰り返し出力する
０図に見られるように、このような変調では重畳とは著
しく異なり、搬送波の振幅減少と共に変調波成分の振幅
も減少する特徴を持つが、これは母音が声帯振動の基本
周波数の高調波ならその振幅は基本波振動の変位に比例
するであろうという推測と一致する。これらの母音は笛
の音のようであり、特にｎ＝１１以上ではまだ良質とは
いえないがその改良について述べる前に搬送波に鋸歯状
波を用いる理由を詳しく説明する。For the latter, it is necessary to reduce the amplitude, and for the ninth, the amplitude of the sawtooth wave must also be reduced, but since this is a decrease in the pitch component, 1 + a - s, which is the addition of the steady value to the vowel component.
in (2gnft) and Ova≦1 to prevent a decrease in the pitch component. Also, in order to avoid the generation of a second harmonic component, the sawtooth wave also has a steady value and has only positive amplitude. In the above, the vowel frequency was used as the carrier wave, but in this case, it is the same no matter which one is considered as the carrier wave.If you follow the idea that the carrier wave is the one that carries information, it is reasonable to consider the sawtooth wave as the carrier wave. The relationship at a certain 0 frequency is the opposite of that of radio waves, but whereas radio waves have extremely large energy in the carrier wave and the energy of the information itself is small, when the frequency of the carrier wave is lowered, the energy is small and the energy of the information itself is small. The larger the size, the better the efficiency. Living organisms are thought to use efficient methods, so sawtooth waves are considered carrier waves. The obtained vowel waveform is shown in Figure 2. One pitch Looms is composed of 400 bytes of data, and this is repeatedly output for the required time. This is significantly different from superposition, in that the amplitude of the modulated wave component also decreases as the amplitude of the carrier wave decreases, but this is because if the vowel is a harmonic of the fundamental frequency of vocal fold vibration, its amplitude is proportional to the displacement of the fundamental wave vibration. This is consistent with the assumption that he is deaf. These vowels are like the sound of a whistle, and cannot be said to be of good quality, especially when n=11 or more.Before discussing the improvement, we will explain in detail the reason for using a sawtooth wave as the carrier wave.

声帯振動の基本周波数を１００Ｈｚとすると第３０高調
波程度迄の高調波が含まれていると考えられるがその総
和について検討する。これら高調波の振幅が全て等しく
位相が０の正弦波であるとすると、その総和は基本周期
で繰り返すインパルスを微分したものに近づく、第３図
（ａ）に第１０高調波迄の和を１０で割ったものを示す
、これは母音アである。この場合は最高周波数とその半
分の周波数の母音の混合である０等振幅の和が白色雑音
となるためには位相がランダムでなければならない０位
相がπ／２だとインパルス列、中間位相では中間的な波
形となる。光のように連続的な周波数を含むと位相の一
致したちの同志がインパルス列を形過し粒子性を与える
と考えら九る。If the fundamental frequency of vocal fold vibration is 100 Hz, it is thought that harmonics up to about the 30th harmonic are included, but the sum total will be considered. Assuming that all of these harmonics have equal amplitudes and are sine waves with a phase of 0, their sum approaches the differentiation of impulses that repeat in the fundamental period. Figure 3 (a) shows the sum up to the 10th harmonic as This is the vowel a, which shows what is divided by. In this case, in order for the sum of 0 equal amplitudes, which is a mixture of vowels with the highest frequency and half the frequency, to become white noise, the phase must be random.If the 0 phase is π/2, it is an impulse train, and in the intermediate phase It becomes an intermediate waveform. It is thought that when light contains continuous frequencies, comrades with matching phases form an impulse train, giving it a particle nature.

実験的にはこのような母音を作れるが声帯振動の全高調
波が同一振幅ということはエネルギーの点で無理である
。そこで、エネルギー一定すなわち第ｎ高調波の振幅が
基本波の１　／　ｎで位相０の正弦波の総和を求めると
鋸歯状波となる。これは良質な母音つであるが他の周波
数で変調するとその周波数の母音となり単に母音の見か
けの周波数を下げる働きをする。これが鋸歯状波を搬送
波に用いる理由である。実際には第５高調波程度迄加え
れば鋸歯状波の代用として十分である。第３図（ｂ）に
この波形を示す、声帯振動により鋸歯状波が生じ声道の
共鳴により特定の周波数成分が増強されると考えれば声
道理論だが、＃！！歯状波にはその成分は一様強度で含
まれるので重畳となり、第２図のような振幅の強弱は現
れない、変調作用が含まれなければならない、声帯自身
の高調波振動が基本振動の時々刻々の変位に比例すると
考えるのは物理的にも妥当でありこれが声帯の変調作用
である。この場合、高調波の総和は鋸歯状波が丸みを帯
びたものとなる。これらの理由でピッチ周期の鋸歯状波
または類似波を搬送波とし母音周波による共振的変調作
用で合成するのが本法の特徴の１つである。そして発声
が呼気のみにより起るので搬送波に定常値を加え正の波
形とする。また共振の位相特性により、共振周波数成分
は位相をπ／２遅らすのがよい。Although it is possible to create such a vowel experimentally, it is impossible in terms of energy for all harmonics of vocal cord vibration to have the same amplitude. Therefore, if the energy is constant, that is, the amplitude of the nth harmonic is 1/n of the fundamental wave, and the sum of sine waves with phase 0 is determined, a sawtooth wave is obtained. This is a high-quality vowel, but when modulated with another frequency, it becomes a vowel of that frequency and simply works to lower the apparent frequency of the vowel. This is the reason why a sawtooth wave is used as a carrier wave. In fact, adding up to about the fifth harmonic is sufficient as a substitute for the sawtooth wave. This waveform is shown in Figure 3(b).If we consider that the vibration of the vocal cords generates a sawtooth wave and the resonance of the vocal tract enhances a specific frequency component, this is the vocal tract theory, but #! ! The tooth wave contains its components with uniform intensity, so they are superimposed, and the amplitude strength as shown in Figure 2 does not appear.Modulation must be included.The harmonic vibration of the vocal cords itself is the fundamental vibration. It is physically reasonable to think that it is proportional to the momentary displacement, and this is the modulation effect of the vocal cords. In this case, the sum of harmonics is a sawtooth wave with a rounded shape. For these reasons, one of the features of this method is that a sawtooth wave or a similar wave with a pitch period is used as a carrier wave and is synthesized by a resonant modulation effect with a vowel frequency. Since vocalization is caused only by exhalation, a steady value is added to the carrier wave to form a positive waveform. Further, depending on the phase characteristics of resonance, it is preferable that the phase of the resonance frequency component is delayed by π/2.

次に前記の合成母音の音質改良法につき説明する。方法
は極めて簡単で、下表に示す組合せと割合で鼻音を混合
することである。結果の母音のパターンを第４図に示す
０割合は以下のように決める。αのエネルギーを１とし
は望これと等しくなるようにｆａ／ｆで原母音の振幅を
決める。へ以上の高周波母音にこの割合でα、ｏ、ｕを
混入するとエネルギーは２倍になるので両者に０．７を
掛は調整する。Next, a method for improving the sound quality of the synthesized vowels will be explained. The method is extremely simple: mix nasal sounds in the combinations and proportions shown in the table below. The resulting vowel pattern is shown in Figure 4. The zero percentage is determined as follows. If the energy of α is 1, the amplitude of the original vowel is determined by fa/f so that it is equal to the desired value. If α, o, and u are mixed in this ratio into the high-frequency vowels above , the energy will be doubled, so adjust by multiplying both by 0.7.

ｕｏ　　　（１／に５ＧＮｅ　　　ｉ原母音　２　１．４　　１　．５．．４．．３　．２　
　．２混　入Δ＝、３　ａ＋＝、ｌ　　−ａ＝、７　　
ｏ＝１　　ｕ＝１．６ｕ、ｏに混入するのは第３高調波
でｑのエネルギーに換算し約ｉｌｌを逆位相で加える。uo (5GNe in 1/ i Protovowel 2 1.4 1 .5..4..3 .2
．． 2 Mixing Δ=, 3 a+=, l −a=, 7
o=1 u=1.6u, what is mixed into o is the third harmonic, which is converted into energy of q, and approximately ill is added in opposite phase.

これはピークを潰す歪を与えて音質を改良し周波数を高
めるものである。混合割合は厳密を要するものではなく
好みによりかなり変り得るが位相は重要で。This applies distortion that crushes peaks to improve sound quality and raise frequencies. The mixing ratio does not have to be exact and can vary considerably depending on your preference, but the phase is important.

１−ａｃｏｓ（２＋ｃｎｆｔ）　　＋ｂｃｏｓ（２ｇｍ
ｆｔ）、　　　ｎ＜ｍ、　　ａ、ｂ＞０による混合波で
搬送波を変調する。1-acos(2+cnft) +bcos(2gm
ft), the carrier wave is modulated by a mixed wave with n<m, a, b>0.

またｎ；１７以上の母音に０を混入すれば母音ｅとなる
が、Ｕを混入す九ば母音ｉとなり、これら高周波母音の
周波数を下げる働きをもする。その結果母音ｉをＵとｅ
の中間母音と晃なすことができ、全母音を環状に並べて
表示することができる。これは青色と長波長の、赤色を
混合すると青色より短波長であるはずの紫色を得るが中
間色と見なして環状の色配合図を作るのと同じである。Also, if 0 is mixed into a vowel of n; 17 or higher, it becomes a vowel e, but if U is mixed in, it becomes a nine-ba vowel i, which also works to lower the frequency of these high-frequency vowels. As a result, the vowel i becomes U and e.
It is possible to display all the vowels arranged in a ring. This is the same as creating a circular color combination diagram by mixing blue with red, which has a long wavelength, and creates purple, which should have a shorter wavelength than blue, but considers it to be an intermediate color.

第４図の各パターンを必要な時間だけ繰り返し出力すれ
ば必要な長さの極めて良質の母音を発声できる。尚、こ
れらを組み合せて単語を発声する場合、明瞭度を良くす
るには母音間の休止時間を十分長くする必要があり、こ
れが短かすぎるとお経のようになってしまう、調音結合
はこれにより生じる。早口の場合も発声時間を短かくシ
、休止時間を十分に保てば明瞭度は失なわれない。By repeatedly outputting each pattern shown in FIG. 4 for the necessary time, it is possible to produce extremely high-quality vowels of the necessary length. When pronouncing a word by combining these words, it is necessary to make the pause time between vowels sufficiently long to improve intelligibility; if the pause time is too short, it will sound like a sutra. arise. Even if you speak quickly, if you keep your speaking time short and your pauses long enough, your intelligibility will not be lost.

人間の発声する母音は手書き文字に相当するもので、ピ
ッチ周期やフォルマント周波数の変動は人間なるが故の
ばらつきである。こＮで合成した母音はこれらを持たな
いが波形および音質は人間の音声に極めて近いものであ
る。Vowels uttered by humans are equivalent to handwritten letters, and variations in pitch period and formant frequency are due to human nature. Although the vowels synthesized by N do not have these, the waveform and sound quality are extremely close to human speech.

以上のごとく本発明は全く独創的な考えに基くものであ
り、その簡単さと実用性および合成音声の特性において
従来法に比べ著しく優れたものである。As described above, the present invention is based on a completely original idea, and is significantly superior to conventional methods in terms of simplicity, practicality, and characteristics of synthesized speech.

[Brief explanation of drawings]

第１図は母音周波数の正弦波を１００　Ｈｚの余弦波で
１００％変調した波形。第２図は鋸歯状波を母音周波で変調した波形。第３図は高調波の総和の波形。第４図は本法による合成母音の波形。特許化原人　　若　林　昭　大第１図（Δ）第　２　図　　（八）第３図（ｂ）第４ｒＩＩ手続補正書昭和６０年１０月３１日２　発明の名称ハンソウハ　ヘンチ四つ　　　　　オン七イコ°つ七イ
ホウ搬送波変調による音声合成法３　補正をする者事件との関係　　　　特許出願人カ　ナカ”ワケンヒラツカシタ“イカンチロウ住所　　
神奈川県平塚市代官町２１−１７〒２５４　　　置　０
４６３−２３−５４２２ワカ　バヤシ　アキ　　才６　添付書類の目録（１）　　図　　　面　　　　　　　　　　　　　　　
１通補正の内容１　特許請求の範囲を下記に補正する。光が周波数個有の色を持つように音波も周波数個有の音
を持ち、しかも母音の原音であることに着目し、その音
質を改良することによる音声合成法で、ピッチ周期の鋸
歯状波または類似波を搬送波とし９Ｍ母音周波数の正弦
波または余弦波、あ２　発明の詳細な説明を下記に補正
する。イ）　　２頁１９行の１０字目上り３頁１行の９字目迄
の下記に示す４４字を削除する。また基本波が２００　Ｈｚならば第２．３．４高調波が
Ｕ、Ｏ，ａで必ずしも素数倍とはならない。口）　９頁１行より１７行迄を下記と入れ替える。第４図の各パターンを必要な時間だけ繰り返し出力すれ
ば必要な長さの極めて良質の母音を発声できる。尚、こ
れらを組み合せて単語を発声する場合、明瞭度を良くす
るには母音間の休止区間を十分長くする必要があり、こ
れが短かすぎるとお経のようになってしまう、調音結合
はこれにより生じる。これを防ぎ休止区間を短かくする
ためには、母音毎にスタートマークとストップマークが
なければならない、これは鋸歯状搬送波の振幅を音声区
間を１．母音間の休止区間をＯとした矩形波の一次遅れ
の包絡線を形成するように増減すればよい、スタートマ
ークの先頭のスタートピッチは休止区間と区別でき、か
つ十分小さい振幅でなければならない、尚、前記調合表
の各母音の振幅は単に割合を決めるためのものであり、
調合後の振幅をはシ一定にそろえたものを各母音の基準
振幅とする。次に子音の合成について述べる。従来、子音は雑音によ
り構成されるといわ九ているが、雑音は個人差を与える
要素ではあるけれども人間の発声器官の構造上やむなく
生ずるもので子音の本質には関係がない、これを除去し
た後の周期成分に子音の本質があるのであり、それは母
音の組合せであるということができる。光の周波数に対
する視覚による概念が色であり種々の周波数の混合によ
り各種の色を生ずるように音波の周波数も聴覚により母
音および子音という概念を与える。たゾ。音声は純粋のスペクトル音が用いられることはなく、少
くともピッチ周期波を混ぜたもので混合色に相当し、母
音と子音の間には子音が多少複雑な混合であるものが多
いという違いがあるにすぎない６色の混合には直接、光
や絵の具を混合する方法と各色の細かい点を混ぜ視覚に
より混合する方法があるが、音声にもこれに相当する２
つの方法がある。振幅の加算または振幅変調による混合
が前者に相当し、前記８母音はこれにより合成されてい
る。多くの子音は後者により合成されるが。このとき時間軸上に時分割で各周波数を配置し聴覚によ
り混合する。以下に各子音について具体的に述べる。１）　ワ行の合成前記ｕ、ｏの合成で主周波のピークを潰す位相で第３高
調波を加えたのは飽和歪を与えたことを意味する。従っ
て、ＵよりＯの方が飽和歪が少なく、αは殆どこれがな
くてよいことは、この順に口の開は方が大きくなること
を示し、更に、これは口を大きく開けると共鳴周波数が
高くなることを示す、そこで、ワの発声は口を窄めた状
態から大きく開けた状態に倒ることを考えれば、主周波
数はＵからαの状態に連続的に変化しなければならない
、実際にこれは観測されるのであり、この周波数推移は
かなり荒い周波数離散化により実現可能で、ｕｏαの３
母音を２，３ピツチづつこの順に並べるだけでかなり良
好なワを合成できる。実際には発声の長さに応じてαのピッチ数を増加し全体
として前記スタート、ストップマークを構成する。更に
２周波数の推移する前部の振幅は最終母音の最高振幅よ
り小さクシ、ピッチ数も少なくするのがよい、これは音
節においてアクセントは母音にあることに相当する。こ
れらを合成記号式で第５図（１）により表わす０発音記
号と区別するため母音を構成する基準振幅の１ピツチの
波形を英大文字で表わし、右肩に繰返しピッチ数、右下
に振幅比を付けた記号を連結類に並べ合成法を示したも
のである。このように日本語の子音は発声の初期に周波
数推移を伴い最終母音に倒るもので、いわゆる音節を構
成しているのである。ワ行の他の音は順にｕｏｉ、ｕｏｕ、ｕｏｅ。ｕｏである０合成記号式はワの最終母音を替えたもので
ある。ワとヲ以外は母音Ｏを含まなくてもよいが７行と
の区別は明瞭でなくなる。２）　ヤ行の合成Ｕと主周波数の同じものにｉがあり、Ｏと同じものにｅ
があることよりｉｅαという推移が考えられるがこれは
ヤになる。たゾ、ワとの区別を明瞭にするためには母音
田を加え１ａａａαとする。また、ユ＝ｉａａａｏｕ、ヨ＝　ｉ　ｅ　ａ！ｏである
。これらの合成記号式を第５図（２）に示す、マンセル
の色相環に例えた前記の母音の並びにおいて周波数の低
い側からαに向って推移するものがワであり、逆に高い
側からαに向うものがヤなのであって、ａ！を加えるこ
とによりこの違いがはっきりするのでワとヤの区別が明
瞭になるのである。３）　マ行の合成ｍの合成は２００　Ｈｚに６００　Ｈｚを逆相（ｃｏｓ
波の場合）に加え１００％程度の３次の飽和歪を与えた
ものを鋸歯状波で変調する。　２００　Ｈｚ自体の音は
Ｕであり、′同じＵの音を持つ３００　Ｈｚでも可能で
１ｍは口を閉じることにより３次歪を大きくしたＵと考
えられるので母音である１日本語のマ行はワ行のＵをｍ
で置換えたものである。たゾ、ピッチ敷は５〜６ピツチ
必栗であり、０は少なくする０合成記号式を第５図（３
）に示す。４）　す行の合成ｎは２００　Ｈｚに４００　Ｈｚを同程度加え２次歪を
起こしたものを鋸歯状波で変調すればよい、母音ａ。田はαに舌の作用の加ったものと考えられ、αの第２高
調波の近辺にあることより推定されるように舌の作用は
２次歪を与える。しかし、実際の発声においてもそうで
あろが、これだけではｍとの区別は明瞭でない、ｎの前
後に母音等が付くとき舌が上顎に着く音、離れる音がｎ
の先頭と末尾のピッチ：こ重畳することによりｍとの区
別は明瞭になる。これは３　ｋＨｚあたりの周波数成分
の強い混合波で、ｎのピッチ波形の前半に重畳して第６
図のような波形を示す、この混合波の詳細は後に述べる
が、ｎの合成記号式はこれらを含めて第５図（４）で表
わす。５）　ラ行の合成ｒは前記の舌の離れる音を示す周波数成分が第７図のよ
うなきれいな包絡線を形成したもので１ピツチで構成さ
れる。従って、う行はこれに直接ア行の各母音を接続す
ればよい、うの合成記号式を第５図（５）に示す。６）へ行の合成ワの合成音ｕｏａからＯを除き休止ピッチを１ピッチ置
く程度の括れをつけるとハになる。共鳴周波数を連続的
にしか変化できない人間の発声器官でこのような周波数
の飛躍を可能にするのが破裂である０口を窄め内圧を高
かくして溜めた呼気を一気に吐くと、−塊の呼気が出た
後に一時呼気の停止が起り、その間に口の形は０を通過
し、やがて呼気が正常に戻ったときはアの状態になって
いるというのが破裂の機構である。その結果、子音と母
音の２つの部分よりなる音節の形となる。ハの合成記号式を第５図（６）に示す、このハ行は昔の
ファ、フイ、・・・と表記した口を窄めて発声するもの
であり、前の音との間の休止区間が短いとハ→ワ、ヒ→
イ、フ４つ等の推移が起る。現代流のハ行は子音部が異
なるがそれについては後に述べる。ハ行の子音部を替え
ることにより以下の各種の子音を生ずる。７）　パ行の合成ハ行の破裂が強くなったものがバ行であると考えられる
が、破裂が強いとは時間に対する周波数の飛躍量が大き
いことである。従って、前記ハ行の接頭部Ｕのピッチ数
を減らしスタートピッチよりすぐに休止ピッチに入るこ
とによりパ行を合成できる。バの合成記号式を第５図（
７）に示す、スタートピッチとしては口を閉じた状態を
示すｍを用いるのがよく、実際の音声では前の音との間
に休止区間はなく上記ｍが詰っていて発声の速度に応じ
てそのピッチ数が変化する。８）　　ｋ、　ｈ、　ｔ、ａｈ、　ｔｓ、　ｓの合成こ
れら子音の特徴は白色雑音により構成されることである
。白色雑音は雑音モドキであって雑音ではない、スペク
トル的には一様振幅のｓｉｎ波の混合とみることができ
るが、従来この方法で合成した前例はなく乱数により模
擬されるのでこの名がある。出原人はこれを下記の方法
でスペクトルに基いて合成することに世界で初めて成功
した。ｙ＝ΣＡ　　５ｉｎ（２ｎπｆｔ　＋　２ｙｃ　Ｒ）ｎ
　　　　　　　　　　　　　　　　　　　　　　　　　
ｎＲはＯと１の間の一様乱数列のｎ番目の値である。す
なわち、ｙは位相をランダムに規定した全高調波の総和
である。Ａ　が一定のときいわゆる白色雑音となる。こ
の方法は任意のスペクトル分布の雑音モドキを合成でき
る画期的方法である。たゾし、二九は基本波の周期を持つ周期関数であるので
それが好ましくない場合は全長を基本周期とする必要が
ある。本項目の各子音は基本波１００　Ｈｚに対してｎの最高
値を概略下表の値としたものである。ｋ　　ｈ　　ｒ／ｎ　　　ｔ　　ｃｈ　　ｔｓ　　ｓｎ
＝　　１７　２３　３１　４７　６１　７１　８３ｒお
よびｎの舌の離着の音もこの部類に属す、振幅Ａｎは一
定よりも３　ｄｂ１０ｃｔａｖ＠程度の高域強調とする
のがよい０例としてｋおよびＳの合成波形の１ピツチ分
を第８図に示す、実際の音声は呼気通路を狭め高域強調
または発声するもので、舌が上顎に着く位置が口の奥よ
り先になる程、また、呼気通路が狭い種間波数は高くな
り、内圧が高くなることによる整圧作用でピッチ周期は
明確でなくなりピッチ数も多くなる。　ｈ、　ｃｈ、、
ｔｓ、　ｓはこれに相当し、ピッチ周期をぼかすために
子音部会長を基本周期にとって合成するのがよい、ピッ
チ周期を基本にすると、ピッチ周期が明確に現れるため
グラインダで金属を削るような音を発生する。これら子音は緩やかに増加し、最高値に達した後２〜３
ピツチで減衰する波形で変調され、同時に最高周波数も
低い方へ推移する。従って、ｔは上記の減衰部分のみか
らなり２ピツチ程で構成されるが、　ｃｈはｔを含み、
　ｔｓはｃｈを含み、Ｓはｔｓを含む、には２〜３ピツ
チで構成され、ピッチ周期は比較的明確である。これら
子音にア行の各母音を接続すれば力、す、り、ハ行を合
成できる。９）　バ行の合成バ行の合成には搬送波としてピッチ周期のｓｉｎ波とそ
の第２高調波を同相同振幅で加えた第９図（ａ）　　を
用いるのがよい、この絶対値により７００　Ｈｚのｓｉ
ｎ波を変調した波形２ピツチを同図（ｂ）に示す、これ
はアのピッチ波形をずらして重ねたような波形を成し、
実際の音声バの波形と極めてよく似ており、これのみで
前記パの合成法によりバを合成できる。たゾ、同様にし
て作ったブの１ピツチを先頭につけるとバの音が強くな
る。こＮで用いた搬送波に、２５程度の定常値を加えたもの
の絶対値により７００　Ｈｚを変調すればバとなる。前
記パ行の合成はこれを用いる方が実際に近い波形となる
。また、これはピッチ強度の増加をゆるやかにすれば母
音アとなる。これは前記の母音と異なり、零レベルの上
下に振動する実際の音声波形により近いものである。こ
の搬送波も高周波成分を除去した鋸歯状波とみることが
できる。１０）　　ｇｔ　ｄｔ　ｚｙｎｇの合成ｇ＝ｋＢｕ、ｄ
＝ｔＢｕ、ｚ＝ｓＢｕ、ｎｇ＝ｎｋＢｕである。たゾし
、２の合成におけるＳはピッチ数を少くしなければなら
ず、　ｔｓを用いると考えてもよい。Ｂｕは既に述べた
ように子音部と母音部に分けることができない、このピ
ッチ数が多くなると母音Ｕが付いていると聴きとられる
１日本語の濁音はこれら英語濁音または清音にバ行の各
音を続ければよい。１１）よう音の合成よう音の合成はローマ字的にイ段の各子音部にヤ行を継
げばよい、たゾし、ｉｅａ：は普通には各１ピツチで構
成する。１２）ンの合成ンはｎの発声区間の中はどで急にピッチ振幅を、２５以
下に落し、以後この状態を維持する。従って次に母音が
くるとその母音はす行に変化するという現象が起る。以上９日本語母子音の合成例において述べた各条件は標
準的なもので、その許容範囲は極めて広い、これは誰で
も発声ができ人により様々な音色を持つことを可能とし
ている反面、従来行われている数理的解析によるこれら
標準的条件の同定を困難にしている。従って、従来の音
声合成はモデルとなった一人間の音声を再現するにすぎ
ないのに対し１本合成法は個々の人間とは関係のない音
声を合成できるもので、前者を手書き文字の再現に例え
るなら本法は活字文字の生成に相当するものである。も
ちろん２周波数の偏移や雑音の混入により様々に音色を
変えることもできるので特定の人間の音声を模擬するこ
ともできる。この目的には音声を本法に基いてローマ字
的に母子前に分け、更に可能な限り細分化したものを組
合せ再現する高圧縮素片合成も可能である。更に２本合
成法は実施例で明らかなように音声認識法としても有効
である。尚２本合成法は英語音声をその綴りに基いて合
成することができる。以上のごとく本発明は音声が周波数個有の音の音圧的お
よび時分割的組合せであるという全く独創的な考えに基
くものであり、その簡単さと実用性および合成音声の特
性において従来法に比べ著しく優れたものである。３　図面の簡単な説明に下記第５図より第９回連の説明
を追加する。第５図は各子音の合成法を示す記号式。第６図はｎのピッチ波形に舌が上顎に離着する音の重畳
した波形。第７図はｒの波形。第８図はｋとＳの一部の波形。第９図（ａ）はパ行の搬送波の波形。（ｂ）　は７００　Ｈｚを変調したバの波形。Figure 1 shows a waveform in which a vowel frequency sine wave is 100% modulated with a 100 Hz cosine wave. Figure 2 shows the waveform of a sawtooth wave modulated by a vowel frequency. Figure 3 shows the waveform of the sum of harmonics. Figure 4 shows the waveform of a vowel synthesized by this method. Original patentee Akira Wakabayashi Figure 1 (Δ) Figure 2 (8) Figure 3 (b) 4rII Procedural amendment October 31, 1985 2 Name of the invention Speech synthesis method using carrier wave modulation 3 Relationship with the case of the person making the correction Address of patent applicant Kanaka "Waken Hiratsukashita" Ikanchiro
21-17 Daikancho, Hiratsuka City, Kanagawa Prefecture 〒254 Location 0
463-23-5422 Waka Bayashi Aki 6 years old List of attached documents (1) Drawings
Contents of one amendment 1 The scope of claims is amended as follows. Just as light has a frequency-specific color, sound waves also have a frequency-specific sound, and this is a speech synthesis method that improves the sound quality by focusing on the fact that it is the original sound of a vowel. Or a sine wave or cosine wave with a 9M vowel frequency using a similar wave as a carrier wave, A2. The detailed description of the invention is amended below. b) Delete the following 44 characters from the 10th character on page 2, line 19 to the 9th character on page 3, line 1. Furthermore, if the fundamental wave is 200 Hz, the 2nd, 3rd, and 4th harmonics are U, O, and a, which are not necessarily prime multiples. ) Replace page 9, lines 1 to 17 with the following. By repeatedly outputting each pattern shown in FIG. 4 for the necessary time, it is possible to produce extremely high-quality vowels of the necessary length. When pronouncing a word by combining these words, it is necessary to make the pause between vowels sufficiently long to improve the clarity; if it is too short, it will sound like a sutra. arise. In order to prevent this and shorten the pause period, there must be a start mark and a stop mark for each vowel, which reduces the amplitude of the sawtooth carrier wave to 1. It should be increased or decreased so as to form an envelope of the first-order lag of a rectangular wave with the pause section between vowels as O.The start pitch at the beginning of the start mark must be distinguishable from the pause section and have a sufficiently small amplitude. The amplitude of each vowel in the above formula table is simply for determining the proportion.
The amplitude after blending is made constant and is used as the reference amplitude for each vowel. Next, we will discuss the synthesis of consonants. Conventionally, consonants are said to be composed of noise, but although noise is an element that causes individual differences, it is unavoidable due to the structure of the human vocal organ and has no relation to the essence of the consonant, so we have removed it. The essence of a consonant lies in the latter periodic component, and can be said to be a combination of vowels. Just as the visual concept of the frequency of light is color, and the mixing of various frequencies produces various colors, the auditory concept of the frequency of sound waves is the concept of vowels and consonants. Tazo. Speech is never a pure spectral sound, but is a mixture of at least pitch periodic waves, which corresponds to a mixed color, and the difference between vowels and consonants is that they are often a somewhat complex mixture of consonants. There are two ways to mix the six colors that exist: one is to mix light or paint directly, and the other is to mix the details of each color visually.
There are two ways. Mixing by amplitude addition or amplitude modulation corresponds to the former, and the eight vowels are synthesized by this. Although many consonants are synthesized by the latter. At this time, each frequency is arranged in time division on the time axis and mixed by hearing. Each consonant will be described in detail below. 1) Synthesis of row W In the combination of u and o, adding the third harmonic at a phase that crushes the peak of the main frequency means giving saturation distortion. Therefore, the saturation distortion is lower in O than in U, and α requires almost no distortion, which indicates that the opening of the mouth becomes larger in this order, and furthermore, this means that the wider the mouth is opened, the higher the resonance frequency becomes. Therefore, considering that Wa's vocalization falls from a closed mouth state to a wide open state, the main frequency must change continuously from the state of U to the state of α, in fact. This is observed, and this frequency transition can be realized by fairly rough frequency discretization, and 3 of uoα
By simply arranging the vowels in this order two or three pitches at a time, you can synthesize a fairly good wa. Actually, the number of pitches α is increased according to the length of the utterance, and the start and stop marks are formed as a whole. Furthermore, the amplitude of the front part where the two frequencies change should be smaller than the highest amplitude of the final vowel, and the number of pitches should also be smaller, which corresponds to the fact that the accent is on the vowel in the syllable. In order to distinguish these from the 0 phonetic symbol shown in Figure 5 (1) in the composite symbol formula, the 1-pitch waveform of the reference amplitude constituting the vowel is expressed in capital letters, with the number of repeated pitches on the right shoulder and the amplitude ratio on the lower right. This shows the composition method by arranging the symbols with , into connected classes. In this way, Japanese consonants undergo a frequency transition and fall to the final vowel at the beginning of utterance, forming what is called a syllable. The other sounds in the wa line are uoi, uou, uoe. The zero compound symbol expression uo is obtained by changing the final vowel of wa. It is not necessary to include the vowel O except for wa and wo, but the distinction from the 7th line will not be clear. 2) I has the same main frequency as the composite U of the Y row, and e has the same main frequency as O.
Since there is, a transition such as ieα can be considered, but this is negative. To clearly distinguish between tazo and wa, the vowel field is added to make it 1aaaα. Also, yu = iaaaou, yo = i e a! It is o. These composite symbol formulas are shown in Figure 5 (2). In the above-mentioned sequence of vowels compared to Munsell's color wheel, Wa is the one that moves from the low frequency side toward α, and vice versa. What goes towards α is yah, so a! By adding , this difference becomes clear, and the distinction between wa and ya becomes clear. 3) Synthesis of m row The synthesis of m is 200 Hz and 600 Hz in reverse phase (cos
In the case of a wave), a third-order saturation distortion of approximately 100% is added to the waveform, which is then modulated with a sawtooth wave. The sound of 200 Hz itself is U, and 300 Hz, which has the same U sound, is also possible, and 1 m is considered to be a U with increased third-order distortion by closing the mouth, so the vowel 1 Japanese M line is U in the wa line is m
It was replaced with . The number of pitches must be 5 to 6 pitches, and the 0 compound symbol formula, which reduces the number of 0s, is shown in Figure 5 (3
). 4) The composite n of the second row is the vowel a, which is obtained by adding 400 Hz to the same extent to 200 Hz to cause second-order distortion, and then modulating it with a sawtooth wave. The field is considered to be α plus the action of the tongue, and the action of the tongue gives second-order distortion, as estimated from the fact that it is near the second harmonic of α. However, as is the case with actual pronunciation, the distinction from m is not clear from this alone; when a vowel, etc. is added before or after n, the sound of the tongue landing on the upper jaw and the sound of leaving it are n.
By superimposing the pitches at the beginning and end of , it becomes clear to distinguish it from m. This is a mixed wave with a strong frequency component around 3 kHz, and it is superimposed on the first half of the pitch waveform of n.
The details of this mixed wave having a waveform as shown in the figure will be described later, but the composite symbol expression of n including these is shown in FIG. 5 (4). 5) The composite r of the A row is composed of one pitch in which the frequency component representing the sound of the tongue leaving forms a clean envelope as shown in FIG. Therefore, for the U line, each vowel in the A line can be directly connected to this, and the composite symbol formula for the U line is shown in Figure 5 (5). 6) If you remove O from the synthesized sound uoa of the synthesized w in the line 6 and add a shortening to the extent of adding one pause pitch, it becomes ha. The human vocal organ, which can only change its resonance frequency continuously, is able to make such a jump in frequency through rupture.If you close your mouth, increase the internal pressure, and exhale the accumulated air all at once, you will exhale in a lump. The mechanism of rupture is that after expiration occurs, there is a temporary cessation of exhalation, during which time the shape of the mouth passes through 0, and when exhalation returns to normal, it is in state A. The result is a syllable with two parts: a consonant and a vowel. The compound symbol for ``ha'' is shown in Figure 5 (6). This ``c'' line is uttered with the mouth closed, which was written in the old days as ``fa'', ``hui'', etc., and there is a pause between the previous sound. If the section is short, Ha → Wa, Hi →
A, F, etc. transitions occur. The consonant part of the modern C line is different, but I will discuss this later. By changing the consonant part of the C line, the following various consonants are produced. 7) Synthesis of the P line The B line is considered to be a stronger rupture of the C line, but a strong rupture means that the amount of jump in frequency with respect to time is large. Therefore, the P line can be synthesized by reducing the number of pitches of the prefix U of the C line and entering the rest pitch immediately after the start pitch. Figure 5 (
As shown in 7), it is best to use m, which indicates a state with the mouth closed, as the starting pitch; in actual speech, there is no pause between the previous sound and the above m is packed, and the pitch changes depending on the speed of speech. The number of pitches changes. 8) Synthesis of k, h, t, ah, ts, s A characteristic of these consonants is that they are composed of white noise. White noise is a type of noise, not noise.Spectrally, it can be seen as a mixture of sine waves with uniform amplitude, but there is no precedent for it being synthesized using this method, and it is simulated using random numbers, hence the name. . Degento was the first in the world to succeed in synthesizing this based on spectra using the method described below. y=ΣA 5in(2nπft + 2yc R)n

nR is the nth value of a uniform random number sequence between O and 1. That is, y is the sum of all harmonics whose phases are randomly defined. When A is constant, it becomes so-called white noise. This method is an innovative method that can synthesize noise variations with arbitrary spectral distributions. However, since 29 is a periodic function with the period of the fundamental wave, if this is not desirable, it is necessary to set the total length as the fundamental period. For each consonant in this item, the maximum value of n for a fundamental wave of 100 Hz is approximately the value shown in the table below. k h r/n t ch ts sn
= 17 23 31 47 61 71 83 The sound of tongue release and release in r and n also belongs to this category.It is better to emphasize the high frequency range of about 3 db10ctav@ rather than keeping the amplitude An constant.For example, the synthesis of k and S One pitch of the waveform is shown in Figure 8.Actual speech is produced by narrowing the exhalation passageway and emphasizing the high frequencies. The narrow interspecies wave number becomes high, and due to the pressure regulating effect due to the high internal pressure, the pitch period becomes less clear and the number of pitches increases. h, ch,,
ts and s correspond to this, and in order to blur the pitch period, it is best to synthesize by taking the consonant section head as the basic period.If you use the pitch period as the basis, the pitch period will clearly appear, so it will produce a sound like scraping metal with a grinder. occurs. These consonants gradually increase and after reaching the highest value 2-3
It is modulated with a waveform that attenuates with pitch, and at the same time the highest frequency also shifts toward the lower side. Therefore, t consists only of the above-mentioned attenuation part and consists of about 2 pitches, but ch includes t,
ts includes ch, S includes ts, is composed of 2 to 3 pitches, and the pitch period is relatively clear. By connecting each vowel in the A line to these consonants, you can synthesize the power, su, ri, and ha lines. 9) Synthesis of the B row It is best to use the pitch period sine wave and its second harmonic with the same phase and amplitude as the carrier wave, as shown in Fig. 9 (a), for the synthesis of the B row. si
The 2-pitch waveform obtained by modulating the n-wave is shown in the same figure (b).
It is extremely similar to the waveform of an actual speech bar, and with this alone, it is possible to synthesize bass using the above-mentioned method of synthesizing bass. Tazo, if you add one pitch of bu made in the same way at the beginning, the sound of ba will become stronger. If 700 Hz is modulated by the absolute value of the carrier wave used in N plus a steady value of about 25, then B is obtained. If this is used to synthesize the above-mentioned PA lines, a waveform that is closer to the actual waveform will be obtained. Also, if the increase in pitch intensity is made more gradual, this becomes a vowel A. Unlike the vowels described above, this is more similar to an actual speech waveform that vibrates above and below the zero level. This carrier wave can also be seen as a sawtooth wave with high frequency components removed. 10) Synthesis of gt dt zyng g=kBu, d
=tBu, z=sBu, ng=nkBu. However, the number of pitches of S in the synthesis of 2 must be reduced, and ts may be considered to be used. As mentioned above, Bu cannot be divided into a consonant part and a vowel part, and when the number of pitches increases, it can be heard as having a vowel U. 1. Japanese voiced sounds are added to these English voiced sounds or clear sounds with each part of the B line. Just keep the sound going. 11) Synthesizing the Yo sound To synthesize the Yo sound, just follow the Y line to each consonant part of the I stage according to the Roman alphabet, and the iea: is usually composed of one pitch each. 12) During the utterance section of n, the pitch amplitude suddenly drops to below 25, and this state is maintained thereafter. Therefore, when the next vowel comes, a phenomenon occurs in which the vowel changes to a sub line. Each of the conditions described above in the nine Japanese vowel consonant synthesis examples is standard, and the permissible range is extremely wide. This makes it difficult to identify these standard conditions through the mathematical analyses. Therefore, while conventional speech synthesis only reproduces the voice of one person as a model, the one-line synthesis method can synthesize voices that are unrelated to the individual person, and the former reproduces handwritten characters. For example, this method is equivalent to the generation of printed characters. Of course, the tone can be changed in various ways by shifting two frequencies or adding noise, so it is also possible to simulate a specific human voice. For this purpose, it is also possible to perform highly compressed segment synthesis, which divides speech into mother and child parts in the Roman alphabet based on this method, and then combines and reproduces the subdivided parts as much as possible. Furthermore, the two-line synthesis method is also effective as a speech recognition method, as is clear from the examples. Note that the two-line synthesis method can synthesize English speech based on its spelling. As described above, the present invention is based on the completely original idea that speech is a sound pressure and time division combination of sounds with individual frequencies, and is superior to conventional methods in its simplicity and practicality, as well as in the characteristics of synthesized speech. It is significantly better than that. 3 Add the 9th series of explanations from Figure 5 below to the brief explanation of the drawings. Figure 5 is a symbolic formula showing how to synthesize each consonant. Figure 6 shows a waveform in which the sound of the tongue touching and leaving the upper jaw is superimposed on the n pitch waveform. Figure 7 shows the waveform of r. Figure 8 shows part of the waveforms of k and S. FIG. 9(a) shows the waveform of the carrier wave in the PA row. (b) is a waveform modulated at 700 Hz.

Claims

[Claims] This is a speech synthesis method that focuses on the fact that just as light has a frequency-specific color, so sound waves also have a frequency-specific sound, and that this is the original sound of a vowel, and improves the sound quality of the sound. A speech synthesis method in which a sawtooth wave or similar wave with a pitch period is used as a carrier wave and modulated with a sine wave or cosine wave of the original vowel frequency, or a mixture or modulation wave of two or three.