JP3187241B2

JP3187241B2 - Speech speed converter

Info

Publication number: JP3187241B2
Application number: JP06725094A
Authority: JP
Inventors: 篤今井; 徹都木; 章中村; 信正清山; 栄一宮坂
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1994-04-05
Filing date: 1994-04-05
Publication date: 2001-07-11
Anticipated expiration: 2016-07-11
Also published as: JPH07281690A

Abstract

PURPOSE:To eliminate the dispersion of a speech speed conversion effect on hearing caused by the differences in the length of each voiced sound segment. CONSTITUTION:Corresponding to the segment length, the voiced sound segment, which is shorter than a certain value, is expanded by a higher magnification so as to obtain a speach speed conversion effect in accordance with a desired magnification on hearing. For example, for a short voiced sound having a less than 150ms duration, an expansion magnification, which corresponds to the length of a voiced sound, is given along a magnification function g(w) as shown in the diagram. Moreover, when the method is to be applied to a reference magnification which changes a speech speed from a 'slow' to a 'fast' within the segment generated in one breath, the voiced sound segment, which has a duration of less than 150ms and appears within 450ms from a starting point in the segment, is given an expansion magnification corresponding to the length of a voiced sound along the function g(w) which is shown in graphic independent of its appearance time. If the voiced sound which exceeds 150ms or the duration time exceeds 450ms, a conventional expansion magnification f(t) is applied.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、話速変換装置に関し、
特に聴覚障害者や高齢者等の音声補聴装置や、一般的な
語学学習装置、ラジオ、テープレコーダー、電話などに
おいて、話速変換による補助的聴取を行う際の聞き取り
易さの向上、テレビジョン、ビデオテープレコーダー、
ビデオディスクプレーヤーなどの音声出力を話速変換し
た際に生ずる映像と音声のズレを効果的に吸収するリア
ルタイム式の話速変換装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech speed conversion device ,
In particular, in hearing aids for the hearing impaired and the elderly, general language learning devices, radios, tape recorders, telephones, etc. Video tape recorder,
The present invention relates to a real-time speech speed conversion device that effectively absorbs a gap between video and audio generated when audio output of a video disc player or the like is converted into speech speed.

【０００２】[0002]

【従来の技術】話速を変換する際に、有声区間を一様倍
率で伸張する手法（中村章ほか平成４年日本音響学会春
季研究発表会「高品質リアルタイム話速変換システム」
２−６−１Ｐ．３２９−Ｐ．３３０（１９９２−３））
や、発声の開始点からの経過時間の関数として倍率を可
変にする話速変換法（池沢龍ほか平成４年日本音響学会
春季研究発表会「話速変換に伴う時間伸張を吸収するた
めの一手法」２−６−２Ｐ．３３１−Ｐ．３３２（１９
９２−３））が存在するが、これらはいずれも各々の有
声区間長とは無関係に、伸張倍率を経過時間の関数とし
て一意に定めたものであり、変換音声が全ての有声区間
で聴感上同程度の「ゆっくり感」を与えるとは言えず、
聴感上の効果に「ばらつき」が生ずることがある。従
来、これを解決して聴感上安定に且つ自然に変換する話
速変換技術はなかった。2. Description of the Related Art A method for extending a voiced section at a uniform magnification when converting speech speed (Akira Nakamura et al., Spring Meeting of Acoustical Society of Japan 1992, "High-quality real-time speech speed conversion system"
2-6-1P. 329-P. 330 (1992-3))
And a speech rate conversion method that varies the magnification as a function of the elapsed time from the start of the utterance (Ryu Ikezawa et al., Spring Meeting of the Acoustical Society of Japan in 1992, “One way to absorb the time extension associated with speech rate conversion” Method "2-6-2 P.331-P.332 (19
92-3)) exist, but in each case, the expansion ratio is uniquely determined as a function of the elapsed time irrespective of the length of each voiced section. It can not be said to give the same “slow feeling”,
"Dispersion" may occur in the effect on hearing. Heretofore, there has been no speech speed conversion technology that solves this and stably and naturally converts in terms of hearing.

【０００３】[0003]

【発明が解決しようとする課題】入力音声の話速を「ゆ
っくり」にすることを目的として、無音区間、無声区
間、有声区間を分離し、無音区間と無声区間の長さはそ
のままに、有声区間の伸張による話速変換を行う際、音
声中の複数の有声区間を一様な倍率で伸張した場合、各
々の有声区間の区間長によって、聴感上の「ゆっくり
感」の程度に差異が生ずることがわかっている（今井篤
ほか平成５年日本音響学会秋季研究発表会「話速変換に
伴う時間伸張のリアルタイム吸収法」１−９−１０Ｐ．
３６１−Ｐ．３６２（１９９３−１０））。SUMMARY OF THE INVENTION In order to make the speech speed of an input voice "slow", a voiceless section, a voiceless section, and a voiced section are separated, and a voiced section is maintained without changing the length of the voiceless section and the voiceless section. When performing voice speed conversion by expanding a section, if a plurality of voiced sections in the voice are expanded at a uniform magnification, a difference occurs in the degree of "slow feeling" in auditory sense depending on the section length of each voiced section. (Atsushi Imai et al., Autumn Meeting of the Acoustical Society of Japan in 1993, "Real-time absorption method of time expansion accompanying speech speed conversion" 1-9-10P.
361-P. 362 (1993- 10)).

【０００４】音声中には、異なる母音の連鎖や長母音な
どのように３００ｍｓを越えるような比較的長い有声区
間や、逆に無声区間や無音区間に挟まれた母音などに多
い１００ｍｓを下回るような比較的短い有声区間が相次
いで現れることもあり、例えば、この両者が混在する音
声に対して、一定の同じ倍率で伸張した音声を聴取した
場合、長い有声区間は１有声区間単位での伸張時間の絶
対量が大きく、聴感上の「ゆっくり感」が大きいのに比
べ、短い有声区間は伸張時間の絶対量が小さく、場合に
よっては殆ど「ゆっくり感」が感じられないことがあ
る。In a voice, a relatively long voiced section exceeding 300 ms, such as a chain of different vowels or a long vowel, or conversely, less than 100 ms in unvoiced sections or vowels sandwiched between silent sections. There are cases where relatively short voiced sections appear one after another. For example, when a voice mixed with both voices is listened to a voice expanded at the same constant magnification, the long voiced section is expanded in units of one voiced section. In contrast to a large absolute amount of time and a large "slow feeling" in the auditory sense, a short voiced section has a small absolute amount of decompression time, and in some cases, almost no "slow feeling" is felt.

【０００５】例えば、区間長が３５０ｍｓと８０ｍｓの
ものを従来法により一律に１．５倍に伸張した場合、５
２５ｍｓと１２０ｍｓに変換されるが、前者の伸張時間
の絶対増加量が１７５ｍｓであるのに対して、後者は僅
か４０ｍｓの伸張で、これが聴感上の効果の差となって
現れてくる。従って、この様に長短さまざまな有声区間
分布が一連の入力音声中に複数箇所存在する場合は、話
速の定まらない不安定な音声に変換されてしまい、場合
によってはこれがかなり気になることがある。For example, when the section lengths of 350 ms and 80 ms are uniformly extended 1.5 times by the conventional method, 5
It is converted into 25 ms and 120 ms. The absolute extension of the extension time of the former is 175 ms, whereas the latter is extended by only 40 ms, which appears as a difference in the effect on hearing. Therefore, when there are a plurality of voiced section distributions of various lengths in a series of input voices as described above, the voices are converted into unstable voices having undetermined speech speeds, and in some cases this is quite anxious. is there.

【０００６】また、既に提案されている、話速変換に伴
う時間伸張を吸収する手法（池沢龍ほか平成４年日本音
響学会春季研究発表会「話速変換に伴う時間伸張を吸収
するための一手法」２−６−２Ｐ．３３１−Ｐ．３３２
（１９９２−３））では、一息で発声する区間（フレー
ズ）の開始点での有声区間の伸張倍率を高く設定し、徐
々に話速を速くしていくことで、変換音声の全体として
の「ゆっくり感」と、全体としての時間伸張の吸収を実
現しているが、このフレーズの開始点付近において短い
有声区間が相次いで出現するような音声の場合には、上
述の理由により比較的高い倍率を乗じても「ゆっくり
感」が得られず、結果的に後半の話速の速い部分だけが
目立ってしまうことになり、期待する効果が得られない
場合がある。Also, a method of absorbing time expansion associated with speech rate conversion, which has already been proposed (Ryu Ikezawa et al., Spring Meeting of the Acoustical Society of Japan in 1992, entitled "One way to absorb time expansion associated with speech rate conversion". Method "2-6-2 P.331-P.332
(1992-3)), the expansion rate of the voiced section at the start point of the section (phrase) uttered in a short breath is set high, and the speech speed is gradually increased, so that the converted speech as a whole " Slowness "and absorption of time expansion as a whole are realized, but in the case of speech in which short voiced sections appear one after another near the starting point of this phrase, a relatively high magnification is used for the above-described reason. Does not provide a "slow feeling", and as a result, only the high-speed part of the latter half becomes noticeable, and the expected effect may not be obtained.

【０００７】上述した問題点を更に具体的事例で示す。[0007] The above-mentioned problems will be shown in more concrete cases.

【０００８】（１）一息で発声される区間（フレーズ）
の予測長を２０００ｍｓに固定し、伸張倍率ｒを図１に
示す曲線に添ってｒｓ（ｒｓ＞１）からｒｅ（ｒｅ＜
１）へと単調に減少させる。(1) Section (phrase) uttered in a short breath
Is fixed at 2000 ms, and the expansion ratio r is changed from rs (rs> 1) to re (re <re) according to the curve shown in FIG.
Monotonically decrease to 1).

【０００９】（２）２０００ｍｓを越せたところではピ
ッチ周波数の変化に伴い倍率に適宜修正を加える。(2) When the time exceeds 2000 ms, the magnification is appropriately corrected according to the change in the pitch frequency.

【００１０】この手法をリアルタイム話速変換システム
に導入し、多数のニュース音声を変換した結果、いくつ
かのフレーズについて期待される効果、特に、フレーズ
の開始点付近において「ゆっくり」した感覚を生じさせ
る効果の得られないものがあった。図２に、特に効果的
であったフレーズ１例（同図の（ａ））と、特に効果が
感じられなかったフレーズ２例（同図の（ｂ），
（ｃ））について、フレーズ内の有声区間長の時間軸上
の分布を示す。This technique is introduced into a real-time speech rate conversion system, and as a result of converting a large number of news voices, the effect expected for some phrases, particularly, a "slow" feeling near the starting point of the phrases is generated. Some effects were not obtained. FIG. 2 shows one example of a particularly effective phrase ((a) in FIG. 2) and two examples of a phrase in which no effect was particularly felt ((b),
Regarding (c)), the distribution of the voiced section length in the phrase on the time axis is shown.

【００１１】この３例に代表される傾向として以下の点
が挙げられる。The following points are mentioned as the tendencies typified by these three examples.

【００１２】（１）文頭４５０ｍｓ〜５００ｍｓ以内に
１５０ｍｓを越える比較的長い有声区間が複数個存在す
る場合は、伸張倍率ｒがｒ＝１．４でも効果が大きい。(1) When there are a plurality of relatively long voiced sections exceeding 150 ms within 450 ms to 500 ms of the beginning of the sentence, the effect is large even if the expansion ratio r is r = 1.4.

【００１３】（２）フレーズの開始部分に１５０ｍｓ以
下の比較的短い有声区間が存在する場合、ｒ＝２．０で
も効果が少ない。(2) When a relatively short voiced section of 150 ms or less exists at the beginning of a phrase, even if r = 2.0, the effect is small.

【００１４】他のフレーズについても検証した結果、同
様の傾向が見られた。As a result of examining other phrases, a similar tendency was found.

【００１５】本発明は、上述した問題点に鑑みてなされ
たもので、その目的は有声区間の伸張による話速変換を
行う際に、入力音声の有声区間長の差異に起因する話速
変換効果の聴感上のばらつきを無くし、いかなる入力音
声に対しても自然で、且つ安定した話速変換効果が得ら
れる話速変換装置を提供することにある。SUMMARY OF THE INVENTION The present invention has been made in view of the above-described problems. An object of the present invention is to provide a speech speed conversion effect caused by a difference in the length of a voiced segment of an input voice when speech speed conversion is performed by expanding a voiced segment. It is an object of the present invention to provide a speech speed conversion device which eliminates variations in the sense of hearing and can obtain a natural and stable speech speed conversion effect for any input voice.

【００１６】[0016]

【課題を解決するための手段】上記目的を達成するため
に、本発明は、入力音声の無音区間、無声区間、有声区
間を分離し、このうち有声区間を伸張することによって
発声の速さ（話速）を声の高さを保ったまま遅くする変
換を行う際に、各有声区間の時間長を逐次検出し、各々
の有声区間の時間長に一様な値の、あるいは経過時間と
ともに滑らかに変化する規準倍率を乗ずることにより、
その倍率に対応した聴感的な効果を得る話速変換装置で
あって、変換対象となる有声区間の時間長が所定の長さ
以下か否かを判定する判定手段と、該判定手段の判定結
果により、前記変換対象となる有声区間の時間長が前記
所定の長さを越える場合にはその有声区間の出現時刻で
の規準倍率を乗ずるが、前記変換対象となる有声区間の
時間長が、前記所定の長さ以下の有声区間については、
その有声区間の時間長に応じて前記規準倍率に比べてよ
り高い伸張倍率を乗ずる演算手段とを有することを特徴
とする。In order to achieve the above object, the present invention separates a silent section, an unvoiced section, and a voiced section of an input voice, and expands the voiced section to speed up the utterance. When performing a conversion that reduces the voice speed while maintaining the voice pitch, the time length of each voiced section is detected sequentially, and the time length of each voiced section is a uniform value or smooth with the elapsed time. By multiplying the reference magnification that changes
A speech speed conversion device that obtains an audible effect corresponding to the magnification, wherein a time length of a voiced section to be converted is a predetermined length.
Determining means for determining whether or not:
The fruit, when said time length conversion subject to voiced segment exceeds the <br/> predetermined length is multiplied by a reference factor in the appearance time of the voiced, voiced section serving as said converted For a voiced section whose time length is equal to or less than the predetermined length,
Calculating means for multiplying the expansion rate higher than the reference magnification according to the time length of the voiced section.

【００１７】また、本発明は好ましくは、前記演算手段
は、前記所定の長さに当る１５０ｍｓ以下の短い有声区
間に対しては、その有声区間の出現時刻に関係なく、前
記規準倍率に比べてより高い伸張倍率を供する倍率関数
に沿ってその有声区間の時間長に対応した伸張倍率を乗
じ、また前記１５０ｍｓを越える有声区間の場合は、当
該有声区間の時間長に前記規準倍率を乗ずることを特徴
とすることができる。In the present invention, preferably, the arithmetic means
For a short voiced section equal to or less than 150 ms corresponding to the predetermined length, regardless of the appearance time of the voiced section, the voiced section is provided along a magnification function that provides a higher expansion magnification than the reference magnification. In the case of a voiced section exceeding 150 ms, the time length of the voiced section is multiplied by the reference magnification.

【００１８】また、本発明は好ましくは、前記経過時間
とともに滑らかに変化する規準倍率として、一息で発生
する区間を単位にして、この区間の開始点ではゆっくり
とした話速を設定し、その終了点に向かって徐々に話速
を速めることを特徴とする倍率関数を適用する場合に、
前記演算手段は、上記区間の開始時刻から一定時間内、
好ましくは時間４５０ｍｓ程度以内に出現する前記所定
の長さに当たる区間長、好ましくは１５０ｍｓ程度に満
たない区間長を有する有声区間に対しては、その有声区
間の出現時刻に関係なく、前記規準倍率に比べてより高
い伸張倍率を供する倍率関数に沿ってその有声区間の時
間長に対応した伸張倍率を常時、また前記１５０ｍｓを
越える有声区間および経過時間が４５０ｍｓを越える場
合は、当該有声区間の時間長に前記規準倍率を乗ずるこ
とを特徴とすることができる。 Preferably , the present invention sets a slow speech speed at a start point of this section in units of a section generated in a breath as a reference magnification that smoothly changes with the elapsed time, and terminates the section. When applying a magnification function characterized by gradually increasing the speech speed toward a point,
The arithmetic means is provided within a fixed time from the start time of the section,
Preferably the predetermined interval length which corresponds to the length appearing within about time 450 ms, preferably for a voiced having a section length less than approximately 150 ms, regardless of the appearance time of the voiced segment, the reference magnification A stretching function corresponding to the time length of the voiced section is always set along the scaling function that provides a higher stretching magnification compared to the above, and if the voiced section exceeds 150 ms and the elapsed time exceeds 450 ms, the time length of the voiced section is changed. Is multiplied by the reference magnification.

【００１９】なお、上記の１５０ｍｓ，４５０ｍｓの値
は好ましい値の１つを具体的に例示したものであって、
本発明はこの値に限定されるものではない。The above values of 150 ms and 450 ms specifically illustrate one of the preferable values.
The present invention is not limited to this value.

【００２０】また、本発明は好ましくは、前記所定の長
さとは規準倍率として実用的な値を設定したときに、変
換音声の「ゆっくり感」が聴感的に感じ取れなくなる有
声区間の最大時間長を指し、この最大時間長以下の有声
区間については、その時間長ｗを変数とする新たな倍率
関数ｇ（ｗ）を導入し、その倍率関数に従って伸張倍率
を与えることとし、この倍率関数によって与えられる倍
率は前記規準倍率に比べて高い値であって、特に、短い
有声区間ほど高倍率になるという性質のものであり、ま
た同倍率関数による倍率の最大増幅値、最小増幅値は固
定ではなく、前記規準倍率の規準倍率関数ｆ（ｔ）の値
によってそれぞれが比例的に変化させられるものである
ことを特徴とすることができる。In the present invention, preferably, the predetermined length is a maximum time length of a voiced section in which a “slow feeling” of the converted voice cannot be audibly sensed when a practical value is set as a reference magnification. For a voiced section shorter than the maximum time length, a new magnification function g (w) having the time length w as a variable is introduced, and an expansion magnification is given according to the magnification function. The magnification is a higher value than the reference magnification, and in particular, has a property that the shorter the voiced section, the higher the magnification, and the maximum amplification value and the minimum amplification value of the magnification by the same magnification function are not fixed, Each of the reference magnifications may be proportionally changed by a value of a reference magnification function f (t).

【００２１】[0021]

【作用】本発明では、話速変換の効果の程度に影響する
有声区間の長さに着目し、聴感上自然で、且つ安定な話
速変換効果が得られるように、ある一定の長さ以下の短
い有声区間に対しては、その区間長に対応して短いもの
ほどより高い伸張倍率を与えるような新たな関数を適用
する。これにより、多様な入力音声を所望の話速に自然
に、且つ安定した効果をもって話速変換することが可能
になる。特に、従来提案されている話速変換による時間
伸張を吸収する手法（池沢龍ほか平成４年日本音響学会
秋季研究発表会「話速変換における時間伸張吸収のリア
ルタイム化の検討」２−９−２Ｐ．３４９−Ｐ．３５０
（１９９３−１０））に適用した場合は、発声の開始点
付近の「ゆっくり感」が聴感上不安定であったという欠
点が解消され、安定したより効果的な変換音声を得るこ
とが可能となる。According to the present invention, attention is paid to the length of a voiced section which affects the degree of the effect of speech rate conversion, and a certain length or less is used so that a speech rate conversion effect that is natural and stable in terms of audibility can be obtained. A new function is applied to a voiced section having a shorter length corresponding to the section length, such that a shorter one gives a higher expansion ratio. As a result, it becomes possible to naturally convert various input voices to a desired speech speed with a stable effect. In particular, a method of absorbing time expansion by speech rate conversion proposed in the past (Ryu Ikezawa et al., Autumn Meeting of the Acoustical Society of Japan in 1992, "A study of real-time absorption of time expansion in speech rate conversion" 2-9-2P .349-P.350
(1993-10)), the disadvantage that the "slow feeling" near the starting point of the utterance was unstable in the sense of hearing can be solved, and a stable and more effective converted voice can be obtained. Become.

【００２２】[0022]

【実施例】以下、図面を参照して本発明の実施例を詳細
に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００２３】一息で発声されると予測される区間内で、
この区間の開始点に於いては原音声の話速より「ゆっく
り」とした話速を設定し、終了点に向かって一定の規則
に従って話速を速めていく「リアルタイム話速変換方
法」の手法（池沢龍ほか平成４年日本音響学会秋季研究
発表会「話速変換における時間伸張吸収のリアルタイム
化の検討」２−９−２Ｐ．３４９−Ｐ．３５０（１９９
３−１０））に本発明を適用した場合の実施例について
説明する。この「リアルタイム話速変換方法」は、実時
間で動作する話速変換装置において時間伸張を吸収する
手法であるが、これは各フレーズの開始点での「ゆっく
り感」が特に要求されるものである。これに本発明によ
る手法を適用することは特に効果的であるといえる。In a section predicted to be uttered in a short breath,
A "real-time speech rate conversion method" in which the speech speed is set to be "slower" than the original speech at the start point of this section, and the speech speed is increased according to a certain rule toward the end point. (Ryu Ikezawa et al., Autumn Meeting of the Acoustical Society of Japan in 1992, "Examination of real-time absorption of time expansion in speech speed conversion," 2-9-2 P. 349-P. 350 (199
An example in which the present invention is applied to 3-10)) will be described. This "real-time speech speed conversion method" is a method of absorbing time expansion in a speech speed conversion device that operates in real time, and this requires a "slow feeling" at the start point of each phrase. is there. It can be said that applying the method according to the present invention to this is particularly effective.

【００２４】図３は本発明の一実施例の動作の概要を示
す。文頭から４５０ｍｓ以内に出現する１５０ｍｓ以下
の短い有声音に対しては、その有声音の出現時刻に関係
なく、図３に示すような倍率関数ｇ（ｗ）に沿って有声
音の長さに対応した伸張倍率を与える。１５０ｍｓを越
える有声音および経過時間が４５０ｍｓを越える場合
は、従来の伸張倍率曲線ｆ（ｔ）（図１）を適用する。FIG. 3 shows an outline of the operation of one embodiment of the present invention. For a short voiced sound of 150 ms or less that appears within 450 ms from the beginning of the sentence, the length of the voiced sound corresponds to the length of the voiced sound along the scaling function g (w) as shown in FIG. 3 regardless of the appearance time of the voiced sound. Give the stretching magnification. If the voiced sound exceeds 150 ms and the elapsed time exceeds 450 ms, the conventional expansion factor curve f (t) (FIG. 1) is applied.

【００２５】図４〜図７は本発明の一実施例を更に詳細
に示す図である。4 to 7 show an embodiment of the present invention in more detail.

【００２６】図４は本発明の一実施例の全体回路構成を
示すブロック図である。FIG. 4 is a block diagram showing the overall circuit configuration of one embodiment of the present invention.

【００２７】図４に示すリアルタイム話速変換装置は、
音声入力回路１と、ＣＰＵ（中央処理ユニット）回路２
と、ＰＲＯＭ（プログラマブルＲＯＭ）回路３と、入力
バッファ回路４と、処理バッファ回路５と、ファイル回
路６と、音声出力回路７と、バス８とを備えている。そ
して、音声入力回路１によって話速変換対象となる音声
（原音声）を取り込み、リアルタイム処理で、原音声の
声の高さ（ピッチ周波数）の変化を検出すると共に、こ
の検出結果に基づいて、声の高さの高い部分では話速を
緩め、低い部分では話速を速めるという規則で話速を変
化させることにより、原音声の発話時間を保ったまま、
原音声を聴き易い良好な音声に変換する。The real-time speech speed converter shown in FIG.
Voice input circuit 1 and CPU (central processing unit) circuit 2
A PROM (programmable ROM) circuit 3, an input buffer circuit 4, a processing buffer circuit 5, a file circuit 6, an audio output circuit 7, and a bus 8. Then, the voice (original voice) to be subjected to speech speed conversion is fetched by the voice input circuit 1, and a change in the pitch (pitch frequency) of the voice of the original voice is detected by real-time processing. By changing the speech speed according to the rule of slowing down the speech speed in the high voice part and increasing the speech speed in the low voice portion, the speech time of the original voice is maintained,
Converts the original sound into good sound that is easy to listen to.

【００２８】音声入力回路１は、原音声を入力するため
の一般的な構成の回路、例えばマイクロフォン、音調回
路、Ａ／Ｄ（アナログ／デジタル）変換器、音声記憶再
生回路、音声記録媒体（例えば、ＩＣメモリ、ハードデ
ィスク、フロッピーディスクまたはＶＴＲ（ビデオテー
プレコーダ））、およびインターフェイス回路等を備え
ており、話速変換対象となる音声を取り込み、これをデ
ジタル形式の音声信号に変換するとともに、この変換し
たデジタル音声信号をＣＰＵ回路２からの指示に基づい
てフレーム単位で入力バッファ回路４に供給する。The audio input circuit 1 is a circuit having a general configuration for inputting an original audio, for example, a microphone, a tone control circuit, an A / D (analog / digital) converter, an audio storage / reproduction circuit, an audio recording medium (for example, , An IC memory, a hard disk, a floppy disk or a VTR (video tape recorder)), an interface circuit, and the like. The audio to be converted is converted into a digital audio signal and converted. The digital audio signal is supplied to the input buffer circuit 4 in frame units based on an instruction from the CPU circuit 2.

【００２９】入力バッファ回路４は、必要な容量のＲＡ
Ｍ（ランダムアクセスメモリ）などによって構成され、
ＣＰＵ回路２の作業域として使用される部分であり、音
声入力回路１から出力される音声信号を取り込んでこれ
を記憶するとともに、ＣＰＵ回路２からの指示に基づい
て記憶している音声信号を処理バッファ回路５に転送す
る。The input buffer circuit 4 has a required capacity RA.
M (random access memory), etc.
This portion is used as a work area of the CPU circuit 2 and takes in and stores an audio signal output from the audio input circuit 1 and processes the stored audio signal based on an instruction from the CPU circuit 2. The data is transferred to the buffer circuit 5.

【００３０】処理バッファ回路５は、必要な容量のＲＡ
Ｍなどによって構成され、ＣＰＵ回路２の作業域として
使用される部分であり、入力バッファ回路４から出力さ
れる音声信号を取り込んでこれを記憶するとともに、Ｃ
ＰＵ回路２からの指示に基づいて記憶している音声信号
をファイル回路６などに転送する。The processing buffer circuit 5 has a required capacity RA.
M, and is used as a work area of the CPU circuit 2. The audio signal output from the input buffer circuit 4 is fetched and stored.
The stored audio signal is transferred to the file circuit 6 or the like based on an instruction from the PU circuit 2.

【００３１】ファイル回路６は、ＲＡＭのほかに、ＩＣ
メモリやフロッピーディスク等の音声記録媒体によって
構成され、本発明に係わる有声区間の伸張された音声信
号と、無音区間の短縮の処理を施された信号などを格納
するメモリであって、処理バッファ回路５から処理済の
音声信号が出力されたとき、これを取り込んで記憶し、
この後ＣＰＵ回路２からの指示に基づいて記憶している
音声信号を音声出力回路７に供給する。The file circuit 6 includes, in addition to the RAM, an IC
A memory configured by a sound recording medium such as a memory or a floppy disk and storing an expanded sound signal of a voiced section according to the present invention, a signal subjected to a process of shortening a silent section, and the like. 5, when a processed audio signal is output, capture and store it,
Thereafter, the stored audio signal is supplied to the audio output circuit 7 based on an instruction from the CPU circuit 2.

【００３２】音声出力回路７は、ファイル回路６内の音
声信号を外部に出力するための一般的な構成の回路、例
えばインターフェイス回路、Ｄ／Ａ（デジタル／アナロ
グ）変換器、スピーカー、録音装置（あるいは放送機
器）等を備えており、ファイル回路６から音声信号が出
力されたとき、これを取り込んで音声に変換しながら、
外部に出力する。The audio output circuit 7 is a circuit having a general configuration for outputting the audio signal in the file circuit 6 to the outside, for example, an interface circuit, a D / A (digital / analog) converter, a speaker, a recording device ( Or when the audio signal is output from the file circuit 6, while taking the audio signal and converting it into audio,
Output to the outside.

【００３３】また、ＣＰＵ回路２は、ワンチップマイク
ロコンピュータ等によって構成される部分であり、ＰＲ
ＯＭ回路３に格納されている図５，図６に示すようなプ
ログラムに基づいて装置全体の制御や各種のデータ処理
を行う。The CPU circuit 2 is a part constituted by a one-chip microcomputer or the like.
Based on the programs shown in FIGS. 5 and 6 stored in the OM circuit 3, control of the entire apparatus and various data processing are performed.

【００３４】また、ＰＲＯＭ回路３は、ＣＰＵ回路２の
動作を規定するプログラムや各種の処理で使用される定
数データなどの格納場所として使用される部分であり、
ＣＰＵ回路２からの読みだし指令に応じて記憶している
プログラムや定数データを読み出してＣＰＵ回路２に供
給する。The PROM circuit 3 is a portion used as a storage location for programs that define the operation of the CPU circuit 2 and constant data used in various processes.
In accordance with a read command from the CPU circuit 2, the stored program or constant data is read and supplied to the CPU circuit 2.

【００３５】次に、本発明の一実施例の動作について図
５，図６を参照して説明する。Next, the operation of one embodiment of the present invention will be described with reference to FIGS.

【００３６】図５、及び図６は処理の流れを示すフロー
チャートであり、図６は図５のＳＴ９の有声区間処理ル
ーチンの詳細を示す。FIGS. 5 and 6 are flow charts showing the flow of processing, and FIG. 6 shows details of the voiced section processing routine in ST9 of FIG.

【００３７】ここでは、説明のために音声信号中の息継
ぎ区間を「ポーズ」、一息で発生される区間を「フレー
ズ」、また「フレーズ」の時間長の平均的な値を「予測
フレーズ長」と呼び、次のように定義する。Here, for the sake of explanation, the breathing section in the voice signal is referred to as “pause”, the section generated in one breath is referred to as “phrase”, and the average value of the time length of “phrase” is referred to as “predicted phrase length”. And is defined as follows.

【００３８】ポーズ：無音部分と判定された区間のう
ち、その区間長がＴｈ１（本実施例ではＴｈ１＝２００
ｍｓ）以上の無音区間。なお、Ｔｈはスレッショールド
値を意味する。Pause: Of the sections determined to be silent sections, the section length is Th1 (Th1 = 200 in this embodiment).
ms) or more silent section. In addition, Th means a threshold value.

【００３９】フレーズ：ポーズと次のポーズに挟まれる
区間。この区間の開始点をＰｈ＿ｓｔとする。Phrase: A section between a pose and the next pose. The start point of this section is Ph_st.

【００４０】予測フレーズ長：フレーズの平均的な時間
長で、Ｔ（単位はｍｓ）とする。（本実施例ではＴ＝２
０００ｍｓとした）また、図６中のｆ（ｔ）とｇ（ｗ）
は有声区間の伸張倍率を定める関数であり、以下の特性
を有するものである。Predicted phrase length: The average time length of the phrase, which is T (unit: ms). (In this embodiment, T = 2
000 ms) f (t) and g (w) in FIG.
Is a function that determines the expansion ratio of a voiced section, and has the following characteristics.

【００４１】ｆ（ｔ）：話速変換に伴う時間伸張を吸収
するために用いる倍率関数であって、予測フレーズ長内
の有声区間の出現時刻ｔ（０≦ｔ≦Ｔ）に対して倍率を
定める単調減少関数である。F (t): a magnification function used to absorb the time expansion accompanying the speech speed conversion, wherein the magnification is set with respect to the appearance time t (0 ≦ t ≦ T) of the voiced section within the predicted phrase length. It is a monotonically decreasing function to be determined.

【００４２】ｔ＝０におけるあらかじめ定めた倍率をｒ
_s 、ｔ＝Ｔにおけるあらかじめ定めた倍率をｒ_e （ｒ_s
≧ｒ_e ）とすると、ｆ（ｔ）はｒ_s ≧ｆ（ｔ）≧ｒ_e ，０≦ｔ≦Ｔを満たす。The predetermined magnification at t = 0 is r
_s, a predetermined magnification in t = T r _e (r _s
When ≧ r _e) to, f (t) satisfies _{r s ≧ f (t) ≧} r e, 0 ≦ t ≦ T.

【００４３】ｇ（ｗ）：一定の区間長Ｗ₁ （本実施例で
はＷ₁ ＝１５０ｍｓ）に満たない有声区間を、その区間
長ｗに応じて、ｆ（ｔ）により定まる規準倍率より高い
倍率で伸張するための倍率関数であって、有声区間長ｗ
（０＜ｗ≦Ｗ₁ ）に対して倍率を定める単調減少関数で
ある。G (w): A voiced section shorter than a fixed section length W ₁ (W ₁ = 150 ms in this embodiment) is scaled according to the section length w to a magnification higher than a standard magnification determined by f (t). Is a scaling function for expanding the voiced section length w
This is a monotonically decreasing function that determines the magnification for (0 <w ≦ W ₁ ).

【００４４】ここで、ｇ（ｗ）の適用条件を満たした有
声区間［ｔ_k ，ｔ_k ＋Ｗ₁ ］（但し、ｗ_k ＜Ｗ₁ ）に対
して、ｇ（ｗ）の定義により、常にｆ（ｔ_k ）≦ｇ（ｗ_k ）の関係が成り立つ。Here, for a voiced section [t _k , t _k + W ₁ ] (w _k <W ₁ ) that satisfies the application condition of g (w), by definition of g (w), always f The relationship of (t _k ) ≦ g (w _k ) holds.

【００４５】次に、図５の処理手順を説明する。なお、
ＳＴはステップを意味する。Next, the processing procedure of FIG. 5 will be described. In addition,
ST means a step.

【００４６】（ＳＴ０）まず、ｆ（ｔ）の最高倍率ｒ_s
と最低倍率ｒ_e を設定する。(ST0) First, the maximum magnification r _s of f (t)
And to set the minimum magnification r _e.

【００４７】（ＳＴ０−１）次に、フレーム番号ｉを０
にセットする。(ST0-1) Next, the frame number i is set to 0
Set to.

【００４８】（ＳＴ０−２）続いて、上記ｉをｉ＋１と
インクリメントする。(ST0-2) Subsequently, the above i is incremented to i + 1.

【００４９】（ＳＴ１）そして、音声入力回路１が取り
込んだ入力音声を、フレームと呼ばれる一定長の部分に
分割し、その結果を入力バッファ回路４に格納する処理
を行う。(ST1) Then, the input voice fetched by the voice input circuit 1 is divided into portions of a fixed length called frames, and the result is stored in the input buffer circuit 4.

【００５０】本実施例ではフレーム幅６．６６ｍｓのＨ
ａｍｍｉｎｇ（ハミング）窓を３．３ｍｓずつずらしな
がら切り出して格納する。In this embodiment, H of a frame width of 6.66 ms
Amming windows are cut out and stored while being shifted by 3.3 ms.

【００５１】（ＳＴ２）入力音声信号を各フレーム毎
に、自己相関法や、零クロス法などの方法で処理して有
声、無声、無音の判定を行う。人が発声する有声および
無声以外の入力音（例えば、低レベルの雑音や背景音
等）は原則として無音として識別処理する。(ST2) The input voice signal is processed for each frame by a method such as the autocorrelation method or the zero-cross method to determine voiced, unvoiced, or silent. Input sounds other than voiced and unvoiced voices (for example, low-level noise and background sounds) uttered by humans are identified as silent in principle.

【００５２】（ＳＴ３）ｉ番目のフレームについての有
声、無声、無音の判定結果（今回の判定結果）と、ｉ−
１番目のフレームについて有声、無声、無音の判定結果
（前回の判定結果）とが同じであるか否かを判別する。
両者の判定結果が同じであれば（ＳＴ０−２）に戻り、
同じでないならば次の（ＳＴ４）に移る。但し、ｉ＝１
の場合は（ＳＴ０−２）に戻る。(ST3) The voiced / unvoiced / silent determination result (the current determination result) for the i-th frame and i-th frame
It is determined whether or not the voiced, unvoiced, and silence determination results (the previous determination results) for the first frame are the same.
If the two judgment results are the same, return to (ST0-2),
If they are not the same, move to the next (ST4). Where i = 1
In this case, the process returns to (ST0-2).

【００５３】本実施例では、システム全体の処理の遅延
時間を最大限短縮するため、有声、無声、無音の各音声
区間については各々の区間長全体を一括して処理するの
ではなく、出来るだけ短い区間に分割（本実施例では、
有声区間を１５０ｍｓに分割）して処理した。In this embodiment, in order to minimize the delay time of the processing of the entire system, voiced, unvoiced, and silent sections are not processed as a whole, but as long as possible. Divided into short sections (in this embodiment,
(The voiced section was divided into 150 ms.)

【００５４】（ＳＴ４）ｉ−１フレームまでの、同じ種
類（有声、無声或いは無音）の区間と判定されている音
声区間を入力バッファ回路４から処理バッファ回路５に
転送して格納する。(ST4) Up to the (i-1) -th frame, speech sections determined to be of the same type (voiced, unvoiced or silent) are transferred from the input buffer circuit 4 to the processing buffer circuit 5 and stored.

【００５５】（ＳＴ５）処理バッファ回路５に格納され
ている音声区間が、無音か無声か有声か否かを判定す
る。無音区間の場合は（ＳＴ６）へ進み、無声区間の場
合は（ＳＴ１１）へ移り、有声区間の場合は（ＳＴ９）
へ移る。(ST5) It is determined whether or not the voice section stored in the processing buffer circuit 5 is silent, unvoiced, or voiced. In the case of a silent section, the procedure proceeds to (ST6), in the case of a silent section, the procedure proceeds to (ST11), and in the case of a voice section, (ST9).
Move to

【００５６】（ＳＴ６）当該無音区間がポーズ区間か否
かを判断する。ポーズ区間の場合は（ＳＴ６−１）へ移
り、ポーズ区間でない場合は（ＳＴ８）へ飛ぶ。但し、
図４のリアルタイム話速変換装置の起動時はポーズ区間
であったと判断し、必ず（ＳＴ６−１）へ進む。(ST6) It is determined whether or not the silent section is a pause section. In the case of the pause section, the process proceeds to (ST6-1), and in the case of not the pause section, the process proceeds to (ST8). However,
At the time of activation of the real-time speech speed conversion device in FIG.

【００５７】（ＳＴ６−１）ポーズ区間以降に出現する
有声区間の番号を表す変数ｋに初期値としての１を代入
する。(ST6-1) An initial value of 1 is substituted for a variable k representing the number of a voiced section appearing after the pause section.

【００５８】（ＳＴ７）ポーズの区間長を調べ、その区
間長によって適宜、予め設定されているアルゴリズム
（池沢龍ほか「話速変換に伴う時間伸張を吸収するため
の一方法」１９９２年音声研究会Ｐ．４９−Ｐ．５６）
によって聴感上違和感ない程度に短縮する。(ST7) The section length of the pause is examined, and an algorithm preset according to the section length (Ryu Ikezawa et al., "A Method for Absorbing Time Expansion Associated with Speech Speed Conversion," 1992 Speech Research Group, 1992. P.49-P.56)
The sound is shortened to a degree that does not cause discomfort in hearing.

【００５９】本実施例では、８６２ｍｓを越える区間長
を有する無音区間を一律にこの８６２ｍｓの値まで短縮
することとし（池沢龍ほか平成４年日本音響学会春季研
究発表会「話速変換に伴う時間伸張を吸収するための一
手法」２−６−２Ｐ．３３１−Ｐ．３３２（１９９２−
３））、無音区間８６２ｍｓを経過した時点で更に無音
区間が続く場合は、それ以降の無音データを廃棄して次
のフレーズの開始点を待つこととする。In the present embodiment, a silent section having a section length exceeding 862 ms is uniformly reduced to the value of 862 ms (Ryu Ikezawa et al., Spring Meeting of the Acoustical Society of Japan in 1992, "Time associated with speech speed conversion"). One Method for Absorbing Stretch "2-6-2 P.331-P.332 (1992-
3)) If a silent section continues after the silent section of 862 ms has elapsed, the remaining silent data is discarded and the start point of the next phrase is waited for.

【００６０】（ＳＴ８）処理バッファ回路５内にある処
理済の無音区間の信号をファイル回路６に転送させて格
納させた後、処理バッファ回路５をクリアする。次に
（ＳＴ１２）へ移る。(ST8) After the processed silence section signal in the processing buffer circuit 5 is transferred to the file circuit 6 and stored therein, the processing buffer circuit 5 is cleared. Next, the process proceeds to (ST12).

【００６１】（ＳＴ１２）音声信号の最後まで処理した
か否かを判定する。肯定判定の場合は本（ＳＴ９）の処
理をルーチン終了し、否定判定の場合は（ＳＴ０−２）
へ戻る。(ST12) It is determined whether the audio signal has been processed to the end. If the determination is affirmative, the process of this step (ST9) is terminated. If the determination is negative, (ST0-2).
Return to

【００６２】（ＳＴ９）（ＳＴ５）で有声区間と判定さ
れた区間に対して、後述の図６に示す有声区間処理を行
う。この区間の処理における時間軸の原点をＶ＿ｓｔと
定義する。また、フレーズ内の第ｋ有声区間の開始時刻
をｔ_k 、区間長をｗ_k と記す。(ST9) The section determined as a voiced section in (ST5) is subjected to a voiced section process shown in FIG. 6 described later. The origin of the time axis in the processing of this section is defined as V_st. In addition, the start time t _k of the k-th voiced in the phrase, the section length referred to as w _k.

【００６３】（ＳＴ９−１）上述の変数ｋをｋ＋１とイ
ンクリントする。(ST9-1) The variable k described above is incremented to k + 1.

【００６４】（ＳＴ１０）処理バッファ回路５内にある
話速変換済みの音声データをファイル回路６のメモリに
格納するとともに、処理バッファ回路５をクリアする。
その後、上述の（ＳＴ１２）へ移る。(ST10) The speech speed converted voice data in the processing buffer circuit 5 is stored in the memory of the file circuit 6, and the processing buffer circuit 5 is cleared.
Thereafter, the process proceeds to (ST12).

【００６５】（ＳＴ１１）（ＳＴ５）において処理対象
となる区間が無声と判断されれば、この無声区間の音声
信号を処理バッファ回路５からファイル回路６に転送し
て格納した後、処理バッファ回路５をクリアする。その
後、上述の（ＳＴ１２）へ移る。(ST11) If it is determined in (ST5) that the section to be processed is unvoiced, the voice signal in this unvoiced section is transferred from the processing buffer circuit 5 to the file circuit 6 and stored therein. Clear Thereafter, the process proceeds to (ST12).

【００６６】次に、図６のＳＴ９有声区間処理ルーチン
の詳細を説明する。Next, the details of the ST9 voiced section processing routine of FIG. 6 will be described.

【００６７】（ＳＴ１４）まず、有声区間のピッチ抽出
を行う。(ST14) First, the pitch of a voiced section is extracted.

【００６８】（ＳＴ１５）次に、変数ｋがｋ＝１か否か
を判定する。ｋ＝１の場合、即ちポーズ区間以降に出現
する最初の有声区間の場合は（ＳＴ１５−１）へ移り、
そうでない場合は（ＳＴ１５−２）へ移る。(ST15) Next, it is determined whether or not the variable k is k = 1. In the case of k = 1, that is, in the case of the first voiced section appearing after the pause section, the process proceeds to (ST15-1).
If not, the process proceeds to (ST15-2).

【００６９】（ＳＴ１５−１）この有声区間の処理にお
ける時間軸の原点を示す変数Ｖ＿ｓｔに時刻ｔ₁ を代入
する。次に（ＳＴ１６）へ移る。[0069] (ST15-1) substitutes the time t ₁ to the variable V_st indicating the origin of the time axis in the process of the voiced segment. Next, the process proceeds to (ST16).

【００７０】（ＳＴ１５−２）変数ｋが３以下か否か、
即ちｋが２または３であるか否かを判定する。ｋが２ま
たは３の場合は（ＳＴ１６）へ移り、ｋが４以上の場合
は（ＳＴ１７）へ飛ぶ。(ST15-2) Whether the variable k is 3 or less,
That is, it is determined whether or not k is 2 or 3. If k is 2 or 3, the process proceeds to (ST16), and if k is 4 or more, the process proceeds to (ST17).

【００７１】（ＳＴ１６）第ｋ有声区間の最大ピッチ周
波数をＰ_k と定義する。ｋ＝１，２，３の場合にはＰ_k
の値を保存する。(ST16) The maximum pitch frequency in the k-th voiced section is defined as P _k . P _{k for} k = 1, 2, 3
Save the value of.

【００７２】（ＳＴ１６−１）変数ｋがｋ＝３か否かを
判定する。ｋ＝３の場合は次の（ＳＴ１６−２）へ移
り、そうでない場合、即ちｋ＝１，２の場合は（ＳＴ１
７）へ飛ぶ。(ST16-1) It is determined whether or not the variable k is k = 3. If k = 3, the process proceeds to the next (ST16-2). Otherwise, that is, if k = 1 and 2, (ST1-2).
Jump to 7).

【００７３】（ＳＴ１６−２）３つの有声区間Ｐ₁ ，Ｐ
₂ ，Ｐ₃ のうちの最大値をＰｉｔｃｈ＿ｍａｘとする。次に（ＳＴ１７）へ移る。(ST16-2) Three voiced sections P ₁ , P
The maximum value of _2, P ₃ and Pitch_max. Next, the process proceeds to (ST17).

【００７４】（ＳＴ１７）ｔ_k が、区間［Ｖ＿ｓｔ，Ｖ
＿ｓｔ＋Ｔ］に含まれているか否かを判定する。含まれ
ていれば（ＳＴ１７−１）へ移り、そうでなければ（Ｓ
Ｔ１２）へ移る。（本実施例では前述のようにＴ＝２０
００ｍｓとした。）（ＳＴ１７−１）Ｖ＿ｓｔ＞ｔ₁ であるか否かを判定す
る。(ST17) t _k is the interval [V_st, V
_St + T] is determined. If it is included, the operation moves to (ST17-1), otherwise (S17-1).
Move to T12). (In this embodiment, as described above, T = 20
00 ms. ) (ST17-1) V_st> determines whether the t _1.

【００７５】Ｖ＿ｓｔ＞ｔ₁ のときは、発声の終了点間
近で意味的重要度が低い場合が多いため、本実施例では
特にｇ（ｗ）を適用せず、（ＳＴ１７−１）から直接
（ＳＴ１９）に移ることとした。それ以外のときは次の
（ＳＴ１８）へ移る。When V_st> t ₁ , since the semantic importance is often low near the end point of the utterance, g (w) is not particularly applied in this embodiment, and (ST17-1) is directly applied to (ST17-1). ST19). Otherwise, it moves to the next (ST18).

【００７６】（ＳＴ１８）フレーズの開始部において変
換により生じる聴感上の「ゆっくり感」を効果的にする
ために必要な時間長をＴ₁ とする。Ｔ₁ は実験結果から
（今井篤ほか平成５年日本音響学会秋季研究発表会
「話速変換に伴う時間伸張のリアルタイム吸収法」１−
９−１０Ｐ．３６１−Ｐ．３６２（１９９３−１０））
Ｔの１／４程度が望ましく、本実施例ではＴ₁ ＝４５０
ｍｓとした。[0076] (ST18) a time length required to effect the "slow-sensitive" on audibility caused by the conversion at the beginning of the phrase and T _1. T ₁ is the experimental results from (Imai, Atsushi other 1993 Acoustical Society of Japan Autumn Research Workshop "story-speed real-time absorption method of time stretching associated with the conversion." 1
9-10P. 361-P. 362 (1993-3)).
About 1/4 of T is desirable, and in this embodiment, T ₁ = 450.
ms.

【００７７】本処理ブロックでは、第ｋ有声区間の終了
時刻ｔ_k ＋ｗ_k が区間［Ｖ＿ｓｔ，Ｖ＿ｓｔ＋Ｔ₁ ］含
まれているか否かを判定する。含まれていれば次の（Ｓ
Ｔ１８−１）へ移り、そうでなければ（ＳＴ１９）へ移
る。In this processing block, it is determined whether or not the end time t _k + w _{k of} the k-th voiced section is included in the section [V_st, V_st + T ₁ ]. If it is included, the next (S
The process moves to T18-1), otherwise (ST19).

【００７８】（ＳＴ１８−１）ｋ番目の有声区間長ｗ_k
と、予め設定されている区間長Ｗ₁ が、ｗ_k ≦Ｗ₁ であるか否かを判定する。肯定判定のときは（ＳＴ２
０）へ移り、否定判定のときは（ＳＴ１９）へ移る。(ST18-1) k-th voiced section length w _k
Then, it is determined whether or not the preset section length W ₁ satisfies w _k ≦ W ₁ . If the determination is affirmative (ST2
0), and to the negative judgment (ST19).

【００７９】有声区間の伸張による話速変換では、区間
長が短いもの程その変換効果が小さくなるが、Ｗ₁ は、
入力音声を１．３倍程度の一様な倍率で話速変換した際
に、聴感上の話速変換効果が余り感じられなくなる臨界
有声区間長を実験により導いた値で、本実施例ではＷ₁
＝１５０ｍｓとした。[0079] In the speech speed conversion by the expansion of the voiced interval, but the conversion effect as those interval length is short becomes smaller, W ₁ is,
When the input voice is converted into speech speed at a uniform magnification of about 1.3 times, the critical voiced section length at which the speech speed conversion effect on the auditory perception is not felt much is derived by an experiment. ₁
= 150 ms.

【００８０】（ＳＴ１９）予め設定した倍率関数ｆ
（ｔ）を適用して有声区間を伸張する。このｆ（ｔ）は
単調減少関数であり、本実施例では以下の式（１）のよ
うな余弦関数を用いて、倍率をｒ_s からｒ_e まで変化さ
せた。(ST19) Magnification function f set in advance
(T) is applied to extend the voiced section. The f (t) is a monotonically decreasing function, in this example using a cosine function as the following equation (1) was changed magnification from r _s to r _e.

【００８１】（図７の曲線のグラフ参照）(See the graph of the curve in FIG. 7)

【００８２】[0082]

【数１】 f(t)＝r_e＋0.5(r_s-r_e){cosπ(t-V_st)/T+1.0} （１）但し、V_st≦ｔ≦V_st＋Ｔ本実施例では、１．０≦ｒ_s ≦１．６，０．７≦ｒ_e ＜
１．０の範囲で任意に値を定めた。その後、図５のメイ
ンルーチンに戻る。[Number 1] _{f (t) = r e +0.5} (r s -r e) {cosπ (t-V_st) /T+1.0} (1) where, in V_st ≦ t ≦ V_st + T present embodiment, 1. _{0 ≦ r s ≦ 1.6,0.7 ≦ r} e <
The value was arbitrarily determined in the range of 1.0. Thereafter, the process returns to the main routine of FIG.

【００８３】（ＳＴ２０）Ｖ＿ｓｔからの経過時間にか
かわらず、当該有声区間の区間長ｗ_k に対して、ｇ（ｗ
_k ）で定まる倍率を適用して有声区間を伸張する。(ST20) Regarding the section length w _k of the voiced section, g (w
_The voiced section is expanded by applying the magnification determined in _k ).

【００８４】本実施例で用いた倍率関数ｇ（ｗ）は次式
（２）に示す一次関数とし、倍率をｇ（０）からｇ（ｗ
₁ ）まで変化させた。その後、図５のメインルーチンに
戻る。The magnification function g (w) used in this embodiment is a linear function represented by the following equation (2), and the magnification is changed from g (0) to g (w).
₁ ) was changed. Thereafter, the process returns to the main routine of FIG.

【００８５】（図７の右角の直線のグラフ参照）(See the graph of the right-hand straight line in FIG. 7)

【００８６】[0086]

【数２】ｇ（ｗ）＝（−（ｒ_s ²−ｆ( Ｗ₁)）ｗ /Ｗ₁)＋ｒ_s ² （２）但し、V_st＝０でｇ（Ｗ₁ ）＝ｆ（Ｗ₁ ）とした。[Number 2] g (w) = (- ( r s 2 -f (W 1)) w / W 1) + r s 2 (2) However, in _{V_st = 0 g (W 1)} = f (W 1) And

【００８７】（ＳＴ２１）処理対象となっている有声区
間の最大ピッチ周波数Ｐ_k が、以下の式（３）の条件を
満たす場合は（ＳＴ２２）へ、満たさない場合は（ＳＴ
２３）へ移る。(ST21) If the maximum pitch frequency P _k of the voiced section to be processed satisfies the condition of the following equation (3), go to (ST22); otherwise, go to (ST22).
Move to 23).

【００８８】[0088]

【数３】Ｐ_k ＞Ｐｉｔｃｈ＿ｍａｘ×Ｔｈ２（３）本実施例では、Ｔｈ２＝０．７とした。P _k > Pitch_max × Th2 (3) In this embodiment, Th2 = 0.7.

【００８９】（ＳＴ２２）変数Ｖ＿ｓｔに時刻ｔ_k を代
入する。[0089] substituting the time t _k to (ST22) variable V_st.

【００９０】（ＳＴ２２−１）変数ｒ_s に（ｒ_s −Ｔｈ
３）を代入する。(ST22-1) The variable r _{s is set} to (r _s −Th
Substitute 3).

【００９１】これによって、ｆ（ｔ）は（ｒ_s −Ｔｈ
３）からｒ_e まで倍率を変化させる。本実施例では、Ｔ
ｈ３＝０．１に設定した。その後、上記の（ＳＴ１７）
へ戻る。[0091] As a result, f (t) is (r _s -Th
3) to r _e to change the magnification. In this embodiment, T
h3 was set to 0.1. Then, the above (ST17)
Return to

【００９２】（ＳＴ２３）有声区間を伸張倍率をｒ_e で
伸張する。つまり、話速を最も速い状態のままにする。
その後、（ＳＴ９）の有声区間処理ルーチンを終了し、
図５のメインルーチンに戻る。[0092] The extension magnification (ST23) voiced segment extending in the r _e. That is, the speech speed is kept at the highest speed.
Thereafter, the voiced section processing routine of (ST9) ends,
It returns to the main routine of FIG.

【００９３】[0093]

【発明の効果】以上説明したように、本発明によれば、
入力音声の無音区間、無声区間、有声区間を分離し、有
声区間を伸張することによって発声する速さ（話速）を
ゆっくりに変換する方法において、全ての有声区間を一
定の倍率によって変換した際に、各有声区間長の違いに
より生ずる聴感上の話速変換効果のばらつきを解消する
ため、有声区間がある値よりも短いものについては所望
の倍率に対応した聴感上の話速変換効果が得られるよう
に、その区間長に応じて更に高い倍率で伸張するように
しているので、いかなる発声音声に対しても自然で且つ
安定した話速変換効果が得られる。即ち、本発明によれ
ば、受聴者の希望にあった話速に安定、且つ自然に変換
することが出来る。As described above, according to the present invention,
In the method of separating the unvoiced section, unvoiced section, and voiced section of the input voice and slowly converting the utterance speed (speech speed) by expanding the voiced section, when all voiced sections are converted at a fixed magnification In order to eliminate the variation in speech speed conversion effect on auditory sensation caused by the difference in the length of each voiced section, a speech speed conversion effect on auditory sensation corresponding to a desired magnification is obtained for voiced sections shorter than a certain value. As described above, since the data is expanded at a higher magnification in accordance with the section length, a natural and stable speech speed conversion effect can be obtained for any uttered voice. That is, according to the present invention, it is possible to stably and naturally convert the speech speed to the one desired by the listener.

[Brief description of the drawings]

【図１】従来法における倍率関数を示すグラフである。FIG. 1 is a graph showing a magnification function in a conventional method.

【図２】従来法を適用した場合の１フレーズ内の有声区
間長の時間軸上の分布を示すタイミング図である。FIG. 2 is a timing chart showing a distribution of a voiced section length in one phrase on a time axis when a conventional method is applied.

【図３】本発明の一実施例の倍率関数を示すグラフであ
る。FIG. 3 is a graph showing a magnification function according to an embodiment of the present invention.

【図４】本発明の一実施例のリアルタイム話速変換装置
の回路構成例を示すブロック図である。Is a block diagram showing a circuit configuration example of a real-time speech speed conversion apparatus of one embodiment of the present invention; FIG.

【図５】図４に示すリアルタイム話速変換装置の動作例
を示すメインフローチャートである。FIG. 5 is a main flowchart showing an operation example of the real-time speech speed conversion device shown in FIG. 4;

【図６】図５に示す有声区間処理ルーチンの詳細を示す
フローチャートである。FIG. 6 is a flowchart showing details of a voiced section processing routine shown in FIG. 5;

【図７】図４に示すリアルタイム話速変換装置にｆ
（ｔ），ｇ（ｗ）の関数を適用した場合の動作例を示す
タイミング図である。FIG. 7 is a diagram showing a configuration of the real-time speech speed converter shown in FIG.
It is a timing chart which shows the example of an operation at the time of applying the function of (t) and g (w).

[Explanation of symbols]

１音声入力回路２ＣＰＵ回路３ＰＲＯＭ回路４入力バッファ回路５処理バッファ回路６ファイル回路７音声出力回路８バスｆ（ｔ）話速変換に伴う時間伸張を吸収するために用
いる倍率関数ｇ（ｗ）一定の区間長Ｗ₁ に満たない有声区間を、そ
の区間長ｗに応じて、ｆ（ｔ）により定まる規準倍率よ
り高い倍率で伸張するための倍率関数ｒ_s あらかじめ定めた最高倍率ｒ_e あらかじめ定めた最低倍率Ｔ予測フレーズ長（フレーズの平均的な時間長）Ｐｈ＿ｓｔフレーズ（ポーズと次のポーズに挟まれる
区間）の開始点Ｖ＿ｓｔ有声区間の処理における時間軸の原点Ｐ_k 第ｋ有声区間の最大ピッチ周波数ｐｉｔｃｈ＿ｍａｘ最初の３つの有声区間Ｐ₁ ，Ｐ
₂ ，Ｐ₃ のうち最大値Ｗ₁ 予め設定されている区間長ｗ_k ｋ番目の有声区間長ｉフレーム番号ｋ有声区間番号ｔ_k 第ｋ有声区間の開始時刻DESCRIPTION OF SYMBOLS 1 Audio input circuit 2 CPU circuit 3 PROM circuit 4 Input buffer circuit 5 Processing buffer circuit 6 File circuit 7 Audio output circuit 8 Bus f (t) Magnification function g (w) used to absorb time expansion accompanying speech speed conversion the voiced segment less than a predetermined interval length W _1, determined in accordance with the section length w, f (t) the largest magnification r _e previously magnification function r _s predefined for stretching at higher standards magnification ratio determined by The minimum magnification T The predicted phrase length (the average time length of the phrase) Ph_st The starting point of the phrase (the section between the pause and the next pause) V_st The origin of the time axis in the processing of the voiced section _Pk The maximum of the k-th voiced section Pitch frequency pitch_max First three voiced sections P ₁ , P
_2, the maximum segment length value W ₁ is previously set w k _k-th voiced length i frame number k voiced numbers t _k start time of the k-voiced of P ₃

───────────────────────────────────────────────────── フロントページの続き (72)発明者清山信正東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内 (72)発明者宮坂栄一東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内 (56)参考文献特開平１−93795（ＪＰ，Ａ) 特開平５−257490（ＪＰ，Ａ) 特開平５−80796（ＪＰ，Ａ) 特開平４−367898（ＪＰ，Ａ) 特開昭63−234299（ＪＰ，Ａ) 特開平６−337696（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 21/04 ──────────────────────────────────────────────────続き Continuing on the front page (72) Nobumasa Kiyoyama 1-10-11 Kinuta, Setagaya-ku, Tokyo Inside the Japan Broadcasting Corporation Broadcasting Research Institute (72) Eiichi Miyasaka 1-10-11 Kinuta, Setagaya-ku, Tokyo Japan Broadcasting Corporation Broadcasting Research Institute (56) References JP-A-1-93795 (JP, A) JP-A-5-257490 (JP, A) JP-A-5-80796 (JP, A) JP-A-4- 367898 (JP, A) JP-A-63-234299 (JP, A) JP-A-6-337696 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 21/04

Claims

(57) [Claims]

1. A conversion that separates a silent section, a unvoiced section, and a voiced section of an input voice and expands a voiced section to reduce the utterance speed (speech speed) while maintaining the voice pitch. When performing, by sequentially detecting the time length of each voiced section, by multiplying the time length of each voiced section by a standard value of a uniform value, or smoothly changes with elapsed time,
A speech speed conversion device for obtaining an audible effect corresponding to the magnification, wherein a time length of a voiced section to be converted is equal to or less than a predetermined length.
Determination means for determining, by a determination result of said determining means, when said duration of the conversion subject to voiced interval Ru exceeds a predetermined length, multiplied by the reference magnification at the appearance time of the voiced but the time length of the voiced segments to be the conversion target, for said predetermined length following voiced interval, calculating means for multiplying a higher stretch ratio than that of the reference magnification according to the time length of the voiced To
A speech speed conversion device comprising:

2. The calculating means provides a higher expansion ratio for a short voiced section of 150 ms or less corresponding to the predetermined length, irrespective of the appearance time of the voiced section, compared to the reference magnification. 2. The method according to claim 1, further comprising: multiplying the expansion rate corresponding to the time length of the voiced section along a magnification function, and, in the case of the voiced section exceeding 150 ms, multiply the time length of the voiced section by the reference magnification. The speech speed conversion device according to 1 .

3. The method changes smoothly with the elapsed time.
When applying a magnification function characterized by setting a slow speech speed at the start point of this section in units of a section that occurs in a breath as a reference magnification, and gradually increasing the speech rate toward the end In the meantime , the calculating means calculates a short voiced section of 150 ms or less corresponding to the predetermined length that appears within a time 450 ms from the start time of this section,
Irrespective of the appearance time of the voiced section, multiplying by the expansion rate corresponding to the time length of the voiced section along a magnification function providing a higher expansion rate as compared with the reference magnification;
Voiced section exceeding 50 ms and elapsed time 450 ms
2. The speech speed conversion according to claim 1, wherein when the number exceeds the threshold , the time length of the voiced section is multiplied by the reference magnification.
Equipment .

4. The predetermined length refers to a maximum time length of a voiced section in which a “slow feeling” of the converted voice cannot be perceived audibly when a practical value is set as a reference magnification. For the following voiced sections, a new scaling function g (w) using the time length w as a variable is introduced,
The expansion magnification is given according to the magnification function, and the magnification given by the magnification function is a value higher than the reference magnification, and in particular, the shorter the voiced section, the higher the magnification. The maximum amplification value and the minimum amplification value of the magnification by the magnification function are not fixed, but each can be proportionally changed by the value of the reference magnification function f (t) of the reference magnification. The speech speed conversion device according to 1 .