JP3249567B2

JP3249567B2 - Method and apparatus for converting speech speed

Info

Publication number: JP3249567B2
Application number: JP05178792A
Authority: JP
Inventors: 龍池沢; 章中村; 栄一宮坂
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1992-03-10
Filing date: 1992-03-10
Publication date: 2002-01-21
Anticipated expiration: 2017-01-21
Also published as: JPH05257490A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、聴覚障害者や高齢者等
の音声聴取に好適な話速変換方法および装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech speed conversion method and apparatus suitable for listening to a voice of a hearing impaired person or an elderly person.

【０００２】[0002]

【発明の概要】本発明は、聴覚障害者や高齢者等の音声
聴取に好適な話速変換方法および装置に関するものであ
って、受聴音声の発声する速さ（話速）を遅くする際
に、文章間の無音区間を聴感上違和感のない範囲で最短
に短縮し、かつ話速を一定の規則に基づいて変化させる
ことにより、発話時間を原音声の発話時間に保ったまま
全体としてゆっくりとした聴きやすい良好な音声に変換
することを図るものである。SUMMARY OF THE INVENTION The present invention relates to a speech speed conversion method and apparatus suitable for hearing voices of hearing-impaired persons, elderly people, and the like. By shortening the silent section between sentences to the minimum as long as there is no unpleasant sensation, and changing the speech speed based on a certain rule, the speech time is maintained slowly while maintaining the speech time of the original voice as a whole. It is intended to convert the sound into good sound which is easy to hear.

【０００３】[0003]

【従来の技術】品質を保ったまま、話速を変換する技術
自体が発展途上である上、実時間（枠）との「ずれ」を
考慮した技術は未開発である。2. Description of the Related Art A technology for converting speech speed while maintaining quality is in the process of development, and a technology that takes into account a "deviation" from real time (frame) has not been developed.

【０００４】[0004]

【発明が解決しようとする課題】音声の話速のみを一様
に遅くすることにより、特に高齢者や聴覚障害者等にと
っては、はるかに聴きやすくすることが可能であるが、
この操作によって音声の発話時間も必然的に伸張する。
しかし、放送や朗読カセット等では、伸張前の音声の発
話時間は、決められた時間内に収まるように発話されて
いるから、このような音声を伸張すると上記制限時間内
に収まらなくなる可能性が生じる。また、テレビジョン
等のように音声と映像を同期して提供するような場合
に、音声のみを伸張すると、映像との間に時間的な「ず
れ」が生じ、これが聞き取りに悪影響を及ぼすことが考
えられる。By uniformly lowering only the speech speed of speech, it is possible to make it much easier to hear, especially for the elderly and the hearing impaired.
This operation inevitably extends the speech utterance time.
However, in broadcasting and reading cassettes, the utterance time of the voice before expansion is uttered so as to fit within the predetermined time.Therefore, if such a voice is expanded, it may not be able to fit within the above-mentioned time limit. Occurs. Also, in the case where audio and video are provided in synchronization, such as on a television, if only audio is expanded, a time lag may occur between the video and the video, which may adversely affect listening. Conceivable.

【０００５】本発明の目的は、上述した時間的な「ず
れ」に伴う問題点を解決するため、発話音声中の意味上
重要な部分の話速は適度に遅くし、それ以外の部分は逆
に速めることによって、発話時間を実質的に伸張させる
ことなく、全体としてゆっくりとした聞きやすい音声に
変換する話速変換方法および装置を提供することにあ
る。[0005] An object of the present invention is to solve the above-mentioned problem associated with the temporal "shift" by appropriately reducing the speech speed of a meaningful portion in the uttered voice, and reversely reducing the other portions. Accordingly, it is an object of the present invention to provide a speech speed conversion method and apparatus for converting speech into a slow and easy-to-hear sound as a whole without substantially extending the speech time by speeding up.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するた
め、請求項１の話速変換方法の発明は、受聴音声の発生
する速さ（以下、話速という）を遅くする際に、音声の
ピッチ（基本周波数）の変化に応じて、ピッチの高いと
ころでは話速を緩め、低いところでは話速を早めること
を特徴とする。In order to achieve the above object, according to the first aspect of the present invention, there is provided a method for converting a speech speed, which comprises the steps of: According to a change in the pitch (fundamental frequency), the speech speed is reduced at a high pitch and the speech speed is increased at a low pitch.

【０００７】上記目的を達成するため、請求項２の話速
変換方法の発明は、話速を遅くする際に、声立てと次の
声立ての区間を単位にしてこの区間の開始点ではゆっく
りとした話速を設定し、その終了点に向かって音声の基
本周波数の大まかな変化に追従して徐々に話速を早める
ことを特徴とする。In order to achieve the above object, a speech speed conversion method according to a second aspect of the present invention provides a speech speed conversion method in which, when the speech speed is reduced, a section of a vocalization and a next vocalization are slowly set at a start point of this section. Is set, and the speech speed is gradually increased toward the end point by following a rough change in the fundamental frequency of the voice.

【０００８】ここで、話速を遅くする際にさらに、文章
間の無音区間を予め実験で求めた聴感上違和感のない範
囲でできるだけ短い時間に短縮することを特徴とするこ
とができる。また、前記予め実験で求めた聴感上違和感
のない範囲でできるだけ短い時間が、８６２ｍｓ〜ほぼ
１０００ｍｓの範囲であることを特徴とすることができ
る。[0008] Here, when the speech speed is reduced, it is further characterized in that the silent section between sentences is shortened as short as possible within a range in which the sense of incongruity obtained by an experiment is obtained in advance. Further, the time as short as possible within a range in which there is no uncomfortable feeling in the hearing, which is previously determined by an experiment, is in a range from 862 ms to almost 1000 ms.

【０００９】上記目的を達成するため、請求項５の話速
変換装置の発明は、音声信号を有声、無声、無音の別に
識別する音声識別手段と、該音声識別手段により識別さ
れた無音区間が文章間の無音区間か否かを判定する無音
区間判定手段と、該無音区間判定手段により文章間の無
音区間と判定された場合は当該無音区間を予め実験で求
めた聴感上違和感のない範囲でできるだけ短い時間に短
縮する無音区間短縮手段と、前記音声識別手段により識
別された有声区間が声立て開始のものか否かを判定する
有声区間判定手段と、該有声区間判定手段により声立て
開始と判定された場合は声立てと次の声立ての区間を単
位にしてこの区間の開始点ではゆっくりとした話速を設
定し、その終了点に向かって音声の基本周波数の大まか
な変化に追従して徐々に話速を速める話速変換処理を行
う有声区間伸張手段とを具備したことを特徴とする。In order to achieve the above object, a speech speed conversion apparatus according to a fifth aspect of the present invention is a speech speed conversion apparatus, comprising: a voice identification unit for identifying a voice signal as voiced, unvoiced, and silent; and a silent section identified by the voice identification unit. A silent section determining means for determining whether or not there is a silent section between sentences; and, if the silent section determining means determines that there is a silent section between sentences, the silent section is determined in advance by an experiment in a range where there is no auditory discomfort. Silent section shortening means for shortening to a time as short as possible, voiced section determining means for determining whether or not the voiced section identified by the voice identifying means is the start of voiced voice, and start of voiced voice by the voiced section determining means. If it is determined, set a slow speech speed at the start point of this section in units of the vocal and the next vocal section, and follow the rough change of the fundamental frequency of the voice toward the end point. hand S and characterized by including a voiced segment stretching means for performing speech speed conversion processing to increase the speech rate on.

【００１０】[0010]

【作用】本発明は、受聴音声の発生する速さ（話速）を
遅くする際に、音声のピッチ（基本周波数）の変化に応
じて、ピッチの高いところでは話速を緩め、低いところ
では話速を早めることに特徴がある。また、本発明は、
話速を遅くする際に、声立てと次の声立ての区間を単位
にしてこの区間の開始点ではゆっくりとした話速を設定
し、その終了点に向かって音声の基本周波数の大まかな
変化に追従して徐々に話速を早めることに特徴がある。According to the present invention, when the speed at which a received voice is generated (speaking speed) is reduced, the voice speed is reduced at a high pitch and the voice speed is reduced at a low pitch in accordance with a change in the pitch (fundamental frequency) of the voice. The feature is to speed up the talk. Also, the present invention
When decreasing the speech speed, set a slow speech speed at the start point of this section in units of the voice and the next voice, and roughly change the fundamental frequency of the voice toward the end point The feature is that the speech speed is gradually increased to follow.

【００１１】さらに、本発明では、文章間の無音区間に
着目し、文章間の無音区間を予め実験で求めた聴感上違
和感のない範囲でできるだけ短い時間に短縮するように
している。その一例として、その聴感上違和感のない範
囲でできるだけ短い時間が、８６２ｍｓ〜ほぼ１０００
ｍｓの範囲であるとしている。Further, in the present invention, attention is paid to a silent section between sentences, and the silent section between sentences is shortened to a time as short as possible within a range in which the sense of incongruity previously obtained by an experiment is not affected. As an example, a time as short as possible within a range where there is no uncomfortable feeling is 862 ms to almost 1000 times.
ms.

【００１２】従って、本発明によれば、受聴者の希望に
あったゆっくりとした聴きやすい音声を発話時間が伸張
することなく、実時間の枠内で聴取することが可能にな
る。Therefore, according to the present invention, it is possible to listen to a slow and easy-to-listen sound desired by the listener within the real time frame without extending the utterance time.

【００１３】[0013]

【実施例】以下、図面を参照して本発明の実施例を詳細
に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１４】（１）装置構成図１に本発明の一実施例の装置構成を示す。音声入力回
路１は音声信号を入力するための一般的な構成の回路で
あり、必要に応じて例えばマイクロホン、音調回路、ア
ナログディジタル変換器、音声記憶再生（録音）回路、
音声記憶媒体（例えば、ＩＣメモリ、ハードディスク、
フロッピーディスクまたはＶＴＲ）、およびインタフェ
ース回路等を包含している。ＣＰＵ（中央演算処理装
置）２は装置全体の制御および演算等を司り、例えば公
知のワンチップマイクロコンピュータやパーソナルコン
ピュータ等が適用できる。プログラムメモリ（ＰＲＯ
Ｍ）３はＣＰＵ２が実行する本発明に係わる図２に示す
ような制御手順（プログラム）、およびテーブル、定数
等をあらかじめ格納している。(1) Apparatus Configuration FIG. 1 shows an apparatus configuration according to an embodiment of the present invention. The audio input circuit 1 is a circuit having a general configuration for inputting an audio signal. If necessary, for example, a microphone, a tone circuit, an analog / digital converter, an audio storage / reproduction (recording) circuit,
Audio storage media (for example, IC memory, hard disk,
Floppy disk or VTR), and an interface circuit. The CPU (Central Processing Unit) 2 controls the entire apparatus, performs calculations, and the like. For example, a known one-chip microcomputer or personal computer can be applied. Program memory (PRO
M) 3 stores in advance control procedures (programs), tables, constants, and the like as shown in FIG.

【００１５】入力バッファ４および処理バッファ５はＣ
ＰＵ２が作業域として使用する不図示のＲＡＭ（ランダ
ムアクセスメモリ）内に確保されており、音声入力回路
１から入力されたディジタル音声信号は後述のフレーム
単位で順次入力バッファ４に一時格納され、次に入力バ
ッファ４に格納された音声信号は後述のセグメント毎に
処理バッファ５に一時格納される。ファイル６は本発明
に係わる有声区間の伸張と無音区間の短縮の処理を施さ
れた音声信号を格納するメモリであり、例えば上記のＲ
ＡＭの他に、ＩＣメモリやフロッピーディスク等の音声
記憶媒体が適用できる。The input buffer 4 and the processing buffer 5 are C
The PU 2 is secured in a RAM (random access memory) (not shown) used as a work area, and digital audio signals input from the audio input circuit 1 are temporarily stored in the input buffer 4 sequentially in frame units described later. The audio signal stored in the input buffer 4 is temporarily stored in the processing buffer 5 for each segment described later. The file 6 is a memory for storing a voice signal which has been subjected to processing for expanding a voiced section and shortening a silent section according to the present invention.
In addition to the AM, an audio storage medium such as an IC memory or a floppy disk can be applied.

【００１６】音声出力回路７はファイル６内の音声信号
を外部に出力するための一般的な構成の回路であり、必
要に応じて例えばインタフェース回路、ディジタルアナ
ログ変換器、スピーカー、録音装置（あるいは放送機
器）等を包含している。なお、後述の図２に示す手順を
公知技術により全てハード化して専用機として構成する
ことも勿論可能である。The audio output circuit 7 is a circuit having a general configuration for outputting the audio signal in the file 6 to the outside. If necessary, for example, an interface circuit, a digital / analog converter, a speaker, a recording device (or a broadcast device) Equipment). Of course, it is also possible to harden all the procedures shown in FIG.

【００１７】（２）動作例図２は本発明の一実施例の動作手順を示す。本実施例で
は、受聴音声の発声する速さ（話速）を遅くする際に、
無音区間を聴感上の違和感なく最短に短縮し、かつ発話
音声中の意味上重要な部分は通例音声のピッチ（基本周
波数）が高いところであり、そのピッチの高いところは
通例声立て開始時であるということに着目して、声立て
と次の声立ての区間を単位にしてこの区間の開始点では
ゆっくりとした話速を設定し、終了点に向って音声の基
本周波数の大まかな変化に追随して徐々に話速を速める
ように処理している。(2) Operation Example FIG. 2 shows an operation procedure of an embodiment of the present invention. In the present embodiment, when the speed of uttering the listening sound (speaking speed) is reduced,
The silent section is shortened to the shortest possible time without a sense of incongruity, and a significant portion in the uttered voice is a place where the pitch (fundamental frequency) of the voice is usually high, and the place where the pitch is high is at the start of vocalization. Focusing on this, set a slow speech speed at the start point of this section in units of the vocal and the next vocal section, and follow the rough change of the fundamental frequency of the voice toward the end point And then gradually increase the speaking speed.

【００１８】ステップＳ１：まず最初に音声入力回路１
からの入力音声信号をフレームと呼ばれる一定長の部分
に切り出し、入力バッファ４に格納する。本実施例で
は、フレーム長は例えば３．３ｍｓである。Step S1: First, the voice input circuit 1
The input audio signal is cut out into a fixed-length portion called a frame and stored in the input buffer 4. In this embodiment, the frame length is, for example, 3.3 ms.

【００１９】ステップＳ２：フレーム毎に有声、無声、
無音の判定を行う。この判定方法として、一例として公
知の自己相関法と零クロス法を適用できる。勿論その他
の判定方法でもよい。人が発声する有声および無声以外
の入力音（例えば、低レベルの雑音や背景音等）は原則
として無音として処理する。Step S2: Voiced, unvoiced,
Performs silence determination. As the determination method, for example, a known autocorrelation method and a zero-cross method can be applied. Of course, other determination methods may be used. In principle, input sounds other than voiced and unvoiced voices (for example, low-level noise and background sounds) uttered by humans are processed as silent.

【００２０】ステップＳ３：今回と前回のフレームの上
記種類が同じであればステップＳ１に戻り、異なった場
合、例えば有声から無声に変化すれば後段の処理に進
む。これにより同一種類（区間）の音声が入力バッファ
４に格納されることになる。Step S3: If the type of the current and previous frames is the same, return to step S1, and if different, for example, change from voiced to unvoiced, proceed to the subsequent stage. As a result, the same type (section) of voice is stored in the input buffer 4.

【００２１】ステップＳ４：１秒間に発声されるモーラ
数の平均から、後述のスレッショールド値Ｔｈ１，Ｔｈ
２，Ｔｈ３を設定する。モーラは、短母音を含む１音節
の長さに相当する。日本語ではほぼ仮名１文字（拗音で
は２字）に相当する。なお、このステップＳ４の処理は
最初の段階のときだけ、あるいは所定時間毎に行っても
よい。Step S4: From the average of the number of mora uttered in one second, the threshold values Th1 and Th described later are calculated.
2. Set Th3. Mora corresponds to the length of one syllable including a short vowel. In Japanese, it is almost equivalent to one character of kana (two characters in MUON). The processing in step S4 may be performed only at the initial stage or at predetermined intervals.

【００２２】ステップＳ５：無声または無音から始まっ
て有声で終わる区間を１ブロック（Ｂ_n ：ｎ＝１，２，
…）とする。このブロック内ではステップＳ２の判定に
応じて無音区間（ａ_n ）、無声区間（ｂ_n ）、有声区間
（Ｃ_n ）の３つに大別され、その区間毎に下記の各処理
系に送られる。ｂ₁ とｃ₁ の境界の時刻をｔ_1,s と表現
し、初回の声立てをα１とする（図３参照）。Step S5: A section starting from unvoiced or unvoiced and ending with voiced is one block (B _n : n = 1, 2, 2)
…). In this block, it is roughly classified into a silent section (a _n ), an unvoiced section (b _n ), and a voiced section (C _n ) according to the determination in step S2, and each section is sent to the following processing system. Can be The time at the boundary between b ₁ and c ₁ is expressed as t _{1, s,} and the first voice is α1 (see FIG. 3).

【００２３】ステップＳ６：図３に示すように、ｎ番目
の有声区間の開始点（ｔ_n,s ）と１つの前の有声区間の
終了点（ｔ_n-1,e ）との間の時間間隔Ｔ_n （Ｔ_n ＝ｔ
_n,s −ｔ_n-1,e ）を算出する。Step S6: As shown in FIG. 3, the time between the start point (t _{n, s} ) of the _nth voiced section and the end point (t _{n-1, e} ) of one previous voiced section. Interval T _n (T _n = t
_{n, s} -t _{n-1, e} ).

【００２４】ステップＳ７：Ｔ_n と声立てを判別するた
めのスレッショールド値Ｔｈ１とを比較する。Ｔ_n があ
るスレッショールド値Ｔｈ１を越えた場合には、ｔ_n,s
の時点を声立てα_m と判断し（図３参照）、ステップＳ
８に進む。なお、本処理の開始時点で前の有声区間がな
いときは後述のステップＳ１１に飛ぶ。[0024] Step S7: comparing the threshold value Th1 for determining T _n and voice stand. If T _n exceeds a certain threshold value Th1, t _{n, s}
Is determined to be the voice α _m (see FIG. 3), and step S
Proceed to 8. If there is no previous voiced section at the start of this processing, the process jumps to step S11 described later.

【００２５】ステップＳ８：１つ前の声立てα_m-1 と１
つ前の有声区間の終了点ｔ_n-1,e の範囲を１セグメント
とする。図３の例では、Ｔ₅ ＝ｔ_6,s −ｔ_5,e ＞Ｔｈ１
とすると、ｔ_6,s の時点が声立てα₂ 、区間（ｔ_5,e −
ｔ_1,s ）が１セグメントとなる。そして、ステップＳ１
１，Ｓ１２，Ｓ１５の処理によりこれまでに処理バッフ
ァ５に格納されている１セグメントの開始点の有声区間
長の伸張倍率ｒ_s を１≦ｒ_s ≦２の範囲内であらかじめ
決めた値に設定して伸張する。この伸張倍率をこのセグ
メントの終了点に向って徐々に小さくし、終了点の有声
区間長の伸張倍率ｒ_e が０．７≦ｒ_e ≦１となるように
する。図４に図３のセグメント１に属する有声区間の伸
張倍率の求め方の一例を示す。セグメント開始点の有声
区間ｃ₁は伸張されてｃ₁ ′＝ｒ_s ・ｃ₁ 、ｃ₂ はｃ
₂ ′＝ｒ₂ ・ｃ₂ となる。セグメント終了点の有声区間
ｃ₅ はｃ₅ ′＝ｒ_e ・ｃ₅ となるが、ｒ_e はｒ_e ≦１で
あるから、実際的には短縮される。有声区間以外の無音
区間ａ_n 、無声区間ｂ_n については処理を施さず、不変
である。Step S8: The previous voice α _m-1 and 1
The range of the end point t _{n-1, e} of the preceding voiced section is defined as one segment. In the example of FIG. 3, T ₅ = t _{6, s} −t _{5, e} > Th1
Then _, the time point of t6 _{, s} is voiced α ₂ , and the section (t _{5, e} −
t _{1, s} ) becomes one segment. Then, step S1
The expansion ratio r _s of the voiced section length of the start point of one segment stored so far in the processing buffer 5 is set to a predetermined value within the range of 1 ≦ r _s ≦ 2 by the processing of 1, S12, and S15. And stretch. The stretching magnification is gradually decreased toward the end point of this segment, stretching magnification r _e voiced section length of the end point is made to be 0.7 ≦ r _e ≦ 1. FIG. 4 shows an example of a method of obtaining the expansion factor of the voiced section belonging to the segment 1 in FIG. Voiced c ₁ of the segment starting point is stretched _{_{c 1 '= r s · c}} 1, c 2 is c
₂ ′ = r ₂ · c ₂ Voiced c ₅ segments end point becomes a _{_{c 5 '= r e · c}} 5, r e is because a r _e ≦ 1, in practice is shortened. No processing is performed on the silent section a _n and the unvoiced section b _n other than the voiced section, and they are unchanged.

【００２６】すなわち、一般に声立て部分（一単位の中
の前半部分）の音声は意味上、重要であることが多いの
で、上記のように話速を適度に遅くすることによって聴
きやすさが向上する。話速の変化は、適当な関数ｆ
（ｔ）を用いて変化させる。本実施例では、一例として
図４に示すような余弦関数を用いた。この場合、ｆ
（ｔ）は次式（１）で表現される。That is, in general, the voice of the vocal part (the first half of one unit) is often important in terms of meaning, so that the audibility is improved by appropriately reducing the speech speed as described above. I do. The change in speech speed is determined by the appropriate function f
It is changed using (t). In the present embodiment, a cosine function as shown in FIG. 4 is used as an example. In this case, f
(T) is expressed by the following equation (1).

【００２７】[0027]

【数１】 (Equation 1)

【００２８】ステップＳ９：ステップＳ８で話速変換さ
れた音声データをファイル６に落とす。Step S9: The voice data whose speech speed has been converted in step S8 is dropped to the file 6.

【００２９】ステップＳ１０：処理バッファ５をクリア
する。Step S10: The processing buffer 5 is cleared.

【００３０】ステップＳ１１：ステップＳ７でＴ_n ≦Ｔ
ｈ１の場合、またはステップＳ１０を処理した場合はこ
のステップＳ１１に進む。ステップＳ７が否定判定の場
合は有声区間が一単位に収まっていると判断し、この有
声区間を処理バッファ５に蓄える。ステップＳ１０を通
った場合は声立て開始時点の有声区間が処理バッファ５
に蓄えられることになる。入力バッファ４を次の音声デ
ータの処理のためにクリアし、本処理作業の終了指示が
発生されてなければ（ステップＳ１６）ステップＳ１に
戻る。[0030] Step S11: In the step S7 T _n ≦ T
In the case of h1, or when step S10 is processed, the process proceeds to step S11. If a negative determination is made in step S7, it is determined that the voiced section falls within one unit, and this voiced section is stored in the processing buffer 5. If step S10 has been reached, the voiced section at the start of the vocalization is stored in the processing buffer 5
Will be stored. The input buffer 4 is cleared for the processing of the next audio data, and if an instruction to end this processing operation has not been issued (step S16), the process returns to step S1.

【００３１】ステップＳ１２：無声区間については、入
力バッファ４から常に処理バッファ５に転送して蓄え
る。その後、入力バッファ４をクリアし、ステップＳ１
６を経てステップＳ１に戻る。Step S12: The unvoiced section is always transferred from the input buffer 4 to the processing buffer 5 and stored. After that, the input buffer 4 is cleared, and step S1
After step 6, the process returns to step S1.

【００３２】ステップＳ１３：音声の種類別区間が無音
区間の場合は、無音区間の長さと、文章間の区切り（句
点）を判別するためのスレッショールド値Ｔｈ２とを比
較する。無音区間がＴｈ２を越えた場合、この無音区間
を文章と文章の区切り（句点）と判断し、次のステップ
Ｓ１４に進み、それ以外はステップＳ１５に飛ぶ。Step S13: If the section for each type of voice is a silent section, the length of the silent section is compared with a threshold value Th2 for determining a break (punctuation) between sentences. When the silent section exceeds Th2, the silent section is determined to be a break (punctuation) between sentences, and the process proceeds to the next step S14. Otherwise, the process jumps to step S15.

【００３３】ステップＳ１４：句点と判定した無音区間
を以下の手順で短縮する。Step S14: The silent section determined as a period is shortened by the following procedure.

【００３４】聴感上の違和感なく最短に短縮するため、
短縮無音区間の時間長はスレッショールド値Ｔｈ３とな
る。無音区間の時間長をａ_n 、削除する区間の時間長を
ｄ_n、削除後の無音区間の時間長をｅ_n とした場合、ｅ_n
は図５の（Ｂ）に示すように、ｅ_n ＝ａ_n −ｄ_n ・・・（２）となる。この際、分析時の無音範囲の指定誤りから、無
声部分までも長い無音の一部と識別してしまう可能性が
あるため、ａ_n の先頭から、ｄ_n を削除するのではな
く、図５の（Ａ）に示すように、ａ_n の中心点からｄ_n
部分を削除する。また、ｄ_n の両端には、数ｍｓのテー
パーをかけて平滑化し、これによりクリック音の発生を
防止する。ここでの無音とは前述のように人から発生さ
れた音声以外の音を含むので、この平滑化処理が有用と
なる。[0034] In order to shorten to the shortest without any discomfort in the sense of hearing,
The time length of the shortened silent section is the threshold value Th3. If the time length a _n of the silent section, the time length d _n of the section to be deleted, the time length of the silent section after the deletion was e _n, e _n
, As shown in (B) of FIG. 5, a _{_{_{e n = a n -d n ···}}} (2). In this case, the specified error silence range during analysis, because there is a possibility of identifying as part of a long silence even unvoiced portion, from the beginning of a _n, instead of deleting the d _n, 5 as shown in the (a), d _n from the center point of a _n
Delete part. Further, both ends of the d _n, smoothes over the taper of several ms, to prevent the occurrence of this the click sound. Since the silence here includes a sound other than the voice generated from a person as described above, this smoothing process is useful.

【００３５】上式（２）においてｅ_n の値はｅ_n ≧Ｔｈ
３での範囲で可変値として設定してもよいが、処理を簡
単にするためｅ_n をＴｈ３に近い一定値（例えば８６２
ｍｓ）に設定した場合は、上式（２）からｄ_n はａ_n に
より変わる可変値となる。次に、ステップＳ１５に進
む。[0035] In the above equation (2) e _n value e _n ≧ Th
It may be set as a variable value in the range of 3, but a fixed value close to Th3 to e _n order to simplify the processing (eg, 862
If set to ms), d _n from the above equation (2) is a variable value that varies by a _n. Next, the process proceeds to step S15.

【００３６】ステップＳ１５：無音区間を処理バッファ
５に蓄える。入力バッファ４をクリアし、ステップＳ１
６を経てステップＳ１に戻る。Step S15: The silent section is stored in the processing buffer 5. Clear the input buffer 4 and execute step S1
After step 6, the process returns to step S1.

【００３７】ステップＳ１６：音声入力回路１に音声信
号のデータがなくなった場合、あるいは作業中止命令が
あった場合は本処理ルーチンは終了し、メインの待機ル
ーチン等に復帰する。Step S16: If there is no sound signal data in the sound input circuit 1, or if there is a work stop command, the present processing routine ends and returns to the main standby routine and the like.

【００３８】（３）実験例本実施例の実験例では、１３６秒のニュース文に適応し
たが、この場合、話速の平均が９．６モーラ／秒であ
り、これを基に、Ｔｈ１，Ｔｈ２，Ｔｈ３をＴｈ１＝３
５０ｍｓ、Ｔｈ２＝Ｔｈ３＝１０００ｍｓに設定した。
この時、心理実験により、話速制御については、一単位
内の開始点の話速（有声区間長の伸張倍率）が原音声の
１．０〜１．３倍、終了点の話速が０．９〜１．０倍の
範囲では自然性、わかりやすさにおいて高い評価が得ら
れ、また、無音区間の短縮については、短縮した無音区
間（ｅ_n ）が最低でも８６２ｍｓ存在すれば、聴感上違
和感がないという知見が得られた。(3) Experimental Example In the experimental example of this embodiment, a news sentence of 136 seconds was applied. In this case, the average of the speech speed was 9.6 mora / second. Th2 = Th3 = Th1 = 3
50 ms, and Th2 = Th3 = 1000 ms.
At this time, according to a psychological experiment, as for the speech speed control, the speech speed at the start point in one unit (the expansion rate of the voiced section length) is 1.0 to 1.3 times that of the original voice, and the speech speed at the end point is 0. naturalness in the range of .9～1.0 times, high evaluation is obtained in clarity, also, for the shortening of the silent section, if shortened silence section (e _n) is them 862ms present at a minimum, the audibility discomfort It was found that there was not.

【００３９】その結果から、話速を１．２倍というゆっ
くりした話速から０．９２倍という速い話速に変化さ
せ、長い無音区間（文章間の「ま」）を１２００ｍｓに
短縮することによって、原音声、変換音声とも発話時間
が合致し、良好な話速変換音声が得られることが確認で
きた。From the results, the speech speed was changed from a slow speech speed of 1.2 times to a fast speech speed of 0.92 times, and a long silent section ("ma" between sentences) was shortened to 1200 ms. It was confirmed that the speech times of the original voice and the converted voice coincided with each other, and that a good voice speed converted voice could be obtained.

【００４０】（４）その他の実施例上記実施例のステップＳ８（図２参照）の処理中におい
て、話速が変わってもそのピッチが変わらないように処
理することにより、高品質の音質が保てる。この処理方
法としては、例えば特願平３−２４５９６０号「話速制
御型補聴方法および装置」に開示された音声信号の処理
方法が好適である。(4) Other Embodiments During the processing in step S8 (see FIG. 2) of the above-described embodiment, high-quality sound quality can be maintained by performing processing so that the pitch does not change even when the speech speed changes. . As this processing method, for example, a method of processing an audio signal disclosed in Japanese Patent Application No. 3-245960 “Speech Rate Control Type Hearing Aid Method and Apparatus” is suitable.

【００４１】また、上記実施例において有声区間長の伸
張倍率ｒ_s ，ｒ_e 無音区間の削除後の時間長ｅ_n 等をあ
らかじめ決めた一定値としたが、ダイヤルやキーボード
等から使用者が希望の値にセット可能な可変値としても
よい。これにより、例えば視聴者の希望に合せたり、あ
るいは放送時間内にぴったりと合わせる編集作業等がよ
り容易となる。Further, stretching magnification r _s voiced interval length in the above embodiment, although a constant value the time length e _n, etc. are previously decided after deletion of r _e silent section, the user desires from the dial or a keyboard May be a variable value that can be set to the value of. This makes it easier to perform, for example, editing work that matches the wishes of the viewer or that fits exactly within the broadcast time.

【００４２】また、上記実施例の有声区間の伸張処理の
代りに、音声のピッチ（基本周波数）を公知のピッチ抽
出方法により直接検出し、ピッチの変化に応じて、ピッ
チの高いところでは話速を緩め、低いところでは話速を
速めるように処理してもよい。Also, instead of the voiced section expansion processing in the above embodiment, the pitch (fundamental frequency) of the voice is directly detected by a known pitch extraction method, and the speech speed is changed at a high pitch according to the change in pitch. May be relaxed, and the speech speed may be increased in low places.

【００４３】[0043]

【発明の効果】以上説明したように、本発明によれば、
受聴音声の発生する速さ（話速）を遅くする際に、音声
のピッチ（基本周波数）の変化に応じて、ピッチの高い
ところでは話速を緩め、低いところでは話速を早めるよ
うにし、また、声立てと次の声立ての区間を単位にして
この区間の開始点ではゆっくりとした話速を設定し、そ
の終了点に向かって音声の基本周波数の大まかな変化に
追従して徐々に話速を早めるようにし、さらには文章間
の無音区間を予め実験で求めた聴感上違和感のない範囲
でできるだけ短い時間に短縮するようにしているので、
発話時間を原音声の発話時間に保ったまま全体としてゆ
っくりとした聴きやすい良好な音声に変換できる効果が
得られる。As described above, according to the present invention,
When decreasing the speed at which the listening sound is generated (speaking speed), the speaking speed is slowed at a high pitch and the spoken speed is increased at a low pitch in accordance with a change in the pitch (fundamental frequency) of the sound. In addition, set a slow speech speed at the start point of this section in units of the vocal section and the next vocal section, and gradually follow the rough change of the fundamental frequency of the voice toward the end point. Since the speech speed is increased, and the silent section between sentences is shortened to the shortest possible time within the range of hearing perceived incongruity determined in advance by experiments,
An effect is obtained in which the speech can be converted into a good voice that is easy to listen to slowly as a whole while keeping the speech time at the speech time of the original voice.

[Brief description of the drawings]

【図１】本発明の一実施例の装置構成を示すブロック図
である。FIG. 1 is a block diagram showing an apparatus configuration according to an embodiment of the present invention.

【図２】本発明の一実施例の処理内容を示すフローチャ
ートである。FIG. 2 is a flowchart showing processing contents of one embodiment of the present invention.

【図３】本発明の一実施例の処理に基づく音声データの
セグメンテーションを示す線図である。FIG. 3 is a diagram showing a segmentation of audio data based on a process according to an embodiment of the present invention.

【図４】本発明の一実施例の話速変化を示すタイミング
チャートである。FIG. 4 is a timing chart showing a change in speech speed in one embodiment of the present invention.

【図５】本発明の一実施例の処理に基づく原波形（Ａ）
と文章間の長い無音区間を短縮した波形（Ｂ）とを示す
波形図である。FIG. 5 is an original waveform (A) based on the processing of one embodiment of the present invention.
FIG. 8 is a waveform diagram showing a waveform (B) obtained by shortening a long silent section between sentences.

[Explanation of symbols]

１音声入力回路２ＣＰＵ３ＰＲＯＭ４入力バッファ５処理バッファ６ファイル７音声出力回路ａ_n 無音区間ｂ_n 無声区間ｃ_n 有声区間ｃ_n ′ 伸張した有声区間Ｂ_n 無声または無音から始まって有声で終わる区間Ｔｈ１声立てを判別するためのスレッショールド値Ｔｈ２文章間の区切り（句点）を判別するためのスレ
ッショールド値ｒ_s 開始点における有声区間長の伸張倍率ｒ_e 終了点における有声区間長の伸張倍率Reference Signs List 1 voice input circuit 2 CPU 3 PROM 4 input buffer 5 processing buffer 6 file 7 voice output circuit a _n silent section b _n unvoiced section c _n voiced section c _n ′ expanded voiced section B _n voiced beginning with voiceless and ending with voiceless Section Th1 Threshold value for discriminating voice-throat Th2 Threshold value for discriminating breaks (phrases) between sentences r _s Extension rate of voiced section length at start point r _e Length of voiced section length at end point Stretch magnification

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平１−93795（ＪＰ，Ａ) 電子情報通信学会技術研究報告ＳＰ 92−56 話速変換に伴う時間伸長を吸収するための一方法 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 21/04 ──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-1-93795 (JP, A) IEICE Technical Report SP 92-56 A method for absorbing time extension accompanying speech speed conversion (58 ) Surveyed field (Int.Cl. ⁷ , DB name) G10L 21/04

Claims

(57) [Claims]

When the speed at which a received voice is generated (hereinafter referred to as a voice speed) is reduced, the voice speed is reduced at a high pitch and the voice speed is lowered at a low pitch in accordance with a change in voice pitch (fundamental frequency). Is a speech speed conversion method characterized by increasing the speech speed.

2. When the speech speed is reduced, a slow speech speed is set at the start point of a section of a vocal utterance and the next vocal utterance unit, and the basic speech is set toward the end point. A speech speed conversion method characterized by gradually increasing the speech speed following a rough change in frequency.

3. The method according to claim 1, wherein, when the speech speed is reduced, a silent section between sentences is shortened to a time as short as possible within a range in which a sense of incongruity obtained by an experiment is obtained in advance.
Or the speech speed conversion method according to 2.

4. The time as short as possible within a range in which there is no uncomfortable feeling on hearing obtained in the experiment in advance is from 862 ms to almost 10
4. The speech speed conversion method according to claim 3, wherein the range is 00 ms.

5. Speech identification means for distinguishing a speech signal into voiced, unvoiced, and silence; silence section determination means for judging whether or not the silence section identified by the speech identification means is a silence section between sentences; When the silent section determining unit determines that the silent section between sentences is a silent section between sentences, a silent section shortening unit that shortens the silent section to a time as short as possible within a range where there is no uncomfortable feeling obtained by an experiment in advance; Voiced section determining means for determining whether or not the identified voiced section is the start of a voiced voice; and, if the voiced voiced section determining means determines that the voiced voice is to be started, the voiced voice and the next voiced voice are used as a unit. Voiced section decompression means that performs a speech rate conversion process that sets a slow speech rate at the start of the leverage section and gradually increases the speech rate following a rough change in the fundamental frequency of speech toward the end point. Speech speed converting device being characterized in that comprises a.