JPH09152889A

JPH09152889A - Speech speed transformer

Info

Publication number: JPH09152889A
Application number: JP7310242A
Authority: JP
Inventors: Koji Tanaka; 浩司田中; Masayuki Iida; 正幸飯田; Masanori Miyatake; 正典宮武
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1995-11-29
Filing date: 1995-11-29
Publication date: 1997-06-10

Abstract

PROBLEM TO BE SOLVED: To obtain a speech speed transformer capable of obtain the output voice having low deterioration by respectively providing a specific speech speed converting means a specific degree of similarity calculating means and a specific control means. SOLUTION: A section judging part 42 classifies an input voice based on the degree of similarity of waveforms being in the vicinities of the input voice signal to transmit the classified results to a signal processing part 43. The input voice signal is constituted of voice sections and non-voice sections and there are vowel section, consonant sections and transition sections from vowels to consonants or from consonants to vowels in the voice section. Then, when the section of the input voice signal is judged to be the transition section by the section judging part 42, the voice signal is compressed with a reference compression rate α determined by the multiplication of a reproducing speed by a time base compressing and extending part 51. Moreover, the input device signal of the vowel section and the input voice signal of the consonant section are respectively compressed with a compression rate αs smaller than the reference compression rate α and a compression rate αs larger than the α. That is, at the time of a high-speed reproducing of a VTR, etc., when the degree of similarity of waveforms calculated in these sections is large, the degree of compressions is made large and when the degree of similarity is small, the degree of compressions is made small.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、音声信号の話速
を変換する話速変換装置に関し、例えば、映像を伴うレ
ーザディスク、ＶＴＲ、ＴＶ、ＴＶ電話、ディジタルビ
デオディスク等の音声の早聞きまたは遅聞きを行なう音
声再生装置、聴覚障害者のために音声信号をゆっくりし
た聞き取りやすい音声に変換する聴覚補助機能付きラジ
オ、電話機、補聴器、ネイティブスピードで話された英
語音声をゆっくりした聞きやすい音声に変換する英語学
習器、音声の早聞きあるいは遅聞きを行なうテープレコ
ーダー、ステレオシステム、ＣＤプレイヤー、ＭＤプレ
イヤー、音声ガイダンスシステム等に利用される話速変
換装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice speed conversion device for converting the voice speed of a voice signal, and for example, a fast listening of a voice of a laser disc, a VTR, a TV, a TV telephone, a digital video disc or the like accompanied by an image. Audio playback device for slow listening, radio with hearing aids that converts audio signals into slow and easy-to-hear voices for hearing impaired people, telephones, hearing aids, English voices spoken at native speed into slow and easy-to-hear The present invention relates to an English learning device for conversion, a tape recorder for performing early or late listening of voice, a stereo system, a CD player, an MD player, a voice speed conversion device used in a voice guidance system and the like.

【０００２】[0002]

【従来の技術】入力音声信号を時間軸圧縮伸長処理する
ことにより、音声信号の話速を変換する技術が既に開発
されている。入力音声は、音声区間と非音声区間とに分
けられる。また、音声区間は、母音区間、子音区間およ
び母音区間と子音区間との間の遷移区間に分けられる。
母音区間の音声信号は、周期性があり、時間軸圧縮伸長
処理しても加工時の歪みが少なく再生音声が劣化するこ
とは少ない。しかしがら、子音区間の音声信号は白色雑
音に近い波形であり、時間軸圧縮伸長処理すると加工時
の歪みが多く再生音声が劣化する。2. Description of the Related Art A technique for converting the speech speed of an audio signal by subjecting an input audio signal to a time axis compression / expansion process has already been developed. The input voice is divided into a voice section and a non-voice section. The voice section is divided into a vowel section, a consonant section, and a transition section between the vowel section and the consonant section.
The voice signal in the vowel section has periodicity, and even if the time axis compression / decompression process is performed, the distortion during processing is small and the reproduced voice is hardly deteriorated. However, the voice signal in the consonant section has a waveform close to white noise, and when the time axis compression / expansion process is performed, the distortion during processing is large and the reproduced voice is deteriorated.

【０００３】[0003]

【発明が解決しようとする課題】この発明は、劣化の少
ない出力音声が得られる話速変換装置を提供することを
目的とする。SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech speed conversion device which can obtain output voice with little deterioration.

【０００４】[0004]

【課題を解決するための手段】この発明による話速変換
装置は、入力音声信号を時間軸圧縮伸長処理するための
時間軸圧縮伸長手段を含む話速変換手段、入力音声信号
の近傍の波形の類似度を算出する類似度算出手段、およ
び算出された類似度に基づいて話速変換手段を制御する
制御手段を備えていることを特徴とする。入力音声信号
が無音区間の音声であるか否かを判別する判別手段、お
よび継続長が所定以上の無音区間の音声信号を削除する
手段を設けてもよい。A speech speed conversion apparatus according to the present invention comprises a speech speed conversion means including a time axis compression / expansion means for time axis compression / expansion processing of an input speech signal, and a waveform in the vicinity of the input speech signal. It is characterized in that it is provided with a similarity calculation means for calculating the similarity and a control means for controlling the speech speed conversion means based on the calculated similarity. A determination unit that determines whether or not the input voice signal is a voice signal in a silent section, and a unit that deletes a voice signal in a silent section whose duration is equal to or longer than a predetermined length may be provided.

【０００５】制御手段としては、たとえば、波形の類似
度が小さい入力音声信号ほど、時間軸圧縮伸長手段によ
る圧縮の度合いまたは伸長の度合いが小さくなるよう
に、話速変換手段を制御するものが用いられる。As the control means, for example, a means for controlling the speech speed conversion means is used so that the degree of compression or expansion by the time axis compression / expansion means becomes smaller as the input voice signal having a smaller degree of waveform similarity. To be

【０００６】また、制御手段としては、たとえば、波形
の類似度が大きい入力音声信号ほど、時間軸圧縮伸長手
段による圧縮の度合いまたは伸長の度合いが大きくなる
ように、話速変換手段を制御するものが用いられる。As the control means, for example, the speech speed conversion means is controlled so that the degree of compression or the degree of expansion by the time-base compression / expansion means increases as the input voice signal having a higher degree of waveform similarity. Is used.

【０００７】また、制御手段としては、たとえば、算出
された類似度に基づいて、入力音声信号を、母音区間、
子音区間、母音区間と子音区間との間の遷移区間および
雑音区間に分類し、母音区間の入力音声に対しては圧縮
または伸長の度合いを大きくして圧縮伸長処理を行なう
ように、子音区間の入力音声に対しては圧縮または伸長
の度合いを小さくして圧縮伸長処理を行なうように、遷
移区間の入力音声に対しては時間軸圧縮伸長手段による
圧縮または伸長の度合いを中程度にして圧縮伸長処理を
行なうように、雑音区間の入力音声に対しては削除する
ように、話速変換手段を制御するものが用いられる。As the control means, for example, based on the calculated similarity, the input voice signal
The consonant section, the transition section between the vowel section and the consonant section, and the noise section are classified, and the input speech of the vowel section is compressed or expanded with a large degree of compression or expansion so that the consonant section can be compressed. The input audio in the transition section is compressed and expanded so that the input audio is compressed or expanded with a small compression or expansion degree. What controls the speech speed conversion means so as to perform processing so as to delete the input voice in the noise section is used.

【０００８】また、制御手段としては、たとえば、算出
された類似度に基づいて、入力音声信号を、母音区間、
子音区間、母音区間と子音区間との間の遷移区間に分類
し、分類した母音区間内、子音区間内、または母音区間
と子音区間との間の遷移区間内において、類似度が大き
いときは該区間の圧縮または伸長の度合いが大きくなる
ように、類似度が小さいときは、該区間の圧縮または伸
長の度合いが小さくなるように、話速変換手段を制御す
るものが用いられる。As the control means, for example, based on the calculated similarity, the input voice signal
If the similarity is large in the consonant section, the transition section between the vowel section and the consonant section, and within the classified vowel section, the consonant section, or the transition section between the vowel section and the consonant section. In order to increase the degree of compression or expansion of the section or to reduce the degree of compression or expansion of the section when the degree of similarity is small, a unit that controls the speech speed conversion means is used.

【０００９】入力音声の近傍の波形の類似度が小さいほ
ど、入力音声波形に周期性がなく白色雑音に近いと考え
られる。したがって、近傍の類似度が小さい入力音声ほ
ど、時間軸圧縮伸長処理手段の圧縮または伸長の度合い
を小さくすることによって、再生音声の劣化を防止する
ことができる。It is considered that the smaller the similarity of the waveform in the vicinity of the input voice is, the closer the waveform of the input voice is to a white noise without periodicity. Therefore, by reducing the degree of compression or decompression of the time axis compression / decompression processing unit as the degree of similarity in the vicinity is smaller, it is possible to prevent deterioration of the reproduced voice.

【００１０】[0010]

【発明の実施の形態】以下、図面を参照して、この発明
の実施の形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１１】図１は、話速変換装置の構成を示してい
る。FIG. 1 shows the structure of a speech speed converter.

【００１２】話速変換装置は、音声信号入力部４１、区
間判別部４２、信号処理部４３、音声メモリ４４および
音声信号出力部４６を備えている。信号処理部４３は、
時間軸圧縮伸長部５１、削除部５２等を備えている。区
間判別部４２は、波形類似度算出部６１、継続長算出部
６２および区間分類部６３を備えている。The speech speed conversion device comprises a voice signal input unit 41, a section discrimination unit 42, a signal processing unit 43, a voice memory 44 and a voice signal output unit 46. The signal processing unit 43
A time axis compression / expansion unit 51, a deletion unit 52 and the like are provided. The section determination unit 42 includes a waveform similarity calculation unit 61, a duration calculation unit 62, and a section classification unit 63.

【００１３】音声信号入力部４１は、たとえば、増幅
部、Ａ／Ｄ変換部、フレームメモリ等を備えている。音
声信号入力部４１に入力された信号は、増幅された後、
ディジタル信号に変換されて、フレームメモリに格納さ
れる。音声信号入力部４１の出力は、区間判別部４２お
よび信号処理部４３に送られる。この実施例では、話速
変換装置にアナログ信号が入力される場合を示したが、
ＩＣメモリ等から読み出されたディジタル信号を話速変
換装置に入力するようにしてもよい。この場合には、音
声信号入力部４１にＡ／Ｄ変換部を設ける必要はない。The audio signal input section 41 includes, for example, an amplification section, an A / D conversion section, a frame memory and the like. The signal input to the audio signal input unit 41 is amplified,
It is converted into a digital signal and stored in the frame memory. The output of the audio signal input unit 41 is sent to the section discrimination unit 42 and the signal processing unit 43. In this embodiment, the case where an analog signal is input to the speech speed conversion device is shown.
A digital signal read from an IC memory or the like may be input to the speech speed conversion device. In this case, it is not necessary to provide the audio signal input unit 41 with an A / D conversion unit.

【００１４】区間判別部４２では、入力音声信号の近傍
の波形の類似度に基づいて、入力音声を分類し、分類結
果を信号処理部４３に送る。入力音声信号は、音声区間
と非音声区間とからなる。音声区間には、母音区間、子
音区間および母音から子音または子音から母音の遷移区
間がある。非音声区間には、定常雑音区間および非定常
雑音区間からなる雑音区間と、無音区間とがある。The section discriminating section 42 classifies the input speech based on the similarity of the waveforms in the vicinity of the input speech signal, and sends the classification result to the signal processing section 43. The input voice signal includes a voice section and a non-voice section. The voice section includes a vowel section, a consonant section, and a vowel-to-consonant or consonant-to-vowel transition section. The non-voice section includes a noise section composed of a stationary noise section and a non-stationary noise section, and a silent section.

【００１５】母音区間においては、音声信号波形は周期
性のある波形となるため、近傍の波形の類似度は大きく
なる。子音区間においては、音声信号波形は白色雑音に
近い波形となるため、近傍の波形の類似度は小さくな
る。また、子音から母音への遷移期間および母音から子
音への遷移期間においては、近傍の波形の類似度は母音
区間の場合の類似度と子音区間の場合の類似度との中間
となる。In the vowel section, since the voice signal waveform has a periodic waveform, the similarity of nearby waveforms becomes large. In the consonant section, the voice signal waveform becomes a waveform close to white noise, so that the similarity of nearby waveforms becomes small. Also, in the transition period from the consonant to the vowel and the transition period from the vowel to the consonant, the similarity of the nearby waveform is intermediate between the similarity in the vowel section and the similarity in the consonant section.

【００１６】定常雑音区間においては、音声信号波形は
周期性のある波形となるため、近傍の波形の類似度は大
きくなるが、その継続時間が母音区間に比べて長くな
る。非定常雑音区間においては、近傍の波形の類似度は
小さくなるが、その継続時間が子音区間に比べて長くな
る。In the stationary noise section, since the voice signal waveform has a periodicity, the similarity of nearby waveforms is large, but the duration is longer than in the vowel section. In the non-stationary noise section, the similarity of nearby waveforms is small, but the duration is longer than in the consonant section.

【００１７】波形類似度算出部６１は、まず、近傍の波
形の類似度を算出し、類似度を大、中、小の３段階に分
類する。継続長算出部６２では、判定された類似度の状
態の継続時間が算出される。区間分類部６３は、波形類
似度算出部６１による分類結果および継続長算出部６２
によって算出された継続時間に基づいて、入力音声信号
を、定常雑音区間、非定常雑音区間、母音区間、遷移区
間、子音区間に分類し、その分類結果を信号処理部４３
に送る。The waveform similarity calculator 61 first calculates the similarity of nearby waveforms and classifies the similarity into three levels of large, medium and small. The duration calculating unit 62 calculates the duration of the state of the determined degree of similarity. The section classification unit 63 includes a result of classification by the waveform similarity calculation unit 61 and a continuation length calculation unit 62.
The input speech signal is classified into a stationary noise section, a non-stationary noise section, a vowel section, a transition section, and a consonant section based on the duration calculated by the signal processing unit 43.
Send to

【００１８】つまり、類似度が中の場合には、当該入力
音声信号は、遷移区間であると判定する。類似度が大で
ありかつその状態の継続時間が所定時間以内であれば、
当該入力音声信号は、母音区間であると判定する。類似
度が大でありかつその状態の継続時間が所定時間より長
いときには、当該入力音声信号は、定常雑音区間である
と判定する。類似度が小でありかつその状態の継続時間
が所定時間以内であれば、当該入力音声信号は、子音区
間であると判定する。類似度が小でありかつその状態の
継続時間が所定時間より長いときには、当該入力音声信
号は、非定常雑音区間であると判定する。That is, when the degree of similarity is medium, it is determined that the input voice signal is in the transition section. If the degree of similarity is large and the duration of the state is within the predetermined time,
The input voice signal is determined to be in the vowel section. When the degree of similarity is high and the duration of the state is longer than the predetermined time, it is determined that the input audio signal is in the stationary noise section. If the degree of similarity is small and the duration of the state is within the predetermined time, it is determined that the input voice signal is in the consonant section. When the degree of similarity is small and the duration of the state is longer than the predetermined time, it is determined that the input voice signal is in the non-stationary noise section.

【００１９】近傍の波形の類似度の算出方法としては、
次のような方法が用いられる。As a method of calculating the degree of similarity of waveforms in the vicinity,
The following method is used.

【００２０】（１）波形間の振幅の差分または差分の２
乗を算出する方法この場合には、差分量が大きいほど類似度は小さくな
る。(1) Difference in amplitude between waveforms or difference of 2
Method of calculating the power In this case, the larger the difference amount, the smaller the degree of similarity.

【００２１】（２）波形間の振幅の差分の平均値または
平均値の２乗を算出する方法この場合には、差分の平均値が大きいほど類似度は小さ
くなる。(2) Method of calculating average value of amplitude difference between waveforms or square of average value In this case, the greater the average value of difference, the smaller the similarity.

【００２２】（３）波形間の相互相関値を算出する方法この場合には、相互相関値が大きいほど類似度は大きく
なる。信号波形ｘ（ｔ）と、信号波形ｙ（ｔ）との間の
相互相関関数Ｒ（τ）は、たとえば、次の数式１に基づ
いて算出される。(3) Method of calculating cross-correlation value between waveforms In this case, the greater the cross-correlation value, the greater the degree of similarity. The cross-correlation function R (τ) between the signal waveform x (t) and the signal waveform y (t) is calculated, for example, based on the following formula 1.

【００２３】[0023]

【数１】 (Equation 1)

【００２４】（４）周波数分析によって波形間のスペク
トルの差分値または差分値の２乗を算出する方法。この
場合には、差分量が大きいほど類似度は小さくなる。(4) A method of calculating the difference value of the spectrum between the waveforms or the square of the difference value by frequency analysis. In this case, the larger the difference amount, the smaller the degree of similarity.

【００２５】（５）周波数分析によって信号のエネルギ
ー（パワー）の波形間の差分値または差分値の２乗を算
出する方法。この場合には、差分量が大きいほど類似度
は小さくなる。(5) A method of calculating the difference value or the square of the difference value between the waveforms of signal energy (power) by frequency analysis. In this case, the larger the difference amount, the smaller the degree of similarity.

【００２６】信号処理部４３では、次のような処理が行
なわれる。The signal processing section 43 performs the following processing.

【００２７】（Ａ）まず、話速変換装置がＶＴＲ等の再
生装置に用いられており、高速で再生された音声の話速
を遅くする場合について説明する。(A) First, a case will be described in which the speech speed conversion device is used in a reproduction device such as a VTR and the speech speed of a sound reproduced at high speed is reduced.

【００２８】（１）区間判別部４２によって定常雑音区
間または非定常雑音区間であると判定された入力音声信
号は削除部５２によって削除される。(1) The deleting unit 52 deletes the input speech signal which is judged to be a stationary noise section or a non-stationary noise section by the section judging unit 42.

【００２９】（２）区間判別部４２によって遷移区間で
あると判定された入力音声信号は、時間軸圧縮伸長部５
１によって、再生速度倍率ｎによって決められた基準圧
縮率αで圧縮処理される。基準圧縮率αｏは、再生速度
倍率をｎとして１／ｎ以上に設定される。ＶＴＲの２倍
速再生時に、入力音声信号を圧縮率１／２で圧縮した場
合には、２ピッチ周期が１ピッチ周期に間引かれ、出力
音声は標準音声速度の２倍となる。ただし、音程は元の
ままとなる。(2) The input voice signal determined to be the transition section by the section determination unit 42 is the time-base compression / decompression unit 5
1, the compression processing is performed at the reference compression rate α determined by the reproduction speed magnification n. The reference compression rate αo is set to 1 / n or more, where n is the reproduction speed magnification. When the input audio signal is compressed at a compression rate of 1/2 during the VTR double speed reproduction, the two-pitch cycle is thinned out to one pitch cycle, and the output audio becomes twice the standard audio speed. However, the pitch remains unchanged.

【００３０】再生速度が２倍速である場合には、基準圧
縮率αｏは、１／２以上の値、たとえば、２／３に設定
される。つまり、３ピッチ周期が２ピッチ周期に間引か
れる。この場合には、出力音声速度は、標準音声速度の
３／２倍となる。ただし、音程は元のままとなる。When the reproduction speed is double speed, the reference compression rate αo is set to a value of 1/2 or more, for example, 2/3. That is, the 3-pitch cycle is thinned out to the 2-pitch cycle. In this case, the output voice speed is 3/2 times the standard voice speed. However, the pitch remains unchanged.

【００３１】（３）区間判別部４２によって母音区間で
あると判定された入力音声信号は、時間軸圧縮伸長部５
１によって、基準圧縮率αｏより小さい圧縮率αｓで圧
縮処理される。つまり、母音区間と判定された入力音声
は、遷移区間と判定された入力音声に比べて、圧縮の度
合いが大きくされる。したがって、母音区間と判定され
た入力音声に対する出力音声速度は、遷移区間と判定さ
れた入力音声に対する出力音声に比べて速くなる。(3) The input voice signal determined by the section discriminating unit 42 to be in the vowel section is the time axis compression / decompression unit 5
1, the compression processing is performed at a compression rate αs smaller than the reference compression rate αo. That is, the degree of compression of the input voice determined to be the vowel section is made higher than that of the input voice determined to be the transition section. Therefore, the output voice speed for the input voice determined to be the vowel section is faster than the output voice for the input voice determined to be the transition section.

【００３２】（４）区間判別部４２によって子音区間で
あると判定された入力音声信号は、時間軸圧縮伸長部５
１によって、基準圧縮率αｏより大きい圧縮率αｓで圧
縮処理される。つまり、子音区間と判定された入力音声
は、遷移区間と判定された入力音声に比べて、圧縮の度
合いが小さくされる。したがって、子音区間と判定され
た入力音声に対する出力音声速度は、遷移区間と判定さ
れた入力音声に対する出力音声に比べて遅くなる。(4) The input voice signal determined by the section discriminating unit 42 to be a consonant section is the time axis compression / decompression unit 5
1, the compression processing is performed at the compression rate αs larger than the reference compression rate αo. That is, the degree of compression of the input voice determined to be the consonant section is made smaller than that of the input voice determined to be the transition section. Therefore, the output voice speed for the input voice determined to be the consonant section is slower than the output voice for the input voice determined to be the transition section.

【００３３】（Ｂ）話速変換装置が聴覚補助装置に用
いられており、標準速度で再生された音声の話速を遅く
させる場合について説明する。(B) A case in which the speech speed conversion device is used in the hearing aid device and the speech speed of the voice reproduced at the standard speed is reduced will be described.

【００３４】（１）区間判別部４２によって定常雑音区
間または非定常雑音区間であると判定された入力音声信
号は削除部５２によって削除される。(1) The deleting unit 52 deletes the input speech signal which is judged to be the stationary noise section or the non-stationary noise section by the section judging unit 42.

【００３５】（２）区間判別部４２によって遷移区間で
あると判定された入力音声信号は、時間軸圧縮伸長部５
１によって、基準伸長率βｏで伸長処理される。基準伸
長率βｏは、たとえば、３／２に設定される。つまり、
２ピッチ周期が３ピッチ周期に伸長される。この場合に
は、出力音声速度は、標準音声速度の２／３倍となる。(2) The input voice signal determined to be the transition section by the section determination unit 42 is the time axis compression / decompression unit 5
1, the extension processing is performed at the reference extension rate βo. The reference expansion rate βo is set to 3/2, for example. That is,
The 2-pitch cycle is extended to the 3-pitch cycle. In this case, the output voice speed is ⅔ times the standard voice speed.

【００３６】（３）区間判別部４２によって母音区間で
あると判定された入力音声信号は、時間軸圧縮伸長部５
１によって、基準伸長率βｏより大きい伸長率β１で伸
長処理される。β１は、たとえば、２／１に設定され
る。つまり、１ピッチ周期が２ピッチ周期に伸長され、
この場合の出力音声速度は、標準音声速度の１／２倍と
なる。つまり、母音区間と判定された入力音声は、遷移
区間と判定された入力音声に比べて、伸長の度合いが大
きくされる。したがって、母音区間と判定された入力音
声に対する出力音声速度は、遷移区間と判定された入力
音声に対する出力音声に比べて遅くなる。(3) The input voice signal determined by the section discriminating unit 42 to be a vowel section is the time axis compression / decompression unit 5
1, the extension processing is performed at the extension rate β1 larger than the reference extension rate βo. β1 is set to 2/1, for example. That is, one pitch period is extended to two pitch periods,
The output voice speed in this case is 1/2 times the standard voice speed. That is, the degree of expansion of the input voice determined to be the vowel section is made higher than that of the input voice determined to be the transition section. Therefore, the output voice speed for the input voice determined to be the vowel section is slower than the output voice for the input voice determined to be the transition section.

【００３７】（４）区間判別部４２によって子音区間で
あると判定された入力音声は、時間軸圧縮伸長部５１に
よって、基準伸長率βｏより小さい伸長率β２で伸長処
理される。つまり、子音区間と判定された入力音声は、
遷移区間と判定された入力音声に比べて、伸長の度合い
が小さくされる。したがって、子音区間と判定された入
力音声に対する出力音声速度は、遷移区間と判定された
入力音声に対する出力音声に比べて速くなる。(4) The time-axis compression / expansion unit 51 expands the input voice judged by the section judgment unit 42 to be a consonant section at an expansion rate β2 smaller than the reference expansion rate βo. In other words, the input voice determined to be the consonant section is
The degree of expansion is reduced as compared with the input voice determined to be the transition section. Therefore, the output voice speed for the input voice determined to be the consonant section is faster than the output voice for the input voice determined to be the transition section.

【００３８】以上、ＶＴＲ等の高速再生と聴覚補助装置
の例について説明した。上記実施の形態では、分類した
母音区間内、子音区間内および母音区間と子音区間との
間の遷移区間内では、圧縮・伸長の度合いは一定とし
た。しかし、該母音区間内、子音区間内または母音区間
と子音区間との間の遷移区間内において、該区間内の波
形の類似度に基づいて圧縮・伸長の度合いを変更しても
よい。The example of the high speed reproduction of the VTR and the hearing aid has been described above. In the above-described embodiment, the degree of compression / expansion is constant in the classified vowel section, the consonant section, and the transition section between the vowel section and the consonant section. However, in the vowel section, the consonant section, or the transition section between the vowel section and the consonant section, the degree of compression / expansion may be changed based on the similarity of the waveform in the section.

【００３９】すなわち、ＶＴＲ等の高速再生時には、該
区間内で算出された波形の類似度が大きければ、圧縮の
度合いを大きくし、類似度が小さければ圧縮の度合いを
小さくする。聴覚補助用途では、該区間内で算出された
類似度が大きければ、伸長の度合いを大きくし、類似度
が小さければ伸長の度合いを小さくする。That is, during high-speed reproduction of a VTR or the like, the degree of compression is increased if the degree of similarity of the waveform calculated in the section is large, and the degree of compression is decreased if the degree of similarity is small. In the hearing aid application, the degree of extension is increased if the degree of similarity calculated within the section is large, and the degree of extension is decreased if the degree of similarity is small.

【００４０】時間軸圧縮伸長部５１で用いられる時間軸
圧縮伸長法としては、たとえば、ポインタ移動制御によ
る重複加算法（Pointer Interval Control Overlap and
Add: PICOLA)、TDHS(Time Domain Harmonic Scaling)
法等がある。As the time axis compression / expansion method used in the time axis compression / expansion unit 51, for example, an overlap addition method by pointer movement control (Pointer Interval Control Overlap and
Add: PICOLA), TDHS (Time Domain Harmonic Scaling)
There are laws etc.

【００４１】ＰＩＣＯＬＡを用いて、入力信号（時間軸
圧縮伸長部５１への入力音声データ）を圧縮率２／３で
圧縮する方法について、図２を用いて簡単に説明する。
まず、入力信号からピッチ周期が抽出される。抽出され
たピッチ周期をＴｐとする。波形Ａに対しては、１から
０へ直線的に向かう重み（重み関数Ｋ１）がつけられ
て、波形Ａ’が作成される。波形Ｂに対しては０から１
に向かう重み（重み関数Ｋ２）がつけられて、波形Ｂ’
が作成される。A method of compressing an input signal (input audio data to the time axis compression / expansion unit 51) at a compression rate of 2/3 using PICOLA will be briefly described with reference to FIG.
First, the pitch period is extracted from the input signal. The extracted pitch period is Tp. A weight (weighting function K1) that linearly goes from 1 to 0 is added to the waveform A to create the waveform A ′. 0 to 1 for waveform B
A weight (weighting function K2) toward
Is created.

【００４２】そして、これらの波形Ａ’およびＢ’が加
え合わされ、長さＴｐの波形Ａ’＊Ｂ’が作成される。
これらの重みは、波形Ａ’＊Ｂ’の前後の接続点での連
続性を保つためにつけられている。次に、ポインタが、
圧縮率に基づいて決定される長さである３Ｔｐ分だけ移
動され、同様な操作が行われる。これにより、３つの波
形Ａ、Ｂ、Ｃから２つの波形Ａ’＊Ｂ’とＣとが得られ
る。このようにして、３ピッチ周期分の信号が、２ピッ
チ周期分の信号に圧縮される。Then, these waveforms A'and B'are added together to create a waveform A '* B' of length Tp.
These weights are added to maintain continuity at the connection points before and after the waveform A ′ * B ′. Then the pointer becomes
The same operation is performed after moving by 3 Tp, which is the length determined based on the compression rate. As a result, two waveforms A ′ * B ′ and C are obtained from the three waveforms A, B, and C. In this manner, a signal for three pitch periods is compressed into a signal for two pitch periods.

【００４３】信号処理部４３の出力は、音声メモリ４４
に一旦蓄積された後、音声信号出力部４６に送られて出
力される。音声信号出力部４６は、Ｄ／Ａ変換部を備え
ている。音声メモリ４４から音声信号出力部４６に送ら
れてきたディジタル信号は、アナログ信号に変換されて
音声信号出力部４６から出力される。この実施例では、
話速変換装置から音声信号をアナログ信号として出力す
る場合を示したが、話速変換装置から音声信号をディジ
タル信号として出力するようにしてもよい。この場合に
は、音声信号出力部４６にＤ／Ａ変換部を設ける必要は
ない。The output of the signal processing unit 43 is the audio memory 44.
After being temporarily stored in, the audio signal is output to the audio signal output unit 46. The audio signal output unit 46 includes a D / A conversion unit. The digital signal sent from the audio memory 44 to the audio signal output unit 46 is converted into an analog signal and output from the audio signal output unit 46. In this example,
Although the case where the voice signal is output as an analog signal from the voice speed conversion device has been shown, the voice signal may be output as a digital signal from the voice speed conversion device. In this case, it is not necessary to provide the audio signal output unit 46 with the D / A conversion unit.

【００４４】なお、話速変換装置の入出力信号が共にア
ナログ信号である場合には、音声信号出力部４６内のＤ
／Ａ変換部のサンプリング周波数は、標準サンプリング
周波数ｆ_SOに設定され、音声信号入力部４１内のＡ／Ｄ
変換部のサンプリング周波数は、再生速度倍率をｎとす
ると、ｎ・ｆ_SOに設定される。したがって、高速再生時
においても、出力音声の音程は元の音程となる。When both the input and output signals of the voice speed converter are analog signals, D in the voice signal output section 46 is used.
The sampling frequency of the A / A converter is set to the standard sampling frequency f _SO, and the A / D in the audio signal input unit 41 is set.
The sampling frequency of the converter is set to n · f _SO , where n is the reproduction speed multiplication factor. Therefore, even during high-speed reproduction, the pitch of the output voice becomes the original pitch.

【００４５】また、話速変換装置の入出力信号が共にデ
ィジタル信号である場合には、再生速度倍率をｎとする
と、音声信号出力部４６から出力されるデータの出力速
度に対して、音声信号入力部４１に入力されるデータの
入力速度は、ｎ倍となるように設定される。したがっ
て、高速再生時においても、出力音声の音程は元の音程
となる。When both the input and output signals of the speech speed converter are digital signals, assuming that the reproduction speed multiplication factor is n, the audio signal is output with respect to the output speed of the data output from the audio signal output unit 46. The input speed of the data input to the input unit 41 is set to be n times. Therefore, even during high-speed reproduction, the pitch of the output voice becomes the original pitch.

【００４６】なお、非音声区間内に無音区間が存在する
か否かを判別するようにしてもよい。そして、継続長が
所定長以上の無音区間の音声信号を削除するようにして
もよい。このようにすると、音声メモリの蓄積量の増加
が抑えられ、メモリのオーバーフローを低減できるとい
う利点がある。Note that it may be determined whether or not there is a silent section in the non-voice section. Then, the voice signal in the silent section whose duration is equal to or longer than the predetermined length may be deleted. This has the advantage that an increase in the amount of storage in the audio memory can be suppressed and the memory overflow can be reduced.

【００４７】無音区間であるか否かの判定は、音声信号
入力部４１からの所定数の音声データのパワー平均が所
与のしきい値より小さいか否かによって行われる。つま
り、パワー平均が所与のしきい値より小さければ、無音
区間と判定される。The judgment as to whether or not it is a silent section is made by whether or not the power average of a predetermined number of audio data from the audio signal input unit 41 is smaller than a given threshold value. That is, if the power average is smaller than the given threshold value, it is determined as a silent section.

【００４８】より具体的に説明すると、音声信号入力部
４１からの所定数の音声データの平均パワー値Ｐが計算
される。この平均パワー値Ｐは、サンプリングされた所
定数の各音声データの振幅をｉ₀，ｉ₁，…ｉ_{N -1}（Ｎ
は所定数の音声データ数）とすると、次の数式２によっ
て算出される。More specifically, the average power value P of a predetermined number of audio data from the audio signal input unit 41 is calculated. This average power value P is the amplitude of each of a predetermined number of sampled audio data i ₀ , i ₁ , ... i _{N -1} (N
Is a predetermined number of audio data), and is calculated by the following formula 2.

【００４９】[0049]

【数２】 (Equation 2)

【００５０】なお、継続長が所定長未満の無音区間は、
類似度を計算した後、話速変換してもよいし、ある決め
られた圧縮伸長率で話速変換してもよい。The silent section whose duration is less than the predetermined length is
After calculating the degree of similarity, the voice speed may be converted, or the voice speed may be converted at a certain compression / expansion rate.

【００５１】[0051]

【発明の効果】この発明によれば、劣化の少ない出力音
声が得られる。According to the present invention, output voice with little deterioration can be obtained.

[Brief description of the drawings]

【図１】話速変換装置の概略構成を示す構成図である。FIG. 1 is a configuration diagram showing a schematic configuration of a speech speed conversion device.

【図２】ＰＩＣＯＬＡを用いた時間軸圧縮伸長法を説明
するための模式図である。FIG. 2 is a schematic diagram for explaining a time axis compression / expansion method using PICOLA.

[Explanation of symbols]

４１音声信号入力部４２区間判別部４３信号処理部４４音声メモリ４６音声信号出力部５１時間軸圧縮伸長部５２削除部６１波形類似度算出部６２継続長算出部６３区間分類部 41 voice signal input unit 42 section discrimination unit 43 signal processing unit 44 voice memory 46 voice signal output unit 51 time axis compression / expansion unit 52 deletion unit 61 waveform similarity calculation unit 62 continuation length calculation unit 63 section classification unit

Claims

[Claims]

1. A speech speed conversion means including a time axis compression / expansion means for time axis compression / expansion processing of an input voice signal, a similarity degree calculation means for calculating a similarity degree of a waveform in the vicinity of the input voice signal, and a calculated rate. A speech speed conversion device, comprising: a control means for controlling the speech speed conversion means based on the similarity.

2. The control means controls the speech speed conversion means such that the degree of compression or the degree of expansion by the time axis compression / expansion means becomes smaller as the input voice signal having a smaller degree of waveform similarity. Item 1. The speech speed conversion device according to item 1.

3. The control means controls the speech speed conversion means such that the degree of compression or the degree of expansion by the time axis compression / expansion means becomes greater as the input voice signal having a higher degree of waveform similarity. Item 1. The speech speed conversion device according to item 1.

4. The control means classifies the input voice signal into a vowel section, a consonant section, a transition section between a vowel section and a consonant section, and a noise section based on the calculated similarity, and Transition such that the degree of compression or decompression is increased for input speech to perform compression / decompression processing, and the degree of compression or decompression for input speech in the consonant section is reduced to perform compression / decompression processing. The speech speed conversion means is designed to perform compression / expansion processing on the input speech in the interval with the degree of compression or expansion by the time axis compression / expansion means being moderate, and to delete the input speech in the noise interval. The speech speed conversion apparatus according to claim 1, which controls the speech speed.

5. The control means classifies the input voice signal into a vowel section, a consonant section, and a transition section between the vowel section and the consonant section based on the calculated similarity, and within the classified vowel section, In the consonant section or in the transition section between the vowel section and the consonant section, when the degree of similarity is large, the degree of compression or expansion of the section is large, and when the degree of similarity is small, the compression of the section is performed. Alternatively, the speech speed conversion means is controlled so that the degree of expansion becomes small.
The speech speed conversion device described in.

6. The method according to claim 1, further comprising: a determining unit that determines whether or not the input voice signal is a voice in a silent section, and a unit that deletes a voice signal in a silent section having a duration of a predetermined length or more. 6. The speech speed conversion device according to any one of 3, 4, and 5.