JP4313724B2

JP4313724B2 - Audio reproduction speed adjustment method, audio reproduction speed adjustment program, and recording medium storing the same

Info

Publication number: JP4313724B2
Application number: JP2004147803A
Authority: JP
Inventors: 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-05-18
Filing date: 2004-05-18
Publication date: 2009-08-12
Anticipated expiration: 2024-05-18
Also published as: JP2005331588A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and a program with which the reproducing speed of recorded voice is adjusted without loosing nature of original voice and individuality of an uttered person and to provide a recording medium which stores the program. <P>SOLUTION: The voice reproducing speed adjusting method includes a voice inputting step 22, a spectrum featured vector group computing step 23 which converts voice signals into a spectrum featured vector group, a spectrum featured change vector group computing step 24 which computes a spectrum featured change vector from the spectrum featured vector and a spectrum change amount computing step 25 which computes a spectrum change amount from the spectrum featured change vector within a certain constant time window and continues to conduct computations while moving the time window along a time axis from the beginning of uttering to the end. In the method, the voice reproducing speed is adjusted by segmenting and reproducing the inputted voice waveforms corresponding to an interval in which the spectrum change amount exceeds a threshold value and by updating the threshold value. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、音声再生速度調節方法、音声再生速度調節プログラム、およびこれを格納した記録媒体に関し、特に、記録された音声を再生するに際して、信号処理して音声の自然性を保ちながら再生速度を調節する音声再生速度調節方法、音声再生速度調節プログラム、およびこれを格納した記録媒体に関する。 The present invention relates to an audio playback speed adjustment method, an audio playback speed adjustment program, and a recording medium storing the same, and in particular, when playing back recorded audio, the playback speed is increased while maintaining the naturalness of the audio by signal processing. The present invention relates to an audio reproduction speed adjustment method to be adjusted, an audio reproduction speed adjustment program, and a recording medium storing the same.

従来の音声再生速度調節方法、即ち、再生音声の速聞きおよび遅聞きの方法には、以下の様なものがある。
第１の方法として、収録音声信号の再生時の信号のサンプリングレートを収録時のレートとは異なる値に設定する方法がある（特許文献１参照）。一例として、収録時に１秒間当たり８０００個のサンプルを観測した音声データを、再生時に１秒間当たり１６０００個のサンプルで再生すれば２倍の再生速度が得られる。アナログテープによる収録の場合、収録時のテープ速度よりも再生時のテープ速度を速めれば、より速い再生速度が得られることになる。しかし、この方法は、再生音声の声の高さに関連するピッチ周波数が変化するという問題を生起する。再生時の信号のサンプリングレートを収録時のレートと比較して大きく増減すると、話者が誰であるかを識別するのも困難なくらいに音声の自然性が失なわれるに到る。 Conventional audio playback speed adjustment methods, that is, methods of fast listening and slow listening of reproduced speech include the following.
As a first method, there is a method of setting a sampling rate of a signal at the time of reproduction of a recorded audio signal to a value different from a rate at the time of recording (see Patent Document 1). As an example, if audio data obtained by observing 8000 samples per second during recording is played back at 16000 samples per second during playback, twice the playback speed can be obtained. In the case of recording by analog tape, if the tape speed at the time of playback is made faster than the tape speed at the time of recording, a higher playback speed can be obtained. However, this method causes a problem that the pitch frequency related to the pitch of the reproduced voice changes. If the sampling rate of the signal at the time of reproduction is greatly increased or decreased compared with the rate at the time of recording, the naturalness of the voice will be lost to the extent that it is difficult to identify who the speaker is.

第２の方法として、発声の合間に存在する無音区間を検出し、この無音区問をスキップして再生せずに音声再生する方法がある（特許文献２参照）。この第２の方法は、音声の自然性は失なわれないが、音声部分を高速に再生することはできないという問題がある。
第３の方法として、音声のパワーの小さい部分をスキップし、パワーの大きな部分だけを再生する方法もある。しかし、この第３の方法には、パワーの小さい子音が欠落し、再生音声が聞き取り難くなるという問題がある。
特開平０５−３１６５９７号公報特開平０５−０８０７９６号公報 As a second method, there is a method in which a silent section existing between utterances is detected and a voice is reproduced without skipping the silent section (see Patent Document 2). This second method does not lose the naturalness of the sound, but has a problem that the sound part cannot be reproduced at high speed.
As a third method, there is a method in which a portion with low power of audio is skipped and only a portion with high power is reproduced. However, this third method has a problem in that consonants with low power are lost and it is difficult to hear the reproduced sound.
JP 05-316597 A JP 05-080796 A

この発明は、入力音声の各時刻で音声スペクトルがどの程度変化しているかを示すスペクトル変化量を計算し、このスペクトル変化量に基づいて音声を再生する区間、スキップする区間、音声を引き延ばす区間を決定するという構成を採用して、原音声の自然性、発声者の個人性を失うことなく記録された音声の再生速度を調節する音声再生速度調節方法、音声再生速度調節プログラム、およびこれを格納した記録媒体を提供するものである。 The present invention calculates a spectrum change amount indicating how much the sound spectrum changes at each time of the input sound, and based on this spectrum change amount, a section for reproducing sound, a section for skipping, and a section for extending the sound. The structure of determining is adopted, the sound reproduction speed adjustment method, the sound reproduction speed adjustment program for adjusting the reproduction speed of the recorded sound without losing the naturalness of the original sound, and the individuality of the speaker, the sound reproduction speed adjustment program, and the same The recording medium is provided.

請求項１：音声信号を入力する音声入力ステップと、音声信号をスペクトル特徴ベクトル系列に変換するスペクトル特徴ベクトル系列計算ステップと、スペクトル特徴ベクトルからスペクトル特徴変化ベクトルを計算するスペクトル特徴変化ベクトル系列計算ステップと、或る一定時間窓内のスペクトル特徴変化ベクトルからスペクトル変化量を計算し、かつ、その時間窓を発声の始端から終端に向かって時間軸に沿って移動しながら計算するスペクトル変化量計算ステップとを有し、スペクトル変化量が閾値を超える区間に対応する入力音声波形を切り出し、また、スペクトル変化量が閾値を超えない区間に対応する入力音声波形を削除して再生し、その閾値を変更することにより音声再生速度を調節する音声再生速度調節方法を構成した。 A speech input step for inputting a speech signal, a spectral feature vector sequence calculating step for converting the speech signal into a spectral feature vector sequence, and a spectral feature change vector sequence calculating step for calculating a spectral feature change vector from the spectral feature vector And a spectral change calculation step for calculating a spectral change amount from a spectral feature change vector within a certain time window and moving the time window along the time axis from the beginning to the end of the utterance The input speech waveform corresponding to the section where the spectrum change amount exceeds the threshold is cut out , and the input speech waveform corresponding to the section where the spectrum change amount does not exceed the threshold is deleted and reproduced, and the threshold is changed. The audio playback speed adjustment method that adjusts the audio playback speed by configuring

請求項２：音声信号を入力する音声入力ステップと、音声信号をスペクトル特徴ベクトル系列に変換するスペクトル特徴ベクトル系列計算ステップと、スペクトル特徴ベクトルからスペクトル特徴変化ベクトルを計算するスペクトル特徴変化ベクトル系列計算ステップと、或る一定時間窓内のスペクトル特徴変化ベクトルからスペクトル変化量を計算し、かつ、その時間窓を発声の始端から終端に向かって時間軸に沿って移動しながら計算するスペクトル変化量計算ステップとを有し、スペクトル変化量が閾値よりも小さな区間において、対応する入力音声波形の一部を繰り返し、スペクトル変化量が閾値よりも小さくない区間において、そのまま保存して再生し、その閾値を変更することにより音声再生速度を調節する音声再生速度調節方法を構成した。 A speech input step for inputting a speech signal, a spectral feature vector sequence calculating step for converting the speech signal into a spectral feature vector sequence, and a spectral feature change vector sequence calculating step for calculating a spectral feature change vector from the spectral feature vector And a spectral change calculation step for calculating a spectral change amount from a spectral feature change vector within a certain time window and moving the time window along the time axis from the beginning to the end of the utterance In a section where the amount of spectral change is smaller than the threshold, a part of the corresponding input speech waveform is repeated , and in a section where the amount of spectral change is not smaller than the threshold, it is stored and played as it is, and the threshold is changed. Adjust the audio playback speed by adjusting the audio playback speed You configure.

請求項３：請求項１および請求項２の内の何れかに記載される音声再生速度調節方法において、スペクトル変化量は動的尺度である音声再生速度調節方法を構成した。
請求項４：音声信号を入力し、音声信号をスペクトル特徴ベクトル系列に変換し、スペクトル特徴ベクトルからスペクトル特徴変化ベクトルを計算し、或る一定時間窓内のスペクトル特徴変化ベクトルからスペクトル変化量を計算し、時間窓を発声の始端から終端に向かって時間軸に沿って移動しながら計算し、スペクトル変化量が閾値を超える区間に対応する入力音声波形を切り出し、また、スペクトル変化量が閾値を超えない区間に対応する入力音声波形を削除して再生し、その閾値を変更することにより音声再生速度を調節する指令をコンピュータに対してする音声再生速度調節プログラムを構成した。 [3] The audio reproduction speed adjustment method according to any one of [1] and [2], wherein the audio reproduction speed adjustment method is such that the amount of change in the spectrum is a dynamic measure .
Claim 4: An audio signal is input, the audio signal is converted into a spectral feature vector sequence, a spectral feature change vector is calculated from the spectral feature vector, and a spectral change amount is calculated from the spectral feature change vector within a certain time window. The time window is calculated while moving along the time axis from the beginning to the end of the utterance, and the input speech waveform corresponding to the section where the amount of spectrum change exceeds the threshold is cut out , and the amount of spectrum change exceeds the threshold. An audio playback speed adjustment program is provided which instructs the computer to delete and play back an input audio waveform corresponding to a non-interval and change the threshold to adjust the audio playback speed.

請求項５：音声信号を入力し、音声信号をスペクトル特徴ベクトル系列に変換し、スペクトル特徴ベクトルからスペクトル特徴変化ベクトルを計算し、或る一定時間窓内のスペクトル特徴変化ベクトルからスペクトル変化量を計算し、かつ、その時間窓を発声の始端から終端に向かって時間軸に沿って移動しながら計算し、スペクトル変化量が閾値よりも小さな区間において、対応する入力音声波形の一部を繰り返し、スペクトル変化量が閾値よりも小さくない区間において、そのまま保存して再生し、その閾値を変更することにより音声再生速度を調節する指令をコンピュータに対してする音声再生速度調節プログラムを構成した。 Claim 5: An audio signal is input, the audio signal is converted into a spectral feature vector sequence, a spectral feature change vector is calculated from the spectral feature vector, and a spectral change amount is calculated from the spectral feature change vector within a certain time window. and, and, the time window towards the end from the start of the utterance calculated while moving along the time axis, in a small interval than the spectral variation threshold, repeating a portion of the corresponding input speech waveform, the spectrum In a section in which the amount of change is not smaller than the threshold value, the program is stored and reproduced as it is, and a voice playback speed adjustment program is configured to give a command to the computer to adjust the voice playback speed by changing the threshold value.

請求項６：請求項４および請求項５の内の何れかに記載される音声再生速度調節プログラムを格納した記録媒体を構成した。 Claim 6: A recording medium storing the sound reproduction speed adjustment program according to any one of claims 4 and 5 is configured.

この発明によれば、スペクトル変化量が人間の音声知覚に関係していることに着目して音声を削除する区間、音声を挿入する区間を決定しているので、速聞き、遅聞きにより各音素を別の音素に聞き違えたり、自然性を損なったりすることはない。そして、音声波形の時間軸を圧縮、伸長するのとは異なって、音声のピッチ周波数を保ったまま再生するものであるので、発声者の個人性を損なうこともない。更に、音声に定常的な雑音が重畳していても、１定常雑音のスペクトル変化は音声と比較して非常に小さいので、音声のスペクトル変化を観測することができ、スペクトル変化区間の検出に影響がない。図６はこれを示す図であり、定常雑音が存在する状態において「baNgumiaNnai」と発声したときの観測結果である。時間軸近傍に低レベルの定常雑音が存在しても、雑音自身のスペクトル変化量は小さいので、音素の境界でのみ、スペクトル変化量が大きくなっていることがわかる。 According to the present invention, the section for deleting the voice and the section for inserting the voice are determined focusing on the fact that the amount of change in the spectrum is related to human speech perception. Is not misunderstood as a different phoneme, nor does it impair naturalness. Unlike the compression and expansion of the time axis of the speech waveform, the speech waveform is reproduced while maintaining the pitch frequency, so that the individuality of the speaker is not impaired. Furthermore, even if stationary noise is superimposed on the speech, the spectral change of one stationary noise is very small compared to the speech, so that the spectral change of the speech can be observed, which affects the detection of the spectral change interval. There is no. FIG. 6 is a diagram showing this, and is an observation result when “baNgumiaNnai” is uttered in a state where stationary noise exists. Even if low level stationary noise exists in the vicinity of the time axis, the amount of change in the spectrum of the noise itself is small, so that it can be seen that the amount of change in the spectrum is large only at the phoneme boundary.

母音、子音の各音素は典型的なスペクトル特徴を有している。人間は、基本的に、このスペクトルの違いを知覚して音声を聴取している。しかし、詳細な音声の知覚実験によれば、人間は、スペクトルが急激に変化する音素境界付近の知覚に非常に敏感で、このスペクトル変化に基づいて音素を知覚しているという実験報告がある（参考文献：古井、“音声知覚におけるスペクトル変化情報の役割”、日本音響学会聴覚研究会資料、Ｈ８５−６、昭和６０年）。この発明は、この人間の知覚特性を利用する。即ち、原音声のスペクトル特徴ベクトルおよびこの特徴ベクトルからスペクトル変化量を計算し、音声再生速度を原音声の速度よりも速めたい場合、スペクトル変化量の大きい部分に対応する音声波形を残して、スペクトル変化の小さい部分に対応する音声波形を削除して再生する。スペクトル変化量の大きい部分とは、主として、音素境界および破裂音区間である。スペクトル変化量が小さい部分とは、母音および摩擦音の如き子音の定常部分、および無音区間である。無音区間もスペクトル変化がないので、この方法によればスキップされる。再生速度をどの程度速めるかは、スペクトル変化量に対して閾値を設け、閾値以上の区間の音声波形を再生する。この閾値を大きくすれば、よりスペクトル変化が激しい部分のみが再生され、再生速度が速くなる。一方、閾値を小さくすれば、スペクトル変化の比較的小さな区間も再生されるので、再生速度は原音声に近くなる。 Each vowel and consonant phoneme has typical spectral features. Humans basically perceive the sound by perceiving this spectral difference. However, detailed speech perception experiments show that humans are very sensitive to perception near the phone boundary where the spectrum changes abruptly, and perceive phonemes based on this spectral change ( References: Furui, “Role of Spectral Change Information in Speech Perception”, Acoustical Society of Japan Auditory Society Material, H85-6, 1985). The present invention utilizes this human perceptual characteristic. That is, when calculating the spectral feature vector of the original voice and the spectral change amount from this feature vector and making the voice playback speed faster than the original voice speed, the voice waveform corresponding to the part where the spectral change amount is large is left, and the spectrum The audio waveform corresponding to the small change portion is deleted and reproduced. The part having a large amount of spectrum change is mainly a phoneme boundary and a plosive section. The portion where the amount of change in the spectrum is small is a steady portion of consonants such as vowels and friction sounds, and a silent section. Since there is no spectrum change in the silent section, it is skipped according to this method. The extent to which the reproduction speed is increased is determined by setting a threshold value for the amount of change in the spectrum, and reproducing the speech waveform in the section equal to or greater than the threshold value. If this threshold value is increased, only the part where the spectrum change is more severe is reproduced, and the reproduction speed is increased. On the other hand, if the threshold value is decreased, a section with a relatively small spectrum change is also reproduced, so that the reproduction speed becomes close to the original sound.

一方、再生速度を原音声よりも遅くする場合は、スペクトル変化量の小さな区間の時間長を長くする。区間の時間長を長くするに際して、その区間の代表的な波形を挿入して区間長を長くする。波形を単純に時間軸に沿って引き延ばすと、等価的にテープを引き延ばして再生するのと同様に、音声のピッチ周波数が変化するので音声の自然性が失なわれるに到る。また、スペクトル変化量の大きな区間は、音素の知覚にとって非常に重要な区間であるので、そのまま保存して再生し、波形の加工は行わない。
上述した通り、この発明は、人間の音声知覚特性に基づいて音声波形の削除、付加を行って再生速度を調整しているので、音声の自然性および話者性を損なうことなしに音声再生速度を調節することができる。 On the other hand, when the playback speed is slower than that of the original voice, the time length of the section having a small spectrum change amount is lengthened. When the time length of the section is lengthened, the section length is lengthened by inserting a representative waveform of the section. If the waveform is simply stretched along the time axis, the sound's naturalness will be lost because the pitch frequency of the sound changes in the same way as when the tape is stretched and played back. In addition, since a section with a large amount of spectrum change is a section that is very important for phoneme perception, it is stored and reproduced as it is, and the waveform is not processed.
As described above, according to the present invention, the playback speed is adjusted by deleting and adding a speech waveform based on human speech perception characteristics, so that the speech playback speed can be achieved without impairing the naturalness and speaker nature of speech. Can be adjusted.

ここで、図１を参照して、発明を実施するための最良の形態を具体的に説明する。
図１において、最下側領域から音声波形、スペクトル、スペクトル変化量が表示されている。表示される発声は、成人男性が発声した「ばんぐみあんない（baNgumiaNnai）」である。縦に走る実線は人が目視により付与した音素境界を示す線である。スペクトルを見ると明らかな如く、各音素で固有のスペクトルパターンを有していることが判る。即ち、音素境界においてスペクトル変化量が大きくなっている。図４および図５を参照するに、これらは同じ単語「baNgumiaNnai」をゆっくり発声した場合の波形、スペクトル、スペクトル変化量を示している。図４および図５を図１と比較すると、ゆっくり発声した場合、スペクトル定常部の継続時間が長くなっていることがわかる。この発明は、スペクトル変化量の時間パターンから、スペクトル変化の少ない定常部を見つけ、対応する波形を削除することにより、速聞きを実現する。逆に、図１の通常の発声の定常部を引き延ばすことにより、遅聞きを実現する。何れの場合も、縦に走る実線近傍のスペクトル変化の大きな音素境界は保存される。 Here, the best mode for carrying out the invention will be described in detail with reference to FIG.
In FIG. 1, a speech waveform, a spectrum, and a spectrum change amount are displayed from the lowermost region. The displayed utterance is “baNgumiaNnai” uttered by an adult male. A solid line that runs vertically is a line that indicates a phoneme boundary given by human eyes. As is apparent from the spectrum, each phoneme has a unique spectral pattern. That is, the amount of change in spectrum is large at the phoneme boundary. Referring to FIGS. 4 and 5, these show the waveform, spectrum, and spectral variation when the same word “baNgumiaNnai” is spoken slowly. Comparing FIG. 4 and FIG. 5 with FIG. 1, it can be seen that the duration of the steady spectrum portion is longer when the voice is spoken slowly. The present invention realizes quick listening by finding a stationary part with a small spectrum change from the time pattern of the spectrum change amount and deleting the corresponding waveform. Conversely, the slow listening is realized by extending the normal part of the normal utterance in FIG. In either case, the phoneme boundary having a large spectral change near the solid line running vertically is preserved.

図２を参照して第１の実施例を説明する。第１の実施例は、原音声よりも再生速度を速くする速聞きを実施する例である。
音声入力ステップ２２において、マイクロホンの如き音響−電気変換器を介して音声信号を入力する。入力された音声信号は、スペクトル特徴ベクトル系列計算ステップ２３において３０ｍｓの時間窓で切り出され、スペクトル分析される。スペクトル分析は全極モデルに基づいた線形予測法（ＬＰＣ）でも、ＦＦＴ法でもよい。時間窓は更に１０ｍｓのシフト幅で移動され、時間軸に沿って音声信号のスペクトル分析が行われる。結局、１０ｍｓ毎にスペクトルの形状を表現するスペクトル特徴ベクトルが計算される。例えば、ＬＰＣケプストラム、ＦＦＴケプストラム、ＬＰＣスペクトル、ＦＦＴスペクトル、或いは、これらスペクトルの周波数軸を対数化して表現したメルスペクトル、メルケプストラムをスペクトル特徴ベクトルとして計算する。次に、スペクトル特徴変化ベクトル系列計算ステップ２４において、スペクトル特徴ベクトル系列に対して新たな９０ｍｓの時間窓を設け、その時間窓内のスペクトル特徴ベクトルの変化ベクトルが計算される。例えば、９０ｍｓ内のスペクトル特徴ベクトル系列の線形１次回帰係数を用いる。これにより、９０ｍｓ時間窓内の特徴ベクトルの変化パターンの傾きが計算される。スペクトル変化が大きいときは、回帰係数の絶対値も大きくなる。スペクトル特徴ベクトルの回帰係数は各次元で独立に計算される。回帰係数を用いずに、より簡単な計算で済む以下の通りの差分値を用いることができる。 A first embodiment will be described with reference to FIG. The first embodiment is an example in which fast listening is performed to increase the playback speed compared to the original voice.
In an audio input step 22, an audio signal is input through an acoustic-electric converter such as a microphone. The input speech signal is cut out in a spectral feature vector sequence calculation step 23 in a time window of 30 ms and subjected to spectral analysis. The spectrum analysis may be a linear prediction method (LPC) based on an all-pole model or an FFT method. The time window is further shifted by a shift width of 10 ms, and the spectrum analysis of the audio signal is performed along the time axis. Eventually, a spectrum feature vector expressing the spectrum shape is calculated every 10 ms. For example, an LPC cepstrum, an FFT cepstrum, an LPC spectrum, an FFT spectrum, or a mel spectrum or mel cepstrum expressed by logarithmizing the frequency axis of these spectra is calculated as a spectrum feature vector. Next, in a spectrum feature change vector sequence calculation step 24, a new 90 ms time window is provided for the spectrum feature vector sequence, and a change vector of the spectrum feature vector within the time window is calculated. For example, a linear primary regression coefficient of a spectral feature vector sequence within 90 ms is used. Thereby, the gradient of the change pattern of the feature vector within the 90 ms time window is calculated. When the spectrum change is large, the absolute value of the regression coefficient also increases. The regression coefficient of the spectral feature vector is calculated independently in each dimension. Without using the regression coefficient, it is possible to use the following difference value which requires simpler calculation.

Δｘ_i（ｔ）＝ｘ_i（ｔ＋Δｔ）−ｘ_i（ｔ−Δｔ）
ここで、ｘ_i（ｔ）は時刻ｔにおけるスペクトル特徴ベクトルのｉ次元目を示す。△ｔは変化量を計算する時間窓幅の半分の値である。この差分値もスペクトル変化の傾きを表す。回帰係数ベクトル、または差分ベクトルからスペクトル変化量を計算する。スペクトル変化量は、例えば、以下の式で計算される動的尺度を用いる（参考文献：嵯峨山、板倉“音声の動的尺度に含まれる個人性情報”、日本音響学会昭和５４年度春季研究発表会講演論文集、３−２−７、pp５８９−５９０）。 Δx _i (t) = x _i (t + Δt) −x _i (t−Δt)
Here, x _i (t) indicates the i-th order of the spectral feature vector at time t. Δt is a half value of the time window width for calculating the amount of change. This difference value also represents the slope of the spectrum change. The spectrum change amount is calculated from the regression coefficient vector or the difference vector. The amount of change in the spectrum uses, for example, a dynamic scale calculated by the following formula (reference: Hiyama, Itakura “Personality information included in the dynamic scale of speech”, Acoustical Society of Japan 1979 Spring Research Presentation Conference Proceedings, 3-2-7, pp589-590).

Ｄ（ｔ）＝Σ^P _i=1（Δｘ_i（ｔ）²）
ここで、Ｐはベクトルの次元数を示す。この値は時刻ｔを中心としたスペクトル変化を示すスカラー量とみなすことができる。これをスペクトル変化量計算ステップ２５において、具体例として動的尺度計算ステップで計算する。音素境界においてはスペクトル変化が激しいので、動的尺度は音素境界付近でピークを示す。上述した通り、動的尺度の値が大きな区間、即ち、スペクトル変化が激しい区間は人間の音声知覚上、重要な箇所であるので、この区間をスペクトル変化区間検出ステップ２６で探し出す。この時、スペクトル変化区間検出ステップ２６に動的尺度に対する閾値を与えておき、閾値以上の区間を検出する構成を採用する。閾値を高く設定すると、検出される区間はよりスペクトル変化が激しい区間となるので、再生される区問の全体に対する割合が減少し、速聞き速度が速くなる。閾値を低く設定すると、より原音声の速度に近くなる。入力音声は入力音声バッファ２７に蓄えられている。ここから音声波形を読み出し、スペクトル変化区間検出ステップ２６において検出された区間に対応する音声波形を音声切りだしステップ２８で切り出し、音声再生ステップ２９へ送り込み、音声を再生して終了する。 ^{_{D (t) = Σ P i}} = 1 (Δx i (t) 2)
Here, P indicates the number of dimensions of the vector. This value can be regarded as a scalar quantity indicating a spectral change centered at time t. This is calculated in the dynamic scale calculation step as a specific example in the spectrum change amount calculation step 25. Since the spectrum change is severe at the phoneme boundary, the dynamic scale shows a peak near the phoneme boundary. As described above, a section where the value of the dynamic scale is large, that is, a section where the spectrum change is severe is an important place for human speech perception, and this section is searched for in the spectrum change section detection step 26. At this time, a configuration is adopted in which a threshold for the dynamic scale is given to the spectrum change interval detection step 26 and an interval equal to or greater than the threshold is detected. If the threshold value is set high, the detected section becomes a section where the spectrum changes more drastically. Therefore, the ratio of the section to be reproduced to the whole is decreased and the fast listening speed is increased. If the threshold is set low, the speed of the original voice is closer. Input audio is stored in the input audio buffer 27. The voice waveform is read out from this, the voice waveform corresponding to the section detected in the spectrum change section detecting step 26 is cut out in the voice extracting step 28, sent to the voice reproducing step 29, the voice is reproduced, and the process is terminated.

以上の説明において、時間窓幅の数値を特定したが、これは説明をわかり易くするためであり、時間窓幅をこれに限定するものではない。
図３を参照して第２の実施例を説明する。第２の実施例は、原音声よりも再生速度を遅くする遅聞きを実施する例である。
第２の実施例のポイントは、スペクトル変化の小さな区間の波形を繰り返す点にある。音声入力からスペクトル変化量（動的尺度）を計算するスペクトル変化量計算ステップ３５までは、実施例１と同じである。第２の実施例は、スペクトル変化区間検出ステップ３６において、動的尺度が小さな区間、即ち、スペクトル変化が小さな区間を見つけだす。入力音声は入力音声バッファ３７に蓄えられている。ここから入力音声波形を読み出し、スペクトル変化区間検出ステップ３６において検出された各区間において、区間中心付近に位置する代表的な波形を切り出し、区間中心において繰り返す。この繰り返し数は、例えば、検出された区間長に比例する様に決定される。この操作を音声波形繰り返しステップ３８において行う。加工された音声波形は音声再生ステップ３９において再生されて終了する。スペクトル変化の大きな区間の音声について、例えば、音声を引き延ばし或いは音声を挿入する加工を施すと、音声知覚に影響を与えるところから別の音に聞こえ、或いは自然性を損なう問題を生起するが、この第２の実施例は、スペクトル変化が小さな区間のみを加工する対象としているので、再生音声の自然性を損なうことはない。以上の説明において、時間窓幅の数値を特定したが、これは説明をわかり易くするためであり、時間窓幅をこれに限定するものではない。 In the above description, the numerical value of the time window width is specified, but this is for easy understanding of the description, and the time window width is not limited to this.
A second embodiment will be described with reference to FIG. The second embodiment is an example in which a slow listening is performed with a playback speed slower than that of the original sound.
The point of the second embodiment is that the waveform of a section with a small spectrum change is repeated. The steps up to the spectral change amount calculation step 35 for calculating the spectral change amount (dynamic scale) from the voice input are the same as those in the first embodiment. In the second embodiment, the spectral change interval detection step 36 finds an interval with a small dynamic measure, that is, an interval with a small spectral change. Input audio is stored in the input audio buffer 37. The input speech waveform is read from here, and in each section detected in the spectrum change section detection step 36, a representative waveform located near the section center is cut out and repeated at the section center. This number of repetitions is determined so as to be proportional to the detected section length, for example. This operation is performed in the speech waveform repetition step 38. The processed voice waveform is played back in the voice playback step 39 and the process ends. For example, if the voice of a section with a large spectrum change is processed to extend the voice or insert the voice, it will cause a problem that the voice perception may be heard from another place that affects the voice perception, or the naturalness may be impaired. In the second embodiment, only the section where the spectrum change is small is processed, so that the naturalness of the reproduced sound is not impaired. In the above description, the numerical value of the time window width is specified, but this is for easy understanding of the description, and the time window width is not limited to this.

この発明の原理を説明する図。The figure explaining the principle of this invention. 第１の実施例を説明する図。The figure explaining a 1st Example. 第２の実施例を説明する図The figure explaining 2nd Example この発明の原理を説明する図。The figure explaining the principle of this invention. この発明の原理を説明する図。The figure explaining the principle of this invention. 定常雑音が存在する状態において「baNgumiaNnai」と発声したときの観測結果を示す図。The figure which shows an observation result when saying "baNgumiaNnai" in the state where stationary noise exists.

Explanation of symbols

２２、３２音声入力ステップ
２３、３３スペクトル特徴ベクトル系列計算ステップ
２４、３４スペクトル変化量計算ステップ
２５、３５スペクトル変化量計算ステップ
２６、３６スペクトル変化区間検出ステップ
２７、３７入力音声バッファ
２８音声切りだしステップ
３８音声波形繰り返しステップ
２９、３９音声再生ステップ 22, 32 Voice input step 23, 33 Spectral feature vector sequence calculation step 24, 34 Spectral change calculation step 25, 35 Spectral change calculation step 26, 36 Spectral change section detection step 27, 37 Input voice buffer 28 Voice extraction step 38 Voice waveform repetition step 29, 39 Voice playback step

Claims

An audio input step for inputting an audio signal; a spectral feature vector sequence calculating step for converting the audio signal into a spectral feature vector sequence; a spectral feature change vector sequence calculating step for calculating a spectral feature change vector from the spectral feature vector; A spectral change amount calculating step for calculating a spectral change amount from a spectral feature change vector within a fixed time window and moving the time window along the time axis from the beginning to the end of the utterance; The input speech waveform corresponding to the section where the spectrum change amount exceeds the threshold is cut out , and the input speech waveform corresponding to the section where the spectrum change amount does not exceed the threshold is deleted and reproduced, and the speech is changed by changing the threshold. An audio playback speed adjustment method, characterized by adjusting a playback speed.

An audio input step for inputting an audio signal; a spectral feature vector sequence calculating step for converting the audio signal into a spectral feature vector sequence; a spectral feature change vector sequence calculating step for calculating a spectral feature change vector from the spectral feature vector; A spectral change amount calculating step for calculating a spectral change amount from a spectral feature change vector within a fixed time window and moving the time window along the time axis from the beginning to the end of the utterance; In a section where the amount of spectral change is smaller than the threshold, a part of the corresponding input speech waveform is repeated , and in a section where the amount of spectral change is not smaller than the threshold, the part is stored and played as it is, and then the voice is changed Audio playback speed characterized by adjusting playback speed Section method.

In the audio | voice reproduction speed adjustment method as described in any one of Claim 1 and Claim 2,
The method for adjusting a voice reproduction speed, wherein the spectral change amount is a dynamic scale .

Input an audio signal, convert the audio signal into a spectral feature vector sequence, calculate a spectral feature change vector from the spectral feature vector, calculate a spectral change amount from the spectral feature change vector within a certain time window; and The time window is calculated while moving along the time axis from the beginning to the end of the utterance, and the input speech waveform corresponding to the section where the spectral variation exceeds the threshold is cut out . Also, the spectral variation does not exceed the threshold. An audio reproduction speed adjustment program that instructs a computer to delete and reproduce an input audio waveform corresponding to a section and adjust the audio reproduction speed by changing the threshold value.

Input an audio signal, convert the audio signal into a spectral feature vector sequence, calculate a spectral feature change vector from the spectral feature vector, calculate a spectral change amount from the spectral feature change vector within a certain time window; and The time window is calculated while moving along the time axis from the beginning to the end of the utterance, and a part of the corresponding input speech waveform is repeated in the section where the spectral variation is smaller than the threshold , and the spectral variation is the threshold. An audio reproduction speed adjustment program for giving a command to the computer to adjust the audio reproduction speed by changing the threshold value in a section that is not smaller than that .

A recording medium storing the audio reproduction speed adjustment program according to claim 4.