JP4313740B2

JP4313740B2 - Reverberation removal method, program, and recording medium

Info

Publication number: JP4313740B2
Application number: JP2004245622A
Authority: JP
Inventors: 智広中谷; 慶介木下; 正人三好
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-08-25
Filing date: 2004-08-25
Publication date: 2009-08-12
Anticipated expiration: 2024-08-25
Also published as: JP2006064866A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a reverberation eliminating method by which a harmonic structure sound can be obtained more accurately than a precedent example by using a time expansion/compression technique for harmonic structure sound extraction processing, and which makes reverberation elimination processing which is more accurate on the whole, and also to provide a device, a program, and a recording medium for implementing the method. <P>SOLUTION: Disclosed is the reverberation eliminating method of subjecting an inputted speech signal x(t) including reverberation to fundamental frequency estimation processing by a fundamental frequency estimation section 1, to fundamental frequency time differential estimation processing by a fundamental frequency time differential estimation section 2, to time expansion/compression processing for the signal waveform by a time expansion/compression section 3 for the signal waveform, to harmonic structure sound extraction processing by a harmonic structure sound extraction section 4, time expanding/compressing restoration processing for the signal waveform by a time expanding/compressing restoration section 5 for the signal waveform, to inverse transfer function estimation processing by an inverse transfer function estimation section 6, and to inverse transfer function application processing by an inverse transfer function application section 7. Further, disclosed are the device, program, and recording medium for implementing the same method. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、残響除去方法、プログラムおよび記録媒体に関し、特に、残響を含んだ音声信号から残響を除去する残響除去における調波構造音抽出処理に用いることで、正確に調波構造音を得ることができ、全体として正確な残響除去処理を実施する残響除去方法、プログラムおよび記録媒体に関する。 This invention can dereverberation method, a program and a recording medium, in particular, by using the harmonic structure sound extraction process in the dereverberation for removing reverberation from speech signal including reverberation, exactly harmonic structure sound it can, dereverberation method for implementing an accurate dereverberation processing as a whole, relates to program, and a recording medium.

図７を参照して残響除去方法の先行例を説明する（参考文献［１］参照）。
図７の残響除去装置による残響除去処理は、音声収集装置８より入力した残響を含んだ音声信号ｘ（ｔ）に対して、基本周波数推定部１による基本周波数推定処理と、調波構造音抽出部４による調波構造音抽出処理と、逆伝達関数推定部６による逆伝達関数推定処理と、逆伝達関数適用部７による逆伝達関数適用処理より成る。調波構造音抽出部４が抽出する調波構造音を、音声信号の直接音を近似する信号とみなし、この信号と観測された信号である音声信号ｘ（ｔ）とから逆伝達関数推定部６において逆伝達関数を推定する。この逆伝達関数を、逆伝達関数適用部７において残響を含んだ観測音声信号に畳み込むことで残響除去を行う。 A prior example of the dereverberation method will be described with reference to FIG. 7 (see reference [1]).
The dereverberation processing by the dereverberation apparatus of FIG. 7 is performed on the speech signal x (t) including the reverberation input from the speech collection device 8 by the fundamental frequency estimation unit 1 and harmonic structure sound extraction. 4 includes a harmonic structure sound extraction process by the unit 4, an inverse transfer function estimation process by the inverse transfer function estimation unit 6, and an inverse transfer function application process by the inverse transfer function application unit 7. The harmonic structure sound extracted by the harmonic structure sound extraction unit 4 is regarded as a signal that approximates the direct sound of the audio signal, and the inverse transfer function estimation unit is obtained from this signal and the observed audio signal x (t). In step 6, the inverse transfer function is estimated. The inverse transfer function is convoluted by the inverse transfer function application unit 7 with the observed speech signal including the reverberation to remove the reverberation.

音声信号は、一般に、残響のある環境で収音されると、本来の音声信号に残響が重畳された信号として観測される。このため、本来の音声信号の性質を抽出することが困難になると共に、音声自体の明瞭度が低下する。これに対して、残響除去処理は、重畳した残響を取り除くことで、音声本来の性質を抽出しやすくすると共に、音声の明瞭度を回復することができる。これは、他の様々な音声信号処理方法および装置の要素技術として用いることで、その全体の性能向上につながる技術である。残響除去処理を要素技術として使用して性能向上する音声信号処理技術としたは、以下の様なものを列挙することができる。 In general, when a sound signal is collected in an environment with reverberation, the sound signal is observed as a signal in which reverberation is superimposed on the original sound signal. For this reason, it becomes difficult to extract the nature of the original audio signal, and the intelligibility of the audio itself decreases. On the other hand, the dereverberation process removes the superimposed reverberation, thereby facilitating extraction of the original nature of the speech and recovering the clarity of the speech. This is a technique that leads to an improvement in the overall performance when used as an elemental technique of various other audio signal processing methods and apparatuses. The following can be enumerated as the audio signal processing technology for improving the performance by using the dereverberation processing as an elemental technology.

１．残響除去を前処理として用いる音声認識方法および装置。
２．残響除去により音声の明瞭度を向上させるＴＶ会議方法および装置などの通信方法および装置。
３．講演の録音に含まれる残響を除去することで、録音された音声の明瞭度を向上させる再生方法および装置。
４．残響を除去することで聞き取りやすさを向上させる補聴器。
５．人が歌ったり、楽器で演奏したり、またはスピーカで演奏された音楽の残響を除去して、楽曲を検索したり、採譜したりする音楽情報処理方法および装置。
６．人が発した声に反応して機械にコマンドをわたす機械制御インターフェース、および機械と人間との間の対話装置。 1. Speech recognition method and apparatus using dereverberation as preprocessing.
2. A communication method and apparatus, such as a TV conference method and apparatus, for improving the articulation of speech by dereverberation.
3. A playback method and apparatus for improving the intelligibility of recorded speech by removing reverberation contained in the recording of the lecture.
4). A hearing aid that improves the ease of hearing by removing the reverberation.
5. A music information processing method and apparatus in which a person sings, plays a musical instrument, or removes the reverberation of music played by a speaker to search for music and record music.
6). A machine control interface that gives commands to the machine in response to human voices, and a dialogue device between the machine and humans.

上述した残響除去技術の先行例（参考文献［１］参照）は、調波構造音抽出処理部４において調波構造音を抽出するに際して、短い時間区間で切り出された音声信号の基本周波数はその区間内で一定であると仮定して処理を行っていた。しかし、実際の音声信号は短い時間区間内においてもその基本周波数は一定ではない。従って、先行例においては、この仮定が原因で、調波構造音の抽出精度を或る程度以上に高くすることができなかった。このために、直接音の近似精度が低く、逆伝達関数を精密に推定することができなかった。その結果、残響除去方法の先行例には達成することができる残響除去性能に限界があった。この様に、残響除去の先行例は基本周波数に関する不正確な仮定に基づいていたところから、或る程度以上の高性能な残響除去を実現することはできなかった。 In the preceding example of the dereverberation technique described above (see Reference [1]), when the harmonic structure sound is extracted by the harmonic structure sound extraction processing unit 4, the fundamental frequency of the audio signal cut out in a short time interval is The processing was performed assuming that the interval was constant. However, the fundamental frequency of an actual audio signal is not constant even within a short time interval. Therefore, in the preceding example, due to this assumption, the harmonic structure sound extraction accuracy could not be increased to a certain degree. For this reason, the approximation accuracy of the direct sound is low, and the inverse transfer function cannot be accurately estimated. As a result, there is a limit to the dereverberation performance that can be achieved in the previous example of the dereverberation method. Thus, since the previous example of dereverberation was based on an inaccurate assumption about the fundamental frequency, it was not possible to achieve dereverberation with a certain degree of performance.

一方、調波構造音の抽出精度を向上させる仕方として、時間伸縮処理技術を使用することが従来検討されている。時間伸縮処理とは音声信号の振幅を変えずに時間軸のみを伸び縮みさせることで波形を変形させる処理である。この時間伸縮処理を用いれば、音声信号の基本周波数の増加減少に合わせて時間軸の伸縮を適切に制御することで、基本周波数が一定の音声信号を得ることができる。これを図８を参照して説明する。図８（ａ）は時間伸縮処理を施す前の音声信号波形を示し、図８（ｂ）は時間伸縮処理を施された後の音声信号波形を示す。図８（ｃ）は図８（ａ）の音声信号のスペクトログラムを示し、図８（ｄ）は図８（ｂ）の音声信号のスペクトログラムを示す。 On the other hand, the use of a time expansion / contraction processing technique has been studied as a method for improving the extraction accuracy of harmonic structure sounds. The time expansion / contraction process is a process for deforming the waveform by expanding / contracting only the time axis without changing the amplitude of the audio signal. If this time expansion / contraction process is used, an audio signal having a constant fundamental frequency can be obtained by appropriately controlling the expansion / contraction of the time axis in accordance with the increase / decrease of the fundamental frequency of the audio signal. This will be described with reference to FIG. FIG. 8A shows an audio signal waveform before the time expansion / contraction process is performed, and FIG. 8B shows an audio signal waveform after the time expansion / contraction process is performed. FIG. 8C shows a spectrogram of the voice signal of FIG. 8A, and FIG. 8D shows a spectrogram of the voice signal of FIG. 8B.

図８（ａ）、（ｃ）における時間伸縮処理を施す前の音声信号波形は、同じ波形の繰り返し間隔が時間の経過に伴って短くなって行く。これは、基本周波数が時間の経過に伴って高くなって行くことを示している。これに対して、図８（ｂ）、（ｄ）における時間伸縮処理を施された後の音声信号波形は、例えば、時間的に前半の信号の時間軸を縮めると共に、後半の信号の時間軸を伸ばすことで、近似的に基本周波数が一定の信号を得ることができる。
この発明は、この公知の時間伸縮処理技術を上述した先行例における調波構造音抽出処理に先だって適用し、近似的に基本周波数が一定の信号に調波構造音抽出処理を施すことに着目した。 In the audio signal waveform before the time expansion / contraction processing in FIGS. 8A and 8C, the repetition interval of the same waveform becomes shorter as time elapses. This indicates that the fundamental frequency increases with time. On the other hand, the audio signal waveform after the time expansion / contraction processing in FIGS. 8B and 8D, for example, shortens the time axis of the first half signal in time and the time axis of the second half signal. By extending, a signal having a substantially constant fundamental frequency can be obtained.
The present invention applies this known time expansion / contraction processing technique prior to the harmonic structure sound extraction processing in the above-described prior example, and focuses on applying harmonic structure sound extraction processing to a signal having a substantially constant fundamental frequency. .

即ち、この発明は、この公知の時間伸縮処理技術を、特に、残響を含んだ音声信号から残響を除去する残響除去における調波構造音抽出処理に用いることで、先行例と比較してより正確に調波構造音を得ることができ、その結果、全体としてより正確な残響除去処理を実施することができる、以上の問題を解消した残響除去方法、プログラムおよび記録媒体を提供するものである。 That is, the present invention uses this known time expansion / contraction processing technique in particular for harmonic structure sound extraction processing in reverberation removal that removes reverberation from an audio signal including reverberation, thereby making it more accurate than the previous example. it is possible to obtain a double-harmonic structure sound, as a result, there is provided a more can be performed accurately dereverberation processing, these problems dereverberation method eliminate, program and recording medium as a whole .

請求項１：入力された音声信号に対して基本周波数推定処理をする第一段階の基本周波数推定ステップと、
第一段階の基本周波数推定ステップにより求められた基本周波数に基づいてその時間微分を推定する第一段階の基本周波数時間微分推定ステップと、
前記音声信号、第一段階の基本周波数推定ステップにより求められた基本周波数、第一段階の基本周波数時間微分推定ステップにより求められた基本周波数の時間微分に基づいて前記音声信号の基本周波数を一定にする第一段階の信号波形時間伸縮ステップと、
第一段階の信号波形時間伸縮ステップにより得られた時間伸縮信号に基づいてその調波構造音を抽出する第一段階の調波構造音抽出ステップと、
第一段階の調波構造音抽出ステップにより得られた調波構造音に対して信号波形の時間伸縮復元処理を施して時間伸縮前と同じ基本周波数を持つ調波構造音を得る第一段階の信号波形時間伸縮復元ステップと、
前記音声信号と第一段階の信号波形時間伸縮復元ステップにおいて得られた調波構造音を、調波構造音抽出処理とは異なる長さの時間フレームに分割して分析を進めて第一段階の逆伝達関数を推定する第一段階の逆伝達関数推定ステップと、
第一段階の逆伝達関数推定ステップにより求めた第一段階の逆伝達関数を前記音声信号に適用して第一段階の残響除去後の信号を得る第一段階の逆伝達関数適用ステップと、
から構成される第一段階の残響除去処理ステップと、
第一段階の残響除去後の信号に対して基本周波数推定処理をする第二段階の基本周波数推定ステップと、
第二段階の基本周波数推定ステップにより求められた基本周波数に基づいてその時間微分を推定する第二段階の基本周波数時間微分推定ステップと、
前記音声信号、第二段階の基本周波数推定ステップにより求められた基本周波数、第二段階の基本周波数時間微分推定ステップにより求められた基本周波数の時間微分に基づいて前記音声信号の基本周波数を一定にする第二段階の信号波形時間伸縮ステップと、
第二段階の信号波形時間伸縮ステップにより得られた時間伸縮信号に基づいてその調波構造音を抽出する第二段階の調波構造音抽出ステップと、
第二段階の調波構造音抽出ステップにより得られた調波構造音に対して信号波形の時間伸縮復元処理を施して時間伸縮前と同じ基本周波数を持つ調波構造音を得る第二段階の信号波形時間伸縮復元ステップと、
前記音声信号と第二段階の信号波形時間伸縮復元ステップにおいて得られた調波構造音を、調波構造音抽出処理とは異なる長さの時間フレームに分割して分析を進めて第二段階の逆伝達関数を推定する第二段階の逆伝達関数推定ステップと、
第二段階の逆伝達関数推定ステップにより求めた第二段階の逆伝達関数を前記音声信号に適用して第二段階の残響除去後の信号を得る第二段階の逆伝達関数適用ステップと、
から構成される第二段階の残響除去処理ステップと、
第二段階の残響除去後の信号に対して基本周波数推定処理をする第三段階の基本周波数推定ステップと、
第三段階の基本周波数推定ステップにより求められた基本周波数に基づいてその時間微分を推定する第三段階の基本周波数時間微分推定ステップと、
第二段階の残響除去後の信号、第三段階の基本周波数推定ステップにより求められた基本周波数、第三段階の基本周波数時間微分推定ステップにより求められた基本周波数の時間微分に基づいて第二段階の残響除去後の信号の基本周波数を一定にする第三段階の信号波形時間伸縮ステップと、
第三段階の信号波形時間伸縮ステップにより得られた時間伸縮信号に基づいてその調波構造音を抽出する第三段階の調波構造音抽出ステップと、
第三段階の調波構造音抽出ステップにより得られた調波構造音に対して信号波形の時間伸縮復元処理を施して時間伸縮前と同じ基本周波数を持つ調波構造音を得る第三段階の信号波形時間伸縮復元ステップと、
第二段階の残響除去後の信号と第三段階の信号波形時間伸縮復元ステップにおいて得られた調波構造音を、調波構造音抽出処理とは異なる長さの時間フレームに分割して分析を進めて第三段階の逆伝達関数を推定する第三段階の逆伝達関数推定ステップと、
第三段階の逆伝達関数推定ステップにより求めた第三段階の逆伝達関数を第二段階の残響除去後の信号に適用して第三段階の残響除去後の信号を得る第三段階の逆伝達関数適用ステップと、
から構成される第三段階の残響除去処理ステップと、
を備える。請 Motomeko 1: the fundamental frequency estimation step of the first stage of the fundamental frequency estimation processing for the No. sound inputted Koeshin,
A fundamental frequency time derivative estimation step of the first stage of estimating the time derivative on the basis of the fundamental frequency determined by the fundamental frequency estimation step of the first stage,
The audio signal, the fundamental frequency estimation fundamental frequency determined by the step of the first stage, a constant fundamental frequency of said speech signal based on the time derivative of the fundamental frequency determined by the fundamental frequency time derivative estimation step of the first stage The first stage signal waveform time expansion and contraction step ,
An extraction step harmonic structure sound in the first step of extracting the harmonic structure sound based on the time warping signal obtained by the signal waveform time warping step of the first stage,
In the first stage, the harmonic structure sound obtained by the first stage harmonic structure sound extraction step is subjected to the time expansion / contraction restoration processing of the signal waveform to obtain the harmonic structure sound having the same fundamental frequency as before the time expansion / contraction . A signal waveform time expansion / contraction restoration step ,
First stage complete the analysis is divided into time frames of different length to the audio signal and the resulting harmonic structure sound in the first stage of the signal waveform time warping restoration step, harmonic structure sound extraction process A first-stage inverse transfer function estimation step for estimating the inverse transfer function of
A first step of the first stage inverse transfer function application step of obtaining a signal after dereverberation the first stage by applying the inverse transfer function to the voice signal obtained by the inverse transfer function estimation step of the first stage ,
A first stage dereverberation processing step comprising:
A second stage fundamental frequency estimation step for performing fundamental frequency estimation processing on the signal after dereverberation in the first stage;
A second-stage fundamental frequency time derivative estimation step for estimating the time derivative based on the fundamental frequency obtained by the second-stage fundamental frequency estimation step;
The fundamental frequency of the speech signal is made constant based on the speech signal, the fundamental frequency obtained by the second-stage fundamental frequency estimation step, and the time derivative of the fundamental frequency obtained by the second-stage fundamental frequency time derivative estimation step. A second stage signal waveform time expansion and contraction step,
A second-stage harmonic structure sound extraction step for extracting the harmonic structure sound based on the time expansion / contraction signal obtained by the second-stage signal waveform time expansion / contraction step;
The second-stage harmonic structure sound obtained by the second-stage harmonic structure sound extraction step is subjected to time expansion / contraction restoration processing of the signal waveform to obtain the harmonic structure sound having the same fundamental frequency as before the time expansion / contraction. A signal waveform time expansion / contraction restoration step,
The harmonic structure sound obtained in the audio signal and the second stage signal waveform time expansion / contraction restoration step is divided into time frames having a length different from that of the harmonic structure sound extraction process, and the analysis is advanced. A second-stage inverse transfer function estimation step for estimating the inverse transfer function;
Applying a second-stage inverse transfer function obtained by the second-stage inverse transfer function estimation step to the speech signal to obtain a signal after the second-stage dereverberation, and a second-stage inverse transfer function applying step;
A second stage dereverberation processing step comprising:
A third-stage fundamental frequency estimation step for performing fundamental frequency estimation processing on the signal after dereverberation in the second stage;
A third-stage fundamental frequency time derivative estimation step for estimating the time derivative based on the fundamental frequency obtained by the third-stage fundamental frequency estimation step;
The second stage based on the signal after dereverberation of the second stage, the fundamental frequency obtained by the fundamental frequency estimation step of the third stage, and the time derivative of the fundamental frequency obtained by the fundamental frequency time derivative estimation step of the third stage A third stage signal waveform time expansion / contraction step to make the fundamental frequency of the signal after dereverberation constant,
A third-stage harmonic structure sound extraction step for extracting the harmonic structure sound based on the time expansion / contraction signal obtained by the third-stage signal waveform time expansion / contraction step;
The third-stage harmonic structure sound obtained by the third-stage harmonic structure sound extraction step is subjected to time expansion / contraction restoration processing of the signal waveform to obtain the harmonic structure sound having the same fundamental frequency as before the time expansion / contraction. A signal waveform time expansion / contraction restoration step,
The harmonic structure sound obtained in the second stage dereverberation signal and the third stage signal waveform time expansion / contraction restoration step is divided into time frames with different lengths from the harmonic structure sound extraction process for analysis. A third-stage inverse transfer function estimation step to proceed and estimate the third-stage inverse transfer function;
Apply the third-stage inverse transfer function obtained by the third-stage inverse transfer function estimation step to the second-stage dereverberation signal to obtain the third-stage dereverberation signal. A function application step;
A third stage dereverberation processing step comprising:
Is provided .

そして、請求項２：請求項１記載の残響除去方法の各ステップをコンピュータに実行させるためのプログラムを構成した。
また、請求項３：請求項２記載のプログラムを記録した記録媒体を構成した。
上述した通り、この発明は、調波構造音の抽出処理に音声信号の時間伸縮処理技術を導入している。時間伸縮処理を施された後の音声信号波形は、例えば、時間的に前半の信号の時間軸を縮めると共に、後半の信号の時間軸を伸ばすことで、近似的に基本周波数が一定の信号を得ることができる。この基本周波数が一定になった音声信号に調波構造音抽出処理を施すことにより、調波構造音を正確に抽出することができるに到る。但し、このとき抽出される調波構造音は基本周波数が一定の信号である。これを元の音声信号に含まれた調波構造音に戻すには、この音声信号に対して、最初に適用した時間伸縮処理とは逆の時間伸縮処理を施せばよい。これにより、元の音声信号と同じ基本周波数の変化をもった
調波構造音に変換される。 According to a second aspect of the present invention, there is provided a program for causing a computer to execute the steps of the dereverberation method according to the first aspect.
A third aspect of the invention is a recording medium on which the program according to the second aspect is recorded.
As described above, the present invention introduces the time expansion / contraction processing technology of the audio signal in the harmonic structure sound extraction processing. The audio signal waveform after the time expansion / contraction processing is performed, for example, by reducing the time axis of the first half of the signal in time and extending the time axis of the second half of the signal so that a signal with a substantially constant fundamental frequency is obtained. Obtainable. By applying a harmonic structure sound extraction process to the audio signal having a constant fundamental frequency, the harmonic structure sound can be accurately extracted. However, the harmonic structure sound extracted at this time is a signal having a constant fundamental frequency. In order to return this to the harmonic structure sound included in the original audio signal, the audio signal may be subjected to a time expansion / contraction process opposite to the time expansion process applied first. As a result, the sound is converted into a harmonic structure sound having the same fundamental frequency change as the original audio signal.

この発明は、調波構造音抽出処理に時間伸縮処理技術を用いることで、調波構造音を先行例と比較してより正確に得ることができ、その結果、全体としてより正確な残響除去処理を実施することができるに到る。 By using a time expansion / contraction processing technique for harmonic structure sound extraction processing, the present invention can obtain harmonic structure sound more accurately than the previous example, and as a result, more accurate dereverberation processing as a whole. Can be carried out.

この発明は、調波構造音の抽出処理に音声信号の時間伸縮処理技術を導入している。この時間伸縮処理を用いて音声信号の基本周波数の増加減少に合わせて時間軸の伸縮を適切に制御することで、基本周波数が一定の音声信号を得ることができる。この発明は、公知の時間伸縮処理技術を、特に、残響を含んだ音声信号から残響を除去する残響除去における調波構造音抽出処理に用いることで、先行例と比較してより正確に調波構造音を得ることができ、その結果、全体としてより正確な残響除去処理を実施することができる、という効果を奏す。 The present invention introduces a time expansion / contraction processing technique of an audio signal in the harmonic structure sound extraction processing. By appropriately controlling the expansion / contraction of the time axis in accordance with the increase / decrease of the fundamental frequency of the audio signal using this time expansion / contraction process, an audio signal having a constant fundamental frequency can be obtained. The present invention uses a known time expansion / contraction processing technique in particular for harmonic structure sound extraction processing in dereverberation that removes reverberation from an audio signal including reverberation, thereby enabling harmonics to be more accurately compared with the preceding example. As a result, it is possible to obtain a structured sound, and as a result, it is possible to perform a more accurate dereverberation process as a whole.

そして、この発明は、時間伸縮処理の精度を改善するために、前処理として残響除去処理自体を用いる。即ち、一旦、残響除去処理を行った信号から基本周波数とその時間微分を求めることで、残響の影響を取り除くことができ、より正確にこれらの値を求めることができる。その結果、時間伸縮処理の精度を改善することができ、残響除去性能を更に改善させることができる。 And this invention uses the dereverberation process itself as a pre-process in order to improve the precision of a time expansion-contraction process. That is, once the fundamental frequency and its time derivative are obtained from the signal that has been subjected to the dereverberation process, the influence of the reverberation can be removed, and these values can be obtained more accurately. As a result, the accuracy of the time expansion / contraction process can be improved, and the dereverberation performance can be further improved.

発明を実施するための最良の形態を図１の実施例１を参照して説明する。
音声収集装置８より収集され、入力した残響を含むディジタルの信号である音声信号ｘ（ｔ）（ｔ＝０，１，・・・・・はディジタル信号の各標本のインデックス、標本化周波数ｆ_sＨｚ）が図１の残響除去装置に入力されると、先ず、基本周波数推定部１において基本周波数推定処理が行われる。この基本周波数推定処理は、音声信号ｘ（ｔ）を分析窓と呼ばれる短時間（例えば、４０ミリ秒程度）の信号区間（フレーム）に分割すると共に、各フレームの基本周波数と調波構造が含まれているフレーム（調波構造区間）を推定する。この基本周波数の推定、および調波構造区間の推定には、ケプストラム法（参考文献[２］、[３］参照）、従来例の特許［１］に記述されている雑音に頑健な推定法その他、多くの方法を用いることができる。以下、この分析に用いたフレームを番号ｌ（ｌ＝０，１，２，・・・・）、フレーム中心時間の標本インデックスをｔ_lで表し、各フレームの基本周波数をθ・_l（Ｈｚ）と表すものとする。 The best mode for carrying out the invention will be described with reference to Embodiment 1 shown in FIG.
A speech signal x (t) (t = 0, 1,..., Which is a digital signal including reverberation collected by the speech collection device 8 and inputted, is an index of each sample of the digital signal, and a sampling frequency f _s. Hz) is input to the dereverberation apparatus of FIG. 1, first, the fundamental frequency estimation unit 1 performs fundamental frequency estimation processing. This fundamental frequency estimation process divides the audio signal x (t) into short-term signal sections (frames, for example, about 40 milliseconds) called analysis windows, and includes the fundamental frequency and harmonic structure of each frame. The estimated frame (harmonic structure section) is estimated. For estimation of the fundamental frequency and harmonic structure interval, a cepstrum method (see References [2] and [3]), a noise robust estimation method described in the patent [1] of the conventional example, and others Many methods can be used. Hereinafter, the frame used in this analysis is represented by number l (l = 0, 1, 2,...), The sample index of the frame center time is represented by t _l , and the fundamental frequency of each frame is θ · _l (Hz). It shall be expressed as

次に、２は基本周波数時間微分推定部である。基本周波数時間微分推定部２における基本周波数時間微分推定処理は、求められた各フレームの基本周波数をもとにその時間微分θ・・_lを計算する。残響下でも頑健にこの時間微分を求めるために、フレームｌの前後のフレームにおける基本周波数の値の時系列θ・_m（ｌ−ｐ＜ｍ＜ｌ＋ｐ）を二次関数などで近似し、その時刻ｔ_lにおける時間微分を求めることで近似的に計算する。この値は、具体的には例えば以下の様に計算することができる。 Next, 2 is a fundamental frequency time derivative estimating unit. The fundamental frequency time derivative estimation process in the fundamental frequency time derivative estimator 2 calculates the time derivative θ ·· _l based on the obtained fundamental frequency of each frame. In order to robustly obtain this time derivative even under reverberation, the time series θ · _m (lp−m <l + p) of the fundamental frequency values in the frames before and after the frame l is approximated by a quadratic function and the time Approximate calculation is performed by _obtaining a time derivative at t _l . Specifically, this value can be calculated as follows, for example.

ここで、△ｌはフレーム周期（秒）、pは近似計算のために考慮する局所的な時間フレームの範囲を決めるパラメータである。
次に、３は信号波形時間伸縮部である。ここで、図２は信号波形の時間伸縮のフローと信号波形の時間伸縮復元のフローを示す図である。信号波形時間伸縮部３における信号波形の時間伸縮処理は、求められた基本周波数をもとにして、各フレームの基本周波数を一定にするために各フレーム毎に時間軸の伸縮を行う。このために、先ず、時間伸縮関数を求める。或るフレームが調波構造区間であると判定されているとしたとき、そのフレームに対する時間伸縮関数τ＝Ｗ_l（ｔ）、およびその逆関数ｔ＝Ｗ_l ^-1（τ）は、例えば、以下の通りに決定することができる。

Here, Δl is a frame period (second), and p is a parameter that determines a local time frame range to be considered for approximate calculation.
Next, 3 is a signal waveform time expansion / contraction part. Here, FIG. 2 is a diagram showing a flow of time expansion / contraction of the signal waveform and a flow of time expansion / contraction restoration of the signal waveform. The time expansion / contraction processing of the signal waveform in the signal waveform time expansion / contraction unit 3 performs expansion / contraction of the time axis for each frame in order to make the basic frequency of each frame constant based on the obtained basic frequency. For this purpose, first, a time expansion / contraction function is obtained. When it is determined that a certain frame is a harmonic structure section, the time expansion / contraction function τ = W _l (t) and the inverse function t = W _l ⁻¹ (τ) for the frame are, for example, It can be determined as follows.

ここで、τ、τ_l、φ・_lは、それぞれ時間伸縮後の信号の時間インデックス、フレームｌの中心時間のインデックス、およびτ_lにおける基本周波数を表している。τ_lとφ・_lは、任意の値に設定してよいパラメータであり、例えば、τ_l＝０、φ・_l＝θ・_lの値に設定することができる。この時間伸縮関数を用いて、音声信号ｘ（ｔ）と時間伸縮後の信号ｘｗ_l（τ）の関係を表すと、以下の様になる。

Here, τ, τ _l , and φ · _l represent the time index of the signal after time expansion and contraction, the index of the center time of the frame l, and the fundamental frequency at τ _l , respectively. τ _l and φ · _l are parameters that can be set to arbitrary values. For example, τ _l = 0 and φ · _l = θ · _l can be set. Using this time expansion / contraction function, the relationship between the audio signal x (t) and the signal xw _l (τ) after the time expansion / contraction is expressed as follows.

ここで、Ｔ₀は時間伸縮前の信号のフレーム長を表す。式（５）から、時間伸縮処理後の信号ｘｗ_l（τ）の時系列を得ることができる。即ち、各時間インデックスτに対する信号ｘｗ_l（τ）は、時間伸縮前の時間インデックスＷ_l ^-1（τ）における信号の値であるｘ（Ｗ_l ^-1（τ））と同じ値を持つ。ただし、一般に、時間インデックスＷ_l ^-1（τ）は整数値を取るとは限らず、離散的なディジタル信号のどの標本インデックスとも一致しない場合がある。このために、ｘ（Ｗ_l ^-1（τ））の値は、近接する時刻の標本値を補完した値を取る必要がある。標本値の補完には、ディジタル信号処理で一般に知られた方法を適用すれば良い。例えば、アップサンプリングによる補完、スプライン関数を用いた補完、二次関数或いは三次関数を用いた補完を列挙することができる。

Here, T ₀ represents the frame length of the signal before time expansion / contraction. From the equation (5), a time series of the signal xw _l (τ) after the time expansion / contraction process can be obtained. That is, the signal xw _l (τ) for each time index τ has the same value as x (W _l ⁻¹ (τ)) that is the value of the signal at the time index W _l ⁻¹ (τ) before time expansion / contraction. However, in general, the time index W _l ⁻¹ (τ) does not always take an integer value, and may not match any sample index of a discrete digital signal. For this reason, the value of x (W _l ⁻¹ (τ)) needs to take a value obtained by complementing the sample values at close times. To complement the sample value, a method generally known in digital signal processing may be applied. For example, completion by upsampling, completion using a spline function, completion using a quadratic function or a cubic function can be listed.

この様にして得られた信号ｘｗ_l（τ）は、基本周波数がほぼ一定の値をとることが期待される。このために、調波構造音抽出部４においては、信号波形時間伸縮部３により得られた信号ｘｗ_l（τ）を入力してその調波構造音を正確に抽出する調波構造音抽出処理をする。例えば、くし型フィルタを用いて以下の様に調波構造音ｘ＾ｗ_l（τ）を抽出することができる。 The signal xw _l (τ) obtained in this way is expected to have a substantially constant fundamental frequency. For this purpose, the harmonic structure sound extraction unit 4 inputs the signal xw _l (τ) obtained by the signal waveform time expansion / contraction unit 3 and accurately extracts the harmonic structure sound. do. For example, the harmonic structure sound x ^ w _l (τ) can be extracted using a comb filter as follows.

ここで、ｇ_l（ｔ）は時間分析窓を表し、Hanning窓その他の一般に信号処理で用いられる関数を用いることができる。また、“＊”は畳み込み演算を表す。式（６）はフレームｌに関する時間範囲、即ち、｜Ｗ_l ^-1（τ）−ｔ_l｜＜Ｔ₀ ／２の近傍のみで意味を持つ値であり、それ以外の時間で値を計算する必要はない。
次に、５は信号波形時間伸縮復元部である。信号波形時間伸縮復元部５は、この様にして得られた調波構造音ｘ＾ｗ_l（τ）に対して、式（４）の関係を利用し、以下の様に信号波形の時間伸縮復元処理を施すことで、時間伸縮前と同じ基本周波数を持つ調波構造音ｘ＾_l（ｔ）を得る（図２ｂ参照）。

Here, g _l (t) represents a time analysis window, and a Hanning window or other functions generally used in signal processing can be used. “*” Represents a convolution operation. Equation (6) is the time range for a frame l, _{^{i.e., | W l -1 (τ)}} -t l | < a value having a meaning only in the vicinity of T _0/2, to calculate the value at other times There is no need.
Next, 5 is a signal waveform time expansion / contraction restoration unit. The signal waveform time expansion / contraction restoration unit 5 uses the relationship of Equation (4) for the harmonic structure sound x ^ w _l (τ) obtained in this way, and the signal waveform time expansion / contraction is as follows. By performing the restoration process, the harmonic structure sound x ^ _l (t) having the same fundamental frequency as before time expansion / contraction is obtained (see FIG. 2b).

なお、上式を計算するには、式（５）と同様に、ディジタル信号の補完が必要である。
信号波形の時間伸縮復元処理においては、各フレーム毎に得られた信号ｘ＾_l（ｔ）を時間的に接続することで、音声信号ｘ（ｔ）から調波構造音だけを取り出した信号ｘ＾（ｔ）を得ることができる。これには、例えば、以下の様に、overlap-add合成として知られた方法を用いることができる。

In order to calculate the above equation, it is necessary to complement the digital signal as in the equation (5).
In the time expansion and contraction restoration processing of the signal waveform, the signal x obtained by extracting only the harmonic structure sound from the audio signal x (t) by temporally connecting the signals x ^ _l (t) obtained for each frame. ^ (T) can be obtained. For this, for example, a method known as overlap-add synthesis can be used as follows.

ｘ＾（ｔ）＝Σ_lｇ₂（ｔ−ｔ_l）ｘ＾_l（ｔ）（８）
ここで、ｇ₂（ｔ）は時間分析窓を表し、Hanning窓などの一般に信号処理で用いられる関数を用いることができる。
次に、６は逆伝達関数推定部である。逆伝達関数推定部６による逆伝達関数推定処理は、音声信号ｘ（ｔ）と信号波形時間伸縮復元部５において得られた調波構造音ｘ＾（ｔ）を、調波構造音抽出処理とは異なる長さの時間フレームに分割して分析を進める。調波構造音抽出処理の場合と区別するために時間フレームのインデックスをＬ（＝０，１，２，・・・・）と書く。各ｘ（ｔ）とｘ＾（ｔ）の各組から切り出された各時間フレーム毎に、逆伝達関数の初期推定値Ｗ_L（ω）を以下の式により計算する。 x ^ (t) = Σ _l g ₂ (t−t _l ) x ^ _l (t) (8)
Here, g ₂ (t) represents a time analysis window, and a function generally used in signal processing such as a Hanning window can be used.
Next, 6 is an inverse transfer function estimation unit. The inverse transfer function estimation process by the inverse transfer function estimation unit 6 includes the harmonic structure sound x ^ (t) obtained by the audio signal x (t) and the signal waveform time expansion / contraction restoration unit 5 as harmonic structure sound extraction process. Divide into time frames of different length and proceed with the analysis. The time frame index is written as L (= 0, 1, 2,...) In order to distinguish it from the case of harmonic structure sound extraction processing. For each time frame cut out from each set of x (t) and x ^ (t), the initial estimated value W _L (ω) of the inverse transfer function is calculated by the following equation.

ここで、ＤＦＴ（・）は、標本インデックスｔ_Lでの短時間離散フーリエ変換を表す。Ｔ_lはフレーム長を表す。次に、こうして求められた逆伝達関数の初期推定値の異なる時間フレームに亘る平均を求めることで、残響除去のための逆伝達関数Ｗ（ω）を求める。

Here, DFT (•) represents a short-time discrete Fourier transform at the sample index t _L. T _l represents the frame length. Next, an inverse transfer function W (ω) for removing dereverberation is obtained by obtaining an average over different time frames of the initial estimated value of the inverse transfer function thus obtained.

なお、式（１３）の計算において、単純に平均値を求めるかわりに、振幅スペクトル
｜Ｘ＾_L（ω）｜の重みを付けて計算することで、より精確な逆伝達関数の近似をすることができる。

In the calculation of equation (13), instead of simply obtaining the average value, the weighting of the amplitude spectrum | X ^ _L (ω) | Can do.

これにより、雑音成分の影響を抑制しつつ占有的な調波成分の影響を強調することができるからである。振幅スペクトルのかわりにパワースペクトル｜Ｘ＾_L（ω）｜²などを重みに使っても同様の効果を得ることができる。
最後に、７は逆伝達関数適用部である。逆伝達関数適用部７による逆伝達関数適用処理は、こうして求めた逆伝達関数W（ω）に離散逆フーリエ変換（ＩＤＦＴ（・））を適用することで時間領域の逆フィルタｗ（ｔ）に戻した後、音声信号ｘ（ｔ）に畳み込むことで、残響除去後の信号ｙ（ｔ）を得る。

This is because the influence of the occupying harmonic component can be emphasized while suppressing the influence of the noise component. The same effect can be obtained by using the power spectrum | X ^ _L (ω) | ² or the like instead of the amplitude spectrum as a weight.
Finally, 7 is an inverse transfer function application unit. The inverse transfer function application processing by the inverse transfer function application unit 7 applies the discrete inverse Fourier transform (IDFT (•)) to the inverse transfer function W (ω) thus obtained, thereby applying the inverse transfer function w (t) in the time domain. After returning, the signal y (t) after dereverberation is obtained by convolution with the audio signal x (t).

ｗ（ｔ）＝ＩＤＦＴ（Ｔ_l ，Ｗ（ω））（１５）
ｙ（ｔ）＝ｗ（ｔ）*ｘ（ｔ）（１６）
更に、残響除去は、図３に示される様に、上述の処理とほぼ同じ処理を三段階で適用することで、各段階毎に、次第に残響除去性能が改善する構成をとることもできる。各段階の処理のポイントは以下の通りにまとめられる。
１．第一段階：調波構造区間、基本周波数、その時間微分、および調波構造音はすべて音声信号ｘ（ｔ）から推定される。このために、各推定値には残響に起因する多くの誤差が含まれている可能性がある。 w (t) = IDFT (T ₁ , W (ω)) (15)
y (t) = w (t) * x (t) (16)
Furthermore, as shown in FIG. 3, the dereverberation can be configured so that the dereverberation performance is gradually improved at each stage by applying almost the same process as the above process in three stages. The points of processing at each stage are summarized as follows.
1. First stage: the harmonic structure interval, the fundamental frequency, its time derivative, and the harmonic structure sound are all estimated from the audio signal x (t). For this reason, each estimated value may include many errors due to reverberation.

２．第二段階：調波構造区間、基本周波数とその時間微分は一つ前の段階で残響除去された信号から推定され、調波構造音のみ音声信号ｘ（ｔ）から推定される。調波構造区間、基本周波数とその時問微分の推定に対する残響の影響が低減されるため、その推定精度が向上する。更に、それらの推定値に基づいて推定される調波構造成分の推定精度も改善される。
３．第三段階：上記すべての値が一つ前の段階で残響除去された信号から推定される。調波構造音の推定精度も向上することからより効果的な残響除去が期待される。 2. Second stage: the harmonic structure section, the fundamental frequency and its time derivative are estimated from the signal from which dereverberation was removed in the previous stage, and only the harmonic structure sound is estimated from the speech signal x (t). Since the influence of reverberation on the estimation of the harmonic structure section, fundamental frequency and its time derivative is reduced, the estimation accuracy is improved. Furthermore, the estimation accuracy of the harmonic structure component estimated based on those estimated values is also improved.
3. Third stage: all the above values are estimated from the dereverberated signal in the previous stage. Since the estimation accuracy of harmonic structure sound is improved, more effective dereverberation is expected.

この内の第二、第三段階については、それぞれの処理を一回ずつ適用するのではなく、更に繰り返して適用することでより残響除去性能を改善することもできる。
式（６）から（７）に示した時間伸縮処理を施した観測された音声信号から調波構造音を取り出すもう一つの方法として、正弦波合成法がある。この方法を用いると、時間伸縮処理を施した信号から時間伸縮前の信号に含まれる調波構造音を直接推定することができるので、調波構造抽出処理と時間伸縮復元処理を一緒に実施することができる。Ｘｗ_l（ω）を
ｘｗ_l（τ）の短時間離散フーリエ変換とすると、時間伸縮処理を適用した信号の第ｋ番目の高調波成分の振幅Ａ_k、_lと位相ｐ_k、_lは以下の様に抽出することができる。 In the second and third stages, the dereverberation performance can be further improved by applying each process repeatedly instead of applying each process once.
There is a sine wave synthesis method as another method for extracting the harmonic structure sound from the observed audio signal subjected to the time expansion / contraction processing shown in the equations (6) to (7). By using this method, the harmonic structure sound included in the signal before time expansion / contraction can be directly estimated from the signal subjected to time expansion / contraction processing, so the harmonic structure extraction processing and time expansion / contraction restoration processing are performed together. be able to. When Xw _l (ω) is a short-time discrete Fourier transform of xw _l (τ), the amplitudes A _k and _l and the phases p _k and _l of the k-th harmonic component of the signal to which time expansion / contraction processing is applied are as follows: Can be extracted in the same way.

ここで、［・］は連続周波数を最も近い離散フーリエ変換の中心周波数に変換する手続きを意味する。これらの値から、時間伸縮前の信号に含まれる調波構造音は以下の様に抽出することができる。
ｘ＾_l（ｔ）＝Σ_kＡ_k,lｃｏｓ（［２πｋφ・_l］Ｗ_l（ｔ）＋ｐ_k,l）（２０）
式（２）,（３）で示される時間伸縮関数について補足して説明する。先ず、時間伸縮前の観測された音声信号中の調波構造の基本周波数に相当する周波数成分（基本波成分）の位相をθ（ｔ）と書き、時間伸縮後の信号の基本波成分の位相をφ（τ）と書くと、式（４）より、以下の関係式が成り立つ。

Here, [•] means a procedure for converting the continuous frequency to the nearest center frequency of the discrete Fourier transform. From these values, the harmonic structure sound included in the signal before time expansion / contraction can be extracted as follows.
_{x ^ l (t) = Σ} k A k, l cos ([2πkφ · l] W l (t) + p k, l) (20)
The time expansion / contraction function represented by the equations (2) and (3) will be supplementarily described. First, the phase of the frequency component (fundamental wave component) corresponding to the fundamental frequency of the harmonic structure in the observed audio signal before time stretching is written as θ (t), and the phase of the fundamental wave component of the signal after time stretching. Is written as φ (τ), the following relational expression is established from the expression (4).

θ（ｔ）＝φ（Ｗ_l（ｔ）） for ｜ｔ−ｔ_l｜＜Ｔ／２（２１）
また、時問伸縮処理は、φ・（τ）を一定にする関数としてＷ_l（ｔ）を定めるため、以下の関係式が成立する。

更に、時間伸縮処理の計算を簡単化するために、元の信号の基本周波数の時間微分は短時間フレーム中で一定であると仮定することは有効である。これは以下の様に表現される。

ここで、θ_l¨は時間インデックスｔ_lにおける基本周波数の時間微分を示す。式（２１）、（２２）および（２３）を満たすＷ_l（ｔ）を求めることで、式（２）,（３）を導くことができる。
次いで、残響除去方法の実施例２を、実施例１と同様に、図１を参照して説明する。実施例２は、逆伝達関数の推定値を求める計算方法のみが実施例１とは異なる。 θ (t) = φ (W _l (t)) for | t−t _l | <T / 2 (21)
In addition, since the time expansion / contraction process determines W _l (t) as a function that makes φ · (τ) constant, the following relational expression is established.

Furthermore, it is useful to assume that the time derivative of the fundamental frequency of the original signal is constant in a short frame in order to simplify the computation of the time scaling process. This is expressed as follows.

Here, θ _l represents a time derivative of the fundamental frequency at the time index t _l . By obtaining W _l (t) that satisfies Expressions (21), (22), and (23), Expressions (2) and (3) can be derived.
Next, Example 2 of the dereverberation method will be described with reference to FIG. The second embodiment is different from the first embodiment only in the calculation method for obtaining the estimated value of the inverse transfer function.

実施例２においては、Ｘ_L（ω）とＸ＾_L（ω）の誤差を最小にする関数として逆伝達関数Ｗ（ω）を決定する。例えば、誤差の評価基準として二条誤差最小基準を用いれば、Ｗ（ω）を以下の様に決定することができる。

この式は解析的に解くことができ、Ｗ（ω）は以下の様に求められる。
Ｗ（ω）＝Ｅ（Ｘ＾_L（ω）Ｘ＾_L ^*（ω））／Ｅ（Ｘ_L（ω）Ｘ＾_L ^*（ω））（２６）
従って、実施例１において式（１２）の計算を上式に置き換えることで、実施例２を構成することができる。
また、式（１４）の様な重み付けによる平均の計算を実施例２に導入することもできる。こうするためには、式（２６）のかわりに以下の計算式を用いればよい。

In the second embodiment, the inverse transfer function W (ω) is determined as a function that minimizes the error between X _L (ω) and X ^ _L (ω). For example, when the double-row error minimum criterion is used as the error evaluation criterion, W (ω) can be determined as follows.

This equation can be solved analytically, and W (ω) is obtained as follows.
W (ω) = E (X ^ _L (ω) X ^ _L ^* (ω)) / E ( _XL (ω) X ^ _L ^* (ω)) (26)
Therefore, the second embodiment can be configured by replacing the calculation of the equation (12) with the above equation in the first embodiment.
Moreover, the average calculation by weighting like Formula (14) can also be introduce | transduced in Example 2. FIG. In order to do this, the following calculation formula may be used instead of the formula (26).

以上の通りの実施例の効果を、図４ないし図６に示されるインパルス応答のエネルギー減衰曲線、残響除去後の音声波形とスペクトログラムにより説明する。評価実験に用いた課題は、残響を含む単語音声の残響除去である。ＡＴＲ単語データベースから男女各一話者の５２４０単語音声を音源信号として用意した。残響のある部屋で測定した４種類の室内インパルス応答（残響時間：０．１、０．２、０．５、１．０秒）を用意した。残響を含んだ観測音声信号は、単語音声に室内インパルス応答を畳み込むことで合成した。残響除去のための逆フィルタはすべての男性の単語音声、またはすべての女性の単語音声を用いて推定した。 The effects of the embodiment as described above will be described with reference to an energy decay curve of an impulse response, a speech waveform after dereverberation, and a spectrogram shown in FIGS. The problem used in the evaluation experiment is dereverberation of word speech including reverberation. From the ATR word database, 5240-word speech of each male and female speaker was prepared as a sound source signal. Four types of room impulse responses (reverberation time: 0.1, 0.2, 0.5, 1.0 second) measured in a room with reverberation were prepared. The observed speech signal including reverberation was synthesized by convolving the room impulse response with the word speech. The inverse filter for dereverberation was estimated using all male word sounds or all female word sounds.

図４と図５は残響時間が異なる場合の室内インパルス応答および残響除去処理を施した後のインパルス応答のエネルギー減衰曲線を示す図である。図４は男声、図５は女声である。減衰曲線はシュレーダ法により計算した。
図４および図５より、すべての残響時間において、また、男女何れの音声に対しても、この発明は従来例よりも効果的に残響のエネルギーを低減することができていることが示されている。図６は、残響を含まない信号、残響を含んだ信号（残響時間：１．０秒）、およびこの発明により残響除去された信号の波形とスペクトログラムを示している。図６より、この発明は、残響を含まない信号の時間構造および周波数構造を効果的に復元することができていることがわかる。 FIG. 4 and FIG. 5 are diagrams showing energy decay curves of the impulse response after performing the indoor impulse response and the dereverberation process when the reverberation times are different. 4 is a male voice, and FIG. 5 is a female voice. The attenuation curve was calculated by Schrader method.
4 and 5 show that the present invention can reduce the energy of reverberation more effectively than the conventional example at all reverberation times and for both male and female voices. Yes. FIG. 6 shows waveforms and spectrograms of a signal that does not include reverberation, a signal that includes reverberation (reverberation time: 1.0 second), and a signal that has been dereverberated by the present invention. FIG. 6 shows that the present invention can effectively restore the time structure and frequency structure of a signal that does not include reverberation.

参考文献
［１］特願２００３−０６００２５：音響信号の残響除去方法、装置、及び音響信号の残響除去プログラム、そのプログラムを記録した記録媒体。
［２］特願２００２−０６２５１３：占有度抽出装置および基本周波数抽出装置、それらの方法、それらのプログラム並びにそれらのプログラムを記録した記録媒体
［３］特顧２００２−２７４５２５：調波構造区間推定方法及び装置、調波構造区間推定プログラム及びそのプログラムを記録した記録媒体、調波構造区間推定の閾値決定方法及び装置、調波構造区間推定の閾値決定プログラム及びそのプログラムを記録した記録媒体 Reference [1] Japanese Patent Application No. 2003-060025: Acoustic signal dereverberation method and apparatus, acoustic signal dereverberation program, and recording medium recording the program.
[2] Japanese Patent Application No. 2002-062513: Occupancy degree extraction device and fundamental frequency extraction device, methods thereof, programs thereof, and recording medium on which those programs are recorded [3] Japanese Patent Application No. 2002-274525: Harmonic structure section estimation method And apparatus, harmonic structure section estimation program and recording medium recording the program, harmonic structure section estimation threshold determining method and apparatus, harmonic structure section estimation threshold determining program and recording medium recording the program

実施例を説明するブロック図。The block diagram explaining an Example. 信号波形の時間伸縮のフローと信号波形の時間伸縮復元のフローを示す図。The figure which shows the flow of the time expansion / contraction of a signal waveform, and the flow of the time expansion / contraction restoration of a signal waveform. 他の実施例を説明するブロック図。The block diagram explaining another Example. 残響時間が異なる場合の室内インパルス応答および残響除去処理を施した後のインパルス応答のエネルギー減衰曲線（男声）を示す図。The figure which shows the energy decay curve (male voice) of the impulse response after performing an indoor impulse response and reverberation removal processing in the case where reverberation time differs. 残響時間が異なる場合の室内インパルス応答および残響除去処理を施した後のインパルス応答のエネルギー減衰曲線（女声）を示す図。The figure which shows the energy decay curve (female voice) of the impulse response after performing the indoor impulse response and reverberation removal processing in the case where reverberation time differs. 残響を含まない信号、残響を含んだ信号（残響時間：１．０秒）、および残響除去された信号の波形とスペクトログラムを示す図。The figure which shows the waveform and spectrogram of the signal which does not contain reverberation, the signal which contains reverberation (reverberation time: 1.0 second), and the signal from which dereverberation was removed. 従来例を説明するブロック図。The block diagram explaining a prior art example. 時間伸縮処理を説明する図。The figure explaining time expansion-contraction processing.

Explanation of symbols

１基本周波数推定部２基本周波数時間微分推定部
３信号波形時間伸縮部４調波構造音抽出部
５信号波形時間伸縮復元部６逆伝達関数推定部
７逆伝達関数適用部８音声収音装置 DESCRIPTION OF SYMBOLS 1 Fundamental frequency estimation part 2 Fundamental frequency time differential estimation part 3 Signal waveform time expansion / contraction part 4 Harmonic structure sound extraction part 5 Signal waveform time expansion / contraction restoration part 6 Inverse transfer function estimation part 7 Inverse transfer function application part 8 Audio | voice sound collection apparatus

Claims

A fundamental frequency estimation step of the first stage of the fundamental frequency estimation processing of an inputted speech signal,
A fundamental frequency time derivative estimation step of the first stage of estimating the time derivative on the basis of the fundamental frequency determined by the fundamental frequency estimation step of the first stage,
Based on the audio signal, the fundamental frequency obtained by the first-stage fundamental frequency estimation step , and the fundamental frequency time derivative obtained by the first-stage fundamental frequency time derivative estimation step, the fundamental frequency of the audio signal is determined. First stage signal waveform time expansion / contraction step to be constant,
A first harmonic structure sound extraction step for extracting the harmonic structure sound based on the time expansion / contraction signal obtained by the signal waveform time expansion / contraction step of the first stage ;
The first step of obtaining time warping restoration processing harmonic structure sound having the same fundamental frequency as before time warping is subjected to the signal waveform to the obtained harmonic structure sound by extracting step harmonic structure sound of the first stage a signal waveform time warping restoration step of,
The sound signal and the harmonic structure sound obtained in the first stage signal waveform time expansion / contraction restoration step are divided into time frames having a length different from that of the harmonic structure sound extraction process, and the analysis proceeds . A first-stage inverse transfer function estimation step for estimating the inverse transfer function of
And inverse transfer function application step of the first stage of the inverse transfer function of the first stage to obtain a signal after dereverberation the first stage is applied to the audio signal obtained by the inverse transfer function estimation step of the first stage,
A first stage dereverberation processing step comprising:
A second-stage fundamental frequency estimation step for performing a fundamental frequency estimation process on the signal after dereverberation of the first stage;
A second-stage fundamental frequency time derivative estimating step for estimating the time derivative based on the fundamental frequency obtained by the second-stage fundamental frequency estimating step;
Based on the audio signal, the fundamental frequency obtained by the second-stage fundamental frequency estimation step, and the fundamental frequency time derivative obtained by the second-stage fundamental frequency time derivative estimation step, the fundamental frequency of the audio signal is determined. Second stage signal waveform time stretching step to make constant,
A second-stage harmonic structure sound extraction step for extracting the harmonic structure sound based on the time expansion / contraction signal obtained by the second-stage signal waveform time expansion / contraction step;
A second stage of obtaining a harmonic structure sound having the same fundamental frequency as before the time expansion / contraction by subjecting the harmonic structure sound obtained by the second harmonic structure sound extraction step to a time expansion / contraction restoration process of the signal waveform. Signal waveform time expansion / contraction restoration step,
The harmonic structure sound obtained in the audio signal and the second stage signal waveform time expansion / contraction restoration step is divided into time frames having a length different from that of the harmonic structure sound extraction process, and the analysis proceeds. A second-stage inverse transfer function estimation step for estimating the inverse transfer function of
Applying a second-stage inverse transfer function obtained by the second-stage inverse transfer function estimating step to the speech signal to obtain a signal after the second-stage dereverberation, and a second-stage inverse transfer function applying step;
A second stage dereverberation processing step comprising:
A third-stage fundamental frequency estimation step for performing fundamental frequency estimation processing on the signal after dereverberation in the second stage;
A third stage fundamental frequency time derivative estimating step for estimating the time derivative based on the fundamental frequency obtained by the third stage fundamental frequency estimating step;
Based on the signal after dereverberation in the second stage, the fundamental frequency obtained in the fundamental frequency estimation step in the third stage, and the time derivative of the fundamental frequency obtained in the fundamental frequency time derivative estimation step in the third stage. A third stage signal waveform time expansion / contraction step for making the fundamental frequency of the signal after the second stage dereverberation constant,
A third-stage harmonic structure sound extraction step for extracting the harmonic structure sound based on the time expansion / contraction signal obtained by the third-stage signal waveform time expansion / contraction step;
Third stage to obtain harmonic structure sound having the same fundamental frequency as before time expansion by applying time expansion / contraction restoration processing of signal waveform to harmonic structure sound obtained by the third stage harmonic structure sound extraction step Signal waveform time expansion / contraction restoration step,
The harmonic structure sound obtained in the second-stage dereverberation signal and the third-stage signal waveform time expansion / contraction restoration step is divided into time frames having a length different from that of the harmonic structure sound extraction process. A third-stage inverse transfer function estimation step that advances the analysis and estimates the third-stage inverse transfer function;
Applying the third-stage inverse transfer function obtained in the third-stage inverse transfer function estimation step to the signal after the second-stage dereverberation to obtain a signal after the third-stage dereverberation; Applying an inverse transfer function;
A third stage dereverberation processing step comprising:
Dereverberation method characterized by comprising a.

The program for making a computer perform each step of the dereverberation method of Claim 1 .

A recording medium on which the program according to claim 2 is recorded.