JP5507596B2

JP5507596B2 - Speech enhancement

Info

Publication number: JP5507596B2
Application number: JP2012040093A
Authority: JP
Inventors: ブラウン、シー・フィリップ
Original assignee: ドルビーラボラトリーズライセンシングコーポレイション
Priority date: 2007-09-12
Filing date: 2012-02-27
Publication date: 2014-05-28
Anticipated expiration: 2028-09-10
Also published as: CN101960516A; JP2012110049A; CN101960516B; EP2191467B1; JP2010539792A; EP2191467A1; US8891778B2; ATE514163T1; US20100179808A1; WO2009035615A1

Abstract

A method for enhancing speech includes extracting a center channel of an audio signal, flattening the spectrum of the center channel, and mixing the flattened speech channel with the audio signal, thereby enhancing any speech in the audio signal. Also disclosed are a method for extracting a center channel of sound from an audio signal with multiple channels, a method for flattening the spectrum of an audio signal, and a method for detecting speech in an audio signal. Also disclosed is a speech enhancer that includes a center-channel extract, a spectral flattener, a speech-confidence generator, and a mixer for mixing the flattened speech channel with original audio signal proportionate to the confidence of having detected speech, thereby enhancing any speech in the audio signal.

Description

本明細書は、複数のチャンネルでオーディオ信号から音の中央チャンネルを抽出して、オーディオ信号のスペクトルを平坦化して、オーディオ信号におけるスピーチ（発話）を検出して、スピーチを引き立たせる方法及び装置を説明する。複数のチャンネルによりオーディオ信号から音の中央チャンネルを抽出する方法は、（１）候補中央チャンネルの比αより小さいオーディオ信号の第１のチャンネル及び（２）候補中央チャンネルの比αより小さいオーディオ信号の第２のチャンネルの共役を乗じて、αを概ね最小化して、候補中央チャンネルにその概ね最小化されたαを乗じることにより、抽出された中央チャンネルを形成することを含む。 The present specification provides a method and apparatus for extracting a central channel of sound from an audio signal with a plurality of channels, flattening the spectrum of the audio signal, detecting speech in the audio signal, and enhancing the speech. explain. The method of extracting a central channel of sound from an audio signal by a plurality of channels includes (1) a first channel of an audio signal smaller than a ratio α of candidate central channels and (2) an audio signal smaller than a ratio α of candidate central channels. Multiplying the conjugate of the second channel to approximately minimize α and multiplying the candidate center channel by the generally minimized α to form an extracted center channel.

オーディオ信号のスペクトルを平坦化する方法は、推定されたスピーチ・チャンネルを知覚帯域に分離して、知覚帯域のうちの何れが最も多くのエネルギを有しているかを判定し、より少ないエネルギを有する知覚帯域の利得を増大させることにより、オーディオ信号における任意のスピーチのスペクトルを平坦化させることを含んでもよい。この増大は、より少ないエネルギを有する知覚帯域の利得を最大まで増大することを含んでもよい。 A method for flattening the spectrum of an audio signal separates the estimated speech channel into perceptual bands to determine which of the perceptual bands has the most energy and has less energy It may include flattening the spectrum of any speech in the audio signal by increasing the gain of the perceptual band. This increase may include increasing the gain of the perceptual band having less energy to a maximum.

オーディオ信号におけるスピーチを検出する方法は、オーディオ信号の候補中央チャンネルにおけるスペクトル変動を測定し、候補中央チャンネルよりも少ないオーディオ信号のスペクトル変動を測定して、これらスペクトル変動を比較することにより、オーディオ信号におけるスピーチを検出することを含んでもよい。 A method for detecting speech in an audio signal measures the spectral variation in a candidate central channel of the audio signal, measures the spectral variation of an audio signal less than the candidate central channel, and compares these spectral variations to determine the audio signal. Detecting speech at.

スピーチを増強する方法は、オーディオ信号の中央チャンネルを抽出して、中央チャンネルのスペクトルを平坦化して、平坦化されたスピーチ・チャンネルにオーディオ信号を混合することにより、オーディオ信号における任意のスピーチを増強することを含んでもよい。この方法は、中央チャンネルにおけるスピーチ検出に信頼度を生成することを更に含んでもよく、その混合は、平坦化されたスピーチ・チャンネルにオーディオ信号を、検出されたスピーチを有する信頼度に比例させて混合することを含んでもよい。その信頼度は、可能性確率が最も低いものから可能性確率が最も高いものまで変動し得るので、その生成は、最低可能性確率よりも高く、且つ最高可能性確率よりも低い値に対して生成された信頼度を制限することを更に含んでもよい。その抽出は、上述した方法を用いて、オーディオ信号の中央チャンネルを抽出することを含んでもよい。上述の平坦化は、上述の方法を用いて中央チャンネルのスペクトルを平坦化することを含んでもよい。上述の生成は、上述の方法を用いて中央チャンネルにおけるスピーチ検出に信頼度を生成することを含んでもよい。 The method of enhancing speech enhances any speech in the audio signal by extracting the center channel of the audio signal, flattening the spectrum of the center channel, and mixing the audio signal into the flattened speech channel May include. The method may further include generating a confidence level for speech detection in the center channel, and the mixing causes the audio signal in the flattened speech channel to be proportional to the confidence level with the detected speech. Mixing may be included. Its confidence can vary from the lowest likelihood probability to the highest probability, so its generation is for values that are higher than the lowest probability and lower than the highest probability. It may further include limiting the generated confidence. The extraction may include extracting the center channel of the audio signal using the method described above. The flattening described above may include flattening the center channel spectrum using the method described above. The generation described above may include generating confidence in speech detection in the center channel using the method described above.

上述の抽出は、上述の方法を用いてオーディオ信号の中央チャンネルを抽出することを含んでもよく、上述の平坦化は、上述の方法を用いて中央チャンネルのスペクトルを平坦化することを含んでもよく、上述の生成は、上述の方法を用いて中央チャンネルにおけるスピーチ検出に信頼度を生成することを含んでもよい。 The extraction described above may include extracting the central channel of the audio signal using the method described above, and the flattening described above may include flattening the spectrum of the central channel using the method described above. , Generating as described above may include generating confidence in speech detection in the center channel using the method described above.

本明細書は、上述の方法の何れかを実行するコンピュータ・プログラムが格納されたコンピュータ読み取り可能な記録媒体のみならず、ＣＰＵ、該記録媒体、及びこれらＣＰＵと記録媒体とを結合するバスとを含むコンピュータ・システムを教示する。 This specification includes not only a computer-readable recording medium storing a computer program for executing any of the above-described methods, but also a CPU, the recording medium, and a bus that couples the CPU and the recording medium. Teach computer system including.

本発明の一つの実施例によるスピーチ・エンハンサーの機能ブロック図である。FIG. 3 is a functional block diagram of a speech enhancer according to one embodiment of the present invention. 計４０帯域をもたらす間隔１ＥＲＢのフィルタの適宜なセットを表す図である。FIG. 5 represents a suitable set of filters with a spacing of 1 ERB that provides a total of 40 bands. 本発明の一つの実施例による混合プロセスを説明する図である。FIG. 6 illustrates a mixing process according to one embodiment of the present invention. 本発明の一つの実施例によるコンピュータ・システムを例示する図である。FIG. 3 illustrates a computer system according to one embodiment of the invention.

図１は本発明の一つの実施例によるスピーチ・エンハンサーの機能ブロック図である。スピーチ・エンハンサー１は、入力信号１７、離散フーリエ変換器１０ａ，１０ｂ、中央チャンネル抽出器１１、スペクトル平坦化器１２、発声活動検出器１３、可変利得増幅器１５，１５ｃ、逆離散フーリエ変換器１８ａ，１８ｂ及び出力信号１８を含む。入力信号１７はそれぞれ左右のチャンネル１７ａ，１７ｂから成り、同様に出力信号１８はそれぞれ左右のチャンネル１８ａ，１８ｂから成る。 FIG. 1 is a functional block diagram of a speech enhancer according to one embodiment of the present invention. The speech enhancer 1 includes an input signal 17, discrete Fourier transformers 10a and 10b, a center channel extractor 11, a spectrum flattener 12, a speech activity detector 13, variable gain amplifiers 15 and 15c, an inverse discrete Fourier transformer 18a, 18b and an output signal 18. The input signal 17 comprises left and right channels 17a and 17b, respectively, and similarly the output signal 18 comprises left and right channels 18a and 18b, respectively.

各々の離散フーリエ変換器１８は、入力として入力信号１７の左右チャンネル１７ａ，１７ｂを受け取って、出力として変換１９ａ，１９ｂを形成する。中央チャンネル抽出器１１は、変換１９を受け取って、出力として仮の中央チャンネルＣ２０を形成する。スペクトル平坦化器１２は入力として仮の中央チャンネルＣ２０を受け取って、成形された中央チャンネル２４を出力として形成し、一方、発声活動検出器１３は同じ入力Ｃ２０を受け取って、一方では可変利得増幅器１４ａ及び１４ｃのための制御信号２２を、他方では可変利得増幅器１４ｂのための制御信号２１を、出力として形成する。 Each discrete Fourier transformer 18 receives as input the left and right channels 17a, 17b of the input signal 17 and forms transforms 19a, 19b as outputs. The center channel extractor 11 receives the transformation 19 and forms a temporary center channel C20 as output. Spectral flattener 12 receives as input a temporary central channel C20 and forms a shaped central channel 24 as an output, while vocal activity detector 13 receives the same input C20, while variable gain amplifier 14a. And the control signal 22 for the variable gain amplifier 14b on the other hand are formed as outputs.

増幅器１４ａは、入力及び制御信号として、左チャンネル変換１９ａ及び発声活動検出器１３の出力制御信号２２をそれぞれ受け取る。同様に、増幅器１４ｃは、入力及び制御信号として、右チャンネル変換１９ｂ及び発声活動検出器出力制御信号２２をそれぞれ受け取る。増幅器１４ｂは、入力及び制御信号として、スペクトル的に成形された中央チャンネル２４及びスペクトル平坦化器１２の出力発声活動検出器制御信号２１を受け取る。 The amplifier 14a receives the left channel conversion 19a and the output control signal 22 of the vocal activity detector 13 as input and control signals, respectively. Similarly, amplifier 14c receives right channel conversion 19b and voicing activity detector output control signal 22 as input and control signals, respectively. The amplifier 14b receives the spectrally shaped central channel 24 and the output speech activity detector control signal 21 of the spectral flattener 12 as input and control signals.

ミキサー１５ａは、増幅器１４からの出力である利得調整された左変換２３ａと、利得調整されたスペクトル的に成形された中央チャンネル２５とを受け取って、出力として信号２６ａを形成する。同様に、ミキサー１５ｂは、増幅器１４ｃからの利得調整された右変換２３ｂと、利得調整されたスペクトル的に成形された中央チャンネル２５とを受け取って、出力として信号２６ｂを形成する。 The mixer 15a receives the gain adjusted left transform 23a, which is the output from the amplifier 14, and the gain adjusted spectrally shaped central channel 25 and forms the signal 26a as output. Similarly, mixer 15b receives gain adjusted right conversion 23b from amplifier 14c and gain adjusted spectrally shaped center channel 25 and forms signal 26b as an output.

逆変換器１８ａ，１８ｂは、各々の信号２６ａ，２６ｂを受け取って、それぞれ導出された左及び右チャンネル信号Ｌ’１８ａ及びＲ’１８ｂを形成する。
スピーチ・エンハンサー１の操作を以下に更に詳細に説明する。中央チャンネル抽出、スペクトル平坦化、発声活動検出及び混合の処理については、一つの実施例に沿って最初は概略的に次いでより詳細に順番に説明する。
中央チャンネル抽出
以下のように仮定する。
（１）対象１７の信号はスピーチを包含する。
（２）多重チャンネル信号（即ち、左及び右、又はステレオ）の場合、スピーチは中央にパンされる。
（３）実際のパンされた中央は、音源左右信号の比アルファ（α）から成る。
（４）その比の減算の結果は一対の直交信号である。 Inverse converters 18a, 18b receive the respective signals 26a, 26b and form derived left and right channel signals L′ 18a and R′18b, respectively.
The operation of the speech enhancer 1 will be described in further detail below. The process of center channel extraction, spectral flattening, vocal activity detection and mixing will be described in order, first generally and then in more detail, according to one embodiment.
Center channel extraction Assuming that:
(1) The signal of the object 17 includes speech.
(2) For multi-channel signals (ie left and right, or stereo), the speech is panned to the center.
(3) The actual panned center consists of the sound source left / right signal ratio alpha (α).
(4) The result of the ratio subtraction is a pair of orthogonal signals.

これらの仮定の下に操作して、中央チャンネル抽出器１１はステレオ信号１７から中央にパンされたコンテンツＣ２０を抽出する。中央にパンされたコンテンツのために、左右両方のチャンネルの同一の領域は、その中央にパンされたコンテンツを含む。中央にパンされたコンテンツは、左右両方のチャンネルから同一部分を除去することにより抽出される。 Operating under these assumptions, the center channel extractor 11 extracts from the stereo signal 17 the center panned content C20. For content that is panned to the center, the same area of both the left and right channels contains the content panned to the center. The content panned to the center is extracted by removing the same part from both the left and right channels.

残りの左右信号について（ブロックのフレーム上で、或いは新しいブロックが入る毎に連続的に更新される方法を用いて）、ＬＲ＊＝０（ここで＊は共役を示す）を計算し、比αが零に充分に近い値になるまで調整するようにしてもよい。 For the remaining left and right signals (using a method that is continuously updated on the frame of the block or each time a new block enters), calculate LR * = 0 (where * indicates the conjugate) and calculate the ratio α Adjustment may be made until the value becomes sufficiently close to zero.

スペクトル平坦化
聴覚フィルタは、推定されたスピーチ・チャンネルにおけるスピーチを知覚帯域へ分離する。最も多くのエネルギを有する帯域、データの各々のブロックについて判定される。そのブロックについてのスピーチ・チャンネルのスペクトル形状は、残りの帯域における低エネルギを補償するために修正される。このスペクトルは平坦化される。低エネルギを有すｒ帯域は、或る最大限まで増大された利得を持つ。一つの実施形態においては、全ての帯域は最大利得を共有してもよい。代替的な実施形態においては、各々の帯域は、それ自身の最大利得を有してもよい。（全ての帯域が同じエネルギを有するという望ましくない場合には、スペクトルは既に平坦である。スペクトル成形が生じないか、或いはスペクトル成形が同一の機能により達成されることも考慮されるであろう。）
スペクトル平坦化はチャンネルのコンテンツとは無関係に生じる。非スピーチを処理してもよいが、これがシステムにおいて後で用いられることはない。非スピーチは、スピーチとは非常に異なるスペクトルを有するので、非スピーチのための平坦化は、通常はスピーチについてのものと同じではない。
発声活動検出器
推定されたスピーチが単独のチャンネルへ分離されると、それはスピーチ・コンテンツについて分析される（それはスピーチを包含するか？）。コンテンツはスペクトル平坦化とは独立に分析される。スピーチ・コンテンツは、データの隣接するフレームにおけるスペクトル変動を測定することにより判定される。（各々のフレームはデータの多くのブロックから成り得るが、フレームは一般に４８ｋＨｚサンプル・レートの２、４、又は８ブロックである。）
スピーチ・チャンネルがステレオから抽出されるところでは、残りのステレオ信号がスピーチ分析に役立つであろう。この概念は、任意の多重チャンネル源における隣接するチャンネルにより一般的に適用される。
ミキシング
スピーチが存在すると見做されるとき、平坦化されたスピーチ・チャンネルは、スピーチ・チャンネルが実際にスピーチを包含するという信頼度に関連する或る割合で原信号と混合される。一般に、信頼度が高いときは、より多くの平坦化スピーチ・チャンネルが用いられる。信頼度が低いときは、より少ない平坦化スピーチ・チャンネルが用いられる。 Spectral flattening The auditory filter separates speech in the estimated speech channel into perceptual bands. The band with the most energy is determined for each block of data. The spectral shape of the speech channel for that block is modified to compensate for the low energy in the remaining bands. This spectrum is flattened. The r band with low energy has a gain increased to some maximum. In one embodiment, all bands may share maximum gain. In an alternative embodiment, each band may have its own maximum gain. (If it is not desirable that all bands have the same energy, the spectrum is already flat. It would also be considered that no spectral shaping occurs or that the spectral shaping is achieved by the same function. )
Spectral flattening occurs independently of the channel content. Non-speech may be processed, but this is not used later in the system. Since non-speech has a very different spectrum than speech, the flattening for non-speech is usually not the same as for speech.
Speech Activity Detector Once the estimated speech is separated into a single channel, it is analyzed for speech content (does it include speech?). Content is analyzed independently of spectral flattening. Speech content is determined by measuring spectral variations in adjacent frames of data. (Each frame can consist of many blocks of data, but a frame is typically 2, 4, or 8 blocks at a 48 kHz sample rate.)
Where the speech channel is extracted from stereo, the remaining stereo signal will be useful for speech analysis. This concept is generally applied by adjacent channels in any multi-channel source.
When mixing speech is considered to be present, the flattened speech channel is mixed with the original signal at a rate that is related to the confidence that the speech channel actually contains speech. In general, when the reliability is high, more flattened speech channels are used. When the confidence level is low, fewer flattened speech channels are used.

中央チャンネル抽出、スペクトル平坦化、発性声活動検出及び混合の処理について、一つの実施例によって更に詳細に順番に説明する。
２チャンネル源からの仮の中央及びサラウンド・チャンネル抽出
スピーチ増強によれば、中央にパンされたオーディオのみを抽出、処理、及び再挿入することが望まれる。ステレオ混合においては、スピーチは最も頻繁に中央へパンされる。 The process of center channel extraction, spectral flattening, vocal activity detection and mixing will be described in turn in more detail by one embodiment.
Temporary center and surround channel extraction from a two-channel source With speech enhancement, it is desirable to extract, process, and reinsert only center-panned audio. In stereo mixing, speech is most often panned to the center.

ここで、中央にパンされたオーディオ（仮の中央チャンネル）の２チャンネル混合物からの抽出について説明する。数学的な証明は第１の部分を構成する。第２の部分は、この証明を実環境ステレオ信号に適用して、仮の中央を導出する。 Here, the extraction from the two-channel mixture of the audio panned in the center (provisional center channel) will be described. Mathematical proof forms the first part. The second part applies this proof to the real environment stereo signal to derive a temporary center.

仮の中央が原ステレオから取り去られると、直交チャンネルを有するステレオ信号が残る。類似の方法は、周辺にパンされたオーディオから仮のサラウンド・チャンネルを導出する。 When the temporary center is removed from the original stereo, a stereo signal having orthogonal channels remains. A similar method derives a temporary surround channel from the audio panned to the periphery.

中央チャンネル抽出−数学的証明
或る２チャンネル信号が与えられると、そのチャンネルは左（Ｌ）と右（Ｒ）に分けられるであろう。この左右のチャンネルの各々は、共通の情報のみならず、各々に固有の情報を包含する。共通の情報をＣ（中央にパンされている）、固有の情報を左のみ右のみについてそれぞれＬ及びＲとして表すことができる。 Center channel extraction-mathematical proof Given a two channel signal, the channel will be divided into left (L) and right (R). Each of the left and right channels includes not only common information but also unique information. Common information can be represented as C (panned in the center) and unique information can be represented as L and R for left only and right only, respectively.

「固有」とはＬ及びＲが互いに直交することを意味する。 “Inherent” means that L and R are orthogonal to each other.

Ｌ及びＲを実数と虚数部分とへ分けると、 Dividing L and R into real and imaginary parts,

ここで、ＬｒはＬの実数部分、ＬｉはＬの虚数部分であり、Ｒについても同様である。いま、中央にパンされたＣをＬ及びＲから減じることにより、非直交対（Ｌ及びＲ）から直交対（Ｌ及びＲ）が形成されたものと見做す。 Here, Lr is the real part of L, Li is the imaginary part of L, and the same applies to R. Now, it is considered that an orthogonal pair (L and R) is formed from a non-orthogonal pair (L and R) by subtracting C panned in the center from L and R.

ここでＣ＝αＣ（但し、Ｃは推定された中央チャンネルであり、αは倍率である）とすると、 Where C = αC (where C is the estimated center channel and α is the magnification)

式（６）及び式（７）を式（３）へ代入すると、 Substituting Equation (6) and Equation (7) into Equation (3),

式（８）は二次方程式の形になり、 Equation (8) takes the form of a quadratic equation,

ここで累乗根は以下のように得られる。 Here, the power root is obtained as follows.

ここで式（６）及び式（７）におけるＣを、 Here, C in Equation (6) and Equation (7) is

として、実数と虚数とに分けると、 As follows:

すると、二次方程式（９）においては、 Then, in the quadratic equation (9),

式（１４）、式（１５）及び式（１６）を式（１０）へ代入して、αについて解くと、 Substituting Equation (14), Equation (15) and Equation (16) into Equation (10) and solving for α,

αに対する解について負の根を選び、周辺にパンされた情報による混乱を避けるためαを範囲｛０，０．５｝に限定する（但し、その値は本発明には重要ではない）。仮の中央チャンネル式は以下のようになる。 A negative root is chosen for the solution for α and α is limited to the range {0, 0.5} to avoid confusion due to information panned around it (however, its value is not important to the present invention). The provisional central channel type is as follows.

ここで、 here,

である。（ｍｉｎ｛｝及びｍａｘ｛｝関数は、αを範囲｛０，０．５｝に制限するが、その値は本発明には重要ではない。）
仮のサラウンド・チャンネルは同様にして以下のように導ける。 It is. (The min {} and max {} functions limit α to the range {0, 0.5}, but its value is not important to the present invention.)
The temporary surround channel can be derived as follows.

ここでＳは、原ステレオ対（Ｌ，Ｒ）において周囲にパンされたオーディオであり、且つＳは（Ｌ−Ｒ）になるものと仮定する。この場合も、βに対する解について負の根を選び、周辺にパンされた情報による混乱を避けるためβを範囲｛０，０．５｝に限定する（但し、その値は本発明には重要ではない）。 Where S is the audio panned around in the original stereo pair (L, R) and S is assumed to be (LR). Again, a negative root is chosen for the solution for β and β is limited to the range {0, 0.5} to avoid confusion due to information panned to the periphery (however, its value is not important to the present invention). Absent).

いまやＣ及びＳが導出されたので、これらを原ステレオ対（Ｌ及びＲ）から除去して、二つの原チャンネルからオーディオの四つのチャンネルを形成することができる。即ち、 Now that C and S have been derived, they can be removed from the original stereo pair (L and R) to form four channels of audio from the two original channels. That is,

ここでＬ’は導出された左チャンネル、Ｃは導出された中央チャンネル、Ｒ’は導出された右チャンネル、Ｓは導出されたサラウンド・チャンネルである。 Here, L ′ is a derived left channel, C is a derived center channel, R ′ is a derived right channel, and S is a derived surround channel.

中央チャンネル抽出−適用
上述のように、スピーチ増強方法にとって、その主要な懸念は中央チャンネルの抽出である。この部分において、上述の技術は、オーディオ信号の複雑な周波数領域表現に適用される。 Central Channel Extraction-Application As mentioned above, for speech enhancement methods, the main concern is the extraction of the central channel. In this part, the techniques described above are applied to complex frequency domain representations of audio signals.

仮の中央チャンネル抽出の第１段階は、オーディオ・サンプルのブロックでＤＦＴを実行し、その結果として生じる変換係数を得ることである。ＤＦＴのブロック・サイズはサンプリング・レートに依存する。例えば４８ｋＨｚのサンプリング・レートｆ_Ｓにおいては、Ｎ＝５１２サンプルのブロック・サイズが可能である。ハミング・ウインドウのようなウィンドーイング関数ｗ［ｎ］により、変換の適用に先立ってサンプルのブロックを重み付けする。 The first stage of tentative center channel extraction is to perform a DFT on the block of audio samples and obtain the resulting transform coefficients. The DFT block size depends on the sampling rate. For example, at a sampling rate f _S of 48 kHz, a block size of N = 512 samples is possible. A block of samples is weighted prior to applying the transform by a windowing function w [n], such as a Hamming window.

ここでｎは整数であり、Ｎはブロックにおけるサンプルの数である。 Where n is an integer and N is the number of samples in the block.

ＤＦＴ係数を次式（２５）で以下のように計算する。 The DFT coefficient is calculated by the following equation (25) as follows.

ここでｘ［ｎ，ｃ］はブロックｍのチャンネルｃにおけるサンプル番号ｎであり、ｊは虚数単位（ｊ^２＝−１）であり、Ｘ_ｍ［ｋ，ｃ］はブロックｍにおけるサンプルについてのチャンネルｃにおける変換係数ｋである。チャンネルの数は三つ、即ち、左、右、及び仮の中央（ｘ［ｎ，ｃ］の場合においては、左及び右のみ）であることに留意されたい。以下の方程式において、左チャンネルはｃ＝１として表され、仮の中央チャンネルはｃ＝２（未だ導出されていない）、右チャンネルはｃ＝３として表される。また、高速フーリエ変換（ＦＦＴ）はＤＦＴを効率的に実行する。 Where x [n, c] is the sample number n in channel c of block m, j is the imaginary unit (j ² = −1), and X _m [k, c] is the channel for the sample in block m. The conversion coefficient k in c. Note that the number of channels is three: left, right, and tentative center (in the case of x [n, c], only left and right). In the following equations, the left channel is represented as c = 1, the temporary center channel is represented as c = 2 (not yet derived), and the right channel is represented as c = 3. Fast Fourier transform (FFT) efficiently performs DFT.

左と右との和及び差は、原則として周波数ビン毎に求めた。実数及び虚数部分はグループ分けして二乗した。各ビンは、αを計算するのに先立ってブロック間で平滑化した。この平滑化は、可聴なアーチファクト（これは、ビンにおけるパワーがデータのブロック間で急激に変化したときに生じる）を低減させる。平滑化は、例えば、漏れ積分回路（ｌｅａｋｙｉｎｔｅｇｒａｔｏｒ）、非線形スムーザー、線形且つ多極のローパス・スムーザー、或いは更に精巧なスムーザーで実行してもよい。 In principle, the sum and difference between left and right were determined for each frequency bin. The real and imaginary parts were grouped and squared. Each bin was smoothed between blocks prior to calculating α. This smoothing reduces audible artifacts (which occur when the power in the bin changes abruptly between blocks of data). Smoothing may be performed, for example, with a leaky integrator, a non-linear smoother, a linear and multipolar low-pass smoother, or a more sophisticated smoother.

ここでＲｅ｛｝は実数部分であり、Ｉｍ｛｝は虚数部分であり、λ_１は漏れ積分回路係数である。漏れ積分回路はローパス・フィルタリング効果を有し、λ_１についての代表的な値は０．９である。次に、ブロックｍについての抽出係数αは式（１９）を用いて以下のように導かれる。 Here, Re {} is a real part, Im {} is an imaginary part, and λ ₁ is a leaky integration circuit coefficient. The leakage integrator circuit has a low-pass filtering effect, with a typical value for λ ₁ being 0.9. Next, the extraction coefficient α for the block m is derived as follows using Equation (19).

そして、ブロックｍについての仮の中央チャンネルは式（１８）を用いて以下のように導かれる。 And the temporary center channel about the block m is derived | led-out as follows using Formula (18).

スペクトル平坦化
以下、本発明のスペクトル平坦化の実施例を説明する。大部分がスピーチである単独のチャンネルを仮定し、そのスピーチ信号を離散フーリエ変換（ＤＦＴ）又は関連した変換によって周波数領域へ変換する。振幅スペクトルは、変換周波数ビンを二乗することによってパワースペクトルへ変換する。 Spectral flattening Examples of spectral flattening according to the present invention will be described below. Assuming a single channel that is mostly speech, the speech signal is transformed to the frequency domain by a discrete Fourier transform (DFT) or related transform. The amplitude spectrum is converted to a power spectrum by squaring the conversion frequency bin.

次いで、周波数ビンは臨界若しくは聴覚フィルタ・スケールで可能な帯域へ分類する。スピーチ信号を臨界帯域へ分割することは、人間の聴覚系（特に蝸牛）によく似ている。これらのフィルタは、概ね丸められた指数形を示して、等価長矩形帯域幅（ＥＲＢ）スケールで均一に間隔をあけられる。このＥＲＢスケールは、音響心理学で用いられる単なる尺度であって、聴覚フィルタの帯域幅及び間隔を概算する。図２は１ＥＲＢの間隔を有するフィルタの適宜なセットを表しており、合計４０の帯域がもたらされる。オーディオ・データの帯域化も可聴なアーチファクト（これは、原則としてビン毎に処理するときに生じる）を除去するのに役立つ。次いで、臨界帯域パワーを時間に対して平滑化する。即ち、隣接するブロックに亘って平滑化する。 The frequency bins are then classified into bands that are possible with a critical or auditory filter scale. Dividing a speech signal into critical bands is very similar to the human auditory system (especially the cochlea). These filters exhibit a generally rounded exponential shape and are evenly spaced on an equivalent length rectangular bandwidth (ERB) scale. This ERB scale is just a measure used in psychoacoustics and approximates the bandwidth and spacing of the auditory filter. FIG. 2 represents a suitable set of filters with 1 ERB spacing, resulting in a total of 40 bands. Audio data banding also helps to eliminate audible artifacts (which in principle occur when processing bin by bin). The critical band power is then smoothed over time. That is, smoothing is performed over adjacent blocks.

平滑化された臨界帯域のうちの最大出力を求めて、対応する利得を残りの（非最大）帯域について計算して、それらの出力を最大出力へ近似させる。利得補償は、基底膜の圧縮（非線形）特性に類似する。これらの利得は、飽和を避けるために、最大値へ制限される。これらの利得を原信号へ適用するためには、これらを変換してＤＦＴフォーマットへ戻さねばならない。従って、帯域毎出力利得は最初に周波数ビン出力利得へ変換して戻し、次いでビン毎出力利得を各ビンの平方根を採ることにより振幅利得へ変換する。かくして原信号変換ビンには、計算されたビン毎振幅利得を乗じることができる。次いでスペクトル平坦化信号を変換して周波数領域から時間領域へ戻す。仮の中央の場合、これは時間領域へ復帰させるのに先立って、先ず原信号と混合する。図３はその処理を説明している。 Find the maximum output of the smoothed critical bands and calculate the corresponding gain for the remaining (non-maximum) bands to approximate those outputs to the maximum output. Gain compensation is similar to the compression (nonlinear) characteristics of the basement membrane. These gains are limited to maximum values to avoid saturation. In order to apply these gains to the original signal, they must be converted back to the DFT format. Thus, the per band output gain is first converted back to a frequency bin output gain, and then the per bin output gain is converted to an amplitude gain by taking the square root of each bin. Thus, the original signal conversion bin can be multiplied by the calculated per-bin amplitude gain. The spectral flattening signal is then transformed back from the frequency domain to the time domain. In the tentative center, this is first mixed with the original signal prior to returning to the time domain. FIG. 3 illustrates the process.

上述のスペクトル平坦化システムは、入力された信号の特性を考慮していない。非スピーチ信号が平坦化されるならば、音質における知覚可能な変化は深刻なものとなろう。非スピーチ信号の処理を避けるために、上述の方法は、発声活動検出器１３に結び付けることができる。発声活動検出器１３がスピーチの存在を示すとき、平坦化スピーチが用いられる。
The spectral flattening system described above does not take into account the characteristics of the input signal. If the non-speech signal is flattened, the perceptible change in sound quality will be severe. In order to avoid processing of non-speech signals, the method described above can be tied to the speech activity detector 13. When speech activity detector 13 indicates the presence of speech, flattened speech is used.

平坦化すべき信号は、上述のようにして周波数領域へ既に変換されていると仮定する。単純化のために、上記に用いられたチャンネル表記法は省略した。ＤＦＴ係数を出力へ変換して、次いでＤＦＴ領域から臨界帯域へ変換する。 Assume that the signal to be flattened has already been transformed to the frequency domain as described above. For simplicity, the channel notation used above is omitted. The DFT coefficients are converted to output and then converted from the DFT domain to the critical band.

ここでＨ［ｋ，ｐ］はＰ臨界帯域フィルタである。 Here, H [k, p] is a P critical band filter.

次いで各帯域における出力を、脳の皮質レベルで生じる時間積分と同様に、ブロック間で平滑化する。平滑化は、例えば、漏れ積分回路、非線形スムーザー、線形且つ多極ローパス・スムーザー、或いは更に精巧なスムーザーにより実行してもよい。この平滑化も遷移挙動（これは、利得にブロック間の急激な変動を引き起こし、可聴なポンピングをもたらす）を除去するのに役立つ。次にピーク出力は以下のように求められる。 The output in each band is then smoothed between blocks, similar to the time integration that occurs at the cortical level of the brain. Smoothing may be performed, for example, by a leaky integration circuit, a non-linear smoother, a linear and multipole low-pass smoother, or a more sophisticated smoother. This smoothing also helps to eliminate the transition behavior, which causes the gain to suddenly fluctuate between blocks, resulting in audible pumping. Next, the peak output is obtained as follows.

ここでＥ_ｍ［ｐ］は平滑化された臨界帯域出力、λ_２は漏れ積分回路係数、及びＥ_ｍａｘはピーク出力である。漏れ積分回路はローパス・フィルタリング効果を有しており、その利得はλ_２についての代表的な値が０．９である。 Here, E _m [p] is the smoothed critical band output, λ ₂ is the leakage integrator circuit coefficient, and E _max is the peak output. The leakage integrator circuit has a low-pass filtering effect, and its gain is typically 0.9 for λ ₂ .

次に帯域毎出力利得を求め、最大利得を過度な補償を避けるように制限すると、 Then find the band-by-band output gain and limit the maximum gain to avoid excessive compensation,

を得る。ここでＧ_ｍ［ｐ］は各帯域へ適用すべき出力利得、Ｇ_ｍａｘは許容できる最大出力利得であり、γはスペクトルの平坦化の度合を決定する。実際には、γは１に近似する。指定された利得の量に対する他の汎用制限のみならず、システムが処理を実行するならば、Ｇ_ｍａｘはダイナミック・レンジ（又は無歪限界）に依存する。Ｇ_ｍａｘについての代表的な値は２０ｄＢである。 Get. Here, G _m [p] is an output gain to be applied to each band, G _max is an allowable maximum output gain, and γ determines the degree of spectrum flattening. In practice, γ approximates 1. G _max depends on the dynamic range (or no distortion limit) if the system performs processing as well as other general limits on the amount of gain specified. A typical value for G _max is 20 dB.

次に帯域毎出力利得をビン毎出力に変換して、平方根を採ってビン毎振幅利得を得る。 Next, the output gain per band is converted into the output per bin, and the square root is taken to obtain the amplitude gain per bin.

ここでＹ_ｍ［ｋ］はビン毎振幅利得である。 Here, Y _m [k] is an amplitude gain per bin.

次に振幅利得を発声活動検出器出力２１，２２に基づいて修正する。発声活動検出のための方法を本発明の一つの実施例によって以下に説明する。
発声活動検出
スペクトル束は、信号の出力スペクトルが変化する速度を測定し、オーディオの隣接するフレームの間の出力を比較する。（フレームは、オーディオ・データの複数のブロックである。）スペクトル束は、発声活動検出、或いは「スピーチ対オーディオ分類における他の判定がなされたもの」を示す。多くの場合、付加的な指標が用いられ、その結果は、オーディオが本当にスピーチであるか否かの判定をなすために集積される。 The amplitude gain is then modified based on the voice activity detector outputs 21,22. A method for voicing activity detection is described below according to one embodiment of the present invention.
Voice activity detection The spectral bundle measures the rate at which the output spectrum of the signal changes and compares the output between adjacent frames of audio. (A frame is a plurality of blocks of audio data.) Spectral bundles indicate voicing activity detection or “other decisions in speech vs. audio classification”. In many cases, additional indicators are used and the results are accumulated to make a determination as to whether the audio is really speech.

一般に、スピーチのスペクトル束は音楽のそれよりも若干高い。即ち、音楽スペクトルは、フレーム間でスピーチ・スペクトルよりも安定する傾向にある。
ステレオの場合、スペクトルの中央チャンネルが抽出されるところで、ＤＦＴ係数は先ず中央と横のオーディオ（原ステレオから仮の中央を減じたもの）に分けられる。これは、伝統的な中間／横ステレオ処理とは異なっており、伝統的な中間／横ステレオ処理が一般に（Ｌ＋Ｒ）／２，（Ｌ−Ｒ）／２であるのに対し、中央／横処理はＣ，Ｌ＋Ｒ−２Ｃである。 In general, the spectral bundle of speech is slightly higher than that of music. That is, the music spectrum tends to be more stable than the speech spectrum between frames.
In the case of stereo, where the center channel of the spectrum is extracted, the DFT coefficients are first divided into center and side audio (original stereo minus the temporary center). This is different from traditional mid / horizontal stereo processing, where traditional mid / horizontal stereo processing is typically (L + R) / 2, (LR) / 2, whereas center / lateral processing. Is C, L + R-2C.

上述したように周波数領域へ変換された信号によれば、ＤＦＴ係数は出力へ変換されて、次いでＤＦＴ領域から臨界帯域領域へ変換される。臨界帯域出力は次いで中央と横との両方のスペクトル束を計算するのに用いられる。 According to the signal converted to the frequency domain as described above, the DFT coefficients are converted to the output, and then converted from the DFT domain to the critical band domain. The critical band output is then used to calculate both the central and lateral spectral bundles.

ここでＸ_ｍ［ｐ］は仮の中央の臨界帯域表現、S_ｍ［ｐ］は残りの信号（左と右との和から中央を減じたもの）の臨界帯域表現であり、Ｈ［ｋ，ｐ］は上述したようにＰ臨界帯域フィルタである。 Where X _m [p] is a temporary central critical band representation, S _m [p] is the critical band representation of the remaining signal (the sum of left and right subtracted from the center), and H [k, p] is a P critical band filter as described above.

データの穿孔する２Ｊブロックから（中央及び横振幅について）二つのフレーム・バッファを形成する。 Two frame buffers are formed (for center and lateral amplitude) from the 2J block of data to puncture.

次の段階は、現在のフレームと先行するフレームとの平均出力から中央チャンネルについての加重Ｗを計算する。これは帯域の限られた範囲に亘ってなされる。 The next stage calculates the weight W for the center channel from the average output of the current frame and the previous frame. This is done over a limited range of bands.

帯域バンドの範囲は、スピーチの主要な帯域幅約１００−８０００Ｈｚに限定される。中央と側方との両方についての非加重スペクトル束は次のように計算される。 The range of the band is limited to the main bandwidth of speech, about 100-8000 Hz. The unweighted spectral bundle for both the center and the side is calculated as follows:

ここでＦ_ｘ（ｍ）は中央の非加重スペクトル束であり、Ｆ_ｓ（ｍ）は側面の非加重スペクトル束である。 Here, F _x (m) is a center unweighted spectrum bundle, and F _s (m) is a side unweighted spectrum bundle.

従ってスペクトル束の偏った推定値は以下のように計算される。 Therefore, an estimated value with a biased spectrum bundle is calculated as follows.

であるならば、 If it is,

さもなければ、 Otherwise,

である。ここでＦ_Ｔｏｔ（ｍ）は全束推定値であり、Ｗ_ｍｉｎは許容される最小加重である。Ｗ_ｍｉｎはダイナミック・レンジに依存するが、代表的な値はＷ_ｍｉｎ＝―６０ｄＢである。 It is. Here, F _Tot (m) is the total bundle estimated value, and W _min is the allowable minimum weight. W _min depends on the dynamic range, but a typical value is W _min = −60 dB.

スペクトル束についての最終的な平滑化値は、単純な一次ＩＩＲローパス・フィルタによりＦ_Ｔｏｔ（ｍ）の値をローパス・フィルタリングすることにより計算される。このフィルタは信号のサンプル・レート及びブロックの大きさに依存するが、一実施形態においては、ｆ_ｓ＝４８ｋＨｚについて０．０２５＊ｆ_ｓの正規化カットオフを有する一次ローパス・フィルタにより規定できる。ここでｆ_ｓはディジタル・システムのサンプル・レートである。 The final smoothed value for the spectral bundle is calculated by low pass filtering the value of F _Tot (m) with a simple first order IIR low pass filter. This filter is dependent on the size of the sample rate and block signals, in one embodiment, can be defined by the primary low-pass filter with a normalized cutoff f s ₌ 48kHz for 0.025 * f _s. Where f _s is the sample rate of the digital system.

Ｆ_Ｔｏｔ（ｍ）はかくして次の範囲に短縮される。即ち、 F _Tot (m) is thus shortened to the following range: That is,

であるから、 Because

（ｍｉｎ{｝及びｍａｘ{｝関数は、本実施例によればＦ_Ｔｏｔ（ｍ）を｛０，１｝に制限する。）
混合
平坦化された中央チャンネルは、発声活動検出器の出力に基づいて原オーディオン信号と混合される。 (The min {} and max {} functions limit F _Tot (m) to {0, 1} according to this embodiment.)
Mixing The flattened center channel is mixed with the original audion signal based on the output of the vocal activity detector.

（上述に示す通りの）スペクトル平坦化についてのビン当りの振幅利得Ｙ_ｍ［ｋ］は、（上述のように導かれるように）仮の中央チャンネルＸ_ｍ［ｋ，２］へ適用される。 The amplitude gain Y _m [k] per bin for spectral flattening (as described above) is applied to the temporary central channel X _m [k, 2] (as derived above).

発声活動検出器１３が、スピーチを検知するときはＦ_Ｔｏｔ（ｍ）＝１とし、非スピーチを検知するときはＦ_Ｔｏｔ（ｍ）＝０とする。０と１との間の値が可能であり、これは発声活動検出器１３がスピーチの存在について軟判定をなす場合に得られる。 When the speech activity detector 13 detects speech, F _Tot (m) = 1, and when it detects non-speech, F _Tot (m) = 0. Values between 0 and 1 are possible, which is obtained when the speech activity detector 13 makes a soft decision on the presence of speech.

左チャンネルについて、 For the left channel

同様に、右チャンネルについて、 Similarly, for the right channel,

実際には、Ｆ_Ｔｏｔは値の狭い範囲に限定される。例えば In practice, F _Tot is limited to a narrow range of values. For example

は最終的混合体における平坦化信号と原信号との両方の少量を保存する。 Preserves a small amount of both the flattening signal and the original signal in the final mixture.

次にビン毎振幅利得を原入力信号へ適用し、これを逆ＤＦＴを介して変換して時間領域へ戻す。 A bin-by-bin amplitude gain is then applied to the original input signal, which is transformed via the inverse DFT back to the time domain.

ここで here

はｘの増強された形態であり、原ステレオ入力信号である。 Is an enhanced form of x, the original stereo input signal.

図４は本発明の一実施例に係るコンピュータ４を示す。このコンピュータ４はメモリ４１、ＣＰＵ４２及びバス４３を含む。バス４３はメモリ４１及びＣＰＵ４２に交信するように接続する。メモリ４１は上述に説明した任意の方法を実行するためのコンピュータ・プログラムを保存する。 FIG. 4 shows a computer 4 according to an embodiment of the present invention. The computer 4 includes a memory 41, a CPU 42, and a bus 43. The bus 43 is connected so as to communicate with the memory 41 and the CPU 42. The memory 41 stores a computer program for executing any method described above.

本発明の幾つかの実施形態について説明した。それでもなお、当業者には本発明の要旨及び目的から逸脱することなく、説明された実施形態に如何にして様々な修正を加えるかを理解されたい。例えば説明は離散フーリエ変換器を含むが、当業者には時間領域から周波数領域及びその逆の変換の様々な代替的方法を理解される。 Several embodiments of the invention have been described. Nevertheless, those skilled in the art will understand how to make various modifications to the described embodiments without departing from the spirit and scope of the present invention. For example, although the description includes a discrete Fourier transformer, those skilled in the art will understand various alternative methods of transforming from the time domain to the frequency domain and vice versa.

従来技術

Ｓｃｈａｕｂ，Ａ．ａｎｄＰ．，”Ｓｐｅｃｔｒａｌｓｈａｒｐｅｎｉｎｇｆｏｒｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔｎｏｉｓｅｒｅｄｕｃｔｉｏｎ”，Ｐｒｏｃ．ＩＣＡＳＳＰ．１９９１,Ｔｏｒｏｎｔｏ，Ｃａｎａｄａ，Ｍａｙ１９９１，ｐｐ．９９３−９９６．

Ｓｏｎｄｈｉ，, ”Ｎｅｗｍｅｔｈｏｄｓｏｆｐｉｔｃｈｅｘｔｒａｃｔｉｏｎ”，ＡｕｄｉｏａｎｄＥｌｅｃｔｒｏａｃｏｕｓｔｉｃｓ，ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓ，Ｊｕｎｅ１９６８，Ｖｏｌｕｍｅ１６，Ｉｓｓｕｅ２，ｐｐ２６２−２６６．

Ｖｉｌｌｃｈｕｒ，Ｅ．， ”ＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇｔｏＩｍｐｒｏｖｅＳｐｅｅｃｈＩｎｔｅｌｌｉｇｉｂｉｌｉｔｙｆｏｒｔｈｅＨｅａｒｉｎｇＩｍｐａｉｒｅｄ”，９９ｔｈＡｕｄｉｏＥｎｇｉｎｅｅｒｉｎｇＳｏｃｉｅｔｙＣｏｎｖｅｎｔｉｏｎ，Ｓｅｐｔｅｍｂｅｒ１９９５．

Ｔｈｏｍａｓ，Ｉ．ａｎｄＮｉｅｄｅｒｊｏｈｎ，Ｒ．， ”ＰｒｅｐｒｏｃｅｓｓｉｎｇｏｆＳｐｅｅｃｈｆｏｒＡｄｄｅｄＩｎｔｅｌｌｉｇｉｂｉｌｉｔｙｉｎＨｉｇｈＡｍｂｉｅｎｔＮｏｉｓｅ”，３４ｔｈＡｕｄｉｏＥｎｇｉｎｅｅｒｉｎｇＳｏｃｉｅｔｙＣｏｎｖｅｎｔｉｏｎ，Ｍａｒｃｈ１９６８．

Ｍｏｏｒｅ，Ｂ．ｅｔ．ａｌ．， ”ＡＭｏｄｅｌｆｏｒｔｈｅＰｒｅｄｉｃｔｉｏｎｏｆＴｈｒｅｓｈｏｌｄｓ，Ｌｏｕｄｎｅｓｓ，ａｎｄＰａｒｔｉａｌＬｏｕｄｎｅｓｓ”，Ｊ．ＡｕｄｉｏＥｎｇ．Ｓｏｃ，Ｖｏｌ．４５，Ｎｏ．４，Ａｐｒｉｌ１９９７．

Ｍｏｏｒｅ，Ｂ．ａｎｄＯｘｅｎｈａｍ，Ａ．， ”Ｐｓｙｃｈｏａｃｏｕｓｔｉｃｃｏｎｓｅｑｕｅｎｃｅｓｏｆｃｏｍｐｒｅｓｓｉｏｎｉｎｔｈｅｐｅｒｉｐｈｅｒａｌａｕｄｉｔｏｒｙｓｙｓｔｅｍ”，ＴｈｅＪｏｕｒｎａｌｏｆｔｈｅＡｃｏｕｓｔｉｃａｌＳｏｃｉｅｔｙｏｆＡｍｅｒｉｃａ − Ｄｅｃｅｍｂｅｒ２００２ − Ｖｏｌｕｍｅ１１２，Ｉｓｓｕｅ６，ｐｐ．２９６２−２９６６

従来技術スペクトル平坦化
米国特許
米国特許第６７３２０７３Ｂ１号発明の名称”Ｓｐｅｃｔｒａｌｅｎｈａｎｃｅｍｅｎｔｏｆａｃｏｕｓｔｉｃｓｉｇｎａｌｓｔｏｐｒｏｖｉｄｅｉｍｐｒｏｖｅｄｒｅｃｏｇｎｉｔｉｏｎｏｆｓｐｅｅｃｈ”

米国特許第０９９３４８０Ｂ１号発明の名称”Ｖｏｉｃｅｉｎｔｅｌｌｉｇｉｂｉｌｉｔｙｅｎｈａｎｃｅｍｅｎｔｓｙｓｔｅｍ”

米国特許２００６／０２６３２０Ａ１号発明の名称”Ａｐｐａｒａｔｕｓａｎｄｍｅｔｈｏｄｆｏｒｎｏｉｓｅｒｅｄｕｃｔｉｏｎａｎｄｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔｗｉｔｈｍｉｃｒｏｐｈｏｎｅｓａｎｄｌｏｕｄｓｐｅａｋｅｒｓ”

米国特許第０７１９１１２２号発明の名称”Ｓｐｅｅｃｈｃｏｍｐｒｅｓｓｉｏｎｓｙｓｔｅｍａｎｄｍｅｔｈｏｄ”

米国特許第２００７／００９４０１７号発明の名称”Ｆｒｅｑｕｅｎｃｙｄｏｍａｉｎｆｏｒｍａｔｅｎｈａｎｃｅｍｅｎｔ”

国際特許

ＷＯ２００４／０１３８４０Ａｌ号発明の名称”ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＴｅｃｈｎｉｑｕｅｓＦｏｒＩｍｐｒｏｖｉｎｇＡｕｄｉｏＣｌａｒｉｔｙＡｎｄＩｎｔｅｌｌｉｇｉｂｉｌｉｔｙ”

ＷＯ２００３／０１５０８２号発明の名称”ＳｏｕｎｄＩｎｔｅｌｌｉｇｉｂｉｌｉｔｙＥｎｈａｎｃｅｍｅｎｔＵｓｉｎｇＡＰｓｙｃｈｏａｃｏｕｓｔｉｃＭｏｄｅｌＡｎｄＡｎＯｖｅｒｓａｍｐｌｅｄＦｉｌｔｅｒｂａｎｋ”

論文

Ｓａｌｌｂｅｒｇ，Ｂ．ｅｔ．ａｌ； ”ＡｎａｌｏｇＣｉｒｃｕｉｔＩｍｐｌｅｍｅｎｔａｔｉｏｎｆｏｒＳｐｅｅｃｈＥｎｈａｎｃｅｍｅｎｔＰｕｒｐｏｓｅｓＳｉｇｎａｌｓ”；ＳｙｓｔｅｍｓａｎｄＣｏｍｐｕｔｅｒｓ，２００４．ＣｏｎｆｅｒｅｎｃｅＲｅｃｏｒｄｏｆｔｈｅＴｈｉｒｔｙ−ＥｉｇｈｔｈＡｓｉｌｏｍａｒＣｏｎｆｅｒｅｎｃｅ．

Ｍａｇｏｔｒａ，Ｎ．ａｎｄＳｉｒｉｖａｒａ，Ｓ．； ”Ｒｅａｌ−ｔｉｍｅｄｉｇｉｔａｌｓｐｅｅｃｈｐｒｏｃｅｓｓｉｎｇｓｔｒａｔｅｇｉｅｓｆｏｒｔｈｅｈｅａｒｉｎｇｉｍｐａｉｒｅｄ”；Ａｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ．１９９７．ＩＣＡＳＳＰ−９７．，１９９７ｐａｇｅ（ｓ）：１２１１−１２１４ｖｏｌ．２

Ｗａｌｋｅｒ，Ｇ．，Ｂｙｒｎｅ，Ｄ．，ａｎｄＤｉｌｌｏｎ，Ｈ．； ”Ｔｈｅｅｆｆｅｃｔｓｏｆｍｕｌｔｉｃｈａｎｎｅｌｃｏｍｐｒｅｓｓｉｏｎ／ｅｘｐａｎｓｉｏｎａｍｐｌｉｆｉｃａｔｉｏｎｏｎｔｈｅｉｎｔｅｌｌｉｇｉｂｉｌｉｔｙｏｆｎｏｎｓｅｎｓｅｓｙｌｌａｂｌｅｓｉｎｎｏｉｓｅ”；ＴｈｅＪｏｕｒｎａｌｏｆｔｈｅＡｃｏｕｓｔｉｃａｌＳｏｃｉｅｔｙｏｆＡｍｅｒｉｃａ − Ｓｅｐｔｅｍｂｅｒ１９８４ − Ｖｏｌｕｍｅ７６，Ｉｓｓｕｅ３，ｐｐ．７４６−７５７

従来技術中央抽出

ＡｄｏｂｅＡｕｄｉｔｉｏｎｈａｓａｖｏｃａｌ／ｉｎｓｔｒｕｍｅｎｔｅｘｔｒａｃｔｉｏｎｆｕｎｃｔｉｏｎ
ｈｔｔｐ：／／ｗｗｗ．ａｄｏｂｅｆｏｒｕｍｓ．ｅｏｍ／ｃｇｉ−ｂｉｎ／ｗｅｂｘ／．３ｂｃ３ａ３ｅ５

ｗｉｎａｍｐのための「中央カット」
ｈｔｔｐ：／／ｗｗｗ．ｈｖｄｒｏｇｅｎａｕｄｉｏ．ｏｒｇ／ｆｏｒｕｍｓ／ｌｏｆｉｖｅｒｓｉｏｎ／ｉｎｄｅｘ．ｐｈｐ／ｔｌ７４５０．ｈｔｍｌ

従来技術スペクトル束

Ｖｉｎｔｏｎ，Ｍ，ａｎｄＲｏｂｉｎｓｏｎＣ； ”ＡｕｔｏｍａｔｅｄＳｐｅｅｃｈ／ＯｔｈｅｒＤｉｓｃｒｉｍｉｎａｔｉｏｎｆｏｒＬｏｕｄｎｅｓｓＭｏｎｉｔｏｒｉｎｇ，” ＡＥＳ１１８ｔｈＣｏｎｖｅｎｔｉｏｎ．２００５

ＳｃｈｅｉｒｅｒＥ．，ａｎｄＳｌａｎｅｙＭ．， ”Ｃｏｎｓｔｒｕｃｔｉｏｎａｎｄｅｖａｌｕａｔｉｏｎｏｆａｒｏｂｕｓｔｍｕｌｔｉｆｅａｔｕｒｅｓｐｅｅｃｈ／ｍｕｓｉｃｄｉｓｃｒｉｍｉｎａｔｏｒ”，ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ’９７），１９９７，ｐｐ．１３３１ −− １３３４．
Conventional technology

Schaub, A.M. and P.M. , “Spectral sharpening for speech enhancement noise reduction”, Proc. ICASSP. 1991, Toronto, Canada, May 1991, pp. 993-996.

Sondhi, "New methods of pitch extraction", Audio and Electroacoustics, IEEE Transactions, June 1968, Volume 16, Issue 2, pp 266-266.

Villchul, E .; , "Signal Processing to Improve Speech Integility for the Healing Implied", 99th Audio Engineering Society Convention, 1995.

Thomas, I.D. and Niederjohn, R.A. , “Preprocessing of Speech for Added Intelligent in High Ambient Noise”, 34th Audio Engineering Society Convention, March 1968.

Moore, B.B. et. al. "A Model for the Prediction of Thresholds, Loudness, and Partial Loudness", J. et al. Audio Eng. Soc, Vol. 45, no. 4, April 1997.

Moore, B.B. and Oxenham, A.A. , “Psychoacoustic consensus of compression in the peripheral auditory system, The Journal of the Austicum of Ace. 2962-2966

Prior Art Spectral Flattening U.S. Pat. No. 6,673,073 B1 Title of invention “Spectral enhancement of acoustic signals to provided imprinted recognition of speech”

U.S. Patent No. 0993480 B1 Title of Invention "Voice intelligence enhancement system"

US Patent No. 2006/026320 A1 “Apparatus and method for noise reduction and speech enhancement with microphones and loudspeakers”

U.S. Pat. No. 07191122 Name of invention “Speech compression system and method”

US Patent No. 2007/0094017 Title of invention “Frequency domain format enhancement”

International patent

WO 2004/013840 Al No. of Invention “Digital Signal Processing Technologies For Improving Audio Clarity And Intelligence”

WO 2003/015082 Title of Invention “Sound Intelligent Enhancement Using A Psychoacoustic Model And An Oversampled Filterbank”

paper

Sallberg, B.M. et. “Analog Circuit Implementation for Speech Enhancement Proposals Signals”; Systems and Computers, 2004. al; Conference Record of the Thirty-Eighth Asilomar Conference.

Magotra, N .; and Sirivara, S .; "Real-time digital speech processing strategies for the sharing implied"; Acoustics, Speed, and Signal Processing. 1997. ICASSP-97. 1997 page (s): 1211-1214 vol. 2

Walker, G.M. Byrne, D .; , And Dillon, H .; ; "The effects of multichannel compression / expansion amplification on the intelligibility of nonsense syllables in noise"; The Journal of the Acoustical Society of America - September 1984 - Volume 76, Issue 3, pp. 746-757

Conventional technology Central extraction

Adobe Audit has a vocal / instrument extraction function
http: // www. Adobeforms. eom / cgi-bin / webx /. 3bc3a3e5

"Center cut" for winamp
http: // www. hvdrogenaudio. org / forums / loloversion / index. php / tl7450. html

Prior art Spectrum bundle

Vinton, M, and Robinson C; "Automated Speech / Other Discrimination for Loudness Monitoring," AES 118th Conv. 2005

Scheiler E. , And Slaney M .; , “Construction and evaluation of a robotic multi-spec feature / music discriminator”, IEEE Transactions on Acoustics, Speech, and Signal Prop. 1331 --- 1334.

Claims

A method for enhancing speech,
Extract the center channel of the audio signal
Generating a confidence level for speech detection in the central channel;
Flatten the spectrum of the central channel,
A method comprising enhancing any speech in an audio signal by mixing the audio signal into the flattened speech channel in proportion to the reliability of the detected speech.

The method of claim 1, wherein
The reliability varies from the lowest possibility probability to the highest possibility probability,
The generation is
The method further comprising further limiting the confidence level generated to a value that is higher than the lowest probability of probability and lower than the highest probability of probability.

The method of claim 1, wherein extracting the central channel of the audio signal comprises:
Obtaining a temporary center channel from the sum of the first channel of the audio signal and the second channel of the audio signal;
Multiplying the first channel of the audio signal smaller than the temporary central channel ratio α by the second channel conjugate with the first channel of the audio signal smaller than the temporary central channel ratio α. Calculate the product,
Obtain the extraction coefficient from the value of α that minimizes this product,
Obtaining the extracted center channel by multiplying the temporary center channel by the extraction factor.

The method of claim 3, wherein flattening the spectrum of the center channel comprises:
Separating the central channel into perceptual bands;
Determine which of the perceptual bands has the most energy,
Increasing the gain of a perceptual band having less energy.

A computer-readable recording medium storing a computer program for executing the method according to claim 1.

A computer system,
CPU,
A recording medium according to claim 5;
A computer system including a bus for coupling the CPU and the recording medium.

A speech enhancer,
A center channel extractor for extracting the center channel of the audio signal;
A spectrum flattener for flattening the spectrum of the central channel;
A speech confidence generator for generating confidence in speech detection in the central channel;
A speech enhancer comprising: a mixer that enhances any speech in the audio signal by mixing the flattened speech channel with the original audio signal in proportion to the reliability of the detected speech.