JP2000099096A

JP2000099096A - Component separation method of voice signal, and voice encoding method using this method

Info

Publication number: JP2000099096A
Application number: JP10265253A
Authority: JP
Inventors: Masahiro Oshikiri; 正浩押切; Kimio Miseki; 公生三関
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1998-09-18
Filing date: 1998-09-18
Publication date: 2000-04-07

Abstract

PROBLEM TO BE SOLVED: To provide a method for separating surely a component of an input voice signal to a first component in which a voice is the subject and a second component in which background noise is the subject for each prescribed time unit. SOLUTION: A component separating section 100 applying this component separating method has noise suppression processing sections 101, 102 performing a first noise suppression processing of which suppression quantity is relatively small for an input voice signal and a second noise suppression processing of which suppression quantity is relatively large and outputting voice component extracting signals S1, S2, a subtracter 103 subtracting the signal S2 from an input voice signal and obtaining a noise component extracting signal S3, a state deciding section 104 deciding which state the input voice signal is, a first state in which a voice component is main, a second state in which a background noise component is main, or a third state in which component other than the above is main, from the signals S2, S3 for each prescribed time, and a selecting section 105 selecting any of signals S1, S2, S3, and all zero signal S4 conforming to a decided result of the state deciding section 104 as first and second components.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力音声信号を所
定の時間単位毎に音声が主体の第１の成分と背景雑音が
主体の第２の成分とに成分分離するための音声信号の成
分分離方法及びこれを用いて背景雑音も含めて音声信号
を原音にできるだけ近い形で高能率に圧縮して符号化す
る音声符号化方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an audio signal component for separating an input audio signal into a first component mainly composed of speech and a second component mainly composed of background noise every predetermined time unit. The present invention relates to a separation method and a speech encoding method for compressing and encoding a speech signal including background noise with high efficiency in a form as close to the original sound as possible.

【０００２】[0002]

【従来の技術】従来の低ビットレート音声符号化は、音
声信号を効率よく符号化することを目的とし、音声の発
生過程のモデルを取り入れた符号化方法によって行われ
る。このような音声符号化方法の中で、特に近年ではＣ
ＥＬＰ方式をベースとする音声符号化方法の普及が著し
い。このＣＥＬＰベースの音声符号化方法を用いると、
背景雑音がほとんど無いような環境下で入力された音声
信号は、符号化のモデルと合っているために符号化の効
率が良く、音質の劣化を比較的小さく抑えて符号化を行
うことができる。2. Description of the Related Art Conventional low bit rate speech coding is performed by a coding method which incorporates a model of a speech generation process for the purpose of efficiently coding a speech signal. Among such audio coding methods, particularly in recent years, C
Speech coding methods based on the ELP method have been widely used. With this CELP based speech coding method,
A speech signal input in an environment where there is almost no background noise matches the coding model, so that the coding efficiency is high and the coding can be performed with a relatively small deterioration in sound quality. .

【０００３】しかし、背景雑音のレベルが大きな条件で
入力された音声信号にＣＥＬＰベースの音声符号化方法
を用いると、再生された出力信号の背景雑音の感じが大
きく変質し、非常に不安定で不快な感じの音となること
が知られている。このような傾向は、符号化のビットレ
ートが８ｋｂｐｓ以下になると特に顕著である。However, when a CELP-based speech coding method is used for a speech signal input under the condition that the level of background noise is large, the reproduced output signal has a very unstable feeling of background noise and is very unstable. It is known that it will sound unpleasant. Such a tendency is particularly remarkable when the encoding bit rate is 8 kbps or less.

【０００４】このような問題を軽減するために、背景雑
音であると判定された時間区間でより雑音性の高い音源
信号を用いてＣＥＬＰ符号化を行うことにより、背景雑
音区間での音質劣化を改善させようとする方法も提案さ
れている。このような方法を用いると、背景雑音区間の
音質が多少は改善されるが、音源信号を合成フィルタに
通して音声を合成するという音声の発生過程のモデルを
用いているため、依然として原音の背景雑音と異なった
感じの雑音になる傾向に変わりはなく、改善の効果が小
さいという問題がある。In order to reduce such a problem, CELP coding is performed using a source signal having higher noise in a time section determined to be background noise, so that sound quality degradation in the background noise section is reduced. Methods to improve it have also been proposed. Although the sound quality of the background noise section is somewhat improved by using such a method, the sound generation process model of synthesizing the sound by passing the sound source signal through the synthesis filter is still used, so that the background of the original sound is still used. There is no change in the tendency of noise to be different from noise, and there is a problem that the effect of improvement is small.

【０００５】[0005]

【発明が解決しようとする課題】上述したように、従来
の音声符号化方法では背景雑音のレベルが大きな条件で
入力された音声信号を符号化すると、再生された出力信
号の背景雑音の感じが大きく変質し、非常に不安定で不
快な感じの音となるという問題があった。As described above, in the conventional speech encoding method, when an input speech signal is encoded under the condition that the background noise level is large, the reproduced output signal feels like background noise. There was a problem that the sound was greatly changed and the sound was very unstable and unpleasant.

【０００６】この問題を解決するためには、入力音声信
号を音声が主体の成分と背景雑音が主体の成分とに分離
し、音声および背景雑音のそれぞれの成分の性質に合っ
た異なるモデルに基づいた符号化方法を用いることが考
えられる。この場合、入力音声信号を音声が主体の成分
と背景雑音が主体の成分とに正しく分離することが必要
であるが、従来ではこのような成分分離のための有効な
手法が見いだされていない。In order to solve this problem, an input speech signal is separated into a component mainly composed of speech and a component mainly composed of background noise, and based on different models suited to the properties of the respective components of speech and background noise. It is conceivable to use a different encoding method. In this case, it is necessary to correctly separate the input audio signal into a component mainly composed of voice and a component mainly composed of background noise, but no effective method for such component separation has been found in the past.

【０００７】本発明は、このような点に鑑みてなされた
もので、入力音声信号を所定の時間単位毎に音声が主体
の第１の成分と背景雑音が主体の第２の成分とに確実に
成分分離するための音声信号の成分分離方法及びこのよ
うな成分分離方法を用いて背景雑音も含めて音声を原音
にできるだけ近い形で再生できる低レート音声符号化を
実現するのに適した音声符号化方法を提供することにあ
る。[0007] The present invention has been made in view of the above points, and it is possible to reliably convert an input audio signal into a first component mainly composed of voice and a second component mainly composed of background noise every predetermined time unit. Speech signal suitable for realizing low-rate speech coding that can reproduce speech as close as possible to the original sound, including background noise, using such a component separation method. It is to provide an encoding method.

【０００８】[0008]

【課題を解決するための手段】上述した課題を解決する
ため、本発明に係る音声信号の成分分離方法は、入力音
声信号を所定の時間単位毎に音声が主体の第１の成分と
背景雑音が主体の第２の成分とに成分分離する際、入力
音声信号に対し抑圧量の相対的に小さな第１の雑音抑圧
処理を行って得られた信号を第１の成分とし、入力音声
信号に対し抑圧量の相対的に大きな第２の雑音抑圧処理
を行って得られた信号を入力音声信号から減じた信号を
第２の成分とする分離モードを有することを特徴とす
る。この分離モードは、入力音声信号に音声成分と雑音
成分が共に多く含まれているような場合に有効である。In order to solve the above-mentioned problems, a method for separating components of an audio signal according to the present invention comprises the steps of dividing an input audio signal into a first component mainly composed of a speech and a background noise every predetermined time unit. When the component is separated into the main component and the second component, a signal obtained by performing a first noise suppression process with a relatively small amount of suppression on the input audio signal is set as a first component, and On the other hand, there is provided a separation mode in which a signal obtained by performing a second noise suppression process having a relatively large amount of suppression and which is obtained by subtracting a signal obtained from the input audio signal is used as a second component. This separation mode is effective when the input voice signal contains both voice components and noise components.

【０００９】入力音声信号に対し、相対的に抑圧量の小
さい第１の雑音抑圧処理を行うと、情報の消失が小さい
音声成分が得られる。この音声成分を符号化／復号化し
て得られる復号音声成分は、混入している雑音成分の部
分で劣化が知覚されるが、音声成分の情報消失が無いた
め、音声成分に着目すると高品質な復号音声成分を得る
ことができる。一方、入力音声信号に対し、相対的に抑
圧量の大きい第２の雑音抑圧処理を行って得られる雑音
成分は、音声成分の混入があるものの雑音成分を正確に
表しており、この信号を雑音符号化／復号化の処理を行
うことで品質の良い復号雑音成分を得ることができる。
従って、このようにして分離された音声成分および雑音
成分の両者を結合すると、音声符号化／復号化処理で生
成される復号音声成分で問題となる混入している雑音成
分の違和感が、雑音符号化／復号化処理で生成される復
号雑音成分でマスクされ聞こえなくなり、最終的に高品
質な復号音声を得ることができる。When the first noise suppression processing with a relatively small amount of suppression is performed on an input speech signal, a speech component with a small loss of information is obtained. Although the decoded speech component obtained by encoding / decoding this speech component is perceived to be degraded at the portion of the mixed noise component, since there is no loss of information of the speech component, high quality can be obtained by paying attention to the speech component. A decoded speech component can be obtained. On the other hand, the noise component obtained by performing the second noise suppression process with a relatively large amount of suppression on the input speech signal accurately represents the noise component although the speech component is mixed. By performing the encoding / decoding process, a high-quality decoding noise component can be obtained.
Therefore, when both the speech component and the noise component separated in this way are combined, the sense of incompatibility of the mixed noise component, which is a problem in the decoded speech component generated in the speech encoding / decoding process, is reduced. It is masked by the decoding noise component generated in the encoding / decoding process, and becomes inaudible, so that high-quality decoded speech can be finally obtained.

【００１０】本発明に係るより具体的な音声信号の成分
分離方法によると、入力音声信号に対し相対的に抑圧量
の小さい第１の雑音抑圧を行って得られた第１の信号、
入力音声信号に対し相対的に抑圧量の大きい第２の雑音
抑圧処理を行って得られた第２の信号、および入力音声
信号から第２の信号を減じて得られる第３の信号を生成
する。そして、入力音声信号が音声成分を主とする第１
の状態、背景雑音成分を主とする第２の状態、および音
声成分と背景雑音成分の両者を含む第１、第２の状態以
外の第３の状態のいずれの状態かを所定の時間単位毎に
判定し、第１の状態と判定されたときは、第１の成分と
して入力音声信号をそのまま出力すると共に、第２の成
分として予め定められた所定の信号を出力し、第２の状
態と判定されたときは、第１の成分として予め定められ
た所定の信号を出力すると共に、第２の成分として入力
音声信号を出力し、第３の状態と判定されたときは、第
１の成分として第１の信号を出力すると共に、第２の成
分として第３の信号を出力する。ここで、第３の状態が
先に示した分離モードである。予め定められた所定の信
号とは、例えば全ゼロ信号である。According to a more specific audio signal component separation method according to the present invention, a first signal obtained by performing a first noise suppression with a relatively small amount of suppression on an input audio signal;
A second signal obtained by performing a second noise suppression process with a relatively large amount of suppression on the input audio signal and a third signal obtained by subtracting the second signal from the input audio signal are generated. . The first audio signal mainly includes an audio component.
State, a second state mainly including a background noise component, and a third state other than the first and second states including both a voice component and a background noise component for each predetermined time unit. When it is determined to be the first state, the input audio signal is output as it is as the first component, and a predetermined signal is output as the second component, and the second state is determined. When it is determined, a predetermined signal is output as a first component, and an input audio signal is output as a second component. When it is determined to be in the third state, the first component is output. And a third signal is output as the second component. Here, the third state is the separation mode described above. The predetermined signal is, for example, an all-zero signal.

【００１１】また、入力音声信号が第１の状態と判定さ
れたときは、第１の成分として入力音声信号をそのまま
出力すると共に、第２の成分として予め定められた所定
の信号を出力し、第２の状態と判定されたときは、第１
の成分として第２の信号を出力すると共に、第２の成分
として第３の信号を出力し、第３の状態と判定されたと
きは、第１の成分として前記第１の信号を出力すると共
に、第２の成分として第３の信号を出力するようにして
もよい。When it is determined that the input audio signal is in the first state, the input audio signal is output as it is as a first component, and a predetermined signal is output as a second component. If the second state is determined, the first state
The second signal is output as the second component, the third signal is output as the second component, and when the third state is determined, the first signal is output as the first component. , A third signal may be output as the second component.

【００１２】さらに、入力音声信号が第１の状態、第２
の状態および第３の状態のいずれの状態かを判定する処
理に、雑音成分抽出信号のピッチ周期性の大きさを調べ
る処理を含むことが望ましい。Further, when the input audio signal is in the first state,
It is desirable that the process of determining which of the state (3) and the state (3) includes a process of checking the magnitude of the pitch periodicity of the noise component extraction signal.

【００１３】このような音声信号の成分分離方法による
と、入力音声信号が音声成分を主とする第１の状態のと
きは、入力音声信号が雑音抑圧処理を受けずにそのまま
音声成分主体の第１の成分として出力され、これが音声
成分に適した符号化方法で符号化されるため、音声成分
に対する不要な雑音抑圧処理による復号音声信号の品質
劣化が回避される。According to such an audio signal component separation method, when the input audio signal is in the first state mainly including the audio component, the input audio signal is not subjected to the noise suppression processing and is not subjected to the noise component main processing. 1 is output as a component, and is encoded by an encoding method suitable for the audio component, so that the quality degradation of the decoded audio signal due to unnecessary noise suppression processing on the audio component is avoided.

【００１４】また、雑音成分抽出信号のピッチ周期性の
大きさを調べ、これがある閾値に満たないときは、入力
音声信号が背景雑音成分を主とする第２の状態であると
して背景雑音成分が主体の第２の成分として出力され、
これが背景雑音成分に適した符号化方法で符号化され
る。一方、ピッチ周期性の大きさが閾値以上のときは、
入力音声信号が音声成分を主体とする第１の状態である
とみなして、入力音声信号が第１の成分として出力さ
れ、これが音声成分に適した符号化方法で符号化され
る。このように雑音成分抽出信号に大きなピッチ周期性
がある場合には、音声成分に適した符号化方法で符号化
されることにより、この雑音成分抽出信号が背景雑音成
分を主体とする第２の成分に適した符号化方法で符号化
されることによる不快な雑音の発生が回避される。Further, the magnitude of the pitch periodicity of the noise component extraction signal is examined. If this value is less than a certain threshold value, it is determined that the input speech signal is in the second state mainly including the background noise component, and the background noise component is determined. Output as the second component of the subject,
This is encoded by an encoding method suitable for the background noise component. On the other hand, when the magnitude of the pitch periodicity is equal to or larger than the threshold,
Assuming that the input audio signal is in the first state mainly including the audio component, the input audio signal is output as the first component, and this is encoded by an encoding method suitable for the audio component. When the noise component extraction signal has a large pitch periodicity as described above, the noise component extraction signal is encoded by an encoding method suitable for the audio component, so that the noise component extraction signal mainly includes the background noise component. The generation of unpleasant noise due to encoding by an encoding method suitable for the component is avoided.

【００１５】すなわち、背景雑音成分を主体とする第２
の成分に適した符号化方法に基づく符号化器は、一般に
周期性を効率よく表現できないため、周期性のある信号
が入力されると不快な雑音を生成してしまうおそれがあ
るが、このような場合には入力音声信号を音声成分を主
体とする第１の成分に適した符号化方法に基づいた、周
期性を正しく表現できる符号化器で符号化することによ
り、不快な雑音の発生を防ぐことができる。That is, a second signal mainly composed of a background noise component
In general, an encoder based on an encoding method suitable for the component cannot efficiently represent the periodicity, and may generate unpleasant noise when a periodic signal is input. In such a case, the generation of unpleasant noise is achieved by encoding the input audio signal using an encoder capable of correctly expressing the periodicity based on an encoding method suitable for the first component mainly including the audio component. Can be prevented.

【００１６】本発明に係る音声符号化方法は、入力音声
信号を上述した本発明に基づく成分分離方法によって所
定の時間単位毎に音声が主体の第１の成分と背景雑音が
主体の第２の成分とに成分分離し、第１および第２の成
分を基に、もしくは前記成分分離の際に判定された状態
の判定結果を基に、それぞれの成分のビット割り当てを
予め定められた複数のビット割り当て候補の中から選択
し、このビット割り当ての下で第１および第２の成分を
それぞれ異なる所定の符号化方法により符号化し、第１
および第２の成分の符号化データおよびビット割り当て
の情報を伝送符号化データとして出力することを特徴と
する。In the speech encoding method according to the present invention, the first component mainly composed of speech and the second component mainly composed of background noise are converted into the input speech signal by the above-described component separation method according to the present invention every predetermined time unit. A plurality of bits for which the bit assignment of each component is predetermined based on the first and second components or on the basis of the determination result of the state determined at the time of the component separation. A selection is made from among allocation candidates, and the first and second components are respectively coded by different predetermined coding methods under this bit allocation.
And coded data of the second component and information of bit allocation are output as transmission coded data.

【００１７】ＣＥＬＰ符号化では、前述したように背景
雑音のレベルが大きな条件で入力された音声信号を符号
化すると、再生音声信号の背景雑音の感じが大きく変化
し、非常に不安定で不快な感じの音となってしまう。こ
れは背景雑音がＣＥＬＰが得意とする音声信号と全く異
なるモデルを有するためであり、背景雑音を符号化する
際にはそれに見合った方法で符号化を行うことが望まし
い。In the CELP coding, as described above, when an input speech signal is encoded under the condition that the level of the background noise is large, the feeling of the background noise of the reproduced speech signal changes greatly, which is very unstable and unpleasant. It sounds like a feeling. This is because the background noise has a completely different model from the speech signal that CELP is good at, and it is desirable to encode the background noise by a method corresponding to it.

【００１８】本発明に係る音声符号化方法では、所定の
時間単位毎に入力音声信号を音声が主体の第１の成分と
背景雑音が主体の第２の成分に成分分離することで、音
声および背景雑音のそれぞれの成分の性質に合った異な
るモデルに基づいた符号化方法を用いることにより、符
号化全体の効率が向上する。また、このとき第１の成分
と第２の成分を基に、それぞれの成分をより効率的に符
号化できるビット割り当てを所定候補の中から選択し
て、第１の成分と第２の成分を符号化することにより、
全体のビットレートを低く保ちつつ、高能率に入力音声
信号を符号化することができる。In the speech encoding method according to the present invention, the input speech signal is separated into a first component mainly composed of speech and a second component mainly composed of background noise for each predetermined time unit, so that speech and speech are separated. By using an encoding method based on a different model suited to the nature of each component of the background noise, the efficiency of the overall encoding is improved. At this time, based on the first component and the second component, a bit allocation that can more efficiently encode each component is selected from predetermined candidates, and the first component and the second component are determined. By encoding
The input audio signal can be encoded with high efficiency while keeping the overall bit rate low.

【００１９】一方、本発明に係る音声復号化方法では、
上述のようにして符号化を行って得られた伝送符号化デ
ータを復号して音声信号を再生するために、入力される
伝送符号化データから、音声が主体の第１の成分の符号
化データと背景雑音が主体の第２の成分の符号化データ
および第１および第２の成分の符号化データのそれぞれ
のビット割り当ての情報を分離し、このビット割り当て
の情報を復号して第１および第２の成分の符号化データ
のビット割り当てを求め、このビット割り当ての下で第
１および第２の成分の符号化データを復号して第１およ
び第２の成分を再生し、再生した第１および第２の成分
を結合して最終的な出力音声信号を生成することができ
る。On the other hand, in the speech decoding method according to the present invention,
In order to decode the transmission encoded data obtained by performing the encoding as described above and reproduce the audio signal, the encoded data of the first component mainly composed of audio is input from the input transmission encoded data. And the bit allocation information of the second component coded data mainly composed of the background noise and the bit allocation information of the first and second component coded data, and decodes the bit allocation information to obtain the first and second bits. A bit assignment of the encoded data of the second component is obtained, and the encoded data of the first and second components are decoded under the bit assignment to reproduce the first and second components. The second component can be combined to produce a final output audio signal.

【００２０】[0020]

【発明の実施の形態】まず、本発明の具体的な実施の形
態を説明する前に、本発明による音声信号の成分分離方
法の基本的な考え方について述べる。後述する音声符号
化／復号化において、音声信号の成分分離時に音声成分
と雑音成分の正確な成分分離ができないと、復号音声に
聴感的に大きな劣化が生じてしまう。しかし、雑音抑圧
処理を利用する構成で常に正確な成分分離を行うことは
一般に困難である。この雑音抑圧処理による音声成分と
雑音成分の成分分離の問題について、図１を用いて説明
する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Before describing a specific embodiment of the present invention, a basic concept of a method for separating components of an audio signal according to the present invention will be described. In speech encoding / decoding described later, if accurate component separation between a speech component and a noise component cannot be performed at the time of component separation of a speech signal, the decoded speech will have a large perceptual deterioration. However, it is generally difficult to always perform accurate component separation in a configuration that uses noise suppression processing. The problem of the separation of the speech component and the noise component by the noise suppression processing will be described with reference to FIG.

【００２１】図１（ａ）は、入力音声信号に含まれる音
声成分の情報と雑音成分の情報の関係を表す。ここで横
軸は時間、縦軸は入力音声信号に含まれる各成分の情報
の大きさを表す。雑音成分は全ての時間でほぼ一様に出
現しており、音声成分はその情報の小さい部分（この場
合、始まりと終端）で雑音成分に埋もれている。FIG. 1A shows the relationship between the information on the audio component and the information on the noise component contained in the input audio signal. Here, the horizontal axis represents time, and the vertical axis represents the magnitude of information of each component included in the input audio signal. The noise component appears almost uniformly at all times, and the voice component is buried in the noise component at a small part (in this case, the beginning and end) of the information.

【００２２】この入力音声信号から雑音抑圧処理を利用
した成分分離を行うと、図１（ｂ）（ｃ）のようにな
る。図１（ｂ）は成分分離後の音声成分、図１（ｃ）は
雑音成分をそれぞれ表す。図１（ｂ）に着目すると、雑
音成分はほとんど混入せずきれいに除かれているもの
の、音声成分の一部が消失してしまう。一方、図１
（ｃ）では、雑音成分は分離されているものの、図１
（ｂ）で消失した音声成分が混入してしまっている。こ
のような状況で音声成分を符号化すると、音声符号化の
入力である音声成分に既に情報消失という劣化が生じて
しまっているため、符号化後に得られる復号音声の品質
低下を招いてしまう。FIG. 1B and FIG. 1C show the result of component separation using noise suppression processing from the input speech signal. FIG. 1B shows a speech component after component separation, and FIG. 1C shows a noise component. Paying attention to FIG. 1B, although the noise component is hardly mixed and removed neatly, a part of the voice component is lost. On the other hand, FIG.
In FIG. 1C, although the noise components are separated, FIG.
The voice component lost in (b) is mixed. If the audio component is encoded in such a situation, the audio component which is the input of the audio encoding has already been degraded by loss of information, so that the quality of the decoded audio obtained after the encoding is reduced.

【００２３】これに対し、雑音抑圧処理の抑圧量を小さ
くして音声成分の情報消失を小さくする方法が考えられ
るが、その場合に別の問題が生じる。この問題につい
て、図２を用いて説明する。ここで、図２（ａ）は図１
（ａ）と同様に入力音声信号に含まれる音声成分の情報
と雑音成分の情報の関係を表す。図２（ａ）の入力音声
信号に対し抑圧量の小さい雑音抑圧処理を用いて成分分
離を行ったときの音声成分を図２（ｂ）に、音声成分を
図２（ｃ）にそれぞれ示す。図２（ｂ）に着目すると、
雑音抑圧処理の抑圧量が小さいため、音声成分の消失が
小さくなるものの、今度は雑音成分の混入が大きくな
る。このような音声成分を符号化すると、復号された雑
音成分の感じが大きく変質し、非常に不安定で不快な感
じの音になる。On the other hand, a method is conceivable in which the amount of suppression in the noise suppression processing is reduced to reduce the loss of information of voice components, but in that case, another problem occurs. This problem will be described with reference to FIG. Here, FIG.
As in (a), the relationship between the information on the audio component and the information on the noise component included in the input audio signal is shown. FIG. 2B shows an audio component when the input audio signal shown in FIG. 2A is subjected to component separation using noise suppression processing with a small amount of suppression, and FIG. 2C shows the audio component. Focusing on FIG. 2 (b),
Since the amount of suppression in the noise suppression processing is small, the disappearance of the voice component is small, but the noise component is large in this time. When such an audio component is encoded, the sense of the decoded noise component is greatly changed, and the sound becomes very unstable and unpleasant.

【００２４】そこで、本発明は第１の雑音抑圧処理に比
べ抑圧量の小さい第２の雑音抑圧処理を新たに用いて、
品質低下を回避できる成分分離方法を提供する。以下、
図３を用いて本発明による音声信号の成分分離方法の原
理を説明する。図３（ａ）に示す入力音声信号に対し、
先に説明したように相対的に抑圧量の小さい雑音抑圧処
理（第１の雑音抑圧処理）を行うことにより、図３
（ｂ）に示すように情報の消失が小さい音声成分を得る
ことができる。この音声成分を符号化／復号化して得ら
れる復号音声成分は、混入している雑音成分の部分で劣
化が知覚されるが、音声成分の情報消失が無いため、音
声成分に着目すると高品質な復号音声成分を得ることが
できる。Therefore, the present invention newly employs a second noise suppression process having a smaller amount of suppression than the first noise suppression process,
Provided is a component separation method capable of avoiding quality deterioration. Less than,
The principle of the audio signal component separation method according to the present invention will be described with reference to FIG. For the input audio signal shown in FIG.
As described above, by performing the noise suppression process (first noise suppression process) with a relatively small amount of suppression,
As shown in (b), it is possible to obtain an audio component with a small loss of information. Although the decoded speech component obtained by encoding / decoding this speech component is perceived to be degraded at the portion of the mixed noise component, since there is no loss of information of the speech component, high quality can be obtained by paying attention to the speech component. A decoded speech component can be obtained.

【００２５】一方、図３（ａ）に示す入力音声信号に対
し、相対的に抑圧量の大きい雑音抑圧処理（第２の雑音
抑圧処理）を行って得られる雑音成分は、図３（ｃ）に
示すように音声成分の混入があるものの雑音成分を正確
に表しており、この信号を雑音符号化／復号化の処理を
行うことで品質の良い復号雑音成分を得ることができ
る。On the other hand, a noise component obtained by performing a noise suppression process (second noise suppression process) having a relatively large suppression amount on the input speech signal shown in FIG. 3A is shown in FIG. As shown in (1), the noise component is accurately represented even though the speech component is mixed, and a high-quality decoded noise component can be obtained by subjecting this signal to noise encoding / decoding.

【００２６】従って、図３（ｂ）（ｃ）のように分離さ
れた音声成分および雑音成分の両者を結合すると、音声
符号化／復号化処理で生成される復号音声成分で問題と
なる混入している雑音成分の違和感が、雑音符号化／復
号化処理で生成される復号雑音成分でマスクされ聞こえ
なくなり、最終的に高品質な復号音声を得ることができ
る。Therefore, when both the separated speech component and the noise component are combined as shown in FIGS. 3B and 3C, there is a problem that the decoded speech component generated in the speech encoding / decoding process causes a problem. The uncomfortable feeling of the noise component is masked by the decoded noise component generated by the noise encoding / decoding process, and becomes inaudible, so that a high-quality decoded speech can be finally obtained.

【００２７】以下、上述した原理に基づく本発明による
音声信号の成分分離方法のより具体的な実施の形態を説
明する。（第１の実施形態）図４に、本発明の第１の実施形態に
よる音声信号の成分分離方法を適用した成分分離部１０
０の構成を示す。この成分分離部１００は、入力音声信
号を所定の時間単位毎に音声が主体の第１の成分と背景
雑音が主体の第２の成分とに成分分離するものであり、
第１雑音抑圧処理部１０１、第２雑音抑圧処理部１０
２、減算器１０３、状態判定部１０４および選択部１０
５からなる。Hereinafter, a more specific embodiment of the method for separating components of an audio signal according to the present invention based on the above-described principle will be described. (First Embodiment) FIG. 4 shows a component separation unit 10 to which a method for separating components of an audio signal according to a first embodiment of the present invention is applied.
0 is shown. The component separation unit 100 separates the input audio signal into a first component mainly composed of voice and a second component mainly composed of background noise every predetermined time unit,
First noise suppression processing unit 101, second noise suppression processing unit 10
2, subtractor 103, state determination unit 104, and selection unit 10
Consists of five.

【００２８】第１雑音抑圧処理部１０１は、入力音声信
号に対し相対的に抑圧量の小さな雑音抑圧処理を行い、
第１の信号Ｓ１を出力する。第２雑音抑圧処理部１０２
は、入力音声信号に対し相対的に抑圧量の大きな雑音抑
圧処理を行い、第２の信号Ｓ２を出力する。減算器１０
３は、入力音声信号から第２の信号Ｓ２を減じて第３の
信号Ｓ３を出力する。The first noise suppression processing unit 101 performs a noise suppression process with a relatively small amount of suppression on the input speech signal,
The first signal S1 is output. Second noise suppression processing unit 102
Performs a noise suppression process with a relatively large amount of suppression on an input audio signal, and outputs a second signal S2. Subtractor 10
3 subtracts the second signal S2 from the input audio signal and outputs a third signal S3.

【００２９】状態判定部１０４は、この例では第２の信
号Ｓ２と第３の信号Ｓ３から入力音声信号が音声成分を
主とする第１の状態、背景雑音成分を主とする第２の状
態、および音声成分と背景雑音成分の両者を含む第１、
第２の状態以外の第３の状態のいずれの状態かを所定の
時間単位毎に判定する。In this example, the state judging unit 104 determines from the second signal S2 and the third signal S3 that the input audio signal is in the first state mainly composed of an audio component and in the second state mainly composed of a background noise component. , And a first containing both a speech component and a background noise component,
It is determined for each predetermined time unit whether the state is the third state other than the second state.

【００３０】状態判定部１０４は、図５に示されるよう
にＳＮＲ判定部１１１と周期性判定部１１２と推定エネ
ルギー判定部１１３および各判定部１１１、１１２、１
１３の判定結果から総合判定を行って、入力音声信号の
状態判定結果を出力する総合判定部１１４から構成され
ている。この状態判定部１０４の判定結果は、選択部１
０５に与えられる。As shown in FIG. 5, the state determination unit 104 includes an SNR determination unit 111, a periodicity determination unit 112, an estimated energy determination unit 113, and each of the determination units 111, 112, and 1,
It is composed of an overall judgment section 114 which makes an overall judgment from the judgment results of No. 13 and outputs a state judgment result of the input audio signal. The determination result of this state determination unit 104 is
05.

【００３１】選択部１０５では、状態判定部１０４によ
り入力音声信号が第１の状態と判定されたときは、第１
の成分として入力音声信号をそのまま出力すると共に、
第２の成分として予め定められた所定の信号Ｓ４を出力
し、入力音声信号が第２の状態と判定されたときは、第
１の成分として予め定められた所定の信号Ｓ４または第
２の信号Ｓ２を出力すると共に、第２の成分として入力
音声信号または第３の信号Ｓ３を出力し、入力音声信号
が第３の状態と判定されたときは、第１の成分として第
１の信号Ｓ１を出力すると共に、第２の成分として第３
の信号Ｓ３を出力する。In the selection section 105, when the input voice signal is determined to be in the first state by the state determination section 104, the first
And output the input audio signal as it is,
A predetermined signal S4 predetermined as a second component is output, and when the input audio signal is determined to be in the second state, the predetermined signal S4 or the second signal predetermined as the first component is output. S2 is output, and the input audio signal or the third signal S3 is output as the second component. When the input audio signal is determined to be in the third state, the first signal S1 is output as the first component. And outputs the third component as the second component.
Is output.

【００３２】また、成分分離部１００内の状態判定部１
０４からの状態判定結果は後述するビット割り当て選択
部１２０と多重化部１５０にも入力される。状態判定部
１０４では入力音声信号が第１、第２および第３の状態
のいずれの状態にあるかの判定を行うので、ビット割り
当て選択部１２０では入力音声信号が音声を主体とする
第１の状態のときは音声符号化部１３０に多くのビット
を割り当て、また入力音声信号が背景雑音を主体とする
第２の状態のときは雑音符号化部１４０に多くのビット
を割り当て、さらに入力音声信号が音声と背景雑音を共
に多く含む第３の状態のときは、音声符号化部１３０お
よび背景雑音符号化部１４０に音声成分と雑音成分の情
報量に見合ったビットを割り当てるようにする。The state determination unit 1 in the component separation unit 100
04 is also input to a bit allocation selection unit 120 and a multiplexing unit 150 described later. Since the state determination unit 104 determines which of the first, second, and third states the input audio signal is in, the bit allocation selection unit 120 determines whether the input audio signal is a first audio signal mainly composed of audio. In the state, more bits are assigned to the speech encoder 130. When the input speech signal is in the second state mainly composed of background noise, more bits are assigned to the noise encoder 140. Is a third state that includes a large amount of both speech and background noise, bits corresponding to the information amounts of the speech component and the noise component are assigned to the speech encoding unit 130 and the background noise encoding unit 140.

【００３３】また、これにより状態判定部１０４の判定
結果はビット割り当て選択部１２０でのビット割り当て
の情報を示すことになるので、これをビット割り当て符
号として多重化部１５０で音声符号化部１３０および雑
音符号化部１４０からの符号化データと多重化すればよ
い。In addition, the result of the determination by the state determination unit 104 indicates the information of the bit allocation in the bit allocation selection unit 120, and this is used as a bit allocation code by the multiplexing unit 150 and the audio coding unit 130 and It may be multiplexed with the encoded data from the noise encoding unit 140.

【００３４】次に、図６に示すフローチャートを用いて
成分分離部１００の処理手順を説明する。この処理は、
入力音声信号のフレーム毎に行われる。成分分離部１０
０における第１雑音抑圧処理部１０１、第２雑音抑圧処
理部１０２には、既存の技術を用いることができるが、
ここでは先に説明したようにスペクトルサブトラクショ
ン法を用いた場合について説明する。Next, the processing procedure of the component separation unit 100 will be described with reference to the flowchart shown in FIG. This process
This is performed for each frame of the input audio signal. Component separation unit 10
For the first noise suppression processing unit 101 and the second noise suppression processing unit 102 at 0, existing techniques can be used.
Here, the case where the spectral subtraction method is used as described above will be described.

【００３５】まず、相対的に抑圧量の小さい第１雑音抑
圧処理部１０１で入力音声信号に対し雑音抑圧処理を行
うことにより、第１の信号（以下、第１音声成分抽出信
号という）Ｓ１を生成し、さらに第２雑音抑圧処理部１
０２で入力音声信号に対し相対的に抑圧量の大きい雑音
抑圧処理を行うことにより、第２の信号（以下、第２音
声成分抽出信号という）Ｓ２を生成した後、さらに減算
器１０３で入力音声信号から第２音声成分抽出信号Ｓ２
を減じることにより、第３の信号（以下、雑音成分抽出
信号という）Ｓ３を生成する（ステップＳ１００１）。First, a first signal (hereinafter, referred to as a first audio component extraction signal) S1 is subjected to noise suppression processing on an input audio signal in a first noise suppression processing unit 101 having a relatively small amount of suppression. Generated by the second noise suppression processing unit 1
02, a second signal (hereinafter, referred to as a second audio component extraction signal) S2 is generated by performing a noise suppression process with a relatively large amount of suppression on the input audio signal. A second audio component extraction signal S2 from the signal
To generate a third signal (hereinafter, referred to as a noise component extraction signal) S3 (step S1001).

【００３６】次に、状態判定部１０４で第２音声成分抽
出信号Ｓ２および雑音成分抽出信号Ｓ３から入力音声信
号の状態、すなわち入力音声信号が音声成分を主とする
第１の状態、背景雑音成分を主とする第２の状態、およ
び音声成分と背景雑音成分の両者を含む第１、第２の状
態以外の第３の状態のいずれの状態かを所定の時間単位
毎、つまりフレーム毎に判定する（ステップＳ１００
２）。Next, the state of the input audio signal from the second audio component extraction signal S2 and the noise component extraction signal S3 by the state determination unit 104, that is, the first state in which the input audio signal mainly includes the audio component, the background noise component Is determined for each predetermined time unit, that is, for each frame, that is, the second state mainly including the first state and the third state other than the first and second states including both the voice component and the background noise component. (Step S100
2).

【００３７】そして、この判定結果に基づいて選択部１
０５で第１の成分である音声主体成分と第２の成分であ
る背景雑音主体成分の選択を行う。すなわち、入力音声
信号が第１の状態の場合には、音声主体成分として入力
音声信号をそのまま出力し、背景雑音主体成分として所
定の信号、例えば全ゼロ信号Ｓ３を出力する（ステップ
Ｓ１００３）。また、入力音声信号が第２の状態の場合
には、音声主体成分として所定の信号、例えば全ゼロ信
号Ｓ３を出力し、背景雑音主体成分として入力音声信号
をそのまま出力する（ステップＳ１００４）。さらに、
入力音声信号が第３の状態の場合には、音声主体成分と
して第１音声成分抽出信号Ｓ１を出力し、背景雑音主体
成分として雑音成分抽出信号Ｓ３を出力する（ステップ
Ｓ１００５）。こうして出力される音声主体成分および
背景雑音主体成分は、それぞれ後述するように音声符号
化部および雑音符号化部で符号化される。Then, based on the determination result, the selecting unit 1
In step 05, the main component of the voice, which is the first component, and the main component of the background noise, which is the second component, are selected. That is, when the input audio signal is in the first state, the input audio signal is output as it is as an audio main component, and a predetermined signal, for example, an all-zero signal S3 is output as a background noise main component (step S1003). If the input audio signal is in the second state, a predetermined signal, for example, an all-zero signal S3, is output as the main audio component, and the input audio signal is output as it is as the main background noise component (step S1004). further,
When the input audio signal is in the third state, the first audio component extraction signal S1 is output as the main audio component, and the noise component extraction signal S3 is output as the main background noise component (step S1005). The audio main component and the background noise main component output in this way are encoded by an audio encoding unit and a noise encoding unit, respectively, as described later.

【００３８】このように本実施形態によれば、入力音声
信号が音声成分と背景雑音成分の両者を含む第３の状態
のときは、相対的に抑圧量の小さい第１雑音抑圧処理部
１０１からの第１音声成分抽出信号Ｓ１が出力され、こ
れが音声符号化部において音声に適した符号化方法で符
号化されるため、音声成分の情報の消失による品質劣化
を回避することができる。As described above, according to the present embodiment, when the input speech signal is in the third state including both the speech component and the background noise component, the first noise suppression processing unit 101 having a relatively small suppression amount performs Since the first audio component extraction signal S1 is output and is encoded by the audio encoding unit using an encoding method suitable for audio, it is possible to avoid quality degradation due to loss of information of audio components.

【００３９】図７のフローチャートは、成分分離部１０
０の他の処理手順を示しており、図６におけるステップ
Ｓ１００４がステップＳ１００６に変更されている点以
外は図６と同様である。このステップＳ１００６では、
入力音声信号が第２の状態の場合、音声主体成分として
第２音声成分抽出信号Ｓ２を出力し、背景雑音主体成分
として雑音成分抽出信号Ｓ３を出力するようにしてお
り、このようにしても結果はほぼ同じである。The flowchart of FIG.
0 shows another processing procedure, and is the same as FIG. 6 except that step S1004 in FIG. 6 is changed to step S1006. In this step S1006,
When the input audio signal is in the second state, the second audio component extraction signal S2 is output as the main audio component, and the noise component extraction signal S3 is output as the background noise main component. Are almost the same.

【００４０】次に、図８に示すフローチャートを用いて
図５に示した状態判定部１０４の処理手順を説明する。
なお、この処理も入力音声信号のフレーム毎に行われる
ものとする。まず、次式（１）により雑音成分抽出信号
Ｓ３の推定エネルギーＥest を算出する（ステップＳ１
１０１）。Ｅest(ｍ) ＝α・Ｅno＋（１−α）・Ｅest(ｍ−１) （１）ここで、αは更新係数、ｍは入力音声信号のフレーム番
号を表す。また、Ｅ_no（３）式に従い算出される。Next, the processing procedure of the state determination unit 104 shown in FIG. 5 will be described with reference to the flowchart shown in FIG.
This process is also performed for each frame of the input audio signal. First, the estimated energy Eest of the noise component extraction signal S3 is calculated by the following equation (1) (step S1).
101). Eest (m) = α · Eno + (1−α) · Eest (m−1) (1) Here, α represents an update coefficient, and m represents a frame number of an input audio signal. It is calculated according to E _no (3).

【００４１】次に、第２音声成分抽出信号Ｓ２と雑音成
分抽出信号Ｓ３のＳＮＲ（Ｓ／Ｎ比）、つまり両信号Ｓ
２，Ｓ３のエネルギー比を以下のようにして算出する
（ステップＳ１１０２）。まず、第２音声成分抽出信号
Ｓ２のエネルギーＥ_spおよび雑音成分抽出信号Ｓ３のエ
ネルギーＥ_noを次式（２）（３）により算出する。Next, the SNR (S / N ratio) of the second audio component extraction signal S2 and the noise component extraction signal S3, that is, both signals S
The energy ratio between S2 and S3 is calculated as follows (step S1102). First, the energy E _{sp of} the second audio component extraction signal S2 and the energy E _no of the noise component extraction signal S3 are calculated by the following equations (2) and (3).

【００４２】[0042]

【数１】 (Equation 1)

【００４３】ここで、ｓｐ（ｎ）は第２音声成分抽出信
号Ｓ２、ｎｏ（ｎ）は雑音成分抽出信号Ｓ３を表し、Ｆ
ＬＮはフレーム長（＝１６０）を表す。これら算出され
た音声成分抽出信号Ｓ２のエネルギーＥ_spおよび雑音成
分抽出信号Ｓ３のエネルギーＥ_noはｌｏｇ領域の信号で
あるため、両者の差Ｅ_sp−Ｅ_noが信号Ｓ２，Ｓ３のエネ
ルギー比であるＳＮＲを表すことになる。Here, sp (n) represents the second audio component extraction signal S2, no (n) represents the noise component extraction signal S3,
LN represents the frame length (= 160). Since the calculated energy E _{sp of} the audio component extraction signal S2 and the energy E _no of the noise component extraction signal S3 are signals in the log region, the difference E _sp −E _no between them is the energy ratio of the signals S2 and S3. It will represent the SNR.

【００４４】次に、このＳＮＲ（＝Ｅ_sp−Ｅ_no）を次式
（４）のように閾値ＴＨ１と比較する（ステップＳ１１
０３）。ＳＮＲ（＝Ｅ_sp−Ｅ_no）≧ＴＨ１（４）ここで、式（４）を満足する場合には、入力音声信号は
音声主体成分の候補とする。一方、式（４）を満足しな
い場合には、雑音成分抽出信号Ｓ３についてピッチ分析
を行い、次式（５）（６）によりピッチ周期性の大きさ
Ｇ_noを算出する（ステップＳ１１０４）。Next, compared with the threshold TH1 as the SNR (= E _sp -E _no) the following equation (4) (step S11
03). Where _{_{SNR (= E sp -E no)}} ≧ TH1 (4), in the case of satisfying the formula (4), the input speech signal is a candidate for audio-oriented components. On the other hand, if the expression (4) is not satisfied, the pitch analysis is performed on the noise component extraction signal S3, and the magnitude G _no of the pitch periodicity is calculated by the following expressions (5) and (6) (step S1104).

【００４５】[0045]

【数２】 (Equation 2)

【００４６】ここで、ＴＭＩＮはピッチ周期の最小値、
ＴＭＡＸはピッチ周期の最大値をそれぞれ表し、ＮＡＮ
Ａはピッチ分析長を表す。Here, TMIN is the minimum value of the pitch period,
TMAX represents the maximum value of the pitch period, respectively, and NAN
A represents the pitch analysis length.

【００４７】次に、このようにして算出されたピッチ周
期性の大きさＧ_noを次式（７）のように閾値ＴＨ２と比
較する（ステップＳ１１０５）。Ｇ_no≧ＴＨ２（７）ここで、式（７）を満足する場合には、入力音声信号は
音声成分を主体とする信号であるとして第１の状態を選
択し（ステップＳ１１０７）、式（７）を満足しない場
合には、入力音声信号は雑音成分を主体とする信号であ
るとして第２の状態を選択する（ステップＳ１１０
８）。Next, the magnitude G _no of the pitch periodicity calculated in this manner is compared with a threshold value TH2 as shown in the following equation (7) (step S1105). G _no ≧ TH2 (7) Here, when Expression (7) is satisfied, the first state is selected assuming that the input audio signal is a signal mainly including an audio component (Step S1107), and Expression (7) is obtained. If not, the second state is selected assuming that the input audio signal is a signal mainly composed of noise components (step S110).
8).

【００４８】一方、ステップＳ１１０３で式（４）を満
足し、入力音声信号が音声主体成分の候補であるとした
場合には、引き続きステップＳ１１０６において、式
（１）により算出された雑音成分抽出信号Ｓ３の推定エ
ネルギーＥest を次式（８）のように閾値ＴＨ３と比較
する。Ｅest(ｍ) ≧ＴＨ３（８）ここで、式（８）を満足する場合には、入力音声信号は
音声成分が比較的大きく、背景雑音成分も比較的大き
い、つまり音声成分および背景雑音成分の両者を含むも
のとして、第３の状態を選択する（ステップＳ１１０
９）。一方、式（８）を満足しない場合には、入力音声
信号中の背景雑音成分は無視できると判断できるので、
第１の状態を選択する（ステップＳ１１０７）。On the other hand, if it is determined in step S1103 that expression (4) is satisfied and the input audio signal is a candidate for the main audio component, then in step S1106, the noise component extraction signal calculated by expression (1) The estimated energy Eest of S3 is compared with a threshold value TH3 as in the following equation (8). Eest (m) ≧ TH3 (8) Here, when Expression (8) is satisfied, the input voice signal has a relatively large voice component and a relatively large background noise component. The third state is selected as including both (step S110)
9). On the other hand, if Expression (8) is not satisfied, it can be determined that the background noise component in the input audio signal can be ignored.
The first state is selected (step S1107).

【００４９】このように入力音声信号の状態判定に際
し、ステップＳ１１０４〜Ｓ１１０５で雑音成分抽出信
号Ｓ２のピッチ周期性の大きさＧ_noを求め、その大きさ
Ｇ_noが閾値ＴＨ２以上かどうかを調べて、Ｇ_noが閾値Ｔ
Ｈ２以上の場合には第２の状態を選択せず、第１の状態
を選択して音声主体成分と同様に扱い、音声符号化部で
符号化することによって、周期性を効率よく表現できな
い雑音符号化部で周期性の大きな雑音成分抽出信号Ｓ３
を符号化することによる不快な雑音の生成という問題を
回避することができる。なお、閾値ＴＨ１，ＴＨ２，Ｔ
Ｈ３は、予め定められた固定値でもよいし、さらに性能
の向上が見込まれる適応的に求められる時変の値であっ
てもよい。As described above, when determining the state of the input speech signal, the magnitude G _no of the pitch periodicity of the noise component extraction signal S2 is determined in steps S1104 to S1105, and it is checked whether the magnitude G _no is equal to or larger than the threshold TH2. , G _no is the threshold T
In the case of H2 or more, the second state is not selected, the first state is selected and treated in the same manner as the audio main component, and the audio is encoded by the audio encoding unit. The noise component extraction signal S3 having a large periodicity in the encoding unit
Can be avoided. Note that the threshold values TH1, TH2, T
H3 may be a predetermined fixed value or a time-varying value that is adaptively required to further improve performance.

【００５０】また、状態判定部１０４で用いる信号は、
ここでは第２音声成分抽出信号Ｓ２と雑音成分抽出信号
Ｓ３を用いて説明したが、第１音声成分抽出信号Ｓ１と
雑音成分抽出信号Ｓ３の組み合わせ、または第１音声成
分抽出信号Ｓ１と入力信号から第１音声成分抽出信号Ｓ
１を減じて得られる信号との組み合わせ、または第２音
声成分抽出信号Ｓ２と入力音声信号から第１音声成分抽
出信号Ｓ１を減じて得られる信号との組み合わせ、を用
いても同様の処理を行うことが可能である。The signal used in state determination section 104 is
Here, the description has been given using the second audio component extraction signal S2 and the noise component extraction signal S3, but the combination of the first audio component extraction signal S1 and the noise component extraction signal S3 or the first audio component extraction signal S1 and the input signal First audio component extraction signal S
Similar processing is performed using a combination of a signal obtained by subtracting 1 or a combination of the second audio component extraction signal S2 and a signal obtained by subtracting the first audio component extraction signal S1 from the input audio signal. It is possible.

【００５１】（第２の実施形態）図９に、本発明の第２
実施形態に係る音声信号の成分分離方法を適用した成分
分離部１００の構成を示す。この成分分離部１００は周
波数領域変換部２０１、雑音スペクトル推定部２０２、
第１ゲインベクトル算出部２０３、第２ゲインベクトル
算出部２０４、乗算器２０５，２０６、時間領域変換部
２０７，２０８、スイッチ２０９、更新部２１０、加算
器１０３、状態判定部１０４および選択部１０５から構
成される。図９において図４と同一の構成要素について
は、同一の参照符号を付して詳細な説明を省略する。(Second Embodiment) FIG. 9 shows a second embodiment of the present invention.
1 shows a configuration of a component separating unit 100 to which a component separating method of an audio signal according to an embodiment is applied. The component separating unit 100 includes a frequency domain transforming unit 201, a noise spectrum estimating unit 202,
The first gain vector calculation unit 203, the second gain vector calculation unit 204, the multipliers 205 and 206, the time domain conversion units 207 and 208, the switch 209, the update unit 210, the adder 103, the state determination unit 104, and the selection unit 105 Be composed. In FIG. 9, the same components as those in FIG. 4 are denoted by the same reference numerals, and detailed description will be omitted.

【００５２】本実施形態の特徴は、第１の実施形態と同
様に抑圧量の異なる２つの雑音抑圧処理を含み、かつ処
理の複雑さを小さくするために、これら２つの雑音抑圧
処理の一部を共通化している点にある。以下、図１０〜
図１１に示すフローチャートを用いて成分分離部１００
の処理手順を説明する。この処理は、入力音声信号のフ
レーム毎に行われる。なお、本実施形態で雑音抑圧処理
にスペクトルサブトラクション法を用いた場合について
説明するが、別の方法の雑音抑圧処理を用いて実現する
ことも可能である。The features of this embodiment include two noise suppression processes having different amounts of suppression as in the first embodiment, and a part of these two noise suppression processes in order to reduce the complexity of the process. Is common. Hereinafter, FIG.
Using the flowchart shown in FIG.
Will be described. This process is performed for each frame of the input audio signal. Although the case where the spectrum subtraction method is used for the noise suppression processing in the present embodiment will be described, it is also possible to realize the noise suppression processing using another method.

【００５３】まず、時間領域の信号である入力音声信号
を周波数領域変換部２０１で周波数領域に変換してスペ
クトル係数を求める（ステップＳ２１０１）。次に、雑
音スペクトル推定部２０２で過去の背景雑音から推定さ
れる背景雑音の推定スペクトルを基に、相対的に抑圧量
の小さい第１ゲインベクトル算出部２０３で第１ゲイン
ベクトルを求め、相対的に抑圧量の大きい第２ゲインベ
クトル算出部２０４で第２ゲインベクトルを求める（ス
テップＳ２１０２）。ここでいう第１ゲインベクトルお
よび第２ゲインベクトルは、入力音声信号のそれぞれ対
応するスペクトル係数に乗ずる複数のゲインを指し、そ
のためにベクトルという表記を用いている。すなわち、
ステップＳ２１０２では入力音声信号の各スペクトル係
数毎にそれぞれ対応するゲインを求めることになる。ま
た、スペクトル係数毎ではなく、予め定められたバンド
幅に分割し、そのバンド毎に抑圧するゲインを求めても
よい。この場合、帯域数だけのゲインを算出することに
なる。First, the input voice signal, which is a signal in the time domain, is converted to the frequency domain by the frequency domain conversion section 201 to obtain a spectrum coefficient (step S2101). Next, a first gain vector is calculated by a first gain vector calculator 203 having a relatively small amount of suppression based on an estimated spectrum of background noise estimated from past background noise by a noise spectrum estimator 202, Then, the second gain vector is calculated by the second gain vector calculation unit 204 having a large suppression amount (step S2102). Here, the first gain vector and the second gain vector refer to a plurality of gains that are multiplied by the corresponding spectral coefficients of the input audio signal, and the expression “vector” is used for that purpose. That is,
In step S2102, a corresponding gain is determined for each spectral coefficient of the input audio signal. Further, instead of each spectral coefficient, a band may be divided into a predetermined bandwidth and a gain to be suppressed for each band may be obtained. In this case, the gain for the number of bands is calculated.

【００５４】次に、乗算器２０５で入力音声信号のスペ
クトル係数に第１ゲインベクトルのそれぞれ対応するゲ
インを乗じ（ステップＳ２１０３）、時間領域逆変換部
２０７で時間領域に逆変換して第１音声成分抽出信号Ｓ
１を生成する（ステップＳ２１０４）。同様に、乗算器
２０６で入力音声信号のスペクトル係数に第２ゲインベ
クトルのそれぞれ対応するゲインを乗じ（ステップＳ２
１０５）、時間領域逆変換部２０８で時間領域に逆変換
して第２音声成分抽出信号Ｓ２を生成する（ステップＳ
２１０６）。Next, the multiplier 205 multiplies the spectral coefficient of the input audio signal by the gain corresponding to each of the first gain vectors (step S2103), and the time domain inverse transform section 207 performs inverse transform on the time domain to obtain the first speech signal. Component extraction signal S
1 is generated (step S2104). Similarly, the multiplier 206 multiplies the spectral coefficient of the input audio signal by the corresponding gain of the second gain vector (step S2).
105), and inversely transform to the time domain by the time domain inverse transform unit 208 to generate the second audio component extraction signal S2 (Step S)
2106).

【００５５】これとは別に、入力音声信号のスペクトル
係数から雑音スペクトルの推定値を減ずることにより音
声成分信号を抽出する構成であってもよい。この場合、
第１ゲインベクトル算出部２０３、第２ゲインベクトル
算出部２０４はそれぞれ減算値を算出する機能を有し、
また乗算器２０５，２０６は減算器に置き換えることに
なる。Alternatively, the audio component signal may be extracted by subtracting the estimated value of the noise spectrum from the spectral coefficient of the input audio signal. in this case,
The first gain vector calculation unit 203 and the second gain vector calculation unit 204 each have a function of calculating a subtraction value,
Further, the multipliers 205 and 206 are replaced with subtractors.

【００５６】次に、減算器１０３で入力音声信号から第
２音声成分抽出信号Ｓ２を減じることにより、雑音成分
抽出信号Ｓ３を生成する（ステップＳ２１０７）。Next, the noise component extraction signal S3 is generated by subtracting the second audio component extraction signal S2 from the input audio signal by the subtractor 103 (step S2107).

【００５７】次に，状態判定部１０４で第２音声成分抽
出信号Ｓ２および雑音成分抽出信号Ｓ３から入力音声信
号の状態、すなわち入力音声信号が音声成分を主とする
第１の状態、背景雑音成分を主とする第２の状態、およ
び音声成分と背景雑音成分の両者を含む第３の状態のい
ずれの状態かを所定の時間単位毎、つまりフレーム毎に
判定する（ステップＳ２１０８）。Next, the state of the input audio signal, that is, the first state in which the input audio signal mainly includes the audio component, the background noise component is obtained from the second audio component extraction signal S2 and the noise component extraction signal S3 by the state determination unit 104. Is determined for each predetermined time unit, that is, for each frame, that is, the second state mainly based on, and the third state including both the voice component and the background noise component (step S2108).

【００５８】そして、この判定結果に基づいて選択部１
０５で音声主体成分と背景雑音主体成分の選択を行う。
すなわち、入力音声信号が第１の状態の場合には、音声
主体成分として入力音声信号をそのまま出力し、背景雑
音主体成分として全ゼロ信号Ｓ４を出力する（ステップ
Ｓ２１０９）。また、入力音声信号が第２の状態の場合
には、音声主体成分として全ゼロ信号Ｓ４を出力し、背
景雑音主体成分として入力音声信号をそのまま出力する
（ステップＳ２１１０）。さらに、入力音声信号が第３
の状態の場合には、音声主体成分として第１音声成分抽
出信号Ｓ１を出力し、背景雑音主体成分として雑音成分
抽出信号Ｓ３を出力する（ステップＳ２１１１）。こう
して出力される音声主体成分および背景雑音主体成分
は、それぞれ図示しない音声符号化部および雑音符号化
部で符号化される。Then, based on this determination result, the selecting unit 1
At 05, a main component of speech and a main component of background noise are selected.
That is, when the input audio signal is in the first state, the input audio signal is output as it is as the audio main component, and the all-zero signal S4 is output as the background noise main component (step S2109). If the input audio signal is in the second state, the all-zero signal S4 is output as the main audio component, and the input audio signal is output as it is as the main background noise component (step S2110). Further, if the input audio signal is
In the case of (1), the first audio component extraction signal S1 is output as the main audio component, and the noise component extraction signal S3 is output as the main background noise component (step S2111). The audio main component and the background noise main component output in this manner are encoded by an audio encoding unit and a noise encoding unit (not shown), respectively.

【００５９】次に、次のフレームの入力音声信号の処理
に向けて、雑音スペクトルの推定値の更新を行うかどう
かの判定を行う（ステップＳ２１１３）。しかし、基本
的には背景雑音の推定値の更新法は本発明の骨子に直接
関連が無く、既存の方式が適用されればよい。例えば、
雑音スペクトルの推定値は常に最新の背景雑音のスペク
トル形状を反映させる必要があると考えられるため、現
フレームで背景雑音主体成分が存在するとみなせる第２
の状態および第３の状態が選択された場合に、スイッチ
２０９をオンにして更新部２１０を起動させる。背景雑
音主体成分が存在しない第１の状態が選択された場合に
は、スイッチ２０９をオフとし、更新部２１０を起動さ
せないようにする。更新部２１０では、スイッチ２０９
がオンとなったときのみ動作し、現フレームで求めたス
ペクトル係数を雑音背景の推定値に反映させるよう更新
を行う（ステップＳ２１１４）。Next, it is determined whether or not to update the estimated value of the noise spectrum for the processing of the input voice signal of the next frame (step S2113). However, basically, the method of updating the estimated value of the background noise is not directly related to the gist of the present invention, and an existing method may be applied. For example,
Since it is considered that the estimated value of the noise spectrum must always reflect the latest spectrum shape of the background noise, the second component that can be regarded as having the main component of the background noise in the current frame is considered.
When the third state and the third state are selected, the switch 209 is turned on to activate the updating unit 210. When the first state in which the background noise main component does not exist is selected, the switch 209 is turned off so that the updating unit 210 is not activated. In the updating unit 210, the switch 209
Is operated only when is turned on, and updating is performed so that the spectral coefficient obtained in the current frame is reflected in the estimated value of the noise background (step S2114).

【００６０】このように本実施形態によれば、第１の実
施形態に比べ処理の複雑さを小さくした上で、入力音声
信号が音声成分と背景雑音成分の両者を含む第３の状態
のときは、抑圧量の小さい雑音抑圧処理部から音声成分
抽出信号が出力され、これが音声符号化部において音声
に適した符号化方法で符号化されるため、音声成分の情
報消失による品質劣化を回避することができる。As described above, according to the present embodiment, the processing complexity is reduced as compared with the first embodiment, and when the input speech signal is in the third state including both the speech component and the background noise component. Is output from a noise suppression processing unit with a small amount of suppression, and is encoded by an encoding method suitable for audio in an audio encoding unit, thereby avoiding quality degradation due to information loss of audio components. be able to.

【００６１】図１２に示すフローチャートは、図９の成
分分離部１００の他の処理手順を示しており、図１１に
おけるステップＳ２１１０がステップＳ２１１２に変更
されている点以外は図１０〜図１１の処理手順と同様で
ある。このステップＳ２１１２では、入力音声信号が第
２の状態の場合、音声主体成分として第２音声成分抽出
信号Ｓ２を出力し、背景雑音主体成分として雑音成分抽
出信号Ｓ３を出力するようにしており、このようにして
も結果はほぼ同じである。The flowchart shown in FIG. 12 shows another processing procedure of the component separation unit 100 shown in FIG. 9. The processing shown in FIGS. 10 to 11 is performed except that step S2110 in FIG. 11 is changed to step S2112. The procedure is the same. In step S2112, when the input audio signal is in the second state, the second audio component extraction signal S2 is output as the main audio component, and the noise component extraction signal S3 is output as the main background noise component. Even so, the result is almost the same.

【００６２】（第３の実施形態）次に、本発明の第３の
実施形態として本発明に係る音声信号の成分分離方法を
利用した音声符号化／復号化方法について説明する。(Third Embodiment) Next, a speech encoding / decoding method using a speech signal component separation method according to the present invention will be described as a third embodiment of the present invention.

【００６３】［符号化側について］図１３に、本発明の
第３の実施形態に係る音声符号化方法を適用した音声符
号化装置の構成を示す。この音声符号化装置は、成分分
離部１００、ビット割り当て選択部１２０、音声符号化
部１３０、雑音符号化部１４０および多重化部１５０に
よって構成される。[On the Encoding Side] FIG. 13 shows the configuration of a speech encoding apparatus to which the speech encoding method according to the third embodiment of the present invention is applied. This speech coding apparatus includes a component separation unit 100, a bit allocation selection unit 120, a speech coding unit 130, a noise coding unit 140, and a multiplexing unit 150.

【００６４】成分分離部１００は、入力される音声信号
を所定の時間単位毎に分析し、音声が主体の第１の成分
（音声主体成分）とそれ以外の背景雑音が主体の第２の
成分（背景雑音主体成分）とに分離する成分分離を行う
ものであり、第１の実施形態で説明した通りの構成であ
る。The component separation unit 100 analyzes the input audio signal for each predetermined time unit, and determines a first component (main audio component) mainly composed of audio and a second component mainly composed of background noise. (The main component of the background noise), and has the same configuration as that described in the first embodiment.

【００６５】ビット割り当て選択部１２０は、成分分離
部１００における状態判定部１０４での判定結果を基
に、音声符号化部１３０および背景雑音符号化部１４０
のそれぞれに割り当てる符号化ビット数を予め定められ
たビット割り当ての組み合わせの中から選択し、それら
のビット割り当ての情報を音声符号化部１３０および雑
音符号化部１４０に出力する。また、これと共に状態判
定部１０４での判定結果の情報が伝送情報として多重化
部１５０に出力される。The bit allocation selection unit 120 determines whether the speech encoding unit 130 and the background noise encoding unit 140
Are selected from a predetermined combination of bit allocations, and information on those bit allocations is output to speech coding section 130 and noise coding section 140. At the same time, information on the result of the determination by the state determination unit 104 is output to the multiplexing unit 150 as transmission information.

【００６６】このビット割り当ては、状態判定部１０４
での判定結果によって選択することが望ましいが、それ
だけに限られるものではなく、例えば、より安定した音
質を得るために、ビット割り当ての変化を過去から監視
しながら、ビット割り当てに急激な変化が起こりにくく
する仕組みを組み合わせてビット割り当てきめる方法も
有効である。ビット割り当て選択部１２０において用意
されるビット割り当ての組み合わせとそれを表す符号の
一例としては、次の表１に示すものが考えられる。This bit allocation is performed by the state determination unit 104
It is desirable to select according to the determination result in, but is not limited thereto.For example, in order to obtain a more stable sound quality, a sudden change in the bit allocation is unlikely to occur while monitoring the change in the bit allocation from the past. It is also effective to determine the bit allocation by combining the mechanisms. As an example of combinations of bit allocation prepared in the bit allocation selection unit 120 and codes representing the combinations, those shown in the following Table 1 can be considered.

【００６７】[0067]

【表１】 [Table 1]

【００６８】表１によると、モード「０」を選択する
と、音声符号化部１３０にはフレーム当たり７８ビット
が割り当てられ、雑音符号化部１４０にはビット割り当
てが無い。これにビット割り当て用符号を２ビット送る
ため、入力音声信号の符号化に必要な全ビット数は８０
ビットとなる。音声主体成分に比べて背景雑音主体成分
がほとんど無いようなフレームに対しては、このモード
「０」のビット割り当てを選択するようすることが望ま
しい。こうすることで音声符号化部へのビット割り当て
が大きくなり、再生音声の音質が良くなる。According to Table 1, when the mode "0" is selected, 78 bits per frame are allocated to the voice coding unit 130, and no bit is allocated to the noise coding unit 140. Since 2 bits of the bit allocation code are sent to this, the total number of bits required for encoding the input audio signal is 80
Bit. It is desirable to select this mode “0” bit allocation for a frame in which there is almost no background noise main component as compared to the voice main component. By doing so, the bit allocation to the audio encoding unit is increased, and the sound quality of the reproduced audio is improved.

【００６９】一方、モード「１」を選択すると、音声符
号化部１３０にはビット割り当てが無く、雑音符号化部
１４０には７８ビットが割り当てられる。これにビット
割り当て用符号を２ビット送るため、入力音声信号の符
号化に必要な全ビット数は８０ビットとなる。背景雑音
主体成分に対して、音声主体成分が無視できる程度のフ
レームでは、このモード「２」のビット割り当てを選択
することが望ましい。On the other hand, when the mode “1” is selected, no bit is allocated to the voice coding unit 130, and 78 bits are allocated to the noise coding unit 140. Since two bits of the bit allocation code are sent to this, the total number of bits required for encoding the input audio signal is 80 bits. It is desirable to select the mode “2” bit allocation in a frame in which the speech main component can be ignored relative to the background noise main component.

【００７０】また、モード「２」を選択すると、音声符
号化部１３０には７８−Ｙビットが割り当てられ、雑音
符号化部１４０にはＹビットが割り当てられる。ここ
で、Ｙは十分小さな正の整数を表わす。ここではＹ＝８
を用いて説明するが、この値に限定されるわけではな
い。モード「２」では、これに加えてビット割り当て用
符号を２ビット送るため、入力音声信号の符号化に必要
な全ビット数は８０ビットとなる。When mode “2” is selected, 78-Y bits are assigned to speech encoding section 130, and Y bits are assigned to noise encoding section 140. Here, Y represents a sufficiently small positive integer. Here Y = 8
, But is not limited to this value. In mode "2", in addition to this, two bits of the bit allocation code are sent, so that the total number of bits required for encoding the input audio signal is 80 bits.

【００７１】音声主体成分と背景雑音主体成分の両者が
共存するフレームに対しては、このモード「２」のよう
なビット割り当てを選択することが望ましい。この場
合、聴感的には音声主体成分の方が明らかに重要である
ため、雑音符号化部１４０に割り振るビット数を非常に
小さくし、その減少分だけ音声符号化部１３０に割り振
るビット数を大きくして、音声主体成分を正確に符号化
できるようにする。このとき、雑音符号化部１４０にお
いて少ないビット数でいかに効率良く背景雑音主体成分
を符号化することができるかがポイントとなるが、その
具体的な実現法は例えば特願平１０−１２８８７６号に
記載された手法を用いてもよい。It is desirable to select a bit assignment such as in mode "2" for a frame in which both a speech main component and a background noise main component coexist. In this case, since the audio main component is obviously more important perceptually, the number of bits allocated to the noise encoding unit 140 is extremely small, and the number of bits allocated to the audio encoding unit 130 is increased by the reduced amount. Thus, the main audio component can be accurately encoded. At this time, the point is how efficiently the main component of the background noise can be encoded with a small number of bits in the noise encoding unit 140. The specific realization method is described in, for example, Japanese Patent Application No. 10-128876. The described technique may be used.

【００７２】このようにすることで、音声と背景雑音が
それぞれの符号化部で効率よく符号化することができ、
自然な背景雑音を伴った音声を再生することができる。
フレーム長としては音声符号化では１０〜３０ｍｓ程度
の長さが適当である。この例では、３種類のビット割り
当ての組み合わせに対し、トータルのフレーム当たりの
ビット数は８０と固定になっている。このように、トー
タルのフレーム当たりのビット数を固定にすると、入力
音声によらずに、固定のビットレートで符号化を行える
ようになる。In this manner, speech and background noise can be efficiently encoded by the respective encoding units.
Sound with natural background noise can be reproduced.
A suitable frame length is about 10 to 30 ms in speech coding. In this example, the total number of bits per frame is fixed at 80 for three combinations of bit allocation. As described above, when the total number of bits per frame is fixed, encoding can be performed at a fixed bit rate regardless of the input voice.

【００７３】なお、ここでは３種類のビット割り当ての
例を示したが、さらに別の種類のビット割り当てを用い
る構成にしても、本発明を適用することができることは
明らかである。Although three types of bit allocation are shown here, it is apparent that the present invention can be applied to a configuration using still another type of bit allocation.

【００７４】音声符号化部１３０は、成分分離部１００
からの音声信号が主体の成分を入力し、音声信号の特徴
を反映した音声符号化により音声信号が主体の成分の符
号化を行う。音声信号を効率よく符号化できるような符
号化方法であれば、音声符号化部１３０にはどのような
符号化方法を用いてもよいことはいうまでもないが、一
例としてここではより自然な音声を提供できる方法とし
て、ＣＥＬＰ方式を用いることにする。ＣＥＬＰ方式
は、通常時間領域で符号化を行う方式で、時間領域の合
成した波形の歪みが少なくなるように音源信号の符号化
を行うことに特徴がある。[0074] The speech encoding unit 130
The audio signal from inputs a main component, and the main component is an audio signal by audio coding that reflects the characteristics of the audio signal. It goes without saying that any encoding method may be used for the audio encoding unit 130 as long as the encoding method can efficiently encode the audio signal. However, as an example, a more natural encoding method is used here. As a method capable of providing voice, a CELP scheme will be used. The CELP method is a method of performing coding in a normal time domain, and is characterized by performing coding of an excitation signal such that distortion of a synthesized waveform in the time domain is reduced.

【００７５】雑音符号化部１４０は、成分分離部１００
からの背景雑音主体成分を入力し、雑音を適切に符号化
できるように構成されている。通常、背景雑音信号は音
声信号に比べるとスペクトルの時間的な変動はゆるやか
である。また、波形の位相情報もランダムに近く、人間
の耳には位相情報があまり重要ではないという特徴があ
る。このような背景雑音成分を効率よく符号化するに
は、ＣＥＬＰ方式等の波形歪みを小さくするような波形
符号化よりも、変換符号化のように時間領域から変換領
域に変換して、その変換係数または変換係数から抽出さ
れるパラメータを符号化する方法の方が効率的に符号化
ができる。特に、周波数領域に変換して人間の聴覚の特
性を考慮した符号化を行うと、さらに符号化の効率を高
めることができる。The noise encoding section 140 includes the component separating section 100
Is input so that the main component of the background noise can be input, and the noise can be appropriately encoded. Normally, the background noise signal has a gradual temporal change in spectrum as compared with the speech signal. Further, the phase information of the waveform is also almost random, and the phase information is not so important for the human ear. In order to encode such background noise components efficiently, rather than transforming waveforms such as the CELP method to reduce waveform distortion, a transform from a time domain to a transform domain is performed as in transform coding. The method of encoding the parameters extracted from the coefficients or the transform coefficients enables more efficient encoding. In particular, when encoding is performed in consideration of the characteristics of human auditory sense after conversion into the frequency domain, the encoding efficiency can be further increased.

【００７６】次に、図１４のフローチャートを用いて本
実施形態の音声符号化方法の処理手順を説明する。ま
ず、入力音声信号を所定の時間単位毎に取り込み（ステ
ップＳ１００）、成分分離部１００でこれを分析して音
声主体成分と背景雑音主体成分とに分離する（ステップ
Ｓ１０１）。Next, the processing procedure of the speech encoding method of this embodiment will be described with reference to the flowchart of FIG. First, an input audio signal is fetched every predetermined time unit (step S100), and the component separation unit 100 analyzes the input audio signal and separates the input audio signal into a main audio component and a main background noise component (step S101).

【００７７】次に、ビット割り当て選択部１２０で成分
分離部１００における状態判定部１０３の判定結果を基
に、音声符号化部１３０と背景雑音符号化部１４０のそ
れぞれに割り当てるビット数を予め定められたビット数
の割り当ての組み合わせの中から選択するとともに、そ
の判定結果の情報を音声符号化部１３０および背景雑音
符号化部１４０に出力する（ステップＳ１０２）。Next, the number of bits to be allocated to each of speech coding section 130 and background noise coding section 140 is determined in advance by bit allocation selecting section 120 based on the determination result of state determining section 103 in component separating section 100. In addition to selecting from the combinations of the allocated numbers of bits, the information of the determination result is output to the speech encoding unit 130 and the background noise encoding unit 140 (step S102).

【００７８】そして、ビット割り当て選択部１２０で選
択されたそれぞれのビット割り当てに従い、音声符号化
部１３０と雑音符号化部１４０で符号化処理を行う（ス
テップＳ１０３）。具体的には、音声符号化部１３０で
は成分分離部１００からの音声主体成分を入力し、音声
符号化部１３０に対して割り当てられたビット数で符号
化を行い、音声を主体とする成分に対する符号化データ
を求める。Then, according to the respective bit allocations selected by the bit allocation selecting section 120, the coding processing is performed by the voice coding section 130 and the noise coding section 140 (step S103). Specifically, the audio encoding unit 130 receives the audio main component from the component separation unit 100, performs encoding with the number of bits allocated to the audio encoding unit 130, and Find coded data.

【００７９】一方、雑音符号化部１４０では成分分離部
１００からの背景雑音信号が主体の成分を入力し、雑音
符号化部１４０に対して割り当てられたビット数で符号
化を行い、背景雑音を主体とする成分に対する符号化デ
ータを求める。On the other hand, the noise encoding section 140 receives a component mainly composed of the background noise signal from the component separating section 100, encodes the noise with the number of bits allocated to the noise encoding section 140, and reduces the background noise. The encoded data for the main component is obtained.

【００８０】次に、多重化部１５０で各符号化部１３
０，１４０からの符号化データと各符号化部１３０，１
４０へのビット割り当ての情報を多重化して伝送路に伝
送符号化データとして出力する（ステップＳ１０４）。
これで所定の時間区間内で行う符号化の処理が終る。そ
して、次の時間区間に対し符号化を続行するか終了する
かを判定する（ステップＳ１０５）。Next, in the multiplexing section 150, each of the encoding sections 13
0, 140 and each of the encoding units 130, 1
The information of the bit allocation to 40 is multiplexed and output as transmission coded data to the transmission path (step S104).
This completes the encoding process performed within the predetermined time interval. Then, it is determined whether to continue or end the encoding for the next time section (step S105).

【００８１】上記実施形態において、成分分離部１００
で入力音声信号を音声主体成分と背景雑音主体成分とに
成分分離する場合に、分離性能が悪いと復号音声信号の
品質劣化が生じてしまう。例えば、この成分分離を行う
ために入力音声信号をノイズキャンセラに通して雑音抑
圧を行うと、たとえ入力音声信号に背景雑音が存在しな
い場合でも、不要な雑音抑圧動作により音声主体成分に
歪みを発生させてしまい、これが音声符号化部１３０に
入力されることにより、復号音声信号の品質劣化の原因
となる。In the above embodiment, the component separating section 100
When the input audio signal is subjected to component separation into a main component of a voice and a main component of a background noise, if the separation performance is poor, the quality of the decoded voice signal deteriorates. For example, if noise is suppressed by passing an input audio signal through a noise canceller to perform this component separation, even if there is no background noise in the input audio signal, distortion is generated in the main audio component by unnecessary noise suppression operation. When this is input to the audio encoding unit 130, the quality of the decoded audio signal is degraded.

【００８２】一方、背景雑音は通常は周期性を持たない
が、例えば遠くで人が話しているような背景雑音では、
周期性が存在する。このような場合、成分分離後に得ら
れる背景雑音主体成分に周期性が残っていると、これを
そのまま雑音符号化部１４０に入力すると、音声符号化
部１３０と異なり、雑音符号化部１４０は構造上、周期
性（ピッチ周期）効率よく表現することができないた
め、ピッチ周期性の強い背景雑音主体成分（話者妨害、
バブルノイズなど）に対して、不快な雑音を生成してし
まう可能性がある。On the other hand, background noise does not usually have periodicity. For example, in background noise where a person is talking at a distance,
There is periodicity. In such a case, if the periodicity remains in the background noise main component obtained after the component separation, and this is input to the noise encoding unit 140 as it is, unlike the speech encoding unit 130, the noise encoding unit 140 In addition, because the periodicity (pitch period) cannot be expressed efficiently, the background noise main component having strong pitch periodicity (speaker interference,
For example, bubble noise).

【００８３】これに対し、成分分離部１００に第１の実
施形態で説明した音声信号の成分分離方法を用いれば、
このような問題を伴うことなく、成分分離を的確に行う
ことが可能である。On the other hand, if the component separation unit 100 uses the component separation method of the audio signal described in the first embodiment,
The components can be accurately separated without such a problem.

【００８４】［音声符号化装置の具体例］図１５に、音
声符号化部１３０にＣＥＬＰ方式、雑音符号化部１４０
に変換符号化をそれぞれ用いた場合の音声符号化装置の
具体例を示す。ＣＥＬＰ方式では、音声の生成過程のモ
デルをとして声帯信号を音源信号に対応させ、声道が表
すスペクトル包絡特性を合成フィルタにより表し、音源
信号を合成フィルタに入力させ、合成フィルタの出力で
音声信号を表現する。この際、ＣＥＬＰ符号化に供され
る音声信号と、符号化してから再生される音声との波形
歪みが聴覚的に小さくなるように音源信号の符号化を行
うところにその特徴がある。[Specific Example of Speech Encoding Apparatus] FIG.
2 shows a specific example of a speech encoding apparatus in the case of using transform coding. In the CELP method, a vocal cord signal is made to correspond to a sound source signal as a model of a sound generation process, a spectral envelope characteristic represented by a vocal tract is represented by a synthesis filter, and the sound source signal is input to the synthesis filter. To express. At this time, the feature is that the excitation signal is encoded so that the waveform distortion between the audio signal provided for CELP encoding and the audio reproduced after encoding is perceptually reduced.

【００８５】音声符号化部１３０は成分分離部１００か
らの音声主体成分を入力し、この成分を時間領域での波
形歪みが小さくなるように符号化する。この際、ビット
割り当て選択部１２０からのビット割り当てに応じて予
め定められたビット数の割り当ての下で符号化部１３０
内の各符号化が行われる。この場合、符号化部１３０中
の各符号化部で使用するビット数の和をビット割り当て
選択部１２０から符号化部１３０へのビット割り当てに
一致させることによって、音声符号化部１３０の性能を
最大に生かすことができる。このことは、符号化部１４
０についても同様である。Speech encoding section 130 receives the speech main component from component separating section 100 and encodes this component such that waveform distortion in the time domain is reduced. At this time, the coding unit 130 is allocated under the assignment of a predetermined number of bits according to the bit assignment from the bit assignment selecting unit 120.
Are performed. In this case, the sum of the number of bits used in each encoding unit in encoding unit 130 is made to match the bit allocation from bit allocation selecting unit 120 to encoding unit 130, so that the performance of speech encoding unit 130 is maximized. It can be used for This means that the encoding unit 14
The same applies to 0.

【００８６】ここで説明するＣＥＬＰ符号化は、スペク
トル包絡符号帳探索部３１１、適応符号帳探索部３１
２、雑音符号帳探索部３１３、ゲイン符号帳探索部３１
４を用いて符号化を行う。各符号帳探索部３１１〜３１
４で探索された符号帳のインデックスの情報は符号化デ
ータ出力部３１５に入力され、この符号化データ出力部
３１５から音声符号化データとして多重化部１５０へ出
力される。The CELP coding described here includes a spectrum envelope codebook search section 311 and an adaptive codebook search section 31.
2, noise codebook search section 313, gain codebook search section 31
4 is encoded. Codebook search units 311 to 31
The information of the index of the codebook searched in 4 is input to the encoded data output unit 315, and is output from the encoded data output unit 315 to the multiplexing unit 150 as audio encoded data.

【００８７】次に、音声符号化部１３０の中の各符号帳
探索部３１１〜３１４の機能について説明して行く。ス
ペクトル包絡符号帳探索部３１１は、成分分離部１００
からの音声主体成分をフレーム毎に入力し、予め用意し
ているスペクトル包絡符号帳を探索して、入力された信
号のスペクトル包絡をより良く表現することのできる符
号帳のインデックスを選択して、このインデックスの情
報を符号化データ出力部３１５へ出力する。通常、ＣＥ
ＬＰ方式ではスペクトル包絡を符号化する際に用いるパ
ラメータとしてＬＳＰ（Line Spectrum Pair）パラメー
タを用いるが、これに限られるものではなく、スペクト
ル包絡を表現できるパラメータであれば他のパラメータ
も有効である。Next, the function of each of the codebook search sections 311 to 314 in the voice coding section 130 will be described. The spectrum envelope codebook search unit 311 includes the component separation unit 100
From the speech-based component for each frame, search for a spectral envelope codebook prepared in advance, and select an index of a codebook that can better express the spectral envelope of the input signal, The information of the index is output to the encoded data output unit 315. Usually CE
In the LP method, an LSP (Line Spectrum Pair) parameter is used as a parameter used when encoding a spectrum envelope. However, the present invention is not limited to this, and other parameters are also effective as long as they can express a spectrum envelope.

【００８８】適応符号帳探索部３１２は、音源の中のピ
ッチ周期で繰り返す成分を表現するために用いられる。
ＣＥＬＰ方式では、符号化された過去の音源信号を所定
の長さだけ適応符号帳として格納し、これを音声符号化
部と音声復号化部の両方で持つことにより、指定された
ピッチ周期に対応して繰り返す信号を適応符号帳から引
き出すことができる構造になっている。適応符号帳では
符号帳からの出力信号とピッチ周期が一対一に対応する
ため、ピッチ周期を適応符号帳のインデックスに対応さ
せることができる。このような構造の下、適応符号帳探
索部３１２では符号帳からの出力信号を合成フィルタで
合成したときの合成信号と目標とする音声信号との歪み
を聴覚重み付けしたレベルで評価し、その歪みが小さく
なるようなピッチ周期のインデックスを探索する。そし
て、探索されたインデックスの情報を符号化データ出力
部３１５へ出力する。The adaptive codebook search section 312 is used to express a component that repeats at a pitch period in the sound source.
In the CELP system, the encoded past excitation signal is stored for a predetermined length as an adaptive codebook, and is stored in both the audio encoding unit and the audio decoding unit, so that the signal corresponds to the specified pitch period. And a signal to be repeated is extracted from the adaptive codebook. In the adaptive codebook, since the output signal from the codebook and the pitch cycle correspond one-to-one, the pitch cycle can be made to correspond to the index of the adaptive codebook. Under such a structure, the adaptive codebook search unit 312 evaluates the distortion between the synthesized signal obtained when the output signal from the codebook is synthesized by the synthesis filter and the target audio signal at a level weighted by auditory perception. Search for an index of the pitch period such that is smaller. Then, the information of the searched index is output to the encoded data output unit 315.

【００８９】雑音符号帳探索部３１３は、音源の中の雑
音的な成分を表現するために用いられる。ＣＥＬＰ方式
では、音源の雑音成分は雑音符号帳を用いて表され、指
定された雑音インデックスに対応して雑音符号帳から様
々な雑音信号を引き出すことができる構造になってい
る。このような構造の下、雑音符号帳探索部３１３では
符号帳からの出力信号を用いて再生される合成音声信号
と、雑音符号帳探索部３１３において目標となる音声信
号との歪みを聴覚重み付けしたレベルで評価し、その歪
みが小さくなるような雑音インデックスを探索する。そ
して、探索された雑音インデックスの情報を符号化デー
タ出力部３１５へ出力する。The noise codebook search section 313 is used to represent a noise component in the sound source. In the CELP system, a noise component of a sound source is represented using a noise codebook, and has a structure capable of extracting various noise signals from the noise codebook in accordance with a specified noise index. Under such a structure, the noise codebook search unit 313 perceptually weights the distortion between the synthesized speech signal reproduced using the output signal from the codebook and the target speech signal in the noise codebook search unit 313. Evaluate at the level and search for a noise index that reduces the distortion. Then, the information of the searched noise index is output to encoded data output section 315.

【００９０】ゲイン符号帳探索部３１４は、音源のゲイ
ン成分を表現するために用いられる。ＣＥＬＰ方式で
は、ピッチ成分に用いるゲインと雑音成分に用いるゲイ
ンの２種類のゲインをゲイン符号帳探索部３１４で符号
化する。符号帳探索においては、符号帳から引き出され
るゲイン候補を用いて再生される合成音声信号と目標と
する音声信号との歪みを聴覚重み付けしたレベルで評価
し、その歪みが小さくなるようなゲインインデックスを
探索する。そして、探索されたゲインインデックスを符
号化データ出力部３１５へ出力する。符号化データ部３
１５は、符号化データを多重化部１５０へ出力する。Gain codebook search section 314 is used to represent the gain component of the excitation. In the CELP system, two kinds of gains, a gain used for a pitch component and a gain used for a noise component, are encoded by a gain codebook search unit 314. In the codebook search, the distortion between the synthesized speech signal reproduced using the gain candidates extracted from the codebook and the target speech signal is evaluated at an auditory weighted level, and a gain index that reduces the distortion is evaluated. Explore. Then, the searched gain index is output to encoded data output section 315. Encoded data part 3
15 outputs the encoded data to the multiplexing unit 150.

【００９１】次に、成分分離部１００からの背景雑音主
体成分を入力し、これを符号化する雑音符号化部１４０
の詳細な構成例について図１５を用いて説明する。Next, the background noise main component from the component separation unit 100 is inputted, and the noise
A detailed configuration example will be described with reference to FIG.

【００９２】雑音符号化部１４０は、成分分離部１００
からの背景雑音主体成分を入力し、所定の変換を用いて
この成分の変換係数を求め、変換領域でのパラメータの
歪みが小さくなるように符号化を行う点が上述した音声
符号化部１３０で行う符号化の方法と大きく異なる。変
換領域でのパラメータの表現については、様々な方法が
考えられるが、ここでは一例として背景雑音成分を変換
領域で帯域分割し、各帯域別にその帯域を代表するパラ
メータを求め、そのパラメータを所定の量子化器で量子
化し、そのインデックスを送る方法について説明する。The noise encoding unit 140 includes the component separating unit 100
The above-described speech encoding unit 130 is characterized in that the background noise main component from the input is input, a transform coefficient of this component is obtained using a predetermined transform, and encoding is performed so that parameter distortion in the transform domain is reduced. It is very different from the encoding method used. Various methods are conceivable for expressing parameters in the transform domain.Here, as an example, a background noise component is divided into bands in the transform domain, a parameter representative of the band is obtained for each band, and the parameter is set to a predetermined value. A method of quantizing with a quantizer and transmitting the index will be described.

【００９３】まず、変換係数算出部３２１において所定
の変換を用いて背景雑音主体成分の変換係数を求める。
変換の方法としては、例えば離散フーリエ変換やＦＦＴ
（高速フーリエ変換）を用いることができる。次に、帯
域分割部３２２において周波数軸を所定の帯域に分割
し、雑音符号化ビット割り当て部３２０からのビット割
り当てに応じた量子化ビット数を用いて、第１帯域符号
化部３２３、第２帯域符号化部３２４、…、第ｍ帯域符
号化部３２５により、ｍ個の帯域別にパラメータの量子
化を行う。８ｋＨｚサンプリングにおいては、帯域の個
数ｍは４〜１６程度の値が望ましい。First, the transform coefficient calculating section 321 obtains the transform coefficient of the main component of the background noise by using a predetermined transform.
As a method of conversion, for example, discrete Fourier transform or FFT
(Fast Fourier transform) can be used. Next, the band dividing section 322 divides the frequency axis into predetermined bands, and uses the number of quantization bits according to the bit allocation from the noise coded bit allocation section 320 to generate the first band coding section 323 and the second band coding section 323. The band coding units 324,..., And the m-th band coding unit 325 perform parameter quantization for each of the m bands. In 8 kHz sampling, the number m of bands is preferably a value of about 4 to 16.

【００９４】このときのパラメータとしては、変換係数
から求められるスペクトル振幅やパワースペクトルをそ
れぞれの帯域の中で平均した値を用いることができる。
各帯域からのパラメータの量子化値を表すインデックス
の情報は、符号化データ出力部３２６に入力され、この
符号化データ出力部３２６から符号化データとして多重
化部１５０へ出力される。As the parameter at this time, a value obtained by averaging the spectrum amplitude and the power spectrum obtained from the transform coefficients in each band can be used.
Index information indicating the quantization value of the parameter from each band is input to the encoded data output unit 326, and is output from the encoded data output unit 326 to the multiplexing unit 150 as encoded data.

【００９５】本実施形態においては、入力音声信号から
に抑圧量の大きい雑音抑圧処理部の出力を減じることに
より背景雑音主体成分を生成し、これを雑音符号化部１
４０の入力とする構成について説明してきたが、成分分
離部１００内に保持してある推定背景雑音スペクトルを
加工して直接雑音符号化部１４０に与え、そのスペクト
ルを帯域分割して帯域毎に符号化するようにしてもよ
い。この場合、信号処理的には上記実施形態とほぼ等価
であり、処理の結果は変わらずに実現でき、かつ余分な
変換処理を省くことができるため、処理量を削減できる
という効果も得られる。In the present embodiment, the main component of the background noise is generated by subtracting the output of the noise suppression processing unit having a large amount of suppression from the input speech signal, and this is converted into the noise encoding unit 1.
Although the description has been given of the configuration in which the input is 40, the estimated background noise spectrum held in the component separating section 100 is processed and directly supplied to the noise encoding section 140, and the spectrum is divided into bands to code the respective bands. You may make it. In this case, the signal processing is substantially equivalent to the above embodiment, and the processing result can be realized without change, and an extra conversion process can be omitted, so that the effect of reducing the processing amount can be obtained.

【００９６】［復号化側について］図１６に、本実施形
態に係る音声復号化方法を適用した音声復号化装置の構
成を示す。この音声復号化装置は逆多重化部１６０、ビ
ット割り当て復号化部１７０、音声復号化部１８０、雑
音復号化部１９０および結合部１９５からなる。[On the Decoding Side] FIG. 16 shows a configuration of a speech decoding apparatus to which the speech decoding method according to the present embodiment is applied. This speech decoding device includes a demultiplexing section 160, a bit allocation decoding section 170, a speech decoding section 180, a noise decoding section 190, and a combining section 195.

【００９７】逆多重化部１６０では、図１３の音声符号
化装置から前述のようにして所定の時間単位毎に送られ
てきた伝送符号化データを受け、これをビット割り当て
の情報と、音声復号化部１８０に入力するための符号化
データと、雑音復号化部１９０に入力するための符号化
データに分離して出力する。The demultiplexing section 160 receives the transmission coded data transmitted from the voice coding apparatus shown in FIG. 13 every predetermined time unit as described above, and uses this data for bit allocation information and voice decoding. Coded data to be input to the decoding section 180 and coded data to be input to the noise decoding section 190 and output.

【００９８】ビット割り当て復号部１７０は、ビット割
り当ての情報を復号し、音声復号化部１８０と雑音復号
化部１９０のそれぞれに割り当てるビット数を符号化側
と同じ仕組みで定められたビット数の割り当ての組み合
わせの中から出力する。Bit allocation decoding section 170 decodes the information of bit allocation, and allocates the number of bits to each of speech decoding section 180 and noise decoding section 190 to the number of bits determined by the same mechanism as that of the encoding side. Output from the combination of.

【００９９】音声復号化部１８０は、ビット割り当て復
号部１７０によるビット割り当てに基づき、符号化デー
タを復号して音声を主体とする成分についての再生信号
を生成し、これを結合部１９５へ出力する。Speech decoding section 180 decodes the encoded data based on the bit assignment by bit assignment decoding section 170 to generate a reproduced signal for a component mainly composed of speech, and outputs this to combining section 195. .

【０１００】雑音復号化部１９０は、ビット割り当て復
号部１７０によるビット割り当てに基づき、符号化デー
タを復号して背景雑音を主体とする成分についての再生
信号を生成し、これを結合部１９５へ出力する。The noise decoding section 190 decodes the encoded data based on the bit allocation by the bit allocation decoding section 170 to generate a reproduced signal for a component mainly composed of background noise, and outputs this to the combining section 195. I do.

【０１０１】結合部１９５は、音声復号化部１８０で復
号再生された音声を主体とする成分の再生信号と、雑音
復号化部１９０で復号再生された雑音を主体とする成分
の再生信号を結合して、最終的な出力音声信号を生成す
る。The combining section 195 combines the reproduced signal mainly composed of the sound decoded and reproduced by the audio decoding section 180 with the reproduced signal mainly composed of the noise decoded and reproduced by the noise decoding section 190. Then, a final output audio signal is generated.

【０１０２】次に、図１７のフローチャートを用いて本
実施形態の音声復号化方法の処理手順を説明する。ま
ず、入力される伝送符号化データを所定の時間単位毎に
取り込み（ステップＳ２００）、この符号化データを逆
多重化部１６０でビット割り当ての情報と、音声復号化
部１８０に入力するための符号化データと、雑音復号化
部１９０に入力するための符号化データに分離して出力
する（ステップＳ２０１）。Next, the processing procedure of the speech decoding method of this embodiment will be described with reference to the flowchart of FIG. First, the input transmission coded data is fetched for each predetermined time unit (step S200), and the coded data is demultiplexed by the demultiplexing section 160 to obtain bit allocation information and a code to be input to the speech decoding section 180. It is separated into encoded data and encoded data to be input to the noise decoding unit 190 and output (step S201).

【０１０３】次に、ビット割り当て復号部１７０におい
てビット割り当ての情報を復号し、音声復号化部１８０
と雑音復号化部１９０のそれぞれに割り当てるビット数
を音声符号化装置と同じ仕組みで定められたビット数の
割り当ての組み合わせの中から出力する（ステップＳ２
０２）。そして、ビット割り当て復号部１７０からのビ
ット割り当てに基づき、音声復号化部１８０と雑音復号
化部１９０でそれぞれの符号化データからそれぞれの再
生信号を生成し、これらを結合部１９５へ出力する（ス
テップＳ２０３）。Next, bit allocation information is decoded in bit allocation decoding section 170 and speech decoding section 180 is decoded.
And the number of bits to be allocated to each of the noise decoding unit 190 is output from a combination of bit number allocations determined by the same mechanism as that of the speech coding apparatus (step S2).
02). Then, based on the bit allocation from bit allocation decoding section 170, speech decoding section 180 and noise decoding section 190 generate respective reproduction signals from the respective coded data and output these to combining section 195 (step S203).

【０１０４】次に、結合部１９５において再生された音
声信号を主体とする成分と再生された雑音信号を主体と
する成分を結合し（ステップＳ２０４）、最終的な音声
信号を生成して出力する（ステップＳ２０５）。Next, the combining section 195 combines the component mainly composed of the reproduced audio signal and the reproduced component mainly composed of the noise signal (step S204), and generates and outputs a final audio signal. (Step S205).

【０１０５】［音声復号化装置の具体例］図１８に、図
１５の音声符号化装置の構成に対応する音声復号化装置
の具体例を示す。逆多重化部１６０は、図１５の音声符
号化装置から送られてきた所定の時間単位毎の伝送符号
化データをビット割り当ての情報と、音声復号化部１８
０に入力するための符号化データであるスペクトル包絡
のインデックス、適応インデックス、雑音インデック
ス、ゲインのインデックス情報を出力するとともに、雑
音復号化部１９０に入力するための符号化データである
各帯域別の量子化インデックスの情報を出力する。ビッ
ト割り当て復号部１７０は、ビット割り当ての情報を復
号し、音声復号化部１８０と雑音復号化部１９０のそれ
ぞれに割り当てるビット数を符号化と同じ仕組みで定め
られたビット数の割り当ての組み合わせの中から出力す
る。[Specific Example of Speech Decoding Apparatus] FIG. 18 shows a specific example of a speech decoding apparatus corresponding to the configuration of the speech coding apparatus of FIG. The demultiplexing unit 160 converts the transmission coded data for each predetermined time unit transmitted from the voice coding apparatus of FIG.
In addition to outputting the spectral envelope index, the adaptive index, the noise index, and the gain index information which are the encoded data to be input to 0, each band which is the encoded data to be input to the noise decoding unit 190 Outputs quantization index information. The bit allocation decoding unit 170 decodes the information of the bit allocation, and determines the number of bits to be allocated to each of the speech decoding unit 180 and the noise decoding unit 190 in a combination of the allocation of the number of bits determined by the same mechanism as the encoding. Output from

【０１０６】音声復号化部１８０は、ビット割り当て復
号部１７０からのビット割り当てに基づき符号化データ
を復号して音声を主体とする成分についての再生信号を
生成し、これを結合部１９５へ出力する。具体的には、
スペクトル包絡復号部４１４でスペクトル包絡のインデ
ックスと予め用意しているスペクトル包絡符号帳からス
ペクトル包絡の情報を再生し、これを合成フィルタ４１
６に送る。また、適応音源復号部４１１で適応インデッ
クスの情報を入力し、これに対応するピッチ周期で繰り
返す信号を適応符号帳から引き出して音源再生部４１５
に出力する。Speech decoding section 180 decodes the coded data based on the bit assignment from bit assignment decoding section 170 to generate a reproduced signal for a component mainly composed of speech, and outputs this to combining section 195. . In particular,
The spectrum envelope decoding unit 414 reproduces information of the spectrum envelope from the spectrum envelope index and the spectrum envelope codebook prepared in advance, and reproduces the information of the spectrum envelope.
Send to 6. Further, adaptive excitation decoding section 411 inputs adaptive index information, extracts a signal repeated at a corresponding pitch cycle from the adaptive codebook, and generates a signal from excitation excitation reproducing section 415.
Output to

【０１０７】雑音音源復号部４１２は、雑音インデック
スの情報を入力し、これに対応する雑音信号を雑音符号
帳から引き出して音源再生部４１５に出力する。The noise excitation decoding section 412 receives the information of the noise index, extracts a noise signal corresponding to the information from the noise codebook, and outputs it to the excitation reproducing section 415.

【０１０８】ゲイン復号部４１３は、ゲインインデック
スの情報を入力し、これに対応するピッチ成分に用いる
ゲインと雑音成分に用いるゲインの２種類のゲインをゲ
イン符号帳から引き出して音源再生部４１５に出力す
る。Gain decoding section 413 receives the information of the gain index, extracts two kinds of gains corresponding to the gain index, ie, a gain used for the pitch component and a gain used for the noise component, from the gain codebook and outputs them to sound source reproducing section 415. I do.

【０１０９】音源再生部４１５は、適応音源復号部４１
１からのピッチ周期で繰り返す信号（ベクトル）Ｅｐ
と、雑音音源復号部４１２からの雑音信号（ベクトル）
Ｅｎと、ゲイン復号部４１３からの２種類のゲインＧ
ｐ，Ｇｎを用いて音源信号（ベクトル）Ｅｘを次式
（１）に従って再生する。Ｅｘ＝Ｇｐ・Ｅｐ＋Ｇｎ・Ｅｎ（９）合成フィルタ４１６は、スペクトル包絡の情報を用いて
音声を合成するための合成フィルタのパラメータを設定
し、音源再生部４１５からの音源信号を入力することに
より合成音声信号を生成する。さらに、ポストフィルタ
４１７でこの合成音声信号に含まれる符号化歪みを整形
して聞きやすい音となるようにして結合部１９５に出力
する。The sound source reproducing unit 415 is adapted to
A signal (vector) Ep that repeats at a pitch cycle from 1
And a noise signal (vector) from the noise excitation decoding unit 412
En and two types of gains G from the gain decoding unit 413
The sound source signal (vector) Ex is reproduced using p and Gn according to the following equation (1). Ex = Gp · Ep + Gn · En (9) The synthesis filter 416 sets the parameters of the synthesis filter for synthesizing the voice using the information of the spectrum envelope, and synthesizes by inputting the sound source signal from the sound source reproduction unit 415. Generate an audio signal. Further, the post-filter 417 shapes the coding distortion included in the synthesized voice signal and outputs the synthesized voice signal to the combining unit 195 so as to make the sound easy to hear.

【０１１０】次に、図１８中の雑音復号部１９０につい
て説明する。雑音復号部１９０は、ビット割り当て復号
部１７０によるビット割り当てに基づき、雑音復号部１
９０に必要な符号化データを入力し、これを復号して背
景雑音を主体とする成分の再生信号を生成し、これを結
合部１９５へ出力する。具体的には、雑音データ分離部
４２０で符号化データを各帯域別の量子化インデックス
に分離し、第１帯域復号化部４２１、第２帯域復号化部
４２２、…、第ｍ帯域復号化部４２３により、それぞれ
の帯域でのパラメータを復号し、逆変換部４２４で復号
されたパラメータを用いて符号化側で行った変換と逆の
変換を行い、背景雑音を主体とする成分を再生信号を生
成する。この背景雑音を主体とする成分の再生信号は、
結合部１９５に送られる。Next, the noise decoding section 190 in FIG. 18 will be described. The noise decoding unit 190 determines whether the noise decoding unit 1
The necessary coded data is input to 90, which is decoded to generate a reproduced signal of a component mainly composed of background noise, and outputs the reproduced signal to the combining unit 195. Specifically, the coded data is separated into quantization indices for each band by the noise data separation unit 420, and the first band decoding unit 421, the second band decoding unit 422, ..., the m-th band decoding unit According to 423, the parameters in each band are decoded, and the inverse of the conversion performed on the encoding side is performed using the parameters decoded by the inverse transform unit 424, and a component mainly composed of background noise is converted into a reproduced signal. Generate. The reproduced signal mainly composed of the background noise is
It is sent to the coupling unit 195.

【０１１１】結合部１９５では、ポストフィルタで整形
された音声を主体とする成分と、再生された背景雑音を
主体とする成分の再生信号が隣接するフレーム間で滑ら
かに接続されるようにして結合し、これを出力音声信号
として最終的な復号化部からの出力とする。The combining unit 195 combines the components mainly composed of the voice shaped by the post-filter and the reproduced signals mainly composed of the background noise so as to be smoothly connected between adjacent frames. Then, this is used as an output audio signal as the final output from the decoding unit.

【０１１２】（第４の実施形態）図１９は、第２の実施
形態で説明した成分分離部１００を用いた本発明の第４
の実施形態に係る音声符号化装置の構成を示している。
本実施形態の音声符号化装置は、第３の実施形態におけ
る成分分離部１００の構成を第２の実施形態で説明した
成分分離部１００に置き換えたものであり、その構成お
よび動作は第２および第３の実施形態から明らかである
ため、第２および第３の実施形態同一部分に同一符号を
付して詳細な説明は省略する。(Fourth Embodiment) FIG. 19 shows a fourth embodiment of the present invention using the component separation unit 100 described in the second embodiment.
1 shows a configuration of a speech encoding device according to an embodiment.
The speech coding apparatus according to the present embodiment is obtained by replacing the configuration of the component separation unit 100 in the third embodiment with the component separation unit 100 described in the second embodiment. Since it is clear from the third embodiment, the same parts as those in the second and third embodiments are denoted by the same reference numerals, and detailed description is omitted.

【０１１３】なお、本発明に基づく音声信号の成分分離
方法は、音声符号化以外の用途にも適用が可能であり、
例えば音声主体成分と背景雑音主体成分を別々に記録す
るなど、両成分に異なった処理を施すような用途に広く
応用できる。Note that the method for separating components of an audio signal according to the present invention can be applied to uses other than audio coding.
For example, the present invention can be widely applied to uses in which different processing is performed on both components, such as separately recording a main component of speech and a main component of background noise.

【０１１４】[0114]

【発明の効果】以上説明したように、本発明によれば入
力音声信号から音声主体成分と背景雑音主体成分との成
分分離を正確に行うことが可能な音声信号の成分分離方
法を提供することができる。As described above, according to the present invention, there is provided an audio signal component separation method capable of accurately separating an audio main component and a background noise main component from an input audio signal. Can be.

【０１１５】また、本発明によれば、このような成分分
離方法を用いて、背景雑音も含めて音声を原音にできる
だけ近い形で再生することが可能な低レート音声符号化
方法を提供することができる。Further, according to the present invention, there is provided a low-rate speech encoding method capable of reproducing speech including background noise as close as possible to the original sound by using such a component separation method. Can be.

[Brief description of the drawings]

【図１】音声信号の成分分離に抑圧量の大きい雑音抑圧
処理を用いた場合の問題点を説明するための図FIG. 1 is a diagram for explaining a problem in a case where noise suppression processing with a large amount of suppression is used for separating components of an audio signal.

【図２】音声信号の成分分離に抑圧量の小さい雑音抑圧
処理を用いた場合の問題点を説明するための図FIG. 2 is a diagram for explaining a problem in a case where noise suppression processing with a small amount of suppression is used to separate components of an audio signal.

【図３】本発明による音声信号の成分分離の原理を説明
するための図FIG. 3 is a diagram for explaining the principle of component separation of an audio signal according to the present invention;

【図４】本発明の第１の実施形態に係る成分分離部の構
成を示すブロック図FIG. 4 is a block diagram illustrating a configuration of a component separation unit according to the first embodiment of the present invention.

【図５】図１における状態判定部の構成を示すブロック
図FIG. 5 is a block diagram illustrating a configuration of a state determination unit in FIG. 1;

【図６】第１の実施形態に係る成分分離部の処理手順の
例を示すフローチャートFIG. 6 is a flowchart illustrating an example of a processing procedure of a component separation unit according to the first embodiment.

【図７】第１の実施形態に係る成分分離部の処理手順の
他の例を示すフローチャートFIG. 7 is a flowchart illustrating another example of the processing procedure of the component separation unit according to the first embodiment.

【図８】図５の状態判定部の処理手順を示すフローチャ
ートFIG. 8 is a flowchart showing a processing procedure of a state determination unit in FIG. 5;

【図９】本発明の第２の実施形態に係る成分分離部の構
成を示すブロック図FIG. 9 is a block diagram illustrating a configuration of a component separation unit according to a second embodiment of the present invention.

【図１０】第２の実施形態に係る成分分離部の処理手順
の一部を示すブロック図FIG. 10 is a block diagram illustrating a part of a processing procedure of a component separation unit according to the second embodiment.

【図１１】第２の実施形態に係る成分分離部の処理手順
の他の一部を示すブロック図FIG. 11 is a block diagram illustrating another part of the processing procedure of the component separation unit according to the second embodiment.

【図１２】第２の実施形態に係る成分分離部の処理手順
の別の例を示すブロック図FIG. 12 is a block diagram illustrating another example of the processing procedure of the component separation unit according to the second embodiment.

【図１３】本発明の第３の実施形態に係る音声符号化装
置の概略構成を示すブロック図FIG. 13 is a block diagram illustrating a schematic configuration of a speech encoding device according to a third embodiment of the present invention.

【図１４】第３の実施形態に係る音声符号化装置の処理
手順を示すフローチャートFIG. 14 is a flowchart illustrating a processing procedure of the speech encoding device according to the third embodiment;

【図１５】第３の同実施形態に係る音声符号化装置のよ
り詳細な構成を示すブロック図FIG. 15 is a block diagram showing a more detailed configuration of the speech coding apparatus according to the third embodiment;

【図１６】第３の実施形態に係る音声復号化装置の概略
構成を示すブロック図FIG. 16 is a block diagram showing a schematic configuration of a speech decoding device according to a third embodiment.

【図１７】第３の実施形態に係る音声復号化方法の処理
手順を示すフローチャートFIG. 17 is a flowchart showing a processing procedure of a speech decoding method according to the third embodiment;

【図１８】第３の実施形態に係る音声復号化装置のより
詳細な構成を示すブロック図FIG. 18 is a block diagram showing a more detailed configuration of the speech decoding device according to the third embodiment.

【図１９】本発明の第４の実施形態に係る音声符号化装
置の構成を示すブロック図FIG. 19 is a block diagram showing a configuration of a speech encoding device according to a fourth embodiment of the present invention.

[Explanation of symbols]

１００…成分分離部１０１…第１雑音抑圧処理部１０２…第２雑音抑圧処理部１０３…減算器１０４…状態判定部１０５…選択部１１１…ＳＮＲ判定部１１２…周期性判定部１１３…推定エネルギー判定部１２０…ビット割り当て選択部１３０…音声符号化部１４０…雑音符号化部１５０…多重化部１６０…逆多重化部１７０…ビット割り当て復号部１８０…音声復号化部１９０…雑音復号化部１９５…結合部 Reference Signs List 100 component separation unit 101 first noise suppression processing unit 102 second noise suppression processing unit 103 subtractor 104 state determination unit 105 selection unit 111 SNR determination unit 112 periodicity determination unit 113 estimated energy determination Unit 120: bit allocation selecting unit 130: voice coding unit 140: noise coding unit 150: multiplexing unit 160: demultiplexing unit 170: bit allocation decoding unit 180: voice decoding unit 190: noise decoding unit 195 Joint

Claims

[Claims]

1. A method of separating an input audio signal into a first component mainly composed of voice and a second component mainly composed of background noise, for each predetermined time unit, comprising: In contrast, a signal obtained by performing a first noise suppression process with a relatively small amount of suppression is used as the first component, and a second noise suppression process with a relatively large amount of suppression is performed on the input speech signal. A method for separating components of an audio signal, comprising a separation mode in which a signal obtained by subtracting a signal obtained from the input audio signal is used as the second component.

2. A method for separating an input audio signal into a first component mainly composed of voice and a second component mainly composed of background noise, for each predetermined time unit, comprising: In contrast, the first signal obtained by performing the first noise suppression processing having a relatively small amount of suppression and the second signal obtained by performing the second noise suppression processing having a relatively large amount of suppression on the input voice signal. A second signal, and a third signal obtained by subtracting the second signal from the input audio signal, wherein the input audio signal includes a first state mainly including an audio component, and a background noise component. Determining, for each of the predetermined time units, a main second state and a third state other than the first and second states including both a voice component and a background noise component; When the audio signal is determined to be in the first state, the first component And outputting the input audio signal as it is, and outputting a predetermined signal as the second component. When the input audio signal is determined to be in the second state, the first A predetermined signal is output as a component, and the input audio signal is output as the second component. When the input audio signal is determined to be in the third state, the first component is output. Outputting the first signal and outputting the third signal as the second component.

3. A method for separating an input audio signal into a first component mainly composed of voice and a second component mainly composed of background noise, for each predetermined time unit, comprising: In contrast, the first signal obtained by performing the first noise suppression processing having a relatively small amount of suppression and the second signal obtained by performing the second noise suppression processing having a relatively large amount of suppression on the input voice signal. A second signal, and a third signal obtained by subtracting the second signal from the input audio signal, wherein the input audio signal includes a first state mainly including an audio component, and a background noise component. Determining, for each of the predetermined time units, a main second state and a third state other than the first and second states including both a voice component and a background noise component; When the audio signal is determined to be in the first state, the first component And outputting the input audio signal as it is, and outputting a predetermined signal as the second component. When the input audio signal is determined to be in the second state, the first Outputting the second signal as a component, outputting the third signal as the second component, and when the input audio signal is determined to be in the third state, as the first component A method for separating components of an audio signal, comprising outputting the first signal and outputting the third signal as the second component.

4. The method according to claim 1, wherein said input audio signal is in said first state,
4. The voice signal according to claim 2, wherein the process of determining which of the third state and the third state includes a step of checking a magnitude of the pitch periodicity of the third signal. Component separation method.

5. An input audio signal is separated into a first component mainly composed of voice and a second component mainly composed of background noise for each predetermined time unit, and based on the first and second components. Alternatively, based on the determination result of the state determined at the time of the component separation, the bit allocation of each component is selected from a plurality of predetermined bit allocation candidates. And a second component are respectively encoded by different predetermined encoding methods, and the encoded data of the first and second components and the information of the bit allocation are output as transmission encoded data. In the component separation, a signal obtained by performing a first noise suppression process with a relatively small amount of suppression on the input audio signal is defined as the first component, and the relative amount of suppression with respect to the input audio signal is determined. Large second speech encoding method characterized by having a separation mode in which a signal obtained by subtracting from the input speech signal a signal obtained by performing a noise suppressing process to the second component.

6. An input audio signal is separated into a first component mainly composed of voice and a second component mainly composed of background noise for each predetermined time unit, and based on the first and second components. Alternatively, based on the determination result of the state determined at the time of the component separation, the bit allocation of each component is selected from a plurality of predetermined bit allocation candidates. And a second component are respectively encoded by different predetermined encoding methods, and the encoded data of the first and second components and the information of the bit allocation are output as transmission encoded data. In the component separation, a first signal obtained by performing a first noise suppression process with a relatively small suppression amount on the input speech signal, a first signal with a relatively large suppression amount on the input speech signal, 2 Generating a second signal obtained by performing the noise suppression processing and a third signal obtained by subtracting the second signal from the input audio signal, wherein the input audio signal mainly includes an audio component; The predetermined state, the first state, the second state mainly including the background noise component, and the third state other than the first and second states including both the voice component and the background noise component. When the input voice signal is determined to be in the first state, the input voice signal is output as it is as the first component, and a predetermined value is determined as the second component. When the input voice signal is determined to be in the second state, a predetermined signal is output as the first component, and the input voice signal is output as the second component. Output a signal, and the input audio signal is When it is determined that the signal is in the third state, the first signal is output as the first component, and the third signal is output as the second component. Method.

7. An input audio signal is separated into a first component mainly composed of voice and a second component mainly composed of background noise every predetermined time unit, and based on the first and second components. Alternatively, based on the determination result of the state determined at the time of the component separation, the bit allocation of each component is selected from a plurality of predetermined bit allocation candidates. And a second component are respectively encoded by different predetermined encoding methods, and the encoded data of the first and second components and the information of the bit allocation are output as transmission encoded data. In the component separation, a first signal obtained by performing a first noise suppression process with a relatively small suppression amount on the input speech signal, a first signal with a relatively large suppression amount on the input speech signal, 2 Generating a second signal obtained by performing the noise suppression processing and a third signal obtained by subtracting the second signal from the input audio signal, wherein the input audio signal mainly includes an audio component; The predetermined state, the first state, the second state mainly including the background noise component, and the third state other than the first and second states including both the voice component and the background noise component. When the input voice signal is determined to be in the first state, the input voice signal is output as it is as the first component, and a predetermined value is determined as the second component. When the input audio signal is determined to be in the second state, the second signal is output as the first component, and the third signal is output as the second component. And the input audio signal is in the third state And when it is determined, the first outputs the first signal as a component, the speech encoding method and outputting the third signal as the second component.