JP2020190606A

JP2020190606A - Sound noise removal device and program

Info

Publication number: JP2020190606A
Application number: JP2019095104A
Authority: JP
Inventors: 知美小倉; Tomomi Ogura; 岳大杉本; Takehiro Sugimoto
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2020-11-26
Anticipated expiration: 2039-05-21
Also published as: JP7316093B2

Abstract

To provide a sound noise removal device and program capable of reproducing a sound signal of high quality by removing noise including musical noise of a sound signal after compression and expansion.SOLUTION: A sound noise removal device according to the present invention comprises: a signal period division part 111 which inputs and divides a sound signal after compression and expansion into signals of predetermined signal periods; a band division part 112 which generates time waveforms by bands in each signal period; a noise learning detection part 113 which uses machine learning to detect noise including musical noise by the bands; a noise band determination part 114 which determines a maximum value and a minimum value of noise bands in each signal period; a noise correction part 115 which corrects a waveform in a band having noise through linear prediction from waveforms of other bands; a band composition part 116 which performs band composition of time waveforms by the bands after connection in a signal period in which noise is included; and a signal composition part 117 which generates a sound signal having respective signals in a signal period after correction and a noise-free signal period concatenated in time series.SELECTED DRAWING: Figure 1

Description

本発明は、圧縮伸長後の音声信号におけるミュージカルノイズを含む雑音を除去する音声雑音除去装置及びプログラムに関する。 The present invention relates to a voice noise removing device and a program for removing noise including musical noise in a voice signal after compression and decompression.

音声信号を伝送又は記録する際に非可逆圧縮符号化処理を施すことがある。この圧縮符号化された音声信号を伸長復号すると、符号化劣化によって、ミュージカルノイズと呼ばれるような特徴的なノイズが生じることがある。このミュージカルノイズによって主観的な音質が劣化してしまう。ミュージカルノイズは、音声信号においてエネルギー集中の分布が所定の時間間隔で区切られた信号期間毎に不規則に変化する特徴を有し、雑音の一種である。 Lossy compression coding processing may be applied when transmitting or recording an audio signal. When the compressed and coded audio signal is decompressed and decoded, characteristic noise called musical noise may be generated due to the coding deterioration. This musical noise deteriorates the subjective sound quality. Musical noise is a kind of noise because it has a characteristic that the distribution of energy concentration in a voice signal changes irregularly for each signal period divided by a predetermined time interval.

特に、最新の音声符号化技術では、チャンネル毎にビットの配分を或る時間間隔でダイナミックに変えることができる（例えば、非特許文献１参照）。このため、後方など或るチャンネルの或る時刻にミュージカルノイズが生じることがある。家庭でテレビを視聴する際には、伝送された圧縮音声はあるが、原音を入手することはできないため、どのように音声が劣化したのかが分からない。そのような条件の中、高品質な音声信号を視聴するためには、家庭側で音声が劣化したのか否かを推定し、補正する技術が望まれる。 In particular, the latest voice coding technology can dynamically change the bit distribution for each channel at a certain time interval (see, for example, Non-Patent Document 1). For this reason, musical noise may occur at a certain time on a certain channel such as backward. When watching TV at home, there is compressed audio transmitted, but the original sound is not available, so it is not possible to know how the audio deteriorated. Under such conditions, in order to view a high-quality audio signal, a technique for estimating and correcting whether or not the audio has deteriorated at home is desired.

尚、雑音の一種であるクリップノイズを検出する技法として、直交検波を行い、クリップノイズの強度が閾値より超過した場合に雑音として検出する技法が開示されている（例えば、特許文献１参照）。この技法はクリップノイズを検出するためのもので、クリップノイズの強度のみを評価指標としている。 As a technique for detecting clip noise, which is a type of noise, a technique is disclosed in which orthogonal detection is performed and when the intensity of clip noise exceeds a threshold value, it is detected as noise (see, for example, Patent Document 1). This technique is for detecting clip noise, and only the intensity of clip noise is used as an evaluation index.

また、雑音を抑圧する技法として、推定した雑音の振幅スペクトルを減算するスペクトラルサブトラクション法を用いる技法がある（例えば、特許文献２参照）。 Further, as a technique for suppressing noise, there is a technique using a spectral subtraction method for subtracting an estimated amplitude spectrum of noise (see, for example, Patent Document 2).

また、ミュージカルノイズを抑圧する技法として、スペクトログラム画像の膨張・収縮処理による方法が提案されている（例えば、非特許文献２参照）。 Further, as a technique for suppressing musical noise, a method by expanding / contracting a spectrogram image has been proposed (see, for example, Non-Patent Document 2).

特開２０１２−２３０１６０号公報Japanese Unexamined Patent Publication No. 2012-230160 国際公開第９９／５０８２５号International Publication No. 99/50825

ISO/IEC 23008-3, “High Efficiency Coding and Media Delivery in Heterogeneous Environments Part 3: 3D Audio”ISO / IEC 23008-3, “High Efficiency Coding and Media Delivery in Heterogeneous Environments Part 3: 3D Audio” 山口亮、金子豊、“雑音抑圧信号処理におけるミュージカルノイズ改善の検討”，日本音響学会研究発表会講演論文集，２００４年３月Ryo Yamaguchi, Yutaka Kaneko, "Study on Improving Musical Noise in Noise Suppression Signal Processing", Proceedings of the Acoustical Society of Japan Research Presentation, March 2004

上述したように、音声信号を伝送又は記録する際に非可逆圧縮符号化処理を施し、その圧縮符号化された音声信号を伸長復号すると、符号化劣化によって、ミュージカルノイズと呼ばれるような特徴的なノイズが生じることがある。このミュージカルノイズによって主観的な音質が劣化してしまう。このため、圧縮伸長後の音声信号を再生し、良好な音質を得るためには、圧縮伸長後の音声信号におけるミュージカルノイズの検出と補正が望まれる。 As described above, when an irreversible compression coding process is applied when transmitting or recording an audio signal, and the compressed and encoded audio signal is decompressed and decoded, it is characterized as being called musical noise due to coding deterioration. Noise may occur. This musical noise deteriorates the subjective sound quality. Therefore, in order to reproduce the audio signal after compression / expansion and obtain good sound quality, it is desired to detect and correct musical noise in the audio signal after compression / expansion.

ここで、特許文献１に開示されるように、雑音の一種であるクリップノイズを検出する技法がある。しかし、この技法では音声信号の時間波形における位相情報を評価指標として扱うものではないため、例えば位相情報が劣化した場合にクリップノイズの強度が大きくならない場合があり、クリップノイズの強度のみで、符号化劣化によって生じるミュージカルノイズを判別することは困難である。 Here, as disclosed in Patent Document 1, there is a technique for detecting clip noise, which is a type of noise. However, since this technique does not treat the phase information in the time waveform of the audio signal as an evaluation index, the strength of the clip noise may not increase when the phase information deteriorates, for example, and the code is based only on the strength of the clip noise. It is difficult to discriminate the musical noise caused by chemical deterioration.

また、特許文献２に開示されるように、ミュージカルノイズを含む雑音を抑圧する技法がある。ただし、一般的にスペクトラルサブトラクション法を利用しようとすると、雑音を無音区間から推定することが必要となる。例えばテレビで放送される音声信号では無音区間は少ないことから、スペクトラルサブトラクション法を利用して、人の声から音楽など多岐に渡る音声信号について時々刻々と変化するミュージカルノイズを推定することは困難である。また、スペクトラルサブトラクション法では非定常な雑音やパワースペクトルを推定できない場合には、効果が十分に得られないといわれており、白色雑音下での音声強調では用いられているが、ミュージカルノイズの抑圧には向いていない。 Further, as disclosed in Patent Document 2, there is a technique for suppressing noise including musical noise. However, in general, when the spectral subtraction method is to be used, it is necessary to estimate the noise from the silent section. For example, since there are few silent sections in audio signals broadcast on television, it is difficult to estimate the ever-changing musical noise of a wide range of audio signals, from human voice to music, using the spectral subtraction method. is there. In addition, it is said that the effect cannot be sufficiently obtained when unsteady noise or power spectrum cannot be estimated by the spectral subtraction method, and it is used for speech enhancement under white noise, but it suppresses musical noise. Not suitable for.

一方、非特許文献２に開示されるように、スペクトログラム画像の膨張・収縮処理により、ミュージカルノイズを抑圧する技法がある。しかし、このスペクトログラム画像の膨張・収縮処理では音声信号の時間波形における位相情報については補正していないため、位相情報も劣化したミュージカルノイズについて十分な効果が得られない。 On the other hand, as disclosed in Non-Patent Document 2, there is a technique for suppressing musical noise by expanding / contracting a spectrogram image. However, since the phase information in the time waveform of the audio signal is not corrected in the expansion / contraction processing of this spectrogram image, a sufficient effect cannot be obtained for the musical noise in which the phase information is also deteriorated.

従って、本発明の目的は、上述の問題に鑑みて、圧縮伸長後の音声信号におけるミュージカルノイズを含む雑音を除去し、高品質の音声信号を再生可能とする音声雑音除去装置及びプログラムを提供することにある。 Therefore, in view of the above problems, an object of the present invention is to provide an audio noise removing device and a program that can remove noise including musical noise in an audio signal after compression and decompression and reproduce a high quality audio signal. There is.

本発明の音声雑音除去装置は、機械学習を用いて、任意の圧縮伸長後の音声信号に対し所定の時間間隔で区切られた信号期間毎に、所定の周波数間隔で帯域分割した帯域別にミュージカルノイズを含む雑音を検出する。ここで、機械学習は、学習用の原音の音声信号と、その原音の圧縮伸長後の音声信号について、所定の信号期間毎に帯域分割した時間波形をエネルギー又は包絡線形状に基づきミュージカルノイズを含む雑音を有する帯域であるか否かを識別するよう予め事前学習されている。そして、本発明の音声雑音除去装置は、信号期間毎に雑音の有無を検出した上で、雑音有りとして判定した信号期間における当該雑音を有する帯域の時間波形を該信号期間における当該雑音を有していない帯域の時間波形から線形予測により補正して全帯域の時間波形を帯域合成し、当該雑音有りとして判定した信号期間における補正後信号を形成する。最終的に、本発明の音声雑音除去装置は、当該雑音有りとして判定した信号期間における信号と、当該雑音無しとして判定した信号期間における信号とを合成し、雑音抑圧後の音声信号を生成して出力する。 The voice noise removing device of the present invention uses machine learning to divide the voice signal after arbitrary compression and expansion into bands at predetermined frequency intervals for each signal period divided at predetermined time intervals, and musical noise for each band. Detects noise including. Here, the machine learning includes musical noise based on the energy or the envelope shape of the audio signal of the original sound for learning and the audio signal after compression and expansion of the original sound, with the time waveform divided into bands for each predetermined signal period. It has been pre-learned to identify whether or not it is a noisy band. Then, the voice noise removing device of the present invention detects the presence or absence of noise for each signal period, and then obtains the time waveform of the band having the noise in the signal period determined as having noise to have the noise in the signal period. The time waveforms of all bands are band-synthesized by correcting the time waveforms of the unbanded bands by linear prediction, and a corrected signal is formed in the signal period determined as having the noise. Finally, the voice noise removing device of the present invention synthesizes the signal in the signal period determined as having noise and the signal in the signal period determined as having no noise to generate a voice signal after noise suppression. Output.

即ち、本発明の音声雑音除去装置は、圧縮伸長後の音声信号における雑音を除去する音声雑音除去装置であって、圧縮伸長後の音声信号を入力し、所定の時間間隔で区切られた信号期間毎の信号に分割する信号期間分割部と、前記信号期間毎に所定の周波数間隔の帯域分割数で帯域分割した帯域別の時間波形を生成する帯域分割部と、前記信号期間毎に、機械学習を用いて当該帯域別にミュージカルノイズを含む雑音を検出し、前記信号期間分割部に対し、雑音無しの信号期間の信号と雑音有りの信号期間の信号に分岐させる雑音学習検出部と、前記雑音学習検出部による雑音帯域情報を基に、最小帯域及び最大帯域を持つ所定の帯域分割数で帯域分割した信号期間毎の雑音帯域の最小値、及び雑音帯域の最大値を判別する雑音帯域判別部と、前記帯域分割数、前記帯域分割の最小帯域及び最大帯域、並びに、前記雑音帯域の最小値及び前記雑音帯域の最大値を基に、当該雑音有りの信号期間における当該雑音を有する帯域の波形を該信号期間における当該雑音を有していない帯域の波形から線形予測により補正して、当該雑音有りの信号期間毎に補正後の帯域別時間波形を生成する雑音補正部と、当該雑音有りの信号期間毎に前記補正後の帯域別時間波形を帯域合成して、当該雑音有りの信号期間毎の補正後信号を形成する帯域合成部と、当該雑音有りの信号期間毎の補正後信号と、雑音無しの信号期間の信号とを時系列に連結することで合成することにより、ミュージカルノイズを含む雑音を抑圧した音声信号を生成する信号合成部と、を備えることを特徴とする。 That is, the voice noise removing device of the present invention is a voice noise removing device that removes noise in the voice signal after compression and decompression, and inputs the voice signal after compression and decompression and divides the signal period at predetermined time intervals. A signal period dividing unit that divides into each signal, a band dividing unit that generates a time waveform for each band divided by the number of band divisions at a predetermined frequency interval for each signal period, and machine learning for each signal period. A noise learning detection unit that detects noise including musical noise for each band and branches the signal period division unit into a noise-free signal period signal and a noisy signal period signal, and the noise learning unit. Based on the noise band information from the detection unit, the noise band discriminator that discriminates the minimum value of the noise band and the maximum value of the noise band for each signal period divided by a predetermined number of band divisions having the minimum band and the maximum band. Based on the number of band divisions, the minimum band and the maximum band of the band division, and the minimum value of the noise band and the maximum value of the noise band, the waveform of the band having the noise in the signal period with the noise is obtained. A noise correction unit that corrects the waveform of the band without the noise in the signal period by linear prediction and generates a time waveform for each band after correction for each signal period with the noise, and a signal with the noise. A band synthesizer that band-synthesizes the corrected time waveform for each band for each period to form a corrected signal for each signal period with noise, a corrected signal for each signal period with noise, and noise. It is characterized by including a signal synthesizing unit that generates a voice signal in which noise including musical noise is suppressed by synthesizing signals having no signal period by connecting them in a time series.

また、本発明の音声雑音除去装置において、前記雑音学習検出部は、ＬＳＴＭ（Long Short-Term Memory）ネット枠により構成され、学習用の原音の音声信号と、その原音の圧縮伸長後の音声信号について、所定の信号期間毎に帯域分割した時間波形をエネルギー又は包絡線形状に基づきミュージカルノイズを含む雑音を有する帯域であるか否かを識別するよう予め事前学習されていることを特徴とする。 Further, in the voice noise removing device of the present invention, the noise learning detection unit is composed of an LSTM (Long Short-Term Memory) net frame, and is composed of an audio signal of the original sound for learning and an audio signal after compression and expansion of the original sound. It is characterized in that the time waveform divided into bands for each predetermined signal period is pre-learned in advance to identify whether or not it is a band having noise including musical noise based on energy or envelope shape.

また、本発明の音声雑音除去装置において、前記雑音補正部は、前記雑音帯域の最小値が予め定めた周波数より高いときは、前記雑音帯域の最小値より低い帯域の信号波形を用いて第１の線形予測を行い、前記雑音帯域の最小値が前記予め定めた周波数以下であり、且つ前記雑音帯域の最大値が前記予め定めた周波数より低いときは、前記雑音帯域の最大値より高い帯域の信号波形を用いて第２の線形予測を行い、前記雑音帯域の最小値が前記予め定めた周波数以下であり、且つ前記雑音帯域の最大値が前記予め定めた周波数以上であるときは、前記第１の線形予測により得られる帯域別の時間波形と、前記第２の線形予測により得られる帯域別の時間波形とを加重平均することにより、当該雑音有りの信号期間における当該雑音を有する帯域の時間波形を補正することを特徴とする。 Further, in the voice noise removing device of the present invention, when the minimum value of the noise band is higher than a predetermined frequency, the noise correction unit uses a signal waveform in a band lower than the minimum value of the noise band. When the minimum value of the noise band is equal to or lower than the predetermined frequency and the maximum value of the noise band is lower than the predetermined frequency, the band is higher than the maximum value of the noise band. When the second linear prediction is performed using the signal waveform and the minimum value of the noise band is equal to or lower than the predetermined frequency and the maximum value of the noise band is equal to or higher than the predetermined frequency, the second linear prediction is performed. By weight-averaging the time waveform for each band obtained by the linear prediction of 1 and the time waveform for each band obtained by the second linear prediction, the time of the band having the noise in the signal period with noise is obtained. It is characterized by correcting the waveform.

また、本発明のプログラムは、コンピュータを、本発明の音声雑音除去装置として機能させるためのプログラムとして構成する。 Further, the program of the present invention is configured as a program for operating the computer as the voice noise removing device of the present invention.

本発明によれば、任意の圧縮伸長後の音声信号に対し符号化劣化によって生じたミュージカルノイズを含む雑音を自動検出し補正することができるので、雑音を抑圧した良好な音質の音声信号を得ることができる。 According to the present invention, noise including musical noise generated by coding deterioration can be automatically detected and corrected for an arbitrary compressed and decompressed audio signal, so that a noise-suppressed audio signal with good sound quality can be obtained. be able to.

本発明による一実施形態の音声雑音除去装置の概略構成を示すブロック図である。It is a block diagram which shows the schematic structure of the voice noise removal apparatus of one Embodiment by this invention. 本発明による一実施形態の音声雑音除去装置の帯域分割部において帯域分割したときに得られる時間波形を概略的に例示する図である。It is a figure which schematically exemplifies the time waveform obtained when the band is divided in the band division part of the voice noise removal apparatus of one Embodiment by this invention. 本発明による一実施形態の音声雑音除去装置の雑音学習検出部における事前学習と、雑音学習検出処理を概念的に示すブロック図である。It is a block diagram which conceptually shows the pre-learning and the noise learning detection processing in the noise learning detection part of the voice noise removal apparatus of one Embodiment by this invention. 本発明による一実施形態の音声雑音除去装置の雑音学習検出部におけるＬＳＴＭ学習処理の概略を概念的に示す図である。It is a figure which conceptually shows the outline of the LSTM learning process in the noise learning detection part of the voice noise removal apparatus of one Embodiment by this invention. 本発明による一実施形態の音声雑音除去装置の雑音帯域判別部において帯域分割したときに得られるパラメータを示す図である。It is a figure which shows the parameter obtained when the band is divided in the noise band discriminating part of the voice noise removing device of one Embodiment by this invention. 本発明による一実施形態の音声雑音除去装置の雑音補正部における雑音補正処理を示すフローチャートである。It is a flowchart which shows the noise correction processing in the noise correction part of the voice noise removal apparatus of one Embodiment by this invention. （ａ）乃至（ｄ）は、それぞれ原音、劣化した圧縮伸長音、非特許文献２に基づく雑音除去処理後の圧縮伸長音、及び本発明に係る雑音除去処理後の圧縮伸長音に関するスペクトログラムを示す図である。(A) to (d) show spectrograms relating to the original sound, the deteriorated compression / extension sound, the compression / extension sound after the noise removal processing based on Non-Patent Document 2, and the compression / extension sound after the noise removal processing according to the present invention, respectively. It is a figure.

以下、図面を参照しながら、本発明による一実施形態の音声雑音除去装置１について説明する。 Hereinafter, the voice noise removing device 1 according to the present invention will be described with reference to the drawings.

〔全体構成〕
図１は、本発明による一実施形態の音声雑音除去装置１の概略構成を示すブロック図である。音声雑音除去装置１は、雑音除去処理部１１及び記憶部１２から構成される。雑音除去処理部１１は、信号期間分割部１１１、帯域分割部１１２、雑音学習検出部１１３、雑音帯域判別部１１４、雑音補正部１１５、帯域合成部１１６、及び信号合成部１１７を備える。また、音声雑音除去装置１は、コンピュータにより構成することができ、記憶部１２には本発明に係るプログラムが格納され、コンピュータ（ＡＶアンプ等の家庭用音響機器のＤＳＰのマイクロコンピュータを含む）内の中央演算処理装置（ＣＰＵ）により当該プログラムを実行することで、雑音除去処理部１１を機能させることができる。そして、記憶部１２は、雑音除去処理部１１に係る各信号処理上のデータの一時記憶や各データの遅延調整に用いる信号処理用メモリ１２１と、雑音学習検出部１１３の処理で利用する機械学習用データベース（ＤＢ）１２２とを備える。〔overall structure〕
FIG. 1 is a block diagram showing a schematic configuration of a voice noise removing device 1 according to an embodiment of the present invention. The voice noise removing device 1 is composed of a noise removing processing unit 11 and a storage unit 12. The noise removal processing unit 11 includes a signal period division unit 111, a band division unit 112, a noise learning detection unit 113, a noise band discrimination unit 114, a noise correction unit 115, a band synthesis unit 116, and a signal synthesis unit 117. Further, the voice noise removing device 1 can be configured by a computer, and the program according to the present invention is stored in the storage unit 12, and the inside of the computer (including the microcomputer of the DSP of the home acoustic device such as an AV amplifier). The noise removal processing unit 11 can be made to function by executing the program by the central processing unit (CPU) of the above. Then, the storage unit 12 is a signal processing memory 121 used for temporary storage of data in each signal processing related to the noise removal processing unit 11 and delay adjustment of each data, and machine learning used in the processing of the noise learning detection unit 113. A database (DB) 122 for use is provided.

信号期間分割部１１１は、非可逆圧縮符号化処理を経て伸長復号された圧縮伸長後の音声信号を入力し、或る一定間隔毎にその音声信号を切り出すことで所定の時間間隔で区切られた信号期間毎の信号に分割して、一旦、信号期間毎の信号を帯域分割部１１２に出力する。 The signal period dividing unit 111 inputs the compressed and decompressed audio signal that has been decompressed and decoded through the lossy compression coding process, and cuts out the audio signal at regular intervals to divide the signal at predetermined time intervals. It is divided into signals for each signal period, and the signal for each signal period is once output to the band division unit 112.

帯域分割部１１２は、信号期間分割部１１１から入力される信号期間毎の信号について、信号期間毎に所定の周波数間隔の帯域分割数Ｎで帯域分割した帯域別の時間波形を生成し、雑音学習検出部１１３に出力する。 The band division unit 112 generates a time waveform for each band divided by the number of band divisions N at a predetermined frequency interval for each signal period for the signal for each signal period input from the signal period division unit 111, and performs noise learning. Output to the detection unit 113.

雑音学習検出部１１３は、帯域分割部１１２から信号期間毎に帯域別の時間波形を入力し、信号期間毎に、機械学習を用いて帯域別にミュージカルノイズを含む雑音を検出する。そして、雑音学習検出部１１３は、雑音無しの信号期間の信号については雑音無しの旨を信号期間分割部１１１に通知し、雑音有りの信号期間の信号についてはその雑音帯域情報付きで、雑音有りの旨を信号期間分割部１１１に通知する。 The noise learning detection unit 113 inputs a time waveform for each band from the band division unit 112 for each signal period, and detects noise including musical noise for each band by using machine learning for each signal period. Then, the noise learning detection unit 113 notifies the signal period dividing unit 111 that there is no noise for the signal of the signal period without noise, and the signal of the signal period with noise is accompanied by the noise band information and has noise. Is notified to the signal period dividing unit 111.

尚、詳細は後述するが、雑音学習検出部１１３は、周波数の位相情報を扱うことができる時間波形に基づいたＬＳＴＭ（Long Short-Term Memory）ネットワークで構成され、機械学習用ＤＢ１２２を参照して事前学習されたネットワークパラメータを基に学習済みの状態で雑音の有無を判定する。ここで、本実施形態の機械学習は、学習用の原音の音声信号と、その原音の圧縮伸長後の音声信号について、所定の信号期間毎に帯域分割した時間波形をエネルギー又は包絡線形状に基づきミュージカルノイズを含む雑音を有する帯域であるか否かを識別するよう予め事前学習されている。 Although the details will be described later, the noise learning detection unit 113 is configured by an LSTM (Long Short-Term Memory) network based on a time waveform capable of handling frequency phase information, and refers to the machine learning DB 122. The presence or absence of noise is determined in the trained state based on the pre-learned network parameters. Here, in the machine learning of the present embodiment, the audio signal of the original sound for learning and the audio signal after compression and expansion of the original sound are band-divided for each predetermined signal period, and the time waveform is divided into bands based on the energy or the envelope shape. It is pre-learned to identify whether or not it is a band having noise including musical noise.

この雑音学習検出部１１３からの通知を受けて、信号期間分割部１１１は、雑音無しの信号期間の信号と雑音有りの信号期間の信号に分岐させ、雑音無しの信号期間の信号については信号合成部１１７に出力し、雑音有りの信号期間の信号についてはその雑音帯域情報付きで雑音帯域判別部１１４に出力する。 Upon receiving the notification from the noise learning detection unit 113, the signal period dividing unit 111 branches into a signal with a signal period without noise and a signal with a signal period with noise, and the signal with a signal period without noise is signal-synthesized. It is output to the unit 117, and the signal of the signal period with noise is output to the noise band determination unit 114 with the noise band information.

ここで、信号期間分割部１１１は、入力された圧縮伸長後の音声信号について、雑音除去処理部１１の処理時間に必要な時間分の信号を信号処理用メモリ１２１に一時記憶しておくことで遅延調整を行う。即ち、信号期間分割部１１１は、雑音学習検出部１１３の処理を経て得られる信号期間毎の信号に対する雑音の有無の情報、及び雑音有りの信号期間の信号についてはその雑音帯域情報を関連付けて信号処理用メモリ１２１に一時記憶する。これにより、信号期間分割部１１１より後段の各処理部は、本実施形態の説明では各信号を順次処理する例を説明するが、信号処理用メモリ１２１から適宜、各処理に必要な信号及び雑音帯域情報等を読み出して処理を行う構成とすることもできる。 Here, the signal period dividing unit 111 temporarily stores the input audio signal after compression / decompression in the signal processing memory 121 for the time required for the processing time of the noise removal processing unit 11. Adjust the delay. That is, the signal period dividing unit 111 correlates the information on the presence or absence of noise with respect to the signal for each signal period obtained through the processing of the noise learning detection unit 113, and the noise band information for the signal with noise in the signal period. Temporarily stored in the processing memory 121. As a result, although each processing unit after the signal period dividing unit 111 sequentially processes each signal in the description of this embodiment, the signal and noise required for each processing are appropriately described from the signal processing memory 121. It is also possible to have a configuration in which band information or the like is read and processed.

雑音帯域判別部１１４は、信号期間分割部１１１から雑音帯域情報付きで雑音有りの信号期間の信号を入力し、その信号期間毎に所定の周波数間隔の帯域分割数Ｍで帯域分割した帯域別の時間波形を生成し、その信号期間毎に、当該雑音帯域情報に基づいて信号期間毎の雑音帯域の最小値ｆ_ｍｉｎ、及び雑音帯域の最大値ｆ_ｍａｘを判別し、帯域分割数Ｍ、帯域分割の最小帯域ｆ_１及び最大帯域ｆ_Ｍ、並びに、雑音帯域の最小値ｆ_ｍｉｎ、及び雑音帯域の最大値ｆ_ｍａｘの情報を補正用雑音帯域情報として抽出し、帯域別の時間波形とともに雑音補正部１１５に出力する。 The noise band determination unit 114 inputs a signal of a signal period with noise with noise band information from the signal period division unit 111, and divides the signal by the number of band divisions M of a predetermined frequency interval for each band. A time waveform is generated, and the minimum value f_min of the noise band and the maximum value f_max of the noise band for each signal period are determined based on the noise band information for each signal period, and the number of band divisions M and the minimum band division are determined. Information on the band f_1 and the maximum band f_M, the minimum value f_min of the noise band, and the maximum value f_max of the noise band is extracted as correction noise band information, and is output to the noise correction unit 115 together with the time waveform for each band.

ここで、本実施形態では、帯域分割部１１２においては雑音学習検出部１１３による「雑音の検出」のために帯域分割数Ｎで帯域分割し、雑音帯域判別部１１４においては雑音補正部１１５による「雑音の補正」のために帯域分割数Ｍで帯域分割するとして説明したが、Ｎ＝ＭとしてもよいしＮ≠Ｍでもよい。Ｎ＝Ｍとする場合、雑音帯域判別部１１４は、ここでの帯域分割する処理を省略し、雑音学習検出部１１３による「雑音の検出」後の雑音有りの信号期間について、帯域分割部１１２における帯域別信号波形を入力するように構成して雑音帯域の最小値ｆ_ｍｉｎ、及び雑音帯域の最大値ｆ_ｍａｘを判別する構成とすることもできる。 Here, in the present embodiment, the band division unit 112 divides the band by the number of band divisions N for "noise detection" by the noise learning detection unit 113, and the noise band determination unit 114 uses the noise correction unit 115 to divide the band. Although it has been described that the band is divided by the number of band divisions M for "noise correction", N = M or N ≠ M may be used. When N = M, the noise band discriminating unit 114 omits the band dividing process here, and the band dividing unit 112 describes the signal period with noise after the “noise detection” by the noise learning detecting unit 113. It is also possible to configure the signal waveform for each band to be input to determine the minimum value f_min of the noise band and the maximum value f_max of the noise band.

雑音補正部１１５は、雑音帯域判別部１１４から雑音有りの信号期間毎に得られる帯域別の時間波形と、補正用雑音帯域情報（帯域分割数Ｍ、帯域分割の最小帯域ｆ_１及び最大帯域ｆ_Ｍ、並びに、雑音帯域の最小値ｆ_ｍｉｎ、及び雑音帯域の最大値ｆ_ｍａｘ）とを入力し、この補正用雑音帯域情報を基に、雑音有りの信号期間における当該雑音を有する帯域の波形を該信号期間における当該雑音を有していない帯域の波形から線形予測により補正して、雑音有りの信号期間毎に補正後の帯域別時間波形を生成し、帯域合成部１１６に出力する。 The noise correction unit 115 includes time waveforms for each band obtained from the noise band determination unit 114 for each signal period with noise, and correction noise band information (number of band divisions M, minimum band f_1 for band division, and maximum band f_M). In addition, the minimum value f_min of the noise band and the maximum value f_max of the noise band) are input, and based on this correction noise band information, the waveform of the band having the noise in the signal period with noise is obtained in the signal period. The waveform of the band having no noise is corrected by linear prediction, and the corrected time waveform for each band is generated for each signal period with noise and output to the band synthesis unit 116.

帯域合成部１１６は、雑音補正部１１５から雑音有りの信号期間毎に補正後の帯域別時間波形を入力し、信号期間毎に全帯域の時間波形を帯域合成して、当該雑音有りの信号期間毎の補正後信号を形成し、信号合成部１１７に出力する。 The band synthesis unit 116 inputs the corrected time waveform for each band from the noise correction unit 115 for each signal period with noise, band-synthesizes the time waveforms of all bands for each signal period, and the signal period with noise. A corrected signal is formed for each, and is output to the signal synthesis unit 117.

信号合成部１１７は、帯域合成部１１６から得られる当該雑音有りの信号期間毎の補正後信号と、信号期間分割部１１１から得られる雑音無しの信号期間の信号とを時系列に連結することで合成することにより、ミュージカルノイズを含む雑音を抑圧した音声信号を生成して出力する。 The signal synthesis unit 117 connects the corrected signal for each signal period with noise obtained from the band synthesis unit 116 and the signal with no noise signal period obtained from the signal period division unit 111 in chronological order. By synthesizing, an audio signal in which noise including musical noise is suppressed is generated and output.

尚、信号合成部１１７は、信号期間分割部１１１に入力される圧縮伸長後の音声信号に対して、帯域合成部１１６から得られる当該雑音有りの信号期間毎の補正後信号により対応する信号期間で置き換えて合成することにより、ミュージカルノイズを含む雑音を抑圧した音声信号を生成する構成とすることもできる。 In addition, the signal synthesis unit 117 corresponds to the compressed and decompressed audio signal input to the signal period division unit 111 by the corrected signal for each noisy signal period obtained from the band synthesis unit 116. It is also possible to generate an audio signal in which noise including musical noise is suppressed by replacing with and synthesizing.

以下、より具体的に、帯域分割部１１２、雑音学習検出部１１３、雑音帯域判別部１１４、及び雑音補正部１１４について順に説明する。 Hereinafter, the band division unit 112, the noise learning detection unit 113, the noise band discrimination unit 114, and the noise correction unit 114 will be described in order.

〔帯域分割部〕
図２は、本発明による一実施形態の音声雑音除去装置１の帯域分割部１１２において帯域分割したときに得られる時間波形を概略的に例示する図である。帯域分割部１１２は、信号期間毎に、後段の雑音学習検出部１１３により「雑音の検出」を行わせるために帯域分割数Ｎで帯域分割を行う。 [Band division]
FIG. 2 is a diagram schematically illustrating a time waveform obtained when the band division unit 112 of the voice noise removing device 1 of the embodiment according to the present invention divides the band. The band division unit 112 divides the band by the number of band divisions N in order to cause the noise learning detection unit 113 in the subsequent stage to perform "noise detection" for each signal period.

図２に示すように、帯域分割部１１２は、圧縮伸長後の音声信号の或る時刻ｔ_ｎの信号期間における信号について、予め定めた帯域分割数Ｎで、帯域分割の最小帯域ｆ_１及び最大帯域ｆ_Ｎとなる帯域ｆ_ｎ（ｆ_ｎ＝ｆ_１〜ｆ_Ｎ）毎の時間波形を生成し、雑音学習検出部１１３に出力する。 As shown in FIG. 2, the band division unit 112 has a predetermined number of band divisions N for the signal in the signal period of a certain time t_n of the audio signal after compression / expansion, and the minimum band f_1 and the maximum band f_N of the band division. A time waveform for each band f_n (f_n = f_1 to f_N) is generated and output to the noise learning detection unit 113.

〔雑音学習検出部〕
図３は、本発明による一実施形態の音声雑音除去装置１の雑音学習検出部１１３における事前学習と、雑音学習検出処理を概念的に示すブロック図である。また、図４は、雑音学習検出部１１３におけるＬＳＴＭ学習処理の概略を概念的に示す図である。雑音学習検出部１１３は、位相情報を扱うことができる時間波形に基づいたＬＳＴＭネットワークで構成され、機械学習用ＤＢ１２２を参照して事前学習されたネットワークパラメータを基に学習済みの機械学習を用いて、帯域分割部１１２から或る時刻ｔ_ｎの信号期間における帯域別の時間波形を入力すると帯域別時間波形ごとにミュージカルノイズを含む雑音の有無を検出する。 [Noise learning detector]
FIG. 3 is a block diagram conceptually showing pre-learning and noise learning detection processing in the noise learning detection unit 113 of the voice noise removing device 1 according to the present invention. Further, FIG. 4 is a diagram conceptually showing an outline of the LSTM learning process in the noise learning detection unit 113. The noise learning detection unit 113 is composed of an RSTM network based on a time waveform that can handle phase information, and uses machine learning that has been learned based on pre-learned network parameters with reference to the machine learning DB 122. When a time waveform for each band in a signal period of a certain time t_n is input from the band division unit 112, the presence or absence of noise including musical noise is detected for each time waveform for each band.

尚、隠れ層として構成されるＬＳＴＭ学習部１１３２は、ＬＳＴＭネットワークにおける少なくとも１つ以上のＬＳＴＭブロックを用いられ、ＬＳＴＭブロックは時刻が異なる情報（即ち、時間波形）を扱うことができる。そこで、図３及び図４に示す例では、帯域毎のエネルギーをＬＳＴＭネットワークにおける入力層とする例を説明したが、図２に示すように、帯域毎の時間貨幣の包絡線の形状自体（包絡線上でサンプリングした値を特徴ベクトルとして羅列表示したもの）をＬＳＴＭネットワークにおける入力層としてもよい。 The LSTM learning unit 1132 configured as a hidden layer uses at least one LSTM block in the LSTM network, and the LSTM block can handle information having different times (that is, time waveform). Therefore, in the examples shown in FIGS. 3 and 4, an example in which the energy for each band is used as the input layer in the LSTM network has been described, but as shown in FIG. 2, the shape of the envelope of the time currency for each band itself (envelope). The values sampled on the line are listed as a feature vector) may be used as the input layer in the LSTM network.

雑音学習検出部１１３は、帯域分割数Ｎ分のエネルギー変換部１１３１，１１３１’と、帯域分割数Ｎ分の評価値算出部１１３２ａを有するＬＳＴＭ学習部１１３２と、帯域分割数Ｎ分の帯域別雑音判定部１１３３と、を備える。ここで、雑音学習検出部１１３について、事前学習時と、雑音学習検出処理とを区別して順に説明する。 The noise learning detection unit 113 includes an energy conversion unit 1131, 1131'for the number of band divisions N, an LSTM learning unit 1132 having an evaluation value calculation unit 1132a for the number of band divisions N, and band-specific noise for the number of band divisions N. A determination unit 1133 is provided. Here, the noise learning detection unit 113 will be described in order by distinguishing between pre-learning and noise learning detection processing.

（事前学習時）
エネルギー変換部１１３１’は、ＬＳＴＭネットワークにおける入力層として、事前学習用に用いられ、事前学習用に非圧縮の原音の音声信号の或る時刻ｔ_ｎの信号期間における帯域ｆ_ｎ（ｆ_ｎ＝ｆ_１〜ｆ_Ｎ）毎の時間波形を入力し、その帯域ｆ_ｎ（ｆ_ｎ＝ｆ_１〜ｆ_Ｎ）毎のエネルギー（時刻ｔ_ｎの信号期間内の単位時間毎の信号振幅の二乗の積分値）を算出し、ＬＳＴＭ学習部１１３２に出力する。 (At the time of pre-learning)
The energy conversion unit 1131'is used for pre-learning as an input layer in the LSTM network, and the band f_n (f_n = f_1 to f_N) of the uncompressed original sound signal for pre-learning in a certain time t_n signal period. The time waveform for each time is input, the energy for each band f_n (f_n = f_1 to f_N) (the integrated value of the square of the signal amplitude for each unit time within the signal period of time t_n) is calculated, and the LSTM learning unit 1132 is used. Output.

エネルギー変換部１１３１は、ＬＳＴＭネットワークにおける入力層として、事前学習時には、当該原音に対し圧縮伸長後の音声信号の対応する時刻ｔ_ｎの信号期間における帯域ｆ_ｎ（ｆ_ｎ＝ｆ_１〜ｆ_Ｎ）毎の時間波形を入力し、その帯域ｆ_ｎ（ｆ_ｎ＝ｆ_１〜ｆ_Ｎ）毎のエネルギーを算出し、ＬＳＴＭ学習部１１３２に出力する。 As an input layer in the LSTM network, the energy conversion unit 1131 generates a time waveform for each band f_n (f_n = f_1 to f_N) in the signal period of the corresponding time t_n of the audio signal after compression and expansion with respect to the original sound during pre-learning. Input, calculate the energy for each band f_n (f_n = f_1 to f_N), and output it to the LSTM learning unit 1132.

ただし、任意の圧縮伸長後の音声信号に関する雑音の有無の検出時には、圧縮伸長後の音声信号の対応する時刻ｔ_ｎの信号期間における帯域ｆ_ｎ（ｆ_ｎ＝ｆ_１〜ｆ_Ｎ）毎の時間波形を入力し、その帯域ｆ_ｎ（ｆ_ｎ＝ｆ_１〜ｆ_Ｎ）毎のエネルギーを算出し、ＬＳＴＭ学習部１１３２に出力する。 However, when detecting the presence or absence of noise related to an arbitrary compressed / decompressed audio signal, a time waveform for each band f_n (f_n = f_1 to f_N) in the signal period of the corresponding time t_n of the compressed / decompressed audio signal is input. The energy for each band f_n (f_n = f_1 to f_N) is calculated and output to the LSTM learning unit 1132.

ＬＳＴＭ学習部１１３２は、ＬＳＴＭネットワークにおける隠れ層（ＬＳＴＭ層）として構成され、帯域分割数Ｎ分の評価値算出部１１３２ａを有しており、評価値算出部１１３２ａの各々は、エネルギー変換部１１３１’から得られる原音の音声信号に関する帯域ｆ_ｎのエネルギーと、エネルギー変換部１１３１から得られる当該原音に対する圧縮伸長後の音声信号に関する帯域ｆ_ｎのエネルギーに基づきミュージカルノイズを含む雑音を有する帯域であるか否かを識別するよう帯域別に事前学習する。ＬＳＴＭ学習部１１３２は、多数の原音を用いて事前学習し、この事前学習の結果として得られるネットワークパラメータは、機械学習用ＤＢ１２２に対し参照可能に格納される。 The LSTM learning unit 1132 is configured as a hidden layer (LSTM layer) in the LSTM network, and has an evaluation value calculation unit 1132a for the number of band divisions N, and each of the evaluation value calculation units 1132a is an energy conversion unit 1131'. Whether or not the band has noise including musical noise based on the energy of the band f_n related to the audio signal of the original sound obtained from and the energy of the band f_n related to the audio signal after compression / expansion with respect to the original sound obtained from the energy conversion unit 1131. Pre-learn by band to identify. The LSTM learning unit 1132 pre-learns using a large number of original sounds, and the network parameters obtained as a result of this pre-learning are stored in reference to the machine learning DB 122.

尚、事前学習における教示データとして、以下に例示する主観評価及び客観評価の技法を利用することができる。
［主観評価１］
ITU-R BS.1116-3 “Methods for the subjective assessment of small impairments in audio systems”
［主観評価２（ＭＵＳＨＲＡ）］
ITU-R BS.1534-3 “Method for the subjective assessment of intermediate quality level of audio systems”
［客観評価１（ＰＥＡＱ）］
ITU-R Rec. BS.1387-1 “Method of objective measurements of perceived audio quality”
［客観評価２（ＰＥＳＱ）］
ITU-T Rec. P.862 “Perceptual evaluation of speech quality(PESQ), an objective method for end-to end speech quality assessment of narrowband telephone networks and speech codecs″ As the teaching data in the pre-learning, the techniques of subjective evaluation and objective evaluation illustrated below can be used.
[Subjective evaluation 1]
ITU-R BS.1116-3 “Methods for the subjective assessment of small impairments in audio systems”
[Subjective evaluation 2 (MUSHRA)]
ITU-R BS.1534-3 “Method for the subjective assessment of intermediate quality level of audio systems”
[Objective evaluation 1 (PEAQ)]
ITU-R Rec. BS.1387-1 “Method of objective measurements of perceived audio quality”
[Objective evaluation 2 (PESQ)]
ITU-T Rec. P.862 “Perceptual evaluation of speech quality (PESQ), an objective method for end-to end speech quality assessment of narrowband telephone networks and speech codecs ″

（雑音学習検出処理）
雑音学習検出処理時では、エネルギー変換部１１３１’は使用せず、エネルギー変換部１１３１のみを入力層として使用する。 (Noise learning detection processing)
At the time of noise learning detection processing, the energy conversion unit 1131'is not used, and only the energy conversion unit 1131 is used as an input layer.

この雑音学習検出処理時では、エネルギー変換部１１３１は、雑音の有無が未知である任意の圧縮伸長後の音声信号の或る時刻ｔ_ｎの信号期間における帯域ｆ_ｎ（ｆ_ｎ＝ｆ_１〜ｆ_Ｎ）毎の時間波形を入力し、その帯域ｆ_ｎ（ｆ_ｎ＝ｆ_１〜ｆ_Ｎ）毎のエネルギーを算出し、ＬＳＴＭ学習部１１３２に出力する。 At the time of this noise learning detection processing, the energy conversion unit 1131 takes time for each band f_n (f_n = f_1 to f_N) in a signal period of a certain time t_n of an arbitrary compressed / expanded audio signal whose presence or absence of noise is unknown. The waveform is input, the energy for each band f_n (f_n = f_1 to f_N) is calculated, and the energy is output to the LSTM learning unit 1132.

ＬＳＴＭ学習部１１３２は、機械学習用ＤＢ１２２から読み出した事前学習済みのネットワークパラメータで帯域分割数Ｎ分の評価値算出部１１３２ａがモデル化され、エネルギー変換部１１３１から雑音の有無が未知の帯域ｆ_ｎ毎のエネルギーを入力すると、学習済みのネットワークパラメータに基づき或る時刻ｔ_ｎの信号期間における帯域ｆ_ｎ毎の雑音の有無に関する評価値を算出し、それぞれ帯域分割数Ｎ分の帯域別雑音判定部１１３３に出力する。 In the LSTM learning unit 1132, the evaluation value calculation unit 1132a for the number of band divisions N is modeled by the pre-learned network parameters read from the machine learning DB 122, and each band f_n in which the presence or absence of noise is unknown from the energy conversion unit 1131. When the energy of is input, the evaluation value regarding the presence or absence of noise for each band f_n in the signal period at a certain time t_n is calculated based on the learned network parameters, and each is output to the band-specific noise determination unit 1133 for the number of band divisions N. To do.

帯域分割数Ｎ分の帯域別雑音判定部１１３３の各々は、ＬＳＴＭネットワークにおける出力層として構成され、ＬＳＴＭ学習部１１３２から得られる帯域ｆ_ｎ毎の雑音の有無に関する評価値を所定の閾値と比較して帯域ｆ_ｎ毎に雑音の有無を判定する。そして、帯域分割数Ｎ分の帯域別雑音判定部１１３３の各々は、雑音無しの信号期間の信号については雑音無しの旨を信号期間分割部１１１に通知し、雑音有りの信号期間の信号についてはその雑音帯域情報付きで、雑音有りの旨を信号期間分割部１１１に通知する。 Each of the band-specific noise determination units 1133 for the number of band divisions N is configured as an output layer in the LSTM network, and the evaluation value regarding the presence or absence of noise for each band f_n obtained from the LSTM learning unit 1132 is compared with a predetermined threshold value. The presence or absence of noise is determined for each band f_n. Then, each of the band-specific noise determination units 1133 for the number of band divisions N notifies the signal period division unit 111 that there is no noise for the signal of the signal period without noise, and for the signal of the signal period with noise. With the noise band information, the signal period dividing unit 111 is notified that there is noise.

例えば、機械学習検出部１１３は、事前学習時に、圧縮伸長された音声信号と非圧縮の音声信号を帯域毎に比較して音質の客観評価を行うＰＥＡＱの結果を教示とし、ＰＥＡＱによる評価値が当該所定の閾値よりも小さい場合、ミュージカルノイズを含む雑音と判別することができる。ここで当該所定の閾値を小さくするほど、より劣化が大きい雑音であると判断することになる。そして、多数の原音と圧縮伸長された音声信号を用いて事前学習させることで、ミュージカルノイズを検出できるようになる。 For example, the machine learning detection unit 113 teaches the result of PEAQ for objectively evaluating the sound quality by comparing the compressed and decompressed audio signal and the uncompressed audio signal for each band at the time of pre-learning, and the evaluation value by PEAQ is set. If it is smaller than the predetermined threshold value, it can be determined as noise including musical noise. Here, the smaller the predetermined threshold value is, the more the noise is judged to be deteriorated. Then, musical noise can be detected by pre-learning using a large number of original sounds and compressed / expanded audio signals.

〔雑音帯域判別部〕
図５は、本発明による一実施形態の音声雑音除去装置１の雑音帯域判別部１１４において帯域分割したときに得られる補正用雑音帯域情報を示す図である。上述したように、雑音帯域判別部１１４は、信号期間分割部１１１から雑音帯域情報付きで雑音有りの或る時刻ｔ_ｎの信号期間の信号を入力すると、帯域分割数Ｍで帯域分割した帯域別の時間波形を生成し雑音補正部１１５に出力する。更に、雑音帯域判別部１１４は、当該雑音帯域情報に基づいて時刻ｔ_ｎの信号期間の雑音帯域の最小値ｆ_ｍｉｎ、及び雑音帯域の最大値ｆ_ｍａｘを判別し、図５に示すように、帯域分割数Ｍ、帯域分割の最小帯域ｆ_１及び最大帯域ｆ_Ｍ、並びに、雑音帯域の最小値ｆ_ｍｉｎ、及び雑音帯域の最大値ｆ_ｍａｘの情報を補正用雑音帯域情報として雑音補正部１１５に出力する。 [Noise band discriminator]
FIG. 5 is a diagram showing correction noise band information obtained when the noise band discriminating unit 114 of the voice noise removing device 1 according to the present invention divides the band. As described above, when the noise band determination unit 114 inputs a signal of a signal period of a certain time t_n with noise band information and noise from the signal period division unit 111, the noise band determination unit 114 is divided into bands by the number of band divisions M. A time waveform is generated and output to the noise correction unit 115. Further, the noise band discrimination unit 114 discriminates the minimum value f_min of the noise band and the maximum value f_max of the noise band in the signal period at time t_n based on the noise band information, and as shown in FIG. 5, the number of band divisions. Information of M, the minimum band f_1 and the maximum band f_M of the band division, the minimum value f_min of the noise band, and the maximum value f_max of the noise band is output to the noise correction unit 115 as correction noise band information.

例えば、雑音帯域判別部１１４は、ＭＰ３でも用いられている帯域分割技法であるＰＱＭＦ（例えば、ISO/IEC 11172-3 “Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s Part 3:Audio”参照）を用いてＭ＝３２個に帯域分割することができるし、他の帯域通過フィルタを用いてもよい。 For example, the noise band discriminator 114 uses PQMF (for example, ISO / IEC 11172-3 “Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit”, which is a band division technique also used in MP3. The band can be divided into M = 32 by using / s Part 3: Audio ”), or another band pass filter may be used.

尚、非特許文献２に開示されるようなスペクトログラムを用いて雑音除去処理を行う際には、信号の振幅と位相情報を補正しないと原信号に戻すことは不可能であるが、ＰＱＭＦの帯域分割法を用いた場合は各時刻における帯域分割された時間波形を補正すれば原理的に元の信号に戻すことは可能である。図５は、ＰＱＭＦを用いて帯域分割した結果を示すものであり、或る時刻ｔ_ｎの信号期間における雑音と識別された帯域を“■”で表している。 When performing noise removal processing using a spectrogram as disclosed in Non-Patent Document 2, it is impossible to return to the original signal without correcting the signal amplitude and phase information, but the band of PQMF. When the division method is used, it is possible to return to the original signal in principle by correcting the band-divided time waveform at each time. FIG. 5 shows the result of band division using PQMF, and the band identified as noise in the signal period at a certain time t_n is represented by “■”.

〔雑音補正部〕
図６は、本発明による一実施形態の音声雑音除去装置１の雑音補正部１１５における雑音補正処理を示すフローチャートである。 [Noise correction unit]
FIG. 6 is a flowchart showing a noise correction process in the noise correction unit 115 of the voice noise removal device 1 according to the present invention.

雑音補正部１１５は、雑音帯域判別部１１４から、或る雑音有りの時刻ｔ_ｎの信号期間における帯域分割数Ｍ、帯域分割の最小帯域ｆ_１及び最大帯域ｆ_Ｍ、雑音帯域情報ｆ_ｍｉｎ，ｆ_ｍａｘの補正用帯域情報とともに、帯域別の時間波形を入力する（ステップＳ１）。 The noise correction unit 115, from the noise band determination unit 114, corrects the number of band divisions M in the signal period at a certain noisy time t_n, the minimum band f_1 and the maximum band f_M of the band division, and the noise band information f_min and f_max. Along with the information, the time waveform for each band is input (step S1).

続いて、雑音補正部１１５は、帯域分割数Ｍに対し予め定めた周波数（本例ではＭ／２）を基準に、ｆ_ｍｉｎ＞Ｍ／２を満たすか否かを判定する（ステップＳ２）。 Subsequently, the noise correction unit 115 determines whether or not f_min> M / 2 is satisfied based on a predetermined frequency (M / 2 in this example) with respect to the band division number M (step S2).

ｆ_ｍｉｎ＞Ｍ／２を満たす場合（ステップＳ２：Ｙｅｓ）、雑音補正部１１５は、ｆ_１〜“ｆ_ｍｉｎ−１”までの帯域を用いて、ｆ_ｍｉｎ〜ｆ_ｍａｘまでの帯域をそれぞれ帯域別にｐ次の線形予測により補正して、帯域合成部１１６に出力する（ステップＳ３）。 When f_min> M / 2 is satisfied (step S2: Yes), the noise correction unit 115 uses the bands from f_1 to “f_min-1” to linearly predict the bands from f_min to f_max in the p-th order for each band. Is corrected and output to the band synthesis unit 116 (step S3).

例えば、ｐ＜ｆ_ｍｉｎ−２としてもよいが、ここではｐ＝ｆ_ｍｉｎ−２とする。
そして、
ｆ_ｎ’＝−Σａ［i]×｛ｆ_ｎ−i ｝（Σは、i＝１〜ｐの総和）
として、
目的関数Ｊ＝Σ（ｆ_ｎ−ｆ_ｎ’）^２が、最小となるように線形予測係数ａを求める。
この求めた線形予測係数ａを用いて、
ｆ_ｍｉｎの信号＝−Σａ［i]×｛ｆ_ｍｉｎ−i ｝（Σは、i＝１〜ｐの総和）
として補正する。
このようにして、帯域ｆ_ｍｉｎ〜ｆ_ｍａｘの各信号を補正する。 For example, p <f_min-2 may be set, but here p = f_min-2.
And
f_n'= −Σa [i] × {f_n−i} (Σ is the sum of i = 1 to p)
As
The linear prediction coefficient a is obtained so that the objective function J = Σ (f_n−f_n ′) ² is minimized.
Using this obtained linear prediction coefficient a,
signal of f_min = −Σa [i] × {f_min−i} (Σ is the sum of i = 1 to p)
Correct as.
In this way, each signal in the band f_min to f_max is corrected.

一方、ｆ_ｍｉｎ＞Ｍ／２を満たさない場合（ステップＳ２：Ｎｏ）、雑音補正部１１５は、ｆ_ｍａｘ＜Ｍ／２を満たすか否かを判定する（ステップＳ４）。 On the other hand, when f_min> M / 2 is not satisfied (step S2: No), the noise correction unit 115 determines whether or not f_max <M / 2 is satisfied (step S4).

ｆ_ｍａｘ＜Ｍ／２を満たす場合（ステップＳ４：Ｙｅｓ）、雑音補正部１１５は、“ｆ_ｍａｘ＋１”〜ｆ_Ｍまでの時間波形を用いて、ｆ_ｍｉｎ〜ｆ_ｍａｘまでの帯域別時間波形をそれぞれ帯域別にｐ次の線形予測により補正して、帯域合成部１１６に出力する（ステップＳ５）。 When f_max <M / 2 is satisfied (step S4: Yes), the noise correction unit 115 uses the time waveforms from “f_max + 1” to f_M to generate the time waveforms for each band from f_min to f_max in the p-th order for each band. It is corrected by linear prediction and output to the band synthesis unit 116 (step S5).

例えば、ｐ＜ｆ_Ｍ−ｆ_ｍａｘ−１としてもよいが、
ここではｐ＝ｆ_Ｍ−ｆ_ｍａｘ−１とする。
そして、
ｆ_ｎ’＝−Σａ［i]×｛ｆ_ｎ＋i ｝（Σは、i＝１〜ｐの総和）
として、
目的関数Ｊ＝Σ（ｆ_ｎ−ｆ_ｎ’）^２が、最小となるように線形予測係数ａを求める。
この求めた線形予測係数ａを用いて、
ｆ_ｍａｘの信号＝−Σａ［i]×｛ｆ_ｍａｘ＋i ｝（Σは、i＝１〜ｐの総和）
として補正する。
このようにして、帯域ｆ_ｍｉｎ〜ｆ_ｍａｘの各信号（時間波形）を補正する。 For example, p <f_M-f_max-1 may be set,
Here, p = f_M−f_max-1.
And
f_n'= −Σa [i] × {f_n + i} (Σ is the sum of i = 1 to p)
As
The linear prediction coefficient a is obtained so that the objective function J = Σ (f_n−f_n ′) ² is minimized.
Using this obtained linear prediction coefficient a,
signal of f_max = −Σa [i] × {f_max + i} (Σ is the sum of i = 1 to p)
Correct as.
In this way, each signal (time waveform) in the bands f_min to f_max is corrected.

一方、ｆ_ｍａｘ＜Ｍ／２を満たさない場合（ステップＳ４：Ｎｏ）、雑音補正部１１５は、ｆ_１〜“ｆ_ｍｉｎ−１”までの時間波形を用いてｆ_ｍｉｎ〜ｆ_ｍａｘまでの帯域をそれぞれ帯域別にｐ次の線形予測により補正した信号波形と、“ｆ_ｍａｘ＋１”〜ｆ_Ｍまでの時間波形を用いてｆ_ｍｉｎ〜ｆ_ｍａｘまでの帯域をそれぞれ帯域別にｐ次の線形予測により補正した信号波形とを加重平均して補正して、帯域合成部１１６に出力する（ステップＳ６）。 On the other hand, when f_max <M / 2 is not satisfied (step S4: No), the noise correction unit 115 uses the time waveforms from f_1 to “f_min-1” to divide the bands from f_min to f_max into p-orders for each band. The signal waveform corrected by the linear prediction of the above and the signal waveform corrected by the p-order linear prediction for each band using the time waveforms from "f_max + 1" to f_M and the bands from f_min to f_max are weighted and averaged and corrected. Then, it is output to the band synthesis unit 116 (step S6).

例えば、ｆ_ｍｉｎがＭ／２以下である場合でｆ_ｍａｘがＭ／２以上の場合、帯域ｆ_１〜ｆ_ｍｉｎ−１を用いて上記と同様にｐ次の線形予測を行ってｆｍ_ｍｉｎの信号（時間波形）を求め、且つ帯域ｆ_ｍａｘ＋１〜ｆ_Ｍを用いて上記と同様にｐ次の線形予測を行ってｆｍ_ｍａｘの信号（時間波形）を求める。
そして、
帯域ｆ_ｍｉｎ〜ｆ_ｍａｘの各信号（時間波形）＝
｛(f_max−fm)×fm_min＋(fm−f_min)×fm_max}／(f_max−f_min)
となる加重平均を行って補正する。 For example, when f_min is M / 2 or less and f_max is M / 2 or more, the p-th order linear prediction is performed in the same manner as above using the bands f_1 to f_min-1 to obtain a signal (time waveform) of fm_min. The signal (time waveform) of fm_max is obtained by performing linear prediction of the pth order in the same manner as described above using the band f_max + 1 to f_M.
And
Each signal (time waveform) of the band f_min to f_max =
{(F_max−fm) × fm_min ＋ (fm−f_min) × fm_max} / (f_max−f_min)
The weighted average that is

そして、帯域合成部１１６は、雑音補正部１１５から雑音有りの信号期間毎に補正後の帯域別時間波形を入力し、信号期間毎に全帯域の時間波形を帯域合成して、当該雑音有りの信号期間毎の補正後信号を形成し、これにより位相情報も劣化させることがあるミュージカルノイズについても補正できるようになる。 Then, the band synthesis unit 116 inputs the corrected time waveform for each band from the noise correction unit 115 for each signal period with noise, band-synthesizes the time waveforms of all bands for each signal period, and has the noise. A corrected signal is formed for each signal period, which makes it possible to correct musical noise that may also deteriorate phase information.

一般的に音声信号の周波数成分には相関が高い場合が多く、線形予測による補正でノイズが軽減されることが予想される。 In general, the frequency components of voice signals are often highly correlated, and it is expected that noise will be reduced by correction by linear prediction.

〔従来技術と本発明に係る処置との比較〕
図７（ａ）は或る原音のスペクトログラムを示す図であり、図７（ｂ）は劣化した圧縮伸長音のスペクトログラムを示す図である。図７（ｂ）を参照するに、特に４ＫＨｚ付近から低域にかけて非可逆圧縮符号化処理により情報が欠落し伸長復号後の音声信号においてミュージカルノイズが発生していることが分かる。 [Comparison between the prior art and the treatment according to the present invention]
FIG. 7A is a diagram showing a spectrogram of a certain original sound, and FIG. 7B is a diagram showing a spectrogram of a deteriorated compression / extension sound. With reference to FIG. 7B, it can be seen that information is lost due to the lossy compression coding process, especially from around 4 KHz to the low frequency range, and musical noise is generated in the audio signal after decompression and decoding.

また、図７（ｃ）は非特許文献２に基づく雑音除去処理（収縮・膨張処理）後の圧縮伸長音のスペクトログラムを示す図である。図７（ｃ）においてよく見るとエッジがスムーズになっているが、情報の欠落部分ははっきりと分かり、依然としてミュージカルノイズが発生していることが分かる。 Further, FIG. 7C is a diagram showing a spectrogram of compression / expansion sound after noise removal processing (contraction / expansion processing) based on Non-Patent Document 2. If you look closely in FIG. 7C, you can see that the edges are smooth, but the missing part of the information is clearly visible, and it can be seen that musical noise is still generated.

そして、図７（ｄ）は本発明に係る雑音除去処理後の圧縮伸長音に関するスペクトログラムを示す図である。図７（ｄ）を参照して理解されるように、はっきりと見えていた情報の欠落部分はスムーズになり、顕著なミュージカルノイズを軽減できることが確認された。従って、本発明に係る雑音除去処理は、従来技術と比較しても音声符号化によって生じる劣化を補正するのに優れていることが分かる。また、本発明に係る雑音除去処理は、ミュージカルノイズに限らず、原理的にも理解されるように、クリップノイズも抑圧することができる。 FIG. 7 (d) is a diagram showing a spectrogram regarding the compressed / expanded sound after the noise removal processing according to the present invention. As can be understood with reference to FIG. 7D, it was confirmed that the clearly visible missing part of the information became smooth and the remarkable musical noise could be reduced. Therefore, it can be seen that the noise removal process according to the present invention is excellent in correcting the deterioration caused by voice coding even when compared with the prior art. Further, the noise removal processing according to the present invention is not limited to musical noise, and as is understood in principle, clip noise can also be suppressed.

以上の実施形態における音声雑音除去装置１は、コンピュータにより構成することができ、音声雑音除去装置１の各処理部を機能させるためのプログラムを好適に用いることができる。具体的には、音声雑音除去装置１の各処理部を制御するための制御部をコンピュータ内の中央演算処理装置（ＣＰＵ）で構成でき、且つ、各処理部を動作させるのに必要となるプログラムを構成することができる。即ち、そのようなコンピュータに、ＣＰＵによって該プログラムを実行させることにより、音声雑音除去装置１の各処理部の有する機能を実現させることができる。更に、音声雑音除去装置１の各処理部の有する機能を実現させるためのプログラムを、前述の記憶部１２の所定の領域に格納させることができる。そのような記憶部は、コンピュータ内部のＲＡＭ又はＲＯＭなどで構成させることができ、或いは又、外部記憶装置（例えば、ハードディスク）で構成させることもできる。更に、そのようなコンピュータに、音声雑音除去装置１の各処理部として機能させるためのプログラムは、コンピュータ読取り可能な記録媒体に記録することができる。また、音声雑音除去装置１の各処理部をハードウェア又はソフトウェアの一部として構成させ、各々を組み合わせて実現させることもできる。 The voice noise removing device 1 in the above embodiment can be configured by a computer, and a program for making each processing unit of the voice noise removing device 1 function can be preferably used. Specifically, a program that can configure a control unit for controlling each processing unit of the voice noise removing device 1 with a central processing unit (CPU) in a computer and is required to operate each processing unit. Can be configured. That is, by causing such a computer to execute the program by the CPU, the functions of each processing unit of the voice noise removing device 1 can be realized. Further, a program for realizing the function of each processing unit of the voice noise removing device 1 can be stored in a predetermined area of the storage unit 12 described above. Such a storage unit can be configured by a RAM or ROM inside the computer, or can be configured by an external storage device (for example, a hard disk). Further, a program for causing such a computer to function as each processing unit of the voice noise removing device 1 can be recorded on a computer-readable recording medium. Further, each processing unit of the voice noise removing device 1 may be configured as a part of hardware or software, and each may be combined and realized.

以上、特定の実施形態の例を挙げて本発明を説明したが、本発明は前述の実施形態の例に限定されるものではなく、その技術思想を逸脱しない範囲で種々変形可能である。従って、本発明に係る画像処理装置１は、上述した実施形態の例に限定されるものではなく、特許請求の範囲の記載によってのみ制限される。 Although the present invention has been described above with reference to examples of specific embodiments, the present invention is not limited to the examples of the above-described embodiments, and various modifications can be made without departing from the technical idea. Therefore, the image processing apparatus 1 according to the present invention is not limited to the example of the above-described embodiment, but is limited only by the description of the claims.

本発明によれば、任意の圧縮伸長後の音声信号に対し符号化劣化によって生じたミュージカルノイズを含む雑音を自動検出し補正することができるので、雑音の抑圧を要する音声信号の信号処理の用途に有用である。 According to the present invention, noise including musical noise generated by coding deterioration can be automatically detected and corrected for an arbitrary compressed and decompressed audio signal, and therefore, an application for signal processing of an audio signal that requires noise suppression. It is useful for.

１音声雑音除去装置
１１雑音除去処理部
１２記憶部
１１１信号期間分割部
１１２帯域分割部
１１３雑音学習検出部
１１４雑音帯域判別部
１１５雑音補正部
１１６帯域合成部
１１７信号合成部
１２１信号処理用メモリ
１２２機械学習用データベース（ＤＢ）
１１３１エネルギー変換部
１１３１’ エネルギー変換部
１１３２ＬＳＴＭ学習部
１１３２ａ評価値算出部
１１３３帯域別雑音判定部 1 Voice noise removal device 11 Noise removal processing unit 12 Storage unit 111 Signal period division unit 112 Band division unit 113 Noise learning detection unit 114 Noise band discrimination unit 115 Noise correction unit 116 Band synthesis unit 117 Signal synthesis unit 121 Signal processing memory 122 Machine learning database (DB)
1131 Energy conversion unit 1131'Energy conversion unit 1132 LSTM learning unit 1132a Evaluation value calculation unit 1133 Band-specific noise determination unit

Claims

A voice noise removing device that removes noise in a voice signal after compression and decompression.
A signal period division unit that inputs the compressed and decompressed audio signal and divides it into signals for each signal period divided at predetermined time intervals.
A band division unit that generates a time waveform for each band divided by the number of band divisions at a predetermined frequency interval for each signal period, and noise including musical noise for each band using machine learning for each signal period. A noise learning detection unit that detects and branches the signal period division unit into a signal with a signal period without noise and a signal with a signal period with noise.
A noise band that determines the minimum value of the noise band and the maximum value of the noise band for each signal period divided by a predetermined number of band divisions having the minimum band and the maximum band based on the noise band information by the noise learning detection unit. Discriminating part and
Based on the number of band divisions, the minimum band and the maximum band of the band division, and the minimum value of the noise band and the maximum value of the noise band, the waveform of the band having the noise in the signal period with the noise is obtained. A noise correction unit that corrects the waveform of the band without the noise in the signal period by linear prediction and generates a corrected time waveform for each band for each signal period with the noise.
A band synthesizer that band-synthesizes the corrected time waveform for each band for each signal period with noise to form a corrected signal for each signal period with noise.
A signal synthesizer that generates an audio signal that suppresses noise including musical noise by synthesizing the corrected signal for each signal period with noise and the signal for the signal period without noise by connecting them in chronological order. When,
A voice noise removing device characterized by comprising.

The noise learning detection unit is composed of an RSTM (Long Short-Term Memory) net frame, and band-divides the voice signal of the original sound for learning and the voice signal after compression and expansion of the original sound for each predetermined signal period. The voice noise removing device according to claim 1, wherein the time waveform is pre-learned in advance to identify whether or not the time waveform is a band having noise including musical noise based on the energy or the envelope shape.

The noise correction unit
When the minimum value of the noise band is higher than the predetermined frequency, the first linear prediction is performed using the signal waveform in the band lower than the minimum value of the noise band.
When the minimum value of the noise band is equal to or lower than the predetermined frequency and the maximum value of the noise band is lower than the predetermined frequency, a signal waveform in a band higher than the maximum value of the noise band is used. Make a linear prediction of 2
When the minimum value of the noise band is equal to or lower than the predetermined frequency and the maximum value of the noise band is equal to or higher than the predetermined frequency, the time waveform for each band obtained by the first linear prediction is obtained. The first aspect of the present invention is to correct the time waveform of the band having the noise in the signal period with the noise by weight-averaging the time waveform of each band obtained by the second linear prediction. Or the voice noise removing device according to 2.

A program for causing a computer to function as the voice noise removing device according to any one of claims 1 to 3.