WO2017104876A1

WO2017104876A1 - Noise removal device and method therefor

Info

Publication number: WO2017104876A1
Application number: PCT/KR2015/013970
Authority: WO
Inventors: 이석필; 서지훈; 한혁수
Original assignee: 상명대학교 서울산학협력단
Priority date: 2015-12-18
Filing date: 2015-12-18
Publication date: 2017-06-22
Also published as: KR101741141B1

Abstract

The present invention relates to a voice signal processing method, wherein a noise removal method, according to one aspect of the present invention, comprises the steps of: receiving, as input, a mixed signal comprising a voice signal and a noise signal; obtaining the noise signal by using a section in the mixed signal that does not include the voice signal; obtaining an a posteriori signal-to-noise ratio by using the noise signal and the mixed signal; estimating an a priori signal-to-noise ratio of a current frame by using the a posteriori signal-to-noise ratio, a noise signal of a previous frame and an a priori signal-to-noise ratio of the previous frame; calculating a weighted value by using the estimated a priori signal-to-noise ratio; calculating a filter value per each frequency by using the calculated weighted value; and by multiplying the calculated filter value by the mixed signal, obtaining the estimated voice signal which has been improved.

Description

Noise canceller and method

The present invention relates to signal processing for speech enhancement, and more particularly, to a signal processing method and apparatus for improving the clarity of speech by removing wind noise included in the speech.

As the spread of smart phones increases, voice recognition technology is being used in various ways. Apple's Siri and Google Now's Google Now are typical smartphone services using voice recognition.

When the surroundings are quiet, the recognition rate of the voice recognition service is high, and even in a general call situation, the other party's voice can be heard well, but when the surroundings are noisy and the wind is mixed with the user's voice and input into the smartphone, the voice recognition is performed. The voice recognition rate of the service may be lowered and the voice of the other party may not be recognized well.

When the wind noise is mixed, the prior art attempts to reduce the wind noise by simply cutting out a specific band of a signal by using a low pass filter (LPF) or a high pass filter (HPF).

Republic of Korea Application No. 10-2005-0120682 The present invention relates to a method for automatically removing the wind noise according to the level to filter the mixed signal with a low pass filter and to measure the level to generate a control signal according to the measured level The invention is to remove wind noise through a high pass filter.

However, there is a problem that the speech recognition rate is not improved because the simple filtering method causes loss not only in wind noise but also in user's voice band.

An object of the present invention is to provide a device and method for obtaining a filter coefficient using the preceding signal-to-noise ratio and the post-signal-to-noise ratio and removing wind noise using the same.

The object of the present invention is not limited to the above-mentioned object, and other objects that are not mentioned will be clearly understood by those skilled in the art from the following description.

Noise reduction method according to an aspect of the present invention for achieving the above object of the present invention, receiving a mixed signal including a voice signal and a noise signal; Obtaining the noise signal using a section in which the voice signal is absent among the mixed signals; Obtaining a post-signal-to-noise ratio using the noise signal and the mixed signal; Estimating a preceding signal-to-noise ratio of the current frame using the post-signal-to-noise ratio, the noise signal of the previous frame, and the preceding signal-to-noise ratio of the previous frame; Calculating a weight value using the estimated preceding signal to noise ratio; Calculating a filter value for each frequency using the calculated weight value; And multiplying the calculated filter value by the mixed signal to obtain the improved estimated speech signal.

According to another aspect of the present invention, there is provided an apparatus for removing noise, comprising: an input unit configured to receive a mixed signal including a voice signal and a noise signal; A frequency signal converter for converting the mixed signal into a frequency domain signal; From the mixed signal, the noise signal is obtained using a section without the voice signal, and a post-signal-to-noise ratio is obtained using the noise signal and the mixed signal, and the post-signal-to-noise ratio, the noise signal of the previous frame, and the preceding of the previous frame. An estimator for estimating a preceding signal-to-noise ratio of the current frame using a signal-to-noise ratio, calculating a weight value using the estimated preceding signal-to-noise ratio, and calculating a filter value for each frequency using the calculated weight value; A filter unit to obtain an improved speech signal by multiplying the calculated filter value by the mixed signal; A time domain signal converter for converting the enhanced voice signal into a time domain signal; Characterized by the implementation including.

According to the present invention, by using a filter formed by using the signal-to-noise ratio and the signal-to-noise ratio before and after the signal mixed with the wind noise to provide a more improved speech enhancement technology to increase the speech recognition rate and the speech intelligibility There is.

1 is a flowchart of a noise removing method according to an embodiment of the present invention;

Figure 2 is a structural diagram showing the flow of the signal of the noise removing method according to an embodiment of the present invention.

3 is a structural diagram of a noise removing device according to another embodiment of the present invention;

4 is a structural diagram of a computer device in which a noise canceling method according to another embodiment of the present invention is implemented.

Advantages and features of the present invention and methods for achieving them will be apparent with reference to the embodiments described below in detail with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various forms. It is provided to fully convey the scope of the invention to those skilled in the art, and the present invention is defined only by the scope of the claims. Meanwhile, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and / or “comprising” refers to a component, step, operation and / or device that is present in one or more other components, steps, operations and / or elements. Or does not exclude additions.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

1 shows a flowchart of a noise removing method according to an embodiment of the present invention.

In order to remove the noise for improving the voice signal, the mixed signal is first received (S110).

Since the input mixed signal is usually a time domain signal, an FFT (Fast Fourier Transform) operation is performed to convert the mixed signal into a frequency domain signal. The signal changed into the frequency domain signal through the FFT operation is composed of a magnitude signal and a phase signal. In the present invention, the phase signal is transmitted to the output side without modification since the calculation is performed only with the amplitude signal.

In order to obtain a priori SNR, a noise signal, a mixed signal, and a posteriori SNR are required. Since only the mixed signal is input, the remaining noise signal and the post-signal noise ratio are estimated from the mixed signal.

First, a noise signal is obtained by using an interval without speech in a mixed signal. In the mixed signal, a human voice does not always exist. Therefore, a short section after receiving a mixed signal input will not have a human voice. Therefore, it is assumed that only a noise signal exists.

After obtaining the noise signal, a post-signal-to-noise ratio may be obtained using the noise signal and the mixed signal. The post-signal-to-noise ratio may be obtained as in Equation 1 below (S120).

Post Signal to Noise Ratio

Denotes the post-signal-to-noise ratio at the p-th frame and the k-th frequency index, and Y (p, k) and N (p, k) represent the mixed signal and the noise signal at the p-th frame and the k-th frequency index, respectively. The noise signal uses the value assumed in the previous step.

The preceding signal-to-noise ratio is calculated using the calculated after-signal-to-noise ratio (S130), and is calculated as in Equation 2.

Refers to the estimated speech signal from which the noise signal is removed from the mixed signal. The speech signal before the calculation according to the present invention starts is initialized to 0, the speech signal of the corresponding frame is estimated, and used to calculate the preceding signal-to-noise ratio from the next frame. .

α is a value of a preset coefficient that is used to adjust the influence of the estimated voice signal and the noise signal of the previous frame and the post signal-to-noise ratio accumulated from the first frame to the previous frame in estimating the voice signal.

That is, α is a value between 0 and 1, the closer to 1, the more affected by the value of the previous frame, and the closer to 0, the more affected by the accumulated value from the first frame to the previous frame. Means greater impact.

When the preceding signal-to-noise ratio is extracted, the weight value is calculated using this value (S140), and the weight value can be obtained by Equation 3.

μ is a weighting parameter. If the value of the preceding signal-to-noise ratio is large, it means that the size of the voice signal is large. Therefore, the weight value should be large. On the contrary, if the value of the preceding signal-to-noise ratio is small, the weight value is smaller than the noise signal. Should also be small.

When the weight value and the preceding signal-to-noise ratio value are obtained, the filter values H (p, k) used for noise reduction can be obtained using the two values (S150), which is shown in Equation 4.

Y (p, k) represents a mixed signal and the estimated speech signal thus obtained as described above

Is used to find the preceding signal-to-noise ratio in the next frame.

2 is a flowchart of a signal until a mixed signal mixed with a noise signal is filtered and outputs a signal in which the noise signal is attenuated.

Finally, since the estimated speech signal is an amplitude signal of the speech signal, it is converted into a time domain signal by IFFT (Inverse Fast Fourier Transform) method together with the phase signal of the speech signal which has not been transformed to provide a signal from which noise is removed.

When the noise is removed by estimating the preceding signal-to-noise ratio, the noise reduction effect is superior to that of the conventional LPF filter.

3 is a structural diagram of a noise removing device according to another embodiment of the present invention.

The input unit 310 receives a mixed signal in which a voice signal and a noise signal are mixed. The input unit may be composed of a microphone or the like, or may receive only a mixed signal, which is a voice signal, by receiving an input in the form of a file such as a voice file or a video file.

In the present invention, since the signal is processed in the frequency domain, the frequency signal converter 320 converts the received signal into a frequency signal through a method such as an FFT. The frequency signal transformation may use methods such as Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), and Filterbank as well as FFT.

The calculator 330 extracts a filter value for noise removal from the input signal.

The signal to noise ratio is first obtained from the input mixed signal and the noise signal, and the process is as shown in Equation 1 above. Although it is impossible to distinguish between a speech signal and a noise signal in a mixed signal, it is assumed that a voice signal does not exist in the initial input signal, and then a signal of the post-noise ratio is calculated by assuming that the signal in this section is a noise signal.

The preceding signal-to-noise ratio is obtained by using the calculated post-signal-to-noise ratio, the average value of the speech signal estimated from the previous frame, and the previously obtained noise signal. In this process, the proportional coefficient value may be used to adjust the rate at which the estimated value of the previous frame and the history value of previous frames including the previous frame affect the preceding signal-to-noise ratio.

Increasing the ratio of the previous frame value has the advantage of being sensitive to the change between frames, but it can cause inconvenience to the user due to frequent changes, and abrupt change can be suppressed when increasing the ratio of the history value. You can hear a natural voice signal, but can not respond quickly to a signal that changes quickly in time, so it can be used to determine the optimal value between the two by experiment.

Equation 2 can be used to obtain the weighted value by calculating the preceding signal-to-noise ratio. When the preceding signal-to-noise ratio is large, the speech signal is expected to be large. Therefore, the weighted value is increased and the weighted value is reduced to reduce the influence of the noise signal. to be. The weight value is obtained by the equation (3).

The filter value can be finally obtained using the weight value and the preceding signal-to-noise ratio value.

The filter unit 340 multiplies the thus obtained filter value by the mixed signal to obtain a signal from which the noise is removed.

Since the signal from which the noise is removed through the filter unit 340 is a signal in the frequency domain, the user finally removes the noise by converting the voice signal into a time domain signal through the time signal converter 350 and providing it to the output unit. You can hear the signal.

The time signal converter 350 may convert a frequency domain signal into a time domain signal using a method such as IFFT, IDFT (Inverse DFT), IDCT (Inverse DCT), Inverse Filterbank, or the like.

On the other hand, the noise reduction method in an embodiment of the present invention may be implemented in a computer system or recorded on a recording medium. As shown in FIG. 4, a computer system includes at least one processor 421, a memory 423, a user input device 426, a data communication bus 422, a user output device 427, It may include a reservoir 428. Each of the components described above communicates data via a data communication bus 422.

The computer system can further include a network interface 429 coupled to the network. The processor 421 may be a central processing unit (CPU) or a semiconductor device that processes instructions stored in the memory 423 and / or the storage 428.

The memory 423 and the storage 428 may include various types of volatile or nonvolatile storage media. For example, the memory 423 may include a ROM 424 and a RAM 425.

Therefore, the noise reduction method according to the embodiment of the present invention can be implemented in a computer executable method. When the noise canceling method according to an embodiment of the present invention is performed in a computer device, computer readable instructions may perform the recognition method according to the present invention.

Meanwhile, the noise canceling method according to the present invention described above may be embodied as computer readable codes on a computer readable recording medium. Computer-readable recording media include all kinds of recording media having data stored thereon that can be decrypted by a computer system. For example, there may be a read only memory (ROM), a random access memory (RAM), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device, and the like. The computer readable recording medium can also be distributed over computer systems connected over a computer network, stored and executed as readable code in a distributed fashion.

In the above, the configuration of the present invention has been described in detail with reference to the accompanying drawings, which are merely examples, and those skilled in the art to which the present invention pertains various modifications and changes within the scope of the technical idea of the present invention. Of course this is possible. Therefore, the protection scope of the present invention should not be limited to the above-described embodiment but should be defined by the following claims.

Claims

Receiving a mixed signal including a voice signal and a noise signal;

Obtaining the noise signal using a section in which the voice signal is absent among the mixed signals;

Obtaining a post-signal-to-noise ratio using the noise signal and the mixed signal;

Estimating a preceding signal-to-noise ratio of the current frame using the post-signal-to-noise ratio, the noise signal of the previous frame, and the preceding signal-to-noise ratio of the previous frame;

Calculating a weight value using the estimated preceding signal to noise ratio;

Calculating a filter value for each frequency using the calculated weight value; And

Multiplying the calculated filter value by the mixed signal to obtain an improved estimated speech signal;

Noise reduction method comprising a.
The method of claim 1, wherein the post-signal to noise ratio is calculated by

Dividing the magnitude of the mixed signal by the magnitude of the noise signal as a post-signal-to-noise ratio

Noise reduction method.
The method of claim 1, wherein the preceding signal to noise ratio of the current frame is

A value obtained by dividing the squared magnitude of the estimated speech signal of the previous frame by an average value of the squared magnitude of the noise signal, and multiplying a predetermined proportional coefficient by

A value obtained by subtracting 1 from the post-signal-to-noise ratio and a larger value of 0 times a value obtained by subtracting the predetermined proportional coefficient from 1 is obtained by adding up the sum of the first frame to the previous frame.

Noise reduction method.
The method of claim 1, wherein the weight value is

The square root of the sum of the squares of the preceding signal-to-noise ratios of the current frame and the absolute value of the preceding signal-to-noise ratios of the current frame is divided by the absolute value of the preceding signal-to-noise ratios of the current frame.

Noise reduction method.
The method of claim 1, wherein the filter value

The value obtained by multiplying the preceding signal-to-noise ratio by the weight is divided by the value obtained by multiplying the preceding signal-to-noise ratio by the weight.

Noise reduction method.
A noise reduction device comprising at least one processor, the processor comprising:

An input unit configured to receive a mixed signal including a voice signal and a noise signal;

A frequency signal converter for converting the mixed signal into a frequency domain signal;

From the mixed signal, the noise signal is obtained using a section without the voice signal, and a post-signal-to-noise ratio is obtained using the noise signal and the mixed signal, and the post-signal-to-noise ratio, the noise signal of the previous frame, and the preceding of the previous frame. An estimator for estimating a preceding signal-to-noise ratio of the current frame using a signal-to-noise ratio, calculating a weight value using the estimated preceding signal-to-noise ratio, and calculating a filter value for each frequency using the calculated weight value;

A filter unit to obtain an improved speech signal by multiplying the calculated filter value by the mixed signal;

A time domain signal converter for converting the enhanced voice signal into a time domain signal;

Noise canceling device to implement including.
The method of claim 6, wherein the operation unit

Dividing the magnitude of the mixed signal by the magnitude of the noise signal as a post-signal-to-noise ratio

Noise reduction device.
The method of claim 6, wherein the operation unit

A value obtained by dividing the squared magnitude of the estimated speech signal of the previous frame by an average value of the squared magnitude of the noise signal, and multiplying a predetermined proportional coefficient by

The preceding signal-to-noise ratio is obtained by subtracting 1 from the post-signal-to-noise ratio and multiplying the larger value of 0 by the value obtained by subtracting the predetermined proportional coefficient from 1 to the previous frame.

Noise reduction device.
The method of claim 6, wherein the operation unit

The weighted value is obtained by dividing the square root of the sum of the square of the preceding signal-to-noise ratio of the current frame and the absolute value of the preceding signal-to-noise ratio of the current frame by the absolute value of the preceding signal-to-noise ratio of the current frame.

Noise reduction device.
The method of claim 6, wherein the operation unit

The filter value is obtained by dividing the value obtained by multiplying the preceding signal-to-noise ratio by the weight by the value obtained by multiplying the value obtained by multiplying the preceding signal-to-noise ratio by the weight.

Noise reduction device.