CN114333884B

CN114333884B - Voice noise reduction method based on combination of microphone array and wake-up word

Info

Publication number: CN114333884B
Application number: CN202011061741.2A
Authority: CN
Inventors: 孙静新; 邱东升
Original assignee: Beijing Ingenic Semiconductor Co Ltd
Current assignee: Beijing Ingenic Semiconductor Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2024-05-03
Anticipated expiration: 2040-09-30
Also published as: CN114333884A

Abstract

The invention provides a voice noise reduction method based on a microphone array combined with a wake-up word, which is characterized in that on the basis of echo cancellation, DOA and beam forming of multipath audio data received by the microphone array, post noise reduction operation is added, noise estimation is performed by combining with a voice wake-up word position mark, and after voice wake-up, noise reduction processing is further performed on voice noise and music noise except expected voice, so that the processing capacity of the voice front end based on the whole microphone array is improved. In the post noise reduction processing, the post noise reduction processing is divided into two stages according to the awakening condition of the awakening word, wherein the stages are respectively non-awakening stages; and after waking up, waiting for a voice recognition result to return or waiting for a certain time after waking up, wherein the time can be the average time length when people finish the sentence to be recognized. Different noise estimations are used in the two stages, and a masking effect is used in noise reduction, so that the suppression of human noise and music noise in the recognition stage after waking up is achieved.

Description

Voice noise reduction method based on combination of microphone array and wake-up word

Technical Field

The invention relates to the technical field of audio processing, in particular to a voice noise reduction method based on combination of a microphone array and wake-up words.

Background

With the continuous development of artificial intelligence and speech recognition, speech wake-up and recognition are increasingly occurring in our lives, such as intelligent sound boxes, vehicle-mounted speech systems and the like, and the application scenes are more and more diversified. Ambient noise and the sound emitted by the device itself are unavoidable during the application, which affects the speech recognition effect and thus requires processing of the speech provided to the speech recognition system, i.e. speech front-end processing.

The front-end processing of the voice mainly adopts a microphone array to pick up the voice and carries out a series of processing such as echo cancellation, sound source positioning, beam forming, noise reduction and the like on the picked-up multipath voice signals, so that the voice signals in the expected direction are enhanced, and meanwhile, the noise in the unexpected direction is restrained, thereby improving the voice recognition effect.

Noise reduction and noise estimation in microphone array based speech front end processing is currently mainly directed to stationary environmental noise, such as kitchen noise, including noise such as microwave ovens, range hood sounds, and white noise. The noise estimation is mainly performed by using VAD, and if no voice is detected, the noise is considered to be noise, and the noise is suppressed by performing the relevant noise reduction processing.

In a scene of speaking of multiple persons, such as family chatting or when songs are played, after the equipment is waken up by using wake-up words, DOA is used for positioning and beam forming, and then, the voice in other directions except for the expected voice signal in the direction of the waken up is not well restrained in the process of wave book formation, and can not be further restrained in the subsequent noise reduction process, so that the voice can be recognized by a voice recognition system, and the voice recognition effect is further affected.

The existing microphone array-based voice front-end processing has no obvious effect on suppressing human voice and song noise which are not expected voice signals, and the effect of human voice and song noise on voice recognition is far more than the effect of steady environmental noise.

Furthermore, technical terms commonly used in the art include:

microphone array: a system consisting of a number of acoustic sensors (typically microphones) for sampling and processing the spatial characteristics of the sound field.

Echo cancellation: (Acoustic Echo Cancellation, AEC) for canceling the sound emitted by the device itself.

Direction of arrival: (Direction of Arrive, DOA) for determining the direction of sound.

Beamforming: (Beamforming) refers to forming a main beam in a particular direction to receive a useful desired signal while forming an ultra-low sidelobe canceling noise signal and an interference signal.

Linear constraint minimum variance: (Linearly constrained minimum variance, LCMV), a linear constraint minimum variance, a beamforming algorithm.

Noise reduction: sound from other sound sources than the desired speech signal is suppressed.

Wake-up word: keywords for voice wakeup, such as "little degree", "little lovely" and the like.

Toeplitz matrix: the matrix also becomes Toeplitz matrix, which is called T-shaped matrix for short, and the elements on the main diagonal of the Toeplitz matrix are equal, and the elements on the lines parallel to the main diagonal are also equal; each element in the matrix is symmetrical about a minor diagonal, i.e., the T-shaped matrix is a minor symmetrical matrix. A simple T-shaped matrix includes a forward displacement matrix and a backward displacement matrix. In Matlab, the function of generating the toeplitz matrix is: toeplitz (x, y). Generating a Toeplitz matrix with x as a first column and y as a first row, wherein x and y are vectors and are not equal in length;

Let t= [ T _ij]∈C^n×n, if T _ij＝t_j-i (i, j=1, 2,., n), i.e.

Then T is referred to as Toeplitz matrix.

Discrete cosine transform: (Discrete Cosine Transform, abbreviated as DCT transform) is a mathematical operation closely related to the Fourier transform. In fourier series expansion, if the expanded function is a real even function, the fourier series contains only cosine terms, and the discrete cosine transform can be derived by discretizing the cosine terms, which is called discrete cosine transform.

Bartret window: (Bartlett window) refers to a pass band of the filter whose transfer function is resolved as

Voice activity detection: (VAD, voice Activity Detection) for detecting whether the current voice signal contains voice signals, namely judging the input signals, distinguishing the voice signals from various background noise signals, and respectively adopting different processing methods for the two signals.

Laplace transform: is an integral transformation commonly used in engineering mathematics, and is also called Lawster's transformation. The Laplace transform is a linear transform that converts a function with an argument of a real number t (t.gtoreq.0) into a function with an argument of a complex number s. If replaced with a symbol, the equation can be written as:

This is the Laplace transform, and when a function of t is input, a function about s will be obtained.

Disclosure of Invention

In order to solve the above problems in the prior art, an object of the present invention is to: aiming at the problem that the voice front-end processing based on the microphone array has poor effect of suppressing the voice noise in the scenes of talking chat among multiple people, playing songs and the like and having the voice noise, the noise suppression method based on the wake-up word is provided.

According to the method, on the basis of echo cancellation, DOA and beam forming of multipath audio data received by a microphone array, post noise reduction operation is added, noise estimation is carried out by combining a voice wake-up word position mark, and noise reduction processing is further carried out on voice noise and music noise except expected voice after voice wake-up, so that the processing capacity of the voice front end based on the whole microphone array is improved.

The invention performs echo cancellation after the microphone array collects multi-path audio data and performs preprocessing such as pre-emphasis, framing, windowing and the like on the data, and performs post-processing of noise estimation and noise reduction on the result of beam forming by combining a voice wake-up word on the basis that DOA determines a voice angle and beam forming enhances an expected voice signal in a target direction and performs preliminary suppression on other angle audios.

In the post noise reduction processing, the post noise reduction processing is divided into two stages according to the awakening condition of the awakening word, wherein the stages are respectively non-awakening stages; and after waking up, waiting for a voice recognition result to return or waiting for a certain time after waking up, wherein the time can be the average time length when people finish the sentence to be recognized. In the following, the first stage is called as a non-awakening stage, and the second stage is called as an awakening stage; different noise estimations are used in the two stages, and a masking effect is used in noise reduction to inhibit music noise to a certain extent, so that the inhibition of human noise and music noise in the recognition stage after waking up is achieved.

Specifically, the invention provides a voice noise reduction method based on a microphone array combined with wake-up words, which comprises the following steps:

s1, carrying out framing and windowing operation on one path of audio data output After Echo Cancellation (AEC), direction of arrival (DOA) and beam forming;

S2, covariance calculation:

s2.1, calculating the circular convolution of the whole frame of data;

S2.2, taking the last L data in the convolution result to form a Toeplitz matrix, wherein the matrix is covariance of the data, and L is the length of subframe data;

S3, determining an initial value: the method comprises the steps of dividing a non-awakening stage and an awakening stage, wherein the noise covariance and the noise power spectrum density of the non-awakening stage and the initial values in the noise covariance and the noise power spectrum density of the awakening stage are respectively determined;

S4, judging whether the wake-up word is in a wake-up stage,

S4.1, if the operation is in the non-awakening stage, turning to S4.1.1 operation;

s4.1.1, performing VAD judgment on the data,

If the noise is judged, updating the noise covariance matrix and updating the noise power spectrum density;

if the judgment is the audio, the covariance matrix and the power spectrum density of the noise are not updated, and the audio of the front noise is maintained;

S4.1.2, in a non-awakening stage, taking the audio data of the stage as noise of the awakening stage, updating the noise covariance and the noise power spectrum density of the stage, and storing the noise covariance and the power spectrum density, wherein the part needs to open up a storage space which is longer than the awakening word length and is used for storing the noise covariance and the noise power spectrum of the awakening stage calculated in the step;

s4.1.3, calculating covariance of data by using the frame data, subtracting noise covariance by using the covariance to obtain covariance of a voice signal, and turning to S5;

S4.2, if the stage is the stage after waking up and waiting for the identification result, namely the waking-up stage, turning to S2.2.1;

S4.2.1, after waking up, according to the length of the wake-up word, the current position of the storage space retreats forward to the maximum length of the wake-up word, and the noise covariance and the power spectrum density on the storage position are taken out and used as the noise covariance and the power spectrum of the stage; and calculating the covariance of the voice signal at the stage;

s5, performing eigenvalue decomposition on the covariance of the voice signals obtained by S4.1.3 and S4.2.1, and performing Laplacian transformation and transformation from a frequency domain to a characteristic value domain;

s6, in order to remove music noise and other non-document noise, a mask is further calculated by adopting a critical bandwidth and a masking effect, weights are calculated according to the mask and the result of S5, and data after final noise reduction processing is calculated.

In summary, the advantages that can be achieved by applying the method of the application are as follows: the method effectively improves the processing capacity of the voice front end based on the whole microphone array; optimizing the effect of speech recognition; the method is simple.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and together with the description serve to explain the application.

Fig. 1 is a schematic block flow diagram of the method of the present invention.

Detailed Description

In order that the technical content and advantages of the present invention may be more clearly understood, a further detailed description of the present invention will now be made with reference to the accompanying drawings.

As shown in fig. 1, a voice noise reduction method based on a microphone array combined with a wake-up word includes the following steps:

S1, carrying out framing and windowing operation on one path of audio data output After Echo Cancellation (AEC), direction of arrival (DOA) and beam forming; wherein, the length of 2-4ms data is selected as the length L of subframe data, x is whole frame data;

S2, covariance calculation:

S2.1, calculating a cyclic convolution of the whole frame data, cx=xcorr (x, L-1, 'biased');

s3, determining an initial value:

The invention needs to calculate and maintain the noise covariance and the power spectrum density of the two stages of non-awakening and awakening respectively, which are called the noise covariance and the noise power spectrum density of the non-awakening stage and the noise covariance and the noise power spectrum density of the awakening stage respectively;

a) Initial value of noise covariance: calculating by using the first frame data and adopting the covariance calculation method;

b) Initial value of power spectral density of noise: using a result of the cyclic convolution, namely a gabor window, and performing DCT operation;

S4, judging whether the wake-up word is in a wake-up stage,

s4.1.1, performing VAD judgment on the data,

If the noise is judged, updating a noise covariance matrix, calculating a noise covariance by using covariance data calculated by the current frame and a previous noise covariance by adopting a forgetting factor, and updating a noise power spectrum density;

S4.2.1, after waking up, because the wake-up word belongs to a desired signal, the covariance and the power spectrum density calculated according to the wake-up word cannot be used as the noise covariance and the power spectrum density, and the noise covariance and the power spectrum density in front of the wake-up word need to be taken out from the noise covariance and the power spectrum storage in the wake-up stage maintained above; according to the length of the wake-up word, the maximum length of the wake-up word is moved forward at the current position of the storage space, the noise covariance and the power spectrum density at the storage position are taken out and used as the noise covariance and the power spectrum at the stage, and the voice signal covariance at the stage is calculated;

The wake-up stage is after wake-up and waits for a voice recognition result to return or waits for a certain time after wake-up, wherein the certain time is the average duration of the statement to be recognized after speaking.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A voice noise reduction method based on a microphone array combined with wake-up words is characterized by comprising the following steps:

S2, covariance calculation:

s2.1, calculating the circular convolution of the whole frame of data;

S4, judging whether the wake-up word is in a wake-up stage,

s4.1.1. voice activity detection is carried out on the data,

s4.1.2, in the non-awakening stage, taking the audio data of the stage as the noise of the awakening stage, updating the noise covariance and the noise power spectrum density of the stage, and storing the noise covariance and the power spectrum density, wherein the part needs to open up a storage space which is longer than the length of the awakening word and is used for storing the noise covariance and the noise power spectrum of the awakening stage calculated in the step;

2. The method for voice noise reduction based on the combination of a microphone array and a wake-up word according to claim 1, wherein in the step S1, a data length of 2-4ms is selected as a length L of subframe data, and x is whole frame data.

3. The method for voice noise reduction based on the combination of the microphone array and the wake-up word according to claim 2, wherein the cyclic convolution of the whole frame of data in step S2 is denoted as: cx=xcorr (x, L-1, 'biased').

4. A method of voice noise reduction based on a microphone array in combination with wake-up words according to claim 3, wherein the initial value determination in step S3 is specifically calculated as follows:

a) Initial value of noise covariance: calculating by using the first frame data and adopting an S2 covariance calculation method;

b) Initial value of power spectral density of noise: and using a result of the cyclic convolution, namely the gabor window, and performing DCT operation.

5. The method according to claim 1, wherein the step s4.1.1 is to calculate the noise covariance using the covariance data calculated for the current frame and the previous noise covariance by using a forgetting factor.

6. The method according to claim 1, wherein the step S4.2.1 is to take out the noise covariance and the power spectral density before the wake-up word from the noise covariance and the power spectral density storage of the wake-up stage maintained in the above step S3 because the wake-up word belongs to the desired signal and the covariance and the power spectral density calculated from the wake-up word cannot be used as the noise covariance and the power spectral density.

7. The method for voice noise reduction based on a microphone array combined with a wake-up word according to claim 1, wherein the wake-up stage is after wake-up and waits for a voice recognition result to return or waits for a certain time after wake-up, and the certain time is an average duration of speaking the sentence to be recognized.