CN114333884B - Voice noise reduction method based on combination of microphone array and wake-up word - Google Patents

Voice noise reduction method based on combination of microphone array and wake-up word Download PDF

Info

Publication number
CN114333884B
CN114333884B CN202011061741.2A CN202011061741A CN114333884B CN 114333884 B CN114333884 B CN 114333884B CN 202011061741 A CN202011061741 A CN 202011061741A CN 114333884 B CN114333884 B CN 114333884B
Authority
CN
China
Prior art keywords
noise
covariance
wake
voice
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011061741.2A
Other languages
Chinese (zh)
Other versions
CN114333884A (en
Inventor
孙静新
邱东升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ingenic Semiconductor Co Ltd
Original Assignee
Beijing Ingenic Semiconductor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ingenic Semiconductor Co Ltd filed Critical Beijing Ingenic Semiconductor Co Ltd
Priority to CN202011061741.2A priority Critical patent/CN114333884B/en
Publication of CN114333884A publication Critical patent/CN114333884A/en
Application granted granted Critical
Publication of CN114333884B publication Critical patent/CN114333884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Circuit For Audible Band Transducer (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

The invention provides a voice noise reduction method based on a microphone array combined with a wake-up word, which is characterized in that on the basis of echo cancellation, DOA and beam forming of multipath audio data received by the microphone array, post noise reduction operation is added, noise estimation is performed by combining with a voice wake-up word position mark, and after voice wake-up, noise reduction processing is further performed on voice noise and music noise except expected voice, so that the processing capacity of the voice front end based on the whole microphone array is improved. In the post noise reduction processing, the post noise reduction processing is divided into two stages according to the awakening condition of the awakening word, wherein the stages are respectively non-awakening stages; and after waking up, waiting for a voice recognition result to return or waiting for a certain time after waking up, wherein the time can be the average time length when people finish the sentence to be recognized. Different noise estimations are used in the two stages, and a masking effect is used in noise reduction, so that the suppression of human noise and music noise in the recognition stage after waking up is achieved.

Description

Voice noise reduction method based on combination of microphone array and wake-up word
Technical Field
The invention relates to the technical field of audio processing, in particular to a voice noise reduction method based on combination of a microphone array and wake-up words.
Background
With the continuous development of artificial intelligence and speech recognition, speech wake-up and recognition are increasingly occurring in our lives, such as intelligent sound boxes, vehicle-mounted speech systems and the like, and the application scenes are more and more diversified. Ambient noise and the sound emitted by the device itself are unavoidable during the application, which affects the speech recognition effect and thus requires processing of the speech provided to the speech recognition system, i.e. speech front-end processing.
The front-end processing of the voice mainly adopts a microphone array to pick up the voice and carries out a series of processing such as echo cancellation, sound source positioning, beam forming, noise reduction and the like on the picked-up multipath voice signals, so that the voice signals in the expected direction are enhanced, and meanwhile, the noise in the unexpected direction is restrained, thereby improving the voice recognition effect.
Noise reduction and noise estimation in microphone array based speech front end processing is currently mainly directed to stationary environmental noise, such as kitchen noise, including noise such as microwave ovens, range hood sounds, and white noise. The noise estimation is mainly performed by using VAD, and if no voice is detected, the noise is considered to be noise, and the noise is suppressed by performing the relevant noise reduction processing.
In a scene of speaking of multiple persons, such as family chatting or when songs are played, after the equipment is waken up by using wake-up words, DOA is used for positioning and beam forming, and then, the voice in other directions except for the expected voice signal in the direction of the waken up is not well restrained in the process of wave book formation, and can not be further restrained in the subsequent noise reduction process, so that the voice can be recognized by a voice recognition system, and the voice recognition effect is further affected.
The existing microphone array-based voice front-end processing has no obvious effect on suppressing human voice and song noise which are not expected voice signals, and the effect of human voice and song noise on voice recognition is far more than the effect of steady environmental noise.
Furthermore, technical terms commonly used in the art include:
microphone array: a system consisting of a number of acoustic sensors (typically microphones) for sampling and processing the spatial characteristics of the sound field.
Echo cancellation: (Acoustic Echo Cancellation, AEC) for canceling the sound emitted by the device itself.
Direction of arrival: (Direction of Arrive, DOA) for determining the direction of sound.
Beamforming: (Beamforming) refers to forming a main beam in a particular direction to receive a useful desired signal while forming an ultra-low sidelobe canceling noise signal and an interference signal.
Linear constraint minimum variance: (Linearly constrained minimum variance, LCMV), a linear constraint minimum variance, a beamforming algorithm.
Noise reduction: sound from other sound sources than the desired speech signal is suppressed.
Wake-up word: keywords for voice wakeup, such as "little degree", "little lovely" and the like.
Toeplitz matrix: the matrix also becomes Toeplitz matrix, which is called T-shaped matrix for short, and the elements on the main diagonal of the Toeplitz matrix are equal, and the elements on the lines parallel to the main diagonal are also equal; each element in the matrix is symmetrical about a minor diagonal, i.e., the T-shaped matrix is a minor symmetrical matrix. A simple T-shaped matrix includes a forward displacement matrix and a backward displacement matrix. In Matlab, the function of generating the toeplitz matrix is: toeplitz (x, y). Generating a Toeplitz matrix with x as a first column and y as a first row, wherein x and y are vectors and are not equal in length;
Let t= [ T ij]∈Cn×n, if T ij=tj-i (i, j=1, 2,., n), i.e.
Then T is referred to as Toeplitz matrix.
Discrete cosine transform: (Discrete Cosine Transform, abbreviated as DCT transform) is a mathematical operation closely related to the Fourier transform. In fourier series expansion, if the expanded function is a real even function, the fourier series contains only cosine terms, and the discrete cosine transform can be derived by discretizing the cosine terms, which is called discrete cosine transform.
Bartret window: (Bartlett window) refers to a pass band of the filter whose transfer function is resolved as
Voice activity detection: (VAD, voice Activity Detection) for detecting whether the current voice signal contains voice signals, namely judging the input signals, distinguishing the voice signals from various background noise signals, and respectively adopting different processing methods for the two signals.
Laplace transform: is an integral transformation commonly used in engineering mathematics, and is also called Lawster's transformation. The Laplace transform is a linear transform that converts a function with an argument of a real number t (t.gtoreq.0) into a function with an argument of a complex number s. If replaced with a symbol, the equation can be written as:
This is the Laplace transform, and when a function of t is input, a function about s will be obtained.
Disclosure of Invention
In order to solve the above problems in the prior art, an object of the present invention is to: aiming at the problem that the voice front-end processing based on the microphone array has poor effect of suppressing the voice noise in the scenes of talking chat among multiple people, playing songs and the like and having the voice noise, the noise suppression method based on the wake-up word is provided.
According to the method, on the basis of echo cancellation, DOA and beam forming of multipath audio data received by a microphone array, post noise reduction operation is added, noise estimation is carried out by combining a voice wake-up word position mark, and noise reduction processing is further carried out on voice noise and music noise except expected voice after voice wake-up, so that the processing capacity of the voice front end based on the whole microphone array is improved.
The invention performs echo cancellation after the microphone array collects multi-path audio data and performs preprocessing such as pre-emphasis, framing, windowing and the like on the data, and performs post-processing of noise estimation and noise reduction on the result of beam forming by combining a voice wake-up word on the basis that DOA determines a voice angle and beam forming enhances an expected voice signal in a target direction and performs preliminary suppression on other angle audios.
In the post noise reduction processing, the post noise reduction processing is divided into two stages according to the awakening condition of the awakening word, wherein the stages are respectively non-awakening stages; and after waking up, waiting for a voice recognition result to return or waiting for a certain time after waking up, wherein the time can be the average time length when people finish the sentence to be recognized. In the following, the first stage is called as a non-awakening stage, and the second stage is called as an awakening stage; different noise estimations are used in the two stages, and a masking effect is used in noise reduction to inhibit music noise to a certain extent, so that the inhibition of human noise and music noise in the recognition stage after waking up is achieved.
Specifically, the invention provides a voice noise reduction method based on a microphone array combined with wake-up words, which comprises the following steps:
s1, carrying out framing and windowing operation on one path of audio data output After Echo Cancellation (AEC), direction of arrival (DOA) and beam forming;
S2, covariance calculation:
s2.1, calculating the circular convolution of the whole frame of data;
S2.2, taking the last L data in the convolution result to form a Toeplitz matrix, wherein the matrix is covariance of the data, and L is the length of subframe data;
S3, determining an initial value: the method comprises the steps of dividing a non-awakening stage and an awakening stage, wherein the noise covariance and the noise power spectrum density of the non-awakening stage and the initial values in the noise covariance and the noise power spectrum density of the awakening stage are respectively determined;
S4, judging whether the wake-up word is in a wake-up stage,
S4.1, if the operation is in the non-awakening stage, turning to S4.1.1 operation;
s4.1.1, performing VAD judgment on the data,
If the noise is judged, updating the noise covariance matrix and updating the noise power spectrum density;
if the judgment is the audio, the covariance matrix and the power spectrum density of the noise are not updated, and the audio of the front noise is maintained;
S4.1.2, in a non-awakening stage, taking the audio data of the stage as noise of the awakening stage, updating the noise covariance and the noise power spectrum density of the stage, and storing the noise covariance and the power spectrum density, wherein the part needs to open up a storage space which is longer than the awakening word length and is used for storing the noise covariance and the noise power spectrum of the awakening stage calculated in the step;
s4.1.3, calculating covariance of data by using the frame data, subtracting noise covariance by using the covariance to obtain covariance of a voice signal, and turning to S5;
S4.2, if the stage is the stage after waking up and waiting for the identification result, namely the waking-up stage, turning to S2.2.1;
S4.2.1, after waking up, according to the length of the wake-up word, the current position of the storage space retreats forward to the maximum length of the wake-up word, and the noise covariance and the power spectrum density on the storage position are taken out and used as the noise covariance and the power spectrum of the stage; and calculating the covariance of the voice signal at the stage;
s5, performing eigenvalue decomposition on the covariance of the voice signals obtained by S4.1.3 and S4.2.1, and performing Laplacian transformation and transformation from a frequency domain to a characteristic value domain;
s6, in order to remove music noise and other non-document noise, a mask is further calculated by adopting a critical bandwidth and a masking effect, weights are calculated according to the mask and the result of S5, and data after final noise reduction processing is calculated.
In summary, the advantages that can be achieved by applying the method of the application are as follows: the method effectively improves the processing capacity of the voice front end based on the whole microphone array; optimizing the effect of speech recognition; the method is simple.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and together with the description serve to explain the application.
Fig. 1 is a schematic block flow diagram of the method of the present invention.
Detailed Description
In order that the technical content and advantages of the present invention may be more clearly understood, a further detailed description of the present invention will now be made with reference to the accompanying drawings.
As shown in fig. 1, a voice noise reduction method based on a microphone array combined with a wake-up word includes the following steps:
S1, carrying out framing and windowing operation on one path of audio data output After Echo Cancellation (AEC), direction of arrival (DOA) and beam forming; wherein, the length of 2-4ms data is selected as the length L of subframe data, x is whole frame data;
S2, covariance calculation:
S2.1, calculating a cyclic convolution of the whole frame data, cx=xcorr (x, L-1, 'biased');
S2.2, taking the last L data in the convolution result to form a Toeplitz matrix, wherein the matrix is covariance of the data, and L is the length of subframe data;
s3, determining an initial value:
The invention needs to calculate and maintain the noise covariance and the power spectrum density of the two stages of non-awakening and awakening respectively, which are called the noise covariance and the noise power spectrum density of the non-awakening stage and the noise covariance and the noise power spectrum density of the awakening stage respectively;
a) Initial value of noise covariance: calculating by using the first frame data and adopting the covariance calculation method;
b) Initial value of power spectral density of noise: using a result of the cyclic convolution, namely a gabor window, and performing DCT operation;
S4, judging whether the wake-up word is in a wake-up stage,
S4.1, if the operation is in the non-awakening stage, turning to S4.1.1 operation;
s4.1.1, performing VAD judgment on the data,
If the noise is judged, updating a noise covariance matrix, calculating a noise covariance by using covariance data calculated by the current frame and a previous noise covariance by adopting a forgetting factor, and updating a noise power spectrum density;
if the judgment is the audio, the covariance matrix and the power spectrum density of the noise are not updated, and the audio of the front noise is maintained;
S4.1.2, in a non-awakening stage, taking the audio data of the stage as noise of the awakening stage, updating the noise covariance and the noise power spectrum density of the stage, and storing the noise covariance and the power spectrum density, wherein the part needs to open up a storage space which is longer than the awakening word length and is used for storing the noise covariance and the noise power spectrum of the awakening stage calculated in the step;
s4.1.3, calculating covariance of data by using the frame data, subtracting noise covariance by using the covariance to obtain covariance of a voice signal, and turning to S5;
S4.2, if the stage is the stage after waking up and waiting for the identification result, namely the waking-up stage, turning to S2.2.1;
S4.2.1, after waking up, because the wake-up word belongs to a desired signal, the covariance and the power spectrum density calculated according to the wake-up word cannot be used as the noise covariance and the power spectrum density, and the noise covariance and the power spectrum density in front of the wake-up word need to be taken out from the noise covariance and the power spectrum storage in the wake-up stage maintained above; according to the length of the wake-up word, the maximum length of the wake-up word is moved forward at the current position of the storage space, the noise covariance and the power spectrum density at the storage position are taken out and used as the noise covariance and the power spectrum at the stage, and the voice signal covariance at the stage is calculated;
s5, performing eigenvalue decomposition on the covariance of the voice signals obtained by S4.1.3 and S4.2.1, and performing Laplacian transformation and transformation from a frequency domain to a characteristic value domain;
s6, in order to remove music noise and other non-document noise, a mask is further calculated by adopting a critical bandwidth and a masking effect, weights are calculated according to the mask and the result of S5, and data after final noise reduction processing is calculated.
The wake-up stage is after wake-up and waits for a voice recognition result to return or waits for a certain time after wake-up, wherein the certain time is the average duration of the statement to be recognized after speaking.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A voice noise reduction method based on a microphone array combined with wake-up words is characterized by comprising the following steps:
s1, carrying out framing and windowing operation on one path of audio data output After Echo Cancellation (AEC), direction of arrival (DOA) and beam forming;
S2, covariance calculation:
s2.1, calculating the circular convolution of the whole frame of data;
S2.2, taking the last L data in the convolution result to form a Toeplitz matrix, wherein the matrix is covariance of the data, and L is the length of subframe data;
S3, determining an initial value: the method comprises the steps of dividing a non-awakening stage and an awakening stage, wherein the noise covariance and the noise power spectrum density of the non-awakening stage and the initial values in the noise covariance and the noise power spectrum density of the awakening stage are respectively determined;
S4, judging whether the wake-up word is in a wake-up stage,
S4.1, if the operation is in the non-awakening stage, turning to S4.1.1 operation;
s4.1.1. voice activity detection is carried out on the data,
If the noise is judged, updating the noise covariance matrix and updating the noise power spectrum density;
if the judgment is the audio, the covariance matrix and the power spectrum density of the noise are not updated, and the audio of the front noise is maintained;
s4.1.2, in the non-awakening stage, taking the audio data of the stage as the noise of the awakening stage, updating the noise covariance and the noise power spectrum density of the stage, and storing the noise covariance and the power spectrum density, wherein the part needs to open up a storage space which is longer than the length of the awakening word and is used for storing the noise covariance and the noise power spectrum of the awakening stage calculated in the step;
s4.1.3, calculating covariance of data by using the frame data, subtracting noise covariance by using the covariance to obtain covariance of a voice signal, and turning to S5;
S4.2, if the stage is the stage after waking up and waiting for the identification result, namely the waking-up stage, turning to S2.2.1;
S4.2.1, after waking up, according to the length of the wake-up word, the current position of the storage space retreats forward to the maximum length of the wake-up word, and the noise covariance and the power spectrum density on the storage position are taken out and used as the noise covariance and the power spectrum of the stage; and calculating the covariance of the voice signal at the stage;
s5, performing eigenvalue decomposition on the covariance of the voice signals obtained by S4.1.3 and S4.2.1, and performing Laplacian transformation and transformation from a frequency domain to a characteristic value domain;
s6, in order to remove music noise and other non-document noise, a mask is further calculated by adopting a critical bandwidth and a masking effect, weights are calculated according to the mask and the result of S5, and data after final noise reduction processing is calculated.
2. The method for voice noise reduction based on the combination of a microphone array and a wake-up word according to claim 1, wherein in the step S1, a data length of 2-4ms is selected as a length L of subframe data, and x is whole frame data.
3. The method for voice noise reduction based on the combination of the microphone array and the wake-up word according to claim 2, wherein the cyclic convolution of the whole frame of data in step S2 is denoted as: cx=xcorr (x, L-1, 'biased').
4. A method of voice noise reduction based on a microphone array in combination with wake-up words according to claim 3, wherein the initial value determination in step S3 is specifically calculated as follows:
a) Initial value of noise covariance: calculating by using the first frame data and adopting an S2 covariance calculation method;
b) Initial value of power spectral density of noise: and using a result of the cyclic convolution, namely the gabor window, and performing DCT operation.
5. The method according to claim 1, wherein the step s4.1.1 is to calculate the noise covariance using the covariance data calculated for the current frame and the previous noise covariance by using a forgetting factor.
6. The method according to claim 1, wherein the step S4.2.1 is to take out the noise covariance and the power spectral density before the wake-up word from the noise covariance and the power spectral density storage of the wake-up stage maintained in the above step S3 because the wake-up word belongs to the desired signal and the covariance and the power spectral density calculated from the wake-up word cannot be used as the noise covariance and the power spectral density.
7. The method for voice noise reduction based on a microphone array combined with a wake-up word according to claim 1, wherein the wake-up stage is after wake-up and waits for a voice recognition result to return or waits for a certain time after wake-up, and the certain time is an average duration of speaking the sentence to be recognized.
CN202011061741.2A 2020-09-30 2020-09-30 Voice noise reduction method based on combination of microphone array and wake-up word Active CN114333884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011061741.2A CN114333884B (en) 2020-09-30 2020-09-30 Voice noise reduction method based on combination of microphone array and wake-up word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011061741.2A CN114333884B (en) 2020-09-30 2020-09-30 Voice noise reduction method based on combination of microphone array and wake-up word

Publications (2)

Publication Number Publication Date
CN114333884A CN114333884A (en) 2022-04-12
CN114333884B true CN114333884B (en) 2024-05-03

Family

ID=81010630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011061741.2A Active CN114333884B (en) 2020-09-30 2020-09-30 Voice noise reduction method based on combination of microphone array and wake-up word

Country Status (1)

Country Link
CN (1) CN114333884B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108122563A (en) * 2017-12-19 2018-06-05 北京声智科技有限公司 Improve voice wake-up rate and the method for correcting DOA
CN108538305A (en) * 2018-04-20 2018-09-14 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and computer readable storage medium
CN109949810A (en) * 2019-03-28 2019-06-28 华为技术有限公司 A kind of voice awakening method, device, equipment and medium
US10667045B1 (en) * 2018-12-28 2020-05-26 Ubtech Robotics Corp Ltd Robot and auto data processing method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108122563A (en) * 2017-12-19 2018-06-05 北京声智科技有限公司 Improve voice wake-up rate and the method for correcting DOA
CN108538305A (en) * 2018-04-20 2018-09-14 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and computer readable storage medium
US10667045B1 (en) * 2018-12-28 2020-05-26 Ubtech Robotics Corp Ltd Robot and auto data processing method thereof
CN109949810A (en) * 2019-03-28 2019-06-28 华为技术有限公司 A kind of voice awakening method, device, equipment and medium

Also Published As

Publication number Publication date
CN114333884A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
CN110556103B (en) Audio signal processing method, device, system, equipment and storage medium
CN109473118B (en) Dual-channel speech enhancement method and device
CN109584896A (en) A kind of speech chip and electronic equipment
CN106875938B (en) Improved nonlinear self-adaptive voice endpoint detection method
CN110211599B (en) Application awakening method and device, storage medium and electronic equipment
CN109509465B (en) Voice signal processing method, assembly, equipment and medium
CN110634497A (en) Noise reduction method and device, terminal equipment and storage medium
CN110610718B (en) Method and device for extracting expected sound source voice signal
CN111435598B (en) Voice signal processing method, device, computer readable medium and electronic equipment
CN110660407B (en) Audio processing method and device
Wang et al. Mask weighted STFT ratios for relative transfer function estimation and its application to robust ASR
CN105355199A (en) Model combination type speech recognition method based on GMM (Gaussian mixture model) noise estimation
CN110706719A (en) Voice extraction method and device, electronic equipment and storage medium
CN114171041A (en) Voice noise reduction method, device and equipment based on environment detection and storage medium
Han et al. Robust GSC-based speech enhancement for human machine interface
CN113203987A (en) Multi-sound-source direction estimation method based on K-means clustering
CN113870893A (en) Multi-channel double-speaker separation method and system
CN112259117B (en) Target sound source locking and extracting method
WO2024017110A1 (en) Voice noise reduction method, model training method, apparatus, device, medium, and product
CN114333884B (en) Voice noise reduction method based on combination of microphone array and wake-up word
CN112363112A (en) Sound source positioning method and device based on linear microphone array
CN111192569B (en) Double-microphone voice feature extraction method and device, computer equipment and storage medium
CN115620739A (en) Method for enhancing voice in specified direction, electronic device and storage medium
CN113223552A (en) Speech enhancement method, speech enhancement device, speech enhancement apparatus, storage medium, and program
CN113611319A (en) Wind noise suppression method, device, equipment and system based on voice component

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant