CN113314135A

CN113314135A - Sound signal identification method and device

Info

Publication number: CN113314135A
Application number: CN202110572969.6A
Authority: CN
Inventors: 何琪琪; 侯海宁
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-27
Anticipated expiration: 2041-05-25
Also published as: CN113314135B

Abstract

The disclosure relates to a method and a device for recognizing a sound signal. The intelligent voice interaction technology is related to, and the problems of low sound source positioning accuracy and poor voice recognition quality under the scene of strong interference and low signal-to-noise ratio are solved. The method comprises the following steps: acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively; performing first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data; obtaining signal components of each sound source according to the posterior domain estimation data; respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals; and obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals. The technical scheme provided by the disclosure is suitable for an intelligent voice system, and realizes high-efficiency and high-interference-resistance voice signal identification.

Description

Sound signal identification method and device

Technical Field

The present disclosure relates to intelligent voice interaction technologies, and in particular, to a method and an apparatus for recognizing a voice signal.

Background

In the times of Internet of things and AI, intelligent voice is taken as one of artificial intelligence core technologies, so that the man-machine interaction mode is enriched, and the use convenience of intelligent products is greatly improved.

The intelligent product equipment mainly adopts a microphone array formed by a plurality of microphones for pickup, and applies a microphone beam forming technology or a blind source separation technology to inhibit environmental interference and improve the processing quality of voice signals so as to improve the voice recognition rate in a real environment.

Microphone beam forming technology needs to estimate sound source direction, and in addition, in order to endow stronger intelligence and perception, the intelligent device can be equipped with the pilot lamp in general, when interacting with the user with the pilot lamp accuracy point to the user rather than the interference, lets the user feel in the face-to-face conversation with intelligent device, strengthens user's interactive experience. Based on this, in an environment where there are interfering sound sources, it is important to accurately estimate the direction of the user (i.e., the sound source).

Generally, a sound source direction-finding algorithm directly uses data acquired by a microphone, and performs direction-finding estimation by using an algorithm such as a Phase-transformed weighted sound source-Phase Transform (SRP-PHAT) with controllable Response Power. However, the algorithm depends on the signal-to-noise ratio of the signal, the accuracy is not high enough under the condition of low signal-to-noise ratio, the direction of an interference sound source is easy to be found, an effective sound source cannot be accurately positioned, and the accuracy of voice recognition is further influenced.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method and an apparatus for recognizing a voice signal. After posterior domain data is obtained through noise reduction, signal components are formed and further noise reduction is carried out through beam forming, and high-quality voice recognition is achieved.

According to a first aspect of the embodiments of the present disclosure, there is provided a sound signal identification method, including:

acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;

performing first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data;

obtaining signal components of each sound source according to the posterior domain estimation data;

respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;

and obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals.

Further, the step of performing a first-stage noise reduction process on the original observation data to obtain posterior domain estimation data includes:

initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, wherein the number of rows and columns of the separation matrix is the number of the sound sources;

obtaining time domain signals at each acquisition point, and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals;

solving prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix;

updating the weighted covariance matrix according to the prior frequency domain estimate;

updating the separation matrix according to the updated weighted covariance matrix;

correcting the updated separation matrix;

and separating the original observation data according to the corrected separation matrix, and obtaining the posterior domain estimation data of each sound source through separation.

Further, the step of obtaining the prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix comprises:

and separating the observation signal matrix according to the separation matrix of the previous frame to obtain the prior frequency domain estimation of each sound source of the current frame.

Further, the step of updating the weighted covariance matrix based on the a priori frequency domain estimates comprises:

and updating the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.

Further, the step of updating the separation matrix according to the updated weighted covariance matrix includes:

calculating the eigenvectors of each sound source according to the weighted covariance matrix of each sound source;

respectively updating the separation matrix of each sound source according to the characteristic vector;

and updating the separation matrix to be a conjugate transpose matrix after the separation matrixes of the sound sources are combined.

Further, the step of correcting the updated separation matrix includes:

and according to the diagonal matrix of the inverse matrix of the separation matrix, carrying out spectrum amplitude deblurring processing on the separation matrix so as to correct the spectrum distortion of the separation matrix.

Further, the step of obtaining the signal component of each sound source according to the a posteriori domain estimation data comprises:

and obtaining signal components of each sound source in each channel of the beam forming filter according to the posterior domain estimation data, wherein the signal components comprise an expected signal and a residual noise signal.

Further, the step of performing a second-stage noise reduction processing on the signal components of each sound source to obtain a beamforming estimation signal includes:

determining input signals of the sound sources according to the signal components of the sound sources;

respectively carrying out the processing of the steps one to two on the input signals of the sound sources frame by frame:

step one, detecting whether the current frame is a noise frame or a non-noise frame,

step two, under the condition that the current frame is a noise frame, estimating the autocorrelation matrix of the expected signal of the current frame and the autocorrelation matrix of the residual noise signal according to the autocorrelation matrix of the expected signal of the previous frame and the autocorrelation matrix of the residual noise signal,

estimating an autocorrelation matrix of the desired signal of a current frame according to an autocorrelation matrix of the desired signal of a previous frame in the case that the current frame is a non-noise frame;

obtaining a weight vector of each sound source according to the final autocorrelation matrix of the expected signal and the autocorrelation matrix of the residual noise signal of each sound source;

and obtaining a beam forming estimation signal of each sound source according to the conjugate transpose matrix of the input signal of each sound source and the weight vector.

Further, the step of obtaining the time-domain sound source signal of each sound source according to the beamforming estimation signal includes:

and performing short-time Fourier inverse transformation on the beam forming estimation signals of each sound source, and overlapping and adding to obtain time-domain sound source signals of each sound source.

According to a second aspect of the embodiments of the present disclosure, there is provided a sound signal identification method, including:

the original data acquisition module is used for acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;

the first noise reduction module is used for carrying out first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data;

the signal mapping module is used for obtaining the signal components of each sound source according to the posterior domain estimation data;

the second noise reduction module is used for respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;

and the enhanced signal output module is used for obtaining the time domain sound source signal of each sound source according to the beam forming estimation signal.

Further, the first noise reduction module includes:

the matrix initialization submodule is used for initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, and the number of rows and the number of columns of the separation matrix are the number of the sound sources;

the observation signal matrix construction submodule is used for solving time domain signals at each acquisition point and constructing an observation signal matrix according to the frequency domain signals corresponding to the time domain signals;

the prior frequency domain solving submodule is used for solving prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix;

a covariance matrix update submodule for updating the weighted covariance matrix according to the prior frequency domain estimate;

a separation matrix updating submodule for updating the separation matrix according to the updated weighted covariance matrix;

a correction submodule for correcting the updated separation matrix;

and the posterior domain solving submodule is used for separating the original observation data according to the corrected separation matrix and obtaining posterior domain estimation data of each sound source through separation.

Further, the prior frequency domain obtaining submodule is configured to separate the observed signal matrix according to the separation matrix of the previous frame, so as to obtain a prior frequency domain estimate of each sound source of the current frame.

Further, the covariance matrix update sub-module is configured to update the weighted covariance matrix according to the observed signal matrix and a conjugate transpose matrix of the observed signal matrix.

Further, the separation matrix update sub-module includes:

the eigenvalue operator module is used for calculating the eigenvector of each sound source according to the weighted covariance matrix of each sound source;

the first updating submodule is used for respectively updating the separation matrix of each sound source according to the characteristic vector;

and the second updating submodule is used for updating the separation matrix into a conjugate transpose matrix after the separation matrixes of the sound sources are combined.

Further, the syndrome module is configured to perform spectrum amplitude deblurring processing on the separation matrix according to a diagonal matrix of an inverse matrix of the separation matrix, so as to correct the spectrum distortion of the separation matrix.

Further, the signal mapping module is configured to obtain, according to the a posteriori domain estimation data, signal components of each sound source in each channel of the beamforming filter, where the signal components include an expected signal and a residual noise signal.

Further, the second noise reduction module includes:

an input signal construction submodule for determining an input signal for each of the sound sources based on the signal components of each of the sound sources;

an autocorrelation matrix estimation submodule, configured to perform the processing in steps one to two on the input signals of the sound sources frame by frame respectively:

a weight vector calculation submodule configured to obtain a weight vector of each sound source according to a final autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal of each sound source;

and the beam forming estimation signal acquisition sub-module is used for obtaining the beam forming estimation signal of each sound source according to the conjugate transpose matrix of the input signal of each sound source and the weight vector.

Further, the enhanced signal output module is configured to perform short-time fourier inverse transformation on the beamforming estimation signals of each sound source, and then overlap and add the beamforming estimation signals to obtain a time-domain sound source signal of each sound source.

According to a third aspect of embodiments of the present disclosure, there is provided a computer apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a sound signal recognition method, the method comprising:

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively, performing primary noise reduction processing on the original observation data to obtain posterior domain estimation data, acquiring posterior domain estimation data according to the posterior domain estimation data, acquiring signal components of each sound source according to the posterior domain estimation data, performing secondary noise reduction processing on the signal components of each sound source respectively to obtain a beam forming estimation signal, and finally acquiring time domain sound source signals of each sound source according to the beam forming estimation signal. After the original observation data is subjected to blind source separation to obtain posterior domain estimation data, the signal-to-noise ratio is further improved through beam enhancement to highlight signals, the problems of low sound source positioning accuracy and poor voice recognition quality under the scene of strong interference and low signal-to-noise ratio are solved, and high-efficiency and high-interference-resistance voice signal recognition is realized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flow chart illustrating a method of sound signal recognition according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating yet another method of sound signal identification according to an example embodiment.

FIG. 3 is a schematic diagram of a two microphone acquisition point reception scenario.

Fig. 4 is a schematic diagram illustrating a blind source separation principle according to an exemplary embodiment.

Fig. 5 is a schematic diagram illustrating a principle of a blind source separation and beamforming combined application according to an example embodiment.

Fig. 6 is a flow chart illustrating yet another sound signal identification method according to an example embodiment.

Fig. 7 is a flowchart illustrating yet another sound signal recognition method according to an example embodiment.

Fig. 8 is a block diagram illustrating a voice signal recognition apparatus according to an exemplary embodiment.

Fig. 9 is a schematic diagram illustrating a structure of a first noise reduction module 802 according to an exemplary embodiment.

Fig. 10 is a schematic structural diagram illustrating the separation matrix update submodule 905 according to an exemplary embodiment.

FIG. 11 is a schematic diagram illustrating a structure of a second noise reduction module 804 according to an exemplary embodiment.

Fig. 12 is a block diagram illustrating an apparatus (a general structure of a mobile terminal) according to an example embodiment.

Fig. 13 is a block diagram showing an apparatus (general structure of a server) according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Generally, a sound source direction-finding algorithm directly utilizes data acquired by a microphone and performs direction-finding estimation by using algorithms such as microphone array sound source localization (SRP-PHAT). However, the algorithm depends on the signal-to-noise ratio of the signal, the accuracy is not high enough under the condition of low signal-to-noise ratio, and the effective sound source cannot be accurately positioned in the direction of each interference sound source.

In order to solve the above problem, embodiments of the present disclosure provide a sound signal identification method and apparatus. The acquired original data are subjected to noise reduction processing to obtain posterior domain data, then, the noise reduction processing is performed for the second time aiming at residual noise in the posterior domain data to further improve the signal-to-noise ratio, a final time domain sound source signal is obtained, the influence of an interference sound source is eliminated, the problem of low sound source positioning accuracy rate under the scene of strong interference and low signal-to-noise ratio is solved, and high-efficiency and high-interference-resistance voice signal recognition is realized.

An exemplary embodiment of the present disclosure provides a sound signal identification method, where a process of acquiring a sound source signal identification result is completed by the method shown in fig. 1, and the method includes:

step 101, acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively.

In this embodiment, the collection point may be a microphone. For example, there may be multiple microphones disposed on the same device, the multiple microphones making up a microphone array.

In this step, data acquisition is performed at each acquisition point, and the acquired data may be from multiple sound sources. The plurality of sound sources may include a target effective sound source and may also include an interfering sound source.

The acquisition point acquires original observation data of at least two sound sources.

And 102, performing primary noise reduction processing on the original observation data to obtain posterior domain estimation data.

In this step, the acquired original observation data is subjected to a first-stage noise reduction process to eliminate noise influence generated by an interference sound source and the like.

The original observation data can be subjected to first-stage noise reduction treatment based on a blind source separation technology to obtain the posterior domain estimation data after noise reduction.

The blind source separation technology based on independent vector analysis does not need prior information such as source signals, transmission channels and the like, decomposes received observation signals into approximate expected signals and noise signals through an optimization algorithm according to the principle of statistical independence, and peels off the noise signals, so that the signal-to-noise ratio of the signals is improved. Meanwhile, the approximate expected signal obtained through blind source separation still retains the spatial information of the microphone array and the sound source.

And 103, obtaining signal components of each sound source according to the posterior domain estimation data.

In the disclosure, after the posterior domain estimation data is obtained, in order to further eliminate the residual noise in the posterior domain estimation data to obtain a signal with a higher signal-to-noise ratio, the posterior domain estimation data is processed, and the noise is filtered by an adaptive filter.

And step 104, respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals.

In this step, for the noise interference residue in the posterior domain estimation data obtained in step 103, a beam forming technique is used to perform a second-stage noise reduction process in order to further improve the sound signal quality. The method can enhance the sound source signal and inhibit other direction signals (signals possibly interfering with the sound source signal), thereby further improving the signal-to-noise ratio of the sound source signal, and further positioning and identifying the sound source can be carried out on the basis to obtain a more accurate result.

In this step, a second-stage noise reduction process may be performed by using a beam forming based on a Parameterized Multichannel Wiener Filter (PMWF).

And 105, obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals.

In the step, according to the wave beam forming estimation signal, a time domain signal with enhanced signal-to-noise ratio after the wave beam separation processing is obtained through short-time inverse Fourier transform (ISTFT) and overlap addition, and compared with posterior domain estimation data, the time domain signal has smaller noise, can truly and accurately reflect the sound signal emitted by the sound source, and realizes accurate and efficient sound signal identification.

An exemplary embodiment of the present disclosure further provides a sound signal identification method, which performs noise reduction processing on original observation data based on blind source separation to obtain posterior domain estimation data, where a specific flow is shown in fig. 2, and includes:

step 201, initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point.

In this step, the number of rows and columns of the separation matrix is the number of sound sources, and the weighted covariance matrix is a 0 matrix.

In this embodiment, a scene with two microphones as acquisition points is taken as an example. As shown in fig. 3, smart speaker a has two microphones: mic1 and mic 2; there are two sound sources in the space around smart speaker a: s1 and s 2. The signals from both sources can be picked up by both microphones. The signals of the two sound sources are mixed together in each microphone.

The main idea of blind source separation is to find a separation matrix

W11, W12, W21, W22 are elements of the separation matrix W, representing that W is composed of these four elements.

The signals x that can be picked up from the two microphones according to the separation matrix₁、x₂Separating out sound source s₁、s₂The implementation principle is shown in fig. 4. Raw observations x collected from mic1 by separation matrix W₁[k]To separate out the posterior domain estimated signal component y to the sound source s1₁₁[k]And estimating the signal component y for the posterior domain of the sound source s2₂₁[k]Raw observation data x collected from mic2₂[k]To separate out the posterior domain estimated signal component y to the sound source s1₁₂[k]And estimating the signal component y for the posterior domain of the sound source s2₂₂[k]And further obtaining posterior domain estimation data y₁[k]And y₂[k]。y₁[k]Is according to y₁₁[k]And y₁₂[k]The obtained posterior domain data of the sound source s 1; y is₂[k]Is according to y₂₁[k]And y₂₂[k]The obtained posterior domain data of the sound source s 2.

In this step, an initialization separation matrix is defined according to expression (1):

the initialized separation matrix is a unit matrix, K being 1. K is a variable representing a frequency point, which represents that the current calculation is the K-th frequency point, wherein K is 1, K represents that the value variable of K is 1,2,3, … K, and K represents the total number of frequency points.

The total number of frequency points K is Nfft/2+1, and Nfft is the frame length of each subframe in the sound system of the smart sound box a.

In this step, the separation matrix of each frequency point is initialized according to expression (1):

the separation matrix is a unit matrix; k1., K.

w₁(k, τ) and w₂(k, τ) are respectively represented as a column element of W, and τ represents the current frame number.

And initializing weighted covariance matrix V of each sound source at each frequency point according to expression (2)_p(k) Zero matrix:

wherein, K is 1, K represents the kth frequency point; p is the microphone number; p is 1, 2.

Step 202, obtaining time domain signals at each acquisition point, and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals.

To be provided with

Representing the time domain signal of the nth frame of the p microphone. p is 1, 2. 1, 2.., Nfft.

Windowing the frame obtained according to Nfft according to expression (3), and then obtaining a corresponding frequency domain signal X by fourier transform (FFT)_p(k,n)：

The observed signal matrix is as shown in expression (4) and is:

X(k,n)＝[X₁(k,n),X₂(k,n)]^T (4)

wherein K is 1.

And 203, solving prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix.

In this step, the observed signal matrix is first separated according to the separation matrix of the previous frame to obtain the prior frequency domain estimation of each sound source of the current frame.

In particular, to the scenario shown in fig. 3, let Y (k, n) be [ Y ═ Y₁(k,n),Y₂(k,n)]^TK is 1, K, wherein Y₁(k,n),Y₂(k, n) are each a sound source s₁And s₂A priori frequency domain estimates at time bins (k, n). As shown in expression (5), it is obtained by separating the observation matrix X (k, n) by using the separation matrix w (k):

Y(k,n)＝W(k)X(k,n)，k＝1,..,K (5)

then the frequency domain estimation of the p sound source in the entire frequency band of the n frame is as shown in expression (6):

and 204, updating the weighted covariance matrix according to the prior frequency domain estimation.

In this step, the weighted covariance matrix is updated according to the observed signal matrix and the conjugate transpose matrix of the observed signal matrix.

For the application scenario shown in FIG. 3, the weighted covariance matrix V is updated according to expression (7)_p(k,n)。

Where α is a weighting coefficient, and can be obtained according to expression (8):

defining a contrast function according to expression (10)

A multi-dimensional super-gaussian prior probability distribution model based on the whole frequency band is represented for the p-th sound source. In a general algorithm, it can be calculated according to expression (11)

exp denotes an exponential function.

Accordingly, the comparison function is shown in expression (12):

the weighting coefficient can be obtained by expressions (13) and (14):

and step 205, updating the separation matrix according to the updated weighted covariance matrix.

In this step, the eigenvectors of each sound source are calculated according to the weighted covariance matrix of each sound source, and then the separation matrix of each sound source is updated according to the eigenvectors.

For the application scenario shown in fig. 3, the equation shown in expression (15) may be solved, and the eigenvalue of the sound source s1 is obtained according to expression (16), and the eigenvector of the sound source s1 is obtained according to expression (17).

V₂(k,n)e_p(k,n)＝λ_p(k,n)V₁(k,n)e_p(k,n)，p＝1,2 (15)

Characteristic value of sound source s 1:

wherein, Tr represents the trace of the matrix, det represents the determinant of the matrix, and H can be obtained according to the expression (20).

Feature vector of sound source s 1:

the eigenvalue of the sound source s1 is obtained according to expression (18), and the eigenvector of the sound source s2 is obtained according to expression (19).

Characteristic value of sound source s 1:

feature vector of sound source s 2:

wherein the content of the first and second substances,

H₁₁、H₁₂、H₂₁、H₂₂respectively, elements in the matrix H (k, n).

Then, a separation matrix w (k) ═ w for each frequency point is calculated according to expression (21)₁(k),w₂(k)]^H：

Wherein, K is 1., K, ()^HIndicating the conjugate transpose in parentheses.

And step 206, correcting the updated separation matrix.

Therefore, in this step, according to the diagonal matrix of the inverse matrix of the separation matrix, the spectrum amplitude deblurring processing is performed on the separation matrix to correct the spectrum distortion of the separation matrix.

For example, W [ k ] may be corrected using expression (22).

W[k]＝diag(W^-1[k])W[k] (22)

Wherein, W^-1[k]Is W [ k ]]The inverse matrix of (d), diag (·) represents a diagonal matrix made up of principal diagonal elements.

And step 207, separating the original observation data according to the corrected separation matrix, and obtaining posterior domain estimation data of each sound source through separation.

Separating the original microphone signals by using W (k) to obtain posterior frequency domain estimation of the sound source signals:

Y(k,n)＝[Y₁(k,n),Y₂(k,n)]^T＝W(k)X(k,n) (23)

an exemplary embodiment of the present disclosure further provides a sound signal identification method, where for an a posteriori frequency domain estimation signal obtained by separating from an observation signal, in a scene with a large background noise and a strong reverberation, a second-stage noise reduction process may be performed on a noise residue in the signal to obtain a sound signal with a higher signal-to-noise ratio.

Firstly, according to the posterior domain estimation data, the signal components of each sound source are obtained. Signal components of each sound source in each channel of the beamforming filter, including the desired signal and the residual noise signal, may be derived from the a posteriori domain estimation data.

Taking the blind source separation technique and the beamforming technique combined enhancement system shown in fig. 5 as an example, a beamforming filter of PMWF is adopted, and there are two component channels, i.e., a first channel and a second channel. In the application scenario shown in FIG. 3, X is used_p(k, n) as input to the blind source separation of the first stage of noise reduction processing (where the frequency domain signal corresponding to the sound source s1 is abbreviated as X)₁[k]Wherein the frequency domain signal corresponding to the sound source s2 is abbreviated as X₂[k]). Separating the blind source result Y₁(k, n) (abbreviated as Y in FIG. 5)₁) Viewed as a sound source s₁Component in the first channel, Y₂(k, n) (abbreviated as Y in FIG. 5)₂) Viewed as a sound source s₂Component in the second channel. The composition of the other sound source in the two channels is solved based on the observed signal matrix X (k, n) and the a priori frequency domain estimate Y (k, n) according to expression (24):

at this time, a sound source s is obtained₁Signal component Y in two channels₁(k, n) and

(abbreviated as in FIG. 5

) And a sound source s₂Signal component Y in two channels₂(k, n) and

(abbreviated as in FIG. 5

). Then, according to the signal components, PMWF filtering is carried out to obtain the estimation result of higher signal-to-noise ratio of each sound source.

An exemplary embodiment of the present disclosure further provides a sound signal identification method, where after signal components of each channel of each sound source in a beamforming filter are obtained, a second-stage noise reduction process is performed on the signal components of each sound source, so as to obtain a beamforming estimation signal. The specific process is shown in fig. 6, and includes:

step 601, determining the input signal of each sound source according to the signal component of each sound source.

In this step, the input signal is determined based on the desired signal and the residual noise signal component in the signal component.

Taking the application scenario shown in FIG. 3 as an example, for sound source s1, Y is given₁(k,n)、

As input to the PMWF. For any frequency point, the input signal is composed of a desired signal and a residual noise signal.

Expression (25) shows the input signal Y of the first channel at any frequency point of the sound source s1₁(k) The constitution of (1):

Y₁(k)＝S₁(k)+V₁(k),k＝1,...,K (25)

wherein S is₁(k) For an input signal Y₁(k) Of the desired signal, V₁(k) For an input signal Y₁(k) The residual noise signal of (2).

Expression (26) shows the input signal of the second channel at any frequency point of the sound source s1

The constitution of (1):

wherein the content of the first and second substances,

for input signals

The desired signal of (a) is determined,

for input signals

The residual noise signal of (2).

Expressed in vector form:

where Y (k) is the input signal, S (k) is the desired signal, and V (k) is the residual noise signal.

For sound source s2, Y₂(k,n)，

Expression (2) shows the input signal of the first channel at the frequency point of the sound source s2

The constitution of (1):

wherein the content of the first and second substances,

for input signals

The desired signal of (a) is determined,

for input signals

The residual noise signal of (2).

Expression (31) shows the second channel input signal at any frequency point of sound source s1

The constitution of (1):

Y₂(k)＝S₂(k)+V₂(k)，k＝1,…,K (31)

wherein S is₂(k) For an input signal Y₂(k) Of the desired signal, V₂(k) For an input signal Y₂(k) The residual noise signal of (2).

Expressed in vector form:

Step 602, estimating an autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal.

In this step, the autocorrelation matrix of the desired signal and/or the residual noise signal is selected to be estimated according to whether the current frame is a noise frame or a non-noise frame, and the frame-by-frame processing is performed until the whole input signal is processed. As shown in fig. 7, the method includes:

respectively performing the following processing on the input signals of the sound sources frame by frame:

step 701, detecting that the current frame is a noise frame or a non-noise frame.

In this step, the noise is further identified by detecting the silent period in the observed signal data. The current frame can be detected as a noise frame or a non-noise frame by any Voice Activity Detection (VAD) technique.

It is checked whether the current frame is a noise frame, step 702 is entered for a noise frame, otherwise step 703 is entered.

Step 702, in case that the current frame is a noise frame, estimating an autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal of the current frame according to an autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal of a previous frame.

Under the condition that the signals s (k) and v (k) satisfy uncorrelated conditions, the respective autocorrelation matrices can be obtained according to expressions (35) to (38), specifically as follows:

autocorrelation matrix of input signal:

autocorrelation matrix of desired signal:

autocorrelation matrix of the residual noise signal:

the relationship between the autocorrelation matrices of the input signal, the desired signal and the residual noise signal:

in the general case of the above-mentioned,

the estimation of (b) is only performed on noise frames.

According to

And

de-estimation

As shown in expression (39).

At time n

And

according to the time of n-1

And

de-estimation, i.e. of the current frame

And

it needs to be estimated from the previous frame as shown in expressions (40), (41).

Wherein λ is₁And λ₂Is a forgetting factor.

Step 703, under the condition that the current frame is a non-noise frame, estimating the autocorrelation matrix of the expected signal of the current frame according to the autocorrelation matrix of the expected signal of the previous frame.

In the application scenario shown in fig. 3, the autocorrelation matrix of the desired signal of the current frame may be estimated according to expression (34).

Step 603, obtaining a weight vector of each sound source according to the final autocorrelation matrix of the desired signal and the autocorrelation matrix of the residual noise signal of each sound source.

In this step, it can be derived from the frequency domain parameterized multi-channel wiener filtering, and the weight vector is calculated according to expression (42):

wherein the content of the first and second substances,

length M, corresponding to reference microphone 1, and the rest 0. Beta is an adjustable parameter, the higher the beta value is set, the better the noise removal effect can be obtained, and the lower the beta value is set, the higher the restoration degree of the voice signal is.

Step 604, obtaining a beamforming estimation signal of each sound source according to the conjugate transpose matrix of the input signal of each sound source and the weight vector.

In this step, a beamforming estimation signal of each sound source is obtained according to expression (43):

Z_p(k)＝Y(k)^HH(k)，p＝1,2 (43)

as shown in FIG. 3Under the application scene, the beam forming estimation signal Z of the sound source s1 can be obtained₁(k) And the beamforming estimation signal Z of the sound source s2₂(k) In that respect And further denoising the posterior domain estimation data obtained by blind source separation through beam forming.

An exemplary embodiment of the present disclosure further provides a sound signal identification method, which obtains a time-domain sound source signal of each of the sound sources according to a beamforming estimation signal. The beam forming estimated signals of each sound source can be subjected to short-time inverse Fourier transform and then overlapped and added to obtain time-domain sound source signals of each sound source.

In the application scenario shown in fig. 3, the time-domain sound source signal can be calculated by expression (44):

an exemplary embodiment of the present disclosure also provides a sound signal recognition apparatus, the structure of which is shown in fig. 8, including:

an original data acquiring module 801, configured to acquire original observation data acquired by at least two acquisition points for at least two sound sources respectively;

a first denoising module 802, configured to perform a first-stage denoising process on the original observation data to obtain posterior domain estimation data;

a signal mapping module 803, configured to obtain signal components of each sound source according to the posterior domain estimation data;

a second denoising module 804, configured to perform second-stage denoising processing on the signal components of the sound sources respectively to obtain beamforming estimation signals;

an enhanced signal output module 805, configured to obtain a time-domain sound source signal of each sound source according to the beamforming estimation signal.

Further, the structure of the first noise reduction module 802 is shown in fig. 9, and includes:

a matrix initialization submodule 901, configured to initialize a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, where the number of rows and columns of the separation matrix is the number of sound sources;

an observed signal matrix construction submodule 902, configured to obtain a time domain signal at each acquisition point, and construct an observed signal matrix according to a frequency domain signal corresponding to the time domain signal;

a priori frequency domain obtaining submodule 903, configured to obtain a priori frequency domain estimate of each sound source of the current frame according to the separation matrix of the previous frame and the observed signal matrix;

a covariance matrix update sub-module 904 for updating the weighted covariance matrix according to the prior frequency domain estimate;

a separation matrix updating submodule 905 configured to update the separation matrix according to the updated weighted covariance matrix;

a syndrome module 906 for correcting the updated separation matrix;

the posterior domain obtaining submodule 907 is configured to separate the original observation data according to the corrected separation matrix, and obtain posterior domain estimation data of each sound source through separation.

Further, the prior frequency domain obtaining sub-module 903 is configured to separate the observed signal matrix according to the separation matrix of the previous frame, so as to obtain a prior frequency domain estimate of each sound source of the current frame.

Further, the covariance matrix update sub-module 904 is configured to update the weighted covariance matrix according to the observed signal matrix and a conjugate transpose matrix of the observed signal matrix.

Further, the structure of the separation matrix update sub-module 905 is shown in fig. 10, and includes:

an eigenvalue operator module 1001 for calculating an eigenvector of each sound source according to the weighted covariance matrix of each sound source;

and an updating sub-module 1002, configured to update the separation matrix of each sound source according to the feature vector.

Further, the syndrome module 906 is configured to perform spectrum amplitude deblurring processing on the separation matrix according to a diagonal matrix of an inverse matrix of the separation matrix, so as to correct the spectrum distortion of the separation matrix.

Further, the signal mapping module 803 is configured to obtain, according to the a posteriori domain estimation data, signal components of each sound source in each channel of the beamforming filter, where the signal components include an expected signal and a residual noise signal.

Further, the structure of the second noise reduction module 804 is shown in fig. 11, and includes:

an input signal construction sub-module 1101 for determining an input signal for each of the sound sources from the signal components of each of the sound sources;

an autocorrelation matrix estimation sub-module 1102, configured to perform the following processing on the input signals of the sound sources frame by frame respectively as follows:

a weight vector calculation sub-module 1103, configured to obtain a weight vector of each sound source according to the final autocorrelation matrix of the desired signal and the autocorrelation matrix of the residual noise signal of each sound source;

and a beamforming estimation signal obtaining sub-module 1104, configured to obtain a beamforming estimation signal of each sound source according to the conjugate transpose matrix of the input signal of each sound source and the weight vector.

The device can be integrated in the intelligent terminal equipment or the remote operation processing platform, and part of the functional modules can be integrated in the intelligent terminal equipment and part of the functional modules can be integrated in the remote operation processing platform, and the corresponding functions can be realized by the intelligent terminal equipment and/or the remote operation processing platform.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 12 is a block diagram illustrating an apparatus 1200 for acoustic signal recognition according to an example embodiment. For example, the apparatus 1200 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 12, the apparatus 1200 may include one or more of the following components: a processing component 1202, a memory 1204, a power component 1206, a multimedia component 1208, an audio component 1210, an input/output (I/O) interface 1212, a sensor component 1214, and a communications component 1216.

The processing component 1202 generally controls overall operation of the apparatus 1200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 1202 may include one or more processors 1220 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 1202 can include one or more modules that facilitate interaction between the processing component 1202 and other components. For example, the processing component 1202 can include a multimedia module to facilitate interaction between the multimedia component 1208 and the processing component 1202.

The memory 1204 is configured to store various types of data to support operation at the device 1200. Examples of such data include instructions for any application or method operating on the device 1200, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1204 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A power component 1206 provides power to the various components of the device 1200. Power components 1206 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for apparatus 1200.

The multimedia components 1208 include a screen that provides an output interface between the device 1200 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1208 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 1200 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

Audio component 1210 is configured to output and/or input audio signals. For example, audio component 1210 includes a Microphone (MIC) configured to receive external audio signals when apparatus 1200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1204 or transmitted via the communication component 1216. In some embodiments, audio assembly 1210 further includes a speaker for outputting audio signals.

The I/O interface 1212 provides an interface between the processing component 1202 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 1214 includes one or more sensors for providing various aspects of state assessment for the apparatus 1200. For example, the sensor assembly 1214 may detect an open/closed state of the device 1200, the relative positioning of the components, such as a display and keypad of the apparatus 1200, the sensor assembly 1214 may also detect a change in the position of the apparatus 1200 or a component of the apparatus 1200, the presence or absence of user contact with the apparatus 1200, an orientation or acceleration/deceleration of the apparatus 1200, and a change in the temperature of the apparatus 1200. The sensor assembly 1214 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 1214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communications component 1216 is configured to facilitate communications between the apparatus 1200 and other devices in a wired or wireless manner. The apparatus 1200 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1216 receives the broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1216 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 1200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as memory 1204 comprising instructions, executable by processor 1220 of apparatus 1200 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a sound signal recognition method, the method comprising:

Fig. 13 is a block diagram illustrating an apparatus 1300 for sound signal recognition according to an example embodiment. For example, the apparatus 1300 may be provided as a server. Referring to fig. 13, apparatus 1300 includes a processing component 1322, which further includes one or more processors, and memory resources, represented by memory 1332, for storing instructions, such as application programs, that may be executed by processing component 1322. The application programs stored in memory 1332 may include one or more modules that each correspond to a set of instructions. Further, processing component 1322 is configured to execute instructions to perform the methods described above.

The apparatus 1300 may also include a power component 1326 configured to perform power management for the apparatus 1300, a wired or wireless network interface 1350 configured to connect the apparatus 1300 to a network, and an input-output (I/O) interface 1358. The apparatus 1300 may operate based on an operating system stored in the memory 1332, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

The invention provides a sound signal identification method and a device, which are used for acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively, then carrying out primary noise reduction processing on the original observation data to obtain posterior domain estimation data, then obtaining posterior domain estimation data according to the posterior domain estimation data, then obtaining signal components of each sound source according to the posterior domain estimation data, carrying out secondary noise reduction processing on the signal components of each sound source respectively to obtain a beam forming estimation signal, and finally obtaining a time domain sound source signal of each sound source according to the beam forming estimation signal. After the original observation data is subjected to blind source separation to obtain posterior domain estimation data, the signal-to-noise ratio is further improved through beam enhancement to highlight signals, the problems of low sound source positioning accuracy and poor voice recognition quality under the scene of strong interference and low signal-to-noise ratio are solved, and high-efficiency and high-interference-resistance voice signal recognition is realized.

Through a combined enhancement scheme based on blind source separation and beam forming, the blind source separation based on independent vector analysis and the beam forming based on multi-channel wiener filtering are jointly processed, residual noise is further removed, the noise reduction performance of an algorithm is greatly improved, the voice quality is improved, and the influence of noise on sound signal identification is reduced.

The technical scheme provided by the disclosure can be used for a device using a microphone as sound collecting point equipment, and can be used for a sound collecting point array formed by two or more collecting points, such as a microphone array and the like.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for recognizing a sound signal, comprising:

2. The method according to claim 1, wherein the step of performing a first-stage noise reduction process on the raw observation data to obtain a posterior domain estimation data comprises:

correcting the updated separation matrix;

3. The method of claim 2, wherein the step of obtaining the a priori frequency domain estimates of the sound sources of the current frame according to the separation matrix of the previous frame and the observation signal matrix comprises:

4. The method of claim 2, wherein the step of updating the weighted covariance matrix based on the a priori frequency domain estimates comprises:

5. The sound signal identification method of claim 2, wherein the step of updating the separation matrix according to the updated weighted covariance matrix comprises:

and respectively updating the separation matrix of each sound source according to the characteristic vector.

6. The sound signal identification method according to claim 2, wherein the step of correcting the updated separation matrix includes:

7. The sound signal identification method according to claim 1, wherein the step of obtaining the signal component of each sound source from the a posteriori domain estimation data comprises:

8. The sound signal identification method of claim 7, wherein the step of performing a second stage of noise reduction processing on the signal components of each sound source to obtain the beamforming estimation signal comprises:

9. The method according to claim 8, wherein the step of obtaining the time-domain sound source signal of each sound source according to the beamforming estimation signal comprises:

10. An apparatus for recognizing a sound signal, comprising:

11. The apparatus according to claim 10, wherein the first noise reduction module comprises:

a correction submodule for correcting the updated separation matrix;

12. The sound signal identification device of claim 11,

and the prior frequency domain solving submodule is used for separating the observation signal matrix according to the separation matrix of the previous frame to obtain the prior frequency domain estimation of each sound source of the current frame.

13. The sound signal identification device of claim 11,

and the covariance matrix updating submodule is used for updating the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.

14. The apparatus according to claim 11, wherein the separation matrix update submodule comprises:

and the updating submodule is used for respectively updating the separation matrix of each sound source according to the characteristic vector.

15. The sound signal identification device of claim 11,

and the correction submodule is used for carrying out spectrum amplitude deblurring processing on the separation matrix according to the diagonal matrix of the inverse matrix of the separation matrix so as to correct the spectrum distortion of the separation matrix.

16. The sound signal identification device of claim 10,

and the signal mapping module is used for obtaining signal components of each sound source in each channel of the beam forming filter according to the posterior domain estimation data, wherein the signal components comprise expected signals and residual noise signals.

17. The apparatus for recognizing a sound signal according to claim 16, wherein the second noise reduction module comprises:

18. The sound signal identification device of claim 17,

and the enhanced signal output module is used for performing short-time Fourier inverse transformation on the beam forming estimation signals of each sound source and then overlapping and adding the beam forming estimation signals to obtain time domain sound source signals of each sound source.

19. A computer device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

20. A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a sound signal recognition method, the method comprising:

according to the posterior domain estimation data, respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;