CN113314135B

CN113314135B - Voice signal identification method and device

Info

Publication number: CN113314135B
Application number: CN202110572969.6A
Authority: CN
Inventors: 何琪琪; 侯海宁
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2024-04-26
Anticipated expiration: 2041-05-25
Also published as: CN113314135A

Abstract

The disclosure relates to a voice signal recognition method and device. The intelligent voice interaction technology is related to solving the problems of low sound source positioning accuracy and poor voice recognition quality in the scene of strong interference and low signal to noise ratio. The method comprises the following steps: acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively; performing first-stage noise reduction treatment on the original observed data to obtain posterior domain estimation data; obtaining signal components of each sound source according to the posterior domain estimation data; performing second-stage noise reduction processing on the signal components of each sound source respectively to obtain beam forming estimated signals; and obtaining time domain sound source signals of the sound sources according to the wave beam forming estimated signals. The technical scheme provided by the disclosure is suitable for an intelligent voice system, and realizes high-efficiency and high-interference-resistance voice signal recognition.

Description

Voice signal identification method and device

Technical Field

The disclosure relates to intelligent voice interaction technology, and in particular relates to a voice signal recognition method and device.

Background

In the internet of things and AI age, intelligent voice is used as one of the artificial intelligent core technologies, so that the mode of man-machine interaction is enriched, and the use convenience of intelligent products is greatly improved.

The intelligent product equipment pick-up adopts a microphone array formed by a plurality of microphones, and the microphone beam forming technology or the blind source separation technology is applied to inhibit environmental interference, so that the voice signal processing quality is improved, and the voice recognition rate in the real environment is improved.

In addition, in order to endow stronger intelligence and perceptibility, the general intelligent device is provided with an indicator light, and the indicator light is accurately directed to a user rather than interfered when the user interacts with the intelligent device, so that the user feels in face-to-face conversation with the intelligent device, and the interaction experience of the user is enhanced. Based on this, in an environment where an interfering sound source exists, it is important to accurately estimate the direction of the user (i.e., the sound source).

The sound source direction finding algorithm generally uses the data collected by the microphone directly, and uses the sound source direction finding algorithm (Steered Response Power-Phase Transform, SRP-PHAT) and other algorithms based on the Phase transformation weighted controllable response power to carry out direction finding estimation. However, the algorithm depends on the signal-to-noise ratio of the signal, the accuracy is not high enough under the condition of low signal-to-noise ratio, the effective sound source cannot be positioned accurately in the direction of the interference sound source, and the accuracy of voice recognition is further affected.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides a method and apparatus for recognizing a sound signal. After posterior domain data is obtained through noise reduction, signal components are formed, and the noise is further reduced through beam forming, so that high-quality voice recognition is realized.

According to a first aspect of an embodiment of the present disclosure, there is provided a sound signal recognition method, including:

Acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;

Performing first-stage noise reduction treatment on the original observed data to obtain posterior domain estimation data;

obtaining signal components of each sound source according to the posterior domain estimation data;

performing second-stage noise reduction processing on the signal components of each sound source respectively to obtain beam forming estimated signals;

and obtaining time domain sound source signals of the sound sources according to the wave beam forming estimated signals.

Further, the step of performing a first-stage noise reduction process on the original observation data to obtain posterior domain estimation data includes:

initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, wherein the number of rows and columns of the separation matrix are the number of the sound sources;

solving time domain signals at each acquisition point, and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals;

According to the separation matrix of the previous frame and the observation signal matrix, solving the prior frequency domain estimation of each sound source of the current frame;

updating the weighted covariance matrix according to the prior frequency domain estimation;

updating the separation matrix according to the updated weighted covariance matrix;

Correcting the updated separation matrix;

and separating the original observed data according to the corrected separation matrix, and separating the obtained posterior domain estimated data of each sound source.

Further, the step of obtaining the prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix comprises the following steps:

And separating the observation signal matrix according to the separation matrix of the previous frame to obtain prior frequency domain estimation of each sound source of the current frame.

Further, the step of updating the weighted covariance matrix based on the prior frequency-domain estimate comprises:

and updating the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.

Further, the step of updating the separation matrix according to the updated weighted covariance matrix includes:

calculating the feature vector of each sound source according to the weighted covariance matrix of each sound source;

Respectively updating the separation matrix of each sound source according to the characteristic vector;

And updating the separation matrix into a conjugate transpose matrix of the combined separation matrix of each sound source.

Further, the step of correcting the updated separation matrix includes:

And according to the diagonal matrix of the inverse matrix of the separation matrix, performing spectrum amplitude deblurring processing on the separation matrix so as to correct spectrum distortion of the separation matrix.

Further, the step of obtaining the signal components of each sound source according to the posterior domain estimation data includes:

And obtaining signal components of each sound source in each channel of the beam forming filter according to the posterior domain estimation data, wherein the signal components comprise a desired signal and a residual noise signal.

Further, the step of performing a second-stage noise reduction process on the signal components of each sound source to obtain a beam forming estimated signal includes:

Determining the input signal of each sound source according to the signal component of each sound source;

and respectively carrying out the processing from the first step to the second step on the input signals of the sound sources frame by frame:

step one, detecting that the current frame is a noise frame or a non-noise frame,

Estimating an autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal of the current frame based on the autocorrelation matrix of the desired signal and the autocorrelation matrix of the residual noise signal of the previous frame in the case where the current frame is a noise frame,

Estimating an autocorrelation matrix of the expected signal of the current frame according to the autocorrelation matrix of the expected signal of the previous frame under the condition that the current frame is a non-noise frame;

Obtaining weight vectors of the sound sources according to the final autocorrelation matrixes of the expected signals and the autocorrelation matrixes of the residual noise signals of the sound sources;

And obtaining wave beam forming estimated signals of the sound sources according to the conjugate transpose matrix and the weight vector of the input signals of the sound sources.

Further, the step of obtaining a time domain sound source signal of each sound source according to the beam forming estimation signal includes:

And carrying out short-time Fourier inverse transformation on the wave beam forming estimated signals of each sound source, and then carrying out overlap addition to obtain time domain sound source signals of each sound source.

According to a second aspect of the embodiments of the present disclosure, there is provided a sound signal recognition method, including:

the original data acquisition module is used for acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;

the first noise reduction module is used for carrying out first-stage noise reduction processing on the original observed data to obtain posterior domain estimated data;

the signal mapping module is used for obtaining the signal components of each sound source according to the posterior domain estimation data;

The second noise reduction module is used for respectively carrying out second-stage noise reduction processing on the signal components of each sound source to obtain a beam forming estimated signal;

and the enhanced signal output module is used for obtaining the time domain sound source signals of the sound sources according to the wave beam forming estimated signals.

Further, the first noise reduction module includes:

the matrix initialization sub-module is used for initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, wherein the number of rows and the number of columns of the separation matrix are the number of the sound sources;

The observation signal matrix construction submodule is used for obtaining time domain signals at each acquisition point and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals;

The priori frequency domain solving sub-module is used for solving the priori frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix;

a covariance matrix updating sub-module for updating the weighted covariance matrix according to the prior frequency domain estimate;

a separation matrix updating sub-module, configured to update the separation matrix according to the updated weighted covariance matrix;

a correction sub-module for correcting the updated separation matrix;

and the posterior domain solving sub-module is used for separating the original observed data according to the corrected separation matrix and estimating the posterior domain of each sound source obtained by separation.

Further, the priori frequency domain obtaining sub-module is configured to separate the observation signal matrix according to a separation matrix of a previous frame, so as to obtain a priori frequency domain estimate of each sound source of the current frame.

Further, the covariance matrix updating sub-module is configured to update the weighted covariance matrix according to the observation signal matrix and a conjugate transpose matrix of the observation signal matrix.

Further, the split matrix updating submodule includes:

The characteristic value calculation submodule is used for calculating characteristic vectors of the sound sources according to the weighted covariance matrix of the sound sources;

the first updating sub-module is used for respectively updating the separation matrixes of the sound sources according to the characteristic vectors;

And the second updating sub-module is used for updating the separation matrix into a conjugate transpose matrix after the separation matrix of each sound source is combined.

Further, the correction submodule is configured to perform spectral amplitude defuzzification processing on the separation matrix according to a diagonal matrix of an inverse matrix of the separation matrix, so as to correct spectral distortion of the separation matrix.

Further, the signal mapping module is configured to obtain, according to the posterior domain estimation data, signal components of each sound source in each channel of the beamforming filter, where the signal components include a desired signal and a residual noise signal.

Further, the second noise reduction module includes:

an input signal construction sub-module, configured to determine an input signal of each sound source according to a signal component of each sound source;

the autocorrelation matrix estimation submodule is used for respectively carrying out the processing from the step one to the step two on the input signals of each sound source frame by frame:

The weight vector calculation operator module is used for obtaining the weight vector of each sound source according to the autocorrelation matrix of the final expected signal of each sound source and the autocorrelation matrix of the residual noise signal;

And the beam forming estimation signal acquisition sub-module is used for acquiring the beam forming estimation signals of the sound sources according to the conjugate transpose matrix and the weight vector of the input signals of the sound sources.

Further, the enhanced signal output module is configured to perform short-time inverse fourier transform and overlap-add on the beamforming estimation signals of each sound source to obtain time domain sound source signals of each sound source.

According to a third aspect of embodiments of the present disclosure, there is provided a computer apparatus comprising:

A processor;

A memory for storing processor-executable instructions;

wherein the processor is configured to:

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a sound signal recognition method, the method comprising:

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: acquiring original observation data acquired by at least two acquisition points on at least two sound sources respectively, then carrying out first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data, then obtaining posterior domain estimation data according to the posterior domain estimation data, obtaining signal components of each sound source according to the posterior domain estimation data, respectively carrying out second-stage noise reduction processing on the signal components of each sound source to obtain beam forming estimation signals, and finally obtaining time domain sound source signals of each sound source according to the beam forming estimation signals. After blind source separation is carried out on the original observation data to obtain posterior domain estimation data, the signal to noise ratio is further improved through beam enhancement to highlight signals, the problems of low sound source positioning accuracy and poor voice recognition quality in a scene with strong interference and low signal to noise ratio are solved, and efficient and anti-interference strong voice signal recognition is realized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flowchart illustrating a voice signal recognition method according to an exemplary embodiment.

Fig. 2 is a flowchart illustrating yet another voice signal recognition method according to an exemplary embodiment.

Fig. 3 is a schematic view of a two microphone acquisition point radio scene.

Fig. 4 is a schematic diagram illustrating a blind source separation principle according to an exemplary embodiment.

Fig. 5 is a schematic diagram illustrating a blind source separation and beamforming combination application in accordance with an exemplary embodiment.

Fig. 6 is a flowchart illustrating yet another voice signal recognition method according to an exemplary embodiment.

Fig. 7 is a flowchart illustrating yet another voice signal recognition method according to an exemplary embodiment.

Fig. 8 is a block diagram illustrating a voice signal recognition apparatus according to an exemplary embodiment.

Fig. 9 is a schematic diagram illustrating a structure of a first noise reduction module 802 according to an exemplary embodiment.

Fig. 10 is a schematic diagram showing the structure of a split matrix update sub-module 905 according to an exemplary embodiment.

Fig. 11 is a schematic structural diagram of a second noise reduction module 804, according to an example embodiment.

Fig. 12 is a block diagram of an apparatus (general structure of a mobile terminal) according to an exemplary embodiment.

Fig. 13 is a block diagram (general structure of a server) of an apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

The sound source direction finding algorithm generally uses data collected by a microphone directly, and uses a microphone array sound source positioning (SRP-PHAT) algorithm to carry out direction finding estimation. However, the algorithm depends on the signal-to-noise ratio of the signal, the accuracy is not high enough under the condition of low signal-to-noise ratio, each direction to the interference sound source is extremely easy to find, and the effective sound source cannot be accurately positioned.

In order to solve the above problems, embodiments of the present disclosure provide a method and an apparatus for identifying a sound signal. Noise reduction processing is carried out on the collected original data to obtain posterior domain data, noise reduction processing is carried out again on residual noise in the posterior domain data to further improve the signal to noise ratio, a final time domain sound source signal is obtained, the influence of an interference sound source is eliminated, the problem of low sound source positioning accuracy under a scene with strong interference and low signal to noise ratio is solved, and efficient and anti-interference strong voice signal recognition is realized.

An exemplary embodiment of the present disclosure provides a sound signal recognition method, with which a flow of obtaining a sound source signal recognition result is shown in fig. 1, including:

step 101, acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively.

In this embodiment, the collection point may be a microphone. For example, it may be a plurality of microphones disposed on the same device, the plurality of microphones constituting a microphone array.

In this step, data is collected at each collection point, and the collected data sources may be a plurality of sound sources. The plurality of sound sources may include a target effective sound source and may also include an interfering sound source.

The acquisition point acquires the original observation data of at least two sound sources.

And 102, performing first-stage noise reduction processing on the original observed data to obtain posterior domain estimated data.

In the step, the first-stage noise reduction processing is performed on the acquired original observation data so as to eliminate noise influence generated by an interference sound source and the like.

The first-stage noise reduction processing can be performed on the original observed data based on the blind source separation technology so as to obtain noise-reduced posterior domain estimation data.

The blind source separation technology based on independent vector analysis does not need prior information such as source signals, transmission channels and the like, a received observation signal is decomposed into an approximate expected signal and a noise signal through an optimization algorithm according to a principle of statistical independence, and the noise signal is stripped, so that the signal-to-noise ratio of the signal is improved. Meanwhile, the approximate expected signal obtained through blind source separation still keeps the spatial information of the microphone array and the sound source.

And step 103, obtaining the signal components of each sound source according to the posterior domain estimation data.

In the present disclosure, after the posterior domain estimation data is obtained, in order to further reject the noise remaining in the posterior domain estimation data to obtain a signal with a higher signal-to-noise ratio, the posterior domain estimation data is processed, and the noise is filtered by an adaptive filter.

And 104, respectively performing second-stage noise reduction processing on the signal components of each sound source to obtain beam forming estimated signals.

In this step, for the noise interference residue in the posterior domain estimation data obtained in step 103, in order to further improve the sound signal quality, a beamforming technique is used to perform a second-stage noise reduction process. The method can enhance the sound source signal and inhibit other direction signals (signals possibly interfering the sound source signal), thereby further improving the signal-to-noise ratio of the sound source signal, and further carrying out sound source localization identification on the basis so as to obtain more accurate results.

In this step, the second-stage noise reduction processing can be performed by adopting beam forming based on parameterized multi-channel wiener filtering (Parameterized Multichannel WIENER FILTER, PMWF for short).

And 105, obtaining time domain sound source signals of the sound sources according to the wave beam forming estimated signals.

In the step, according to the wave beam forming estimated signal, the time domain signal with enhanced signal to noise ratio after the separation wave beam processing is obtained through short time inverse Fourier transform (ISTFT) and overlap addition, and compared with the posterior domain estimated data, the time domain signal has smaller noise, can reflect the sound signal sent by the sound source more truly and accurately, and realizes accurate and efficient sound signal identification.

An exemplary embodiment of the present disclosure further provides a method for identifying a sound signal, which performs noise reduction processing on original observation data based on blind source separation to obtain posterior domain estimation data, where a specific flow is shown in fig. 2, and includes:

step 201, initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point.

In this step, the number of rows and columns of the separation matrix is the number of sound sources, and the weighted covariance matrix is 0 matrix.

In the present embodiment, a scene in which two microphones are taken as the acquisition points is taken as an example. As shown in fig. 3, the smart speaker a has two microphones: mic1 and mic2; there are two sound sources in the space around the intelligent sound box a: s1 and s2. The signals from both sources can be picked up by both microphones. The signals of the two sound sources in each microphone will be aliased together.

The main idea of blind source separation is to find the separation matrixW11, W12, W21, W22 are elements of the separation matrix W, representing W being composed of these four elements.

According to the separation matrix, the sound source s ₁、s₂ can be separated from the signals x ₁、x₂ collected by two microphones, and the implementation principle is shown in fig. 4. The method comprises the steps of separating a posterior domain estimated signal component y ₁₁ [ k ] of a sound source s1 and a posterior domain estimated signal component y ₂₁ [ k ] of the sound source s2 from original observed data x ₁ [ k ] acquired by a mic1 through a separation matrix W, separating a posterior domain estimated signal component y ₁₂ [ k ] of the sound source s1 and a posterior domain estimated signal component y ₂₂ [ k ] of the sound source s2 from original observed data x ₂ [ k ] acquired by the mic2, and further obtaining posterior domain estimated data y ₁ [ k ] and y ₂[k].y₁ [ k ] which are posterior domain data of the sound source s1 obtained according to y ₁₁ [ k ] and y ₁₂ [ k ]; y ₂ [ k ] is posterior domain data of the sound source s2 obtained from y ₂₁ [ k ] and y ₂₂ [ k ].

In this step, an initialization separation matrix is defined according to expression (1):

The initialized separation matrix is a unit matrix, k=1. K is a variable representing a frequency point, the K represents the K frequency point which is currently calculated, k=1.

The total number of frequency points K=Nfft/2+1, and Nfft is the frame length of each frame in the sound system of the intelligent sound box A.

In this step, a separation matrix of each frequency point is initialized according to expression (1):

the separation matrix is a unit matrix; k=1.

W ₁ (k, τ) and W ₂ (k, τ) are each represented as a column of elements of W, τ representing the current frame number.

Initializing a weighted covariance matrix V _p (k) of each sound source at each frequency point as a zero matrix according to the expression (2):

Wherein k=1,..k, represents the kth frequency bin; p is the microphone number; p=1, 2.

Step 202, obtaining time domain signals at each acquisition point, and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals.

To be used forRepresenting the time domain signal of the nth frame of the p-th microphone. p=1, 2. m=1, 2,..nfft.

According to expression (3), after windowing the frame obtained from Nfft, a corresponding frequency domain signal X _p (k, n) is obtained by fourier transform (FFT):

The observed signal matrix is shown as expression (4) and is:

X(k,n)＝[X₁(k,n),X₂(k,n)]^T (4)

Where k=1.

And 203, according to the separation matrix of the previous frame and the observation signal matrix, solving the prior frequency domain estimation of each sound source of the current frame.

In the step, firstly, the observed signal matrix is separated according to the separation matrix of the previous frame, and the prior frequency domain estimation of each sound source of the current frame is obtained.

Specifically to the scenario shown in fig. 3, let Y (K, n) = [ Y ₁(k,n),Y₂(k,n)]^T, k=1, ], K, where Y ₁(k,n),Y₂ (K, n) is the a priori frequency domain estimate of sound sources s ₁ and s ₂, respectively, at time frequency points (K, n). As shown in expression (5), it is obtained by separating the observation matrix X (k, n) by the separation matrix W (k):

Y(k,n)＝W(k)X(k,n)，k＝1,..,K (5)

The frequency domain estimation of the p-th sound source in the entire frequency band of the n-th frame is as shown in expression (6), as:

and step 204, updating the weighted covariance matrix according to the prior frequency domain estimation.

In this step, the weighted covariance matrix is updated according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.

For the application scenario shown in fig. 3, the weighted covariance matrix V _p (k, n) is updated according to expression (7).

Wherein α is a weighting coefficient, which can be obtained according to expression (8):

defining a contrast function according to expression (10)

Representing a multi-dimensional super-Gaussian prior probability density distribution model of the p-th sound source based on the whole frequency band. In general algorithms,/>, can be calculated according to expression (11)

Exp represents an exponential function.

Correspondingly, the comparison function is shown in expression (12):

the weighting coefficients can be obtained by expressions (13) and (14):

step 205, updating the separation matrix according to the updated weighted covariance matrix.

In this step, feature vectors of the respective sound sources are calculated according to the weighted covariance matrix of the respective sound sources, and then the separation matrix of the respective sound sources is updated according to the feature vectors.

For the application scenario shown in fig. 3, the equation shown in expression (15) may be solved, the eigenvalue of the sound source s1 is obtained according to expression (16), and the eigenvector of the sound source s1 is obtained according to expression (17).

V₂(k,n)e_p(k,n)＝λ_p(k,n)V₁(k,n)e_p(k,n)，p＝1,2 (15)

Eigenvalues of sound source s1:

Where Tr represents the trace of the matrix, det represents the determinant of the matrix, and H can be found according to expression (20).

Eigenvector of sound source s1:

The eigenvalue of the sound source s1 is obtained from expression (18), and the eigenvector of the sound source s2 is obtained from expression (19).

Eigenvalues of sound source s1:

Eigenvector of sound source s 2:

wherein,

H ₁₁、H₁₂、H₂₁、H₂₂ are the elements in the matrix H (k, n), respectively.

Then, a separation matrix W (k) = [ W ₁(k),w₂(k)]^H ] for each frequency point is calculated according to expression (21):

Where k=1..k, () ^H represents the conjugate transpose in brackets.

And 206, correcting the updated separation matrix.

In this step, according to the diagonal matrix of the inverse matrix of the separation matrix, the separation matrix is subjected to a spectral amplitude deblurring process to correct the spectral distortion of the separation matrix.

For example, W [ k ] may be corrected using expression (22).

W[k]＝diag(W^-1[k])W[k] (22)

Wherein W ^-1 k is the inverse of W k, diag (·) represents the diagonal matrix of principal diagonal elements.

And step 207, separating the original observation data according to the corrected separation matrix, and separating the obtained posterior domain estimation data of each sound source.

Separating the original microphone signal by using W (k) to obtain posterior frequency domain estimation of the sound source signal:

Y(k,n)＝[Y₁(k,n),Y₂(k,n)]^T＝W(k)X(k,n) (23)

in another exemplary embodiment of the present disclosure, for a posterior frequency domain estimated signal separated from an observed signal, in a scene where background noise is large and reverberation is strong, a second-stage noise reduction process may be performed on noise residuals in the signal to obtain a sound signal with a higher signal-to-noise ratio.

Firstly, according to the posterior domain estimation data, signal components of each sound source are obtained. Signal components of each sound source in each channel of the beam forming filter can be obtained according to the posterior domain estimation data, wherein the signal components comprise a desired signal and a residual noise signal.

Taking the blind source separation technique and beamforming technique joint enhancement system shown in fig. 5 as an example, a PMWF beamforming filter is used, and two component channels, namely a first channel and a second channel, exist. In the application scenario shown in fig. 3, X _p (k, n) is taken as an input of the first-stage noise reduction processing blind source separation (the frequency domain signal corresponding to the sound source s1 is abbreviated as X ₁ [ k ], and the frequency domain signal corresponding to the sound source s2 is abbreviated as X ₂ [ k ]). The result of blind source separation Y ₁ (k, n) (abbreviated Y ₁ in fig. 5) is considered as the component of sound source s ₁ in the first channel, and Y ₂ (k, n) (abbreviated Y ₂ in fig. 5) is considered as the component of sound source s ₂ in the second channel. According to expression (24), the component of the other sound source in both channels is found based on the observed signal matrix X (k, n) and the a priori frequency domain estimate Y (k, n):

At this time, the signal components Y ₁ (k, n) and the signal components of the sound source s ₁ in the two channels are obtained (Abbreviated as/> in FIG. 5) And the signal components Y ₂ (k, n) and/>, of sound source s ₂ in two channels(Abbreviated as/> in FIG. 5). Then, PMWF filtering can be performed according to the signal components, so that an estimation result of higher signal-to-noise ratio of each sound source can be obtained.

An exemplary embodiment of the present disclosure further provides a sound signal identifying method, after obtaining signal components of each channel of each sound source in a beamforming filter, performing second-stage noise reduction processing on the signal components of each sound source, to obtain a beamforming estimation signal. The specific flow is shown in fig. 6, and includes:

And 601, determining the input signal of each sound source according to the signal component of each sound source.

In this step, the input signal is determined based on the desired signal and the residual noise signal in the signal component.

Taking the application scenario shown in FIG. 3 as an example, for sound source s1, Y ₁ (k, n),As input to PMWF. For any frequency point, the input signal is composed of the desired signal and the residual noise signal.

Expression (25) shows the composition of the first channel input signal Y ₁ (k) at any frequency point of the sound source s1:

Y₁(k)＝S₁(k)+V₁(k),k＝1,...,K (25)

Wherein S ₁ (k) is a desired signal of the input signal Y ₁ (k), and V ₁ (k) is a residual noise signal of the input signal Y ₁ (k).

Expression (26) shows the second channel input signal at any frequency point of sound source s1Is composed of: /(I)

Wherein,For input signal/>Is/is a desired signal ofFor input signal/>Is included in the noise signal.

Expressed in vector form:

where Y (k) is an input signal, S (k) is a desired signal, and V (k) is a residual noise signal.

For sound source s2, Y ₂ (k, n),As input to PMWF. For any frequency point, the input signal is composed of the desired signal and the residual noise signal.

Expression (2) shows the first channel input signal at one frequency point of sound source s2Is composed of:

wherein, For input signal/>Is/is a desired signal ofFor input signal/>Is included in the noise signal.

Expression (31) shows the second channel input signal at any frequency point of sound source s1Is composed of:

Y₂(k)＝S₂(k)+V₂(k)，k＝1,…,K (31)

Wherein S ₂ (k) is a desired signal of the input signal Y ₂ (k), and V ₂ (k) is a residual noise signal of the input signal Y ₂ (k).

Expressed in vector form:

Step 602, estimating an autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal.

In this step, according to whether the current frame is a noise frame or a non-noise frame, the autocorrelation matrix of the desired signal and/or the residual noise signal is selected to be estimated, and the frame-by-frame processing is performed until the whole input signal is processed. As shown in fig. 7, the method comprises:

the input signals of the sound sources are respectively processed frame by frame as follows:

step 701, detecting that the current frame is a noise frame or a non-noise frame.

In this step, noise is further identified by detecting a mute period in the observation signal data. The current frame may be detected as a noisy frame or a non-noisy frame by any voice activity detection (Voice Activity Detection, VAD for short) technique.

It is detected whether the current frame is a noise frame, and step 702 is performed on the noise frame, otherwise step 703 is performed.

Step 702, estimating an autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal of the current frame according to the autocorrelation matrix of the desired signal and the autocorrelation matrix of the residual noise signal of the previous frame in the case that the current frame is a noise frame.

Under the condition that the signals S (k) and V (k) satisfy uncorrelation, respective autocorrelation matrices can be obtained according to expressions (35) - (38), specifically as follows:

Autocorrelation matrix of input signal:

autocorrelation matrix of desired signal:

autocorrelation matrix of residual noise signal:

relationship between the autocorrelation matrix of the input signal, the desired signal and the residual noise signal:

In general terms, the process is carried out, Is only performed on noisy frames.

According toAnd/>To estimate/>As shown in expression (39).

At time nAnd/>According to/>, at time n-1And/>De-estimation, i.e./>, of the current frameAnd/>It is necessary to estimate from the previous frame as shown in expressions (40), (41).

Where λ ₁ and λ ₂ are forgetting coefficients.

Step 703, estimating an autocorrelation matrix of the desired signal of the current frame according to the autocorrelation matrix of the desired signal of the previous frame in the case that the current frame is a non-noise frame.

In the application scenario shown in fig. 3, the autocorrelation matrix of the desired signal of the current frame may be estimated according to expression (34).

And 603, obtaining weight vectors of the sound sources according to the autocorrelation matrixes of the final expected signals and the autocorrelation matrixes of the residual noise signals of the sound sources.

In this step, the weight vector is calculated from the expression (42) as deduced from the frequency domain parameterized multi-channel wiener filtering:

wherein, Length M, corresponding to reference microphone 1, and the remainder are all 0. The higher the beta value is, the better the noise removal effect can be obtained, and the lower the beta value is, the higher the reduction degree of the voice signal is.

Step 604, obtaining a beam forming estimation signal of each sound source according to the conjugate transpose matrix and the weight vector of the input signal of each sound source.

In this step, the beam forming estimation signals of the respective sound sources are obtained according to expression (43):

Z_p(k)＝Y(k)^HH(k)，p＝1,2 (43)

in the application scenario shown in fig. 3, the beamforming estimation signal Z ₁ (k) of the sound source s1 and the beamforming estimation signal Z ₂ (k) of the sound source s2 can be obtained. And further noise reduction processing is carried out on the posterior domain estimation data obtained by blind source separation through beam forming.

An exemplary embodiment of the present disclosure further provides a sound signal identifying method, which obtains a time domain sound source signal of each sound source according to a beam forming estimation signal. And carrying out short-time Fourier inverse transformation on the wave beam forming estimated signals of each sound source, and then carrying out overlap addition to obtain time domain sound source signals of each sound source.

In the application scenario shown in fig. 3, the time-domain sound source signal can be calculated by expression (44):

an exemplary embodiment of the present disclosure further provides a voice signal recognition apparatus, having a structure as shown in fig. 8, including:

the original data acquisition module 801 is configured to acquire original observation data acquired by at least two acquisition points for at least two sound sources respectively;

A first noise reduction module 802, configured to perform a first level noise reduction process on the original observation data to obtain posterior domain estimation data;

A signal mapping module 803, configured to obtain signal components of each sound source according to the posterior domain estimation data;

A second noise reduction module 804, configured to perform second-stage noise reduction processing on the signal components of each sound source, to obtain a beam forming estimated signal;

And the enhanced signal output module 805 is configured to obtain a time domain sound source signal of each sound source according to the beamforming estimation signal.

Further, as shown in fig. 9, the structure of the first noise reduction module 802 includes:

the matrix initialization submodule 901 is used for initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, wherein the number of rows and the number of columns of the separation matrix are the number of the sound sources;

An observation signal matrix construction sub-module 902, configured to calculate time domain signals at each acquisition point, and construct an observation signal matrix according to frequency domain signals corresponding to the time domain signals;

The priori frequency domain solving sub-module 903 is configured to solve a priori frequency domain estimation of each sound source in the current frame according to the separation matrix of the previous frame and the observation signal matrix;

A covariance matrix update sub-module 904 for updating the weighted covariance matrix based on the prior frequency domain estimate;

a separation matrix updating sub-module 905, configured to update the separation matrix according to the updated weighted covariance matrix;

a correction sub-module 906 for correcting the updated separation matrix;

The posterior domain obtaining sub-module 907 is configured to separate the original observation data according to the corrected separation matrix, and separate the posterior domain estimated data of each sound source obtained by separation.

Further, the priori frequency domain obtaining sub-module 903 is configured to separate the observation signal matrix according to the separation matrix of the previous frame, so as to obtain a priori frequency domain estimate of each sound source of the current frame.

Further, the covariance matrix updating sub-module 904 is configured to update the weighted covariance matrix according to the observation signal matrix and a conjugate transpose matrix of the observation signal matrix.

Further, the structure of the split matrix updating sub-module 905 is shown in fig. 10, and includes:

A eigenvalue calculation submodule 1001, configured to calculate eigenvectors of each sound source according to the weighted covariance matrix of each sound source;

And an updating submodule 1002, configured to update the separation matrix of each sound source according to the feature vector.

Further, the correction submodule 906 is configured to perform a spectral amplitude deblurring process on the separation matrix according to a diagonal matrix of an inverse matrix of the separation matrix, so as to correct spectral distortion of the separation matrix.

Further, the signal mapping module 803 is configured to obtain, according to the posterior domain estimation data, signal components of each sound source in each channel of the beamforming filter, where the signal components include a desired signal and a residual noise signal.

Further, as shown in fig. 11, the structure of the second noise reduction module 804 includes:

an input signal construction sub-module 1101, configured to determine an input signal of each of the sound sources according to a signal component of each of the sound sources;

an autocorrelation matrix estimation submodule 1102, configured to perform processing according to steps one to two on the input signals of the respective sound sources, respectively:

a weight vector calculation operator module 1103, configured to obtain a weight vector of each sound source according to the autocorrelation matrix of the final desired signal and the autocorrelation matrix of the residual noise signal of each sound source;

and the beam forming estimated signal obtaining sub-module 1104 is configured to obtain a beam forming estimated signal of each sound source according to the conjugate transpose matrix and the weight vector of the input signal of each sound source.

The device can be integrated in intelligent terminal equipment or a remote operation processing platform, or part of functional modules can be integrated in the intelligent terminal equipment and part of functional modules are integrated in the remote operation processing platform, and corresponding functions are realized by the intelligent terminal equipment and/or the remote operation processing platform.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 12 is a block diagram illustrating an apparatus 1200 for voice signal recognition according to an exemplary embodiment. For example, apparatus 1200 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 12, apparatus 1200 may include one or more of the following components: a processing component 1202, a memory 1204, a power component 1206, a multimedia component 1208, an audio component 1210, an input/output (I/O) interface 1212, a sensor component 1214, and a communications component 1216.

The processing component 1202 generally controls overall operation of the apparatus 1200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1202 may include one or more processors 1220 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 1202 may include one or more modules that facilitate interactions between the processing component 1202 and other components. For example, the processing component 1202 may include a multimedia module to facilitate interaction between the multimedia component 1208 and the processing component 1202.

Memory 1204 is configured to store various types of data to support operations at device 1200. Examples of such data include instructions for any application or method operating on the apparatus 1200, contact data, phonebook data, messages, pictures, videos, and the like. The memory 1204 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

Power component 1206 provides power to the various components of device 1200. The power components 1206 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the apparatus 1200.

The multimedia component 1208 includes a screen between the device 1200 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1208 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 1200 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 1210 is configured to output and/or input audio signals. For example, the audio component 1210 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 1200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 1204 or transmitted via the communications component 1216. In some embodiments, audio assembly 1210 further includes a speaker for outputting audio signals.

The I/O interface 1212 provides an interface between the processing component 1202 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 1214 includes one or more sensors for providing status assessment of various aspects of the apparatus 1200. For example, the sensor assembly 1214 may detect the on/off state of the device 1200, the relative positioning of the components, such as the display and keypad of the apparatus 1200, the sensor assembly 1214 may also detect a change in position of the apparatus 1200 or one of the components of the apparatus 1200, the presence or absence of user contact with the apparatus 1200, the orientation or acceleration/deceleration of the apparatus 1200, and a change in temperature of the apparatus 1200. The sensor assembly 1214 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 1214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1214 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communications component 1216 is configured to facilitate communication between the apparatus 1200 and other devices, either wired or wireless. The apparatus 1200 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 1216 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communications component 1216 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 1200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as memory 1204, comprising instructions executable by processor 1220 of apparatus 1200 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a method of voice signal recognition, the method comprising:

Fig. 13 is a block diagram illustrating an apparatus 1300 for voice signal recognition according to an example embodiment. For example, apparatus 1300 may be provided as a server. Referring to fig. 13, apparatus 1300 includes a processing component 1322 that further includes one or more processors and memory resources represented by memory 1332 for storing instructions, such as application programs, executable by processing component 1322. The applications stored in memory 1332 may include one or more modules each corresponding to a set of instructions. Further, processing component 1322 is configured to execute instructions to perform the methods described above.

The apparatus 1300 may also include a power component 1326 configured to perform power management of the apparatus 1300, a wired or wireless network interface 1350 configured to connect the apparatus 1300 to a network, and an input output (I/O) interface 1358. The apparatus 1300 may operate based on an operating system stored in the memory 1332, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

The disclosure provides a sound signal identification method and device, which are characterized in that original observation data acquired by at least two acquisition points for at least two sound sources are acquired respectively, then first-stage noise reduction processing is carried out on the original observation data to obtain posterior domain estimation data, posterior domain estimation data are obtained according to the posterior domain estimation data, signal components of each sound source are obtained according to the posterior domain estimation data, second-stage noise reduction processing is carried out on the signal components of each sound source respectively to obtain beam forming estimation signals, and finally time domain sound source signals of each sound source are obtained according to the beam forming estimation signals. After blind source separation is carried out on the original observation data to obtain posterior domain estimation data, the signal to noise ratio is further improved through beam enhancement to highlight signals, the problems of low sound source positioning accuracy and poor voice recognition quality in a scene with strong interference and low signal to noise ratio are solved, and efficient and anti-interference strong voice signal recognition is realized.

The blind source separation based on independent vector analysis and the beam forming based on multi-channel wiener filtering are processed in a combined mode through the combined enhancement scheme based on the blind source separation and the beam forming, residual noise is further removed, noise reduction performance of an algorithm is greatly improved, voice quality is improved, and influence of noise on voice signal recognition is reduced.

The technical scheme provided by the disclosure can be used for a device taking a microphone as sound collection point equipment, and can be used for a sound collection point array formed by two or more collection points, such as a microphone array and the like.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method of voice signal recognition, comprising:

performing first-stage noise reduction processing on the original observed data based on blind source separation to obtain posterior domain estimation data;

Performing second-stage noise reduction processing on the signal components of each sound source through PMWF filtering to obtain wave beam forming estimated signals;

2. The method of claim 1, wherein the step of performing a first level of noise reduction on the raw observation data to obtain the posterior domain estimate data comprises:

Correcting the updated separation matrix;

3. The method of claim 2, wherein the step of obtaining a priori frequency domain estimates for each source of the current frame based on the separation matrix of the previous frame and the observation signal matrix comprises:

4. The method of claim 2, wherein updating the weighted covariance matrix based on the prior frequency-domain estimate comprises:

5. The sound signal recognition method of claim 2, wherein the step of updating the separation matrix based on the updated weighted covariance matrix comprises:

and respectively updating the separation matrix of each sound source according to the characteristic vector.

6. The sound signal recognition method of claim 2, wherein the step of correcting the updated separation matrix comprises:

7. The sound signal recognition method according to claim 1, wherein the step of obtaining the signal components of the respective sound sources from the posterior domain estimation data comprises:

8. The sound signal recognition method of claim 7, wherein the step of performing a second-stage noise reduction process on the signal components of each of the sound sources, respectively, to obtain the beam forming estimated signal comprises:

9. The sound signal identification method of claim 8, wherein the step of obtaining a time domain sound source signal of each of the sound sources from the beamforming estimation signal comprises:

10. A sound signal recognition apparatus, comprising:

The first noise reduction module is used for carrying out first-stage noise reduction processing on the original observed data based on blind source separation to obtain posterior domain estimated data;

The second noise reduction module is used for respectively carrying out second-stage noise reduction processing on the signal components of each sound source through PMWF filtering to obtain wave beam forming estimated signals;

11. The voice signal recognition device of claim 10, wherein the first noise reduction module comprises:

a correction sub-module for correcting the updated separation matrix;

12. The voice signal recognition apparatus of claim 11, wherein,

And the priori frequency domain solving sub-module is used for separating the observation signal matrix according to the separation matrix of the previous frame to obtain the priori frequency domain estimation of each sound source of the current frame.

13. The voice signal recognition apparatus of claim 11, wherein,

The covariance matrix updating sub-module is used for updating the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.

14. The voice signal recognition apparatus of claim 11, wherein the separation matrix update sub-module comprises:

And the updating sub-module is used for respectively updating the separation matrixes of the sound sources according to the characteristic vectors.

15. The voice signal recognition apparatus of claim 11, wherein,

And the correction submodule is used for carrying out frequency spectrum amplitude deblurring processing on the separation matrix according to the diagonal matrix of the inverse matrix of the separation matrix so as to correct the frequency spectrum distortion of the separation matrix.

16. The voice signal recognition apparatus of claim 10, wherein,

The signal mapping module is used for obtaining signal components of each sound source in each channel of the beam forming filter according to the posterior domain estimation data, wherein the signal components comprise expected signals and residual noise signals.

17. The voice signal recognition apparatus of claim 16, wherein the second noise reduction module comprises:

18. The voice signal recognition apparatus of claim 17, wherein,

And the enhanced signal output module is used for carrying out short-time Fourier inverse transformation on the wave beam forming estimated signals of each sound source and then carrying out overlap addition to obtain time domain sound source signals of each sound source.

19. A computer apparatus, comprising:

A processor;

A memory for storing processor-executable instructions;

wherein the processor is configured to:

20. A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a method of voice signal recognition, the method comprising: