CN113314135A - Sound signal identification method and device - Google Patents

Sound signal identification method and device Download PDF

Info

Publication number
CN113314135A
CN113314135A CN202110572969.6A CN202110572969A CN113314135A CN 113314135 A CN113314135 A CN 113314135A CN 202110572969 A CN202110572969 A CN 202110572969A CN 113314135 A CN113314135 A CN 113314135A
Authority
CN
China
Prior art keywords
signal
matrix
sound source
sound
estimation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110572969.6A
Other languages
Chinese (zh)
Other versions
CN113314135B (en
Inventor
何琪琪
侯海宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd, Beijing Xiaomi Pinecone Electronic Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202110572969.6A priority Critical patent/CN113314135B/en
Publication of CN113314135A publication Critical patent/CN113314135A/en
Application granted granted Critical
Publication of CN113314135B publication Critical patent/CN113314135B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Abstract

The disclosure relates to a method and a device for recognizing a sound signal. The intelligent voice interaction technology is related to, and the problems of low sound source positioning accuracy and poor voice recognition quality under the scene of strong interference and low signal-to-noise ratio are solved. The method comprises the following steps: acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively; performing first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data; obtaining signal components of each sound source according to the posterior domain estimation data; respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals; and obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals. The technical scheme provided by the disclosure is suitable for an intelligent voice system, and realizes high-efficiency and high-interference-resistance voice signal identification.

Description

Sound signal identification method and device
Technical Field
The present disclosure relates to intelligent voice interaction technologies, and in particular, to a method and an apparatus for recognizing a voice signal.
Background
In the times of Internet of things and AI, intelligent voice is taken as one of artificial intelligence core technologies, so that the man-machine interaction mode is enriched, and the use convenience of intelligent products is greatly improved.
The intelligent product equipment mainly adopts a microphone array formed by a plurality of microphones for pickup, and applies a microphone beam forming technology or a blind source separation technology to inhibit environmental interference and improve the processing quality of voice signals so as to improve the voice recognition rate in a real environment.
Microphone beam forming technology needs to estimate sound source direction, and in addition, in order to endow stronger intelligence and perception, the intelligent device can be equipped with the pilot lamp in general, when interacting with the user with the pilot lamp accuracy point to the user rather than the interference, lets the user feel in the face-to-face conversation with intelligent device, strengthens user's interactive experience. Based on this, in an environment where there are interfering sound sources, it is important to accurately estimate the direction of the user (i.e., the sound source).
Generally, a sound source direction-finding algorithm directly uses data acquired by a microphone, and performs direction-finding estimation by using an algorithm such as a Phase-transformed weighted sound source-Phase Transform (SRP-PHAT) with controllable Response Power. However, the algorithm depends on the signal-to-noise ratio of the signal, the accuracy is not high enough under the condition of low signal-to-noise ratio, the direction of an interference sound source is easy to be found, an effective sound source cannot be accurately positioned, and the accuracy of voice recognition is further influenced.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a method and an apparatus for recognizing a voice signal. After posterior domain data is obtained through noise reduction, signal components are formed and further noise reduction is carried out through beam forming, and high-quality voice recognition is achieved.
According to a first aspect of the embodiments of the present disclosure, there is provided a sound signal identification method, including:
acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;
performing first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data;
obtaining signal components of each sound source according to the posterior domain estimation data;
respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;
and obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals.
Further, the step of performing a first-stage noise reduction process on the original observation data to obtain posterior domain estimation data includes:
initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, wherein the number of rows and columns of the separation matrix is the number of the sound sources;
obtaining time domain signals at each acquisition point, and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals;
solving prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix;
updating the weighted covariance matrix according to the prior frequency domain estimate;
updating the separation matrix according to the updated weighted covariance matrix;
correcting the updated separation matrix;
and separating the original observation data according to the corrected separation matrix, and obtaining the posterior domain estimation data of each sound source through separation.
Further, the step of obtaining the prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix comprises:
and separating the observation signal matrix according to the separation matrix of the previous frame to obtain the prior frequency domain estimation of each sound source of the current frame.
Further, the step of updating the weighted covariance matrix based on the a priori frequency domain estimates comprises:
and updating the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.
Further, the step of updating the separation matrix according to the updated weighted covariance matrix includes:
calculating the eigenvectors of each sound source according to the weighted covariance matrix of each sound source;
respectively updating the separation matrix of each sound source according to the characteristic vector;
and updating the separation matrix to be a conjugate transpose matrix after the separation matrixes of the sound sources are combined.
Further, the step of correcting the updated separation matrix includes:
and according to the diagonal matrix of the inverse matrix of the separation matrix, carrying out spectrum amplitude deblurring processing on the separation matrix so as to correct the spectrum distortion of the separation matrix.
Further, the step of obtaining the signal component of each sound source according to the a posteriori domain estimation data comprises:
and obtaining signal components of each sound source in each channel of the beam forming filter according to the posterior domain estimation data, wherein the signal components comprise an expected signal and a residual noise signal.
Further, the step of performing a second-stage noise reduction processing on the signal components of each sound source to obtain a beamforming estimation signal includes:
determining input signals of the sound sources according to the signal components of the sound sources;
respectively carrying out the processing of the steps one to two on the input signals of the sound sources frame by frame:
step one, detecting whether the current frame is a noise frame or a non-noise frame,
step two, under the condition that the current frame is a noise frame, estimating the autocorrelation matrix of the expected signal of the current frame and the autocorrelation matrix of the residual noise signal according to the autocorrelation matrix of the expected signal of the previous frame and the autocorrelation matrix of the residual noise signal,
estimating an autocorrelation matrix of the desired signal of a current frame according to an autocorrelation matrix of the desired signal of a previous frame in the case that the current frame is a non-noise frame;
obtaining a weight vector of each sound source according to the final autocorrelation matrix of the expected signal and the autocorrelation matrix of the residual noise signal of each sound source;
and obtaining a beam forming estimation signal of each sound source according to the conjugate transpose matrix of the input signal of each sound source and the weight vector.
Further, the step of obtaining the time-domain sound source signal of each sound source according to the beamforming estimation signal includes:
and performing short-time Fourier inverse transformation on the beam forming estimation signals of each sound source, and overlapping and adding to obtain time-domain sound source signals of each sound source.
According to a second aspect of the embodiments of the present disclosure, there is provided a sound signal identification method, including:
the original data acquisition module is used for acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;
the first noise reduction module is used for carrying out first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data;
the signal mapping module is used for obtaining the signal components of each sound source according to the posterior domain estimation data;
the second noise reduction module is used for respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;
and the enhanced signal output module is used for obtaining the time domain sound source signal of each sound source according to the beam forming estimation signal.
Further, the first noise reduction module includes:
the matrix initialization submodule is used for initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, and the number of rows and the number of columns of the separation matrix are the number of the sound sources;
the observation signal matrix construction submodule is used for solving time domain signals at each acquisition point and constructing an observation signal matrix according to the frequency domain signals corresponding to the time domain signals;
the prior frequency domain solving submodule is used for solving prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix;
a covariance matrix update submodule for updating the weighted covariance matrix according to the prior frequency domain estimate;
a separation matrix updating submodule for updating the separation matrix according to the updated weighted covariance matrix;
a correction submodule for correcting the updated separation matrix;
and the posterior domain solving submodule is used for separating the original observation data according to the corrected separation matrix and obtaining posterior domain estimation data of each sound source through separation.
Further, the prior frequency domain obtaining submodule is configured to separate the observed signal matrix according to the separation matrix of the previous frame, so as to obtain a prior frequency domain estimate of each sound source of the current frame.
Further, the covariance matrix update sub-module is configured to update the weighted covariance matrix according to the observed signal matrix and a conjugate transpose matrix of the observed signal matrix.
Further, the separation matrix update sub-module includes:
the eigenvalue operator module is used for calculating the eigenvector of each sound source according to the weighted covariance matrix of each sound source;
the first updating submodule is used for respectively updating the separation matrix of each sound source according to the characteristic vector;
and the second updating submodule is used for updating the separation matrix into a conjugate transpose matrix after the separation matrixes of the sound sources are combined.
Further, the syndrome module is configured to perform spectrum amplitude deblurring processing on the separation matrix according to a diagonal matrix of an inverse matrix of the separation matrix, so as to correct the spectrum distortion of the separation matrix.
Further, the signal mapping module is configured to obtain, according to the a posteriori domain estimation data, signal components of each sound source in each channel of the beamforming filter, where the signal components include an expected signal and a residual noise signal.
Further, the second noise reduction module includes:
an input signal construction submodule for determining an input signal for each of the sound sources based on the signal components of each of the sound sources;
an autocorrelation matrix estimation submodule, configured to perform the processing in steps one to two on the input signals of the sound sources frame by frame respectively:
step one, detecting whether the current frame is a noise frame or a non-noise frame,
step two, under the condition that the current frame is a noise frame, estimating the autocorrelation matrix of the expected signal of the current frame and the autocorrelation matrix of the residual noise signal according to the autocorrelation matrix of the expected signal of the previous frame and the autocorrelation matrix of the residual noise signal,
estimating an autocorrelation matrix of the desired signal of a current frame according to an autocorrelation matrix of the desired signal of a previous frame in the case that the current frame is a non-noise frame;
a weight vector calculation submodule configured to obtain a weight vector of each sound source according to a final autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal of each sound source;
and the beam forming estimation signal acquisition sub-module is used for obtaining the beam forming estimation signal of each sound source according to the conjugate transpose matrix of the input signal of each sound source and the weight vector.
Further, the enhanced signal output module is configured to perform short-time fourier inverse transformation on the beamforming estimation signals of each sound source, and then overlap and add the beamforming estimation signals to obtain a time-domain sound source signal of each sound source.
According to a third aspect of embodiments of the present disclosure, there is provided a computer apparatus comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;
performing first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data;
obtaining signal components of each sound source according to the posterior domain estimation data;
respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;
and obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals.
According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a sound signal recognition method, the method comprising:
acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;
performing first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data;
obtaining signal components of each sound source according to the posterior domain estimation data;
respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;
and obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively, performing primary noise reduction processing on the original observation data to obtain posterior domain estimation data, acquiring posterior domain estimation data according to the posterior domain estimation data, acquiring signal components of each sound source according to the posterior domain estimation data, performing secondary noise reduction processing on the signal components of each sound source respectively to obtain a beam forming estimation signal, and finally acquiring time domain sound source signals of each sound source according to the beam forming estimation signal. After the original observation data is subjected to blind source separation to obtain posterior domain estimation data, the signal-to-noise ratio is further improved through beam enhancement to highlight signals, the problems of low sound source positioning accuracy and poor voice recognition quality under the scene of strong interference and low signal-to-noise ratio are solved, and high-efficiency and high-interference-resistance voice signal recognition is realized.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
Fig. 1 is a flow chart illustrating a method of sound signal recognition according to an exemplary embodiment.
Fig. 2 is a flow chart illustrating yet another method of sound signal identification according to an example embodiment.
FIG. 3 is a schematic diagram of a two microphone acquisition point reception scenario.
Fig. 4 is a schematic diagram illustrating a blind source separation principle according to an exemplary embodiment.
Fig. 5 is a schematic diagram illustrating a principle of a blind source separation and beamforming combined application according to an example embodiment.
Fig. 6 is a flow chart illustrating yet another sound signal identification method according to an example embodiment.
Fig. 7 is a flowchart illustrating yet another sound signal recognition method according to an example embodiment.
Fig. 8 is a block diagram illustrating a voice signal recognition apparatus according to an exemplary embodiment.
Fig. 9 is a schematic diagram illustrating a structure of a first noise reduction module 802 according to an exemplary embodiment.
Fig. 10 is a schematic structural diagram illustrating the separation matrix update submodule 905 according to an exemplary embodiment.
FIG. 11 is a schematic diagram illustrating a structure of a second noise reduction module 804 according to an exemplary embodiment.
Fig. 12 is a block diagram illustrating an apparatus (a general structure of a mobile terminal) according to an example embodiment.
Fig. 13 is a block diagram showing an apparatus (general structure of a server) according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
Generally, a sound source direction-finding algorithm directly utilizes data acquired by a microphone and performs direction-finding estimation by using algorithms such as microphone array sound source localization (SRP-PHAT). However, the algorithm depends on the signal-to-noise ratio of the signal, the accuracy is not high enough under the condition of low signal-to-noise ratio, and the effective sound source cannot be accurately positioned in the direction of each interference sound source.
In order to solve the above problem, embodiments of the present disclosure provide a sound signal identification method and apparatus. The acquired original data are subjected to noise reduction processing to obtain posterior domain data, then, the noise reduction processing is performed for the second time aiming at residual noise in the posterior domain data to further improve the signal-to-noise ratio, a final time domain sound source signal is obtained, the influence of an interference sound source is eliminated, the problem of low sound source positioning accuracy rate under the scene of strong interference and low signal-to-noise ratio is solved, and high-efficiency and high-interference-resistance voice signal recognition is realized.
An exemplary embodiment of the present disclosure provides a sound signal identification method, where a process of acquiring a sound source signal identification result is completed by the method shown in fig. 1, and the method includes:
step 101, acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively.
In this embodiment, the collection point may be a microphone. For example, there may be multiple microphones disposed on the same device, the multiple microphones making up a microphone array.
In this step, data acquisition is performed at each acquisition point, and the acquired data may be from multiple sound sources. The plurality of sound sources may include a target effective sound source and may also include an interfering sound source.
The acquisition point acquires original observation data of at least two sound sources.
And 102, performing primary noise reduction processing on the original observation data to obtain posterior domain estimation data.
In this step, the acquired original observation data is subjected to a first-stage noise reduction process to eliminate noise influence generated by an interference sound source and the like.
The original observation data can be subjected to first-stage noise reduction treatment based on a blind source separation technology to obtain the posterior domain estimation data after noise reduction.
The blind source separation technology based on independent vector analysis does not need prior information such as source signals, transmission channels and the like, decomposes received observation signals into approximate expected signals and noise signals through an optimization algorithm according to the principle of statistical independence, and peels off the noise signals, so that the signal-to-noise ratio of the signals is improved. Meanwhile, the approximate expected signal obtained through blind source separation still retains the spatial information of the microphone array and the sound source.
And 103, obtaining signal components of each sound source according to the posterior domain estimation data.
In the disclosure, after the posterior domain estimation data is obtained, in order to further eliminate the residual noise in the posterior domain estimation data to obtain a signal with a higher signal-to-noise ratio, the posterior domain estimation data is processed, and the noise is filtered by an adaptive filter.
And step 104, respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals.
In this step, for the noise interference residue in the posterior domain estimation data obtained in step 103, a beam forming technique is used to perform a second-stage noise reduction process in order to further improve the sound signal quality. The method can enhance the sound source signal and inhibit other direction signals (signals possibly interfering with the sound source signal), thereby further improving the signal-to-noise ratio of the sound source signal, and further positioning and identifying the sound source can be carried out on the basis to obtain a more accurate result.
In this step, a second-stage noise reduction process may be performed by using a beam forming based on a Parameterized Multichannel Wiener Filter (PMWF).
And 105, obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals.
In the step, according to the wave beam forming estimation signal, a time domain signal with enhanced signal-to-noise ratio after the wave beam separation processing is obtained through short-time inverse Fourier transform (ISTFT) and overlap addition, and compared with posterior domain estimation data, the time domain signal has smaller noise, can truly and accurately reflect the sound signal emitted by the sound source, and realizes accurate and efficient sound signal identification.
An exemplary embodiment of the present disclosure further provides a sound signal identification method, which performs noise reduction processing on original observation data based on blind source separation to obtain posterior domain estimation data, where a specific flow is shown in fig. 2, and includes:
step 201, initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point.
In this step, the number of rows and columns of the separation matrix is the number of sound sources, and the weighted covariance matrix is a 0 matrix.
In this embodiment, a scene with two microphones as acquisition points is taken as an example. As shown in fig. 3, smart speaker a has two microphones: mic1 and mic 2; there are two sound sources in the space around smart speaker a: s1 and s 2. The signals from both sources can be picked up by both microphones. The signals of the two sound sources are mixed together in each microphone.
The main idea of blind source separation is to find a separation matrix
Figure BDA0003083253230000081
W11, W12, W21, W22 are elements of the separation matrix W, representing that W is composed of these four elements.
The signals x that can be picked up from the two microphones according to the separation matrix1、x2Separating out sound source s1、s2The implementation principle is shown in fig. 4. Raw observations x collected from mic1 by separation matrix W1[k]To separate out the posterior domain estimated signal component y to the sound source s111[k]And estimating the signal component y for the posterior domain of the sound source s221[k]Raw observation data x collected from mic22[k]To separate out the posterior domain estimated signal component y to the sound source s112[k]And estimating the signal component y for the posterior domain of the sound source s222[k]And further obtaining posterior domain estimation data y1[k]And y2[k]。y1[k]Is according to y11[k]And y12[k]The obtained posterior domain data of the sound source s 1; y is2[k]Is according to y21[k]And y22[k]The obtained posterior domain data of the sound source s 2.
In this step, an initialization separation matrix is defined according to expression (1):
Figure BDA0003083253230000082
the initialized separation matrix is a unit matrix, K being 1. K is a variable representing a frequency point, which represents that the current calculation is the K-th frequency point, wherein K is 1, K represents that the value variable of K is 1,2,3, … K, and K represents the total number of frequency points.
The total number of frequency points K is Nfft/2+1, and Nfft is the frame length of each subframe in the sound system of the smart sound box a.
In this step, the separation matrix of each frequency point is initialized according to expression (1):
Figure BDA0003083253230000091
the separation matrix is a unit matrix; k1., K.
w1(k, τ) and w2(k, τ) are respectively represented as a column element of W, and τ represents the current frame number.
And initializing weighted covariance matrix V of each sound source at each frequency point according to expression (2)p(k) Zero matrix:
Figure BDA0003083253230000092
wherein, K is 1, K represents the kth frequency point; p is the microphone number; p is 1, 2.
Step 202, obtaining time domain signals at each acquisition point, and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals.
To be provided with
Figure BDA0003083253230000093
Representing the time domain signal of the nth frame of the p microphone. p is 1, 2. 1, 2.., Nfft.
Windowing the frame obtained according to Nfft according to expression (3), and then obtaining a corresponding frequency domain signal X by fourier transform (FFT)p(k,n):
Figure BDA0003083253230000094
The observed signal matrix is as shown in expression (4) and is:
X(k,n)=[X1(k,n),X2(k,n)]T (4)
wherein K is 1.
And 203, solving prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix.
In this step, the observed signal matrix is first separated according to the separation matrix of the previous frame to obtain the prior frequency domain estimation of each sound source of the current frame.
In particular, to the scenario shown in fig. 3, let Y (k, n) be [ Y ═ Y1(k,n),Y2(k,n)]TK is 1, K, wherein Y1(k,n),Y2(k, n) are each a sound source s1And s2A priori frequency domain estimates at time bins (k, n). As shown in expression (5), it is obtained by separating the observation matrix X (k, n) by using the separation matrix w (k):
Y(k,n)=W(k)X(k,n),k=1,..,K (5)
then the frequency domain estimation of the p sound source in the entire frequency band of the n frame is as shown in expression (6):
Figure BDA0003083253230000101
and 204, updating the weighted covariance matrix according to the prior frequency domain estimation.
In this step, the weighted covariance matrix is updated according to the observed signal matrix and the conjugate transpose matrix of the observed signal matrix.
For the application scenario shown in FIG. 3, the weighted covariance matrix V is updated according to expression (7)p(k,n)。
Figure BDA0003083253230000102
Where α is a weighting coefficient, and can be obtained according to expression (8):
Figure BDA0003083253230000103
Figure BDA0003083253230000104
defining a contrast function according to expression (10)
Figure BDA0003083253230000105
Figure BDA0003083253230000106
Figure BDA0003083253230000107
A multi-dimensional super-gaussian prior probability distribution model based on the whole frequency band is represented for the p-th sound source. In a general algorithm, it can be calculated according to expression (11)
Figure BDA0003083253230000108
Figure BDA0003083253230000109
exp denotes an exponential function.
Accordingly, the comparison function is shown in expression (12):
Figure BDA00030832532300001010
the weighting coefficient can be obtained by expressions (13) and (14):
Figure BDA0003083253230000111
Figure BDA0003083253230000112
and step 205, updating the separation matrix according to the updated weighted covariance matrix.
In this step, the eigenvectors of each sound source are calculated according to the weighted covariance matrix of each sound source, and then the separation matrix of each sound source is updated according to the eigenvectors.
For the application scenario shown in fig. 3, the equation shown in expression (15) may be solved, and the eigenvalue of the sound source s1 is obtained according to expression (16), and the eigenvector of the sound source s1 is obtained according to expression (17).
V2(k,n)ep(k,n)=λp(k,n)V1(k,n)ep(k,n),p=1,2 (15)
Characteristic value of sound source s 1:
Figure BDA0003083253230000113
wherein, Tr represents the trace of the matrix, det represents the determinant of the matrix, and H can be obtained according to the expression (20).
Feature vector of sound source s 1:
Figure BDA0003083253230000114
the eigenvalue of the sound source s1 is obtained according to expression (18), and the eigenvector of the sound source s2 is obtained according to expression (19).
Characteristic value of sound source s 1:
Figure BDA0003083253230000115
feature vector of sound source s 2:
Figure BDA0003083253230000116
wherein the content of the first and second substances,
Figure BDA0003083253230000117
H11、H12、H21、H22respectively, elements in the matrix H (k, n).
Then, a separation matrix w (k) ═ w for each frequency point is calculated according to expression (21)1(k),w2(k)]H
Figure BDA0003083253230000121
Wherein, K is 1., K, ()HIndicating the conjugate transpose in parentheses.
And step 206, correcting the updated separation matrix.
Therefore, in this step, according to the diagonal matrix of the inverse matrix of the separation matrix, the spectrum amplitude deblurring processing is performed on the separation matrix to correct the spectrum distortion of the separation matrix.
For example, W [ k ] may be corrected using expression (22).
W[k]=diag(W-1[k])W[k] (22)
Wherein, W-1[k]Is W [ k ]]The inverse matrix of (d), diag (·) represents a diagonal matrix made up of principal diagonal elements.
And step 207, separating the original observation data according to the corrected separation matrix, and obtaining posterior domain estimation data of each sound source through separation.
Separating the original microphone signals by using W (k) to obtain posterior frequency domain estimation of the sound source signals:
Y(k,n)=[Y1(k,n),Y2(k,n)]T=W(k)X(k,n) (23)
an exemplary embodiment of the present disclosure further provides a sound signal identification method, where for an a posteriori frequency domain estimation signal obtained by separating from an observation signal, in a scene with a large background noise and a strong reverberation, a second-stage noise reduction process may be performed on a noise residue in the signal to obtain a sound signal with a higher signal-to-noise ratio.
Firstly, according to the posterior domain estimation data, the signal components of each sound source are obtained. Signal components of each sound source in each channel of the beamforming filter, including the desired signal and the residual noise signal, may be derived from the a posteriori domain estimation data.
Taking the blind source separation technique and the beamforming technique combined enhancement system shown in fig. 5 as an example, a beamforming filter of PMWF is adopted, and there are two component channels, i.e., a first channel and a second channel. In the application scenario shown in FIG. 3, X is usedp(k, n) as input to the blind source separation of the first stage of noise reduction processing (where the frequency domain signal corresponding to the sound source s1 is abbreviated as X)1[k]Wherein the frequency domain signal corresponding to the sound source s2 is abbreviated as X2[k]). Separating the blind source result Y1(k, n) (abbreviated as Y in FIG. 5)1) Viewed as a sound source s1Component in the first channel, Y2(k, n) (abbreviated as Y in FIG. 5)2) Viewed as a sound source s2Component in the second channel. The composition of the other sound source in the two channels is solved based on the observed signal matrix X (k, n) and the a priori frequency domain estimate Y (k, n) according to expression (24):
Figure BDA0003083253230000131
at this time, a sound source s is obtained1Signal component Y in two channels1(k, n) and
Figure BDA0003083253230000132
(abbreviated as in FIG. 5
Figure BDA0003083253230000133
) And a sound source s2Signal component Y in two channels2(k, n) and
Figure BDA0003083253230000134
(abbreviated as in FIG. 5
Figure BDA0003083253230000135
). Then, according to the signal components, PMWF filtering is carried out to obtain the estimation result of higher signal-to-noise ratio of each sound source.
An exemplary embodiment of the present disclosure further provides a sound signal identification method, where after signal components of each channel of each sound source in a beamforming filter are obtained, a second-stage noise reduction process is performed on the signal components of each sound source, so as to obtain a beamforming estimation signal. The specific process is shown in fig. 6, and includes:
step 601, determining the input signal of each sound source according to the signal component of each sound source.
In this step, the input signal is determined based on the desired signal and the residual noise signal component in the signal component.
Taking the application scenario shown in FIG. 3 as an example, for sound source s1, Y is given1(k,n)、
Figure BDA0003083253230000136
As input to the PMWF. For any frequency point, the input signal is composed of a desired signal and a residual noise signal.
Expression (25) shows the input signal Y of the first channel at any frequency point of the sound source s11(k) The constitution of (1):
Y1(k)=S1(k)+V1(k),k=1,...,K (25)
wherein S is1(k) For an input signal Y1(k) Of the desired signal, V1(k) For an input signal Y1(k) The residual noise signal of (2).
Expression (26) shows the input signal of the second channel at any frequency point of the sound source s1
Figure BDA00030832532300001316
The constitution of (1):
Figure BDA0003083253230000137
wherein the content of the first and second substances,
Figure BDA0003083253230000138
for input signals
Figure BDA0003083253230000139
The desired signal of (a) is determined,
Figure BDA00030832532300001310
for input signals
Figure BDA00030832532300001311
The residual noise signal of (2).
Expressed in vector form:
Figure BDA00030832532300001312
Figure BDA00030832532300001313
Figure BDA00030832532300001314
where Y (k) is the input signal, S (k) is the desired signal, and V (k) is the residual noise signal.
For sound source s2, Y2(k,n),
Figure BDA00030832532300001315
As input to the PMWF. For any frequency point, the input signal is composed of a desired signal and a residual noise signal.
Expression (2) shows the input signal of the first channel at the frequency point of the sound source s2
Figure BDA0003083253230000141
The constitution of (1):
Figure BDA0003083253230000142
wherein the content of the first and second substances,
Figure BDA0003083253230000143
for input signals
Figure BDA0003083253230000144
The desired signal of (a) is determined,
Figure BDA0003083253230000145
for input signals
Figure BDA0003083253230000146
The residual noise signal of (2).
Expression (31) shows the second channel input signal at any frequency point of sound source s1
Figure BDA0003083253230000147
The constitution of (1):
Y2(k)=S2(k)+V2(k),k=1,…,K (31)
wherein S is2(k) For an input signal Y2(k) Of the desired signal, V2(k) For an input signal Y2(k) The residual noise signal of (2).
Expressed in vector form:
Figure BDA0003083253230000148
Figure BDA0003083253230000149
Figure BDA00030832532300001410
where Y (k) is the input signal, S (k) is the desired signal, and V (k) is the residual noise signal.
Step 602, estimating an autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal.
In this step, the autocorrelation matrix of the desired signal and/or the residual noise signal is selected to be estimated according to whether the current frame is a noise frame or a non-noise frame, and the frame-by-frame processing is performed until the whole input signal is processed. As shown in fig. 7, the method includes:
respectively performing the following processing on the input signals of the sound sources frame by frame:
step 701, detecting that the current frame is a noise frame or a non-noise frame.
In this step, the noise is further identified by detecting the silent period in the observed signal data. The current frame can be detected as a noise frame or a non-noise frame by any Voice Activity Detection (VAD) technique.
It is checked whether the current frame is a noise frame, step 702 is entered for a noise frame, otherwise step 703 is entered.
Step 702, in case that the current frame is a noise frame, estimating an autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal of the current frame according to an autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal of a previous frame.
Under the condition that the signals s (k) and v (k) satisfy uncorrelated conditions, the respective autocorrelation matrices can be obtained according to expressions (35) to (38), specifically as follows:
autocorrelation matrix of input signal:
Figure BDA00030832532300001411
autocorrelation matrix of desired signal:
Figure BDA00030832532300001412
autocorrelation matrix of the residual noise signal:
Figure BDA00030832532300001413
the relationship between the autocorrelation matrices of the input signal, the desired signal and the residual noise signal:
Figure BDA0003083253230000151
in the general case of the above-mentioned,
Figure BDA0003083253230000152
the estimation of (b) is only performed on noise frames.
According to
Figure BDA0003083253230000153
And
Figure BDA0003083253230000154
de-estimation
Figure BDA0003083253230000155
As shown in expression (39).
Figure BDA0003083253230000156
At time n
Figure BDA0003083253230000157
And
Figure BDA0003083253230000158
according to the time of n-1
Figure BDA0003083253230000159
And
Figure BDA00030832532300001510
de-estimation, i.e. of the current frame
Figure BDA00030832532300001511
And
Figure BDA00030832532300001512
it needs to be estimated from the previous frame as shown in expressions (40), (41).
Figure BDA00030832532300001513
Figure BDA00030832532300001514
Wherein λ is1And λ2Is a forgetting factor.
Step 703, under the condition that the current frame is a non-noise frame, estimating the autocorrelation matrix of the expected signal of the current frame according to the autocorrelation matrix of the expected signal of the previous frame.
In the application scenario shown in fig. 3, the autocorrelation matrix of the desired signal of the current frame may be estimated according to expression (34).
Step 603, obtaining a weight vector of each sound source according to the final autocorrelation matrix of the desired signal and the autocorrelation matrix of the residual noise signal of each sound source.
In this step, it can be derived from the frequency domain parameterized multi-channel wiener filtering, and the weight vector is calculated according to expression (42):
Figure BDA00030832532300001516
wherein the content of the first and second substances,
Figure BDA00030832532300001515
length M, corresponding to reference microphone 1, and the rest 0. Beta is an adjustable parameter, the higher the beta value is set, the better the noise removal effect can be obtained, and the lower the beta value is set, the higher the restoration degree of the voice signal is.
Step 604, obtaining a beamforming estimation signal of each sound source according to the conjugate transpose matrix of the input signal of each sound source and the weight vector.
In this step, a beamforming estimation signal of each sound source is obtained according to expression (43):
Zp(k)=Y(k)HH(k),p=1,2 (43)
as shown in FIG. 3Under the application scene, the beam forming estimation signal Z of the sound source s1 can be obtained1(k) And the beamforming estimation signal Z of the sound source s22(k) In that respect And further denoising the posterior domain estimation data obtained by blind source separation through beam forming.
An exemplary embodiment of the present disclosure further provides a sound signal identification method, which obtains a time-domain sound source signal of each of the sound sources according to a beamforming estimation signal. The beam forming estimated signals of each sound source can be subjected to short-time inverse Fourier transform and then overlapped and added to obtain time-domain sound source signals of each sound source.
In the application scenario shown in fig. 3, the time-domain sound source signal can be calculated by expression (44):
Figure BDA0003083253230000161
an exemplary embodiment of the present disclosure also provides a sound signal recognition apparatus, the structure of which is shown in fig. 8, including:
an original data acquiring module 801, configured to acquire original observation data acquired by at least two acquisition points for at least two sound sources respectively;
a first denoising module 802, configured to perform a first-stage denoising process on the original observation data to obtain posterior domain estimation data;
a signal mapping module 803, configured to obtain signal components of each sound source according to the posterior domain estimation data;
a second denoising module 804, configured to perform second-stage denoising processing on the signal components of the sound sources respectively to obtain beamforming estimation signals;
an enhanced signal output module 805, configured to obtain a time-domain sound source signal of each sound source according to the beamforming estimation signal.
Further, the structure of the first noise reduction module 802 is shown in fig. 9, and includes:
a matrix initialization submodule 901, configured to initialize a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, where the number of rows and columns of the separation matrix is the number of sound sources;
an observed signal matrix construction submodule 902, configured to obtain a time domain signal at each acquisition point, and construct an observed signal matrix according to a frequency domain signal corresponding to the time domain signal;
a priori frequency domain obtaining submodule 903, configured to obtain a priori frequency domain estimate of each sound source of the current frame according to the separation matrix of the previous frame and the observed signal matrix;
a covariance matrix update sub-module 904 for updating the weighted covariance matrix according to the prior frequency domain estimate;
a separation matrix updating submodule 905 configured to update the separation matrix according to the updated weighted covariance matrix;
a syndrome module 906 for correcting the updated separation matrix;
the posterior domain obtaining submodule 907 is configured to separate the original observation data according to the corrected separation matrix, and obtain posterior domain estimation data of each sound source through separation.
Further, the prior frequency domain obtaining sub-module 903 is configured to separate the observed signal matrix according to the separation matrix of the previous frame, so as to obtain a prior frequency domain estimate of each sound source of the current frame.
Further, the covariance matrix update sub-module 904 is configured to update the weighted covariance matrix according to the observed signal matrix and a conjugate transpose matrix of the observed signal matrix.
Further, the structure of the separation matrix update sub-module 905 is shown in fig. 10, and includes:
an eigenvalue operator module 1001 for calculating an eigenvector of each sound source according to the weighted covariance matrix of each sound source;
and an updating sub-module 1002, configured to update the separation matrix of each sound source according to the feature vector.
Further, the syndrome module 906 is configured to perform spectrum amplitude deblurring processing on the separation matrix according to a diagonal matrix of an inverse matrix of the separation matrix, so as to correct the spectrum distortion of the separation matrix.
Further, the signal mapping module 803 is configured to obtain, according to the a posteriori domain estimation data, signal components of each sound source in each channel of the beamforming filter, where the signal components include an expected signal and a residual noise signal.
Further, the structure of the second noise reduction module 804 is shown in fig. 11, and includes:
an input signal construction sub-module 1101 for determining an input signal for each of the sound sources from the signal components of each of the sound sources;
an autocorrelation matrix estimation sub-module 1102, configured to perform the following processing on the input signals of the sound sources frame by frame respectively as follows:
step one, detecting whether the current frame is a noise frame or a non-noise frame,
step two, under the condition that the current frame is a noise frame, estimating the autocorrelation matrix of the expected signal of the current frame and the autocorrelation matrix of the residual noise signal according to the autocorrelation matrix of the expected signal of the previous frame and the autocorrelation matrix of the residual noise signal,
estimating an autocorrelation matrix of the desired signal of a current frame according to an autocorrelation matrix of the desired signal of a previous frame in the case that the current frame is a non-noise frame;
a weight vector calculation sub-module 1103, configured to obtain a weight vector of each sound source according to the final autocorrelation matrix of the desired signal and the autocorrelation matrix of the residual noise signal of each sound source;
and a beamforming estimation signal obtaining sub-module 1104, configured to obtain a beamforming estimation signal of each sound source according to the conjugate transpose matrix of the input signal of each sound source and the weight vector.
Further, the enhanced signal output module is configured to perform short-time fourier inverse transformation on the beamforming estimation signals of each sound source, and then overlap and add the beamforming estimation signals to obtain a time-domain sound source signal of each sound source.
The device can be integrated in the intelligent terminal equipment or the remote operation processing platform, and part of the functional modules can be integrated in the intelligent terminal equipment and part of the functional modules can be integrated in the remote operation processing platform, and the corresponding functions can be realized by the intelligent terminal equipment and/or the remote operation processing platform.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 12 is a block diagram illustrating an apparatus 1200 for acoustic signal recognition according to an example embodiment. For example, the apparatus 1200 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 12, the apparatus 1200 may include one or more of the following components: a processing component 1202, a memory 1204, a power component 1206, a multimedia component 1208, an audio component 1210, an input/output (I/O) interface 1212, a sensor component 1214, and a communications component 1216.
The processing component 1202 generally controls overall operation of the apparatus 1200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 1202 may include one or more processors 1220 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 1202 can include one or more modules that facilitate interaction between the processing component 1202 and other components. For example, the processing component 1202 can include a multimedia module to facilitate interaction between the multimedia component 1208 and the processing component 1202.
The memory 1204 is configured to store various types of data to support operation at the device 1200. Examples of such data include instructions for any application or method operating on the device 1200, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1204 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
A power component 1206 provides power to the various components of the device 1200. Power components 1206 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for apparatus 1200.
The multimedia components 1208 include a screen that provides an output interface between the device 1200 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1208 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 1200 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
Audio component 1210 is configured to output and/or input audio signals. For example, audio component 1210 includes a Microphone (MIC) configured to receive external audio signals when apparatus 1200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1204 or transmitted via the communication component 1216. In some embodiments, audio assembly 1210 further includes a speaker for outputting audio signals.
The I/O interface 1212 provides an interface between the processing component 1202 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 1214 includes one or more sensors for providing various aspects of state assessment for the apparatus 1200. For example, the sensor assembly 1214 may detect an open/closed state of the device 1200, the relative positioning of the components, such as a display and keypad of the apparatus 1200, the sensor assembly 1214 may also detect a change in the position of the apparatus 1200 or a component of the apparatus 1200, the presence or absence of user contact with the apparatus 1200, an orientation or acceleration/deceleration of the apparatus 1200, and a change in the temperature of the apparatus 1200. The sensor assembly 1214 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 1214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communications component 1216 is configured to facilitate communications between the apparatus 1200 and other devices in a wired or wireless manner. The apparatus 1200 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1216 receives the broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1216 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 1200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as memory 1204 comprising instructions, executable by processor 1220 of apparatus 1200 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a sound signal recognition method, the method comprising:
acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;
performing first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data;
obtaining signal components of each sound source according to the posterior domain estimation data;
respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;
and obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals.
Fig. 13 is a block diagram illustrating an apparatus 1300 for sound signal recognition according to an example embodiment. For example, the apparatus 1300 may be provided as a server. Referring to fig. 13, apparatus 1300 includes a processing component 1322, which further includes one or more processors, and memory resources, represented by memory 1332, for storing instructions, such as application programs, that may be executed by processing component 1322. The application programs stored in memory 1332 may include one or more modules that each correspond to a set of instructions. Further, processing component 1322 is configured to execute instructions to perform the methods described above.
The apparatus 1300 may also include a power component 1326 configured to perform power management for the apparatus 1300, a wired or wireless network interface 1350 configured to connect the apparatus 1300 to a network, and an input-output (I/O) interface 1358. The apparatus 1300 may operate based on an operating system stored in the memory 1332, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
The invention provides a sound signal identification method and a device, which are used for acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively, then carrying out primary noise reduction processing on the original observation data to obtain posterior domain estimation data, then obtaining posterior domain estimation data according to the posterior domain estimation data, then obtaining signal components of each sound source according to the posterior domain estimation data, carrying out secondary noise reduction processing on the signal components of each sound source respectively to obtain a beam forming estimation signal, and finally obtaining a time domain sound source signal of each sound source according to the beam forming estimation signal. After the original observation data is subjected to blind source separation to obtain posterior domain estimation data, the signal-to-noise ratio is further improved through beam enhancement to highlight signals, the problems of low sound source positioning accuracy and poor voice recognition quality under the scene of strong interference and low signal-to-noise ratio are solved, and high-efficiency and high-interference-resistance voice signal recognition is realized.
Through a combined enhancement scheme based on blind source separation and beam forming, the blind source separation based on independent vector analysis and the beam forming based on multi-channel wiener filtering are jointly processed, residual noise is further removed, the noise reduction performance of an algorithm is greatly improved, the voice quality is improved, and the influence of noise on sound signal identification is reduced.
The technical scheme provided by the disclosure can be used for a device using a microphone as sound collecting point equipment, and can be used for a sound collecting point array formed by two or more collecting points, such as a microphone array and the like.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (20)

1. A method for recognizing a sound signal, comprising:
acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;
performing first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data;
obtaining signal components of each sound source according to the posterior domain estimation data;
respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;
and obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals.
2. The method according to claim 1, wherein the step of performing a first-stage noise reduction process on the raw observation data to obtain a posterior domain estimation data comprises:
initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, wherein the number of rows and columns of the separation matrix is the number of the sound sources;
obtaining time domain signals at each acquisition point, and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals;
solving prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix;
updating the weighted covariance matrix according to the prior frequency domain estimate;
updating the separation matrix according to the updated weighted covariance matrix;
correcting the updated separation matrix;
and separating the original observation data according to the corrected separation matrix, and obtaining the posterior domain estimation data of each sound source through separation.
3. The method of claim 2, wherein the step of obtaining the a priori frequency domain estimates of the sound sources of the current frame according to the separation matrix of the previous frame and the observation signal matrix comprises:
and separating the observation signal matrix according to the separation matrix of the previous frame to obtain the prior frequency domain estimation of each sound source of the current frame.
4. The method of claim 2, wherein the step of updating the weighted covariance matrix based on the a priori frequency domain estimates comprises:
and updating the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.
5. The sound signal identification method of claim 2, wherein the step of updating the separation matrix according to the updated weighted covariance matrix comprises:
calculating the eigenvectors of each sound source according to the weighted covariance matrix of each sound source;
and respectively updating the separation matrix of each sound source according to the characteristic vector.
6. The sound signal identification method according to claim 2, wherein the step of correcting the updated separation matrix includes:
and according to the diagonal matrix of the inverse matrix of the separation matrix, carrying out spectrum amplitude deblurring processing on the separation matrix so as to correct the spectrum distortion of the separation matrix.
7. The sound signal identification method according to claim 1, wherein the step of obtaining the signal component of each sound source from the a posteriori domain estimation data comprises:
and obtaining signal components of each sound source in each channel of the beam forming filter according to the posterior domain estimation data, wherein the signal components comprise an expected signal and a residual noise signal.
8. The sound signal identification method of claim 7, wherein the step of performing a second stage of noise reduction processing on the signal components of each sound source to obtain the beamforming estimation signal comprises:
determining input signals of the sound sources according to the signal components of the sound sources;
respectively carrying out the processing of the steps one to two on the input signals of the sound sources frame by frame:
step one, detecting whether the current frame is a noise frame or a non-noise frame,
step two, under the condition that the current frame is a noise frame, estimating the autocorrelation matrix of the expected signal of the current frame and the autocorrelation matrix of the residual noise signal according to the autocorrelation matrix of the expected signal of the previous frame and the autocorrelation matrix of the residual noise signal,
estimating an autocorrelation matrix of the desired signal of a current frame according to an autocorrelation matrix of the desired signal of a previous frame in the case that the current frame is a non-noise frame;
obtaining a weight vector of each sound source according to the final autocorrelation matrix of the expected signal and the autocorrelation matrix of the residual noise signal of each sound source;
and obtaining a beam forming estimation signal of each sound source according to the conjugate transpose matrix of the input signal of each sound source and the weight vector.
9. The method according to claim 8, wherein the step of obtaining the time-domain sound source signal of each sound source according to the beamforming estimation signal comprises:
and performing short-time Fourier inverse transformation on the beam forming estimation signals of each sound source, and overlapping and adding to obtain time-domain sound source signals of each sound source.
10. An apparatus for recognizing a sound signal, comprising:
the original data acquisition module is used for acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;
the first noise reduction module is used for carrying out first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data;
the signal mapping module is used for obtaining the signal components of each sound source according to the posterior domain estimation data;
the second noise reduction module is used for respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;
and the enhanced signal output module is used for obtaining the time domain sound source signal of each sound source according to the beam forming estimation signal.
11. The apparatus according to claim 10, wherein the first noise reduction module comprises:
the matrix initialization submodule is used for initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, and the number of rows and the number of columns of the separation matrix are the number of the sound sources;
the observation signal matrix construction submodule is used for solving time domain signals at each acquisition point and constructing an observation signal matrix according to the frequency domain signals corresponding to the time domain signals;
the prior frequency domain solving submodule is used for solving prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix;
a covariance matrix update submodule for updating the weighted covariance matrix according to the prior frequency domain estimate;
a separation matrix updating submodule for updating the separation matrix according to the updated weighted covariance matrix;
a correction submodule for correcting the updated separation matrix;
and the posterior domain solving submodule is used for separating the original observation data according to the corrected separation matrix and obtaining posterior domain estimation data of each sound source through separation.
12. The sound signal identification device of claim 11,
and the prior frequency domain solving submodule is used for separating the observation signal matrix according to the separation matrix of the previous frame to obtain the prior frequency domain estimation of each sound source of the current frame.
13. The sound signal identification device of claim 11,
and the covariance matrix updating submodule is used for updating the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.
14. The apparatus according to claim 11, wherein the separation matrix update submodule comprises:
the eigenvalue operator module is used for calculating the eigenvector of each sound source according to the weighted covariance matrix of each sound source;
and the updating submodule is used for respectively updating the separation matrix of each sound source according to the characteristic vector.
15. The sound signal identification device of claim 11,
and the correction submodule is used for carrying out spectrum amplitude deblurring processing on the separation matrix according to the diagonal matrix of the inverse matrix of the separation matrix so as to correct the spectrum distortion of the separation matrix.
16. The sound signal identification device of claim 10,
and the signal mapping module is used for obtaining signal components of each sound source in each channel of the beam forming filter according to the posterior domain estimation data, wherein the signal components comprise expected signals and residual noise signals.
17. The apparatus for recognizing a sound signal according to claim 16, wherein the second noise reduction module comprises:
an input signal construction submodule for determining an input signal for each of the sound sources based on the signal components of each of the sound sources;
an autocorrelation matrix estimation submodule, configured to perform the processing in steps one to two on the input signals of the sound sources frame by frame respectively:
step one, detecting whether the current frame is a noise frame or a non-noise frame,
step two, under the condition that the current frame is a noise frame, estimating the autocorrelation matrix of the expected signal of the current frame and the autocorrelation matrix of the residual noise signal according to the autocorrelation matrix of the expected signal of the previous frame and the autocorrelation matrix of the residual noise signal,
estimating an autocorrelation matrix of the desired signal of a current frame according to an autocorrelation matrix of the desired signal of a previous frame in the case that the current frame is a non-noise frame;
a weight vector calculation submodule configured to obtain a weight vector of each sound source according to a final autocorrelation matrix of the desired signal and an autocorrelation matrix of the residual noise signal of each sound source;
and the beam forming estimation signal acquisition sub-module is used for obtaining the beam forming estimation signal of each sound source according to the conjugate transpose matrix of the input signal of each sound source and the weight vector.
18. The sound signal identification device of claim 17,
and the enhanced signal output module is used for performing short-time Fourier inverse transformation on the beam forming estimation signals of each sound source and then overlapping and adding the beam forming estimation signals to obtain time domain sound source signals of each sound source.
19. A computer device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;
performing first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data;
obtaining signal components of each sound source according to the posterior domain estimation data;
respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;
and obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals.
20. A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a sound signal recognition method, the method comprising:
acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;
performing first-stage noise reduction processing on the original observation data to obtain posterior domain estimation data;
according to the posterior domain estimation data, respectively carrying out secondary noise reduction processing on the signal components of the sound sources to obtain beam forming estimation signals;
and obtaining time domain sound source signals of the sound sources according to the beam forming estimation signals.
CN202110572969.6A 2021-05-25 2021-05-25 Voice signal identification method and device Active CN113314135B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110572969.6A CN113314135B (en) 2021-05-25 2021-05-25 Voice signal identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110572969.6A CN113314135B (en) 2021-05-25 2021-05-25 Voice signal identification method and device

Publications (2)

Publication Number Publication Date
CN113314135A true CN113314135A (en) 2021-08-27
CN113314135B CN113314135B (en) 2024-04-26

Family

ID=77374659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110572969.6A Active CN113314135B (en) 2021-05-25 2021-05-25 Voice signal identification method and device

Country Status (1)

Country Link
CN (1) CN113314135B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113506582A (en) * 2021-05-25 2021-10-15 北京小米移动软件有限公司 Sound signal identification method, device and system
CN113628633A (en) * 2021-10-14 2021-11-09 辰风策划(深圳)有限公司 Noise reduction method for multi-channel information transmission of enterprise multi-party meeting
CN115862685A (en) * 2023-02-27 2023-03-28 全时云商务服务股份有限公司 Real-time voice activity detection method and device and electronic equipment

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080240463A1 (en) * 2007-03-29 2008-10-02 Microsoft Corporation Enhanced Beamforming for Arrays of Directional Microphones
CN105609112A (en) * 2016-01-15 2016-05-25 苏州宾果智能科技有限公司 Sound source positioning method and apparatus and time delay estimation method and apparatus
KR20160116440A (en) * 2015-03-30 2016-10-10 한국전자통신연구원 SNR Extimation Apparatus and Method of Voice Recognition System
CN106504763A (en) * 2015-12-22 2017-03-15 电子科技大学 Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction
CN106952653A (en) * 2017-03-15 2017-07-14 科大讯飞股份有限公司 Noise remove method, device and terminal device
CN108877827A (en) * 2017-05-15 2018-11-23 福州瑞芯微电子股份有限公司 Voice-enhanced interaction method and system, storage medium and electronic equipment
CN108986838A (en) * 2018-09-18 2018-12-11 东北大学 A kind of adaptive voice separation method based on auditory localization
CN111009256A (en) * 2019-12-17 2020-04-14 北京小米智能科技有限公司 Audio signal processing method and device, terminal and storage medium
CN111128221A (en) * 2019-12-17 2020-05-08 北京小米智能科技有限公司 Audio signal processing method and device, terminal and storage medium
CN111489753A (en) * 2020-06-24 2020-08-04 深圳市友杰智新科技有限公司 Anti-noise sound source positioning method and device and computer equipment
CN111681665A (en) * 2020-05-20 2020-09-18 浙江大华技术股份有限公司 Omnidirectional noise reduction method, equipment and storage medium
CN111863012A (en) * 2020-07-31 2020-10-30 北京小米松果电子有限公司 Audio signal processing method and device, terminal and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080240463A1 (en) * 2007-03-29 2008-10-02 Microsoft Corporation Enhanced Beamforming for Arrays of Directional Microphones
KR20160116440A (en) * 2015-03-30 2016-10-10 한국전자통신연구원 SNR Extimation Apparatus and Method of Voice Recognition System
CN106504763A (en) * 2015-12-22 2017-03-15 电子科技大学 Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction
CN105609112A (en) * 2016-01-15 2016-05-25 苏州宾果智能科技有限公司 Sound source positioning method and apparatus and time delay estimation method and apparatus
CN106952653A (en) * 2017-03-15 2017-07-14 科大讯飞股份有限公司 Noise remove method, device and terminal device
CN108877827A (en) * 2017-05-15 2018-11-23 福州瑞芯微电子股份有限公司 Voice-enhanced interaction method and system, storage medium and electronic equipment
CN108986838A (en) * 2018-09-18 2018-12-11 东北大学 A kind of adaptive voice separation method based on auditory localization
CN111009256A (en) * 2019-12-17 2020-04-14 北京小米智能科技有限公司 Audio signal processing method and device, terminal and storage medium
CN111128221A (en) * 2019-12-17 2020-05-08 北京小米智能科技有限公司 Audio signal processing method and device, terminal and storage medium
CN111681665A (en) * 2020-05-20 2020-09-18 浙江大华技术股份有限公司 Omnidirectional noise reduction method, equipment and storage medium
CN111489753A (en) * 2020-06-24 2020-08-04 深圳市友杰智新科技有限公司 Anti-noise sound source positioning method and device and computer equipment
CN111863012A (en) * 2020-07-31 2020-10-30 北京小米松果电子有限公司 Audio signal processing method and device, terminal and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113506582A (en) * 2021-05-25 2021-10-15 北京小米移动软件有限公司 Sound signal identification method, device and system
CN113628633A (en) * 2021-10-14 2021-11-09 辰风策划(深圳)有限公司 Noise reduction method for multi-channel information transmission of enterprise multi-party meeting
CN115862685A (en) * 2023-02-27 2023-03-28 全时云商务服务股份有限公司 Real-time voice activity detection method and device and electronic equipment
CN115862685B (en) * 2023-02-27 2023-09-15 全时云商务服务股份有限公司 Real-time voice activity detection method and device and electronic equipment

Also Published As

Publication number Publication date
CN113314135B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
CN113314135B (en) Voice signal identification method and device
US11284190B2 (en) Method and device for processing audio signal with frequency-domain estimation, and non-transitory computer-readable storage medium
CN111128221B (en) Audio signal processing method and device, terminal and storage medium
CN110808063A (en) Voice processing method and device for processing voice
US11206483B2 (en) Audio signal processing method and device, terminal and storage medium
KR102497549B1 (en) Audio signal processing method and device, and storage medium
CN111179960B (en) Audio signal processing method and device and storage medium
CN110133594B (en) Sound source positioning method and device for sound source positioning
EP4254408A1 (en) Speech processing method and apparatus, and apparatus for processing speech
CN113506582A (en) Sound signal identification method, device and system
CN113053406A (en) Sound signal identification method and device
EP3929920B1 (en) Method and device for processing audio signal, and storage medium
CN110459236B (en) Noise estimation method, apparatus and storage medium for audio signal
CN113223553B (en) Method, apparatus and medium for separating voice signal
CN113488066A (en) Audio signal processing method, audio signal processing apparatus, and storage medium
CN111667842B (en) Audio signal processing method and device
CN113223548B (en) Sound source positioning method and device
CN112863537A (en) Audio signal processing method and device and storage medium
CN113362848B (en) Audio signal processing method, device and storage medium
CN113223543B (en) Speech enhancement method, device and storage medium
CN113362847A (en) Audio signal processing method and device and storage medium
CN114724578A (en) Audio signal processing method and device and storage medium
CN117880732A (en) Spatial audio recording method, device and storage medium
CN117121104A (en) Estimating an optimized mask for processing acquired sound data
CN113299301A (en) Voice processing method and device for voice processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant