CN112863537A

CN112863537A - Audio signal processing method and device and storage medium

Info

Publication number: CN112863537A
Application number: CN202110001599.0A
Authority: CN
Inventors: 侯海宁
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2021-05-28

Abstract

Disclosed herein are an audio signal processing method, apparatus and storage medium, the method including: when the separation characteristic of the separation matrix meets a set condition, the separation matrix is used for separating different sound source audio signals from corresponding frame audio time domain signals; and when the separation characteristic of the separation matrix does not meet the set condition, updating the separation matrix of the current frame of audio time domain signals according to the separation matrix of the previous frame of audio time domain signals, and separating different sound source audio signals of the current frame of audio time domain signals by using the updated separation matrix. The method and the device can ensure the separation effect of the separation matrix on the blind source signal, improve the robustness and stability of the algorithm, improve the separation performance, reduce the voice damage degree after separation, and improve the recognition performance.

Description

Audio signal processing method and device and storage medium

Technical Field

The present disclosure relates to the field of mobile terminal data processing technologies, and in particular, to an audio signal processing method, an audio signal processing apparatus, and a storage medium.

Background

In the times of Internet of things and AI, intelligent voice is taken as one of artificial intelligence core technologies, so that the man-machine interaction mode can be effectively improved, and the use convenience of intelligent products is greatly improved.

At present, a microphone array is mostly adopted in sound collection equipment of intelligent product equipment, and a microphone beam forming technology is applied to improve the processing quality of a voice signal so as to improve the voice recognition rate in a real environment.

The blind source separation technology utilizes independence among different sound source signals to separate sound sources, so that a target signal and a noise source signal are separated, and the signal-to-noise ratio of the signals is improved.

How to improve the performance of the blind source separation technology is a technical problem to be solved.

Disclosure of Invention

To overcome the problems in the related art, an audio signal processing method, apparatus, and storage medium are provided.

According to a first aspect of embodiments herein, there is provided a method of audio signal processing, the method comprising:

acquiring aliasing audio signals of at least two sound sources collected by at least two microphones;

performing frame division processing on the aliasing audio signals to obtain multi-frame audio time domain signals;

determining a separation matrix of each frame of audio time domain signals;

judging whether the separation characteristic of the separation matrix of each frame of audio time domain signal meets a set condition or not;

when the separation characteristic of the separation matrix meets a set condition, the separation matrix is used for separating different sound source audio signals from corresponding frame audio time domain signals; and when the separation characteristic of the separation matrix does not meet the set condition, updating the separation matrix of the current frame of audio time domain signals according to the separation matrix of the previous frame of audio time domain signals, and separating different sound source audio signals of the current frame of audio time domain signals by using the updated separation matrix.

In an embodiment, the updating the separation matrix of the current frame audio time domain signal according to the separation matrix of the previous frame audio time domain signal includes one of:

taking the separation matrix of the previous frame of audio time domain signals as the separation matrix of the current frame of audio time domain signals;

and determining a product matrix of the separation matrix and the coefficient matrix of the previous frame of audio time domain signals, and taking the product matrix as the separation matrix of the current frame of audio time domain signals.

In one embodiment, the determining whether the separation characteristic of the separation matrix of each frame of audio time-domain signal meets a predetermined condition includes:

determining an auxiliary matrix corresponding to the separation matrix by using an inversion formula;

determining a product matrix of the separation matrix and the auxiliary matrix;

determining a first difference value between the product matrix and an identity matrix, and determining a second difference value between the product matrix and a transposed matrix of the identity matrix;

and when the first difference value is smaller than or equal to a first threshold value or the second difference value is smaller than or equal to a second threshold value, determining that the separation characteristic of the separation matrix meets a set condition.

In an embodiment, the determining the auxiliary matrix corresponding to the separation matrix by using an inverse formula includes:

determining a adjoint of the separation matrix, and determining a determinant of the separation matrix;

determining a ratio result of the adjoint to the determinant;

and taking the ratio result as an auxiliary matrix corresponding to the separation matrix.

In one embodiment, the determining a first gap value between the product matrix and the identity matrix includes:

determining the absolute value of the difference of 1 from each element of the product matrix located on the main diagonal,

determining an absolute value of each element of the product matrix that is outside of a main diagonal;

determining the sum of the absolute values;

and taking the sum as a first difference value of the product matrix and the unit matrix.

In one embodiment, the determining the second gap value between the product matrix and the transpose matrix of the identity matrix includes:

determining the absolute value of the difference of 1 for each element of the product matrix located on the secondary diagonal,

determining an absolute value of each element of the product matrix that is outside a secondary diagonal;

determining the sum of the absolute values;

and taking the sum as a second difference value of the product matrix and the unit matrix.

In an embodiment, the method further comprises:

determining first difference values corresponding to a plurality of historical frame audio time domain signals before a current frame audio time domain signal, determining a first coefficient according to the first difference values corresponding to the plurality of historical frame audio time domain signals, and determining that the first threshold is the product of a first fixed value and the first coefficient;

determining a plurality of historical frame audio time domain signals before the current frame audio time domain signal corresponding to second difference values, determining a second coefficient according to the plurality of historical frame audio time domain signals corresponding to the second difference values, and determining that the second threshold is a product of a second fixed value and the second coefficient.

In one embodiment, the determining a first coefficient according to a first gap value corresponding to a plurality of historical frame audio time domain signals includes:

determining a difference value between a first difference value corresponding to each historical frame audio time domain signal and a first fixed value, determining an average value of the difference values corresponding to each historical frame audio time domain signal, and determining a first coefficient according to the average value, wherein the average value is positively correlated with the first coefficient;

determining a difference value between a second difference value corresponding to each historical frame audio time domain signal and a second fixed value, determining an average value of the difference values corresponding to each historical frame audio time domain signal, and determining a second coefficient according to the average value, wherein the average value is positively correlated with the second coefficient.

According to a second aspect of embodiments herein, there is provided an audio signal processing apparatus for use in a mobile terminal comprising at least two microphones, the apparatus comprising:

an acquisition module configured to acquire aliased audio signals of at least two sound sources acquired by the at least two microphones;

a framing module configured to perform framing processing on the aliasing audio signal to obtain a multi-frame audio time domain signal;

a first determining module configured to determine a separation matrix for each frame of the audio time domain signal;

the judging module is configured to judge whether the separation characteristic of the separation matrix of each frame of audio time domain signal meets a set condition;

the processing module is configured to use the separation matrix to separate different sound source audio signals from corresponding frame audio time domain signals when the separation characteristic of the separation matrix meets a set condition; and when the separation characteristic of the separation matrix does not meet the set condition, updating the separation matrix of the current frame of audio time domain signals according to the separation matrix of the previous frame of audio time domain signals, and separating different sound source audio signals of the current frame of audio time domain signals by using the updated separation matrix.

In an embodiment, the processing module is further configured to update the separation matrix of the audio time-domain signal of the current frame according to the separation matrix of the audio time-domain signal of the previous frame by using one of the following methods:

In one embodiment, the determining module includes:

a second determination module configured to determine an auxiliary matrix corresponding to the separation matrix using an inversion formula;

a third determination module configured to determine a product matrix of the separation matrix and the auxiliary matrix;

a fourth determining module configured to determine a first gap value between the product matrix and an identity matrix, and determine a second gap value between the product matrix and a transpose of the identity matrix;

a fifth determination module configured to determine that the separation characteristic of the separation matrix meets a set condition when the first gap value is less than or equal to a first threshold or the second gap value is less than or equal to a second threshold.

In an embodiment, the second determining module is configured to determine the auxiliary matrix corresponding to the separation matrix by using an inversion formula by using the following method:

determining a ratio result of the adjoint to the determinant;

In an embodiment, the fourth determining module is further configured to determine the first gap value between the product matrix and the identity matrix by:

determining the sum of the absolute values;

In an embodiment, the fourth determining module is further configured to determine the second gap value between the product matrix and the transpose matrix of the identity matrix by using the following method, including:

determining the absolute value of the difference value between each element positioned on the secondary diagonal and 1 in the product matrix, and determining the absolute value of each element positioned outside the secondary diagonal in the product matrix;

determining the sum of the absolute values;

In one embodiment, the apparatus further comprises:

a sixth determining module, configured to determine first gap values corresponding to a plurality of historical frame audio time domain signals before a current frame audio time domain signal, determine a first coefficient according to the first gap values corresponding to the plurality of historical frame audio time domain signals, and determine that the first threshold is a product of a first fixed value and the first coefficient;

a seventh determining module, configured to determine that a plurality of historical frame audio time domain signals before the current frame audio time domain signal correspond to second gap values, determine a second coefficient according to the plurality of historical frame audio time domain signals corresponding to the second gap values, and determine that the second threshold is a product of a second fixed value and the second coefficient.

In an embodiment, the sixth determining module is further configured to determine the first coefficient according to the corresponding first gap value of the plurality of historical frame audio time domain signals by using the following method:

the seventh determining module is further configured to determine a second coefficient according to the plurality of historical frame audio time domain signals corresponding to the second gap value by using the following method:

According to a third aspect of embodiments herein, there is provided an audio signal processing apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute executable instructions in the memory to implement the steps of the method.

According to a fourth aspect of embodiments herein, there is provided a non-transitory computer readable storage medium having stored thereon executable instructions that, when executed by a processor, implement the steps of the method.

The technical solutions provided by the embodiments herein may include the following beneficial effects: after the separation matrix of each frame of audio time domain signal is calculated, the separation characteristic of the separation matrix is judged, the separation matrix of the current frame of audio time domain signal is used when the separation characteristic of the separation matrix meets the set condition, and the separation matrix of the previous frame of audio time domain signal is used when the separation characteristic of the separation matrix does not meet the set condition, so that the separation effect of the separation matrix on the blind source signal is ensured, the robustness and the stability of the algorithm are improved, the separation performance is improved, the voice damage degree after separation is reduced, and the recognition performance is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating an audio signal processing method according to an exemplary embodiment;

FIG. 2 is a block diagram of an audio signal processing apparatus according to an exemplary embodiment;

fig. 3 is a block diagram illustrating an audio signal processing apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects herein, as detailed in the appended claims.

In the prior art, in a blind source separation algorithm, an AuxIVA algorithm with an auxiliary function, which has a better performance, iteratively solves an optimal separation matrix by using an auxiliary function technology, and separates signals of different sound sources in an audio signal by using the separation matrix, so that a faster convergence performance and a better separation performance can be obtained. However, in the conventional algorithm, no constraint is imposed on the separation matrix. When the separation characteristic of the separation matrix obtained by iteration for separating the sound source signals is poor, the stability of the algorithm is damaged, and the separation performance is deteriorated.

The embodiment of the disclosure provides an audio signal processing method, which applies a terminal, wherein the terminal is an electronic device integrated with two or more microphones. For example, the terminal may be a mobile phone, a notebook, a tablet computer, a vehicle-mounted terminal, a computer, a server, or the like; or the terminal is a device connected to a plurality of microphones.

Referring to fig. 1, fig. 1 is a flowchart illustrating an audio signal processing method according to an exemplary embodiment. As shown in fig. 1, the method includes:

step S11, acquiring aliasing audio signals of at least two sound sources collected by at least two microphones.

Step S12, performing framing processing on the aliasing audio signal to obtain a multi-frame audio time domain signal.

In step S13, a separation matrix for each frame of audio time domain signal is determined.

Step S14, determine whether the separation characteristic of the separation matrix of each frame of audio time domain signal meets the set condition.

Step S15, when the separation characteristic of the separation matrix meets the set condition, the separation matrix is used to separate the audio time domain signals of different sound sources from the corresponding frame of audio time domain signals; and when the separation characteristic of the separation matrix does not meet the set condition, updating the separation matrix of the current frame of audio time domain signals according to the separation matrix of the previous frame of audio time domain signals, and separating different sound source audio signals of the current frame of audio time domain signals by using the updated separation matrix.

In this embodiment, the number of microphones is 2 or more, and the number of sound sources is 2 or more. The number of sound sources and the number of microphones are generally the same, and in some embodiments, the number of sound sources and the number of microphones may be different.

In an application scenario, 2 microphones are provided, namely microphone 1 and microphone 2; the number of sound sources is 2, and the sound sources are respectively a sound source 1 and a sound source 2. The aliasing audio signals collected by the microphone 1 are the aliasing audio signals of the sound source 1 and the sound source 2, and the aliasing audio signals collected by the microphone 2 are the aliasing audio signals of the sound source 1 and the sound source 2.

In another application scenario, the number of the microphones is 3, namely the microphone 1, the microphone 2 and the microphone 3; the number of the sound sources is 3, namely a sound source 1, a sound source 2 and a sound source 3; the aliased audio signals collected by the microphone 1, the microphone 2, and the microphone 3 are the aliased audio signals of the sound source 1, the sound source 2, and the sound source 3.

When the number of sound sources is greater than 2, it is generally considered that the number of sound sources is 2, that is, the audio signal of one sound source is taken as a target audio signal, and the audio signals of other sound sources are taken as interference target audio signals.

When the number of microphones is greater than 2, in sound source separation, signals collected by a plurality of microphones are subjected to redundancy removal processing (or dimension reduction processing), and aliasing audio signals corresponding to the 2 microphones are obtained.

In this embodiment, after the separation matrix of each frame of audio time domain signal is determined, the separation characteristic of the separation matrix is determined, the separation matrix of the current frame of audio time domain signal is used when the separation characteristic of the separation matrix meets the set condition, and the separation matrix of the previous frame of audio time domain signal is used when the separation characteristic of the separation matrix does not meet the set condition, so that the separation effect of the separation matrix on the blind source signal is ensured, the robustness and stability of the algorithm are improved, the separation performance is improved, the voice damage degree after separation is reduced, and the recognition performance is improved.

An embodiment of the present disclosure provides an audio signal processing method, including the method shown in fig. 1, and: in step S14, the updating the separation matrix of the current frame audio time domain signal according to the separation matrix of the previous frame audio time domain signal includes one of the following:

firstly, taking the separation matrix of the previous frame of audio time domain signal as the separation matrix of the current frame of audio time domain signal;

secondly, determining a product matrix of the separation matrix and the coefficient matrix of the previous frame of audio time domain signals, and taking the product matrix as the separation matrix of the current frame of audio time domain signals.

In this embodiment, when the separation characteristic of the separation matrix of the current frame does not meet the set condition, the separation matrix of the audio time domain signal of the current frame is abandoned, and the separation matrix of the audio time domain signal of the previous frame is used as the separation matrix of the audio time domain signal of the current frame, or the separation matrix of the audio time domain signal of the previous frame is modified and then used as the separation matrix of the audio time domain signal of the current frame, so that a better separation effect is obtained than that of the separation matrix of the audio time domain signal of the current frame.

An embodiment of the present disclosure provides an audio signal processing method, including the method shown in fig. 1, and:

the determining whether the separation characteristic of the separation matrix of each frame of audio time domain signal meets a set condition includes:

step 1, determining an auxiliary matrix corresponding to the separation matrix by using an inversion formula;

step 2, determining a product matrix of the separation matrix and the auxiliary matrix;

step 3, determining a first difference value between the product matrix and an identity matrix, and determining a second difference value between the product matrix and a transposed matrix of the identity matrix;

and 4, when the first difference value is smaller than or equal to a first threshold value or the second difference value is smaller than or equal to a second threshold value, determining that the separation characteristic of the separation matrix meets a set condition. And when the first difference value is larger than a first threshold value and the second difference value is larger than a second threshold value, determining that the separation characteristic of the separation matrix does not meet a set condition.

In one embodiment, the determining the auxiliary matrix corresponding to the separation matrix in step 1 by using an inversion formula includes:

determining a ratio result of the adjoint to the determinant;

For example: the separation matrix is Wtmp (K, n), where K is 1, K and K denote the location identifiers of the frequency points, and the number of the frequency points is K, where K is Nfft/2+1, the system frame length is Nfft, and n denotes the frame number.

In the case where Wtmp (k, n) is a 2 x 2 matrix,

the secondary matrix determined using the inversion formula shown below is invWtmp (k, n):

in one embodiment, the determining the first difference value between the product matrix and the identity matrix in step 3 includes:

determining the sum of the absolute values;

For example: the product matrix is W_dot(k, n) with a first difference from the identity matrix of amp1_ W_dot(k,n)：

amp1_W_dot(k,n)＝abs(W_dot(1,1,k,n)-1)+abs(W_dot(1,2,k,n))+abs(W_dot(2,1,k,n))+abs(W_dot(2,2,k,n)-1)

In one embodiment, determining a second gap value between the product matrix and a transpose of an identity matrix comprises:

determining the sum of the absolute values;

For example: the product matrix is W_dot(k, n) having a second difference value from the transposed matrix of the identity matrix of amp2_ W_dot(k,n)：

amp2_W_dot(k,n)＝abs(W_dot(1,2,k,n)-1)+abs(W_dot(1,1,k,n))+abs(W_dot(2,2,k,n))+abs(W_dot(2,1,k,n)-1)

In this embodiment, the auxiliary matrix corresponding to the separation matrix is calculated by using an inversion formula, and when the separation performance of the separation matrix is good, the product matrix of the separation matrix and the corresponding auxiliary matrix is closer to the identity matrix or the inverse matrix of the identity matrix; when the separation performance of the separation matrix is poor, the difference between the product matrix of the separation matrix and the corresponding auxiliary matrix and the identity matrix or the inverse matrix of the identity matrix is larger. By the method in the embodiment, the separation characteristic of the separation matrix can be effectively judged.

The embodiment of the present disclosure provides an audio signal processing method, which includes the method shown in the previous embodiment, and: the value mode of the first threshold and the second threshold is one of the following modes:

in a first mode, the first threshold and the second threshold are fixed values, and the first threshold and the second threshold are the same or different. For example: the first threshold and the second threshold are both 1e-2, or the first threshold is 1e-2 and the second threshold is 1 e-3.

In a second mode, the first threshold and the second threshold are adjustable dynamic values.

For example:

for the first threshold: determining first difference values corresponding to a plurality of historical frames before a current frame, determining a first coefficient according to the first difference values corresponding to the plurality of historical frames, and determining that the first threshold is the product of a first fixed value and the first coefficient.

For the second threshold: and determining second difference values corresponding to a plurality of historical frames before the current frame, determining a second coefficient according to the second difference values corresponding to the plurality of historical frames, and determining that the second threshold is the product of a second fixed value and the second coefficient.

In one embodiment, determining the first coefficient according to the first gap values corresponding to the plurality of historical frames includes: determining a difference value between a first gap value corresponding to each historical frame and a first fixed value, determining an average value of the difference value corresponding to each historical frame, and determining a first coefficient according to the average value, wherein the average value is positively correlated with the first coefficient.

Determining a second coefficient according to the second difference values corresponding to the plurality of historical frames, including: determining a difference value between a second difference value corresponding to each historical frame and a second fixed value, determining an average value of the difference values corresponding to each historical frame, and determining a second coefficient according to the average value, wherein the average value is positively correlated with the second coefficient.

In the embodiment, the first threshold and the second threshold of the current frame are adjusted according to the first difference value and the second difference value corresponding to a plurality of historical frames between the current frames, so that the first threshold and the second threshold are closely related to the historical sound source separation condition, and the overall separation effect is better.

The following is a detailed description of specific examples.

The specific embodiment is as follows:

the loudspeaker box is provided with two sound sources, two microphones are arranged in the loudspeaker box, each microphone receives aliasing sound data of the two sound sources, and data of the two sound sources are distinguished according to the aliasing sound data received by the two microphones.

Step 1, setting parameter values.

Step 1.1, setting the frame length of the system as Nfft, and the number of frequency points as K, where K is Nfft/2+ 1.

Step 1.2, setting an initial value of a separation matrix corresponding to each frequency point according to a formula (1):

wherein the content of the first and second substances,

k is an identity matrix, and K is 1.

Where H represents the conjugate transpose.

w₁(k, 0) is an initial value matrix of the separation matrix of the first sound source, w₂(k, 0) is the initial value matrix of the separation matrix for the second sound source. w is a₁(k, 0) and w₂0 in (k, 0) represents the 0 th frame, and after the sound data is subjected to framing processing, the 1 st frame data and the 2 nd frame data corresponding to the sound data are obtained, and so on; in the subsequent calculation, the separation matrix of the previous frame is used for each current frame data, so that the value used for representing the current frame number in the initial value matrix is set to be 0 in order to conveniently process the 1 st frame data.

Step 1.3, setting a weighted covariance matrix V corresponding to each frequency point according to the formula (2)_i(k) Initial value of (a):

wherein the content of the first and second substances,

k, K represents the position identification of the frequency point. i-1, 2, i denotes the identification of the sound source.

And step 2, determining frequency domain data.

And performing frame processing on the aliasing sound data collected by each microphone to obtain a frame of the sound signal collected by each microphone.

To be provided with

A discrete sequence of time domain signals representing the nth frame of the pth microphone, p being 1, 2; n fft, 1.

According to the formula (3) pair

FFT conversion of windowed Nfft point is performed to obtain corresponding frequency domain signal X_p(k,n)，

According to X of each microphone_p(k, n) constructing an observation signal matrix as follows:

X(k,n)＝[X₁(k,n),X₂(k,n)]^T

wherein K is 1., K; t denotes transposition.

And step 3, determining the frequency band estimation.

And determining the prior frequency domain estimation of all sound source signals in the current frame by using the separation matrix W (k, n-1) of the previous frame and the observation signal matrix according to the formula (4).

Y(k,n)＝W(k,n-1)X(k,n) (4)

Wherein K is 1.

Let Y (k, n) be [ Y₁(k,n),Y₂(k,n)]^T，k＝1,..,K。

Y₁(k,n),Y₂(k, n) are two elements of Y (k, n), and Y is determined from Y (k, n)₁(k,n),Y₂(k,n)。Y₁(k,n),Y₂(k, n) are estimated values of sound sources s1 and s2 at time bins (k, n), respectively.

Determining the frequency domain estimate of each sound source in the entire frequency band of the current frame as:

wherein i is 1, 2.

Step 4, updating the corresponding weighted covariance matrix V according to the frequency domain estimation of each sound source in the whole frequency band of the current frame_i(k,n)。

And updating the weighted covariance matrix of each sound source at the (k, n) th time-frequency point according to the formula (6).

Where β is a weighting coefficient, e.g., β has a value of 0.98.

Determined by equation (7):

wherein

For the comparison function, it is determined by equation (9):

wherein the content of the first and second substances,

a multi-dimensional super-gaussian prior probability distribution model based on the whole frequency band is represented for the ith sound source.

In the general case of the above-mentioned,

determined according to equation (10) yields:

at this time, the process of the present invention,

thus, it is possible to prevent the occurrence of,

step 5, solving the characteristic value

Solving the eigenvalues according to equation (11):

V₂(k,n)e_i(k,n)＝λ_i(k,n)V₁(k,n)e_i(k,n) (11)

wherein i is 1, 2. Solving to obtain:

where tr is the trace function. tr (A) is the summation of the elements on the main diagonal of matrix A; det (A) is the determinant, λ, of the matrix A₁、λ₂、e₁、e₂Is the eigenvalue.

Wherein the content of the first and second substances,

wherein H₂₂(k, n) denotes the element in row 2 and column 2 of the H (k, n) matrix, H₁₂(k, n) denotes the element in row 1, column 2 of the H (k, n) matrix, H₁₁(k, n) denotes the element in row 1, column 1 of the H (k, n) matrix.

Step 6, determining a temporary separation matrix of all sound sources at each frequency point in the current frame according to the characteristic values:

Wtmp(k,n)＝[w₁(k,n),w₂(k,n)]^H，k＝1,..,K。 (17)

in the conventional algorithm, the separation characteristic of the obtained temporary separation matrix Wtmp (k, n) is not determined, and the temporary separation matrix Wtmp (k, n) is directly used as the separation matrix W (k, n) of the current frame, that is, W (k, n) ═ Wtmp (k, n). However, in an actual scenario, when the separation characteristic of Wtmp (k, n) is poor, the value is directly assigned to W (k, n) for subsequent separation, which may destroy the stability of the algorithm and cause the separation performance to deteriorate. In view of this, the present application proposes to determine the separation characteristic of Wtmp (k, n), and assign W (k, n) as the separation matrix of the current frame when the separation characteristic satisfies a set condition, i.e., W (k, n) ═ Wtmp (k, n); when the set condition is not met, the separation matrix of the previous frame or the correction matrix of the separation matrix of the previous frame is used as the separation matrix of the current frame, so that the robustness of the algorithm is improved, the stability of the convergence of the algorithm is ensured, and the voice quality is improved.

And 7, judging whether the separation characteristic of the Wtmp (k, n) meets the set condition.

And determining an auxiliary matrix corresponding to Wtmp (k, n) by using an inversion formula.

For example: in the case where Wtmp (k, n) is a 2 x 2 matrix,

then the secondary matrix determined using the inversion formula shown in equation (1) is invWtmp (k, n):

where det (Wtmp (k, n)) represents a determinant of Wtmp (k, n).

In the process of calculation by using the arithmetic program, the value of det (Wtmp (k, n)) is not 0, and if the determinant value of Wtmp (k, n) is 0, a correction value is automatically added to make the value of det (Wtmp (k, n)) after correction not 0.

Calculating the product of the two to obtain W_dot(k,n)：

Calculating W_dotThe first difference between (k, n) and the identity matrix is amp1_ W_dot(k,n)：

amp1_W_dot(k,n)＝abs(W_dot(1,1,k,n)-1)+abs(W_dot(1,2,k,n))+abs(W_dot(2,1,k,n))+abs(W_dot(2,2,k,n)-1) (21)

Calculating W_dotThe second difference between (k, n) and the transpose of the identity matrix is amp2_ W_dot(k,n)：

amp2_W_dot(k,n)＝abs(W_dot(1,2,k,n)-1)+abs(W_dot(1,1,k,n))+abs(W_dot(2,2,k,n))+abs(W_dot(2,1,k,n)-1) (22)

If amp1_ W is satisfied_dot(k, n) ≦ TH or amp2_ W_dotWhen (k, n) is less than or equal to TH, determining the separation matrix of the current frame as Wtmp (k, n); or does not satisfy amp1_ W_dot(k, n) ≦ TH or amp2_ W_dotAnd when the (k, n) is less than or equal to TH, updating the separation matrix of the current frame to be the separation matrix of the previous frame, namely W (k, n) is W (k, n-1).

Where TH is a threshold, e.g., 1 e-2.

Step 8, separating the aliasing audio signals by using the obtained W (k, n) to obtain the posterior frequency domain estimation of the sound source signals:

Y(k,n)＝[Y₁(k,n),Y₂(k,n)]^T＝W(k,n)X(k,n) (23)

step 9, respectively aligning

IFFT and overlap-add are carried out to obtain a separated time domain sound source signal s_i(m,n)。

Wherein i is 1, 2; n fft, 1.

The embodiment of the disclosure provides an audio signal processing device, which is applied to a terminal, wherein the terminal is an electronic device integrated with two or more microphones. For example, the terminal may be a mobile phone, a notebook, a tablet computer, a vehicle-mounted terminal, a computer, a server, or the like; or the terminal is a device connected to a plurality of microphones.

Referring to fig. 2, fig. 2 is a block diagram illustrating an audio signal processing apparatus according to an exemplary embodiment. The device is applied to a mobile terminal which comprises at least two microphones. As shown in fig. 2, the apparatus includes:

an acquisition module 201 configured to acquire aliased audio signals of at least two sound sources acquired by the at least two microphones;

a framing module 202, configured to perform framing processing on the aliasing audio signal to obtain a multi-frame audio time domain signal;

a first determining module 203 configured to determine a separation matrix of each frame of the audio time domain signal;

a determining module 204 configured to determine whether a separation characteristic of a separation matrix of each frame of audio time domain signal meets a set condition;

the processing module 205 is configured to, when the separation characteristic of the separation matrix meets a set condition, separate audio signals of different sound sources from corresponding frames of audio time-domain signals by using the determined separation matrix; and when the separation characteristic of the separation matrix does not meet the set condition, determining the separation matrix of the current frame of audio time domain signals according to the separation matrix of the previous frame of audio time domain signals, and separating the audio time domain signals of different sound sources of the current frame of audio time domain signals by using the updated separation matrix.

An embodiment of the present disclosure provides an audio signal processing apparatus, which includes the apparatus shown in fig. 2, and:

the processing module 205 is further configured to update the separation matrix of the current frame audio time domain signal according to the separation matrix of the previous frame audio time domain signal by using one of the following methods:

the determining module 204 includes:

In an embodiment, the second determining module is further configured to determine the auxiliary matrix corresponding to the separation matrix by using an inversion formula by using the following method:

determining a ratio result of the adjoint to the determinant;

determining the absolute value of the difference value of each element positioned on the main diagonal in the product matrix and 1, and determining the absolute value of each element positioned outside the main diagonal in the product matrix;

determining the sum of the absolute values;

In one embodiment, the apparatus further comprises:

a sixth determining module configured to determine first gap values corresponding to a plurality of historical frames before a current frame, determine a first coefficient according to the first gap values corresponding to the plurality of historical frames, and determine that the first threshold is a product of a first fixed value and the first coefficient;

a seventh determining module configured to determine that a plurality of historical frames before the current frame correspond to the second gap value, determine a second coefficient according to the plurality of historical frames corresponding to the second gap value, and determine that the second threshold is a product of a second fixed value and the second coefficient.

In an embodiment, the sixth determining module is further configured to determine the first coefficient according to the first gap value corresponding to the plurality of historical frames by using the following method:

determining a difference value between a first difference value corresponding to each historical frame and a first fixed value, determining an average value of the difference values corresponding to each historical frame, and determining a first coefficient according to the average value, wherein the average value is positively correlated with the first coefficient;

the seventh determining module is further configured to determine a second coefficient according to the second gap values corresponding to the plurality of historical frames using the following method:

determining a difference value between a second difference value corresponding to each historical frame and a second fixed value, determining an average value of the difference values corresponding to each historical frame, and determining a second coefficient according to the average value, wherein the average value is positively correlated with the second coefficient.

An embodiment of the present disclosure provides an audio signal processing apparatus, including:

a processor;

a memory for storing processor-executable instructions;

A non-transitory computer readable storage medium having stored thereon executable instructions that, when executed by a processor, perform the steps of the method is provided in embodiments of the present disclosure.

Fig. 3 is a block diagram illustrating an audio signal processing apparatus 300 according to an exemplary embodiment. For example, the apparatus 300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 3, the apparatus 300 may include one or more of the following components: processing component 302, memory 304, power component 306, multimedia component 308, audio component 310, input/output (I/O) interface 312, sensor component 314, and communication component 316.

The processing component 302 generally controls overall operation of the device 300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 302 may include one or more processors 320 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 302 can include one or more modules that facilitate interaction between the processing component 302 and other components. For example, the processing component 302 may include a multimedia module to facilitate interaction between the multimedia component 308 and the processing component 302.

The memory 304 is configured to store various types of data to support operations at the device 300. Examples of such data include instructions for any application or method operating on device 300, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 304 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 306 provides power to the various components of the device 300. The power components 306 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 300.

The multimedia component 308 includes a screen that provides an output interface between the device 300 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 308 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 300 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 310 is configured to output and/or input audio signals. For example, audio component 310 includes a Microphone (MIC) configured to receive external audio signals when apparatus 300 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 304 or transmitted via the communication component 316. In some embodiments, audio component 310 also includes a speaker for outputting audio signals.

The I/O interface 312 provides an interface between the processing component 302 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 314 includes one or more sensors for providing various aspects of status assessment for the device 300. For example, sensor assembly 314 may detect an open/closed state of device 300, the relative positioning of components, such as a display and keypad of apparatus 300, the change in position of apparatus 300 or a component of apparatus 300, the presence or absence of user contact with apparatus 300, the orientation or acceleration/deceleration of apparatus 300, and the change in temperature of apparatus 300. Sensor assembly 314 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 314 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 314 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 316 is configured to facilitate wired or wireless communication between the apparatus 300 and other devices. The device 300 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 316 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 316 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 304 comprising instructions, executable by the processor 320 of the apparatus 300 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Other embodiments of the invention herein will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles herein and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. An audio signal processing method, comprising:

determining a separation matrix of each frame of audio time domain signals;

2. The method of claim 1,

the updating of the separation matrix of the current frame audio time domain signal according to the separation matrix of the previous frame audio time domain signal includes one of:

3. The method of claim 1,

determining a product matrix of the separation matrix and the auxiliary matrix;

4. The method of claim 3,

the determining an auxiliary matrix corresponding to the separation matrix by using an inversion formula includes:

determining a ratio result of the adjoint to the determinant;

5. The method of claim 3,

the determining a first difference value between the product matrix and the identity matrix comprises:

determining the sum of the absolute values;

6. The method of claim 3,

the determining a second difference value between the product matrix and a transposed matrix of an identity matrix includes:

determining the sum of the absolute values;

7. The method of claim 3,

the method further comprises the following steps:

8. The method of claim 7,

the determining a first coefficient according to a first gap value corresponding to a plurality of historical frame audio time domain signals includes:

9. An audio signal processing apparatus applied to a mobile terminal, the mobile terminal including at least two microphones, comprising:

10. The apparatus of claim 9,

the processing module is further configured to update the separation matrix of the current frame audio time domain signal according to the separation matrix of the previous frame audio time domain signal using one of the following methods:

11. The apparatus of claim 9,

the judging module comprises:

12. The apparatus of claim 11,

the second determining module is further configured to determine an auxiliary matrix corresponding to the separation matrix by using an inversion formula by adopting the following method:

determining a ratio result of the adjoint to the determinant;

13. The apparatus of claim 11,

the fourth determining module is further configured to determine a first gap value between the product matrix and an identity matrix by:

determining the sum of the absolute values;

14. The apparatus of claim 11,

the fourth determining module is further configured to determine a second gap value between the product matrix and a transposed matrix of an identity matrix by using a method comprising:

determining the sum of the absolute values;

15. The apparatus of claim 11,

the device further comprises:

16. The apparatus of claim 15,

the sixth determining module is further configured to determine the first coefficient according to the corresponding first gap value of the plurality of historical frame audio time domain signals by using the following method:

17. An audio signal processing apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute executable instructions in the memory to implement the steps of the method of any one of claims 1 to 8.

18. A non-transitory computer readable storage medium having stored thereon executable instructions, wherein the executable instructions, when executed by a processor, implement the steps of the method of any one of claims 1 to 8.