CN110428852B

CN110428852B - Voice separation method, device, medium and equipment

Info

Publication number: CN110428852B
Application number: CN201910735350.5A
Authority: CN
Inventors: 向腾; 卢晶
Original assignee: Nanjing Artificial Intelligence Advanced Research Institute Co ltd
Current assignee: Nanjing Artificial Intelligence Advanced Research Institute Co ltd
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2021-07-16
Anticipated expiration: 2039-08-09
Also published as: CN110428852A

Abstract

A voice separation method, apparatus, medium, and device are disclosed. The method comprises the following steps: acquiring a time-frequency domain mixed signal to be processed; carrying out voice separation processing on the time-frequency domain mixed signal to be processed, and obtaining an expected signal variance of the time-frequency domain mixed signal to be processed according to a voice separation processing result; according to the expected signal variance of the time-frequency domain mixed signal to be processed, performing dereverberation processing on the time-frequency domain mixed signal to be processed to obtain a dereverberated time-frequency domain mixed signal; and acquiring the time-frequency domain signal of each sound source according to the time-frequency domain mixed signal after the reverberation is removed. The technical scheme provided by the disclosure is beneficial to realizing on-line voice separation under a high reverberation environment aiming at the miniature double microphones, improving the accuracy of the voice separation and ensuring the instantaneity of the voice separation.

Description

Voice separation method, device, medium and equipment

Technical Field

The present disclosure relates to voice processing technologies, and in particular, to a voice separating method, a voice separating apparatus, a storage medium, and an electronic device.

Background

The voice separation technique can extract an initial sound source signal from a mixed signal of a plurality of sound sources, thereby achieving enhancement of a desired signal. Currently, voice separation technology is used in various applications such as smart home systems, video conference systems, and voice recognition systems.

The performance of the existing speech separation algorithm is generally greatly reduced in a reverberation environment. How to ensure the performance of a voice separation algorithm in a reverberation environment without greatly increasing the calculation amount to ensure the real-time performance of voice separation is a technical problem of great concern.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a voice separation method and device, a storage medium and electronic equipment.

According to an aspect of an embodiment of the present disclosure, there is provided a speech separation method, including: acquiring a time-frequency domain mixed signal to be processed; carrying out voice separation processing on the time-frequency domain mixed signal to be processed; obtaining an expected signal variance of the time-frequency domain mixed signal to be processed according to the result of the voice separation processing; according to the expected signal variance, performing dereverberation processing on the time-frequency domain mixed signal to be processed to obtain a dereverberated time-frequency domain mixed signal; and acquiring the time-frequency domain signal of each sound source according to the time-frequency domain mixed signal after the reverberation is removed.

According to another aspect of the embodiments of the present disclosure, there is provided a voice separating apparatus including: the mixed signal acquisition module is used for acquiring a time-frequency domain mixed signal to be processed; the signal variance acquiring module is used for carrying out voice separation processing on the time-frequency domain mixed signal to be processed, which is acquired by the mixed signal acquiring module, and acquiring an expected signal variance of the time-frequency domain mixed signal to be processed according to the result of the voice separation processing; the dereverberation processing module is used for carrying out dereverberation processing on the time-frequency domain mixed signal to be processed according to the expected signal variance obtained by the signal variance obtaining module to obtain a dereverberated time-frequency domain mixed signal; and the sound source separation module is used for obtaining the time-frequency domain signals of each sound source according to the time-frequency domain mixed signals after the reverberation is removed, which are obtained by the reverberation removal processing module.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-mentioned voice separation method.

According to still another aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is used for reading the executable instruction from the memory and executing the instruction to realize the voice separation method.

According to the voice separation method and the voice separation device provided by the embodiment of the disclosure, the voice separation processing is firstly utilized to obtain the expected signal variance of the time-frequency domain mixed signal to be processed, and the dereverberation processing is carried out by utilizing the expected signal variance of the time-frequency domain mixed signal to be processed, so that the dereverberation processing effect is improved, and meanwhile, the increase of the calculated amount of dereverberation processing is avoided; the time-frequency domain signals of each sound source are obtained by using the time-frequency domain mixed signals after the reverberation is removed, so that the accuracy of the finally obtained time-frequency domain signals of each sound source is guaranteed. Therefore, the technical scheme provided by the disclosure is beneficial to realizing on-line voice separation under a high reverberation environment aiming at the miniature double microphones, improving the accuracy of voice separation and ensuring the instantaneity of voice separation.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 is a schematic view of a scenario in which the present disclosure is applicable;

FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a speech separation method according to the present disclosure;

fig. 3 is a schematic flowchart of one embodiment of acquiring a time-frequency domain mixed signal to be processed according to the present disclosure;

FIG. 4 is a schematic diagram of one embodiment of a miniature dual microphone system according to the present disclosure;

FIG. 5 is a flowchart illustrating an embodiment of a process for separating speech from a time-frequency-domain mixed signal to be processed according to the present disclosure;

FIG. 6 is a schematic flow chart diagram illustrating another embodiment of a speech separation method according to the present disclosure;

FIG. 7 is a schematic structural diagram of an embodiment of a speech separation apparatus according to the present disclosure;

fig. 8 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more than two and "at least one" may refer to one, two or more than two.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, such as a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the present disclosure may be implemented in electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with an electronic device, such as a terminal device, computer system, or server, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the disclosure

In the course of implementing the present disclosure, the inventors found that, in a high reverberation environment, in order to improve the accuracy of the time-frequency domain signals of each sound source separated from the speech, the time-frequency domain signals are usually subjected to dereverberation processing, and then speech separation processing is performed on the result of the dereverberation processing. In the dereverberation process, a spatial correlation matrix is typically used for calculation to achieve dereverberation. However, since the computation is performed by using the spatial correlation matrix, there is a problem that the computation amount is large, and thus, it is not favorable for realizing the on-line speech separation in the high reverberation environment. If the dereverberation processing can be performed by using the desired signal variance, the calculation amount can be reduced to a large extent, so that the online voice separation under the high-reverberation environment can be realized while the accuracy of the time-frequency domain signals of each sound source separated by voice is improved.

Brief description of the drawings

The voice separation technology provided by the disclosure can be widely applied to tasks such as on-site conferences, teleconferences and voice interaction.

One example is shown in figure 1. Two microphones may be provided in a portable translator device 100 (e.g., a smart mobile phone, etc.), which may form a miniature dual microphone system. The translation apparatus 100 is used to implement bilingual translation.

During a dialog between the user 101 and the user 102 in a highly reverberant environment, the user 101 puts its translation apparatus 100 in an operating state for bi-directional translation between a first language (e.g., chinese, etc.) and a second language (e.g., english, etc.).

The translator device 100 acquires an external audio signal in real time through a micro dual microphone system provided therein to obtain a sound source mix signal, which may be a sound source mix signal in which a sound source signal of a current speaker is mixed with background noise. The background noise may include: reverberation noise due to reflection of the sound source signal. Translation device 100 may utilize the speech separation technique provided by the present disclosure to separate the sound source signal of the current speaker from the currently obtained sound source mixed signal, thereby avoiding the influence of background noise on subsequent speech recognition processing, facilitating improvement of the sound clarity of the current speaker, and further facilitating improvement of the accuracy of the speech recognition processing.

Thereafter, the translator device 100 may perform a speech recognition processing operation based on the separated sound source signal of the current speaker, and the translator device 100 may determine the language used by the current speaker and the content of the speech of the current speaker based on the result of the speech recognition processing.

Finally, the translation device 100 can convert the content of the current speaker speaking into another language and output the converted another language, for example, the translation device 200 displays the converted another language through its display screen, and for example, the translation device 100 plays the converted another language through its speaker.

The operations of collecting the audio signal, separating the sound source signal, performing the speech recognition process, performing the language conversion process, and the like are repeated, so that a continuous conversation between the user 101 and the user 102 can be facilitated.

Exemplary method

Fig. 2 is a flow chart of an example of a speech separation method of the present disclosure. As shown in fig. 2, the method of this embodiment includes the steps of: s200, S201, S202, and S203. The following describes each step.

And S200, acquiring a time-frequency domain mixed signal to be processed.

The time-frequency domain mixed signal to be processed in the present disclosure may be referred to as a time-frequency domain mixed signal to be separated. The time-frequency domain mixed signal in the present disclosure is a time-frequency domain signal formed based on a plurality of initial sound sources. The plurality of initial sound sources may include: desired initial sound sources, and undesired initial sound sources, etc. The desired initial sound source may be a speaker in a conference, a speaker in a conversation, or the like. The undesired initial sound source may be a noise sound source or the like, or may be an undesired interfering voice or the like.

The time-frequency domain mixed signal in the present disclosure may refer to: a signal containing both frequency domain information and time domain information. A time-frequency domain hybrid signal may generally describe the variation over time of the frequency domain components of various types of random signals.

S201, performing voice separation processing on the time-frequency domain mixed signal to be processed, and obtaining an expected signal variance of the time-frequency domain mixed signal to be processed according to a result of the voice separation processing.

The voice separation process in the present disclosure may refer to: and obtaining the expected time-frequency domain signal of each sound source from the time-frequency domain mixed signal to be processed. Thus, the result of the speech separation process in this disclosure may refer to: and separating the expected time-frequency domain signals of the sound sources from the time-frequency domain mixed signals to be processed.

The desired signal of the time-frequency domain mixed signal to be processed in the present disclosure may refer to: and removing the time-frequency domain mixed signal after the reverberation signal in the time-frequency domain mixed signal to be processed.

In the present disclosure, the desired signal variance of the time-frequency domain mixed signal to be processed may be determined based on the calculated sum value by summing the desired signal variances of the time-frequency domain signals of the respective sound sources in the time-frequency domain mixed signal.

S202, according to the expected signal variance of the time-frequency domain mixed signal to be processed, dereverberation processing is carried out on the time-frequency domain mixed signal to be processed, and the time-frequency domain mixed signal after dereverberation is obtained.

The dereverberation process in the present disclosure may refer to: and the processing procedure is used for removing the late reverberation signal in the time-frequency domain mixed signal to be processed. Reverberation is caused by multipath reflections of a speech signal from a sound source to a microphone, with early reflections (e.g., wall reflections, etc.) arriving at the microphone forming an early reverberation signal and later reflections arriving at the microphone forming a late reverberation signal. The present disclosure may consider that the desired signal of the time-frequency domain mixed signal to be processed corresponds to the sum of the early reverberation signal and the direct sound of the sound source directly reaching the microphone.

The dereverberated time-frequency domain mixed signal in the present disclosure may refer to: and mixing the dereverberated time-frequency domain mixed signals of the plurality of sound sources.

And S203, acquiring the time-frequency domain signal of each sound source according to the time-frequency domain mixed signal after dereverberation.

The time-frequency domain signal of each sound source in the present disclosure may refer to: and separating the time-frequency domain signals of the plurality of sound sources from the time-frequency domain mixed signal after dereverberation.

According to the method, the expected signal variance of the time-frequency domain mixed signal to be processed is obtained by utilizing voice separation processing, dereverberation processing is carried out by utilizing the expected signal variance of the time-frequency domain mixed signal to be processed, the dereverberation processing effect is improved, and meanwhile, the calculated amount of dereverberation processing is reduced; the time-frequency domain signals of the sound sources are obtained by using the time-frequency domain mixed signals after the reverberation is removed, so that the accuracy of the finally obtained time-frequency domain signals of the sound sources is ensured. Therefore, the method and the device are beneficial to realizing the on-line voice separation under the high-reverberation environment aiming at the miniature double microphones, greatly improve the accuracy of the voice separation and ensure the instantaneity of the voice separation.

In an alternative example, the process of acquiring a time-frequency domain mixed signal to be processed in S200 of the present disclosure is shown in fig. 3.

In fig. 3, a time-frequency domain mixed signal obtained based on signal acquisition is acquired S300.

Optionally, the present disclosure may perform signal acquisition based on a miniature dual-microphone system, so as to obtain a time-frequency domain mixed signal. The miniature dual microphone system in the present disclosure may refer to: an audio acquisition system is formed by two closely positioned microphones (which may also be referred to as microphones) and the distance between the two microphones is typically much smaller than the wavelength. For example, a micro dual-microphone system may be provided in each of devices such as a smart phone and a handheld translator to acquire a time-frequency domain mixed signal.

Alternatively, an example of a miniature dual microphone system is shown in fig. 4. The miniature dual microphone system of fig. 4 comprises: the microphone comprises a microphone 1 and a microphone 2, wherein the distance between the microphone 1 and the microphone 2 is delta; θ represents the sound source direction.

In an alternative example, the present disclosure may perform discrete short-time fourier transform on the time-domain signal collected by the microphone, so as to obtain a time-frequency domain mixed signal. The window function used for the discrete short-time fourier transform may be a hanning window, the length of the hanning window may be 64ms (milliseconds), and the overlap ratio between time-domain frames may be 75%. That is, in the discrete short-time fourier transform, the same time domain signal with a time length of 48ms exists between the time domain signal with a time length of 64ms to be transformed at the front and the time domain signal with a time length of 64ms to be transformed at the back.

And S301, acquiring beam output signals of each frequency band according to the time-frequency domain mixed signals.

Alternatively, the present disclosure may form the beam output signals of each frequency band in S301 using a DMA (Differential Microphone Array) technique. An example is as follows:

first, DMA beam filters for respective frequency bands of the time-frequency domain mixed signal are acquired.

Alternatively, the present disclosure may obtain the DMA beam filter of each frequency band of the time-frequency domain mixed signal by using the following formula (1):

in the above equation (1), f (k) represents two "cardioid" directional DMA beam filters for frequency band k, where the first column represents filters directed to a 0 ° beam, the second column represents filters directed to a 180 ° beam, and i represents an imaginary unit; f. of_sRepresents the sampling frequency; δ represents the separation of the two microphones; c. C₀Representing the speed of sound in air.

Next, a beam output signal of each frequency band is obtained from the time-frequency domain mixed signal and the DMA beam filter of each frequency band.

Alternatively, the present disclosure may obtain a beam output signal for each frequency band using the following equation (2):

x_b(k, l) ═ f (k) x (k, l) formula (2)

In the above formula (2)) In, x_b(k, l) represents the beam output signal of the l-th frame of frequency band k, x for a microphone system with two channel outputs, such as a miniature dual microphone system_b(k, l) may be represented as x_b(k,l)＝[x_b,1(k,l),x_b,2(k,l)]^TWherein x is_b,1(k, l) denotes the output signal of the l-th frame pointing to the frequency band k corresponding to the first beam, where x_b,2(k, l) represents the output signal of the l-th frame of frequency band k corresponding to the second beam, and [ x_b,1(k,l),x_b,2(k,l)]^TIs represented by [ x ]_b,1(k,l),x_b,2(k,l)]Transposing; f (k) denotes a DMA beam filter of frequency band k, and for a microphone system with two channel outputs (e.g. miniature dual microphone system), F (k) comprises 2 columns, i.e. a first column f₁And a second column f₂And each column is 2 in length; x (k, l) denotes the l-th frame of the frequency band k in the time-frequency domain mixed signal.

In fig. 4, the first frame and f of the frequency band k in the time-frequency domain mixed signal corresponding to the microphone 1₁Multiplying the first frame and the second frame of the frequency band k in the time-frequency domain mixed signal corresponding to the microphone 2 by f₂Multiplication, addition of two multiplication results forming x_b(k,l)。

And S302, forming a to-be-processed time-frequency domain mixed signal of a corresponding frequency band based on the beam output signal of each frequency band.

Optionally, the present disclosure may directly use the beam output signal of each frequency band as the time-frequency domain mixed signal to be processed of each frequency band. The present disclosure may also perform corresponding processing on the beam output signal of each frequency band, and obtain a to-be-processed time-frequency domain mixed signal of each frequency band.

According to the method and the device, the wave beam output signals of all the frequency bands are formed by utilizing the wave beam technology, and the to-be-processed time-frequency domain mixed signals of the corresponding frequency bands are formed by utilizing the wave beam output signals of all the frequency bands, so that the interference problems of noise, interference voice, reverberation and the like in the to-be-processed time-frequency domain mixed signals can be eliminated. Further, the present disclosure forms the beam output signal of each frequency band by using the DMA technology, so that the beam output signal has a better directivity for a small-sized microphone array (such as a miniature dual-microphone system), and the directivity can be independent of the frequency, thereby being beneficial to improving the voice separation performance in a high reverberation environment.

In an alternative example, an example of performing the speech separation processing on the time-frequency domain mixed signal to be processed in S201 of the present disclosure may be as shown in fig. 5.

In fig. 5, S500, an a priori desired signal of each acoustic source mix signal of the beam output signal of each frequency band is acquired.

Alternatively, the present disclosure may employ a dereverberation technique to obtain desired signals of each sound source mix signal of the beam output signals of each frequency band. To distinguish from the dereverberation process in S202, the dereverberation process in this step may be referred to as a dereverberation pre-process.

Alternatively, the present disclosure may obtain the desired signal of each sound source mixed signal in S500 using an RLS-based MCLP (Recursive Least square-based Multichannel Linear Prediction) algorithm. One example may include step 1 and step 2 described below.

Step 1, obtaining a prediction matrix for dereverberation.

Alternatively, for the first frame in the beam output signal of each frequency band, the present disclosure may use a random initialization manner to obtain a dereverberation prediction matrix. And for the non-first frame in the beam output signals of each frequency band, the method can calculate the dereverberation prediction matrix of the next frame by using the dereverberation prediction matrix of the previous frame. For example, the present disclosure may obtain an initial dereverberation prediction matrix in a beam output signal of each channel by using an initialization method, calculate a desired signal of a first frame by using the prediction matrix, and update the dereverberation prediction matrix of the first frame of each frequency band by using the following formula (4).

And 2, calculating the prior expected signal of each frequency band of each sound source mixed signal according to the obtained prediction matrix for removing reverberation and the beam output signal of each frequency band. The present disclosure may calculate the a priori desired signals for each frequency band of each sound source mix signal using the following equation (3):

in the above-mentioned formula (3),

an a priori desired signal representing the l-th frame of the frequency band k in each sound source mix signal; x is the number of_b(k, l) a beam output signal of the l-th frame representing the frequency band k;

a prediction matrix representing the dereverberation of the l-1 frame of the frequency band k in each sound source mix signal,

the following formula (4) can be used for updating;

x representing delay_b(k, l) and

the following formula (5) can be used.

In the above-mentioned formula (4),

a prediction matrix representing dereverberation of the l frame of the frequency band k in each sound source mix signal;

to representA prediction matrix for dereverberation of the l-1 frame of the frequency band k in each sound source mix signal; k (k, l) represents a gain vector of the l-th frame of the frequency band k, and k (k, l) can be expressed by the following formula (6);

is composed of

Is transposed by conjugation, and

an a priori desired signal representing the l-th frame of frequency band k in each sound source mix signal.

In the above formula (5), D represents the prediction delay; l is_gThe prediction order is represented and is a preset known value;

x of the representation_b(k, l-D) transpose, x_b(k, l-D) represents the beam output signal of the l-D frame of frequency band k;

denotes x_b(k,l-D-L_gTransposition of +1), x_b(k,l-D-L_g+1) denotes the L-D-L of the frequency band k_gA beam output signal of +1 frame; [*]^TDenote the transpose of x.

In the above formula (6), σ²(k, l) represents the desired signal variance of the l-th frame of frequency band k, and σ²(k, l) can be calculated using the time-frequency domain signal of each sound source, for example, can be calculated using the following formula (7); λ represents a forgetting factor, which is a preset known value; lambda [ alpha ]^-1Represents the minus 1 power of λ; Ψ^-1(k, l-1) represents the inverse of Ψ (k, l-1), Ψ (k, l-1) represents the variance weighted covariance matrix of the l-1 th frame of band k; in particular, Ψ for the first frame in the beam output signal of each band^-1(k, l-1) may be initialized with a weighted identity matrix; Ψ for non-first frames in the beam output signals of each band^-1(k, l-1) may be updated using the following equation (8);

x representing delay_b(k, l) and

can be expressed by the above formula (5), and

to represent

The conjugate transpose of (c).

In the above formula (7), M represents the number of microphones; y (k, l) represents the time-frequency domain signal of the l frame of the frequency band k of each sound source;

the square of the second order norm of denotes.

In the above formula (8), λ^-1Represents the minus 1 power of λ; λ represents a forgetting factor, typically a preset known value; i is an identity matrix; k (k, l) represents a gain vector of the l-th frame of the frequency band k, and k (k, l) can be represented by the above equation (6);

is composed of

The conjugate transpose of (1);

x representing delay_b(k, l) and

can be expressed by the above equation (5).

S501, obtaining time-frequency domain signals of each frequency band of each sound source according to a priori expected signals of each sound source mixed signal of the beam output signals of each frequency band and an IVA (Independent Vector Analysis) mode based on an NMF (non negative Matrix Factorization) model.

Alternatively, the NMF model-based IVA approach (NMF-IVA) in the present disclosure may refer to: and aiming at the cost function of the NMF-IVA algorithm, carrying out a minimum solving mode. Aiming at the cost function of the NMF-IVA algorithm, the minimum solution can be carried out through iteration to obtain the separation matrix of each frequency band, then the separation matrix is acted on the expected signal of the mixed signal to obtain the time-frequency signal of each sound source, namely, the time-frequency domain signal of each frequency band of each sound source can be obtained by using the following formula (9)

In the above formula (9), y_m(k, l) represents the time-frequency domain signal of the l frame of the frequency band k of the m-th sound source obtained by separation; w^-1(k, l) is the inverse of W (k, l), and W (k, l) represents the separation matrix for the ith frame of band k; w (k, l) can be expressed by the following formula (10); e_mAn M × M matrix representing the mth diagonal element as 1 and the remaining elements as 0, M representing the number of microphones; d (k, l) represents the l-th frame of the frequency band k in each sound source mix signalAnd d (k, l) can be obtained by the following formula (20) calculation;

is e_mConjugate transpose of (e), e_mRepresenting a unit vector with the mth element being 1.

In the above formula (10), w_m(k, l) represents the conjugate transpose of the m-th line in W (k, l);

denotes w_mConjugate transpose of (k, l); u shape_m(k, l) represents the spatial correlation matrix of the variance weighted desired signal for the m-th microphone band k. W on the right side of the equal sign in the above formula (10)_m(k, l) can be obtained by calculation from the following equation (10-1):

w_m(k,l)＝[W(k,l-1)U_m(k,l)]^-1e_mformula (10-1)

W (k, l-1) in the above equation (10-1) represents a separation matrix of the l-1 th frame of the frequency band k. For the first frame in the beam output signal of each frequency band, W (k, l-1) may be initialized with an identity matrix; for non-first frames in the beam output signals of each frequency band, W (k, l-1) is the separation matrix of the l-1 th frame of the frequency band k that is finally obtained. Optionally, U in this disclosure_mThe (k, l) may be calculated by an exponential moving average calculation method. For example, U in this disclosure_m(k, l) can be calculated by the following formula (11):

in the above formula (11), α represents a smoothing factor, which is a known value set in advance, and for example, α may be 0.98; d (k, l) represents the desired signal of the l-th frame of the frequency band k in each sound source mix signal；d^H(k, l) is a conjugate transpose of d (k, l), and d (k, l) can be obtained by calculation of the following formula (20); gamma ray_kl,mThe following formula (12) can be used.

In the above formula (12), J has a value range of J equal to 1, … …, J representing the number of bases of the non-negative matrix decomposition of the sound source; t is t_kj,mA jth base representing a non-negative matrix decomposition of the frequency band k of the mth sound source; v. of_jl,mJ-th activation value representing a non-negative matrix decomposition of the frequency band k of the m-th sound source; for the first frame in the beam output signal of each frequency band, t_kj,mAnd v_jl,mThe value can be set by initialisation, e.g. by using a random number for t_kj,mAnd v_jl,mRespectively initializing; for non-first frames in the beam output signal of each frequency band, t_kj,mAnd v_jl,mThe update may be calculated using the following equations (13) and (14);

in the above equations (13) and (14), K is in the range of 1-K, and K represents the number of frequency bands;

the a priori time-frequency domain signal of the l frame representing the frequency band k of the m sound source obtained by separation can be obtained by calculation using the following formula (17); j 'has a value range of J' 1.. and J, J represents the number of bases of the non-negative matrix factorization of the sound source; t is t_akj,mAnd t_bkj,mAll intermediate variables, 0 initialization may be used for the first frame in the beam output signal for each frequency band, for eachThe non-first frame in the beam output signal of the frequency band can be obtained by calculation using the following formula (15) and formula (16).

In addition, it should be noted that equation (14) may be executed in a loop for multiple times, and the number of loop executions may be preset, for example, may be executed in a loop for 100 times, and so on.

t_akj,m＝αt_akj,m+(1-α)v_jl,m(∑_j't_kj',mv_j'l,m)^-1

Formula (15)

In the above-mentioned formula (15) and formula (16), α represents a smoothing factor, which is a preset known value;

the a priori time-frequency domain signal representing the l frame of the m sound source frequency band k obtained by separation can be obtained by calculation using the following formula (17); j 'has a value range of J' 1.. and J, J represents the number of bases of the non-negative matrix factorization of the sound source.

In the above-mentioned formula (17),

denotes w_mConjugate transpose of (k, l-1), w_m(k, l-1) denotes the conjugate transpose of the m-th row in W (k, l-1), and W (k, l-1) denotes the separation matrix of the l-1-th frame of band k; d (k, l) represents a desired signal of the l-th frame of the frequency band k in each sound source mix signal, and d (k, l) can be obtained by calculation of the following equation (20).

U in NMF model-based IVA mode in prior art_m(k, l) is generally expressed by the following equation (18):

in the above equation (18), L represents the number of frames of the frequency band k; the value range of L is 1 to L; d (k, l) represents a desired signal of the l-th frame of the frequency band k in each sound source mix signal, and d (k, l) can be obtained by the following formula (20) calculation; d^H(k, l) is the conjugate transpose of d (k, l); gamma ray_kl,mIt can be expressed by the above formula (12), however, in the prior art, t in the formula (12)_kj,mGenerally expressed by the following formula (19);

in the above-mentioned formula (19),

a priori time-frequency domain signals of the l frame representing the m sound source frequency band k obtained by separation; the value range of L is 1 to L, and L represents the number of frames of the frequency band k; the range of J 'is J' 1, … …, J representing the number of bases of the non-negative matrix decomposition of the sound source.

As can be seen from comparing the formula (11) and the formula (13) of the present disclosure with the formula (18) and the formula (19) in the prior art, the formula (18) and the formula (19) need to sum the time and frequency domains for the whole time period, and online speech recognition cannot be achieved; the formula (11) and the formula (13) adopt an exponential moving average mode for calculation, which is beneficial to realizing online voice separation, thereby being beneficial to improving the instantaneity of the voice separation technology.

According to the method and the device, the time-frequency domain signals of each frequency band of each sound source are obtained by adopting an IVA mode based on an NMF model, and the time-frequency domain signals of each sound source are favorably and effectively separated. Furthermore, the method and the device for dereverberating the time-frequency domain mixed signals to be processed are adopted to obtain the expected signals of the sound source mixed signals of the wave beam output signals of the frequency bands, so that the accuracy of the obtained expected signals of the sound source mixed signals is improved, and the accuracy of voice separation processing is improved. In addition, in the case that the present disclosure obtains the desired signal of each sound source mixed signal using the RLS-based MCLP algorithm, since the calculation amount of the RLS-based MCLP algorithm is small, the present disclosure is advantageous to implement online dereverberation and online voice separation by combining the IVA based on the NMF model with the RLS-based MCLP.

In an alternative example, one implementation of S202 in the present disclosure may be: and performing dereverberation processing on the time-frequency domain mixed signal to be processed according to the expected signal variance of the time-frequency domain mixed signal to be processed by using an RLS-based MCLP algorithm to obtain the time-frequency domain mixed signal subjected to dereverberation.

Alternatively, the present disclosure may calculate the desired signal for each frequency band of each sound source mix signal using the following formula (20):

in the above equation (20), d (k, l) represents the desired signal of the l-th frame of the frequency band k in each sound source mix signal, x_b(k, l) a beam output signal of the l-th frame representing the frequency band k;

a prediction matrix representing dereverberation of the l-th frame of the frequency band k in each sound source mix signal,

can be updated using the above equation (4);

x representing delay_b(k, l) and

can be expressed by the above formula (5). K (k, l) in the above formula (4) can be expressed by the above formula (6); σ in the above formula (6)²(k, l) is obtained in the above stepAnd obtaining the variance of the expected signal of the time-frequency domain mixed signal to be processed. Ψ in the above equation (6)^-1(k, l) can be expressed by the above formula (8).

The method utilizes the RLS-based MCLP algorithm to perform dereverberation processing according to the expected signal variance of the time-frequency domain mixed signal to be processed, so that the dereverberation processing effect is improved, and meanwhile, the calculated amount of dereverberation processing can be reduced; therefore, the accuracy of finally obtained time-frequency domain signals of each sound source is ensured, and the real-time performance of voice separation is greatly improved.

In an alternative example, in S203 of the present disclosure, the manner of obtaining the time-frequency domain signal of each sound source according to the dereverberated time-frequency domain mixed signal may be: and obtaining the time-frequency domain signals of each sound source according to the time-frequency domain mixed signals after the reverberation is removed and based on an NMF-IVA mode.

Alternatively, the present disclosure may obtain the time-frequency domain signal of each sound source using equation (9) above. Similarly, W (k, l) in equation (9) may be updated using the above equation (10); u in formula (10)_m(k, l) can be calculated by the above equation (11). Gamma in formula (11)_kl,mCan be expressed by the above formula (12). T in the formula (12)_kj,mAnd v_jl,mThe update may be calculated using the above-described formula (13) and formula (14), and one of the formulas (14) and (16)

Can be obtained by calculation of formula (17).

Optionally, the present disclosure may perform short-time inverse fourier transform on the finally separated time-frequency domain signals of each sound source, so as to obtain the separated time-domain signals of each sound source.

According to the method and the device, the time-frequency domain signals of the sound sources in each frequency band are obtained by adopting an NMF-IVA mode, and the time-frequency domain signals of the sound sources can be effectively separated.

In an alternative example, an example of the speech separation method of the present disclosure is shown in fig. 6.

In fig. 6, S600, discrete short-time fourier transform is performed on the time domain signal collected by the microphone to obtain a time-frequency domain mixed signal.

S601, forming a beam output signal for each frequency band by using the DMA method. For example, the DMA beam filters of the respective frequency bands of the time-frequency domain mixed signal are acquired, and then the beam output signals of the respective frequency bands are acquired from the time-frequency domain mixed signal and the DMA beam filters of the respective frequency bands. The present disclosure may use the beam output signal of each frequency band as a time-frequency domain mixed signal to be processed. The present disclosure may perform the processing of the following steps for any one of all the frequency bands (for example, the frequency band k), respectively. This step may set l to 1.

S602, judging whether L is larger than L (the value range of L is 1-L, and L represents the number of frames contained in the current frequency band), and if so, going to S609; if not, go to S603.

S603, performing dereverberation processing on the I frame of the frequency band k of each sound source mixed signal in the beam output signal by using an RLS-based MCLP algorithm, thereby obtaining a priori expected signal of the I frame of the frequency band k of each sound source mixed signal.

For the 1 st frame (i.e., l ═ 1) of the frequency band k of each source mix signal, Ψ in the RLS-based mclp algorithm may be initialized using 0.1 × (identity matrix)^-1(k, l-1) and initializing t in NMF-IVA algorithm using random number_kj,mAnd v_jl,m. For non-1 st frame (i.e. 1) of frequency band k of each sound source mix signal<l<L), the present disclosure may obtain Ψ using equations (8), (13), and (14)^-1(k,l)、t_kj,mAnd v_jl,m。

The present disclosure may calculate the prior desired signal of the l frame of the frequency band k of each sound source mixed signal by using the above formula (3)

S604, according to the priori expected signal of the I frame of the frequency band k of each sound source mixed signal

Performing voice separation processing on the first frame of the frequency band k of each sound source mixed signal by using an NMF-IVA algorithm to obtain each soundThe time-frequency domain signal of each sound source of the l-th frame of the frequency band k of the source.

Optionally, the present disclosure calculates and obtains the prior time-frequency domain signal of the l frame of the frequency band k of the m sound source by using the above equation (17)

Then, the present disclosure may repeatedly perform equation (14) a plurality of times (e.g., 100 times), and perform calculation using equation (13), equation (12), equation (11), equation (10), and equation (10-1); thereafter, the time-frequency domain signal y of the l frame of the frequency band k of the m-th sound source is updated using formula (9)_m(k,l)。

S605 calculates the desired signal variance of the first frame of the frequency band k of each sound source using the time-frequency domain signal of each sound source of the first frame of the frequency band k of each sound source. For example, the desired signal variance may be calculated using equation (7) above.

S606, using RLS-based MCLP algorithm, dereverberation processing is carried out on the frame I of the frequency band k of each sound source mixed signal, and the expected signal of the frame I of the frequency band k of each sound source mixed signal is obtained. For example, the desired signal d (k, l) of the l-th frame of the frequency band k of each sound source mix signal is calculated by the above equation (20).

S607, according to the obtained desired signal d (k, l) of the first frame of the frequency band k of each sound source mixed signal, the first frame of the frequency band k is subjected to speech separation processing by using the NMF-IVA algorithm, and a time-frequency domain signal of the first frame of the frequency band k of each sound source is obtained.

S608, a short-time inverse fourier transform is performed on the time-frequency domain signal of the i-th frame of the frequency band k of each sound source obtained by separation, so as to obtain a time-domain signal of each sound source, and l is set to l +1, and the process returns to S602.

And S609, ending the voice separation process.

Exemplary devices

Fig. 7 is a schematic structural diagram of an embodiment of a speech separation apparatus according to the present disclosure. The apparatus of this embodiment may be used to implement the method embodiments of the present disclosure described above. As shown in fig. 7, the apparatus of this embodiment includes: an acquire mixed signal module 700, an acquire signal variance module 701, a dereverberation processing module 702, and a sound source separation module 703.

The mixed signal acquiring module 700 is configured to acquire a time-frequency-domain mixed signal to be processed.

Optionally, the mixed signal acquiring module 700 in the present disclosure may include: a first sub-module 7001, a second sub-module 7002, and a third sub-module 7003. The first sub-module 7001 is used for acquiring a time-frequency domain mixed signal obtained based on signal acquisition. The second submodule 7002 is configured to obtain a beam output signal of each frequency band according to the time-frequency domain mixed signal obtained by the first submodule 7001. For example, the second sub-module 7002 may first acquire a differential microphone array DMA beam filter for each frequency band of the time-frequency domain mixed signal, and then the second sub-module 7002 may obtain a beam output signal for each frequency band from the time-frequency domain mixed signal and the differential microphone array beam filter for each frequency band. The third submodule 7003 is configured to form a to-be-processed time-frequency domain mixed signal of a corresponding frequency band based on the beam output signal of each frequency band obtained by the second submodule 7002. For example, the third sub-module 7003 may directly use the beam output signal of each frequency band as the to-be-processed time-frequency domain mixed signal of each frequency band. The third sub-module 7003 may also perform corresponding processing on the beam output signals of each frequency band to obtain a to-be-processed time-frequency domain mixed signal of each frequency band.

The signal variance obtaining module 701 is configured to perform voice separation processing on the time-frequency domain mixed signal to be processed obtained by the mixed signal obtaining module 700, and obtain an expected signal variance of the time-frequency domain mixed signal to be processed according to a result of the voice separation processing.

Optionally, the obtain signal variance module 701 may include a fourth sub-module 7011 and a fifth sub-module 7012. In the case that the third submodule 7003 directly uses the beam output signals of each frequency band as the to-be-processed time-frequency domain mixed signals of each frequency band, the fourth submodule 7011 therein is configured to obtain the a priori desired signals of each sound source mixed signal of the beam output signals of each frequency band obtained by the second submodule 7002. For example, the fourth sub-module 7011 may first obtain a prediction matrix for dereverberation, and then the fourth sub-module 7011 may calculate an a priori desired signal for each frequency band of each sound source mix signal according to the prediction matrix for dereverberation and the beam output signals for each frequency band obtained by the second sub-module 7002. The prediction matrix is obtained by adopting a random initialization mode aiming at a first frame in the beam output signals of each frequency band; and for a non-first frame in the beam output signals of each frequency band, the prediction matrix is calculated by using the prediction matrix of the frame before the non-first frame. The fifth sub-module 7012 is configured to obtain time-frequency domain signals of each frequency band of each sound source according to the priori desired signals of each sound source mixed signal of the beam output signal of each frequency band obtained by the fourth sub-module 7011 and an independent vector analysis manner based on a non-negative matrix decomposition model.

Optionally, the module for obtaining signal variance 701 may determine the variance-weighted spatial correlation matrix of the desired signal and the basis of the non-negative matrix decomposition of the sound source in an independent vector analysis manner based on the non-negative matrix decomposition model by using an exponential moving average calculation manner.

The dereverberation processing module 702 is configured to perform dereverberation processing on the time-frequency domain mixed signal to be processed according to the expected signal variance obtained by the signal variance obtaining module 701, so as to obtain a dereverberated time-frequency domain mixed signal.

Optionally, the dereverberation processing module 702 may perform dereverberation processing on the time-frequency domain mixed signal to be processed according to the expected signal variance of the time-frequency domain mixed signal to be processed by using an online multi-channel linear prediction method based on recursive least squares, so as to obtain the time-frequency domain mixed signal after dereverberation.

The sound source separation module 703 is configured to obtain a time-frequency domain signal of each sound source according to the time-frequency domain mixed signal after dereverberation obtained by the dereverberation processing module 702.

Optionally, the sound source separation module 703 may obtain the time-frequency domain signal of each sound source according to the time-frequency domain mixed signal after dereverberation and an independent vector analysis mode based on a non-negative matrix decomposition model.

Alternatively, the sound source separation module 703 may determine the variance-weighted spatial correlation matrix of the desired signal and the basis of the non-negative matrix decomposition of the sound source in an independent vector analysis manner based on the non-negative matrix decomposition model using an exponential moving average calculation manner.

Exemplary electronic device

An electronic device according to an embodiment of the present disclosure is described below with reference to fig. 8. FIG. 8 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 8, the electronic device 81 includes one or more processors 811 and memory 812.

The processor 811 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capability and/or instruction execution capability, and may control other components in the electronic device 81 to perform desired functions.

Memory 812 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory, for example, may include: random Access Memory (RAM) and/or cache memory (cache), etc. The nonvolatile memory, for example, may include: read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 811 to implement the speech separation methods of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 81 may further include: an input device 813, an output device 814, etc., which are interconnected by a bus system and/or other form of connection mechanism (not shown). The input device 813 may also include, for example, a keyboard, a mouse, and the like. The output device 814 may output various information to the outside. The output devices 814 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, for simplicity, only some of the components of the electronic device 81 relevant to the present disclosure are shown in fig. 8, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 81 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in a speech separation method according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a speech separation method according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium may include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, and systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," comprising, "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of speech separation comprising:

acquiring a time-frequency domain mixed signal to be processed;

carrying out voice separation processing on the time-frequency domain mixed signal to be processed, and obtaining an expected signal variance of the time-frequency domain mixed signal to be processed according to a voice separation processing result;

according to the expected signal variance of the time-frequency domain mixed signal to be processed, performing dereverberation processing on the time-frequency domain mixed signal to be processed to obtain a dereverberated time-frequency domain mixed signal;

and acquiring the time-frequency domain signal of each sound source according to the time-frequency domain mixed signal after the reverberation is removed.

2. The method of claim 1, wherein the obtaining the time-frequency domain mixed signal to be processed comprises:

acquiring a time-frequency domain mixed signal obtained based on signal acquisition;

acquiring a wave beam output signal of each frequency band according to the time-frequency domain mixed signal;

and forming the time-frequency domain mixed signal to be processed of the corresponding frequency band based on the beam output signals of the frequency bands.

3. The method of claim 2, wherein the obtaining the beam output signal of each frequency band according to the time-frequency domain mixed signal comprises:

acquiring a difference microphone array DMA wave beam filter of each frequency band of the time-frequency domain mixed signal;

and obtaining the beam output signal of each frequency band according to the time-frequency domain mixed signal and the differential microphone array beam filter of each frequency band.

4. The method according to claim 2, wherein the performing the speech separation process on the time-frequency domain mixed signal to be processed comprises:

acquiring a priori expected signals of the sound source mixed signals of the wave beam output signals of the frequency bands;

and obtaining the time-frequency domain signals of each frequency band of each sound source according to the priori expected signals of the mixed signals of each sound source of the wave beam output signals of each frequency band and an Independent Vector Analysis (IVA) mode based on a non-Negative Matrix Factorization (NMF) model.

5. The method of claim 4, wherein said obtaining an a priori desired signal for each acoustic source mix signal of the beam output signals for each frequency band comprises:

obtaining a prediction matrix for dereverberation;

calculating prior expected signals of each frequency band of each sound source mixed signal according to the dereverberation prediction matrix and the beam output signals of each frequency band;

aiming at a first frame in the beam output signals of each frequency band, the prediction matrix is obtained by adopting a random initialization mode; the prediction matrix is calculated using a prediction matrix of a frame previous to the non-first frame for a non-first frame in the beam output signal of each frequency band.

6. The method according to any one of claims 1 to 5, wherein the dereverberating the time-frequency domain mixed signal to be processed according to the desired signal variance of the time-frequency domain mixed signal to obtain a dereverberated time-frequency domain mixed signal comprises:

and performing dereverberation processing on the time-frequency domain mixed signal to be processed according to the expected signal variance of the time-frequency domain mixed signal to be processed by using an online multi-channel linear prediction method based on recursive least squares to obtain the time-frequency domain mixed signal after dereverberation.

7. The method according to any one of claims 1 to 5, wherein the obtaining a time-frequency domain signal of each sound source according to the dereverberated time-frequency domain mixed signal comprises:

and obtaining the time-frequency domain signals of each sound source according to the time-frequency domain mixed signals after dereverberation and an Independent Vector Analysis (IVA) mode based on a non-Negative Matrix Factorization (NMF) model.

8. The method of claim 4, wherein the method further comprises:

and determining a variance weighted expected signal spatial correlation matrix in an Independent Vector Analysis (IVA) mode based on a non-Negative Matrix Factorization (NMF) model and a base of non-negative matrix factorization of the sound source by using an exponential moving average calculation mode.

9. The method of claim 7, wherein the method further comprises:

10. A speech separation apparatus comprising:

the mixed signal acquisition module is used for acquiring a time-frequency domain mixed signal to be processed;

the signal variance acquiring module is used for carrying out voice separation processing on the time-frequency domain mixed signal to be processed, which is acquired by the mixed signal acquiring module, and acquiring an expected signal variance of the time-frequency domain mixed signal to be processed according to the result of the voice separation processing;

the dereverberation processing module is used for carrying out dereverberation processing on the time-frequency domain mixed signal to be processed according to the expected signal variance obtained by the signal variance obtaining module to obtain a dereverberated time-frequency domain mixed signal;

and the sound source separation module is used for obtaining the time-frequency domain signals of each sound source according to the time-frequency domain mixed signals after the reverberation is removed, which are obtained by the reverberation removal processing module.

11. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the above claims 1-9.

12. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-9.