CN113870893A

CN113870893A - Multi-channel double-speaker separation method and system

Info

Publication number: CN113870893A
Application number: CN202111134595.6A
Authority: CN
Inventors: 张鹏远; 杨弋; 陈航艇; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-31

Abstract

The application relates to a multichannel double-speaker separation method and a multichannel double-speaker separation system, wherein the method comprises the following steps: processing the mixed voice audio to obtain the frequency spectrum of each frame of audio; obtaining estimated frame-level Cartesian coordinates and corresponding weights according to each frame of audio and sound source position estimation network; obtaining a first logarithmic energy spectrum and a phase difference between first sine and cosine channels according to the frequency spectrum of each frame of audio; obtaining the Cartesian coordinate estimation of a target speaker in the mixed voice audio according to the estimated frame level Cartesian coordinate and the corresponding weight; obtaining a first angle characteristic according to the Cartesian coordinates of the target speaker; obtaining a target speaker and a first estimated speaker mask according to the first logarithmic energy spectrum, the first sine and cosine inter-channel phase difference, the first angle characteristic and the speaker mask estimation network; separate voices of the at least two speakers are obtained based on the target speaker, the first estimated speaker mask, and the mixed voice audio.

Description

Multi-channel double-speaker separation method and system

Technical Field

The embodiment of the application relates to the field of voice separation, in particular to a method and a system for separating double speakers in multiple channels.

Background

The goal of speech separation is to separate different speakers from the mixed speech audio with reverberation and noise, resulting in clean individual speaker speech. The voice separation is used as a front end of technologies such as a voice recognition system and a voice log, and is widely applied to various environments such as a teaching environment and a conference environment.

Deep clustering is a traditional method of speech separation. The method obtains the separated voice of the target speaker by training the ideal binary masking of the target speaker on the mixed voice frequency. In the training process, each time-frequency unit needs to be vectorized, and then the time-frequency units with similar distances are clustered together. However, the performance of deep clustering is very limited for the influence of different speech environments.

In recent years, a speech separation model based on a deep neural network is rapidly developed in the field of speech separation, and the performance of the model far exceeds that of a traditional method. However, most experimental studies are still based on the separation of speech from mixed speech audio that is completely overlapped, and neglect the speech environment that is dominant by a single speaker, such as a conference. Research shows that the overlapping proportion of speakers is generally not higher than 20% in a conference environment, and therefore, the robustness is still to be improved for voice separation with different low overlapping proportions of speakers. On the other hand, for mixed voice audio with different overlapping proportions of speakers, the specific position of the target speaker in the voice cannot be known in advance, and the input of the neural network training can only be the whole voice. In this case, if averaging pooling is used, interfering speaker voices and silence frames can severely affect the estimation of the target speaker's position information, thereby degrading the performance of voice separation.

Disclosure of Invention

The embodiment of the application aims to reduce the deviation generated by the position estimation of the target speaker for the mixed voice audio with low speaker overlapping proportion and improve the robustness and the separation performance of voice separation.

In a first aspect, an embodiment of the present application provides a method for separating two speakers in multiple channels, including: performing framing, windowing and Fourier transform processing on the mixed voice audio to obtain the frequency spectrum of each frame of audio; the mixed voice audio comprises mixed voice audio with different speaker overlapping proportions; obtaining estimated frame-level Cartesian coordinates and corresponding weights according to each frame of audio and sound source position estimation network; obtaining a first logarithmic energy spectrum and a phase difference between first sine and cosine channels according to the frequency spectrum of each frame of audio; obtaining a Cartesian coordinate estimation of a target speaker in the mixed voice audio according to the estimated frame-level Cartesian coordinates and corresponding weights, wherein the Cartesian coordinate estimation of the target speaker indicates a weighted sound source position estimation of the target speaker; obtaining a first angle characteristic according to the Cartesian coordinates of the target speaker; obtaining a first target speaker mask and a first interference speaker mask according to the first logarithmic energy spectrum, the first sine and cosine inter-channel phase difference, the first angle characteristic and the speaker mask estimation network; and obtaining the voice of the target speaker and the voice of the interfering speaker based on the first target speaker mask, the first interfering speaker mask and the mixed voice audio.

In one possible embodiment, the method further comprises: determining a training set of mixed voice audio, and determining training voice audio and a label based on the training set of mixed voice audio; the tag comprises a sound source position vector, a second target speaker voice and a second interfering speaker voice; estimating a network according to the position of the training voice audio training sound source; training a speaker masking estimation network; and jointly training the sound source position estimation network and the speaker masking estimation network to obtain the trained sound source position estimation network and the trained speaker masking estimation network.

In one possible embodiment, the label includes a sound source position vector, and the training of the sound source position estimation network according to the training speech audio includes: performing framing, windowing and Fourier transform processing on the training voice audio to obtain a frequency spectrum of the training voice audio; the spectrum of the training speech audio comprises a real part and an imaginary part; calculating a value of a first loss function by taking the data after splicing the real part and the imaginary part as the input of the sound source position estimation network and taking the sound source position vector estimation as the output, wherein the first loss function is the mean square error of the sound source position; training by taking the value of the first loss function within a first threshold value as a target to obtain the trained sound source position estimation network and a corresponding weight vector; the sound source position estimation network comprises 3 layers of convolution modules, 2 layers of bidirectional long-time and short-time memory networks and 2 layers of full connection layers.

In one possible embodiment, the training speaker masking estimate network comprises: determining a second angle characteristic, a second logarithmic energy spectrum and a second phase difference between sine and cosine channels according to the training voice audio and the sound source position vector; calculating a product of multiplication of the second target speaker mask and the training voice audio by taking the second angle characteristic, the second logarithmic energy spectrum and the phase difference between the second sine channel and the second cosine channel as input and taking the second target speaker mask and the second interfering speaker mask as output to obtain an estimated speaker voice signal; calculating the product of the second interference speaker masking and the training voice audio to obtain an estimated interference speaker voice signal; calculating the value of a second loss function, wherein the value of the second loss function is a logarithmic value of the loss ratio of the estimated voice signal to the target voice signal; the estimated voice signal comprises an estimated speaker voice signal and an estimated interference speaker voice signal; the target speech signal comprises a second target speaker speech and a second interfering speaker speech; training by taking the value of the second loss function within a second threshold value as a target to obtain the trained speaker masking estimation network, wherein the speaker masking estimation network comprises a 3-layer bidirectional long-time memory network and 2 independent full-connection layers.

In one possible embodiment, the tag includes a source position vector, a second targeted speaker's voice, and a second interfering speaker's voice, and the jointly trained source position estimation network and speaker masking estimation network includes: combining a sound source position estimation network and a speaker masking estimation network, setting a loss function as the sum of the mean square error of sound source position estimation and the scale invariant-signal-to-loss ratio error, wherein the scale invariant-signal-to-loss ratio is the scale invariant ratio of a speaker separation result, and carrying out combined training fine adjustment on the two networks. Wherein the estimated speech signal is a product of the second targeted speaker mask and a second interfering speaker mask multiplied by the training speech audio; the target speech signal is the second target speaker speech and the second interfering speaker speech.

In one possible embodiment, the obtaining the cartesian coordinate estimation of the target speaker in the mixed speech audio according to the estimated frame-level cartesian coordinates and the corresponding weights includes: and performing softmax operation according to the estimated frame level Cartesian coordinates and corresponding weights, and calculating weighted sound source position estimation to obtain Cartesian coordinates of the target speaker in the mixed voice audio.

In one possible embodiment, the deriving the first angular characteristic from cartesian coordinates of the target speaker comprises: determining a guide vector of the target speaker according to the difference value of the Cartesian coordinates of the target speaker and the Cartesian coordinates of the microphone topological structure; the Cartesian coordinates of the microphone topological structure are obtained based on a coordinate system of the microphone array; and calculating according to the guide vector of the target speaker and the frequency spectrum of the mixed voice audio of the M channels to obtain angle characteristics, wherein the value of M is a natural number.

In a second aspect, embodiments of the present application provide a multi-channel dual speaker separation system, the system including: the signal processing module is used for performing framing, windowing and Fourier transform processing on the mixed voice audio to obtain the frequency spectrum of each frame; the mixed voice audio comprises at least two speaker voices; the characteristic extraction module is used for obtaining estimated frame level Cartesian coordinates and corresponding weights according to the position estimation network of each frame of audio and sound source; obtaining a first logarithmic energy spectrum and a phase difference between first sine and cosine channels according to the frequency spectrum of each frame of audio; a sound source position weighting processing module, configured to obtain a cartesian coordinate estimation of a target speaker in the mixed speech audio according to the estimated frame-level cartesian coordinates and the corresponding weights, where the cartesian coordinate estimation of the target speaker indicates a weighted sound source position estimation of the target speaker; the angle characteristic calculation module is used for obtaining a first angle characteristic according to the Cartesian coordinates of the target speaker; the speaker separation module is used for obtaining a first target speaker mask and a first interference speaker mask according to the first logarithmic energy spectrum, the first sine and cosine inter-channel phase difference, the first angle characteristic and the speaker mask estimation network; and obtaining the voice of the target speaker and the voice of the interfering speaker based on the first target speaker mask, the first interfering speaker mask and the mixed voice audio.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor; the processor is configured to execute the computer executable instructions stored in the memory, and the processor executes the computer executable instructions to perform the method according to any one of the first aspect.

In a fourth aspect, the present application provides a storage medium, which includes a readable storage medium and a computer program stored in the readable storage medium, where the computer program is used to implement the method in any one of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings needed to be used in the description of the embodiments will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments disclosed in the present specification, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flowchart of a multi-channel dual speaker separation method according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a neural network module training process in a multi-channel dual speaker separation method according to an embodiment of the present disclosure.

FIG. 3 is a block diagram of a multi-channel dual speaker separation system according to an embodiment of the present application;

fig. 4 is a schematic view of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application are described in further detail below with reference to the accompanying drawings and embodiments.

The angle characteristic plays an important role in the speaker separation system. The premise for extracting the angular features is that the position information of the target speaker needs to be known, but in most cases, the information is unknown. Therefore, efficient estimation of the position information is necessary.

Based on this application, an embodiment of the present invention proposes a multi-channel dual speaker separation method for identifying dual speaker voices based on weighted sound source position estimation for mixed voice audio with low speaker overlap ratio, which includes S11-S17.

S11, obtaining mixed voice audio, wherein the mixed voice audio comprises two speaker voices; performing framing, windowing and Fourier transform processing on the mixed voice audio to obtain the frequency spectrum of each frame of audio of the mixed voice audio; the mixed speech audio is a mixed speech audio that includes a low proportion of overlap of speakers.

And S12, obtaining estimated frame level Cartesian coordinates and corresponding weights according to the frequency spectrum of each frame of audio and the sound source position estimation network.

S13, according to the estimated frame level Cartesian coordinates and the corresponding weight, carrying out weighted calculation to obtain Cartesian coordinate estimation of the target speaker in the mixed voice audio, wherein the Cartesian coordinate estimation is weighted sound source position estimation.

S14, extracting an angle feature according to the Cartesian coordinates of the target speaker, and recording the angle feature as a first angle feature.

And S15, extracting the phase difference between the log energy spectrum and the sine and cosine channels according to the frequency spectrum of the mixed voice audio. The logarithmic energy spectrum is obtained by calculating the frequency spectrum of the microphone at the central point and represents the standard spectral characteristics of the voice signal; the inter-channel phase difference can capture the tiny change transmitted from the sound source to different microphones and is useful spatial information, and the inter-channel phase difference calculation process is as follows:

in the formula (1)

And

respectively representing the Fourier transform values, φ, of the mixed speech audio at time t and frequency bin f, microphone 1 and microphone 2_t，fIs the phase difference between microphone 1 and microphone 2 at time t and frequency unit f.

The logarithmic energy spectrum and the sine-cosine inter-channel phase difference can be respectively recorded as a first logarithmic energy spectrum and a first sine-cosine inter-channel phase difference.

S16, calculating according to the angle characteristics, the logarithm energy spectrum, the phase difference between sine and cosine channels and the speaker masking estimation network to obtain a target speaker masking estimation and an interference speaker masking estimation; the targeted speaker masking estimate may be denoted as a first targeted speaker masking and the interfering speaker masking estimate may be denoted as a first interfering speaker masking.

And S17, performing voice separation based on the target speaker masking estimation, the interference speaker masking estimation and the mixed voice audio to obtain the target speaker voice and the interference speaker voice.

In the method for separating the multi-channel double speakers, a neural network module comprises a sound source position estimation neural network and a speaker masking estimation neural network; the sound source position estimation neural network and the speaker masking estimation neural network are trained respectively and then jointly, and the training process comprises the following steps of S21-S24.

S21, determining training voice audio and labels based on the mixed voice audio training set; a training speech spectrum is obtained.

In one possible implementation, a mixed speech audio training set is established, and the labels are used for recording the sound source position vector of the training speech audio, the speaker speech and the interference speaker speech; the speaker's voice and the interfering speaker's voice can be registered as a second target speaker's voice and a second interfering speaker's voice. Determining a training voice audio and a label, and performing framing, windowing and Fourier transform processing on the training voice audio to obtain a training voice frequency spectrum; the training speech spectrum includes a real part and an imaginary part.

And S22, training the sound source position estimation network according to the training voice frequency spectrum.

In one possible implementation mode, data obtained after splicing a real part and an imaginary part of a training voice frequency spectrum are used as input of a sound source position estimation network, sound source position vector estimation is used as output, and mean square error of a sound source position is calculated; and training by taking the value of the mean square error within a set threshold value as a target to obtain a trained sound source position estimation network and a corresponding weight vector.

The mean square error of the sound source position can be noted as the first loss function. The mean square error of the sound source position can be obtained by calculating the mean square error of the sound source position vector estimation output by the network and the sound source position vector recorded by the label. The set threshold value is defined as a first threshold value.

The input of the sound source position estimation network is the splicing of the real part and the imaginary part of the frequency spectrum obtained after the Fourier transform of the training voice audio. The sound source position estimation network comprises 3 layers of convolution modules, 2 layers of bidirectional long-time and short-time memory networks and 2 layers of full connection layers. Each convolution module comprises a two-dimensional convolution layer, a linear rectification function, a batch normalization processing layer and a maximum value pooling layer. The number of output nodes of the two-dimensional convolutional layer is 64, and the convolutional kernel size, stride, and padding are (3,3), (1,1), and (1,1), respectively. The window sizes of the maximum pooling layers are (8,1), (4,1) and (4,1), respectively. The number of hidden layer nodes of the bidirectional long-time memory network is 64. The number of output nodes of the fully connected layer is 429 and 4 respectively. The final output data of the network has dimensions of B × T × 4, where B represents the size of the output data and T represents the number of voice frames of the output data. The output data is an estimated frame-level sound source position vector and a corresponding weight vector.

And S23, training the speaker mask estimation network.

In a feasible implementation manner, a second angle characteristic can be determined according to the sound source position vector recorded by the label, and a phase difference between a second logarithm energy spectrum and a second sine-cosine channel is obtained according to the training voice frequency spectrum; and taking the second angle characteristic, the second logarithm energy spectrum and the phase difference between the second sine and cosine channels as input, taking the speaker masking estimation and the interference speaker masking estimation as output, and calculating an estimated voice signal, wherein the estimated voice signal is a product of the speaker masking estimation and the interference speaker masking estimation multiplied by the training voice audio respectively. And calculating a scale invariance-signal-to-loss ratio error in a training process, and obtaining a trained speaker masking estimation network through training, wherein the speaker masking estimation network comprises a 3-layer bidirectional long-time memory network and 2 independent full-connection layers.

Scale invariant-to-loss ratio estimation of speech signals

Logarithm of the s-loss ratio to the target speech signal, scale invariant-signal-to-loss ratio L is:

in the formula (2), s is a target voice signal, the target voice signal is a known value, is two separate voices which are recorded by a tag and form a training voice audio, and is also a training target of the network;

representing an estimated speech signal resulting from multiplying the estimated masking by the input training speech audio (i.e., the input speech of the network); alpha is a scale factor, and in order to ensure that the scale alpha of the signal-to-loss ratio is not deformed, the scale factor alpha is:

loss＝-|L_tgt-L_est| (4)

l in the formula (4)_tgtIs the scale-invariant-to-signal-loss ratio, L, of the target speech signal_estTo estimate the scale-invariant-to-loss ratio of a speech signal, loss is the scale-invariant-to-loss ratio error.

The speaker masking estimate may be denoted as a second targeted speaker masking, the interfering speaker masking estimate may be denoted as a second interfering speaker masking, and the scale-invariant-to-signal-to-loss ratio error may be denoted as a second loss function.

In one possible implementation, the product of the second targeted speaker mask multiplied by the training speech audio may be calculated to obtain an estimated speaker speech signal; calculating the product of the second interference speaker masking and the training voice audio to obtain an estimated interference speaker voice signal; calculating a scale invariance-signal loss ratio error; wherein the estimated voice signal comprises an estimated speaker voice signal and an estimated interfering speaker voice signal, and the target voice signal comprises a second target speaker voice and a second interfering speaker voice; and training by taking the value of the second loss function within a second threshold value as a target to obtain the trained speaker masking estimation network.

And S24, training the sound source position estimation network and the speaker masking estimation network in a combined mode, and finely adjusting the two networks to enable the loss function value to be minimum.

In one possible implementation, a trained network of sound source position estimates and a trained network of speaker masking estimates are combined, a loss function is set to the sum of the mean-square error and the scale-invariant-to-signal-to-loss ratio error of the sound source position, and the two networks are fine-tuned until the value of the loss function is minimized. This loss function may be denoted as a third loss function.

After training, an ideal sound source position estimation network and a speaker masking estimation network are obtained, and the ideal sound source position estimation network and the speaker masking estimation network can be used for multi-channel double-speaker voice separation. The following describes a multi-channel dual speaker voice separation method proposed in the embodiments of the present application with an embodiment.

Example 1

Based on the ideal sound source position estimation network and the speaker masking estimation network, the embodiment of the application provides a multi-channel double-speaker separation method, which comprises the following steps:

and S31, obtaining mixed voice audio comprising the voices of the two speakers, and performing framing, windowing and Fourier transform processing on the mixed voice audio to obtain the frequency spectrum of each frame of audio.

In one possible embodiment, the following steps S311 to S313 are included:

s311, framing the mixed voice audio to be separated, wherein each frame is 25 milliseconds, and the frame is shifted by 6.25 milliseconds;

s312, windowing is carried out on each frame, and the window function is a Hamming window;

s313, performing 512-point Fourier transform on each frame of audio to obtain the frequency spectrum of each frame of audio.

S32, inputting the frequency spectrum of each frame of audio into the trained sound source position estimation network, and outputting the frequency spectrum as estimated frame-level Cartesian coordinates and corresponding weight vectors

Wherein

Cartesian coordinates of the audio at time t, w_tAre the corresponding weights.

S33, according to the estimated frame level Cartesian coordinates

And corresponding weight w_tAnd obtaining the estimated Cartesian coordinate estimation of the target speaker in the input mixed voice audio.

In one possible implementation, softmax operations are performed on the estimated frame-level location information and corresponding weights, computing a weighted sound source location estimate:

in the formula (5), T is the frame number of the mixed voice audio, e represents the base number of the natural logarithm,

is the estimated cartesian coordinates of the targeted speaker.

And S34, calculating an angle characteristic according to the estimated Cartesian coordinates of the target speaker, wherein the angle characteristic is marked as a first angle characteristic.

In one possible embodiment, the angular features are calculated based on the estimated cartesian coordinates of the targeted speaker as:

the formula (6) M is the number of channels,

representing the spectrum of the mixed speech audio at time t and frequency unit f in channel m;

representing a guide vector of a target speaker under a channel m at a frequency unit f, and taking a value as a difference value of estimated Cartesian coordinates of the target speaker and Cartesian coordinates of a microphone topological structure; the microphone array is fixed, a known quantity, during training and testing. The coordinate system can be established by taking the microphone No. 0 as the origin and the connecting line of the microphone No. 0 and the microphone No. 1 as the x axis, and then the Cartesian coordinates of each microphone can be obtained.

Illustratively, the number of channels M is 4, including channel 0, channel 1, channel 2, and channel 3.

And S35, calculating the phase difference between the logarithmic energy spectrum and the sine and cosine channel according to the mixed voice spectrum. And recording the logarithmic energy spectrum as a first logarithmic energy spectrum, and recording the phase difference between the sine and cosine channels as a first phase difference between the sine and cosine channels.

It is to be understood that the fourier transform is for the entire speech and the mixed speech spectrum is the spectrum of each successive frame of speech audio.

In one possible implementation, a 257-dimensional log energy spectrum and a sine-cosine inter-channel phase difference can be obtained by calculation based on the mixed speech spectrum.

Illustratively, the number of channels is 4, and the pairs of channels taken by the phase difference between channel 0, channel 1, channel 2, channel 3, 4 channels are (0,1), (0,2), and (0, 3).

And S36, inputting the first angle characteristic, the first logarithmic energy spectrum and the phase difference between the first sine and cosine channels into the trained speaker masking estimation network to obtain the target speaker masking estimation and the interference speaker masking estimation. The speaker masking estimation represents a vector with the same dimension as the input mixed voice audio, wherein each element is between 0 and 1, and represents the component proportion occupied by the speaker in the input mixed voice audio at each time frequency point. Here, 2 speaker masking estimates for the targeted speaker masking estimate and the interfering speaker masking estimate may be obtained.

In one possible embodiment, the angular features, the log energy spectrum and the phase difference between sine and cosine channels obtained in steps S35 and S36 are input into a trained speaker masking estimation network. The speaker masking estimation network comprises a 3-layer bidirectional long-time memory network and 2 independent fully-connected layers. The number of hidden layer nodes of the bidirectional long-time and short-time memory network is 512, and the dropout ratio is 0.4. The number of output nodes of the fully connected layer is 256. The 2 separate fully-connected layers output the target speaker masking estimate and the interfering speaker masking estimate, respectively.

S37, two separated voices are obtained based on the target speaker masking estimation, the interference speaker masking estimation and the mixed voice audio.

In one possible implementation, the element-by-element product is performed with the spectrum of the mixed speech audio according to the target speaker masking estimate and the interfering speaker masking estimate in step S37 to obtain the separated signals of the two speakers in the frequency domain. And performing inverse Fourier transform on the two separated signals to obtain finally estimated separated voices of the two speakers in the time domain.

It should be understood that the experimental objective is speaker separation, and when the mixed speech includes two speakers such as a target speaker and an interfering speaker, it is necessary to obtain the masking of the two speakers separately, and multiply the masking with the original input mixed speech signal to obtain the separated speech of the estimated two speakers in the time domain.

The system of the embodiment of the application is based on weighted sound source position estimation, and reduces the influence of interfering speakers and mute frames in mixed voice audio on sound source position estimation; the robustness of the double-speaker voice separation of different overlapping proportions is enhanced based on the weighted sound source position estimation and the multi-channel double-speaker separation combined training, and the performance of a voice separation system is improved.

The method and the system in the embodiment of the application are specific to double speakers, have expandability for more speakers, are still applicable, but need to correspondingly modify the number of the output nodes of the neural network, and the performance may be reduced.

As shown in fig. 3, an embodiment of the present application provides a multi-channel dual speaker separation system, which includes: the system comprises a signal processing module 41, a feature extraction module 42, a neural network module 43, a sound source position weighting processing module 44, an angle feature calculation module 45 and a speaker separation module 46.

Wherein, the signal processing module 41 obtains a mixed voice audio, which includes two speaker voices; and performing framing, windowing and Fourier transform processing on the mixed voice audio to obtain the frequency spectrum of each frame of audio. The mixed speech audio is a time domain signal.

The feature extraction module 42 inputs the frequency spectrum of each frame of audio into the trained sound source position estimation network, and outputs the frequency spectrum as estimated frame-level cartesian coordinates and corresponding weight vectors; and obtaining a phase difference between a first logarithmic energy spectrum and a first sine-cosine channel according to the frequency spectrum of the mixed voice audio.

The sound source position weighting processing module 43 performs softmax operation on the cartesian coordinates of the frame level and the corresponding weight output by the sound source position estimation neural network to obtain the cartesian coordinates of the estimation target speaker; the cartesian coordinates of the estimated target speaker are used to indicate sound source position information.

The angular feature calculation module 44 converts the cartesian coordinates of the estimated targeted speaker into a first angular feature.

The speaker separation module 45 obtains a target speaker and a first estimated speaker mask according to the first angle characteristic, the first logarithmic energy spectrum, the first sine-cosine inter-channel phase difference and the speaker mask estimation network; separate speech is derived based on the target speaker, the first estimated speaker masking, and the mixed speech audio.

The method and the system for separating the voices of the multi-channel double speakers have better robustness for separating the voices with different low overlapping ratios of the speakers. The method of the application separates the speakers by adopting the angle characteristics, extracts the angle characteristics by acquiring the position information of the target speaker and effectively estimates the position information. For mixed voice audio with different overlapping proportions of speakers, the specific position of the target speaker in the voice can be known in advance, and the problem of performance reduction of voice separation caused by the fact that the voice of the speaker is interfered and the severe deviation of a mute frame to the position information of the target speaker is caused by the use of average pooling in the traditional neural network can be effectively solved. The multi-channel double-speaker voice separation method provided by the embodiment of the application greatly improves the voice separation performance.

As shown in fig. 4, an embodiment of the present application provides an electronic device 1100, which includes a processor 1101 and a memory 1102; the processor 1101 is configured to execute the computer executable instructions stored in the memory 1102, and the processor 1101 executes the computer executable instructions to perform the method according to any of the embodiments described above.

The embodiment of the present application provides a storage medium 1103, which includes a readable storage medium and a computer program stored in the readable storage medium, where the computer program is used to implement the method described in any of the above embodiments.

It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the embodiments of the present application in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present application and are not intended to limit the scope of the embodiments of the present application, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the embodiments of the present application should be included in the scope of the embodiments of the present application.

Claims

1. A multi-channel dual speaker separation method, the method comprising:

performing framing, windowing and Fourier transform processing on the mixed voice audio to obtain the frequency spectrum of each frame of audio; the mixed voice audio comprises mixed voice audio with different speaker overlapping proportions;

obtaining estimated frame level Cartesian coordinates and corresponding weights according to the frequency spectrum of each frame of audio and the sound source position estimation network;

obtaining a first logarithmic energy spectrum and a phase difference between first sine and cosine channels according to the frequency spectrum of each frame of audio;

obtaining a Cartesian coordinate estimation of a target speaker in the mixed voice audio according to the estimated frame-level Cartesian coordinates and corresponding weights, wherein the Cartesian coordinate estimation of the target speaker indicates a weighted sound source position estimation of the target speaker;

obtaining a first angle characteristic according to the Cartesian coordinates of the target speaker;

obtaining a first target speaker mask and a first interference speaker mask according to the first logarithmic energy spectrum, the first sine and cosine inter-channel phase difference, the first angle characteristic and the speaker mask estimation network;

and obtaining the voice of the target speaker and the voice of the interfering speaker based on the first target speaker mask, the first interfering speaker mask and the mixed voice audio.

2. The method of claim 1, further comprising:

determining a training set of mixed voice audio, and determining training voice audio and a label based on the training set of mixed voice audio; the tag comprises a sound source position vector, a second target speaker voice and a second interfering speaker voice;

training the sound source position estimation network according to the training voice audio;

training the speaker masking estimation network;

and jointly training the sound source position estimation network and the speaker masking estimation network to obtain the trained sound source position estimation network and the trained speaker masking estimation network.

3. The method of claim 2, wherein training the sound source position estimation network based on the training speech audio comprises:

performing framing, windowing and Fourier transform processing on the training voice audio to obtain a frequency spectrum of the training voice audio; the spectrum of the training speech audio comprises a real part and an imaginary part;

calculating a value of a first loss function by taking the data after splicing the real part and the imaginary part as the input of the sound source position estimation network and taking the sound source position vector estimation as the output, wherein the first loss function is the mean square error of the sound source position;

training by taking the value of the first loss function within a first threshold value as a target to obtain the trained sound source position estimation network and a corresponding weight vector; the sound source position estimation network comprises 3 layers of convolution modules, 2 layers of bidirectional long-time and short-time memory networks and 2 layers of full connection layers.

4. The method of claim 2, wherein training the speaker masking estimate network comprises:

determining a second angle characteristic, a second logarithmic energy spectrum and a second phase difference between sine and cosine channels according to the training voice audio and the sound source position vector;

calculating a product of multiplication of the second target speaker mask and the training voice audio by taking the second angle characteristic, the second logarithmic energy spectrum and the phase difference between the second sine channel and the second cosine channel as input and taking the second target speaker mask and the second interfering speaker mask as output to obtain an estimated speaker voice signal;

calculating the product of the second interference speaker masking and the training voice audio to obtain an estimated interference speaker voice signal;

calculating the value of a second loss function, wherein the value of the second loss function is a logarithmic value of the loss ratio of the estimated voice signal to the target voice signal; the estimated voice signal comprises an estimated speaker voice signal and an estimated interference speaker voice signal; the target speech signal comprises a second target speaker speech and a second interfering speaker speech;

training by taking the value of the second loss function within a second threshold value as a target to obtain the trained speaker masking estimation network, wherein the speaker masking estimation network comprises a 3-layer bidirectional long-time memory network and 2 independent full-connection layers.

5. The method of claim 2, wherein the jointly training the sound source position estimation network and the speaker masking estimation network comprises:

combining the sound source position estimation network and the speaker masking estimation network, and calculating a value of a third loss function, wherein the value of the third loss function is the sum of the value of the first loss function and the value of the second loss function; the value of the first loss function is the mean square error value of the sound source position; the value of the second loss function is a logarithmic value of the loss ratio of the estimated voice signal to the target voice signal;

and fine-tuning the sound source position estimation network and the speaker masking estimation network by taking the minimum value of the third loss function as a target to obtain the trained sound source position estimation network and the trained speaker masking estimation network.

6. The method of claim 1, wherein deriving a cartesian coordinate estimate of a targeted speaker in the mixed speech audio based on the estimated frame-level cartesian coordinates and corresponding weights comprises:

and performing softmax operation according to the estimated frame level Cartesian coordinates and corresponding weights, and calculating weighted sound source position estimation to obtain Cartesian coordinates of the target speaker in the mixed voice audio.

7. The method as claimed in claim 1, wherein said deriving a first angular feature from cartesian coordinates of the target speaker comprises:

determining a guide vector of the target speaker according to the difference value of the Cartesian coordinates of the target speaker and the Cartesian coordinates of the microphone topological structure; the Cartesian coordinates of the microphone topological structure are obtained based on a coordinate system of the microphone array;

and calculating to obtain a first angle characteristic according to the guide vector of the target speaker and the frequency spectrum of the mixed voice audio of the M channels, wherein the value of M is a natural number.

8. A multi-channel dual speaker separation system, the system comprising:

the signal processing module is used for performing framing, windowing and Fourier transform processing on the mixed voice audio to obtain the frequency spectrum of each frame; the mixed voice audio comprises at least two speaker voices;

the characteristic extraction module is used for obtaining estimated frame level Cartesian coordinates and corresponding weights according to the position estimation network of each frame of audio and sound source; obtaining a first logarithmic energy spectrum and a phase difference between first sine and cosine channels according to the frequency spectrum of each frame of audio;

a sound source position weighting processing module, configured to obtain a cartesian coordinate estimation of a target speaker in the mixed speech audio according to the estimated frame-level cartesian coordinates and the corresponding weights, where the cartesian coordinate estimation of the target speaker indicates a weighted sound source position estimation of the target speaker;

the angle characteristic calculation module is used for obtaining a first angle characteristic according to the Cartesian coordinates of the target speaker;

the speaker separation module is used for obtaining a first target speaker mask and a first interference speaker mask according to the first logarithmic energy spectrum, the first sine and cosine inter-channel phase difference, the first angle characteristic and the speaker mask estimation network; and obtaining the voice of the target speaker and the voice of the interfering speaker based on the first target speaker mask, the first interfering speaker mask and the mixed voice audio.

9. An electronic device comprising a memory and a processor; the processor is used for executing the computer execution instructions stored by the memory, and the processor executes the computer execution instructions to execute the method of any one of claims 1-7.

10. A storage medium comprising a readable storage medium and a computer program stored in the readable storage medium, the computer program being for implementing the method of any one of claims 1 to 7.