CN113782047B

CN113782047B - Voice separation method, device, equipment and storage medium

Info

Publication number: CN113782047B
Application number: CN202111040658.1A
Authority: CN
Inventors: 戴玮; 关海欣; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2024-03-08
Anticipated expiration: 2041-09-06
Also published as: CN113782047A

Abstract

The invention relates to a voice separation method, a device, equipment and a storage medium, which comprises the steps of separating a time domain mixed voice signal to obtain a time domain signal of a first channel and a time domain signal of a second channel, selecting two-dimensional arrival azimuth estimation corresponding to the time domain signal of the first channel with a designated frame number according to the sequence of signal energy from high to low, and solving the mode to obtain azimuth estimation information of the first channel, and selecting two-dimensional arrival azimuth estimation information corresponding to the time domain signal of the second channel with the designated frame number, and solving the mode to obtain azimuth estimation of the second channel; calculating pitch angle deviation of the first channel and azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating pitch angle deviation of the second channel and azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel; and obtaining a comparison result of the deviations of the first channel and the second channel, and determining a target sound source corresponding to each channel according to the comparison result.

Description

Voice separation method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech separation method, apparatus, device, and storage medium.

Background

In recent years, with the rapid development of speech recognition technology, urgent technical demands are being made on real-time speech separation technology in a multi-path speech recognition scenario. For example, in one-to-one education, it is necessary to separate the voices of students and the voices of teachers.

In the related art, a blind source separation technology is generally adopted to separate mixed voices, but the sequence of output channels corresponding to voice signals obtained by blind source separation is uncertain, so that a user is required to further determine voice signals corresponding to each channel, and the voice separation efficiency is reduced.

Disclosure of Invention

The invention provides a voice separation method, a device, equipment and a storage medium, which are used for solving the technical problems that in the prior art, the order of output channels corresponding to voice signals obtained by blind source separation is uncertain, a user is required to further determine the voice signals corresponding to each channel, and the voice separation efficiency is reduced.

The technical scheme for solving the technical problems is as follows:

a method of speech separation comprising:

performing Fourier transform on the time domain mixed voice signals received by the microphone array to obtain time domain mixed voice signals;

separating the mixed voice signals of the time-frequency domain to obtain a separation signal of a first channel and a separation signal of a second channel;

respectively carrying out short-time inverse Fourier transform on the separated signals of the first channel and the separated signals of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel;

selecting two-dimensional arrival azimuth estimation corresponding to time domain signals of a first channel with a specified frame number according to the sequence of the signal energy from high to low, and solving a mode to obtain azimuth estimation information of the first channel, and selecting two-dimensional arrival azimuth estimation information corresponding to time domain signals of a second channel with a specified frame number, and solving a mode to obtain azimuth estimation of the second channel;

calculating pitch angle deviation of the first channel and azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating pitch angle deviation of the second channel and azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;

if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source and the second channel is the voice information of the second target sound source;

if the pitch angle deviation of the first channel is larger than the pitch angle deviation of the second channel, and the azimuth angle deviation of the first channel is larger than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source.

Further, in the above voice separation method, before performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel to obtain the time domain signal of the first channel and the time domain signal of the second channel, the method further includes:

processing the separation signal of the first channel and the separation signal of the second channel through an adaptive filtering algorithm to obtain a primary noise reduction signal of the first channel;

performing energy comparison on the primary noise reduction signal of the first channel and the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain a primary noise reduction signal of a second channel;

correspondingly, performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel, including:

and respectively carrying out short-time inverse Fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.

Further, in the above voice separation method, before performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain the time domain signal of the first channel and the time domain signal of the second channel, the method further includes:

the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel are respectively subjected to single-channel noise reduction to eliminate background noise, so that a final noise reduction signal of the first channel and a final noise reduction signal of the second channel are obtained;

correspondingly, performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel, including:

and respectively carrying out short-time inverse Fourier transform on the final noise reduction signal of the first channel and the final noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.

Further, the above voice separation method further includes:

and updating the weight of a filter corresponding to the adaptive filtering algorithm when the pitch angle deviation is larger than the angle deviation threshold of the pitch angle or the azimuth angle deviation is larger than the angle deviation threshold of the azimuth angle.

Further, the above voice separation method further includes:

and when the pitch angle deviation is smaller than or equal to the angle deviation threshold value, and the azimuth angle deviation is smaller than or equal to the angle deviation threshold value of the azimuth angle, maintaining the weight of the filter corresponding to the adaptive filtering algorithm unchanged.

Further, in the above voice separation method, the adaptive filtering algorithm is any one of a least mean square algorithm LMS, an NLMS algorithm, and a least square method RLS.

The invention also provides a voice separation device, which comprises:

the first transformation module is used for obtaining a time-frequency domain mixed voice signal by not carrying out Fourier transformation on the time domain mixed voice signal received by the microphone array;

the separation module is used for separating the time-frequency domain mixed voice signals to obtain a separation signal of a first channel and a separation signal of a second channel;

the second transformation module is used for respectively carrying out short-time inverse Fourier transformation on the separation signal of the first channel and the separation signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel;

the azimuth estimation module is used for selecting two-dimensional arrival azimuth estimation corresponding to the time domain signals of the first channel with the designated frame number according to the sequence of the signal energy from high to low, and solving the mode to obtain azimuth estimation information of the first channel, and selecting two-dimensional arrival azimuth estimation information corresponding to the time domain signals of the second channel with the designated frame number, and solving the mode to obtain azimuth estimation of the second channel;

the deviation estimation module is used for calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;

the determining module is used for determining that the first channel is the voice information of the first target sound source and the second channel is the voice information of the second target sound source if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel; if the pitch angle deviation of the first channel is larger than the pitch angle deviation of the second channel, and the azimuth angle deviation of the first channel is larger than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source.

Further, in the above voice separation apparatus, the separation module is further configured to:

correspondingly, the second transformation module is further configured to perform short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel, so as to obtain a time domain signal of the first channel and a time domain signal of the second channel.

The invention also provides a voice separation apparatus comprising: a processor and a memory;

the processor is configured to execute a program of the speech separation method stored in the memory, so as to implement any one of the above-described speech separation methods.

The present invention also provides a storage medium storing one or more programs that when executed implement any of the above-described methods of speech separation.

The beneficial effects of the invention are as follows:

the method comprises the steps of performing voice separation on a time-domain mixed voice signal, obtaining a time-domain signal of a first channel and a time-domain signal of a second channel, collecting energy judgment, selecting two-dimensional arrival azimuth estimation corresponding to the time-domain signal of the first channel with a specified frame number, and obtaining azimuth estimation information of the first channel, and selecting two-dimensional arrival azimuth estimation information corresponding to the time-domain signal of the second channel with a specified frame number, and obtaining azimuth estimation of the second channel; then, according to the azimuth estimation information of the first channel, calculating the pitch angle deviation of the first channel and the azimuth deviation of the first channel, and according to the azimuth estimation information of the second channel, calculating the pitch angle deviation of the second channel and the azimuth deviation of the second channel; if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source; if the pitch angle deviation of the first channel is larger than the pitch angle deviation of the second channel, and the azimuth angle deviation of the first channel is larger than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source. Therefore, the voice signals are output according to the determined channel sequence, so that the user is prevented from further determining the voice signals corresponding to each channel, and the voice separation efficiency is improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a method for speech separation according to the present invention;

FIG. 2 is a schematic diagram of a microphone array according to the present invention

FIG. 3 is a schematic diagram of a voice separation apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural view of the voice separation apparatus of the present invention.

Detailed Description

The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.

Fig. 1 is a flowchart of an embodiment of a voice separation method according to the present invention, as shown in fig. 1, the voice separation method of the present embodiment may specifically include the following steps:

100. performing Fourier transform on the time domain mixed voice signals received by the microphone array to obtain time domain mixed voice signals;

fig. 2 is a schematic diagram of a microphone array according to the present invention. As shown in fig. 2, an angle error threshold of a pitch angle and an angle error threshold of an azimuth angle may be set, where the pitch angle θ, e.g., 30 degrees, of the first sound source signal of the time-domain mixed speech signal received by the microphone arraySuch as 60 degrees. The second sound source signal of the time-domain mixed voice signal received by the microphone array can be in any direction.

In a specific implementation process, the microphone array may receive a time-domain mixed speech signal, and because the speech signal has a short-time stable characteristic, the speech signal is generally transformed into a short-time frequency domain for analysis, so that a short-time fourier transform is performed on the time-domain mixed speech signal to obtain a time-frequency domain mixed speech signal. May be expressed as x (t, k), t representing the number of frames and k representing the frequency.

101. Separating the mixed voice signals of the time-frequency domain to obtain a separation signal of a first channel and a separation signal of a second channel;

in a specific implementation process, a blind source separation algorithm may be used to separate the time-frequency domain mixed speech signal, so as to obtain a separation signal of the first channel and a separation signal of the second channel. For the specific separation method, reference may be made to the related art, and will not be described herein.

102. Respectively carrying out short-time inverse Fourier transform on the separated signals of the first channel and the separated signals of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel;

in a specific implementation process, the separation signal of the first channel and the separation signal of the second channel may be respectively subjected to short-time inverse fourier transform, so as to obtain a time domain signal of the first channel and a time domain signal of the second channel.

103. Selecting two-dimensional arrival azimuth estimation corresponding to time domain signals of a first channel with a specified frame number according to the sequence of the signal energy from high to low, and solving a mode to obtain azimuth estimation information of the first channel, and selecting two-dimensional arrival azimuth estimation information corresponding to time domain signals of a second channel with a specified frame number, and solving a mode to obtain azimuth estimation of the second channel;

in a specific implementation process, the pitch angle of each frame of each channel can be obtained through two-dimensional direction of arrival estimationAnd azimuth->The ability of the speech signal for each frame can also be derived from the calculation of the speech signal energy. Wherein, the languageThe energy of the sound signal is calculated as +.>E _i Representing speech signal energy, x _i (t) represents a time domain signal of each channel of the current frame, and N represents the number of frames.

In a specific implementation process, two-dimensional arrival azimuth estimation corresponding to the time domain signal of the first channel with the first 30% of frame number can be selected, and the mode is calculated to obtain azimuth estimation information of the first channel, and two-dimensional arrival azimuth estimation information corresponding to the time domain signal of the second channel with the designated frame number is selected, and the mode is calculated to obtain azimuth estimation of the second channel.

Specifically, after two-dimensional direction of arrival estimates (pitch angle and azimuth angle) of all frames are obtained, the energy calculation of all frames is ordered from high to low, and the pitch angle and azimuth angle of the first 30% of frames with the highest energy are selected, so that an array of pitch angles and an array of azimuth angles are obtained. Three angular region ranges, such as 0-50, 50-100, 100-180, can be set in advance, and the mode is to choose which angle to see if the value in the array occurs the most frequently. For example, the number of occurrences of 0-50 in the pitch array is the largest, i consider the mode of the pitch of this channel to be any value from 0-50.

104. Calculating pitch angle deviation of the first channel and azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating pitch angle deviation of the second channel and azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;

in one implementation, the position estimate information for the first channel may be recorded asThe azimuth estimation information of the second channel can be denoted +.>The pitch angle deviation of the first channel is +.>The azimuthal deviation of the first channel is +.>Wherein θ represents a reference pitch angle, < >>Representing the reference azimuth angle.

105. Detecting whether the pitch angle deviation of a first channel is larger than that of a second channel, and whether the azimuth angle deviation of the first channel is larger than that of the second channel; if yes, go to step 106, if no, go to step 107;

106. determining the first channel as voice information of a second target sound source, wherein the second channel is the voice information of the first target sound source;

if the pitch angle deviation of the first channel is larger than the pitch angle deviation of the second channel, and the azimuth angle deviation of the first channel is larger than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, determining that the second channel is the voice information of the second target sound source for the voice information of the first target sound source, and determining that the second channel is the voice information of the first target sound source.

107. And determining the first channel as the voice information of the first target sound source, and determining the second channel as the voice information of the second target sound source.

If the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source.

According to the voice separation method, voice separation is carried out on the time-domain mixed voice signals, after the time-domain signals of the first channel and the time-domain signals of the second channel are obtained, energy judgment is gathered, two-dimensional arrival azimuth estimation corresponding to the time-domain signals of the first channel with a specified frame number is selected, the mode is calculated, azimuth estimation information of the first channel is obtained, two-dimensional arrival azimuth estimation information corresponding to the time-domain signals of the second channel with a specified frame number is selected, and the mode is calculated, so that azimuth estimation of the second channel is obtained; then, according to the azimuth estimation information of the first channel, calculating the pitch angle deviation of the first channel and the azimuth deviation of the first channel, and according to the azimuth estimation information of the second channel, calculating the pitch angle deviation of the second channel and the azimuth deviation of the second channel; if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source; if the pitch angle deviation of the first channel is larger than the pitch angle deviation of the second channel, and the azimuth angle deviation of the first channel is larger than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source. Therefore, the voice signals are output according to the determined channel sequence, so that the user is prevented from further determining the voice signals corresponding to each channel, and the voice separation efficiency is improved.

In a specific implementation process, before the step 102 "performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel to obtain the time domain signal of the first channel and the time domain signal of the second channel" in the foregoing embodiment, the following steps may be further performed:

(1) Processing the separation signal of the first channel and the separation signal of the second channel through an adaptive filtering algorithm to obtain a primary noise reduction signal of the first channel;

(2) Performing energy comparison on the primary noise reduction signal of the first channel and the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain a primary noise reduction signal of a second channel;

specifically, after the primary noise reduction signal of the first channel is obtained, energy comparison can be performed between the primary noise reduction signal of the first channel and the mixed voice signal of the time domain, and a voice signal with high energy is selected. If the energy of the primary noise reduction signal of the first channel is higher than the energy of the mixed voice signal of the time domain, the primary noise reduction signal of the first channel is used as the voice signal with high energy, and if the energy of the primary noise reduction signal of the first channel is lower than the energy of the mixed voice signal of the time domain, the mixed voice signal of the time domain is used as the voice signal with high energy. And taking the time domain mixed voice signal as a reference, and filtering by a self-adaptive filtering algorithm to obtain a primary noise reduction signal of the second channel. The self-adaptive filtering algorithm is any one of a least mean square algorithm LMS, an NLMS algorithm and a least square method RLS.

Correspondingly, performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel, including: and respectively carrying out short-time inverse Fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.

In a specific implementation process, before "performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel", the following steps may be further performed:

(11) And respectively removing background noise from the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel through single-channel noise reduction to obtain a final noise reduction signal of the first channel and a final noise reduction signal of the second channel.

Correspondingly, the short-time inverse fourier transform is respectively performed on the separation signal of the first channel and the separation signal of the second channel, so as to obtain a time domain signal of the first channel and a time domain signal of the second channel, which comprises the following steps: and respectively carrying out short-time inverse Fourier transform on the final noise reduction signal of the first channel and the final noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.

In this embodiment, the energy judgment and the adaptive filtering technology are combined to further denoise the separated voice signals of each channel, so that the separated voice is cleaner.

In a specific implementation process, after calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel in step 104", and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel, the following steps may be further performed: and updating the weight of a filter corresponding to the adaptive filtering algorithm when the pitch angle deviation is larger than the angle deviation threshold of the pitch angle or the azimuth angle deviation is larger than the angle deviation threshold of the azimuth angle. And when the pitch angle deviation is smaller than or equal to the angle deviation threshold value, and the azimuth angle deviation is smaller than or equal to the angle deviation threshold value of the azimuth angle, maintaining the weight of the filter corresponding to the adaptive filtering algorithm unchanged.

In a specific implementation process, fitting can be performed according to historical weights of the filter corresponding to the updating self-adaptive filtering algorithm to obtain a weight updating fitting function of the filter, so that before the filter is used, the weight of the filter is set according to the obtained weight updating fitting function, after the updating times of the weight updating fitting function reach the preset times m, the actual calculated weight of the filter of the mth time is obtained by using the updating method, if errors of the calculated weight of the filter are within a preset range, the weight of the filter is still set by using the weight updating fitting function of the mth to the mth 2, otherwise, the weight of the filter is set by using the mode that the pitch angle deviation is larger than the angle deviation threshold value or the azimuth angle deviation is larger than the angle deviation threshold value of the azimuth angle, the weight of the filter corresponding to the self-adaptive filtering algorithm is updated until the weight of the filter is updated for n times, and the weight of the filter is updated according to the calculated values of the n times after the calculation is completed, and the weight of the filter is updated again, and the weight of the filter is fitted. Therefore, the pitch angle deviation between the pitch angle in the mixed voice signal of the time-frequency domain and the target azimuth can be avoided through repeated calculation, the azimuth angle deviation between the azimuth angle in the mixed voice signal of the time-frequency domain and the target azimuth can be avoided, and the efficiency and the accuracy are improved.

It should be noted that, the method of the embodiment of the present invention may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the method of an embodiment of the present invention, and the devices interact with each other to complete the method.

Fig. 3 is a schematic structural diagram of an embodiment of the voice separation apparatus according to the present invention, as shown in fig. 3, the voice separation apparatus according to the present embodiment may include a first transformation module 20, a separation module 21, a second transformation module 22, an orientation estimation module 23, a deviation estimation module 24, and a determination module 25.

The first transformation module 20 performs fourier transformation on the time domain mixed voice signal received by the microphone array to obtain a time domain mixed voice signal;

a separation module 21, configured to separate the time-frequency domain mixed speech signal to obtain a separation signal of the first channel and a separation signal of the second channel;

a second transform module 22, configured to perform short-time inverse fourier transform on the separated signal of the first channel and the separated signal of the second channel, to obtain a time domain signal of the first channel and a time domain signal of the second channel;

the azimuth estimation module 23 is configured to select, according to the order of the signal energy from high to low, two-dimensional arrival azimuth estimation corresponding to the time domain signal of the first channel with a specified frame number, and calculate a mode, to obtain azimuth estimation information of the first channel, and select two-dimensional arrival azimuth estimation information corresponding to the time domain signal of the second channel with a specified frame number, and calculate a mode, to obtain azimuth estimation of the second channel;

a deviation estimating module 24, configured to calculate a pitch angle deviation of the first channel and an azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculate a pitch angle deviation of the second channel and an azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;

a determining module 25, configured to determine that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source, if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel; if the pitch angle deviation of the first channel is larger than the pitch angle deviation of the second channel, and the azimuth angle deviation of the first channel is larger than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source.

In a specific implementation, the separation module 21 is further configured to:

and performing energy comparison on the primary noise reduction signal of the first channel and the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain the primary noise reduction signal of the second channel. The self-adaptive filtering algorithm is any one of a least mean square algorithm LMS, an NLMS algorithm and a least square method RLS.

Correspondingly, the second transform module 22 is further configured to perform short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel, so as to obtain a time domain signal of the first channel and a time domain signal of the second channel.

In a specific implementation, the separation module 21 is further configured to: the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel are respectively subjected to single-channel noise reduction to eliminate background noise, so that a final noise reduction signal of the first channel and a final noise reduction signal of the second channel are obtained;

correspondingly, the second transform module 22 is further configured to perform short-time inverse fourier transform on the final noise reduction signal of the first channel and the final noise reduction signal of the second channel, so as to obtain a time domain signal of the first channel and a time domain signal of the second channel.

In a specific implementation, the deviation estimation module 24 is further configured to update the weight of the filter corresponding to the adaptive filtering algorithm when the pitch angle deviation is greater than the angle deviation threshold of the pitch angle, or the azimuth angle deviation is greater than the angle deviation threshold of the azimuth angle. And when the pitch angle deviation is smaller than or equal to the angle deviation threshold value, and the azimuth angle deviation is smaller than or equal to the angle deviation threshold value of the azimuth angle, maintaining the weight of the filter corresponding to the adaptive filtering algorithm unchanged.

The device of the foregoing embodiment is configured to implement the corresponding method in the foregoing embodiment, and specific implementation schemes thereof may refer to the method described in the foregoing embodiment and related descriptions in the method embodiment, and have beneficial effects of the corresponding method embodiment, which are not described herein.

Fig. 4 is a schematic structural diagram of a voice separation apparatus according to the present invention, as shown in fig. 4, a traffic apparatus according to this embodiment may include: a processor 1010 and a memory 1020. The device may also include an input/output interface 1030, a communication interface 1040, and a bus 1050, as will be appreciated by those skilled in the art. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.

The input/output interface 1030 is used to connect with the input/output module 32 for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).

It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

In one specific implementation, the processor 1010 is configured to execute a program for speech separation stored in the memory 1020 to implement the speech separation method of the above embodiment.

The present invention also provides a storage medium storing one or more programs which when executed implement the speech separation method of the above embodiments.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the disclosure, including the claims, is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the invention, the steps may be implemented in any order and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the invention. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the present invention, and these modifications and substitutions are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of speech separation comprising:

2. The method according to claim 1, wherein the step of performing short-time inverse fourier transform on the first channel separation signal and the second channel separation signal to obtain a first channel time domain signal and a second channel time domain signal, respectively, further comprises:

3. The method of claim 2, wherein performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel, respectively, further comprises:

4. The voice separation method of claim 2, further comprising:

5. The voice separation method of claim 4, further comprising:

6. The method of claim 2, wherein the adaptive filtering algorithm is any one of a least mean square algorithm LMS, an NLMS algorithm, and a least squares RLS.

7. A speech separation device, comprising:

8. The speech separation device of claim 7 wherein the separation module is further configured to:

9. A speech separation apparatus, comprising: a processor and a memory;

the processor is configured to execute a program of the speech separation method stored in the memory to implement the speech separation method of any one of claims 1 to 6.

10. A storage medium storing one or more programs which when executed implement the speech separation method of any of claims 1-6.