CN113421582B

CN113421582B - Microphone voice enhancement method and device, terminal and storage medium

Info

Publication number: CN113421582B
Application number: CN202110687473.3A
Authority: CN
Inventors: 纪伟; 罗本彪; 潘思伟; 董斐
Original assignee: Spreadtrum Communications Tianjin Co Ltd
Current assignee: Spreadtrum Communications Tianjin Co Ltd
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2022-11-04
Anticipated expiration: 2041-06-21
Also published as: CN113421582A

Abstract

The application provides a microphone voice enhancement method and device, a terminal and a storage medium, which can acquire a first target signal and a second target signal, acquire a first filter coefficient and a second filter coefficient, perform spatial filtering on the first target signal and the second target signal based on the first filter coefficient and the second filter coefficient, and determine a third target signal and a fourth target signal based on the result of the spatial filtering; carrying out noise reduction filtering processing on the third target signal and the fourth target signal to obtain an enhanced voice signal; and performing inverse transformation from the frequency domain to the time domain on the enhanced voice signal to obtain a corresponding target time domain signal, and synthesizing the target time domain signal into the target voice signal. By carrying out weighting processing, voice components contained in the two paths of output signals obtained after processing have enough energy difference, and then the two paths of signals are processed, so that the finally obtained output signals can have better noise suppression and voice restoration effects.

Description

Microphone voice enhancement method and device, terminal and storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and an apparatus for enhancing a microphone voice, a terminal, and a storage medium.

Background

The dual-microphone configuration is widely applied to voice communication terminals such as smart phones at present, and compared with the single-microphone configuration, the dual-microphone configuration can perform noise suppression by using a voice energy difference between two microphones during handheld communication, so that a better noise suppression effect and a better voice quality restoration effect than a single-microphone configuration are achieved. However, in hands-free conversation, the energy difference between the two microphones is small or even close to each other, and a good speech enhancement effect cannot be achieved by using the two-microphone noise reduction method based on the energy difference.

Disclosure of Invention

The embodiment of the application provides a microphone voice enhancement method and device, a terminal and a storage medium. And then the two paths of signals are processed by using a double-microphone noise suppression module based on energy difference, and finally the obtained output signal can have better noise suppression and voice restoration effects.

In a first aspect, an embodiment of the present application provides a microphone speech enhancement method, where the method includes: acquiring a first target signal and a second target signal, wherein the first target signal and the second target signal are frequency domain signals obtained by signal processing of noise-containing voice signals received by a first microphone and a second microphone respectively; acquiring a first filter coefficient and a second filter coefficient, wherein the first filter coefficient and the second filter coefficient are respectively filter coefficients matched with the current receiving state information of the first microphone and the second microphone; spatially filtering the first target signal and the second target signal based on the first filter coefficient and the second filter coefficient, and determining a third target signal and a fourth target signal based on a result of the spatial filtering; carrying out noise reduction filtering processing on the third target signal and the fourth target signal to obtain an enhanced voice signal; and performing inverse transformation from the frequency domain to the time domain on the enhanced voice signal to obtain a corresponding target time domain signal, and synthesizing the target time domain signal into a target voice signal.

Further, the acquiring the first target signal and the second target signal includes: framing a first noisy speech signal received by a first microphone and a second noisy speech signal received by a second microphone respectively, so as to divide the first noisy speech signal and the second noisy speech signal into m frames of time domain sub-signals respectively; respectively carrying out time domain-to-frequency domain conversion operation on each frame of time domain sub-signals in the m frames of time domain sub-signals to correspondingly obtain the first target signal and the second target signal; wherein the first target signal comprises m frames of frequency domain sub-signals and the second target signal comprises m frames of frequency domain sub-signals.

Further, the obtaining the first filter coefficient and the second filter coefficient includes: determining a first filter coefficient according to the first receiving state information, and determining a second filter coefficient according to the second receiving state information; the first receiving state information is receiving state information of the first microphone, the second receiving state information is receiving state information of the second microphone, and the receiving state information includes a distance between the first microphone and the second microphone, a voice arrival direction angle and a sampling frequency.

Further, the spatially filtering the first target signal and the second target signal based on the first filter coefficient and the second filter coefficient, and determining a third target signal and a fourth target signal based on a result of the spatial filtering includes: performing spatial filtering on each frame of frequency domain sub-signals in the first target signal and the second target signal based on the first filter coefficient and the second filter coefficient, and determining a third target signal and a fourth target signal based on results of a plurality of rounds of the spatial filtering; wherein spatially filtering one of the same frame frequency domain sub-signals of the first target signal and the second target signal comprises: performing voice enhancement processing on the l-th frame frequency domain sub-signal in the first target signal according to the first filter coefficient to obtain a first factor, performing voice enhancement processing on the l-th frame frequency domain sub-signal in the second target signal according to the second filter coefficient to obtain a second factor, determining a third factor according to the difference between the first factor and the second factor, determining a fourth factor according to the sum of the first factor and the second factor, performing voice enhancement on the third factor according to a first adaptive filter coefficient to obtain a fifth factor, determining a third target sub-signal of the current round of spatial filtering according to the third factor, and determining a fourth target sub-signal of the current round of spatial filtering according to the difference between the fourth factor and the fifth factor; wherein, l has a value range of (1, 2, 3 \8230, m), the third target signal includes the third target sub-signal determined by each round of spatial filtering, and the fourth target signal includes the fourth target sub-signal determined by each round of spatial filtering.

Further, the determining a third target sub-signal of the current round of spatial filtering according to the third factor includes determining the third target sub-signal according to the following formula:

U(k)＝α ₁ (k)Y ₁ (k)-α ₂ (k)Y ₂ (k)

wherein U (k) represents the third target sub-signal of the current round of spatial filtering, α ₁ (k) Representing said first filter coefficient, a ₂ (k) Representing said second filter coefficient, Y ₁ (k) Representing the l frame frequency domain sub-signal, Y, in the first target signal corresponding to the current round of spatial filtering ₂ (k) Representing the l frame frequency domain sub-signal in the second target signal corresponding to the current round of spatial filtering;

determining a fourth target sub-signal of the present round of spatial filtering according to a difference of the fourth factor and the fifth factor comprises determining the fourth target signal according to the following formula:

V(k)＝α ₁ (k)Y ₁ (k)+α ₂ (k)Y ₂ (k)-H ₀ (k)U(k)

v (k) denotes the fourth target subsignal, H, of the present round of spatial filtering ₀ (k) Representing the first adaptive filter coefficients.

Further, the performing noise reduction filtering processing on the third target signal and the fourth target signal to obtain an enhanced speech signal includes: respectively carrying out multi-round noise reduction filtering processing on each frame of target sub-signals with the same frame number in the third target signal and the fourth target signal to obtain enhanced voice signals; the process of performing multiple rounds of noise reduction filtering processing on the ith frame target sub-signal in the third target signal and the fourth target signal comprises the following steps: performing crosstalk elimination processing on the third target sub-signal of the l frame in the third target signal through a second adaptive filter coefficient to determine a reference voice sub-signal of the noise reduction filter processing in the current round based on a crosstalk elimination processing result; performing crosstalk elimination processing on the fourth target sub-signal of the l frame in the fourth target signal through a third adaptive filter coefficient to determine a reference noise sub-signal of the noise reduction filtering processing of the current round based on a crosstalk elimination processing result; determining the gain of the noise reduction filtering processing of the current round according to the reference voice signal of the noise reduction filtering processing of the current round and the reference noise signal of the noise reduction filtering processing of the current round; determining an enhanced voice sub-signal of the current round of noise reduction filtering processing according to the gain of the current round of noise reduction filtering processing and the reference voice sub-signal of the current round of noise reduction filtering processing; wherein the enhanced speech signal comprises the enhanced speech sub-signal determined for each round of noise reduction filtering processing.

Further, the inverse transform from the frequency domain to the time domain of the enhanced speech signal to obtain a corresponding target time domain signal, and synthesizing the target time domain signal into a target speech signal includes: and respectively carrying out inverse transformation from the frequency domain to the time domain on each enhanced voice sub-signal to obtain a corresponding number of target time domain sub-signals, and carrying out overlapping and addition on the corresponding number of target time domain sub-signals to synthesize the target voice signal.

In a second aspect, an embodiment of the present application further provides a microphone speech enhancement apparatus, where the apparatus includes: a spatial filtering module to perform the following operations: acquiring a first target signal and a second target signal, wherein the first target signal and the second target signal are frequency domain signals obtained by signal processing of noise-containing voice signals received by a first microphone and a second microphone respectively; acquiring a first filter coefficient and a second filter coefficient, wherein the first filter coefficient and the second filter coefficient are respectively filter coefficients matched with the current receiving state information of the first microphone and the second microphone; and spatially filtering the first target signal and the second target signal based on the first filter coefficient and the second filter coefficient, and determining a third target signal and a fourth target signal based on a result of the spatial filtering; the microphone noise reduction module is used for carrying out noise reduction filtering processing on the third target signal and the fourth target signal to obtain an enhanced voice signal; and the signal synthesis module is used for carrying out inverse transformation from frequency domain to time domain on the enhanced voice signal to obtain a corresponding target time domain signal and synthesizing the target time domain signal into a target voice signal.

In a third aspect, an embodiment of the present application further provides a microphone speech enhancement apparatus, where the apparatus includes: a processor and a memory for storing at least one instruction which is loaded and executed by the processor to implement the method of microphone speech enhancement provided by the first aspect.

In one embodiment, the microphone voice enhancement device provided by the second aspect may be a chip.

In a fourth aspect, a further embodiment of the present application further provides a chip, where the chip is connected to a memory, or the chip is integrated with a memory (for example, the microphone speech enhancement apparatus provided in the third aspect), and when a program or an instruction stored in the memory is executed, the microphone speech enhancement method provided in the first aspect is implemented.

In a fifth aspect, an embodiment of the present application further provides a terminal, where the terminal may include a terminal body and the microphone voice enhancement device provided in the third aspect.

In a sixth aspect, a further embodiment of the present application further provides a terminal, where the terminal may include a terminal body and the chip provided in the fourth aspect.

In a seventh aspect, this application embodiment further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the microphone speech enhancement method provided in the first aspect.

By the technical scheme, the noise-containing voice signals received by the first microphone and the second microphone can be processed to obtain frequency domain signals, the first target signal and the second target signal are spatially filtered through the obtained first filter coefficient and the obtained second filter coefficient, the third target signal and the fourth target signal are determined based on the spatial filtering result, the noise reduction filtering processing is performed on the third target signal and the fourth target signal to obtain enhanced voice signals, the enhanced voice signals are subjected to inverse transformation from the frequency domain to the time domain to obtain corresponding target time domain signals, and the target time domain signals are synthesized into target voice signals. The signals collected by the two microphones are weighted based on spatial filtering, and the voice components in the two paths of output signals obtained after processing have enough energy difference. And the two paths of signals are processed by using a double-microphone noise suppression module based on energy difference, and finally the obtained output signal can have better noise suppression and voice restoration effects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a microphone speech enhancement method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating a spatial filtering operation according to yet another embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating a dual-microphone denoising filtering operation according to yet another embodiment of the present application;

fig. 4 is a schematic structural diagram of a microphone speech enhancement device according to still another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

When people use terminal equipment such as voice communication or voice recording in real life, clear voice is often polluted by surrounding environment noise, and the quality of the voice received by the equipment is seriously influenced. The noise level in the voice can be reduced and the quality of the voice can be improved by using the terminal equipment integrated with the noise suppression module. The performance of noise suppression is gradually increasing from early single-microphone to the now widely used two-microphone configuration.

The mainstream method for noise suppression of the current dual microphones is implemented based on the difference of the speech energy received by the two microphones, which usually requires the difference of the speech energy received by the two microphones to be more than 6 dB. For example, when a mobile phone is used for a handheld call, the energy difference received by the main microphone close to the mouth and the auxiliary microphone far away from the mouth can meet the requirement of dual-microphone noise suppression. However, when a hands-free call is performed in a vehicle scene, the energy difference between the main microphone and the auxiliary microphone is small or even the same, and the traditional dual-microphone noise suppression method cannot work normally. One solution is to switch to single microphone noise suppression to achieve a less than desirable speech enhancement, however, this solution obviously only uses the signal of one microphone, so that the advantages of the two-microphone configuration cannot be realized.

In order to overcome the foregoing technical problems, an embodiment of the present application provides a microphone speech enhancement method, where signals collected by two microphones are weighted based on spatial filtering, and speech components contained in two output signals obtained after processing have a sufficiently large energy difference. And then the two paths of signals are processed by using a double-microphone noise suppression module based on energy difference, and finally the obtained output signal can have better noise suppression and voice restoration effects.

Fig. 1 is a flowchart of a microphone speech enhancement method according to an embodiment of the present application, and as shown in fig. 1, the microphone speech enhancement method includes the following steps:

step 101: a first target signal and a second target signal are acquired.

Step 102: and acquiring a first filter coefficient and a second filter coefficient, performing spatial filtering on the first target signal and the second target signal based on the first filter coefficient and the second filter coefficient, and determining a third target signal and a fourth target signal based on a spatial filtering result.

Step 103: and carrying out noise reduction filtering processing on the third target signal and the fourth target signal to obtain an enhanced voice signal.

Step 104: and carrying out inverse transformation from a frequency domain to a time domain on the enhanced voice signal to obtain a corresponding target time domain signal, and synthesizing the target time domain signal into a target voice signal.

In the specific implementation of step 101, the noisy speech signals received by the two microphones (the first microphone and the second microphone) are input signals y ₁ (n) (i.e., the first noisy speech signal) and an input signal y ₂ (n) (i.e., the second noisy speech signal), signal 10-1 and signal 10-2, respectively, as shown in FIG. 1. Wherein the input signal y ₁ (n) and input signal y ₂ The speech energies of (n) are close. In particular, the input signal y ₁ (n) and input signal y ₂ (n) may be represented as y _i (n)＝s _i (n)+n _i (n) wherein i =1,2,y _i (n) represents a speech signal contaminated by noise (the above-mentioned noisy speech signal), s _i (n) denotes a clean speech signal, n _i And (n) represents a noise interference signal. Further, the input signal y may be input ₁ (n) and an input signal y ₂ (n) performing framing and time-to-frequency domain transformation, and in particular, may be applied to the input signal y ₁ (n) and the input signal y ₂ (n) dividing the time domain signals into m frames, wherein each frame of time domain signals after being divided into frames can be represented as y _i (l, n), and then, converting each frame of time domain signal y _i (l, n) transforming to frequency domain to obtain corresponding number (m frames) of frequency domain sub-signals Y _i (l, k), where l is a frame number, n is a time domain sample point number, and k is a frequency domain frequency point number, to further obtain a first target signal 11-1 and a second target signal 11-2, where the first target signal 11-1 includes m-frame frequency domain sub-signals, and the second target signal 11-2 includes m-frame frequency domain sub-signals. I.e. the first target signal 11-1 and the second target signal 11-2, as input signals for the spatial filtering module.

In a specific implementation of step 102, the first filter coefficient α may be determined according to the reception state information (first reception state information) of the first microphone ₁ (k) According to the reception state of the second microphoneThe state information (second reception state information) determines the second filter coefficient α ₂ (k) In that respect Wherein the receiving state information of each microphone may include a distance d between the first microphone and the second microphone, and a direction angle of arrival of the voice

And a sampling frequency f _k 。

In one embodiment, the first filter coefficient α ₁ (k) And a second filter coefficient alpha ₂ (k) The calculation of (c) is as follows:

wherein, theta _k The calculation method of (c) is as follows:

where c is the speed of sound propagation in air.

As shown in fig. 2, the first filter coefficient α is obtained ₁ (k) 21-1 and a second filter coefficient alpha ₂ (k) 21-2, based on the first filter coefficient alpha ₁ (k) 21-1 and the second filter coefficient alpha ₂ (k) 21-2 spatially filters each frame frequency domain sub-signal of the first target signal 11-1 and the second target signal 11-2 and determines a third target signal and a fourth target signal based on results of a plurality of rounds of the spatial filtering.

Wherein the process of spatially filtering one of the same frame frequency domain sub-signals of the first target signal 11-1 and the second target signal 11-2 comprises:

according to the first filter coefficient alpha ₁ (k) 21-1 to the ith frame frequency domain in the first target signal 11-1Signal Y ₁ (k) Performing speech enhancement processing to obtain a first factor according to the second filter coefficient alpha ₂ (k) 21-2 pair of the l-th frame frequency domain subsignal Y in the second target signal 11-2 ₂ (k) Performing voice enhancement processing to obtain a second factor, determining a third factor according to the difference between the first factor and the second factor, determining a fourth factor according to the sum of the first factor and the second factor, performing voice enhancement on the third factor according to a first adaptive filter coefficient to obtain a fifth factor, determining a third target sub-signal of the current round of spatial filtering according to the third factor, and determining a fourth target sub-signal of the current round of spatial filtering according to the difference between the fourth factor and the fifth factor;

wherein, the value range of l is (1, 2, 3 \8230, m), the third target signal 12-1 includes the third target sub-signal determined by each round of spatial filtering, and the fourth target signal 12-2 includes the fourth target sub-signal determined by each round of spatial filtering.

Specifically, the third target sub-signal of the current round spatial filtering may be calculated by the following formula:

U(k)＝α ₁ (k)Y ₁ (k)-α ₂ (k)Y ₂ (k)

u (k) denotes the third target subsignal of the present round of spatial filtering, α ₁ (k) Representing said first filter coefficient, a ₂ (k) Representing said second filter coefficient, Y ₁ (k) Represents the l frame frequency domain sub-signal, Y, in the first target signal corresponding to the spatial filtering of the current round ₂ (k) And representing the ith frame frequency domain sub-signal in the second target signal corresponding to the current round of spatial filtering.

The fourth target sub-signal of the current round of spatial filtering may be calculated by the following formula:

V(k)＝α ₁ (k)Y ₁ (k)+α ₂ (k)Y ₂ (k)-H ₀ (k)U(k)

v (k) denotes the fourth target sub-signal, H, of the present round of spatial filtering ₀ (k) Representing the first adaptive filter coefficients of the adaptive filter 1. By the voice enhancement operation, the energy difference between output signals can be increased and reducedLeakage of small speech signals from the output signal V (k) to the output signal U (k).

Furthermore, through multiple rounds of the above spatial filtering operations, spatial filtering is performed on each frame of frequency-domain sub-signal in the first target signal 11-1 and the second target signal 11-2, so as to obtain a third target sub-signal and a fourth target sub-signal determined after each round of spatial filtering. A third target signal 12-1 comprises the third target sub-signal determined by each round of spatial filtering and a fourth target signal 12-2 comprises the fourth target sub-signal determined by each round of spatial filtering.

In a specific implementation of step 103, a plurality of rounds of noise reduction filtering processes may be performed on each frame of target sub-signals with the same number of frames as 12-2 in the third target signal 12-1 and the fourth target signal, respectively, to obtain an enhanced speech signal;

the process of performing multiple rounds of noise reduction filtering processing on the ith frame target sub-signal in the third target signal 12-1 and the fourth target signal 12-2 includes:

and performing crosstalk elimination processing on the third target sub-signal of the l frame in the third target signal through a second adaptive filter coefficient so as to determine a reference voice sub-signal subjected to noise reduction filtering processing in the current round based on a crosstalk elimination processing result. And performing crosstalk elimination processing on the fourth target sub-signal of the ith frame in the fourth target signal through a third adaptive filter coefficient so as to determine a reference noise sub-signal of the noise reduction filtering processing of the current round based on a crosstalk elimination processing result. And determining the gain of the noise reduction filtering processing of the current round according to the reference voice signal of the noise reduction filtering processing of the current round and the reference noise signal of the noise reduction filtering processing of the current round. And determining the enhanced voice sub-signal of the noise reduction filtering processing according to the gain of the noise reduction filtering processing of the current round and the reference voice sub-signal of the noise reduction filtering processing of the current round. Wherein the enhanced speech signal 13 comprises the enhanced speech sub-signal determined for each round of noise reduction filtering.

Specifically, as shown in fig. 3, the third target signal 12-1 and the fourth target signal 12-2 are input to the noise reduction module, and are subjected to crosstalk removal processing by the adaptive filter 2 (31-1) and the adaptive filter 3 (31-2), respectively, to obtain a reference speech signal 32-1 and a reference noise signal 32-2.

In one embodiment, the third target sub-signal of the l-th frame in the third target signal 12-1 and the fourth target sub-signal of the l-th frame in the fourth target signal 12-2 may be subjected to crosstalk elimination processing in the following manner:

E ₁ (k)＝U(k)e ^-j2πkD/N -H ₁ (k)V(k)

E ₁ (k) Representing the reference speech sub-signal, H, obtained after crosstalk removal of the third target sub-signal of the l-th frame ₁ (k) Represents the second adaptive filter coefficient, D represents the number of samples delayed, and N represents the total number of frequency domains.

E ₂ (k)＝V(k)e ^-j2πkD/N -H ₂ (k)U(k)

E ₂ (k) Represents the reference noise sub-signal, H, obtained after the fourth target sub-signal of the first frame is subjected to crosstalk elimination ₂ (k) Represents the third adaptive filter coefficient, D represents the number of samples delayed, and N represents the total number of frequency domains.

Further, the enhanced speech sub-signal of the noise reduction filtering process of the current round may be determined according to the reference speech sub-signal and the reference noise sub-signal determined after the third target sub-signal of the l-th frame and the fourth target sub-signal of the l-th frame are cross-talk removed, that is, the corresponding gain G (k) is determined based on the reference speech sub-signal (the l-th frame) and the reference noise sub-signal (the l-th frame).

And then, through a plurality of rounds of the noise reduction filtering processing, the third target sub-signal and the fourth target sub-signal of each frame are subjected to crosstalk elimination, and then the corresponding gain G (k) is determined.

Further, a corresponding enhanced speech sub-signal may be calculated based on the reference speech sub-signal (i frame) obtained from each round of noise reduction filtering and the gain G (k) obtained from the corresponding round of noise reduction filtering. That is, each round of noise reduction filtering may determine an enhanced speech sub-signal (frame l)

Wherein the content of the first and second substances,

the enhanced speech signal 13 may be determined after a plurality of rounds of noise reduction filtering, and the enhanced speech signal 13 comprises the enhanced speech sub-signals determined in each round of noise reduction filtering

In a specific implementation of step 104, each enhanced speech sub-signal in the enhanced speech signal 13 may be individually subjected to

The inverse frequency-domain-to-time-domain transform is performed to obtain a corresponding number of target time-domain sub-signals, and the corresponding number of target time-domain sub-signals are overlapped and added to synthesize the target speech signal 14.

Yet another embodiment of the present application further provides a microphone voice enhancement apparatus, including:

a spatial filtering module to perform the following operations:

acquiring a first target signal and a second target signal, wherein the first target signal and the second target signal are frequency domain signals obtained by signal processing of noise-containing voice signals received by a first microphone and a second microphone respectively;

acquiring a first filter coefficient and a second filter coefficient, wherein the first filter coefficient and the second filter coefficient are respectively filter coefficients matched with the current receiving state information of the first microphone and the second microphone; and

spatially filtering the first target signal and the second target signal based on the first filter coefficient and the second filter coefficient, and determining a third target signal and a fourth target signal based on a result of the spatial filtering;

the microphone noise reduction module is used for carrying out noise reduction filtering processing on the third target signal and the fourth target signal to obtain an enhanced voice signal;

and the signal synthesis module is used for carrying out inverse transformation from frequency domain to time domain on the enhanced voice signal to obtain a corresponding target time domain signal and synthesizing the target time domain signal into a target voice signal.

In an embodiment, the microphone speech enhancement apparatus may further include a signal processing module, where the signal processing module may perform framing on a first noisy speech signal received by a first microphone and a second noisy speech signal received by a second microphone, respectively, to divide the first noisy speech signal and the second noisy speech signal into m frames of time-domain sub-signals, respectively; respectively carrying out time domain to frequency domain conversion operation on each frame of time domain sub-signals in the m frames of time domain sub-signals to correspondingly obtain the first target signal and the second target signal; wherein the first target signal comprises m frames of frequency domain sub-signals and the second target signal comprises m frames of frequency domain sub-signals.

Fig. 4 is a schematic structural diagram of a microphone speech enhancement apparatus according to still another embodiment of the present application, and as shown in fig. 4, the microphone speech enhancement apparatus may include a processor 401 and a memory 402, where the memory 402 is used to store at least one instruction, and the instruction is loaded by the processor 401 and executed to implement the microphone speech enhancement method according to the embodiment shown in fig. 1. In one embodiment, the microphone voice enhancement device provided by the second aspect may be a chip.

Still another embodiment of the present application further provides a chip, where the chip is connected to a memory, or the chip is integrated with a memory (such as the microphone speech enhancement apparatus provided in the embodiment shown in fig. 4), and when a program or an instruction stored in the memory is executed, the microphone speech enhancement method provided in the embodiment shown in fig. 1 is implemented.

Still another embodiment of the present application further provides a terminal, where the terminal includes a terminal body and the microphone voice enhancement device provided in the embodiment shown in fig. 4.

Still another embodiment of the present application provides a terminal, which includes a terminal body and the above chip connectable to a memory.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the microphone speech enhancement method provided in the embodiment shown in fig. 1.

It should be noted that the terminal according to the embodiment of the present invention may include, but is not limited to, a Personal Computer (PC), a Personal Digital Assistant (PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), a mobile phone, an MP3 player, an MP4 player, and the like.

It should be understood that the application may be an application program (native app) installed on the terminal, or may also be a web page program (webApp) of a browser on the terminal, which is not limited in this embodiment of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or in the form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a Processor (Processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the spirit of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for microphone speech enhancement, the method comprising:

acquiring a first filter coefficient and a second filter coefficient, wherein the first filter coefficient and the second filter coefficient are respectively filter coefficients matched with the current receiving state information of the first microphone and the second microphone;

carrying out noise reduction filtering processing on the third target signal and the fourth target signal to obtain an enhanced voice signal; and

carrying out inverse transformation from a frequency domain to a time domain on the enhanced voice signal to obtain a corresponding target time domain signal, and synthesizing the target time domain signal into a target voice signal;

wherein the spatially filtering the first target signal and the second target signal based on the first filter coefficient and the second filter coefficient, and determining a third target signal and a fourth target signal based on a result of the spatial filtering includes:

performing spatial filtering on each frame of frequency domain sub-signals in the first target signal and the second target signal based on the first filter coefficient and the second filter coefficient, and determining a third target signal and a fourth target signal based on the results of multiple rounds of the spatial filtering;

wherein the process of spatially filtering one of the same frame frequency domain sub-signals of the first target signal and the second target signal comprises:

performing voice enhancement processing on the l-th frame frequency domain sub-signal in the first target signal according to the first filter coefficient to obtain a first factor, performing voice enhancement processing on the l-th frame frequency domain sub-signal in the second target signal according to the second filter coefficient to obtain a second factor, determining a third factor according to the difference between the first factor and the second factor, determining a fourth factor according to the sum of the first factor and the second factor, performing voice enhancement on the third factor according to a first adaptive filter coefficient to obtain a fifth factor, determining a third target sub-signal of the current round of spatial filtering according to the third factor, and determining a fourth target sub-signal of the current round of spatial filtering according to the difference between the fourth factor and the fifth factor;

wherein, l has a value range of (1, 2, 3 \8230, m), the third target signal includes the third target sub-signal determined by each round of spatial filtering, and the fourth target signal includes the fourth target sub-signal determined by each round of spatial filtering.

2. The method of claim 1, wherein the acquiring the first target signal and the second target signal comprises:

framing a first noisy speech signal received by a first microphone and a second noisy speech signal received by a second microphone respectively, so as to divide the first noisy speech signal and the second noisy speech signal into m frames of time domain sub-signals respectively;

respectively carrying out time domain-to-frequency domain conversion operation on each frame of time domain sub-signals in the m frames of time domain sub-signals to correspondingly obtain the first target signal and the second target signal;

wherein the first target signal comprises m frames of frequency domain sub-signals and the second target signal comprises m frames of frequency domain sub-signals.

3. The method of claim 1, wherein obtaining the first filter coefficient and the second filter coefficient comprises:

determining a first filter coefficient according to the first receiving state information, and determining a second filter coefficient according to the second receiving state information;

the first receiving state information is receiving state information of the first microphone, the second receiving state information is receiving state information of the second microphone, and the receiving state information includes a distance between the first microphone and the second microphone, a voice arrival direction angle and a sampling frequency.

4. The method of claim 1, wherein determining the third target sub-signal for the current round of spatial filtering according to the third factor comprises determining the third target sub-signal according to the following equation:

U(k)＝α ₁ (k)Y ₁ (k)-α ₂ (k)Y ₂ (k)

wherein U (k) represents the third target sub-signal of the current round of spatial filtering, α ₁ (k) Representing said first filter coefficient, a ₂ (k) Representing said second filter coefficient, Y ₁ (k) Represents the l frame frequency domain sub-signal, Y, in the first target signal corresponding to the spatial filtering of the current round ₂ (k) Representing the l frame frequency domain sub-signal in the second target signal corresponding to the spatial filtering of the current round;

determining a fourth target sub-signal of the present round of spatial filtering according to a difference between the fourth factor and the fifth factor comprises determining the fourth target sub-signal according to the following formula:

V(k)＝α ₁ (k)Y ₁ (k)+α ₂ (k)Y ₂ (k)-H ₀ (k)U(k)

v (k) denotes the fourth target sub-signal, H, of the present round of spatial filtering ₀ (k) Representing the first adaptive filter coefficients.

5. The method of claim 1, wherein the performing noise reduction filtering processing on the third target signal and the fourth target signal to obtain an enhanced speech signal comprises:

respectively carrying out multi-round noise reduction filtering processing on each frame of target sub-signals with the same frame number in the third target signal and the fourth target signal to obtain enhanced voice signals;

wherein, the process of performing multiple rounds of noise reduction filtering processing on the l frame target sub-signal in the third target signal and the fourth target signal comprises:

performing crosstalk elimination processing on the third target sub-signal of the l frame in the third target signal through a second adaptive filter coefficient to determine a reference voice sub-signal of the noise reduction filter processing in the current round based on a crosstalk elimination processing result;

performing crosstalk elimination processing on a first frame fourth target sub-signal in the fourth target signal through a third adaptive filter coefficient to determine a reference noise sub-signal of the current round of noise reduction filtering processing based on a crosstalk elimination processing result;

determining the gain of the noise reduction filtering processing of the current round according to the reference voice sub-signal of the noise reduction filtering processing of the current round and the reference noise sub-signal of the noise reduction filtering processing of the current round; and

determining an enhanced voice sub-signal of the noise reduction filtering processing according to the gain of the noise reduction filtering processing of the current round and the reference voice sub-signal of the noise reduction filtering processing of the current round;

wherein the enhanced speech signal comprises the enhanced speech sub-signal determined for each round of noise reduction filtering processing.

6. The method according to claim 5, wherein said inverse frequency-domain to time-domain transforming the enhanced speech signal to obtain a corresponding target time-domain signal, and synthesizing the target time-domain signal into a target speech signal comprises:

and respectively carrying out inverse transformation from the frequency domain to the time domain on each enhanced voice sub-signal to obtain a corresponding number of target time domain sub-signals, and carrying out overlapping and addition on the corresponding number of target time domain sub-signals to synthesize the target voice signal.

7. An apparatus for microphone speech enhancement, the apparatus comprising:

a spatial filtering module to perform the following operations:

the signal synthesis module is used for carrying out inverse transformation from a frequency domain to a time domain on the enhanced voice signal to obtain a corresponding target time domain signal and synthesizing the target time domain signal into a target voice signal;

wherein the spatially filtering the first target signal and the second target signal based on the first filter coefficient and the second filter coefficient, and determining a third target signal and a fourth target signal based on a result of the spatial filtering comprises:

performing spatial filtering on each frame of frequency domain sub-signals in the first target signal and the second target signal based on the first filter coefficient and the second filter coefficient, and determining a third target signal and a fourth target signal based on results of a plurality of rounds of the spatial filtering;

wherein spatially filtering one of the same frame frequency domain sub-signals of the first target signal and the second target signal comprises:

performing voice enhancement processing on the l frame frequency domain sub-signal in the first target signal according to the first filter coefficient to obtain a first factor, performing voice enhancement processing on the l frame frequency domain sub-signal in the second target signal according to the second filter coefficient to obtain a second factor, determining a third factor according to the difference between the first factor and the second factor, determining a fourth factor according to the sum of the first factor and the second factor, performing voice enhancement on the third factor according to a first adaptive filter coefficient to obtain a fifth factor, determining a third target sub-signal of the current round of spatial filtering according to the third factor, and determining a fourth target sub-signal of the current round of spatial filtering according to the difference between the fourth factor and the fifth factor;

wherein l has a value range of (1, 2, 3 \ 8230, m), the third target signal comprises the third target sub-signal determined by each round of spatial filtering, and the fourth target signal comprises the fourth target sub-signal determined by each round of spatial filtering.

8. An apparatus for microphone speech enhancement, the apparatus comprising:

a processor and a memory for storing at least one instruction which when loaded and executed by the processor is to implement the microphone speech enhancement method of any of claims 1-6.

9. A terminal, characterized in that the terminal comprises a microphone speech enhancement device according to claim 8.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method for microphone speech enhancement according to any one of claims 1-6.