CN113689869B

CN113689869B - Speech enhancement method, electronic device, and computer-readable storage medium

Info

Publication number: CN113689869B
Application number: CN202110846654.6A
Authority: CN
Inventors: 陈庭威; 黄景标; 林聚财; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2024-08-16
Anticipated expiration: 2041-07-26
Also published as: CN113689869A

Abstract

The invention discloses a voice enhancement method, electronic equipment and a computer readable storage medium, wherein the voice enhancement method comprises the following steps: acquiring voice to be enhanced; determining an inverse matrix of a signal covariance matrix of the voice to be enhanced based on the voice to be enhanced; determining a target signal covariance matrix of the target voice by using a mask matrix corresponding to the target voice in the voice to be enhanced; and performing voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix. By the mode, the method and the device can realize the voice enhancement of the voice to be enhanced and improve the voice enhancement effect.

Description

Speech enhancement method, electronic device, and computer-readable storage medium

Technical Field

The present invention relates to the field of speech processing, and in particular to a speech enhancement method, an electronic device, and a computer-readable storage medium.

Background

In the fields of teleconferencing, artificial intelligence, etc., speech propagation often plays an important role. However, in a practical scenario, the target voice signal is often disturbed by various noises or background sounds, so that the target voice signal needs to be subjected to voice enhancement to improve semantic transmission of the target voice.

While conventional speech enhancement is often performed by a beam forming technique, the beam forming technique needs to estimate the azimuth information of the target speech signal in advance, and then filter out signals other than the azimuth of the target speech signal by a beam former, so as to achieve the purpose of speech enhancement.

However, it is difficult to accurately obtain the azimuth information of the target voice signal in practice, so that the voice enhancement effect is poor.

Disclosure of Invention

The invention provides a voice enhancement method, electronic equipment and a computer readable storage medium, which are used for solving the problem of improving the voice enhancement effect.

In order to solve the above technical problems, the present invention provides a speech enhancement method, including: acquiring voice to be enhanced; determining an inverse matrix of a signal covariance matrix of the voice to be enhanced based on the voice to be enhanced; determining a target signal covariance matrix of the target voice by using a mask matrix corresponding to the target voice in the voice to be enhanced; and performing voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix.

Wherein the step of determining an inverse of the signal covariance matrix of the speech to be enhanced based on the speech to be enhanced comprises: transforming the voice to be enhanced to obtain a matrix corresponding to the current frame of the voice to be enhanced; acquiring an inverse matrix of a signal covariance matrix of a voice initial frame to be enhanced; obtaining an inverse matrix of the signal covariance matrix of the current frame of the voice to be enhanced based on a first recurrence relation by utilizing the matrix of the current frame, the conjugate transpose matrix of the matrix and the inverse matrix of the signal covariance matrix of the initial frame; the first recurrence relation represents the corresponding relation between the inverse matrix of the signal covariance matrix of the current frame and the inverse matrix of the signal covariance matrix of the previous frame.

The first recurrence relation is obtained by constructing a first correspondence through a matrix of a current frame of the voice to be enhanced and a conjugate transpose matrix, and then performing inverse operation on the first correspondence.

The step of determining the target signal covariance matrix of the target voice by using the mask matrix corresponding to the target voice in the voice to be enhanced comprises the following steps: obtaining the probability of the target voice existing in the current frame of the voice to be enhanced by utilizing the matrix corresponding to the voice to be enhanced; acquiring a mask matrix of an initial frame; obtaining a mask matrix of a current frame of the voice signal to be enhanced by using the mask matrix of the initial frame and the probability; and obtaining a target signal covariance matrix of the current frame of the target voice by using the mask matrix of the current frame, the matrix corresponding to the current frame and the conjugate transpose matrix.

Wherein, the step of obtaining the mask matrix of the current frame of the speech signal to be enhanced by using the mask matrix of the initial frame and the probability comprises the following steps: obtaining a mask matrix of the current frame based on the second recurrence relation by using the probability and the mask matrix of the initial frame; wherein the second recurrence relation characterizes the corresponding relation between the mask matrix of the current frame and the mask matrix of the previous frame; the step of obtaining the target signal covariance matrix of the target voice by using the mask matrix of the current frame, the matrix corresponding to the current frame and the conjugate transpose matrix comprises the following steps: obtaining a target signal covariance matrix of the current frame by using a third recurrence relation by using a mask matrix of the current frame, a matrix corresponding to the current frame, a conjugate transpose matrix and a target signal covariance matrix of the initial frame; the third recurrence relation represents the corresponding relation between the target signal covariance matrix of the current frame and the target signal covariance matrix of the previous frame.

The third recurrence relation is obtained by constructing a second corresponding relation through a matrix of the current frame of the voice to be enhanced, a conjugate transpose matrix, a target signal covariance matrix and a mask matrix of the current frame, and transforming the second corresponding relation.

Wherein the step of acquiring the mask matrix of the initial frame includes: acquiring a random matrix with a unit matrix and a value range of 0-1 or a probability matrix obeying normal distribution; a unit matrix, a random matrix with a value range of 0-1, or a probability matrix following a normal distribution is determined as a mask matrix of the initial frame.

The step of performing voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix comprises the following steps of: calculating to obtain the beam former coefficient of the current frame through the inverse matrix of the signal covariance matrix of the current frame and the target signal covariance matrix of the current frame; the beamformer coefficients are multiplied with the speech of the current frame to be enhanced to enhance the speech of the current frame to be enhanced.

The step of obtaining the voice to be enhanced comprises the following steps: acquiring an initial voice in a time domain form; and windowing, framing and Fourier transforming the initial voice in sequence to obtain the voice to be enhanced in the time-frequency domain signal form.

The step of performing voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix further comprises the following steps: and carrying out inverse Fourier transform on the voice after voice enhancement to obtain a voice signal in a time domain form after voice enhancement.

In order to solve the technical problem, the present invention further provides an electronic device, including: the memory and the processor are coupled to each other, and the processor is configured to execute the program instructions stored in the memory to implement the speech enhancement method of any one of the above.

To solve the above technical problem, the present invention also provides a computer-readable storage medium storing program data that can be executed to implement the speech enhancement method as any one of the above.

The beneficial effects of the invention are as follows: compared with the prior art, the voice enhancement method of the invention is characterized in that the inverse matrix of the signal covariance matrix of the voice to be enhanced is firstly determined, then the mask matrix corresponding to the target voice in the voice to be enhanced is used for determining the target signal covariance matrix of the target voice, and finally the voice enhancement is carried out on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix.

Drawings

FIG. 1 is a flow chart of an embodiment of a speech enhancement method according to the present invention;

FIG. 2 is a flow chart of another embodiment of a speech enhancement method according to the present invention;

FIG. 3 is a schematic diagram of an embodiment of an electronic device according to the present invention;

Fig. 4 is a schematic structural diagram of an embodiment of a computer readable storage medium provided by the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a voice enhancement method according to the present invention.

Step S11: and obtaining the voice to be enhanced.

And obtaining the voice to be enhanced. The voice to be enhanced can be obtained through the voice receiver or played through the voice player. The voice receiver comprises a wired microphone, a wireless microphone, a telephone receiver and the like. The voice player includes: a voice player such as an intelligent device player, a telephone player and the like.

In a specific application scenario, multiple channels of speech to be enhanced may be obtained through multiple microphones. In another specific application scenario, the voice to be enhanced of a single channel may also be obtained through a single handset.

Step S12: an inverse of a signal covariance matrix of the speech to be enhanced is determined based on the speech to be enhanced.

An inverse of a signal covariance matrix of the speech to be enhanced is determined based on the speech to be enhanced. Wherein each element in the covariance matrix is the covariance between the individual vector elements.

In a specific application scenario, a matrix of the to-be-enhanced voice can be obtained based on the to-be-enhanced voice, then the matrix is subjected to matrix transformation to obtain a signal covariance matrix of the to-be-enhanced voice, and inversion operation is performed on the signal covariance matrix to obtain an inverse matrix of the signal covariance matrix of the to-be-enhanced voice.

In another specific application scenario, an accompanying matrix of a signal covariance matrix of the to-be-enhanced voice can be determined based on the to-be-enhanced voice, and then an inverse matrix is solved based on the accompanying matrix, so as to obtain an inverse matrix of the signal covariance matrix of the to-be-enhanced voice.

The method of calculating the inverse of the signal covariance matrix of the speech to be enhanced is not limited herein.

Step S13: and determining a target signal covariance matrix of the target voice by using a mask matrix corresponding to the target voice in the voice to be enhanced.

And determining a target signal covariance matrix of the target voice by using a mask matrix corresponding to the target voice in the voice to be enhanced. Wherein, the target voice refers to the voice which needs to be enhanced in the voice to be enhanced. In a specific application scenario, when the voice to be enhanced is the voice received by the conference recording microphone, the target voice is the voice of the conference speaker, and the target voice is the background voice.

Wherein the mask matrix is used for masking the voice to be enhanced so as to cover the background sound and highlight the target voice. In a specific application scenario, the probability of existence of the target voice can be estimated by each element of the matrix of the voice to be enhanced, and the larger the probability is, the larger the probability of existence of the target voice is, so as to obtain a mask matrix corresponding to the target voice in the voice to be enhanced. In other application scenes, the matrix of the voice to be enhanced can be filtered through the deep neural network, so that a mask matrix corresponding to the target voice in the voice to be enhanced is obtained.

And the target signal covariance matrix is the signal covariance matrix corresponding to the target voice. The target voice is highlighted through the mask matrix corresponding to the target voice, so that the target signal covariance matrix of the target voice is determined, and the accuracy and reliability of the target signal covariance matrix can be improved.

Step S14: and performing voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix.

The inverse matrix of the signal covariance matrix and the target signal covariance matrix obtained through the steps are used for carrying out voice enhancement on the voice to be enhanced.

In a specific application scenario, the beamformer coefficient can be obtained by calculating the inverse matrix of the signal covariance matrix and the target signal covariance matrix, and then the voice to be enhanced is enhanced by the beamformer coefficient. The beam former coefficient is MVDR (Minimum Variance Distortionless Response) beam former coefficient, and the MVDR beam former coefficient is used for processing the voice to be enhanced, so that the background sound of the voice to be enhanced can be minimized based on the constraint condition of the MVDR algorithm. Wherein the constraint is that the variance of the output is minimized if the clean speech signal remains unchanged. Minimization of the background sound signal can be accomplished.

In another specific application scenario, the inverse matrix of the signal covariance matrix and the target signal covariance matrix can also be directly combined with the matrix of the speech to be enhanced to perform speech enhancement on the speech to be enhanced.

In another specific application scenario, the speech enhancement can also be performed on the speech to be enhanced by means of machine learning based on the inverse matrix of the signal covariance matrix and the target signal covariance matrix. The specific manner of reinforcement is not limited herein.

By the method, the voice enhancement method of the embodiment determines the inverse matrix of the signal covariance matrix of the voice to be enhanced firstly, then determines the target signal covariance matrix of the target voice by using the mask matrix corresponding to the target voice in the voice to be enhanced, and finally enhances the voice of the voice to be enhanced by using the inverse matrix of the signal covariance matrix and the target signal covariance matrix.

Referring to fig. 2, fig. 2 is a flow chart of another embodiment of the voice enhancement method according to the present invention.

Step S21: the method comprises the steps of obtaining initial voice in a time domain form, and sequentially windowing, framing and Fourier transforming the initial voice to obtain voice to be enhanced in a time-frequency domain signal form.

The method comprises the steps of obtaining initial voice in a time domain form, and then windowing, framing and Fourier transforming the initial voice to obtain voice to be enhanced in a time-frequency domain signal form. In a specific application scenario, initial speech in a multi-channel time domain form can be obtained through a plurality of microphones, and then windowing, framing and fourier transformation (FFT, fourier Transformation) are sequentially performed on the initial speech to obtain speech to be enhanced in a time-frequency domain signal form. Wherein, the voice to be enhanced in the time-frequency domain signal form comprises a plurality of frames of voice to be enhanced.

In other embodiments, the speech to be enhanced in the form of a time-frequency domain signal may also be directly acquired, for example: and acquiring the voice to be enhanced in the form of a time-frequency domain signal output by a processor or other processing equipment.

Step S22: transforming the voice to be enhanced to obtain a matrix corresponding to the current frame of the voice to be enhanced, obtaining an inverse matrix of a signal covariance matrix of the initial frame of the voice to be enhanced, and obtaining the inverse matrix of the signal covariance matrix of the current frame of the voice to be enhanced based on a first recurrence relation by utilizing the matrix of the current frame, the conjugate transpose matrix of the matrix and the inverse matrix of the signal covariance matrix of the initial frame.

And transforming the voice to be enhanced in the time-frequency domain signal form to obtain a matrix corresponding to the current frame of the voice to be enhanced. In a specific application scenario, the voice enhancement in this embodiment may enhance the voice of each frame currently in real time when the microphone acquires the voice, or may sequentially enhance the voice of each frame after the microphone acquires the complete voice.

In this embodiment, the matrix corresponding to the current frame of the speech to be enhanced in the form of the time-frequency domain signal may be expressed as:

y(f,t)＝[y_1,f,t,y_2,f,t,...y_J,f,t]^T

wherein y (f, t) is a matrix corresponding to the current frame in the form of a time-frequency domain signal, y _f,t in the subsequent formula is the same as the matrix y (f, t) corresponding to the current frame, and y (f, t) represents an observation vector in the dimension J×1 at the time t or the t-th frame and the frequency f. J is the number of microphones, i.e. y _1,f,t,y_2,f,t,...y_J,f,t is the voice signal corresponding to each of the J microphones. T is the transpose of the matrix. t is the current frame time, or any frame time. f is the current frequency or any frequency.

And obtaining an inverse matrix of a signal covariance matrix of the initial frame of the voice to be enhanced, namely the voice frame with the initial frame being t being 0 moment. And further, obtaining the inverse matrix of the signal covariance matrix of the current frame of the voice to be enhanced based on the first recurrence relation by utilizing the matrix of the current frame, the conjugate transpose matrix of the matrix and the inverse matrix of the signal covariance matrix of the initial frame. The first recurrence relation represents the corresponding relation between the inverse matrix of the signal covariance matrix of the current frame and the inverse matrix of the signal covariance matrix of the previous frame.

Specifically, the first recurrence relation is as follows:

wherein, Is the inverse of the signal covariance matrix of the t frame of speech to be enhanced,Is the inverse of the signal covariance matrix of the t-1 st frame, y _f,t is the matrix of the t frame of speech to be enhanced,Is the conjugate transpose of the t frame matrix of the speech to be enhanced. t is the current frame instant, and when the speech to be enhanced has s frames in total, t may include (0, 1, 2..s), specifically set based on the speech frame to which it is currently directed.

The first recursive relational expression represents an inverse matrix of a signal covariance matrix of the current frameInverse of signal covariance matrix of previous frameI.e. the inverse of the signal covariance matrix of every two adjacent frames. Therefore, when the inverse matrix of the signal covariance matrix of the initial frame of the speech to be enhanced is obtained, that is, the inverse matrix of the signal covariance matrix of the initial frame of the speech to be enhanced is substituted into a first recurrence relation with t being 0, the inverse matrix of the signal covariance matrix of the first frame is obtained by calculation, and then the inverse matrix of the signal covariance matrix of the first frame is substituted into a first recurrence relation with t being 1, the inverse matrix of the signal covariance matrix of the second frame is obtained by calculation, and so on, the matrix of each frame, the conjugate transpose matrix of the matrix of each frame, and the inverse matrix of the signal covariance matrix of the initial frame can be utilized to obtain the inverse matrix of the signal covariance matrix of all frames of the speech to be enhanced based on the first recurrence relation.

The method for acquiring the inverse matrixes of the signal covariance matrixes of all frames of the voice to be enhanced only needs to acquire the matrixes of each frame, the conjugate transpose matrix of each frame matrix and the signal covariance matrix of the initial frame, and then the method can be obtained by calculation based on the first recurrence relation, and inversion operation is not needed to be carried out on the inverse matrixes of the signal covariance matrixes of each frame in the voice to be enhanced in sequence.

The first recurrence relation is obtained by constructing a first correspondence through a matrix of a current frame of the voice to be enhanced and a conjugate transpose matrix, and then performing inverse operation on the first correspondence. Specifically, a first correspondence between a signal covariance matrix of a current frame and a signal covariance matrix of a previous frame may be constructed based on a matrix of the current frame and a conjugate transpose matrix of a speech to be enhanced, where the first correspondence is as follows:

Wherein Y _f,t is the signal covariance matrix of the t frame of the voice to be enhanced, and Y _f,t-1 is the signal covariance matrix of the t-1 frame of the voice to be enhanced. And (3) carrying out inversion operation on the first corresponding relation, namely the formula (2), and obtaining a first recurrence relation, namely the formula (1).

The first correspondence represents an iterative updating mode between signal covariance matrixes of adjacent frames of the voice to be enhanced, the first correspondence is obtained by carrying out inversion operation on the first correspondence once, and then the inverse matrix of the signal covariance matrix of each frame of the voice to be enhanced is obtained based on the first recurrence and the inverse matrix recurrence of the signal covariance matrix of the initial frame.

The inverse matrix of the signal covariance matrix of the initial frame may include a simple matrix such as an identity matrix, a random matrix with a value range of 0-1, or a probability matrix obeying normal distribution, so that after the signal covariance matrix of the initial frame is substituted into the first recurrence relation, the calculation amount of recurrence calculation can be further reduced, and the calculation efficiency is improved.

Step S23: obtaining the probability of target voice existing in the current frame of the voice to be enhanced by utilizing the matrix corresponding to the voice to be enhanced, obtaining a mask matrix of an initial frame, obtaining the mask matrix of the current frame of the voice signal to be enhanced by utilizing the mask matrix and the probability of the initial frame, and obtaining the target signal covariance matrix of the current frame of the target voice by utilizing the mask matrix of the current frame, the matrix corresponding to the current frame and the conjugate transpose matrix.

Obtaining the probability of the target voice existing in the current frame of the voice to be enhanced by utilizing the matrix corresponding to the voice to be enhanced, obtaining the mask matrix of the initial frame, obtaining the mask matrix of the current frame of the voice signal to be enhanced by utilizing the mask matrix of the initial frame and the probability, and obtaining the target signal covariance matrix of the current frame of the target voice by utilizing the mask matrix of the current frame, the matrix corresponding to the current frame and the conjugate transpose matrix.

In a specific application scenario, the probability that the target speech exists in the current frame of the speech to be enhanced may be calculated by the following formula:

wherein p (y _f,t) is the probability that the target speech exists for the t frame of the speech to be enhanced, e is a natural constant, and Where J is the number of microphones, tr () represents the trace of the matrix,Is the target matrix covariance matrix. While the target matrix covariance matrixThe second correspondence may be obtained by referring to the formula (5).

After the probability of the target voice in the current frame of the voice to be enhanced is obtained, the mask matrix of the current frame is obtained based on the second recurrence relation by utilizing the probability and the mask matrix of the initial frame. The second recurrence relation is as follows:

wherein, For the mask matrix of the t-th frame,Is the mask matrix for the t-1 th frame. The super parameters α and β satisfy the relationship α+β=1.

Wherein the second recurrence relation characterizes the corresponding relation between the mask matrix of the current frame and the mask matrix of the previous frame; therefore, after the mask matrix of the initial frame is obtained, the mask matrix of the initial frame is substituted into the second recurrence relation, so that the mask matrix of the first frame can be obtained, and then the mask matrix of the first frame is substituted into the second recurrence relation again, so that the mask matrix of the second frame can be obtained, and the mask matrices of all frames of the speech signal to be enhanced can be obtained by analogy.

Wherein the mask matrix of the initial frameThe method can comprise a simple matrix with elements between 0 and 1, such as an identity matrix, a random matrix with a value range of 0-1 or a probability matrix obeying normal distribution, so that after the mask matrix of the initial frame is substituted into the first recurrence relation, the calculation amount of recurrence calculation can be further reduced, and the calculation efficiency is improved.

After the mask matrix of all frames of the voice signal to be enhanced is obtained, the mask matrix of the current frame, the matrix corresponding to the current frame and the conjugate transpose matrix are utilized to obtain the target signal covariance matrix of the target voice. Specifically, a mask matrix of the current frame, a matrix corresponding to the current frame, a conjugate transpose matrix and a target signal covariance matrix of the initial frame are utilized to obtain a target signal covariance matrix of the current frame by utilizing a third recurrence relation, wherein the third recurrence relation is specifically as follows:

wherein, For the target signal covariance matrix of the t-th frame,Is the target signal covariance matrix of the t-1 frame.And the third recursive relation is used for representing the corresponding relation between the target signal covariance matrix of the current frame and the target signal covariance matrix of the previous frame as the mask matrix of the t frame. And substituting the target signal covariance matrix of the initial frame into the third recurrence relation, and sequentially recurrence the target signal covariance matrices of all frames.

The third recurrence relation is obtained by constructing a second corresponding relation through a matrix of the current frame of the voice to be enhanced, a conjugate transpose matrix, a target signal covariance matrix and a mask matrix of the current frame, and then transforming the second corresponding relation.

The second correspondence is specifically as follows:

wherein, For the mask matrix of the t-th frame,Masking the matrix of the voice to be enhanced of the t frame and the conjugate transpose matrix of the voice to be enhanced of the t frame by using the masking matrix of the t frame as a target signal covariance matrix of the whole target voice to obtain a target signal covariance matrix of the target voice

Obtaining a target signal covariance matrix of target speechThen, the target signal covariance matrix of the initial frame can be obtained by conversion based on the formula (6), and then the target signal covariance matrix of the initial frame is substituted into the formula (5), so that the calculation of the target signal covariance matrix of all frames is completed.

Step S24: and calculating to obtain a beam former coefficient of the current frame through an inverse matrix of the signal covariance matrix of the current frame and the target signal covariance matrix of the current frame, and multiplying the beam former coefficient with the voice to be enhanced of the current frame to enhance the voice to be enhanced of the current frame.

After the inverse matrix of the signal covariance matrix of all frames and the target signal covariance matrix of all frames are obtained, the beam forming device coefficient of each frame is obtained through calculation through the inverse matrix of the signal covariance matrix of each frame and the target signal covariance matrix of each frame.

In a specific application scenario, the specific calculation mode of the beamformer coefficient of the current frame is obtained by calculating the inverse matrix of the signal covariance matrix of the current frame and the target signal covariance matrix of the current frame is as follows:

Where w _f,t is the beamformer coefficient of the t frame, also called MVDR filter coefficient, tr () represents the trace of the matrix, d is a 0-1 vector of dimension M x1, which in this embodiment may be 1 or 0.

The beam former coefficient W _f,t of the t frame is converted to obtain a matrix W _f of the whole beam former coefficient, and then the matrix W _f of the whole beam former coefficient is converted to obtain a conjugate transpose matrix of the whole beam former coefficient

The beam former coefficient is multiplied with the voice to be enhanced of each frame, so that the voice to be enhanced of each frame can be enhanced through the beam former coefficient.

In a specific application scenario, taking the current frame as an example, the specific calculation process is as follows:

wherein, The speech to be enhanced y _f,t for the t-th frame passes through the conjugate transpose of the entire beamformer coefficientsAnd carrying out the enhanced voice to be enhanced of the enhanced t frame. In this embodiment, the t frame is the current frame.

The voice to be enhanced of each frame is enhanced through the formula (7), and then the voice enhancement of the whole voice to be enhanced can be achieved.

After the voice of the whole voice to be enhanced is enhanced, the voice after the voice enhancement is subjected to inverse Fourier transformation to obtain a voice signal in a time domain form after the voice enhancement, so that the voice signal in the time domain form after the voice enhancement can be conveniently applied, for example, the format of a sound player is matched.

Through the steps, the corresponding relation between the mask matrix of the current frame and the mask matrix of the previous frame and the corresponding relation between the inverse matrix of the signal covariance matrix of the current frame and the inverse matrix of the signal covariance matrix of the previous frame are firstly constructed, then the mask matrix of the initial frame and the inverse matrix of the signal covariance matrix are acquired, further, the mask matrix of all frames and the inverse matrix of the signal covariance matrix are sequentially recursively obtained through the corresponding relation, furthermore, the beamformer coefficient is obtained through matrix calculation, finally, the speech enhancement of the speech to be enhanced of each frame is completed through the beamformer coefficient, the signal covariance matrix of each frame can be prevented from being subjected to inverse operation, but the inverse operation is directly performed based on the signal covariance matrix of the initial frame, so that the calculation amount and the calculation complexity of the inverse matrix of the signal covariance matrix are greatly reduced, the calculation amount and the calculation complexity of the mask matrix of each frame are prevented from being independently calculated, the calculation amount and the calculation complexity of the mask matrix of each frame are further reduced, the calculation amount and the calculation complexity of the mask matrix of the initial frame are further reduced, and the calculation difficulty of the signal covariance matrix of the initial frame is further improved, and the calculation effect of the inverse matrix of the initial frame is also improved. Therefore, the voice enhancement method of the embodiment can greatly reduce the calculated amount and the calculation complexity and improve the calculation efficiency. The calculation errors are reduced, and the voice enhancement effect is improved.

Based on the same inventive concept, the present invention also provides an electronic device capable of being executed to implement the voice enhancement method of any of the above embodiments, referring to fig. 3, fig. 3 is a schematic structural diagram of an embodiment of the electronic device provided by the present invention, where the electronic device includes a processor 31 and a memory 32.

The processor 31 is configured to execute program instructions stored in the memory 32 to implement the steps of any of the speech enhancement method embodiments described above. In one particular implementation scenario, an electronic device may include, but is not limited to: the microcomputer and the server, and the electronic device may also include mobile devices such as a notebook computer and a tablet computer, which are not limited herein.

In particular, the processor 31 is adapted to control itself and the memory 32 to implement the steps of any of the speech enhancement method embodiments described above. The processor 31 may also be referred to as a CPU (Central Processing Unit ). The processor 31 may be an integrated circuit chip with signal processing capabilities. The Processor 31 may also be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 31 may be commonly implemented by an integrated circuit chip.

By the scheme, the voice enhancement of the voice to be enhanced can be realized.

Based on the same inventive concept, the present invention also provides a computer readable storage medium, please refer to fig. 4, fig. 4 is a schematic structural diagram of an embodiment of the computer readable storage medium provided by the present invention. At least one program data 41 is stored in the computer readable storage medium 40, the program data 41 being for implementing any of the methods described above. In one embodiment, computer-readable storage medium 40 comprises: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the several embodiments provided in the present invention, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the invention, in essence or a part contributing to the prior art or all or part of the technical solution, can be embodied in the form of a software product, which is stored in a storage medium.

The foregoing description is only of embodiments of the present invention, and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes using the descriptions and the drawings of the present invention or directly or indirectly applied to other related technical fields are included in the scope of the present invention.

Claims

1. A method of speech enhancement, the method comprising:

Acquiring voice to be enhanced;

determining an inverse matrix of a signal covariance matrix of the voice to be enhanced based on the voice to be enhanced;

determining a target signal covariance matrix of the target voice by using a mask matrix corresponding to the target voice in the voice to be enhanced;

And carrying out voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix.

2. The method of claim 1, wherein the step of determining an inverse of a signal covariance matrix of the speech to be enhanced based on the speech to be enhanced comprises:

transforming the voice to be enhanced to obtain a matrix corresponding to the current frame of the voice to be enhanced; and

Acquiring an inverse matrix of a signal covariance matrix of the initial frame of the voice to be enhanced;

Obtaining an inverse matrix of the signal covariance matrix of the current frame of the voice to be enhanced based on a first recurrence relation by using the matrix of the current frame, a conjugate transpose matrix of the matrix and an inverse matrix of the signal covariance matrix of the initial frame; the first recurrence relation represents the corresponding relation between the inverse matrix of the signal covariance matrix of the current frame and the inverse matrix of the signal covariance matrix of the previous frame;

And obtaining the inverse matrix of the signal covariance matrix of all frames of the voice to be enhanced based on the first recurrence relation by utilizing the matrix of each frame, the conjugate transpose matrix of each frame matrix and the inverse matrix of the signal covariance matrix of the initial frame, so as to obtain the inverse matrix of the signal covariance matrix of the voice to be enhanced.

3. The method for speech enhancement according to claim 2, wherein,

The first recurrence relation is obtained by constructing a first corresponding relation through a matrix of the current frame of the voice to be enhanced and a conjugate transpose matrix, and then performing inverse operation on the first corresponding relation.

4. The method according to claim 1, wherein the step of determining the target signal covariance matrix of the target voice using the mask matrix corresponding to the target voice in the voices to be enhanced comprises:

obtaining the probability of target voice existing in the current frame of the voice to be enhanced by utilizing the matrix corresponding to the voice to be enhanced;

acquiring a mask matrix of an initial frame;

obtaining a mask matrix of the current frame of the voice to be enhanced by utilizing the mask matrix of the initial frame and the probability;

obtaining a target signal covariance matrix of the current frame by using a third recurrence relation by using a mask matrix of the current frame, a matrix corresponding to the current frame, a conjugate transpose matrix and a target signal covariance matrix of an initial frame; the third recurrence relation represents the corresponding relation between the target signal covariance matrix of the current frame and the target signal covariance matrix of the previous frame;

Substituting the target signal covariance matrix of the initial frame into the third recurrence relation, and sequentially and gradually deducing the target signal covariance matrix of all frames to obtain the target signal covariance matrix of the target voice.

5. The method of claim 4, wherein the step of deriving the mask matrix of the current frame of the speech to be enhanced using the mask matrix of the initial frame and the probability comprises:

Obtaining a mask matrix of the current frame based on a second recurrence relation by using the probability and the mask matrix of the initial frame; wherein the second recurrence relation characterizes a correspondence between the mask matrix of the current frame and the mask matrix of the previous frame.

6. The method for speech enhancement according to claim 5, wherein,

The third recurrence relation is obtained by constructing a second corresponding relation through the matrix of the current frame of the voice to be enhanced, the conjugate transpose matrix, the target signal covariance matrix and the mask matrix of the current frame, and transforming the second corresponding relation.

7. The method according to any one of claims 4 to 6, wherein the step of acquiring a mask matrix of the initial frame comprises:

Acquiring a random matrix with a unit matrix and a value range of 0-1 or a probability matrix obeying normal distribution;

and determining the identity matrix, a random matrix with the value range of 0-1 or a probability matrix obeying normal distribution as a mask matrix of the initial frame.

8. The method according to claim 1, wherein the step of speech-enhancing the speech to be enhanced by the inverse of the signal covariance matrix and the target signal covariance matrix comprises:

calculating to obtain a beam former coefficient of the current frame through an inverse matrix of the signal covariance matrix of the current frame and the target signal covariance matrix of the current frame;

The beamformer coefficients are multiplied with the speech of the current frame to be enhanced to enhance the speech of the current frame to be enhanced.

9. The method of claim 1, wherein the step of obtaining the speech to be enhanced comprises:

Acquiring an initial voice in a time domain form;

and windowing, framing and Fourier transforming the initial voice in sequence to obtain the voice to be enhanced in a time-frequency domain signal form.

10. The method according to claim 9, wherein after the step of performing speech enhancement on the speech to be enhanced by the inverse of the signal covariance matrix and the target signal covariance matrix, further comprising:

and carrying out inverse Fourier transform on the voice after voice enhancement to obtain a voice signal in a time domain form after voice enhancement.

11. An electronic device, the electronic device comprising: a memory and a processor coupled to each other for executing program instructions stored in the memory to implement the speech enhancement method of any one of claims 1 to 10.

12. A computer readable storage medium, characterized in that the computer readable storage medium stores program data executable to implement the speech enhancement method according to any of claims 1-10.