WO2019227589A1

WO2019227589A1 - Speech enhancement method and apparatus, computer device, and storage medium

Info

Publication number: WO2019227589A1
Application number: PCT/CN2018/094410
Authority: WO
Inventors: 涂宏
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-05-29
Filing date: 2018-07-04
Publication date: 2019-12-05
Also published as: CN108682429A

Abstract

Disclosed are a speech enhancement method and apparatus, a computer device, and a storage medium. The speech enhancement method comprises: transforming original speech information to obtain a digital speech signal; decomposing the digital speech signal by using an EEMD algorithm to obtain a first signal component; performing a correlation calculation on the digital speech signal and the first signal component by using a correlation calculation formula to obtain a first correlation coefficient; selecting a first signal component having the first correlation coefficient greater than a preset threshold as a second signal component; and integrating the second signal component to obtain target speech information. According to the speech enhancement method, when the speech enhancement is performed, the speech signal can be effectively denoised to obtain a pure speech signal, so that the accuracy of voiceprint recognition by using the pure speech signal is higher.

Description

Voice enhancement method, device, computer equipment and storage medium

This patent application is based on a Chinese invention patent application filed on May 29, 2018 with the application number 201810528846.0, entitled "Voice Enhancement Method, Device, Computer Equipment, and Storage Medium", and claims priority.

Technical field

The present application relates to the technical field of speech signal processing, and in particular, to a speech enhancement method, device, computer device, and storage medium.

Background technique

With the widespread use of speech recognition technology, the demand for speech signal processing technology has also expanded. At present, in the process of speech recognition or voiceprint recognition, the speech signals collected by the front-end equipment are generally noisy, including noise in the background environment and noise generated during recording by the front-end equipment. These speech signals with noise will affect the accuracy of speech recognition when performing speech recognition. Therefore, it is necessary to perform speech enhancement processing on the speech signal (that is, perform noise reduction processing on the speech signal) to extract as much as possible from the speech signal. To more pure speech signals to make speech recognition more accurate. The accuracy of the currently extracted speech signal after speech enhancement processing on the speech signal is not high, which is not conducive to subsequent speech recognition.

Summary of the Invention

Based on this, it is necessary to address the above technical problems. Embodiments of the present application provide a method, a device, a computer device, and a storage medium for voice enhancement.

A speech enhancement method includes:

Convert the original voice information to obtain digital voice signals;

Use the EEMD algorithm to decompose the digital voice signal to obtain a first signal component;

Performing a correlation calculation on the digital voice signal and the first signal component by using a correlation calculation formula to obtain a first correlation coefficient;

Selecting a first signal component whose first correlation coefficient is greater than a preset threshold as the second signal component;

Performing integration processing on the second signal component to obtain target voice information.

A voice enhancement device includes:

Digital voice signal acquisition module, for converting original voice information to obtain digital voice signals;

A first signal component acquisition module, configured to decompose the digital voice signal by using an EEMD algorithm to acquire a first signal component;

A first correlation coefficient acquisition module, configured to perform a correlation calculation on the digital voice signal and the first signal component by using a correlation calculation formula to obtain a first correlation coefficient;

A second signal component acquisition module, configured to select, as the second signal component, a first signal component whose first correlation coefficient is greater than a preset threshold;

A target voice information acquisition module is configured to perform integrated processing on the second signal component to acquire target voice information.

A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:

Convert the original voice information to obtain digital voice signals;

One or more non-volatile readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:

Convert the original voice information to obtain digital voice signals;

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is an application environment diagram of a speech enhancement method in an embodiment of the present application

2 is a flowchart of a speech enhancement method according to an embodiment of the present application;

FIG. 3 is a specific flowchart of step S20 in FIG. 2;

4 is a specific flowchart of step S22 in FIG. 3;

5 is another flowchart of a speech enhancement method according to an embodiment of the present application;

6 is a schematic diagram of a speech enhancement device according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a computer device in an embodiment of the present application.

Detailed ways

In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.

The speech enhancement method provided in this application can be applied in the application environment shown in FIG. 1, where a computer device communicates with a server through a network. Computer devices can be, but are not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented as a stand-alone server.

The speech enhancement method can be applied to computer equipment configured by financial institutions such as banks, securities, insurance, or other institutions, and is used to perform speech enhancement on voice data before voiceprint recognition to improve recognition accuracy.

In one embodiment, as shown in FIG. 2, the speech enhancement method is applied to the server in FIG. 1 as an example for description, and includes the following steps:

S10: Convert the original voice information to obtain a digital voice signal.

The original voice information is the voice information of the speaker collected by the recording module (such as a microphone) of the front-end device. The original voice information may be voice information in wav, mp3, or other formats. Digital voice signals refer to discrete digital signals obtained by converting original voice information. Since computer equipment cannot directly process the original voice information, it can only process binary data, so the original voice information needs to be converted into digital voice signals.

Specifically, the server receives the original voice information sent by the front-end device, and reads the original voice information by using a command function for reading an audio file in the Python module to obtain a digital voice signal. For example, the command function for reading an audio file may be wave.open (file (original voice information), rb (read file operation)). The command function for reading an audio file is used to read and obtain the original voice information. The one-dimensional array of the received audio files is the digital voice signal. A Python module is a module containing a large number of encapsulated functions written in an object-oriented interpreted computer-readable instruction design language. In this embodiment, a command function for reading an audio file in the Python module is used to directly read the original voice information to obtain a digital voice signal, which is simple to implement.

S20: Decompose the digital voice signal by using the EEMD algorithm to obtain a first signal component.

The first signal component refers to an IMF (Intrinsic Mode Function) component obtained by decomposing a digital voice signal by using an EEMD algorithm. The EEMD (Ensemble, Empirical, Mode, and Decomposition) algorithm is a noise-assisted data analysis algorithm that can effectively solve the modal aliasing phenomenon, so that the decomposition result (the first signal component) can clearly reflect the digital voice signal in different Time scale or oscillating changes at different frequencies. Modal aliasing refers to the phenomenon that different modal components cannot be effectively separated according to the time scale, so that different modalities appear in one modal.

Because the digital voice signal is non-stationary, in order to make the digital voice signal more stable, the EEMD algorithm needs to be used to decompose the digital voice signal to make the first signal component decomposed by the digital voice signal more stable, which can help suppress noise interference So that the accuracy of the speech signal is high. Specifically, the server uses the EEMD algorithm to decompose the digital voice signal to obtain N (N is a positive integer) first signal components, and each first signal component represents an oscillation change of the digital voice signal at different time scales or at different frequencies.

S30: Perform a correlation calculation on the digital voice signal and the first signal component by using a correlation calculation formula to obtain a first correlation coefficient.

The first correlation coefficient is a calculation result obtained by performing correlation calculation on the digital voice signal and the first signal component. The first correlation coefficient may reflect the degree of correlation between the digital voice signal and the first signal component, and may also reflect the degree to which the first signal component contains an effective amount of information (voice information) in the digital voice signal.

Specifically, the correlation calculation formula is

Among them, x is the digital voice signal, y is the first signal component, Cov (x, y) is the covariance of x and y, Var [x] is the variance of x, Var [y] is the variance of y, and r is the first A correlation coefficient. Among them, Cov (x, y) is calculated as:

The calculation formula of Var [x] is Var [x] = E (x ² ) -E ² (x); The calculation formula of Var [y] is Var [y] = E (y ² ) -E ² (y); Among them, E (x) represents the expectation of the digital voice signal, E (y) represents the expectation of the first signal component, n represents the number of the first signal component, and x _j represents the j-th digital voice signal on the time scale. y _j represents the j-th first signal component on the same time scale. In this embodiment, the first correlation coefficient may be a real number between 0 and 1. The closer the first correlation coefficient is to 1, the greater the correlation between the digital speech signal and the first signal component; otherwise, the first correlation The closer the coefficient is to 0, the smaller the correlation between the digital speech signal and the first signal component.

S40: Select a first signal component whose first correlation coefficient is greater than a preset threshold as the second signal component.

The preset threshold is a threshold defined in advance for screening the first signal component. The second signal component is a signal component obtained by performing a filtering operation on the first signal component by using a preset threshold.

Since the first correlation coefficient is a real number between 0 and 1, the preset threshold is also a real number between 0 and 1. If the first correlation coefficient is greater than a preset threshold value, it means that the correlation between the first signal component and the digital voice signal is large, and the first signal component contains more effective information amount of the digital voice signal. If the first correlation coefficient is not greater than a preset threshold value, it means that the correlation between the first signal component and the digital voice signal is small, and the first signal component contains a small amount of effective information of the digital voice signal, and it may default to noise. In this embodiment, the first signal component is filtered to obtain a first signal component with a high correlation with a digital voice signal as a second signal component to reduce noise interference and further improve the accuracy of the voice signal. In addition, the method for screening the second signal component is simple to implement and can improve the efficiency of speech enhancement processing.

S50: Perform integrated processing on the second signal component to obtain target voice information.

The target voice signal is relatively pure voice information obtained by integrating the original voice information. Integrated processing is processing that restores signal components to speech information.

Specifically, the server uses the formula

(N is a positive integer) perform integration processing on the second signal component to obtain a target voice signal, where _SN represents the Nth second signal component, N represents the total number of second signal components, and Z represents the target voice information. That is, when the server performs integrated processing on the second signal component, it needs to first perform a square operation on each second signal component and then perform an average operation to obtain the target voice information.

In this embodiment, first, a command function for reading an audio file in the Python module is used to directly read the original voice information to obtain a digital voice signal, so that the process of acquiring the digital voice signal is simple, and the efficiency of voice enhancement can be improved. Then, the EEMD algorithm is used to decompose the digital voice signal to obtain the first signal component, and the correlation calculation formula is used to calculate the correlation between the digital voice signal and the first signal component to obtain the first correlation coefficient, and then select the first correlation A first signal component with a coefficient of coefficient greater than a preset threshold is used to obtain a first signal component with greater correlation with a digital speech signal as a second signal component to reduce noise interference and achieve the purpose of speech enhancement. Finally, the second signal component is integrated to obtain target speech information with higher accuracy. The implementation of the speech enhancement method is simple, can improve the processing efficiency of speech enhancement, and ensures that the accuracy of the acquired target speech information is high.

In an embodiment, as shown in FIG. 3, in step S20, the EEMD algorithm is used to decompose the digital voice signal to obtain a first signal component, which specifically includes the following steps:

S21: Add different normally distributed white noise sequences to the digital voice signal to obtain a voice signal to be processed.

Among them, the speech signal to be processed is a digital speech signal added with different normally distributed white noise sequences. In this embodiment, the normally distributed white noise sequence refers to a Gaussian white noise sequence. Gaussian white noise means that the instantaneous value of the noise obeys Gaussian distribution, and its power spectral density is normally distributed, then it is called Gaussian white noise. The instantaneous value refers to the probability density function, and the Gaussian distribution is the normal distribution.

In this embodiment, by adding different normal-distributed white noise sequences to each digital voice signal, a speech signal to be processed is obtained, so that the white noise is evenly distributed in the time-frequency space of the entire digital voice signal, that is, when the digital voice When the signal is added to a normal-distributed white noise sequence, the signal regions at different time scales are automatically mapped to the appropriate time scale related to white noise, which effectively solves the modal aliasing phenomenon, and uses the zero mean value of the white noise of the Gaussian distribution. Features to keep digital voice signals and improve the accuracy of voice enhancement.

S22: EMD decomposition of the speech signal to be processed to obtain an intermediate signal component corresponding to the speech signal to be processed.

The intermediate signal component is an IMF component obtained by performing EMD decomposition on each to-be-processed voice signal. The EMD (Empirical Mode Decomposition, empirical mode decomposition) method is a method of performing signal decomposition based on the local time scale characteristics of the signal. Specifically, the EMD method is used to perform EMD decomposition on each to-be-processed voice signal and obtain an intermediate signal component corresponding to each to-be-processed voice signal, which can effectively avoid the modal aliasing phenomenon that is easy to occur during the decomposition process, which makes the EMD decomposition The accuracy is higher, which further improves the accuracy of speech enhancement.

S23: Perform an average operation on the intermediate signal components to obtain a first signal component.

Specifically, the server performs an averaging operation on an intermediate signal component corresponding to each to-be-processed voice signal to obtain a first signal component. Specifically, the server uses a mean calculation formula

Calculate the intermediate signal component to obtain the first signal component, where M _j is the j-th first signal component, M is the intermediate signal component, N is the number of the first signal component, t is the time scale, and i is the intermediate signal The subscript value of the component.

In this embodiment, different normal distributed white noise sequences are added to the digital voice signal to obtain the to-be-processed voice signal so that the white noise is evenly distributed in the time-frequency space of the entire digital voice signal, which is helpful for solving the modal The aliasing phenomenon and the use of the characteristics of zero mean of white noise in Gaussian distribution make the real digital speech signal preserved and improve the accuracy of speech enhancement. Then, EMD decomposition of the speech signal to be processed is performed to obtain the intermediate signal component corresponding to the speech signal to be processed. Due to the addition of different normally distributed white noise sequences to the digital speech signal, the deficiency of EMD decomposition can be solved (that is, modal mixing Overlap phenomenon), therefore, the accuracy of EMD decomposition can be improved. Finally, an average operation is performed on the intermediate signal components to obtain the first signal component. The calculation process is simple and can improve the processing efficiency of the voice information.

In an embodiment, as shown in FIG. 4, in step S22, the EMD decomposition of the speech signal to be processed to obtain an intermediate signal component corresponding to the speech signal to be processed specifically includes the following steps:

S221: Obtain local extreme points of the speech signal to be processed. Each local extreme point includes a maximum point and a minimum point.

Wherein, the speech signal to be processed includes a plurality of local extreme points, and the local extreme points refer to extreme points of the speech signal to be processed in an arbitrary time range in the entire time domain. The local extreme point includes a local maximum point and a local minimum point. Specifically, the functions formed by the speech signals to be processed in different time ranges are differentiated, and the value of the corresponding function when the derivative is 0 is the local extreme point. For example, the speech signals to be processed in different time ranges are x (t), t ∈ T, T is the entire time domain, and when X '(t) = 0, the value of t corresponding to x (t) is the local extreme point. .

S222: Construct an upper envelope based on the maximum points of all local extreme points, and construct a lower envelope based on the minimum points of all local extreme points.

Among them, the envelope refers to connecting the peak points of the high frequency AM signal to obtain a curve corresponding to the low frequency modulation signal. The high frequency AM signal refers to a signal whose amplitude is changed according to the change of the low frequency modulation signal. The low-frequency modulation signal is a modulation signal, and the modulation signal is a low-frequency signal converted from the original information. The upper envelope is a smooth curve obtained by fitting all the maximum points using a spline function. The lower envelope is a smooth curve obtained by fitting all the minimum points with a spline function. A spline function usually refers to a polynomial parameter curve defined in sections. The spline function is used to fit all the maximum points or all the minimum points. It has the advantages of simple construction, convenient use and accurate fitting. Specifically, the upper envelope can be obtained by fitting all the maximum value points by using the built-in spline function (spline function) in Matlab, and using the built-in spline function (spline function) in Matlab for all the minimum value points. The lower envelope curve can be obtained by fitting, and the curve in the time domain of the speech signal to be processed can be made smoother and clearer by drawing the envelope curve. Matlab is an application software for numerical calculations in the field of mathematical technology applications.

S223: Obtain an average value corresponding to the upper and lower envelopes based on the upper and lower envelopes.

Specifically, using

The formula calculates the upper and lower envelopes to obtain the corresponding mean value, where P is the mean value, s ₁ (t) represents the upper envelope curve that changes with time t, and s ₂ (t) represents the time curve with time t Varying lower envelope. In this embodiment, the corresponding mean value is obtained based on the upper envelope curve and the lower envelope curve, and technical support is provided for subsequent screening of the initial signal components.

S224: Obtain an initial signal component based on the speech signal to be processed and the average value. If the initial signal component meets a preset condition, the initial signal component is an intermediate signal component.

The preset condition is a condition set in advance for filtering signal components. The preset conditions are as follows: First, the number of extreme points of the signal and the number of zero crossings are equal or differ by at most one. Second, the average of the upper and lower envelopes is zero. Specifically, the number of extreme points includes the number of local maximums and local minimums. In this embodiment, only the initial signal component that meets the two preset conditions can be used as the intermediate signal component. This process can effectively decompose the noise-containing voice signal to obtain a more pure voice signal and achieve the purpose of voice enhancement. .

Specifically, the formula h ₀ (t) = s (t) -m ₀ (t) is used to process the speech signal to be processed and the mean value to obtain the initial signal component, where h ₀ (t) is the initial signal component and s (t ) Is the speech signal to be processed, m ₀ (t) is the average, and t is the time scale. If the initial signal component meets the preset condition, the initial signal component is used as the first intermediate signal component. If the initial signal component does not meet the preset condition, the initial signal component is used as the new pending speech signal (that is, h ₀ (t ) As s (t)), and the steps S221 to S223 are repeatedly performed until the first intermediate signal component that satisfies a preset condition is obtained. Then, set r ₁ (t) = s (t)-c ₁ (t), where r ₁ (t) is the new speech signal to be processed, and c ₁ (t) is the first intermediate signal component, and repeat the execution. In steps SS221-S224, a second intermediate signal component is obtained. After repeated processing of the above steps, until the obtained initial signal component is a monotonic signal or an initial signal component whose value is smaller than the first threshold value, the loop ends. Wherein, the first threshold is a predefined threshold for stopping the foregoing cycle. Finally, N intermediate signal components can be obtained after multiple cycles, and the speech signal to be processed can be expressed as

Among them, c _k (t) is the k-th intermediate signal component, and r _n (t) is the initial signal component of the monotonic signal or the value of the initial signal component is less than a given threshold initial signal component.

In this embodiment, by acquiring the local extreme point of the speech signal to be processed, each local extreme point includes a local maximum point and a local minimum point, so as to construct a packet based on the local maximum point among all local extreme points. The envelope curve is constructed based on the minimum points of all local extreme points to make the curve of the speech signal to be processed in the time domain smoother and clearer. Then, based on the upper envelope curve and the lower envelope curve, obtain the average values corresponding to the upper envelope curve and the lower envelope curve, and obtain the initial signal component based on the speech signal and the mean value to be processed; if the initial signal component meets the preset conditions, The initial signal component is an intermediate signal component to make the signal stable; if the initial signal component does not meet the preset conditions, the initial signal component is used as a new voice signal to be processed, and then multiple times based on the new voice signal to be processed Loop processing to obtain N intermediate signal components. This decomposition process can effectively decompose the noise-containing voice signal to obtain a relatively pure voice signal and achieve the purpose of voice enhancement.

In an embodiment, as shown in FIG. 5, the voice enhancement method further includes the following steps:

S411: Decompose the second signal component by using the EEMD algorithm to obtain a second decomposed signal component.

In order to achieve more subtle noise reduction and achieve better speech enhancement effects, so that speech recognition is more accurate, in this embodiment, the EEMD algorithm is used to perform secondary decomposition on the second signal component to obtain the second decomposed signal component. Specifically, the decomposition process of using the EEMD algorithm to decompose the second signal component is the same as step S20, and details are not described herein again.

S412: Perform a correlation calculation on the digital speech signal and the binary signal component to obtain a second correlation coefficient.

The second correlation coefficient is a coefficient that reflects the correlation degree between the binary signal component and the digital voice signal obtained by performing correlation calculation on the digital voice signal and the binary signal component. specifically,

The correlation calculation formula is

Among them, a is a digital voice signal, b is a binary decomposition signal component, Cov (a, b) is the covariance of a and b, Var [a] is the variance of a, Var [b] is the variance of b, and r2 is the first Second correlation coefficient. The calculation formula of the covariance and the calculation formula of the variance are the same as those in step S30. To avoid repetition, details are not described herein again. In this embodiment, in order to filter out binary decomposition signal components that have greater correlation with digital voice signals, and perform more detailed noise reduction on digital voice signals, a second correlation coefficient needs to be calculated in order to be selected by the second correlation coefficient. Decompose the signal components to improve the accuracy of the speech signal.

S413: Select a binary signal component whose second correlation coefficient is greater than a preset threshold as the updated second signal component.

The preset threshold is a threshold that is defined in advance for screening the binary decomposition signal components. The preset threshold is the same as the preset threshold in step S40.

Specifically, the second correlation coefficient is a real number between 0 and 1. If the first correlation coefficient is greater than a preset threshold value, it means that the correlation between the binary decomposition signal component and the digital voice signal is large, and the signal component contains a large amount of effective information of the digital voice signal. If the second correlation coefficient is less than a preset threshold, it means that the correlation between the binary decomposition signal component and the digital voice signal is small, the amount of effective information contained in the signal component is small, and noise may be defaulted. In this embodiment, the binarized signal component is filtered to obtain a binarized signal component that has a greater correlation with a digital voice signal as an updated second signal component to reduce noise interference and further improve the accuracy of the voice signal. In addition, the screening method of the binary decomposition signal component is simple to implement and improves the efficiency of speech enhancement.

In this embodiment, the EEMD algorithm is first used to decompose the second signal component to obtain the second decomposed signal component, so as to perform correlation calculation on the digital voice signal and the second decomposed signal component to obtain a second correlation coefficient. By selecting a binary signal component whose second correlation coefficient is greater than a preset threshold value as the updated second signal component, the integrated second updated signal component is subsequently processed to obtain the target speech information. This process can perform more detailed noise reduction processing on speech signals to obtain more pure speech information, making voiceprint recognition more accurate.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

In one embodiment, FIG. 6 shows a schematic diagram of a speech enhancement device corresponding to the speech enhancement method in the above embodiment. As shown in FIG. 6, the voice enhancement device includes a digital voice signal acquisition module 10, a first signal component acquisition module 20, a first correlation coefficient acquisition module 30, a second signal component acquisition module 40, and a target voice information acquisition module 50. The detailed description of each function module is as follows:

The digital voice signal acquisition module 10 is configured to convert the original voice information to obtain a digital voice signal.

The first signal component acquisition module 20 is configured to decompose a digital voice signal by using an EEMD algorithm to acquire a first signal component.

A first correlation coefficient acquisition module 30 is configured to perform a correlation calculation on a digital voice signal and a first signal component by using a correlation calculation formula to obtain a first correlation coefficient.

The second signal component acquisition module 40 is configured to select, as the second signal component, a first signal component whose first correlation coefficient is greater than a preset threshold.

The target voice information acquisition module 50 is configured to perform integrated processing on the second signal component to acquire target voice information.

Specifically, the first signal component acquisition module 20 is configured to include a to-be-processed voice signal acquisition unit 21, an intermediate signal component acquisition unit 22, and a first signal component acquisition unit 23.

The to-be-processed voice signal obtaining unit 21 is configured to add different normally distributed white noise sequences to the digital voice signal to obtain the to-be-processed voice signal.

The intermediate signal component obtaining unit 22 is configured to perform EMD decomposition on the speech signal to be processed, and obtain an intermediate signal component corresponding to the speech signal to be processed.

The first signal component acquiring unit 23 is configured to perform an averaging operation on the intermediate signal component to acquire a first signal component.

Specifically, the intermediate signal component acquisition unit 22 includes a local extreme point acquisition subunit 221, an envelope construction subunit 222, a mean acquisition subunit 223, and an intermediate signal component acquisition subunit 224.

The local extreme point acquisition subunit 221 is configured to acquire a local extreme point of a speech signal to be processed, and each local extreme point includes a maximum point and a minimum point.

The envelope construction sub-unit 222 is configured to construct an upper envelope based on a local maximum point among all local extreme points, and a lower envelope based on a local minimum point among all local extreme points.

The average value obtaining subunit 223 is configured to obtain an average value corresponding to the upper envelope line and the lower envelope line based on the upper envelope line and the lower envelope line.

The intermediate signal component acquisition subunit 224 is configured to obtain an initial signal component based on the speech signal to be processed and the average value. If the initial signal component meets a preset condition, the initial signal component is an intermediate signal component.

Specifically, the correlation calculation formula is

Among them, x is the digital voice signal, y is the first signal component, Cov (x, y) is the covariance of x and y, Var [x] is the variance of x, Var [y] is the variance of y, and r is The first correlation coefficient is described.

Specifically, the voice enhancement device further includes a binary decomposition signal component acquisition unit 411, a second correlation coefficient acquisition unit 412, and a second signal component update unit 413.

The binary decomposition signal component obtaining unit 411 is configured to decompose the second signal component by using the EEMD algorithm to obtain a binary decomposition signal component.

A second correlation coefficient acquisition unit 412 is configured to perform correlation calculation on the digital speech signal and the binary decomposition signal component to obtain a second correlation coefficient.

The second signal component updating unit 413 is configured to select a binary decomposition signal component whose second correlation coefficient is greater than a preset threshold as the updated second signal component.

Specifically, the target voice information acquisition module 50 uses a formula

(N is a positive integer) perform integration processing on the second signal component to obtain a target speech signal. Among them, S _N represents the second signal component, N represents the total number of the second signal components, and Z represents the target voice information.

For the specific limitation of the speech enhancement device, refer to the foregoing limitation on the speech enhancement method, and details are not described herein again. Each module in the above voice enhancement device may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 7. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer device is used to store data generated or obtained during the execution of the speech enhancement method, such as a target speech signal. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a speech enhancement method.

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. The processor executes the computer-readable instructions to implement the following steps: The voice information is converted to obtain the digital voice signal; the digital voice signal is decomposed using the EEMD algorithm to obtain the first signal component; the correlation calculation formula is used to calculate the correlation between the digital voice signal and the first signal component to obtain the first correlation Coefficient; selecting a first signal component whose first correlation coefficient is greater than a preset threshold as the second signal component; performing integrated processing on the second signal component to obtain target speech information.

Specifically, the correlation calculation formula is

Where x is a digital voice signal, y is the first signal component, Cov (x, y) is the covariance of x and y, Var [x] is the variance of x, and Var [y] is the variance of y, r Is the first correlation coefficient.

In one embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: adding different normally distributed white noise sequences to the digital voice signal to obtain a voice signal to be processed;

EMD decomposition of the to-be-processed voice signal is performed to obtain an intermediate signal component corresponding to the to-be-processed voice signal; an average operation is performed on the intermediate signal component to obtain a first signal component.

In one embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: obtaining local extreme points of the speech signal to be processed, each local extreme point including a local maximum point and a local minimum point; based on all local points The upper envelope is constructed from the maximum points of the extreme points, and the lower envelope is constructed based on the minimum points from all the local extreme points; the upper envelope is obtained based on the upper and lower envelopes. The average value corresponding to the lower envelope; based on the speech signal to be processed and the average value, an initial signal component is obtained. If the initial signal component meets a preset condition, the initial signal component is an intermediate signal component.

In one embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: the EEMD algorithm is used to decompose the second signal component to obtain a binary decomposition signal component; and a correlation calculation is performed on the digital voice signal and the binary decomposition signal component, Obtaining a second correlation coefficient; selecting a binary decomposition signal component whose second correlation coefficient is greater than a preset threshold value as the updated second signal component.

In one embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: using a formula

(N is a positive integer) perform integration processing on the second signal component to obtain a target voice signal; wherein, S _N represents the second signal component, N represents the total number of the second signal component, and Z represents the target voice information.

In one embodiment, one or more non-volatile readable storage media storing computer-readable instructions are provided, and when the computer-readable instructions are executed by one or more processors, the one or more When the processors execute, the following steps are implemented: converting the original voice information to obtain the digital voice signal; using the EEMD algorithm to decompose the digital voice signal to obtain the first signal component; and using the correlation calculation formula to the digital voice signal and the first signal The components are subjected to correlation calculation to obtain a first correlation coefficient; a first signal component whose first correlation coefficient is greater than a preset threshold is selected as a second signal component; and the second signal component is integrated to obtain target speech information.

Specifically, the correlation calculation formula is

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: adding different normal distributions to the digital voice signal White noise sequence to obtain the voice signal to be processed;

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: obtaining local extreme points of the speech signal to be processed, each Local extreme points include local maximum points and local minimum points; the upper envelope is constructed based on the local maximum points of all local extreme points, and the lower envelope is constructed based on the local minimum points of all local extreme points Envelope; based on the upper and lower envelopes, obtain the average value of the upper and lower envelopes; based on the speech signal and the mean value to be processed, obtain the initial signal component, if the initial signal component meets the preset conditions, Then the initial signal component is the intermediate signal component.

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: the EEMD algorithm is used to decompose the second signal component to obtain Binary decomposition signal component; performing correlation calculation on the digital speech signal and the binary decomposition signal component to obtain a second correlation coefficient; selecting a binary decomposition signal component whose second correlation coefficient is larger than a preset threshold value as the updated second signal component.

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: using a formula

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer-readable In the storage medium, when the computer-readable instructions are executed, the computer-readable instructions may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims

A speech enhancement method, comprising:

Convert the original voice information to obtain digital voice signals;

Use the EEMD algorithm to decompose the digital voice signal to obtain a first signal component;

Performing a correlation calculation on the digital voice signal and the first signal component by using a correlation calculation formula to obtain a first correlation coefficient;

Selecting a first signal component whose first correlation coefficient is greater than a preset threshold as the second signal component;

Performing integration processing on the second signal component to obtain target voice information.
The method of claim 1, wherein the step of decomposing the digital voice signal by using an EEMD algorithm to obtain a first signal component comprises:

Adding different normally distributed white noise sequences to the digital voice signal to obtain a voice signal to be processed;

EMD decompose the speech signal to be processed to obtain an intermediate signal component corresponding to the speech signal to be processed;

Performing an averaging operation on the intermediate signal component to obtain the first signal component.
The speech enhancement method according to claim 2, wherein the performing EMD decomposition on the speech signal to be processed to obtain an intermediate signal component corresponding to the speech signal to be processed comprises:

Obtaining local extreme points of the speech signal to be processed, each local extreme point including a maximum point and a minimum point;

Construct an upper envelope based on the maximum points of all local extreme points and a lower envelope based on the minimum points of all local extreme points;

Obtaining an average value corresponding to the upper envelope and the lower envelope based on the upper envelope and the lower envelope;

An initial signal component is obtained based on the speech signal to be processed and the average value. If the initial signal component meets a preset condition, the initial signal component is an intermediate signal component.
The speech enhancement method according to claim 1, wherein the correlation calculation formula is
Where x is the digital voice signal, y is the first signal component, Cov (x, y) is the covariance of x and y, Var [x] is the variance of x, and Var [y] is the variance of y , R is the first correlation coefficient.
The speech enhancement method according to claim 1, wherein after the step of selecting the first signal component with the first correlation coefficient larger than a preset threshold as the second signal component, the speech Enhancements also include:

Use the EEMD algorithm to decompose the second signal component to obtain a second decomposed signal component;

Performing a correlation calculation on the digital speech signal and the binary decomposition signal component to obtain a second correlation coefficient;

A binary signal component with a second correlation coefficient greater than a preset threshold is selected as the updated second signal component.
The speech enhancement method according to claim 1, wherein the performing integrated processing on the second signal component to obtain target speech information comprises:

Use formula
Performing integration processing on the second signal component to obtain a target voice signal; wherein, S N represents a second signal component, N is a positive integer and represents the total number of the second signal components, and Z represents the target voice information.
A speech enhancement device, comprising:

Digital voice signal acquisition module, for converting original voice information to obtain digital voice signals;

A first signal component acquisition module, configured to decompose the digital voice signal by using an EEMD algorithm to acquire a first signal component;

A first correlation coefficient acquisition module, configured to perform a correlation calculation on the digital voice signal and the first signal component by using a correlation calculation formula to obtain a first correlation coefficient;

A second signal component acquisition module, configured to select, as the second signal component, a first signal component whose first correlation coefficient is greater than a preset threshold;

A target voice information acquisition module is configured to perform integrated processing on the second signal component to acquire target voice information.
The voice enhancement device according to claim 7, wherein the voice enhancement device further comprises:

A binary decomposition signal component obtaining unit, configured to decompose the second signal component by using an EEMD algorithm to obtain a binary decomposition signal component;

A second correlation coefficient obtaining unit, configured to perform a correlation calculation on the digital speech signal and the binary decomposition signal component to obtain a second correlation coefficient;

The second signal component updating unit is configured to select a binary decomposition signal component whose second correlation coefficient is greater than a preset threshold as the updated second signal component.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:

Convert the original voice information to obtain digital voice signals;

Use the EEMD algorithm to decompose the digital voice signal to obtain a first signal component;

Performing a correlation calculation on the digital voice signal and the first signal component by using a correlation calculation formula to obtain a first correlation coefficient;

Selecting a first signal component whose first correlation coefficient is greater than a preset threshold as the second signal component;

Performing integration processing on the second signal component to obtain target voice information.
The computer device according to claim 9, wherein the using the EEMD algorithm to decompose the digital voice signal to obtain a first signal component comprises:

Adding different normally distributed white noise sequences to the digital voice signal to obtain a voice signal to be processed;

EMD decompose the speech signal to be processed to obtain an intermediate signal component corresponding to the speech signal to be processed;

Performing an averaging operation on the intermediate signal component to obtain the first signal component.
The computer device according to claim 10, wherein performing EMD decomposition on the speech signal to be processed to obtain an intermediate signal component corresponding to the speech signal to be processed comprises:

Obtaining local extreme points of the speech signal to be processed, each local extreme point including a maximum point and a minimum point;

Construct an upper envelope based on the maximum points of all local extreme points and a lower envelope based on the minimum points of all local extreme points;

Obtaining an average value corresponding to the upper envelope and the lower envelope based on the upper envelope and the lower envelope;

An initial signal component is obtained based on the speech signal to be processed and the average value. If the initial signal component meets a preset condition, the initial signal component is an intermediate signal component.
The computer device according to claim 9, wherein the correlation calculation formula is
Where x is the digital voice signal, y is the first signal component, Cov (x, y) is the covariance of x and y, Var [x] is the variance of x, and Var [y] is the variance of y , R is the first correlation coefficient.
The computer device according to claim 9, wherein after the step of selecting the first signal component with the first correlation coefficient larger than a preset threshold as the second signal component, the processor When the computer-readable instructions are executed, the following steps are also implemented:

Use the EEMD algorithm to decompose the second signal component to obtain a second decomposed signal component;

Performing a correlation calculation on the digital speech signal and the binary decomposition signal component to obtain a second correlation coefficient;

A binary signal component with a second correlation coefficient greater than a preset threshold is selected as the updated second signal component.
The computer device according to claim 9, wherein the performing integrated processing on the second signal component to obtain target voice information comprises:

Use formula
Performing integration processing on the second signal component to obtain a target voice signal; wherein, S N represents a second signal component, N is a positive integer and represents the total number of the second signal components, and Z represents the target voice information.
One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:

Convert the original voice information to obtain digital voice signals;

Use the EEMD algorithm to decompose the digital voice signal to obtain a first signal component;

Performing a correlation calculation on the digital voice signal and the first signal component by using a correlation calculation formula to obtain a first correlation coefficient;

Selecting a first signal component whose first correlation coefficient is greater than a preset threshold as the second signal component;

Performing integration processing on the second signal component to obtain target voice information.
The non-volatile readable storage medium according to claim 15, wherein the step of decomposing the digital voice signal by using an EEMD algorithm to obtain a first signal component comprises:

Adding different normally distributed white noise sequences to the digital voice signal to obtain a voice signal to be processed;

EMD decompose the speech signal to be processed to obtain an intermediate signal component corresponding to the speech signal to be processed;

Performing an averaging operation on the intermediate signal component to obtain the first signal component.
The non-volatile readable storage medium according to claim 16, wherein the EMD decomposition of the speech signal to be processed to obtain an intermediate signal component corresponding to the speech signal to be processed comprises:

Obtaining local extreme points of the speech signal to be processed, each local extreme point including a maximum point and a minimum point;

Construct an upper envelope based on the maximum points of all local extreme points and a lower envelope based on the minimum points of all local extreme points;

Obtaining an average value corresponding to the upper envelope and the lower envelope based on the upper envelope and the lower envelope;

An initial signal component is obtained based on the speech signal to be processed and the average value. If the initial signal component meets a preset condition, the initial signal component is an intermediate signal component.
The non-volatile readable storage medium according to claim 15, wherein the correlation calculation formula is
Where x is the digital voice signal, y is the first signal component, Cov (x, y) is the covariance of x and y, Var [x] is the variance of x, and Var [y] is the variance of y , R is the first correlation coefficient.
The non-volatile readable storage medium according to claim 15, wherein, in the step of selecting the first signal component with the first correlation coefficient larger than a preset threshold, as the second signal component, Thereafter, when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps:

Use the EEMD algorithm to decompose the second signal component to obtain a second decomposed signal component;

Performing a correlation calculation on the digital speech signal and the binary decomposition signal component to obtain a second correlation coefficient;

A binary signal component with a second correlation coefficient greater than a preset threshold is selected as the updated second signal component.
The non-volatile readable storage medium according to claim 15, wherein the performing integrated processing on the second signal component to obtain target voice information comprises:

Use formula
Performing integration processing on the second signal component to obtain a target voice signal; wherein, S N represents a second signal component, N is a positive integer and represents the total number of the second signal components, and Z represents the target voice information.