WO2019227588A1

WO2019227588A1 - Voice enhancement method and apparatus, and computer device and storage medium

Info

Publication number: WO2019227588A1
Application number: PCT/CN2018/094409
Authority: WO
Inventors: 涂宏
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-05-29
Filing date: 2018-07-04
Publication date: 2019-12-05
Also published as: CN108831494B; CN108831494A

Abstract

A voice enhancement method and apparatus, and a computer device and a storage medium. The voice enhancement method comprises: converting original voice information to obtain a digital voice signal (S10); obtaining a Hankel matrix on the basis of the digital voice signal (S20); performing a singular value decomposition operation on the Hankel matrix to obtain at least two singular values (S30); performing an inverse singular value decomposition operation on the at least two singular values to obtain a target voice signal (S40); and performing reduction processing on the target voice signal to obtain target speech information (S50). The voice enhancement method can effectively inhibit the noise interference, so as to improve the recognition accuracy of the target voice information in the voice recognition process.

Description

Voice enhancement method, device, computer equipment and storage medium

This patent application is based on a Chinese invention patent application filed on May 29, 2018 with the application number 201810529510.6 and entitled "Voice Enhancement Method, Device, Computer Equipment, and Storage Medium", and claims its priority.

Technical field

The present application relates to the field of signal processing, and in particular, to a method, a device, a computer device, and a storage medium for voice enhancement.

Background technique

With the widespread use of speech recognition technology, the demand for speech signal processing technology has also expanded. At present, the voice signals collected on computer equipment include both the voice information corresponding to the voice of the speaker, the voice information being valid information, and noise information other than the voice of the speaker. During the speech recognition process, if the speech signals collected by the computer equipment are directly identified, the accuracy of speech recognition will be affected due to the presence of noise information. Therefore, the speech signals collected by the computer equipment need to be enhanced (that is, noise reduction processing is performed on the speech signals) in order to extract as much purer speech signals as possible from the speech signals to make speech recognition more accurate. The accuracy of the currently extracted speech signal after speech enhancement processing on the speech signal is not high, which is not conducive to subsequent speech recognition.

Summary of the Invention

Based on this, it is necessary to provide a speech enhancement method, device, computer equipment, and storage medium that can improve the accuracy of the speech signal after speech enhancement processing, in response to the above technical problems.

A speech enhancement method includes:

Convert the original voice information to obtain digital voice signals;

Obtaining a Hankel matrix based on the digital speech signal;

Performing singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values;

Performing an inverse singular value decomposition operation on at least two of the singular values to obtain a target speech signal;

Performing restoration processing on the target voice signal to obtain target voice information.

A voice enhancement device includes:

Digital voice signal acquisition module, for converting original voice information to obtain digital voice signals;

A Hankel matrix acquisition module, configured to acquire a Hankel matrix based on the digital speech signal;

A singular value acquisition module, configured to perform singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values;

A target voice signal acquisition module, configured to perform an inverse singular value decomposition operation on at least two of the singular values to obtain a target voice signal;

A target voice information acquisition module is configured to perform restoration processing on the target voice signal to acquire target voice information.

A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:

Convert the original voice information to obtain digital voice signals;

Obtaining a Hankel matrix based on the digital speech signal;

One or more non-volatile readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:

Convert the original voice information to obtain digital voice signals;

Obtaining a Hankel matrix based on the digital speech signal;

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the application will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is an application environment diagram of a speech enhancement method according to an embodiment of the present application;

2 is a flowchart of a speech enhancement method according to an embodiment of the present application;

FIG. 3 is a specific flowchart of step S30 in FIG. 2;

FIG. 4 is a specific flowchart of step S40 in FIG. 2;

5 is a specific flowchart of step S411 in FIG. 4;

FIG. 6 is a specific flowchart of step S40 in FIG. 2;

7 is a schematic diagram of a speech enhancement device according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed ways

In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.

The voice enhancement method provided by the embodiment of the present application may be applied in the application environment shown in FIG. 1, where a computer device communicates with a server through a network. Computer devices can be, but are not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented as a stand-alone server.

The speech enhancement method can be specifically applied to computer equipment configured by financial institutions such as banks, securities, and insurance, or other institutions, and is used to enhance speech signals during speech recognition to improve the accuracy of recognition.

In one embodiment, as shown in FIG. 2, the speech enhancement method is applied to the server in FIG. 1 as an example for description, and includes the following steps:

S10: Convert the original voice information to obtain a digital voice signal.

The original voice information is voice information of a speaker collected by a recording module (such as a microphone) in a computer device. The original voice information may be voice information in wav, mp3, or other formats. Digital voice signals refer to discrete digital signals obtained by converting original voice information. Since computer equipment cannot directly process the original voice information, it can only process binary data, so the original voice information needs to be converted into digital voice signals.

Specifically, the server receives the original voice information sent by the computer device, and reads the original voice information by using a command function for reading an audio file in the Python module to obtain a digital voice signal. For example, the command function for reading an audio file may be wave.open (file (original voice information), rb (read file operation)). The command function for reading an audio file is used to read and obtain the original voice information. The one-dimensional array of the received audio files is the digital voice signal. A Python module is a module containing a large number of encapsulated functions written in an object-oriented interpreted computer-readable instruction design language. In this embodiment, a command function for reading an audio file in the Python module is used to directly read the original voice information to obtain a digital voice signal, which is simple to implement.

In summary, the digital voice signal is a one-dimensional digital information obtained by converting the original voice information. Specifically, the digital voice signal is a one-dimensional digital signal obtained by directly reading the original voice information by using the command function of the read audio file in the Python module. .

S20: Obtaining a Hankel matrix based on a digital voice signal.

The digital voice signal is a one-dimensional digital signal of one-dimensional digital information obtained by converting the original voice information. Hankel matrix refers to a square matrix with equal elements on each subdiagonal.

Specifically, the Hankel matrix has the following representation: Assuming that a digital speech signal (a one-dimensional digital signal sequence) is x (i), the length is N, and i = 1,2,3 ... N, then

Where n is the number of matrix elements. The elements in the j-th row of the Hankel matrix are formed by shifting the elements from the previous row one element to the left, so that the elements on each subdiagonal in the Hankel matrix are equal, that is, the elements in each row are related to their lower left corner. The adjacent elements are equal. The diagonal from the upper right corner to the lower left corner is the sub diagonal.

In this embodiment, the elements of the first column and the last row of the Hankel matrix need to be defined in advance in order to determine the rows and columns of the Hankel matrix. The Hankel matrix is constructed according to these two parameters, and singular value decomposition is performed for subsequent Computing provides technical support. Understandably, the first element of the last row of elements is the same as the last element of the first column of elements. For example, if the first column of a given matrix is A = (1,2,3,4) and the last row of the matrix is B = (4,4.5,5.5), then a Hankel matrix constructed based on these two parameters for

S30: Perform singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values.

Among them, Singular Value Decomposition (SVD Decomposition for short) is an important matrix factorization in linear algebra. This singular value decomposition operation can effectively reduce the dimension of a large amount of data to reduce the amount of calculation and save operation time. Specifically, the server performs singular value decomposition on the Hankel matrix to obtain two unitary matrices and a semi-positive definite diagonal matrix. The values on the diagonal of the semi-definite definite diagonal matrix are singular values. The singular values generally contain N (N > 2), in order from largest to smallest. The singular value can represent the important information hidden in the matrix, and the importance is positively related to the size of the singular value. Understandably, the larger the singular value is, the larger the effective information amount of the digital voice signal contained in the singular value is. The more noise is considered to be included in the example. The server obtains at least two singular values by performing singular value decomposition operation processing on the Hankel matrix, and can intuitively observe the degree of effective information contained in the singular values, which is convenient for noise reduction processing.

Specifically, the singular value decomposition operation can be expressed by a singular value decomposition formula, that is, H = UDV ^* , where U and V are two unitary matrices and D is a semi-definite positive diagonal matrix. The unitary matrix refers to a matrix that satisfies the condition that n column vectors in the matrix are orthogonal unit vectors, that is, the conjugate transpose of the unitary matrix is equal to its inverse matrix. Let A be an n-th order square matrix in the number field. If there is another n-th order matrix B in the same number field, make AB = BA = E (E is the identity matrix, that is, the diagonal line from the upper left corner to the lower right corner. The elements of N are all square matrices of order n), then B is called the inverse matrix of A. Conjugate transpose means that after transposing the matrix, every element in the matrix is replaced with its conjugate complex number. A conjugate complex number is a complex number where two real parts are equal and the imaginary parts are opposite numbers to each other. For example, in z = a + bi (a, b∈R), the conjugate complex number of z is zˊ = a-bi (a, b∈R). A semi-definite definite diagonal matrix refers to a matrix that is both a semi-definite definite matrix and a diagonal matrix. A semi-positive definite matrix is an n-th order square matrix with X'AX ≥ 0 (X 'represents the transpose of X) for any non-zero vector X, where A is a semi-positive definite matrix. A diagonal matrix is a matrix with zero elements except the main diagonal (the diagonal from the upper left corner to the lower right corner).

In an embodiment, as shown in FIG. 3, in step S30, the singular value decomposition operation processing is performed on the Hankel matrix to obtain at least two singular values, and the specific steps include the following steps:

S31: Calculate the transpose matrix of the Hankel matrix.

Among them, the transposed matrix of the Hankel matrix refers to a matrix obtained by mirror-inverting all elements of the Hankel matrix around a ray of 45 degrees below and to the right starting from the elements in the first row and the first column. For example, let ’s set the Hankel matrix

Hankel matrix transpose matrix

Provide technical support for the subsequent acquisition of eigenvalues by obtaining the transposed matrix of the Hankel matrix.

S32: Obtain at least two eigenvalues based on the product of the Hankel matrix and the transposed matrix.

Specifically, let A be the Hankel matrix and A ^T be the transposed matrix, and the formula B = AA ^T and B ′ = A ^T A can be used to calculate the matrix B and the matrix corresponding to the product of the Hankel matrix and the transposed matrix. B ', at least two eigenvalues can be obtained by calculation according to Bx = mx. If B is a square matrix of order n, if the real number m and non-zero n-dimensional column vector x exist, so that the equation Bx = mx holds, then m is said to be a eigenvalue of B, and the eigenvalue reflects the scaling factor of the matrix transformation , By scaling the matrix to achieve the purpose of data dimensionality reduction.

Specifically, the Johankel matrix

Hankel's transpose matrix

Based on the product of the Hankel matrix and the transposed matrix, at least two eigenvalues are obtained, which specifically include the following process:

(1) Use the formula B = AA ^T and B ′ = A ^T A to calculate the matrix B and matrix B ′ corresponding to the product of the Hankel matrix and the transposed matrix. For example, use the formula B = AA ^{T to} obtain

Calculated by the formula B '= A ^T A

(2) A matrix determinant is used to process the matrix B and the matrix B ′ to obtain at least two eigenvalues. Among them, the calculation formula of matrix determinant is

The matrix Σ number represents the sum of all permutations, τ represents the inverse ordinal number of the permutations k ₁ k ₂ … k _n , and D is called the determinant of the matrix. The formula for calculating the inverse ordinal number is

Take B 'as an example, by calculating the matrix determinant of matrix B'

The eigenvalues λ ₁ = 3 and λ ₂ = 1 are obtained.

(3) At least two eigenvalues λ _i are processed by formula Bu _i = λ _i u _i and formula B′v _i = λ _i v _i to obtain a feature vector corresponding to each feature value, where u _i It is an eigenvalue corresponding to the matrix B, wherein v _i of the matrix B 'values corresponding eigenvectors. The server obtains eigenvalues and eigenvectors based on the product of the Hankel matrix and the transposed matrix to achieve the purpose of data dimensionality reduction.

S33: Operate at least two eigenvalues according to a preset calculation method to obtain at least two singular values.

The preset calculation method refers to a predefined calculation method for calculating singular values by calculating characteristic values. The preset calculation method includes using a formula

Perform singular value square operation or use the formula Av _i = σ _i u _i to calculate at least two eigenvalues.

Specifically, the server uses the formula

By performing a square operation on at least two eigenvalues, at least two singular values can be obtained, where σ _i is a singular value and λ _i is a eigenvalue. The server performs a square root operation on the eigenvalues to obtain a singular value. The calculation is simple and the efficiency is improved.

Alternatively, the server uses the formula Av _i = σ _i u _i to calculate at least two eigenvalues to obtain at least two singular values. u _i is a feature vector corresponding to the eigenvalues of matrix B, and v _i is a feature vector corresponding to the eigenvalues of matrix B ′.

Finally, based on the singular value σ _i , the eigenvector u _i and the eigenvector v _i , we get the expression of singular value decomposition of the Hankel matrix, which is H = UDV ^* , where,

In this embodiment, the transpose matrix of the Hankel matrix is first calculated so as to obtain at least two eigenvalues based on the product of the Hankel matrix and the transpose matrix, and then based on the obtained eigenvalues, the Hankel matrix-based The matrix obtained by multiplying the product with the transposed matrix is scaled to achieve the purpose of reducing the dimension of the data. Finally, at least two eigenvalues are subjected to a square operation to obtain at least two singular values. The method for obtaining the singular values is simple to calculate and easy to implement.

S40: Perform inverse singular value decomposition operation on at least two singular values to obtain a target speech signal.

Among them, the singular value decomposition inverse operation refers to reducing each singular value into a semi-positive definite diagonal matrix, and multiplying the semi-positive definite diagonal matrix with two unitary matrices obtained by the previous singular value decomposition operation to obtain the target speech. Information operations. The target speech signal is a denoised speech signal obtained by performing singular value decomposition on a digital speech signal. Specifically, the server performs an inverse singular value decomposition operation on at least two singular values to obtain a voice signal (that is, a target voice signal) corresponding to each singular value, so as to achieve the purpose of voice enhancement.

In an embodiment, as shown in FIG. 4, in step S40, the singular value decomposition inverse operation is performed on at least two singular values to obtain a target voice signal, which specifically includes the following steps:

S411: Perform singular value decomposition and inverse operation processing on at least two singular values, respectively, to obtain an original signal component corresponding to each singular value.

The original signal component is a signal component obtained by performing singular value decomposition inverse operation processing on at least two singular values respectively. Specifically, each singular value is reduced (the position of the singular value in the matrix is unchanged) into a semi-positive definite diagonal matrix, and multiplied by two unitary matrices obtained from the previous singular value decomposition operation to obtain each singular value. Corresponding original signal component.

S412: Perform correlation calculation between the original signal component and the digital voice signal to obtain a correlation coefficient.

The correlation coefficient is a calculation result obtained by performing correlation calculation on the digital voice signal and the first signal component. The first correlation coefficient reflects the degree of correlation between the digital speech signal and the first signal component, and also reflects the degree to which the signal component contains an effective amount of information.

Specifically, the correlation calculation formula is

Where x is the original signal component, y is the digital voice signal, Cov (x, y) is the covariance of x and y, Var [x] is the variance of x, Var [y] is the variance of y, and r is the correlation coefficient.

Among them, Cov (x, y) is calculated as:

The calculation formula of Var [x] is Var [x] = E (x ² ) -E ² (x); The calculation formula of Var [y] is Var [y] = E (y ² ) -E ² (y); Among them, E (x) represents the average value of the original signal components, E (y) represents the average value of the digital speech signals, n represents the number of original signal components, and y _j represents the j-th digital speech signal on the time scale. x _j represents the j-th original signal component on the same time scale.

S413: Select an original signal component whose correlation coefficient is greater than a preset threshold as a target signal component.

Wherein, the preset threshold is a predefined threshold for filtering the original signal components. The target signal component is an original signal component obtained by performing a filtering operation on the original signal component using a preset threshold.

Since the correlation coefficient is a real number between 0 and 1, the preset threshold is selected as a real number between 0 and 1. If the correlation coefficient is greater than a preset threshold, it means that the original signal component has a large correlation with the digital voice signal, and the original signal component contains a large amount of effective information of the digital voice signal. If the correlation coefficient is not greater than a preset threshold value, it means that the correlation between the original signal component and the digital voice signal is small, and the amount of effective information contained in the original signal component is small, and the noise may be defaulted. In this embodiment, the original signal components are filtered to obtain the original signal components with greater correlation with the digital speech signal as the target signal components to reduce noise interference and achieve the purpose of speech enhancement. In addition, the method for screening original signal components is simple to implement and improves the efficiency of speech enhancement.

S414: Perform linear superposition processing on the target signal components to obtain a target voice signal.

Specifically, the server linearly superimposes the acquired N target signal components by using a formula W = x ₁ + x ₂ + ... x _n to obtain a target voice signal, where W is a target voice signal and x is a target signal component.

In this embodiment, the server first obtains the original signal component corresponding to each singular value by performing singular value decomposition and inverse operation processing on each singular value, so as to perform correlation calculation between the original signal component and the digital voice signal to obtain correlation. The coefficient reflects the degree of correlation between the digital speech signal and the first signal component, and also reflects the degree to which the signal component contains an effective amount of information. The server then screens each original signal component to obtain the original signal component with greater correlation with the digital speech signal as the target signal component, in order to reduce noise interference in more detail, and achieve the purpose of speech enhancement. Finally, the target signal components are linearly superimposed to obtain the target speech signal. The process of obtaining the target speech signal is simple to calculate, easy to implement, and improves the processing efficiency of speech enhancement.

In an embodiment, as shown in FIG. 5, in step S411, at least two singular values are separately subjected to singular value decomposition and inverse operation processing to obtain an original signal component corresponding to each singular value, which specifically includes the following steps:

S4111: Obtain a singular value matrix based on the singular values.

The singular value matrix is a matrix obtained by reducing each singular value in a semi-positive definite diagonal matrix. Specifically, the server restores each singular value in a semi-positive definite diagonal matrix to obtain a singular value matrix. In this embodiment, each singular value is restored to obtain a corresponding singular value matrix, which can be expressed according to the following formula

Among them, D _n represents a singular value matrix corresponding to the n-th singular value.

S4112: Obtain an original signal component corresponding to each singular value based on the singular value matrix.

Specifically, each singular value matrix is operated according to the following formula to obtain an original signal component corresponding to each singular value.

U and V * are two two unitary matrices, D is the singular value matrix corresponding to each singular value, that is, D ₁ , D ₂ … D _n , H is the original signal component corresponding to each singular value, U _ik is given by Bu _i = λ _i u _i The matrix corresponding to the ith feature vector. V _ik is a matrix corresponding to the ith feature vector calculated by the formula B′v _i = λ _i v _i .

In this embodiment, each singular value is first reduced in a semi-positive definite diagonal matrix to obtain a singular value matrix, and then the singular value matrix corresponding to each singular value and the two unitary matrices obtained by the singular value decomposition operation are performed. Multiplication operation to obtain the original signal component corresponding to each singular value, and to provide technical support for subsequent filtering of the original signal component to obtain the target signal component.

In an embodiment, as shown in FIG. 6, in step S40, at least two of the singular values are subjected to singular value decomposition inverse operation to obtain target voice information, and specifically include the following steps:

S421: Calculate the sum of at least two singular values, multiply the sum by a preset threshold, and obtain a corresponding evaluation threshold. The preset threshold is a positive number not greater than 1.

The preset threshold is a threshold defined in advance for calculating an evaluation threshold. The evaluation threshold is a threshold used for screening singular values. The preset threshold is a positive number not greater than 1. Specifically, a sum of all singular values is calculated, and then the sum is multiplied with a preset threshold to obtain an evaluation threshold. That is, the calculation formula of the evaluation threshold is:

Among them, T is a preset threshold, P is an evaluation threshold, and σ _i is a singular value.

S422: Perform linear superposition of at least two singular values in order from large to small to obtain a superposition sum value. If the superposition sum value is greater than the evaluation threshold, obtain N singular values corresponding to the superposition sum value. Where N is a positive integer.

Specifically, the singular values are arranged in descending order. Therefore, the singular values are added linearly in order from large to small to obtain the superposition sum value. If the superposition sum value is greater than the evaluation threshold, the superposition sum value is obtained. Singular values of N terms, where N is a positive integer. Understandably, the linear addition is performed in the order of the singular values from large to small until the sum of the singular values of the superimposed N items is greater than the evaluation threshold, then the superimposition is stopped to obtain the N singular values. As the singular value is larger, the effective information amount of the digital voice signal contained in the singular value is larger. On the other hand, the smaller the singular value is, the less effective information amount of the digital voice signal contained in the singular value is considered to be the main Contains noise. Therefore, the server linearly adds the singular values in ascending order until the sum of the values of the singular values of the N items is larger than the evaluation threshold, and removes the remaining singular values of the M items to reduce noise interference. The singular value screening process does not need to perform inverse operation on each singular value, and then performs correlation analysis. The required singular value can be filtered directly based on the evaluation threshold, which is simple to operate and improves efficiency.

S423: Perform batch reconstruction on the N singular values to obtain a target voice signal.

The batch reconstruction refers to a method of performing batch restoration processing on N singular values to obtain target voice information.

Specifically, batch reconstruction is performed on the N singular values, and the specific implementation process of obtaining the target speech signal is as follows: The selected N singular values are retained in the original semi-positive definite diagonal matrix D obtained by singular value decomposition operation, and the size and position The singular values (that is, the singular values representing noise) are reduced to 0 and the positions are unchanged in the semi-definite definite diagonal matrix to obtain the target semi-definite definite diagonal matrix M containing the selected N singular values. Then, the target semi-positive definite diagonal matrix M is substituted into the above singular value decomposition formula, U and V are unchanged, and a new Hankel matrix H 'is obtained, where H' = UD _n V ^* , and the new Hankel The matrix H 'is expanded according to the properties of the Hankel matrix (that is, the elements on each subdiagonal are equal), and a denoised speech signal, that is, a target speech signal in this embodiment, can be obtained.

In summary, in this embodiment, the inverse singular value decomposition includes inverse decomposition of each singular value or batch reconstruction of singular values to obtain a target speech signal.

In this embodiment, the sum of at least two singular values is calculated, and the sum is multiplied with a preset threshold to obtain an evaluation threshold, so that at least two singular values are linearly added in order from large to small until superimposed. The sum of the N singular values is greater than the evaluation threshold, then the superposition is stopped to obtain the N singular values, and the remaining M singular values are removed to reduce noise interference and achieve the purpose of speech enhancement. Finally, batch reconstruction is performed on the N singular values to obtain the target speech signal. The process of obtaining the target speech signal can directly restore the selected N singular values in the original semi-positive definite diagonal matrix D obtained by the singular value decomposition operation. And multiply the two unitary matrices obtained by the singular value decomposition operation to obtain the target speech signal, and obtain the target speech signal by means of batch reconstruction to improve the acquisition efficiency of the target speech signal, and then improve the processing efficiency of speech enhancement.

S50: Perform restoration processing on the target voice signal to obtain target voice information.

The target voice information is voice information obtained by restoring the target voice signal in a required audio format. Further, the server can use the following method to restore the target speech signal in the form of a matrix: first expand the Hankel matrix according to the subdiagonal elements, and then obtain the one-dimensional digital signal after noise reduction, by adding the sampling frequency parameter And a one-dimensional digital signal to obtain the target voice information. Among them, the sampling frequency is also called the sampling speed or sampling rate, which defines the number of samples that are extracted from the continuous signal per second to form a discrete signal. It is expressed in Hertz (Hz).

In this embodiment, a command function for reading an audio file in the Python module is used to directly read the original voice information to obtain the sampling frequency parameter. Specifically, the Python module has a function for generating audio files in different formats. Calling this function directly and assigning a sampling frequency parameter and a one-dimensional digital signal can generate the target voice information in the required format. For example, you can call the function wave that generates a wav format file in the Python module to process the acquired sampling frequency parameters and one-dimensional digital signals to generate an audio file (that is, target voice information) in the wav format.

In this embodiment, the original voice information is first converted to obtain a digital voice signal, and the digital voice signal is constructed into a Hankel matrix, so that the Hankel matrix is subjected to singular value decomposition operation processing to obtain at least two singular values. The value can represent the important information implied in the matrix, and the importance is positively related to the size of the singular value. According to the singular value obtained, the degree of the effective information contained in the singular value can be intuitively observed. Then, the server performs a singular value decomposition inverse operation on at least two singular values to obtain a speech signal corresponding to each singular value, that is, a target speech signal, so as to suppress noise interference and implement speech enhancement. Finally, the target voice signal is restored to obtain the audio file in the required format, that is, the target voice information. The restoration process can directly call the function in the Python module for restoration, and the operation is simple.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

In one embodiment, FIG. 7 shows a schematic diagram of a speech enhancement device corresponding to the speech enhancement method in the above embodiment. As shown in FIG. 7, the voice enhancement device includes a digital voice signal acquisition module 10, a Hankel matrix acquisition module 20, a singular value acquisition module 30, a target voice signal acquisition module 40, and a target voice information acquisition module. The detailed description of each function module is as follows:

The digital voice signal acquisition module 10 is configured to convert the original voice information to obtain a digital voice signal.

The Hankel matrix obtaining module 20 is configured to obtain a Hankel matrix based on a digital voice signal.

The singular value acquisition module 30 is configured to perform singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values.

The target voice signal acquisition module 40 is configured to perform an inverse singular value decomposition operation on at least two singular values to obtain a target voice signal.

The target voice information acquisition module 50 is configured to perform restoration processing on the target voice signal to acquire the target voice information.

Specifically, the singular value acquisition module 30 includes a transposed matrix calculation unit 31, a eigenvalue acquisition unit 32, and a singular value acquisition unit 33.

The transposed matrix calculation unit 31 is configured to calculate a transposed matrix of the Hankel matrix.

An eigenvalue obtaining unit 32 is configured to obtain at least two eigenvalues based on a product of a Hankel matrix and a transposed matrix.

The singular value obtaining unit 33 is configured to perform an operation on at least two eigenvalues according to a preset calculation method to obtain at least two singular values.

Specifically, the target speech signal acquisition module 40 includes an original signal component acquisition unit 411, a correlation coefficient acquisition unit 412, a target signal component acquisition unit 413, and a target speech signal acquisition unit 414.

The original signal component acquiring unit 411 is configured to perform singular value decomposition and inverse operation processing on at least two singular values, respectively, to obtain an original signal component corresponding to each singular value.

A correlation coefficient acquisition unit 412 is configured to perform correlation calculation between the original signal component and the digital voice signal to obtain a correlation coefficient.

The target signal component acquiring unit 413 is configured to select an original signal component whose correlation coefficient is greater than a preset threshold as a target signal component.

The target voice signal acquisition unit 414 is configured to perform linear superposition processing on the target signal components to acquire a target voice signal.

Specifically, the original signal component acquisition unit 411 includes a singular value matrix acquisition subunit 4111 and an original signal component acquisition subunit 4112.

The singular value matrix obtaining subunit 4111 is configured to obtain a singular value matrix based on the singular value.

The original signal component acquisition subunit 4112 is configured to obtain an original signal component corresponding to each singular value based on the eigenvalue and the singular value matrix.

Specifically, the correlation calculation formula is

Specifically, the target voice signal acquisition module 40 includes an evaluation threshold acquisition unit 421, N-term singular value acquisition units 422, and a target voice signal acquisition unit 423.

The evaluation threshold obtaining unit 421 is configured to calculate a sum of at least two singular values, and multiply the sum with a preset threshold to obtain a corresponding evaluation threshold. The preset threshold is a positive number not greater than 1.

The N-term singular value obtaining unit 422 is configured to linearly superimpose at least two singular values in order from large to small to obtain a superposition sum value. If the superposition sum value is greater than the evaluation threshold, obtain N singularities corresponding to the superposition sum value Value; where N is a positive integer.

The target voice signal acquisition unit 423 is configured to perform batch reconstruction on the N singular values to acquire a target voice signal.

For the specific limitation of the speech enhancement device, refer to the foregoing limitation on the speech enhancement method, and details are not described herein again. Each module in the above voice enhancement device may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer device is used to store data generated or obtained during the execution of the speech enhancement method, such as target speech information. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer-readable instructions are executed by one or more processors, the one or more processors are executed to implement a speech enhancement method.

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. The processor executes the computer-readable instructions to implement the following steps: The speech information is converted to obtain a digital speech signal; based on the digital speech signal, a Hankel matrix is obtained; a singular value decomposition operation is performed on the Hankel matrix to obtain at least two singular values; and at least two singular values are singular value decomposition The inverse operation is performed to obtain the target voice signal; the target voice signal is restored to obtain the target voice information.

In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: calculating a transposed matrix of the Hankel matrix; obtaining at least two eigenvalues based on a product of the Hankel matrix and the transposed matrix; The design calculation method operates on at least two eigenvalues to obtain at least two singular values.

In an embodiment, when the processor executes the computer-readable instructions, the processor further implements the following steps: performing singular value decomposition and inverse operation processing on at least two singular values, respectively, to obtain an original signal component corresponding to each singular value; The digital speech signal is subjected to correlation calculation to obtain a correlation coefficient; an original signal component with a correlation coefficient greater than a preset threshold is selected as a target signal component.

Perform linear superposition processing on the target signal components to obtain the target speech signal.

Specifically, the correlation calculation formula is

In one embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: obtaining a singular value matrix based on the singular values; and obtaining an original signal component corresponding to each singular value based on the eigenvalues and the singular value matrix.

In an embodiment, when the processor executes the computer-readable instructions, the processor further implements the following steps: calculating a sum of at least two singular values, and multiplying the sum with a preset threshold to obtain a corresponding evaluation threshold; wherein the preset threshold is A positive number not greater than 1. At least two singular values are linearly superimposed in order from large to small to obtain a superposition sum value. If the superposition sum value is greater than the evaluation threshold, then N singular values corresponding to the superposition sum value are obtained; where N is a positive integer. Perform batch reconstruction on N singular values to obtain the target speech signal.

In one embodiment, one or more non-volatile readable storage media storing computer-readable instructions are provided, and when the computer-readable instructions are executed by one or more processors, the one or more When the processors are executed, the following steps are implemented: converting the original speech information to obtain a digital speech signal; obtaining a Hankel matrix based on the digital speech signal; performing singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values ; Performing inverse singular value decomposition on at least two singular values to obtain a target voice signal; and performing restoration processing on the target voice signal to obtain target voice information.

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: calculating a transpose matrix of a Hankel matrix; The product of the Kerr matrix and the transposed matrix is used to obtain at least two eigenvalues; the at least two eigenvalues are calculated according to a preset calculation method to obtain at least two singular values.

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: performing singular value decomposition inverse on at least two singular values, respectively. The operation process obtains the original signal component corresponding to each singular value; performs correlation calculation between the original signal component and the digital voice signal to obtain a correlation coefficient; and selects an original signal component with a correlation coefficient greater than a preset threshold as a target signal component. Perform linear superposition processing on the target signal components to obtain the target speech signal.

Specifically, the correlation calculation formula is

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: obtaining a singular value matrix based on the singular values; and based on the eigenvalues And singular value matrix to obtain the original signal component corresponding to each singular value.

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: calculating a sum of at least two singular values, and summing the sum with The preset threshold value is multiplied to obtain a corresponding evaluation threshold value, wherein the preset threshold value is a positive number not greater than 1. At least two singular values are linearly superimposed in order from large to small to obtain a superposition sum value. If the superposition sum value is greater than the evaluation threshold, then N singular values corresponding to the superposition sum value are obtained; where N is a positive integer. Perform batch reconstruction on N singular values to obtain the target speech signal.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer-readable In the storage medium, when the computer-readable instructions are executed, the computer-readable instructions may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims

A speech enhancement method, comprising:

Convert the original voice information to obtain digital voice signals;

Obtaining a Hankel matrix based on the digital speech signal;

Performing singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values;

Performing an inverse singular value decomposition operation on at least two of the singular values to obtain a target speech signal;

Performing restoration processing on the target voice signal to obtain target voice information.
The speech enhancement method according to claim 1, wherein the performing singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values comprises:

Calculating a transposed matrix of the Hankel matrix;

Obtaining at least two eigenvalues based on a product of the Hankel matrix and the transposed matrix;

Operate at least two of the eigenvalues according to a preset calculation method to obtain at least two of the singular values.
The method of claim 2, wherein the step of performing inverse singular value decomposition on at least two of the singular values to obtain a target speech signal comprises:

Performing singular value decomposition and inverse operation processing on at least two of the singular values, respectively, to obtain an original signal component corresponding to each of the singular values;

Performing a correlation calculation between the original signal component and the digital voice signal to obtain a correlation coefficient;

Selecting the original signal component whose correlation coefficient is greater than a preset threshold as a target signal component;

Performing linear superposition processing on the target signal component to obtain a target speech signal.
The speech enhancement method according to claim 3, wherein said performing singular value decomposition inverse operation processing on at least two of said singular values respectively to obtain an original signal component corresponding to each of said singular values comprises:

Obtaining a singular value matrix based on the singular value;

Based on the eigenvalues and the singular value matrix, an original signal component corresponding to each singular value is obtained.
The speech enhancement method according to claim 3, wherein the correlation calculation formula is
Where x is the original signal component, y is the digital speech signal, Cov (x, y) is the covariance of x and y, Var [x] is the variance of x, Var [y] is the variance of y, and r is the correlation coefficient.
The speech enhancement method according to claim 1, wherein the performing a singular value decomposition inverse operation on at least two of the singular values to obtain a target speech signal comprises:

Calculating a sum of at least two singular values, and multiplying the sum with a preset threshold to obtain a corresponding evaluation threshold; wherein the preset threshold is a positive number not greater than 1;

Linearly superimpose at least two of the singular values in order from large to small to obtain a superposition sum value, and if the superposition sum value is greater than the evaluation threshold, obtain N singular values corresponding to the superposition sum value; Where N is a positive integer;

Perform batch reconstruction on N singular values to obtain the target speech signal.
A speech enhancement device, comprising:

Digital voice signal acquisition module, for converting original voice information to obtain digital voice signals;

A Hankel matrix acquisition module, configured to acquire a Hankel matrix based on the digital speech signal;

A singular value acquisition module, configured to perform singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values;

A target voice signal acquisition module, configured to perform an inverse singular value decomposition operation on at least two of the singular values to obtain a target voice signal;

A target voice information acquisition module is configured to perform restoration processing on the target voice signal to acquire target voice information.
The speech enhancement device according to claim 7, wherein the singular value acquisition module comprises:

A transpose matrix calculation unit, configured to calculate the transpose matrix of the Hankel matrix;

A eigenvalue obtaining unit, configured to obtain at least two eigenvalues based on a product of the Hankel matrix and the transposed matrix;

The singular value obtaining unit is configured to perform an operation on at least two of the eigenvalues according to a preset calculation method to obtain at least two of the singular values.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:

Convert the original voice information to obtain digital voice signals;

Obtaining a Hankel matrix based on the digital speech signal;

Performing singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values;

Performing an inverse singular value decomposition operation on at least two of the singular values to obtain a target speech signal;

Performing restoration processing on the target voice signal to obtain target voice information.
The computer device according to claim 9, wherein the performing singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values comprises:

Calculating a transposed matrix of the Hankel matrix;

Obtaining at least two eigenvalues based on a product of the Hankel matrix and the transposed matrix;

Operate at least two of the eigenvalues according to a preset calculation method to obtain at least two of the singular values.
The computer device according to claim 10, wherein said performing a singular value decomposition inverse operation on at least two of said singular values to obtain a target speech signal comprises:

Performing singular value decomposition and inverse operation processing on at least two of the singular values, respectively, to obtain an original signal component corresponding to each of the singular values;

Performing a correlation calculation between the original signal component and the digital voice signal to obtain a correlation coefficient;

Selecting the original signal component whose correlation coefficient is greater than a preset threshold as a target signal component;

Performing linear superposition processing on the target signal component to obtain a target speech signal.
The computer device according to claim 11, wherein the performing singular value decomposition and inverse operation processing on at least two of the singular values respectively to obtain an original signal component corresponding to each of the singular values comprises:

Obtaining a singular value matrix based on the singular value;

Based on the eigenvalues and the singular value matrix, an original signal component corresponding to each singular value is obtained.
The computer device according to claim 11, wherein the correlation calculation formula is
Where x is the original signal component, y is the digital voice signal, Cov (x, y) is the covariance of x and y, Var [x] is the variance of x, Var [y] is the variance of y, and r is the correlation coefficient.
The computer device according to claim 9, wherein said performing a singular value decomposition inverse operation on at least two of said singular values to obtain a target speech signal comprises:

Calculating a sum of at least two singular values, and multiplying the sum with a preset threshold to obtain a corresponding evaluation threshold; wherein the preset threshold is a positive number not greater than 1;

Linearly superimpose at least two of the singular values in order from large to small to obtain a superposition sum value, and if the superposition sum value is greater than the evaluation threshold, obtain N singular values corresponding to the superposition sum value; Where N is a positive integer;

Perform batch reconstruction on N singular values to obtain the target speech signal.
One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:

Convert the original voice information to obtain digital voice signals;

Obtaining a Hankel matrix based on the digital speech signal;

Performing singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values;

Performing an inverse singular value decomposition operation on at least two of the singular values to obtain a target speech signal;

Performing restoration processing on the target voice signal to obtain target voice information.
The non-volatile readable storage medium according to claim 15, wherein the performing singular value decomposition operation processing on the Hankel matrix to obtain at least two singular values comprises:

Calculating a transposed matrix of the Hankel matrix;

Obtaining at least two eigenvalues based on a product of the Hankel matrix and the transposed matrix;

Operate at least two of the eigenvalues according to a preset calculation method to obtain at least two of the singular values.
The non-volatile readable storage medium according to claim 16, wherein said performing a singular value decomposition inverse operation on at least two of said singular values to obtain a target speech signal comprises:

Performing singular value decomposition and inverse operation processing on at least two of the singular values, respectively, to obtain an original signal component corresponding to each of the singular values;

Performing a correlation calculation between the original signal component and the digital voice signal to obtain a correlation coefficient;

Selecting the original signal component whose correlation coefficient is greater than a preset threshold as a target signal component;

Performing linear superposition processing on the target signal component to obtain a target speech signal.
The non-volatile readable storage medium according to claim 17, wherein the at least two singular values are respectively subjected to singular value decomposition inverse operation processing to obtain an original signal corresponding to each of the singular values. Weight, including:

Obtaining a singular value matrix based on the singular value;

Based on the eigenvalues and the singular value matrix, an original signal component corresponding to each singular value is obtained.
The non-volatile readable storage medium according to claim 17, wherein the correlation calculation formula is
Where x is the original signal component, y is the digital voice signal, Cov (x, y) is the covariance of x and y, Var [x] is the variance of x, Var [y] is the variance of y, and r is the correlation coefficient.
The non-volatile readable storage medium according to claim 15, wherein the performing a singular value decomposition inverse operation on at least two of the singular values to obtain a target voice signal comprises:

Calculating a sum of at least two singular values, and multiplying the sum with a preset threshold to obtain a corresponding evaluation threshold; wherein the preset threshold is a positive number not greater than 1;

Linearly superimpose at least two of the singular values in order from large to small to obtain a superposition sum value, and if the superposition sum value is greater than the evaluation threshold, obtain N singular values corresponding to the superposition sum value; Where N is a positive integer;

Perform batch reconstruction on N singular values to obtain the target speech signal.