CN110164461B

CN110164461B - Voice signal processing method and device, electronic equipment and storage medium

Info

Publication number: CN110164461B
Application number: CN201910611481.2A
Authority: CN
Inventors: 王天宝
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-07-08
Filing date: 2019-07-08
Publication date: 2023-12-15
Anticipated expiration: 2039-07-08
Also published as: CN110164461A

Abstract

The embodiment of the application provides a voice signal processing method, a voice signal processing device, electronic equipment and a storage medium. The method comprises the following steps: the method comprises the steps of obtaining an original voice signal, carrying out linear prediction analysis on the original voice signal, determining original excitation corresponding to the original voice signal and a first filter, adjusting at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter to obtain an adjusted first filter, and determining a target voice signal based on the original excitation corresponding to the original voice signal and the adjusted first filter. The embodiment of the application realizes the adjustment of at least one of the formant frequency and the formant sharpness corresponding to the original voice signal to obtain the target voice signal, thereby realizing the voice conversion of the voice input by the user and further improving the user experience.

Description

Voice signal processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of signal processing technologies, and in particular, to a method and apparatus for processing a speech signal, an electronic device, and a storage medium.

Background

With the development of mobile communication, various applications, such as some applications with communication functions, have been developed. The user can perform voice interaction through the application programs with the communication function, namely, the information input by the user through the voice format is sent to the opposite-end user, so that the information interaction is realized.

In the process of information interaction between users in a voice mode, in order to increase interestingness in the information interaction process, voice information input by the users can be sent to the opposite terminal after being subjected to sound changing processing, so that the voice information received by the opposite terminal is different from the voice information input by the users.

How to make the voice input by the user become a key issue.

Disclosure of Invention

The application provides a voice signal processing method, a voice signal processing device, electronic equipment and a storage medium, which can solve at least one technical problem. The technical scheme is as follows:

in a first aspect, a method for processing a speech signal is provided, the method comprising:

acquiring an original voice signal;

performing linear prediction analysis on the original voice signal to determine original excitation corresponding to the original voice signal and a first filter;

Adjusting at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter to obtain an adjusted first filter;

the target speech signal is determined based on the original excitation corresponding to the original speech signal and the adjusted first filter.

In one possible implementation, performing linear prediction analysis on an original speech signal to determine an original excitation corresponding to the original speech signal and a first filter, including:

performing linear prediction analysis on the original voice signal to determine prediction error information corresponding to the original voice signal;

based on the prediction error information corresponding to the original speech signal, an original excitation corresponding to the original speech signal and a first filter are determined.

In another possible implementation, determining the original excitation corresponding to the original speech signal and the first filter based on the prediction error information corresponding to the original speech signal includes:

determining a second filter based on prediction error information corresponding to the original voice signal, wherein the second filter is a filter corresponding to linear prediction analysis;

the method comprises determining an original excitation corresponding to the original speech signal based on the original speech signal and the second filter, and determining the first filter based on the second filter.

In another possible implementation manner, adjusting at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter to obtain an adjusted first filter includes:

and if the pole angle information corresponding to the first filter meets the preset condition, adjusting at least one of the pole angle information and the pole amplitude information corresponding to the first filter according to a preset mode.

In another possible implementation manner, the pole angle information corresponding to the first filter includes: a pole angle value corresponding to at least one pole;

the pole angle information corresponding to the first filter is adjusted according to a preset mode, and the pole angle information comprises at least one of the following:

increasing a pole angle value corresponding to at least one pole by a first preset threshold;

and reducing the pole angle value corresponding to the at least one pole by a second preset threshold value.

In another possible implementation, the pole amplitude information corresponding to the first filter includes: a pole amplitude value corresponding to the at least one pole;

the pole amplitude information corresponding to the first filter is adjusted according to a preset method, and the method comprises the following steps:

and adjusting the pole amplitude value corresponding to the at least one pole according to a preset multiple.

In another possible implementation, the obtaining the original speech signal further includes:

acquiring a voice signal input by a user;

and denoising the voice signal input by the user, and taking the denoised voice signal as an original voice signal.

In a second aspect, there is provided a speech signal processing apparatus comprising:

the first acquisition module is used for acquiring an original voice signal;

the first determining module is used for carrying out linear prediction analysis on the original voice signal and determining original excitation corresponding to the original voice signal and a first filter;

the adjusting module is used for adjusting at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter to obtain an adjusted first filter;

and the second determining module is used for determining the target voice signal based on the original excitation corresponding to the original voice signal and the adjusted first filter.

In one possible implementation, the first determining module includes a first determining unit and a second determining unit, where,

the first determining unit is used for carrying out linear prediction analysis on the original voice signal and determining prediction error information corresponding to the original voice signal;

And the second determining unit is used for determining the original excitation corresponding to the original voice signal and the first filter based on the prediction error information corresponding to the original voice signal.

In another possible implementation manner, the second determining unit is specifically configured to determine, based on prediction error information corresponding to the original speech signal, a second filter, where the second filter is a filter corresponding to linear prediction analysis;

the second determining unit is specifically further configured to determine an original excitation corresponding to the original speech signal based on the original speech signal and the second filter, and determine the first filter based on the second filter.

In another possible implementation manner, the adjusting module is specifically configured to adjust at least one of pole angle information and pole amplitude information corresponding to the first filter according to a preset manner when the pole angle information corresponding to the first filter meets a preset condition.

In another possible implementation manner, the pole angle information corresponding to the first filter includes: a pole angle value corresponding to at least one pole; the adjustment module comprises: at least one of an increasing unit and a decreasing unit, wherein,

an increasing unit, configured to increase a pole angle value corresponding to at least one pole by a first preset threshold;

And the reducing unit is used for reducing the pole angle value corresponding to the at least one pole by a second preset threshold value.

the adjusting module is specifically configured to adjust a pole amplitude value corresponding to at least one pole according to a preset multiple.

In another possible implementation manner, the voice signal processing apparatus further includes: a second acquisition module and a denoising processing module, wherein,

the second acquisition module is used for acquiring a voice signal input by a user;

the denoising processing module is used for denoising the voice signal input by the user and taking the denoised voice signal as an original voice signal.

In a third aspect, an electronic device is provided, the electronic device comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: and executing the operation corresponding to the voice signal processing method shown in the first aspect or any possible implementation manner of the first aspect.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which program, when being executed by a processor, implements the speech signal processing method according to the first aspect or any of the possible implementations of the first aspect.

The technical scheme provided by the embodiment of the application has the beneficial effects that:

compared with the prior art, the method and the device for processing the voice signal, the electronic equipment and the storage medium are characterized in that the original excitation and the first filter corresponding to the original voice signal are determined through linear prediction analysis of the original voice signal, and the adjusted first filter is obtained through adjusting at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter, and then the target voice signal is determined by utilizing the original excitation corresponding to the original voice signal and the adjusted first filter, so that at least one of formant frequency and formant sharpness corresponding to the original voice signal can be adjusted to obtain the target voice signal, thereby realizing voice conversion of the voice input by a user and improving user experience.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a flow chart of a voice signal processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a sound-variable display interface according to an embodiment of the present application;

FIG. 3 is a flow chart of the original voice signal according to the embodiment of the application becoming common cold;

fig. 4 is a schematic diagram of a spectrogram corresponding to an original speech signal according to an embodiment of the present application;

FIG. 5 is a diagram of a spectrogram of an embodiment of the present application after a sound is changed into a cold sound;

fig. 6 is a schematic structural diagram of a voice signal processing device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device for processing a voice signal according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

First, several terms related to the present application are described and explained:

QR decomposition method: the matrix is decomposed into an orthonormal matrix Q and an upper triangular matrix R. Specifically, the QR decomposition method is the most effective and widely applied method for obtaining all characteristic values of a general matrix, the general matrix is firstly changed into a hessianberg (Hessenberg) matrix through orthogonal similarity change, and then the QR method is applied to obtain the characteristic values and the characteristic vectors;

First one polynomial: polynomial with leader coefficient of 1.

The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The embodiment of the application provides a voice signal processing method, as shown in fig. 1, which comprises the following steps:

step S101, an original speech signal is acquired.

For the embodiment of the application, the original voice signal in the preset time period is obtained. For example, the preset time period is 25 ms, 20 ms, or 15 ms.

Step S102, linear prediction analysis is carried out on the original voice signal, and original excitation corresponding to the original voice signal and a first filter are determined.

For the embodiment of the application, linear predictive analysis (Linear Prediction Analysis, LPA) is a technical means of performing speech signal analysis by treating the signal as the output of a model and describing the signal with model parameters. In the embodiment of the application, the linear prediction analysis of the original voice signal is mainly realized by means of a linear prediction error filter.

Specifically, step S102 may specifically include: the original excitation corresponding to the original speech signal and the first filter are determined using the original speech signal and the linear prediction error filter.

Step S103, at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter is adjusted, and the adjusted first filter is obtained.

For the embodiment of the application, when the sound passes through the resonant cavity, the sound is subjected to the filtering action of the cavity, so that energy of different frequencies in the frequency domain is redistributed, one part is reinforced due to the resonance action of the resonant cavity, and the other part is attenuated. The strong portion is called a formant because of the uneven energy distribution, such as a roller coaster peak. Wherein the formants are closely related to the voice timbre. In the embodiment of the application, at least one of the pole angle information and the pole amplitude information corresponding to the first filter is adjusted, so that the pole position corresponding to the first filter is adjusted, and the first filter is further adjusted, so that the formants corresponding to the original voice can be modified after the original excitation corresponding to the original voice signal passes through the adjusted first filter, and the sound changing effect is realized.

The specific manner of adjusting at least one of the pole angle information corresponding to the first filter and the pole amplitude information corresponding to the first filter is described in the following embodiments, which are not described herein again.

Step S104, determining a target voice signal based on the original excitation corresponding to the original voice signal and the adjusted first filter.

For the embodiment of the application, the target voice signal is obtained after the first filter which is adjusted and used for the original excitation corresponding to the original voice signal is filtered.

Compared with the prior art, the embodiment of the application determines the original excitation and the first filter corresponding to the original voice signal by carrying out linear prediction analysis on the original voice signal, and obtains the adjusted first filter by adjusting at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter, and then determines the target voice signal by utilizing the original excitation corresponding to the original voice signal and the adjusted first filter, namely, at least one of formant frequency and formant sharpness corresponding to the original voice signal can be adjusted to obtain the target voice signal, thereby realizing the voice conversion of voice input by a user and improving user experience.

In one possible implementation manner of the embodiment of the present application, step S102 may specifically include: performing linear prediction analysis on the original voice signal to determine prediction error information corresponding to the original voice signal; based on the prediction error information corresponding to the original speech signal, an original excitation corresponding to the original speech signal and a first filter are determined.

For the embodiment of the application, the linear prediction error filter is utilized to carry out linear prediction analysis on the original voice signal, and the prediction error information corresponding to the original voice signal is determined.

For the embodiment of the present application, the transfer function of the linear prediction error filter can be expressed by the following formula (1-1):

wherein A (z) represents the transfer function of the linear prediction error filter, z is a complex number, p represents the order of the linear prediction error filter, alpha _i The i-th coefficient of the linear prediction error filter is characterized, _i the sequence number of the characterization parameter, "Σ" is the sum symbol.

For the embodiment of the application, the original voice signal is input into the linear prediction error filter to obtain the prediction error information corresponding to the original voice signal. Namely, the original voice signal s (n) is input into the formula (1-1) to obtain the prediction error information.

Wherein the prediction error information can be represented by the following formula (1-2):

wherein e (n) represents prediction error information, n represents a time parameter in units of sampling periods, s (n) represents an original speech signal of an nth sampling period, p represents an order of a linear prediction error filter, and a _i Characterizing an ith coefficient of the linear prediction error filter, n-i characterizing an ith sample period before an nth sample period, s (n-i) characterizing an original speech signal of the ith sample period before the nth sample period, such thatThen->Is the predictor of s (n), where "≡" is the predictor sign and "Σ" is the sum sign.

For the embodiment of the application, based on the prediction error information corresponding to the original voice signal, the specific implementation manner of the first filter and the original excitation corresponding to the original voice signal are determined as follows:

in another possible implementation manner of the embodiment of the present application, determining, based on prediction error information corresponding to an original speech signal, an original excitation corresponding to the original speech signal and the first filter may specifically include: determining a second filter based on prediction error information corresponding to the original speech signal; the method comprises determining an original excitation corresponding to the original speech signal based on the original speech signal and the second filter, and determining the first filter based on the second filter.

The second filter is a filter corresponding to linear prediction analysis.

For the embodiment of the application, the prediction error information can be represented by a formula (1-2), and the formula (1-2) is processed to calculate the coefficient { alpha } of the linear prediction error filter _i } _i＝1,2,…p Then based on the coefficient { alpha } of the linear prediction error filter _i } _i＝1,2,…p A second filter is determined, then an original excitation corresponding to the original speech signal is determined based on the original speech signal and the second filter, and a first filter is determined based on the second filter.

How to process the formula (1-2) to calculate the coefficient { alpha } of the linear prediction error filter _i } _i＝1,2,…p The following is shown:

for the embodiment of the present application, the coefficients of the linear prediction error filter are calculated by minimizing the prediction error information e (n) under a certain criterion based on the above formula (1-2). In the embodiment of the application, the prediction error information E (n) is minimized under a certain criterion, and the mean square error value E [ E ] of the prediction error information can be adopted ² (n)]Minimum mean square error value E [ E ] of prediction error information ² (n)]Can be represented by the following formula (1-3):

wherein E [ E ] ² (n)]Mean square error value representing prediction error information, also called mathematical expectation value of the prediction error information, "E" is a mathematical expectation symbol, E (n) represents the prediction error information, n represents a time parameter in units of sampling period, E ² (n) represents the square of the prediction error information, s (n) represents the original speech signal of the nth sampling period, p represents the order of the linear prediction error filter, α _i The i-th coefficient of the linear prediction error filter is represented, n-i represents the i-th sampling period before the n-th sampling period, s (n-i) represents the original speech signal of the i-th sampling period before the n-th sampling period, and 'sigma' is the sum symbol.

For the embodiment of the application, on one hand, the derivative operation is performed on the formula (1-3) and the derivative result is 0, namely:

order theJ is more than or equal to 1 and less than or equal to p, and the following formula (1-4) is obtained:

wherein,to derive the arithmetic symbol E [ E ] ² (n)]Mean square error value representing prediction error information, also called mathematical expectation value of the prediction error information, "E" is a mathematical expectation symbol, E (n) represents the prediction error information, n represents a time parameter in units of sampling period, E ² (n) square, alpha, characterizing prediction error information _j The j-th coefficient of the linear prediction error filter is represented, p represents the order of the linear prediction error filter, s (n-j) represents the original speech signal of the j-th sampling period before the n-th sampling period, and 'sigma' is the sum symbol.

Substituting formula (1-2) into formula (1-4) yields the following formula (1-5):

Wherein "E" is a mathematical desired symbol, s (n) represents an original speech signal of an nth sampling period, n represents a time parameter in units of sampling periods, s (n-j) represents an original speech signal of a jth sampling period preceding the nth sampling period, n-j represents a jth sampling period preceding the nth sampling period, p represents an order of a linear prediction error filter, alpha _i The i-th coefficient of the linear prediction error filter, s (n-i) the original speech signal of the i-th sampling period before the n-th sampling period, n-i the i-th sampling period before the n-th sampling period, r (j) =e [ s (n) s (n-j)]R (j) characterizes the j-th value of the autocorrelation function of s (n), r (j-i) characterizes the j-i-th value of the autocorrelation function of s (n), and "Σ" is the sum symbol.

Wherein, formula (1-5) is a multiple one-time equation, called You Er-Wacker equation (Yule-Walker equation).

For embodiments of the present application, on the other hand, the mean square error value of the prediction error information is minimized, i.e., E [ E ] is calculated ² (n)]Is a minimum of (2). In the embodiment of the present application, the minimum value of the mean square error value of the prediction error information can be expressed by the following formula (1-6):

wherein E is _p Minimum value of mean square error value representing prediction error information E [ E ] ² (n)]Mean square error value representing prediction error information, also called mathematical expectation value of prediction error information, "E" being a mathematical expectation symbol, E (n) representing prediction error information, E ² (n) represents the square of the prediction error information, n represents the time parameter in units of sampling periods, s (n) represents the original speech signal of the nth sampling period, p represents the order of the linear prediction error filter, α _i The i-th coefficient of the linear prediction error filter is represented, s (n-i) represents the original speech signal of the i-th sampling period before the n-th sampling period, n-i represents the i-th sampling period before the n-th sampling period, r (0) represents the 0-th value of the autocorrelation function of s (n), r (i) represents the i-th value of the autocorrelation function of s (n), and 'sigma' is the sum sign.

For the embodiment of the application, on one hand, carrying out derivative operation on the formula (1-3) and enabling a derivative result to be 0 to obtain the formula (1-4), and substituting the formula (1-2) into the formula (1-4) to obtain the formula (1-5); on the other hand, E [ E ] in the formula (1-3) ² (n)]The minimum, equation (1-6) is obtained. In the embodiment of the application, a solution expression corresponding to the linear prediction analysis is obtained based on the formula (1-5) and the formula (1-6), wherein the solution expression corresponding to the linear prediction analysis can be represented by the following formula (1-7):

Where r (j) characterizes the j-th value of the autocorrelation function of s (n), p characterizes the order of the linear prediction error filter, α _i An ith coefficient characterizing the linear prediction error filter, r (j-i) a jth-i value characterizing the autocorrelation function of s (n), r (0) a 0 th value characterizing the autocorrelation function of s (n), r (i) an ith value characterizing the autocorrelation function of s (n), E _p The minimum value of the mean square error value characterizing the prediction error information, "Σ" is the sum symbol.

For the embodiment of the application, the coefficients { alpha } of the linear prediction error filter can be obtained by performing calculation processing on the formulas (1-7) _i } _i＝1,2,…p . In the embodiment of the application, the key of the calculation processing of the formula (1-7) is to solve r (j) in the formula (1-7), and the solving of r (j) in the formula (1-7) involves set averaging. In the embodiment of the application, the targeted signal is a voice signal, and in general, the voice signal is considered to be a short-time stable signal, that is, in a short time, a random signal corresponding to the voice signal is considered to be a stable random signal traversed in each state, so that the set average is equal to the time average. In the embodiment of the application, the method can be utilized when solving r (j)Which is evaluated.

For the embodiment of the application, use is made ofThe evaluation of r (j) involves a set average, since the embodiments of the application are directed to speech signals which are considered to be random signals which are experienced in various states in a short time, it is possible to use +.>The evaluation of r (j) is performed because of +.>Does not affect the solving of r (j), thus, remove +.>Since N is infinite, a larger value N is preset to evaluate r (j), which is as follows:

assuming that the original speech signal s (N) is 0 outside the range 0.ltoreq.n.ltoreq.n, the estimated value of r (j) can be expressed by the following formulas (1 to 8):

wherein r (j) represents the j-th value of the autocorrelation function of s (N), p represents the order of the linear prediction error filter, N represents a time parameter in units of sampling periods, N is a preset value, s (N) represents the original speech signal of the N-th sampling period, s (N-j) represents the original speech signal of the j-th sampling period before the N-th sampling period, and 'sigma' is the sum symbol.

For the embodiment of the present application, the following formula (1-9) can be obtained by using the even function characteristic r (j) =r (-j) retained by the formula (1-8) and rewriting the formula (1-7):

wherein r (0) characterizes the 0 th value of the autocorrelation function of s (n), r (1) characterizes the 1 st value of the autocorrelation function of s (n), r (2) characterizes the 2 nd value of the autocorrelation function of s (n), r (p-2) characterizes the p-2 th value of the autocorrelation function of s (n), r (p-1) characterizes the p-1 th value of the autocorrelation function of s (n), r (p) characterizes the p-th value of the autocorrelation function of s (n), alpha ₁ Characterization of the 1 st coefficient, α, of the linear prediction error filter ₂ Characterization of the 2 nd coefficient, α, of the linear prediction error filter _p1 Characterization of the p1 st coefficient of the linear prediction error filter, E _p The minimum value of the mean square error value characterizing the prediction error information.

For the embodiment of the application, the formula (1-9) is a Tobuliz matrix, and a Levenson-Durbin algorithm can be adopted for the common formulaThe coefficients { alpha } of the linear prediction error filter can be obtained by solving the equations (1-9) _i } _i＝1,2,…p 。

For the embodiment of the application, the coefficient { alpha } of the linear prediction error filter is obtained _i } _i＝1,2,…p Substituting formula (1-1) determines a second filter, which can be expressed by formula (1-1).

For the embodiment of the application, the original voice signal is s (n), s (n) is input to the determined second filter for whitening treatment, and the original excitation corresponding to the original voice signal is obtained and can be represented by a formula (1-2).

For the embodiment of the application, the reciprocal of the characterization formula corresponding to the second filter is the characterization formula corresponding to the first filter. In an embodiment of the present application, the first filter may be represented by the following formulas (1-10):

wherein H (z) represents the first filter, z is a complex number, p represents the order of the linear prediction error filter, α _i The i-th coefficient of the linear prediction error filter is characterized, i is the serial number of the parameter, and 'sigma' is the sum sign.

Another possible implementation manner of the embodiment of the present application may further include, before step S103: and determining a pole corresponding to the first filter, and determining pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter based on the pole corresponding to the first filter.

Specifically, the first filter can be represented by the formula (1-10), and the pole corresponding to the first filter can be obtained by performing the correlation calculation on the formula (1-10), wherein any pole comprises pole angle information and pole amplitude information. In the embodiment of the application, all roots with denominators of 0 in formulas (1-10) are solved. In the embodiment of the application, all roots with the denominator of 0 in the formulas (1-10) are solved by adopting a QR decomposition method to obtain all poles corresponding to the first filter, and then all pole angle values and pole amplitude values corresponding to all poles of the first filter are obtained.

Specifically, all poles corresponding to the first filter are obtained through formulas (1-10), and then pole angle values and pole amplitude values corresponding to all poles of the first filter are obtained as follows:

For the embodiment of the present application, the method for solving all the roots with the denominator of 0 in the formula (1-10) by using the QR decomposition method may specifically include:

first, the first polynomial equation of the formula (1-10) is denominator to n and is made equal to 0. The first polynomial equation being equal to 0 can be expressed by the following formulas (1-11):

Q _n (x)＝x ⁿ +b _n-1 x ^n-1 +…+b ₁ x+b ₀ ＝0 (1-11)

wherein Q is _n (x) Characterizing an n-degree polynomial about x, n characterizing the degree of the argument x, x characterizing the argument, b ₀ ,b ₁ ,…,b _n-1 Characterization equation Q _n (x) Is a coefficient of (a).

For the embodiment of the application, the formulas (1-11) can be regarded as the characteristic equation of a real matrix, and solving all the roots of the formulas (1-11) can be converted into solving all the characteristic values of the real matrix. In the embodiment of the present application, the real matrix shown in the following formula (1-12) can be obtained by rewriting the formula (1-11):

wherein B represents a real matrix, B ₀ ,b ₁ ,…,b _n-1 Are all elements of the real matrix B.

For the embodiment of the present application, the formula (1-12) is the upper H matrix, and all eigenvalues of the real matrix B can be directly obtained by QR decomposition, which is not described in detail herein. In the embodiment of the application, all eigenvalues of the real matrix B are obtained to obtain all poles corresponding to the first filter, and any pole comprises a pole angle value and a pole amplitude value.

For example, pole 1 can be represented by the following formulas (1-13):

wherein z is ₁ Characterizing poles 1, r ₁ Characterizing pole amplitude information, ω, corresponding to pole 1 ₁ And characterizing pole angle information corresponding to the pole 1.

Another possible implementation manner of the embodiment of the present application, S103 may specifically include: and if the pole angle information corresponding to the first filter meets the preset condition, adjusting at least one of the pole angle information and the pole amplitude information corresponding to the first filter according to a preset mode.

For the embodiment of the present application, the pole angle information corresponding to the first filter may specifically include: a pole angle value corresponding to at least one pole; the pole amplitude information corresponding to the first filter may specifically include: and a pole amplitude value corresponding to the at least one pole.

Further, the pole angle information corresponding to the first filter meeting the preset condition may include: and whether each pole angle value in the pole angle values corresponding to at least one pole corresponding to the first filter meets a preset condition or not. In the embodiment of the application, all poles meeting the preset condition are adjusted at least one of the corresponding pole angle value and pole amplitude value according to a preset mode.

For the embodiment of the present application, when the pole angle value corresponding to any pole belongs to [ -a, a ], at least one of the pole angle information and the pole amplitude information corresponding to the first filter may be adjusted according to a preset manner.

For the embodiment of the application, when the pole angle value corresponding to any pole belongs to [ -a, a ], the pole angle value corresponding to the pole can be only adjusted, and the pole amplitude value corresponding to the pole is not adjusted.

With a first poleFor the sake of example, ω can be adjusted only ₁ And r is ₁ Is unchanged.

According to a possible implementation manner of the embodiment of the present application, the pole angle information corresponding to the first filter is adjusted according to a preset manner, including: and increasing the pole angle value corresponding to the at least one pole by at least one of a first preset threshold value and decreasing the pole angle value corresponding to the at least one pole by at least one of a second preset threshold value.

For the embodiment of the present application, the first preset threshold value and the second preset threshold value may be the same or different. The embodiment of the application is not limited.

For the embodiment of the application, when the poles corresponding to the first filter comprise at least two poles, the pole angle value of the pole angle value belonging to (0, a) in the at least two poles is increased by a first preset threshold value, and the pole angle value belonging to [ -a, 0) in the pole angle is reduced by a second preset threshold value.

For the embodiment of the application, when the pole corresponding to the first filter comprises one pole, if the pole angle value of the pole belongs to (0, a), the pole angle value of the pole is increased by a first preset threshold value, and if the pole angle value of the pole belongs to [ -a, 0), the pole angle value of the pole is decreased by a second preset threshold value.

With a first poleFor the sake of example, when ω ₁ Belonging to (0, 3)]When omega ₁ Increase X, when omega ₁ When it is [ -3, 0), ω ₁ Reducing X, wherein r ₁ Can be unchanged.

Wherein X is [0.07,0.11].

For the embodiment of the application, when the pole angle value corresponding to any pole belongs to [ -a, a ], the pole angle value corresponding to the pole and the pole amplitude value can be adjusted simultaneously.

The adjustment manner of the pole angle value corresponding to the pole is described above, and is not described herein.

Another possible implementation manner of the embodiment of the present application, according to a preset method, adjusts pole amplitude information corresponding to the first filter, including: and adjusting the pole amplitude value corresponding to the at least one pole according to a preset multiple.

For the embodiment of the application, the pole angle values corresponding to all poles with the pole angle values belonging to [ -a, a ] are adjusted to Y times respectively corresponding to the poles.

With a first poleFor the sake of example, when ω ₁ ∈[-3,3]When r is equal to ₁ Can be adjusted to its corresponding Y times.

Wherein Y is [0.8,1.2].

For the embodiment of the application, when the pole angle value corresponding to any pole belongs to [ -a, a ], only the pole amplitude value corresponding to any pole can be adjusted, and the pole angle value is not adjusted. In the embodiment of the present application, the adjustment of the pole amplitude value is as described above, and will not be described herein.

For the embodiment of the application, at least one of the pole angle information and the pole amplitude information corresponding to the first filter is adjusted, namely at least one of the formant frequency corresponding to the original voice signal and the formant sharpness corresponding to the original voice signal is adjusted.

For example, if the target voice signal is a signal corresponding to a cold sound, the spectrogram corresponding to the original voice signal is shown in fig. 4, the spectrogram after the voice is changed into the cold sound is shown in fig. 5, the horizontal axes in fig. 4 and fig. 5 each represent time, the vertical axes each represent frequency, the brightness of a point in the spectrogram indicates the amplitude of the frequency component corresponding to the point, the brighter the point indicates the larger the amplitude of the frequency component corresponding to the point, the darker the point indicates the smaller the amplitude of the frequency component corresponding to the point, and when the amplitude of the frequency component corresponding to a certain point is larger than the amplitude of the frequency component corresponding to each point around, the point is a formant. In the embodiment of the present application, the formants in the region 1 shown in fig. 4 are concentrated between the frequencies 0 and 1000, and the formants in the region 2 shown in fig. 5 are concentrated between the frequencies 1000 and 3000, so that the comparison of fig. 4 and fig. 5 shows that the formant frequencies corresponding to the adjusted voice signals are shifted up (increased); the difference in luminance between fig. 4 and fig. 5 characterizes the sharpness of the formants, as shown in fig. 4, a larger difference in luminance of the formants indicates a sharper formants (a larger sharpness of formants), as shown in fig. 5, a smaller difference in luminance of the formants indicates a slower formants (a smaller sharpness of formants), and thus, comparison of fig. 4 and fig. 5 indicates a reduced sharpness of formants corresponding to the adjusted speech signal.

For the embodiment of the application, at least one of the pole angle information and the pole amplitude information corresponding to the first filter is adjusted, namely at least one of the formant frequency corresponding to the original voice signal and the formant sharpness corresponding to the original voice signal is adjusted, so that the target voice obtained through the adjustment is different from the original voice, the sound changing effect is realized, and the user experience is further improved.

Another possible implementation manner of the embodiment of the present application, S101 may further include: acquiring a voice signal input by a user; and denoising the voice signal input by the user, and taking the denoised voice signal as an original voice signal.

For the embodiment of the application, the voice signal input by the user can comprise the voice signal input by the user in real time, and also can comprise the voice signal stored locally. The embodiment of the application is not limited.

For the embodiment of the application, the acquired voice signal input by the user can be subjected to denoising processing in the actual application process, the voice signal after denoising processing is used as an original voice signal, and the voice signal input by the user can not be subjected to denoising processing, namely the voice signal input by the user is used as the original voice signal. The embodiment of the application is not limited.

For the embodiment of the application, the denoising processing can be performed on the voice signal input by the user in a plurality of feasible denoising modes. For example, a speech signal input by a user is input into a trained noise separation neural network model to be subjected to denoising processing.

For the embodiments of the present application, the above embodiments may be executed by a terminal device, or may be executed by a server, or may be executed partly by a terminal device, or partly by a server. The embodiment of the application is not limited.

The above embodiment describes in detail that the original voice signal is processed to obtain the target voice signal (voice signal after voice conversion), and the following description is combined with a specific application scenario (i.e. the original voice signal is converted into the common cold voice signal), so as to describe a specific implementation manner of the present application, which is specifically as follows:

if the voice input by the user is changed into cold voice, an original voice signal (the original voice signal can be a voice signal obtained by denoising the voice input by the user or a voice signal corresponding to the voice input by the user) is obtained, then the original voice signal is subjected to voice changing processing to obtain a voice signal subjected to voice changing processing, the voice signal subjected to voice changing processing is encoded and then sent to an opposite terminal through the internet, and the opposite terminal decodes the received information and plays the decoded information, namely, the played voice is the cold voice, as shown in fig. 3.

For the embodiment of the present application, a user may trigger any one of the target objects ("askew fruit", "cold" and "trapped animal") displayed in the operation interface, as shown in fig. 2, and if the target voice signal is a voice signal corresponding to the cold sound, the user may trigger the "cold" object in the operation interface shown in fig. 2.

For the embodiment of the application, the voice signal processing method comprises the following steps: the method comprises the steps of determining corresponding original excitation and a first filter through an original voice signal, increasing pole angle values of all poles of pole angle values epsilon (0, 3) corresponding to the first filter by X (X epsilon [0.07,0.11 ]), decreasing pole angle values of all poles of pole angle values epsilon (-3, 0) corresponding to the first filter by X (X epsilon [0.07,0.11 ]), and/or adjusting pole amplitude values of all poles of pole angle values epsilon < -3,3 > corresponding to the first filter to be 0.8-1.2 times of the pole angle values corresponding to the first filter, obtaining an adjusted first filter, and performing sound changing processing on the original excitation through the adjusted first filter.

The above-mentioned method steps specifically describe the speech signal processing method, and the speech signal processing device is described below in terms of virtual modules or virtual units, specifically as follows:

An embodiment of the present application provides a speech signal processing apparatus, as shown in fig. 6, the speech signal processing apparatus 60 includes: a first acquisition module 601, a first determination module 602, an adjustment module 603, and a second determination module 604, wherein,

the first obtaining module 601 is configured to obtain an original voice signal.

The first determining module 602 is configured to perform linear prediction analysis on the original speech signal, and determine an original excitation corresponding to the original speech signal and a first filter.

The adjusting module 603 is configured to adjust at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter, to obtain an adjusted first filter.

The second determining module 604 is configured to determine the target speech signal based on the original excitation corresponding to the original speech signal and the adjusted first filter.

In one possible implementation manner of the embodiment of the present application, the first determining module 602 may specifically include a first determining unit and a second determining unit, where,

the first determining unit is used for performing linear prediction analysis on the original voice signal and determining prediction error information corresponding to the original voice signal.

In another possible implementation manner of the embodiment of the present application, the second determining unit is specifically configured to determine, based on prediction error information corresponding to the original speech signal, a second filter, where the second filter is a filter corresponding to linear prediction analysis.

In another possible implementation manner of the embodiment of the present application, the adjusting module 603 is specifically configured to adjust at least one of pole angle information and pole amplitude information corresponding to the first filter according to a preset manner when the pole angle information corresponding to the first filter meets a preset condition.

In another possible implementation manner of the embodiment of the present application, the pole angle information corresponding to the first filter includes: a pole angle value corresponding to at least one pole; the adjustment module 603 includes at least one of an increase unit and a decrease unit, wherein,

and the increasing unit is used for increasing the pole angle value corresponding to the at least one pole by a first preset threshold value.

In another possible implementation manner of the embodiment of the present application, the pole amplitude information corresponding to the first filter includes: and a pole amplitude value corresponding to the at least one pole.

The adjusting module 603 is specifically configured to adjust a pole amplitude value corresponding to at least one pole according to a preset multiple.

In another possible implementation manner of the embodiment of the present application, the voice signal processing apparatus 60 further includes: the second acquisition module and the denoising processing module:

and the second acquisition module is used for acquiring the voice signal input by the user.

The denoising processing module is used for denoising the voice signal input by the user and taking the denoised voice signal as the original voice signal.

For the embodiment of the present application, the first acquisition module and the second acquisition module may be the same acquisition module, or may be two different acquisition modules. The embodiment of the application is not limited.

The voice signal processing device provided in the embodiment of the present application may perform the operations corresponding to the voice signal processing method shown in the foregoing method embodiment, and the implementation principle is similar, which is not repeated here.

Compared with the prior art, the application determines the original excitation and the first filter corresponding to the original voice signal by carrying out linear prediction analysis on the original voice signal, and obtains the adjusted first filter by adjusting at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter, and then determines the target voice signal by utilizing the original excitation corresponding to the original voice signal and the adjusted first filter, so that at least one of formant frequency and formant sharpness corresponding to the original voice signal can be adjusted to obtain the target voice signal, thereby realizing voice conversion input by a user and improving user experience.

The speech signal processing device of the present application is described above from the viewpoint of a virtual module or a virtual unit, and an electronic device is described below from the viewpoint of an entity device.

An embodiment of the present application provides an electronic device, as shown in fig. 7, an electronic device 4000 shown in fig. 7 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may also include a transceiver 4004. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The processor 4001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI bus or an EISA bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.

The memory 4003 may be, but is not limited to, ROM or other type of static storage device that can store static information and instructions, RAM or other type of dynamic storage device that can store information and instructions, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 4003 is used for storing application program codes for executing the inventive arrangements, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute application program codes stored in the memory 4003 to realize what is shown in any of the foregoing method embodiments.

The embodiment of the application provides an electronic device, which comprises: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: the operations corresponding to the voice signal processing method shown in the foregoing method embodiment or any one of possible implementation manners of the foregoing method embodiment may be implemented, compared with the prior art: according to the application, the original excitation corresponding to the original voice signal and the first filter are determined through linear prediction analysis, at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter is adjusted to obtain the adjusted first filter, then the target voice signal is determined by utilizing the original excitation corresponding to the original voice signal and the adjusted first filter, and at least one of formant frequency and formant sharpness corresponding to the original voice signal can be adjusted to obtain the target voice signal, so that voice input by a user can be changed, and user experience can be improved.

The above describes a speech signal processing electronic device from the point of view of the physical means, and a computer readable storage medium is described below from the point of view of the storage medium.

The embodiment of the application provides a computer readable storage medium, and a computer program is stored on the computer readable storage medium, and when the program is executed by a processor, the method of processing a voice signal shown in the foregoing method embodiment or any possible implementation manner is realized. Compared with the prior art, the method has the advantages that the original excitation and the first filter corresponding to the original voice signal are determined through linear prediction analysis on the original voice signal, at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter is adjusted to obtain the adjusted first filter, then the target voice signal is determined through the original excitation corresponding to the original voice signal and the adjusted first filter, and at least one of formant frequency and formant sharpness corresponding to the original voice signal can be adjusted to obtain the target voice signal, so that voice input by a user can be changed, and user experience can be improved.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present application, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims

1. A method of processing a speech signal, comprising:

acquiring an original voice signal;

determining a second filter based on prediction error information corresponding to the original voice signal, wherein the second filter is a filter corresponding to the linear prediction analysis;

determining an original excitation corresponding to the original voice signal based on the original voice signal and the second filter, and determining a first filter based on the second filter;

determining a target voice signal based on the original excitation corresponding to the original voice signal and the adjusted first filter;

wherein the determining the second filter based on the prediction error information corresponding to the original speech signal includes:

and carrying out mean square error value minimization processing on the prediction error information, calculating to obtain coefficients of a linear prediction error filter, and determining a second filter based on the coefficients of the linear prediction error filter.

2. The method of claim 1, wherein adjusting at least one of pole angle information corresponding to the first filter and pole amplitude information corresponding to the first filter, to obtain an adjusted first filter, comprises:

3. The method of claim 2, wherein the pole angle information corresponding to the first filter comprises: a pole angle value corresponding to at least one pole;

increasing a pole angle value corresponding to the at least one pole by a first preset threshold;

4. A method according to claim 3, wherein the pole magnitude information corresponding to the first filter comprises: a pole amplitude value corresponding to the at least one pole;

5. The method of claim 1, wherein the acquiring the original speech signal further comprises:

acquiring a voice signal input by a user;

and denoising the voice signal input by the user, and taking the denoised voice signal as the original voice signal.

6. A speech signal processing apparatus, comprising:

the first acquisition module is used for acquiring an original voice signal;

the first determining module is used for carrying out linear prediction analysis on the original voice signal and determining prediction error information corresponding to the original voice signal; determining a second filter based on prediction error information corresponding to the original voice signal, wherein the second filter is a filter corresponding to the linear prediction analysis; determining an original excitation corresponding to the original voice signal based on the original voice signal and the second filter, and determining a first filter based on the second filter;

The second determining module is used for determining a target voice signal based on the original excitation corresponding to the original voice signal and the adjusted first filter;

the first determining module is configured to, when determining the second filter based on prediction error information corresponding to the original speech signal:

7. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: the speech signal processing method according to any one of claims 1 to 5 is performed.

8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the speech signal processing method of any one of claims 1 to 5.