CN109215635B

CN109215635B - Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement

Info

Publication number: CN109215635B
Application number: CN201811249506.0A
Authority: CN
Inventors: 胡瑞敏; 李罡; 张锐; 王晓晨
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2020-08-07
Anticipated expiration: 2038-10-25
Also published as: CN109215635A

Abstract

The invention provides a broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement, which comprises a training phase and a using phase of a frequency spectrum gradient reconstruction network based on a recurrent neural network, wherein the training phase establishes a voice data set and preprocesses voice data in the data set; inputting preprocessed narrow-band voice data, performing short-time Fourier transform to obtain a narrow-band voice frequency spectrum, and carrying out logarithm conversion on frequency spectrum information to obtain a logarithmic magnitude spectrum; inputting the preprocessed wideband voice data, extracting all-pole model parameters of the wideband voice signal frequency spectrum gradient, and converting the parameters into linear frequency spectrum pair parameters; and training a spectrum gradient reconstruction network and using the spectrum gradient reconstruction network to reconstruct all-pole model parameters of the broadband voice spectrum gradient. The invention reconstructs the frequency spectrum gradient parameter of the broadband voice signal according to the narrowband voice signal, is suitable for all voice definition enhancing systems based on the frequency spectrum gradient characteristic and can be adapted to multi-language and multi-mode voice signals.

Description

Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement

Technical Field

The invention provides a broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement, relates to the technical field of voice signal processing and communication, is suitable for all voice definition enhancement systems based on frequency spectrum gradient characteristics, and can be adapted to multi-language and multi-mode voice signals.

Background

Since the 21 st century, mobile communication technology has rapidly developed, and mobile communication devices such as mobile phones have rapidly become widespread. By means of the convenience brought by the mobile phone, people can use the mobile communication equipment to carry out real-time voice communication anytime and anywhere; with such convenience, people inevitably talk in various noisy environments such as stations, restaurants, factories, etc., and noise in the noisy environment seriously degrades the voice call quality.

The voice communication flow can be briefly divided into two phases (as shown in fig. 1): the first stage is a speaking stage, wherein a speaker speaks to the mobile phone, a microphone of the mobile phone collects voice signals, encodes the voice signals and finally sends the encoded voice signals to a communication channel as uplink signals; the second stage is a listening stage, the mobile phone receives a downlink signal sent by a communication network from a channel, the mobile phone decodes the downlink signal to regenerate a voice signal, finally the mobile phone plays the decoded voice signal, and the human ear receives the played voice signal, so that the communication process of a piece of voice information is completed. The process of receiving downlink signals and listening to voice content is called near-end from the perspective of a voice listener; the process of generating a voice signal, transmitting an upstream signal, still stands from the perspective of a voice listener, called the far end.

In the process of processing the far-end signal, researchers gradually developed a voice enhancement technology for suppressing the environmental noise in the voice signal collected by the microphone. In the voice enhancement process, on one hand, a software algorithm is utilized to filter out energy except the voice signal according to a series of characteristics such as time-frequency characteristics, acoustic characteristics, linguistic characteristics and the like of the voice signal, and voice characteristic reconstruction is carried out on the voice signal with missing signal components after filtering; on the other hand, with the aid of hardware, a plurality of special microphones are installed on the mobile phone for collecting environmental sound, and the voice signal and the noise signal collected by the noise microphone are subjected to spectral subtraction or form an adaptive filtering system. By means of a series of software and hardware combination measures, the voice enhancement technology can completely filter noise components in voice signals collected by a microphone and ensure very small voice distortion.

In the near-end signal processing, in order to suppress the ambient noise during listening, researchers first thought out noise cancellation strategies: the microphone is used for collecting environmental noise, then sound waves with the phase opposite to that of the noise and the same frequency and amplitude are emitted to interfere with the noise to realize phase cancellation, and the energy of the environmental noise is reduced. The active noise reduction earphone is a typical product based on a noise cancellation strategy, a part of noise is filtered in advance through a physical isolation mode by the earphone, and the rest noise is cancelled by adding an anti-phase signal in a signal played by the earphone. However, under the condition that the earphone listening mode is lack of earphone physical isolation, the ear is directly exposed in environment noise with huge energy, and simultaneously, along with a series of problems of environment reverberation, difficulty in ensuring that the earphone is over against the ear and the like, the anti-noise effect is greatly reduced.

Under the condition that a noise cancellation strategy is invalid in an answering mode of a mobile phone receiver, researchers also provide a near-end listening enhancement technology for ensuring that a voice signal received by a listener is clear enough, and on the basis of perception acoustics, linguistics and a signal processing method, the robustness of the voice signal is enhanced by improving the perception intelligibility of the voice signal, so that the voice signal is easier to be understood by the listener under the same noise condition; it is also referred to as speech intelligibility enhancement or speech intelligibility enhancement technique, since it aims to improve the intelligibility of speech signals.

The traditional methods of speech intelligibility enhancement techniques are mainly divided into two categories: rule-based methods and metric-based methods. The method based on the rules does not consider surrounding environmental noise, and only adjusts the rules according to fixed voice characteristics to correct the voice signal time-frequency characteristics, so that the method has larger difference of definition improvement amplitude under different environments and poorer algorithm robustness; the method based on measurement is to compare the voice signal with the environmental noise fact through specific measurement indexes, dynamically adjust the gain of the voice signal, and have obvious effect on improving the voice definition, but the method damages the voice naturalness and comfort degree to a great extent.

Under the noise scene, the speaker is stressed by noise, the self-sounding mode can be spontaneously changed to overcome the influence of surrounding noise, the change can obviously improve the perception definition of a listener, the speaker noise countermeasure generation mechanism is called L ombard effect, the voice with the anti-noise characteristic is called L ombard voice, research shows that the spectrum inclination of L ombard voice is greatly different from the spectrum inclination of common voice of corresponding sentences in details, the spectrum inclination of L ombard voice is integrally more flat, the characteristic of the spectrum inclination effectively reflects the difference between L ombard voice and the common voice, and the spectrum inclination parameter can be used as a key parameter for improving the voice definition.

In a data-driven speech intelligibility enhancement system, L ombard speech under different scenes and a common speech signal under a corresponding quiet environment are used as training data, a speech intelligibility enhancement system based on L ombard can be fitted, the spectrum inclination of L ombard speech can be mapped through the spectrum inclination of the common speech signal, and then L ombard speech with an anti-noise characteristic is obtained.

The data-driven algorithm can utilize machine learning algorithms such as Gaussian process regression, Gaussian mixture models, deep neural networks and the like to complete the training of the mapping model. The mapping model has high precision requirement on the input speech frequency spectrum degree information, but the narrow-band signals in the actual speech communication environment have more errors than the narrow-band speech signals due to the fact that the acoustic characteristics are lost, and the narrow-band signals are used for directly calculating frequency spectrum inclination parameters, so that the speech definition enhancement system cannot acquire accurate frequency spectrum inclination information, and the enhancement effect is seriously reduced. The invention provides a broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement.

Disclosure of Invention

The invention provides a broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement, and solves the problem that due to the fact that acoustic characteristics of narrow-band voice signals are lost, the frequency spectrum gradient parameters directly calculated by the method have large errors in wider-band voice signals, a voice definition enhancement system cannot acquire accurate frequency spectrum gradient information, and the enhancement effect is seriously reduced.

The technical scheme of the invention provides a broadband voice frequency spectrum gradient characteristic parameter reconstruction method for enhancing voice definition, which comprises a training phase and a using phase of a frequency spectrum gradient reconstruction network based on a recurrent neural network,

the training phase of the spectral tilt reconstruction network comprises the following steps,

step S11, acquiring narrow-band voice data with low sampling rate by down-sampling the wide-band voice data with high sampling rate, establishing a voice data set, dividing the voice data into a training set, a testing set and a verification set according to proportion, and preprocessing the voice data in the data set, wherein the preprocessing comprises framing and windowing;

step S12, inputting the preprocessed narrowband speech data training set, performing short-time Fourier transform to obtain a narrowband speech frequency spectrum, and carrying out logarithm acquisition of logarithmic magnitude spectrum of frequency spectrum information as the input of a frequency spectrum inclination reconstruction network;

step S13, inputting the preprocessed wideband speech data training set, extracting all-pole model parameters of the wideband speech signal frequency spectrum gradient, converting the parameters into linear frequency spectrum pair parameters, and outputting the linear frequency spectrum pair parameters as the frequency spectrum gradient reconstruction network;

step S14, training a frequency spectrum inclination reconstruction network, defining a sensing root mean square deviation PRMSD as an evaluation method to test the performance of the frequency spectrum inclination network, using a verification set as an evaluation standard for each evaluation, debugging an optimal reconstruction network parameter model, and verifying the final effect in the test set;

the using stage of the frequency spectrum gradient rebuilding network puts the trained neural network into the real-time signal frame-by-frame processing of the actual communication,

step S21, narrow-band speech is input frame by frame in real time, and logarithmic magnitude spectrum parameters of the narrow-band speech are extracted;

and step S22, inputting the logarithmic magnitude spectrum parameters of the broadband voice frame by frame, and reconstructing all-pole model parameters of the broadband voice frequency spectrum gradient by combining the frequency spectrum gradient reconstruction network and the parameter conversion.

Moreover, both the wideband and narrowband speech material includes normal speech and anti-noise speech.

In step S12, the number of points of the short-time fourier transform is N, and the calculation formula of the training input of the spectrum gradient reconstruction network is:

S_i(n) represents the i-th frame narrow-band speech signal, n is the length of the speech signal frame, x_i(k) Representing the value of the log-amplitude spectrum of the i-th frame speech signal, k being the complex variable primary representation symbol in the frequency domain of the complex variable function, W_inRepresenting a window function in the time domain; the number of points of the logarithmic magnitude spectrum of each frame of the voice signal is

x_i＝[x_i(1),x_i(2)…,x_i(C)]And calculating each frame signal of the narrowband speech data after being framed in the speech data set according to the first formula to obtain the logarithmic amplitude spectrum of the frame signal, and storing the logarithmic amplitude spectrum of the frame signal into a matrix X line by line, wherein X represents the input matrix of the frequency spectrum inclination reconstruction network, and M is the number of lines of X.

Then, in step S13, a wideband speech signal S is generated from the i-th frame_i(n) calculating the ratio of the total weight of the steel,

to obtain a_i＝[a_i(1),a_i(2)…,a_i(P)]The model is the all-pole model parameter of the i-th frame broadband voice signal frequency spectrum gradient, and P is the order of the all-pole model parameter.

Moreover, the linear spectrum pair parameters described in step S13 are equivalent to all-pole model parameters, and the linear spectrum pair parameters have stronger robustness.

Moreover, the evaluation method adopted in step S14 uses the phonetic data of the verification set and the test set, and the calculation formula is:

is an estimated value, y, of the parameter of the i-th frame speech signal spectrum gradient all-pole model_i(n) is the true value of the i frame speech signal spectrum gradient all-pole model parameter,

as an estimate of the spectral tilt of the i-th frame speech signal, Y_i(k) For the true value of the gradient of the frequency spectrum of the ith frame speech signal, pair

And Y_i(k) The same sub-band division method is used to divide into L sub-bands respectively,

representing the estimated value of the spectral inclination of the jth sub-band of the ith frame of speech signal,

representing the true value of the spectral inclination, D, of the jth sub-band of the ith frame of speech signal_jDenotes the length of the jth sub-band, b_jRepresenting the perceptual coefficient, PR, for calculating the perceptual RMS deviation of the j-th sub-band_iAnd a perceptual root mean square deviation PRMSD representing the inclination of the frequency spectrum of the voice signal of the ith frame.

Moreover, the number of nodes of the input layer of the optimal reconstructed network parameter model in step S14 is

The number of points is the same as the number of points of the log-amplitude spectrum parameter of each frame of the narrowband speech signal in step S12.

In step S14, the excitation function used by the hidden layer of the optimal network parameter model is a Sigmoid function, a Tanh function, or a L initial function, the node parameters of the hidden layer are [ N/4, N/8], [ N/8, N/16], [ N/4, N/8, N/16], [ N/4, N/8, N/16], or [ N/4, N/8, N/16], and the optimal time step of each hidden layer is determined by parameter debugging.

In step S14, the number of output layers of the optimum reconstruction network is P, which is the same as the order of the all-pole model parameter of the speech spectrum pitch.

Moreover, the method for extracting the narrow-band speech logarithmic magnitude spectrum parameters in the step S21 in the use stage of the frequency spectrum gradient reconstruction network is the same as the step S12 in the training stage of the frequency spectrum gradient reconstruction network; the using stage of the spectrum gradient reconstruction network, the parameter transformation in step S22, is to transform the linear spectrum pair parameters of the spectrum gradient of the wideband speech reconstructed by the spectrum gradient reconstruction network into all-pole model parameters.

The invention realizes the reconstruction of the frequency spectrum gradient information of the broadband voice from the logarithmic magnitude spectrum information of the narrowband voice, the frequency spectrum gradient information can be suitable for all voice definition enhancing systems based on frequency spectrum gradient, and can be adapted to voice signals of multiple languages and multiple modes, and the expansibility and the practicability of the voice definition enhancing system can be improved.

Drawings

Fig. 1 is a schematic diagram of a voice communication flow in a noise scene according to an embodiment of the present invention;

FIG. 2 is a block diagram of a speech intelligibility enhancement system based on spectral tilt characteristics according to an embodiment of the present invention;

fig. 3 is a flowchart of a wideband speech spectrum gradient feature parameter reconstruction method for speech intelligibility enhancement according to an embodiment of the present invention.

Detailed Description

In the following, embodiments of the present invention will be described in further detail with reference to the drawings, and it should be understood that the embodiments described herein are only a part of the embodiments of the present invention, and not all of the embodiments. Any embodiments that can be obtained by a person skilled in the art based on the embodiments of the present invention without making any creative effort are within the protection scope of the present application.

The invention is suitable for a speech definition enhancing system in a real-time speech communication system, which enhances the speech definition based on a phonation mechanism (L ombard effect) of speaker noise countermeasure and a natural speech generation model.

The invention is further illustrated by the following figures and examples, which are not to be construed as limiting the invention.

According to the problems existing in the prior art, the embodiment provides a method for reconstructing a wideband speech frequency spectrum gradient characteristic parameter from a narrowband speech, which is suitable for a speech intelligibility enhancement system based on a frequency spectrum gradient characteristic, and a block diagram of the system is shown in fig. 2.

The implementation process of the embodiment includes a training phase and a use phase of a Recurrent Neural Network (RNN) based on a spectral gradient reconstruction Network of a recurrent neural Network, as shown in fig. 3.

A training stage: extracting narrowband voice logarithmic magnitude spectrum parameters and broadband voice frequency spectrum linear spectrum pair parameters in a training set to be used as input and output of frequency spectrum inclination reconstruction network training respectively, training the frequency spectrum inclination reconstruction network, and debugging an optimal parameter model; the use stage is as follows: narrow-band voice logarithmic magnitude spectrum parameters are input into a spectrum inclination reconstruction network frame by frame, linear spectrum pair parameters of the broadband voice spectrum inclination are reconstructed, and all-pole model parameters of the broadband voice spectrum inclination are generated.

The training phase of the frequency spectrum gradient reconstruction network comprises the following specific implementation steps:

step S11: establishing a voice data set, dividing the voice data set into a training set, a testing set and a verification set according to a proportion, framing the voice data in the data set, and carrying out preprocessing such as glancing window windowing;

step S12: inputting a preprocessed narrowband voice data training set, performing short-time Fourier transform to obtain a narrowband voice frequency spectrum, and carrying out logarithm obtaining on frequency spectrum information to obtain a logarithmic magnitude spectrum as the input of a frequency spectrum inclination reconstruction network;

step S13: inputting a preprocessed broadband voice data training set, extracting all-pole model parameters of the broadband voice signal frequency spectrum gradient, and converting the parameters into linear frequency spectrum pair parameters to be used as the output of a frequency spectrum gradient reconstruction network;

step S14: training a spectrum tilt reconstruction network, defining a Perceptual Root-Mean-Square development (PRMSD) as an evaluation method to test the performance of the spectrum tilt network, debugging an optimal reconstruction network parameter model by using a verification set as an evaluation standard in each evaluation, and verifying the final effect in the test set.

Specifically, the detailed process of step S11 is: the method comprises the steps of sampling down broadband voice data with a high sampling rate to obtain narrowband voice data with a low sampling rate, and establishing a voice data set, wherein the sampling rate of the broadband voice data is generally 16000 Hz, 48000 Hz and the like, and the sampling rate of the narrowband voice data is generally 8000 Hz, 6000 Hz and the like.

In this embodiment, the sampling rate of the wideband speech data is 16000 hz, the sampling rate of the narrowband speech data is 8000 hz, and the corresponding narrowband and wideband speech data both include the common speech and the anti-noise speech with the same text content. Both the narrowband and wideband speech input in fig. 3 are from the speech data set established in step S11. The speech data set is divided into a training set, a verification set and a test set according to the proportion of 85%, 7.5% and 7.5%, the narrowband and wideband speech data in the training set and the test set are subjected to framing, and in the embodiment, a hamming window is used for windowing.

The wide band and narrow band speech data include both normal speech and anti-noise speech (L ombard speech).

The L ombard speech is the speech with anti-noise property which is sent by the human body in the noise environment and is compressed by the surrounding noise, the voice mode of the human body is spontaneously changed, L ombard speech has stronger definition than the common speech, preferably, the narrow-band speech data and the wide-band speech data are framed according to the following setting that the time length of each frame of speech signal is 20 milliseconds, and each frame of speech signal is overlapped with the previous frame by 50 percent.

Specifically, step S12 corresponds to the module for calculating network input in the training phase in fig. 3, and the detailed process is as follows: inputting each frame of narrowband speech signal obtained in step S11, performing short-time fourier transform of N points, where a possible value of N is 1024,512,256, and the like, and in this embodiment, the value of N is preferably 512, and then calculating a log-amplitude spectrum of each frame of narrowband speech signal according to the following formula:

S_i(n) represents the i-th frame narrow-band speech signal, n is the length of the speech signal frame, and takes the value of 160, x_i(k) Representing the value of logarithmic magnitude spectrum of the ith frame of narrow-band speech signal, k being the basic representation symbol of complex variable in the frequency domain of complex variable function, M being the total frame number of input training samples, W_inThe embodiment uses a hanning window for each frame of voice signal to add window, and other optional window functions include a hamming window and a sine window. The number of points of the logarithmic magnitude spectrum of each frame of voice signal is

The value of C in this example is 257.

The number of points of the logarithmic magnitude spectrum of each frame of the voice signal is

x_i＝[x_i(1),x_i(2)…,x_i(C)]And calculating each frame signal of the narrowband speech data after being framed in the speech data set according to the first formula to obtain a logarithmic magnitude spectrum of the frame signal, storing the logarithmic magnitude spectrum of the frame signal into a matrix X line by line, wherein X represents an input matrix of the frequency spectrum inclination reconstruction network, and M is the number of lines of X, namely the total frame number of input training samples (the narrowband speech data after being framed in all training sets).

In this embodiment, 257 point-to-digital amplitude spectrum parameters of each frame of narrowband speech signal are used as training input of the spectrum gradient reconstruction network. The input matrix X of the spectral gradient reconstruction network is:

specifically, step S13 corresponds to the module for calculating the network output in the training phase in fig. 3, and the detailed process is as follows: inputting each frame of wideband speech signal obtained in step S11, and calculating all-pole model parameters of speech spectrum gradient parameters, where the formula of the all-pole model parameter calculation method used in this embodiment is as follows:

a_i＝f(s_i(n))

s_i(n) is the i-th frame wideband speech signal, a_i＝[a_i(1),a_i(2)…,a_i(P)]And (4) all-pole model parameters of the ith frame broadband voice signal spectrum gradient. P is the order of the all-pole model parameter, a_i(1),a_i(2)…,a_i(P) is the all-pole model parameter values of orders 1,2, …, P, where P is 20 in this embodiment. All-pole model parameter a_iThere are a variety of calculation methods, f(s)_i(n)) represents the all-pole model parameter a_iAccording to a_iThe calculation method of (2) is set accordingly. For example, a linear prediction algorithm or other linear prediction algorithm based on specific perceptual weighting may be used.

And then converting all-pole model parameters of the broadband voice frequency spectrum gradient into linear frequency spectrum pair parameters. The linear spectrum pair parameters are equivalent forms of all-pole model parameters, and the linear spectrum pair parameters have stronger robustness and are widely applied to the field of speech signal processing.

Further, the specific process of parameter transformation is as follows: converting all-pole model parameters of the i frame broadband voice frequency spectrum gradient into a z-domain form, wherein the z-domain form is as follows:

definition K_i(z) and Q_i(z) the two symmetric and antisymmetric polynomials of order P + 1:

K_i(z)＝A_i(z)+z^-(P+1)A_i(z^-1)

Q_i(z)＝A_i(z)-z^-(P+1)A_i(z^-1)

the Z-domain form of the linear spectrum pair of the ith frame broadband voice spectrum inclination is K_i' (z) and Q_i' (z) two polynomials:

to find K_i' (z) and Q_i' (z) corresponds to a parameter of

And

linear spectrum pair parameter of the gradient of the wideband speech spectrum per i-frame is b_i＝[bp_i,bq_i]And taking the linear spectrum pair parameters of the wideband voice spectrum inclination of each frame as the training output of the spectrum inclination reconstruction network. The output matrix Y of the spectrum gradient reconstruction network is as follows:

specifically, step S14 corresponds to a module of the training spectrum gradient reconstruction network in the training stage in fig. 3, and the detailed process is as follows: training the spectrum inclination reconstruction network, defining the perception root mean square deviation as an evaluation method, testing the performance of the spectrum inclination network by using the voice data in the test set and the evaluation method, and debugging an optimal reconstruction network parameter model.

The evaluation method perceives the calculation formula of the root mean square deviation as follows:

representing the true value of the spectral inclination, D, of the jth sub-band of the ith frame of speech signal_jDenotes the length of the jth sub-band, b_jIndicating the perceptual coefficients that compute the perceptual rms deviation for the jth sub-band. PR_iAnd a Perceptual Root Mean Square Deviation (PRMSD) representing a spectral tilt of the voice signal of the ith frame.

The number of nodes of the input layer of the optimal spectrum inclination reconstruction network is C, which is the same as the number of logarithmic magnitude spectrum parameters of each frame of narrowband speech signal in step S12.

In specific implementation, the excitation functions that can be used by the hidden layer of the optimal network parameter model include a Sigmoid function, a Tanh function, and an L initial function, etc., the node parameters of the hidden layer may be [ N/4, N/8], [ N/8, N/16], [ N/4, N/8, N/16], [ N/4, N/8, N/16], and [ N/4, N/8, N/16], and the optimal time step size of each hidden layer is determined by parameter debugging.

In this embodiment, the excitation function used by the hidden layer is a Tanh function, the excitation function used by the output layer is a L initial function, the node parameters of the hidden layer are [ N/8, N/16], the number of nodes of the output layer is P, and the number of the nodes is the same as the order of the all-pole model parameter of the voice spectrum inclination, the number of the output layers of the optimal reconstruction network described in step S14 is P, and the number of the output layers is the same as the order of the all-pole model parameter of the voice spectrum inclination, and the value of P is generally less than or equal to 20 in consideration of the algorithm complexity.

In the embodiment, the optimal time step of the hidden layer is determined by parameter debugging, and the specific debugging process is as follows: the reconstructed networks with different time step lengths are respectively trained by using the reconstructed network structure, the trained networks are tested by using the voice data with concentrated verification, the perceived root mean square deviation of the reconstructed networks with different time step lengths is calculated, the time step length used by the reconstructed network with the minimum perceived root mean square deviation is the optimal hidden layer time step length, and each hidden layer time step length of the embodiment is 6.

After the training of the frequency spectrum gradient reconstruction network is completed, the network can be put into a use stage, the network is embedded into the tail end of a voice communication system decoder in the use stage to be used as a post-processing technology, and the network can process real-time voice signals in actual communication frame by frame.

The specific implementation steps of the using stage of the frequency spectrum gradient reconstruction network are as follows:

step S21: narrow-band speech is input frame by frame in real time, and logarithmic magnitude spectrum parameters of the narrow-band speech are extracted.

Step S22: inputting the logarithmic magnitude spectrum parameters of the broadband voice frame by frame, and reconstructing all-pole model parameters of the broadband voice spectrum gradient by combining the spectrum gradient reconstruction network and the parameter conversion.

Specifically, step S21 corresponds to the module for extracting the narrowband speech feature in fig. 3, and the specific implementation process is as follows: a frame of narrow-band speech signals are input in real time, and the C-point narrow-band speech logarithmic magnitude spectrum parameters are extracted by the method which is the same as the step S12 in the training stage of the frequency spectrum gradient reconstruction network.

The specific implementation process of step S22 is as follows: inputting the C-point narrowband speech logarithmic magnitude spectrum parameters extracted in step S21 into a trained optimal spectrum inclination reconstruction network, reconstructing P-order linear spectrum pair parameters of the spectrum inclination of the wideband speech, and finally converting the obtained P-order linear spectrum pair parameters into P-order all-pole model parameters, i.e., obtaining wideband speech spectrum inclination characteristic parameters reconstructed from the narrowband speech.

In summary, the present invention provides a method for reconstructing a spectral tilt of a wideband speech signal from a narrowband speech signal. The method has stronger robustness, can be applied to all voice definition enhancement systems based on the frequency spectrum gradient characteristics, and is suitable for multi-language and multi-mode voice signals. In specific implementation, a computer software technology can be adopted to realize an automatic operation process.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and those skilled in the art should understand that any simple modification, equivalent change and modification made to the above embodiment with reference to the technical core of the present invention are within the scope of the invention claimed in the technical scheme of the present invention.

Claims

1. A broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement is characterized by comprising the following steps: comprising a training phase and a use phase for reconstructing a network based on the spectral gradients of a recurrent neural network,

step S11, acquiring narrowband speech data with low sampling rate by down-sampling wideband speech data with high sampling rate, establishing a speech data set, wherein the narrowband speech data and wideband speech data in the speech data set both comprise common speech and anti-noise speech with the same text content, dividing the speech data set into a training set and a testing set according to a proportion, and preprocessing the speech data in the speech data set, wherein the preprocessing comprises framing and windowing;

step S12, narrow-band voice data in the training set after preprocessing are input, short-time Fourier transform is carried out to obtain a narrow-band voice frequency spectrum, and a logarithmic magnitude spectrum obtained by logarithmizing frequency spectrum information is used as the input of a frequency spectrum inclination reconstruction network;

step S13, inputting the broadband voice data in the training set after preprocessing, extracting all-pole model parameters of the broadband voice signal frequency spectrum gradient, converting the parameters into linear frequency spectrum pair parameters, and outputting the linear frequency spectrum pair parameters as the frequency spectrum gradient reconstruction network;

the using stage of the frequency spectrum inclination rebuilding network puts the trained neural network into the real-time speech frame-by-frame processing of the actual communication, which comprises the following steps,

2. The method of claim 1, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: in step S12, the number of short-time fourier transform points is N, and the calculation formula of the training input of the spectrum gradient reconstruction network is:

3. The method of claim 2, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: in step S13, a wideband speech signal S is generated from the i-th frame_i(n) calculating the ratio of the total weight of the steel,

to obtain a_i＝[a_i(1),a_i(2)…,a_i(P)]All-pole model parameters of i frame wideband speech signal spectral gradient, a_i(1),a_i(2)…,a_i(P) all-pole model of order 1,2, …, P, respectivelyThe parameter value, P, is the order of the all-pole model parameter.

4. The method of claim 1, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: the linear spectrum pair parameters described in step S13 are equivalent to the all-pole model parameters, and the linear spectrum pair parameters have stronger robustness.

5. The method of claim 2, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: the evaluation method adopted in step S14 uses the speech data of the verification set and the test set, and the calculation formula is:

And Y_i(k) Using the same sub-band partition methodThe law is divided into L sub-bands,

representing the spectral tilt estimate, Y, of the jth sub-band of the ith frame of speech signal_i ^j(k) Representing the true value of the spectral inclination, D, of the jth sub-band of the ith frame of speech signal_jDenotes the length of the jth sub-band, b_jRepresenting the perceptual coefficient, PR, for calculating the perceptual RMS deviation of the j-th sub-band_iAnd a perceptual root mean square deviation PRMSD representing the inclination of the frequency spectrum of the voice signal of the ith frame.

6. The method of claim 2, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: the number of nodes of the input layer of the optimal reconstructed network parameter model in the step S14 is

7. The wideband speech spectrum inclination feature parameter reconstruction method for speech intelligibility enhancement according to claim 2, wherein in step S14, the excitation function used by the hidden layer of the optimal reconstruction network parameter model is Sigmoid function, Tanh function or L initial function, the node parameters of the hidden layer are [ N/4, N/4, N/8, N/8], [ N/8, N/8, N/16, N/16], [ N/4, N/4, N/8, N/16], [ N/4, N/8, N/16] or [ N/4, N/8, N/16, N/16], and the optimal time step size of each hidden layer is determined by parameter tuning.

8. The method of claim 1, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: in step S14, the number of output layers of the optimal reconstructed network parameter model is P, which is the same as the order of the all-pole model parameter of the voice spectrum inclination.

9. The method for reconstructing the wideband speech spectral gradient feature parameters for speech intelligibility enhancement according to claim 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8, wherein:

the method for extracting the narrowband speech logarithmic magnitude spectrum parameters in the step S21 of the use stage of the frequency spectrum gradient reconstruction network is the same as the step S12 of the training stage of the frequency spectrum gradient reconstruction network;

the using stage of the spectrum gradient reconstruction network, the parameter transformation in step S22, is to transform the linear spectrum pair parameters of the spectrum gradient of the wideband speech reconstructed by the spectrum gradient reconstruction network into all-pole model parameters.