CN109215635B - Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement - Google Patents

Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement Download PDF

Info

Publication number
CN109215635B
CN109215635B CN201811249506.0A CN201811249506A CN109215635B CN 109215635 B CN109215635 B CN 109215635B CN 201811249506 A CN201811249506 A CN 201811249506A CN 109215635 B CN109215635 B CN 109215635B
Authority
CN
China
Prior art keywords
speech
voice
frequency spectrum
spectrum
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811249506.0A
Other languages
Chinese (zh)
Other versions
CN109215635A (en
Inventor
胡瑞敏
李罡
张锐
王晓晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201811249506.0A priority Critical patent/CN109215635B/en
Publication of CN109215635A publication Critical patent/CN109215635A/en
Application granted granted Critical
Publication of CN109215635B publication Critical patent/CN109215635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention provides a broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement, which comprises a training phase and a using phase of a frequency spectrum gradient reconstruction network based on a recurrent neural network, wherein the training phase establishes a voice data set and preprocesses voice data in the data set; inputting preprocessed narrow-band voice data, performing short-time Fourier transform to obtain a narrow-band voice frequency spectrum, and carrying out logarithm conversion on frequency spectrum information to obtain a logarithmic magnitude spectrum; inputting the preprocessed wideband voice data, extracting all-pole model parameters of the wideband voice signal frequency spectrum gradient, and converting the parameters into linear frequency spectrum pair parameters; and training a spectrum gradient reconstruction network and using the spectrum gradient reconstruction network to reconstruct all-pole model parameters of the broadband voice spectrum gradient. The invention reconstructs the frequency spectrum gradient parameter of the broadband voice signal according to the narrowband voice signal, is suitable for all voice definition enhancing systems based on the frequency spectrum gradient characteristic and can be adapted to multi-language and multi-mode voice signals.

Description

Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement
Technical Field
The invention provides a broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement, relates to the technical field of voice signal processing and communication, is suitable for all voice definition enhancement systems based on frequency spectrum gradient characteristics, and can be adapted to multi-language and multi-mode voice signals.
Background
Since the 21 st century, mobile communication technology has rapidly developed, and mobile communication devices such as mobile phones have rapidly become widespread. By means of the convenience brought by the mobile phone, people can use the mobile communication equipment to carry out real-time voice communication anytime and anywhere; with such convenience, people inevitably talk in various noisy environments such as stations, restaurants, factories, etc., and noise in the noisy environment seriously degrades the voice call quality.
The voice communication flow can be briefly divided into two phases (as shown in fig. 1): the first stage is a speaking stage, wherein a speaker speaks to the mobile phone, a microphone of the mobile phone collects voice signals, encodes the voice signals and finally sends the encoded voice signals to a communication channel as uplink signals; the second stage is a listening stage, the mobile phone receives a downlink signal sent by a communication network from a channel, the mobile phone decodes the downlink signal to regenerate a voice signal, finally the mobile phone plays the decoded voice signal, and the human ear receives the played voice signal, so that the communication process of a piece of voice information is completed. The process of receiving downlink signals and listening to voice content is called near-end from the perspective of a voice listener; the process of generating a voice signal, transmitting an upstream signal, still stands from the perspective of a voice listener, called the far end.
In the process of processing the far-end signal, researchers gradually developed a voice enhancement technology for suppressing the environmental noise in the voice signal collected by the microphone. In the voice enhancement process, on one hand, a software algorithm is utilized to filter out energy except the voice signal according to a series of characteristics such as time-frequency characteristics, acoustic characteristics, linguistic characteristics and the like of the voice signal, and voice characteristic reconstruction is carried out on the voice signal with missing signal components after filtering; on the other hand, with the aid of hardware, a plurality of special microphones are installed on the mobile phone for collecting environmental sound, and the voice signal and the noise signal collected by the noise microphone are subjected to spectral subtraction or form an adaptive filtering system. By means of a series of software and hardware combination measures, the voice enhancement technology can completely filter noise components in voice signals collected by a microphone and ensure very small voice distortion.
In the near-end signal processing, in order to suppress the ambient noise during listening, researchers first thought out noise cancellation strategies: the microphone is used for collecting environmental noise, then sound waves with the phase opposite to that of the noise and the same frequency and amplitude are emitted to interfere with the noise to realize phase cancellation, and the energy of the environmental noise is reduced. The active noise reduction earphone is a typical product based on a noise cancellation strategy, a part of noise is filtered in advance through a physical isolation mode by the earphone, and the rest noise is cancelled by adding an anti-phase signal in a signal played by the earphone. However, under the condition that the earphone listening mode is lack of earphone physical isolation, the ear is directly exposed in environment noise with huge energy, and simultaneously, along with a series of problems of environment reverberation, difficulty in ensuring that the earphone is over against the ear and the like, the anti-noise effect is greatly reduced.
Under the condition that a noise cancellation strategy is invalid in an answering mode of a mobile phone receiver, researchers also provide a near-end listening enhancement technology for ensuring that a voice signal received by a listener is clear enough, and on the basis of perception acoustics, linguistics and a signal processing method, the robustness of the voice signal is enhanced by improving the perception intelligibility of the voice signal, so that the voice signal is easier to be understood by the listener under the same noise condition; it is also referred to as speech intelligibility enhancement or speech intelligibility enhancement technique, since it aims to improve the intelligibility of speech signals.
The traditional methods of speech intelligibility enhancement techniques are mainly divided into two categories: rule-based methods and metric-based methods. The method based on the rules does not consider surrounding environmental noise, and only adjusts the rules according to fixed voice characteristics to correct the voice signal time-frequency characteristics, so that the method has larger difference of definition improvement amplitude under different environments and poorer algorithm robustness; the method based on measurement is to compare the voice signal with the environmental noise fact through specific measurement indexes, dynamically adjust the gain of the voice signal, and have obvious effect on improving the voice definition, but the method damages the voice naturalness and comfort degree to a great extent.
Under the noise scene, the speaker is stressed by noise, the self-sounding mode can be spontaneously changed to overcome the influence of surrounding noise, the change can obviously improve the perception definition of a listener, the speaker noise countermeasure generation mechanism is called L ombard effect, the voice with the anti-noise characteristic is called L ombard voice, research shows that the spectrum inclination of L ombard voice is greatly different from the spectrum inclination of common voice of corresponding sentences in details, the spectrum inclination of L ombard voice is integrally more flat, the characteristic of the spectrum inclination effectively reflects the difference between L ombard voice and the common voice, and the spectrum inclination parameter can be used as a key parameter for improving the voice definition.
In a data-driven speech intelligibility enhancement system, L ombard speech under different scenes and a common speech signal under a corresponding quiet environment are used as training data, a speech intelligibility enhancement system based on L ombard can be fitted, the spectrum inclination of L ombard speech can be mapped through the spectrum inclination of the common speech signal, and then L ombard speech with an anti-noise characteristic is obtained.
The data-driven algorithm can utilize machine learning algorithms such as Gaussian process regression, Gaussian mixture models, deep neural networks and the like to complete the training of the mapping model. The mapping model has high precision requirement on the input speech frequency spectrum degree information, but the narrow-band signals in the actual speech communication environment have more errors than the narrow-band speech signals due to the fact that the acoustic characteristics are lost, and the narrow-band signals are used for directly calculating frequency spectrum inclination parameters, so that the speech definition enhancement system cannot acquire accurate frequency spectrum inclination information, and the enhancement effect is seriously reduced. The invention provides a broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement.
Disclosure of Invention
The invention provides a broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement, and solves the problem that due to the fact that acoustic characteristics of narrow-band voice signals are lost, the frequency spectrum gradient parameters directly calculated by the method have large errors in wider-band voice signals, a voice definition enhancement system cannot acquire accurate frequency spectrum gradient information, and the enhancement effect is seriously reduced.
The technical scheme of the invention provides a broadband voice frequency spectrum gradient characteristic parameter reconstruction method for enhancing voice definition, which comprises a training phase and a using phase of a frequency spectrum gradient reconstruction network based on a recurrent neural network,
the training phase of the spectral tilt reconstruction network comprises the following steps,
step S11, acquiring narrow-band voice data with low sampling rate by down-sampling the wide-band voice data with high sampling rate, establishing a voice data set, dividing the voice data into a training set, a testing set and a verification set according to proportion, and preprocessing the voice data in the data set, wherein the preprocessing comprises framing and windowing;
step S12, inputting the preprocessed narrowband speech data training set, performing short-time Fourier transform to obtain a narrowband speech frequency spectrum, and carrying out logarithm acquisition of logarithmic magnitude spectrum of frequency spectrum information as the input of a frequency spectrum inclination reconstruction network;
step S13, inputting the preprocessed wideband speech data training set, extracting all-pole model parameters of the wideband speech signal frequency spectrum gradient, converting the parameters into linear frequency spectrum pair parameters, and outputting the linear frequency spectrum pair parameters as the frequency spectrum gradient reconstruction network;
step S14, training a frequency spectrum inclination reconstruction network, defining a sensing root mean square deviation PRMSD as an evaluation method to test the performance of the frequency spectrum inclination network, using a verification set as an evaluation standard for each evaluation, debugging an optimal reconstruction network parameter model, and verifying the final effect in the test set;
the using stage of the frequency spectrum gradient rebuilding network puts the trained neural network into the real-time signal frame-by-frame processing of the actual communication,
step S21, narrow-band speech is input frame by frame in real time, and logarithmic magnitude spectrum parameters of the narrow-band speech are extracted;
and step S22, inputting the logarithmic magnitude spectrum parameters of the broadband voice frame by frame, and reconstructing all-pole model parameters of the broadband voice frequency spectrum gradient by combining the frequency spectrum gradient reconstruction network and the parameter conversion.
Moreover, both the wideband and narrowband speech material includes normal speech and anti-noise speech.
In step S12, the number of points of the short-time fourier transform is N, and the calculation formula of the training input of the spectrum gradient reconstruction network is:
Figure BDA0001841310100000041
Figure BDA0001841310100000042
Si(n) represents the i-th frame narrow-band speech signal, n is the length of the speech signal frame, xi(k) Representing the value of the log-amplitude spectrum of the i-th frame speech signal, k being the complex variable primary representation symbol in the frequency domain of the complex variable function, WinRepresenting a window function in the time domain; the number of points of the logarithmic magnitude spectrum of each frame of the voice signal is
Figure BDA0001841310100000043
xi=[xi(1),xi(2)…,xi(C)]And calculating each frame signal of the narrowband speech data after being framed in the speech data set according to the first formula to obtain the logarithmic amplitude spectrum of the frame signal, and storing the logarithmic amplitude spectrum of the frame signal into a matrix X line by line, wherein X represents the input matrix of the frequency spectrum inclination reconstruction network, and M is the number of lines of X.
Then, in step S13, a wideband speech signal S is generated from the i-th framei(n) calculating the ratio of the total weight of the steel,
to obtain ai=[ai(1),ai(2)…,ai(P)]The model is the all-pole model parameter of the i-th frame broadband voice signal frequency spectrum gradient, and P is the order of the all-pole model parameter.
Moreover, the linear spectrum pair parameters described in step S13 are equivalent to all-pole model parameters, and the linear spectrum pair parameters have stronger robustness.
Moreover, the evaluation method adopted in step S14 uses the phonetic data of the verification set and the test set, and the calculation formula is:
Figure BDA0001841310100000044
Figure BDA0001841310100000045
Figure BDA0001841310100000051
Figure BDA0001841310100000052
is an estimated value, y, of the parameter of the i-th frame speech signal spectrum gradient all-pole modeli(n) is the true value of the i frame speech signal spectrum gradient all-pole model parameter,
Figure BDA0001841310100000053
as an estimate of the spectral tilt of the i-th frame speech signal, Yi(k) For the true value of the gradient of the frequency spectrum of the ith frame speech signal, pair
Figure BDA0001841310100000054
And Yi(k) The same sub-band division method is used to divide into L sub-bands respectively,
Figure BDA0001841310100000055
representing the estimated value of the spectral inclination of the jth sub-band of the ith frame of speech signal,
Figure BDA0001841310100000056
representing the true value of the spectral inclination, D, of the jth sub-band of the ith frame of speech signaljDenotes the length of the jth sub-band, bjRepresenting the perceptual coefficient, PR, for calculating the perceptual RMS deviation of the j-th sub-bandiAnd a perceptual root mean square deviation PRMSD representing the inclination of the frequency spectrum of the voice signal of the ith frame.
Moreover, the number of nodes of the input layer of the optimal reconstructed network parameter model in step S14 is
Figure BDA0001841310100000057
The number of points is the same as the number of points of the log-amplitude spectrum parameter of each frame of the narrowband speech signal in step S12.
In step S14, the excitation function used by the hidden layer of the optimal network parameter model is a Sigmoid function, a Tanh function, or a L initial function, the node parameters of the hidden layer are [ N/4, N/8], [ N/8, N/16], [ N/4, N/8, N/16], [ N/4, N/8, N/16], or [ N/4, N/8, N/16], and the optimal time step of each hidden layer is determined by parameter debugging.
In step S14, the number of output layers of the optimum reconstruction network is P, which is the same as the order of the all-pole model parameter of the speech spectrum pitch.
Moreover, the method for extracting the narrow-band speech logarithmic magnitude spectrum parameters in the step S21 in the use stage of the frequency spectrum gradient reconstruction network is the same as the step S12 in the training stage of the frequency spectrum gradient reconstruction network; the using stage of the spectrum gradient reconstruction network, the parameter transformation in step S22, is to transform the linear spectrum pair parameters of the spectrum gradient of the wideband speech reconstructed by the spectrum gradient reconstruction network into all-pole model parameters.
The invention realizes the reconstruction of the frequency spectrum gradient information of the broadband voice from the logarithmic magnitude spectrum information of the narrowband voice, the frequency spectrum gradient information can be suitable for all voice definition enhancing systems based on frequency spectrum gradient, and can be adapted to voice signals of multiple languages and multiple modes, and the expansibility and the practicability of the voice definition enhancing system can be improved.
Drawings
Fig. 1 is a schematic diagram of a voice communication flow in a noise scene according to an embodiment of the present invention;
FIG. 2 is a block diagram of a speech intelligibility enhancement system based on spectral tilt characteristics according to an embodiment of the present invention;
fig. 3 is a flowchart of a wideband speech spectrum gradient feature parameter reconstruction method for speech intelligibility enhancement according to an embodiment of the present invention.
Detailed Description
In the following, embodiments of the present invention will be described in further detail with reference to the drawings, and it should be understood that the embodiments described herein are only a part of the embodiments of the present invention, and not all of the embodiments. Any embodiments that can be obtained by a person skilled in the art based on the embodiments of the present invention without making any creative effort are within the protection scope of the present application.
The invention is suitable for a speech definition enhancing system in a real-time speech communication system, which enhances the speech definition based on a phonation mechanism (L ombard effect) of speaker noise countermeasure and a natural speech generation model.
The invention is further illustrated by the following figures and examples, which are not to be construed as limiting the invention.
According to the problems existing in the prior art, the embodiment provides a method for reconstructing a wideband speech frequency spectrum gradient characteristic parameter from a narrowband speech, which is suitable for a speech intelligibility enhancement system based on a frequency spectrum gradient characteristic, and a block diagram of the system is shown in fig. 2.
The implementation process of the embodiment includes a training phase and a use phase of a Recurrent Neural Network (RNN) based on a spectral gradient reconstruction Network of a recurrent neural Network, as shown in fig. 3.
A training stage: extracting narrowband voice logarithmic magnitude spectrum parameters and broadband voice frequency spectrum linear spectrum pair parameters in a training set to be used as input and output of frequency spectrum inclination reconstruction network training respectively, training the frequency spectrum inclination reconstruction network, and debugging an optimal parameter model; the use stage is as follows: narrow-band voice logarithmic magnitude spectrum parameters are input into a spectrum inclination reconstruction network frame by frame, linear spectrum pair parameters of the broadband voice spectrum inclination are reconstructed, and all-pole model parameters of the broadband voice spectrum inclination are generated.
The training phase of the frequency spectrum gradient reconstruction network comprises the following specific implementation steps:
step S11: establishing a voice data set, dividing the voice data set into a training set, a testing set and a verification set according to a proportion, framing the voice data in the data set, and carrying out preprocessing such as glancing window windowing;
step S12: inputting a preprocessed narrowband voice data training set, performing short-time Fourier transform to obtain a narrowband voice frequency spectrum, and carrying out logarithm obtaining on frequency spectrum information to obtain a logarithmic magnitude spectrum as the input of a frequency spectrum inclination reconstruction network;
step S13: inputting a preprocessed broadband voice data training set, extracting all-pole model parameters of the broadband voice signal frequency spectrum gradient, and converting the parameters into linear frequency spectrum pair parameters to be used as the output of a frequency spectrum gradient reconstruction network;
step S14: training a spectrum tilt reconstruction network, defining a Perceptual Root-Mean-Square development (PRMSD) as an evaluation method to test the performance of the spectrum tilt network, debugging an optimal reconstruction network parameter model by using a verification set as an evaluation standard in each evaluation, and verifying the final effect in the test set.
Specifically, the detailed process of step S11 is: the method comprises the steps of sampling down broadband voice data with a high sampling rate to obtain narrowband voice data with a low sampling rate, and establishing a voice data set, wherein the sampling rate of the broadband voice data is generally 16000 Hz, 48000 Hz and the like, and the sampling rate of the narrowband voice data is generally 8000 Hz, 6000 Hz and the like.
In this embodiment, the sampling rate of the wideband speech data is 16000 hz, the sampling rate of the narrowband speech data is 8000 hz, and the corresponding narrowband and wideband speech data both include the common speech and the anti-noise speech with the same text content. Both the narrowband and wideband speech input in fig. 3 are from the speech data set established in step S11. The speech data set is divided into a training set, a verification set and a test set according to the proportion of 85%, 7.5% and 7.5%, the narrowband and wideband speech data in the training set and the test set are subjected to framing, and in the embodiment, a hamming window is used for windowing.
The wide band and narrow band speech data include both normal speech and anti-noise speech (L ombard speech).
The L ombard speech is the speech with anti-noise property which is sent by the human body in the noise environment and is compressed by the surrounding noise, the voice mode of the human body is spontaneously changed, L ombard speech has stronger definition than the common speech, preferably, the narrow-band speech data and the wide-band speech data are framed according to the following setting that the time length of each frame of speech signal is 20 milliseconds, and each frame of speech signal is overlapped with the previous frame by 50 percent.
Specifically, step S12 corresponds to the module for calculating network input in the training phase in fig. 3, and the detailed process is as follows: inputting each frame of narrowband speech signal obtained in step S11, performing short-time fourier transform of N points, where a possible value of N is 1024,512,256, and the like, and in this embodiment, the value of N is preferably 512, and then calculating a log-amplitude spectrum of each frame of narrowband speech signal according to the following formula:
Figure BDA0001841310100000071
Si(n) represents the i-th frame narrow-band speech signal, n is the length of the speech signal frame, and takes the value of 160, xi(k) Representing the value of logarithmic magnitude spectrum of the ith frame of narrow-band speech signal, k being the basic representation symbol of complex variable in the frequency domain of complex variable function, M being the total frame number of input training samples, WinThe embodiment uses a hanning window for each frame of voice signal to add window, and other optional window functions include a hamming window and a sine window. The number of points of the logarithmic magnitude spectrum of each frame of voice signal is
Figure BDA0001841310100000072
The value of C in this example is 257.
The number of points of the logarithmic magnitude spectrum of each frame of the voice signal is
Figure BDA0001841310100000073
xi=[xi(1),xi(2)…,xi(C)]And calculating each frame signal of the narrowband speech data after being framed in the speech data set according to the first formula to obtain a logarithmic magnitude spectrum of the frame signal, storing the logarithmic magnitude spectrum of the frame signal into a matrix X line by line, wherein X represents an input matrix of the frequency spectrum inclination reconstruction network, and M is the number of lines of X, namely the total frame number of input training samples (the narrowband speech data after being framed in all training sets).
In this embodiment, 257 point-to-digital amplitude spectrum parameters of each frame of narrowband speech signal are used as training input of the spectrum gradient reconstruction network. The input matrix X of the spectral gradient reconstruction network is:
Figure BDA0001841310100000081
specifically, step S13 corresponds to the module for calculating the network output in the training phase in fig. 3, and the detailed process is as follows: inputting each frame of wideband speech signal obtained in step S11, and calculating all-pole model parameters of speech spectrum gradient parameters, where the formula of the all-pole model parameter calculation method used in this embodiment is as follows:
ai=f(si(n))
si(n) is the i-th frame wideband speech signal, ai=[ai(1),ai(2)…,ai(P)]And (4) all-pole model parameters of the ith frame broadband voice signal spectrum gradient. P is the order of the all-pole model parameter, ai(1),ai(2)…,ai(P) is the all-pole model parameter values of orders 1,2, …, P, where P is 20 in this embodiment. All-pole model parameter aiThere are a variety of calculation methods, f(s)i(n)) represents the all-pole model parameter aiAccording to aiThe calculation method of (2) is set accordingly. For example, a linear prediction algorithm or other linear prediction algorithm based on specific perceptual weighting may be used.
And then converting all-pole model parameters of the broadband voice frequency spectrum gradient into linear frequency spectrum pair parameters. The linear spectrum pair parameters are equivalent forms of all-pole model parameters, and the linear spectrum pair parameters have stronger robustness and are widely applied to the field of speech signal processing.
Further, the specific process of parameter transformation is as follows: converting all-pole model parameters of the i frame broadband voice frequency spectrum gradient into a z-domain form, wherein the z-domain form is as follows:
Figure BDA0001841310100000082
definition Ki(z) and Qi(z) the two symmetric and antisymmetric polynomials of order P + 1:
Ki(z)=Ai(z)+z-(P+1)Ai(z-1)
Qi(z)=Ai(z)-z-(P+1)Ai(z-1)
the Z-domain form of the linear spectrum pair of the ith frame broadband voice spectrum inclination is Ki' (z) and Qi' (z) two polynomials:
Figure BDA0001841310100000091
to find Ki' (z) and Qi' (z) corresponds to a parameter of
Figure BDA0001841310100000092
And
Figure BDA0001841310100000093
linear spectrum pair parameter of the gradient of the wideband speech spectrum per i-frame is bi=[bpi,bqi]And taking the linear spectrum pair parameters of the wideband voice spectrum inclination of each frame as the training output of the spectrum inclination reconstruction network. The output matrix Y of the spectrum gradient reconstruction network is as follows:
Figure BDA0001841310100000094
specifically, step S14 corresponds to a module of the training spectrum gradient reconstruction network in the training stage in fig. 3, and the detailed process is as follows: training the spectrum inclination reconstruction network, defining the perception root mean square deviation as an evaluation method, testing the performance of the spectrum inclination network by using the voice data in the test set and the evaluation method, and debugging an optimal reconstruction network parameter model.
The evaluation method perceives the calculation formula of the root mean square deviation as follows:
Figure BDA0001841310100000095
Figure BDA0001841310100000096
Figure BDA0001841310100000097
Figure BDA0001841310100000098
is an estimated value, y, of the parameter of the i-th frame speech signal spectrum gradient all-pole modeli(n) is the true value of the i frame speech signal spectrum gradient all-pole model parameter,
Figure BDA0001841310100000099
as an estimate of the spectral tilt of the i-th frame speech signal, Yi(k) For the true value of the gradient of the frequency spectrum of the ith frame speech signal, pair
Figure BDA00018413101000000910
And Yi(k) The same sub-band division method is used to divide into L sub-bands respectively,
Figure BDA00018413101000000911
representing the estimated value of the spectral inclination of the jth sub-band of the ith frame of speech signal,
Figure BDA00018413101000000912
representing the true value of the spectral inclination, D, of the jth sub-band of the ith frame of speech signaljDenotes the length of the jth sub-band, bjIndicating the perceptual coefficients that compute the perceptual rms deviation for the jth sub-band. PRiAnd a Perceptual Root Mean Square Deviation (PRMSD) representing a spectral tilt of the voice signal of the ith frame.
The number of nodes of the input layer of the optimal spectrum inclination reconstruction network is C, which is the same as the number of logarithmic magnitude spectrum parameters of each frame of narrowband speech signal in step S12.
In specific implementation, the excitation functions that can be used by the hidden layer of the optimal network parameter model include a Sigmoid function, a Tanh function, and an L initial function, etc., the node parameters of the hidden layer may be [ N/4, N/8], [ N/8, N/16], [ N/4, N/8, N/16], [ N/4, N/8, N/16], and [ N/4, N/8, N/16], and the optimal time step size of each hidden layer is determined by parameter debugging.
In this embodiment, the excitation function used by the hidden layer is a Tanh function, the excitation function used by the output layer is a L initial function, the node parameters of the hidden layer are [ N/8, N/16], the number of nodes of the output layer is P, and the number of the nodes is the same as the order of the all-pole model parameter of the voice spectrum inclination, the number of the output layers of the optimal reconstruction network described in step S14 is P, and the number of the output layers is the same as the order of the all-pole model parameter of the voice spectrum inclination, and the value of P is generally less than or equal to 20 in consideration of the algorithm complexity.
In the embodiment, the optimal time step of the hidden layer is determined by parameter debugging, and the specific debugging process is as follows: the reconstructed networks with different time step lengths are respectively trained by using the reconstructed network structure, the trained networks are tested by using the voice data with concentrated verification, the perceived root mean square deviation of the reconstructed networks with different time step lengths is calculated, the time step length used by the reconstructed network with the minimum perceived root mean square deviation is the optimal hidden layer time step length, and each hidden layer time step length of the embodiment is 6.
After the training of the frequency spectrum gradient reconstruction network is completed, the network can be put into a use stage, the network is embedded into the tail end of a voice communication system decoder in the use stage to be used as a post-processing technology, and the network can process real-time voice signals in actual communication frame by frame.
The specific implementation steps of the using stage of the frequency spectrum gradient reconstruction network are as follows:
step S21: narrow-band speech is input frame by frame in real time, and logarithmic magnitude spectrum parameters of the narrow-band speech are extracted.
Step S22: inputting the logarithmic magnitude spectrum parameters of the broadband voice frame by frame, and reconstructing all-pole model parameters of the broadband voice spectrum gradient by combining the spectrum gradient reconstruction network and the parameter conversion.
Specifically, step S21 corresponds to the module for extracting the narrowband speech feature in fig. 3, and the specific implementation process is as follows: a frame of narrow-band speech signals are input in real time, and the C-point narrow-band speech logarithmic magnitude spectrum parameters are extracted by the method which is the same as the step S12 in the training stage of the frequency spectrum gradient reconstruction network.
The specific implementation process of step S22 is as follows: inputting the C-point narrowband speech logarithmic magnitude spectrum parameters extracted in step S21 into a trained optimal spectrum inclination reconstruction network, reconstructing P-order linear spectrum pair parameters of the spectrum inclination of the wideband speech, and finally converting the obtained P-order linear spectrum pair parameters into P-order all-pole model parameters, i.e., obtaining wideband speech spectrum inclination characteristic parameters reconstructed from the narrowband speech.
In summary, the present invention provides a method for reconstructing a spectral tilt of a wideband speech signal from a narrowband speech signal. The method has stronger robustness, can be applied to all voice definition enhancement systems based on the frequency spectrum gradient characteristics, and is suitable for multi-language and multi-mode voice signals. In specific implementation, a computer software technology can be adopted to realize an automatic operation process.
The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and those skilled in the art should understand that any simple modification, equivalent change and modification made to the above embodiment with reference to the technical core of the present invention are within the scope of the invention claimed in the technical scheme of the present invention.

Claims (9)

1. A broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement is characterized by comprising the following steps: comprising a training phase and a use phase for reconstructing a network based on the spectral gradients of a recurrent neural network,
the training phase of the spectral tilt reconstruction network comprises the following steps,
step S11, acquiring narrowband speech data with low sampling rate by down-sampling wideband speech data with high sampling rate, establishing a speech data set, wherein the narrowband speech data and wideband speech data in the speech data set both comprise common speech and anti-noise speech with the same text content, dividing the speech data set into a training set and a testing set according to a proportion, and preprocessing the speech data in the speech data set, wherein the preprocessing comprises framing and windowing;
step S12, narrow-band voice data in the training set after preprocessing are input, short-time Fourier transform is carried out to obtain a narrow-band voice frequency spectrum, and a logarithmic magnitude spectrum obtained by logarithmizing frequency spectrum information is used as the input of a frequency spectrum inclination reconstruction network;
step S13, inputting the broadband voice data in the training set after preprocessing, extracting all-pole model parameters of the broadband voice signal frequency spectrum gradient, converting the parameters into linear frequency spectrum pair parameters, and outputting the linear frequency spectrum pair parameters as the frequency spectrum gradient reconstruction network;
step S14, training a frequency spectrum inclination reconstruction network, defining a sensing root mean square deviation PRMSD as an evaluation method to test the performance of the frequency spectrum inclination network, using a verification set as an evaluation standard for each evaluation, debugging an optimal reconstruction network parameter model, and verifying the final effect in the test set;
the using stage of the frequency spectrum inclination rebuilding network puts the trained neural network into the real-time speech frame-by-frame processing of the actual communication, which comprises the following steps,
step S21, narrow-band speech is input frame by frame in real time, and logarithmic magnitude spectrum parameters of the narrow-band speech are extracted;
and step S22, inputting the logarithmic magnitude spectrum parameters of the broadband voice frame by frame, and reconstructing all-pole model parameters of the broadband voice frequency spectrum gradient by combining the frequency spectrum gradient reconstruction network and the parameter conversion.
2. The method of claim 1, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: in step S12, the number of short-time fourier transform points is N, and the calculation formula of the training input of the spectrum gradient reconstruction network is:
Figure FDA0002521478380000011
Figure FDA0002521478380000021
Si(n) represents the i-th frame narrow-band speech signal, n is the length of the speech signal frame, xi(k) Representing the value of the log-amplitude spectrum of the i-th frame speech signal, k being the complex variable primary representation symbol in the frequency domain of the complex variable function, WinRepresenting a window function in the time domain; the number of points of the logarithmic magnitude spectrum of each frame of the voice signal is
Figure FDA0002521478380000022
xi=[xi(1),xi(2)…,xi(C)]And calculating each frame signal of the narrowband speech data after being framed in the speech data set according to the first formula to obtain the logarithmic amplitude spectrum of the frame signal, and storing the logarithmic amplitude spectrum of the frame signal into a matrix X line by line, wherein X represents the input matrix of the frequency spectrum inclination reconstruction network, and M is the number of lines of X.
3. The method of claim 2, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: in step S13, a wideband speech signal S is generated from the i-th framei(n) calculating the ratio of the total weight of the steel,
to obtain ai=[ai(1),ai(2)…,ai(P)]All-pole model parameters of i frame wideband speech signal spectral gradient, ai(1),ai(2)…,ai(P) all-pole model of order 1,2, …, P, respectivelyThe parameter value, P, is the order of the all-pole model parameter.
4. The method of claim 1, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: the linear spectrum pair parameters described in step S13 are equivalent to the all-pole model parameters, and the linear spectrum pair parameters have stronger robustness.
5. The method of claim 2, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: the evaluation method adopted in step S14 uses the speech data of the verification set and the test set, and the calculation formula is:
Figure FDA0002521478380000023
Figure FDA0002521478380000024
Figure FDA0002521478380000025
Figure FDA0002521478380000026
is an estimated value, y, of the parameter of the i-th frame speech signal spectrum gradient all-pole modeli(n) is the true value of the i frame speech signal spectrum gradient all-pole model parameter,
Figure FDA0002521478380000031
as an estimate of the spectral tilt of the i-th frame speech signal, Yi(k) For the true value of the gradient of the frequency spectrum of the ith frame speech signal, pair
Figure FDA0002521478380000032
And Yi(k) Using the same sub-band partition methodThe law is divided into L sub-bands,
Figure FDA0002521478380000033
representing the spectral tilt estimate, Y, of the jth sub-band of the ith frame of speech signali j(k) Representing the true value of the spectral inclination, D, of the jth sub-band of the ith frame of speech signaljDenotes the length of the jth sub-band, bjRepresenting the perceptual coefficient, PR, for calculating the perceptual RMS deviation of the j-th sub-bandiAnd a perceptual root mean square deviation PRMSD representing the inclination of the frequency spectrum of the voice signal of the ith frame.
6. The method of claim 2, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: the number of nodes of the input layer of the optimal reconstructed network parameter model in the step S14 is
Figure FDA0002521478380000034
The number of points is the same as the number of points of the log-amplitude spectrum parameter of each frame of the narrowband speech signal in step S12.
7. The wideband speech spectrum inclination feature parameter reconstruction method for speech intelligibility enhancement according to claim 2, wherein in step S14, the excitation function used by the hidden layer of the optimal reconstruction network parameter model is Sigmoid function, Tanh function or L initial function, the node parameters of the hidden layer are [ N/4, N/4, N/8, N/8], [ N/8, N/8, N/16, N/16], [ N/4, N/4, N/8, N/16], [ N/4, N/8, N/16] or [ N/4, N/8, N/16, N/16], and the optimal time step size of each hidden layer is determined by parameter tuning.
8. The method of claim 1, wherein the wideband speech spectral slope feature parameter reconstruction method for speech intelligibility enhancement comprises: in step S14, the number of output layers of the optimal reconstructed network parameter model is P, which is the same as the order of the all-pole model parameter of the voice spectrum inclination.
9. The method for reconstructing the wideband speech spectral gradient feature parameters for speech intelligibility enhancement according to claim 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8, wherein:
the method for extracting the narrowband speech logarithmic magnitude spectrum parameters in the step S21 of the use stage of the frequency spectrum gradient reconstruction network is the same as the step S12 of the training stage of the frequency spectrum gradient reconstruction network;
the using stage of the spectrum gradient reconstruction network, the parameter transformation in step S22, is to transform the linear spectrum pair parameters of the spectrum gradient of the wideband speech reconstructed by the spectrum gradient reconstruction network into all-pole model parameters.
CN201811249506.0A 2018-10-25 2018-10-25 Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement Active CN109215635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811249506.0A CN109215635B (en) 2018-10-25 2018-10-25 Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811249506.0A CN109215635B (en) 2018-10-25 2018-10-25 Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement

Publications (2)

Publication Number Publication Date
CN109215635A CN109215635A (en) 2019-01-15
CN109215635B true CN109215635B (en) 2020-08-07

Family

ID=64996332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811249506.0A Active CN109215635B (en) 2018-10-25 2018-10-25 Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement

Country Status (1)

Country Link
CN (1) CN109215635B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110085245B (en) * 2019-04-09 2021-06-15 武汉大学 Voice definition enhancing method based on acoustic feature conversion
CN110322891B (en) * 2019-07-03 2021-12-10 南方科技大学 Voice signal processing method and device, terminal and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2776848B2 (en) * 1988-12-14 1998-07-16 株式会社日立製作所 Denoising method, neural network learning method used for it
AU2003226081A1 (en) * 2002-03-25 2003-10-13 The Trustees Of Columbia University In The City Of New York Method and system for enhancing data quality
CN105070293B (en) * 2015-08-31 2018-08-21 武汉大学 Audio bandwidth expansion coding-decoding method based on deep neural network and device
CN107705801B (en) * 2016-08-05 2020-10-02 中国科学院自动化研究所 Training method of voice bandwidth extension model and voice bandwidth extension method
CN106710604A (en) * 2016-12-07 2017-05-24 天津大学 Formant enhancement apparatus and method for improving speech intelligibility

Also Published As

Publication number Publication date
CN109215635A (en) 2019-01-15

Similar Documents

Publication Publication Date Title
Valin et al. A perceptually-motivated approach for low-complexity, real-time enhancement of fullband speech
CN108986834B (en) Bone conduction voice blind enhancement method based on codec framework and recurrent neural network
CN110085245B (en) Voice definition enhancing method based on acoustic feature conversion
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
US8880396B1 (en) Spectrum reconstruction for automatic speech recognition
CN105513605B (en) The speech-enhancement system and sound enhancement method of mobile microphone
ES2347760T3 (en) NOISE REDUCTION PROCEDURE AND DEVICE.
CN111833896B (en) Voice enhancement method, system, device and storage medium for fusing feedback signals
JP5127754B2 (en) Signal processing device
Ma et al. Speech enhancement using a masking threshold constrained Kalman filter and its heuristic implementations
Pulakka et al. Speech bandwidth extension using gaussian mixture model-based estimation of the highband mel spectrum
CN109215635B (en) Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement
CN110970044B (en) Speech enhancement method oriented to speech recognition
Qian et al. Combining equalization and estimation for bandwidth extension of narrowband speech
Islam et al. Supervised single channel speech enhancement based on stationary wavelet transforms and non-negative matrix factorization with concatenated framing process and subband smooth ratio mask
JP5443547B2 (en) Signal processing device
CN103971697B (en) Sound enhancement method based on non-local mean filtering
Pulakka et al. Bandwidth extension of telephone speech using a filter bank implementation for highband mel spectrum
CN115273884A (en) Multi-stage full-band speech enhancement method based on spectrum compression and neural network
CN104751854A (en) Broadband acoustic echo cancellation method and system
Guimarães et al. Optimizing time domain fully convolutional networks for 3D speech enhancement in a reverberant environment using perceptual losses
Kallio Artificial bandwidth expansion of narrowband speech in mobile communication systems
Roy Single channel speech enhancement using Kalman filter
Boyko et al. Using recurrent neural network to noise absorption from audio files.
Yang et al. Environment-Aware Reconfigurable Noise Suppression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant