CN109215635A - Broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing - Google Patents

Broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing Download PDF

Info

Publication number
CN109215635A
CN109215635A CN201811249506.0A CN201811249506A CN109215635A CN 109215635 A CN109215635 A CN 109215635A CN 201811249506 A CN201811249506 A CN 201811249506A CN 109215635 A CN109215635 A CN 109215635A
Authority
CN
China
Prior art keywords
spectral tilt
tilt degree
parameter
voice
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811249506.0A
Other languages
Chinese (zh)
Other versions
CN109215635B (en
Inventor
胡瑞敏
李罡
张锐
王晓晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201811249506.0A priority Critical patent/CN109215635B/en
Publication of CN109215635A publication Critical patent/CN109215635A/en
Application granted granted Critical
Publication of CN109215635B publication Critical patent/CN109215635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The present invention provides a kind of broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing, the training stage and service stage of network are rebuild including the spectral tilt degree based on Recognition with Recurrent Neural Network, training rank establishes voice data collection, and the voice data concentrated to data pre-processes;Pretreated narrowband speech data is inputted, Short Time Fourier Transform is carried out and obtains narrowband speech frequency spectrum, spectrum information logarithmetics are obtained into log-magnitude spectrum;Pretreated broadband voice data is inputted, the all-pole modeling parameter of wideband speech signal spectral tilt degree is extracted, is converted to linear spectral to parameter;Training spectral tilt degree is rebuild network and is used, and the all-pole modeling parameter of broadband voice spectral tilt degree is rebuild.The present invention rebuilds wideband speech signal spectral tilt degree parameter according to narrow band voice signal, among all speech intelligibility enhancing systems based on spectral tilt degree feature, and can be adapted to multilingual, multi-modal voice signal.

Description

Broadband voice spectral tilt degree characteristic parameter for speech intelligibility enhancing is rebuild Method
Technical field
The present invention provides a kind of broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing, It is related to Speech processing and field of communication technology, enhances system suitable for all speech intelligibilities based on spectral tilt degree feature Among system, and multilingual, multi-modal voice signal can be adapted to.
Background technique
Since 21 century, mobile communication technology is rapidly developed, and the mobile communication equipments such as mobile phone are quickly popularized.By mobile phone belt The convenience come, people are able to carry out real-time speech communicating using mobile communication equipment whenever and wherever possible;Under this convenience, people Inevitably AT STATION, restaurant, converse under the diversified noisy environment such as factory, the noise in noisy environment seriously reduces language Sound speech quality.
Voice communication process can briefly be divided into two stages (as shown in Figure 1): the first stage is to speak the stage, speaker It speaks to mobile phone, mobile microphone acquires voice signal, and by Signal coding, is finally sent to communication channel as uplink signal In;Second stage is the audition stage, and mobile phone receives the downlink signal issued by communication network from channel, decodes weight by mobile phone Newly-generated voice signal, finally by the voice signal after mobile phone broadcast decoder, human ear receives the voice signal of broadcasting, a language The communication process of message breath completes.The process for receiving downlink signal, listening to voice content is come from the angle of voice listener It sees, referred to as proximal end;The process that voice signal occurs, sends uplink signal, still station is from the perspective of voice listener, referred to as Distally.
In remote signaling treatment process, researchers gradually have investigated speech enhancement technique for inhibiting microphone to acquire To voice signal in ambient noise.During speech enhan-cement, software algorithm is on the one hand utilized, according to voice signal time-frequency The series of features such as characteristic, acoustic characteristic, linguistic characteristics, filter out the energy except voice signal, and to filtered signal at The voice signal lacked is divided to carry out phonetic feature reconstruction;On the other hand it is assisted, is installed on mobile phone multiple using hardware Dedicated microphone is acquired for ambient sound, and voice signal and the collected noise signal of noise microphone are carried out spectrum-subtraction or group At Avaptive filtering system.By a series of software and hardware combining measure, speech enhancement technique more can be filtered out completely Noise contribution in the collected voice signal of microphone, and guarantee that voice distortion is very small.
In near end signal treatment process, in order to inhibit the ambient noise in hearing process, researchers at first it is envisioned that Noise cancellation strategy: acquiring ambient noise using microphone, then issues with noise phase on the contrary, the identical sound of frequency, amplitude Wave and noise interference realize phase cancellation, reduce environmental noise power.Active noise reduction earphone is namely based on noise cancellation strategy Typical products, earphone have filtered a part of noise by physical isolation mode in advance, and residual noise passes through the letter that plays in earphone Additional inversion signal balances out in number.But in the case where earpiece answer mode lacks earphone physical isolation, ear directly exposes Among the huge ambient noise of energy, a series of problems, such as reverberation of simultaneous environment, earpiece are difficult to ensure face ear, Anti-noise effect sharp fall.
Under earpiece answer mode in the case where noise cancellation strategy fails, in order to ensure the received voice of auditor Signal is clear enough, and researchers have also been proposed proximal end audition enhancing technology, based on perception acoustics, linguistics and signal processing side Method, improve voice signal perceive intelligibility by way of, enhance voice signal robustness, make voice signal in same noise Under the conditions of be easier understood by auditor;Since it is to improve voice signal intelligibility as target, thus it is clear also referred to as voice Clear degree enhancing or the intelligibility of speech enhance technology.
The conventional method of speech intelligibility enhancing technology is broadly divided into two classes: rule-based method and the side based on measurement Method.Rule-based method does not consider the ambient noise of surrounding, according only to fixed characteristics of speech sounds adjustment rule amendment voice letter Number time-frequency characteristic, the clarity of such method under various circumstances promote that amplitude difference is larger, and algorithm robustness is poor;Based on degree The method of amount is to be compared voice signal and the ambient noise fact by specific Measure Indexes, and dynamic adjusts the increasing of voice signal Benefit, it is more obvious to the promotion effect of speech intelligibility, but such method largely destroys speech naturalness and relaxes Appropriateness.
Speech intelligibility Enhancement Method based on data-driven is a kind of completely new speech intelligibility Enhancement Method, this method Model, which is generated, using the sound generating mechanism and natural-sounding of speaker's noise confrontation improves speech intelligibility.Under noise scenarios, say Compressing of the people by noise is talked about, the tune of oneself can be changed spontaneously to overcome the influence of ambient noise, this change can To significantly improve the perceived sharpness of listener, this speaker's noise confrontation genesis mechanism is referred to as Lombard effect, this The voice for having noiseproof feature is referred to as Lombard voice.Studies have shown that the spectral tilt degree of Lombard voice corresponds to language Difference is huge in detail for the spectral tilt degree of the normal speech of sentence, and Lombard voice spectrum gradient is whole also more flat, The feature effecting reaction of the spectral tilt degree difference of Lombard voice and normal speech is made using spectral tilt degree parameter For the key parameter for promoting speech intelligibility.
In the speech intelligibility enhancing system of data-driven, using the Lombard voice under different scenes and peace is corresponded to Normal speech signal under stationary ring border can fit the speech intelligibility enhancing system based on Lombard as training data, The spectral tilt degree of Lombard voice can be mapped out by the spectral tilt degree of normal speech signal, and then obtain having anti- The Lombard voice for characteristic of making an uproar.The system algorithm block diagram is as shown in Fig. 2, detailed process are as follows: input narrowband normal speech is extracted Narrowband speech spectral tilt degree rebuilds network reconnection broadband voice spectral tilt degree characteristic parameter A (z) using spectral tilt degree, A (z) is input to spectral tilt degree mapping model and maps out broadband anti-noise voice (Lombard voice) spectral tilt degree feature ginseng Number A ' (z), wherein z is that the complex variable in the domain complex function z basically represents symbol.Using filter by narrowband normal speech frequency Spectrum gradient is substituted for the broadband voice anti-noise voice spectrum gradient of mapping, then, in order to guarantee voice letter before and after the processing Number gross energy it is constant, to filtered voice signal carry out gain control, finally, output anti-noise voice.
Algorithm based on data-driven can use the machines such as Gaussian process recurrence, gauss hybrid models and deep neural network Device learning algorithm completes mapping model training.The mapping model has very high required precision to the voice spectrum degree information of input, But the narrow band signal in actual speech communication environment directly calculates spectral tilt with narrow band signal since acoustic feature missing is added Spending parameter, there are large errors compared with narrow band voice signal, cause speech intelligibility enhancing system that can not obtain accurate spectral tilt Degree information makes reinforcing effect degradation.The present invention proposes a kind of broadband voice spectral tilt degree for speech intelligibility enhancing Characteristic parameter method for reconstructing, the characteristic parameter of reconstruction can be applied to all speech intelligibilities based on spectral tilt degree parameter and increase Strong system.
Summary of the invention
The present invention is by providing a kind of broadband voice spectral tilt degree characteristic parameter reconstruction for speech intelligibility enhancing Method is solved since narrow band voice signal acoustic feature lacks, the spectral tilt degree parameter wider band voice directly calculated There are large errors for signal, and causing speech intelligibility enhancing system that can not obtain accurate spectral tilt degree information makes reinforcing effect The problem of degradation.
Technical solution of the present invention provides a kind of broadband voice spectral tilt degree characteristic parameter for speech intelligibility enhancing Method for reconstructing rebuilds the training stage and service stage of network including the spectral tilt degree based on Recognition with Recurrent Neural Network,
The training stage that the spectral tilt degree rebuilds network includes the following steps,
Step S11, by the broadband voice data to high sampling rate it is down-sampled obtain low sampling rate narrowband speech data, Establish voice data collection, be divided into training set and test set, verifying collection voice data in proportion, the voice data that data are concentrated into Row pretreatment, the pretreatment include framing and adding window;
Step S12 inputs pretreated narrowband speech data training set, carries out Short Time Fourier Transform and obtains narrowband language Spectrum information logarithmetics are obtained log-magnitude spectrum as the input of spectral tilt degree reconstruction network by sound spectrum;
Step S13 inputs pretreated broadband voice data training set, extracts wideband speech signal spectral tilt degree All-pole modeling parameter is converted to linear spectral to parameter, the output of network is rebuild as spectral tilt degree;
Step S14, training spectral tilt degree rebuild network, and definition perception root-mean-square-deviation PRMSD is surveyed as appraisal procedure Spectral tilt degree network performance is tried, assessment uses verifying collection as evaluation criterion every time, debugs out optimal reconstruction network parameter mould Type, and final effect is verified in test set;
Trained neural network is put into practical communication by the spectral tilt degree reconstruction Web vector graphic stage During letter voice number is handled frame by frame in real time,
Step S21, inputs narrowband speech frame by frame in real time, extracts the log-magnitude spectrum parameter of narrowband speech;
Step S22 inputs broadband voice log-magnitude spectrum parameter frame by frame, rebuilds network in conjunction with spectral tilt degree and parameter turns Change the all-pole modeling parameter for rebuilding broadband voice spectral tilt degree.
Moreover, broadband and narrowband speech data include normal speech and anti-noise voice.
Moreover, the points of Short Time Fourier Transform are N in step S12, spectral tilt degree rebuilds the training input of network Calculation formula are as follows:
Si(n) indicate that the i-th frame narrow band voice signal, n are voice signal frame length, xi(k) pair of the i-th frame voice signal is indicated The value of number amplitude spectrum, k are that the complex variable in complex function frequency domain basically represents symbol, WinIndicate the window function in a kind of time domain; The points of the log-magnitude spectrum of every frame voice signal arexi=[xi(1),xi(2)…,xiIt (C)] is the i-th frame voice The log-magnitude spectrum of signal concentrates each frame signal of the narrowband speech data after framing according to above-mentioned first voice data The log-magnitude spectrum of the frame signal is calculated in formula, it is stored line by line in matrix X, and X indicates that spectral tilt degree rebuilds net The input matrix of network, M are the line number of X.
Moreover, in step S13, according to the i-th frame wideband speech signal si(n) it calculates,
Obtain ai=[ai(1),ai(2)…,aiIt (P)], is the full pole mould of the i-th frame wideband speech signal spectral tilt degree Shape parameter, P are the order of all-pole modeling parameter.
Moreover, linear spectral described in step S13 is the equivalent form of all-pole modeling parameter, linear spectral pair to parameter Parameter has stronger robustness.
Moreover, the appraisal procedure that step S14 is used uses the voice data of verifying collection and test set, calculation formula Are as follows:
For the estimated value of the i-th frame speech signal spec-trum gradient all-pole modeling parameter, yiIt (n) is the i-th frame voice The true value of signal spectrum gradient all-pole modeling parameter,For the estimated value of the i-th frame speech signal spec-trum gradient, Yi It (k) is the true value of the i-th frame speech signal spec-trum gradient, it is rightAnd Yi(k) distinguished using identical sub-band division method It is divided into L subband,Indicate the spectral tilt degree estimated value of i-th j-th of subband of frame voice signal,Indicate i-th The spectral tilt degree true value of j-th of subband of frame voice signal, DjIndicate the length of j-th of subband, bjIt indicates to calculate j-th of son The perception coefficient of the perception root-mean-square-deviation of band, PRiIndicate the perception root-mean-square-deviation of the i-th frame speech signal spec-trum gradient PRMSD。
Moreover, the input layer number of optimal reconstruction network paramter models described in step S14 isWith step The points of the log-magnitude spectrum parameter of every frame narrow band voice signal are identical in S12.
Moreover, in step S14, the excitation function that the hidden layer of optimal network parameter model uses be Sigmoid function, Tanh function or Linear function, the node parameter of hidden layer is [N/4, N/4, N/8, N/8], [N/8, N/8, N/16, N/16], [N/4, N/4, N/8, N/16], [N/4, N/8, N/8, N/16] or [N/4, N/8, N/16, N/16], every layer of hidden layer it is optimal when Between step-length pass through parameter testing determine.
Moreover, the optimal output number of plies for rebuilding network is P, the all-pole modeling with voice spectrum gradient in step S14 The order of parameter is identical.
Moreover, spectral tilt degree rebuilds the extraction narrowband speech log-magnitude spectrum ginseng in the service stage step S21 of network Several methods is identical as the spectral tilt degree reconstruction training stage step S12 of network;Spectral tilt degree reconstruction network uses rank Parameter Switch in section step S22 is the linear spectral that spectral tilt degree is rebuild to the broadband voice spectral tilt degree of network reconnection All-pole modeling parameter is converted into parameter.
The present invention realizes the log-magnitude spectrum information reconstruction broadband voice spectral tilt degree information by narrowband speech, the frequency Composing inclination information and capable of being suitable for all speech intelligibilities based on spectral tilt degree enhances systems, and can be adapted to it is multilingual, Multi-modal voice signal can promote the expansion and practicability of speech intelligibility enhancing system.
Detailed description of the invention
Fig. 1 is the voice communication flow diagram under the noise scenarios of the embodiment of the present invention;
Fig. 2 is that the speech intelligibility based on spectral tilt degree feature of the embodiment of the present invention enhances system block diagram;
Fig. 3 is that the broadband voice spectral tilt degree characteristic parameter for speech intelligibility enhancing of the embodiment of the present invention is rebuild The flow chart of method.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, to being further described in detail in the embodiment of the present invention, It is clear that embodiment described herein is only a part of the embodiment of the present invention, not all embodiment.This field skill Art personnel are without making creative work the present invention based on any embodiment acquired in the embodiment of the present invention The protection scope of application.
Speech intelligibility of the present invention suitable for real-time speech communicating system enhances system, speech intelligibility enhancing system It unites the sound generating mechanism (Lombard effect) fight based on speaker's noise and natural-sounding generation model raising sound clarity.This Invention provides speech characteristic parameter restoration methods in a kind of speech intelligibility enhancing system, i.e., " one kind is rebuild by narrowband speech The method of broadband voice spectral tilt degree parameter ".
Present invention will be further explained below with reference to the attached drawings and examples, but not as the limitation of the invention.
According to problem of the existing technology, embodiment proposes a kind of by narrowband speech reconstruction broadband voice spectral tilt The method for spending characteristic parameter enhances system, the system block diagram such as Fig. 2 suitable for the speech intelligibility based on spectral tilt degree feature It is shown.
The realization process of embodiment includes that the spectral tilt degree based on Recognition with Recurrent Neural Network rebuilds network (Recurrent Neural Network, RNN) training stage and service stage, as shown in Figure 3.
Training stage: narrowband speech log-magnitude spectrum parameter and broadband voice spectral linearity frequency spectrum are to ginseng in extraction training set Number rebuilds outputting and inputting for network training respectively as spectral tilt degree, and training spectral tilt degree rebuilds network, and debugs out Optimized parameter model;Service stage: inputting narrowband speech log-magnitude spectrum parameter frame by frame and rebuild in network to spectral tilt degree, weight The linear spectral of broadband voice spectral tilt degree is built out to parameter, generates the all-pole modeling ginseng of broadband voice spectral tilt degree Number.
The training stage that spectral tilt degree rebuilds network includes following specific implementation step:
Step S11: establishing voice data collection, is divided into training set and test set, verifying collection voice data in proportion, to data The voice data of concentration carries out framing, is pre-processed using hamming window adding window etc.;
Step S12: inputting pretreated narrowband speech data training set, carries out Short Time Fourier Transform and obtains narrowband language Spectrum information logarithmetics are obtained log-magnitude spectrum as the input of spectral tilt degree reconstruction network by sound spectrum;
Step S13: inputting pretreated broadband voice data training set, extracts wideband speech signal spectral tilt degree All-pole modeling parameter converts it into the output that linear spectral rebuilds network to parameter as spectral tilt degree;
Step S14: training spectral tilt rebuilds network, definition perception root-mean-square-deviation (Perceptual Root-Mean- Square Deviation, PRMSD) it is used as appraisal procedure to test spectral tilt degree network performance, assessment, which uses, every time verifies collection As evaluation criterion, optimal reconstruction network paramter models are debugged out, and verify final effect in test set.
Specifically, the detailed process of step S11 are as follows: to the down-sampled acquisition low sampling rate of the broadband voice data of high sampling rate Narrowband speech data, establish voice data collection, the sample rate of the broadband voice data is generally 16000 hertz, 48000 Hertz etc., the sample rate of narrowband speech data is generally 8000 hertz, 6000 hertz etc..
The sample rate of broadband voice data described in the present embodiment is 16000 hertz, the sample rate of narrowband speech data It is 8000 hertz, corresponding narrowband and broadband voice data include the normal speech and anti-noise voice of same text content. The narrowband and broadband voice inputted in Fig. 3 is all from the voice data collection established in step S11.By voice data collection according to 85%, 7.5%, 7.5% ratio is respectively divided into training set, verifying collection and test set, to the narrowband in training set and test set Framing is carried out with broadband voice data, uses the progress windowing process of hamming window in the present embodiment.
The broadband and narrowband speech data includes normal speech and anti-noise voice (Lombard voice).
The Lombard voice be people in a noisy environment, by the compressing of ambient noise, spontaneously change oneself The voice with noiseproof feature that tune is issued.Lombard voice has stronger clarity than normal speech.It is preferred that , narrowband and broadband voice data carry out framing according to following setting: be arranged every frame voice signal when it is 20 milliseconds a length of, it is each Overlapping of the frame voice signal with former frame setting 50%.Since narrowband is different with the sample rate of broadband voice, so narrowband and width Frame length with the every frame signal of voice is different, and the frame length of every frame narrowband and wideband speech signal is respectively 320 Hes in the present embodiment 160。
Specifically, in step S12 corresponding diagram 3 the calculating network inputs of training stage module, detailed process are as follows: input from Every frame narrow band voice signal that step S11 is obtained carries out the Short Time Fourier Transform of N point, and the possibility value of N is 1024,512, 256 etc., then the value of N preferably 512 in the present embodiment calculate the log-magnitude of every frame narrow band voice signal according to following formula Spectrum:
Si(n) indicate that the i-th frame narrow band voice signal, n are voice signal frame length, value 160, xi(k) indicate that the i-th frame is narrow The value of log-magnitude spectrum with voice signal, k are that the complex variable in complex function frequency domain basically represents symbol, and M is the instruction of input Practice the totalframes of sample, WinIndicate that the window function in a kind of time domain, the present embodiment use Hanning window adding window to every frame voice signal, Other alternative window functions have hamming window and sinusoidal windows.The point of the log-magnitude spectrum for every frame voice signal that the present invention is taken Number isThe value of C is 257 in the present embodiment.
The points of the log-magnitude spectrum of every frame voice signal arexi=[xi(1),xi(2)…,xiIt (C)] is the The log-magnitude spectrum of i frame voice signal concentrates each frame signal of the narrowband speech data after framing according to upper voice data The log-magnitude spectrum that the frame signal is calculated in first formula is stated, it is stored line by line in matrix X, X indicates spectral tilt Degree rebuilds the input matrix of network, and M is the line number of X, the training sample as inputted (the narrowband language in all training sets after framing Sound data) totalframes.
257 log-magnitude spectrum parameters of every frame narrow band voice signal rebuild network as spectral tilt degree in the present embodiment Training input.The input matrix X of spectral tilt degree reconstruction network are as follows:
Specifically, in step S13 corresponding diagram 3 training stage calculate network output module, detailed process are as follows: input from Every frame wideband speech signal that step S11 is obtained calculates the all-pole modeling parameter of voice spectrum gradient parameter, the present embodiment The formula of the all-pole modeling calculation method of parameters used are as follows:
ai=f (si(n))
siIt (n) is the i-th frame wideband speech signal, ai=[ai(1),ai(2)…,aiIt (P)] is the i-th frame wideband speech signal frequency Compose the all-pole modeling parameter of gradient.P is the order of all-pole modeling parameter, ai(1),ai(2)…,aiIt (P) is respectively the 1st, The all-pole modeling parameter value of 2 ..., P rank, P=20 in the present embodiment.All-pole modeling parameter aiThere are a variety of calculation methods, f (si(n)) all-pole modeling parameter a is indicatediCalculating function, the calculating function is according to aiCalculation method be accordingly arranged.Such as Linear prediction algorithm or other linear prediction algorithms based on specific perceptual weighting can be used.
Then linear spectral is converted to parameter by the all-pole modeling parameter of broadband voice spectral tilt degree.Linear spectral It is the equivalent form of all-pole modeling parameter to parameter, linear spectral has stronger robustness to parameter, at voice signal Reason field is widely applied.
Further, the detailed process of parameter conversion are as follows: by the all-pole modeling parameter of the i-th frame broadband voice spectral tilt degree It is converted to the domain z form, the domain z form are as follows:
Define Ki(z) and Qi(z) the symmetric and anti-symmetric multinomial of the two P+1 ranks:
Ki(z)=Ai(z)+z-(P+1)Ai(z-1)
Qi(z)=Ai(z)-z-(P+1)Ai(z-1)
The domain the Z form of the linear spectral pair of i-th frame broadband voice spectral tilt degree is Ki' (z) and Qi' (z) two multinomial Formula:
Acquire Ki' (z) and QiThe corresponding parameter of ' (z) isWith
The linear spectral of every i frame broadband voice spectral tilt degree is b to parameteri=[bpi,bqi], every frame broadband voice frequency The training that the linear spectral for composing gradient rebuilds network as spectral tilt degree to parameter exports.Spectral tilt degree rebuilds network Output matrix Y are as follows:
Specifically, the training spectral tilt degree of training stage rebuilds the module of network, detailed mistake in step S14 corresponding diagram 3 Journey are as follows: training spectral tilt rebuilds network, and definition perception root-mean-square-deviation is provided as appraisal procedure using the voice in test set Material and appraisal procedure test spectral tilt degree network performance, debug out optimal reconstruction network paramter models.
The calculation formula of appraisal procedure perception root-mean-square-deviation are as follows:
For the estimated value of the i-th frame speech signal spec-trum gradient all-pole modeling parameter, yiIt (n) is the i-th frame voice The true value of signal spectrum gradient all-pole modeling parameter,For the estimated value of the i-th frame speech signal spec-trum gradient, Yi It (k) is the true value of the i-th frame speech signal spec-trum gradient, it is rightAnd Yi(k) distinguished using identical sub-band division method It is divided into L subband,Indicate the spectral tilt degree estimated value of i-th j-th of subband of frame voice signal,Indicate i-th The spectral tilt degree true value of j-th of subband of frame voice signal, DjIndicate the length of j-th of subband, bjIt indicates to calculate j-th of son The perception coefficient of the perception root-mean-square-deviation of band.PRiIndicate the perception root-mean-square-deviation of the i-th frame speech signal spec-trum gradient (PRMSD)。
The input layer number that optimal spectral tilt degree rebuilds network is every frame narrow band voice signal in C, with step S12 The points of log-magnitude spectrum parameter are identical.
When it is implemented, excitation function workable for the hidden layer of optimal network parameter model include Sigmoid function, Tanh function, Linear function etc., the node parameter of hidden layer can be [N/4, N/4, N/8, N/8], [N/8, N/8, N/16, N/ 16], [N/4, N/4, N/8, N/16], [N/4, N/8, N/8, N/16] and [N/4, N/8, N/16, N/16], every layer of hidden layer is most Excellent time step is determined by parameter testing.
The excitation function that hidden layer uses in the present embodiment is Tanh function, and the excitation function that output layer uses is Linear Function, the node parameter of hidden layer are respectively [N/8, N/8, N/16, N/16], and output layer number of nodes is P, are tilted with voice spectrum The order of the all-pole modeling parameter of degree is identical.The output number of plies of optimal reconstruction network described in step S14 is P, with voice frequency The order for composing the all-pole modeling parameter of gradient is identical, it is contemplated that algorithm complexity, the value of P, which is generally less than, is equal to 20.
In embodiment, the optimal time step-length of hidden layer is determined by parameter testing, specific debugging process are as follows: using above-mentioned Network structure is rebuild, the reconstruction network using different time steps is respectively trained, is used using the voice data that verifying is concentrated Network is tested after above-mentioned training, is calculated the perception root-mean-square-deviation of the reconstruction network of different time step-length, is perceived root mean square It is optimal hidden layer time step, each hidden layer of the present embodiment that the smallest reconstruction Web vector graphic of deviation, which obtains time step, Time step is 6.
After the completion of spectral tilt degree rebuilds network instruction, service stage can be put into, service stage is by the internet startup disk It is used to voice communication system decoder end as post-processing technology, which can be to the real-Time Speech Signals in practical communication It is handled frame by frame.
Spectral tilt degree rebuilds the specific implementation step of the service stage of network are as follows:
Step S21: inputting narrowband speech in real time frame by frame, extracts the log-magnitude spectrum parameter of narrowband speech.
Step S22: inputting broadband voice log-magnitude spectrum parameter frame by frame, rebuilds network in conjunction with spectral tilt degree and parameter turns Change the all-pole modeling parameter for rebuilding broadband voice spectral tilt degree.
Specifically, extracting the module of narrowband speech feature in step S21 corresponding diagram 3, process is implemented are as follows: input in real time One frame narrow band voice signal extracts its C using method identical with the spectral tilt degree reconstruction training stage step S12 of network Point narrowband speech log-magnitude spectrum parameter.
The specific implementation process of step S22 are as follows: the C point narrowband speech log-magnitude spectrum parameter input for extracting step S21 It is rebuild in network to trained optimal spectral tilt degree, reconstructs the P rank linear spectral pair of the spectral tilt degree of broadband voice Parameter finally converts P rank all-pole modeling parameter to parameter for obtained P rank linear spectral, that is, obtains by narrowband speech weight The broadband voice spectral tilt degree characteristic parameter built.
To sum up, the present invention provides a kind of from narrow band voice signal rebuilds the side of wideband speech signal spectral tilt degree Method.This method has stronger robustness, can apply to all speech intelligibility enhancing systems based on spectral tilt degree feature Among, and it is suitable for multilingual, multi-modal voice signal.It is realized automatically when it is implemented, computer software technology can be used Operational process.
Content described above is only preferred embodiments of the invention, and the present invention is not by above-described embodiment formal Limitation, those skilled in the art it is to be appreciated that it is all referring to technological core of the invention to any made by above-described embodiment Simple modifications, equivalent variations and the modification of form, belong within the scope of technical solution of the present invention claimed invention.

Claims (10)

1. a kind of broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing, it is characterised in that: The training stage and service stage of network are rebuild including the spectral tilt degree based on Recognition with Recurrent Neural Network,
The spectral tilt degree is rebuild the network training stage and is included the following steps,
Step S11 is established by the down-sampled narrowband speech data for obtaining low sampling rate of the broadband voice data to high sampling rate Voice data collection is divided into training set and test set, verifying collection voice data in proportion, carries out to the voice data that data are concentrated pre- Processing, the pretreatment include framing and adding window;
Step S12 inputs pretreated narrowband speech data training set, carries out Short Time Fourier Transform and obtains narrowband speech frequency Spectrum information logarithmetics are obtained log-magnitude spectrum as the input of spectral tilt degree reconstruction network by spectrum;
Step S13 inputs pretreated broadband voice data training set, extracts the full pole of wideband speech signal spectral tilt degree Point model parameter is converted to linear spectral to parameter, the output of network is rebuild as spectral tilt degree;
Step S14, training spectral tilt degree rebuild network, and definition perception root-mean-square-deviation PRMSD tests frequency as appraisal procedure Gradient network performance is composed, assessment uses verifying collection as evaluation criterion every time, optimal reconstruction network paramter models are debugged out, and Final effect is verified in test set;
Trained neural network is put into the real-time of practical communication by the spectral tilt degree reconstruction Web vector graphic stage Voice includes the following steps in handling frame by frame,
Step S21, inputs narrowband speech frame by frame in real time, extracts the log-magnitude spectrum parameter of narrowband speech;
Step S22 inputs broadband voice log-magnitude spectrum parameter frame by frame, rebuilds network and Parameter Switch weight in conjunction with spectral tilt degree Build the all-pole modeling parameter of broadband voice spectral tilt degree.
2. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing Method is characterized in that: broadband and narrowband speech data include normal speech and anti-noise voice.
3. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing Method, it is characterised in that: in step S12, the points of Short Time Fourier Transform are N, and spectral tilt degree rebuilds the training input of network Calculation formula are as follows:
Si(n) indicate that the i-th frame narrow band voice signal, n are voice signal frame length, xi(k) the logarithm width of the i-th frame voice signal is indicated The value of spectrum is spent, k is that the complex variable in complex function frequency domain basically represents symbol, WinIndicate the window function in a kind of time domain;Every frame The points of the log-magnitude spectrum of voice signal arexi=[xi(1),xi(2)…,xiIt (C)] is the i-th frame voice signal Log-magnitude spectrum, to voice data concentrate framing after narrowband speech data each frame signal according to above-mentioned first formula The log-magnitude spectrum of the frame signal is calculated, it is stored line by line in matrix X, X indicates that spectral tilt degree rebuilds network Input matrix, M are the line number of X.
4. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing Method, it is characterised in that: in step S13, according to the i-th frame wideband speech signal si(n) it calculates,
Obtain ai=[ai(1),ai(2)…,ai(P)], join for the all-pole modeling of the i-th frame wideband speech signal spectral tilt degree Number, P are the order of all-pole modeling parameter.
5. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing Method, it is characterised in that: linear spectral described in step S13 is the equivalent form of all-pole modeling parameter, linear spectral to parameter There is stronger robustness to parameter.
6. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing Method, it is characterised in that: the appraisal procedure that step S14 is used uses the voice data of verifying collection and test set, calculation formula Are as follows:
For the estimated value of the i-th frame speech signal spec-trum gradient all-pole modeling parameter, yiIt (n) is the i-th frame voice signal The true value of spectral tilt degree all-pole modeling parameter,For the estimated value of the i-th frame speech signal spec-trum gradient, Yi(k) It is right for the true value of the i-th frame speech signal spec-trum gradientAnd Yi(k) it is respectively divided using identical sub-band division method At L subband,Indicate the spectral tilt degree estimated value of i-th j-th of subband of frame voice signal, Yi j(k) the i-th frame language is indicated The spectral tilt degree true value of j-th of subband of sound signal, DjIndicate the length of j-th of subband, bjIt indicates to calculate j-th of subband Perceive the perception coefficient of root-mean-square-deviation, PRiIndicate the perception root-mean-square-deviation PRMSD of the i-th frame speech signal spec-trum gradient.
7. according to right want 1 described in for speech intelligibility enhance broadband voice spectral tilt degree characteristic parameter reconstruction side Method, it is characterised in that: the input layer number of optimal reconstruction network paramter models described in step S14 isWith step The points of the log-magnitude spectrum parameter of every frame narrow band voice signal are identical in rapid S12.
8. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing Method, it is characterised in that: in step S14, excitation function that the hidden layer of optimal network parameter model uses be Sigmoid function, Tanh function or Linear function, the node parameter of hidden layer is [N/4, N/4, N/8, N/8], [N/8, N/8, N/16, N/16], [N/4, N/4, N/8, N/16], [N/4, N/8, N/8, N/16] or [N/4, N/8, N/16, N/16], every layer of hidden layer it is optimal when Between step-length pass through parameter testing determine.
9. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing Method, it is characterised in that: in step S14, the optimal output number of plies for rebuilding network is P, the full pole mould with voice spectrum gradient The order of shape parameter is identical.
10. the broadband language described according to claim 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9 for speech intelligibility enhancing Sound spectrum gradient characteristic parameter method for reconstructing, it is characterised in that:
Spectral tilt degree rebuild network service stage step S21 in extraction narrowband speech log-magnitude spectrum parameter method with The training stage step S12 that spectral tilt degree rebuilds network is identical;
The Parameter Switch that spectral tilt degree is rebuild in the service stage step S22 of network is that spectral tilt degree is rebuild network reconnection The linear spectral of broadband voice spectral tilt degree all-pole modeling parameter is converted into parameter.
CN201811249506.0A 2018-10-25 2018-10-25 Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement Active CN109215635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811249506.0A CN109215635B (en) 2018-10-25 2018-10-25 Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811249506.0A CN109215635B (en) 2018-10-25 2018-10-25 Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement

Publications (2)

Publication Number Publication Date
CN109215635A true CN109215635A (en) 2019-01-15
CN109215635B CN109215635B (en) 2020-08-07

Family

ID=64996332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811249506.0A Active CN109215635B (en) 2018-10-25 2018-10-25 Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement

Country Status (1)

Country Link
CN (1) CN109215635B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110085245A (en) * 2019-04-09 2019-08-02 武汉大学 A kind of speech intelligibility Enhancement Method based on acoustic feature conversion
CN110322891A (en) * 2019-07-03 2019-10-11 南方科技大学 A kind of processing method of voice signal, device, terminal and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5185848A (en) * 1988-12-14 1993-02-09 Hitachi, Ltd. Noise reduction system using neural network
US20060003328A1 (en) * 2002-03-25 2006-01-05 Grossberg Michael D Method and system for enhancing data quality
CN105070293A (en) * 2015-08-31 2015-11-18 武汉大学 Audio bandwidth extension coding and decoding method and device based on deep neutral network
CN106710604A (en) * 2016-12-07 2017-05-24 天津大学 Formant enhancement apparatus and method for improving speech intelligibility
CN107705801A (en) * 2016-08-05 2018-02-16 中国科学院自动化研究所 The training method and Speech bandwidth extension method of Speech bandwidth extension model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5185848A (en) * 1988-12-14 1993-02-09 Hitachi, Ltd. Noise reduction system using neural network
US20060003328A1 (en) * 2002-03-25 2006-01-05 Grossberg Michael D Method and system for enhancing data quality
CN105070293A (en) * 2015-08-31 2015-11-18 武汉大学 Audio bandwidth extension coding and decoding method and device based on deep neutral network
CN107705801A (en) * 2016-08-05 2018-02-16 中国科学院自动化研究所 The training method and Speech bandwidth extension method of Speech bandwidth extension model
CN106710604A (en) * 2016-12-07 2017-05-24 天津大学 Formant enhancement apparatus and method for improving speech intelligibility

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIN JIANG .ETC: "Nonlinear Prediction with Deep Recurrent Neural Networks for Non-Blind Audio Bandwidth Extension", 《CHINA COMMUNICATIONS》 *
郭雷勇等: "用于隐马尔可夫模型语音带宽扩展的激励分段扩展方法", 《计算机应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110085245A (en) * 2019-04-09 2019-08-02 武汉大学 A kind of speech intelligibility Enhancement Method based on acoustic feature conversion
CN110085245B (en) * 2019-04-09 2021-06-15 武汉大学 Voice definition enhancing method based on acoustic feature conversion
CN110322891A (en) * 2019-07-03 2019-10-11 南方科技大学 A kind of processing method of voice signal, device, terminal and storage medium
CN110322891B (en) * 2019-07-03 2021-12-10 南方科技大学 Voice signal processing method and device, terminal and storage medium

Also Published As

Publication number Publication date
CN109215635B (en) 2020-08-07

Similar Documents

Publication Publication Date Title
CN110085245B (en) Voice definition enhancing method based on acoustic feature conversion
CN107886967B (en) A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
Ganapathy et al. Robust feature extraction using modulation filtering of autoregressive models
Nemala et al. A multistream feature framework based on bandpass modulation filtering for robust speech recognition
CN111833896A (en) Voice enhancement method, system, device and storage medium for fusing feedback signals
Janke et al. Fundamental frequency generation for whisper-to-audible speech conversion
Adiga et al. Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN.
Nossier et al. Mapping and masking targets comparison using different deep learning based speech enhancement architectures
Shah et al. Novel MMSE DiscoGAN for cross-domain whisper-to-speech conversion
CN109215635A (en) Broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing
Islam et al. Supervised single channel speech enhancement based on stationary wavelet transforms and non-negative matrix factorization with concatenated framing process and subband smooth ratio mask
CN111326170A (en) Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution
Fan et al. A regression approach to binaural speech segregation via deep neural network
Zouhir et al. A bio-inspired feature extraction for robust speech recognition
Pulakka et al. Bandwidth extension of telephone speech using a filter bank implementation for highband mel spectrum
Exter et al. DNN-Based Automatic Speech Recognition as a Model for Human Phoneme Perception.
Huber et al. Single-ended speech quality prediction based on automatic speech recognition
Cheyne et al. Talker-to-listener distance effects on speech production and perception
CN103971697B (en) Sound enhancement method based on non-local mean filtering
Akhter et al. An analysis of performance evaluation metrics for voice conversion models
Tanaka et al. An evaluation of excitation feature prediction in a hybrid approach to electrolaryngeal speech enhancement
Gupta et al. Artificial bandwidth extension using H∞ sampled-data control theory
Zheng et al. Throat microphone speech enhancement via progressive learning of spectral mapping based on lstm-rnn
Marković et al. Whispered speech recognition based on gammatone filterbank cepstral coefficients

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant