CN109215635A - Broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing - Google Patents
Broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing Download PDFInfo
- Publication number
- CN109215635A CN109215635A CN201811249506.0A CN201811249506A CN109215635A CN 109215635 A CN109215635 A CN 109215635A CN 201811249506 A CN201811249506 A CN 201811249506A CN 109215635 A CN109215635 A CN 109215635A
- Authority
- CN
- China
- Prior art keywords
- spectral tilt
- tilt degree
- parameter
- voice
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000003595 spectral effect Effects 0.000 title claims abstract description 143
- 238000000034 method Methods 0.000 title claims abstract description 67
- 230000002708 enhancing effect Effects 0.000 title claims abstract description 29
- 238000001228 spectrum Methods 0.000 claims abstract description 67
- 238000012549 training Methods 0.000 claims abstract description 44
- 238000013528 artificial neural network Methods 0.000 claims abstract description 8
- 238000013480 data collection Methods 0.000 claims abstract description 7
- 230000000306 recurrent effect Effects 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 22
- 238000012360 testing method Methods 0.000 claims description 18
- 230000008447 perception Effects 0.000 claims description 15
- 230000006854 communication Effects 0.000 claims description 11
- 238000004891 communication Methods 0.000 claims description 9
- 238000009432 framing Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 8
- 230000000694 effects Effects 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 6
- 230000005284 excitation Effects 0.000 claims description 5
- 238000012886 linear function Methods 0.000 claims description 4
- 239000012141 concentrate Substances 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims 1
- 230000005236 sound signal Effects 0.000 claims 1
- 230000008569 process Effects 0.000 abstract description 18
- 238000010586 diagram Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 241000208340 Araliaceae Species 0.000 description 4
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 4
- 235000003140 Panax quinquefolius Nutrition 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 235000008434 ginseng Nutrition 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 3
- 238000010295 mobile communication Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000015556 catabolic process Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003014 reinforcing effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 239000004568 cement Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
Abstract
The present invention provides a kind of broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing, the training stage and service stage of network are rebuild including the spectral tilt degree based on Recognition with Recurrent Neural Network, training rank establishes voice data collection, and the voice data concentrated to data pre-processes;Pretreated narrowband speech data is inputted, Short Time Fourier Transform is carried out and obtains narrowband speech frequency spectrum, spectrum information logarithmetics are obtained into log-magnitude spectrum;Pretreated broadband voice data is inputted, the all-pole modeling parameter of wideband speech signal spectral tilt degree is extracted, is converted to linear spectral to parameter;Training spectral tilt degree is rebuild network and is used, and the all-pole modeling parameter of broadband voice spectral tilt degree is rebuild.The present invention rebuilds wideband speech signal spectral tilt degree parameter according to narrow band voice signal, among all speech intelligibility enhancing systems based on spectral tilt degree feature, and can be adapted to multilingual, multi-modal voice signal.
Description
Technical field
The present invention provides a kind of broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing,
It is related to Speech processing and field of communication technology, enhances system suitable for all speech intelligibilities based on spectral tilt degree feature
Among system, and multilingual, multi-modal voice signal can be adapted to.
Background technique
Since 21 century, mobile communication technology is rapidly developed, and the mobile communication equipments such as mobile phone are quickly popularized.By mobile phone belt
The convenience come, people are able to carry out real-time speech communicating using mobile communication equipment whenever and wherever possible;Under this convenience, people
Inevitably AT STATION, restaurant, converse under the diversified noisy environment such as factory, the noise in noisy environment seriously reduces language
Sound speech quality.
Voice communication process can briefly be divided into two stages (as shown in Figure 1): the first stage is to speak the stage, speaker
It speaks to mobile phone, mobile microphone acquires voice signal, and by Signal coding, is finally sent to communication channel as uplink signal
In;Second stage is the audition stage, and mobile phone receives the downlink signal issued by communication network from channel, decodes weight by mobile phone
Newly-generated voice signal, finally by the voice signal after mobile phone broadcast decoder, human ear receives the voice signal of broadcasting, a language
The communication process of message breath completes.The process for receiving downlink signal, listening to voice content is come from the angle of voice listener
It sees, referred to as proximal end;The process that voice signal occurs, sends uplink signal, still station is from the perspective of voice listener, referred to as
Distally.
In remote signaling treatment process, researchers gradually have investigated speech enhancement technique for inhibiting microphone to acquire
To voice signal in ambient noise.During speech enhan-cement, software algorithm is on the one hand utilized, according to voice signal time-frequency
The series of features such as characteristic, acoustic characteristic, linguistic characteristics, filter out the energy except voice signal, and to filtered signal at
The voice signal lacked is divided to carry out phonetic feature reconstruction;On the other hand it is assisted, is installed on mobile phone multiple using hardware
Dedicated microphone is acquired for ambient sound, and voice signal and the collected noise signal of noise microphone are carried out spectrum-subtraction or group
At Avaptive filtering system.By a series of software and hardware combining measure, speech enhancement technique more can be filtered out completely
Noise contribution in the collected voice signal of microphone, and guarantee that voice distortion is very small.
In near end signal treatment process, in order to inhibit the ambient noise in hearing process, researchers at first it is envisioned that
Noise cancellation strategy: acquiring ambient noise using microphone, then issues with noise phase on the contrary, the identical sound of frequency, amplitude
Wave and noise interference realize phase cancellation, reduce environmental noise power.Active noise reduction earphone is namely based on noise cancellation strategy
Typical products, earphone have filtered a part of noise by physical isolation mode in advance, and residual noise passes through the letter that plays in earphone
Additional inversion signal balances out in number.But in the case where earpiece answer mode lacks earphone physical isolation, ear directly exposes
Among the huge ambient noise of energy, a series of problems, such as reverberation of simultaneous environment, earpiece are difficult to ensure face ear,
Anti-noise effect sharp fall.
Under earpiece answer mode in the case where noise cancellation strategy fails, in order to ensure the received voice of auditor
Signal is clear enough, and researchers have also been proposed proximal end audition enhancing technology, based on perception acoustics, linguistics and signal processing side
Method, improve voice signal perceive intelligibility by way of, enhance voice signal robustness, make voice signal in same noise
Under the conditions of be easier understood by auditor;Since it is to improve voice signal intelligibility as target, thus it is clear also referred to as voice
Clear degree enhancing or the intelligibility of speech enhance technology.
The conventional method of speech intelligibility enhancing technology is broadly divided into two classes: rule-based method and the side based on measurement
Method.Rule-based method does not consider the ambient noise of surrounding, according only to fixed characteristics of speech sounds adjustment rule amendment voice letter
Number time-frequency characteristic, the clarity of such method under various circumstances promote that amplitude difference is larger, and algorithm robustness is poor;Based on degree
The method of amount is to be compared voice signal and the ambient noise fact by specific Measure Indexes, and dynamic adjusts the increasing of voice signal
Benefit, it is more obvious to the promotion effect of speech intelligibility, but such method largely destroys speech naturalness and relaxes
Appropriateness.
Speech intelligibility Enhancement Method based on data-driven is a kind of completely new speech intelligibility Enhancement Method, this method
Model, which is generated, using the sound generating mechanism and natural-sounding of speaker's noise confrontation improves speech intelligibility.Under noise scenarios, say
Compressing of the people by noise is talked about, the tune of oneself can be changed spontaneously to overcome the influence of ambient noise, this change can
To significantly improve the perceived sharpness of listener, this speaker's noise confrontation genesis mechanism is referred to as Lombard effect, this
The voice for having noiseproof feature is referred to as Lombard voice.Studies have shown that the spectral tilt degree of Lombard voice corresponds to language
Difference is huge in detail for the spectral tilt degree of the normal speech of sentence, and Lombard voice spectrum gradient is whole also more flat,
The feature effecting reaction of the spectral tilt degree difference of Lombard voice and normal speech is made using spectral tilt degree parameter
For the key parameter for promoting speech intelligibility.
In the speech intelligibility enhancing system of data-driven, using the Lombard voice under different scenes and peace is corresponded to
Normal speech signal under stationary ring border can fit the speech intelligibility enhancing system based on Lombard as training data,
The spectral tilt degree of Lombard voice can be mapped out by the spectral tilt degree of normal speech signal, and then obtain having anti-
The Lombard voice for characteristic of making an uproar.The system algorithm block diagram is as shown in Fig. 2, detailed process are as follows: input narrowband normal speech is extracted
Narrowband speech spectral tilt degree rebuilds network reconnection broadband voice spectral tilt degree characteristic parameter A (z) using spectral tilt degree,
A (z) is input to spectral tilt degree mapping model and maps out broadband anti-noise voice (Lombard voice) spectral tilt degree feature ginseng
Number A ' (z), wherein z is that the complex variable in the domain complex function z basically represents symbol.Using filter by narrowband normal speech frequency
Spectrum gradient is substituted for the broadband voice anti-noise voice spectrum gradient of mapping, then, in order to guarantee voice letter before and after the processing
Number gross energy it is constant, to filtered voice signal carry out gain control, finally, output anti-noise voice.
Algorithm based on data-driven can use the machines such as Gaussian process recurrence, gauss hybrid models and deep neural network
Device learning algorithm completes mapping model training.The mapping model has very high required precision to the voice spectrum degree information of input,
But the narrow band signal in actual speech communication environment directly calculates spectral tilt with narrow band signal since acoustic feature missing is added
Spending parameter, there are large errors compared with narrow band voice signal, cause speech intelligibility enhancing system that can not obtain accurate spectral tilt
Degree information makes reinforcing effect degradation.The present invention proposes a kind of broadband voice spectral tilt degree for speech intelligibility enhancing
Characteristic parameter method for reconstructing, the characteristic parameter of reconstruction can be applied to all speech intelligibilities based on spectral tilt degree parameter and increase
Strong system.
Summary of the invention
The present invention is by providing a kind of broadband voice spectral tilt degree characteristic parameter reconstruction for speech intelligibility enhancing
Method is solved since narrow band voice signal acoustic feature lacks, the spectral tilt degree parameter wider band voice directly calculated
There are large errors for signal, and causing speech intelligibility enhancing system that can not obtain accurate spectral tilt degree information makes reinforcing effect
The problem of degradation.
Technical solution of the present invention provides a kind of broadband voice spectral tilt degree characteristic parameter for speech intelligibility enhancing
Method for reconstructing rebuilds the training stage and service stage of network including the spectral tilt degree based on Recognition with Recurrent Neural Network,
The training stage that the spectral tilt degree rebuilds network includes the following steps,
Step S11, by the broadband voice data to high sampling rate it is down-sampled obtain low sampling rate narrowband speech data,
Establish voice data collection, be divided into training set and test set, verifying collection voice data in proportion, the voice data that data are concentrated into
Row pretreatment, the pretreatment include framing and adding window;
Step S12 inputs pretreated narrowband speech data training set, carries out Short Time Fourier Transform and obtains narrowband language
Spectrum information logarithmetics are obtained log-magnitude spectrum as the input of spectral tilt degree reconstruction network by sound spectrum;
Step S13 inputs pretreated broadband voice data training set, extracts wideband speech signal spectral tilt degree
All-pole modeling parameter is converted to linear spectral to parameter, the output of network is rebuild as spectral tilt degree;
Step S14, training spectral tilt degree rebuild network, and definition perception root-mean-square-deviation PRMSD is surveyed as appraisal procedure
Spectral tilt degree network performance is tried, assessment uses verifying collection as evaluation criterion every time, debugs out optimal reconstruction network parameter mould
Type, and final effect is verified in test set;
Trained neural network is put into practical communication by the spectral tilt degree reconstruction Web vector graphic stage
During letter voice number is handled frame by frame in real time,
Step S21, inputs narrowband speech frame by frame in real time, extracts the log-magnitude spectrum parameter of narrowband speech;
Step S22 inputs broadband voice log-magnitude spectrum parameter frame by frame, rebuilds network in conjunction with spectral tilt degree and parameter turns
Change the all-pole modeling parameter for rebuilding broadband voice spectral tilt degree.
Moreover, broadband and narrowband speech data include normal speech and anti-noise voice.
Moreover, the points of Short Time Fourier Transform are N in step S12, spectral tilt degree rebuilds the training input of network
Calculation formula are as follows:
Si(n) indicate that the i-th frame narrow band voice signal, n are voice signal frame length, xi(k) pair of the i-th frame voice signal is indicated
The value of number amplitude spectrum, k are that the complex variable in complex function frequency domain basically represents symbol, WinIndicate the window function in a kind of time domain;
The points of the log-magnitude spectrum of every frame voice signal arexi=[xi(1),xi(2)…,xiIt (C)] is the i-th frame voice
The log-magnitude spectrum of signal concentrates each frame signal of the narrowband speech data after framing according to above-mentioned first voice data
The log-magnitude spectrum of the frame signal is calculated in formula, it is stored line by line in matrix X, and X indicates that spectral tilt degree rebuilds net
The input matrix of network, M are the line number of X.
Moreover, in step S13, according to the i-th frame wideband speech signal si(n) it calculates,
Obtain ai=[ai(1),ai(2)…,aiIt (P)], is the full pole mould of the i-th frame wideband speech signal spectral tilt degree
Shape parameter, P are the order of all-pole modeling parameter.
Moreover, linear spectral described in step S13 is the equivalent form of all-pole modeling parameter, linear spectral pair to parameter
Parameter has stronger robustness.
Moreover, the appraisal procedure that step S14 is used uses the voice data of verifying collection and test set, calculation formula
Are as follows:
For the estimated value of the i-th frame speech signal spec-trum gradient all-pole modeling parameter, yiIt (n) is the i-th frame voice
The true value of signal spectrum gradient all-pole modeling parameter,For the estimated value of the i-th frame speech signal spec-trum gradient, Yi
It (k) is the true value of the i-th frame speech signal spec-trum gradient, it is rightAnd Yi(k) distinguished using identical sub-band division method
It is divided into L subband,Indicate the spectral tilt degree estimated value of i-th j-th of subband of frame voice signal,Indicate i-th
The spectral tilt degree true value of j-th of subband of frame voice signal, DjIndicate the length of j-th of subband, bjIt indicates to calculate j-th of son
The perception coefficient of the perception root-mean-square-deviation of band, PRiIndicate the perception root-mean-square-deviation of the i-th frame speech signal spec-trum gradient
PRMSD。
Moreover, the input layer number of optimal reconstruction network paramter models described in step S14 isWith step
The points of the log-magnitude spectrum parameter of every frame narrow band voice signal are identical in S12.
Moreover, in step S14, the excitation function that the hidden layer of optimal network parameter model uses be Sigmoid function,
Tanh function or Linear function, the node parameter of hidden layer is [N/4, N/4, N/8, N/8], [N/8, N/8, N/16, N/16],
[N/4, N/4, N/8, N/16], [N/4, N/8, N/8, N/16] or [N/4, N/8, N/16, N/16], every layer of hidden layer it is optimal when
Between step-length pass through parameter testing determine.
Moreover, the optimal output number of plies for rebuilding network is P, the all-pole modeling with voice spectrum gradient in step S14
The order of parameter is identical.
Moreover, spectral tilt degree rebuilds the extraction narrowband speech log-magnitude spectrum ginseng in the service stage step S21 of network
Several methods is identical as the spectral tilt degree reconstruction training stage step S12 of network;Spectral tilt degree reconstruction network uses rank
Parameter Switch in section step S22 is the linear spectral that spectral tilt degree is rebuild to the broadband voice spectral tilt degree of network reconnection
All-pole modeling parameter is converted into parameter.
The present invention realizes the log-magnitude spectrum information reconstruction broadband voice spectral tilt degree information by narrowband speech, the frequency
Composing inclination information and capable of being suitable for all speech intelligibilities based on spectral tilt degree enhances systems, and can be adapted to it is multilingual,
Multi-modal voice signal can promote the expansion and practicability of speech intelligibility enhancing system.
Detailed description of the invention
Fig. 1 is the voice communication flow diagram under the noise scenarios of the embodiment of the present invention;
Fig. 2 is that the speech intelligibility based on spectral tilt degree feature of the embodiment of the present invention enhances system block diagram;
Fig. 3 is that the broadband voice spectral tilt degree characteristic parameter for speech intelligibility enhancing of the embodiment of the present invention is rebuild
The flow chart of method.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, to being further described in detail in the embodiment of the present invention,
It is clear that embodiment described herein is only a part of the embodiment of the present invention, not all embodiment.This field skill
Art personnel are without making creative work the present invention based on any embodiment acquired in the embodiment of the present invention
The protection scope of application.
Speech intelligibility of the present invention suitable for real-time speech communicating system enhances system, speech intelligibility enhancing system
It unites the sound generating mechanism (Lombard effect) fight based on speaker's noise and natural-sounding generation model raising sound clarity.This
Invention provides speech characteristic parameter restoration methods in a kind of speech intelligibility enhancing system, i.e., " one kind is rebuild by narrowband speech
The method of broadband voice spectral tilt degree parameter ".
Present invention will be further explained below with reference to the attached drawings and examples, but not as the limitation of the invention.
According to problem of the existing technology, embodiment proposes a kind of by narrowband speech reconstruction broadband voice spectral tilt
The method for spending characteristic parameter enhances system, the system block diagram such as Fig. 2 suitable for the speech intelligibility based on spectral tilt degree feature
It is shown.
The realization process of embodiment includes that the spectral tilt degree based on Recognition with Recurrent Neural Network rebuilds network (Recurrent
Neural Network, RNN) training stage and service stage, as shown in Figure 3.
Training stage: narrowband speech log-magnitude spectrum parameter and broadband voice spectral linearity frequency spectrum are to ginseng in extraction training set
Number rebuilds outputting and inputting for network training respectively as spectral tilt degree, and training spectral tilt degree rebuilds network, and debugs out
Optimized parameter model;Service stage: inputting narrowband speech log-magnitude spectrum parameter frame by frame and rebuild in network to spectral tilt degree, weight
The linear spectral of broadband voice spectral tilt degree is built out to parameter, generates the all-pole modeling ginseng of broadband voice spectral tilt degree
Number.
The training stage that spectral tilt degree rebuilds network includes following specific implementation step:
Step S11: establishing voice data collection, is divided into training set and test set, verifying collection voice data in proportion, to data
The voice data of concentration carries out framing, is pre-processed using hamming window adding window etc.;
Step S12: inputting pretreated narrowband speech data training set, carries out Short Time Fourier Transform and obtains narrowband language
Spectrum information logarithmetics are obtained log-magnitude spectrum as the input of spectral tilt degree reconstruction network by sound spectrum;
Step S13: inputting pretreated broadband voice data training set, extracts wideband speech signal spectral tilt degree
All-pole modeling parameter converts it into the output that linear spectral rebuilds network to parameter as spectral tilt degree;
Step S14: training spectral tilt rebuilds network, definition perception root-mean-square-deviation (Perceptual Root-Mean-
Square Deviation, PRMSD) it is used as appraisal procedure to test spectral tilt degree network performance, assessment, which uses, every time verifies collection
As evaluation criterion, optimal reconstruction network paramter models are debugged out, and verify final effect in test set.
Specifically, the detailed process of step S11 are as follows: to the down-sampled acquisition low sampling rate of the broadband voice data of high sampling rate
Narrowband speech data, establish voice data collection, the sample rate of the broadband voice data is generally 16000 hertz, 48000
Hertz etc., the sample rate of narrowband speech data is generally 8000 hertz, 6000 hertz etc..
The sample rate of broadband voice data described in the present embodiment is 16000 hertz, the sample rate of narrowband speech data
It is 8000 hertz, corresponding narrowband and broadband voice data include the normal speech and anti-noise voice of same text content.
The narrowband and broadband voice inputted in Fig. 3 is all from the voice data collection established in step S11.By voice data collection according to
85%, 7.5%, 7.5% ratio is respectively divided into training set, verifying collection and test set, to the narrowband in training set and test set
Framing is carried out with broadband voice data, uses the progress windowing process of hamming window in the present embodiment.
The broadband and narrowband speech data includes normal speech and anti-noise voice (Lombard voice).
The Lombard voice be people in a noisy environment, by the compressing of ambient noise, spontaneously change oneself
The voice with noiseproof feature that tune is issued.Lombard voice has stronger clarity than normal speech.It is preferred that
, narrowband and broadband voice data carry out framing according to following setting: be arranged every frame voice signal when it is 20 milliseconds a length of, it is each
Overlapping of the frame voice signal with former frame setting 50%.Since narrowband is different with the sample rate of broadband voice, so narrowband and width
Frame length with the every frame signal of voice is different, and the frame length of every frame narrowband and wideband speech signal is respectively 320 Hes in the present embodiment
160。
Specifically, in step S12 corresponding diagram 3 the calculating network inputs of training stage module, detailed process are as follows: input from
Every frame narrow band voice signal that step S11 is obtained carries out the Short Time Fourier Transform of N point, and the possibility value of N is 1024,512,
256 etc., then the value of N preferably 512 in the present embodiment calculate the log-magnitude of every frame narrow band voice signal according to following formula
Spectrum:
Si(n) indicate that the i-th frame narrow band voice signal, n are voice signal frame length, value 160, xi(k) indicate that the i-th frame is narrow
The value of log-magnitude spectrum with voice signal, k are that the complex variable in complex function frequency domain basically represents symbol, and M is the instruction of input
Practice the totalframes of sample, WinIndicate that the window function in a kind of time domain, the present embodiment use Hanning window adding window to every frame voice signal,
Other alternative window functions have hamming window and sinusoidal windows.The point of the log-magnitude spectrum for every frame voice signal that the present invention is taken
Number isThe value of C is 257 in the present embodiment.
The points of the log-magnitude spectrum of every frame voice signal arexi=[xi(1),xi(2)…,xiIt (C)] is the
The log-magnitude spectrum of i frame voice signal concentrates each frame signal of the narrowband speech data after framing according to upper voice data
The log-magnitude spectrum that the frame signal is calculated in first formula is stated, it is stored line by line in matrix X, X indicates spectral tilt
Degree rebuilds the input matrix of network, and M is the line number of X, the training sample as inputted (the narrowband language in all training sets after framing
Sound data) totalframes.
257 log-magnitude spectrum parameters of every frame narrow band voice signal rebuild network as spectral tilt degree in the present embodiment
Training input.The input matrix X of spectral tilt degree reconstruction network are as follows:
Specifically, in step S13 corresponding diagram 3 training stage calculate network output module, detailed process are as follows: input from
Every frame wideband speech signal that step S11 is obtained calculates the all-pole modeling parameter of voice spectrum gradient parameter, the present embodiment
The formula of the all-pole modeling calculation method of parameters used are as follows:
ai=f (si(n))
siIt (n) is the i-th frame wideband speech signal, ai=[ai(1),ai(2)…,aiIt (P)] is the i-th frame wideband speech signal frequency
Compose the all-pole modeling parameter of gradient.P is the order of all-pole modeling parameter, ai(1),ai(2)…,aiIt (P) is respectively the 1st,
The all-pole modeling parameter value of 2 ..., P rank, P=20 in the present embodiment.All-pole modeling parameter aiThere are a variety of calculation methods, f
(si(n)) all-pole modeling parameter a is indicatediCalculating function, the calculating function is according to aiCalculation method be accordingly arranged.Such as
Linear prediction algorithm or other linear prediction algorithms based on specific perceptual weighting can be used.
Then linear spectral is converted to parameter by the all-pole modeling parameter of broadband voice spectral tilt degree.Linear spectral
It is the equivalent form of all-pole modeling parameter to parameter, linear spectral has stronger robustness to parameter, at voice signal
Reason field is widely applied.
Further, the detailed process of parameter conversion are as follows: by the all-pole modeling parameter of the i-th frame broadband voice spectral tilt degree
It is converted to the domain z form, the domain z form are as follows:
Define Ki(z) and Qi(z) the symmetric and anti-symmetric multinomial of the two P+1 ranks:
Ki(z)=Ai(z)+z-(P+1)Ai(z-1)
Qi(z)=Ai(z)-z-(P+1)Ai(z-1)
The domain the Z form of the linear spectral pair of i-th frame broadband voice spectral tilt degree is Ki' (z) and Qi' (z) two multinomial
Formula:
Acquire Ki' (z) and QiThe corresponding parameter of ' (z) isWith
The linear spectral of every i frame broadband voice spectral tilt degree is b to parameteri=[bpi,bqi], every frame broadband voice frequency
The training that the linear spectral for composing gradient rebuilds network as spectral tilt degree to parameter exports.Spectral tilt degree rebuilds network
Output matrix Y are as follows:
Specifically, the training spectral tilt degree of training stage rebuilds the module of network, detailed mistake in step S14 corresponding diagram 3
Journey are as follows: training spectral tilt rebuilds network, and definition perception root-mean-square-deviation is provided as appraisal procedure using the voice in test set
Material and appraisal procedure test spectral tilt degree network performance, debug out optimal reconstruction network paramter models.
The calculation formula of appraisal procedure perception root-mean-square-deviation are as follows:
For the estimated value of the i-th frame speech signal spec-trum gradient all-pole modeling parameter, yiIt (n) is the i-th frame voice
The true value of signal spectrum gradient all-pole modeling parameter,For the estimated value of the i-th frame speech signal spec-trum gradient, Yi
It (k) is the true value of the i-th frame speech signal spec-trum gradient, it is rightAnd Yi(k) distinguished using identical sub-band division method
It is divided into L subband,Indicate the spectral tilt degree estimated value of i-th j-th of subband of frame voice signal,Indicate i-th
The spectral tilt degree true value of j-th of subband of frame voice signal, DjIndicate the length of j-th of subband, bjIt indicates to calculate j-th of son
The perception coefficient of the perception root-mean-square-deviation of band.PRiIndicate the perception root-mean-square-deviation of the i-th frame speech signal spec-trum gradient
(PRMSD)。
The input layer number that optimal spectral tilt degree rebuilds network is every frame narrow band voice signal in C, with step S12
The points of log-magnitude spectrum parameter are identical.
When it is implemented, excitation function workable for the hidden layer of optimal network parameter model include Sigmoid function,
Tanh function, Linear function etc., the node parameter of hidden layer can be [N/4, N/4, N/8, N/8], [N/8, N/8, N/16, N/
16], [N/4, N/4, N/8, N/16], [N/4, N/8, N/8, N/16] and [N/4, N/8, N/16, N/16], every layer of hidden layer is most
Excellent time step is determined by parameter testing.
The excitation function that hidden layer uses in the present embodiment is Tanh function, and the excitation function that output layer uses is Linear
Function, the node parameter of hidden layer are respectively [N/8, N/8, N/16, N/16], and output layer number of nodes is P, are tilted with voice spectrum
The order of the all-pole modeling parameter of degree is identical.The output number of plies of optimal reconstruction network described in step S14 is P, with voice frequency
The order for composing the all-pole modeling parameter of gradient is identical, it is contemplated that algorithm complexity, the value of P, which is generally less than, is equal to 20.
In embodiment, the optimal time step-length of hidden layer is determined by parameter testing, specific debugging process are as follows: using above-mentioned
Network structure is rebuild, the reconstruction network using different time steps is respectively trained, is used using the voice data that verifying is concentrated
Network is tested after above-mentioned training, is calculated the perception root-mean-square-deviation of the reconstruction network of different time step-length, is perceived root mean square
It is optimal hidden layer time step, each hidden layer of the present embodiment that the smallest reconstruction Web vector graphic of deviation, which obtains time step,
Time step is 6.
After the completion of spectral tilt degree rebuilds network instruction, service stage can be put into, service stage is by the internet startup disk
It is used to voice communication system decoder end as post-processing technology, which can be to the real-Time Speech Signals in practical communication
It is handled frame by frame.
Spectral tilt degree rebuilds the specific implementation step of the service stage of network are as follows:
Step S21: inputting narrowband speech in real time frame by frame, extracts the log-magnitude spectrum parameter of narrowband speech.
Step S22: inputting broadband voice log-magnitude spectrum parameter frame by frame, rebuilds network in conjunction with spectral tilt degree and parameter turns
Change the all-pole modeling parameter for rebuilding broadband voice spectral tilt degree.
Specifically, extracting the module of narrowband speech feature in step S21 corresponding diagram 3, process is implemented are as follows: input in real time
One frame narrow band voice signal extracts its C using method identical with the spectral tilt degree reconstruction training stage step S12 of network
Point narrowband speech log-magnitude spectrum parameter.
The specific implementation process of step S22 are as follows: the C point narrowband speech log-magnitude spectrum parameter input for extracting step S21
It is rebuild in network to trained optimal spectral tilt degree, reconstructs the P rank linear spectral pair of the spectral tilt degree of broadband voice
Parameter finally converts P rank all-pole modeling parameter to parameter for obtained P rank linear spectral, that is, obtains by narrowband speech weight
The broadband voice spectral tilt degree characteristic parameter built.
To sum up, the present invention provides a kind of from narrow band voice signal rebuilds the side of wideband speech signal spectral tilt degree
Method.This method has stronger robustness, can apply to all speech intelligibility enhancing systems based on spectral tilt degree feature
Among, and it is suitable for multilingual, multi-modal voice signal.It is realized automatically when it is implemented, computer software technology can be used
Operational process.
Content described above is only preferred embodiments of the invention, and the present invention is not by above-described embodiment formal
Limitation, those skilled in the art it is to be appreciated that it is all referring to technological core of the invention to any made by above-described embodiment
Simple modifications, equivalent variations and the modification of form, belong within the scope of technical solution of the present invention claimed invention.
Claims (10)
1. a kind of broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing, it is characterised in that:
The training stage and service stage of network are rebuild including the spectral tilt degree based on Recognition with Recurrent Neural Network,
The spectral tilt degree is rebuild the network training stage and is included the following steps,
Step S11 is established by the down-sampled narrowband speech data for obtaining low sampling rate of the broadband voice data to high sampling rate
Voice data collection is divided into training set and test set, verifying collection voice data in proportion, carries out to the voice data that data are concentrated pre-
Processing, the pretreatment include framing and adding window;
Step S12 inputs pretreated narrowband speech data training set, carries out Short Time Fourier Transform and obtains narrowband speech frequency
Spectrum information logarithmetics are obtained log-magnitude spectrum as the input of spectral tilt degree reconstruction network by spectrum;
Step S13 inputs pretreated broadband voice data training set, extracts the full pole of wideband speech signal spectral tilt degree
Point model parameter is converted to linear spectral to parameter, the output of network is rebuild as spectral tilt degree;
Step S14, training spectral tilt degree rebuild network, and definition perception root-mean-square-deviation PRMSD tests frequency as appraisal procedure
Gradient network performance is composed, assessment uses verifying collection as evaluation criterion every time, optimal reconstruction network paramter models are debugged out, and
Final effect is verified in test set;
Trained neural network is put into the real-time of practical communication by the spectral tilt degree reconstruction Web vector graphic stage
Voice includes the following steps in handling frame by frame,
Step S21, inputs narrowband speech frame by frame in real time, extracts the log-magnitude spectrum parameter of narrowband speech;
Step S22 inputs broadband voice log-magnitude spectrum parameter frame by frame, rebuilds network and Parameter Switch weight in conjunction with spectral tilt degree
Build the all-pole modeling parameter of broadband voice spectral tilt degree.
2. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing
Method is characterized in that: broadband and narrowband speech data include normal speech and anti-noise voice.
3. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing
Method, it is characterised in that: in step S12, the points of Short Time Fourier Transform are N, and spectral tilt degree rebuilds the training input of network
Calculation formula are as follows:
Si(n) indicate that the i-th frame narrow band voice signal, n are voice signal frame length, xi(k) the logarithm width of the i-th frame voice signal is indicated
The value of spectrum is spent, k is that the complex variable in complex function frequency domain basically represents symbol, WinIndicate the window function in a kind of time domain;Every frame
The points of the log-magnitude spectrum of voice signal arexi=[xi(1),xi(2)…,xiIt (C)] is the i-th frame voice signal
Log-magnitude spectrum, to voice data concentrate framing after narrowband speech data each frame signal according to above-mentioned first formula
The log-magnitude spectrum of the frame signal is calculated, it is stored line by line in matrix X, X indicates that spectral tilt degree rebuilds network
Input matrix, M are the line number of X.
4. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing
Method, it is characterised in that: in step S13, according to the i-th frame wideband speech signal si(n) it calculates,
Obtain ai=[ai(1),ai(2)…,ai(P)], join for the all-pole modeling of the i-th frame wideband speech signal spectral tilt degree
Number, P are the order of all-pole modeling parameter.
5. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing
Method, it is characterised in that: linear spectral described in step S13 is the equivalent form of all-pole modeling parameter, linear spectral to parameter
There is stronger robustness to parameter.
6. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing
Method, it is characterised in that: the appraisal procedure that step S14 is used uses the voice data of verifying collection and test set, calculation formula
Are as follows:
For the estimated value of the i-th frame speech signal spec-trum gradient all-pole modeling parameter, yiIt (n) is the i-th frame voice signal
The true value of spectral tilt degree all-pole modeling parameter,For the estimated value of the i-th frame speech signal spec-trum gradient, Yi(k)
It is right for the true value of the i-th frame speech signal spec-trum gradientAnd Yi(k) it is respectively divided using identical sub-band division method
At L subband,Indicate the spectral tilt degree estimated value of i-th j-th of subband of frame voice signal, Yi j(k) the i-th frame language is indicated
The spectral tilt degree true value of j-th of subband of sound signal, DjIndicate the length of j-th of subband, bjIt indicates to calculate j-th of subband
Perceive the perception coefficient of root-mean-square-deviation, PRiIndicate the perception root-mean-square-deviation PRMSD of the i-th frame speech signal spec-trum gradient.
7. according to right want 1 described in for speech intelligibility enhance broadband voice spectral tilt degree characteristic parameter reconstruction side
Method, it is characterised in that: the input layer number of optimal reconstruction network paramter models described in step S14 isWith step
The points of the log-magnitude spectrum parameter of every frame narrow band voice signal are identical in rapid S12.
8. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing
Method, it is characterised in that: in step S14, excitation function that the hidden layer of optimal network parameter model uses be Sigmoid function,
Tanh function or Linear function, the node parameter of hidden layer is [N/4, N/4, N/8, N/8], [N/8, N/8, N/16, N/16],
[N/4, N/4, N/8, N/16], [N/4, N/8, N/8, N/16] or [N/4, N/8, N/16, N/16], every layer of hidden layer it is optimal when
Between step-length pass through parameter testing determine.
9. the broadband voice spectral tilt degree characteristic parameter reconstruction side according to claim 1 for speech intelligibility enhancing
Method, it is characterised in that: in step S14, the optimal output number of plies for rebuilding network is P, the full pole mould with voice spectrum gradient
The order of shape parameter is identical.
10. the broadband language described according to claim 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9 for speech intelligibility enhancing
Sound spectrum gradient characteristic parameter method for reconstructing, it is characterised in that:
Spectral tilt degree rebuild network service stage step S21 in extraction narrowband speech log-magnitude spectrum parameter method with
The training stage step S12 that spectral tilt degree rebuilds network is identical;
The Parameter Switch that spectral tilt degree is rebuild in the service stage step S22 of network is that spectral tilt degree is rebuild network reconnection
The linear spectral of broadband voice spectral tilt degree all-pole modeling parameter is converted into parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811249506.0A CN109215635B (en) | 2018-10-25 | 2018-10-25 | Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811249506.0A CN109215635B (en) | 2018-10-25 | 2018-10-25 | Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109215635A true CN109215635A (en) | 2019-01-15 |
CN109215635B CN109215635B (en) | 2020-08-07 |
Family
ID=64996332
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811249506.0A Active CN109215635B (en) | 2018-10-25 | 2018-10-25 | Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109215635B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110085245A (en) * | 2019-04-09 | 2019-08-02 | 武汉大学 | A kind of speech intelligibility Enhancement Method based on acoustic feature conversion |
CN110322891A (en) * | 2019-07-03 | 2019-10-11 | 南方科技大学 | A kind of processing method of voice signal, device, terminal and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5185848A (en) * | 1988-12-14 | 1993-02-09 | Hitachi, Ltd. | Noise reduction system using neural network |
US20060003328A1 (en) * | 2002-03-25 | 2006-01-05 | Grossberg Michael D | Method and system for enhancing data quality |
CN105070293A (en) * | 2015-08-31 | 2015-11-18 | 武汉大学 | Audio bandwidth extension coding and decoding method and device based on deep neutral network |
CN106710604A (en) * | 2016-12-07 | 2017-05-24 | 天津大学 | Formant enhancement apparatus and method for improving speech intelligibility |
CN107705801A (en) * | 2016-08-05 | 2018-02-16 | 中国科学院自动化研究所 | The training method and Speech bandwidth extension method of Speech bandwidth extension model |
-
2018
- 2018-10-25 CN CN201811249506.0A patent/CN109215635B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5185848A (en) * | 1988-12-14 | 1993-02-09 | Hitachi, Ltd. | Noise reduction system using neural network |
US20060003328A1 (en) * | 2002-03-25 | 2006-01-05 | Grossberg Michael D | Method and system for enhancing data quality |
CN105070293A (en) * | 2015-08-31 | 2015-11-18 | 武汉大学 | Audio bandwidth extension coding and decoding method and device based on deep neutral network |
CN107705801A (en) * | 2016-08-05 | 2018-02-16 | 中国科学院自动化研究所 | The training method and Speech bandwidth extension method of Speech bandwidth extension model |
CN106710604A (en) * | 2016-12-07 | 2017-05-24 | 天津大学 | Formant enhancement apparatus and method for improving speech intelligibility |
Non-Patent Citations (2)
Title |
---|
LIN JIANG .ETC: "Nonlinear Prediction with Deep Recurrent Neural Networks for Non-Blind Audio Bandwidth Extension", 《CHINA COMMUNICATIONS》 * |
郭雷勇等: "用于隐马尔可夫模型语音带宽扩展的激励分段扩展方法", 《计算机应用》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110085245A (en) * | 2019-04-09 | 2019-08-02 | 武汉大学 | A kind of speech intelligibility Enhancement Method based on acoustic feature conversion |
CN110085245B (en) * | 2019-04-09 | 2021-06-15 | 武汉大学 | Voice definition enhancing method based on acoustic feature conversion |
CN110322891A (en) * | 2019-07-03 | 2019-10-11 | 南方科技大学 | A kind of processing method of voice signal, device, terminal and storage medium |
CN110322891B (en) * | 2019-07-03 | 2021-12-10 | 南方科技大学 | Voice signal processing method and device, terminal and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109215635B (en) | 2020-08-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110085245B (en) | Voice definition enhancing method based on acoustic feature conversion | |
CN107886967B (en) | A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
Ganapathy et al. | Robust feature extraction using modulation filtering of autoregressive models | |
Nemala et al. | A multistream feature framework based on bandpass modulation filtering for robust speech recognition | |
CN111833896A (en) | Voice enhancement method, system, device and storage medium for fusing feedback signals | |
Janke et al. | Fundamental frequency generation for whisper-to-audible speech conversion | |
Adiga et al. | Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN. | |
Nossier et al. | Mapping and masking targets comparison using different deep learning based speech enhancement architectures | |
Shah et al. | Novel MMSE DiscoGAN for cross-domain whisper-to-speech conversion | |
CN109215635A (en) | Broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing | |
Islam et al. | Supervised single channel speech enhancement based on stationary wavelet transforms and non-negative matrix factorization with concatenated framing process and subband smooth ratio mask | |
CN111326170A (en) | Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution | |
Fan et al. | A regression approach to binaural speech segregation via deep neural network | |
Zouhir et al. | A bio-inspired feature extraction for robust speech recognition | |
Pulakka et al. | Bandwidth extension of telephone speech using a filter bank implementation for highband mel spectrum | |
Exter et al. | DNN-Based Automatic Speech Recognition as a Model for Human Phoneme Perception. | |
Huber et al. | Single-ended speech quality prediction based on automatic speech recognition | |
Cheyne et al. | Talker-to-listener distance effects on speech production and perception | |
CN103971697B (en) | Sound enhancement method based on non-local mean filtering | |
Akhter et al. | An analysis of performance evaluation metrics for voice conversion models | |
Tanaka et al. | An evaluation of excitation feature prediction in a hybrid approach to electrolaryngeal speech enhancement | |
Gupta et al. | Artificial bandwidth extension using H∞ sampled-data control theory | |
Zheng et al. | Throat microphone speech enhancement via progressive learning of spectral mapping based on lstm-rnn | |
Marković et al. | Whispered speech recognition based on gammatone filterbank cepstral coefficients |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |