CN106024010A

CN106024010A - Speech signal dynamic characteristic extraction method based on formant curves

Info

Publication number: CN106024010A
Application number: CN201610340935.3A
Authority: CN
Inventors: 韩志艳; 王健; 王东; 周建壮; 郭继宁; 刘继行; 曹丽
Original assignee: Bohai University
Current assignee: Bohai University
Priority date: 2016-05-19
Filing date: 2016-05-19
Publication date: 2016-10-12
Anticipated expiration: 2036-05-19
Also published as: CN106024010B

Abstract

The invention provides a speech signal dynamic characteristic extraction method based on formant curves, belonging to the technical field of Chinese speech signal dynamic characteristic extraction. The method comprises the following steps: acquiring speech signals; carrying out preprocessingon the speech signals; extracting formant frequency characteristics of the speech signals; according to the sequence from the first frame to the last frame, combining the first formant frequency characteristic values of all the frames of preprocessed speech signals to obtain a first formant curve, and then obtaining a second formant curve, a third formant curve, and a fourth formant curve in the same manner; carrying out rapid Fourier transform on each obtained formant curve to obtain a linear frequency spectrum; obtaining an energy spectrum according to the linear frequency spectrum; obtaining logarithm energy according to the energy spectrum; and carrying out discrete cosine transform on the logarithm energy. Compared with the existing method, the method provided by the invention has the advantages that the speech signal dynamic characteristics are extracted, the temporal correlation is available, therefore, the close relevance before and after the speech signals and between the adjacent speech signals is disclosed, and the speech recognition property is improved.

Description

A kind of voice signal dynamic feature extraction method based on formant curve

Technical field

The invention belongs to Chinese phonetic signal dynamics Feature Extraction Technology field, be specifically related to a kind of voice based on formant curve letter Number dynamic feature extraction method.

Background technology

China's the Research of Speech Recognition work is started in the fifties, but until just starts to develop rapidly the seventies.The Chinese Academy of Sciences, Tsing-Hua University, Deng Duojia research unit of Peking University are being engaged in the exploitation of Chinese speech recognition system, continuous to large vocabulary at present The research of speech recognition system is already close to external top level；In the 8th Five-Year Plan for national economic and social development of China and " 863 " in the works, Chinese The research of speech recognition has obtained supporting energetically, and National 863 " intelligent computer theme " expert group is exclusively for the Research of Speech Recognition Project verification, simultaneously because grow with each passing day in China status in the world, and the critical role residing in terms of economy and market, Chinese Speech recognition is the most increasingly paid attention to by foreign study mechanism and company, IBM, Microsoft, APPLE, Motorola, Intel, The companies such as L&H set up research institution the most at home, in succession put in the exploitation of Chinese speech recognition system, promote forcefully The development of Mandarin speech recognition research；

While it is true, it is far away apart from the real man-machine boundary freely exchanged；Present existing commercial system all also exists one A little problems, such as the most not fully up to expectations for the phonetic recognization rate under noise circumstance and robustness etc.；

The most basic most important exploitation link of speech recognition is the extraction of phonic signal character parameter；As far back as the forties in last century, R.K.Potter et al. proposes the concept of " Visible Speech ", it is indicated that sound spectrograph has the strongest descriptive power to voice signal, And term spectrum information of trying carries out speech recognition, which forms phonetic feature the earliest.To the fifties, it has been found that It is identified voice signal being necessary for from speech waveform extracting some parameter that can reflect characteristics of speech sounds, is so possible not only to Reduce template number, operand and amount of storage, and redundancy useless in voice signal can be filtered, then occur as soon as Amplitude, short time frame average energy, short time frame zero-crossing rate, in short-term autocorrelation coefficient etc..Along with the development of the technology of identification, Ren Menfa Its stability of characteristic parameter and separating capacity in current territory are not the most fine, then start with frequency domain parameter as voice signal Feature, such as pitch period, formant frequency, linear predictor coefficient (LPC), line spectrum pair (LSP), cepstrum coefficient etc., mesh Front the most widely used characteristic parameter is MFCC cepstrum (MFCC) based on human auditory model；But these parameters are once When being applied to noise circumstance, its performance can drastically decline；

And characteristic parameter suggested above all reflects the static nature of voice, the dynamic characteristic of voice signal refers to from the most several The characteristic parameter extracted in frame voice, such as can be obtained by the differential parameter of static nature and acceleration parameter, but difference Multidate information can not be excavated the most abundant by parameter and acceleration parameter, so they still can not reflect voice signal well Dynamic characteristic.

Summary of the invention

For the deficiencies in the prior art, the present invention proposes a kind of voice signal dynamic feature extraction method based on formant curve, To reach to expand application, the performance of raising speech recognition, realize fast and effeciently grasping the behavioral characteristics of signal and realize existing The purpose of speech recognition technology is applied under strong noise environment.

A kind of voice signal dynamic feature extraction method based on formant curve, comprises the following steps:

Step 1, collection voice signal；

Step 2, voice signal is carried out pretreatment, including preemphasis, framing windowing and end-point detection；

Step 3, employing method based on Hibert-Huang conversion, carried out the formant frequency feature of voice signal after pretreatment Estimation, it is thus achieved that every first formant eigenvalue of frame voice signal, the second formant eigenvalue, the 3rd formant eigenvalue and the Four formant eigenvalues；

Step 4, composition formant curve, particularly as follows:

According to from the frame sequence of the first frame to last frame, the first formant eigenvalue of pretreated every frame voice signal is carried out Combination obtains the first formant curve；

According to from the frame sequence of the first frame to last frame, the second formant eigenvalue of pretreated every frame voice signal is carried out Combination obtains the second formant curve；

According to from the frame sequence of the first frame to last frame, the 3rd formant eigenvalue of pretreated every frame voice signal is carried out Combination obtains the 3rd formant curve；

According to from the frame sequence of the first frame to last frame, the 4th formant eigenvalue of pretreated every frame voice signal is carried out Combination obtains the 4th formant curve；

Step 5, to obtain the first formant curve, the second formant curve, the 3rd formant curve and the 4th formant curve Carry out fast Fourier transform, it is thus achieved that the linear spectral of every formant curve；

Step 6, obtain the energy spectrum of every formant curve according to linear spectral；

Step 7, obtain the logarithmic energy of every formant curve according to energy spectrum；

Step 8, above-mentioned logarithmic energy is carried out discrete cosine transform obtain cepstral domains, i.e. obtain voice signal dynamic feature coefficient.

Described in step 2, voice signal is carried out pretreatment, including preemphasis, framing windowing and end-point detection, wherein,

Described preemphasis: realized by single order digital pre-emphasis filter, the coefficient value scope of preemphasis filter be 0.93～ 0.97；

Described framing windowing: carry out framing with frame length 256, and the voice signal after framing is added Hamming window；

Described end-point detection: use short-time energy-zero-product method to detect.

The the first formant curve obtained, the second formant curve, the 3rd formant curve and the 4th are resonated described in step 5 Peak curve carries out fast Fourier transform, it is thus achieved that the linear spectral of every formant curve；

Concrete formula is as follows:

X_{i} (k) = Σ_{n = 0}^{N - 1} x_{i} (n) e^{- j 2 π n k / N} - - - (1)

Wherein, X_iK () represents the linear spectral that i-th formant curve obtains after carrying out fast Fourier transform；I=1,2,3,4； K=0,1,2 ..., N-1, N are the frame number of voice signal；x_iN () represents i-th formant curve；J is imaginary unit, and e is Constant.

The discrete cosine transform that carries out above-mentioned logarithmic energy described in step 8 obtains cepstral domains, i.e. obtains voice signal the most special Levy parameter；

Concrete publicity is as follows:

C_{i} (t) = Σ_{k = 0}^{N - 1} L_{i} (k) c o s [\frac{π t (k + 0.5)}{N}] - - - (2)

Wherein, C_iT () represents the dynamic feature coefficient of i-th formant curve；I=1,2,3,4；T=1,2 ..., T, T represent and set Fixed cepstrum coefficient number, span is 12～16；L_iK () represents the logarithmic energy of i-th formant curve； K=0,1,2 ..., N-1, N are the frame number of voice signal.

The invention have the advantages that

1, the voice signal dynamic feature coefficient that the present invention obtains is mainly used in the dictation machine of computer, and with telephone network or The Speech information query and service system that the Internet combines, the most also can be applicable in miniaturization, portable speech production, as The aspects such as dialing on wireless phone, the Voice command of automobile equipment, intelligent toy, household remote；

What 2, the present invention extracted is voice signal behavioral characteristics, and it has temporal correlation, before and after disclosing voice signal and phase The close association existed between neighbour, compared to traditional MFCC method, substantially increases the performance of speech recognition；

3, the present invention uses method based on Hibert-Huang conversion to estimate pretreated Speech formant frequency feature, Wherein one group of intrinsic mode function containing different scale (IMF) is become to divide signal decomposition by Empirical mode decomposition (EMD) Amount, represents a frequency content through decomposing each the IMF component obtained, and these frequency contents can effectively highlight signal Local characteristics and variations in detail, this will assist in the behavioral characteristics fast and effeciently grasping signal；

4, the present invention constitute formant curve there is temporal correlation, before and after disclosing voice signal and adjacent between also exist Close association；This characteristic so that apply speech recognition technology to become possibility under strong noise environment.

Accompanying drawing explanation

Fig. 1 is the voice signal dynamic feature extraction method flow chart based on formant curve of an embodiment of the present invention；

Fig. 2 be an embodiment of the present invention white noise in the case of parameter recognition performance curve comparison diagram；

Fig. 3 be an embodiment of the present invention powder noise situations under parameter recognition performance curve comparison diagram；

Fig. 4 be an embodiment of the present invention street noise in the case of parameter recognition performance curve comparison diagram；

Fig. 5 be an embodiment of the present invention tank noise situations under parameter recognition performance curve comparison diagram.

Detailed description of the invention

Below in conjunction with the accompanying drawings an embodiment of the present invention is described further.

A kind of voice signal dynamic feature extraction method based on formant curve, method flow diagram is as it is shown in figure 1, include following step Rapid:

Step 1, collection voice signal；

In the embodiment of the present invention, utilize mike to input speech data, and processed by computer, single-chip microcomputer or dsp chip etc. single Unit carries out sample quantization with the sample frequency of 11.025KHz, the quantified precision of 16bit, it is thus achieved that corresponding voice signal；The present invention Embodiment uses computer as processing unit；

In the embodiment of the present invention, described preemphasis: realized by single order digital pre-emphasis filter, preemphasis filter be Number span is 0.93～0.97, and in the embodiment of the present invention, value is 0.9375；Described framing windowing: with frame length 256 Point carries out framing, and the voice signal after framing is added Hamming window；Described end-point detection: use short-time energy-zero-product method to examine Survey；

Step 3, employing method based on Hibert-Huang conversion, carried out the formant frequency feature of voice signal after pretreatment Estimation, it is thus achieved that every first formant eigenvalue F1 of frame voice signal, the second formant eigenvalue F2, the 3rd formant eigenvalue F3 and the 4th formant eigenvalue F4；

In the embodiment of the present invention, fast Fourier transform (FFT) each rank formant frequency of the voice signal gone out according to a preliminary estimate is true Determine the parameter of respective band pass filters, and by this parameter, voice signal is made Filtering Processing, filtered signal is carried out Empirical Mode State is decomposed (EMD) and is obtained family's intrinsic mode function (IMF), determines the IMF containing formant frequency by energy maximum principle, The instantaneous frequency and the Hilbert that calculate this IMF compose the formant frequency parameter i.e. obtaining voice signal；

Step 4, composition formant curve, particularly as follows:

In the embodiment of the present invention, according to from the frame sequence of the first frame to last frame, by the first of pretreated every frame voice signal Formant frequency eigenvalue F1 is combined obtaining the first formant curve x₁(n), n=0,1,2 ..., N-1, N are voice signal Frame number；According to from the frame sequence of the first frame to last frame, by special for the second formant frequency of pretreated every frame voice signal Value indicative F2 is combined obtaining the second formant curve x₂(n)；According to from the frame sequence of the first frame to last frame, after pretreatment The 3rd formant frequency eigenvalue F3 of every frame voice signal be combined obtaining the 3rd formant curve x₃(n)；According to from 4th formant frequency eigenvalue F4 of pretreated every frame voice signal, to the frame sequence of last frame, is combined obtaining by one frame Obtain the 4th formant curve x₄(n)；

In the embodiment of the present invention, concrete formula is as follows:

X_{i} (k) = Σ_{n = 0}^{N - 1} x_{i} (n) e^{- j 2 π n k / N} - - - (1)

Wherein, X_iK () represents the linear spectral that i-th formant curve obtains after carrying out fast Fourier transform；I=1,2,3,4； K=0,1,2 ..., N-1, N are the frame number of voice signal；x_iN () represents i-th formant curve；J is imaginary unit, and e is Constant, approximation is 2.7；

In the embodiment of the present invention, take above-mentioned linear spectral X_i(k) mould square obtain corresponding energy spectrum S_i(k), formula is as follows:

S_i(k)=| X_i(k)|² (3)

Wherein, S_iK () represents the energy spectrum of i-th formant curve；

In the embodiment of the present invention, in order to make result have more preferable robustness to noise, by the energy spectrum S of above-mentioned acquisition_iK () is taken the logarithm, Logarithmic energy L can be obtained_i(k), formula is as follows:

L_i(k)=Log (S_i(k)) (4)

Wherein, L_iK () is the logarithmic energy of i-th formant curve；

Concrete publicity is as follows:

C_{i} (t) = Σ_{k = 0}^{N - 1} L_{i} (k) c o s [\frac{π t (k + 0.5)}{N}] - - - (2)

Wherein, C_iT () represents the dynamic feature coefficient of i-th formant curve；I=1,2,3,4；T=1,2 ..., T, T represent and set Fixed cepstrum coefficient number, span is 12～16, and the embodiment of the present invention takes T=12；

In the embodiment of the present invention, use 50 typical Chinese words to remit and test；Owing to considering that identification system is easily by environment The impact of the factors such as noise, channel variation and speaker's change, therefore, the training set of the embodiment of the present invention uses under quiet environment Speech data, and test set uses containing noisy data；

In order to verify this feature parameter robustness to different speaker's changes, training set data is recorded into for twice by front and back, totally 50 people, Everyone every word pronounces one time, obtains 5000 data altogether, and test set data are also to record at twice, totally 30 people, everyone every word Pronounce one time, totally 3000 data；In order to verify the robustness that different channels is changed by this feature parameter, use different every time Mike is recorded；In order to verify the robustness that varying environment noise is changed by this feature parameter, the embodiment of the present invention is at test set Each voice in be manually adding to four kinds of noises, including white noise, powder noise, street noise, tank noise, constitute letter Make an uproar than for 15dB, the noisy speech signal of 10dB, 5dB, 0dB ,-5dB.

Using the wavelet neural network improved based on genetic algorithm to be used as grader in the embodiment of the present invention, network input layer has 48 Individual neuron, output layer has 50 neurons, hidden layer node number to be determined by genetic algorithm；

In the embodiment of the present invention, Fig. 2, Fig. 3, Fig. 4 and Fig. 5 are to use the MFCC method with embodiment of the present invention the same terms Bent with the embodiment of the present invention method system identification performance under white noise, powder noise, street noise and tank noise jamming respectively Line；It can be seen that signal to noise ratio is relatively low when, embodiment of the present invention method discrimination compared with MFCC method carries High a lot.

Claims

1. a voice signal dynamic feature extraction method based on formant curve, it is characterised in that comprise the following steps:

Step 1, collection voice signal；

Step 3, employing method based on Hibert-Huang conversion, enter the formant frequency feature of voice signal after pretreatment Row estimation, it is thus achieved that every first formant eigenvalue of frame voice signal, the second formant eigenvalue, the 3rd formant eigenvalue and 4th formant eigenvalue；

Step 4, composition formant curve, particularly as follows:

Step 5, the first formant curve to obtaining, the second formant curve, the 3rd formant curve and the 4th formant are bent Line carries out fast Fourier transform, it is thus achieved that the linear spectral of every formant curve；

Step 8, above-mentioned logarithmic energy is carried out discrete cosine transform obtain cepstral domains, i.e. obtain voice signal behavioral characteristics ginseng Number.

Voice signal dynamic feature extraction method based on formant curve the most according to claim 1, it is characterised in that step Described in rapid 2, voice signal is carried out pretreatment, including preemphasis, framing windowing and end-point detection, wherein,

Described preemphasis: realized by single order digital pre-emphasis filter, the coefficient value scope of preemphasis filter is 0.93～0.97；

Voice signal dynamic feature extraction method based on formant curve the most according to claim 1, it is characterised in that step The first formant curve, the second formant curve, the 3rd formant curve and the 4th formant curve to acquisition described in rapid 5 Carry out fast Fourier transform, it is thus achieved that the linear spectral of every formant curve；

Concrete formula is as follows:

X_{i} (k) = Σ_{n = 0}^{N - 1} x_{i} (n) e^{- j 2 π n k / N} - - - (1)

Wherein, X_iK () represents the linear spectral that i-th formant curve obtains after carrying out fast Fourier transform； I=1,2,3,4；K=0,1,2 ..., N-1, N are the frame number of voice signal；x_iN () represents i-th formant curve, N=0,1,2 ..., N-1；J is imaginary unit, and e is constant.

Voice signal dynamic feature extraction method based on formant curve the most according to claim 1, it is characterised in that step The discrete cosine transform that carries out above-mentioned logarithmic energy described in rapid 8 obtains cepstral domains, i.e. obtains voice signal behavioral characteristics ginseng Number；

Concrete publicity is as follows:

C_{i} (t) = Σ_{k = 0}^{N - 1} L_{i} (k) \cos [\frac{π t (k + 0.5)}{N}] - - - (2)