CN108564965B - An anti-noise speech recognition system - Google Patents

An anti-noise speech recognition system Download PDF

Info

Publication number
CN108564965B
CN108564965B CN201810311359.9A CN201810311359A CN108564965B CN 108564965 B CN108564965 B CN 108564965B CN 201810311359 A CN201810311359 A CN 201810311359A CN 108564965 B CN108564965 B CN 108564965B
Authority
CN
China
Prior art keywords
speech
cfcc
signal
feature
speech signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810311359.9A
Other languages
Chinese (zh)
Other versions
CN108564965A (en
Inventor
薛珮芸
史燕燕
白静
郭倩岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN201810311359.9A priority Critical patent/CN108564965B/en
Publication of CN108564965A publication Critical patent/CN108564965A/en
Application granted granted Critical
Publication of CN108564965B publication Critical patent/CN108564965B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

本发明涉及语音识别技术领域。一种抗噪语音识别系统,对于语音信号进行加窗分帧,然后做离散傅里叶变换,求出语音信号的幅值和相角;通过谱减运算得到估计信号的功率谱;利用谱减前的相位角信息对信号进行重构,得到谱减后的语音序列;对新的语音序列采用非线性幂函数模拟人耳听觉特性提取耳蜗滤波倒谱特征CFCC及其一阶差分△CFCC,并利用维度筛选法进行特征混合;对融合特征用数据归一化处理,得到训练集标签和测试集标签;将归一化后的训练集采用PCA进行降维,并带入SVM模型,得到识别准确率。The present invention relates to the technical field of speech recognition. An anti-noise speech recognition system, which performs windowing and framing on speech signals, and then performs discrete Fourier transform to obtain the amplitude and phase angle of the speech signal; obtains the power spectrum of the estimated signal through spectral subtraction; uses spectral subtraction The signal is reconstructed with the previous phase angle information to obtain the speech sequence after spectral subtraction; the non-linear power function is used to simulate the auditory characteristics of the human ear for the new speech sequence to extract the cochlear filter cepstral feature CFCC and its first-order difference △CFCC, and Use the dimension screening method to mix features; normalize the fusion features to obtain training set labels and test set labels; use PCA to reduce the dimension of the normalized training set and bring it into the SVM model to obtain accurate identification Rate.

Description

Anti-noise voice recognition system
Technical Field
The invention relates to the technical field of voice recognition.
Background
With the rapid development of information technology, human-computer interaction receives more and more attention, and speech recognition becomes a key technology of human-computer interaction and becomes a research focus in the field. Speech recognition is a high-technology speech recognition technology in which a computer converts speech signals into corresponding texts or commands by extracting and analyzing human speech semantic information, and is widely applied to various fields such as industry, household appliances, communication, automotive electronics, medical treatment, home services, consumer electronics and the like.
However, speech signals are particularly susceptible to noise, and various links from acquisition to transmission to restoration may be affected by noise. The spectral subtraction method is one of the voice enhancement technologies, and is simple in operation and easy to implement.
Currently, the most mainstream characteristic parameter in speech recognition is Mel Frequency Cepstrum Coefficient (MFCC), MFCC characteristics are extracted based on fourier transform, and actually, the fourier transform is only suitable for processing of stationary signals. Auditory transformation is used as a new method for processing non-stationary voice signals, makes up for the defects of Fourier transformation, and has the advantages of less harmonic distortion and good spectrum smoothness. The first time in 2011 by Peter Li doctor in bell labs, cochlear filter cepstral coefficients were proposed and applied to speaker recognition, the cochlear filter cepstral coefficients being the first feature to use auditory transformations. Although many scholars research the CFCC features, the nonlinear power function derived from the saturation relation between the neuron action potential firing rate and the sound intensity can be approximate to an auditory neuron-intensity curve, the traditional CFCC feature extraction method does not take the characteristic of human auditory sense into consideration, and therefore the nonlinear power function capable of simulating the characteristic of human auditory sense is adopted to extract new CFCC features.
A complete speech signal contains both frequency information and energy information. The Teager energy operator is used as a nonlinear difference operator, can eliminate the influence of zero-mean noise and the voice enhancement capability, is used for feature extraction, can better reflect the energy change of voice signals, can inhibit noise and enhance the voice signals, and can obtain good effect when used for voice recognition.
Support Vector Machines (SVMs) are a new Machine learning technique based on the principle of minimizing structural risks. The method can better solve the classification problems of small samples, nonlinearity, high dimensionality and the like, has good generalization, is widely applied to the problems of pattern recognition, classification estimation and the like, and becomes a more common classification model in the voice recognition technology through the excellent classification capability and good generalization performance of the method.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: how to improve the speech recognition effect.
The technical scheme adopted by the invention is as follows: an anti-noise speech recognition system, comprising the steps of:
step one, performing windowing and framing on voice signals s (n), and then performing discrete Fourier transform to obtain the amplitude and phase angle of the voice signals
Windowing the speech signal s (n), the window function used being the hamming window w (n):
Figure BDA0001622437570000011
multiplying the speech signal s (n) by a window function w (n) to form a windowed speech signal x (n)
x(n)=s(n)*w(n)
The windowed speech signal x (n) is subjected to framing processing, and then the speech signal x (n) is expressed as xn(t), wherein N is a frame number, t is a frame synchronization time number, and N is a frame length;
for the framed speech signal xn(t) performing a discrete fourier transform:
Figure BDA0001622437570000021
where j denotes a complex number, e is a constant, pi is a constant, and the harmonic component number k is 0, 1., N-1, then the short-time amplitude spectrum of the windowed speech signal X (N) is estimated to be | X (N, k) |, and the phase angle is:
Figure BDA0001622437570000022
the value of | X (n, k) | is expressed as the amplitude of the voice signal,
Figure BDA0001622437570000023
the value of (d) is expressed as the phase angle of the speech signal;
calculating the average energy of the noise section, and obtaining a power spectrum of the estimated signal through spectral subtraction;
the duration of the noise section IS, its corresponding frame number IS NIS, and the average energy of the noise section IS:
Figure BDA0001622437570000024
the power spectrum of the estimated signal is obtained by the following spectral subtraction operation
Figure BDA0001622437570000025
Wherein a1 and b1 are two constants, a1 is an over-subtraction factor, and a1 is a gain compensation factor;
thirdly, reconstructing the signal by utilizing the phase angle information before spectral subtraction to obtain a speech sequence after spectral subtraction;
power spectrum subtracted with spectrum
Figure BDA0001622437570000026
Combining the phase angle information before spectral subtraction
Figure BDA0001622437570000027
IFFT is carried out, the frequency domain is restored to the time domain, and the speech sequence after spectral subtraction is obtained
Figure BDA0001622437570000028
Figure BDA0001622437570000029
Step four, the voice sequence after spectral subtraction
Figure BDA00016224375700000210
Simulating the auditory characteristics of human ears by adopting a nonlinear power function to extract a cochlear filtering cepstrum characteristic CFCC and a first-order difference delta CFCC thereof, and performing characteristic mixing by using a dimension screening method;
the auditory transformation simulates the auditory mechanism of human ears, and is a process of realizing filtering by using wavelet transformation by taking a cochlear filter function as a new wavelet basis function;
spectrally subtracted speech sequence
Figure BDA00016224375700000211
The output over a certain frequency band after auditory transformation is:
Figure BDA00016224375700000212
wherein
Figure BDA00016224375700000213
Is a cochlear filter function, and the expression thereof is:
Figure BDA00016224375700000214
in the above formula
Figure BDA0001622437570000031
β>0, where the values of α and β determine the frequency domain shape and width of the cochlear filter function, u (t) is a unit step function, b2 is a real number that is variable over time, a2 is a scale variable, θ is the initial phase, and in general
Figure BDA00016224375700000311
Can be derived from the centre frequency f of the filter bankcAnd the lowest center frequency fLDetermining
Figure BDA0001622437570000032
Therein, in general
Figure BDA0001622437570000033
Is in the value range of
Figure BDA0001622437570000034
While
Figure BDA0001622437570000035
Beta is generally an empirical value
Figure BDA0001622437570000036
β=0.2;
The inner hair cell of the human ear cochlea converts the voice signal output by auditory transformation into an electric signal analyzable by the human brain:
h(a2,b2)=[T(a2,b2)]2
according to the auditory characteristics of human ears, the response duration of the auditory nerve of the sound to the sound is gradually shortened along with the increase of the frequency, which indicates that the human ears are more sensitive to high-frequency transient components, so that the time smoothing window length of the cochlear filter with higher central frequency needs to be properly shortened. For different frequency bands, different window lengths are selected, and the average value of the capillary cell function of the ith frequency band can be expressed as:
Figure BDA0001622437570000037
where d ═ max {3.5 τq20ms, which is the smoothing window length of the ith band, τqIs the time length, τ, of the center frequency of the center band of the p-th filterq=1/fcL is frame shift, L is d/2, and w is the number of windows;
the output of the hair cells completes loudness transformation through a nonlinear power function, the loudness is changed from an energy value to a perceived loudness, and the perceived loudness of the ith frequency band can be expressed as:
y(i,w)=[S(i,w)]0.101
finally, the obtained characteristics are decorrelated by using discrete cosine transform to obtain CFCC characteristic parameters:
Figure BDA0001622437570000038
wherein n1 is the order of the CFCC characteristic, and M is the channel number of the cochlear filter;
after extracting the CFCC parameters, calculating a first-order difference coefficient:
Figure BDA0001622437570000039
dx(n1) represents an n1 th order coefficient of the first order difference CFCC parameter of the x frame speech signal, where k is a constant, and is generally taken to be 2;
after 16-order CFCC and delta CFCC are respectively extracted, dimension screening is carried out on the features, and the parts which can represent the voice features most are selected for feature mixing;
step five, adding TEOCC on the basis of CFCC + delta CFCC characteristics to form fusion characteristics;
for each frame of speech signal x (n), its TEO energy is calculated:
ψ[x(n)]=x(n)2-x(n+1)x(n-1)
carrying out normalization processing and taking logarithm to obtain:
Figure BDA00016224375700000310
finally, performing DCT transformation to obtain one-dimensional TEOCC;
adding one-dimensional TEOCC characteristics into the last dimension of the mixed characteristic vector;
step six, performing data normalization processing on the fusion characteristics to form a normalized training set and a normalized test set, and labeling the two sets respectively to obtain a training set label and a test set label;
any data sample in the feature training set and the feature testing set is yyiAfter normalization processing, the corresponding data samples in the normalized training set and the normalized testing set are as follows:
Figure BDA0001622437570000041
wherein y iswinAnd yw0xRepresents yiA respective minimum and maximum.
Step seven, adopting PCA to reduce the dimension of the normalized training set, and bringing the reduced dimension into an SVM model to obtain the recognition accuracy
Dividing the voice features after dimensionality reduction into two parts of a training set train _ data and a test set test _ data, respectively adding a training set label train _ label and a test set label test _ label, and inputting the training set into an SVM (support vector machine) to establish a model:
model=svmtrain(train_label,train_data)
testing the test set by using the established model to obtain the identification accuracy rate accurve:
accuracy=svmpredict(test_label,test_data)。
the invention has the beneficial effects that: the invention reduces the influence of noise on the voice signal by introducing the spectral subtraction method into the front end of feature extraction, adopts the nonlinear power function to simulate the auditory characteristics of human ears to extract CFCC and the first-order difference coefficient thereof, adds TEOCC representing the voice signal energy on the basis to form fusion features, performs feature selection on the fusion features by using a principal component analysis method, and applies the SVM model of the selected features to a voice recognition system, thereby having higher recognition accuracy, stronger robustness and higher recognition speed.
Detailed Description
In the invention, a windows 7 system is used as a program development software environment, MATLAB R2011a is used as a program development platform, in the embodiment, under the condition that the signal to noise ratio of 10 isolated words is 0db, 270 voice samples which are generated by pronouncing each word three times are used as a training set by 9 persons, and 210 voice samples which correspond to 7 persons under the corresponding vocabulary and the signal to noise ratio are used as a test set.
Step one, performing windowing and framing on voice signals s (n), and then performing discrete Fourier transform to obtain the amplitude and phase angle of the voice signals
Windowing the speech signal s (n), the window function used being the hamming window w (n):
Figure BDA0001622437570000042
multiplying the speech signal s (n) by a window function w (n) to form a windowed speech signal x (n)
x(n)=s(n)*w(n)
The windowed speech signal x (n) is subjected to framing processing, and then the speech signal x (n) is expressed as xn(t), wherein N is a frame number, t is a frame synchronization time number, and N is a frame length;
for the framed speech signal xn(t) performing a discrete fourier transform:
Figure BDA0001622437570000043
where j denotes a complex number, e is a constant, pi is a constant, and the harmonic component number k is 0, 1., N-1, then the short-time amplitude spectrum of the windowed speech signal X (N) is estimated to be | X (N, k) |, and the phase angle is:
Figure BDA0001622437570000044
the value of | X (n, k) | is expressed as the amplitude of the voice signal,
Figure BDA0001622437570000045
the value of (d) is expressed as the phase angle of the speech signal;
calculating the average energy of the noise section, and obtaining a power spectrum of the estimated signal through spectral subtraction;
the duration of the noise section IS, its corresponding frame number IS NIS, and the average energy of the noise section IS:
Figure BDA0001622437570000051
the power spectrum of the estimated signal is obtained by the following spectral subtraction operation
Figure BDA0001622437570000052
Wherein a1 and b1 are two constants, a1 is an over-subtraction factor, and a1 is a gain compensation factor;
thirdly, reconstructing the signal by utilizing the phase angle information before spectral subtraction to obtain a speech sequence after spectral subtraction;
power spectrum subtracted with spectrum
Figure BDA0001622437570000053
Combining the phase angle information before spectral subtraction
Figure BDA0001622437570000054
IFFT is carried out to restore the frequency domain to the time domain to obtainOf the spectrally subtracted speech sequence
Figure BDA0001622437570000055
Figure BDA0001622437570000056
Step four, the voice sequence after spectral subtraction
Figure BDA0001622437570000057
Simulating the auditory characteristics of human ears by adopting a nonlinear power function to extract a cochlear filtering cepstrum characteristic CFCC and a first-order difference delta CFCC thereof, and performing characteristic mixing by using a dimension screening method;
the auditory transformation simulates the auditory mechanism of human ears, and is a process of realizing filtering by using wavelet transformation by taking a cochlear filter function as a new wavelet basis function;
spectrally subtracted speech sequence
Figure BDA0001622437570000058
The output over a certain frequency band after auditory transformation is:
Figure BDA0001622437570000059
wherein
Figure BDA00016224375700000510
Is a cochlear filter function, and the expression thereof is:
Figure BDA00016224375700000511
in the above formula
Figure BDA00016224375700000512
β>0, where the values of α and β determine the frequency domain shape and width of the cochlear filter function, u (t) is a unit step function, b2 is a real variable over timeNumber, a2 is a scale variable, θ is the initial phase, and
Figure BDA00016224375700000513
can be derived from the centre frequency f of the filter bankcAnd the lowest center frequency fLDetermining
Figure BDA00016224375700000514
Therein, in general
Figure BDA00016224375700000515
Is in the value range of
Figure BDA00016224375700000516
While
Figure BDA00016224375700000517
Beta is generally an empirical value
Figure BDA00016224375700000518
β=0.2;
The inner hair cell of the human ear cochlea converts the voice signal output by auditory transformation into an electric signal analyzable by the human brain:
h(a2,b2)=[T(a2,b2)]2
according to the auditory characteristics of human ears, the response duration of the auditory nerve of the sound to the sound is gradually shortened along with the increase of the frequency, which indicates that the human ears are more sensitive to high-frequency transient components, so that the time smoothing window length of the cochlear filter with higher central frequency needs to be properly shortened. For different frequency bands, different window lengths are selected, and the average value of the capillary cell function of the ith frequency band can be expressed as:
Figure BDA0001622437570000061
where d ═ max {3.5 τq20ms, which is the smoothing window length of the ith band, τqIs the center frequency of the p-th filter center bandLength of time of τq=1/fcL is frame shift, L is d/2, and w is the number of windows;
the output of the hair cells completes loudness transformation through a nonlinear power function, the loudness is changed from an energy value to a perceived loudness, and the perceived loudness of the ith frequency band can be expressed as:
y(i,w)=[S(i,w)]0.101
finally, the obtained characteristics are decorrelated by using discrete cosine transform to obtain CFCC characteristic parameters:
Figure BDA0001622437570000062
wherein n1 is the order of the CFCC characteristic, and M is the channel number of the cochlear filter;
after extracting the CFCC parameters, calculating a first-order difference coefficient:
Figure BDA0001622437570000065
dx(n1) represents an n1 th order coefficient of the first order difference CFCC parameter of the x frame speech signal, where k is a constant, and is generally taken to be 2;
after 16-order CFCC and delta CFCC are respectively extracted, dimension screening is carried out on the features, and the parts which can represent the voice features most are selected for feature mixing;
step five, adding TEOCC on the basis of CFCC + delta CFCC characteristics to form fusion characteristics;
for each frame of speech signal x (n), its TEO energy is calculated:
ψ[x(n)]=x(n)2-x(n+1)x(n-1)
carrying out normalization processing and taking logarithm to obtain:
Figure BDA0001622437570000063
finally, performing DCT transformation to obtain one-dimensional TEOCC;
adding one-dimensional TEOCC characteristics into the last dimension of the mixed characteristic vector;
step six, performing data normalization processing on the fusion characteristics to form a normalized training set and a normalized test set, and labeling the two sets respectively to obtain a training set label and a test set label;
any data sample in the feature training set and the feature testing set is yyiAfter normalization processing, the corresponding data samples in the normalized training set and the normalized testing set are as follows:
Figure BDA0001622437570000064
wherein y iswinAnd ywaxRepresents y4A respective minimum and maximum.
Step seven, adopting PCA to reduce the dimension of the normalized training set, and bringing the reduced dimension into an SVM model to obtain the recognition accuracy
Dividing the voice features after dimensionality reduction into two parts of a training set train _ data and a test set test _ data, respectively adding a training set label train _ label and a test set label test _ label, and inputting the training set into an SVM (support vector machine) to establish a model:
model=svmtrain(train_label,train_data)
testing the test set by using the established model to obtain the identification accuracy rate accurve:
accuracy=svmpredict(test_label,test_data)。
wherein accuracy is the classification accuracy of the test set sample, and the speech recognition accuracy corresponding to the test set sample is 88.10%.

Claims (1)

1.一种抗噪语音识别系统,其特征在于:按照如下的步骤进行:1. an anti-noise speech recognition system is characterized in that: carry out according to the following steps: 步骤一、对于语音信号s(n)进行加窗分帧,然后做离散傅里叶变换,求出语音信号的幅值和相位角Step 1. Windowing and framing the speech signal s(n), and then doing discrete Fourier transform to find the amplitude and phase angle of the speech signal 对语音信号s(n)进行加窗,采用的窗函数为汉明窗w(n):Windowing the speech signal s(n), the window function used is the Hamming window w(n):
Figure FDA0003159630810000011
Figure FDA0003159630810000011
用窗函数w(n)乘以语音信号s(n),形成加窗语音信号x(n)Multiply the speech signal s(n) by the window function w(n) to form the windowed speech signal x(n) x(n)=s(n)*w(n)x(n)=s(n)*w(n) 对加窗语音信号x(n)进行分帧处理,则分帧后的语音信号表示为xn(t),其中n为帧序号,t为帧同步的时间序号,N为帧长;The windowed voice signal x(n) is framed, and the framed voice signal is represented as xn(t), where n is the frame sequence number, t is the time sequence number of frame synchronization, and N is the frame length; 对分帧后的语音信号xn(t)进行离散傅里叶变换:Perform discrete Fourier transform on the framed speech signal x n (t):
Figure FDA0003159630810000012
Figure FDA0003159630810000012
其中,j表示复数,e是常数,π是常数,谐波分量序号k=0,1,2,...,N-1,则加窗语音信号的短时幅度谱估计即语音信号的幅值为|X(n,k)|,Among them, j represents a complex number, e is a constant, π is a constant, and the harmonic component serial number k=0, 1, 2, ..., N-1, then the short-term amplitude spectrum estimation of the windowed speech signal is the amplitude of the speech signal. The value is |X(n,k)|,
Figure FDA0003159630810000013
Figure FDA0003159630810000013
Figure FDA0003159630810000014
的表示语音信号的相位角;
Figure FDA0003159630810000014
represents the phase angle of the speech signal;
步骤二、计算噪声段平均能量,通过谱减运算得到估计信号的功率谱;Step 2: Calculate the average energy of the noise segment, and obtain the power spectrum of the estimated signal through a spectral subtraction operation; 噪声段的时长为IS,其相应的帧数为NIS,噪声段的平均能量为:The duration of the noise segment is IS, the corresponding frame number is NIS, and the average energy of the noise segment is:
Figure FDA0003159630810000015
Figure FDA0003159630810000015
采用以下的谱减运算得到估计信号的功率谱Use the following spectral subtraction to obtain the power spectrum of the estimated signal
Figure FDA0003159630810000016
Figure FDA0003159630810000016
其中,a1和b1是两个常数,a1为过减因子,b1为增益补偿因子;Among them, a1 and b1 are two constants, a1 is the over-reduction factor, and b1 is the gain compensation factor; 步骤三、利用谱减前的相位角信息对信号进行重构,得到谱减后的语音序列;Step 3, using the phase angle information before spectral subtraction to reconstruct the signal to obtain a speech sequence after spectral subtraction; 用谱减后的功率谱
Figure FDA0003159630810000017
结合谱减前的相位角信息
Figure FDA0003159630810000018
进行IFFT,将频域还原到时域,得到的谱减后的语音序列
Figure FDA0003159630810000019
The power spectrum after subtraction from the spectrum
Figure FDA0003159630810000017
Combine the phase angle information before spectral subtraction
Figure FDA0003159630810000018
Perform IFFT, restore the frequency domain to the time domain, and obtain the spectrally subtracted speech sequence
Figure FDA0003159630810000019
Figure FDA00031596308100000110
Figure FDA00031596308100000110
步骤四、对谱减后的语音序列
Figure FDA00031596308100000111
采用非线性幂函数模拟人耳听觉特性提取耳蜗滤波倒谱特征CFCC及其一阶差分△CFCC,并利用维度筛选法进行特征混合;
Step 4. The speech sequence after spectral subtraction
Figure FDA00031596308100000111
The non-linear power function is used to simulate the auditory characteristics of the human ear to extract the cochlear filter cepstral feature CFCC and its first-order difference △CFCC, and the feature mixing is performed by the dimension screening method;
听觉变换模拟了人耳听觉机理,是将耳蜗滤波函数作为一种新的小波基函数,运用小波变换实现滤波的过程;The auditory transformation simulates the human hearing mechanism, which is the process of using the cochlear filter function as a new wavelet base function to realize the filtering process by using the wavelet transform; 谱减后的语音序列
Figure FDA00031596308100000112
经过听觉变换后在某一频带范围内的输出为:
spectrally subtracted speech sequence
Figure FDA00031596308100000112
The output in a certain frequency band after auditory transformation is:
Figure FDA0003159630810000021
Figure FDA0003159630810000021
其中
Figure FDA0003159630810000022
为耳蜗滤波函数,它的表达式为:
in
Figure FDA0003159630810000022
is the cochlear filter function, and its expression is:
Figure FDA0003159630810000023
Figure FDA0003159630810000023
上式中
Figure FDA0003159630810000024
其中α和β的取值决定了耳蜗滤波函数的频域形状和宽度,u(t)为单位步进函数,b2为随时间可变的实数,a2为尺度变量,θ是初始相位,
Figure FDA0003159630810000025
由滤波器组的中心频率fc和最低中心频率fL决定
In the above formula
Figure FDA0003159630810000024
The values of α and β determine the frequency domain shape and width of the cochlear filter function, u(t) is the unit step function, b2 is a real number that changes with time, a2 is the scale variable, θ is the initial phase,
Figure FDA0003159630810000025
Determined by the center frequency f c and the lowest center frequency f L of the filter bank
Figure FDA0003159630810000026
Figure FDA0003159630810000026
其中,
Figure FDA0003159630810000027
β取经验值
Figure FDA0003159630810000028
β=0.2;
in,
Figure FDA0003159630810000027
β takes the empirical value
Figure FDA0003159630810000028
β=0.2;
人耳耳蜗的内毛细胞将经过听觉变换输出后的语音信号转变为人脑可分析的电信号:The inner hair cells of the human cochlea convert the speech signal output by auditory transformation into electrical signals that can be analyzed by the human brain: h(a2,b2)=[T(a2,b2)]2 h(a2,b2)=[T(a2,b2)] 2 h(a2,b2)为人脑可分析的电信号,T(a2,b2)为经过听觉变换输出后的语音信号,根据人耳的听觉特性,声音听觉神经对声音的响应持续时间会随着频率的增加而逐渐变短,说明了人耳对高频暂态成分更加敏感,因此对中心频率较高的耳蜗滤波器,需要适当缩短其时间平滑窗长,对于不同的频带选用不同的窗长,第i频带毛细胞函数平均值可以表示为:h(a2,b2) is the electrical signal that can be analyzed by the human brain, T(a2,b2) is the voice signal output after auditory transformation. According to the auditory characteristics of the human ear, the response duration of the acoustic auditory nerve to the sound will vary with the frequency. It gradually becomes shorter with the increase of , indicating that the human ear is more sensitive to high-frequency transient components. Therefore, for the cochlear filter with a higher center frequency, the time smoothing window length needs to be shortened appropriately. Different window lengths are selected for different frequency bands. The average value of the hair cell function in the i-th band can be expressed as:
Figure FDA0003159630810000029
Figure FDA0003159630810000029
其中d=max{3.5τp,20ms},为第i频带的平滑窗长,τp是第p个滤波器中心频带中心频率的时间长度,τp=1/fc,L为帧移,L=d/2,w是窗的个数;where d=max{3.5τ p , 20ms}, is the smoothing window length of the ith frequency band, τ p is the time length of the center frequency of the center frequency band of the p th filter, τ p =1/f c , L is the frame shift, L=d/2, w is the number of windows; 毛细胞输出通过非线性幂函数完成响度变换,由能量值变为感知响度,第i个频带的感知响度可以表示为:The output of the hair cell completes the loudness transformation through a nonlinear power function, and changes from the energy value to the perceptual loudness. The perceptual loudness of the i-th frequency band can be expressed as: y(i,w)=[S(i,w)]0.101 y(i,w)=[S(i,w)] 0.101 最后再用离散余弦变换对所得的特征去相关,得到CFCC特征参数:Finally, the discrete cosine transform is used to decorrelate the obtained features, and the CFCC feature parameters are obtained:
Figure FDA00031596308100000210
Figure FDA00031596308100000210
其中,n1为CFCC特征的阶数,M是耳蜗滤波器的通道数;Among them, n1 is the order of the CFCC feature, and M is the number of channels of the cochlear filter; 在提取CFCC参数后计算其一阶差分系数:Calculate its first-order difference coefficients after extracting the CFCC parameters:
Figure FDA00031596308100000211
Figure FDA00031596308100000211
dx(n1)表示第x帧语音信号的一阶差分CFCC参数的第n1阶系数,k为常数,取k=2;d x (n1) represents the n1th order coefficient of the first order difference CFCC parameter of the xth frame speech signal, k is a constant, and k=2; 分别提取出16阶CFCC和△CFCC之后,对特征进行维度筛选,选取最能表征语音特征的部分再进行特征混合;After extracting the 16th-order CFCC and △CFCC, the features are dimensionally screened, and the part that can best characterize the speech features is selected for feature mixing; 步骤五、在CFCC+△CFCC特征基础上,加入TEOCC构成融合特征;Step 5. On the basis of the CFCC+△CFCC feature, add TEOCC to form a fusion feature; 对每一帧语音信号x(n)计算其TEO能量:Calculate the TEO energy of each frame of speech signal x(n): ψ[x(n)]=x(n)2-x(n+1)x(n-1)ψ[x(n)]=x(n) 2 -x(n+1)x(n-1) 进行归一化处理并取对数得到:Normalize and take the logarithm to get:
Figure FDA0003159630810000031
Figure FDA0003159630810000031
最后进行DCT变换得到一维的TEOCC;Finally, DCT transform is performed to obtain a one-dimensional TEOCC; 将一维的TEOCC特征加入到混合特征向量的最后一维中;Add the one-dimensional TEOCC feature to the last dimension of the mixed feature vector; 步骤六、对融合特征用数据归一化处理,形成归一化训练集和归一化测试集两部分,分别给两个集合加注标签,得到训练集标签和测试集标签;Step 6: The fusion feature is processed by data normalization to form a normalized training set and a normalized test set, and labels are added to the two sets respectively to obtain a training set label and a test set label; 特征训练集和特征测试集中任意一个数据样本为yyi,进行归一化处理后,归一化训练集和归一化测试集中对应的数据样本为:Any data sample in the feature training set and feature test set is yy i . After normalization, the corresponding data samples in the normalized training set and the normalized test set are:
Figure FDA0003159630810000032
Figure FDA0003159630810000032
其中ymin和ymax代表yi各自的极小值和极大值;where y min and y max represent the respective minimum and maximum values of y i ; 步骤七、将归一化后的训练集采用PCA进行降维,并带入SVM模型,得到识别准确率Step 7. Use PCA to reduce the dimension of the normalized training set and bring it into the SVM model to obtain the recognition accuracy 将降维后的语音特征划分为训练集train_data和测试集test_data两部分,分别添加训练集标签train_label和测试集标签test_label,把训练集输入SVM建立模型model:Divide the dimensionality-reduced speech features into two parts: training set train_data and test set test_data, add training set label train_label and test set label test_label respectively, and input the training set into SVM to build a model model: model=svmtrain(train_label,train_data)model=svmtrain(train_label,train_data) 用建立好的模型对测试集进行测试得到识别准确率accuracy:Test the test set with the established model to get the recognition accuracy accuracy: accuracy=svmpredict(test_label,test_data)。accuracy=svmpredict(test_label, test_data).
CN201810311359.9A 2018-04-09 2018-04-09 An anti-noise speech recognition system Active CN108564965B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810311359.9A CN108564965B (en) 2018-04-09 2018-04-09 An anti-noise speech recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810311359.9A CN108564965B (en) 2018-04-09 2018-04-09 An anti-noise speech recognition system

Publications (2)

Publication Number Publication Date
CN108564965A CN108564965A (en) 2018-09-21
CN108564965B true CN108564965B (en) 2021-08-24

Family

ID=63534360

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810311359.9A Active CN108564965B (en) 2018-04-09 2018-04-09 An anti-noise speech recognition system

Country Status (1)

Country Link
CN (1) CN108564965B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109256127B (en) * 2018-11-15 2021-02-19 江南大学 A Robust Speech Feature Extraction Method Based on Nonlinear Power Transform Gammachirp Filter
CN110808059A (en) * 2019-10-10 2020-02-18 天津大学 Speech noise reduction method based on spectral subtraction and wavelet transform
CN111142084B (en) * 2019-12-11 2023-04-07 中国电子科技集团公司第四十一研究所 Micro terahertz spectrum identification and detection algorithm
CN113205823A (en) * 2021-04-12 2021-08-03 广东技术师范大学 Lung sound signal endpoint detection method, system and storage medium
CN113325752B (en) * 2021-05-12 2022-06-14 北京戴纳实验科技有限公司 Equipment management system
CN114422313B (en) * 2021-12-22 2023-08-01 西安电子科技大学 A frame detection method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100789084B1 (en) * 2006-11-21 2007-12-26 한양대학교 산학협력단 Sound Quality Improvement Method by Overweight Gain of Nonlinear Structure in Wavelet Packet Domain
JP2012032648A (en) * 2010-07-30 2012-02-16 Sony Corp Mechanical noise reduction device, mechanical noise reduction method, program and imaging apparatus
CN102456351A (en) * 2010-10-14 2012-05-16 清华大学 Voice enhancement system
CN103985390A (en) * 2014-05-20 2014-08-13 北京安慧音通科技有限责任公司 Method for extracting phonetic feature parameters based on gammatone relevant images
CN107248414A (en) * 2017-05-23 2017-10-13 清华大学 A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization
CN107845390A (en) * 2017-09-21 2018-03-27 太原理工大学 A kind of Emotional speech recognition system based on PCNN sound spectrograph Fusion Features

Also Published As

Publication number Publication date
CN108564965A (en) 2018-09-21

Similar Documents

Publication Publication Date Title
CN108564965B (en) An anti-noise speech recognition system
Ancilin et al. Improved speech emotion recognition with Mel frequency magnitude coefficient
CN108597496B (en) Voice generation method and device based on generation type countermeasure network
Li et al. An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions
CN109256127B (en) A Robust Speech Feature Extraction Method Based on Nonlinear Power Transform Gammachirp Filter
WO2020034628A1 (en) Accent identification method and device, computer device, and storage medium
CN113012720B (en) Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction
CN108198545B (en) A Speech Recognition Method Based on Wavelet Transform
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium, and terminal
CN110931023B (en) Gender identification method, system, mobile terminal and storage medium
CN108682432B (en) Voice emotion recognition device
CN108198566B (en) Information processing method and device, electronic device and storage medium
CN102664010A (en) Robust speaker distinguishing method based on multifactor frequency displacement invariant feature
Waghmare et al. Emotion recognition system from artificial marathi speech using MFCC and LDA techniques
CN109065073A (en) Speech-emotion recognition method based on depth S VM network model
CN105679321B (en) Voice recognition method, device and terminal
CN111508504B (en) Speaker recognition method based on auditory center perception mechanism
CN113421584A (en) Audio noise reduction method and device, computer equipment and storage medium
CN103559893B (en) One is target gammachirp cepstrum coefficient aural signature extracting method under water
Cheng et al. DNN-based speech enhancement with self-attention on feature dimension
CN115346561A (en) Method and system for evaluating and predicting depression based on speech features
CN112863517B (en) Speech Recognition Method Based on Convergence Rate of Perceptual Spectrum
Zouhir et al. A bio-inspired feature extraction for robust speech recognition
Islam et al. Noise-robust text-dependent speaker identification using cochlear models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Xue Peiyun

Inventor after: Shi Yanyan

Inventor after: Bai Jing

Inventor after: Guo Qianyan

Inventor before: Bai Jing

Inventor before: Shi Yanyan

Inventor before: Xue Peiyun

Inventor before: Guo Qianyan

GR01 Patent grant
GR01 Patent grant