CN108564965B - Anti-noise voice recognition system - Google Patents

Anti-noise voice recognition system Download PDF

Info

Publication number
CN108564965B
CN108564965B CN201810311359.9A CN201810311359A CN108564965B CN 108564965 B CN108564965 B CN 108564965B CN 201810311359 A CN201810311359 A CN 201810311359A CN 108564965 B CN108564965 B CN 108564965B
Authority
CN
China
Prior art keywords
cfcc
signal
training set
speech signal
auditory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810311359.9A
Other languages
Chinese (zh)
Other versions
CN108564965A (en
Inventor
薛珮芸
史燕燕
白静
郭倩岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN201810311359.9A priority Critical patent/CN108564965B/en
Publication of CN108564965A publication Critical patent/CN108564965A/en
Application granted granted Critical
Publication of CN108564965B publication Critical patent/CN108564965B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Abstract

The invention relates to the technical field of voice recognition. An anti-noise voice recognition system, carry on the windowing and framing to the voice signal, then do the discrete Fourier transform, find amplitude and phase angle of the voice signal; obtaining a power spectrum of the estimated signal through a spectrum subtraction operation; reconstructing the signal by utilizing the phase angle information before spectral subtraction to obtain a speech sequence after spectral subtraction; simulating the auditory characteristics of human ears by adopting a nonlinear power function to extract cochlear filtering cepstrum characteristics CFCC and first-order difference delta CFCC of the cochlear filtering cepstrum characteristics for the new voice sequence, and performing characteristic mixing by utilizing a dimension screening method; performing data normalization processing on the fusion characteristics to obtain a training set label and a test set label; and (4) carrying out dimensionality reduction on the normalized training set by adopting PCA (principal component analysis), and bringing the reduced training set into an SVM (support vector machine) model to obtain the identification accuracy.

Description

Anti-noise voice recognition system
Technical Field
The invention relates to the technical field of voice recognition.
Background
With the rapid development of information technology, human-computer interaction receives more and more attention, and speech recognition becomes a key technology of human-computer interaction and becomes a research focus in the field. Speech recognition is a high-technology speech recognition technology in which a computer converts speech signals into corresponding texts or commands by extracting and analyzing human speech semantic information, and is widely applied to various fields such as industry, household appliances, communication, automotive electronics, medical treatment, home services, consumer electronics and the like.
However, speech signals are particularly susceptible to noise, and various links from acquisition to transmission to restoration may be affected by noise. The spectral subtraction method is one of the voice enhancement technologies, and is simple in operation and easy to implement.
Currently, the most mainstream characteristic parameter in speech recognition is Mel Frequency Cepstrum Coefficient (MFCC), MFCC characteristics are extracted based on fourier transform, and actually, the fourier transform is only suitable for processing of stationary signals. Auditory transformation is used as a new method for processing non-stationary voice signals, makes up for the defects of Fourier transformation, and has the advantages of less harmonic distortion and good spectrum smoothness. The first time in 2011 by Peter Li doctor in bell labs, cochlear filter cepstral coefficients were proposed and applied to speaker recognition, the cochlear filter cepstral coefficients being the first feature to use auditory transformations. Although many scholars research the CFCC features, the nonlinear power function derived from the saturation relation between the neuron action potential firing rate and the sound intensity can be approximate to an auditory neuron-intensity curve, the traditional CFCC feature extraction method does not take the characteristic of human auditory sense into consideration, and therefore the nonlinear power function capable of simulating the characteristic of human auditory sense is adopted to extract new CFCC features.
A complete speech signal contains both frequency information and energy information. The Teager energy operator is used as a nonlinear difference operator, can eliminate the influence of zero-mean noise and the voice enhancement capability, is used for feature extraction, can better reflect the energy change of voice signals, can inhibit noise and enhance the voice signals, and can obtain good effect when used for voice recognition.
Support Vector Machines (SVMs) are a new Machine learning technique based on the principle of minimizing structural risks. The method can better solve the classification problems of small samples, nonlinearity, high dimensionality and the like, has good generalization, is widely applied to the problems of pattern recognition, classification estimation and the like, and becomes a more common classification model in the voice recognition technology through the excellent classification capability and good generalization performance of the method.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: how to improve the speech recognition effect.
The technical scheme adopted by the invention is as follows: an anti-noise speech recognition system, comprising the steps of:
step one, performing windowing and framing on voice signals s (n), and then performing discrete Fourier transform to obtain the amplitude and phase angle of the voice signals
Windowing the speech signal s (n), the window function used being the hamming window w (n):
Figure BDA0001622437570000011
multiplying the speech signal s (n) by a window function w (n) to form a windowed speech signal x (n)
x(n)=s(n)*w(n)
The windowed speech signal x (n) is subjected to framing processing, and then the speech signal x (n) is expressed as xn(t), wherein N is a frame number, t is a frame synchronization time number, and N is a frame length;
for the framed speech signal xn(t) performing a discrete fourier transform:
Figure BDA0001622437570000021
where j denotes a complex number, e is a constant, pi is a constant, and the harmonic component number k is 0, 1., N-1, then the short-time amplitude spectrum of the windowed speech signal X (N) is estimated to be | X (N, k) |, and the phase angle is:
Figure BDA0001622437570000022
the value of | X (n, k) | is expressed as the amplitude of the voice signal,
Figure BDA0001622437570000023
the value of (d) is expressed as the phase angle of the speech signal;
calculating the average energy of the noise section, and obtaining a power spectrum of the estimated signal through spectral subtraction;
the duration of the noise section IS, its corresponding frame number IS NIS, and the average energy of the noise section IS:
Figure BDA0001622437570000024
the power spectrum of the estimated signal is obtained by the following spectral subtraction operation
Figure BDA0001622437570000025
Wherein a1 and b1 are two constants, a1 is an over-subtraction factor, and a1 is a gain compensation factor;
thirdly, reconstructing the signal by utilizing the phase angle information before spectral subtraction to obtain a speech sequence after spectral subtraction;
power spectrum subtracted with spectrum
Figure BDA0001622437570000026
Combining the phase angle information before spectral subtraction
Figure BDA0001622437570000027
IFFT is carried out, the frequency domain is restored to the time domain, and the speech sequence after spectral subtraction is obtained
Figure BDA0001622437570000028
Figure BDA0001622437570000029
Step four, the voice sequence after spectral subtraction
Figure BDA00016224375700000210
Simulating the auditory characteristics of human ears by adopting a nonlinear power function to extract a cochlear filtering cepstrum characteristic CFCC and a first-order difference delta CFCC thereof, and performing characteristic mixing by using a dimension screening method;
the auditory transformation simulates the auditory mechanism of human ears, and is a process of realizing filtering by using wavelet transformation by taking a cochlear filter function as a new wavelet basis function;
spectrally subtracted speech sequence
Figure BDA00016224375700000211
The output over a certain frequency band after auditory transformation is:
Figure BDA00016224375700000212
wherein
Figure BDA00016224375700000213
Is a cochlear filter function, and the expression thereof is:
Figure BDA00016224375700000214
in the above formula
Figure BDA0001622437570000031
β>0, where the values of α and β determine the frequency domain shape and width of the cochlear filter function, u (t) is a unit step function, b2 is a real number that is variable over time, a2 is a scale variable, θ is the initial phase, and in general
Figure BDA00016224375700000311
Can be derived from the centre frequency f of the filter bankcAnd the lowest center frequency fLDetermining
Figure BDA0001622437570000032
Therein, in general
Figure BDA0001622437570000033
Is in the value range of
Figure BDA0001622437570000034
While
Figure BDA0001622437570000035
Beta is generally an empirical value
Figure BDA0001622437570000036
β=0.2;
The inner hair cell of the human ear cochlea converts the voice signal output by auditory transformation into an electric signal analyzable by the human brain:
h(a2,b2)=[T(a2,b2)]2
according to the auditory characteristics of human ears, the response duration of the auditory nerve of the sound to the sound is gradually shortened along with the increase of the frequency, which indicates that the human ears are more sensitive to high-frequency transient components, so that the time smoothing window length of the cochlear filter with higher central frequency needs to be properly shortened. For different frequency bands, different window lengths are selected, and the average value of the capillary cell function of the ith frequency band can be expressed as:
Figure BDA0001622437570000037
where d ═ max {3.5 τq20ms, which is the smoothing window length of the ith band, τqIs the time length, τ, of the center frequency of the center band of the p-th filterq=1/fcL is frame shift, L is d/2, and w is the number of windows;
the output of the hair cells completes loudness transformation through a nonlinear power function, the loudness is changed from an energy value to a perceived loudness, and the perceived loudness of the ith frequency band can be expressed as:
y(i,w)=[S(i,w)]0.101
finally, the obtained characteristics are decorrelated by using discrete cosine transform to obtain CFCC characteristic parameters:
Figure BDA0001622437570000038
wherein n1 is the order of the CFCC characteristic, and M is the channel number of the cochlear filter;
after extracting the CFCC parameters, calculating a first-order difference coefficient:
Figure BDA0001622437570000039
dx(n1) represents an n1 th order coefficient of the first order difference CFCC parameter of the x frame speech signal, where k is a constant, and is generally taken to be 2;
after 16-order CFCC and delta CFCC are respectively extracted, dimension screening is carried out on the features, and the parts which can represent the voice features most are selected for feature mixing;
step five, adding TEOCC on the basis of CFCC + delta CFCC characteristics to form fusion characteristics;
for each frame of speech signal x (n), its TEO energy is calculated:
ψ[x(n)]=x(n)2-x(n+1)x(n-1)
carrying out normalization processing and taking logarithm to obtain:
Figure BDA00016224375700000310
finally, performing DCT transformation to obtain one-dimensional TEOCC;
adding one-dimensional TEOCC characteristics into the last dimension of the mixed characteristic vector;
step six, performing data normalization processing on the fusion characteristics to form a normalized training set and a normalized test set, and labeling the two sets respectively to obtain a training set label and a test set label;
any data sample in the feature training set and the feature testing set is yyiAfter normalization processing, the corresponding data samples in the normalized training set and the normalized testing set are as follows:
Figure BDA0001622437570000041
wherein y iswinAnd yw0xRepresents yiA respective minimum and maximum.
Step seven, adopting PCA to reduce the dimension of the normalized training set, and bringing the reduced dimension into an SVM model to obtain the recognition accuracy
Dividing the voice features after dimensionality reduction into two parts of a training set train _ data and a test set test _ data, respectively adding a training set label train _ label and a test set label test _ label, and inputting the training set into an SVM (support vector machine) to establish a model:
model=svmtrain(train_label,train_data)
testing the test set by using the established model to obtain the identification accuracy rate accurve:
accuracy=svmpredict(test_label,test_data)。
the invention has the beneficial effects that: the invention reduces the influence of noise on the voice signal by introducing the spectral subtraction method into the front end of feature extraction, adopts the nonlinear power function to simulate the auditory characteristics of human ears to extract CFCC and the first-order difference coefficient thereof, adds TEOCC representing the voice signal energy on the basis to form fusion features, performs feature selection on the fusion features by using a principal component analysis method, and applies the SVM model of the selected features to a voice recognition system, thereby having higher recognition accuracy, stronger robustness and higher recognition speed.
Detailed Description
In the invention, a windows 7 system is used as a program development software environment, MATLAB R2011a is used as a program development platform, in the embodiment, under the condition that the signal to noise ratio of 10 isolated words is 0db, 270 voice samples which are generated by pronouncing each word three times are used as a training set by 9 persons, and 210 voice samples which correspond to 7 persons under the corresponding vocabulary and the signal to noise ratio are used as a test set.
Step one, performing windowing and framing on voice signals s (n), and then performing discrete Fourier transform to obtain the amplitude and phase angle of the voice signals
Windowing the speech signal s (n), the window function used being the hamming window w (n):
Figure BDA0001622437570000042
multiplying the speech signal s (n) by a window function w (n) to form a windowed speech signal x (n)
x(n)=s(n)*w(n)
The windowed speech signal x (n) is subjected to framing processing, and then the speech signal x (n) is expressed as xn(t), wherein N is a frame number, t is a frame synchronization time number, and N is a frame length;
for the framed speech signal xn(t) performing a discrete fourier transform:
Figure BDA0001622437570000043
where j denotes a complex number, e is a constant, pi is a constant, and the harmonic component number k is 0, 1., N-1, then the short-time amplitude spectrum of the windowed speech signal X (N) is estimated to be | X (N, k) |, and the phase angle is:
Figure BDA0001622437570000044
the value of | X (n, k) | is expressed as the amplitude of the voice signal,
Figure BDA0001622437570000045
the value of (d) is expressed as the phase angle of the speech signal;
calculating the average energy of the noise section, and obtaining a power spectrum of the estimated signal through spectral subtraction;
the duration of the noise section IS, its corresponding frame number IS NIS, and the average energy of the noise section IS:
Figure BDA0001622437570000051
the power spectrum of the estimated signal is obtained by the following spectral subtraction operation
Figure BDA0001622437570000052
Wherein a1 and b1 are two constants, a1 is an over-subtraction factor, and a1 is a gain compensation factor;
thirdly, reconstructing the signal by utilizing the phase angle information before spectral subtraction to obtain a speech sequence after spectral subtraction;
power spectrum subtracted with spectrum
Figure BDA0001622437570000053
Combining the phase angle information before spectral subtraction
Figure BDA0001622437570000054
IFFT is carried out to restore the frequency domain to the time domain to obtainOf the spectrally subtracted speech sequence
Figure BDA0001622437570000055
Figure BDA0001622437570000056
Step four, the voice sequence after spectral subtraction
Figure BDA0001622437570000057
Simulating the auditory characteristics of human ears by adopting a nonlinear power function to extract a cochlear filtering cepstrum characteristic CFCC and a first-order difference delta CFCC thereof, and performing characteristic mixing by using a dimension screening method;
the auditory transformation simulates the auditory mechanism of human ears, and is a process of realizing filtering by using wavelet transformation by taking a cochlear filter function as a new wavelet basis function;
spectrally subtracted speech sequence
Figure BDA0001622437570000058
The output over a certain frequency band after auditory transformation is:
Figure BDA0001622437570000059
wherein
Figure BDA00016224375700000510
Is a cochlear filter function, and the expression thereof is:
Figure BDA00016224375700000511
in the above formula
Figure BDA00016224375700000512
β>0, where the values of α and β determine the frequency domain shape and width of the cochlear filter function, u (t) is a unit step function, b2 is a real variable over timeNumber, a2 is a scale variable, θ is the initial phase, and
Figure BDA00016224375700000513
can be derived from the centre frequency f of the filter bankcAnd the lowest center frequency fLDetermining
Figure BDA00016224375700000514
Therein, in general
Figure BDA00016224375700000515
Is in the value range of
Figure BDA00016224375700000516
While
Figure BDA00016224375700000517
Beta is generally an empirical value
Figure BDA00016224375700000518
β=0.2;
The inner hair cell of the human ear cochlea converts the voice signal output by auditory transformation into an electric signal analyzable by the human brain:
h(a2,b2)=[T(a2,b2)]2
according to the auditory characteristics of human ears, the response duration of the auditory nerve of the sound to the sound is gradually shortened along with the increase of the frequency, which indicates that the human ears are more sensitive to high-frequency transient components, so that the time smoothing window length of the cochlear filter with higher central frequency needs to be properly shortened. For different frequency bands, different window lengths are selected, and the average value of the capillary cell function of the ith frequency band can be expressed as:
Figure BDA0001622437570000061
where d ═ max {3.5 τq20ms, which is the smoothing window length of the ith band, τqIs the center frequency of the p-th filter center bandLength of time of τq=1/fcL is frame shift, L is d/2, and w is the number of windows;
the output of the hair cells completes loudness transformation through a nonlinear power function, the loudness is changed from an energy value to a perceived loudness, and the perceived loudness of the ith frequency band can be expressed as:
y(i,w)=[S(i,w)]0.101
finally, the obtained characteristics are decorrelated by using discrete cosine transform to obtain CFCC characteristic parameters:
Figure BDA0001622437570000062
wherein n1 is the order of the CFCC characteristic, and M is the channel number of the cochlear filter;
after extracting the CFCC parameters, calculating a first-order difference coefficient:
Figure BDA0001622437570000065
dx(n1) represents an n1 th order coefficient of the first order difference CFCC parameter of the x frame speech signal, where k is a constant, and is generally taken to be 2;
after 16-order CFCC and delta CFCC are respectively extracted, dimension screening is carried out on the features, and the parts which can represent the voice features most are selected for feature mixing;
step five, adding TEOCC on the basis of CFCC + delta CFCC characteristics to form fusion characteristics;
for each frame of speech signal x (n), its TEO energy is calculated:
ψ[x(n)]=x(n)2-x(n+1)x(n-1)
carrying out normalization processing and taking logarithm to obtain:
Figure BDA0001622437570000063
finally, performing DCT transformation to obtain one-dimensional TEOCC;
adding one-dimensional TEOCC characteristics into the last dimension of the mixed characteristic vector;
step six, performing data normalization processing on the fusion characteristics to form a normalized training set and a normalized test set, and labeling the two sets respectively to obtain a training set label and a test set label;
any data sample in the feature training set and the feature testing set is yyiAfter normalization processing, the corresponding data samples in the normalized training set and the normalized testing set are as follows:
Figure BDA0001622437570000064
wherein y iswinAnd ywaxRepresents y4A respective minimum and maximum.
Step seven, adopting PCA to reduce the dimension of the normalized training set, and bringing the reduced dimension into an SVM model to obtain the recognition accuracy
Dividing the voice features after dimensionality reduction into two parts of a training set train _ data and a test set test _ data, respectively adding a training set label train _ label and a test set label test _ label, and inputting the training set into an SVM (support vector machine) to establish a model:
model=svmtrain(train_label,train_data)
testing the test set by using the established model to obtain the identification accuracy rate accurve:
accuracy=svmpredict(test_label,test_data)。
wherein accuracy is the classification accuracy of the test set sample, and the speech recognition accuracy corresponding to the test set sample is 88.10%.

Claims (1)

1. An anti-noise speech recognition system characterized by: the method comprises the following steps:
step one, performing windowing and framing on a voice signal s (n), and then performing discrete Fourier transform to obtain the amplitude and phase angle of the voice signal
Windowing the speech signal s (n), the window function used being the hamming window w (n):
Figure FDA0003159630810000011
multiplying the speech signal s (n) by a window function w (n) to form a windowed speech signal x (n)
x(n)=s(n)*w(n)
The windowed speech signal x (n) is subjected to framing processing, and the framed speech signal is represented as xn(t), wherein N is a frame number, t is a frame synchronization time number, and N is a frame length;
for the framed speech signal xn(t) performing a discrete fourier transform:
Figure FDA0003159630810000012
where j denotes a complex number, e is a constant, pi is a constant, the harmonic component number k is 0,1, 2,.., N-1, and the short-time amplitude spectrum estimate of the windowed speech signal is, that is, the amplitude of the speech signal is | X (N, k) |,
Figure FDA0003159630810000013
Figure FDA0003159630810000014
represents the phase angle of the speech signal;
calculating the average energy of the noise section, and obtaining a power spectrum of the estimated signal through spectral subtraction;
the duration of the noise section IS, its corresponding frame number IS NIS, and the average energy of the noise section IS:
Figure FDA0003159630810000015
the power spectrum of the estimated signal is obtained by the following spectral subtraction operation
Figure FDA0003159630810000016
Wherein a1 and b1 are two constants, a1 is an over-subtraction factor, and b1 is a gain compensation factor;
thirdly, reconstructing the signal by utilizing the phase angle information before spectral subtraction to obtain a speech sequence after spectral subtraction;
power spectrum subtracted with spectrum
Figure FDA0003159630810000017
Combining the phase angle information before spectral subtraction
Figure FDA0003159630810000018
IFFT is carried out, the frequency domain is restored to the time domain, and the speech sequence after spectral subtraction is obtained
Figure FDA0003159630810000019
Figure FDA00031596308100000110
Step four, the voice sequence after spectral subtraction
Figure FDA00031596308100000111
Simulating the auditory characteristics of human ears by adopting a nonlinear power function to extract a cochlear filtering cepstrum characteristic CFCC and a first-order difference delta CFCC thereof, and performing characteristic mixing by using a dimension screening method;
the auditory transformation simulates the auditory mechanism of human ears, and is a process of realizing filtering by using wavelet transformation by taking a cochlear filter function as a new wavelet basis function;
spectrally subtracted speech sequence
Figure FDA00031596308100000112
Output in a certain frequency band after auditory conversionComprises the following steps:
Figure FDA0003159630810000021
wherein
Figure FDA0003159630810000022
Is a cochlear filter function, and the expression thereof is:
Figure FDA0003159630810000023
in the above formula
Figure FDA0003159630810000024
Wherein the values of alpha and beta determine the frequency domain shape and width of the cochlear filter function, u (t) is a unit step function, b2 is a real number variable along with time, a2 is a scale variable, theta is an initial phase,
Figure FDA0003159630810000025
by the centre frequency f of the filter bankcAnd the lowest center frequency fLDetermining
Figure FDA0003159630810000026
Wherein the content of the first and second substances,
Figure FDA0003159630810000027
beta is taken as an empirical value
Figure FDA0003159630810000028
β=0.2;
The inner hair cell of the human ear cochlea converts the voice signal output by auditory transformation into an electric signal analyzable by the human brain:
h(a2,b2)=[T(a2,b2)]2
h (a2, b2) is an electrical signal analyzable by the human brain, T (a2, b2) is a speech signal output after auditory transformation, according to the auditory characteristics of the human ear, the response duration of the acoustic auditory nerve to the sound is gradually shortened along with the increase of the frequency, which indicates that the human ear is more sensitive to high-frequency transient components, therefore, for a cochlear filter with higher central frequency, the time smoothing window length needs to be properly shortened, different window lengths are selected for different frequency bands, and the average value of the hair cell function of the ith frequency band can be expressed as:
Figure FDA0003159630810000029
where d ═ max {3.5 τp20ms, which is the smoothing window length of the ith band, τpIs the time length, τ, of the center frequency of the center band of the p-th filterp=1/fcL is frame shift, L is d/2, and w is the number of windows;
the output of the hair cells completes loudness transformation through a nonlinear power function, the loudness is changed from an energy value to a perceived loudness, and the perceived loudness of the ith frequency band can be expressed as:
y(i,w)=[S(i,w)]0.101
finally, the obtained characteristics are decorrelated by using discrete cosine transform to obtain CFCC characteristic parameters:
Figure FDA00031596308100000210
wherein n1 is the order of the CFCC characteristic, and M is the channel number of the cochlear filter;
after extracting the CFCC parameters, calculating a first-order difference coefficient:
Figure FDA00031596308100000211
dx(n1) an n1 th order coefficient indicating a first order difference CFCC parameter of the x-th frame speech signal, where k is a constant and is 2;
after 16-order CFCC and delta CFCC are respectively extracted, dimension screening is carried out on the features, and the parts which can represent the voice features most are selected for feature mixing;
step five, adding TEOCC on the basis of CFCC + delta CFCC characteristics to form fusion characteristics;
for each frame of speech signal x (n), its TEO energy is calculated:
ψ[x(n)]=x(n)2-x(n+1)x(n-1)
carrying out normalization processing and taking logarithm to obtain:
Figure FDA0003159630810000031
finally, performing DCT transformation to obtain one-dimensional TEOCC;
adding one-dimensional TEOCC characteristics into the last dimension of the mixed characteristic vector;
step six, performing data normalization processing on the fusion characteristics to form a normalized training set and a normalized test set, and labeling the two sets respectively to obtain a training set label and a test set label;
any data sample in the feature training set and the feature testing set is yyiAfter normalization processing, the corresponding data samples in the normalized training set and the normalized testing set are as follows:
Figure FDA0003159630810000032
wherein y isminAnd ymaxRepresents yiRespective minimum and maximum values;
step seven, adopting PCA to reduce the dimension of the normalized training set, and bringing the reduced dimension into an SVM model to obtain the recognition accuracy
Dividing the voice features after dimensionality reduction into two parts of a training set train _ data and a test set test _ data, respectively adding a training set label train _ label and a test set label test _ label, and inputting the training set into an SVM (support vector machine) to establish a model:
model=svmtrain(train_label,train_data)
testing the test set by using the established model to obtain the identification accuracy rate accurve:
accuracy=svmpredict(test_label,test_data)。
CN201810311359.9A 2018-04-09 2018-04-09 Anti-noise voice recognition system Active CN108564965B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810311359.9A CN108564965B (en) 2018-04-09 2018-04-09 Anti-noise voice recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810311359.9A CN108564965B (en) 2018-04-09 2018-04-09 Anti-noise voice recognition system

Publications (2)

Publication Number Publication Date
CN108564965A CN108564965A (en) 2018-09-21
CN108564965B true CN108564965B (en) 2021-08-24

Family

ID=63534360

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810311359.9A Active CN108564965B (en) 2018-04-09 2018-04-09 Anti-noise voice recognition system

Country Status (1)

Country Link
CN (1) CN108564965B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109256127B (en) * 2018-11-15 2021-02-19 江南大学 Robust voice feature extraction method based on nonlinear power transformation Gamma chirp filter
CN110808059A (en) * 2019-10-10 2020-02-18 天津大学 Speech noise reduction method based on spectral subtraction and wavelet transform
CN111142084B (en) * 2019-12-11 2023-04-07 中国电子科技集团公司第四十一研究所 Micro terahertz spectrum identification and detection algorithm
CN113205823A (en) * 2021-04-12 2021-08-03 广东技术师范大学 Lung sound signal endpoint detection method, system and storage medium
CN113325752B (en) * 2021-05-12 2022-06-14 北京戴纳实验科技有限公司 Equipment management system
CN114422313B (en) * 2021-12-22 2023-08-01 西安电子科技大学 Frame detection method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100789084B1 (en) * 2006-11-21 2007-12-26 한양대학교 산학협력단 Speech enhancement method by overweighting gain with nonlinear structure in wavelet packet transform
JP2012032648A (en) * 2010-07-30 2012-02-16 Sony Corp Mechanical noise reduction device, mechanical noise reduction method, program and imaging apparatus
CN102456351A (en) * 2010-10-14 2012-05-16 清华大学 Voice enhancement system
CN103985390A (en) * 2014-05-20 2014-08-13 北京安慧音通科技有限责任公司 Method for extracting phonetic feature parameters based on gammatone relevant images
CN107248414A (en) * 2017-05-23 2017-10-13 清华大学 A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization
CN107845390A (en) * 2017-09-21 2018-03-27 太原理工大学 A kind of Emotional speech recognition system based on PCNN sound spectrograph Fusion Features

Also Published As

Publication number Publication date
CN108564965A (en) 2018-09-21

Similar Documents

Publication Publication Date Title
CN108564965B (en) Anti-noise voice recognition system
Ancilin et al. Improved speech emotion recognition with Mel frequency magnitude coefficient
Li et al. An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions
CN109256127B (en) Robust voice feature extraction method based on nonlinear power transformation Gamma chirp filter
CN112006697B (en) Voice signal-based gradient lifting decision tree depression degree recognition system
CN102968990B (en) Speaker identifying method and system
WO2020034628A1 (en) Accent identification method and device, computer device, and storage medium
CN102664010B (en) Robust speaker distinguishing method based on multifactor frequency displacement invariant feature
CN113012720B (en) Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction
CN108198545B (en) Speech recognition method based on wavelet transformation
CN111785285A (en) Voiceprint recognition method for home multi-feature parameter fusion
CN104778948B (en) A kind of anti-noise audio recognition method based on bending cepstrum feature
CN110931023B (en) Gender identification method, system, mobile terminal and storage medium
CN110970036A (en) Voiceprint recognition method and device, computer storage medium and electronic equipment
CN107274887A (en) Speaker's Further Feature Extraction method based on fusion feature MGFCC
CN111508504B (en) Speaker recognition method based on auditory center perception mechanism
CN108461081A (en) Method, apparatus, equipment and the storage medium of voice control
CN105679321B (en) Voice recognition method, device and terminal
CN111489763B (en) GMM model-based speaker recognition self-adaption method in complex environment
US6701291B2 (en) Automatic speech recognition with psychoacoustically-based feature extraction, using easily-tunable single-shape filters along logarithmic-frequency axis
Hasan et al. Preprocessing of continuous bengali speech for feature extraction
CN103557925B (en) Underwater target gammatone discrete wavelet coefficient auditory feature extraction method
CN113421584A (en) Audio noise reduction method and device, computer equipment and storage medium
CN113539243A (en) Training method of voice classification model, voice classification method and related device
Zouhir et al. A bio-inspired feature extraction for robust speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Xue Peiyun

Inventor after: Shi Yanyan

Inventor after: Bai Jing

Inventor after: Guo Qianyan

Inventor before: Bai Jing

Inventor before: Shi Yanyan

Inventor before: Xue Peiyun

Inventor before: Guo Qianyan

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant