CN110265049A - A kind of audio recognition method and speech recognition system - Google Patents
A kind of audio recognition method and speech recognition system Download PDFInfo
- Publication number
- CN110265049A CN110265049A CN201910444597.1A CN201910444597A CN110265049A CN 110265049 A CN110265049 A CN 110265049A CN 201910444597 A CN201910444597 A CN 201910444597A CN 110265049 A CN110265049 A CN 110265049A
- Authority
- CN
- China
- Prior art keywords
- voice data
- point
- voice
- short
- dynamic time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000000605 extraction Methods 0.000 claims abstract description 19
- 238000001514 detection method Methods 0.000 claims description 42
- 238000012549 training Methods 0.000 claims description 35
- 238000012545 processing Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 15
- 238000007781 pre-processing Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 238000002203 pretreatment Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000005713 exacerbation Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Complex Calculations (AREA)
Abstract
The present invention provides a kind of audio recognition method, and the voice data in the voice data feature of extraction and template library is carried out similarity mode, wherein carries out dynamic time warping to the voice data feature of extraction, comprising: define cost function Φ [(ni, mi)]=(ni‑1, mi‑1), indicate dynamic time warping path current point (ni,mi) previous mesh point (ni‑1,mi‑1), the point on dynamic time warping path meets following constraint: the starting point (1,1) on a, dynamic time warping path, terminating point (N, M);B, the vertex opposite as a certain parallelogram using the starting point and ending point, the point on dynamic time warping path are entirely fallen in the parallelogram;Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.A kind of audio recognition method provided by the invention and speech recognition system, can effectively improve the accuracy of characteristic matching.
Description
Technical field
The present invention relates to technical field of voice recognition, in particular to a kind of audio recognition method and speech recognition system.
Background technique
With the development of computer technology and information technology, interactive voice has become the necessary means of human-computer interaction,
It under this situation, is communicated with people with how allowing Computerized intelligent, makes man-machine communication is more convenient to become modern computer
One important subject of science.
After complete speech recognition system includes pretreatment, characteristic parameter extraction, similar comparison and identification successfully
Continuous step.In pretreatment, framing, end-point detection etc. are all important processing means, and pre-processing to voice signal can
The recognition success rate of increasing system;End-point detection technology can extract the phonological component in signal, reduce system-computed amount
Meanwhile improving whole accuracy of identification.Meanwhile speech recognition system is compared using characteristic value come recognition command, it will be in frequency domain
Voice signal handled after take its coefficient to indicate the segment signal characteristic value.
However in the prior art, to endpoint Detection and Extraction voice data part, effect is poor, it is difficult to extract effective voice
Data cause speech recognition mistake occur.
Dynamic time warping is the important method of one kind in Feature Correspondence Algorithm, in the prior art, the mistake of speech recognition
Cheng Zhong, the time span of every section of voice signal will not keep identical, and the opposite duration of various pieces is also inside each word
The characteristic vector for changing at random, therefore using in the prior art carries out the comparison of similitude, and effect is often poor.
Therefore, in order to solve the above problem occurred in the prior art, a kind of audio recognition method and speech recognition are needed
System, the accuracy of Lai Tigao speech recognition.
Summary of the invention
One aspect of the present invention is to provide a kind of audio recognition method, which comprises
Voice signal is obtained, voice data is converted to, pretreatment is carried out to voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in voice data after rejecting uncorrelated noise, obtains the features of voice data
Point;
Voice data in the voice data feature of extraction and template library is subjected to similarity mode, wherein to extraction
Voice data feature carries out dynamic time warping, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi)
Previous mesh point (ni-1,mi-1),
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, on dynamic time warping path
Point entirely fall in the parallelogram;
Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.
Preferably, the pretreatment includes that voice data is sampled and quantified, the processing of voice data preemphasis, voice
Data adding window, sub-frame processing,
And uncorrelated noise is rejected by voice data end-point detection.
Preferably, the voice data end-point detection judges voice data endpoint by short-time energy, and short-time energy passes through
Following method statement:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is
When first frame voice data in front window.
Preferably, the voice data end-point detection judges voice data endpoint by short-time energy, by short-time energy with
End-point detection filter convolution:
Wherein, H (i) be end-point detection filter, g (t) be logarithmetics one-dimensional short-time energy, F (t) be short-time energy with
The voice data obtained after end-point detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.
Preferably, training obtains the voice data of the template library by the following method:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains training voice data
Characteristic;
The characteristic of obtained training voice is sent into template library.
Another aspect of the present invention is to provide a kind of speech recognition system, the system comprises:
Preprocessing module, is converted to voice data for will acquire voice signal, carries out pretreatment rejecting to voice data
Uncorrelated noise;
Characteristic extracting module is obtained for extracting linear prediction residue error in the voice data after rejecting uncorrelated noise
To the characteristic of voice data;
Template library, for storing the characteristic of trained voice;
Speech recognition module, the voice data in voice data feature and template library for that will extract carry out similarity
Match, wherein dynamic time warping is carried out to the voice data feature of extraction, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi)
Previous mesh point (ni-1,mi-1),
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, on dynamic time warping path
Point entirely fall in the parallelogram;
Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.
Preferably, the pretreatment includes that voice data is sampled and quantified, the processing of voice data preemphasis, voice
Data adding window, sub-frame processing,
And uncorrelated noise is rejected by voice data end-point detection.
Preferably, the voice data end-point detection judges voice data endpoint by short-time energy, and short-time energy passes through
Following method statement:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is
When first frame voice data in front window.
Preferably, the voice data end-point detection judges voice data endpoint by short-time energy, by short-time energy with
End-point detection filter convolution:
Wherein, H (i) be end-point detection filter, g (t) be logarithmetics one-dimensional short-time energy, F (t) be short-time energy with
The voice data obtained after end-point detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.
Preferably, training obtains the voice data of the template library by the following method:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains training voice data
Characteristic;
The characteristic of obtained training voice is sent into template library.
A kind of audio recognition method provided by the invention and speech recognition system, the voice data after rejecting uncorrelated noise
Middle extraction linear prediction residue error, as the characteristic value of voice data, when extracting eigenmatrix progress dynamic after the pre-treatment
Between it is regular during, using the calibration for re-starting the time to characteristic parameter sequence pattern, can effectively improve characteristic matching
Accuracy.
A kind of audio recognition method provided by the invention and speech recognition system, identification knot is accurate, calculates quickly, using model
It encloses extensively, there is extremely wide application prospect in fields such as communication, automatic speech recognitions.
It should be appreciated that aforementioned description substantially and subsequent detailed description are exemplary illustration and explanation, it should not
As the limitation to the claimed content of the present invention.
Detailed description of the invention
With reference to the attached drawing of accompanying, the more purposes of the present invention, function and advantage are by the as follows of embodiment through the invention
Description is illustrated, in which:
Fig. 1 diagrammatically illustrates a kind of flow diagram of audio recognition method of the present invention.
Fig. 2 shows the voice signal time domain waveform schematic diagrams sampled in one embodiment of the invention.
Fig. 3 shows the schematic diagram of short-time energy waveform in this one embodiment of the invention.
Fig. 4 is shown rejects the voice data after uncorrelated noise by end-point detection in one embodiment of the invention.
Fig. 5 shows the schematic diagram in dynamic time warping path in one embodiment of the invention.
Fig. 6 shows a kind of structural block diagram of speech recognition system of the present invention.
Specific embodiment
By reference to exemplary embodiment, the purpose of the present invention and function and the side for realizing these purposes and function
Method will be illustrated.However, the present invention is not limited to exemplary embodiment as disclosed below;Can by different form come
It is realized.The essence of specification is only to aid in those skilled in the relevant arts' Integrated Understanding detail of the invention.
Hereinafter, the embodiment of the present invention will be described with reference to the drawings.In the accompanying drawings, identical appended drawing reference represents identical
Or similar component or same or like step.
A kind of audio recognition method provided by the invention is illustrated below by specific embodiment, as shown in Figure 1
A kind of flow diagram of audio recognition method of the present invention, embodiment according to the present invention, a kind of audio recognition method include with
Lower method and step:
Step S101, voice data pre-processes.
According to an embodiment of the invention, by taking automatic Pilot identifies as an example, acquisition voice signal (such as: turn left), to acquisition
Voice signal after AD conversion, be converted to voice data.
Voice data is pre-processed, uncorrelated noise is rejected.Uncorrelated noise refers to unrelated with voice signal is obtained
Data, such as the noise of running car, ambient noise etc..
The voice signal time domain waveform schematic diagram sampled in one embodiment of the invention as shown in Figure 2.Due to voice signal
Energisation mode, mouth and nose radiation and its non-stationary property and recording audio duration problem, one section of voice can not directly carry out spy
Sign is extracted and matching, is pre-processed before feature extraction.
According to the embodiment that this hair is invented, preprocessing process includes: that voice data is sampled and quantified, and voice data is pre-
Exacerbation processing, voice data adding window, sub-frame processing, and uncorrelated noise is rejected by voice data end-point detection.
The processing of voice data preemphasis is being carried out to the voice signal of acquisition, voice data adding window, sub-frame processing guarantee letter
Keep the statistical property of signal more obvious while number stationarity, the short-time characteristic of prominent voice signal increases the identification of system
Success rate.
The purpose of preemphasis is to improve the high frequency section of voice signal, so that the frequency spectrum of signal is become flat, is maintained at low frequency
Into the entire frequency band of high frequency, frequency spectrum can be sought with same signal-to-noise ratio, in order to spectrum analysis or Parameter analysis.According to this hair
Bright embodiment, preemphasis processing are realized by following digital filter:
H (z)=1-uz-1, 0.93 < u < 0.97,
Wherein, u is filter coefficient, and z is input signal.
Since voice signal belongs to typical non-stationary signal, statistical property in entire speech period be it is unknowable,
But it is smoothly, therefore, to obtain one section of signal for seamlessly transitting and having stronger autocorrelation in 10ms -30ms,
Just need to carry out before carrying out feature extraction adding window, sub-frame processing come achieve the purpose that analyze any moment signal.To voice into
It needs to remove the uncorrelated noise in signal after row sub-frame processing.
In preprocessing process, rejecting uncorrelated noise by voice data end-point detection, (unessential information and background are made an uproar
Sound).
According to an embodiment of the invention, voice data end-point detection judges voice data endpoint by short-time energy, in short-term
Energy is stated by the following method:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is
When first frame voice data in front window.Fig. 3 shows the schematic diagram of short-time energy waveform in this one embodiment of the invention.
Voice data end-point detection judges voice data endpoint by short-time energy, and short-time energy and end-point detection are filtered
Device convolution:
Wherein, H (i) be end-point detection filter, g (t) be logarithmetics one-dimensional short-time energy, F (t) be short-time energy with
The voice data obtained after end-point detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.The present invention one as shown in Figure 4
The voice data after uncorrelated noise is rejected in embodiment by end-point detection.
Step S102, voice data characteristic extracts.
According to an embodiment of the invention, linear prediction residue error is extracted from the voice data after rejecting uncorrelated noise,
Obtain the characteristic of voice data.
After the present invention obtains the all-pole modeling of voice by linear prediction analysis, local derviation is asked it to obtain linear prediction system
Number.In some embodiments, in order to obtain better recognition effect, certain post-processing is usually carried out to linear cepstrum coefficient,
Such as to each component of cepstrum coefficient multiplied by weighting coefficient appropriate, or ask on the basis of current cepstrum coefficient single order,
Second differnce etc..
Step S103, voice data matches.
According to an embodiment of the invention, the voice data in the voice data feature of extraction and template library is carried out similarity
Matching, wherein dynamic time warping is carried out to the voice data feature of extraction, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi)
Previous mesh point (ni-1,mi-1)。
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, on dynamic time warping path
Point entirely fall in the parallelogram.
The schematic diagram in dynamic time warping path, dynamic time warping path in one embodiment of the invention as shown in Figure 5
On starting point A (1,1), terminating point B (N, M), the point on dynamic time warping path entirely falls in ACBD in parallelogram
In (parallelogram of solid line in Fig. 5).In some embodiments, it is AC ' that the point on dynamic time warping path, which is entirely fallen in,
In BD ' (parallelogram of dotted line in Fig. 5).It should be understood that as long as the point met on dynamic time warping path of the invention is full
The beam condition of foot.
During speech recognition, when user is trained/identifies, even if saying in an identical manner as far as possible every time same
Vocabulary, but the length of its duration can also change at random, and the opposite duration of various pieces is also random inside each word
Variation.The present invention effectively solves to carry out similarity system design using eigenmatrix by the matched mode of above-mentioned voice data
Matching degree it is poor, identification error is big, the bad good a series of problems of recognition effect.
Step S104, speech recognition is completed.
According to an embodiment of the invention, constantly iteration cost function Φ [(ni, mi)]=(ni-1, mi-1), make the voice extracted
Data characteristics is matched with the voice data in template library.In matching process, the voice data feature of extraction and the language in template library
Sound data characteristics compares resulting value minimum, then the voice data feature taken at this time is as the output result after identification.
According to an embodiment of the invention, training obtains the voice data of template library by the following method:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise.
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains training voice data
Characteristic.
The characteristic of obtained training voice is sent into template library.
In embodiment, is tested by tested speech identification, show audio recognition method provided by the invention, Neng Gouyou
Effect improves the accuracy rate of identification.
The comparing result of the characteristic of one: seven tested speech of table and template voice.
The comparing result of the characteristic of training voice in one or seven tested speech of table and template library,
The first seven in table is classified as template voice, last column " test " is to enroll other speakers to say same test voice
Voice.The criterion of system is that it is most to terminate to the smallest tested speech of resulting value that selection, which carries out aspect ratio with template voice,
Fruit.
As can be seen that due to not needing regular operation when tested speech and self-contrast, therefore its distance is 0, is removed identical
Order.
Experiment shows that system can identify the voice most like with a certain tested speech." pen is had chosen in experiment intentionally
Note is originally " (three words) and " lamp is opened " (four words), it include tested speech comparison of two words with other, it can be seen that although
There are particular cases, but " notebook ", " lamp is opened " and the reduced value of other tested speech are usually larger.
A kind of structural block diagram of speech recognition system of the present invention as shown in Figure 6, according to an embodiment of the invention, a kind of voice
Identifying system includes:
Preprocessing module 101 is converted to voice data for will acquire voice signal, carries out pretreatment to voice data and picks
Except uncorrelated noise.
Pretreatment includes that voice data is sampled and quantified, and voice data preemphasis processing, voice data adding window is divided
Frame processing, and uncorrelated noise is rejected by voice data end-point detection.
Voice data end-point detection judges voice data endpoint by short-time energy, short-time energy table by the following method
It states:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is
When first frame voice data in front window.
Voice data end-point detection judges voice data endpoint by short-time energy, and short-time energy and end-point detection are filtered
Device convolution:
Wherein, H (i) be end-point detection filter, g (t) be logarithmetics one-dimensional short-time energy, F (t) be short-time energy with
The voice data obtained after end-point detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.
Characteristic extracting module 102, for extracting linear prediction residue error in the voice data after rejecting uncorrelated noise,
Obtain the characteristic of voice data.
After the present invention obtains the all-pole modeling of voice by linear prediction analysis, local derviation is asked it to obtain linear prediction system
Number.In some embodiments, in order to obtain better recognition effect, certain post-processing is usually carried out to linear cepstrum coefficient,
Such as to each component of cepstrum coefficient multiplied by weighting coefficient appropriate, or ask on the basis of current cepstrum coefficient single order,
Second differnce etc..
Template library 103, for storing the characteristic of trained voice.
Training obtains the voice data of template library by the following method:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains training voice data
Characteristic;
The characteristic of obtained training voice is sent into template library.
Speech recognition module 104, the voice data feature for that will extract are similar to the voice data progress in template library
Degree matching, wherein dynamic time warping is carried out to the voice data feature of extraction, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi)
Previous mesh point (ni-1,mi-1),
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, on dynamic time warping path
Point entirely fall in the parallelogram;
Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.
In matching process, the voice data feature of extraction and the voice data Characteristic Contrast resulting value minimum in template library,
The voice data feature then taken at this time is as the output result after identification.
A kind of audio recognition method provided by the invention and speech recognition system, the voice data after rejecting uncorrelated noise
Middle extraction linear prediction residue error, as the characteristic value of voice data, when extracting eigenmatrix progress dynamic after the pre-treatment
Between it is regular during, using the calibration for re-starting the time to characteristic parameter sequence pattern, can effectively improve characteristic matching
Accuracy.
A kind of audio recognition method provided by the invention and speech recognition system, identification knot is accurate, calculates quickly, using model
It encloses extensively, there is extremely wide application prospect in fields such as communication, automatic speech recognitions.
In conjunction with the explanation and practice of the invention disclosed here, the other embodiment of the present invention is for those skilled in the art
It all will be readily apparent and understand.Illustrate and embodiment is regarded only as being exemplary, true scope of the invention and purport are equal
It is defined in the claims.
Claims (10)
1. a kind of audio recognition method, which is characterized in that the described method includes:
Voice signal is obtained, voice data is converted to, pretreatment is carried out to voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in voice data after rejecting uncorrelated noise, obtains the characteristic of voice data;
Voice data in the voice data feature of extraction and template library is subjected to similarity mode, wherein to the voice of extraction
Data characteristics carries out dynamic time warping, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi) it is previous
Mesh point (ni-1,mi-1),
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, the point on dynamic time warping path
It entirely falls in the parallelogram;
Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.
2. the method according to claim 1, wherein the pretreatment includes that voice data is sampled and measured
Change, the processing of voice data preemphasis, voice data adding window, sub-frame processing,
And uncorrelated noise is rejected by voice data end-point detection.
3. according to the method described in claim 2, it is characterized in that, the voice data end-point detection is judged by short-time energy
Voice data endpoint, short-time energy are stated by the following method:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is current
First frame voice data in window.
4. according to the method described in claim 3, it is characterized in that, the voice data end-point detection is judged by short-time energy
Voice data endpoint, by short-time energy and end-point detection filter convolution:
Wherein, H (i) is end-point detection filter, and g (t) is the one-dimensional short-time energy of logarithmetics, and F (t) is short-time energy and endpoint
The voice data obtained after Fault detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.
5. the method according to claim 1, wherein the voice data of the template library is trained by the following method
It obtains:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains the spy for training voice data
Levy part;
The characteristic of obtained training voice is sent into template library.
6. a kind of speech recognition system, which is characterized in that the system comprises:
Preprocessing module is converted to voice data for will acquire voice signal, and it is unrelated to carry out pretreatment rejecting to voice data
Noise;
Characteristic extracting module obtains language for extracting linear prediction residue error in the voice data after rejecting uncorrelated noise
The characteristic of sound data;
Template library, for storing the characteristic of trained voice;
Speech recognition module, the voice data in voice data feature and template library for that will extract carry out similarity mode,
Wherein, dynamic time warping is carried out to the voice data feature of extraction, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi) it is previous
Mesh point (ni-1,mi-1),
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, the point on dynamic time warping path
It entirely falls in the parallelogram;
Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.
7. system according to claim 6, which is characterized in that the pretreatment includes that voice data is sampled and measured
Change, the processing of voice data preemphasis, voice data adding window, sub-frame processing,
And uncorrelated noise is rejected by voice data end-point detection.
8. system according to claim 7, which is characterized in that the voice data end-point detection is judged by short-time energy
Voice data endpoint, short-time energy are stated by the following method:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is current
First frame voice data in window.
9. system according to claim 8, which is characterized in that the voice data end-point detection is judged by short-time energy
Voice data endpoint, by short-time energy and end-point detection filter convolution:
Wherein, H (i) is end-point detection filter, and g (t) is the one-dimensional short-time energy of logarithmetics, and F (t) is short-time energy and endpoint
The voice data obtained after Fault detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.
10. system according to claim 6, which is characterized in that the voice data of the template library is instructed by the following method
It gets:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains the spy for training voice data
Levy part;
The characteristic of obtained training voice is sent into template library.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910444597.1A CN110265049A (en) | 2019-05-27 | 2019-05-27 | A kind of audio recognition method and speech recognition system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910444597.1A CN110265049A (en) | 2019-05-27 | 2019-05-27 | A kind of audio recognition method and speech recognition system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110265049A true CN110265049A (en) | 2019-09-20 |
Family
ID=67915432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910444597.1A Pending CN110265049A (en) | 2019-05-27 | 2019-05-27 | A kind of audio recognition method and speech recognition system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110265049A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110931022A (en) * | 2019-11-19 | 2020-03-27 | 天津大学 | Voiceprint identification method based on high-frequency and low-frequency dynamic and static characteristics |
CN112965381A (en) * | 2021-02-09 | 2021-06-15 | 重庆高开清芯智联网络科技有限公司 | Method for establishing cooperative intelligent self-adaptive decision model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1197952A1 (en) * | 2000-10-18 | 2002-04-17 | Thales | Coding method of the prosody for a very low bit rate speech encoder |
CN102982803A (en) * | 2012-12-11 | 2013-03-20 | 华南师范大学 | Isolated word speech recognition method based on HRSF and improved DTW algorithm |
CN104865973A (en) * | 2015-04-13 | 2015-08-26 | 天津工业大学 | Method for controlling window-cleaning robot through voice |
CN106874185A (en) * | 2016-12-27 | 2017-06-20 | 中车株洲电力机车研究所有限公司 | A kind of automated testing method driven based on voiced keyword and system |
-
2019
- 2019-05-27 CN CN201910444597.1A patent/CN110265049A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1197952A1 (en) * | 2000-10-18 | 2002-04-17 | Thales | Coding method of the prosody for a very low bit rate speech encoder |
CN102982803A (en) * | 2012-12-11 | 2013-03-20 | 华南师范大学 | Isolated word speech recognition method based on HRSF and improved DTW algorithm |
CN104865973A (en) * | 2015-04-13 | 2015-08-26 | 天津工业大学 | Method for controlling window-cleaning robot through voice |
CN106874185A (en) * | 2016-12-27 | 2017-06-20 | 中车株洲电力机车研究所有限公司 | A kind of automated testing method driven based on voiced keyword and system |
Non-Patent Citations (4)
Title |
---|
何晖: "《现代信号检测技术与评估理论的应用与研究》", 31 August 2018 * |
吴艳艳: "孤立语音识别的关键技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
张仁志等: "基于短时能量的语音端点检测算法研究", 《电声技术》 * |
王炳锡等: "《实用语音识别基础》", 31 January 2005 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110931022A (en) * | 2019-11-19 | 2020-03-27 | 天津大学 | Voiceprint identification method based on high-frequency and low-frequency dynamic and static characteristics |
CN110931022B (en) * | 2019-11-19 | 2023-09-15 | 天津大学 | Voiceprint recognition method based on high-low frequency dynamic and static characteristics |
CN112965381A (en) * | 2021-02-09 | 2021-06-15 | 重庆高开清芯智联网络科技有限公司 | Method for establishing cooperative intelligent self-adaptive decision model |
CN112965381B (en) * | 2021-02-09 | 2022-11-11 | 重庆高开清芯智联网络科技有限公司 | Method for establishing cooperative intelligent self-adaptive decision model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Togneri et al. | An overview of speaker identification: Accuracy and robustness issues | |
US8566088B2 (en) | System and method for automatic speech to text conversion | |
Wang et al. | Exploring monaural features for classification-based speech segregation | |
Chu et al. | SAFE: A statistical approach to F0 estimation under clean and noisy conditions | |
Hansen et al. | Automatic voice onset time detection for unvoiced stops (/p/,/t/,/k/) with application to accent classification | |
CN110232933B (en) | Audio detection method and device, storage medium and electronic equipment | |
CN104021789A (en) | Self-adaption endpoint detection method using short-time time-frequency value | |
CN103646649A (en) | High-efficiency voice detecting method | |
KR20130133858A (en) | Speech syllable/vowel/phone boundary detection using auditory attention cues | |
CN101136199A (en) | Voice data processing method and equipment | |
CN108305639B (en) | Speech emotion recognition method, computer-readable storage medium and terminal | |
CN107564543B (en) | Voice feature extraction method with high emotion distinguishing degree | |
CN111599344B (en) | Language identification method based on splicing characteristics | |
CN108682432B (en) | Speech emotion recognition device | |
Deshmukh et al. | Speech based emotion recognition using machine learning | |
CN110265049A (en) | A kind of audio recognition method and speech recognition system | |
Poorna et al. | Emotion recognition using multi-parameter speech feature classification | |
Sun et al. | A novel convolutional neural network voiceprint recognition method based on improved pooling method and dropout idea | |
CN110136746B (en) | Method for identifying mobile phone source in additive noise environment based on fusion features | |
CN110415707B (en) | Speaker recognition method based on voice feature fusion and GMM | |
CN116597853A (en) | Audio denoising method | |
CN111091816B (en) | Data processing system and method based on voice evaluation | |
Li et al. | Tibetan voice activity detection based on one-dimensional convolutional neural network | |
Nosek et al. | Synthesized speech detection based on spectrogram and convolutional neural networks | |
Heckmann et al. | A closer look on hierarchical spectro-temporal features (HIST). |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190920 |
|
RJ01 | Rejection of invention patent application after publication |