CN110265049A - A kind of audio recognition method and speech recognition system - Google Patents

A kind of audio recognition method and speech recognition system Download PDF

Info

Publication number
CN110265049A
CN110265049A CN201910444597.1A CN201910444597A CN110265049A CN 110265049 A CN110265049 A CN 110265049A CN 201910444597 A CN201910444597 A CN 201910444597A CN 110265049 A CN110265049 A CN 110265049A
Authority
CN
China
Prior art keywords
voice data
point
voice
short
dynamic time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910444597.1A
Other languages
Chinese (zh)
Inventor
林孝康
傅嵩
葛宛营
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Gaokai Core Technology Development Co Ltd
Original Assignee
Chongqing Gaokai Core Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Gaokai Core Technology Development Co Ltd filed Critical Chongqing Gaokai Core Technology Development Co Ltd
Priority to CN201910444597.1A priority Critical patent/CN110265049A/en
Publication of CN110265049A publication Critical patent/CN110265049A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention provides a kind of audio recognition method, and the voice data in the voice data feature of extraction and template library is carried out similarity mode, wherein carries out dynamic time warping to the voice data feature of extraction, comprising: define cost function Φ [(ni, mi)]=(ni‑1, mi‑1), indicate dynamic time warping path current point (ni,mi) previous mesh point (ni‑1,mi‑1), the point on dynamic time warping path meets following constraint: the starting point (1,1) on a, dynamic time warping path, terminating point (N, M);B, the vertex opposite as a certain parallelogram using the starting point and ending point, the point on dynamic time warping path are entirely fallen in the parallelogram;Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.A kind of audio recognition method provided by the invention and speech recognition system, can effectively improve the accuracy of characteristic matching.

Description

A kind of audio recognition method and speech recognition system
Technical field
The present invention relates to technical field of voice recognition, in particular to a kind of audio recognition method and speech recognition system.
Background technique
With the development of computer technology and information technology, interactive voice has become the necessary means of human-computer interaction, It under this situation, is communicated with people with how allowing Computerized intelligent, makes man-machine communication is more convenient to become modern computer One important subject of science.
After complete speech recognition system includes pretreatment, characteristic parameter extraction, similar comparison and identification successfully Continuous step.In pretreatment, framing, end-point detection etc. are all important processing means, and pre-processing to voice signal can The recognition success rate of increasing system;End-point detection technology can extract the phonological component in signal, reduce system-computed amount Meanwhile improving whole accuracy of identification.Meanwhile speech recognition system is compared using characteristic value come recognition command, it will be in frequency domain Voice signal handled after take its coefficient to indicate the segment signal characteristic value.
However in the prior art, to endpoint Detection and Extraction voice data part, effect is poor, it is difficult to extract effective voice Data cause speech recognition mistake occur.
Dynamic time warping is the important method of one kind in Feature Correspondence Algorithm, in the prior art, the mistake of speech recognition Cheng Zhong, the time span of every section of voice signal will not keep identical, and the opposite duration of various pieces is also inside each word The characteristic vector for changing at random, therefore using in the prior art carries out the comparison of similitude, and effect is often poor.
Therefore, in order to solve the above problem occurred in the prior art, a kind of audio recognition method and speech recognition are needed System, the accuracy of Lai Tigao speech recognition.
Summary of the invention
One aspect of the present invention is to provide a kind of audio recognition method, which comprises
Voice signal is obtained, voice data is converted to, pretreatment is carried out to voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in voice data after rejecting uncorrelated noise, obtains the features of voice data Point;
Voice data in the voice data feature of extraction and template library is subjected to similarity mode, wherein to extraction Voice data feature carries out dynamic time warping, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi) Previous mesh point (ni-1,mi-1),
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, on dynamic time warping path Point entirely fall in the parallelogram;
Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.
Preferably, the pretreatment includes that voice data is sampled and quantified, the processing of voice data preemphasis, voice Data adding window, sub-frame processing,
And uncorrelated noise is rejected by voice data end-point detection.
Preferably, the voice data end-point detection judges voice data endpoint by short-time energy, and short-time energy passes through Following method statement:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is When first frame voice data in front window.
Preferably, the voice data end-point detection judges voice data endpoint by short-time energy, by short-time energy with End-point detection filter convolution:
Wherein, H (i) be end-point detection filter, g (t) be logarithmetics one-dimensional short-time energy, F (t) be short-time energy with The voice data obtained after end-point detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.
Preferably, training obtains the voice data of the template library by the following method:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains training voice data Characteristic;
The characteristic of obtained training voice is sent into template library.
Another aspect of the present invention is to provide a kind of speech recognition system, the system comprises:
Preprocessing module, is converted to voice data for will acquire voice signal, carries out pretreatment rejecting to voice data Uncorrelated noise;
Characteristic extracting module is obtained for extracting linear prediction residue error in the voice data after rejecting uncorrelated noise To the characteristic of voice data;
Template library, for storing the characteristic of trained voice;
Speech recognition module, the voice data in voice data feature and template library for that will extract carry out similarity Match, wherein dynamic time warping is carried out to the voice data feature of extraction, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi) Previous mesh point (ni-1,mi-1),
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, on dynamic time warping path Point entirely fall in the parallelogram;
Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.
Preferably, the pretreatment includes that voice data is sampled and quantified, the processing of voice data preemphasis, voice Data adding window, sub-frame processing,
And uncorrelated noise is rejected by voice data end-point detection.
Preferably, the voice data end-point detection judges voice data endpoint by short-time energy, and short-time energy passes through Following method statement:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is When first frame voice data in front window.
Preferably, the voice data end-point detection judges voice data endpoint by short-time energy, by short-time energy with End-point detection filter convolution:
Wherein, H (i) be end-point detection filter, g (t) be logarithmetics one-dimensional short-time energy, F (t) be short-time energy with The voice data obtained after end-point detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.
Preferably, training obtains the voice data of the template library by the following method:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains training voice data Characteristic;
The characteristic of obtained training voice is sent into template library.
A kind of audio recognition method provided by the invention and speech recognition system, the voice data after rejecting uncorrelated noise Middle extraction linear prediction residue error, as the characteristic value of voice data, when extracting eigenmatrix progress dynamic after the pre-treatment Between it is regular during, using the calibration for re-starting the time to characteristic parameter sequence pattern, can effectively improve characteristic matching Accuracy.
A kind of audio recognition method provided by the invention and speech recognition system, identification knot is accurate, calculates quickly, using model It encloses extensively, there is extremely wide application prospect in fields such as communication, automatic speech recognitions.
It should be appreciated that aforementioned description substantially and subsequent detailed description are exemplary illustration and explanation, it should not As the limitation to the claimed content of the present invention.
Detailed description of the invention
With reference to the attached drawing of accompanying, the more purposes of the present invention, function and advantage are by the as follows of embodiment through the invention Description is illustrated, in which:
Fig. 1 diagrammatically illustrates a kind of flow diagram of audio recognition method of the present invention.
Fig. 2 shows the voice signal time domain waveform schematic diagrams sampled in one embodiment of the invention.
Fig. 3 shows the schematic diagram of short-time energy waveform in this one embodiment of the invention.
Fig. 4 is shown rejects the voice data after uncorrelated noise by end-point detection in one embodiment of the invention.
Fig. 5 shows the schematic diagram in dynamic time warping path in one embodiment of the invention.
Fig. 6 shows a kind of structural block diagram of speech recognition system of the present invention.
Specific embodiment
By reference to exemplary embodiment, the purpose of the present invention and function and the side for realizing these purposes and function Method will be illustrated.However, the present invention is not limited to exemplary embodiment as disclosed below;Can by different form come It is realized.The essence of specification is only to aid in those skilled in the relevant arts' Integrated Understanding detail of the invention.
Hereinafter, the embodiment of the present invention will be described with reference to the drawings.In the accompanying drawings, identical appended drawing reference represents identical Or similar component or same or like step.
A kind of audio recognition method provided by the invention is illustrated below by specific embodiment, as shown in Figure 1 A kind of flow diagram of audio recognition method of the present invention, embodiment according to the present invention, a kind of audio recognition method include with Lower method and step:
Step S101, voice data pre-processes.
According to an embodiment of the invention, by taking automatic Pilot identifies as an example, acquisition voice signal (such as: turn left), to acquisition Voice signal after AD conversion, be converted to voice data.
Voice data is pre-processed, uncorrelated noise is rejected.Uncorrelated noise refers to unrelated with voice signal is obtained Data, such as the noise of running car, ambient noise etc..
The voice signal time domain waveform schematic diagram sampled in one embodiment of the invention as shown in Figure 2.Due to voice signal Energisation mode, mouth and nose radiation and its non-stationary property and recording audio duration problem, one section of voice can not directly carry out spy Sign is extracted and matching, is pre-processed before feature extraction.
According to the embodiment that this hair is invented, preprocessing process includes: that voice data is sampled and quantified, and voice data is pre- Exacerbation processing, voice data adding window, sub-frame processing, and uncorrelated noise is rejected by voice data end-point detection.
The processing of voice data preemphasis is being carried out to the voice signal of acquisition, voice data adding window, sub-frame processing guarantee letter Keep the statistical property of signal more obvious while number stationarity, the short-time characteristic of prominent voice signal increases the identification of system Success rate.
The purpose of preemphasis is to improve the high frequency section of voice signal, so that the frequency spectrum of signal is become flat, is maintained at low frequency Into the entire frequency band of high frequency, frequency spectrum can be sought with same signal-to-noise ratio, in order to spectrum analysis or Parameter analysis.According to this hair Bright embodiment, preemphasis processing are realized by following digital filter:
H (z)=1-uz-1, 0.93 < u < 0.97,
Wherein, u is filter coefficient, and z is input signal.
Since voice signal belongs to typical non-stationary signal, statistical property in entire speech period be it is unknowable, But it is smoothly, therefore, to obtain one section of signal for seamlessly transitting and having stronger autocorrelation in 10ms -30ms, Just need to carry out before carrying out feature extraction adding window, sub-frame processing come achieve the purpose that analyze any moment signal.To voice into It needs to remove the uncorrelated noise in signal after row sub-frame processing.
In preprocessing process, rejecting uncorrelated noise by voice data end-point detection, (unessential information and background are made an uproar Sound).
According to an embodiment of the invention, voice data end-point detection judges voice data endpoint by short-time energy, in short-term Energy is stated by the following method:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is When first frame voice data in front window.Fig. 3 shows the schematic diagram of short-time energy waveform in this one embodiment of the invention.
Voice data end-point detection judges voice data endpoint by short-time energy, and short-time energy and end-point detection are filtered Device convolution:
Wherein, H (i) be end-point detection filter, g (t) be logarithmetics one-dimensional short-time energy, F (t) be short-time energy with The voice data obtained after end-point detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.The present invention one as shown in Figure 4 The voice data after uncorrelated noise is rejected in embodiment by end-point detection.
Step S102, voice data characteristic extracts.
According to an embodiment of the invention, linear prediction residue error is extracted from the voice data after rejecting uncorrelated noise, Obtain the characteristic of voice data.
After the present invention obtains the all-pole modeling of voice by linear prediction analysis, local derviation is asked it to obtain linear prediction system Number.In some embodiments, in order to obtain better recognition effect, certain post-processing is usually carried out to linear cepstrum coefficient, Such as to each component of cepstrum coefficient multiplied by weighting coefficient appropriate, or ask on the basis of current cepstrum coefficient single order, Second differnce etc..
Step S103, voice data matches.
According to an embodiment of the invention, the voice data in the voice data feature of extraction and template library is carried out similarity Matching, wherein dynamic time warping is carried out to the voice data feature of extraction, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi) Previous mesh point (ni-1,mi-1)。
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, on dynamic time warping path Point entirely fall in the parallelogram.
The schematic diagram in dynamic time warping path, dynamic time warping path in one embodiment of the invention as shown in Figure 5 On starting point A (1,1), terminating point B (N, M), the point on dynamic time warping path entirely falls in ACBD in parallelogram In (parallelogram of solid line in Fig. 5).In some embodiments, it is AC ' that the point on dynamic time warping path, which is entirely fallen in, In BD ' (parallelogram of dotted line in Fig. 5).It should be understood that as long as the point met on dynamic time warping path of the invention is full The beam condition of foot.
During speech recognition, when user is trained/identifies, even if saying in an identical manner as far as possible every time same Vocabulary, but the length of its duration can also change at random, and the opposite duration of various pieces is also random inside each word Variation.The present invention effectively solves to carry out similarity system design using eigenmatrix by the matched mode of above-mentioned voice data Matching degree it is poor, identification error is big, the bad good a series of problems of recognition effect.
Step S104, speech recognition is completed.
According to an embodiment of the invention, constantly iteration cost function Φ [(ni, mi)]=(ni-1, mi-1), make the voice extracted Data characteristics is matched with the voice data in template library.In matching process, the voice data feature of extraction and the language in template library Sound data characteristics compares resulting value minimum, then the voice data feature taken at this time is as the output result after identification.
According to an embodiment of the invention, training obtains the voice data of template library by the following method:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise.
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains training voice data Characteristic.
The characteristic of obtained training voice is sent into template library.
In embodiment, is tested by tested speech identification, show audio recognition method provided by the invention, Neng Gouyou Effect improves the accuracy rate of identification.
The comparing result of the characteristic of one: seven tested speech of table and template voice.
The comparing result of the characteristic of training voice in one or seven tested speech of table and template library,
The first seven in table is classified as template voice, last column " test " is to enroll other speakers to say same test voice Voice.The criterion of system is that it is most to terminate to the smallest tested speech of resulting value that selection, which carries out aspect ratio with template voice, Fruit.
As can be seen that due to not needing regular operation when tested speech and self-contrast, therefore its distance is 0, is removed identical Order.
Experiment shows that system can identify the voice most like with a certain tested speech." pen is had chosen in experiment intentionally Note is originally " (three words) and " lamp is opened " (four words), it include tested speech comparison of two words with other, it can be seen that although There are particular cases, but " notebook ", " lamp is opened " and the reduced value of other tested speech are usually larger.
A kind of structural block diagram of speech recognition system of the present invention as shown in Figure 6, according to an embodiment of the invention, a kind of voice Identifying system includes:
Preprocessing module 101 is converted to voice data for will acquire voice signal, carries out pretreatment to voice data and picks Except uncorrelated noise.
Pretreatment includes that voice data is sampled and quantified, and voice data preemphasis processing, voice data adding window is divided Frame processing, and uncorrelated noise is rejected by voice data end-point detection.
Voice data end-point detection judges voice data endpoint by short-time energy, short-time energy table by the following method It states:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is When first frame voice data in front window.
Voice data end-point detection judges voice data endpoint by short-time energy, and short-time energy and end-point detection are filtered Device convolution:
Wherein, H (i) be end-point detection filter, g (t) be logarithmetics one-dimensional short-time energy, F (t) be short-time energy with The voice data obtained after end-point detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.
Characteristic extracting module 102, for extracting linear prediction residue error in the voice data after rejecting uncorrelated noise, Obtain the characteristic of voice data.
After the present invention obtains the all-pole modeling of voice by linear prediction analysis, local derviation is asked it to obtain linear prediction system Number.In some embodiments, in order to obtain better recognition effect, certain post-processing is usually carried out to linear cepstrum coefficient, Such as to each component of cepstrum coefficient multiplied by weighting coefficient appropriate, or ask on the basis of current cepstrum coefficient single order, Second differnce etc..
Template library 103, for storing the characteristic of trained voice.
Training obtains the voice data of template library by the following method:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains training voice data Characteristic;
The characteristic of obtained training voice is sent into template library.
Speech recognition module 104, the voice data feature for that will extract are similar to the voice data progress in template library Degree matching, wherein dynamic time warping is carried out to the voice data feature of extraction, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi) Previous mesh point (ni-1,mi-1),
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, on dynamic time warping path Point entirely fall in the parallelogram;
Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.
In matching process, the voice data feature of extraction and the voice data Characteristic Contrast resulting value minimum in template library, The voice data feature then taken at this time is as the output result after identification.
A kind of audio recognition method provided by the invention and speech recognition system, the voice data after rejecting uncorrelated noise Middle extraction linear prediction residue error, as the characteristic value of voice data, when extracting eigenmatrix progress dynamic after the pre-treatment Between it is regular during, using the calibration for re-starting the time to characteristic parameter sequence pattern, can effectively improve characteristic matching Accuracy.
A kind of audio recognition method provided by the invention and speech recognition system, identification knot is accurate, calculates quickly, using model It encloses extensively, there is extremely wide application prospect in fields such as communication, automatic speech recognitions.
In conjunction with the explanation and practice of the invention disclosed here, the other embodiment of the present invention is for those skilled in the art It all will be readily apparent and understand.Illustrate and embodiment is regarded only as being exemplary, true scope of the invention and purport are equal It is defined in the claims.

Claims (10)

1. a kind of audio recognition method, which is characterized in that the described method includes:
Voice signal is obtained, voice data is converted to, pretreatment is carried out to voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in voice data after rejecting uncorrelated noise, obtains the characteristic of voice data;
Voice data in the voice data feature of extraction and template library is subjected to similarity mode, wherein to the voice of extraction Data characteristics carries out dynamic time warping, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi) it is previous Mesh point (ni-1,mi-1),
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, the point on dynamic time warping path It entirely falls in the parallelogram;
Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.
2. the method according to claim 1, wherein the pretreatment includes that voice data is sampled and measured Change, the processing of voice data preemphasis, voice data adding window, sub-frame processing,
And uncorrelated noise is rejected by voice data end-point detection.
3. according to the method described in claim 2, it is characterized in that, the voice data end-point detection is judged by short-time energy Voice data endpoint, short-time energy are stated by the following method:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is current First frame voice data in window.
4. according to the method described in claim 3, it is characterized in that, the voice data end-point detection is judged by short-time energy Voice data endpoint, by short-time energy and end-point detection filter convolution:
Wherein, H (i) is end-point detection filter, and g (t) is the one-dimensional short-time energy of logarithmetics, and F (t) is short-time energy and endpoint The voice data obtained after Fault detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.
5. the method according to claim 1, wherein the voice data of the template library is trained by the following method It obtains:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains the spy for training voice data Levy part;
The characteristic of obtained training voice is sent into template library.
6. a kind of speech recognition system, which is characterized in that the system comprises:
Preprocessing module is converted to voice data for will acquire voice signal, and it is unrelated to carry out pretreatment rejecting to voice data Noise;
Characteristic extracting module obtains language for extracting linear prediction residue error in the voice data after rejecting uncorrelated noise The characteristic of sound data;
Template library, for storing the characteristic of trained voice;
Speech recognition module, the voice data in voice data feature and template library for that will extract carry out similarity mode, Wherein, dynamic time warping is carried out to the voice data feature of extraction, comprising:
Define cost function Φ [(ni, mi)]=(ni-1, mi-1), indicate dynamic time warping path current point (ni,mi) it is previous Mesh point (ni-1,mi-1),
Point on dynamic time warping path meets following constraint:
A, the starting point (1,1) on dynamic time warping path, terminating point (N, M);
B, the vertex opposite as a certain parallelogram using the starting point and ending point, the point on dynamic time warping path It entirely falls in the parallelogram;
Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.
7. system according to claim 6, which is characterized in that the pretreatment includes that voice data is sampled and measured Change, the processing of voice data preemphasis, voice data adding window, sub-frame processing,
And uncorrelated noise is rejected by voice data end-point detection.
8. system according to claim 7, which is characterized in that the voice data end-point detection is judged by short-time energy Voice data endpoint, short-time energy are stated by the following method:
Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is current First frame voice data in window.
9. system according to claim 8, which is characterized in that the voice data end-point detection is judged by short-time energy Voice data endpoint, by short-time energy and end-point detection filter convolution:
Wherein, H (i) is end-point detection filter, and g (t) is the one-dimensional short-time energy of logarithmetics, and F (t) is short-time energy and endpoint The voice data obtained after Fault detection filter convolution,
When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.
10. system according to claim 6, which is characterized in that the voice data of the template library is instructed by the following method It gets:
Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise;
Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains the spy for training voice data Levy part;
The characteristic of obtained training voice is sent into template library.
CN201910444597.1A 2019-05-27 2019-05-27 A kind of audio recognition method and speech recognition system Pending CN110265049A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910444597.1A CN110265049A (en) 2019-05-27 2019-05-27 A kind of audio recognition method and speech recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910444597.1A CN110265049A (en) 2019-05-27 2019-05-27 A kind of audio recognition method and speech recognition system

Publications (1)

Publication Number Publication Date
CN110265049A true CN110265049A (en) 2019-09-20

Family

ID=67915432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910444597.1A Pending CN110265049A (en) 2019-05-27 2019-05-27 A kind of audio recognition method and speech recognition system

Country Status (1)

Country Link
CN (1) CN110265049A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110931022A (en) * 2019-11-19 2020-03-27 天津大学 Voiceprint identification method based on high-frequency and low-frequency dynamic and static characteristics
CN112965381A (en) * 2021-02-09 2021-06-15 重庆高开清芯智联网络科技有限公司 Method for establishing cooperative intelligent self-adaptive decision model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1197952A1 (en) * 2000-10-18 2002-04-17 Thales Coding method of the prosody for a very low bit rate speech encoder
CN102982803A (en) * 2012-12-11 2013-03-20 华南师范大学 Isolated word speech recognition method based on HRSF and improved DTW algorithm
CN104865973A (en) * 2015-04-13 2015-08-26 天津工业大学 Method for controlling window-cleaning robot through voice
CN106874185A (en) * 2016-12-27 2017-06-20 中车株洲电力机车研究所有限公司 A kind of automated testing method driven based on voiced keyword and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1197952A1 (en) * 2000-10-18 2002-04-17 Thales Coding method of the prosody for a very low bit rate speech encoder
CN102982803A (en) * 2012-12-11 2013-03-20 华南师范大学 Isolated word speech recognition method based on HRSF and improved DTW algorithm
CN104865973A (en) * 2015-04-13 2015-08-26 天津工业大学 Method for controlling window-cleaning robot through voice
CN106874185A (en) * 2016-12-27 2017-06-20 中车株洲电力机车研究所有限公司 A kind of automated testing method driven based on voiced keyword and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
何晖: "《现代信号检测技术与评估理论的应用与研究》", 31 August 2018 *
吴艳艳: "孤立语音识别的关键技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
张仁志等: "基于短时能量的语音端点检测算法研究", 《电声技术》 *
王炳锡等: "《实用语音识别基础》", 31 January 2005 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110931022A (en) * 2019-11-19 2020-03-27 天津大学 Voiceprint identification method based on high-frequency and low-frequency dynamic and static characteristics
CN110931022B (en) * 2019-11-19 2023-09-15 天津大学 Voiceprint recognition method based on high-low frequency dynamic and static characteristics
CN112965381A (en) * 2021-02-09 2021-06-15 重庆高开清芯智联网络科技有限公司 Method for establishing cooperative intelligent self-adaptive decision model
CN112965381B (en) * 2021-02-09 2022-11-11 重庆高开清芯智联网络科技有限公司 Method for establishing cooperative intelligent self-adaptive decision model

Similar Documents

Publication Publication Date Title
Togneri et al. An overview of speaker identification: Accuracy and robustness issues
US8566088B2 (en) System and method for automatic speech to text conversion
Wang et al. Exploring monaural features for classification-based speech segregation
Chu et al. SAFE: A statistical approach to F0 estimation under clean and noisy conditions
Hansen et al. Automatic voice onset time detection for unvoiced stops (/p/,/t/,/k/) with application to accent classification
CN110232933B (en) Audio detection method and device, storage medium and electronic equipment
CN104021789A (en) Self-adaption endpoint detection method using short-time time-frequency value
CN103646649A (en) High-efficiency voice detecting method
KR20130133858A (en) Speech syllable/vowel/phone boundary detection using auditory attention cues
CN101136199A (en) Voice data processing method and equipment
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
CN107564543B (en) Voice feature extraction method with high emotion distinguishing degree
CN111599344B (en) Language identification method based on splicing characteristics
CN108682432B (en) Speech emotion recognition device
Deshmukh et al. Speech based emotion recognition using machine learning
CN110265049A (en) A kind of audio recognition method and speech recognition system
Poorna et al. Emotion recognition using multi-parameter speech feature classification
Sun et al. A novel convolutional neural network voiceprint recognition method based on improved pooling method and dropout idea
CN110136746B (en) Method for identifying mobile phone source in additive noise environment based on fusion features
CN110415707B (en) Speaker recognition method based on voice feature fusion and GMM
CN116597853A (en) Audio denoising method
CN111091816B (en) Data processing system and method based on voice evaluation
Li et al. Tibetan voice activity detection based on one-dimensional convolutional neural network
Nosek et al. Synthesized speech detection based on spectrogram and convolutional neural networks
Heckmann et al. A closer look on hierarchical spectro-temporal features (HIST).

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190920

RJ01 Rejection of invention patent application after publication