CN110265049A

CN110265049A - A kind of audio recognition method and speech recognition system

Info

Publication number: CN110265049A
Application number: CN201910444597.1A
Authority: CN
Inventors: 林孝康; 傅嵩; 葛宛营
Original assignee: Chongqing Gaokai Core Technology Development Co Ltd
Current assignee: Chongqing Gaokai Core Technology Development Co Ltd
Priority date: 2019-05-27
Filing date: 2019-05-27
Publication date: 2019-09-20

Abstract

The present invention provides a kind of audio recognition method, and the voice data in the voice data feature of extraction and template library is carried out similarity mode, wherein carries out dynamic time warping to the voice data feature of extraction, comprising: define cost function Φ [(n_i, m_i)]=(n_i‑1, m_i‑1), indicate dynamic time warping path current point (n_i,m_i) previous mesh point (n_i‑1,m_i‑1), the point on dynamic time warping path meets following constraint: the starting point (1,1) on a, dynamic time warping path, terminating point (N, M)；B, the vertex opposite as a certain parallelogram using the starting point and ending point, the point on dynamic time warping path are entirely fallen in the parallelogram；Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.A kind of audio recognition method provided by the invention and speech recognition system, can effectively improve the accuracy of characteristic matching.

Description

A kind of audio recognition method and speech recognition system

Technical field

The present invention relates to technical field of voice recognition, in particular to a kind of audio recognition method and speech recognition system.

Background technique

With the development of computer technology and information technology, interactive voice has become the necessary means of human-computer interaction, It under this situation, is communicated with people with how allowing Computerized intelligent, makes man-machine communication is more convenient to become modern computer One important subject of science.

After complete speech recognition system includes pretreatment, characteristic parameter extraction, similar comparison and identification successfully Continuous step.In pretreatment, framing, end-point detection etc. are all important processing means, and pre-processing to voice signal can The recognition success rate of increasing system；End-point detection technology can extract the phonological component in signal, reduce system-computed amount Meanwhile improving whole accuracy of identification.Meanwhile speech recognition system is compared using characteristic value come recognition command, it will be in frequency domain Voice signal handled after take its coefficient to indicate the segment signal characteristic value.

However in the prior art, to endpoint Detection and Extraction voice data part, effect is poor, it is difficult to extract effective voice Data cause speech recognition mistake occur.

Dynamic time warping is the important method of one kind in Feature Correspondence Algorithm, in the prior art, the mistake of speech recognition Cheng Zhong, the time span of every section of voice signal will not keep identical, and the opposite duration of various pieces is also inside each word The characteristic vector for changing at random, therefore using in the prior art carries out the comparison of similitude, and effect is often poor.

Therefore, in order to solve the above problem occurred in the prior art, a kind of audio recognition method and speech recognition are needed System, the accuracy of Lai Tigao speech recognition.

Summary of the invention

One aspect of the present invention is to provide a kind of audio recognition method, which comprises

Voice signal is obtained, voice data is converted to, pretreatment is carried out to voice data and rejects uncorrelated noise；

Linear prediction residue error is extracted in voice data after rejecting uncorrelated noise, obtains the features of voice data Point；

Voice data in the voice data feature of extraction and template library is subjected to similarity mode, wherein to extraction Voice data feature carries out dynamic time warping, comprising:

Define cost function Φ [(n_i, m_i)]=(n_i-1, m_i-1), indicate dynamic time warping path current point (n_i,m_i) Previous mesh point (n_i-1,m_i-1),

Point on dynamic time warping path meets following constraint:

A, the starting point (1,1) on dynamic time warping path, terminating point (N, M)；

B, the vertex opposite as a certain parallelogram using the starting point and ending point, on dynamic time warping path Point entirely fall in the parallelogram；

Cost function described in continuous iteration, matches the voice data feature extracted with the voice data in template library.

Preferably, the pretreatment includes that voice data is sampled and quantified, the processing of voice data preemphasis, voice Data adding window, sub-frame processing,

And uncorrelated noise is rejected by voice data end-point detection.

Preferably, the voice data end-point detection judges voice data endpoint by short-time energy, and short-time energy passes through Following method statement:

Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is When first frame voice data in front window.

Preferably, the voice data end-point detection judges voice data endpoint by short-time energy, by short-time energy with End-point detection filter convolution:

Wherein, H (i) be end-point detection filter, g (t) be logarithmetics one-dimensional short-time energy, F (t) be short-time energy with The voice data obtained after end-point detection filter convolution,

When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.

Preferably, training obtains the voice data of the template library by the following method:

Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise；

Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains training voice data Characteristic；

The characteristic of obtained training voice is sent into template library.

Another aspect of the present invention is to provide a kind of speech recognition system, the system comprises:

Preprocessing module, is converted to voice data for will acquire voice signal, carries out pretreatment rejecting to voice data Uncorrelated noise；

Characteristic extracting module is obtained for extracting linear prediction residue error in the voice data after rejecting uncorrelated noise To the characteristic of voice data；

Template library, for storing the characteristic of trained voice；

Speech recognition module, the voice data in voice data feature and template library for that will extract carry out similarity Match, wherein dynamic time warping is carried out to the voice data feature of extraction, comprising:

Point on dynamic time warping path meets following constraint:

And uncorrelated noise is rejected by voice data end-point detection.

The characteristic of obtained training voice is sent into template library.

A kind of audio recognition method provided by the invention and speech recognition system, the voice data after rejecting uncorrelated noise Middle extraction linear prediction residue error, as the characteristic value of voice data, when extracting eigenmatrix progress dynamic after the pre-treatment Between it is regular during, using the calibration for re-starting the time to characteristic parameter sequence pattern, can effectively improve characteristic matching Accuracy.

A kind of audio recognition method provided by the invention and speech recognition system, identification knot is accurate, calculates quickly, using model It encloses extensively, there is extremely wide application prospect in fields such as communication, automatic speech recognitions.

It should be appreciated that aforementioned description substantially and subsequent detailed description are exemplary illustration and explanation, it should not As the limitation to the claimed content of the present invention.

Detailed description of the invention

With reference to the attached drawing of accompanying, the more purposes of the present invention, function and advantage are by the as follows of embodiment through the invention Description is illustrated, in which:

Fig. 1 diagrammatically illustrates a kind of flow diagram of audio recognition method of the present invention.

Fig. 2 shows the voice signal time domain waveform schematic diagrams sampled in one embodiment of the invention.

Fig. 3 shows the schematic diagram of short-time energy waveform in this one embodiment of the invention.

Fig. 4 is shown rejects the voice data after uncorrelated noise by end-point detection in one embodiment of the invention.

Fig. 5 shows the schematic diagram in dynamic time warping path in one embodiment of the invention.

Fig. 6 shows a kind of structural block diagram of speech recognition system of the present invention.

Specific embodiment

By reference to exemplary embodiment, the purpose of the present invention and function and the side for realizing these purposes and function Method will be illustrated.However, the present invention is not limited to exemplary embodiment as disclosed below；Can by different form come It is realized.The essence of specification is only to aid in those skilled in the relevant arts' Integrated Understanding detail of the invention.

Hereinafter, the embodiment of the present invention will be described with reference to the drawings.In the accompanying drawings, identical appended drawing reference represents identical Or similar component or same or like step.

A kind of audio recognition method provided by the invention is illustrated below by specific embodiment, as shown in Figure 1 A kind of flow diagram of audio recognition method of the present invention, embodiment according to the present invention, a kind of audio recognition method include with Lower method and step:

Step S101, voice data pre-processes.

According to an embodiment of the invention, by taking automatic Pilot identifies as an example, acquisition voice signal (such as: turn left), to acquisition Voice signal after AD conversion, be converted to voice data.

Voice data is pre-processed, uncorrelated noise is rejected.Uncorrelated noise refers to unrelated with voice signal is obtained Data, such as the noise of running car, ambient noise etc..

The voice signal time domain waveform schematic diagram sampled in one embodiment of the invention as shown in Figure 2.Due to voice signal Energisation mode, mouth and nose radiation and its non-stationary property and recording audio duration problem, one section of voice can not directly carry out spy Sign is extracted and matching, is pre-processed before feature extraction.

According to the embodiment that this hair is invented, preprocessing process includes: that voice data is sampled and quantified, and voice data is pre- Exacerbation processing, voice data adding window, sub-frame processing, and uncorrelated noise is rejected by voice data end-point detection.

The processing of voice data preemphasis is being carried out to the voice signal of acquisition, voice data adding window, sub-frame processing guarantee letter Keep the statistical property of signal more obvious while number stationarity, the short-time characteristic of prominent voice signal increases the identification of system Success rate.

The purpose of preemphasis is to improve the high frequency section of voice signal, so that the frequency spectrum of signal is become flat, is maintained at low frequency Into the entire frequency band of high frequency, frequency spectrum can be sought with same signal-to-noise ratio, in order to spectrum analysis or Parameter analysis.According to this hair Bright embodiment, preemphasis processing are realized by following digital filter:

H (z)=1-uz^-1, 0.93 < u < 0.97,

Wherein, u is filter coefficient, and z is input signal.

Since voice signal belongs to typical non-stationary signal, statistical property in entire speech period be it is unknowable, But it is smoothly, therefore, to obtain one section of signal for seamlessly transitting and having stronger autocorrelation in 10ms -30ms, Just need to carry out before carrying out feature extraction adding window, sub-frame processing come achieve the purpose that analyze any moment signal.To voice into It needs to remove the uncorrelated noise in signal after row sub-frame processing.

In preprocessing process, rejecting uncorrelated noise by voice data end-point detection, (unessential information and background are made an uproar Sound).

According to an embodiment of the invention, voice data end-point detection judges voice data endpoint by short-time energy, in short-term Energy is stated by the following method:

Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is When first frame voice data in front window.Fig. 3 shows the schematic diagram of short-time energy waveform in this one embodiment of the invention.

Voice data end-point detection judges voice data endpoint by short-time energy, and short-time energy and end-point detection are filtered Device convolution:

When F (t) is less than a certain threshold value, then using the voice data as uncorrelated noise rejecting.The present invention one as shown in Figure 4 The voice data after uncorrelated noise is rejected in embodiment by end-point detection.

Step S102, voice data characteristic extracts.

According to an embodiment of the invention, linear prediction residue error is extracted from the voice data after rejecting uncorrelated noise, Obtain the characteristic of voice data.

After the present invention obtains the all-pole modeling of voice by linear prediction analysis, local derviation is asked it to obtain linear prediction system Number.In some embodiments, in order to obtain better recognition effect, certain post-processing is usually carried out to linear cepstrum coefficient, Such as to each component of cepstrum coefficient multiplied by weighting coefficient appropriate, or ask on the basis of current cepstrum coefficient single order, Second differnce etc..

Step S103, voice data matches.

According to an embodiment of the invention, the voice data in the voice data feature of extraction and template library is carried out similarity Matching, wherein dynamic time warping is carried out to the voice data feature of extraction, comprising:

Define cost function Φ [(n_i, m_i)]=(n_i-1, m_i-1), indicate dynamic time warping path current point (n_i,m_i) Previous mesh point (n_i-1,m_i-1)。

Point on dynamic time warping path meets following constraint:

B, the vertex opposite as a certain parallelogram using the starting point and ending point, on dynamic time warping path Point entirely fall in the parallelogram.

The schematic diagram in dynamic time warping path, dynamic time warping path in one embodiment of the invention as shown in Figure 5 On starting point A (1,1), terminating point B (N, M), the point on dynamic time warping path entirely falls in ACBD in parallelogram In (parallelogram of solid line in Fig. 5).In some embodiments, it is AC ' that the point on dynamic time warping path, which is entirely fallen in, In BD ' (parallelogram of dotted line in Fig. 5).It should be understood that as long as the point met on dynamic time warping path of the invention is full The beam condition of foot.

During speech recognition, when user is trained/identifies, even if saying in an identical manner as far as possible every time same Vocabulary, but the length of its duration can also change at random, and the opposite duration of various pieces is also random inside each word Variation.The present invention effectively solves to carry out similarity system design using eigenmatrix by the matched mode of above-mentioned voice data Matching degree it is poor, identification error is big, the bad good a series of problems of recognition effect.

Step S104, speech recognition is completed.

According to an embodiment of the invention, constantly iteration cost function Φ [(n_i, m_i)]=(n_i-1, m_i-1), make the voice extracted Data characteristics is matched with the voice data in template library.In matching process, the voice data feature of extraction and the language in template library Sound data characteristics compares resulting value minimum, then the voice data feature taken at this time is as the output result after identification.

According to an embodiment of the invention, training obtains the voice data of template library by the following method:

Training voice is obtained, trained voice data is converted to, pretreatment is carried out to training voice data and rejects uncorrelated noise.

Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains training voice data Characteristic.

The characteristic of obtained training voice is sent into template library.

In embodiment, is tested by tested speech identification, show audio recognition method provided by the invention, Neng Gouyou Effect improves the accuracy rate of identification.

The comparing result of the characteristic of one: seven tested speech of table and template voice.

The comparing result of the characteristic of training voice in one or seven tested speech of table and template library,

The first seven in table is classified as template voice, last column " test " is to enroll other speakers to say same test voice Voice.The criterion of system is that it is most to terminate to the smallest tested speech of resulting value that selection, which carries out aspect ratio with template voice, Fruit.

As can be seen that due to not needing regular operation when tested speech and self-contrast, therefore its distance is 0, is removed identical Order.

Experiment shows that system can identify the voice most like with a certain tested speech." pen is had chosen in experiment intentionally Note is originally " (three words) and " lamp is opened " (four words), it include tested speech comparison of two words with other, it can be seen that although There are particular cases, but " notebook ", " lamp is opened " and the reduced value of other tested speech are usually larger.

A kind of structural block diagram of speech recognition system of the present invention as shown in Figure 6, according to an embodiment of the invention, a kind of voice Identifying system includes:

Preprocessing module 101 is converted to voice data for will acquire voice signal, carries out pretreatment to voice data and picks Except uncorrelated noise.

Pretreatment includes that voice data is sampled and quantified, and voice data preemphasis processing, voice data adding window is divided Frame processing, and uncorrelated noise is rejected by voice data end-point detection.

Voice data end-point detection judges voice data endpoint by short-time energy, short-time energy table by the following method It states:

Characteristic extracting module 102, for extracting linear prediction residue error in the voice data after rejecting uncorrelated noise, Obtain the characteristic of voice data.

Template library 103, for storing the characteristic of trained voice.

Training obtains the voice data of template library by the following method:

The characteristic of obtained training voice is sent into template library.

Speech recognition module 104, the voice data feature for that will extract are similar to the voice data progress in template library Degree matching, wherein dynamic time warping is carried out to the voice data feature of extraction, comprising:

Point on dynamic time warping path meets following constraint:

In matching process, the voice data feature of extraction and the voice data Characteristic Contrast resulting value minimum in template library, The voice data feature then taken at this time is as the output result after identification.

In conjunction with the explanation and practice of the invention disclosed here, the other embodiment of the present invention is for those skilled in the art It all will be readily apparent and understand.Illustrate and embodiment is regarded only as being exemplary, true scope of the invention and purport are equal It is defined in the claims.

Claims

1. a kind of audio recognition method, which is characterized in that the described method includes:

Linear prediction residue error is extracted in voice data after rejecting uncorrelated noise, obtains the characteristic of voice data；

Voice data in the voice data feature of extraction and template library is subjected to similarity mode, wherein to the voice of extraction Data characteristics carries out dynamic time warping, comprising:

Define cost function Φ [(n_i, m_i)]=(n_i-1, m_i-1), indicate dynamic time warping path current point (n_i,m_i) it is previous Mesh point (n_i-1,m_i-1),

Point on dynamic time warping path meets following constraint:

B, the vertex opposite as a certain parallelogram using the starting point and ending point, the point on dynamic time warping path It entirely falls in the parallelogram；

2. the method according to claim 1, wherein the pretreatment includes that voice data is sampled and measured Change, the processing of voice data preemphasis, voice data adding window, sub-frame processing,

And uncorrelated noise is rejected by voice data end-point detection.

3. according to the method described in claim 2, it is characterized in that, the voice data end-point detection is judged by short-time energy Voice data endpoint, short-time energy are stated by the following method:

Wherein, o (j) is frame voice data, and j is frame number, and g (t) is the one-dimensional short-time energy of logarithmetics, and I window is long, and n is current First frame voice data in window.

4. according to the method described in claim 3, it is characterized in that, the voice data end-point detection is judged by short-time energy Voice data endpoint, by short-time energy and end-point detection filter convolution:

Wherein, H (i) is end-point detection filter, and g (t) is the one-dimensional short-time energy of logarithmetics, and F (t) is short-time energy and endpoint The voice data obtained after Fault detection filter convolution,

5. the method according to claim 1, wherein the voice data of the template library is trained by the following method It obtains:

Linear prediction residue error is extracted in training voice data after rejecting uncorrelated noise, obtains the spy for training voice data Levy part；

The characteristic of obtained training voice is sent into template library.

6. a kind of speech recognition system, which is characterized in that the system comprises:

Preprocessing module is converted to voice data for will acquire voice signal, and it is unrelated to carry out pretreatment rejecting to voice data Noise；

Characteristic extracting module obtains language for extracting linear prediction residue error in the voice data after rejecting uncorrelated noise The characteristic of sound data；

Template library, for storing the characteristic of trained voice；

Speech recognition module, the voice data in voice data feature and template library for that will extract carry out similarity mode, Wherein, dynamic time warping is carried out to the voice data feature of extraction, comprising:

Point on dynamic time warping path meets following constraint:

7. system according to claim 6, which is characterized in that the pretreatment includes that voice data is sampled and measured Change, the processing of voice data preemphasis, voice data adding window, sub-frame processing,

And uncorrelated noise is rejected by voice data end-point detection.

8. system according to claim 7, which is characterized in that the voice data end-point detection is judged by short-time energy Voice data endpoint, short-time energy are stated by the following method:

9. system according to claim 8, which is characterized in that the voice data end-point detection is judged by short-time energy Voice data endpoint, by short-time energy and end-point detection filter convolution:

10. system according to claim 6, which is characterized in that the voice data of the template library is instructed by the following method It gets:

The characteristic of obtained training voice is sent into template library.