CN1991976A

CN1991976A - Phoneme based voice recognition method and system

Info

Publication number: CN1991976A
Application number: CNA2005101214992A
Authority: CN
Inventors: 潘建强
Original assignee: Individual
Current assignee: Individual
Priority date: 2005-12-31
Filing date: 2005-12-31
Publication date: 2007-07-04

Abstract

A voice recognizing method and system based on the phoneme includes: A) the analog voice signal is transferred into digital voice signal; B) the short-time zero-crossing ratio is detected, if the short-time zero-crossing ratio is less than the preset value, it is judged to sonant to processed as sonant, if the short-time zero-crossing ratio is higher than the preset value, it is judged to surd to processed as surd; C) the data after pretreatment is spectrum transformed to pick up character; D) the character data is analyzed; E) the phoneme sequence is output according to the analyzed result. The voice recognizing method and system can introduce different process method to surd and sonant, specially the sonant phoneme is modeled based on the single keynote cycle spectrum; it resolves the defect of current voice input recognizing system. It possesses advantages of high recognizing efficiency, high accuracy and high stability.

Description

Audio recognition method and system based on phoneme

Technical field

The present invention relates to the speech recognition technology of computer field, be specifically related to audio recognition method and system based on phoneme.

Background technology

Fast Fourier Transform (FFT)-the FFT of sequence is one of most important instrument of discrete-time signal analysis and processing.If signal is time-limited sequence, directly sequence is carried out the frequency spectrum that the FFT computing can be tried to achieve sequence.For simulating signal, when carrying out spectrum analysis, at first must sample to signal with FFT, make it to become discrete signal.By sampling thheorem, sample frequency fs should be greater than the highest frequency of two times of signals.According to the relation of numerical frequency and analog frequency, can be when carrying out spectrum analysis with N point FFT, its analog frequency resolution is:

ΔF＝fs/N --------------------------------------(1)

Therefore, in order to guarantee specified frequency resolution ax/F, require to be used for counting of FFT

N≥fs/ΔF --------------------------------------(2)

When adopting base-2FFT algorithm, also requiring N is 2 integer power.The frequency scale value of every spectral line representative is

f _k＝fs×k/N k＝0，1，2，3......N/2 ------------(3)

By formula (2) as can be known, fixedly the time, obtain high frequency resolution in sample frequency, being used for points N that FFT calculates must be enough greatly, but in continuous speech, the phoneme that has, as the polynary sound in the Chinese, the duration of transition sound wherein is very short, only several milliseconds, directly several milliseconds signal is done spectrum transformation, frequency resolution is very low, and the speech recognition features out of true that frequency spectrum data constituted that obtains thus will cause voice identification result uncertain.

Voice signal is non-stabilization signal, and simple FFT conversion can not reflect the variation characteristic of voice signal, is extensive use of Short Time Fourier Transform algorithm (STFT) now, and the waveform frame by frame under the sliding window is done Fourier transform, obtains sound spectrograph therefrom.Length difference by sliding window is divided into arrowband sound spectrograph and broadband sound spectrograph again.For the arrowband sound spectrograph, sliding window length is usually greater than two pitch periods, the arrowband sound spectrograph has frequency resolution preferably, be embodied in it and can distinguish the each harmonic spectral line, yet the long window that has comprised several cycles makes that sound spectrograph can't the variation of display frequency on time domain, when the signal spectrum that is comprised changes greatly, it is disorderly and unsystematic that sound spectrograph becomes, and can't differentiate.And for the broadband sound spectrograph, sliding window length is usually less than a pitch period, and shortens the long spectral resolution that can the broadening Short Time Fourier Transform of window, thereby the structure of having flooded the harmonic wave spectral line can only approximate goes out the envelope of frequency spectrum.And because sliding window length is less than a pitch period, there is leakage phenomenon in the frequency spectrum of being described, thereby causes the spectrogram drawn untrue.

Because voice signal frequency, phase place and amplitude all are unsettled in time domain, and the temporal resolution of Fourier transform itself is zero, and it has covered the variation of signal spectrum, and spectral change is the inevitable outcome that voice change.Though Short Time Fourier Transform has regular hour resolution, but effect is not remarkable in the analysis of voice signal frequency spectrum, this is because its frequency and temporal resolution are high not enough, the effect of sound spectrograph is very limited, even the information of transmitting gives wrong understanding, to such an extent as to also can't unify people even to this day to the generation of voice and the understanding of consciousness.Though sound spectrograph has the title of observable sound, the professional person who only has been trained for a long time could analyze the implication of voice from sound spectrograph, and accuracy is not 100% yet.Though need not to use sound spectrograph in the speech recognition process, in the performance history of speech recognition system, sound spectrograph effect intuitively is helpful for the standard form of correct structure recognition feature.

Along with development of computer, being showing improvement or progress day by day of signal processing technology, the performance of phonetic entry, speech recognition product is become better and better, and usable range is more and more widely, but because some critical technical matterss do not solve, also there is such or such shortcoming in existing voice identification product.

Application number is that 97111623.7 Chinese invention patent discloses a kind of speech recognition computer module and based on the audio digital signals transform method of phoneme, described phoneme feature extracting method is: the audio digital signals joint that audio digital signals is divided into any amount, phoneme is divided into the fragment of any amount, and respectively distribute a phoneme proper vector of describing each phoneme or phoneme fragment characteristic for phoneme, the two similarity of voice signal joint and digitizing word pronunciation proper vector relatively, thereby recognizing voice.This scheme adopts identical disposal route to voiceless sound with voiced sound, and is indeterminate to the division foundation of voice signal joint and phoneme fragment, can not obtain desirable discrimination.In addition, a kind of speech recognition system has been introduced by International Business Machine Corporation (IBM) in application number is 200410058687.0 Chinese invention patent application.Carry out modeling with the posterior probability of logarithm-linear model pair voice unit relevant in this system with speech recognition.This posterior model is obtained the probability of voice unit under the prerequisite of the parameter of given phonetic feature that observes and posterior model.Can determine described posterior model with the probability of word sequence hypothesis under the prerequisite of a plurality of phonetic features given.And has following shortcoming: 1, require speaker's very standard of pronouncing based on the continuous speech recognition system of this technology; 2, require environment-identification very quiet, neighbourhood noise is little; 3, phoneme, isolated word, the speech discrimination is not high maybe can't discern; 4, discrimination is relevant with topic, and promptly the content with template base is relevant, can't discern the words that does not have in the template; 5, require recognition system to set up jumbo recognition template storehouse, 6, very poor, the same a word of repeatability, if can not correctly discern the time, recognition result repeatedly is different.The existence of above defective illustrates institute's modeling plate bad adaptability, the speech recognition features instability of being extracted.Though the product listing is for many years, fail to promote always, certain let alone universal.

Existing continuous speech recognition system adopts voiceless sound to mix identification with voiced sound, with fixing duration voice signal is sampled, extracts speech recognition features because can't guarantee each sampling needle right be single phoneme, the recognition feature poor stability that is extracted, recognition effect are very undesirable.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of audio recognition method and system that can overcome above-mentioned prior art shortcoming, require low to environment-identification, require low to speaker, have the discrimination height, both can discern isolated word, speech, can discern continuous speech again, recognition result such as can reproduce at advantage.

The above-mentioned technical matters of the present invention solves like this, constructs a kind of audio recognition method based on phoneme, it is characterized in that, may further comprise the steps:

A) analog voice signal is converted to audio digital signals;

B) detect the audio digital signals short-time zero-crossing rate,,, then carry out the voiceless sound pre-service if short-time zero-crossing rate is higher than setting value if short-time zero-crossing rate less than setting value then be judged to be voiced sound, carries out the voiced sound pre-service;

C) to carrying out spectrum transformation, extract feature through pretreated data;

D) characteristic that extracts is analyzed;

E) according to analysis result output aligned phoneme sequence.

Above-mentioned according to the audio recognition method that the present invention is based on phoneme in, described voiced sound pre-service may further comprise the steps:

F1) frequency and the amplitude of mensuration pitch signal;

F2) by sectioning the voiced sound signal is resolved into size sequentially and equal mutual incoherent section of pitch period, the signal fundamental frequency is a N sampled point, promptly presses one section of N continuous sampling point, and make every section starting point and terminal point amplitude is zero or near zero as far as possible;

F3) by time domain continuation device segment data is carried out the time domain periodic repetitions, monocycle signal is become the multicycle signal.

Above-mentioned according to the audio recognition method that the present invention is based on phoneme in, described voiceless sound pre-service may further comprise the steps:

G1) set initial, the end amplitude of voiceless sound;

G2) detect plosive starting point, terminal point;

G3) detect the voiceless sound duration.

Above-mentioned according to the audio recognition method that the present invention is based on phoneme in, described step C) may further comprise the steps: extract a kind or 2 kinds in the following speech recognition features at least: spectrum signature, spectral change feature, cepstrum, linear predictor coefficient, resonance peak, phoneme duration.

Above-mentioned according to the audio recognition method that the present invention is based on phoneme in, described step D) may further comprise the steps:

D1) sum up amplitude distribution, Changing Pattern and the meaning in voice thereof in time of each radio-frequency component;

D2) frequency spectrum that will have identical or close feature is sorted out, and forms the individual character template;

D3) set the similar value that template is compared according to system requirements, high similar value is applicable to speaker's identification, instruction input, and low similar value is used for the conversion of voice and text;

Above-mentioned according to the audio recognition method that the present invention is based on phoneme in, described step e) may further comprise the steps: the phoneme template of recognition feature and speech database middle finger attribute kind or dialect is compared, determines the phoneme title.

Above-mentioned according to the audio recognition method that the present invention is based on phoneme in, described step F 1) measuring the frequency and the amplitude of pitch signal, is to adopt the realization of one of following fundamental frequency extracting method: autocorrelation function method, linear prediction method, scramble spectrometry, based on the pitch estimation method of " comb filtering device ", based on the pitch estimation method of harmonic sine wave pattern.

Above-mentioned according to the audio recognition method that the present invention is based on phoneme in, described step e according to analysis result output aligned phoneme sequence) after, also comprise the step that aligned phoneme sequence is converted to text or instruction.

Another technical matters of the present invention solves like this, constructs a kind of speech recognition system based on phoneme, comprising:

Be used for analog voice signal is converted to the speech input device of audio digital signals;

Be used to detect the pure and impure sound recognition device of the audio digital signals short-time zero-crossing rate that speech input device provides, detect short-time zero-crossing rate and then audio digital signals is outputed to the voiced sound pretreatment unit, detect short-time zero-crossing rate and be higher than setting value and then audio digital signals outputed to the voiceless sound pretreatment unit less than setting value;

The data that provided by voiceless sound pretreatment unit and voiced sound pretreatment unit are carried out spectrum transformation and transformation results analyzed, extracted the feature deriving means of feature;

The feature analyzing apparatus that the characteristic that feature deriving means is extracted is analyzed;

Voice memory storage and the aligned phoneme sequence output unit of exporting aligned phoneme sequence according to the analysis result retrieval voice memory storage of feature analyzing apparatus output.With phoneme conversion is the device of text or instruction.

In above-mentioned system according to the speech recognition based on phoneme provided by the invention, described voiced sound pretreatment unit comprises with lower module: measure the frequency of pitch signal and the module of amplitude; By sectioning the voiced sound signal is resolved into size sequentially and equal mutual incoherent section of pitch period, the signal fundamental frequency is a N sampled point, promptly press one section of N continuous sampling point, make every section starting point and terminal point amplitude is zero or approaching zero module as far as possible, and segment data is carried out the time domain periodic repetitions by time domain continuation device, monocycle signal is become the module of multicycle signal; Described voiceless sound pretreatment unit comprises with lower module: set voiceless sound starting and ending amplitude module, detect the module of plosive starting point, terminal point; And the module that detects the voiceless sound duration.

Implement audio recognition method provided by the invention and system, can be at the characteristics of voice signal, voiceless sound and voiced sound are adopted different disposal routes, particularly to the voiced sound phoneme with the modeling of single pitch period spectrum signature, solved the deficiency of existing voice input recognition system.Have recognition efficiency height, precision height and stable advantages of higher

Description of drawings

Fig. 1 is the logic block-diagram according to the speech recognition system embodiment based on phoneme of the present invention;

Fig. 2 is the logic diagram of voiced sound pretreatment unit of the present invention;

Fig. 3 is for realizing the schematic flow sheet of the audio recognition method based on phoneme of the present invention;

Fig. 4 A is the signal subsection synoptic diagram

Be denoted as among the figure: S1-voice signal S2-pitch signal

ST1_ST4-segment signal T1_T4-pitch period

Fig. 4 B is a ST1 time domain continuation signal waveforms

Fig. 4 C is a ST2 time domain continuation signal waveforms

Fig. 4 D is a ST3 time domain continuation signal waveforms

Fig. 4 E is a ST4 time domain continuation signal waveforms

Fig. 5 is ST1 time domain continuation signal spectrum figure

Fig. 6 is ST2 time domain continuation signal spectrum figure

Fig. 7 is ST3 time domain continuation signal spectrum figure

Fig. 8 is ST4 time domain continuation signal spectrum figure

Fig. 9 is middle-aged male Chinese vowel [a] falling tone oscillogram

Figure 10 is middle-aged male Chinese vowel [a] falling tone time domain continuation sound spectrograph

Figure 11 is middle-aged male Chinese vowel [a] falling tone arrowband sound spectrograph

Embodiment

According to the present invention, in voice signal, phoneme (phoneme) is people's the differentiable elementary cell of the sense of hearing.Whether vocal cords vibrate during according to pronunciation, can be divided into voiced sound and voiceless sound to phoneme.Vocal cords do not vibrate when sending out voiceless sound, and the low frequency range in the frequency spectrum of voiceless sound below 400Hz does not have the frequency of concentration of energy, we can say that voiceless sound does not have fundamental frequency, and its waveform is similar to the ripple of making an uproar, and stability is very poor, and are periodically very poor, its short-time zero-crossing rate height.Different with voiceless sound, vocal cords can vibrate when sending out voiced sound, and volume is big than voiceless sound, and propagation distance is far away, during daily conversation, the frequency of concentration of energy is arranged at the low frequency range of 60-400Hz in the frequency spectrum of voiced sound,, minimum frequency is called fundamental frequency, also is fundamental tone.During singing, fundamental frequency might surpass 400Hz.The voiced sound short-time zero-crossing rate generally is lower than voiceless sound.

Because voiceless sound is different with the voiced sound short-time zero-crossing rate, the two is more easily distinguished, and the voiceless sound amplitude is generally low than voiced sound, and the duration length of most of voicelesss sound influences the pronunciation and meaning of voiceless sound.So, for recognition of speech signals effectively, reduce the number of times of template comparison, be necessary earlier voiceless sound and voiced sound to be discerned, according to the characteristics of voiceless sound, voiced sound, take corresponding techniques to handle.

Fig. 1 has provided an embodiment of the speech recognition system that the present invention is based on phoneme, among the figure each several part function can by software and (or) hardware realizes.Wherein:

Speech input device 107 is used for sound wave is converted to electric analoging signal, and electric analoging signal is converted to digital signal.Voiceless sound voiced sound recognition device 101, be used to detect the voice signal short-time zero-crossing rate, short-time zero-crossing rate is judged to be voiced sound less than setting value, during for voiced sound signal is outputed to voiced sound treating apparatus 102 and do the voiced sound pre-service, otherwise then be judged to voiceless sound, during for voiceless sound signal outputed to voiceless sound treating apparatus 103 and do the voiceless sound pre-service.Feature deriving means 104 is used to extract a plurality of speech recognition features, comprises frequency spectrum, cepstrum, linear predictor coefficient, resonance peak, duration etc., and is wherein important with spectrum signature and phoneme duration.And feature analyzing apparatus 105, in time distribution of the amplitude that is used for summing up each radio-frequency component, Changing Pattern and in the meaning of voice, the frequency spectrum that will have identical or close feature is sorted out, and forms the individual character template; Set the similar value that template is compared according to system requirements, high similar value is applicable to speaker's identification, instruction input, and low similar value is used for the conversion of voice and text; At last, the phoneme title is determined in recognition feature and the phoneme template contrast of specifying languages or dialect.Voice memory storage 106 is used for different sexes, age groups phoneme template and related data multilingual with the storage of database form, multiple dialect provide user personality template stores space simultaneously.Aligned phoneme sequence output unit 108 is used for recognition result is sent to the converting system of aligned phoneme sequence and text, instruction.As the Chinese all-phonetic input method, double-spelling Chinese character input method etc. can become text with phoneme conversion, other Languages, and as Japanese, Korean also has similar input method phoneme can be transformed to text.In fact, as long as set up the corresponding relation of phoneme and literal, letter, word, the conversion of phoneme and text all can be realized in any language.

As shown in Figure 2, voiced sound treating apparatus 102 among Fig. 1 in Fig. 2 by a fundamental frequency analytical equipment 201 that is used to measure pitch signal frequency and amplitude, 202, one time domain continuation of signal subsection device of voiced sound signal subsection are promptly carried out device 203 and temporary module 204 that the time domain cycle duplicates to segment data to be formed.In the work, the voiced sound signal determines the frequency and the amplitude of pitch signal through fundamental frequency analytical equipment 201.Fundamental frequency is to adopt one of following fundamental frequency extracting method to realize, autocorrelation function method, linear prediction (LPC) method, scramble spectrometry, and based on the fundamental tone estimation of " comb filtering device ", based on the pitch estimation method of harmonic sine wave pattern.The voiced sound signal is resolved into size sequentially and is equaled mutual incoherent section of pitch period in sectioning 202, the signal fundamental frequency is a N sampled point, promptly presses one section of N continuous sampling point, and make every section starting point and terminal point amplitude is zero or near zero as far as possible.Through the signal of segmentation by time domain continuation device 203 with the section be unit in the time domain periodic repetitions, monocycle signal is become the multicycle signal.

Voiceless sound pretreatment unit 103 among Fig. 1, its effect comprise that the setting voiceless sound is initial, finish range parameter, detect plosive starting point, terminal point, detect the voiceless sound duration.

Fig. 3 has provided the process flow diagram of the audio recognition method of realizing the present invention is based on phoneme.As shown in the figure, flow process starts from step 301, and control procedure advances to 302, wherein the unknown digitizing discrete voice signal of input.Next step in step 303, detects the voice signal short-time zero-crossing rate, and short-time zero-crossing rate is judged as voiced sound during less than setting value, otherwise is voiceless sound.As be judged as voiced sound and enter step 304 and carry out the voiced sound pre-service, in step 304, by adopting one of following fundamental frequency extracting method, comprise autocorrelation function method, linear prediction (LPC) method, scramble spectrometry, and based on the fundamental tone of " comb filtering device " estimate, based on the pitch estimation method of harmonic sine wave pattern, measure voiced sound signal fundamental frequency.In step 304, the voiced sound signal of known fundamental frequency is resolved into size sequentially and is equaled mutual incoherent section of pitch period, the signal fundamental frequency is a N sampled point, promptly press one section of N continuous sampling point, make every section starting point and terminal point amplitude is zero or approaching zero as far as possible, and the signal of the section of being divided into is that unit time domain continuation is a periodic signal at last with the section; Voiceless sound is sent to step 305, in step 305, the voiceless sound signal is through initial, end amplitude, plosive starting point, terminal point, after the detection of voiceless sound duration, entering step 306, be extracted in many kinds of speech recognition features of step 306, wherein mainly is spectrum signature, spectral change feature, phoneme duration etc.In step 307, a plurality of recognition features of being extracted are included in the recognition feature data that step 304,305 is detected, and are used to and the template comparison, and the aligned phoneme sequence of voice signal is determined.Process advances to 308 then, exports aligned phoneme sequence here.

At last, process advances to step 309, and process finishes.

The top description of this invention is to be used for illustrative purposes, rather than will limit the present invention to above-mentioned concrete form.During enforcement, the modification of foregoing and change are unavoidable, therefore, embodiment disclosed herein is just in order to explain principle of the present invention better, so that those of ordinary skill in the art can make various modifications at concrete separately requirement of engineering, make the present invention obtain best utilization and enforcement.

Effect analysis

Signal S1 shown in Fig. 4 A is an original signal waveform, signal S2 is that signal S1 amplifies the signal fundamental waveform that obtains through the arrowband low frequency filtering, corresponding relation according to signal S1 and signal S2, with the signal intercept point in signal zero passage place, a primitive period is cut into one section, signal S1 is divided into ST1, ST2, ST3, ST4...... section by pitch period T1, T2, T3, T4......, and 4 sections add up sampling number N mutually is 100.The signals sampling frequency is 8000Hz, the frequency resolution that can calculate 100 sampling points according to formula (1) is 80Hz, but 100 sampling points have comprised the signal of 4 pitch periods, and the signal in each cycle all has nothing in common with each other, change in order accurately to understand signal spectrum comprehensively, should be by fundamental frequency cycles signal calculated frequency spectrum.ST2 is 25 sampling points, is FFT as these 25 sampling points of direct usefulness, and frequency resolution will be up to 320Hz, and this obviously can not satisfy the needs of spectrum analysis.Segment signal ST1, ST2, ST3, ST4 in the time domain continuation, are obtained the periodic signal of waveform shown in Fig. 4 B, 4C, 4D, 4E respectively.Respectively the signal after the continuation is done 1024 FFT conversion, gained spectrogram such as Fig. 5,6,7,8 show that this moment, spectral resolution was 7.8Hz, had improved 40 times than original, contrasted each figure intermediate frequency spectrum parameter, can find the frequency spectrum similarities and differences of each periodic signal.This shows, by with the signal of a fundamental frequency cycles in the time domain continuation, the frequency spectrum of high precision in short-term that can picked up signal, the spectrogram that adopts the method to make has high frequency resolution.

In voice, the variation of voiced sound frequency spectrum is very large, even two adjacent pitch periods, frequency spectrum also has difference, particularly higher hamonic wave has difference, adopt conventional FFT can't obtain the accurate frequency spectrum of voiced sound, adopt the method for time domain fundamental frequency cycles continuation to calculate the voiced sound frequency spectrum, have lot of advantages.1, because the sample of calculating frequency spectrum is very little, only a pitch period can improve the time sense of spectrum transformation; 2, the time domain continuation has increased FFT and has counted, and has improved the frequency resolution of spectrum transformation greatly; 3, can get rid of voiced sound changes in amplitude, duration and change interference phoneme recognition; 4, can get rid of the interference of adjacent phoneme, adjacent periods, ensure the purity of frequency spectrum, also just improve the confidence level of frequency spectrum frequency spectrum; 5, only need the signal of a fundamental frequency cycles can calculate the accurate frequency spectrum of voiced sound, according to spectrum signature and signal duration can determining the phoneme title, and need not, thereby accomplish quick identification, recognition result and context-free with reference to adjacent phoneme.If there are several pitch periods a voiced sound duration, each pitch period is all done spectrum analysis, make phoneme and differentiate, like this, the identification of a voiced sound element is to finish through repeatedly judging, has also just improved the reliability of identification.

The signal of different phonemes can produce the frequency spectrum of adjacent even the phoneme signal of being separated by and disturb, so be necessary phoneme is isolated, prevent the phase mutual interference of phoneme, with the sampling point that guarantees each spectrum transformation is single phoneme, owing to taked the quarantine measures of voiceless sound and voiced sound, got rid of of the interference of high-octane voiced sound signal to the voiceless sound signal spectrum, increased the voiceless sound duration as one of recognition feature, the recognition feature of voiceless sound phoneme is obvious, and recognition result is reliable.

Be shown the falling tone signal waveforms of a middle-aged male Chinese phonetic alphabet vowel [a] as Fig. 9, the sound spectrograph of Figure 10 for adopting the continuation of fundamental frequency time domain to obtain.As seen from Figure 10: this voice signal is made of the fundamental tone chord; The variation of fundamental frequency causes homophonic frequency change, and overtone order high-frequency more changes more greatly; The harmonic amplitude of frequency below 1350Hz is higher; The fundamental tone Strength Changes is less.Figure 11 is the arrowband sound spectrograph of this signal.Two figure contrast, and can obviously find out the superiority of time domain continuation sound spectrograph.

In continuous speech, because the duration of single phoneme is too short, people's ear can't be discerned when taking audition apart, has only and listens attentively to continuously and could discern, so people's speech recognition based on context comprehensively.But computing machine has the unrivaled arithmetic speed of human brain, utilize the Fourier spectrum conversion, way by the time domain continuation is calculated the accurate frequency spectrum of voiced sound, separate by voiceless sound, voiced sound, calculate voiceless sound frequency spectrum and duration, for the single phoneme in the continuous speech, can accurately discern fully, need not the probability that occurs with reference to phoneme.Because the quantity of phoneme is few, voiceless sound and voiced sound can not obscured, the frequency spectrum difference that has highly significant between the phoneme that has can not be made mistakes during the template comparison, has only the plain frequency spectrum of several voiced sounds of minority close, seek only to such an extent that also be easy to discern behind the accurate frequency spectrum of signal, so based on the speech recognition system reliability height of phoneme, the capacity of required template base is little, can reduce the speech recognition system cost greatly, improve the identification accuracy greatly.

Claims

1, a kind of audio recognition method based on phoneme is characterized in that, may further comprise the steps:

A) analog voice signal is converted to audio digital signals;

D) characteristic that extracts is analyzed;

E) according to analysis result output aligned phoneme sequence.

According to the described method of claim 1, it is characterized in that 2, described voiced sound pre-service may further comprise the steps:

F1) frequency and the amplitude of mensuration pitch signal;

According to the described method of claim 1, it is characterized in that 3, described voiceless sound pre-service may further comprise the steps:

G1) set initial, the end amplitude of voiceless sound;

G2) detect plosive starting point, terminal point;

G3) detect the voiceless sound duration.

4, according to the described method of claim 1, it is characterized in that described step C) may further comprise the steps: extract a kind or 2 kinds in the following speech recognition features at least: spectrum signature, spectral change feature, cepstrum, linear predictor coefficient, resonance peak, phoneme duration.

5, according to the described method of claim 1, it is characterized in that described step D) may further comprise the steps:

D3) set the similar value that template is compared according to system requirements, high similar value is applicable to speaker's identification, instruction input, and low similar value is used for the conversion of voice and text.

6, according to the described method of claim 1, it is characterized in that described step e) may further comprise the steps: the phoneme template of recognition feature and speech database middle finger attribute kind or dialect is compared, determine the phoneme title.

7, according to the described method of claim 1, it is characterized in that, described step F 1) measuring the frequency and the amplitude of pitch signal, is to adopt the realization of one of following fundamental frequency extracting method: autocorrelation function method, linear prediction method, scramble spectrometry, based on the pitch estimation method of " comb filtering device ", based on the pitch estimation method of harmonic sine wave pattern.

8, according to the described method of claim 1, it is characterized in that described step e according to analysis result output aligned phoneme sequence) after, also comprise the step that aligned phoneme sequence is converted to text or instruction.

9, a kind of speech recognition system based on phoneme is characterized in that, comprising:

The data that provided by voiceless sound pretreatment unit and voiced sound pretreatment unit are carried out spectrum transformation, extract the feature deriving means of feature;

Voice memory storage and the aligned phoneme sequence output unit of exporting aligned phoneme sequence according to the analysis result retrieval voice memory storage of feature analyzing apparatus output.Aligned phoneme sequence is converted to the device of text or instruction.

According to the described system of claim 9, it is characterized in that 10, described voiced sound pretreatment unit comprises with lower module: measure the frequency of pitch signal and the module of amplitude; The voiced sound signal is resolved into size sequentially equal mutual incoherent section of pitch period, the signal fundamental frequency is a N sampled point, promptly press one section of N continuous sampling point, make every section starting point and terminal point amplitude is zero or approaching zero data sementation module as far as possible, and segment data carried out the time domain periodic repetitions, monocycle signal is become the time domain continuation module of multicycle signal; Described voiceless sound pretreatment unit comprises with lower module: set voiceless sound starting and ending amplitude module, detect the module of plosive starting point, terminal point; And the module that detects the voiceless sound duration.