CN105374352A

CN105374352A - Voice activation method and system

Info

Publication number: CN105374352A
Application number: CN201410418850.3A
Authority: CN
Inventors: 葛凤培
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2014-08-22
Filing date: 2014-08-22
Publication date: 2016-03-02
Anticipated expiration: 2034-08-22
Also published as: CN105374352B

Abstract

The invention relates to a voice activation method. The method comprises steps: an acoustic model is built, and decoding network space is built on the basis of the acoustic model; according to a noise environment level, a corresponding silence suppression configuration parameter is selected, and an input voice stream is cut into voice fractions; voice features of the voice fractions are extracted; the voice features are inputted to the decoding network space for decoding and recognition, and recognition voice phonemes are acquired; from all measurements capable of representing a confidence level of a pronouncing unit, a plurality of measurements are selected to serve as a plurality of confidence levels of the recognition voice phonemes; the plurality of confidence levels of the recognition voice phonemes are calculated; secondary judgment, comprising pre-judgment and second judgment, is carried out on the plurality of confidence levels of the recognition voice phonemes, and a final recognition result is outputted. The method overcomes defects existing in a finger starting device, good activation effects can be realized, and convenience is provided for using a voice recognition device by people.

Description

A kind of voice activated method and system

Technical field

The invention belongs to technical field of voice recognition, specifically, the present invention relates to a kind of voice activated method and system.

Background technology

Current speech recognition technology is when being subject to the affecting of the factor such as noise and natural spoken language, and recognition correct rate can seriously reduce.Therefore in daily life, the interactive mode based on continuous speech recognition technology is difficult to realize.Current common solution is the mode opening voice identification equipment adopting finger key, and such user just can carry out voice typing under relatively quiet ambient condition, thus ensures good speech recognition effect, finishing man-machine interaction work then.

The opening ways of finger key can bring various inconvenience to user.First, finger key requires that the distance between user and speech recognition apparatus can not exceed arm, brings operational difficulty can to so far away the or handicapped customer group of distance equipment; Secondly, in the environment of dark, user is not easy the position finding button; Again, finger key is not suitable for the occupied user of both hands, and for user's steering vehicle, now user is inconvenient to use finger key mode.In sum, the dependence of this opening ways of finger key is limited to the promotion and application of speech recognition technology.

Voice activation techniques provides a kind of scheme overcoming above-mentioned defect, contributes to the application and development advancing man-machine interaction.Document [1] (" Wake-Up-WordSpeechRecognition ", VetonKepuska (2011), SpeechTechnologies, Prof.Lvolpsic (Ed.), ISBN:978-953-307-996-7.) the voice activation algorithm based on speech recognition framework is described in, this algorithm does not consider various neighbourhood noise in actual application environment, there is good voice activation performance in environment facies are to quiet laboratory, but be applied to voice activation performance possibility severe exacerbation in the larger environment of ground unrest; And only adopting sorter to carry out judging confidence in document [1], its judgement accuracy places one's entire reliance upon the training sample of sorter, if the improper meeting of training sample selection directly affects voice activation performance.

Summary of the invention

The object of the invention is to overcome the various defects that finger key opening voice identification equipment exists, provide a kind of voice activation this brand-new device start pattern, thus use speech recognition apparatus to provide convenience for people.

To achieve these goals, the invention provides a kind of voice activated method, comprising:

Set up acoustic model, acoustic model basis is set up decoding network space;

According to the silence suppression configuration parameter that noise circumstance hierarchical selection is corresponding, input voice flow is cut into sound bite; Extract the phonetic feature of sound bite; Phonetic feature is inputted decoding network space and carry out decoding identification, obtain and identify phoneme of speech sound; From all tolerance that can characterize pronunciation unit credibility, choose several tolerance as several degree of confidence identifying phoneme of speech sound, calculate several degree of confidence identifying phoneme of speech sound; To identifying that several degree of confidence of phoneme of speech sound carry out second judgement, comprising pre-judgement and second time judgement, exporting final recognition result.

In technique scheme, described decoding network space of setting up comprises: the rubbish phoneme sub-network by the rubbish phoneme parallel connection in phone set being circulation, activation word phone string is linked in sequence into the phoneme that the activation word of specifying comprises, then add described rubbish phoneme sub-network at activation word phone string head and the tail, the rubbish phoneme sub-network of head and the tail strides across activation word phone string and is directly connected.

In technique scheme, described noise circumstance grade is: strong noise environment, medium noise circumstance, quiet environment; Noise circumstance grade is classified according to the sound pressure level of neighbourhood noise.

In technique scheme, several degree of confidence of described identification phoneme of speech sound comprise: the phoneme log-likelihood of the regular duration of phoneme, Time alignment, phoneme log posterior probability, duration are the state number of a frame, minimum syllable duration, the total duration of identification voice.

In technique scheme, described pre-judgement comprises:

If the phoneme log posterior probability sub-minimum < first threshold of all identification phoneme of speech sound, then directly judgement is inactive word; If the phoneme number > Second Threshold that the phoneme log posterior probability value of all identification phoneme of speech sound is less than-1, then directly judgement is inactive word; If duration is state number > the 3rd threshold value of a frame, then directly judgement is inactive word; If minimum syllable duration <=the 4th threshold value, then directly judgement is inactive word; If identify that the total duration of voice is less than to identify phoneme of speech sound number * 6 frame or be greater than identification phoneme of speech sound number * 15 frame, then directly judgement is inactive word; Described first threshold, Second Threshold, the 3rd threshold value and the 4th threshold value are preferably obtained by experience and statistical law.

In technique scheme, the judgement of described second time adopts sorter to realize, and described sorter is linear classifier or mixed Gauss model sorter or support vector machine classifier.

In addition, present invention also offers a kind of voice-activation system, described system comprises:

Silence suppression module, for according to silence suppression configuration parameter corresponding to noise circumstance hierarchical selection, is cut into sound bite by the continuous speech stream of collection;

Characteristic extracting module, for extracting the phonetic feature of sound bite;

Acoustic model, for describing the phonetic feature regularity of distribution of each pronunciation unit in acoustic space;

Decoder module, for setting up decoding network space on the basis of acoustic model, Veterbi decoding is carried out to the phonetic feature of sound bite, in decoding network space, find optimum phoneme path as identification voice path, all non-junk phonemes on optimum phoneme path are identification phoneme of speech sound;

Confidence calculations module, for choosing several tolerance as several degree of confidence identifying phoneme of speech sound from all tolerance that can characterize pronunciation unit credibility, calculates several degree of confidence of identification phoneme of speech sound;

Second judgement module, for identifying that several degree of confidence of phoneme of speech sound carry out second judgement, comprising pre-judgement and second time judgement, exporting final recognition result.

The invention has the advantages that:

1, by the classification process to noise circumstance, voice-activation system provided by the invention has good robustness in noise circumstance;

2, by setting up specific decoding network space, the ground unrest that exists in actual application environment can be eliminated to the harmful effect of speech recognition performance;

3, by the second judgement to voice identification result, the identification error rate of voice is dropped to minimum, reach excellent voice activation effect;

4, voice-activation system provided by the invention, has broad application prospects in interactively controlling intelligent household appliances, Wearable etc.

Accompanying drawing explanation

Fig. 1 is the building mode schematic diagram in decoding network space of the present invention;

Fig. 2 is the module composition diagram of voice-activation system of the present invention.

Embodiment

Describe further below in conjunction with accompanying drawing and doing specific embodiment of the invention.

Voice activated method provided by the invention comprises the following steps:

Step 1) set up acoustic model;

Phone set comprise 65 Chinese without tuning element, 15 rubbish phonemes (filler), represent the sp phoneme of quiet sil phoneme and expression minibreak; Each phoneme utilizes context extension to become three-tone, and each three-tone is connected by three sequence of states.Described 15 rubbish phonemes are obtained by statistical method, according to obscuring and degree of correlation between each phoneme, all phonemes are gathered into multiple Similarity Class, and each Similarity Class is as a rubbish phoneme.

By the mode of decision tree, identical central phoneme, same position, different contextual state group are carried out cluster, obtain 3970 states, i.e. 3970 unit, each unit is described by the mixed Gauss model (GMM) comprising 8 gaussian component; Based on phone set and 3970 cell formation acoustic models.

Step 2) on acoustic model basis, set up decoding network space;

With reference to figure 1, the mode of setting up in decoding network space for: by step 1) described in 15 rubbish phonemes parallel connections rubbish phoneme sub-network that is circulation, activation word phone string is linked in sequence into the phoneme that the activation word of specifying comprises, then add described rubbish phoneme sub-network at activation word phone string head and the tail, the rubbish phoneme sub-network of head and the tail strides across activation word phone string and is directly connected.

The decoding network space of above-mentioned foundation can complete five class sound bites forces alignment accurately, described five class sound bites are: activate word, the activation word that the activation word of rubbish voice, the activation word of rear belt rubbish voice are with in front portion, all there is rubbish voice front and back, full rubbish voice, this five classes sound bite covers all possible voice to be identified.

For the activation word of specifying be: " your good air-conditioning ", the activation word phone string of series winding is " n-i-h-ao-k-ong-t-iao ".

Step 3) according to VAD (silence suppression) configuration parameter corresponding to noise circumstance hierarchical selection, input voice flow is cut into sound bite; Extract the phonetic feature of sound bite; Phonetic feature is inputted decoding network space and carry out decoding identification, obtain and identify phoneme of speech sound; From all tolerance that can characterize pronunciation unit credibility, choose several tolerance as several degree of confidence identifying phoneme of speech sound, calculate several degree of confidence identifying phoneme of speech sound; To identifying that several degree of confidence of phoneme of speech sound carry out second judgement, comprising pre-judgement and second time judgement, exporting final recognition result.

In technique scheme, described step 3) comprise further:

Step 301) according to VAD (silence suppression) configuration parameter corresponding to noise circumstance hierarchical selection, input voice flow is cut into sound bite;

Noise circumstance is divided into Three Estate: strong noise environment, medium noise circumstance, quiet environment, grade is classified according to the sound pressure level of neighbourhood noise, and the computing method of sound pressure level are as follows:

Lp＝20*lg(p/p0)

Wherein, Lp is sound pressure level, and unit is decibel; P is acoustic pressure; P0 is reference acoustic pressure, in atmosphere p0=2 × 10 ^-5.

Noise circumstance grade separation standard is as follows:

According to the VAD configuration parameter that noise circumstance hierarchical selection is corresponding, the continuous speech stream of input is cut into little sound bite, the target of cutting is that the intermediary position of speaking people disconnects, and namely guarantee one section of continuous voice is placed in a sound bite as far as possible.Different VAD configuration parameters can ensure that voice flow cutting does not have notable difference with the fluctuations of neighbourhood noise, obtains sound bite accurately with this, reduces the cut-off phenomenon of complete speech and occurs.

Step 302) extract the phonetic feature of sound bite;

8K sampling rate is adopted to gather voice, voice sub-frame processing adopts 25 milliseconds of windows length, 10 milliseconds of windows to move, extract 12 dimensions PLP (perception linear predictor coefficient) and 1 and tie up energy as the static nature of voice, have employed two jumps and divide parameter extraction 39 dimensional feature as the behavioral characteristics of voice.Have employed HLDA (the linear distinction analysis of Singular variance) technology and static nature and behavioral characteristics are converted to improve the ability of distinguishing characteristic.

Step 303) phonetic feature inputted decoding network space and carry out decoding and identify, obtain and identify phoneme of speech sound;

Decoding identifies and adopts Viterbi (Viterbi) algorithm, and spatially find optimum phoneme path as identification voice path at decoding network, all phonemes optimum phoneme path removed beyond filler are identification phoneme of speech sound.If all identification phoneme of speech sound are all filler, then directly judge to identify that voice are as inactive word, proceed to step 305-3); Otherwise, proceed to step 304).

Step 304) calculate several degree of confidence identifying phoneme of speech sound;

Several described degree of confidence are: the phoneme log-likelihood of the regular duration of phoneme, Time alignment, phoneme log posterior probability, duration are the state number of a frame, minimum syllable duration, the total duration of identification voice.

The computing method of the regular duration of phoneme are as follows:

{dur}_{NOR} (p_{i}) = \frac{dur (p_{i})}{Σ_{i = 0}^{S} dur (p_{i})}

Wherein, p _ibe i-th and identify phoneme of speech sound; dur _nOR(p _i) be i-th regular duration identifying phoneme of speech sound; Dur (p _i) be i-th duration identifying phoneme of speech sound; S is the total number of phoneme identifying that voice packet contains.

The phoneme log-likelihood value calculating method of Time alignment is as follows:

{LL}_{Nor} (p_{i}) = \frac{\ln (P (O | p_{i}))}{dur (p_{i})}

Wherein, LL _nor(p _i) be i-th Time alignment log-likelihood identifying phoneme of speech sound; P (O|p _i) be i-th likelihood value identifying phoneme of speech sound, lnP (O|p _i) all can obtain in the decoded result of routine.

Phoneme log posterior probability computing method are as follows:

GOP (p_{i}) = \frac{\ln P (p_{i} | O)}{dur (p_{i})} = \ln (\frac{P (O | p_{i}) P (p_{i})}{\underset{q &Element; Q}{Σ} P (O | q) P (q)}) / dur (p_{i}) \approx \ln (\frac{P (O | p_{i})}{\underset{q &Element; Q}{Σ} P (O | q)}) / dur (p_{i})

Wherein, GOP (p _i) be i-th phoneme log posterior probability value identifying phoneme of speech sound; be the likelihood value sum of all phonemes in phone set, Q is above-mentioned steps 1) in phone set.

Step 305) to identifying that several degree of confidence of phoneme of speech sound carry out second judgement, export final recognition result.Comprise:

Step 305-1) to identifying that several degree of confidence of phoneme of speech sound are adjudicated in advance, if court verdict is inactive word, proceed to 305-3); Otherwise, proceed to 305-2);

Described pre-judgement comprises:

If the phoneme log posterior probability sub-minimum < first threshold of all identification phoneme of speech sound, then directly judgement is inactive word; Described first threshold preferably can be obtained by experience and statistical law, and first threshold gets-4.0 in the present embodiment;

If the phoneme number > Second Threshold that the phoneme log posterior probability value of all identification phoneme of speech sound is less than-1, then directly judgement is inactive word; Described Second Threshold preferably can be obtained by experience and statistical law, and Second Threshold gets 4 in the present embodiment;

If duration is state number > the 3rd threshold value of a frame, then directly judgement is inactive word; Described 3rd threshold value preferably can be obtained by experience and statistical law, and the 3rd threshold value gets 12 in the present embodiment;

If minimum syllable duration <=the 4th threshold value, then directly judgement is inactive word; Described 4th threshold value preferably can be obtained by experience and statistical law, and the 4th threshold value gets 6 in the present embodiment;

If identify that the total duration of voice is less than to identify phoneme of speech sound number * 6 frame or be greater than identification phoneme of speech sound number * 15 frame, then directly judgement is inactive word.

Step 305-2) second time judgement is carried out to the degree of confidence vector that anticipation must not directly be adjudicated as the identification voice of inactive word;

The degree of confidence vector of described identification voice to identify the vector that several degree of confidence of each phoneme of voice are formed by phoneme sequence arrangement, identifies that the dimension of the degree of confidence vector of voice is the number of several degree of confidence of number * identifying phoneme of speech sound;

To identify that voice are " your good air-conditioning ", each Chinese character is made up of initial consonant and simple or compound vowel of a Chinese syllable two phonemes, and so the dimension of the degree of confidence vector of " your good air-conditioning " is 6*8=48 dimension.

The judgement of described second time adopts sorter to realize, and sorter is linear classifier or mixed Gauss model sorter or support vector machine (SVM) sorter, and the sorter that the present embodiment adopts is SVM classifier.

Before the judgement of described second time, first train a SVM classifier with the positive sample of equivalent and negative sample; Described positive sample is specify the sound bite activating word content, and described negative sample is the sound bite of non-designated activation word content;

Described second time judgement comprises: will identify that the degree of confidence vector input SVM classifier of voice is classified, the output of SVM classifier is 1 or 2, and wherein 1 represents it is activate word, and 2 represent inactive word.

Step 305-3) export final recognition result.

With reference to figure 2, the present invention also provides a kind of voice-activation system, comprising:

VAD (silence suppression) module, for according to silence suppression configuration parameter corresponding to noise circumstance hierarchical selection, is cut into sound bite by the continuous speech stream of collection;

The present embodiment activates word with " your good air-conditioning " for specifying, acoustic model adopts the bright read data compared with quiet environment of 150 hours, the bright read data of 10 people is recorded as the test set passing judgment on activity ratio under actual scene, actual scene is divided into Four types: peace and quiet, echo, noise, echo+noise, and everyone reads aloud 20 and activates word; 10 people 24 hr light data are recorded as the test set passing judgment on false-alarm under same four kinds of scenes.The performance of voice-activation system of the present invention is as table 1:

Table 1

	Quiet	Echo	Noise	Echo+noise
					Activity ratio	91.3％	89.5％	80.2％	75.1％
False-alarm	0	1 time/hour	2 times/hour	2.6 times/hour

Claims

1. a voice activated method, comprising:

Set up acoustic model, acoustic model basis is set up decoding network space;

2. voice activated method according to claim 1, it is characterized in that, described decoding network space of setting up comprises: the rubbish phoneme sub-network by the rubbish phoneme parallel connection in phone set being circulation, activation word phone string is linked in sequence into the phoneme that the activation word of specifying comprises, then add described rubbish phoneme sub-network at activation word phone string head and the tail, the rubbish phoneme sub-network of head and the tail strides across activation word phone string and is directly connected.

3. voice activated method according to claim 1, is characterized in that, described noise circumstance grade is: strong noise environment, medium noise circumstance, quiet environment; Noise circumstance grade is classified according to the sound pressure level of neighbourhood noise.

4. voice activated method according to claim 1, it is characterized in that, several degree of confidence of described identification phoneme of speech sound comprise: the phoneme log-likelihood of the regular duration of phoneme, Time alignment, phoneme log posterior probability, duration are the state number of a frame, minimum syllable duration, the total duration of identification voice.

5. voice activated method according to claim 4, is characterized in that, described pre-judgement comprises:

6. voice activated method according to claim 1, is characterized in that, the judgement of described second time adopts sorter to realize, and described sorter is linear classifier or mixed Gauss model sorter or support vector machine classifier.

7. a voice-activation system, is characterized in that, described system comprises: