CN102999161A

CN102999161A - Implementation method and application of voice awakening module

Info

Publication number: CN102999161A
Application number: CN2012104551752A
Authority: CN
Inventors: 操文祥; 王海坤; 康怀茂; 钱勇; 谢信珍; 黄海兵
Original assignee: iFlytek Co Ltd
Current assignee: Science And Technology University Information Flying South China Institute Of Artificial Intelligence (guangzhou) Co Ltd
Priority date: 2012-11-13
Filing date: 2012-11-13
Publication date: 2013-03-27
Anticipated expiration: 2032-11-13
Also published as: CN102999161B

Abstract

The invention discloses an implementation method and application of a voice awakening module. The implementation method comprises the following steps of: voice input (1), voice awakening algorithm (2) and awakening actuation (3), wherein the voice awakening algorithm (2) is implemented through the following main steps of: acoustic feature extraction (4), awakening word detection (5), awakening word confirmation (6), construction of an awakening word detection network (7), training of an acoustic model (8) and construction of an awakening word confirming network (9) and the like. The invention has the advantages that even under a noisy environment, no matter whether the music is played, the voice awakening function can be started by the voice awakening word, and the recognition awakening effect is good; and the implementation method can be planted onto an ARM or DSP universal process for operation and is applied in the fields related to vehicle mounting and household appliances.

Description

A kind of implementation method of voice wake module and application

Technical field

The invention discloses a kind of implementation method and application of voice wake module, be specifically related to a kind ofly say that by the user predetermined voice wake word up and come triggering system to carry out next step operation of user, can use with needs and realize the fields such as vehicle-mounted and household electrical appliances that voice wake up.

Background technology

The present invention relates to one and applied for the invention disclosed patent, publication number is: CN102645977A, and the applying date is 2012.03.26, the inventor is Yin Jianhong, Wang Zhong, Zhou Yanhuang, name is called " a kind of vehicle-mounted voice wakes man-machine interactive system and method up ", at this it is drawn to be list of references.The vehicle-mounted voice of this invention wakes up realizes that principle is: deposit the information such as sound bank, vehicle-mounted noise storehouse, speech engine in the flash storer that sets in advance, compare via the phonetic order relevant information of master controller MCU and memory stores by the phonetic order of microphone input and to carry out speech recognition, and with the phonetic order relevant information determined behind the matching identification as carrying out instruction control vehicle-mounted control functional unit block, realize its corresponding function.What involved flash deposited in this invention all is the data of fixing, and under the vehicle environment, because road speed, road conditions, weather, the variation such as the vehicle-mounted noise storehouse that all can cause engine noise and tyre noise that opens a window of whether turning on the aircondition, the music of playing in the car is different, the difference of speaker can cause the sound bank of institute's reference to change, so realize the voice arousal function under the scene that this invention is only applicable to fix.And the present invention trains a kind of acoustic model by gathering different speaker recording datas under all kinds of scenes, wakes the word Sampling network up and confirms network by structure simultaneously, so that the present invention adapts to scene is more extensive, the voice wake-up effect is good simultaneously.

Summary of the invention

The objective of the invention is in order to solve the deficiencies in the prior art, a kind of implementation method of voice waken system is provided, no matter even whether music playing is arranged, can wake word opening voice arousal function up by voice under noisy environment, the voice wake-up effect is good simultaneously; The present invention also provides the application of voice waken system in addition, comprises being applied to application vehicle-mounted and the household electrical appliances association area.

The present invention is achieved by the following technical solutions: a kind of implementation method of voice wake module comprises: phonetic entry 1, voice wake algorithm 2 up and wake up and carry out 3 steps, voice wake the voice signal that algorithm 2 obtains phonetic entry 1 up, after carrying out the voice wake up process, the result exported to wake up carry out 3, thereby finish wake operation;

Described voice wake algorithm 2 up and extract 4, wake word up and detect 5, wake word up and confirm 6, make up and wake word Sampling network 7, training acoustic model 8 and structure up and wake word up and confirm that network 9 realizes that the specific implementation process is as follows by acoustic feature:

The first step, acoustic feature extracts 4: obtain the voice signal input by phonetic entry 1, extraction has the property distinguished and feature that be based on the human hearing characteristic extraction, usually choose MFCC (Mel-Frequency Cepstrum Coefficient, the Mel frequency cepstrum coefficient) feature used in the speech recognition as acoustic feature;

Second step, wake word up and detect 5: the acoustic feature that extraction is obtained, adopt the acoustic model 8 of training waking word Sampling network 7 calculating acoustics scores up, if comprise the word that wakes up that will detect in the path of score optimum, then determine to have detected to wake word up, enter the operation of the 3rd step, re-start extraction acoustic feature 4 steps otherwise get back to the first step;

In the 3rd step, wake word up and confirm 6: with the acoustic feature that extraction obtains, the acoustic model 8 that adopts training confirms that network 9 wakes word up and confirms waking word up, is finally confirmed score; Whether that judges that this detects wakes word up for waking really word up, being about to this final affirmation score and predefined thresholding that wakes word up compares, if confirm that finally score is more than or equal to thresholding, think that then this wakes word up is to wake really word up, voice wake up successfully, the result exported to wake up carry out 3, thereby finish the voice wake operation; If finally confirm score less than thresholding, think that then this wakes word up and is the false word that wakes up, come back to the first step and re-start acoustic feature and extract 4 steps.

The training of described acoustic model 8 is divided into two parts, is respectively phoneme acoustic model and garbage model (being the Garbage model); The phoneme acoustic model adopts the acoustic training model method in traditional speech recognition, choose database, utilization is based on MLE (Maximum Likelihood Estimation, maximal possibility estimation) and under MPE (Minimum Phone Error, the minimum phoneme mistake) property the distinguished training criterion obtain; The Garbage model is used for absorbing the irrelevant voice except waking word up, use and train the same database of phoneme model, by calculating the similarity between each phoneme model, each phoneme is divided into 20 classes, use all training datas corresponding to every class phoneme to merge, adopt Garbage model corresponding to MLE criterion training, just obtain 20 class Garbage models.

The described implementation method of waking word Sampling network 7 up is to adopt optimum score path computing to draw, and the described optimum computing formula that gets sub-path is:

W = \underset{W}{\arg \max} P (W) P (X | W) - - - (2)

Wherein X represents the acoustic feature vector that extracts from the input voice, and W represents the optimum word sequence of score maximum; Conditional probability P (X|W) is the acoustic model score, calculates by the acoustic model 8 that trains; Prior probability P (W) is the language model score, is the added PenaltyP of different acoustic models (X) as total probability, when acoustic model be exactly definite value after waking the word Sampling network up and deciding.

The described word that wakes up confirms that network (9) implementation method is:

The word that wakes up that a. will detect is decoded to the phoneme one-level, and records all score (Score _Phone1, Score _Phone2..., Score _PhoneN), wherein N wakes phoneme number total in the word up;

Score _Phone1, Score _Phone2..., Score _PhoneNWhat represent respectively that this wakes all phonemes in the word up is the decoding score, and wherein subscript represents the sign of N phoneme of phoneme.

B. use and wake word up and detect same feature, obtain corresponding acoustics score, and be accurate to frame one-level (Score _Frame1, Score _Frame2..., Score _FrameM), wherein M is the total duration of this feature, take frame as unit;

C. calculate and wake each phoneme of word up and really recognize minute, account form is as follows:

C M_{phonei} = ({Score}_{phonei} - Σ_{k = K_{istart}}^{K_{iend}} {Score}_{framek}) / (K_{iend} - K_{istart}) - - - (3)

K wherein _IstartAnd K _IendBe respectively zero-time and the concluding time of i phoneme;

CM _PhoneiRepresent that i phoneme recognize minute really, subscript phonei represents i phoneme, Score _PhoneiThe decoding score of i phone as shown above, Score _FramekExpression is used and is waken the score that the k frame that network decoding obtains confirmed in word up.

D. calculate the final affirmation score that this wakes word up, account form is as follows:

C M_{word} = \frac{1}{N} Σ_{i = 1}^{N} C M_{phonei} - - - (4)

Method of the present invention can be transplanted on ARM or the DSP general processor, is applied to vehicle-mounted and the household electrical appliances association area.

A kind of vehicle-mounted voice waken system is characterized in that comprising: microprocessor, voice wake module, audio conversion device, recording device, apparatus for processing audio, public address system; Wherein the voice wake module operates in the microprocessor, and the specific implementation process is as follows:

The first step, microprocessor and apparatus for processing audio interconnection, control apparatus for processing audio output audio information, and apparatus for processing audio and public address system interconnection are carried out power amplification with required playback of audio information to promote the loudspeaker playback, finish the audio frequency play operation;

Second step, the interconnection of recording device and audio conversion device when the user says voice when waking word up, is carried out the voice typing and is passed to the audio conversion device conversion by recording device, finishes the voice collecting operation;

In the 3rd step, audio conversion device carries out data-switching to the voice messaging of recording device typing, and the data after will changing are simultaneously passed to the computing that microprocessor carries out the described voice wake module of claim 1, finish the voice data conversion operations;

The 4th step, microprocessor and audio conversion device interconnection, the voice messaging that audio conversion device is inputted carries out the computing of voice wake module, wakes information up if correctly identify voice, then control apparatus for processing audio and play the voice suggestion sound, finish that vehicle-mounted voice wakes up and the prompt tone play operation; If identification makes mistakes, then proceed the operation of second step voice collecting.

The present invention's advantage compared with prior art is:

(1) the present invention wakes word up as trigger source by user's voice, adds that waking word up detects and wake up the word affirmation, even no matter whether music playing is arranged, can wake word opening voice arousal function up by voice under noisy environment, the voice wake-up effect is good; Simultaneously also need not the user and utilize bimanualness, only realize fast arousal function by voice command, carry out next step interactive operation.

(2) the present invention realizes that cost is low, and code is transplanted convenient, has good application value.

(3) the present invention can be widely used in the fields such as vehicle-mounted and household electrical appliances, can also be widely used in each field that other audio plays needs voice to wake up simultaneously.Under vehicle environment, do not use and want to start recognition function in user's driving process before the native system and need to manually remove operation push-button, suspend the music of current broadcast, cause the driving process to have potential safety hazard; User's experience effect is poor simultaneously.

(4) value brought of the present invention is, can wake word opening voice arousal function up by the voice of saying agreement after using native system, need not to suspend in advance audio frequency and plays, and simultaneously by actual testing authentication, correctly identifies and wakes rate up and can reach more than 90%; Such as field of household appliances, the user just when TV reception, looks on the bright side of things and opens speech identifying function at other, also can wake word up by voice and realize, so that interactive voice is more convenient, more humane.

(5) the voice arousal function among the present invention is all realized by software algorithm, can be transplanted to very easily on the general processors such as ARM or DSP.

Description of drawings

Fig. 1 is the schematic block diagram that the present invention realizes;

Fig. 2 is that structure of the present invention wakes word Sampling network schematic block diagram up;

Fig. 3 is that structure of the present invention wakes word affirmation network schematic block diagram up;

Fig. 4 is that the present invention is at the implementation synoptic diagram of automotive field.

Embodiment

As shown in Figure 1, the realization of voice wake module of the present invention wakes algorithm 2 up by phonetic entry 1, voice and wakes up and carry out the realization of 3 steps.

Voice wake algorithm 2 up and realize mainly being extracted 4, being waken up word and detect 5, wake word up and confirm 6, make up and wake word Sampling network 7, training acoustic model 8 and structure up and wake word up and confirm that network 9 finishes by acoustic feature, and the specific implementation process is:

(1) training acoustic model 8: the training of acoustic model is divided into two parts, is respectively phoneme acoustic model and garbage model (being the Garbage model).The phoneme acoustic model adopts the acoustic training model method in traditional speech recognition, choose suitable database, utilization is based on MLE (Maximum Likelihood Estimation, maximal possibility estimation) and MPE (Minimum Phone Error, minimum phoneme mistake) distinguish obtaining under the property training criterion.The Garbage model is used for absorbing the irrelevant voice except waking word up, use and train the same database of phoneme model, by calculating the similarity between each phoneme model, each phoneme is divided into 20 classes, use all training datas corresponding to every class phoneme to merge, adopt Garbage model corresponding to MLE criterion training, so namely obtain 20 class Garbage models.The Garbage model has adopted the phoneme training data combined training of cluster, and two kinds of purposes are arranged, and is used for absorbing other voice except waking word up in waking the word Sampling network up, is used for calculating the score of confirming network in waking word affirmation network up.

(2) acoustic feature extracts 4: obtain the voice signal input by phonetic entry 1, extraction can have certain differentiation, and be based on the feature that human hearing characteristic extracts, generally choose MFCC (Mel-Frequency Cepstrum Coefficient, the Mel frequency cepstrum coefficient) feature of using in the speech recognition.

(3) wake word up and detect 5: with the acoustic feature that extraction obtains, use acoustic model 8 waking word Sampling network 7 calculating acoustics scores up, if comprise the word that wakes up that will detect in the path of score optimum, then detect and wake word up, enter next step operation; Otherwise again extract the acoustic feature operation.In order to guarantee that waking word up can be detected normally, invalid voice can effectively be absorbed again simultaneously.The structure that wakes Sampling network up mainly by the user select wake word up and the Garbage model forms, as shown in Figure 2, this network is also referred to as recognition network in speech recognition, to detect network configuration very simple owing to wake up, or can by simple program manual construction.Because the complicacy of practical service environment, under many circumstances, what receive wakes voice up by noise pollution, wake a lot of that the score of feature on the phoneme acoustic model of acoustics corresponding to voice will reduce this moment up, and because the Garbage model is to use more phoneme combined training to obtain, itself be not very accurate, the amplitude that the score of acoustic feature on the Garbage model reduces is limited, wake voice this moment up and just absorbed by Garbage model mistake, the system wake-up rate will reduce.

In order to prevent the generation of above-mentioned situation, when waking the word Sampling network up and decode, the decoding score of the arc at Garbage place is certain punishment, i.e. Penalty, make its can not with the fair competition of phoneme acoustic model, also can normally be detected to ensure by the voice that wake up of noise pollution.Concrete punishment amplitude need to be done experimental adjustment for the different words that wakes up.

The implementation method of waking word Sampling network 7 up is to adopt optimum score path computing to draw.

Optimum that obtaining of sub-path adopted classical Bayesian formula, as follows:

The acoustic feature vector that the X representative is extracted from the input voice in the following formula, W represents the optimum word sequence of score maximum.Conditional probability P (X|W) is the acoustic model score, can calculate by phoneme acoustic model and the garbage model that trains, and prior probability P (W) is the language model score, can be understood as here the added Penalty of different acoustic models.P (X) is total probability, and when acoustic model be exactly definite value after waking the word Sampling network up and deciding, so formula (1) can be written as:

W = \underset{W}{\arg \max} P (W) P (X | W) - - - (2)

(4) wake word up and confirm 6: because the complicacy that has inexactness and practical service environment of acoustic model itself, not necessarily wake really word up by waking the word that wakes up that the word detection obtains up.In order to reduce the non-problem that the false wake-up that brings and back can cause of waking up, need to do further to confirm to the word that wakes up that detection obtains.The present invention adopts the mode of accompanying drawing 3 to make up and wakes word affirmation network 9 up, wake that network confirmed in word and to wake the word Sampling network up the same up, all belong to the recognition network in the speech recognition, confirm only to comprise the Garbage model in the network, can use simple program or manual construction.

The key step of waking the word affirmation up is as follows:

A) will wake word up and detect and to obtain waking up word and be decoded to the phoneme one-level, and record its all score (Score _Phone1, Score _Phone2..., Score _PhoneN), wherein N wakes phoneme number total in the word up.

B) use and wake word up and detect same feature, confirm that network obtains corresponding acoustics score waking word up, and be accurate to frame one-level (Score _Frame1, Score _Frame2..., Score _FrameM), wherein M is the total duration of this feature, take frame as unit.

C) calculate and wake each phoneme of word up and really recognize minute, account form is as follows:

C M_{phonei} = ({Score}_{phonei} - Σ_{k = K_{istart}}^{K_{iend}} {Score}_{framek}) / (K_{iend} - K_{istart}) - - - (3)

K wherein _IstartAnd K _IendBe respectively zero-time and the concluding time of i phoneme.

D) calculate the final affirmation score that this wakes word up, account form is as follows:

C M_{word} = \frac{1}{N} Σ_{i = 1}^{N} C M_{phonei} - - - (4)

E) judge that whether this wakes word up for waking really word up, contrast final affirmation score and predefined thresholding that this wakes word up, if confirm score C M _WordThink then that greater than thresholding T this wakes word up for waking really word up, wakes up successfully; If CM _WordThink then that less than thresholding T this wakes word up and is the false word that wakes up, re-start acoustic feature and extract.

Realize the voice arousal function by above work, result feedback is given to wake up and is carried out 3 the most at last, carries out wake operation.

As shown in Figure 4, provided the implementation synoptic diagram of the present invention in automotive field, the vehicle-mounted voice waken system, its structure comprises: microprocessor 11, preferentially select the ARM9 processor, but be not limited to this microprocessor; The voice wake module operates in the microprocessor 11; Audio conversion device 12 is preferentially selected WM8731, but is not limited to this audio conversion device; Recording device 13 is preferentially selected the high electret microphone of cost performance, but is not limited to this recording device; Apparatus for processing audio 14 is preferentially selected TDA7419, but is not limited to this apparatus for processing audio; Public address system 15, the four unit loudspeaker (left front loudspeaker, left back loudspeaker, right front loudspeaker, right back loudspeaker) that adopt power amplifier TDA7388 and automobile to carry, but be not limited to this power amplifier and vehicle-mounted loudspeaker unit; Voice wake command word, preferential select " automobile language point " do not wake word up but be not limited to these voice.

Realize that principle mainly comprises audio frequency broadcast, data under voice, voice data conversion, voice wake up and the step such as prompt tone broadcast is finished.Specific as follows:

The, when the user uses native system to listen to music when driving, music can be radio/TV/other sources of sound such as DVD/line in of the audio frequency that provides of the broadcast module by microprocessor ARM9 or accessing to audio processor TDA7419; The music of all broadcasts promotes vehicle-mounted loudspeaker by power amplifier TDA7388 again and broadcasts after carrying out the audio processing by audio process first, finishes audio frequency broadcast work;

The second, when saying specific voice, the user wakes word up---when " automobile language point ", user's speaking volume should keep the level of normally speaking, the too little meeting of sound causes the electret microphone record less than voice signal, and sound is crossed conference and caused recording to cut the top, all can cause the arousal function failure; Include the microphone signal that voice wake word information up, through carrying out analog to digital conversion among the audio converter WM8731, finish speech signal collection work;

Three, the voice acquisition module of microprocessor ARM9 is carried out analog to digital conversion work by iic bus control audio converter WM8731, convert the microphone location signal to digital signal, and return to microprocessor by the IIS bus, finish the voice data conversion work;

Four, microprocessor training acoustic model extracts user's acoustic feature of microphone signal input, after waking the word Sampling network up and waking word affirmation network up, realizes the voice arousal function.Simultaneously by audio process play cuing tone signal, finish that whole voice wake up and the prompt tone play operation.

More than be the preferential embodiment of the present invention, the user can wake word opening voice recognition function up by special sound equally when not music playing or non-driving.

The non-elaborated part of the present invention belongs to techniques well known.And above-described embodiment does not limit the present invention in any form, and all employings are equal to replaces or technical scheme that the form of equivalent transformation obtains, all drops within protection scope of the present invention.

Claims

1. the implementation method of a voice wake module, it is characterized in that comprising: phonetic entry (1), voice wake algorithm (2) up and wake execution (3) step up, voice wake algorithm (2) up and obtain the voice signal of phonetic entry (1), after carrying out the voice wake up process, the result exported to wake execution (3) up, thereby finish wake operation;

Described voice wake algorithm (2) up and extract (4), wake word up and detect (5), wake word up and confirm (6), make up and wake word Sampling network (7), training acoustic model (8) and structure up and wake word affirmation network (9) up and realize that the specific implementation process is as follows by acoustic feature:

The first step, acoustic feature extracts (4): obtain the voice signal input by phonetic entry (1), extraction has the property distinguished and feature that be based on the human hearing characteristic extraction, usually choose MFCC (Mel-Frequency Cepstrum Coefficient, the Mel frequency cepstrum coefficient) feature used in the speech recognition as acoustic feature;

Second step, wake word up and detect (5): the acoustic feature that extraction is obtained, adopt the acoustic model (8) of training waking word Sampling network (7) calculating acoustics score up, if comprise the word that wakes up that will detect in the path of acoustics score optimum, then determine to have detected to wake word up, enter the operation of the 3rd step, re-start extraction acoustic feature (4) step otherwise get back to the first step;

In the 3rd step, wake word up and confirm (6): with the acoustic feature that extraction obtains, the acoustic model (8) that adopts training confirms that network (9) wakes word up and confirms waking word up, is finally confirmed score; Whether that judges that this detects wakes word up for waking really word up, be about to this and wake final affirmation score and the predefined thresholding of word up, if confirm that finally score is more than or equal to thresholding, think that then this wakes word up is to wake really word up, voice wake up successfully, the result exported to wake execution (3) up, thereby finish the voice wake operation; If finally confirm score less than thresholding, think that then this wakes word up and is the false word that wakes up, come back to the first step and re-start acoustic feature extraction (4) step.

2. the implementation method of voice wake module according to claim 1, it is characterized in that: the training of described acoustic model (8) is divided into two parts, is respectively phoneme acoustic model and garbage model (being the Garbage model); The phoneme acoustic model adopts the acoustic training model method in traditional speech recognition, choose database, utilization is based on MLE (Maximum Likelihood Estimation, maximal possibility estimation) and under MPE (Minimum Phone Error, the minimum phoneme mistake) property the distinguished training criterion obtain; The Garbage model is used for absorbing the irrelevant voice except waking word up, use and train the same database of phoneme model, by calculating the similarity between each phoneme model, each phoneme is divided into 20 classes, use all training datas corresponding to every class phoneme to merge, adopt Garbage model corresponding to MLE criterion training, just obtain 20 class Garbage models.

3. the implementation method of voice wake module according to claim 1 is characterized in that: the described implementation method of waking word Sampling network (7) up is to adopt optimum score path computing to draw, and the computing formula of described optimum sub-path is:

W = \underset{W}{\arg \max} P (W) P (X | W)

Wherein X represents the acoustic feature vector that extracts from the input voice, and W represents the optimum word sequence of score maximum; Conditional probability P (X|W) is the acoustic model score, calculates by the acoustic model (8) that trains; Prior probability P (W) is the language model score, is the added PenaltyP of different acoustic models (X) as total probability, when acoustic model with to wake up after the word Sampling network is decided namely be definite value.

4. the implementation method of voice wake module according to claim 1 is characterized in that: the described word that wakes up confirms that network (9) implementation method is:

The word that wakes up that a. will detect is decoded to the phoneme one-level, and records all score (Score _Phone1, Score _Phone2..., Score _PhoneN), wherein N wakes phoneme number total in the word, Score up _Phone1, Score _Phone2..., Score _PhoneNWhat represent respectively that this wakes all phonemes in the word up is the decoding score, and wherein subscript represents the sign of N phoneme of phoneme;

C M_{phonei} = ({Score}_{phonei} - Σ_{k = K_{istart}}^{K_{iend}} {Score}_{framek}) / (K_{iend} - K_{istart})

CM _PhoneiRepresent that i phoneme recognize minute really, subscript phonei represents i phoneme, Score _PhoneiThe decoding score of i phone as shown above, Score _FramekExpression is used and is waken the score that the k frame that network decoding obtains confirmed in word up;

C M_{word} = \frac{1}{N} Σ_{i = 1}^{N} C M_{phonei} .

5. the implementation method of a kind of voice wake module according to claim 1, it is characterized in that: described method can be transplanted on ARM or the DSP general processor and move, and is applied to vehicle-mounted and the household electrical appliances association area.

6. vehicle-mounted voice waken system, it is characterized in that comprising: microprocessor, the described voice wake module of claim 1, audio conversion device, recording device, apparatus for processing audio, public address system, described voice wake module operates in the microprocessor, and the specific implementation process is as follows:

In the 3rd step, audio conversion device carries out data-switching to the voice messaging of recording device typing, and the data after will changing are simultaneously passed to the computing that microprocessor carries out the voice wake module, finish the voice data conversion operations;