CN102999161B

CN102999161B - A kind of implementation method of voice wake-up module and application

Info

Publication number: CN102999161B
Application number: CN201210455175.2A
Authority: CN
Inventors: 操文祥; 王海坤; 康怀茂; 钱勇; 谢信珍; 黄海兵
Original assignee: iFlytek Co Ltd
Current assignee: Science And Technology University Information Flying South China Institute Of Artificial Intelligence (guangzhou) Co Ltd
Priority date: 2012-11-13
Filing date: 2012-11-13
Publication date: 2016-03-02
Anticipated expiration: 2032-11-13
Also published as: CN102999161A

Abstract

The implementation method of voice wake-up module and an application, comprising: phonetic entry (1), voice wake algorithm (2) up and wake execution (3) up; Voice wake algorithm (2) up and realize extracting (4) mainly through acoustic feature, wake word up and detect (5), wake word up and confirm (6), build and wake word Sampling network (7), training acoustic model (8) and structure up and wake the word confirmation network realization such as (9) up.No matter even if whether the present invention has broadcasting music under noisy environment, word opening voice arousal function can be waken up by voice, identify that wake-up effect is good; Implementation method of the present invention can be transplanted on ARM or DSP general processor and run, and is applied to vehicle-mounted and household electrical appliances association area.

Description

A kind of implementation method of voice wake-up module and application

Technical field

The invention discloses a kind of implementation method and application of voice wake-up module, be specifically related to a kind ofly say that predetermined voice wake word up and carry out triggering system and perform next step operation of user by user, can apply and need to realize the fields such as vehicle-mounted and household electrical appliances that voice wake up.

Background technology

The present invention relates to one and apply for invention disclosed patent, publication number is: CN102645977A, and the applying date is 2012.03.26, and inventor is Yin Jianhong, Wang Zhong, Zhou Yanhuang, name is called " a kind of vehicle-mounted voice wakes man-machine interactive system and method up ", is incorporated by reference document at this.The vehicle-mounted voice of this invention wakes up and realizes principle and be: in the flash storer pre-set, deposit the information such as sound bank, vehicle-mounted noise storehouse, speech engine, the phonetic order inputted by microphone is compared via the phonetic order relevant information that master controller MCU and storer store and is carried out speech recognition, and the phonetic order relevant information determined after matching identification is controlled vehicle-mounted control functional unit block as execution instruction, realize its corresponding function.What flash involved in this invention deposited is all fixing data, and under vehicle environment, whether due to road speed, road conditions, weather, turning on the aircondition to open a window all causes the vehicle-mounted noise storehouse change such as engine noise and tyre noise, the music play in car is different, the difference of speaker can cause referenced sound bank to change, so this invention realizes voice arousal function under being only applicable to the scene of fixing.And the present invention is by different speaker recording data under all kinds of scene of collection, train a kind of acoustic model, wake word Sampling network up by structure simultaneously and confirm network, make the present invention adapt to scene more extensive, voice wake-up effect is good simultaneously.

Summary of the invention

The object of the invention is to solve the deficiencies in the prior art, a kind of implementation method of voice waken system is provided, no matter even if whether have broadcasting music under noisy environment, can wake word opening voice arousal function up by voice, voice wake-up effect is good simultaneously; In addition the present invention also provides the application of voice waken system, comprises and is applied to application that is vehicle-mounted and household electrical appliances association area.

The present invention is achieved by the following technical solutions: a kind of implementation method of voice wake-up module comprises: phonetic entry 1, voice wake algorithm 2 up and wake execution 3 step up, voice wake the voice signal that algorithm 2 obtains phonetic entry 1 up, after carrying out voice wake up process, result is exported to and wakes execution 3 up, thus complete wake operation;

Described voice wake up algorithm 2 by acoustic feature extract 4, wake up word detect 5, wake up word confirm 6, build wake up word Sampling network 7, training acoustic model 8 and structure wake up word confirm network 9 realize, specific implementation process is as follows:

The first step, acoustic feature extracts 4: obtain voice signal input by phonetic entry 1, extract there is distinction and be the feature extracted based on human hearing characteristic, usually MFCC (Mel-FrequencyCepstrumCoefficient, the Mel frequency cepstrum coefficient) feature used in speech recognition is chosen as acoustic feature;

Second step, wake word up and detect 5: will the acoustic feature obtained be extracted, the acoustic model 8 of training is adopted to calculate acoustic score waking up on word Sampling network 7, if comprise in the path of score optimum to detect wake word up, then determine to have detected and wake word up, enter the 3rd step operation, otherwise get back to the first step and re-start and extract acoustic feature 4 step;

3rd step, wakes word up and confirms 6: will extract the acoustic feature that obtains, adopts the acoustic model 8 of training to confirm that network 9 carrying out wake word up confirms, is finally confirmed score waking word up; Judge whether the word that wakes up that this detects is wake word up really, compare by this final confirmation score waking word up and the thresholding preset, confirm that score is more than or equal to thresholding if final, then thinking that this wakes word up is wake word up really, voice wake up successfully, result is exported to and wakes execution 3 up, thus complete voice wake operation; Confirm that score is less than thresholding if final, then thinking that this wakes word up is false wake word up, comes back to the first step and re-starts acoustic feature and extract 4 steps.

The training of described acoustic model 8 is divided into two parts, is respectively phoneme acoustic model and garbage model (i.e. Garbage model); Phoneme acoustic model adopts the acoustic training model method in traditional speech recognition, choose database, utilize based on MLE (MaximumLikelihoodEstimation, maximal possibility estimation) and MPE (MinimumPhoneError, minimum phoneme mistake) distinction training criterion under obtain; Garbage model is for absorbing the independent voice except waking word up, use and train the database that phoneme model is same, by calculating the similarity between each phoneme model, each phoneme is divided into 20 classes, the all training datas using every class phoneme corresponding merge, adopt the Garbage model that the training of MLE criterion is corresponding, just obtain 20 class Garbage models.

The described implementation method waking word Sampling network 7 up adopts optimum score path computing to draw, the described optimum computing formula obtaining sub-path is:

W = \underset{W}{\arg m a x} P (W) P (X | W) - - - (2)

Wherein X representative is from inputting the acoustic feature vector extracted voice, and W represents the maximum optimum word sequence of score; Conditional probability P (X|W) is acoustic model scores, is calculated by the acoustic model 8 trained; Prior probability P (W) is language model scores, and being the PenaltyP (X) added by different acoustic models is total probability, when acoustic model and wake up word Sampling network decide after be exactly definite value.

The described word that wakes up confirms that network (9) implementation method is:

A. the word that wakes up detected is decoded to phoneme one-level, and records all score (Score _phone1, Score _phone2..., Score _phoneN), wherein N wakes phoneme number total in word up;

Score _phone1, Score _phone2..., Score _phoneNwhat represent that this wakes all phonemes in word up respectively is decoding score, and wherein subscript represents the mark of N number of phoneme of phoneme.

B. use and wake word up and detect same feature, obtain corresponding acoustic score, and be accurate to frame one-level (Score _frame1, Score _frame2..., Score _frameM), wherein M is the total duration of this feature, in units of frame;

C. calculate and wake each phoneme of word up and really to recognize point, account form is as follows:

{CM}_{p h o n e i} = ({Score}_{p h o n e i} - Σ_{k = K_{i s t a r t}}^{K_{i e n d}} {Score}_{f r a m e k}) / (K_{i e n d} - K_{i s t a r t}) - - - (3)

Wherein K _istartand K _iendbe respectively initial time and the end time of i-th phoneme;

CM _phoneirepresent that i-th phoneme is recognized point really, subscript phonei represents i-th phoneme, Score _phoneithe decoding score of i-th phone as shown above, Score _framekrepresent to use and wake the score that word confirms the kth frame that network decoding obtains up.

D. calculate the final confirmation score that this wakes word up, account form is as follows:

{CM}_{w o r d} = \frac{1}{N} Σ_{i = 1}^{N} {CM}_{p h o n e i} - - - (4)

Method of the present invention can be transplanted on ARM or DSP general processor, is applied to vehicle-mounted and household electrical appliances association area.

A kind of vehicle-mounted voice waken system, is characterized in that comprising: microprocessor, voice wake-up module, audio conversion device, recording device, apparatus for processing audio, public address system; Wherein voice wake-up module is run in the microprocessor, and specific implementation process is as follows:

The first step, microprocessor and apparatus for processing audio interconnect, and control apparatus for processing audio output audio information, and apparatus for processing audio and public address system interconnect, and required playback of audio information are carried out power amplification to promote loudspeaker playback, complete audio player operation;

Second step, recording device and audio conversion device interconnect, when user say voice wake word up time, carry out voice typing by recording device and pass to audio conversion device conversion, complete voice collecting operation;

3rd step, audio conversion device carries out data conversion to the voice messaging of recording device typing, the data after conversion are passed to simultaneously microprocessor carry out claim 1 described in the computing of voice wake-up module, complete voice data conversion operations;

4th step, microprocessor and audio conversion device interconnect, and the voice messaging of audio conversion device input are carried out to the computing of voice wake-up module, if correctly identify voice to wake information up, then control apparatus for processing audio and play voice message sound, complete vehicle-mounted voice and wake up and prompt tone play operation; If identify and make mistakes, then proceed the operation of second step voice collecting.

The present invention's advantage is compared with prior art:

(1) the present invention wakes word up as trigger source by the voice of user, add that waking word up detects and wake up word confirmation, no matter even if whether have broadcasting music under noisy environment, can wake word opening voice arousal function up by voice, voice wake-up effect is good; Also utilize bimanualness without the need to user simultaneously, realize arousal function fast by means of only voice command, carry out next step interactive operation.

(2) the present invention realizes, and cost is low, and code migrating is convenient, has good application value.

(3) the present invention can be widely used in the fields such as vehicle-mounted and household electrical appliances, can also be widely used in each field that other audio plays needs voice to wake up simultaneously.In the automotive environment, want in user's driving conditions before not using native system that starting recognition function needs manually to remove operation push-button, suspends the music of current broadcasting, causes driving conditions to there is potential safety hazard; Consumer's Experience weak effect simultaneously.

(4) value that the present invention brings is, by saying that the voice of agreement wake word opening voice arousal function up after using native system, play without the need to suspending audio frequency in advance, simultaneously by actual testing authentication, correct identification wakes rate up and can reach more than 90%; At other as field of household appliances, user just when TV reception, looks on the bright side of things and opens speech identifying function, also can wake word up to realize by voice, make interactive voice more convenient, more humane.

(5) the voice arousal function in the present invention is all realized by software algorithm, can be transplanted on the general processors such as ARM or DSP very easily.

Accompanying drawing explanation

Fig. 1 is the schematic block diagram that the present invention realizes;

Fig. 2 is that structure of the present invention wakes word Sampling network schematic block diagram up;

Fig. 3 is that structure of the present invention wakes word confirmation network schematic block diagram up;

Fig. 4 is the concrete enforcement schematic diagram of the present invention in automotive field.

Embodiment

As shown in Figure 1, the realization of voice wake-up module of the present invention wakes algorithm 2 up by phonetic entry 1, voice and wakes execution 3 step up and realizes.

Voice wake up algorithm 2 realize primarily of acoustic feature extract 4, wake up word detect 5, wake up word confirm 6, build wake up word Sampling network 7, training acoustic model 8 and structure wake up word confirm network 9 complete, specific implementation process is:

(1) acoustic model 8 is trained: the training of acoustic model is divided into two parts, is respectively phoneme acoustic model and garbage model (i.e. Garbage model).Phoneme acoustic model adopts the acoustic training model method in traditional speech recognition, choose suitable database, utilize based on MLE (MaximumLikelihoodEstimation, maximal possibility estimation) and MPE (MinimumPhoneError, minimum phoneme mistake) distinction training criterion under obtain.Garbage model is for absorbing the independent voice except waking word up, use and train the database that phoneme model is same, by calculating the similarity between each phoneme model, each phoneme is divided into 20 classes, the all training datas using every class phoneme corresponding merge, adopt the Garbage model that the training of MLE criterion is corresponding, so namely obtain 20 class Garbage models.Garbage model have employed the phoneme training data combined training of cluster, has two kinds of purposes, is used for absorbing other voice except waking word up waking up in word Sampling network, confirms to be used in network calculating the score confirming network waking word up.

(2) acoustic feature extracts 4: obtain voice signal input by phonetic entry 1, extraction can have certain distinction, and be the feature extracted based on human hearing characteristic, generally choose MFCC (Mel-FrequencyCepstrumCoefficient, the Mel frequency cepstrum coefficient) feature used in speech recognition.

(3) wake word up and detect 5: will the acoustic feature that obtains be extracted, use acoustic model 8 to calculate acoustic score waking up on word Sampling network 7, if comprise in the path of score optimum to detect wake word up, then detect and wake word up, enter next step operation; Otherwise again extract acoustic feature operation.In order to ensure that waking word up can be detected normally, invalid voice can effectively be absorbed again simultaneously.What the structure waking Sampling network up was selected primarily of user wakes word and Garbage model composition up, as shown in Figure 2, this network in speech recognition also referred to as recognition network, very simple owing to waking checking network line structure up, or can by simple program manual construction.Due to the complicacy of practical service environment, under many circumstances, what receive wakes voice up by noise pollution, it is a lot of that the score of feature on phoneme acoustic model now waking acoustics corresponding to voice up will reduce, and due to Garbage model be use more phoneme combined training to obtain, itself be not very accurate, the limited extent that the score of acoustic feature on Garbage model reduces, now wake voice up just to be absorbed by Garbage model, system wake-up rate will reduce by mistake.

In order to prevent the generation of above-mentioned situation, wake up word Sampling network is decoded time, certain punishment is done to the decoding score of the arc at Garbage place, i.e. Penalty, make its can not with the fair competition of phoneme acoustic model, also can normally be detected by the voice that wake up of noise pollution to ensure.Concrete punishment amplitude needs to do experimental adjustment for the different words that wakes up.

The implementation method waking word Sampling network 7 up adopts optimum score path computing to draw.

The optimum acquisition obtaining sub-path adopts classical Bayesian formula, as follows:

W = \underset{W}{\arg m a x} P (W | X) = \underset{W}{\arg m a x} \frac{P (W) P (X | W)}{P (X)} - - - (1)

The acoustic feature vector that in above formula, X representative is extracted from input voice, W represents the maximum optimum word sequence of score.Conditional probability P (X|W) is acoustic model scores, can be calculated by the phoneme acoustic model that trains and garbage model, prior probability P (W) is language model scores, can be understood as here the Penalty added by different acoustic models.P (X) is total probability, when acoustic model and wake up word Sampling network decide after be exactly definite value, therefore formula (1) can be written as:

W = \underset{W}{\arg \max} P (W) P (X | W) - - - (2)

(4) wake word up and confirm 6: due to the complicacy that there is inexactness and practical service environment of acoustic model itself, the word that wakes up obtained by waking word detection up not necessarily wakes word up really.Non-ly waking the false wake-up brought and the problem that can cause up below to reduce, needing to do further to confirm to detecting the word that wakes up obtained.The present invention adopts the mode of accompanying drawing 3 to build and wakes word confirmation network 9 up, wake word up and confirm that network is the same with waking word Sampling network up, all belong to the recognition network in speech recognition, confirm only to comprise Garbage model in network, simple program or manual construction can be used.

The key step waking word confirmation up is as follows:

A) word will be waken up detect and obtain waking word up and be decoded to phoneme one-level, and record its all score (Score _phone1, Score _phone2..., Score _phoneN), wherein N wakes phoneme number total in word up.

B) use and wake word up and detect same feature, confirming network obtains corresponding acoustic score waking word up, and be accurate to frame one-level (Score _frame1, Score _frame2..., Score _frameM), wherein M is the total duration of this feature, in units of frame.

C) calculate and wake each phoneme of word up and really to recognize point, account form is as follows:

{CM}_{p h o n e i} = ({Score}_{p h o n e i} - Σ_{k = K_{i s t a r t}}^{K_{i e n d}} {Score}_{f r a m e k}) / (K_{i e n d} - K_{i s t a r t}) - - - (3)

Wherein K _istartand K _iendbe respectively initial time and the end time of i-th phoneme.

D) calculate the final confirmation score that this wakes word up, account form is as follows:

{CM}_{w o r d} = \frac{1}{N} Σ_{i = 1}^{N} {CM}_{p h o n e i} - - - (4)

Whether be really wake word, the thresholding contrasting this final confirmation score waking word up He preset if e) judging that this wakes word up, if confirm score C M _wordbe greater than thresholding T and then think that this wakes word up for wake word up really, wake up successfully; If CM _wordbeing less than thresholding T, then to think that this wakes word up be false wake word up, re-starts acoustic feature and extract.

Realize voice arousal function by working above, result feedback is to waking execution 3 up the most at last, performs wake operation.

As shown in Figure 4, give the present invention the concrete enforcement schematic diagram in automotive field, vehicle-mounted voice waken system, its structure comprises: microprocessor 11, preferentially selects ARM9 processor, but is not limited thereto microprocessor; Voice wake-up module operates in microprocessor 11; Audio conversion device 12, prioritizing selection WM8731, but be not limited thereto audio conversion device; Recording device 13, prioritizing selection sexual valence than high electret microphone, but is not limited thereto recording device; Apparatus for processing audio 14, prioritizing selection TDA7419, but be not limited thereto apparatus for processing audio; Public address system 15, adopts the four unit loudspeaker (left front loudspeaker, left back loudspeaker, right front loudspeaker, right back loudspeaker) that power amplifier TDA7388 and automobile carry, but is not limited thereto power amplifier and vehicle-mounted loudspeaker unit; Voice wake command word, prioritizing selection " automobile language point ", but be not not limited thereto voice and wake word up.

Realize that principle mainly comprises audio frequency broadcasting, data under voice, voice data conversion, voice wake up and the step such as prompt tone broadcasting completes.Specific as follows:

The first, when user uses native system to listen to music when driving, music can be other sources of sound such as the radio/TV/DVD/linein of audio frequency or the accessing to audio processor TDA7419 provided by the broadcast module of microprocessor ARM9; After the music of all broadcastings first carries out audio effect processing by audio process, then promote vehicle-mounted loudspeaker by power amplifier TDA7388 and broadcast, complete audio frequency broadcasting work;

The second, word is waken up when user says specific voice---time " automobile language point ", user's speaking volume should keep level of normally speaking, the too little meeting of sound causes electret microphone to be recorded less than voice signal, and sound is crossed conference and caused recording to cut top, all can cause arousal function failure; Include the microphone signal that voice wake word information up, in audio converter WM8731, carry out analog to digital conversion, complete speech signal collection work;

Three, the voice acquisition module of microprocessor ARM9 carries out analog to digital conversion work by iic bus control audio converter WM8731, convert microphone location signal to digital signal, and return to microprocessor by IIS bus, complete voice data conversion work;

Four, microprocessor training acoustic model, extracts user's acoustic feature of microphone signal input, after waking word Sampling network up and waking word confirmation network up, realizes voice arousal function.Simultaneously by audio process play cuing tone signal, complete whole voice and wake up and prompt tone play operation.

Be more than preferred embodiments of the present invention, user, when not playing music or non-driving, can wake word opening voice recognition function up by special sound equally.

Non-elaborated part of the present invention belongs to techniques well known.And above-described embodiment does not limit the present invention in any form, the technical scheme that the form that all employings are equal to replacement or equivalent transformation obtains, all drops within protection scope of the present invention.

Claims

1. the implementation method of a voice wake-up module, it is characterized in that comprising: phonetic entry (1), voice wake algorithm (2) up and wake execution (3) step up, voice wake the voice signal that algorithm (2) obtains phonetic entry (1) up, after carrying out voice wake up process, result is exported to and wakes execution (3) up, thus complete wake operation;

Described voice wake up algorithm (2) by acoustic feature extract (4), wake up word detect (5), wake up word confirm (6), build wake up word Sampling network (7), training acoustic model (8) and structure wake up word confirmation network (9) realize, specific implementation process is as follows:

The first step, acoustic feature extracts (4): obtain voice signal input by phonetic entry (1), extract there is distinction and be the feature extracted based on human hearing characteristic, choose the Mel frequency cepstrum coefficient characteristics used in speech recognition as acoustic feature;

Second step, wake word up and detect (5): will the acoustic feature obtained be extracted, the acoustic model (8) of training is adopted to calculate acoustic score waking up on word Sampling network (7), if comprise in the path of acoustic score optimum to detect wake word up, then determine to have detected and wake word up, enter the 3rd step operation, otherwise get back to the first step re-start extract acoustic feature (4) step;

3rd step, wakes word up and confirms (6): will extract the acoustic feature that obtains, adopts the acoustic model (8) of training to confirm that network (9) carrying out wake word up confirms, is finally confirmed score waking word up; Judge whether the word that wakes up that this detects is wake word up really, by this final confirmation score waking word up and the thresholding preset, confirm that score is more than or equal to thresholding if final, then thinking that this wakes word up is wake word up really, voice wake up successfully, result is exported to and wakes execution (3) up, thus complete voice wake operation; Confirm that score is less than thresholding if final, then thinking that this wakes word up is false wake word up, comes back to the first step and re-starts acoustic feature and extract (4) step;

The described implementation method waking word Sampling network (7) up adopts the path computing of acoustic score optimum to draw, the computing formula in the path of described acoustic score optimum is:

W = \underset{W}{\arg \max} P (W) P (X | W)

Wherein X representative is from inputting the acoustic feature vector extracted voice, and W represents the maximum optimum word sequence of score; Conditional probability P (X|W) is acoustic model scores, is calculated by the acoustic model (8) trained; Prior probability P (W) is language model scores, and being the PenaltyP (X) added by different acoustic models is total probability;

A. the word that wakes up detected is decoded to phoneme one-level, and records all score Score _phone1, Score _phone2..., Score _phoneN, wherein N wakes phoneme number total in word up,

Score _phone1, Score _phone2..., Score _phoneNwhat represent that this wakes all phonemes in word up respectively is decoding score, and wherein subscript represents the mark of N number of phoneme of phoneme;

B. use and wake word up and detect same feature, obtain corresponding acoustic score, and be accurate to frame one-level Score _frame1, Score _frame2..., Score _frameM, wherein M is the total duration of this feature, in units of frame;

C. calculate the acoustic score waking each phoneme of word up, account form is as follows:

{CM}_{p h o n e i} = ({Score}_{p h o n e i} - Σ_{k = K_{i s t a r t}}^{K_{i e n d}} {Score}_{f r a m e k}) / (K_{i e n d} - K_{i s t a r t})

CM _phoneirepresent that i-th phoneme is recognized point really, subscript phonei represents i-th phoneme, Score _phoneirepresent the decoding score of i-th phone, Score _framekrepresent to use and wake the score that word confirms the kth frame that network decoding obtains up;

{CM}_{w o r d} = \frac{1}{N} Σ_{i = 1}^{N} {CM}_{p h o n e i} .

2. the implementation method of voice wake-up module according to claim 1, is characterized in that: the training of described acoustic model (8) is divided into two parts, is respectively phoneme acoustic model and garbage model and Garbage model; Phoneme acoustic model adopts the acoustic training model method in traditional speech recognition, chooses database, utilizes and obtains based under maximal possibility estimation and minimum phoneme fault discrimination training criterion; Garbage model is for absorbing the independent voice except waking word up, use and train the database that phoneme model is same, by calculating the similarity between each phoneme model, each phoneme is divided into 20 classes, the all training datas using every class phoneme corresponding merge, adopt the Garbage model that the training of maximal possibility estimation criterion is corresponding, just obtain 20 class Garbage models.

3. the implementation method of a kind of voice wake-up module according to claim 1, is characterized in that: described method can be transplanted on ARM or DSP general processor and run, and is applied to vehicle-mounted and household electrical appliances association area.

4. a vehicle-mounted voice waken system, it is characterized in that comprising: voice wake-up module, audio conversion device, recording device, apparatus for processing audio, public address system described in microprocessor, claim 1, described voice wake-up module is run in the microprocessor, and specific implementation process is as follows:

3rd step, audio conversion device carries out data conversion to the voice messaging of recording device typing, the data after conversion is passed to the computing that microprocessor carries out voice wake-up module simultaneously, completes voice data conversion operations;

4th step, microprocessor and audio conversion device interconnect, and the voice messaging of audio conversion device input are carried out to the computing of voice wake-up module, if correctly identify voice to wake information up, then control apparatus for processing audio and play voice message sound, complete voice and wake up and prompt tone play operation; If identify and make mistakes, then proceed the operation of second step voice collecting.