CN106448663A

CN106448663A - Voice wakeup method and voice interaction device

Info

Publication number: CN106448663A
Application number: CN201610901706.4A
Authority: CN
Inventors: 杨香斌
Original assignee: Hisense Group Co Ltd
Current assignee: Hisense Group Co Ltd
Priority date: 2016-10-17
Filing date: 2016-10-17
Publication date: 2017-02-22
Anticipated expiration: 2036-10-17
Also published as: CN106448663B

Abstract

The present invention provides a voice wake-up method and a voice interaction device. The method includes the following steps that: voice input signals are received; the first similarity of the voice input signals and preset wake-up voice signals is determined according to a first acoustic model, and whether the first similarity exceeds a first preset threshold value is judged; and if the first similarity exceeds the first preset threshold value, second similarity between the speech input signals and the preset wake-up voice signals is determined according to a second acoustic model, and whether the second similarity exceeds a second preset threshold value is judged, if the second similarity exceeds the second preset threshold value, a voice interaction function is awaken, wherein the accuracy of the second acoustic model is higher than the accuracy of the first acoustic model. The voice wake-up method and the voice interaction device provided by the embodiment of the invention have the advantages of low power consumption and low wrong wake-up rate.

Description

Voice awakening method and voice interaction device

Technical field

The present embodiments relate to technical field of voice recognition, more particularly, to a kind of voice awakening method and interactive voice dress Put.

Background technology

Developing rapidly with speech recognition technology, the application scenarios of interactive voice are more and more universal, intelligent television, intelligence Vehicle-mounted, smart home, intelligent robot be all interactive voice application main application scenarios, simultaneously because man-machine interaction for The requirement more and more higher of family experience, the distance of man-machine voiced interaction is also increasingly not limited to closely say (within 50cm).Lead to now Excessive microphone techniques, have been able to realize the remote speech interaction in 3-5 rice.

Meanwhile, remote speech interaction there is also an issue, is exactly when to start to trigger voice radio reception simultaneously And start to identify.Current technology scheme has two kinds, and one kind is with a low-power chip, receives all the time by microphone array Sound, after doing corresponding signal processing (signal enhancing, noise suppressed, echo cancellor), then does speech recognition again, judges that user is No say wake-up word, if, then notify primary module, start radio reception and simultaneously carry out speech recognition, also a kind of mode is front end Module only do signal processing, radio reception always is come by primary module, and does speech recognition to judge whether user says wake-up word, but It is that both modes have drawback, former mode requires low-power consumption due to front end processing block, so recognition performance comes relatively Saying can be relatively low, and false wake-up rate also can be higher simultaneously；And the problem of latter scheme is main chip module needs full speed running always, Power consumption can ratio larger, and because the requirement to main chip module is higher, the cost of scheme is also higher.There is no at present and take into account Power consumption and the scheme of false wake-up rate.

Content of the invention

The embodiment of the present invention provides a kind of voice awakening method and voice interaction device, cannot be simultaneous in order to solve prior art Turn round and look at the problem of power consumption and false wake-up rate.

Embodiment of the present invention first aspect provides a kind of voice awakening method, and the method includes：

Receive voice input signal；

According to the first acoustic model, determine the first phase between described voice input signal and default wake-up voice signal Like degree, and judge described first similarity whether more than the first predetermined threshold value；

If exceeding, according to the second acoustic model, determine described voice input signal and default wake-up voice signal it Between the second similarity, and judge described second similarity whether more than the second predetermined threshold value, wherein, described second acoustic model Accuracy be higher than described first acoustic model accuracy；

If exceeding, wake up voice interactive function.

Embodiment of the present invention second aspect provides a kind of voice interaction device, and this device includes：

Receiver module, for receiving voice input signal；

First determining module, for according to the first acoustic model, determining described voice input signal and default wake-up language The first similarity between message number, and judge described first similarity whether more than the first predetermined threshold value；

Second determining module, for when described first similarity exceedes described first predetermined threshold value, according to the second acoustics Model, determines the second similarity between described voice input signal and default wake-up voice signal, and judges described second Whether more than the second predetermined threshold value, wherein, the accuracy of described second acoustic model is higher than described first acoustic model to similarity Accuracy；

Wake module, for when described second similarity is more than the second predetermined threshold value, waking up voice interactive function.

The embodiment of the present invention, first pass through the first relatively low acoustic model of accuracy voice input signal is carried out preliminary Voice wakes up identification, when identifying that the similarity between voice input signal and default wake-up voice signal is default more than first During threshold value, then second voice wake-up identification is carried out by higher second acoustic model of accuracy to voice input signal, thus Result according to second identification, it is determined whether wake up voice interactive function.Due to, in first time identification process, using The relatively low acoustic model of accuracy, therefore, the power consumption in first time identification process is relatively low.And only ought be identified by for the first time, When i.e. the similarity between voice input signal and default wake-up voice signal is more than the first predetermined threshold value, just enable accuracy The second higher acoustic model carries out second wake-up identification.So pass through by acoustic model relatively low for accuracy and accuracy relatively High acoustic model is used in combination, it is to avoid when low accuracy acoustic model is used alone, and it is relatively low to wake up recognition accuracy, calls out by mistake The higher problem of awake rate, when being also avoided that high accuracy acoustic model is used alone simultaneously, the higher problem of power consumption, and then reach Take into account power consumption and the purpose of low false wake-up rate.

Brief description

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description be only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, also may be used So that other accompanying drawings are obtained according to these accompanying drawings.

The schematic flow sheet of the voice awakening method that Fig. 1 provides for one embodiment of the invention；

The Organization Chart of the voice interaction device that Fig. 2 provides for one embodiment of the invention；

The structural representation of the voice interaction device that Fig. 3 provides for one embodiment of the invention.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation description is it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of not making creative work Embodiment, broadly falls into the scope of protection of the invention.

The term " comprising " and " having " of description and claims of this specification and their any deformation it is intended that It is to cover non-exclusive comprising, for example, the device of the process or structure that contain series of steps is not necessarily limited to clearly arrange Those structures going out or step but may include clearly not listing or for the intrinsic other steps of these processes or device Rapid or structure.

The schematic flow sheet of the voice awakening method that Fig. 1 provides for one embodiment of the invention, the method can be by such as intelligence Can TV, intelligent vehicle-carried, smart home, intelligent robot etc. has the voice interaction device of voice interactive function to execute.As Shown in Fig. 1, the method that the present embodiment provides comprises the steps：

Step S101, reception voice input signal.

In practical application, voice interaction device can be by the microphone array that is disposed thereon come receive user or terminal The voice signal of equipment input, and the voice signal receiving is guaranteed after receiving voice signal by time delay equalization Integrity, it is to avoid due to missing part of speech signal, and to wake up judgement impact.

Further, obtain this enforcement by pretreatment is carried out to this voice signal after obtaining complete voice signal " voice input signal " alleged by example.Specifically, in preprocessing process, at least voice signal to be carried out at noise suppressed Reason, echo cancellation process and sound enhancement process, wherein, above-mentioned process is similar with speech processes process in prior art, at this In repeat no more.

Step S102, according to the first acoustic model, determine described voice input signal and default wake-up voice signal it Between the first similarity, and whether judge described first similarity more than the first predetermined threshold value, if not less than terminating this and call out Wake up and operate, if exceeding, execution step S103.

Wherein, this first predetermined threshold value can by user according to the actual requirements self-defined setting it is also possible to by terminal unit Default setting, the embodiment of the present invention is not construed as limiting to this.

Particularly, the voice awakening method providing in the present embodiment includes differentiating twice process, wherein, judges for the first time Journey, can be executed by a DSP module.In first time judge process, the phonetic entry that obtains from step S101 first In signal, extract characteristic signal.For example, it is possible to be obtained by way of the mel-frequency cepstrum coefficient extracting voice input signal Take characteristic signal, this process is same as the prior art, repeats no more here.

Further, in actual applications, can in DSP module built-in one simple acoustic model, by should Acoustic model does decoding process to the characteristic signal of above-mentioned acquisition, and calculates judging characteristic signal using maximum likelihood ratio and call out Similarity between awake voice signal, its ultimate principle is will to preset in each characteristic point in characteristic signal and acoustic model Each characteristic point waking up voice signal carries out similarity-rough set, then draws a maximum likelihood value by comprehensive for all of point, Formula is：

Wherein, x_iIt is the sample value of ith feature point in characteristic signal, μ is the value in model, θ calculates for needs Maximum likelihood value, calculated between current speech input signal and default wake-up voice signal by this maximum likelihood value Similarity.Wherein, when calculating the similarity obtaining more than preset first threshold value, then unlatching wakes up for second and judges, otherwise Terminate wake operation.In the present embodiment, DSP module carries out to voice input signal waking up the process of judgement and existing skill for the first time Art is similar to, and repeats no more here.

Need exist for illustrating, use better simply acoustic model because first time wakes up judge process, therefore, Requirement to DSP module is relatively low, and the power consumption of DSP module is relatively low.

Certainly above are only and illustrate, rather than the unique restriction to the present invention, for example, in actual applications can also To calculate the similarity of two sections of voices using the method for packet window DTW, but its maximum problem is the pronunciation wind of voice Lattice difference can have a strong impact on the discrimination of voice.

Step S103, according to the second acoustic model, determine described voice input signal and default wake-up voice signal it Between the second similarity, and whether judging described second similarity more than the second predetermined threshold value, if exceeding, waking up interactive voice Function, does not otherwise wake up.Wherein, the accuracy of described second acoustic model is higher than the accuracy of described first acoustic model.

In the present embodiment, waking up judgement second can be executed by a master chip processing module.Calling out through for the first time Wake up after judging, if the similarity between voice input signal and default wake-up voice signal exceedes preset first threshold value, Activation master chip processing module, and then master chip processing module obtains features described above signal from DSP module, and built-in according to it The higher acoustic model (i.e. the second acoustic model) of accuracy and above-mentioned acquisition characteristic signal, determine voice input signal with The second similarity between default wake-up voice signal.Further, after obtaining the second similarity, the obtaining will be calculated Two similarities are contrasted with the second predetermined threshold value, when the second similarity is more than the second predetermined threshold value, wake up interactive voice work( Can, otherwise do not wake up.

It should be noted that not determining between voice input signal and default wake-up voice signal in DSP module When similarity exceedes preset first threshold value, master chip processing module is in unactivated state, that is, master chip processing module be in low Power consumption working condition or resting state；When DSP module determines between voice input signal and default wake-up voice signal Similarity when exceeding preset first threshold value, corresponding for this voice signal characteristic signal is sent to master chip and processes by DSP module Module, and then activate master chip processing module.

Particularly, in the present embodiment, wake up the method judging for second different with the method that first time wake-up judges, its Difference is：Wake up for second and judge to use complicated similarity decoding algorithm, such as Vetebi, it is that a kind of dynamic programming is calculated Method, can calculate the state relation relation in front and back of voice signal content, and wake up for the first time and judge it is static calculating similarity side Method, only calculates the maximum likelihood value of each sampled point, both acoustic models are also different simultaneously, the right and wrong in DSP module Often simple, easily calculate the simple acoustic model processing, in master chip processing module is more complicated, and precision is higher Complicated acoustic model.

As an example it is assumed that the wake-up word in wake-up voice is " Vidaa, Vidaa ", the calculating process in DSP module In it is believed that being that this section of speech decomposition is become 256 sampled points, then by maximum likelihood value-based algorithm come Integrated comparative this In 256 points, the coincidence probability of the maximum likelihood value between the voice input signal that value in acoustic model and collection are come in, be A kind of static computational methods, as long as such as it is considered that this probability reaches 70%, being considered as user and be possible to sentence " Vidaa Vidaa”；

Then start second to wake up and judge, voice input signal can be led with waking up voice signal by master chip processing module Enter the HMM acoustic model of the high accuracy training, high robust, and calculate voice input signal with Veterbi algorithm and call out Similarity between awake voice signal, this algorithm is dynamic planning algorithm, is to calculate in voice signal each point and front The transition probability of pronunciation unit afterwards, because when people speaks, the pronunciation of each word is continuous, and this is determined by vocal cords, because This each phonetic or factor pronunciation characteristic office have determined the transition probability that each is put in front and back, and this part amount of calculation is larger, accuracy Also very high, therefore, if the similarity calculated of Veterbi more than the second predetermined threshold value (such as 90%) then it is assumed that being to use " Vidaa Vidaa " the words has veritably been said at family.Certainly above are only and illustrate, be not the unique limit to the present invention Fixed.

Need exist for illustrating, in the present embodiment, the purpose that second wakes up identification is that voice input signal is entered Row more accurately identifies, it is to avoid the generation of false wake-up.Therefore, in actual applications, the setting of the second predetermined threshold value should be greater than Or it is equal to the first predetermined threshold value.

The present embodiment, first passes through the first relatively low acoustic model of accuracy and carries out preliminary voice to voice input signal Wake up identification, when identifying the similarity between voice input signal and default wake-up voice signal more than the first predetermined threshold value When, then second voice wake-up identification is carried out by higher second acoustic model of accuracy to voice input signal, thus according to The result of second identification, it is determined whether wake up voice interactive function.Due in first time identification process, using accurately Spend relatively low acoustic model, therefore, the power consumption in first time identification process is relatively low.And only ought be identified by for the first time, i.e. language When similarity between sound input signal and default wake-up voice signal is more than the first predetermined threshold value, just enable accuracy higher The second acoustic model carry out second wake-up identification.So passing through will be higher to acoustic model relatively low for accuracy and accuracy Acoustic model is used in combination, it is to avoid when low accuracy acoustic model is used alone, and it is relatively low to wake up recognition accuracy, false wake-up rate Higher problem, when being also avoided that high accuracy acoustic model is used alone simultaneously, the higher problem of power consumption, and then reached simultaneous Turn round and look at the purpose of power consumption and low false wake-up rate.

The Organization Chart of the voice interaction device that Fig. 2 provides for one embodiment of the invention, as shown in Fig. 2 interactive voice in Fig. 2 Device includes DSP module and master chip processing module.Wherein, a built-in better simply acoustic model (i.e. accuracy in DSP module Relatively low acoustic model), it is built-in with an accuracy and the higher acoustic model of robustness in master chip processing module.And master chip When processing module is not triggered by DSP module, it is in working condition or the resting state of low-power consumption, wherein it is preferred that working as main core When piece processing module is not triggered by DSP module, master chip processing module in a dormant state, can reduce main core to greatest extent The power consumption of piece.

In practical application, after microphone array receives voice input signal, DSP module passes through end-point detection (voice Activity detection, abbreviation VAD) to determine whether voice signal input, such as can in short-term can using existing Amount and the algorithm of short-time zero-crossing rate, the application in the present embodiment of this algorithm is identical with application in the prior art, here not Repeat again.After the completion of end-point detection, need to carry out a time delay equalization, to guarantee the complete of voice input signal.Right Before voice input signal carries out signal processing, need completely to preserve this section of voice input signal, in case being sent to cloud End server is identified.Signal processing at least includes noise suppressed process, echo cancellation process and sound enhancement process. In practical application, noise suppressed processes and can carry out on the basis of multi-filter combination.Echo cancellation process and sound strengthen The execution method processing is same as the prior art, repeats no more here.

Further, after completing above-mentioned signal processing, from voice input signal, first extract characteristic signal, further according to One in DSP module simple acoustic model, is decoded processing to extracting the characteristic signal obtaining, and calculates characteristic signal And default wake up voice signal between similarity, when calculate obtain similarity more than the first predetermined threshold value when, then trigger Master chip processing module, the wake-up carrying out again judges, otherwise exits this wake operation.Need exist for illustrating, DSP Module, does preliminary wake-up simply by simple acoustic model and judges, therefore, as long as DSP module is in the building ring of low-power consumption Under border.

Further, when master chip processing module is triggered, master chip processing module can by its with DSP module it Between data-interface, obtain DSP module and wake up, first, the characteristic signal obtaining in judge process, and built-in accurate according to it Spend higher acoustic model and features described above signal carries out second wake-up identification to voice input signal, master chip is processed here Mould carries out second wake-up, and to know method for distinguishing identical with shown in DSP module Fig. 1 embodiment second wake-up knowledge method for distinguishing, Repeat no more here.

Framework shown in Fig. 2, using the quick low-power consumption of front end DSP module, does preliminary wake-up to voice input signal Identification, utilizes the computing resource of DSP module simultaneously, has done a feature extraction, is second wake-up of master chip processing module Identification saves computing resource, and master chip processing module is before being not received by the trigger of DSP module, always low Power consumption mode runs, and after being triggered, then utilizes the high storage resource of itself and high computing resource, and DSP module sends over Characteristic signal, can quickly and efficiently voice input signal be carried out waking up identification, therefore whole framework can take into account low-power consumption And high-accuracy.

The structural representation of the voice interaction device that Fig. 3 provides for one embodiment of the invention, as shown in figure 3, the present embodiment The device providing includes：

Receiver module 11, for receiving voice input signal；

First determining module 12, for according to the first acoustic model, determining described voice input signal and default wake-up The first similarity between voice signal, and judge described first similarity whether more than the first predetermined threshold value；

Second determining module 13, for when described first similarity exceedes described first predetermined threshold value, according to the rising tone Learn model, determine the second similarity between described voice input signal and default wake-up voice signal, and judge described the Whether more than the second predetermined threshold value, wherein, the accuracy of described second acoustic model is higher than described first acoustic mode to two similarities The accuracy of type；

Wake module 14, for when described second similarity is more than the second predetermined threshold value, waking up voice interactive function.

Wherein, described second predetermined threshold value is more than or equal to the first predetermined threshold value.

Described first determining module 12, including：

Acquisition submodule 121, for, from described voice input signal, extracting characteristic signal；

First determination sub-module 122, for according to the first acoustic model and described characteristic signal, determining described characteristic signal And default wake up voice signal between the first maximum likelihood value；

According to described first maximum likelihood value, determine between described voice input signal and default wake-up voice signal First similarity.

Described second determining module 13, including：

Second determination sub-module 131, is used for

According to described second acoustic model, determine in described characteristic signal pronunciation unit with its before or after pronunciation unit Between the first transition probability, and corresponding described wake-up voice signal in pronunciation unit with its before or after pronunciation unit Between the second transition probability；

According to described first transition probability and described second transition probability, determine described characteristic signal and described wake-up voice The second similarity between signal.

The voice interaction device that the present embodiment provides, can be used in executing the method shown in Fig. 1, its specific executive mode Similar with embodiment illustrated in fig. 1 with beneficial effect, repeat no more here.

Finally it should be noted that：Various embodiments above only in order to technical scheme to be described, is not intended to limit；To the greatest extent Pipe has been described in detail to the present invention with reference to foregoing embodiments, it will be understood by those within the art that：Its according to So the technical scheme described in foregoing embodiments can be modified, or wherein some or all of technical characteristic is entered Row equivalent；And these modifications or replacement, do not make the essence of appropriate technical solution depart from various embodiments of the present invention technology The scope of scheme.

Claims

1. a kind of voice awakening method is it is characterised in that include：

Receive voice input signal；

According to the first acoustic model, determine that first between described voice input signal and default wake-up voice signal is similar Degree, and judge described first similarity whether more than the first predetermined threshold value；

If exceeding, according to the second acoustic model, determine between described voice input signal and default wake-up voice signal Second similarity, and judge described second similarity whether more than the second predetermined threshold value, wherein, the standard of described second acoustic model Exactness is higher than the accuracy of described first acoustic model；

If exceeding, wake up voice interactive function.

2. method according to claim 1 is it is characterised in that described second predetermined threshold value is more than the described first default threshold Value.

3. method according to claim 2 it is characterised in that described according to the first acoustic model, determine that described voice is defeated Enter the first similarity between signal and default wake-up voice signal, including：

From described voice input signal, extract characteristic signal；

According to the first acoustic model and described characteristic signal, determine between described characteristic signal and default wake-up voice signal First maximum likelihood value；

According to described first maximum likelihood value, determine first between described voice input signal and default wake-up voice signal Similarity.

4. method according to claim 3 is it is characterised in that when described first similarity exceedes described first predetermined threshold value When, described according to the second acoustic model, determine the second phase between described voice input signal and default wake-up voice signal Seemingly spend, including：

According to described second acoustic model, determine in described characteristic signal pronunciation unit and before or after it between pronunciation unit The first transition probability, and pronunciation unit and before or after it between pronunciation unit in corresponding described wake-up voice signal The second transition probability；

According to described first transition probability and described second transition probability, determine described characteristic signal and described wake-up voice signal Between the second similarity.

5. the method according to any one of Claims 1 to 4 is it is characterised in that described first acoustic model is arranged on DSP mould In block, the second described acoustic model is arranged in master chip processing module.

6. a kind of voice interaction device is it is characterised in that include：

Receiver module, for receiving voice input signal；

First determining module, for according to the first acoustic model, determining described voice input signal and default wake-up voice letter The first similarity between number, and judge described first similarity whether more than the first predetermined threshold value；

Second determining module, for when described first similarity exceedes described first predetermined threshold value, according to the second acoustic model, Determine the second similarity between described voice input signal and default wake-up voice signal, and judge described second similarity Whether more than the second predetermined threshold value, wherein, the accuracy of described second acoustic model is higher than the accurate of described first acoustic model Degree；

7. device according to claim 6 is it is characterised in that described second predetermined threshold value is more than the first predetermined threshold value.

8. device according to claim 7 is it is characterised in that described first determining module, including：

Acquisition submodule, for, from described voice input signal, extracting characteristic signal；

First determination sub-module, for according to the first acoustic model and described characteristic signal, determining described characteristic signal and presetting Wake up voice signal between the first maximum likelihood value；

9. device according to claim 8 is it is characterised in that described second determining module, including：

Second determination sub-module, for according to described second acoustic model, determine in described characteristic signal pronunciation unit with its before And/or after the first transition probability between pronunciation unit, and in corresponding described wake-up voice signal pronunciation unit with its before And/or after the second transition probability between pronunciation unit；

10. the device according to any one of claim 6～9 is it is characterised in that described first acoustic model is arranged on DSP In module, the second described acoustic model is arranged in master chip processing module.