CN107622770A

CN107622770A - voice awakening method and device

Info

Publication number: CN107622770A
Application number: CN201710922732.XA
Authority: CN
Inventors: 孙杨; 谢波
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-09-30
Filing date: 2017-09-30
Publication date: 2018-01-23
Anticipated expiration: 2037-09-30
Also published as: CN107622770B

Abstract

The present invention proposes a kind of voice awakening method and device, the detected not high also not low situation for waking up similarity between voice and default wake-up word signal that this method identifies to the first acoustic model of local, it can be again identified that by the second acoustic model of cloud server, terminal device false wake-up can be avoided as much as or the situation generation not waken up but can be waken up, improve the Experience Degree of user.In addition, wake up voice to what is identified by the first acoustic model and default wake up the higher situation of phase knowledge and magnanimity between word signal or the relatively low situation of phase knowledge and magnanimity, decided whether to perform the operation for waking up terminal device by terminal device itself, cloud server need not be sent to be identified, can so improve the efficiency of the execution wake operation of terminal device.

Description

Voice awakening method and device

Technical field

The present invention relates to intelligent human-machine interaction technical field, more particularly to a kind of voice awakening method and device.

Background technology

Artificial intelligence (Artificial Intelligence, AI) is research, developed for simulating, extending and extending people Intelligent theory, method, a new technological sciences of technology and application system.Artificial intelligence is one of computer science Branch, it attempts to understand the essence of intelligence, and produces a kind of new intelligence that can be made a response in a manner of human intelligence is similar Energy machine, the research in the field include robot, speech recognition, image recognition, natural language processing and expert system etc..

With the development of speech recognition technology, increasing intelligent terminal is configured with voice arousal function.User One section of voice is inputted against intelligent terminal, intelligent terminal judges whether the voice of input includes by built-in algorithm Word is waken up, if comprising intelligent terminal is switched into wake-up states from resting state.

However, may be in due to user among different scenes, such as user just participates in concert, scene relatively noise Miscellaneous, the noise in the voice that intelligent terminal receives is relatively more, and intelligent terminal may be made false wake-up occur, influences The experience of user.

The content of the invention

It is contemplated that at least solves one of technical problem in correlation technique to a certain extent.

Therefore, first purpose of the present invention is to propose a kind of voice awakening method.First sound of this method to local The detected not high also not low situation for waking up similarity between voice and default wake-up word signal that Model Identification goes out is learned, Can be again identified that by the second acoustic model of cloud server, can be avoided as much as terminal device false wake-up or The situation not waken up but can be waken up to occur, improve the Experience Degree of user.

Therefore, second object of the present invention is to propose a kind of voice Rouser.

Third object of the present invention is to propose a kind of computer equipment.

Fourth object of the present invention is to propose a kind of computer program product.

The 5th purpose of the present invention is to propose a kind of non-transitorycomputer readable storage medium.

For the above-mentioned purpose, first aspect present invention embodiment proposes voice awakening method, including：

What detection was input to terminal device wakes up voice and the current scene residing for the terminal device；

First threshold and Second Threshold are obtained according to the corresponding relation of the current scene and scene and threshold value, wherein, institute State first threshold and be more than the Second Threshold；

The acoustic feature of the wake-up voice is analyzed according to the first acoustic model, obtains the wake-up voice and pre- If wake up the first similarity between word signal；

Judge whether first similarity is more than the Second Threshold and is less than the first threshold；

If the determination result is YES, the wake-up voice is sent to cloud server so that cloud server is according to the rising tone Learn model and judge the wake-up voice and default second similarity waken up between word signal, if second similarity is big In the first threshold, then generate the wake-up for waking up the terminal device and instruct；Wherein, the knowledge of second acoustic model Other precision is more than the accuracy of identification of first acoustic model；

Receive described wake up and instruct and perform the operation for waking up the terminal device.

Method as described above, if second similarity is more than the first threshold, generate for waking up institute The wake-up instruction of terminal device is stated, including：

The acoustic feature of the wake-up voice is analyzed according to second acoustic model, obtains the wake-up voice Corresponding pronunciation sequence；

Pronunciation sequence corresponding to the wake-up voice is analyzed according to language model, it is corresponding to obtain the wake-up voice Text sequence；

By text sequence progress corresponding to text sequence corresponding to the wake-up voice and the default wake-up word signal Match somebody with somebody；

If the match is successful, generate the wake-up for waking up the terminal device and instruct.

Method as described above, it is described that the acoustic feature of the wake-up voice is analyzed according to the first acoustic model, The wake-up voice and default the first similarity waken up between word signal are obtained, including：

Determine that the acoustics of the wake-up voice is special according to the acoustic feature of the wake-up voice and first acoustic model Seek peace it is described it is default wake up word signal acoustic feature between characteristic similarity；

First between the wake-up voice and the default wake-up word signal is determined according to each characteristic similarity Similarity.

Method as described above, the current scene residing for the detection terminal device include：

The current location of the terminal device is detected, it is current according to residing for the current location determines the terminal device Scene；

Or the scene voice of the terminal device is detected, Concordance is carried out to the scene voice, obtains the field Scene corresponding to the language material set and the determination language material set of scape voice, scene corresponding to the language material set is defined as Current scene residing for the terminal device.

Method as described above, in addition to：

If first similarity is more than the first threshold, the operation for waking up the terminal device is performed；

Or if first similarity is less than the Second Threshold, do not perform the operation for waking up the terminal device.

For the above-mentioned purpose, second aspect of the present invention embodiment proposes voice Rouser, including：

First detection module, the wake-up voice of terminal device is input to for detecting

Second detection module, for detecting the current scene residing for the terminal device；

Threshold module, for obtaining first threshold and second according to the corresponding relation of the current scene and scene and threshold value Threshold value, wherein, the first threshold is more than the Second Threshold；

Analysis module, for being analyzed according to the first acoustic model the acoustic feature of the wake-up voice, obtain institute State and wake up voice and default the first similarity waken up between word signal；

Judge module, for judging whether first similarity is more than the Second Threshold and is less than first threshold Value, if the determination result is YES, trigger sending module；

Sending module, for the wake-up voice to be sent into cloud server so that cloud server is according to the second acoustics Model judges the wake-up voice and default second similarity waken up between word signal, if second similarity is more than The first threshold, then generate the wake-up for waking up the terminal device and instruct；Wherein, the identification of second acoustic model Precision is more than the accuracy of identification of first acoustic model；

First execution module, for receive it is described wake up to instruct and perform wake up the operation of the terminal device.

Device as described above, the cloud server include waking up directive generation module；

The wake-up directive generation module is specifically used for：

Device as described above, the analysis module are specifically used for：

Device as described above, second detection module are specifically used for：

Or second detection module is specifically used for：The scene voice of the terminal device is detected, to the scene language Sound carries out Concordance, obtains the language material set of the scene voice and determines scene corresponding to the language material set, by institute Scene corresponding to predicate material set is defined as the current scene residing for the terminal device.

Device as described above, in addition to：Second execution module and the 3rd execution module；

If the judged result of the judge module, which is first similarity, is more than the first threshold, triggering second performs Module；Wherein, second execution module is used to perform the operation for waking up the terminal device；

Or if the judged result of the judge module is less than the Second Threshold for first similarity, trigger the Three execution modules；Wherein, the 3rd execution module is used to not perform the operation for waking up the terminal device.

For the above-mentioned purpose, third aspect present invention embodiment proposes a kind of computer equipment, including：Memory and place Manage device wherein, the processor can perform by reading the executable program code stored in the memory to run with described Program corresponding to program code, for realizing the voice awakening method as described in first aspect of the embodiment of the present invention.

For the above-mentioned purpose, fourth aspect present invention embodiment proposes a kind of computer program product, when the calculating When instruction in machine program product is by computing device, the voice awakening method as described in first aspect embodiment is performed.

For the above-mentioned purpose, fifth aspect present invention embodiment proposes a kind of non-transitory computer-readable storage medium Matter, computer program is stored thereon with, is realized when computer program is executed by processor as described in first aspect embodiment Voice awakening method.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description Obtain substantially, or recognized by the practice of the present invention.

Brief description of the drawings

Of the invention above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Substantially and it is readily appreciated that, wherein：

Fig. 1 is the schematic flow sheet for the voice awakening method that one embodiment of the invention proposes；

Fig. 2 is the schematic flow sheet for the voice awakening method that further embodiment of this invention proposes；

Fig. 3 is the structural representation for the voice Rouser that one embodiment of the invention proposes；

Fig. 4 is the structural representation for the voice Rouser that further embodiment of this invention proposes；

Fig. 5 shows the block diagram suitable for being used for the exemplary computer device for realizing embodiment of the present invention.

Embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and is not considered as limiting the invention.

Below with reference to the accompanying drawings the voice awakening method and device of the embodiment of the present invention are described.

Fig. 1 is the schematic flow sheet for the voice awakening method that one embodiment of the invention proposes.The executive agent of this method is Voice Rouser, the device can have hardware and/or software to realize, can also be integrated into terminal device.

As shown in figure 1, the voice awakening method that the present embodiment proposes, comprises the following steps：

What S101, detection were input to terminal device wakes up voice and the current scene residing for the terminal device.

For example, when user says one section of language against terminal device, such as " the small small degree of degree ", due in this section of voice Including the wake-up word " small degree " for independently being set or being given tacit consent to by user, then the voice that active user is said is wake-up voice；Terminal Equipment can receive the wake-up voice that user inputs by language detection devices such as the receivers that configures.

Specifically, may be in due to user among different scenes, such as user just participates in concert, scene relatively noise Miscellaneous, the noise in the voice that intelligent terminal receives is relatively more, and intelligent terminal may be made false wake-up occur, influences The experience of user.Therefore, the necessary current scene to residing for terminal device detects, different according to scene are carried out certainly Terminal device is adaptively waken up, false wake-up is avoided as much as or the generation for the situation not waken up but can be waken up.It may be noted that , can be to detecting that actual scene is finely divided, such as it is divided into quiet scene and noise scenarios.Terminal device is in quiet field The probability for occurring the situation of false wake-up in scape is lower compared to the probability for the situation that false wake-up occurs in noise scenarios in terminal device.

In a kind of possible implementation, the specific implementation of the current scene residing for the terminal device is detected For：The current location of the terminal device is detected, the current scene according to residing for the current location determines the terminal device. For example, the terminal equipment configuration locating module of such as GPS (Global Positioning System, global positioning system), Current location by locating module detection terminal equipment is certain KTV (Karaoke Television) public place of entertainment, at this moment, really It is noise scenarios to determine the current scene residing for terminal device.In another example the current location by locating module detection terminal equipment For library, at this moment, it is quiet scene to determine the current scene residing for terminal device.

In another possible implementation, the specific implementation of the current scene residing for the terminal device is detected For：The scene voice of the terminal device is detected, Concordance is carried out to the scene voice, obtains the language of the scene voice Material is gathered and determines scene corresponding to the language material set, and scene corresponding to the language material set is defined as into the terminal sets Standby residing current scene.

For example, scene voice can be understood as the voice of surrounding enviroment residing for the terminal device that detects, scene language Sound can be detected before detection wakes up voice, can also be detected after detection wakes up voice, or both are simultaneously Detection, is not particularly limited herein.

For example, the scene voice detected in library checks out, the also specific language material such as book；In certain KTV The scene voice detected in public place of entertainment also has singer's name, song title, sings carry out the specific language material such as a head again.The present embodiment Concordance is carried out from multiple angles such as semanteme, voice, linguistic context by the scene voice to detection, obtains the complete of the scene voice Portion's language material, whole language materials form language material set.Alternatively, being configured with terminal device can enter to language material corresponding to different scenes The model of place of row deep learning, deep learning is carried out by the way that language material set is input in model of place, language can be got Scene corresponding to sound set, scene corresponding to language material set is defined as to the current scene residing for terminal device in the present embodiment. Alternatively, scene corresponding to language material set is finely divided, is divided into quiet scene and noise scenarios, accordingly, it may be determined that eventually Current scene residing for end equipment is quiet scene or noise scenarios.

It is pointed out that the current scene residing for detection terminal equipment is not limited to illustrate.

S102, first threshold and Second Threshold obtained according to the corresponding relation of the current scene and scene and threshold value, its In, the first threshold is more than the Second Threshold.

Specifically, first threshold, Second Threshold can independently be set by user or terminal device is entered before dispatching from the factory by manufacturer Row is set, and is not particularly limited herein.In the present embodiment, different first thresholds and the second threshold are set according to the difference of scene Value, for example, first threshold corresponding to noise scenarios is higher than first threshold corresponding to quiet scene, the second threshold corresponding to noise scenarios Value is higher than Second Threshold corresponding to quiet scene, realizes and is adaptively adjusted first threshold or the second threshold according to the difference of scene Value, so realize be avoided as much as due to terminal device caused by fixed first threshold or Second Threshold occur false wake-up or The generation for the situation not waken up but can be waken up, lifts the Experience Degree of user's using terminal equipment.More specifically, it is pre-configured with The corresponding relation of scene and threshold value, according to the corresponding relation of current scene and scene and threshold value can accurately obtain the first threshold Value and Second Threshold.

For example, it is used as setting first threshold or second to wake up voice and the default similarity waken up between word signal The basis source of threshold value, specifically, can be with if waking up voice and the default similarity waken up between word signal is higher than first threshold Think that wake up voice wakes up word Signal Matching with default；If voice and the default similarity waken up between word signal are waken up less than the Two threshold values, it is believed that wake up voice and mismatched with the default word signal that wakes up；If wake up between voice and default wake-up word signal Similarity between first threshold and Second Threshold, it is believed that wake up voice and default wake-up word Signal Matching degree be not high Also it is not low, it is necessary to which whether further confirm to wake up can be pre- with such as " the small small degree of degree " in voice when there is this situation If wake up word Signal Matching.

S103, according to the first acoustic model to it is described wake-up voice acoustic feature analyze, obtain the wake-up language Sound and default the first similarity waken up between word signal.

Specifically, acoustic model is one of part mostly important in speech recognition system, can be divided by acoustic model Analysis obtains inputting pronunciation sequence corresponding to voice, the similarity inputted between voice and default voice can also be obtained, on sound Learn model and can be found in prior art, will not be repeated here.

In the present embodiment, speech terminals detection technology can be used to detecting that waking up voice carries out mute part and reality Border wakes up phonological component and separated, and then carries out acoustic feature extraction to the actual wake-up phonological component of acquisition, will get The acoustic feature of wake-up voice be input to the first acoustic model and analyzed, obtain wake up voice and it is default wake up word signal it Between the first similarity.Alternatively, the first acoustic model is established based on HMM.

In a kind of possible implementation, step S103 concrete implementation mode is：According to the wake-up voice Acoustic feature and first acoustic model determine the acoustic feature of the wake-up voice and the default sound for waking up word signal Learn the characteristic similarity between feature；The wake-up voice and the default wake-up word are determined according to each characteristic similarity The first similarity between signal.

For example, waking up voice has multiple different acoustic features, and correspondingly, default wake-up word signal has multiple Different acoustic features, the first acoustic model can first analyze the acoustic feature of each wake-up voice and corresponding default wake-up Characteristic similarity between the acoustic feature of word signal, then statistical analysis is carried out to each obtained characteristic similarity, for example, can To carry out statistical analysis to each obtained characteristic similarity using maximum likelihood principle, it is special to obtain the acoustics for waking up voice Seek peace it is described it is default wake up word signal acoustic feature between maximum likelihood value, using obtained maximum likelihood value as wake-up language Sound and default the first similarity waken up between word signal.

S104, judge whether first similarity is more than the Second Threshold and is less than the first threshold.

Specifically, Second Threshold and be less than first threshold when the first similarity is more than, illustrate the wake-up voice that detects and Similarity is not high also not low between default wake-up word signal, it is necessary to further confirm to wake up voice when there is this situation In whether can be with the default wake-up word Signal Matching of such as " small degree small degree ".

S105, if the determination result is YES, the wake-up voice is sent to cloud server so that cloud server according to Second acoustic model judges the wake-up voice and default second similarity waken up between word signal, if second phase It is more than the first threshold like degree, then generates the wake-up for waking up the terminal device and instruct；Wherein, second acoustic mode The accuracy of identification of type is more than the accuracy of identification of first acoustic model.

In the present embodiment, the first acoustic model is configured in local, that is, is configured in terminal device；And in the present embodiment Second acoustic model configures server beyond the clouds.Cloud server has powerful data-handling capacity, for example, cloud server The second higher acoustic model of accuracy of identification can be established by excavating more related datas progress deep learnings.In this implementation In example, the accuracy of identification of the second acoustic model is more than the accuracy of identification of the first acoustic model, and the first acoustic model of local is known The detected not high also not low situation for waking up similarity between voice and default wake-up word signal not gone out, can pass through Second acoustic model of cloud server is again identified that.

If the second acoustic model of cloud server judges second between the wake-up voice and default wake-up word signal Similarity is more than first threshold, it is believed that wakes up voice and wakes up word Signal Matching with default.Using the default word signal that wakes up to be " small Spend small degree " exemplified by, recognition result is matching, illustrates that user has said " the small small degree of degree " such wake-up voice, at this moment can hold Row wakes up the operation of terminal device.Specifically, in the present embodiment, if second similarity is more than the first threshold, The wake-up for waking up the terminal device is generated to instruct；If second similarity is less than the first threshold, do not generate Wake-up for waking up the terminal device instructs.

In a kind of possible implementation, if second similarity is more than the first threshold, generate for calling out Wake up the terminal device wake-up instruction concrete implementation mode be：

S1, according to second acoustic model to it is described wake-up voice acoustic feature analyze, obtain the wake-up Pronunciation sequence corresponding to voice.

In the present embodiment, the pronunciation sequence matched the most with waking up voice can be determined by the second acoustic model.

S2, according to language model to it is described wake-up voice corresponding to pronunciation sequence analyze, obtain the wake-up voice Corresponding text sequence.

Specifically, language model is one of part mostly important in speech recognition system, can be obtained by language model To text sequence corresponding to input voice, voice will be inputted and be converted into text.Alternatively, language model is N-Gram models (N Meta-model).

The present embodiment by the second acoustic model can determine with wake up pronunciation sequence that voice matches the most and then The text sequence that can determine to match the most with waking up voice by speech model.

S3, by it is described wake-up voice corresponding to text sequence and it is described it is default wake-up word signal corresponding to text sequence carry out Matching.

If S4, the match is successful, generate the wake-up for waking up the terminal device and instruct.

The present embodiment is by the second acoustic model to waking up the phase between voice and the default acoustic feature for waking up word signal Tentatively judged like degree, then, using language model to waking up text sequence corresponding to voice and default wake-up word signal pair The text sequence answered is matched, i.e., is matched twice from two angles of voice and text, voice awakening method is more defined It is really reliable.

S106, receive the wake-up instruction and perform the operation for waking up the terminal device.

Voice awakening method provided in an embodiment of the present invention, including：Detection is input to wake-up voice and the institute of terminal device State the current scene residing for terminal device；According to the corresponding relation of the current scene and scene and threshold value obtain first threshold and Second Threshold, wherein, the first threshold is more than the Second Threshold；Sound according to the first acoustic model to the wake-up voice Learn feature to be analyzed, obtain first similarity waken up between voice and default wake-up word signal；Judge described first Whether similarity is more than the Second Threshold and is less than the first threshold；If the determination result is YES, by the wake-up voice hair Cloud server is given so that cloud server judges the wake-up voice and the default wake-up word according to the second acoustic model The second similarity between signal, if second similarity is more than the first threshold, generate for waking up the terminal The wake-up instruction of equipment；Wherein, the accuracy of identification of second acoustic model is more than the accuracy of identification of first acoustic model； Receive described wake up and instruct and perform the operation for waking up the terminal device.This method identifies to the first acoustic model of local It is detected wake up voice and the default not high also not low situation for waking up similarity between word signal, high in the clouds can be passed through Second acoustic model of server is again identified that, can be avoided as much as terminal device false wake-up or can wake up not having but The situation of wake-up occurs, and improves the Experience Degree of user.

Fig. 2 is the schematic flow sheet for the voice awakening method that further embodiment of this invention proposes.In the base of above-described embodiment On plinth, if first similarity is more than the first threshold, the operation for waking up the terminal device is performed；Or if institute State the first similarity and be less than the Second Threshold, then do not perform the operation for waking up the terminal device.

As shown in Fig. 2 the voice awakening method that the present embodiment proposes, comprises the following steps：

What S201, detection were input to terminal device wakes up voice and the current scene residing for the terminal device, performs step Rapid S202.

S202, first threshold and Second Threshold obtained according to the corresponding relation of the current scene and scene and threshold value, its In, the first threshold is more than the Second Threshold, performs step S203.

S203, according to the first acoustic model to it is described wake-up voice acoustic feature analyze, obtain the wake-up language Sound and default the first similarity waken up between word signal, perform step S204.

S204, judge whether first similarity is more than the Second Threshold and is less than the first threshold, perform step Either step in rapid S205, step S207, step S208.

S205, if the determination result is YES, the wake-up voice is sent to cloud server so that cloud server according to Second acoustic model judges the wake-up voice and default second similarity waken up between word signal, if second phase It is more than the first threshold like degree, then generates the wake-up for waking up the terminal device and instruct；Wherein, second acoustic mode The accuracy of identification of type is more than the accuracy of identification of first acoustic model, performs step S206.

S206, receive the wake-up instruction and perform the operation for waking up the terminal device.

It should be noted that the implementation of step S201, S202, S203, S204, S205, S206 in the present embodiment It is identical with the implementation of step S101, S102, S103, S104, S105, S106 in above-described embodiment respectively, herein no longer Repeat.

If S207, first similarity are more than the first threshold, the operation for waking up the terminal device is performed.

Specifically, determine that the first similarity is more than first threshold by the first acoustic model of local, it is believed that call out Voice of waking up wakes up word Signal Matching with default.So that default wake-up word signal is " the small small degree of degree " as an example, recognition result is matching, is said Bright user has said " the small small degree of degree " such operation for waking up voice, at this moment can performing wake-up terminal device.

If S208, first similarity are less than the Second Threshold, the operation for waking up the terminal device is not performed.

Specifically, determine that the first similarity is less than Second Threshold by the first acoustic model of local, it is believed that call out Voice of waking up mismatches with the default word signal that wakes up.By it is default wake up word signal as " the small small degree of degree " exemplified by, recognition result is not Match somebody with somebody, illustrate that user has not said " the small small degree of degree " such operation for waking up voice, at this moment not performing wake-up terminal device.

Voice awakening method provided in an embodiment of the present invention, the first similarity is determined by the first acoustic model of local During more than first threshold, the operation for waking up terminal device is performed；First similarity is determined by the first acoustic model of local During less than Second Threshold, the operation for waking up terminal device is not performed.That is, call out what is identified by the first acoustic model Wake up voice and it is default wake up the higher situation of phase knowledge and magnanimity between word signal or the relatively low situation of phase knowledge and magnanimity, determined by terminal device itself It is fixed whether to perform the operation for waking up terminal device, it is identified without being sent to cloud server, can so improves terminal and set The efficiency of standby execution wake operation.

Fig. 3 is the structural representation for the voice Rouser that one embodiment of the invention proposes.The device can have hardware and/ Or software is realized, can also be integrated into terminal device, for performing voice awakening method.

As shown in figure 3, the voice Rouser that the present embodiment provides, including：

First detection module 01, the wake-up voice of terminal device is input to for detecting；

Second detection module 02, for detecting the current scene residing for the terminal device；

Threshold module 03, for obtaining first threshold and the according to the corresponding relation of the current scene and scene and threshold value Two threshold values, wherein, the first threshold is more than the Second Threshold；

Analysis module 04, for being analyzed according to the first acoustic model the acoustic feature of the wake-up voice, obtain It is described to wake up voice and default the first similarity waken up between word signal；

Judge module 05, for judging whether first similarity is more than the Second Threshold and is less than first threshold Value, if the determination result is YES, trigger sending module；

Sending module 06, for the wake-up voice to be sent into cloud server so that cloud server is according to the rising tone Learn model and judge the wake-up voice and default second similarity waken up between word signal, if second similarity is big In the first threshold, then generate the wake-up for waking up the terminal device and instruct；Wherein, the knowledge of second acoustic model Other precision is more than the accuracy of identification of first acoustic model；

First execution module 07, for receive it is described wake up to instruct and perform wake up the operation of the terminal device.

Further, the cloud server includes waking up directive generation module；

The wake-up directive generation module is specifically used for：

Further, the analysis module 04 is specifically used for：

Further, second detection module 02 is specifically used for：

Or second detection module 02 is specifically used for：The scene voice of the terminal device is detected, to the scene Voice carries out Concordance, obtains the language material set of the scene voice and determines scene corresponding to the language material set, will Scene corresponding to the language material set is defined as the current scene residing for the terminal device.

On the device in the present embodiment, wherein modules perform the concrete mode of operation in relevant this method It is described in detail in embodiment, explanation will be not set forth in detail herein.

Voice Rouser provided in an embodiment of the present invention, including：First detection module, it is input to terminal for detection and sets Standby wake-up voice；Second detection module, for detecting the current scene residing for the terminal device；Threshold module, for root First threshold and Second Threshold are obtained according to the corresponding relation of the current scene and scene and threshold value, wherein, the first threshold More than the Second Threshold；Analysis module, for being divided according to the first acoustic model the acoustic feature of the wake-up voice Analysis, obtain first similarity waken up between voice and default wake-up word signal；Judge module, for judging described first Whether similarity is more than the Second Threshold and is less than the first threshold, if the determination result is YES, triggers sending module；Send Module, for the wake-up voice to be sent into cloud server so that cloud server is according to judging the second acoustic model Voice and default second similarity waken up between word signal are waken up, if second similarity is more than first threshold Value, then generate the wake-up for waking up the terminal device and instruct；Wherein, the accuracy of identification of second acoustic model is more than institute State the accuracy of identification of the first acoustic model；First execution module, the wake-up terminal is instructed and performs for receiving described wake up The operation of equipment.The device wakes up word letter to the detected wake-up voice that the first acoustic model of local identifies with default The not high also not low situation of similarity between number, can be again identified that by the second acoustic model of cloud server, Terminal device false wake-up can be avoided as much as or the situation generation not waken up but can be waken up, improve the Experience Degree of user.

Fig. 4 is the structural representation for the voice Rouser that one embodiment of the invention proposes.On the basis of above-described embodiment On, voice Rouser also includes the second execution module and the 3rd execution module.

As shown in figure 4, the voice Rouser that the present embodiment provides, including：

Judge module 05, for judging whether first similarity is more than the Second Threshold and is less than first threshold Value, if the determination result is YES, sending module is triggered, or, if the judged result of the judge module is first similarity More than the first threshold, the second execution module is triggered, or, if the judged result of the judge module is described first similar Degree is less than the Second Threshold, triggers the 3rd execution module；

Further, the cloud server includes waking up directive generation module；

The wake-up directive generation module is specifically used for：

Further, the analysis module 04 is specifically used for：

Further, second detection module 02 is specifically used for：

Second execution module 08, the operation of the terminal device is waken up for performing.

3rd execution module 09, the operation of the terminal device is waken up for not performing.

Voice Rouser provided in an embodiment of the present invention, the first similarity is determined by the first acoustic model of local During more than first threshold, the operation for waking up terminal device is performed；First similarity is determined by the first acoustic model of local During less than Second Threshold, the operation for waking up terminal device is not performed.That is, call out what is identified by the first acoustic model Wake up voice and it is default wake up the higher situation of phase knowledge and magnanimity between word signal or the relatively low situation of phase knowledge and magnanimity, determined by terminal device itself It is fixed whether to perform the operation for waking up terminal device, it is identified without being sent to cloud server, can so improves terminal and set The efficiency of standby execution wake operation.

Fig. 5 shows the block diagram suitable for being used for the exemplary computer device 20 for realizing embodiment of the present invention.Fig. 5 is shown Computer equipment 20 be only an example, any restrictions should not be brought to the function and use range of the embodiment of the present invention.

As shown in figure 5, computer equipment 20 is showed in the form of universal computing device.The component of computer equipment 20 can be with Including but not limited to：One or more processor or processing unit 21, system storage 22, connect different system component The bus 23 of (including system storage 22 and processing unit 21).

Bus 23 represents the one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.Lift For example, these architectures include but is not limited to industry standard architecture (Industry Standard Architecture；Hereinafter referred to as：ISA) bus, MCA (Micro Channel Architecture；Below Referred to as：MAC) bus, enhanced isa bus, VESA (Video Electronics Standards Association；Hereinafter referred to as：VESA) local bus and periphery component interconnection (Peripheral Component Interconnection；Hereinafter referred to as：PCI) bus.

Computer equipment 20 typically comprises various computing systems computer-readable recording medium.These media can be it is any can be by The usable medium that computer equipment 20 accesses, including volatibility and non-volatile media, moveable and immovable medium.

System storage 22 can include the computer system readable media of form of volatile memory, such as arbitrary access Memory (Random Access Memory；Hereinafter referred to as：RAM) 30 and/or cache memory 32.Computer equipment can To further comprise other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only as act Example, storage system 34 can be used for reading and writing immovable, non-volatile magnetic media, and (Fig. 5 does not show that commonly referred to as " hard disk drives Dynamic device ").Although not shown in Fig. 5, it can provide for the disk to may move non-volatile magnetic disk (such as " floppy disk ") read-write Driver, and to removable anonvolatile optical disk (such as：Compact disc read-only memory (Compact Disc Read Only Memory；Hereinafter referred to as：CD-ROM), digital multi read-only optical disc (Digital Video Disc Read Only Memory；Hereinafter referred to as：DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving Device can be connected by one or more data media interfaces with bus 23.Memory 22 can include at least one program and produce Product, the program product have one group of (for example, at least one) program module, and these program modules are configured to perform of the invention each The function of embodiment.

Program/utility 40 with one group of (at least one) program module 42, such as memory 22 can be stored in In, such program module 42 include but is not limited to operating system, one or more application program, other program modules and Routine data, the realization of network environment may be included in each or certain combination in these examples.Program module 42 is usual Perform the function and/or method in embodiment described in the invention.

Computer equipment 20 can also be with one or more external equipments 50 (such as keyboard, sensing equipment, display 60 Deng) communication, the equipment communication interacted with the computer equipment 20 can be also enabled a user to one or more, and/or with making Obtain any equipment that the computer equipment 20 can be communicated with one or more of the other computing device (such as network interface card, modulatedemodulate Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 24.Also, computer equipment 20 may be used also To pass through network adapter 25 and one or more network (such as LAN (Local Area Network；Hereinafter referred to as： LAN), wide area network (Wide Area Network；Hereinafter referred to as：WAN) and/or public network, for example, internet) communication.Such as figure Shown, network adapter 25 is communicated by bus 23 with other modules of computer equipment 20.It should be understood that although do not show in figure Go out, computer equipment 20 can be combined and use other hardware and/or software module, included but is not limited to：Microcode, device drives Device, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..

Processing unit 21 is stored in program in system storage 22 by operation, so as to perform various function application and Data processing, such as realize the voice awakening method shown in Fig. 1-Fig. 2.

Any combination of one or more computer-readable media can be used.Computer-readable medium can be calculated Machine readable signal medium or computer-readable recording medium.Computer-readable recording medium can for example be but not limited to electricity, Magnetic, optical, electromagnetic, system, device or the device of infrared ray or semiconductor, or it is any more than combination.Computer-readable storage The more specifically example (non exhaustive list) of medium includes：Electrical connection, portable computer with one or more wires Disk, hard disk, random access memory (RAM), read-only storage (Read Only Memory；Hereinafter referred to as：ROM it is), erasable Formula programmable read only memory (Erasable Programmable Read Only Memory；Hereinafter referred to as：EPROM) or dodge Deposit, optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory device or above-mentioned any Suitable combination.In this document, computer-readable recording medium can be it is any include or the tangible medium of storage program, should Program can be commanded the either device use or in connection of execution system, device.

Computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium beyond storage medium is read, the computer-readable medium, which can send, propagates or transmit, to be used for By instruction execution system, device either device use or program in connection.

The program code included on computer-readable medium can use any appropriate medium to transmit, including but not limited to without Line, electric wire, optical cable, RF etc., or above-mentioned any appropriate combination.

It can be write with one or more programming languages or its combination for performing the computer that operates of the present invention Program code, described program design language include object oriented program language such as Java, Smalltalk, C++, also Include procedural programming language such as " C " language or similar programming language of routine.Program code can be complete Ground is performed, partly performed on the user computer on the user computer, the software kit independent as one performs, partly existed Subscriber computer upper part is performed or performed completely on remote computer or server on the remote computer.It is being related to In the situation of remote computer, remote computer can include LAN (Local Area by the network of any kind Network；Hereinafter referred to as：) or wide area network (Wide Area Network LAN；Hereinafter referred to as：WAN) it is connected to user's calculating Machine, or, it may be connected to outer computer (such as passing through Internet connection using ISP).

In order to realize above-described embodiment, the present invention also proposes a kind of computer program product, when in computer program product Instruction by computing device when, perform voice awakening method as in the foregoing embodiment.

In order to realize above-described embodiment, the present invention also proposes a kind of non-transitorycomputer readable storage medium, deposited thereon Computer program is contained, can realize that voice as in the foregoing embodiment wakes up when the computer program is executed by processor Method.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or the spy for combining the embodiment or example description Point is contained at least one embodiment or example of the present invention.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example must be directed to.Moreover, specific features, structure, material or the feature of description can be with office Combined in an appropriate manner in one or more embodiments or example.In addition, in the case of not conflicting, the skill of this area Art personnel can be tied the different embodiments or example and the feature of different embodiments or example described in this specification Close and combine.

In addition, term " first ", " second " are only used for describing purpose, and it is not intended that instruction or hint relative importance Or the implicit quantity for indicating indicated technical characteristic.Thus, define " first ", the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the invention, " multiple " are meant that at least two, such as two, three It is individual etc., unless otherwise specifically defined.

Any process or method described otherwise above description in flow chart or herein is construed as, and represents to include Module, fragment or the portion of the code of the executable instruction of one or more the step of being used to realize custom logic function or process Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discuss suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.

Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system including the system of processor or other can be held from instruction The system of row system, device or equipment instruction fetch and execute instruction) use, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass Defeated program is for instruction execution system, device or equipment or the dress used with reference to these instruction execution systems, device or equipment Put.The more specifically example (non-exhaustive list) of computer-readable medium includes following：Electricity with one or more wiring Connecting portion (electronic installation), portable computer diskette box (magnetic device), random access memory (RAM), read-only storage (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device, and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium, which can even is that, to print the paper of described program thereon or other are suitable Medium, because can then enter edlin, interpretation or if necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.

It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, software that multiple steps or method can be performed in memory and by suitable instruction execution system with storage Or firmware is realized.Such as, if realized with hardware with another embodiment, following skill well known in the art can be used Any one of art or their combination are realized：With the logic gates for realizing logic function to data-signal from Logic circuit is dissipated, the application specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile Journey gate array (FPGA) etc..

Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method carries Suddenly it is that by program the hardware of correlation can be instructed to complete, described program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, can also That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould Block can both be realized in the form of hardware, can also be realized in the form of software function module.The integrated module is such as Fruit is realized in the form of software function module and as independent production marketing or in use, can also be stored in a computer In read/write memory medium.

Storage medium mentioned above can be read-only storage, disk or CD etc..Although have been shown and retouch above Embodiments of the invention are stated, it is to be understood that above-described embodiment is exemplary, it is impossible to be interpreted as the limit to the present invention System, one of ordinary skill in the art can be changed to above-described embodiment, change, replace and become within the scope of the invention Type.

Claims

A kind of 1. voice awakening method, it is characterised in that including：

What detection was input to terminal device wakes up voice and the current scene residing for the terminal device；

First threshold and Second Threshold are obtained according to the corresponding relation of the current scene and scene and threshold value, wherein, described the One threshold value is more than the Second Threshold；

The acoustic feature of the wake-up voice is analyzed according to the first acoustic model, the wake-up voice is obtained and presets and call out The first similarity between awake word signal；

Judge whether first similarity is more than the Second Threshold and is less than the first threshold；

If the determination result is YES, the wake-up voice is sent to cloud server so that cloud server is according to the second acoustic mode Type judges the wake-up voice and default second similarity waken up between word signal, if second similarity is more than institute First threshold is stated, then generates the wake-up for waking up the terminal device and instructs；Wherein, the identification essence of second acoustic model Accuracy of identification of the degree more than first acoustic model；

Receive described wake up and instruct and perform the operation for waking up the terminal device.
2. the method as described in claim 1, it is characterised in that if second similarity is more than the first threshold, The wake-up for waking up the terminal device is then generated to instruct, including：

The acoustic feature of the wake-up voice is analyzed according to second acoustic model, it is corresponding to obtain the wake-up voice Pronunciation sequence；

Pronunciation sequence corresponding to the wake-up voice is analyzed according to language model, obtains text corresponding to the wake-up voice This sequence；

Text sequence corresponding to text sequence corresponding to the wake-up voice and the default wake-up word signal is matched；

If the match is successful, generate the wake-up for waking up the terminal device and instruct.
3. the method as described in claim 1, it is characterised in that the sound according to the first acoustic model to the wake-up voice Learn feature to be analyzed, obtain first similarity waken up between voice and default wake-up word signal, including：

According to it is described wake-up voice acoustic feature and first acoustic model determine it is described wake-up voice acoustic feature and Characteristic similarity between the default acoustic feature for waking up word signal；

Determine that first between the wake-up voice and the default wake-up word signal is similar according to each characteristic similarity Degree.
4. the method as described in claim 1, it is characterised in that the current scene residing for the detection terminal device, bag Include：

The current location of the terminal device is detected, the current field according to residing for the current location determines the terminal device Scape；

Or the scene voice of the terminal device is detected, Concordance is carried out to the scene voice, obtains the scene language Scene corresponding to the language material set and the determination language material set of sound, scene corresponding to the language material set is defined as described Current scene residing for terminal device.
5. the method as described in claim 1, it is characterised in that also include：

If first similarity is more than the first threshold, the operation for waking up the terminal device is performed；

Or if first similarity is less than the Second Threshold, do not perform the operation for waking up the terminal device.
A kind of 6. voice Rouser, it is characterised in that including：

First detection module, the wake-up voice of terminal device is input to for detecting；

Second detection module, for detecting the current scene residing for the terminal device；

Threshold module, for obtaining first threshold and the second threshold according to the corresponding relation of the current scene and scene and threshold value Value, wherein, the first threshold is more than the Second Threshold；

Analysis module, for being analyzed according to the first acoustic model the acoustic feature of the wake-up voice, called out described in acquisition The first similarity waken up between voice and default wake-up word signal；

Judge module, the Second Threshold and it is less than the first threshold for judging whether first similarity is more than, if Judged result is yes, triggers sending module；

Sending module, for the wake-up voice to be sent into cloud server so that cloud server is according to the second acoustic model The wake-up voice and default second similarity waken up between word signal are judged, if second similarity is more than described First threshold, then generate the wake-up for waking up the terminal device and instruct；Wherein, the accuracy of identification of second acoustic model More than the accuracy of identification of first acoustic model；

First execution module, for receive it is described wake up to instruct and perform wake up the operation of the terminal device.
7. device as claimed in claim 6, it is characterised in that the cloud server includes waking up directive generation module；

The wake-up directive generation module is specifically used for：

The acoustic feature of the wake-up voice is analyzed according to second acoustic model, it is corresponding to obtain the wake-up voice Pronunciation sequence；

Pronunciation sequence corresponding to the wake-up voice is analyzed according to language model, obtains text corresponding to the wake-up voice This sequence；

Text sequence corresponding to text sequence corresponding to the wake-up voice and the default wake-up word signal is matched；

If the match is successful, generate the wake-up for waking up the terminal device and instruct.
8. device as claimed in claim 6, it is characterised in that the analysis module is specifically used for：

According to it is described wake-up voice acoustic feature and first acoustic model determine it is described wake-up voice acoustic feature and Characteristic similarity between the default acoustic feature for waking up word signal；

Determine that first between the wake-up voice and the default wake-up word signal is similar according to each characteristic similarity Degree.
9. device as claimed in claim 6, it is characterised in that second detection module is specifically used for：

The current location of the terminal device is detected, the current field according to residing for the current location determines the terminal device Scape；

Or second detection module is specifically used for：The scene voice of the terminal device is detected, the scene voice is entered Row Concordance, obtain the language material set of the scene voice and determine scene corresponding to the language material set, by institute's predicate Scene corresponding to material set is defined as the current scene residing for the terminal device.
10. device as claimed in claim 6, it is characterised in that also include：Second execution module and the 3rd execution module；

If the judged result of the judge module, which is first similarity, is more than the first threshold, triggering second performs mould Block；Wherein, second execution module is used to perform the operation for waking up the terminal device；

Or if the judged result of the judge module is less than the Second Threshold for first similarity, triggering the 3rd is held Row module；Wherein, the 3rd execution module is used to not perform the operation for waking up the terminal device.
A kind of 11. computer equipment, it is characterised in that including：Processor and memory；

Wherein, the processor can perform by reading the executable program code stored in the memory to run with described Program corresponding to program code, for realizing the voice awakening method as any one of claim 1-5.
12. a kind of computer program product, when the instruction in the computer program product is by computing device, perform as weighed Profit requires the voice awakening method any one of 1-5.
13. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, it is characterised in that the calculating The voice awakening method as any one of claim 1-5 is realized when machine program is executed by processor.