CN109273007A

CN109273007A - Voice awakening method and device

Info

Publication number: CN109273007A
Application number: CN201811184504.8A
Authority: CN
Inventors: 吴国兵; 潘嘉
Original assignee: iFlytek Co Ltd
Current assignee: Xi'an Xunfei Super Brain Information Technology Co Ltd
Priority date: 2018-10-11
Filing date: 2018-10-11
Publication date: 2019-01-25
Anticipated expiration: 2038-10-11
Also published as: CN109273007B

Abstract

The embodiment of the present invention provides a kind of voice awakening method and device, belongs to technical field of voice recognition.This method comprises: obtaining the acoustic feature for waking up word in voice data；Acoustic feature is input to wake-up and determines network, output wake-up determines to determine that result is used to indicate whether to wake up successfully as a result, waking up, and waking up judgement network is obtained based on the training of sample acoustic feature, wakes up and determines that network is used to carry out confidence declaration to wake-up word.The embodiment of the present invention wakes up the acoustic feature of word by obtaining in voice data.Acoustic feature is input to wake-up and determines network, output, which wakes up, to be determined to determine that result is used to indicate whether to wake up successfully as a result, waking up.Due to that can determine that network carries out wake-up judgement by waking up under the customized any wake-up word of user, and without relying on fixed preset threshold, so that wake-up success rate can be improved, the applicable scene of wakeup process is more extensive.

Description

Voice awakening method and device

Technical field

The present embodiments relate to technical field of voice recognition more particularly to a kind of voice awakening methods and device.

Background technique

With the development of smart home, voice arousal function is more more and more universal.Voice wakes up mainly by understanding user Voice data, to wake up intelligent terminal.At present when realizing that voice wakes up, usually distinguish according in wake-up word identification process The corresponding acoustics likelihood score for waking up word path and the path filler；If acoustics likelihood ratio is greater than fixed preset threshold, really The recognition result for recognizing wake-up word is credible, and successfully wakes up intelligent terminal.Since preset threshold is fixed, if waking up password Variation, then preset threshold may not be able to be suitable for the decision process of current awake language, to reduce wake-up success rate.

Summary of the invention

To solve the above-mentioned problems, the embodiment of the present invention provides one kind and overcomes the above problem or at least be partially solved State the voice awakening method and device of problem.

According to a first aspect of the embodiments of the present invention, a kind of voice awakening method is provided, comprising:

Obtain the acoustic feature that word is waken up in voice data；

Acoustic feature is input to wake-up and determines network, output, which wakes up, to be determined to determine that result is used to indicate and is as a result, waking up No to wake up successfully, waking up judgement network is obtained based on the training of sample acoustic feature, is waken up and is determined that network is used for wake-up word Carry out confidence declaration.

Method provided in an embodiment of the present invention wakes up the acoustic feature of word by obtaining in voice data.By acoustic feature It is input to wake-up and determines network, output, which wakes up, to be determined to determine that result is used to indicate whether to wake up successfully as a result, waking up.Due to Under the customized any wake-up word in family, it can determine that network carry out wake-up judgement by waking up, and without relying on fixed pre- If threshold value, so that wake-up success rate can be improved, the applicable scene of wakeup process is more extensive.

According to a second aspect of the embodiments of the present invention, a kind of voice Rouser is provided, comprising:

Module is obtained, for obtaining the acoustic feature for waking up word in voice data；

Output module determines network for acoustic feature to be input to wake-up, and output, which wakes up, to be determined to determine knot as a result, waking up Fruit is used to indicate whether to wake up successfully, wakes up and determines that network is obtained based on the training of sample acoustic feature, wakes up and determine network For carrying out confidence declaration to wake-up word.

According to a third aspect of the embodiments of the present invention, a kind of electronic equipment is provided, comprising:

At least one processor；And

At least one processor being connect with processor communication, in which:

Memory is stored with the program instruction that can be executed by processor, and the instruction of processor caller is able to carry out first party Voice awakening method provided by any possible implementation in the various possible implementations in face.

According to the fourth aspect of the invention, a kind of non-transient computer readable storage medium, non-transient computer are provided Readable storage medium storing program for executing stores computer instruction, and computer instruction makes the various possible implementations of computer execution first aspect In voice awakening method provided by any possible implementation.

It should be understood that above general description and following detailed description be it is exemplary and explanatory, can not Limit the embodiment of the present invention.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.

Fig. 1 is a kind of flow diagram of voice awakening method provided in an embodiment of the present invention；

Fig. 2 is a kind of structural schematic diagram for waking up word and identifying network provided in an embodiment of the present invention；

Fig. 3 is a kind of structural schematic diagram for waking up identification network provided in an embodiment of the present invention；

Fig. 4 is a kind of structural schematic diagram of voice Rouser provided in an embodiment of the present invention；

Fig. 5 is the block diagram of a kind of electronic equipment provided in an embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

With the development of smart home, voice arousal function is more more and more universal.Voice wakes up mainly by understanding user Voice data, to wake up intelligent terminal.At present when realizing that voice wakes up, usually distinguish according in wake-up word identification process It is corresponding to wake up word and the non-acoustics likelihood score for waking up word；Calculate the ratio waken up between word and the non-acoustics likelihood score for waking up word Obtain waking up the acoustics likelihood ratio of word；If acoustics likelihood ratio is greater than fixed preset threshold, the recognition result for waking up word is confirmed It is credible, and successfully wake up intelligent terminal.Since preset threshold is fixed, changed if waking up password, preset threshold It may not be able to be suitable for the decision process of current awake language, to reduce wake-up success rate.

For example, if the wake-up password of current preset is " ding-dong ding-dong ", by that will wake up the acoustics of word " ding-dong ding-dong " seemingly Right ratio is compared with preset threshold, and can the wake-up password that can accurately determine that user tells wake up intelligent terminal.If User's voluntarily customized new wake-up password, such as " hello, small to fly ", then fixed preset threshold may not be suitable for user and make by oneself The wake-up password of justice, so that wake-up success rate can be reduced.

For said circumstances, the embodiment of the invention provides a kind of voice awakening methods.It should be noted that this method can Applied to the intelligent terminal with arousal function, such as intelligent sound box, wearable device or intelligent appliance.Referring to Fig. 1, this method packet It includes but is not limited to:

101, the acoustic feature that word is waken up in voice data is obtained.

Before executing 101, voice data can be input to and wake up word identification network, to identify wake-up word.Specifically, Wake up identification network can be the Keyword and Filler network based on Hidden Markov Model, the network as shown in Fig. 2, Contain the approach of Keyword and Filler.Wherein, the non-wake-up word path of Filler path representation, other than waking up word Vocabulary is included in the path Filler.While identification wakes up word, the acoustic feature for waking up word can be also extracted simultaneously.

102, acoustic feature is input to wake-up and determines network, output, which wakes up, to be determined to determine result for referring to as a result, waking up Show and whether wake up success, wakes up and determine that network is obtained based on the training of sample acoustic feature, wake up and determine that network is used for calling out Word of waking up carries out confidence declaration.

Before executing 102, it can train to obtain wake-up judgement network.Specifically, can by the sample acoustic feature of positive example and The sample acoustic feature of counter-example determines that network is trained to initial, determines network to obtain waking up.Wherein, the sample sound of positive example It learns feature and refers to that the acoustic feature for waking up word waken up that can succeed, the sample acoustic feature of counter-example refer to waking up failure The non-acoustic feature for waking up word.Content based on the above embodiment wakes up as a kind of alternative embodiment and determines that network can be Encoder-decoder model.Correspondingly, initially determine that network may include the end encoder and the end decoder, waking up word can wrap Include multiple sub- words.The end encoder can be using Recognition with Recurrent Neural Network, the long Series Modelings network such as memory network in short-term, will be each The acoustic feature list entries modeled network of sub- word, can get the characterization vector of every sub- word, then by the characterization of every sub- word to Amount access attention mechanism to obtain the weighing factor of the characterization vector to wake-up word of every sub- word, and then obtains whole word Vector is characterized, then is made decisions according to this.

It wakes up and determines that the structure of network can refer to Fig. 3, for encoder encoder, x1~xT indicates the acoustics of each sub- word The feature vector of feature, h1~hT are characterization vector of the feature vector of sub- word after Series Modeling network, at, 1~at, T It is each characterization vector by the weight obtained after attention mechanism, ct is the characterization vector of whole word.For decoder decoder, History decoding result before yt-1 expression.Yt indicates final decoding result, that is, is used to indicate whether to wake up successfully.St-1 is indicated The average information of decoding process before, st indicate the average information in this decoding process.Net is determined it should be noted that waking up Network uses encoder-decoder model, can satisfy the wake-up word demand of different length.

Content based on the above embodiment, as a kind of alternative embodiment, acoustic feature includes in following five kinds of information At least any one, five kinds of information are respectively to wake up the non-wake-up for waking up word score, waking up word neutron word of word neutron word below Word score wakes up the corresponding frame number of word neutron word, wakes up score distribution and wake-up word neutron that word neutron word corresponds to acoustic feature The insertion feature of word.

Five kinds of information provided in an embodiment of the present invention, the execution moment for obtaining five kinds of information can be for by acoustic feature It is input to wake-up and determines network, output wakes up before determining result, can also be further determined as when identification wakes up word, simultaneously Five kinds of information are obtained, the present invention is not especially limit this.In order to make it easy to understand, it is existing by taking sub- word is phoneme as an example, Above-mentioned five kinds of information is illustrated respectively:

(1) the wake-up word score of word neutron word is waken up；

Specifically, the acoustic score that can get every sub- word according to identification decoding, by the wake-up word score of every sub- word into Row splicing obtains the wake-up word score of one-dimensional scalar namely sub- word.For example, waking up word is " hello, small to fly ", wake up word includes altogether 4 voice units, it is assumed that each voice unit uses 2 phonemic representations, then entirely wakes up word totally 8 phonemes, and can obtain length The sub- word acoustic score sequence that degree is 8, namely wake up the wake-up word score of word neutron word.

(2) the non-wake-up word score of word neutron word is waken up；

Similarly, the filler acoustic score that can get every sub- word according to identification decoding, by the non-wake-up word of every sub- word Score is spliced to obtain the non-wake-up word score of one-dimensional scalar namely sub- word.Equally can obtain for " hello, small fly " The non-wake-up word acoustic score sequence that length is 8 is obtained, namely wakes up the non-wake-up word score of word neutron word.

(3) the corresponding frame number of word neutron word is waken up；

Since sub- word accounts for certain duration, and duration can be indicated with frame number, wake up each son in word to can determine The corresponding frame number of word.For example, equally if each voice unit includes two sub- words, to come to 8 for " hello, small fly " A sub- word.After determining the frame number that each sub- word occupies, the frame number sequence that length is 8 can be obtained.

(4) the score distribution that word neutron word corresponds to acoustic feature is waken up；

As shown in the above, the process for waking up the acoustic score of word neutron word is obtained, can be inputted by acoustic feature Network is determined to waking up, and output wakes up and executes before determining result.Content based on the above embodiment, as a kind of optional implementation Example, the score distribution mode that the embodiment of the present invention does not correspond to acoustic feature to acquisition wake-up word neutron word are made specifically to limit, including But it is not limited to: determines that the corresponding acoustic feature of any sub- word is subordinated to the probability value of each example phoneme, and as any sub- word The score distribution of corresponding acoustic feature.

Wherein, example phoneme refer to currently can exhaustion all phonemes.By taking Chinese as an example, the example phoneme of Chinese can Think initial consonant and simple or compound vowel of a Chinese syllable, the total quantity of example factor comes to 83.For waking up any sub- word in word, the sub- word is corresponding Acoustic feature may be subordinated to a certain example phoneme in 83 example phonemes, to can determine the corresponding acoustic feature of the sub- word A possibility that being subordinated to each example phoneme, i.e., indicated with probability value.For any sub- word, " being subordinated to " refers to the sub- word pair The acoustic feature answered matches with example element.

(5) the insertion feature of word neutron word is waken up；

It since sub- word is phoneme, and wakes up word and is made of multiple sub- words, every sub- word is that determining and example phoneme is Can be exhausted, match to can determine and wake up each sub- word in word with 83 example phoneme which example factors.Based on upper Content is stated, insertion feature is to be used to indicate this matching relationship.Specifically, insertion feature can be one hot vector, lead to One hot vector is crossed to indicate to wake up the matching relationship in word between each sub- word and all example factors.For example, if waking up word In first sub- word be to be matched with second example phoneme, then can obtain the corresponding one hot vector of this first sub- word be (0, 1,0....0).Wherein, 1 81 0 are followed by.Similarly, if waking up second sub- word in word is and first example phoneme Match, then can obtain the corresponding one hot vector of this second sub- word is (1,0,0....0).Wherein, 1 82 0 are followed by.

It should be noted that the insertion feature of sub- word can be in preposition wake-up identification process by waking up identification network It obtains, namely before executing the embodiment of the present invention, can also be identified by the decision procedure based on acoustics likelihood ratio by waking up Network carries out wake-up identification.When carrying out waking up identification by waking up identification network, it can simultaneously obtain and wake up the embedding of word neutron word Enter feature.For example, if voice data is " hello, small to fly ", by waking up in identification network " hello, small to fly " corresponding wake-up word Behind path, obtained path confidence level can be higher than the path confidence level obtained by other wake-up word paths, and thus can Obtain waking up the insertion feature of word neutron word.

Method provided in an embodiment of the present invention, by the way that at least one of five kinds of information information input to wake-up is determined net Network, output, which wakes up, determines result.Due to that can determine that network is called out by waking up under the customized any wake-up word of user It wakes up and determines, and without relying on fixed preset threshold, so that wake-up success rate can be improved, the applicable scene of wakeup process is more Extensively.

Content based on the above embodiment, as a kind of alternative embodiment, sub- word not any to determination of the embodiment of the present invention The mode that corresponding acoustic feature is subordinated to the probability value of each example phoneme specifically limits, including but not limited to: calculating and appoints Each frame is subordinated to the probability value of each example phoneme in one sub- word, obtains the corresponding probability value sequence of each frame；According to any The totalframes that sub- word includes, it is regular to the progress of each frame corresponding probability value sequence, obtain the corresponding acoustic feature of any sub- word It is subordinated to the probability value of each example phoneme.

For example, for any sub- word, if the sub- word includes two frames, for first frame acoustic feature, it may be determined that first frame Acoustic feature is subordinated to the probability value of each example phoneme.By taking the quantity of example factor is 83 as an example, then first frame is available 83 probability values are calculated as A=(0.01,0.02 ..., 0.3).Similarly, for the second frame acoustic feature, it may be determined that the second frame acoustics Feature is subordinated to the probability value of each example phoneme, can be calculated as B=(0.03,0.04 ..., 0.1).Since the sub- word includes Totalframes is 2, then can the corresponding probability value sequence of each frame carry out it is regular, also will two sequences be added and be averaged, can It obtains the corresponding acoustic feature of the sub- word and is subordinated to the probability value of each example phoneme to be (0.02,0.03 ..., 0.2) C=.

Method provided in an embodiment of the present invention is subordinated to each example by calculating each frame acoustic feature in any sub- word The probability value of phoneme obtains the corresponding probability value sequence of each frame.It is corresponding to each frame according to the totalframes that any sub- word includes Probability value sequence carry out it is regular, obtain the probability value that the corresponding acoustic feature of any sub- word is subordinated to each example phoneme.By In under the customized any wake-up word of user, it can determine that network carries out wake-up judgement by waking up, and without relying on solid Fixed preset threshold, so that wake-up success rate can be improved, the applicable scene of wakeup process is more extensive.

Content based on the above embodiment, as a kind of alternative embodiment, sub- word be state, single-tone element or triphones or Person's syllable.Wherein, waking up each voice unit in word may include multiple phonemes, and each phoneme may include multiple states.Alternatively, Each voice unit may include multiple triphones, and each triphones may include multiple states.

Content based on the above embodiment determines net acoustic feature is input to wake-up as a kind of alternative embodiment Network, output wake up before determining result, can also determine to wake up whether word meets preset condition.The embodiment of the present invention is not to judgement The mode whether wake-up word meets preset condition specifically limits, including but not limited to: obtaining the acoustics likelihood ratio for waking up word, sentences Whether disconnected acoustics likelihood ratio meets preset condition.Correspondingly, the embodiment of the present invention is not input to wake-up judgement to by acoustic feature Network, output, which wakes up, determines that the mode of result specifically limits, including but not limited to: if acoustics likelihood ratio meets preset condition, Acoustic feature is then input to wake-up and determines network, output, which wakes up, determines result.

Content based on the above embodiment, as a kind of alternative embodiment, preset condition is acoustics likelihood ratio less than default Threshold value is in preset threshold section.

Wherein, when preset condition is that acoustics likelihood ratio is less than preset threshold, then in the actual implementation process, if acoustics is seemingly So than be less than preset threshold, then can be performed wake up decision process, if acoustics likelihood ratio be greater than preset threshold, can determine wake-up at Function.Certainly, preset condition can also be greater than preset threshold for acoustics likelihood ratio, when preset condition is acoustics likelihood ratio greater than default When threshold value, then in the actual implementation process, if acoustics likelihood ratio is greater than preset threshold, it can be performed and wake up decision process.If sound It learns likelihood ratio and is less than preset threshold, then can directly determine wake-up failure.When preset condition is that acoustics likelihood ratio is greater than preset threshold When, it is equivalent to and wake-up is realized by confidence process twice.

When preset condition is that acoustics likelihood ratio is in pre-set interval, if acoustics likelihood ratio is in pre-set interval (a, b), If the acoustics likelihood ratio for the wake-up word being then actually calculated is in (a, b), decision process can be waken up executing.If practical The acoustics likelihood ratio for the wake-up word being calculated is less than preset threshold, then can determine wake-up failure.If what is be actually calculated calls out The acoustics likelihood ratio of awake word is greater than preset threshold, then can determine and wake up successfully.

Method provided in an embodiment of the present invention wakes up the acoustics likelihood ratio of word by obtaining, whether judges acoustics likelihood ratio Meet preset condition.If acoustics likelihood ratio meets preset condition, acoustic feature is input to wake-up and determines network, output wakes up Determine result.It due to can first determine to wake up the acoustics likelihood ratio of word, then executes wake-up and determines, to be determined by secondary Journey can be improved and wake up success rate and rate of precision.

It should be noted that above-mentioned all alternative embodiments, can form optional implementation of the invention using any combination Example, this is no longer going to repeat them.

Content based on the above embodiment, the embodiment of the invention provides a kind of voice Rouser, the device is for holding Voice awakening method in row above method embodiment.Referring to fig. 4, which includes: to obtain module 401 and output module 402； Wherein,

Module 401 is obtained, for obtaining the acoustic feature for waking up word in voice data；

Output module 402 determines network for acoustic feature to be input to wake-up, and output, which wakes up, to be determined to sentence as a result, waking up Determine result to be used to indicate whether to wake up successfully, wakes up and determine that network is obtained based on the training of sample acoustic feature.

Content based on the above embodiment, as a kind of alternative embodiment, acoustic feature includes waking up word neutron word to correspond to The score of acoustic feature is distributed；Correspondingly, the device further include:

Determining module, for determining that the corresponding acoustic feature of any sub- word is subordinated to for waking up any sub- word in word The probability value of each example phoneme, and be distributed as the score that any sub- word corresponds to acoustic feature.

Content based on the above embodiment, as a kind of alternative embodiment, determining module is every in any sub- word for calculating One frame acoustic feature is subordinated to the probability value of each example phoneme, obtains the corresponding probability value sequence of each frame；According to any son The totalframes that word includes, to the corresponding probability value sequence of each frame carry out it is regular, obtain the corresponding acoustic feature of any sub- word from Belong to the probability value of each example phoneme.

Content based on the above embodiment, as a kind of alternative embodiment, sub- word be state, single-tone element, triphones or Syllable.

Content based on the above embodiment, as a kind of alternative embodiment, the device further include:

Judgment module judges whether acoustics likelihood ratio meets preset condition for obtaining the acoustics likelihood ratio for waking up word；Phase Ying Di, output module 402, for when acoustics likelihood ratio meets preset condition, then acoustic feature being input to wake-up and determining net Network, output, which wakes up, determines result.

Content based on the above embodiment wakes up as a kind of alternative embodiment and determines that network is encoder-decoder Model.

Device provided in an embodiment of the present invention wakes up the acoustic feature of word by obtaining in voice data.By acoustic feature It is input to wake-up and determines network, output, which wakes up, to be determined to determine that result is used to indicate whether to wake up successfully as a result, waking up.Due to Under the customized any wake-up word in family, it can determine that network carry out wake-up judgement by waking up, and without relying on fixed pre- If threshold value, so that wake-up success rate can be improved, the applicable scene of wakeup process is more extensive.

Fig. 5 illustrates the entity structure schematic diagram of a kind of electronic equipment, as shown in figure 5, the electronic equipment may include: place Manage device (processor) 510, communication interface (Communications Interface) 520,530 He of memory (memory) Communication bus 540, wherein processor 510, communication interface 520, memory 530 complete mutual lead to by communication bus 540 Letter.Processor 510 can call the logical order in memory 530, to execute following method: obtaining in voice data and wake up word Acoustic feature；Acoustic feature is input to wake-up and determines network, output, which wakes up, to be determined to determine that result is used to indicate as a result, waking up Whether wake-up is successful, wakes up and determines that network is obtained based on the training of sample acoustic feature, wakes up judgement network and be used for wake-up Word carries out confidence declaration.

In addition, the logical order in above-mentioned memory 530 can be realized by way of SFU software functional unit and conduct Independent product when selling or using, can store in a computer readable storage medium.Based on this understanding, originally Substantially the part of the part that contributes to existing technology or the technical solution can be in other words for the technical solution of invention The form of software product embodies, which is stored in a storage medium, including some instructions to So that a computer equipment (can be personal computer, electronic equipment or the network equipment etc.) executes each reality of the present invention Apply all or part of the steps of a method.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. it is various It can store the medium of program code.

The embodiment of the present invention also provides a kind of non-transient computer readable storage medium, is stored thereon with computer program, The computer program is implemented to carry out the various embodiments described above offer method when being executed by processor, for example, obtain voice The acoustic feature of word is waken up in data；Acoustic feature is input to wake-up and determines network, output, which wakes up, to be determined to determine as a result, waking up As a result it is used to indicate whether to wake up successfully, wakes up and determine that network is obtained based on the training of sample acoustic feature, wake up and determine net Network is used to carry out confidence declaration to wake-up word.

The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness Labour in the case where, it can understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of voice awakening method characterized by comprising

Obtain the acoustic feature that word is waken up in voice data；

The acoustic feature is input to wake-up and determines network, output, which wakes up, to be determined as a result, the wake-up determines result for referring to Show and whether wake up success, the wake-up determines that network is obtained based on the training of sample acoustic feature, and the wake-up determines network For carrying out confidence declaration to the wake-up word.

2. the method according to claim 1, wherein the acoustic feature include in following five kinds of information at least Any one, following five kinds of information are respectively the wake-up word score for waking up word neutron word, the wake-up word neutron word Non- wake-up word score, described wake up the corresponding frame number of word neutron word, the score for waking up word neutron word and corresponding to acoustic feature Distribution and the insertion feature for waking up word neutron word.

3. according to the method described in claim 2, it is characterized in that, the acoustic feature includes that the wake-up word neutron word is corresponding The score of acoustic feature is distributed；Correspondingly, described that the acoustic feature is input to wake-up judgement network, output, which wakes up, determines knot Before fruit, further includes:

For any sub- word in the wake-up word, determine that the corresponding acoustic feature of any sub- word is subordinated to each example sound The probability value of element, and be distributed as the score that any sub- word corresponds to acoustic feature.

4. according to the method described in claim 3, it is characterized in that, the corresponding acoustic feature of the determination any sub- word from Belong to the probability value of each example phoneme, comprising:

The probability value that each frame acoustic feature in any sub- word is subordinated to each example phoneme is calculated, it is corresponding to obtain each frame Probability value sequence；

It is regular to the progress of each frame corresponding probability value sequence according to the totalframes that any sub- word includes, obtain described appoint The corresponding acoustic feature of one sub- word is subordinated to the probability value of each example phoneme.

5. method according to any one of claim 2 to 4, which is characterized in that the sub- word is state, single-tone element, three Phoneme or syllable.

6. the method according to claim 1, wherein described be input to wake-up judgement net for the acoustic feature Network, output wake up before determining result, further includes:

The acoustics likelihood ratio for waking up word is obtained, judges whether the acoustics likelihood ratio meets preset condition；

Correspondingly, described that the acoustic feature is input to wake-up judgement network, output, which wakes up, determines result, comprising:

If the acoustics likelihood ratio meets preset condition, the acoustic feature is input to wake-up and determines network, output wakes up Determine result.

7. according to the method described in claim 6, it is characterized in that, the preset condition is the acoustics likelihood ratio less than default Threshold value is in preset threshold section.

8. the method according to claim 1, wherein the wake-up determines that network is encoder-decoder mould Type.

9. a kind of voice Rouser characterized by comprising

Output module determines network for the acoustic feature to be input to wake-up, and output, which wakes up, to be determined as a result, the wake-up is sentenced Determine result to be used to indicate whether to wake up successfully, the wake-up determines that network is obtained based on the training of sample acoustic feature, described It wakes up and determines that network is used to carry out confidence declaration to the wake-up word.

10. a kind of electronic equipment characterized by comprising

At least one processor；And

At least one processor being connect with the processor communication, in which:

The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy Enough execute method as described in any of the claims 1 to 8.

11. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute method as described in any of the claims 1 to 8.