CN108182937A

CN108182937A - Keyword recognition method, device, equipment and storage medium

Info

Publication number: CN108182937A
Application number: CN201810045490.5A
Authority: CN
Inventors: 叶青峰; 黄美玉
Original assignee: Chumen Wenwen Information Technology Co Ltd
Current assignee: Mobvoi Innovation Technology Co Ltd
Priority date: 2018-01-17
Filing date: 2018-01-17
Publication date: 2018-06-19
Anticipated expiration: 2038-01-17
Also published as: CN108182937B

Abstract

The embodiment of the invention discloses a kind of keyword recognition method, device, equipment and storage mediums.The method includes：Obtain voice messaging input by user；The voice messaging is input in syllable classification model trained in advance, and obtains the syllable classification model output, syllable recognition result associated with preset keyword；The confidence score of the voice messaging is obtained according to the syllable recognition result；If the confidence score is more than predetermined threshold value, it is determined that the keyword is identified in the voice messaging.By technical scheme of the present invention, the selection flexibility of self-defined keyword can be improved, simplify Keyword Spotting System establishes process.

Description

Keyword recognition method, device, equipment and storage medium

Technical field

The present embodiments relate to voice technologies more particularly to a kind of keyword recognition method, device, equipment and storage to be situated between Matter.

Background technology

Gradual ripe and universal with voice technology, voice control becomes most convenient, most quick, most attractive One of man-machine interaction mode.

In robot field, keyword identification, which would generally be used as, opens robot or a certain operation of triggering robot execution The first step and the basis that engages in the dialogue with robot, i.e., the purpose that keyword identification is blocked is exactly to enable robot Expected received signal is enough accurately received, so that robot makes a response in time when user needs.

It is mainly based upon the mode of acoustics segment decomposition in the prior art to detect keyword, and to the voice messaging of acquisition It carries out needing to use when acoustics segment is decomposed carrying out classification and Detection for the acoustics segment grader of particular keywords, for example, Establish the detecting system of a rational keyword of performance " W ", it is necessary to collect the record for including keyword " W " of tens hours Sound, is if desired changed to the detecting system of keyword " V ", then needs to re-start systematic training, thus limits user and make by oneself The flexibility of justice selection keyword.

Invention content

The embodiment of the present invention provides a kind of keyword recognition method, device, equipment and storage medium, is made by oneself with realizing to improve The selection flexibility of adopted keyword, simplify Keyword Spotting System establishes process.

In a first aspect, an embodiment of the present invention provides a kind of keyword recognition method, including：

Obtain voice messaging input by user；

The voice messaging is input in syllable classification model trained in advance, and it is defeated to obtain the syllable classification model Go out, syllable recognition result associated with preset keyword；

The confidence score of the voice messaging is obtained according to the syllable recognition result；

If the confidence score is more than predetermined threshold value, it is determined that the key is identified in the voice messaging Word.

Second aspect, the embodiment of the present invention additionally provide a kind of keyword identification device, which includes：

Voice acquisition module, for obtaining voice messaging input by user；

Syllable identification module for the voice messaging to be input in syllable classification model trained in advance, and obtains The syllable classification model output, syllable recognition result associated with preset keyword；

Confidence level acquisition module, for obtaining the confidence score of the voice messaging according to the syllable recognition result；

Keyword determining module, if being more than predetermined threshold value for the confidence score, it is determined that believe in the voice The keyword is identified in breath.

The third aspect, the embodiment of the present invention additionally provide a kind of equipment, which includes：

One or more processors；

Memory, for storing one or more programs；

When one or more of programs are performed by one or more of processors so that one or more of processing Device realizes the keyword recognition method as described in any in the embodiment of the present invention.

Fourth aspect, the embodiment of the present invention additionally provide a kind of computer readable storage medium, are stored thereon with computer Program realizes the keyword recognition method as described in any in the embodiment of the present invention when program is executed by processor.

Voice messaging is input to syllable trained in advance by the embodiment of the present invention by obtaining voice messaging input by user In disaggregated model, and syllable classification model output is obtained, syllable recognition result associated with preset keyword, and then The confidence score of voice messaging is obtained according to the result, if the confidence score is more than predetermined threshold value, it is determined that in voice The keyword is recognized in information, the advantages of syllable number is limited is utilized, is solved in the prior art because using based on acoustics The mode that segment is decomposed the problem of detecting the limited flexibility of User Defined selection keyword caused by keyword, is realized The selection flexibility of the self-defined keyword of raising simplifies the effect for establishing process of Keyword Spotting System.

Description of the drawings

Fig. 1 is a kind of flow diagram for keyword recognition method that the embodiment of the present invention one provides；

Fig. 2 is a kind of flow diagram of keyword recognition method provided by Embodiment 2 of the present invention；

Fig. 3 is a kind of structure diagram for keyword identification device that the embodiment of the present invention three provides；

Fig. 4 is a kind of structure diagram for equipment that the embodiment of the present invention four provides.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limitation of the invention.It also should be noted that in order to just Part related to the present invention rather than entire infrastructure are illustrated only in description, attached drawing.

Embodiment one

Fig. 1 is the flow diagram of a kind of keyword recognition method that the embodiment of the present invention one provides.This method is applicable In the situation that keyword is identified, this method can be performed by keyword identification device, the device can by hardware and/or Software forms, and can generally be integrated in robot and all equipment comprising keyword identification function.It specifically includes as follows：

S110, voice messaging input by user is obtained.

Optionally, can voice messaging input by user be obtained by the voice collection device carried in equipment, can also passed through Other voice capture devices obtain voice messaging input by user.Wherein, voice messaging input by user can be that user says A word or a word or a word or collected one section of voice messaging from ambient sound.Illustratively, can lead to Detection ambient sound in real time is crossed, when collected intensity of sound is more than preset sound intensity, obtains and records current time Voice messaging, until sound of the intensity of sound more than preset sound intensity can be no longer collected in setting time.

Preferably, it before voice messaging input by user is obtained, further includes：

Obtain the standard keyword sample with different syllable information；Standard keyword sample includes：Standard keyword Voice messaging, standard syllable decomposition result corresponding with standard keyword；

Standard keyword sample is trained based on setting neural network algorithm, obtains syllable classification model.

Illustratively, syllable classification model used in the present embodiment, can be used to decompose voice messaging and syllable Classification, by taking Chinese as an example, Chinese syllable includes initial consonant and simple or compound vowel of a Chinese syllable, and no tune combination one shares 400 kinds, toned group Conjunction mode one shares more than 1300 kinds, and thus, for it can be decomposed into the language of syllable, the training to syllable classification model is Limited number of time, exhaustive training can be carried out, the mark with different syllable information can be also obtained by way of voice typing Quasi- keyword sample, wherein, syllable information can be the combination of the corresponding initial consonant of standard keyword and simple or compound vowel of a Chinese syllable or Combination between the corresponding initial consonant of standard keyword, simple or compound vowel of a Chinese syllable and tone.Optionally, standard keyword sample includes Voice messaging can be acquisition sample audio information, standard syllable decomposition result can be input with the standard keyword pair The decomposed information of initial consonant, simple or compound vowel of a Chinese syllable and tone answered.

Optionally, setting neural network algorithm can be used to be trained the standard keyword sample of acquisition, wherein, nerve Network algorithm namely artificial neural network are formed by connecting by numerous adjustable connection weights of neuron, are had on a large scale simultaneously The features such as row processing, distributed information storage, good self-organizing self-learning capability.Preferably, deep neural network can be used Standard keyword sample is trained, and then obtains syllable classification mould required when the syllable decomposition and classification of voice messaging Type.

S120, voice messaging is input in syllable classification model trained in advance, and obtains the output of syllable classification model , syllable recognition result associated with preset keyword.

Wherein, preset keyword can be that the equipment it is expected the keyword obtained, and optionally, which can be one A word or a word, are not limited thereto.The voice messaging of acquisition is input to the syllable classification that training is completed in advance In model, syllable decomposition and classification are carried out to the voice messaging by the syllable classification model, can obtain and preset keyword Associated syllable recognition result, optionally, the syllable recognition result can be that voice messaging wraps after syllable decomposes and classifies The probability of the corresponding syllable containing preset keyword.

Illustratively, in syllable classification model, syllable decomposition, such as voice messaging are first carried out to voice messaging " to ask Immediately begin to ", can obtain after syllable decomposes " q " " ing " " l " " i " " j " " i " " k " " ai " " sh " " i " this ten by initial consonant and The syllable sequence that simple or compound vowel of a Chinese syllable is combined into.Classify again to the syllable sequence, such as when preset keyword is (corresponding for " beginning " Syllable is " k " " ai " " sh " " i ") when, it is " k " " ai " to calculate each syllable in the syllable sequence that voice messaging obtains after decomposition The probability of one of " sh " " i " this four syllables.

Since preset keyword also can carry out syllable decomposition and classification by the syllable classification model, thus to different Same syllable classification model can be used in the identification of keyword, does not need to re-establish a new classification for new keyword Model improves the selection flexibility of self-defined keyword, once in addition, the completion of syllable classification model training, can be universally used in institute The syllable for having voice messaging decomposes and classification, thus also simplify Keyword Spotting System establishes process.

S130, the confidence score that voice messaging is obtained according to syllable recognition result.

Optionally, processing numerically can be carried out to syllable recognition result, calculates the confidence score of voice messaging, specifically , it can be according to the probability size comprising the corresponding whole syllables of preset keyword in syllable recognition result come to voice messaging Confidence level is given a mark.Wherein, confidence score can be used for weighing on the whole including preset keyword in voice messaging Probability, if the probability comprising preset keyword is high in voice messaging, the confidence score got is high；If voice messaging In comprising preset keyword probability it is low, then the confidence score got is low.Since syllable classification model has centainly Error probability so the confidence score of voice messaging is obtained according to syllable recognition result to be advantageous in that, can believe voice The recognition result of breath carries out reliability marking, and then can improve keyword by the secondary correctness for judging voice recognition result The accuracy of identification and reliability.

If S140, confidence score are more than predetermined threshold value, it is determined that keyword is identified in voice messaging.

Illustratively, if confidence score is more than predetermined threshold value, illustrate credible comprising keyword in voice messaging Degree is high, you can determines to have identified preset keyword in voice messaging；On the contrary, if confidence score is not above default threshold Value then illustrates with a low credibility comprising keyword in voice messaging, you can determine not recognize in voice messaging preset Keyword.Wherein, predetermined threshold value can be configured according to actual demand, can also be configured according to historical experience value, herein It is not construed as limiting.

Voice messaging by obtaining voice messaging input by user, is input to advance instruction by the technical solution of the present embodiment In experienced syllable classification model, and syllable classification model output is obtained, syllable identification associated with preset keyword As a result, and then according to the result obtain voice messaging confidence score, if the confidence score be more than predetermined threshold value, really It is scheduled in voice messaging and recognizes the keyword, the advantages of syllable number is limited is utilized, solve in the prior art because using The mode decomposed based on acoustics segment detects the limited flexibility of User Defined selection keyword caused by keyword Problem realizes the selection flexibility for improving self-defined keyword, simplifies the effect for establishing process of Keyword Spotting System.

Embodiment two

Fig. 2 is a kind of flow diagram of keyword recognition method provided by Embodiment 2 of the present invention.It is more than the present embodiment It states and optimizes based on embodiment, provide preferred keyword recognition method, it is specifically, pre- to voice messaging is input to First in trained syllable classification model, and the output of syllable classification model is obtained, syllable associated with preset keyword is known Other result is advanced optimized.It specifically includes as follows：

S210, voice messaging input by user is obtained.

S220, when detect keyword setting instruction when, obtain keyword input by user.

Wherein, keyword setting instruction can be used for control device to receive the setting of keyword, and keyword setting instruction can It is sent out by the remote controler of equipment, generates, be not limited thereto when keyword setting button can also be pressed by user.Optionally, The mode that user inputs keyword can be inputted by voice mode, correspondingly, equipment can pass through voice collection device Obtain keyword input by user.

S230, the syllable parameter in the corresponding syllable information adjustment syllable classification model of keyword.

Optionally, the corresponding syllable information of keyword can be carried out by the way that the keyword of acquisition is inputted in syllable classification model Syllable is obtained after decomposing or is obtained by receiving the syllable information that the manual the input phase of user answers, wherein, syllable information It can include can make up the specific syllable of keyword, such as corresponding initial consonant, simple or compound vowel of a Chinese syllable and tone in Chinese character.

Optionally, whether the syllable parameter in syllable classification model can be included for weighing in the voice messaging obtained The key parameter of keyword, can be by the corresponding syllable information of keyword directly as the syllable parameter in syllable classification model.

Due to the type sum of syllable be it is changeless, keyword can be broken into limited a syllable type In certain several syllable combination, this causes syllable classification model to have to the versatility of arbitrary word or word, and passes through keyword Syllable parameter in corresponding syllable information adjustment syllable classification model can flexibly set or replace keyword, and without weight Newly train corresponding disaggregated model, it is only necessary to adjust keyword can be realized in the syllable parameter in model setting or replacement.

S240, acoustic feature extraction is carried out to voice messaging according to preset sample frequency, obtains acoustic feature frame group.

Wherein, carry out acoustic feature extraction purpose be from the voice messaging of acquisition extract robust, it is representational Feature.It optionally, can be by the way that voice messaging be gone 23 dimension of extraction by the Mel wave filter groups of a 25ms wide frame and 10ms frame per second Feature vector, wherein, preset sample rate is preferably 16000hz.For example, for the voice signal that length is 5 seconds, extraction Acoustic feature is 500 frames, and each frame therein is 23 dimensional vectors.Specifically, acoustic feature group may include the multiframe sound of extraction Learn feature.

S250, every frame acoustic feature in acoustic feature frame group is separately input into syllable classification model trained in advance In.

Optionally, a syllable classification model can be set, every frame acoustic feature is sequentially input by frame sequence, may also set up multiple Syllable classification model, and input simultaneously per frame acoustic feature, and then shorten syllable and decompose, classify the time.Only by acoustic feature frame Every frame acoustic feature in group, which is input to syllable classification model, to be advantageous in that, can be saved syllable and be decomposed and classify the time, only To most the acoustic feature of feature carries out syllable decomposition and classification in voice messaging.

Preferably, every frame acoustic feature in acoustic feature frame group is being separately input into syllable classification mould trained in advance Before in type, further include：

To in acoustic feature frame group every frame acoustic feature carry out syllable without tune into the melt processing.

Specifically, being to every frame acoustic feature in acoustic feature frame group is made to carry out tone without processing is tuned into the melt, that is, lead to Cross the processing and remove tone in acoustic feature, by taking Chinese as an example namely syllable classification model only need will be per frame acoustic feature point It solves as initial consonant and simple or compound vowel of a Chinese syllable.Be without the purpose for tuning into the melt processing, the sound in acoustic feature frame is made not have tone, with letter Change the workload of syllable classification model, while improve the robustness of acoustic feature, avoid because part words in dialect tone not Keyword is difficult to be identified with caused by.

S260, the output of syllable classification model is obtained, it is corresponding and matched with syllable parameter per frame acoustic feature Syllable recognition result.

Optionally, in the result of syllable classification model output, a syllable identification can be corresponded respectively to per frame acoustic feature As a result, wherein, which matches with the syllable parameter in syllable classification model, since syllable parameter is that basis is set What the keyword put was adjusted, therefore, the syllable recognition result of acquisition is also relevant with the keyword of setting.Optionally, Syllable of each syllable for understanding to decomposite in each acoustic feature frame from syllable recognition result corresponding to the keyword of setting Similarity, and then can thus similarity whether judge in the acoustic feature frame comprising some syllable in keyword.

S270, the confidence score that voice messaging is obtained according to syllable recognition result.

Preferably, syllable recognition result is the posterior probability group of vector form.

A concrete instance is lifted, one keyword " beginning " with syllable " k " " ai " " sh " " i " of setting will be per frame sound It learns feature and is input to the posterior probability that will generate vector form in syllable classification model：[P (k | x), P (ai | x), P (sh | x), P (i | x), P (other | x)], wherein, " x " can be that the syllable come is decomposited in the acoustic feature frame of input.

Correspondingly, the confidence score of voice messaging is obtained according to syllable recognition result, including：

Rolling average extraction is carried out to the posterior probability group of vector form；

The abstraction sequence maximum value in result is extracted；

The corresponding geometrical mean of sequence of calculation maximum value, and obtained geometrical mean as the confidence level of voice messaging Point.

Wherein, the method for moving average is a kind of simple smooth Predicting Technique, its basic thought is：According to time series item by item Passage, average value when calculating the sequence comprising certain item number successively, to reflect the method for long-term trend.Illustratively, based on movement After the method for average extracts the posterior probability group of vector form, sequence maximum value therein is extracted in result is extracted, finally The corresponding geometrical mean of sequence maximum value is calculated, the confidence level which can be used as the voice messaging obtained obtains Point.

If S280, confidence score are more than predetermined threshold value, it is determined that keyword is identified in voice messaging.

The technical solution of the present embodiment, by obtaining voice messaging input by user, and according to keyword input by user Syllable parameter in corresponding syllable information adjustment syllable classification model, then according to preset sample frequency to voice messaging into Row acoustic feature extracts, and then every frame acoustic feature in obtained acoustic feature group is separately input into syllable trained in advance In disaggregated model, the syllable recognition result of output is obtained, the confidence score of voice messaging is finally obtained according to the result, if The confidence score is more than predetermined threshold value, it is determined that the keyword is recognized in voice messaging, it is limited that syllable number is utilized The advantages of, solve in the prior art because by the way of being decomposed based on acoustics segment come user caused by detecting keyword from The problem of limited flexibility of definition selection keyword, the selection flexibility for improving self-defined keyword is realized, simplified crucial Word identifying system establishes process, saves the effect of the recognition time of keyword.

On the basis of the various embodiments described above, the identification of keyword will be applied in the wake-up of equipment, it is preferred that crucial Word is wakes up word；

After determining to identify keyword in voice messaging, further include：Perform equipment wake operation.

Optionally, equipment wake operation can include opening display screen, can also include opening dialogue mode, such as actively With user greet etc., be not limited thereto.Certainly, the identification of keyword can be also applied on the voice control of equipment, this When, keyword is instruction word, after determining to identify keyword in voice messaging, can generate corresponding instruction, and perform and refer to Enable corresponding task.

Embodiment three

Fig. 3 is the structure diagram of a kind of keyword identification device that the embodiment of the present invention three provides.It is crucial with reference to figure 3 Word identification device includes：Voice acquisition module 310, syllable identification module 320, confidence level acquisition module 330 and keyword are true Cover half block 340 is below specifically described each module.

Voice acquisition module 310, for obtaining voice messaging input by user；

Syllable identification module 320 for voice messaging to be input in syllable classification model trained in advance, and obtains sound Save disaggregated model output, syllable recognition result associated with preset keyword；

Confidence level acquisition module 330, for obtaining the confidence score of voice messaging according to syllable recognition result；

Keyword determining module 340, if being more than predetermined threshold value for confidence score, it is determined that know in voice messaging Keyword is not gone out.

Keyword identification device provided in this embodiment, it is by obtaining voice messaging input by user, voice messaging is defeated Enter into syllable classification model trained in advance, and obtain syllable classification model output, it is associated with preset keyword Syllable recognition result, and then the confidence score of voice messaging is obtained according to the result, if the confidence score is more than pre- If threshold value, it is determined that the keyword is recognized in voice messaging, the advantages of syllable number is limited is utilized, solves existing skill In art because by the way of being decomposed based on acoustics segment come the spirit of User Defined selection keyword caused by detecting keyword The problem of activity is limited realizes the selection flexibility for improving self-defined keyword, simplifies the foundation of Keyword Spotting System The effect of journey.

Optionally, it further includes：

Keyword input module in voice messaging to be input to in advance trained syllable classification model, and obtains Syllable classification model output, before syllable recognition result associated with preset keyword, when detect keyword set During instruction, keyword input by user is obtained；

Parameter adjustment module is joined for the syllable in the corresponding syllable information adjustment syllable classification model of keyword Number.

Optionally, syllable identification module 320 can include：

Feature extraction submodule for carrying out acoustic feature extraction to voice messaging according to preset sample frequency, obtains sound Learn characteristic frame group；

Feature input submodule, for every frame acoustic feature in acoustic feature frame group to be separately input into training in advance In syllable classification model；

As a result acquisition submodule, it is corresponding per frame acoustic feature for obtaining the output of syllable classification model, and with The matched syllable recognition result of syllable parameter.

Optionally, syllable identification module 320 can also include：

Without processing submodule is tuned into the melt, for every frame acoustic feature in acoustic feature frame group to be separately input into advance instruction Before in experienced syllable classification model, in acoustic feature frame group every frame acoustic feature carry out syllable without tune into the melt processing.

Optionally, syllable recognition result is the posterior probability group of vector form；

Confidence level acquisition module 330 specifically can be used for：

The abstraction sequence maximum value in result is extracted；

Optionally, it further includes：

Sample acquisition module, for before voice messaging input by user is obtained, obtaining with different syllable information Standard keyword sample；Standard keyword sample includes：The voice messaging of standard keyword, standard corresponding with standard keyword Syllable decomposition result；

Model training module is trained standard keyword sample for being based on setting neural network algorithm, obtains sound Save disaggregated model.

Optionally, keyword is wakes up word；

It further includes：Equipment wake-up module is called out for after determining to identify keyword in voice messaging, performing equipment It wakes up and operates.

The said goods can perform the method that any embodiment of the present invention is provided, and have the corresponding function module of execution method And advantageous effect.

Example IV

Fig. 4 is the structure diagram of a kind of equipment that the embodiment of the present invention four provides, as shown in figure 4, the present embodiment provides A kind of equipment, including：Processor 41 and memory 42.Processor in the equipment can be one or more, with one in Fig. 4 For a processor 41, the processor 41 in the equipment can be connected with memory 42 by bus or other modes, in Fig. 4 For being connected by bus.

The keyword identification device of above-described embodiment offer is provided in the present embodiment in the processor 41 of equipment.In addition, Memory 42 in the equipment is used as a kind of computer readable storage medium, available for the one or more programs of storage, the journey Sequence can be software program, computer executable program and module, as keyword recognition method corresponds in the embodiment of the present invention Program instruction/module (for example, the module in attached keyword identification device shown in Fig. 3, including：Voice acquisition module 310, Syllable identification module 320, confidence level acquisition module 330 and keyword determining module 340).Processor 41 is stored by running Software program, instruction and module in memory 42, so as to perform the various function application of equipment and data processing, i.e., Realize keyword recognition method in above method embodiment.

Memory 42 may include storing program area and storage data field, wherein, storing program area can storage program area, extremely Application program needed for a few function；Storage data field can be stored uses created data etc. according to equipment.In addition, it deposits Reservoir 42 can include high-speed random access memory, can also include nonvolatile memory, and a for example, at least disk is deposited Memory device, flush memory device or other non-volatile solid state memory parts.In some instances, memory 42 can further comprise Relative to the remotely located memory of processor 41, these remote memories can pass through network connection to equipment.Above-mentioned network Example include but not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.

Also, when one or more program included by above equipment is performed by one or more of processors 41 When, program proceeds as follows：

Obtain voice messaging input by user；Voice messaging is input in syllable classification model trained in advance, and obtained Take what syllable classification model exported, syllable recognition result associated with preset keyword；It is obtained according to syllable recognition result The confidence score of voice messaging；If confidence score is more than predetermined threshold value, it is determined that key is identified in voice messaging Word.

Embodiment five

The embodiment of the present invention five additionally provides a kind of computer readable storage medium, is stored thereon with computer program, should The keyword recognition method provided such as the embodiment of the present invention one, this method packet are provided when program is performed by keyword identification device It includes：Obtain voice messaging input by user；Voice messaging is input in syllable classification model trained in advance, and obtains syllable Disaggregated model output, syllable recognition result associated with preset keyword；Voice letter is obtained according to syllable recognition result The confidence score of breath；If confidence score is more than predetermined threshold value, it is determined that keyword is identified in voice messaging.

Certainly, a kind of computer readable storage medium that the embodiment of the present invention is provided, the computer program stored thereon It is performed and is not limited to realize method operation as described above, can also realize the keyword that any embodiment of the present invention is provided Relevant operation in recognition methods.

By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but the former is more in many cases Good embodiment.Based on such understanding, what technical scheme of the present invention substantially in other words contributed to the prior art Part can be embodied in the form of software product, which can be stored in computer readable storage medium In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) performs the method described in each embodiment of the present invention.

It is worth noting that, in the embodiment of above-mentioned keyword identification device, included each unit and module are It is divided according to function logic, but is not limited to above-mentioned division, as long as corresponding function can be realized；Separately Outside, the specific name of each functional unit is also only to facilitate mutually distinguish, the protection domain being not intended to restrict the invention.

Note that it above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The present invention is not limited to specific embodiment described here, can carry out for a person skilled in the art various apparent variations, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above example to the present invention It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also It can include other more equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.

Claims

1. a kind of keyword recognition method, which is characterized in that including：

Obtain voice messaging input by user；

The voice messaging is input in syllable classification model trained in advance, and obtains the syllable classification model output , syllable recognition result associated with preset keyword；

If the confidence score is more than predetermined threshold value, it is determined that the keyword is identified in the voice messaging.

2. according to the method described in claim 1, it is characterized in that, the voice messaging to be input to syllable trained in advance In disaggregated model, and obtain syllable classification model output, syllable recognition result associated with preset keyword it Before, it further includes：

When detecting keyword setting instruction, keyword input by user is obtained；

The syllable parameter in the syllable classification model is adjusted according to the corresponding syllable information of the keyword.

3. according to the method described in claim 2, it is characterized in that, described be input to the voice messaging sound trained in advance It saves in disaggregated model, and obtains the syllable classification model output, syllable recognition result associated with preset keyword, Including：

Acoustic feature extraction is carried out to the voice messaging according to preset sample frequency, obtains acoustic feature frame group；

Every frame acoustic feature in the acoustic feature frame group is separately input into syllable classification model trained in advance；

Obtain syllable classification model output, it is described corresponding per frame acoustic feature, and with the syllable parameter The syllable recognition result matched.

4. according to the method described in claim 3, it is characterized in that, by every frame acoustic feature in the acoustic feature frame group Before being separately input into syllable classification model trained in advance, further include：

To in the acoustic feature frame group every frame acoustic feature carry out syllable without tune into the melt processing.

5. according to the method described in claim 3, it is characterized in that, the syllable recognition result is the posterior probability of vector form Group；

The confidence score that the voice messaging is obtained according to the syllable recognition result, including：

Rolling average extraction is carried out to the posterior probability group of the vector form；

The abstraction sequence maximum value in result is extracted；

Calculate the corresponding geometrical mean of the sequence maximum value, and putting using the geometrical mean as the voice messaging Confidence score.

6. according to the method described in claim 1, it is characterized in that, before voice messaging input by user is obtained, further include：

Obtain the standard keyword sample with different syllable information；The standard keyword sample includes：Standard keyword Voice messaging, standard syllable decomposition result corresponding with the standard keyword；

The standard keyword sample is trained based on setting neural network algorithm, obtains the syllable classification model.

7. according to claim 1-6 any one of them methods, which is characterized in that the keyword is wakes up word；

After determining to identify the keyword in the voice messaging, further include：Perform equipment wake operation.

8. a kind of keyword identification device, which is characterized in that including：

Voice acquisition module, for obtaining voice messaging input by user；

Syllable identification module, for the voice messaging to be input in syllable classification model trained in advance, and described in acquisition The output of syllable classification model, syllable recognition result associated with preset keyword；

Keyword determining module, if being more than predetermined threshold value for the confidence score, it is determined that in the voice messaging Identify the keyword.

9. a kind of equipment, which is characterized in that the equipment includes：

One or more processors；

Memory, for storing one or more programs；

When one or more of programs are performed by one or more of processors so that one or more of processors are real The now keyword recognition method as described in any in claim 1-7.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The keyword method of identification as described in any in claim 1-7 is realized during execution.