WO2017071182A1 - 一种语音唤醒方法、装置及系统 - Google Patents

一种语音唤醒方法、装置及系统 Download PDF

Info

Publication number
WO2017071182A1
WO2017071182A1 PCT/CN2016/082401 CN2016082401W WO2017071182A1 WO 2017071182 A1 WO2017071182 A1 WO 2017071182A1 CN 2016082401 W CN2016082401 W CN 2016082401W WO 2017071182 A1 WO2017071182 A1 WO 2017071182A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
keyword
instruction word
phoneme
current input
Prior art date
Application number
PCT/CN2016/082401
Other languages
English (en)
French (fr)
Inventor
王育军
Original Assignee
乐视控股(北京)有限公司
乐视致新电子科技(天津)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视致新电子科技(天津)有限公司 filed Critical 乐视控股(北京)有限公司
Priority to EP16739388.3A priority Critical patent/EP3179475A4/en
Priority to RU2016135447A priority patent/RU2016135447A/ru
Priority to US15/223,799 priority patent/US20170116994A1/en
Publication of WO2017071182A1 publication Critical patent/WO2017071182A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the embodiments of the present invention relate to the field of voice recognition technologies, and in particular, to a voice wakeup method, apparatus, and system.
  • voice interaction system of the smart device completes the user's instruction by recognizing the user's voice.
  • the user usually activates the voice manually, such as pressing the record button to perform voice interaction.
  • the simulation simulates the behavior of the other person at the beginning of the interaction, and the voice wake-up function is designed.
  • the existing voice wake-up mode is mainly: before the voice interaction with the smart device, the user first needs to say the wake-up word, and the wake-up word can be preset for the smart device.
  • the wake-up module of the voice interaction system detects the voice, extracts the voice feature, determines whether the extracted voice feature matches the voice feature of the preset wake-up word, and if so, wakes up the recognition module, performs voice recognition on the subsequent input voice command and Semantic analysis.
  • the user wants to use the TV's voice interaction system to instruct the TV to switch to the sports channel.
  • the user needs to say a wake-up word, such as "Hello TV", and the wake-up module activates the recognition module after detecting the wake-up word.
  • the recognition module begins to detect voice commands. At this time, the user says "Look at the sports channel", the recognition module recognizes the voice command, and converts the channel to the sports channel according to the instruction. After the instruction recognition is completed, the identification module is closed and no longer works. If the user wants to issue the instruction again, the wake-up word needs to be called again to wake up the recognition module.
  • the embodiment of the invention provides a method and a device for waking up a voice, which is used to solve the problem of waste of system resources and poor user experience caused by voice wake-up of the voice interactive system in the prior art.
  • the embodiment of the invention provides a voice wake-up method, including:
  • the speech recognizer When there is an instruction word in the current input voice, the speech recognizer is awakened, and the corresponding operation indicated by the instruction word is executed according to the instruction word.
  • the embodiment of the invention provides a voice wake-up device, including:
  • An extracting unit configured to perform voice feature extraction on the acquired current input voice
  • the command word determining unit is configured to determine, according to the extracted voice feature, whether the instruction word exists in the current input voice according to the pre-built keyword detection model, and the keyword in the keyword detection model includes at least a preset Instruction word
  • the first waking unit is configured to wake up the speech recognizer when the instruction word exists in the current input speech, and execute a corresponding operation indicated by the instruction word according to the instruction word.
  • the embodiment of the invention provides a voice wake-up system, comprising: a keyword detection module and a voice recognizer, wherein:
  • the keyword detecting module is configured to perform voice feature extraction on the acquired current input voice; and according to the extracted voice feature, determine whether there is an instruction word in the current input voice according to a pre-built keyword detection model,
  • the keyword detection model includes at least instruction word detection; when there is an instruction word in the current input speech, waking up the speech recognizer, and transmitting the current input speech to the speech recognizer;
  • the speech recognizer is configured to perform semantic analysis on the current input speech to obtain semantics of the current input speech; determine semantics of the current input speech to be semantically matched with the set instruction; and execute execution according to the instruction word
  • the command word indicates the corresponding operational command.
  • the beneficial effects of the voice waking method and apparatus provided by the embodiments of the present invention include: directly waking up the voice recognizer after detecting the command words on the input voice, and performing corresponding operations according to the command words, without waking up the voice after detecting the wake words The recognizer then re-detects whether there is a command word in the new input voice, which saves resources.
  • FIG. 1 is a flowchart of a method for waking up a voice according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a keyword detection model as a hidden Markov model according to an embodiment of the present invention
  • FIG. 3 is a flowchart of a method for waking up a voice according to Embodiment 1 of the present invention
  • FIG. 4 is a schematic structural diagram of a voice wake-up apparatus according to Embodiment 2 of the present invention.
  • FIG. 5 is a schematic structural diagram of a voice wake-up system according to Embodiment 3 of the present invention.
  • the embodiment of the invention provides a voice wake-up method, as shown in FIG. 1 , including:
  • Step 101 Perform voice feature extraction on the acquired current input voice.
  • Step 102 Determine, according to the extracted voice feature, whether the instruction word exists in the current input voice according to the pre-built keyword detection model, and the keyword in the keyword detection model includes at least a preset instruction word.
  • Step 103 When there is an instruction word in the current input voice, the speech recognizer is awakened, and the corresponding operation indicated by the instruction word is executed according to the instruction word.
  • the voice wake-up method in the embodiment of the present invention can be applied to a smart device with a voice interaction function, such as a television, a mobile phone, a computer, a smart refrigerator, and the like.
  • the speech feature can be a spectral or cepstral coefficient.
  • the keyword in the keyword detection model may be a preset instruction word, and the instruction word is a phrase for instructing the smart device to perform a specific operation.
  • the instruction word may be “watching sports channel”, “navigation to” or “playing”. Wait.
  • the current input speech can be detected by the keyword detection model.
  • the keyword detection model is first constructed, and the specific method for constructing the keyword detection model is as follows:
  • the keyword is preset, and the keyword can be an awakening word or an instruction word.
  • the wake-up word is used to wake up the speech recognizer's phrase.
  • the wake-up word usually uses a phrase with more initials.
  • the phrase contains Chinese characters starting with m, n, l, r, etc., because the vocal initials have vocal cord vibration. It can be better distinguished from environmental noise and has better noise immunity.
  • the wake-up word can be set to “Hello Music” or “ ⁇ ”.
  • the instruction word is a phrase used to instruct the smart device to perform the corresponding operation.
  • the command word is characterized by a function that reflects the smart device's proprietary features, such as "navigation to” is highly relevant to devices with navigation capabilities (such as cars), and “play” is usually associated with multimedia-enabled devices (such as television and Mobile phones) are highly relevant.
  • the instruction word can directly reflect the user's intention.
  • the speech feature may be a spectrum or a cepstral coefficient, etc., and a frame of speech feature vector may be extracted from the signal of the input speech every 10 milliseconds.
  • the keywords are usually different, which requires pre-building a keyword detection model for different application scenarios.
  • Building a keyword detection model is to construct an acoustic model. Since the acoustic model can have many expressions, such as hidden Markov models, or neural network models.
  • the keyword detection model is represented by a hidden Markov model. As shown in FIG. 2, each keyword can be expanded into a hidden Markov chain in a hidden Markov model, that is, a keyword. A state chain, each node on each chain corresponds to an acoustic parameter of a state of the keyword phoneme.
  • each keyword state chain respectively set a short mute state, and an empty tail state that identifies the keyword type.
  • the empty tail state identifies that the hidden Markov chain is represented as an awakening word or an instruction word, as shown in the figure.
  • the black point node of each chain in 3.
  • the node can jump forward to indicate the change of the utterance state, such as the change of the vocal vocal shape; it can also be self-jumping, and the self-jumping indicates that the utterance is temporarily unchanged, such as a relatively stable utterance state in the vowel.
  • the phoneme components other than the keyword phoneme constitute a junk word state chain
  • the junk word state chain tail also has an empty tail state, indicating that the hidden Markov chain is a junk word.
  • the keyword detection model can be constructed in the following two ways:
  • the acoustic parameter samples corresponding to each factor are extracted in the corpus. Divided from the perspective of sound quality, the smallest unit of speech is the phoneme. Phonemes can be divided into two categories: vowels and consonants, including 10 vowels, 22 consonants, and a total of 32 phonemes. In the hidden Markov model, three states are usually set for one phoneme according to the phonetic feature, and each state reflects the sound characteristics of the phoneme, for example, it can represent the shape change of the channel when the phoneme is sent.
  • the corpus is used to store the voice text and the voice corresponding to the voice file. The voice text can be content in different fields, and the voice corresponding to the voice file can be a recording of the voice text read by different people.
  • the acoustic parameter samples corresponding to each phoneme are extracted in the corpus, and the acoustic parameters are parameters that represent the state of the phoneme.
  • the acoustic parameter samples corresponding to the phoneme a are extracted, the three states of a are b, c, and d, and each state extracts n samples respectively, then the samples corresponding to the state b are b1, b2, ... bn, and the state c corresponds to The samples are c1, c2, ..., cn, and the samples corresponding to state c are c1, c2, ... cn.
  • the acoustic parameter samples corresponding to each phoneme are trained to obtain an acoustic model.
  • the acoustic model is the correspondence between phonemes and corresponding acoustic parameters.
  • the hidden Markov model and the neural network in the prior art can be used to train the weight of each neuron using the back propagation method, and the neural network model is determined.
  • the input of the neural network model is a phoneme
  • the output is The acoustic parameters corresponding to the phoneme.
  • the acoustic model is the correspondence between each phoneme of 32 phonemes and the acoustic parameters of the phoneme.
  • the keyword phonemes corresponding to the keywords are searched in the pronunciation dictionary.
  • the pronunciation dictionary is used to save the phonemes included in the phrase.
  • the acoustic parameters corresponding to the keyword phonemes in the acoustic model are constructed as a keyword detection model.
  • the keywords are determined according to different application scenarios, and the keyword phonemes corresponding to the keywords are searched in the pronunciation dictionary.
  • the acoustic parameter samples corresponding to the keyword phonemes are extracted in the corpus.
  • the acoustic parameter samples corresponding to the keyword phonemes are trained to obtain a keyword detection model.
  • the training algorithm used in the first method is the same as the algorithm used in the first mode, and will not be described in detail here.
  • FIG. 3 is a flowchart of a method for waking up a voice according to Embodiment 1 of the present invention, which specifically includes the following processing steps:
  • Step 301 The smart device performs voice feature extraction on the current input voice.
  • the smart device with the voice interaction function monitors whether there is voice input.
  • the keyword detection module in the smart device is used to detect keywords in the currently input voice.
  • the existing acoustic model can be used to perform feature extraction on the current input speech.
  • the speech feature may be a spectrum or a cepstral coefficient.
  • the keyword detection module may detect a keyword in the input voice by using a keyword detection model.
  • the keyword detection model is a hidden Markov model as an example.
  • the hidden Markov model can determine the start and end of the speech through the mute state node, and determine the current input speech.
  • Step 302 Acquire an acoustic model, and perform keyword confirmation on each hidden Markov chain in the hidden Markov model for the extracted speech features to obtain a score of the hidden Markov chain.
  • the extracted speech features are compared with the state of each hidden Markov chain to obtain a score of the hidden Markov chain, which represents the similarity between the phrase and the keywords in the current input speech.
  • Step 303 Confirm whether the phrase corresponding to the Hidden Markov chain with the highest score is a preset instruction word, if yes, proceed to step 304, and if no, proceed to step 312.
  • Step 304 waking up the speech recognizer.
  • the speech recognizer is generally deployed on a server in the cloud.
  • Step 305 Send the current input voice to the voice recognizer.
  • Step 306 The speech recognizer performs semantic analysis on the current input speech to obtain semantics of the current input speech.
  • the instruction word Since the instruction word is detected from the current input speech, the instruction word does not necessarily identify that the user is speaking a voice instruction, or that the current input voice happens to contain the instruction word, but the user's intention is not the instruction word. For example, the user says “Huludao Channel” contains a pronunciation similar to "navigation to", but the user's real intention is not to direct navigation to a destination.
  • the method in the prior art may be used. For example, a method based on template matching or a method based on sequence labeling may be adopted, and the specific processing manner is not described in detail herein.
  • Step 307 The speech recognizer determines whether the semantics of the current input speech match the set instruction semantics, and if yes, proceeds to step 308, and if no, proceeds to step 310.
  • the set instruction semantics is a plurality of semantic phrases set according to the application scenario, for example, including “instruction words” + “place nouns”.
  • the command voice set is “navigate to” + “place noun”, and the noun of the place may be Beijing, Haidian District Zhongguancun, Xitucheng, and the like. Comparing the determined semantics of the current input speech with the instruction semantics of each setting, if the matching with the voice of the current input speech is found, the matching is successful, and the process proceeds to step 308, if no match with the voice of the currently input voice is found, Then the matching fails, and the process proceeds to step 310.
  • Step 308 The speech recognizer sends a matching success message to the smart device.
  • Step 309 The smart device executes a corresponding operation indicated by the instruction word according to the instruction word.
  • the user says "Look at the sports channel", and when the smart TV receives the matching success message sent by the semantic recognizer, directly switches to the sports channel.
  • the user first needs to say the wake-up word (for example, Hi Lele), and after waking up the speech recognizer, the user repeats the instruction word "watching the sports channel”.
  • Step 310 The speech recognizer sends a matching failure message to the smart device.
  • Step 311 After receiving the matching failure message, the smart device does not perform any processing.
  • Step 312 The phrase corresponding to the hidden Markov chain with the highest score is the wake-up word or the junk word. If it is the wake-up word, the process proceeds to step 313. If it is the junk word, the process proceeds to step 314.
  • the voice recognizer is woken up. After the user speaks the wake-up word, the user usually continues to speak the command word, and the smart device continues to perform keyword detection to determine whether there is an instruction word in the current input voice.
  • the specific detection method is the same as step 301-step 311 above. A detailed description will be given.
  • Step 314 When the phrase corresponding to the highest hidden Hidden Markov chain is a junk word, it is determined that the keyword is not included in the current input voice.
  • the keyword detection model returns to the detection portal to continue detecting the input voice.
  • Embodiment 1 of the present invention since the speech recognizer is directly awake after detecting the instruction word, and the corresponding operation is performed according to the instruction word, the speech recognizer does not need to be awake after detecting the wake-up word, and then re-detected. Whether there is a command word in the new input voice saves resources, and for the user, there is no need to say the wake-up word every time, and then the instruction word, which improves the user experience.
  • the voice waking method according to the above embodiment of the present invention correspondingly, the second embodiment of the present invention further provides a voice waking device, which is shown in FIG. 4, and specifically includes:
  • the extracting unit 401 is configured to perform voice feature extraction on the current input voice.
  • the existing acoustic model can be used to perform feature extraction on the current input speech.
  • the speech feature may be a spectrum or a cepstral coefficient.
  • the current input speech can be detected by a pre-built keyword detection model.
  • the instruction word determining unit 402 is configured to determine, according to the extracted voice feature, whether the instruction word exists in the current input voice according to the pre-built keyword detection model, and the keyword in the keyword detection model includes at least a preset instruction word.
  • the voice wake-up device detects keywords in the input voice.
  • the keyword is preset, and the keyword can be an awakening word or an instruction word.
  • the wake-up word is used to wake up the speech recognizer's phrase.
  • the wake-up word usually uses a phrase with more initials.
  • the phrase contains Chinese characters starting with m, n, l, r, etc., because the vocal initials have vocal cord vibration. It can be better distinguished from environmental noise and has better noise immunity.
  • the wake-up word can be set to “Hello Music” or “ ⁇ ”.
  • the instruction word is a phrase used to instruct the smart device to perform a corresponding operation, and the instruction word is characterized by reflecting a function exclusive to the smart device, such as “navigation to” is highly related to a device having navigation function (such as a car), "Play” is usually highly relevant to multimedia-enabled devices (such as televisions and mobile phones), and the command words directly reflect the user's intent.
  • the speech feature may be a spectrum or a cepstral coefficient, etc., and a frame of speech feature vector may be extracted from the signal of the input speech every 10 milliseconds.
  • the first waking unit 403 is configured to wake up the speech recognizer when the instruction word exists in the current input speech, and execute a corresponding operation indicated by the instruction word according to the instruction word.
  • the user says “watch the sports channel”, and when the smart TV receives the matching success message sent by the semantic recognizer, it directly switches to the sports channel.
  • the user first needs to say the wake-up word (for example, Hi Lele), and after waking up the speech recognizer, the user repeats the instruction word "watching the sports channel”.
  • the above device further includes:
  • the obtaining unit 404 is configured to obtain a matching success message that matches the semantics of the current input speech with the instruction semantics.
  • the matching success message is a semantic recognition of the input speech by the speech recognizer, obtaining the semantics of the input speech, and the semantics of the input speech.
  • the set command semantics are sent after the matching is successful.
  • the instruction word is detected from the current input speech, the instruction word does not necessarily identify that the user is speaking a voice instruction, or that the current input voice happens to contain the instruction word, but the user's intention is not the instruction word. For example, the user says “Huludao Channel” contains a pronunciation similar to "navigation to”, but the user's real intention is not to direct navigation to a destination.
  • the set instruction semantics is a plurality of semantic phrases set according to an application scenario, for example, including "instruction words” + "place nouns”.
  • the command voice set is “navigate to” + “place noun”, and the noun of the place may be Beijing, Haidian District Zhongguancun, Xitucheng, etc., which will determine the current input voice.
  • the semantics are compared with the instruction semantics of each setting. If the match is found to match the voice of the current input voice, the match is successful. If the match with the voice of the currently input voice is not found, the match fails.
  • the instruction word determining unit 402 is specifically configured to extract, for each phoneme in the voice, an acoustic parameter sample corresponding to each phoneme in the corpus, where the corpus is used to save the voice corresponding to the voice text and the voice text;
  • the training algorithm is designed to train the acoustic parameter samples corresponding to each phoneme to obtain an acoustic model, the acoustic model is the correspondence between the phoneme and the corresponding acoustic parameters; in the pronunciation dictionary, the keyword phoneme corresponding to the keyword is searched, and the acoustics are
  • the keyword phoneme and the corresponding acoustic parameters in the model are constructed as a keyword detection model, and the pronunciation dictionary is used to save the phonemes included in the phrase.
  • the instruction word determining unit 402 is specifically configured to search for keywords in the pronunciation dictionary.
  • the pronunciation dictionary is used to save the phoneme included in the phrase;
  • the acoustic parameter sample corresponding to the keyword phoneme is extracted in the corpus, and the corpus is used to save the voice corresponding to the phonetic text; according to the preset training algorithm,
  • the acoustic parameter samples corresponding to the keyword phonemes are trained to obtain a keyword detection model.
  • the keywords are usually different, which requires pre-building a keyword detection model for different application scenarios.
  • Building a keyword detection model is to construct an acoustic model. Since the acoustic model can have many expressions, such as hidden Markov models, or neural network models.
  • the keyword detection model is expressed by a hidden Markov model as an example. As shown in Fig. 2, each keyword can be expanded into a hidden Markov chain in the hidden Markov model, that is, a keyword state chain, and each node on each chain corresponds to the acoustic state of a state of the keyword phoneme. parameter.
  • each keyword state chain respectively set a short mute state, and an empty tail state that identifies the keyword type.
  • the empty tail state identifies that the hidden Markov chain is represented as an awakening word or an instruction word, as shown in the figure.
  • the black point node of each chain in 3.
  • the node can jump forward to indicate the change of the utterance state, such as the change of the vocal vocal shape; it can also be self-jumping, and the self-jumping indicates that the utterance is temporarily unchanged, such as a relatively stable utterance state in the vowel.
  • the phoneme components other than the keyword phoneme constitute a junk word state chain
  • the junk word state chain tail also has an empty tail state, indicating that the hidden Markov chain is a junk word.
  • the hidden Markov model can determine the start and end of the speech through the mute state node, and determine the current input speech.
  • the keyword detection model can be constructed in the following two ways:
  • the acoustic parameter samples corresponding to each factor are extracted in the corpus. Divided from the perspective of sound quality, the smallest unit of speech is the phoneme. Phonemes can be divided into two categories: vowels and consonants, including 10 vowels, 22 consonants, and a total of 32 phonemes. In the hidden Markov model, three states are usually set for one phoneme according to the phonetic feature, and each state reflects the sound characteristics of the phoneme, for example, it can represent the shape change of the channel when the phoneme is sent.
  • the corpus is used to store the voice text and the voice corresponding to the voice file. The voice text can be content in different fields, and the voice corresponding to the voice file can be a recording of the voice text read by different people.
  • the acoustic parameter is the parameter that characterizes the phoneme state.
  • the acoustic parameter samples corresponding to the phoneme a are extracted, the three states of a are b, c, and d, and each state extracts n samples respectively, then the samples corresponding to the state b are b1, b2, ... bn, and the state c corresponds to The samples are c1, c2, ..., cn, and the samples corresponding to state c are c1, c2, ... cn.
  • the acoustic parameter samples corresponding to each phoneme are trained to obtain an acoustic model.
  • the acoustic model is the correspondence between phonemes and corresponding acoustic parameters.
  • the hidden Markov model and the neural network in the prior art can be used to train the weight of each neuron using the back propagation method, and the neural network model is determined.
  • the input of the neural network model is a phoneme
  • the output is The acoustic parameters corresponding to the phoneme.
  • the acoustic model is the correspondence between each phoneme of 32 phonemes and the acoustic parameters of the phoneme.
  • the keyword phonemes corresponding to the keywords are searched in the pronunciation dictionary.
  • the pronunciation dictionary is used to save the phonemes included in the phrase.
  • the acoustic parameters corresponding to the keyword phonemes in the acoustic model are constructed as a keyword detection model.
  • the keywords are determined according to different application scenarios, and the keyword phonemes corresponding to the keywords are searched in the pronunciation dictionary.
  • the acoustic parameter samples corresponding to the keyword phonemes are extracted in the corpus.
  • the acoustic parameter samples corresponding to the keyword phonemes are trained to obtain a keyword detection model.
  • the training algorithm used in the first method is the same as the algorithm used in the first mode, and will not be described in detail here.
  • the instruction word determining unit 402 is specifically configured to perform an acoustic model evaluation, and perform an instruction word confirmation on each hidden Markov chain in the hidden Markov model to obtain the hidden Markov chain.
  • the instruction word confirms the score; and confirms whether the phrase of the hidden Markov chain corresponding to the instruction word confirmation score is a preset instruction word.
  • the instruction word determining unit 402 estimates the extracted speech features using the existing acoustic model Comparing with the state of each hidden Markov chain, the score of the hidden Markov chain is obtained, which represents the similarity between the phrase in the input speech and each keyword. The higher the score, the higher the similarity.
  • the keyword in the keyword detection model further includes a preset wake-up word
  • the above device further includes:
  • the second waking unit 405 is configured to wake up the voice recognizer when determining that there is an awake word in the input voice according to the extracted voice feature according to the pre-built keyword detection model.
  • a related function module can be implemented by a hardware processor.
  • the speech recognizer since the speech recognizer is directly awake after detecting the instruction word, and the corresponding operation is performed according to the instruction word, the speech recognizer does not need to be awake after detecting the wake-up word, and then re-detected. Whether there is a command word in the new input voice saves resources, and for the user, there is no need to say the wake-up word every time, and then the instruction word, which improves the user experience.
  • the voice waking method according to the above embodiment of the present invention correspondingly, the third embodiment of the present invention further provides a voice waking system, and the structure thereof is shown in FIG. 5, and includes: a keyword detecting module 501. And a speech recognizer 502, wherein:
  • the keyword detecting module 501 is configured to perform voice feature extraction on the acquired current input voice; according to the extracted voice feature, according to the pre-built keyword detection model, determine whether there is an instruction word in the current input voice, and the keyword detection model is at least Including instruction word detection; when there is an instruction word in the current input speech, the speech recognizer is awakened, and the current input speech is sent to the speech recognizer.
  • the current input speech can be detected by a pre-built keyword detection model.
  • the pre-built keyword detection model is constructed in the following way:
  • the keyword is preset, and the keyword can be an awakening word or an instruction word.
  • the wake-up word is the phrase used to wake up the speech recognizer, and the wake-up word usually uses a phrase with more initial consonants, such as the phrase contains
  • the Chinese characters at the beginning of the initials such as m, n, l, r, etc., because of the vocal vibrations of the vocal initials, can be better distinguished from the environmental noise, and have better noise immunity.
  • the wake-up word can be set to "Hello Lele.” "Or “Le Lele.”
  • the instruction word is a phrase used to instruct the smart device to perform a corresponding operation, and the instruction word is characterized by reflecting a function exclusive to the smart device, such as “navigation to” is highly related to a device having navigation function (such as a car), "Play” is usually highly relevant to multimedia-enabled devices (such as televisions and mobile phones), and the command words directly reflect the user's intent.
  • the speech feature may be a spectrum or a cepstral coefficient, etc., and a frame of speech feature vector may be extracted from the signal of the input speech every 10 milliseconds.
  • the keywords are usually different, which requires pre-building a keyword detection model for different application scenarios.
  • Building a keyword detection model is to construct an acoustic model. Since the acoustic model can have many expressions, such as hidden Markov models, or neural network models.
  • the keyword detection model is expressed by a hidden Markov model as an example. As shown in Fig. 3, each keyword can be expanded into a hidden Markov chain in the hidden Markov model, that is, a keyword state chain, and each node on each chain corresponds to the acoustic state of a state of the keyword phoneme. parameter.
  • each keyword state chain respectively set a short mute state, and an empty tail state that identifies the keyword type.
  • the empty tail state identifies that the hidden Markov chain is represented as an awakening word or an instruction word, as shown in the figure.
  • the black point node of each chain in 3.
  • the node can jump forward to indicate the change of the utterance state, such as the change of the vocal vocal shape; it can also be self-jumping, and the self-jumping indicates that the utterance is temporarily unchanged, such as a relatively stable utterance state in the vowel.
  • the phoneme components other than the keyword phoneme constitute a junk word state chain
  • the junk word state chain tail also has an empty tail state, indicating that the hidden Markov chain is a junk word.
  • the keyword detection model can be constructed in the following two ways:
  • the acoustic parameter samples corresponding to each factor are extracted in the corpus. Divided from the perspective of sound quality, the smallest unit of speech is the phoneme. Phonemes can be divided into two categories: vowels and consonants, including 10 vowels, 22 consonants, and a total of 32 phonemes. In the hidden Markov model, three states are usually set for one phoneme according to the phonetic feature, and each state reflects the sound characteristics of the phoneme, for example, it can represent the shape change of the channel when the phoneme is sent. Corpus It is used to save the voice text and the voice corresponding to the voice file. The voice text can be content in different fields, and the voice corresponding to the voice file can be a recording of reading voice text to different people.
  • the acoustic parameter samples corresponding to each phoneme are extracted in the corpus, and the acoustic parameters are parameters that represent the state of the phoneme.
  • the acoustic parameter samples corresponding to the phoneme a are extracted, the three states of a are b, c, and d, and each state extracts n samples respectively, then the samples corresponding to the state b are b1, b2, ... bn, and the state c corresponds to The samples are c1, c2, ..., cn, and the samples corresponding to state c are c1, c2, ... cn.
  • the acoustic parameter samples corresponding to each phoneme are trained to obtain an acoustic model.
  • the acoustic model is the correspondence between phonemes and corresponding acoustic parameters.
  • the hidden Markov model and the neural network in the prior art can be used to train the weight of each neuron using the back propagation method, and the neural network model is determined.
  • the input of the neural network model is a phoneme
  • the output is
  • the acoustic parameter acoustic model corresponding to the phoneme is the correspondence between each phoneme of 32 phonemes and the acoustic parameters of the phoneme.
  • the keyword phonemes corresponding to the keywords are searched in the pronunciation dictionary.
  • the pronunciation dictionary is used to save the phonemes included in the phrase.
  • the acoustic parameters corresponding to the keyword phonemes in the acoustic model are constructed as a keyword detection model.
  • the keywords are determined according to different application scenarios, and the keyword phonemes corresponding to the keywords are searched in the pronunciation dictionary.
  • the acoustic parameter samples corresponding to the keyword phonemes are extracted in the corpus.
  • the acoustic parameter samples corresponding to the keyword phonemes are trained to obtain a keyword detection model.
  • the training algorithm used in the first method is the same as the algorithm used in the first mode, and will not be described in detail here.
  • the keyword detection module 501 can perform acoustic model evaluation, and perform keyword confirmation on each hidden Markov chain in the hidden Markov model for the extracted speech features, and obtain the implicit The score of the Markov chain.
  • the score characterizes the similarity between the phrase in the input speech and each keyword. The higher the score, the higher the similarity. It is confirmed whether the phrase corresponding to the Hidden Markov chain with the highest score is a preset instruction word.
  • the phrase corresponding to the highest-margin Hidden Markov chain is pre- The command word is set, if the phrase corresponding to the highest hidden Hidden Markov chain is a preset instruction word, the speech recognizer is awakened, and the input speech is sent to the speech recognizer 502.
  • the speech recognizer 502 is configured to perform semantic analysis on the current input speech to obtain the semantics of the current input speech; determine semantics of the current input speech to match the set instruction semantics; and issue a command corresponding to the operation instruction indicated by the instruction word according to the instruction word .
  • the instruction word Since the instruction word is detected from the input voice, the instruction word does not necessarily indicate that the user is speaking the voice instruction, or the input voice happens to contain the instruction word, but the user's intention is not the instruction word. For example, the user says “Huludao Channel” contains a pronunciation similar to "navigation to”, but the user's real intention is not to direct navigation to a destination. Therefore, the semantic analysis of the detected instruction words is performed.
  • the voice wake-up system shown in FIG. 5 is provided in the embodiment 3 of the present invention, wherein the further functions of the keyword detecting module 501 and the voice recognizer 502 are included in the flow shown in FIG. 2 and FIG. The corresponding processing steps will not be described here.
  • the solution provided by the embodiment of the present invention includes: performing voice feature extraction on the acquired current input voice; determining, according to the extracted voice feature, a pre-built keyword detection model to determine whether there is an instruction in the current input voice.
  • the word in the keyword detection model includes at least a preset instruction word; when the instruction word exists in the current input speech, the speech recognizer is awakened, and the corresponding operation is performed according to the instruction word.
  • the voice recognizer since the voice recognizer is directly awake after detecting the command word for the current input voice, and the corresponding operation is performed according to the command word, the voice recognizer does not need to be awake after detecting the wake-up word, and then re-detected. Whether there is a command word in the new input voice saves resources. And for the user, there is no need to say the wake-up word every time, and then the instruction word, which improves the user experience.
  • the storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种语音唤醒方法、装置及系统,该方法包括:对获取的当前输入语音进行语音特征提取(101);根据提取到的语音特征,按照预先构建的关键词检测模型,确定当前输入语音中是否存在指令词,关键词检测模型中的关键词至少包括预设的指令词(102);当当前输入语音中存在指令词时,唤醒语音识别器,并根据指令词执行相应操作(103)。由于对输入语音检测出指令词后,直接唤醒语音识别器,根据指令词执行相应操作,不需要在检测到唤醒词后,唤醒语音识别器,再重新检测新的输入语音中是否存在指令词,节省了资源。并且对于用户而言,不需要每次先说唤醒词,再说指令词,提高了用户体验。

Description

一种语音唤醒方法、装置及系统
本申请要求在2015年10月26日提交中国专利局、申请号为201510702094.1、发明名称为“一种语音唤醒方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及语音识别技术领域,尤其涉及一种语音唤醒方法、装置及系统。
背景技术
随着语音技术的发展,很多智能设备都可以通过语音与用户进行交互。智能设备的语音交互系统通过对用户的语音进行识别,完成用户的指令。在传统的语音交互中,用户通常手动激活语音,比如按下录音键,才可以进行语音交互。为了使用户更加顺畅的切入语音,模拟在人与人交互的开始呼唤对方的行为,设计了语音唤醒功能。
目前,现有的语音唤醒方式主要为:在与智能设备进行语音交互之前,用户首先需要说唤醒词,唤醒词可以是针对智能设备预先设置的。语音交互系统的唤醒模块对语音进行检测,提取语音特征,确定提取到的语音特征与预设的唤醒词的语音特征是否匹配,如果匹配,唤醒识别模块,对后续输入的语音指令进行语音识别和语义解析。例如:用户想要使用电视的语音交互系统,指示电视转换到体育频道。首先用户需要说唤醒词,比如“你好电视”,唤醒模块检测到唤醒词后,激活识别模块。识别模块开始检测语音指令。此时,用户说“看体育频道”,识别模块识别语音指令,并根据指令将频道转换到体育频道。在完成指令识别后,识别模块关闭不再工作,如果用户想要再次发出指令,需要再次说唤醒词唤醒识别模块。
上述现有的语音唤醒方式中,由于用户在每次发出指令前,都需要进行语音唤醒,即需要先说唤醒词,再发出指令的语音,使得语音交互系统完成一个指令操作后,需要重新进行关键词检测,浪费系统资源;并且对于用户而言,每次发出指令之前,都需要说一次唤醒词,语音唤醒方式繁琐,用户体验较差。
发明内容
本发明实施例提供一种语音唤醒方法及装置,用以解决现有技术中的对语音交互系统进行语音唤醒时造成的系统资源浪费、用户体验差的问题。
本发明实施例提供一种语音唤醒方法,包括:
对获取的当前输入语音进行语音特征提取;
根据提取到的语音特征,按照预先构建的关键词检测模型,确定所述当前输入语音中是否存在指令词,所述关键词检测模型中的关键词至少包括预设的指令词;
当所述当前输入语音中存在指令词时,唤醒语音识别器,并根据指令词执行指令词指示的相应操作。
本发明实施例提供一种语音唤醒装置,包括:
提取单元,用于对获取的当前输入语音进行语音特征提取;
指令词确定单元,用于根据提取到的语音特征,按照预先构建的关键词检测模型,确定所述当前输入语音中是否存在指令词,所述关键词检测模型中的关键词至少包括预设的指令词;
第一唤醒单元,用于当所述当前输入语音中存在指令词时,唤醒语音识别器,并根据指令词执行指令词指示的相应操作。
本发明实施例提供一种语音唤醒系统,包括:关键词检测模块和语音识别器,其中:
所述关键词检测模块,用于对获取的当前输入语音进行语音特征提取;根据提取到的语音特征,按照预先构建的关键词检测模型,确定所述当前输入语音中是否存在指令词,所述关键词检测模型至少包括指令词检测;当所述当前输入语音中存在指令词时,唤醒语音识别器,并将所述当前输入语音发送给所述语音识别器;
所述语音识别器,用于对所述当前输入语音进行语义解析,得到所述当前输入语音的语义;确定所述当前输入语音的语义与设定的指令语义匹配;根据所述指令词发出执行指令词指示的相应的操作的命令。
本发明实施例提供的语音唤醒方法及装置的有益效果包括:由于对输入语音检测出指令词后,直接唤醒语音识别器,根据指令词执行相应操作,不需要在检测到唤醒词后,唤醒语音识别器,再重新检测新的输入语音中是否存在指令词,节省了资源。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例中语音唤醒方法的流程图;
图2为本发明实施例中关键词检测模型为隐马尔科夫模型的结构示意图;
图3为本发明实施例1中语音唤醒方法的流程图;
图4为本发明实施例2中语音唤醒装置的结构示意图;
图5为本发明实施例3中语音唤醒系统的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施例提供一种语音唤醒方法,如图1所示,包括:
步骤101、对获取的当前输入语音进行语音特征提取。
步骤102、根据提取到的语音特征,按照预先构建的关键词检测模型,确定该当前输入语音中是否存在指令词,该关键词检测模型中的关键词至少包括预设的指令词。
步骤103、当该当前输入语音中存在指令词时,唤醒语音识别器,并根据指令词执行指令词指示的相应操作。
本发明实施例中的语音唤醒方法可以应用在带有语音交互功能的智能设备上,例如:电视、手机、电脑、智能冰箱等。语音特征可以为频谱或倒谱系数。关键词检测模型中的关键词可以为预先设置的指令词,指令词是用于指示智能设备执行具体操作的词组,例如:指令词可以为“看体育频道”、“导航到”或“播放”等。当前输入语音可以通过关键词检测模型检测。
本发明实施例中,在检测输入语音中是否存在指令词之前,首先要构建好关键词检测模型,具体构建关键词检测模型的方式如下:
一般的,用户如果想要使用语音交互功能,可以说预先设置的关键词,该关键词可以为唤醒词,也可以为指令词。其中,唤醒词是用来唤醒语音识别器的词组,唤醒词通常选用发声声母较多的词组,比如词组中包含以m、n、l、r等声母开头的汉字,因为发声声母存在声带振动,可以较好的与环境噪声区分开,具有较好的抗噪性,例如:唤醒词可以设置为“你好乐乐”或者“嗨乐乐”。指令词是用来指示该智能设备执行相应操作的词组。指令词的特点是能反映该智能设备专有的功能,比如“导航到”是与具有导航功能的设备(例如汽车)高度相关的,“播放”通常是与具有多媒体功能的设备(例如电视和手机)高度相关的。指令词可以直接反映出用户的意图。语音特征可以为频谱或者倒谱系数等,每10毫秒可以从输入语音的信号中提取一帧语音特征向量。
用户在说关键词时,可以说唤醒词,也可以说指令词,在不同的应用场景中,关键词通常是不同,这就需要针对不同的应用场景预先构建关键词检测模型。构建关键词检测模型也就是要构建声学模型。由于声学模型可以有很多表述方式,例如隐马尔科夫模型、或神经网络模型等。本发明实施例中,以隐马尔科夫模型表述关键词检测模型为例,如图2所示,每个关键词可以在隐马尔科夫模型中展开成一条隐马尔科夫链,即关键词状态链,每条链上的每一个节点对应关键词音素的一个状态的声学参数。每条关键词状态链的两端节点分别设置一个短静音状态,以及一个标识关键词类型的空尾状态,空尾状态即标识该条隐马尔科夫链表示为唤醒词或者指令词,如图3中每条链的黑点节点。节点可以往前跳转,表示发声状态的变化,例如元音时口型的变化;也可以自跳转,自跳转表示发声暂时不变,例如元音中较为稳定的发音状态。每条链的开始有静音状态节点。在隐马尔科夫状态链中,将除了关键词音素以外的音素组合构成垃圾词状态链,垃圾词状态链尾部也有一个空尾状态,表示该条隐马尔科夫链为垃圾词。
以隐马尔科夫模型表述关键词检测模型为例,构建关键词检测模型的方式可以为如下两种方式:
第一种方式:
针对语音中的每一个音素,在语料库中提取每个因素对应的声学参数样本。从音质的角度划分,语音的最小单位是音素。音素可以分为元音和辅音两大类,包括10个元音,22个辅音,共有32个音素。在隐马尔科夫模型中,根据语音特征,通常为一个音素设定三个状态,每个状态反映了该音素的声音特性,例如:可以表示发该音素时声道的形状变化。语料库是用于保存语音文本和该语音文件对应的语音的,语音文本可以为不同领域内的内容,语音文件对应的语音可以为对不同的人阅读语音文本的录音。由于不同的语音文本中可能包含相同的音素,在语料库中提取每个音素对应的声学参数样本,声学参数即为表征音素状态的参数。例如,提取音素a对应的声学参数样本,a的三个状态为b、c、d,每个状态分别提取n个样本,那么状态b对应的样本为b1、b2……bn,状态c对应的样本为c1、c2……cn,状态c对应的样本为c1、c2……cn。
按照预设的训练算法,对每个音素对应的声学参数样本进行训练,得到声学模型。声学模型为音素与对应的声学参数的对应关系。预设的训练算法可以为算术平均算法,例如:将音素a的三个状态b、c、d的样本分别进行算术平均b′=(b1+b2+……+bn)/n,c′=(c1+c2+……+cn)/n,d′=(d1+d2+……+dn)/n,b′、c′、d′为音素a对应的声学参数。还可以将音素a的三个状态b、c、d的样本分别求方差,将得到的方差作为音素a对应的声学参数。进一步的,还可以采用现有技术中的隐马尔科夫模型和神经网络结合的方式,使用反向传播法训练每个神经元的权重,确定神经网络模型,神经网络模型的输入为音素,输出为该音素对应的声学参数。声学模型即为32个音素的每一个音素分别与该音素的声学参数的对应关系。
在根据不同的应用场景确定关键词后,在发音词典中查找各关键词对应的关键词音素。发音词典是用于保存词组包括的音素的。确定关键词音素后,将声学模型中与关键词音素对应的声学参数构建为关键词检测模型。
第二种方式:
该方式中不需要对每一个音素都确定对应的声学参数,只需要确定与关键词音素对应的声学参数即可。
根据不同的应用场景确定关键词,在发音词典中查找各关键词对应的关键词音素。
在语料库中提取关键词音素对应的声学参数样本。
按照预设的训练算法,对关键词音素对应的声学参数样本进行训练,得到关键词检测模型。其中采用的训练算法与第一种方式中采用的算法相同,在此不再进行详细描述。
下面结合附图,用具体实施例对本发明提供的方法及装置和相应系统进行详细描述。
实施例1:
图3为本发明实施例1提供的语音唤醒方法的流程图,具体包括如下处理步骤:
步骤301、智能设备对当前输入语音进行语音特征提取。
本发明实施例中,带有语音交互功能的智能设备监听是否有语音输入。智能设备中的关键词检测模块用来检测当前输入语音中的关键词。
本步骤中,可以采用现有的声学模型评估对当前输入语音进行特征提取。其中,语音特征可以为频谱或倒谱系数。关键词检测模块可以采用关键词检测模型检测输入语音中的关键词,本发明实施例中,以该关键词检测模型为隐马尔科夫模型为例。隐马尔科夫模型可以通过静音状态节点确定语音的开始和结束,确定得到当前输入语音。
步骤302、采用声学模型评估,对提取到的语音特征,在隐马尔科夫模型中的每一条隐马尔科夫链上进行关键词确认,得到该隐马尔科夫链的分数。
本步骤中,采用提取到的语音特征与每一条隐马尔科夫链的状态进行比较,得到该隐马尔科夫链的分数,该分数表征了当前输入语音中的词组与各关键词的相似度,分数越高,相似度越高。
步骤303、确认分数最高的隐马尔科夫链对应的词组是否为预设的指令词,如果是,进入步骤304,如果否,进入步骤312。
本步骤中,可以根据隐马尔科夫链的空尾状态,确定分数最高的隐马尔科夫链对应的词组是否为预设的指令词。
步骤304、唤醒语音识别器。
本发明实施例中,语音识别器一般部署在云端的服务器上。
步骤305、将当前输入语音发送给语音识别器。
步骤306、语音识别器对该当前输入语音进行语义解析,得到该当前输入语音的语义。
由于从当前输入语音中检测到指令词时,该指令词并不一定标识用户所说的话是语音指令,也可能是当前输入语音中恰巧含有指令词,但用户的意图并非是该指令词。例如:用户说“葫芦岛航道”里面含有类似“导航到”的发音,但用户的真正意图并不是指示导航到某个目的地。其中,对当前输入语音进行语义解析可以采用现有技术中的方法,例如,可以采用基于模板匹配的方法,或者基于序列标注的方法,具体处理方式在此不再进行详细描述。
步骤307、语音识别器确定该当前输入语音的语义与设定的指令语义是否匹配,如果是,进入步骤308,如果否,进入步骤310。
本步骤中,设定的指令语义为根据应用场景设定的包含多个语义词组,例如包含“指令词”+“地点名词”。例如,对于应用在导航功能的导航器,设定的指令语音为“导航到”+“地点名词”,该地点名词可以为北京、海淀区中关村、西土城等。将已确定的当前输入语音的语义与各个设定的指令语义进行对比,如果找到与当前输入语音的语音相符的,则匹配成功,进入步骤308,如果未找到与当前输入语音的语音相符的,则匹配失败,进入步骤310。
步骤308、语音识别器向智能设备发送匹配成功消息。
步骤309、智能设备根据指令词执行指令词指示的相应的操作。
本步骤中,如果该智能设备是电视,用户说“看体育频道”,当智能电视接收到语义识别器发送的匹配成功消息后,直接切换到体育频道。而现有技术中用户首先需要说唤醒词(例如:你好乐乐),唤醒语音识别器后,用户再说指令词“看体育频道”。
步骤310、语音识别器向智能设备发送匹配失败消息。
步骤311、该智能设备接收到该匹配失败消息后,不做任何处理。
步骤312、确认分数最高的隐马尔科夫链对应的词组是唤醒词还是垃圾词,如果是唤醒词,进入步骤313,如果是垃圾词,进入步骤314。
步骤313、唤醒语音识别器。
本步骤中,如果智能设备从当前输入语音中检测出唤醒词,则唤醒语音识别器。用户在说出唤醒词后,通常会继续说出指令词,智能设备继续进行关键词检测,确定当前输入语音中是否存在指令词,具体的检测方式与上述步骤301-步骤311相同,在此不再进行详细描述。
步骤314、当分数最高的隐马尔科夫链对应的词组是垃圾词时,确定当前输入语音中未包括关键词。
进一步的,确定当前输入语音中未包括关键词时,关键词检测模型回到检测入口继续检测输入语音。
通过本发明实施例1提供的方法,由于对输入语音检测出指令词后,直接唤醒语音识别器,根据指令词执行相应操作,不需要在检测到唤醒词后,唤醒语音识别器,再重新检测新的输入语音中是否存在指令词,节省了资源,并且对于用户而言,不需要每次先说唤醒词,再说指令词,提高了用户体验。
实施例2:
基于同一发明构思,根据本发明上述实施例提供的语音唤醒方法,相应地,本发明实施例2还提供了一种语音唤醒装置,其结构示意图如图4所示,具体包括:
提取单元401,用于对当前输入语音进行语音特征提取。
具体的,可以采用现有的声学模型评估对当前输入语音进行特征提取。其中,语音特征可以为频谱或倒谱系数。当前输入语音可以通过预先构建的关键词检测模型检测。
指令词确定单元402,用于根据提取到的语音特征,按照预先构建的关键词检测模型,确定当前输入语音中是否存在指令词,关键词检测模型中的关键词至少包括预设的指令词。
本发明实施例中,语音唤醒装置检测输入语音中的关键词。一般的,用户如果想要使用语音交互功能,可以说预先设置的关键词,该关键词可以为唤醒词,也可以为指令词。其中,唤醒词是用来唤醒语音识别器的词组,唤醒词通常选用发声声母较多的词组,比如词组中包含以m、n、l、r等声母开头的汉字,因为发声声母存在声带振动,可以较好的与环境噪声区分开,具有较好的抗噪性,例如:唤醒词可以设置为“你好乐乐”或者“嗨乐乐”。指令词是用来指示该智能设备执行相应操作的词组,指令词的特点是能反映该智能设备专有的功能,比如“导航到”是与具有导航功能的设备(例如汽车)高度相关的,“播放”通常是与具有多媒体功能的设备(例如电视和手机)高度相关的,指令词可以直接反映出用户的意图。 语音特征可以为频谱或者倒谱系数等,每10毫秒可以从输入语音的信号中提取一帧语音特征向量。
第一唤醒单元403,用于当前输入语音中存在指令词时,唤醒语音识别器,并根据指令词执行指令词指示的相应操作。
以电视上包括该语音唤醒装置为例,用户说“看体育频道”,当智能电视接收到语义识别器发送的匹配成功消息后,直接切换到体育频道。而现有技术中用户首先需要说唤醒词(例如:你好乐乐),唤醒语音识别器后,用户再说指令词“看体育频道”。
进一步的,上述装置,还包括:
获取单元404,用于获取对当前输入语音的语义与指令语义进行匹配的匹配成功消息,匹配成功消息为语音识别器对输入语音进行语义解析,得到输入语音的语义,并对输入语音的语义与设定的指令语义进行匹配成功后发送的。
由于从当前输入语音中检测到指令词时,该指令词并不一定标识用户所说的话是语音指令,也可能是当前输入语音中恰巧含有指令词,但用户的意图并非是该指令词。例如:用户说“葫芦岛航道”里面含有类似“导航到”的发音,但用户的真正意图并不是指示导航到某个目的地。设定的指令语义为根据应用场景设定的包含多个语义词组,例如包含“指令词”+“地点名词”。例如,对于应用在导航功能的导航器,设定的指令语音为“导航到”+“地点名词”,该地点名词可以为北京、海淀区中关村、西土城等,将已确定的当前输入语音的语义与各个设定的指令语义进行对比,如果找到与当前输入语音的语音相符的,则匹配成功,如果未找到与当前输入语音的语音相符的,则匹配失败。
进一步的,指令词确定单元402,具体用于针对语音中的每一个音素,在语料库中提取每个音素对应的声学参数样本,语料库是用于保存语音文本和语音文本对应的语音的;按照预设的训练算法,对每一个音素对应的声学参数样本进行训练,得到声学模型,声学模型为音素与对应的声学参数的对应关系;在发音词典中查找关键词对应的关键词音素,并将声学模型中关键词音素与对应的声学参数构建为关键词检测模型,发音词典是用于保存词组包括的音素的。
进一步的,指令词确定单元402,具体用于在发音词典中查找关键词 对应的关键词音素,发音词典是用于保存词组包括的音素的;在语料库中提取关键词音素对应的声学参数样本,语料库是用于保存语音文本对应的语音的;按照预设的训练算法,对关键词音素对应的声学参数样本进行训练,得到关键词检测模型。
用户在说关键词时,可以说唤醒词,也可以说指令词,在不同的应用场景中,关键词通常是不同,这就需要针对不同的应用场景预先构建关键词检测模型。构建关键词检测模型也就是要构建声学模型。由于声学模型可以有很多表述方式,例如隐马尔科夫模型、或神经网络模型等。本发明实施例中,以隐马尔科夫模型表述关键词检测模型为例。如图2所示,每个关键词可以在隐马尔科夫模型中展开成一条隐马尔科夫链,即关键词状态链,每条链上的每一个节点对应关键词音素的一个状态的声学参数。每条关键词状态链的两端节点分别设置一个短静音状态,以及一个标识关键词类型的空尾状态,空尾状态即标识该条隐马尔科夫链表示为唤醒词或者指令词,如图3中每条链的黑点节点。节点可以往前跳转,表示发声状态的变化,例如元音时口型的变化;也可以自跳转,自跳转表示发声暂时不变,例如元音中较为稳定的发音状态。每条链的开始有静音状态节点。在隐马尔科夫状态链中,将除了关键词音素以外的音素组合构成垃圾词状态链,垃圾词状态链尾部也有一个空尾状态,表示该条隐马尔科夫链为垃圾词。隐马尔科夫模型可以通过静音状态节点确定语音的开始和结束,确定得到当前输入语音。
以隐马尔科夫模型表述关键词检测模型为例,构建关键词检测模型的方式可以为如下两种方式:
第一种方式:
针对语音中的每一个音素,在语料库中提取每个因素对应的声学参数样本。从音质的角度划分,语音的最小单位是音素。音素可以分为元音和辅音两大类,包括10个元音,22个辅音,共有32个音素。在隐马尔科夫模型中,根据语音特征,通常为一个音素设定三个状态,每个状态反映了该音素的声音特性,例如:可以表示发该音素时声道的形状变化。语料库是用于保存语音文本和该语音文件对应的语音的,语音文本可以为不同领域内的内容,语音文件对应的语音可以为对不同的人阅读语音文本的录音。由于不同的语音文本中可能包含相同的音素,在语料库中提取每个音素对 应的声学参数样本,声学参数即为表征音素状态的参数。例如,提取音素a对应的声学参数样本,a的三个状态为b、c、d,每个状态分别提取n个样本,那么状态b对应的样本为b1、b2……bn,状态c对应的样本为c1、c2……cn,状态c对应的样本为c1、c2……cn。
按照预设的训练算法,对每个音素对应的声学参数样本进行训练,得到声学模型。声学模型为音素与对应的声学参数的对应关系。预设的训练算法可以为算术平均算法,例如:将音素a的三个状态b、c、d的样本分别进行算术平均b′=(b1+b2+……+bn)/n,c′=(c1+c2+……+cn)/n,d′=(d1+d2+……+dn)/n,b′、c′、d′为音素a对应的声学参数。还可以将音素a的三个状态b、c、d的样本分别求方差,将得到的方差作为音素a对应的声学参数。进一步的,还可以采用现有技术中的隐马尔科夫模型和神经网络结合的方式,使用反向传播法训练每个神经元的权重,确定神经网络模型,神经网络模型的输入为音素,输出为该音素对应的声学参数。声学模型即为32个音素的每一个音素分别与该音素的声学参数的对应关系。
在根据不同的应用场景确定关键词后,在发音词典中查找各关键词对应的关键词音素。发音词典是用于保存词组包括的音素的。确定关键词音素后,将声学模型中与关键词音素对应的声学参数构建为关键词检测模型。
第二种方式:
该方式中不需要对每一个音素都确定对应的声学参数,只需要确定与关键词音素对应的声学参数即可。
根据不同的应用场景确定关键词,在发音词典中查找各关键词对应的关键词音素。
在语料库中提取关键词音素对应的声学参数样本。
按照预设的训练算法,对关键词音素对应的声学参数样本进行训练,得到关键词检测模型。其中采用的训练算法与第一种方式中采用的算法相同,在此不再进行详细描述。
指令词确定单元402,具体用于采用声学模型评估,对提取到的语音特征,在隐马尔科夫模型中的每一条隐马尔科夫链上进行指令词确认,得到该隐马尔科夫链的指令词确认分数;确认与指令词确认分数最高对应的隐马尔科夫链的词组是否为预设的指令词。
指令词确定单元402,采用现有的声学模型评估将提取到的语音特征 与每一条隐马尔科夫链的状态进行比较,得到该隐马尔科夫链的分数,该分数表征了输入语音中的词组与各关键词的相似度,分数越高,相似度越高。
进一步的,关键词检测模型中的关键词还包括预设的唤醒词;
进一步的,上述装置还包括:
第二唤醒单元405,用于当根据提取到的语音特征,按照预先构建的关键词检测模型,确定输入语音中存在唤醒词时,唤醒语音识别器。
上述各单元的功能可对应于图1或图2所示流程中的相应处理步骤,在此不再赘述。
本发明实施例中可以通过硬件处理器(hardware processor)来实现相关功能模块。
通过本发明实施例1提供的装置,由于对输入语音检测出指令词后,直接唤醒语音识别器,根据指令词执行相应操作,不需要在检测到唤醒词后,唤醒语音识别器,再重新检测新的输入语音中是否存在指令词,节省了资源,并且对于用户而言,不需要每次先说唤醒词,再说指令词,提高了用户体验。
实施例3:
基于同一发明构思,根据本发明上述实施例提供的语音唤醒方法,相应地,本发明实施例3还提供了一种语音唤醒系统,其结构示意图如图5所示,包括:关键词检测模块501和语音识别器502,其中:
关键词检测模块501,用于对获取的当前输入语音进行语音特征提取;根据提取到的语音特征,按照预先构建的关键词检测模型,确定当前输入语音中是否存在指令词,关键词检测模型至少包括指令词检测;当当前输入语音中存在指令词时,唤醒语音识别器,并将当前输入语音发送给语音识别器。
当前输入语音可以通过预先构建的关键词检测模型检测。
预先构建的关键词检测模型具体采用如下方式构建:
一般的,用户如果想要使用语音交互功能,可以说预先设置的关键词,该关键词可以为唤醒词,也可以为指令词。其中,唤醒词是用来唤醒语音识别器的词组,唤醒词通常选用发声声母较多的词组,比如词组中包含以 m、n、l、r等声母开头的汉字,因为发声声母存在声带振动,可以较好的与环境噪声区分开,具有较好的抗噪性,例如:唤醒词可以设置为“你好乐乐”或者“嗨乐乐”。指令词是用来指示该智能设备执行相应操作的词组,指令词的特点是能反映该智能设备专有的功能,比如“导航到”是与具有导航功能的设备(例如汽车)高度相关的,“播放”通常是与具有多媒体功能的设备(例如电视和手机)高度相关的,指令词可以直接反映出用户的意图。语音特征可以为频谱或者倒谱系数等,每10毫秒可以从输入语音的信号中提取一帧语音特征向量。
用户在说关键词时,可以说唤醒词,也可以说指令词,在不同的应用场景中,关键词通常是不同,这就需要针对不同的应用场景预先构建关键词检测模型。构建关键词检测模型也就是要构建声学模型。由于声学模型可以有很多表述方式,例如隐马尔科夫模型、或神经网络模型等。本发明实施例中,以隐马尔科夫模型表述关键词检测模型为例。如图3所示,每个关键词可以在隐马尔科夫模型中展开成一条隐马尔科夫链,即关键词状态链,每条链上的每一个节点对应关键词音素的一个状态的声学参数。每条关键词状态链的两端节点分别设置一个短静音状态,以及一个标识关键词类型的空尾状态,空尾状态即标识该条隐马尔科夫链表示为唤醒词或者指令词,如图3中每条链的黑点节点。节点可以往前跳转,表示发声状态的变化,例如元音时口型的变化;也可以自跳转,自跳转表示发声暂时不变,例如元音中较为稳定的发音状态。每条链的开始有静音状态节点。在隐马尔科夫状态链中,将除了关键词音素以外的音素组合构成垃圾词状态链,垃圾词状态链尾部也有一个空尾状态,表示该条隐马尔科夫链为垃圾词。
以隐马尔科夫模型表述关键词检测模型为例,构建关键词检测模型的方式可以为如下两种方式:
第一种方式:
针对语音中的每一个音素,在语料库中提取每个因素对应的声学参数样本。从音质的角度划分,语音的最小单位是音素。音素可以分为元音和辅音两大类,包括10个元音,22个辅音,共有32个音素。在隐马尔科夫模型中,根据语音特征,通常为一个音素设定三个状态,每个状态反映了该音素的声音特性,例如:可以表示发该音素时声道的形状变化。语料库 是用于保存语音文本和该语音文件对应的语音的,语音文本可以为不同领域内的内容,语音文件对应的语音可以为对不同的人阅读语音文本的录音。由于不同的语音文本中可能包含相同的音素,在语料库中提取每个音素对应的声学参数样本,声学参数即为表征音素状态的参数。例如,提取音素a对应的声学参数样本,a的三个状态为b、c、d,每个状态分别提取n个样本,那么状态b对应的样本为b1、b2……bn,状态c对应的样本为c1、c2……cn,状态c对应的样本为c1、c2……cn。
按照预设的训练算法,对每个音素对应的声学参数样本进行训练,得到声学模型。声学模型为音素与对应的声学参数的对应关系。预设的训练算法可以为算术平均算法,例如:将音素a的三个状态b、c、d的样本分别进行算术平均b′=(b1+b2+……+bn)/n,c′=(c1+c2+……+cn)/n,d′=(d1+d2+……+dn)/n,b′、c′、d′为音素a对应的声学参数。还可以将音素a的三个状态b、c、d的样本分别求方差,将得到的方差作为音素a对应的声学参数。进一步的,还可以采用现有技术中的隐马尔科夫模型和神经网络结合的方式,使用反向传播法训练每个神经元的权重,确定神经网络模型,神经网络模型的输入为音素,输出为该音素对应的声学参数声学模型即为32个音素的每一个音素分别与该音素的声学参数的对应关系。
在根据不同的应用场景确定关键词后,在发音词典中查找各关键词对应的关键词音素。发音词典是用于保存词组包括的音素的。确定关键词音素后,将声学模型中与关键词音素对应的声学参数构建为关键词检测模型。
第二种方式:
该方式中不需要对每一个音素都确定对应的声学参数,只需要确定与关键词音素对应的声学参数即可。
根据不同的应用场景确定关键词,在发音词典中查找各关键词对应的关键词音素。
在语料库中提取关键词音素对应的声学参数样本。
按照预设的训练算法,对关键词音素对应的声学参数样本进行训练,得到关键词检测模型。其中采用的训练算法与第一种方式中采用的算法相同,在此不再进行详细描述。
关键词检测模块501,可以采用声学模型评估,对提取到的语音特征,在隐马尔科夫模型中的每一条隐马尔科夫链上进行关键词确认,得到该隐 马尔科夫链的分数。该分数表征了输入语音中的词组与各关键词的相似度,分数越高,相似度越高。确认分数最高的隐马尔科夫链对应的词组是否为预设的指令词,具体的,可以根据隐马尔科夫链的空尾状态,确定分数最高的隐马尔科夫链对应的词组是否为预设的指令词,如果分数最高的隐马尔科夫链对应的词组是预设的指令词,则唤醒语音识别器,并将输入语音发送给语音识别器502。
语音识别器502,用于对当前输入语音进行语义解析,得到当前输入语音的语义;确定当前输入语音的语义与设定的指令语义匹配;根据指令词发出执行指令词指示的相应的操作的命令。
由于从输入语音中检测到指令词时,该指令词并不一定标识用户所说的话是语音指令,也可能是输入语音中恰巧含有指令词,但用户的意图并非是该指令词。例如:用户说“葫芦岛航道”里面含有类似“导航到”的发音,但用户的真正意图并不是指示导航到某个目的地。因此,要对检测到的指令词进行语义解析。
本发明实施例3中提供的上述如图5所示的语音唤醒系统,其中所包括的关键词检测模块501和语音识别器502进一步的功能,可对应于图2、图3所示流程中的相应处理步骤,在此不再赘述。
综上所述,本发明实施例提供的方案,包括:对获取的当前输入语音进行语音特征提取;根据提取到的语音特征,按照预先构建的关键词检测模型,确定当前输入语音中是否存在指令词,关键词检测模型中的关键词至少包括预设的指令词;当该当前输入语音中存在指令词时,唤醒语音识别器,并根据指令词执行相应操作。采用本发明实施例提供的方案,由于对当前输入语音检测出指令词后,直接唤醒语音识别器,根据指令词执行相应操作,不需要在检测到唤醒词后,唤醒语音识别器,再重新检测新的输入语音中是否存在指令词,节省了资源。并且对于用户而言,不需要每次先说唤醒词,再说指令词,提高了用户体验。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前 述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims (13)

  1. 一种语音唤醒方法,其特征在于,包括:
    对获取的当前输入语音进行语音特征提取;
    根据提取到的语音特征,按照预先构建的关键词检测模型,确定所述当前输入语音中是否存在指令词,所述关键词检测模型中的关键词至少包括预设的指令词;
    当所述当前输入语音中存在指令词时,唤醒语音识别器,并根据指令词执行指令词指示的相应操作。
  2. 根据权利要求1所述的方法,其特征在于,在根据指令词执行指令词指示的相应操作之前,还包括:
    获取对所述当前输入语音的语义与指令语义进行匹配的匹配成功消息,所述匹配成功消息为所述语音识别器对所述输入语音进行语义解析,得到所述当前输入语音的语义,并对所述当前输入语音的语义与设定的指令语义进行匹配成功后发送的。
  3. 根据权利要求1所述的方法,其特征在于,构建关键词检测模型,具体包括:
    针对语音中的每一个音素,在语料库中提取每个音素对应的声学参数样本,所述语料库是用于保存语音文本和所述语音文本对应的语音的;
    按照预设的训练算法,对每一个音素对应的声学参数样本进行训练,得到声学模型,所述声学模型为音素与对应的声学参数的对应关系;
    在发音词典中查找关键词对应的关键词音素,并将声学模型中关键词音素与对应的声学参数构建为关键词检测模型,所述发音词典是用于保存词组包括的音素的。
  4. 根据权利要求1所述的方法,其特征在于,构建关键词检测模型,具体包括:
    在发音词典中查找关键词对应的关键词音素,所述发音词典是用于保存词组包括的音素的;
    在语料库中提取所述关键词音素对应的声学参数样本,所述语料库是用于保存语音文本对应的语音的;
    按照预设的训练算法,对所述关键词音素对应的声学参数样本进行训 练,得到关键词检测模型。
  5. 根据权利要求1所述的方法,其特征在于,所述关键词检测模型为隐马尔科夫链模型;
    根据提取到的语音特征,按照预先构建的关键词检测模型,确定所述输入语音中是否存在指令词,具体包括:
    采用声学模型评估,对提取到的语音特征,在隐马尔科夫模型中的每一条隐马尔科夫链上进行指令词确认,得到该隐马尔科夫链的指令词确认分数;
    确认与所述指令词确认分数最高的隐马尔科夫链对应的词组是否为预设的指令词。
  6. 根据权利要求1所述的方法,其特征在于,所述关键词检测模型中的关键词还包括预设的唤醒词;
    所述方法,还包括:
    当根据提取到的语音特征,按照预先构建的关键词检测模型,确定所述输入语音中存在唤醒词时,唤醒语音识别器。
  7. 一种语音唤醒装置,其特征在于,包括:
    提取单元,用于对获取的当前输入语音进行语音特征提取;
    指令词确定单元,用于根据提取到的语音特征,按照预先构建的关键词检测模型,确定所述当前输入语音中是否存在指令词,所述关键词检测模型中的关键词至少包括预设的指令词;
    第一唤醒单元,用于当所述当前输入语音中存在指令词时,唤醒语音识别器,并根据指令词执行指令词指示的相应操作。
  8. 根据权利要求7所述的装置,其特征在于,还包括:
    获取单元,用于获取对所述当前输入语音的语义与指令语义进行匹配的匹配成功消息,所述匹配成功消息为所述语音识别器对所述输入语音进行语义解析,得到所述当前输入语音的语义,并对所述当前输入语音的语义与设定的指令语义进行匹配成功后发送的。
  9. 根据权利要求7所述的装置,其特征在于,所述指令词确定单元,具体用于针对语音中的每一个音素,在语料库中提取每个音素对应的声学参数样本,所述语料库是用于保存语音文本和所述语音文本对应的语音的;按照预设的训练算法,对每一个音素对应的声学参数样本进行训练,得到 声学模型,所述声学模型为音素与对应的声学参数的对应关系;在发音词典中查找关键词对应的关键词音素,并将声学模型中关键词音素与对应的声学参数构建为关键词检测模型,所述发音词典是用于保存词组包括的音素的。
  10. 根据权利要求7所述的装置,其特征在于,所述指令词确定单元,具体用于在发音词典中查找关键词对应的关键词音素,所述发音词典是用于保存词组包括的音素的;在语料库中提取所述关键词音素对应的声学参数样本,所述语料库是用于保存语音文本对应的语音的;按照预设的训练算法,对所述关键词音素对应的声学参数样本进行训练,得到关键词检测模型。
  11. 根据权利要求7所述的装置,其特征在于,所述关键词检测模型为隐马尔科夫链模型;
    所述指令词确定单元,具体用于采用声学模型评估,对提取到的语音特征,在隐马尔科夫模型中的每一条隐马尔科夫链上进行指令词确认,得到该隐马尔科夫链的指令词确认分数;确认与所述指令词确认分数最高对应的隐马尔科夫链的词组是否为预设的指令词。
  12. 根据权利要求7所述的装置,其特征在于,所述关键词检测模型中的关键词还包括预设的唤醒词;
    所述装置,还包括:
    第二唤醒单元,用于当根据提取到的语音特征,按照预先构建的关键词检测模型,确定所述输入语音中存在唤醒词时,唤醒语音识别器。
  13. 一种语音唤醒系统,其特征在于,包括:关键词检测模块和语音识别器,其中:
    所述关键词检测模块,用于对获取的当前输入语音进行语音特征提取;根据提取到的语音特征,按照预先构建的关键词检测模型,确定所述当前输入语音中是否存在指令词,所述关键词检测模型至少包括指令词检测;当所述当前输入语音中存在指令词时,唤醒语音识别器,并将所述当前输入语音发送给所述语音识别器;
    所述语音识别器,用于对所述当前输入语音进行语义解析,得到所述当前输入语音的语义;确定所述当前输入语音的语义与设定的指令语义匹配;根据所述指令词发出执行指令词指示的相应的操作的命令。
PCT/CN2016/082401 2015-10-26 2016-05-17 一种语音唤醒方法、装置及系统 WO2017071182A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP16739388.3A EP3179475A4 (en) 2015-10-26 2016-05-17 Voice wakeup method, apparatus and system
RU2016135447A RU2016135447A (ru) 2015-10-26 2016-05-17 Способ, устройство и система для пробуждения голосом
US15/223,799 US20170116994A1 (en) 2015-10-26 2016-07-29 Voice-awaking method, electronic device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510702094.1A CN105654943A (zh) 2015-10-26 2015-10-26 一种语音唤醒方法、装置及系统
CN201510702094.1 2015-10-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/223,799 Continuation US20170116994A1 (en) 2015-10-26 2016-07-29 Voice-awaking method, electronic device and storage medium

Publications (1)

Publication Number Publication Date
WO2017071182A1 true WO2017071182A1 (zh) 2017-05-04

Family

ID=56482004

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/082401 WO2017071182A1 (zh) 2015-10-26 2016-05-17 一种语音唤醒方法、装置及系统

Country Status (4)

Country Link
EP (1) EP3179475A4 (zh)
CN (1) CN105654943A (zh)
RU (1) RU2016135447A (zh)
WO (1) WO2017071182A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109003611A (zh) * 2018-09-29 2018-12-14 百度在线网络技术(北京)有限公司 用于车辆语音控制的方法、装置、设备和介质
CN110225386A (zh) * 2019-05-09 2019-09-10 青岛海信电器股份有限公司 一种显示控制方法、显示设备
CN110415691A (zh) * 2018-04-28 2019-11-05 青岛海尔多媒体有限公司 基于语音识别的控制方法、装置及计算机可读存储介质
CN111128134A (zh) * 2018-10-11 2020-05-08 阿里巴巴集团控股有限公司 声学模型训练方法和语音唤醒方法、装置及电子设备
CN111429915A (zh) * 2020-03-31 2020-07-17 国家电网有限公司华东分部 一种基于语音识别的调度系统及调度方法
CN112331229A (zh) * 2020-10-23 2021-02-05 网易有道信息技术(北京)有限公司 语音检测方法、装置、介质和计算设备
CN115331670A (zh) * 2022-08-09 2022-11-11 深圳市麦驰信息技术有限公司 一种家用电器用离线语音遥控器

Families Citing this family (62)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328137A (zh) * 2016-08-19 2017-01-11 镇江惠通电子有限公司 语音控制方法、装置及系统
CN107767861B (zh) * 2016-08-22 2021-07-02 科大讯飞股份有限公司 语音唤醒方法、系统及智能终端
CN106157950A (zh) * 2016-09-29 2016-11-23 合肥华凌股份有限公司 语音控制系统及其唤醒方法、唤醒装置和家电、协处理器
CN108074563A (zh) * 2016-11-09 2018-05-25 珠海格力电器股份有限公司 时钟应用的控制方法及装置
CN106847273B (zh) * 2016-12-23 2020-05-05 北京云知声信息技术有限公司 语音识别的唤醒词选择方法及装置
CN106653022B (zh) * 2016-12-29 2020-06-23 百度在线网络技术(北京)有限公司 基于人工智能的语音唤醒方法和装置
CN107610695B (zh) * 2017-08-08 2021-07-06 大众问问(北京)信息科技有限公司 驾驶人语音唤醒指令词权重的动态调整方法
CN107704275B (zh) * 2017-09-04 2021-07-23 百度在线网络技术(北京)有限公司 智能设备唤醒方法、装置、服务器及智能设备
CN107610702B (zh) * 2017-09-22 2021-01-29 百度在线网络技术(北京)有限公司 终端设备待机唤醒方法、装置及计算机设备
CN107578776B (zh) * 2017-09-25 2021-08-06 咪咕文化科技有限公司 一种语音交互的唤醒方法、装置及计算机可读存储介质
CN109584860B (zh) * 2017-09-27 2021-08-03 九阳股份有限公司 一种语音唤醒词定义方法和系统
CN109741735B (zh) * 2017-10-30 2023-09-01 阿里巴巴集团控股有限公司 一种建模方法、声学模型的获取方法和装置
CN109817220A (zh) * 2017-11-17 2019-05-28 阿里巴巴集团控股有限公司 语音识别方法、装置及系统
CN109903751B (zh) * 2017-12-08 2023-07-07 阿里巴巴集团控股有限公司 关键词确认方法和装置
CN108198552B (zh) * 2018-01-18 2021-02-02 深圳市大疆创新科技有限公司 一种语音控制方法及视频眼镜
CN108039175B (zh) 2018-01-29 2021-03-26 北京百度网讯科技有限公司 语音识别方法、装置及服务器
CN110097870B (zh) * 2018-01-30 2023-05-30 阿里巴巴集团控股有限公司 语音处理方法、装置、设备和存储介质
CN110096249A (zh) * 2018-01-31 2019-08-06 阿里巴巴集团控股有限公司 用于提示快捷唤醒词的方法、装置和系统
CN108520743B (zh) * 2018-02-02 2021-01-22 百度在线网络技术(北京)有限公司 智能设备的语音控制方法、智能设备及计算机可读介质
CN108536668B (zh) * 2018-02-26 2022-06-07 科大讯飞股份有限公司 唤醒词评估方法及装置、存储介质、电子设备
CN111819626A (zh) * 2018-03-07 2020-10-23 华为技术有限公司 一种语音交互的方法及装置
CN108538298B (zh) * 2018-04-04 2021-05-04 科大讯飞股份有限公司 语音唤醒方法及装置
EP3561806B1 (en) * 2018-04-23 2020-04-22 Spotify AB Activation trigger processing
CN108735210A (zh) * 2018-05-08 2018-11-02 宇龙计算机通信科技(深圳)有限公司 一种语音控制方法及终端
JP2019211599A (ja) * 2018-06-04 2019-12-12 本田技研工業株式会社 音声認識装置、音声認識方法およびプログラム
CN108877780B (zh) * 2018-06-06 2021-06-01 广东小天才科技有限公司 一种语音搜题方法及家教设备
CN108899028A (zh) * 2018-06-08 2018-11-27 广州视源电子科技股份有限公司 语音唤醒方法、搜索方法、装置和终端
CN108735216B (zh) * 2018-06-12 2020-10-16 广东小天才科技有限公司 一种基于语义识别的语音搜题方法及家教设备
CN110600023A (zh) * 2018-06-12 2019-12-20 Tcl集团股份有限公司 一种终端设备交互方法、装置和终端设备
CN109065045A (zh) * 2018-08-30 2018-12-21 出门问问信息科技有限公司 语音识别方法、装置、电子设备及计算机可读存储介质
CN109253728A (zh) * 2018-08-31 2019-01-22 平安科技(深圳)有限公司 语音导航方法、装置、计算机设备及存储介质
CN109346070A (zh) * 2018-09-17 2019-02-15 佛吉亚好帮手电子科技有限公司 一种基于车机安卓系统的语音免唤醒方法
CN109147764A (zh) * 2018-09-20 2019-01-04 百度在线网络技术(北京)有限公司 语音交互方法、装置、设备及计算机可读介质
CN109215634A (zh) * 2018-10-22 2019-01-15 上海声瀚信息科技有限公司 一种多词语音控制通断装置的方法及其系统
CN111199732B (zh) * 2018-11-16 2022-11-15 深圳Tcl新技术有限公司 一种基于情感的语音交互方法、存储介质及终端设备
CN109545207A (zh) * 2018-11-16 2019-03-29 广东小天才科技有限公司 一种语音唤醒方法及装置
CN109243462A (zh) * 2018-11-20 2019-01-18 广东小天才科技有限公司 一种语音唤醒方法及装置
CN109360567B (zh) * 2018-12-12 2021-07-20 思必驰科技股份有限公司 可定制唤醒的方法和装置
CN109364477A (zh) * 2018-12-24 2019-02-22 苏州思必驰信息科技有限公司 基于语音控制进行打麻将游戏的方法及装置
CN109584878A (zh) * 2019-01-14 2019-04-05 广东小天才科技有限公司 一种语音唤醒方法及系统
CN111462756B (zh) * 2019-01-18 2023-06-27 北京猎户星空科技有限公司 声纹识别方法、装置、电子设备及存储介质
CN109753665B (zh) * 2019-01-30 2020-10-16 北京声智科技有限公司 唤醒模型的更新方法及装置
CN109754788B (zh) * 2019-01-31 2020-08-28 百度在线网络技术(北京)有限公司 一种语音控制方法、装置、设备及存储介质
CN109741746A (zh) * 2019-01-31 2019-05-10 上海元趣信息技术有限公司 机器人高度拟人化语音交互算法、情感交流算法及机器人
CN110070863A (zh) * 2019-03-11 2019-07-30 华为技术有限公司 一种语音控制方法及装置
CN109979440B (zh) * 2019-03-13 2021-05-11 广州市网星信息技术有限公司 关键词样本确定方法、语音识别方法、装置、设备和介质
CN110032316A (zh) * 2019-03-29 2019-07-19 五邑大学 一种智能挂钟的交互方法、装置和存储介质
CN111862963B (zh) * 2019-04-12 2024-05-10 阿里巴巴集团控股有限公司 语音唤醒方法、装置和设备
CN110232916A (zh) * 2019-05-10 2019-09-13 平安科技(深圳)有限公司 语音处理方法、装置、计算机设备及存储介质
CN110444207A (zh) * 2019-08-06 2019-11-12 广州豫本草电子科技有限公司 基于衡通仪的智能响应控制方法、装置、介质及终端设备
CN111756935A (zh) * 2019-12-12 2020-10-09 北京沃东天骏信息技术有限公司 智能系统的信息处理方法和智能系统
CN111081254B (zh) * 2019-12-26 2022-09-23 思必驰科技股份有限公司 一种语音识别方法和装置
CN111462777B (zh) * 2020-03-30 2023-02-14 厦门快商通科技股份有限公司 关键词检索方法、系统、移动终端及存储介质
CN111554284A (zh) * 2020-04-24 2020-08-18 广东电网有限责任公司东莞供电局 一种倒闸操作监控方法、装置、设备及存储介质
CN111555247A (zh) * 2020-04-24 2020-08-18 广东电网有限责任公司东莞供电局 一种电力设备的倒闸操作控制方法、装置、设备及介质
CN111739521B (zh) * 2020-06-19 2021-06-22 腾讯科技(深圳)有限公司 电子设备唤醒方法、装置、电子设备及存储介质
CN112037772B (zh) * 2020-09-04 2024-04-02 平安科技(深圳)有限公司 基于多模态的响应义务检测方法、系统及装置
CN112233656A (zh) * 2020-10-09 2021-01-15 安徽讯呼信息科技有限公司 一种人工智能语音唤醒方法
CN112420044A (zh) * 2020-12-03 2021-02-26 深圳市欧瑞博科技股份有限公司 语音识别方法、语音识别装置及电子设备
CN112735441A (zh) * 2020-12-07 2021-04-30 浙江合众新能源汽车有限公司 智能生态语音识别系统
CN113643700B (zh) * 2021-07-27 2024-02-27 广州市威士丹利智能科技有限公司 一种智能语音开关的控制方法及系统
CN115472156A (zh) * 2022-09-05 2022-12-13 Oppo广东移动通信有限公司 语音控制方法、装置、存储介质及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101383150A (zh) * 2008-08-19 2009-03-11 南京师范大学 语音软开关的控制方法及其在地理信息系统中的应用
CN102929390A (zh) * 2012-10-16 2013-02-13 广东欧珀移动通信有限公司 一种在待机状态下应用程序的启动方法及装置
CN103871408A (zh) * 2012-12-14 2014-06-18 联想(北京)有限公司 一种语音识别方法及装置、电子设备
CN103943105A (zh) * 2014-04-18 2014-07-23 安徽科大讯飞信息科技股份有限公司 一种语音交互方法及系统
CN104538030A (zh) * 2014-12-11 2015-04-22 科大讯飞股份有限公司 一种可以通过语音控制家电的控制系统与方法
CN104866274A (zh) * 2014-12-01 2015-08-26 联想(北京)有限公司 信息处理方法及电子设备

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11393461B2 (en) * 2013-03-12 2022-07-19 Cerence Operating Company Methods and apparatus for detecting a voice command
US20140365225A1 (en) * 2013-06-05 2014-12-11 DSP Group Ultra-low-power adaptive, user independent, voice triggering schemes

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101383150A (zh) * 2008-08-19 2009-03-11 南京师范大学 语音软开关的控制方法及其在地理信息系统中的应用
CN102929390A (zh) * 2012-10-16 2013-02-13 广东欧珀移动通信有限公司 一种在待机状态下应用程序的启动方法及装置
CN103871408A (zh) * 2012-12-14 2014-06-18 联想(北京)有限公司 一种语音识别方法及装置、电子设备
CN103943105A (zh) * 2014-04-18 2014-07-23 安徽科大讯飞信息科技股份有限公司 一种语音交互方法及系统
CN104866274A (zh) * 2014-12-01 2015-08-26 联想(北京)有限公司 信息处理方法及电子设备
CN104538030A (zh) * 2014-12-11 2015-04-22 科大讯飞股份有限公司 一种可以通过语音控制家电的控制系统与方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3179475A4 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415691A (zh) * 2018-04-28 2019-11-05 青岛海尔多媒体有限公司 基于语音识别的控制方法、装置及计算机可读存储介质
CN109003611A (zh) * 2018-09-29 2018-12-14 百度在线网络技术(北京)有限公司 用于车辆语音控制的方法、装置、设备和介质
CN111128134A (zh) * 2018-10-11 2020-05-08 阿里巴巴集团控股有限公司 声学模型训练方法和语音唤醒方法、装置及电子设备
CN110225386A (zh) * 2019-05-09 2019-09-10 青岛海信电器股份有限公司 一种显示控制方法、显示设备
CN111429915A (zh) * 2020-03-31 2020-07-17 国家电网有限公司华东分部 一种基于语音识别的调度系统及调度方法
CN112331229A (zh) * 2020-10-23 2021-02-05 网易有道信息技术(北京)有限公司 语音检测方法、装置、介质和计算设备
CN112331229B (zh) * 2020-10-23 2024-03-12 网易有道信息技术(北京)有限公司 语音检测方法、装置、介质和计算设备
CN115331670A (zh) * 2022-08-09 2022-11-11 深圳市麦驰信息技术有限公司 一种家用电器用离线语音遥控器

Also Published As

Publication number Publication date
RU2016135447A3 (zh) 2018-03-02
EP3179475A4 (en) 2017-06-28
RU2016135447A (ru) 2018-03-02
EP3179475A1 (en) 2017-06-14
CN105654943A (zh) 2016-06-08

Similar Documents

Publication Publication Date Title
WO2017071182A1 (zh) 一种语音唤醒方法、装置及系统
US11720326B2 (en) Audio output control
US10884701B2 (en) Voice enabling applications
US9972318B1 (en) Interpreting voice commands
US11061644B2 (en) Maintaining context for voice processes
US11669300B1 (en) Wake word detection configuration
US20170116994A1 (en) Voice-awaking method, electronic device and storage medium
US10365887B1 (en) Generating commands based on location and wakeword
CN107016994B (zh) 语音识别的方法及装置
US11862174B2 (en) Voice command processing for locked devices
US20200184967A1 (en) Speech processing system
US10623246B1 (en) Device configuration by natural language processing system
JP2018523156A (ja) 言語モデルスピーチエンドポインティング
US11195522B1 (en) False invocation rejection for speech processing systems
CN112927683A (zh) 用于语音使能设备的动态唤醒词
US11579841B1 (en) Task resumption in a natural understanding system
US20240029743A1 (en) Intermediate data for inter-device speech processing
CN111862943B (zh) 语音识别方法和装置、电子设备和存储介质
WO2022271435A1 (en) Interactive content output
KR20190032557A (ko) 음성 기반 통신
US11955112B1 (en) Cross-assistant command processing
CN113611316A (zh) 人机交互方法、装置、设备以及存储介质
US11044567B1 (en) Microphone degradation detection and compensation
US11699444B1 (en) Speech recognition using multiple voice-enabled devices
US11328713B1 (en) On-device contextual understanding

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2016549257

Country of ref document: JP

Kind code of ref document: A

REEP Request for entry into the european phase

Ref document number: 2016739388

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2016739388

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2016135447

Country of ref document: RU

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE