CN107767863B - Voice awakening method and system and intelligent terminal - Google Patents

Voice awakening method and system and intelligent terminal Download PDF

Info

Publication number
CN107767863B
CN107767863B CN201610701651.2A CN201610701651A CN107767863B CN 107767863 B CN107767863 B CN 107767863B CN 201610701651 A CN201610701651 A CN 201610701651A CN 107767863 B CN107767863 B CN 107767863B
Authority
CN
China
Prior art keywords
awakening
acoustic
recognition result
word recognition
awakening word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610701651.2A
Other languages
Chinese (zh)
Other versions
CN107767863A (en
Inventor
吴国兵
潘嘉
刘聪
胡国平
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201610701651.2A priority Critical patent/CN107767863B/en
Publication of CN107767863A publication Critical patent/CN107767863A/en
Application granted granted Critical
Publication of CN107767863B publication Critical patent/CN107767863B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention discloses a voice awakening method and a system, wherein the method comprises the following steps: receiving voice data; acquiring a first acoustic feature of the voice data; performing awakening word recognition by using the first acoustic feature, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result; if the primary awakening word recognition result is an awakening word, judging whether the primary awakening word recognition result reaches a set target or not; if yes, acquiring a second acoustic feature of the voice data; performing secondary awakening word recognition by using the second acoustic feature, the second acoustic model and the second decoding network to obtain a secondary awakening word recognition result; and determining whether the awakening is successful or not according to the secondary awakening word recognition result. The invention further provides the intelligent terminal. The invention can effectively reduce the power consumption of the voice wake-up system.

Description

Voice awakening method and system and intelligent terminal
Technical Field
The invention relates to the field of voice processing, in particular to a voice awakening method, a voice awakening system and an intelligent terminal.
Background
The voice awakening is realized by understanding semantic information of user voice data, and the process can be realized without physical contact with equipment, so that both hands of human beings are liberated, the first door leading to artificial intelligence of the human beings is opened, and the voice awakening method is widely applied to various intelligent terminals, such as intelligent wearable equipment, mobile phones, tablet computers, intelligent household appliances and the like. In the existing method, when voice awakening is performed, acoustic features of voice data are extracted after the voice data are received, and awakening word recognition is performed by using the extracted acoustic features and a pre-constructed acoustic model.
The existing voice awakening method has the following defects:
(1) since it is impossible to predict when a user performs a human-computer interaction operation, it is necessary to continuously monitor the voice data, and once the voice data is received, the awakening word recognition is immediately performed, which consumes a large amount of resources of the intelligent terminal and consumes a large amount of power.
(2) In order to improve the success rate of awakening, the existing method generally uses a larger acoustic model and a decoding network to identify awakening words, so that the voice awakening power consumption is further increased, which is unacceptable for an intelligent terminal with a smaller memory, and when the power consumption is too large, the intelligent terminal often crashes or does not respond, so that the user experience is greatly reduced.
Disclosure of Invention
The invention provides a voice awakening method, a voice awakening system and an intelligent terminal, which can effectively reduce the power consumption of the system while ensuring the awakening success rate.
Therefore, the invention provides the following technical scheme:
a voice wake-up method, comprising:
receiving voice data;
acquiring a first acoustic feature of the voice data;
performing awakening word recognition by using the first acoustic feature, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result;
if the primary awakening word recognition result is an awakening word, judging whether the primary awakening word recognition result reaches a set target or not;
if yes, acquiring a second acoustic feature of the voice data;
performing secondary awakening word recognition by using the second acoustic feature, the second acoustic model and the second decoding network to obtain a secondary awakening word recognition result; the second acoustic model is larger than the first acoustic model, and/or the second decoding network is larger than the first decoding network;
and determining whether the awakening is successful or not according to the secondary awakening word recognition result.
Optionally, the second acoustic characteristic is the same as or different from the first acoustic characteristic.
Optionally, the first acoustic characteristic is any one of the following characteristics: MFCC feature, Bottleneck feature, Filterbank feature.
Preferably, the first acoustic model comprises a wake word acoustic model and an absorption model, wherein the wake word acoustic model and the absorption model are respectively trained, the wake word acoustic model is characterized by a GMM-HMM based on the first acoustic feature, and the absorption model is characterized by the GMM-HMM;
the second acoustic model includes a wake word acoustic model and an absorption model, wherein the wake word acoustic model and the absorption model are trained simultaneously, both of which are characterized using a neural network model based on the second acoustic features.
Preferably, the determining whether the initial awakening word recognition result reaches a set target includes:
determining a current environment state;
and judging whether the initial awakening word recognition result reaches a set target or not according to the environment state.
Preferably, the determining the current environmental state includes:
calculating the signal-to-noise ratio of the voice data;
if the signal to noise ratio is larger than a set value, the current environment state is a quiet environment; otherwise, the current environment state is a noise environment.
Preferably, the determining whether the initial awakening word recognition result reaches a set target according to the environment state includes:
acquiring acoustic likelihood of an awakening word and a non-awakening word obtained in the process of recognizing the initial awakening word;
calculating the acoustic likelihood ratio of the awakening word and the non-awakening word according to the acoustic likelihood;
and if the acoustic likelihood ratio is larger than a judgment threshold corresponding to the environment state, the primary awakening word recognition result reaches a set target.
Preferably, the determining whether the waking is successful according to the secondary waking word recognition result includes:
and if the secondary awakening word recognition result is the awakening word, determining that the awakening is successful.
Preferably, the determining whether the waking is successful according to the secondary waking word recognition result includes:
if the secondary awakening word recognition result is an awakening word, fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result;
and determining whether the awakening is successful or not according to the fusion result.
Preferably, the fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result includes:
respectively acquiring an acoustic likelihood ratio T1 of the primary awakening identification result and an acoustic likelihood ratio T2 of the secondary awakening identification result;
carrying out weighted combination on the acoustic likelihood ratio T1 of the primary awakening identification result and the acoustic likelihood ratio T2 of the secondary awakening identification result to obtain a fusion result T;
the determining whether the awakening is successful according to the fusion result comprises:
if the fusion result T is larger than the set fusion threshold value, awakening successfully; otherwise, the awakening fails.
Preferably, the determining whether the waking is successful according to the secondary waking word recognition result includes:
if the secondary awakening word recognition result is an awakening word, fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result;
calculating the similarity between the time length of the primary awakening identification result and the time length of the secondary awakening identification result;
if the fusion result is larger than a set fusion threshold value and the similarity is larger than a set similarity threshold value, awakening successfully; otherwise, the awakening fails.
A voice wake-up system comprising:
the receiving module is used for receiving voice data;
the first acoustic feature acquisition module is used for acquiring first acoustic features of the voice data;
the primary awakening module is used for performing awakening word recognition by utilizing the first acoustic characteristic, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result;
the judging module is used for judging whether the primary awakening word recognition result reaches a set target or not when the primary awakening word recognition result is the awakening word; if yes, triggering a second acoustic feature acquisition module;
the second acoustic feature acquisition module is used for acquiring a second acoustic feature of the voice data;
the secondary awakening module is used for carrying out secondary awakening word recognition by utilizing the second acoustic characteristic, the second acoustic model and the second decoding network to obtain a secondary awakening word recognition result; the second acoustic model is larger than the first acoustic model, and/or the second decoding network is larger than the first decoding network;
and the determining module is used for determining whether the awakening is successful or not according to the secondary awakening word recognition result.
Preferably, the first acoustic model comprises a wake word acoustic model and an absorption model, wherein the wake word acoustic model and the absorption model are respectively trained, the wake word acoustic model is characterized by a GMM-HMM based on the first acoustic feature, and the absorption model is characterized by the GMM-HMM;
the second acoustic model includes a wake word acoustic model and an absorption model, wherein the wake word acoustic model and the absorption model are trained simultaneously, both of which are characterized using a neural network model based on the second acoustic features.
Preferably, the judging module includes:
an environment state determination unit for determining a current environment state;
and the judging unit is used for judging whether the initial awakening word recognition result reaches a set target or not according to the environment state.
Preferably, the environment state determining unit is specifically configured to calculate a signal-to-noise ratio of the voice data, determine that the current environment state is a quiet environment when the signal-to-noise ratio is greater than a set value, and otherwise determine that the current environment state is a noisy environment.
Preferably, the judging unit includes:
the likelihood obtaining subunit is used for obtaining the acoustic likelihoods of the awakening words and the non-awakening words obtained in the process of identifying the primary awakening words;
and the likelihood ratio calculating subunit is used for calculating the acoustic likelihood ratio of the awakening word and the non-awakening word according to the acoustic likelihood ratio, and determining that the initial awakening word recognition result reaches a set target when the acoustic likelihood ratio is greater than a judgment threshold corresponding to the environment state.
Preferably, the determining module is specifically configured to determine that the waking is successful when the secondary waking word recognition result is a waking word.
Preferably, the determining module comprises:
the fusion unit is used for fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result when the secondary awakening word recognition result is the awakening word;
and the first determining unit is used for determining whether the awakening is successful or not according to the fusion result.
Preferably, the fusion unit is specifically configured to obtain an acoustic likelihood ratio T1 of the primary wake-up recognition result and an acoustic likelihood ratio T2 of the secondary wake-up recognition result, and perform weighted combination on the acoustic likelihood ratio T1 of the primary wake-up recognition result and the acoustic likelihood ratio T2 of the secondary wake-up recognition result to obtain a fusion result T;
the determining unit is specifically configured to determine that the wake-up is successful when the fusion result T is greater than a set fusion threshold; otherwise, determining that the awakening fails.
Preferably, the determining module comprises:
the fusion unit is used for fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result when the secondary awakening word recognition result is the awakening word;
the similarity calculation unit is used for calculating the similarity between the duration of the primary awakening identification result and the duration of the secondary awakening identification result;
a second determining unit, configured to determine that the waking up is successful when the fusion result is greater than a set fusion threshold and the similarity is greater than a set similarity threshold; otherwise, determining that the awakening fails.
An intelligent terminal comprises the voice awakening system.
Preferably, the intelligent terminal is any one of the following: wearable equipment, cell-phone, panel computer, audio amplifier, household electrical appliances, intelligent car machine.
According to the voice awakening method, the voice awakening system and the intelligent terminal, once voice data are received, the small acoustic model and the decoding network are used for carrying out primary awakening word recognition, and after the awakening word is recognized and the primary awakening word recognition result reaches the set target, the large acoustic model and the decoding network are used for carrying out secondary awakening word recognition. Because the power consumption of the initial awakening is smaller, the awakening power consumption can be effectively reduced when the device is used for continuous monitoring; and the secondary awakening operation is started only when the primary awakening word recognition result reaches a set target, and the secondary awakening operation uses a larger acoustic model and a decoding network, so that the awakening success rate is effectively ensured.
Furthermore, when the device is awakened for the second time, the neural network model with strong learning capacity is used, the nonlinear transformation capacity is strong, the model obtained through training is strong in distinction, and the awakening success rate is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a flow chart of a voice wake-up method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a voice wake-up system according to an embodiment of the present invention.
Detailed Description
In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.
Aiming at the problem of high power consumption of the existing voice awakening method, the embodiment of the invention provides a voice awakening method and a voice awakening system.
As shown in fig. 1, it is a flowchart of a voice wake-up method according to an embodiment of the present invention, including the following steps:
step 101, receiving voice data.
And the voice data is received through a microphone of the intelligent terminal.
Step 102, obtaining a first acoustic feature of the voice data.
The first acoustic feature is used for primary awakening, the first acoustic feature may be an MFCC feature, and in particular, in the extraction, the speech data may be firstly subjected to framing processing; then pre-emphasis is carried out on the voice data after the frame division; and finally, sequentially extracting the frequency spectrum characteristics of each frame of voice data.
Of course, in order to further improve the distinctiveness of the acoustic features, the first acoustic features also use acoustic features with higher distinctiveness, such as bottleneckk features and Filterbank features. When the Bottleneck features are specifically extracted, firstly, the MFCC features of voice data are extracted, then the extracted MFCC features are used as input features of a pre-constructed deep neural network model, model training is carried out, features output by a Bottleneck layer are extracted and used as the Bottleneck features, and a specific extraction method is the same as that in the prior art and is not detailed herein. The Filterbank feature extraction may also be performed using conventional techniques and will not be described in detail herein.
And 103, performing awakening word recognition by using the first acoustic feature, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result.
In order to reduce awakening power consumption and resource storage, a small acoustic model and a small decoding network are adopted in the process of identifying the awakening words for the first time, an awakening system is in a real-time monitoring state, and a user can respond in time when awakening at any time. The construction of the first acoustic model and the structure of the first decoding network will be described in detail later.
During specific decoding, a smaller decoding network and an acoustic model which are constructed in advance are utilized, an acoustic score of the acoustic feature of each voice unit on each path of the first decoding network is calculated by adopting a dynamic programming method, and the path with the highest acoustic score is used as an optimal path. If the optimal path is the awakening word path, the identification result is the awakening word on the path; and if the recognition result is the absorption path, the recognition result is a non-awakening word.
104, if the primary awakening word recognition result is an awakening word, judging whether the primary awakening word recognition result reaches a set target; if yes, go to step 105; otherwise, the awakening fails.
In order to further reduce noise interference and improve the accuracy of awakening, in the embodiment of the invention, whether the initial awakening word recognition result reaches the set target or not can be determined according to the current environment state. For this purpose, it is necessary to first determine the current environmental state, which may be determined, for example, from the signal-to-noise ratio of the received speech data. Specifically, calculating the signal-to-noise ratio of the voice data; if the signal to noise ratio is larger than a set value, the current environment state is a quiet environment; otherwise, the current environment state is a noise environment.
Certainly, the environment states are not limited to a quiet environment and a noisy environment, and multiple environment states can be divided according to actual application requirements, so that personalized requirements of users are met, for example, according to the awakening time of the user each time, the awakening environment is further divided into morning, afternoon, evening, early morning and the like, and the acoustic likelihood ratio threshold values of awakening words in different awakening environments are set according to experimental results or application requirements.
When judging whether the initial awakening word recognition result reaches a set target according to the environment state, judging threshold values under different environments can be preset, and the initial awakening word recognition result is determined to reach the set target according to the judging threshold value under the current environment.
For example, according to the acoustic likelihood of the corresponding awakening word and non-awakening word in the awakening word recognition process, calculating the ratio between the acoustic likelihood of the awakening word and the acoustic likelihood of the non-awakening word to obtain the acoustic likelihood ratio of the awakening word, when the likelihood ratio is greater than a threshold value, considering that the current voice data is non-noise voice data, starting to perform secondary awakening operation, otherwise, failing to awaken, and continuing to receive the voice data.
Setting acoustic likelihood ratio thresholds of the awakening words in different environments respectively, and confirming the awakening words according to the acoustic likelihood ratio threshold of the awakening words in the current environment, taking the two environment states as an example, and the specific awakening word confirmation result is shown in table 1, wherein T1 is the acoustic likelihood ratio of the awakening words calculated in the awakening word recognition process when the awakening words are awakened for the first time, thres _ clear is the acoustic likelihood ratio threshold in a quiet environment, thres _ noise is the acoustic likelihood ratio threshold in a noise environment, and the threshold can be determined according to a large number of experimental results or according to actual application requirements.
TABLE 1
Figure BDA0001085846410000081
Step 105, obtaining a second acoustic feature of the voice data.
The second acoustic signature is for a secondary wake-up operation. It should be noted that the second acoustic characteristic may be the same as the first acoustic characteristic, or may be different from the first acoustic characteristic, and may be determined according to application requirements. If the first acoustic feature uses the Bottleneck feature, the second acoustic feature uses the Filterbank feature; of course, both may be the same, i.e. both the first and the second acoustic features use the bottleeck feature. If the second acoustic feature is the same as the first acoustic feature, then in step 106, the first acoustic feature extracted in step 102 may be used directly for secondary wake-up word recognition without re-extracting the acoustic feature from the speech data.
And 106, performing secondary awakening word recognition by using the second acoustic feature, the second acoustic model and the second decoding network to obtain a secondary awakening word recognition result.
It should be noted that, in the embodiment of the present invention, the second acoustic model is larger than the first acoustic model, and/or the second decoding network is larger than the first decoding network. In addition, in order to improve the wake-up success rate, the secondary wake-up operation not only uses a larger acoustic model and a decoding network, but also considers the primary wake-up result. The construction of the second acoustic model and the structure of the second decoding network will be described in detail later.
In this step, the process of identifying the awakening word is similar to that of the initial awakening, that is, the acoustic score of the acoustic feature of each voice unit on each path of the second decoding network is calculated by using a pre-constructed larger decoding network and an acoustic model and adopting a dynamic programming method, and the path with the highest acoustic score is used as the optimal path. If the optimal path is the awakening word path, the identification result is the awakening word on the path; and if the recognition result is the absorption path, the recognition result is a non-awakening word.
And 107, determining whether the awakening is successful or not according to the secondary awakening word recognition result.
After the secondary awakening word recognition result is obtained, the following ways can be used for determining whether the awakening is successful:
1) and directly determining whether the awakening is successful according to the identification result of the secondary awakening word.
For example, if the secondary awakening word recognition result is an awakening word, the awakening is determined to be successful, otherwise, the awakening is failed.
2) And comprehensively considering the secondary awakening word recognition result and the primary awakening word recognition result to determine whether the awakening is successful.
For example, the primary awakening word recognition result and the secondary awakening word recognition result are fused to obtain a fusion result; and determining whether the awakening is successful or not according to the fusion result. Specific fusion methods are exemplified as follows:
respectively acquiring an acoustic likelihood ratio T1 of a primary awakening recognition result and an acoustic likelihood ratio T2 of a secondary awakening recognition result;
and (3) carrying out weighted combination on the acoustic likelihood ratio T1 of the primary awakening identification result and the acoustic likelihood ratio T2 of the secondary awakening identification result to obtain a fusion result T, wherein the formula (1) is as follows:
T=α*T1+β*T2 (1)
if the fusion result T is larger than the set fusion threshold value, awakening successfully; otherwise, the awakening fails.
3) Not only the fusion result T is considered, but also the similarity between the time length of the awakening word awakened for the first time and the time length of the awakening word awakened for the second time is further considered.
Specifically, firstly, performing state level segmentation on received voice data by using a first acoustic feature and a primary awakening acoustic model to obtain a primary awakening word recognition time duration vector D1, which is represented as D1 ═ D11, D12, …, D1 n; then, performing state level segmentation on the received voice data by using the second acoustic feature and a secondary awakening acoustic model to obtain a secondary awakening word recognition time duration vector D2, which is represented as D2 ═ D21, D22, … and D2 n; and finally, calculating the similarity between the duration vector D1 and the duration vector D2, wherein the similarity can be specifically expressed by cosine distance, Euclidean distance and the like between the vectors, and the smaller the distance is, the higher the similarity is.
The specific calculation method taking the cosine distance as an example is shown in formula (2):
Figure BDA0001085846410000101
where Dcos is the cosine distance between the duration vectors, the smaller the distance, the higher the similarity.
If the fusion result is larger than a set fusion threshold value and the similarity is larger than a set similarity threshold value, awakening successfully; otherwise, the awakening fails, and the voice data continues to be received.
In the embodiment of the present invention, when performing the first wake-up word recognition, a smaller acoustic model and a smaller decoding network are used, and when performing the second wake-up word recognition, a larger acoustic model and a larger decoding network are used, that is, the first acoustic model is smaller than the second acoustic model, and/or the first decoding network is smaller than the second decoding network.
The acoustic models used in the two wake-up word recognition processes are described in detail below.
First and second acoustic models
The first acoustic model comprises a wake-up word acoustic model and an absorption model, the wake-up word acoustic model is used for recognizing wake-up words from voice data, and the absorption model is used for absorbing various sound phenomena except the wake-up words, such as non-wake-up word voice, various forms of noise, music and the like.
a) Training awakening word acoustic model
In order to improve the wake-up success rate under the condition of low power consumption, the wake-up word acoustic Model is characterized by using GMM (Gaussian Mixture Model) based on first acoustic features. During specific training, firstly, a large amount of voice data containing awakening words are collected, acoustic features of the voice data are extracted, the acoustic features are the same as the first acoustic features, then, a Gaussian mixture Model based on an HMM (Hidden Markov Model) is trained based on an MLE (Maximum Likelihood rule), and then, discriminative training based on an MPE (Minimum Phone Error rule) is carried out based on the Model, so that the awakening word acoustic Model is obtained.
b) Training absorption model
The absorption model is characterized by adopting a GMM-HMM model as the acoustic model of the awakening word. Unlike the acoustic model of the wake-up word, the absorption units of the absorption model are formed by all speech unit clusters, and the number of the absorption models depends on the number of the cluster categories, and is generally between 1 and 100.
During specific training, firstly, a large amount of voice data is collected, wherein the voice data contains all voice units as much as possible, such as phonemes, syllables and the like, and the collected voice data contains all syllables in Chinese as much as possible; then extracting acoustic features of the voice data, wherein the acoustic features are the same as the first acoustic features, and then training a Gaussian mixture model based on an HMM (hidden Markov model) based on a maximum likelihood criterion to obtain an acoustic model of each voice unit; clustering the acoustic models of the voice units based on KL (Kullback-Leibler) distance to obtain absorption units, wherein the absorption units are formed by clustering the voice units, and the specific clustering number can be preset according to an experiment result; and finally, modifying the label of the training data into an absorption unit, and retraining an acoustic model corresponding to the absorption unit by using the modified training data, wherein the acoustic model is called as an absorption model, and the specific training method is the same as the training method of the acoustic model of the voice unit.
For example: the method for modifying the training data labels is as follows: the phonetic unit labeled by the training data is 'zhong 1', after clustering, the phonetic unit 'zhong 1' belongs to class 1, namely, the absorption unit 1, and the labeling of the training data is only required to be modified into 'absorption unit 1'.
The first decoding network comprises the acoustic model and the absorption model of the wake-up word.
Second, second acoustic model
The acoustic model comprises an awakening word acoustic model and an absorption model, when the acoustic model is awakened for the second time, the awakening word acoustic model and the absorption model are trained simultaneously, both the awakening word acoustic model and the absorption model are characterized by using a deep neural network model based on second acoustic features, the second acoustic features are Filterbank features, the deep neural network structure is one or a combination of more of a feedforward neural network, a convolution neural network and a circulation neural network, the number of hidden layers of the neural network is generally 3-8, and the number of nodes of each hidden layer is generally 2048. And performing model training by using a large amount of collected voice data, wherein during model training, the input of the deep neural network is the acoustic characteristics (namely the second acoustic characteristics) of the voice data, the output is the state corresponding to the awakening word and the universal voice unit, the state corresponding to the awakening word is used for constructing the acoustic model of the awakening word, the universal voice unit is used for constructing the absorption model, model training is performed by using the collected voice data according to a cross entropy criterion, and after the training is finished, the acoustic model of the awakening word and the absorption model are obtained.
The first decoding network and the second decoding network can be constructed by pre-collected text data of the awakening words, and the specific construction method is the same as that of the decoding network in the voice recognition.
According to the voice awakening method provided by the embodiment of the invention, after voice data is received, the small acoustic model and the decoding network are used for carrying out primary awakening word recognition, and after the awakening word is recognized and the primary awakening word recognition result reaches the set target, the large acoustic model and the decoding network are used for carrying out secondary awakening word recognition. Because the power consumption of the initial awakening is smaller, the awakening power consumption can be effectively reduced when the device is used for continuous monitoring; and the secondary awakening operation is started only when the primary awakening word recognition result reaches a set target, and the secondary awakening operation uses a larger acoustic model and a decoding network, so that the awakening success rate is effectively ensured.
Furthermore, when the device is awakened for the second time, the neural network model with strong learning capacity is used, the nonlinear transformation capacity is strong, the model obtained through training is strong in distinction, and the awakening success rate is further improved.
Correspondingly, an embodiment of the present invention further provides a voice wake-up system, as shown in fig. 2, the system includes:
a receiving module 201, configured to receive voice data;
a first acoustic feature obtaining module 202, configured to obtain a first acoustic feature of the voice data;
the primary awakening module 203 is configured to perform awakening word recognition by using the first acoustic feature, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result;
a judging module 204, configured to judge whether the primary awakening word recognition result reaches a set target when the primary awakening word recognition result is an awakening word; if so, the second acoustic feature acquisition module 205 is triggered;
the second acoustic feature obtaining module 205 is configured to obtain a second acoustic feature of the voice data;
a secondary awakening module 206, configured to perform secondary awakening word recognition by using the second acoustic feature, the second acoustic model, and the second decoding network, so as to obtain a secondary awakening word recognition result; the second acoustic model is larger than the first acoustic model, and/or the second decoding network is larger than the first decoding network;
and the determining module 207 is configured to determine whether the waking is successful according to the secondary waking word recognition result.
It should be noted that the second acoustic feature may be the same as or different from the first acoustic feature, specifically, the MFCC feature, the Bottleneck feature, and the Filterbank feature may be used, and the extraction of these acoustic features may be performed by using the prior art. If the same acoustic features are used for the wake word recognition twice, in the system, the first acoustic feature obtaining module 202 needs to extract the acoustic features from the voice data, and the second acoustic feature obtaining module 205 may directly obtain the required acoustic features from the first acoustic feature obtaining module 202, or the second acoustic feature obtaining module 205 may be omitted, that is, the second wake module 206 performs the wake word recognition by using the acoustic features extracted by the first acoustic feature obtaining module 202.
In the system of the embodiment of the present invention, the first acoustic module and the second acoustic model used may be pre-trained by the respective modules. The module may be a part of the system or may be independent of the system, and the embodiment of the present invention is not limited thereto. In addition, in order to reduce awakening power consumption and resource storage, a small acoustic model and a small decoding network are adopted in the process of identifying the awakening words for the first time, the awakening system is in a real-time monitoring state, and a user can respond in time when awakening at any time. The first acoustic model comprises a wake word acoustic model and an absorption model, wherein the wake word acoustic model and the absorption model are respectively trained, the wake word acoustic model is characterized by a GMM-HMM based on the first acoustic feature, and the absorption model is characterized by the GMM-HMM. The second awakening word recognition process adopts a larger acoustic model and a decoding network, and specifically, the second acoustic model comprises an awakening word acoustic model and an absorption model, wherein the awakening word acoustic model and the absorption model are trained simultaneously, and both are characterized by using a neural network model based on second acoustic characteristics, and the neural network model is, for example, DNN, CNN, RNN and the like.
In order to further reduce noise interference and improve the accuracy of waking up, in an embodiment of the present invention, the determining module 204 may determine whether the initial wake-up word recognition result reaches a set target according to the current environment state. One specific structure of the determining module 204 may include the following two units:
an environment state determination unit for determining a current environment state;
and the judging unit is used for judging whether the initial awakening word recognition result reaches a set target or not according to the environment state.
For example, the environment state determining unit may determine the current environment state according to a signal-to-noise ratio of the voice data, specifically, when the signal-to-noise ratio is greater than a set value, the current environment state is determined to be a quiet environment, otherwise, the current environment state is determined to be a noisy environment. Of course, various environmental states may be set, and the embodiment of the present invention is not limited thereto.
Accordingly, the determining unit may determine according to an acoustic likelihood ratio of the wake word and the non-wake word, and the determining unit may include: a likelihood obtaining subunit and a likelihood ratio calculating subunit; wherein:
the likelihood obtaining subunit is used for obtaining the acoustic likelihoods of the awakening words and the non-awakening words obtained in the process of identifying the primary awakening words;
and the likelihood ratio calculating subunit is used for calculating the acoustic likelihood ratio of the awakening word and the non-awakening word according to the acoustic likelihood ratio, and determining that the initial awakening word recognition result reaches a set target when the acoustic likelihood ratio is greater than a judgment threshold corresponding to the environment state.
After the secondary wake-up module 206 obtains the secondary wake-up word recognition result, the determining module 207 may determine whether to wake up to function in a variety of ways, such as:
1) the determining module 207 determines whether the awakening is successful directly according to the secondary awakening word recognition result, specifically, when the secondary awakening word recognition result is the awakening word, the awakening is successful; otherwise, the awakening fails.
2) The determination module 207 determines whether the wake-up is successful by comprehensively considering both the secondary wake-up word recognition result and the primary wake-up word recognition result. Correspondingly, the determining module 207 may specifically include the following units:
the fusion unit is used for fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result when the secondary awakening word recognition result is the awakening word;
and the first determining unit is used for determining whether the awakening is successful or not according to the fusion result.
3) The determining module 207 comprehensively considers the fusion result and the similarity between the time length of the wake-up word awakened for the first time and the time length of the wake-up word awakened for the second time. Correspondingly, the determining module 207 may specifically include the following units:
the fusion unit is used for fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result when the secondary awakening word recognition result is the awakening word;
the similarity calculation unit is used for calculating the similarity between the duration of the primary awakening identification result and the duration of the secondary awakening identification result;
a second determining unit, configured to determine that the waking up is successful when the fusion result is greater than a set fusion threshold and the similarity is greater than a set similarity threshold; otherwise, determining that the awakening fails.
According to the voice awakening system provided by the embodiment of the invention, after voice data is received, the small acoustic model and the decoding network are used for carrying out primary awakening word recognition, and after the awakening word is recognized and the primary awakening word recognition result reaches the set target, the large acoustic model and the decoding network are used for carrying out secondary awakening word recognition. Because the power consumption of the initial awakening is smaller, the awakening power consumption can be effectively reduced when the device is used for continuous monitoring; and the secondary awakening operation is started only when the primary awakening word recognition result reaches a set target, and the secondary awakening operation uses a larger acoustic model and a decoding network, so that the awakening success rate is effectively ensured.
Furthermore, when the device is awakened for the second time, the neural network model with strong learning capacity is used, the nonlinear transformation capacity is strong, the model obtained through training is strong in distinction, and the awakening success rate is further improved.
The voice awakening system provided by the embodiment of the invention can be applied to various intelligent terminals, such as wearable equipment, a mobile phone, a tablet personal computer, a sound box, intelligent household appliances and the like.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, they are described in a relatively simple manner, and reference may be made to some descriptions of method embodiments for relevant points. The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above embodiments of the present invention have been described in detail, and the present invention is described herein using specific embodiments, but the above embodiments are only used to help understanding the method and system of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (22)

1. A voice wake-up method, comprising:
receiving voice data;
acquiring a first acoustic feature of the voice data;
performing awakening word recognition by using the first acoustic feature, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result;
if the primary awakening word recognition result is an awakening word, judging whether the primary awakening word recognition result reaches a set target or not;
if yes, acquiring a second acoustic feature of the voice data;
performing secondary awakening word recognition by using the second acoustic feature, the second acoustic model and the second decoding network to obtain a secondary awakening word recognition result; the second acoustic model is larger than the first acoustic model, and/or the second decoding network is larger than the first decoding network;
and determining whether the awakening is successful or not according to the secondary awakening word recognition result.
2. The method of claim 1, wherein the second acoustic characteristic is the same as or different from the first acoustic characteristic.
3. The method of claim 2, wherein the first acoustic feature is any one of: MFCC feature, Bottleneck feature, Filterbank feature.
4. The method of claim 1, wherein:
the first acoustic model comprises a wake-up word acoustic model and an absorption model, wherein the wake-up word acoustic model and the absorption model are respectively trained, the wake-up word acoustic model is characterized by using a GMM-HMM based on first acoustic features, and the absorption model is characterized by adopting the GMM-HMM;
the second acoustic model includes a wake word acoustic model and an absorption model, wherein the wake word acoustic model and the absorption model are trained simultaneously, both of which are characterized using a neural network model based on the second acoustic features.
5. The method of claim 1, wherein the determining whether the initial wake word recognition result reaches a set target comprises:
determining a current environment state;
and judging whether the initial awakening word recognition result reaches a set target or not according to the environment state.
6. The method of claim 5, wherein the determining a current environmental state comprises:
calculating the signal-to-noise ratio of the voice data;
if the signal to noise ratio is larger than a set value, the current environment state is a quiet environment; otherwise, the current environment state is a noise environment.
7. The method according to claim 5, wherein the determining whether the initial wake word recognition result reaches a set target according to the environment state comprises:
acquiring acoustic likelihood of an awakening word and a non-awakening word obtained in the process of recognizing the initial awakening word;
calculating the acoustic likelihood ratio of the awakening word and the non-awakening word according to the acoustic likelihood;
and if the acoustic likelihood ratio is larger than a judgment threshold corresponding to the environment state, the primary awakening word recognition result reaches a set target.
8. The method according to any one of claims 1 to 7, wherein the determining whether the waking is successful according to the secondary wake word recognition result comprises:
and if the secondary awakening word recognition result is the awakening word, determining that the awakening is successful.
9. The method according to any one of claims 1 to 7, wherein the determining whether the waking is successful according to the secondary wake word recognition result comprises:
if the secondary awakening word recognition result is an awakening word, fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result;
and determining whether the awakening is successful or not according to the fusion result.
10. The method of claim 9,
the fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result comprises:
respectively acquiring an acoustic likelihood ratio T1 of the primary awakening word recognition result and an acoustic likelihood ratio T2 of the secondary awakening word recognition result;
carrying out weighted combination on the acoustic likelihood ratio T1 of the primary awakening word recognition result and the acoustic likelihood ratio T2 of the secondary awakening word recognition result to obtain a fusion result T;
the determining whether the awakening is successful according to the fusion result comprises:
if the fusion result T is larger than the set fusion threshold value, awakening successfully; otherwise, the awakening fails.
11. The method according to any one of claims 1 to 7, wherein the determining whether the waking is successful according to the secondary wake word recognition result comprises:
if the secondary awakening word recognition result is an awakening word, fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result;
calculating the similarity between the time length of the primary awakening word recognition result and the time length of the secondary awakening word recognition result;
if the fusion result is larger than a set fusion threshold value and the similarity is larger than a set similarity threshold value, awakening successfully; otherwise, the awakening fails.
12. A voice wake-up system, comprising:
the receiving module is used for receiving voice data;
the first acoustic feature acquisition module is used for acquiring first acoustic features of the voice data;
the primary awakening module is used for performing awakening word recognition by utilizing the first acoustic characteristic, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result;
the judging module is used for judging whether the primary awakening word recognition result reaches a set target or not when the primary awakening word recognition result is the awakening word; if yes, triggering a second acoustic feature acquisition module;
the second acoustic feature acquisition module is used for acquiring a second acoustic feature of the voice data;
the secondary awakening module is used for carrying out secondary awakening word recognition by utilizing the second acoustic characteristic, the second acoustic model and the second decoding network to obtain a secondary awakening word recognition result; the second acoustic model is larger than the first acoustic model, and/or the second decoding network is larger than the first decoding network;
and the determining module is used for determining whether the awakening is successful or not according to the secondary awakening word recognition result.
13. The system of claim 12, wherein:
the first acoustic model comprises a wake-up word acoustic model and an absorption model, wherein the wake-up word acoustic model and the absorption model are respectively trained, the wake-up word acoustic model is characterized by using a GMM-HMM based on first acoustic features, and the absorption model is characterized by adopting the GMM-HMM;
the second acoustic model includes a wake word acoustic model and an absorption model, wherein the wake word acoustic model and the absorption model are trained simultaneously, both of which are characterized using a neural network model based on the second acoustic features.
14. The system of claim 12, wherein the determining module comprises:
an environment state determination unit for determining a current environment state;
and the judging unit is used for judging whether the initial awakening word recognition result reaches a set target or not according to the environment state.
15. The system of claim 14,
the environment state determining unit is specifically configured to calculate a signal-to-noise ratio of the voice data, determine that the current environment state is a quiet environment when the signal-to-noise ratio is greater than a set value, and otherwise determine that the current environment state is a noisy environment.
16. The system according to claim 14, wherein the judging unit includes:
the likelihood obtaining subunit is used for obtaining the acoustic likelihoods of the awakening words and the non-awakening words obtained in the process of identifying the primary awakening words;
and the likelihood ratio calculating subunit is used for calculating the acoustic likelihood ratio of the awakening word and the non-awakening word according to the acoustic likelihood ratio, and determining that the initial awakening word recognition result reaches a set target when the acoustic likelihood ratio is greater than a judgment threshold corresponding to the environment state.
17. The system according to any one of claims 12 to 16,
the determining module is specifically configured to determine that the waking is successful when the secondary waking word recognition result is a waking word.
18. The system according to any one of claims 12 to 16, wherein the determining module comprises:
the fusion unit is used for fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result when the secondary awakening word recognition result is the awakening word;
and the first determining unit is used for determining whether the awakening is successful or not according to the fusion result.
19. The system of claim 18,
the fusion unit is specifically configured to obtain an acoustic likelihood ratio T1 of the primary awakening word recognition result and an acoustic likelihood ratio T2 of the secondary awakening word recognition result, and perform weighted combination on the acoustic likelihood ratio T1 of the primary awakening word recognition result and the acoustic likelihood ratio T2 of the secondary awakening word recognition result to obtain a fusion result T;
the determining unit is specifically configured to determine that the wake-up is successful when the fusion result T is greater than a set fusion threshold; otherwise, determining that the awakening fails.
20. The system according to any one of claims 12 to 16, wherein the determining module comprises:
the fusion unit is used for fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result when the secondary awakening word recognition result is the awakening word;
the similarity calculation unit is used for calculating the similarity between the time length of the primary awakening word recognition result and the time length of the secondary awakening word recognition result;
a second determining unit, configured to determine that the waking up is successful when the fusion result is greater than a set fusion threshold and the similarity is greater than a set similarity threshold; otherwise, determining that the awakening fails.
21. An intelligent terminal, characterized in that it comprises a voice wake-up system according to any one of claims 12 to 20.
22. The intelligent terminal according to claim 21, wherein the intelligent terminal is any one of: wearable equipment, cell-phone, panel computer, audio amplifier, household electrical appliances, intelligent car machine.
CN201610701651.2A 2016-08-22 2016-08-22 Voice awakening method and system and intelligent terminal Active CN107767863B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610701651.2A CN107767863B (en) 2016-08-22 2016-08-22 Voice awakening method and system and intelligent terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610701651.2A CN107767863B (en) 2016-08-22 2016-08-22 Voice awakening method and system and intelligent terminal

Publications (2)

Publication Number Publication Date
CN107767863A CN107767863A (en) 2018-03-06
CN107767863B true CN107767863B (en) 2021-05-04

Family

ID=61263952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610701651.2A Active CN107767863B (en) 2016-08-22 2016-08-22 Voice awakening method and system and intelligent terminal

Country Status (1)

Country Link
CN (1) CN107767863B (en)

Families Citing this family (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10264030B2 (en) 2016-02-22 2019-04-16 Sonos, Inc. Networked microphone device control
US10095470B2 (en) 2016-02-22 2018-10-09 Sonos, Inc. Audio response playback
US10509626B2 (en) 2016-02-22 2019-12-17 Sonos, Inc Handling of loss of pairing between networked devices
US9820039B2 (en) 2016-02-22 2017-11-14 Sonos, Inc. Default playback devices
US9978390B2 (en) 2016-06-09 2018-05-22 Sonos, Inc. Dynamic player selection for audio signal processing
US10134399B2 (en) 2016-07-15 2018-11-20 Sonos, Inc. Contextualization of voice inputs
US10115400B2 (en) 2016-08-05 2018-10-30 Sonos, Inc. Multiple voice services
US10181323B2 (en) 2016-10-19 2019-01-15 Sonos, Inc. Arbitration-based voice recognition
US10475449B2 (en) 2017-08-07 2019-11-12 Sonos, Inc. Wake-word detection suppression
US10048930B1 (en) 2017-09-08 2018-08-14 Sonos, Inc. Dynamic computation of system response volume
US10051366B1 (en) 2017-09-28 2018-08-14 Sonos, Inc. Three-dimensional beam forming with a microphone array
US10482868B2 (en) 2017-09-28 2019-11-19 Sonos, Inc. Multi-channel acoustic echo cancellation
US10466962B2 (en) 2017-09-29 2019-11-05 Sonos, Inc. Media playback system with voice assistance
CN108564941B (en) * 2018-03-22 2020-06-02 腾讯科技(深圳)有限公司 Voice recognition method, device, equipment and storage medium
US11175880B2 (en) 2018-05-10 2021-11-16 Sonos, Inc. Systems and methods for voice-assisted media content selection
US10959029B2 (en) 2018-05-25 2021-03-23 Sonos, Inc. Determining and adapting to changes in microphone performance of playback devices
CN109147763B (en) * 2018-07-10 2020-08-11 深圳市感动智能科技有限公司 Audio and video keyword identification method and device based on neural network and inverse entropy weighting
US11076035B2 (en) 2018-08-28 2021-07-27 Sonos, Inc. Do not disturb feature for audio notifications
CN109065046A (en) * 2018-08-30 2018-12-21 出门问问信息科技有限公司 Method, apparatus, electronic equipment and the computer readable storage medium that voice wakes up
CN108831471B (en) * 2018-09-03 2020-10-23 重庆与展微电子有限公司 Voice safety protection method and device and routing terminal
CN110890087A (en) * 2018-09-10 2020-03-17 北京嘉楠捷思信息技术有限公司 Voice recognition method and device based on cosine similarity
US10587430B1 (en) 2018-09-14 2020-03-10 Sonos, Inc. Networked devices, systems, and methods for associating playback devices based on sound codes
US11024331B2 (en) 2018-09-21 2021-06-01 Sonos, Inc. Voice detection optimization using sound metadata
US11100923B2 (en) * 2018-09-28 2021-08-24 Sonos, Inc. Systems and methods for selective wake word detection using neural network models
US10692518B2 (en) 2018-09-29 2020-06-23 Sonos, Inc. Linear filtering for noise-suppressed speech detection via multiple network microphone devices
US11899519B2 (en) 2018-10-23 2024-02-13 Sonos, Inc. Multiple stage network microphone device with reduced power consumption and processing load
CN112740321A (en) * 2018-11-20 2021-04-30 深圳市欢太科技有限公司 Method and device for waking up equipment, storage medium and electronic equipment
US11183183B2 (en) 2018-12-07 2021-11-23 Sonos, Inc. Systems and methods of operating media playback systems having multiple voice assistant services
US11132989B2 (en) 2018-12-13 2021-09-28 Sonos, Inc. Networked microphone devices, systems, and methods of localized arbitration
US10602268B1 (en) 2018-12-20 2020-03-24 Sonos, Inc. Optimization of network microphone devices using noise classification
US11120794B2 (en) 2019-05-03 2021-09-14 Sonos, Inc. Voice assistant persistence across multiple network microphone devices
CN110047485B (en) * 2019-05-16 2021-09-28 北京地平线机器人技术研发有限公司 Method and apparatus for recognizing wake-up word, medium, and device
US11200894B2 (en) 2019-06-12 2021-12-14 Sonos, Inc. Network microphone device with command keyword eventing
CN110197663B (en) * 2019-06-30 2022-05-31 联想(北京)有限公司 Control method and device and electronic equipment
US10871943B1 (en) 2019-07-31 2020-12-22 Sonos, Inc. Noise classification for event detection
CN110634468B (en) * 2019-09-11 2022-04-15 中国联合网络通信集团有限公司 Voice wake-up method, device, equipment and computer readable storage medium
CN110570861B (en) * 2019-09-24 2022-02-25 Oppo广东移动通信有限公司 Method and device for voice wake-up, terminal equipment and readable storage medium
CN110853633A (en) * 2019-09-29 2020-02-28 联想(北京)有限公司 Awakening method and device
CN110580908A (en) * 2019-09-29 2019-12-17 出门问问信息科技有限公司 command word detection method and device supporting different languages
US11189286B2 (en) 2019-10-22 2021-11-30 Sonos, Inc. VAS toggle based on device orientation
CN110808030B (en) * 2019-11-22 2021-01-22 珠海格力电器股份有限公司 Voice awakening method, system, storage medium and electronic equipment
US11200900B2 (en) 2019-12-20 2021-12-14 Sonos, Inc. Offline voice control
CN111092798B (en) * 2019-12-24 2021-06-11 东华大学 Wearable system based on spoken language understanding
CN111161714B (en) * 2019-12-25 2023-07-21 联想(北京)有限公司 Voice information processing method, electronic equipment and storage medium
US11562740B2 (en) 2020-01-07 2023-01-24 Sonos, Inc. Voice verification for media playback
CN111243604B (en) * 2020-01-13 2022-05-10 思必驰科技股份有限公司 Training method for speaker recognition neural network model supporting multiple awakening words, speaker recognition method and system
US11308958B2 (en) 2020-02-07 2022-04-19 Sonos, Inc. Localized wakeword verification
CN111312222B (en) * 2020-02-13 2023-09-12 北京声智科技有限公司 Awakening and voice recognition model training method and device
US11482224B2 (en) 2020-05-20 2022-10-25 Sonos, Inc. Command keywords with input detection windowing
CN111816193B (en) * 2020-08-12 2020-12-15 深圳市友杰智新科技有限公司 Voice awakening method and device based on multi-segment network and storage medium
CN111933114B (en) * 2020-10-09 2021-02-02 深圳市友杰智新科技有限公司 Training method and use method of voice awakening hybrid model and related equipment
US11984123B2 (en) 2020-11-12 2024-05-14 Sonos, Inc. Network device interaction by range
CN113129874B (en) * 2021-04-27 2022-05-10 思必驰科技股份有限公司 Voice awakening method and system
CN113947855A (en) * 2021-09-18 2022-01-18 中标慧安信息技术股份有限公司 Intelligent building personnel safety alarm system based on voice recognition
CN114360522B (en) * 2022-03-09 2022-08-02 深圳市友杰智新科技有限公司 Training method of voice awakening model, and detection method and equipment of voice false awakening
CN115223573A (en) * 2022-07-15 2022-10-21 北京百度网讯科技有限公司 Voice wake-up method and device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105869637A (en) * 2016-05-26 2016-08-17 百度在线网络技术(北京)有限公司 Voice wake-up method and device

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1175398C (en) * 2000-11-18 2004-11-10 中兴通讯股份有限公司 Sound activation detection method for identifying speech and music from noise environment
CN100514446C (en) * 2004-09-16 2009-07-15 北京中科信利技术有限公司 Pronunciation evaluating method based on voice identification and voice analysis
CN1841500B (en) * 2005-03-30 2010-04-14 松下电器产业株式会社 Method and apparatus for resisting noise based on adaptive nonlinear spectral subtraction
US8949266B2 (en) * 2007-03-07 2015-02-03 Vlingo Corporation Multiple web-based content category searching in mobile search application
CN101241699B (en) * 2008-03-14 2012-07-18 北京交通大学 A speaker identification method for remote Chinese teaching
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
CN102238190B (en) * 2011-08-01 2013-12-11 安徽科大讯飞信息科技股份有限公司 Identity authentication method and system
CN102270451B (en) * 2011-08-18 2013-05-29 安徽科大讯飞信息科技股份有限公司 Method and system for identifying speaker
CN102623009B (en) * 2012-03-02 2013-11-20 安徽科大讯飞信息科技股份有限公司 Abnormal emotion automatic detection and extraction method and system on basis of short-time analysis
CN102999161B (en) * 2012-11-13 2016-03-02 科大讯飞股份有限公司 A kind of implementation method of voice wake-up module and application
CN103021409B (en) * 2012-11-13 2016-02-24 安徽科大讯飞信息科技股份有限公司 A kind of vice activation camera system
CN103811003B (en) * 2012-11-13 2019-09-24 联想(北京)有限公司 A kind of audio recognition method and electronic equipment
CN103823867B (en) * 2014-02-26 2017-02-15 深圳大学 Humming type music retrieval method and system based on note modeling
CN103854662B (en) * 2014-03-04 2017-03-15 中央军委装备发展部第六十三研究所 Adaptive voice detection method based on multiple domain Combined estimator
CN103943107B (en) * 2014-04-03 2017-04-05 北京大学深圳研究生院 A kind of audio frequency and video keyword recognition method based on Decision-level fusion
CN105374352B (en) * 2014-08-22 2019-06-18 中国科学院声学研究所 A kind of voice activated method and system
CN104299612B (en) * 2014-11-10 2017-11-07 科大讯飞股份有限公司 The detection method and device of imitative sound similarity
US10719115B2 (en) * 2014-12-30 2020-07-21 Avago Technologies International Sales Pte. Limited Isolated word training and detection using generated phoneme concatenation models of audio inputs
CN104616653B (en) * 2015-01-23 2018-02-23 北京云知声信息技术有限公司 Wake up word matching process, device and voice awakening method, device
CN105096939B (en) * 2015-07-08 2017-07-25 百度在线网络技术(北京)有限公司 voice awakening method and device
CN105679316A (en) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 Voice keyword identification method and apparatus based on deep neural network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105869637A (en) * 2016-05-26 2016-08-17 百度在线网络技术(北京)有限公司 Voice wake-up method and device

Also Published As

Publication number Publication date
CN107767863A (en) 2018-03-06

Similar Documents

Publication Publication Date Title
CN107767863B (en) Voice awakening method and system and intelligent terminal
CN107767861B (en) Voice awakening method and system and intelligent terminal
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
CN108711421B (en) Speech recognition acoustic model establishing method and device and electronic equipment
CN110265040B (en) Voiceprint model training method and device, storage medium and electronic equipment
CN108320733B (en) Voice data processing method and device, storage medium and electronic equipment
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
US9202462B2 (en) Key phrase detection
CN111462756B (en) Voiceprint recognition method and device, electronic equipment and storage medium
US20170256270A1 (en) Voice Recognition Accuracy in High Noise Conditions
CN110570873B (en) Voiceprint wake-up method and device, computer equipment and storage medium
CN104575504A (en) Method for personalized television voice wake-up by voiceprint and voice identification
CN112102850A (en) Processing method, device and medium for emotion recognition and electronic equipment
CN116601598A (en) Hotphrase triggering based on detection sequences
CN111862963B (en) Voice wakeup method, device and equipment
US20190103110A1 (en) Information processing device, information processing method, and program
CN110853669A (en) Audio identification method, device and equipment
CN114120979A (en) Optimization method, training method, device and medium of voice recognition model
CN114360510A (en) Voice recognition method and related device
CN113611316A (en) Man-machine interaction method, device, equipment and storage medium
CN111862943A (en) Speech recognition method and apparatus, electronic device, and storage medium
GB2576960A (en) Speaker recognition
WO2020073839A1 (en) Voice wake-up method, apparatus and system, and electronic device
CN115691478A (en) Voice wake-up method and device, man-machine interaction equipment and storage medium
US20230113883A1 (en) Digital Signal Processor-Based Continued Conversation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant