CN107767863B

CN107767863B - Voice awakening method and system and intelligent terminal

Info

Publication number: CN107767863B
Application number: CN201610701651.2A
Authority: CN
Inventors: 吴国兵; 潘嘉; 刘聪; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2016-08-22
Filing date: 2016-08-22
Publication date: 2021-05-04
Anticipated expiration: 2036-08-22
Also published as: CN107767863A

Abstract

The invention discloses a voice awakening method and a system, wherein the method comprises the following steps: receiving voice data; acquiring a first acoustic feature of the voice data; performing awakening word recognition by using the first acoustic feature, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result; if the primary awakening word recognition result is an awakening word, judging whether the primary awakening word recognition result reaches a set target or not; if yes, acquiring a second acoustic feature of the voice data; performing secondary awakening word recognition by using the second acoustic feature, the second acoustic model and the second decoding network to obtain a secondary awakening word recognition result; and determining whether the awakening is successful or not according to the secondary awakening word recognition result. The invention further provides the intelligent terminal. The invention can effectively reduce the power consumption of the voice wake-up system.

Description

Voice awakening method and system and intelligent terminal

Technical Field

The invention relates to the field of voice processing, in particular to a voice awakening method, a voice awakening system and an intelligent terminal.

Background

The voice awakening is realized by understanding semantic information of user voice data, and the process can be realized without physical contact with equipment, so that both hands of human beings are liberated, the first door leading to artificial intelligence of the human beings is opened, and the voice awakening method is widely applied to various intelligent terminals, such as intelligent wearable equipment, mobile phones, tablet computers, intelligent household appliances and the like. In the existing method, when voice awakening is performed, acoustic features of voice data are extracted after the voice data are received, and awakening word recognition is performed by using the extracted acoustic features and a pre-constructed acoustic model.

The existing voice awakening method has the following defects:

(1) since it is impossible to predict when a user performs a human-computer interaction operation, it is necessary to continuously monitor the voice data, and once the voice data is received, the awakening word recognition is immediately performed, which consumes a large amount of resources of the intelligent terminal and consumes a large amount of power.

(2) In order to improve the success rate of awakening, the existing method generally uses a larger acoustic model and a decoding network to identify awakening words, so that the voice awakening power consumption is further increased, which is unacceptable for an intelligent terminal with a smaller memory, and when the power consumption is too large, the intelligent terminal often crashes or does not respond, so that the user experience is greatly reduced.

Disclosure of Invention

The invention provides a voice awakening method, a voice awakening system and an intelligent terminal, which can effectively reduce the power consumption of the system while ensuring the awakening success rate.

Therefore, the invention provides the following technical scheme:

a voice wake-up method, comprising:

receiving voice data;

acquiring a first acoustic feature of the voice data;

performing awakening word recognition by using the first acoustic feature, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result;

if the primary awakening word recognition result is an awakening word, judging whether the primary awakening word recognition result reaches a set target or not;

if yes, acquiring a second acoustic feature of the voice data;

performing secondary awakening word recognition by using the second acoustic feature, the second acoustic model and the second decoding network to obtain a secondary awakening word recognition result; the second acoustic model is larger than the first acoustic model, and/or the second decoding network is larger than the first decoding network;

and determining whether the awakening is successful or not according to the secondary awakening word recognition result.

Optionally, the second acoustic characteristic is the same as or different from the first acoustic characteristic.

Optionally, the first acoustic characteristic is any one of the following characteristics: MFCC feature, Bottleneck feature, Filterbank feature.

Preferably, the first acoustic model comprises a wake word acoustic model and an absorption model, wherein the wake word acoustic model and the absorption model are respectively trained, the wake word acoustic model is characterized by a GMM-HMM based on the first acoustic feature, and the absorption model is characterized by the GMM-HMM;

the second acoustic model includes a wake word acoustic model and an absorption model, wherein the wake word acoustic model and the absorption model are trained simultaneously, both of which are characterized using a neural network model based on the second acoustic features.

Preferably, the determining whether the initial awakening word recognition result reaches a set target includes:

determining a current environment state;

and judging whether the initial awakening word recognition result reaches a set target or not according to the environment state.

Preferably, the determining the current environmental state includes:

calculating the signal-to-noise ratio of the voice data;

if the signal to noise ratio is larger than a set value, the current environment state is a quiet environment; otherwise, the current environment state is a noise environment.

Preferably, the determining whether the initial awakening word recognition result reaches a set target according to the environment state includes:

acquiring acoustic likelihood of an awakening word and a non-awakening word obtained in the process of recognizing the initial awakening word;

calculating the acoustic likelihood ratio of the awakening word and the non-awakening word according to the acoustic likelihood;

and if the acoustic likelihood ratio is larger than a judgment threshold corresponding to the environment state, the primary awakening word recognition result reaches a set target.

Preferably, the determining whether the waking is successful according to the secondary waking word recognition result includes:

and if the secondary awakening word recognition result is the awakening word, determining that the awakening is successful.

if the secondary awakening word recognition result is an awakening word, fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result;

and determining whether the awakening is successful or not according to the fusion result.

Preferably, the fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result includes:

respectively acquiring an acoustic likelihood ratio T1 of the primary awakening identification result and an acoustic likelihood ratio T2 of the secondary awakening identification result;

carrying out weighted combination on the acoustic likelihood ratio T1 of the primary awakening identification result and the acoustic likelihood ratio T2 of the secondary awakening identification result to obtain a fusion result T;

the determining whether the awakening is successful according to the fusion result comprises:

if the fusion result T is larger than the set fusion threshold value, awakening successfully; otherwise, the awakening fails.

calculating the similarity between the time length of the primary awakening identification result and the time length of the secondary awakening identification result;

if the fusion result is larger than a set fusion threshold value and the similarity is larger than a set similarity threshold value, awakening successfully; otherwise, the awakening fails.

A voice wake-up system comprising:

the receiving module is used for receiving voice data;

the first acoustic feature acquisition module is used for acquiring first acoustic features of the voice data;

the primary awakening module is used for performing awakening word recognition by utilizing the first acoustic characteristic, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result;

the judging module is used for judging whether the primary awakening word recognition result reaches a set target or not when the primary awakening word recognition result is the awakening word; if yes, triggering a second acoustic feature acquisition module;

the second acoustic feature acquisition module is used for acquiring a second acoustic feature of the voice data;

the secondary awakening module is used for carrying out secondary awakening word recognition by utilizing the second acoustic characteristic, the second acoustic model and the second decoding network to obtain a secondary awakening word recognition result; the second acoustic model is larger than the first acoustic model, and/or the second decoding network is larger than the first decoding network;

and the determining module is used for determining whether the awakening is successful or not according to the secondary awakening word recognition result.

Preferably, the judging module includes:

an environment state determination unit for determining a current environment state;

and the judging unit is used for judging whether the initial awakening word recognition result reaches a set target or not according to the environment state.

Preferably, the environment state determining unit is specifically configured to calculate a signal-to-noise ratio of the voice data, determine that the current environment state is a quiet environment when the signal-to-noise ratio is greater than a set value, and otherwise determine that the current environment state is a noisy environment.

Preferably, the judging unit includes:

the likelihood obtaining subunit is used for obtaining the acoustic likelihoods of the awakening words and the non-awakening words obtained in the process of identifying the primary awakening words;

and the likelihood ratio calculating subunit is used for calculating the acoustic likelihood ratio of the awakening word and the non-awakening word according to the acoustic likelihood ratio, and determining that the initial awakening word recognition result reaches a set target when the acoustic likelihood ratio is greater than a judgment threshold corresponding to the environment state.

Preferably, the determining module is specifically configured to determine that the waking is successful when the secondary waking word recognition result is a waking word.

Preferably, the determining module comprises:

the fusion unit is used for fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result when the secondary awakening word recognition result is the awakening word;

and the first determining unit is used for determining whether the awakening is successful or not according to the fusion result.

Preferably, the fusion unit is specifically configured to obtain an acoustic likelihood ratio T1 of the primary wake-up recognition result and an acoustic likelihood ratio T2 of the secondary wake-up recognition result, and perform weighted combination on the acoustic likelihood ratio T1 of the primary wake-up recognition result and the acoustic likelihood ratio T2 of the secondary wake-up recognition result to obtain a fusion result T;

the determining unit is specifically configured to determine that the wake-up is successful when the fusion result T is greater than a set fusion threshold; otherwise, determining that the awakening fails.

Preferably, the determining module comprises:

the similarity calculation unit is used for calculating the similarity between the duration of the primary awakening identification result and the duration of the secondary awakening identification result;

a second determining unit, configured to determine that the waking up is successful when the fusion result is greater than a set fusion threshold and the similarity is greater than a set similarity threshold; otherwise, determining that the awakening fails.

An intelligent terminal comprises the voice awakening system.

Preferably, the intelligent terminal is any one of the following: wearable equipment, cell-phone, panel computer, audio amplifier, household electrical appliances, intelligent car machine.

According to the voice awakening method, the voice awakening system and the intelligent terminal, once voice data are received, the small acoustic model and the decoding network are used for carrying out primary awakening word recognition, and after the awakening word is recognized and the primary awakening word recognition result reaches the set target, the large acoustic model and the decoding network are used for carrying out secondary awakening word recognition. Because the power consumption of the initial awakening is smaller, the awakening power consumption can be effectively reduced when the device is used for continuous monitoring; and the secondary awakening operation is started only when the primary awakening word recognition result reaches a set target, and the secondary awakening operation uses a larger acoustic model and a decoding network, so that the awakening success rate is effectively ensured.

Furthermore, when the device is awakened for the second time, the neural network model with strong learning capacity is used, the nonlinear transformation capacity is strong, the model obtained through training is strong in distinction, and the awakening success rate is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flow chart of a voice wake-up method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a voice wake-up system according to an embodiment of the present invention.

Detailed Description

In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.

Aiming at the problem of high power consumption of the existing voice awakening method, the embodiment of the invention provides a voice awakening method and a voice awakening system.

As shown in fig. 1, it is a flowchart of a voice wake-up method according to an embodiment of the present invention, including the following steps:

step 101, receiving voice data.

And the voice data is received through a microphone of the intelligent terminal.

Step 102, obtaining a first acoustic feature of the voice data.

The first acoustic feature is used for primary awakening, the first acoustic feature may be an MFCC feature, and in particular, in the extraction, the speech data may be firstly subjected to framing processing; then pre-emphasis is carried out on the voice data after the frame division; and finally, sequentially extracting the frequency spectrum characteristics of each frame of voice data.

Of course, in order to further improve the distinctiveness of the acoustic features, the first acoustic features also use acoustic features with higher distinctiveness, such as bottleneckk features and Filterbank features. When the Bottleneck features are specifically extracted, firstly, the MFCC features of voice data are extracted, then the extracted MFCC features are used as input features of a pre-constructed deep neural network model, model training is carried out, features output by a Bottleneck layer are extracted and used as the Bottleneck features, and a specific extraction method is the same as that in the prior art and is not detailed herein. The Filterbank feature extraction may also be performed using conventional techniques and will not be described in detail herein.

And 103, performing awakening word recognition by using the first acoustic feature, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result.

In order to reduce awakening power consumption and resource storage, a small acoustic model and a small decoding network are adopted in the process of identifying the awakening words for the first time, an awakening system is in a real-time monitoring state, and a user can respond in time when awakening at any time. The construction of the first acoustic model and the structure of the first decoding network will be described in detail later.

During specific decoding, a smaller decoding network and an acoustic model which are constructed in advance are utilized, an acoustic score of the acoustic feature of each voice unit on each path of the first decoding network is calculated by adopting a dynamic programming method, and the path with the highest acoustic score is used as an optimal path. If the optimal path is the awakening word path, the identification result is the awakening word on the path; and if the recognition result is the absorption path, the recognition result is a non-awakening word.

104, if the primary awakening word recognition result is an awakening word, judging whether the primary awakening word recognition result reaches a set target; if yes, go to step 105; otherwise, the awakening fails.

In order to further reduce noise interference and improve the accuracy of awakening, in the embodiment of the invention, whether the initial awakening word recognition result reaches the set target or not can be determined according to the current environment state. For this purpose, it is necessary to first determine the current environmental state, which may be determined, for example, from the signal-to-noise ratio of the received speech data. Specifically, calculating the signal-to-noise ratio of the voice data; if the signal to noise ratio is larger than a set value, the current environment state is a quiet environment; otherwise, the current environment state is a noise environment.

Certainly, the environment states are not limited to a quiet environment and a noisy environment, and multiple environment states can be divided according to actual application requirements, so that personalized requirements of users are met, for example, according to the awakening time of the user each time, the awakening environment is further divided into morning, afternoon, evening, early morning and the like, and the acoustic likelihood ratio threshold values of awakening words in different awakening environments are set according to experimental results or application requirements.

When judging whether the initial awakening word recognition result reaches a set target according to the environment state, judging threshold values under different environments can be preset, and the initial awakening word recognition result is determined to reach the set target according to the judging threshold value under the current environment.

For example, according to the acoustic likelihood of the corresponding awakening word and non-awakening word in the awakening word recognition process, calculating the ratio between the acoustic likelihood of the awakening word and the acoustic likelihood of the non-awakening word to obtain the acoustic likelihood ratio of the awakening word, when the likelihood ratio is greater than a threshold value, considering that the current voice data is non-noise voice data, starting to perform secondary awakening operation, otherwise, failing to awaken, and continuing to receive the voice data.

Setting acoustic likelihood ratio thresholds of the awakening words in different environments respectively, and confirming the awakening words according to the acoustic likelihood ratio threshold of the awakening words in the current environment, taking the two environment states as an example, and the specific awakening word confirmation result is shown in table 1, wherein T1 is the acoustic likelihood ratio of the awakening words calculated in the awakening word recognition process when the awakening words are awakened for the first time, thres _ clear is the acoustic likelihood ratio threshold in a quiet environment, thres _ noise is the acoustic likelihood ratio threshold in a noise environment, and the threshold can be determined according to a large number of experimental results or according to actual application requirements.

TABLE 1

Step 105, obtaining a second acoustic feature of the voice data.

The second acoustic signature is for a secondary wake-up operation. It should be noted that the second acoustic characteristic may be the same as the first acoustic characteristic, or may be different from the first acoustic characteristic, and may be determined according to application requirements. If the first acoustic feature uses the Bottleneck feature, the second acoustic feature uses the Filterbank feature; of course, both may be the same, i.e. both the first and the second acoustic features use the bottleeck feature. If the second acoustic feature is the same as the first acoustic feature, then in step 106, the first acoustic feature extracted in step 102 may be used directly for secondary wake-up word recognition without re-extracting the acoustic feature from the speech data.

And 106, performing secondary awakening word recognition by using the second acoustic feature, the second acoustic model and the second decoding network to obtain a secondary awakening word recognition result.

It should be noted that, in the embodiment of the present invention, the second acoustic model is larger than the first acoustic model, and/or the second decoding network is larger than the first decoding network. In addition, in order to improve the wake-up success rate, the secondary wake-up operation not only uses a larger acoustic model and a decoding network, but also considers the primary wake-up result. The construction of the second acoustic model and the structure of the second decoding network will be described in detail later.

In this step, the process of identifying the awakening word is similar to that of the initial awakening, that is, the acoustic score of the acoustic feature of each voice unit on each path of the second decoding network is calculated by using a pre-constructed larger decoding network and an acoustic model and adopting a dynamic programming method, and the path with the highest acoustic score is used as the optimal path. If the optimal path is the awakening word path, the identification result is the awakening word on the path; and if the recognition result is the absorption path, the recognition result is a non-awakening word.

And 107, determining whether the awakening is successful or not according to the secondary awakening word recognition result.

After the secondary awakening word recognition result is obtained, the following ways can be used for determining whether the awakening is successful:

1) and directly determining whether the awakening is successful according to the identification result of the secondary awakening word.

For example, if the secondary awakening word recognition result is an awakening word, the awakening is determined to be successful, otherwise, the awakening is failed.

2) And comprehensively considering the secondary awakening word recognition result and the primary awakening word recognition result to determine whether the awakening is successful.

For example, the primary awakening word recognition result and the secondary awakening word recognition result are fused to obtain a fusion result; and determining whether the awakening is successful or not according to the fusion result. Specific fusion methods are exemplified as follows:

respectively acquiring an acoustic likelihood ratio T1 of a primary awakening recognition result and an acoustic likelihood ratio T2 of a secondary awakening recognition result;

and (3) carrying out weighted combination on the acoustic likelihood ratio T1 of the primary awakening identification result and the acoustic likelihood ratio T2 of the secondary awakening identification result to obtain a fusion result T, wherein the formula (1) is as follows:

T＝α*T1+β*T2 (1)

3) Not only the fusion result T is considered, but also the similarity between the time length of the awakening word awakened for the first time and the time length of the awakening word awakened for the second time is further considered.

Specifically, firstly, performing state level segmentation on received voice data by using a first acoustic feature and a primary awakening acoustic model to obtain a primary awakening word recognition time duration vector D1, which is represented as D1 ═ D11, D12, …, D1 n; then, performing state level segmentation on the received voice data by using the second acoustic feature and a secondary awakening acoustic model to obtain a secondary awakening word recognition time duration vector D2, which is represented as D2 ═ D21, D22, … and D2 n; and finally, calculating the similarity between the duration vector D1 and the duration vector D2, wherein the similarity can be specifically expressed by cosine distance, Euclidean distance and the like between the vectors, and the smaller the distance is, the higher the similarity is.

The specific calculation method taking the cosine distance as an example is shown in formula (2):

where Dcos is the cosine distance between the duration vectors, the smaller the distance, the higher the similarity.

If the fusion result is larger than a set fusion threshold value and the similarity is larger than a set similarity threshold value, awakening successfully; otherwise, the awakening fails, and the voice data continues to be received.

In the embodiment of the present invention, when performing the first wake-up word recognition, a smaller acoustic model and a smaller decoding network are used, and when performing the second wake-up word recognition, a larger acoustic model and a larger decoding network are used, that is, the first acoustic model is smaller than the second acoustic model, and/or the first decoding network is smaller than the second decoding network.

The acoustic models used in the two wake-up word recognition processes are described in detail below.

First and second acoustic models

The first acoustic model comprises a wake-up word acoustic model and an absorption model, the wake-up word acoustic model is used for recognizing wake-up words from voice data, and the absorption model is used for absorbing various sound phenomena except the wake-up words, such as non-wake-up word voice, various forms of noise, music and the like.

a) Training awakening word acoustic model

In order to improve the wake-up success rate under the condition of low power consumption, the wake-up word acoustic Model is characterized by using GMM (Gaussian Mixture Model) based on first acoustic features. During specific training, firstly, a large amount of voice data containing awakening words are collected, acoustic features of the voice data are extracted, the acoustic features are the same as the first acoustic features, then, a Gaussian mixture Model based on an HMM (Hidden Markov Model) is trained based on an MLE (Maximum Likelihood rule), and then, discriminative training based on an MPE (Minimum Phone Error rule) is carried out based on the Model, so that the awakening word acoustic Model is obtained.

b) Training absorption model

The absorption model is characterized by adopting a GMM-HMM model as the acoustic model of the awakening word. Unlike the acoustic model of the wake-up word, the absorption units of the absorption model are formed by all speech unit clusters, and the number of the absorption models depends on the number of the cluster categories, and is generally between 1 and 100.

During specific training, firstly, a large amount of voice data is collected, wherein the voice data contains all voice units as much as possible, such as phonemes, syllables and the like, and the collected voice data contains all syllables in Chinese as much as possible; then extracting acoustic features of the voice data, wherein the acoustic features are the same as the first acoustic features, and then training a Gaussian mixture model based on an HMM (hidden Markov model) based on a maximum likelihood criterion to obtain an acoustic model of each voice unit; clustering the acoustic models of the voice units based on KL (Kullback-Leibler) distance to obtain absorption units, wherein the absorption units are formed by clustering the voice units, and the specific clustering number can be preset according to an experiment result; and finally, modifying the label of the training data into an absorption unit, and retraining an acoustic model corresponding to the absorption unit by using the modified training data, wherein the acoustic model is called as an absorption model, and the specific training method is the same as the training method of the acoustic model of the voice unit.

For example: the method for modifying the training data labels is as follows: the phonetic unit labeled by the training data is 'zhong 1', after clustering, the phonetic unit 'zhong 1' belongs to class 1, namely, the absorption unit 1, and the labeling of the training data is only required to be modified into 'absorption unit 1'.

The first decoding network comprises the acoustic model and the absorption model of the wake-up word.

Second, second acoustic model

The acoustic model comprises an awakening word acoustic model and an absorption model, when the acoustic model is awakened for the second time, the awakening word acoustic model and the absorption model are trained simultaneously, both the awakening word acoustic model and the absorption model are characterized by using a deep neural network model based on second acoustic features, the second acoustic features are Filterbank features, the deep neural network structure is one or a combination of more of a feedforward neural network, a convolution neural network and a circulation neural network, the number of hidden layers of the neural network is generally 3-8, and the number of nodes of each hidden layer is generally 2048. And performing model training by using a large amount of collected voice data, wherein during model training, the input of the deep neural network is the acoustic characteristics (namely the second acoustic characteristics) of the voice data, the output is the state corresponding to the awakening word and the universal voice unit, the state corresponding to the awakening word is used for constructing the acoustic model of the awakening word, the universal voice unit is used for constructing the absorption model, model training is performed by using the collected voice data according to a cross entropy criterion, and after the training is finished, the acoustic model of the awakening word and the absorption model are obtained.

The first decoding network and the second decoding network can be constructed by pre-collected text data of the awakening words, and the specific construction method is the same as that of the decoding network in the voice recognition.

According to the voice awakening method provided by the embodiment of the invention, after voice data is received, the small acoustic model and the decoding network are used for carrying out primary awakening word recognition, and after the awakening word is recognized and the primary awakening word recognition result reaches the set target, the large acoustic model and the decoding network are used for carrying out secondary awakening word recognition. Because the power consumption of the initial awakening is smaller, the awakening power consumption can be effectively reduced when the device is used for continuous monitoring; and the secondary awakening operation is started only when the primary awakening word recognition result reaches a set target, and the secondary awakening operation uses a larger acoustic model and a decoding network, so that the awakening success rate is effectively ensured.

Correspondingly, an embodiment of the present invention further provides a voice wake-up system, as shown in fig. 2, the system includes:

a receiving module 201, configured to receive voice data;

a first acoustic feature obtaining module 202, configured to obtain a first acoustic feature of the voice data;

the primary awakening module 203 is configured to perform awakening word recognition by using the first acoustic feature, the first acoustic model and the first decoding network to obtain a primary awakening word recognition result;

a judging module 204, configured to judge whether the primary awakening word recognition result reaches a set target when the primary awakening word recognition result is an awakening word; if so, the second acoustic feature acquisition module 205 is triggered;

the second acoustic feature obtaining module 205 is configured to obtain a second acoustic feature of the voice data;

a secondary awakening module 206, configured to perform secondary awakening word recognition by using the second acoustic feature, the second acoustic model, and the second decoding network, so as to obtain a secondary awakening word recognition result; the second acoustic model is larger than the first acoustic model, and/or the second decoding network is larger than the first decoding network;

and the determining module 207 is configured to determine whether the waking is successful according to the secondary waking word recognition result.

It should be noted that the second acoustic feature may be the same as or different from the first acoustic feature, specifically, the MFCC feature, the Bottleneck feature, and the Filterbank feature may be used, and the extraction of these acoustic features may be performed by using the prior art. If the same acoustic features are used for the wake word recognition twice, in the system, the first acoustic feature obtaining module 202 needs to extract the acoustic features from the voice data, and the second acoustic feature obtaining module 205 may directly obtain the required acoustic features from the first acoustic feature obtaining module 202, or the second acoustic feature obtaining module 205 may be omitted, that is, the second wake module 206 performs the wake word recognition by using the acoustic features extracted by the first acoustic feature obtaining module 202.

In the system of the embodiment of the present invention, the first acoustic module and the second acoustic model used may be pre-trained by the respective modules. The module may be a part of the system or may be independent of the system, and the embodiment of the present invention is not limited thereto. In addition, in order to reduce awakening power consumption and resource storage, a small acoustic model and a small decoding network are adopted in the process of identifying the awakening words for the first time, the awakening system is in a real-time monitoring state, and a user can respond in time when awakening at any time. The first acoustic model comprises a wake word acoustic model and an absorption model, wherein the wake word acoustic model and the absorption model are respectively trained, the wake word acoustic model is characterized by a GMM-HMM based on the first acoustic feature, and the absorption model is characterized by the GMM-HMM. The second awakening word recognition process adopts a larger acoustic model and a decoding network, and specifically, the second acoustic model comprises an awakening word acoustic model and an absorption model, wherein the awakening word acoustic model and the absorption model are trained simultaneously, and both are characterized by using a neural network model based on second acoustic characteristics, and the neural network model is, for example, DNN, CNN, RNN and the like.

In order to further reduce noise interference and improve the accuracy of waking up, in an embodiment of the present invention, the determining module 204 may determine whether the initial wake-up word recognition result reaches a set target according to the current environment state. One specific structure of the determining module 204 may include the following two units:

For example, the environment state determining unit may determine the current environment state according to a signal-to-noise ratio of the voice data, specifically, when the signal-to-noise ratio is greater than a set value, the current environment state is determined to be a quiet environment, otherwise, the current environment state is determined to be a noisy environment. Of course, various environmental states may be set, and the embodiment of the present invention is not limited thereto.

Accordingly, the determining unit may determine according to an acoustic likelihood ratio of the wake word and the non-wake word, and the determining unit may include: a likelihood obtaining subunit and a likelihood ratio calculating subunit; wherein:

After the secondary wake-up module 206 obtains the secondary wake-up word recognition result, the determining module 207 may determine whether to wake up to function in a variety of ways, such as:

1) the determining module 207 determines whether the awakening is successful directly according to the secondary awakening word recognition result, specifically, when the secondary awakening word recognition result is the awakening word, the awakening is successful; otherwise, the awakening fails.

2) The determination module 207 determines whether the wake-up is successful by comprehensively considering both the secondary wake-up word recognition result and the primary wake-up word recognition result. Correspondingly, the determining module 207 may specifically include the following units:

3) The determining module 207 comprehensively considers the fusion result and the similarity between the time length of the wake-up word awakened for the first time and the time length of the wake-up word awakened for the second time. Correspondingly, the determining module 207 may specifically include the following units:

According to the voice awakening system provided by the embodiment of the invention, after voice data is received, the small acoustic model and the decoding network are used for carrying out primary awakening word recognition, and after the awakening word is recognized and the primary awakening word recognition result reaches the set target, the large acoustic model and the decoding network are used for carrying out secondary awakening word recognition. Because the power consumption of the initial awakening is smaller, the awakening power consumption can be effectively reduced when the device is used for continuous monitoring; and the secondary awakening operation is started only when the primary awakening word recognition result reaches a set target, and the secondary awakening operation uses a larger acoustic model and a decoding network, so that the awakening success rate is effectively ensured.

The voice awakening system provided by the embodiment of the invention can be applied to various intelligent terminals, such as wearable equipment, a mobile phone, a tablet personal computer, a sound box, intelligent household appliances and the like.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, they are described in a relatively simple manner, and reference may be made to some descriptions of method embodiments for relevant points. The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above embodiments of the present invention have been described in detail, and the present invention is described herein using specific embodiments, but the above embodiments are only used to help understanding the method and system of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A voice wake-up method, comprising:

receiving voice data;

acquiring a first acoustic feature of the voice data;

if yes, acquiring a second acoustic feature of the voice data;

2. The method of claim 1, wherein the second acoustic characteristic is the same as or different from the first acoustic characteristic.

3. The method of claim 2, wherein the first acoustic feature is any one of: MFCC feature, Bottleneck feature, Filterbank feature.

4. The method of claim 1, wherein:

the first acoustic model comprises a wake-up word acoustic model and an absorption model, wherein the wake-up word acoustic model and the absorption model are respectively trained, the wake-up word acoustic model is characterized by using a GMM-HMM based on first acoustic features, and the absorption model is characterized by adopting the GMM-HMM;

5. The method of claim 1, wherein the determining whether the initial wake word recognition result reaches a set target comprises:

determining a current environment state;

6. The method of claim 5, wherein the determining a current environmental state comprises:

calculating the signal-to-noise ratio of the voice data;

7. The method according to claim 5, wherein the determining whether the initial wake word recognition result reaches a set target according to the environment state comprises:

8. The method according to any one of claims 1 to 7, wherein the determining whether the waking is successful according to the secondary wake word recognition result comprises:

9. The method according to any one of claims 1 to 7, wherein the determining whether the waking is successful according to the secondary wake word recognition result comprises:

10. The method of claim 9,

the fusing the primary awakening word recognition result and the secondary awakening word recognition result to obtain a fusion result comprises:

respectively acquiring an acoustic likelihood ratio T1 of the primary awakening word recognition result and an acoustic likelihood ratio T2 of the secondary awakening word recognition result;

carrying out weighted combination on the acoustic likelihood ratio T1 of the primary awakening word recognition result and the acoustic likelihood ratio T2 of the secondary awakening word recognition result to obtain a fusion result T;

11. The method according to any one of claims 1 to 7, wherein the determining whether the waking is successful according to the secondary wake word recognition result comprises:

calculating the similarity between the time length of the primary awakening word recognition result and the time length of the secondary awakening word recognition result;

12. A voice wake-up system, comprising:

the receiving module is used for receiving voice data;

13. The system of claim 12, wherein:

14. The system of claim 12, wherein the determining module comprises:

15. The system of claim 14,

the environment state determining unit is specifically configured to calculate a signal-to-noise ratio of the voice data, determine that the current environment state is a quiet environment when the signal-to-noise ratio is greater than a set value, and otherwise determine that the current environment state is a noisy environment.

16. The system according to claim 14, wherein the judging unit includes:

17. The system according to any one of claims 12 to 16,

the determining module is specifically configured to determine that the waking is successful when the secondary waking word recognition result is a waking word.

18. The system according to any one of claims 12 to 16, wherein the determining module comprises:

19. The system of claim 18,

the fusion unit is specifically configured to obtain an acoustic likelihood ratio T1 of the primary awakening word recognition result and an acoustic likelihood ratio T2 of the secondary awakening word recognition result, and perform weighted combination on the acoustic likelihood ratio T1 of the primary awakening word recognition result and the acoustic likelihood ratio T2 of the secondary awakening word recognition result to obtain a fusion result T;

20. The system according to any one of claims 12 to 16, wherein the determining module comprises:

the similarity calculation unit is used for calculating the similarity between the time length of the primary awakening word recognition result and the time length of the secondary awakening word recognition result;

21. An intelligent terminal, characterized in that it comprises a voice wake-up system according to any one of claims 12 to 20.

22. The intelligent terminal according to claim 21, wherein the intelligent terminal is any one of: wearable equipment, cell-phone, panel computer, audio amplifier, household electrical appliances, intelligent car machine.