CN109273007B

CN109273007B - Voice wake-up method and device

Info

Publication number: CN109273007B
Application number: CN201811184504.8A
Authority: CN
Inventors: 吴国兵; 潘嘉
Original assignee: Xi'an Xunfei Super Brain Information Technology Co ltd
Current assignee: Xi'an Xunfei Super Brain Information Technology Co Ltd
Priority date: 2018-10-11
Filing date: 2018-10-11
Publication date: 2022-05-17
Anticipated expiration: 2038-10-11
Also published as: CN109273007A

Abstract

The embodiment of the invention provides a voice awakening method and device, and belongs to the technical field of voice recognition. The method comprises the following steps: acquiring acoustic characteristics of a wakeup word in voice data; and inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result, wherein the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on wake-up words. The embodiment of the invention obtains the acoustic characteristics of the awakening words in the voice data. And inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result which is used for indicating whether wake-up is successful. Because the awakening judgment can be carried out through the awakening judgment network under any awakening words defined by the user without depending on a fixed preset threshold value, the awakening success rate can be improved, and the applicable scenes of the awakening process are wider.

Description

Voice wake-up method and device

Technical Field

The embodiment of the invention relates to the technical field of voice recognition, in particular to a voice awakening method and device.

Background

Along with the development of smart homes, the voice awakening function is more and more popular. The voice wake-up is mainly realized by understanding voice data of a user to wake up the intelligent terminal. At present, when voice awakening is realized, acoustic likelihoods of an awakening word path and a filer path which respectively correspond to the awakening word path and the filer path in an awakening word recognition process are generally adopted; and if the acoustic likelihood ratio is larger than a fixed preset threshold value, confirming that the recognition result of the awakening word is credible, and successfully awakening the intelligent terminal. Because the preset threshold is fixed, if the awakening password changes, the preset threshold may not be applicable to the judgment process of the current awakening language, so that the awakening success rate is reduced.

Disclosure of Invention

To solve the above problems, embodiments of the present invention provide a voice wake-up method and apparatus that overcome the above problems or at least partially solve the above problems.

According to a first aspect of the embodiments of the present invention, there is provided a voice wake-up method, including:

acquiring acoustic characteristics of a wakeup word in voice data;

and inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result, wherein the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on wake-up words.

According to the method provided by the embodiment of the invention, the acoustic characteristics of the awakening words in the voice data are obtained. And inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result which is used for indicating whether wake-up is successful. Because the awakening judgment can be carried out through the awakening judgment network under any awakening words defined by the user without depending on a fixed preset threshold value, the awakening success rate can be improved, and the applicable scenes of the awakening process are wider.

According to a second aspect of the embodiments of the present invention, there is provided a voice wake-up apparatus, including:

the acquisition module is used for acquiring the acoustic characteristics of the awakening words in the voice data;

and the output module is used for inputting the acoustic features into a wake-up judgment network and outputting a wake-up judgment result, the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on the wake-up words.

According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor calling the program instructions being capable of performing the voice wake-up method provided by any of the various possible implementations of the first aspect.

According to a fourth aspect of the present invention, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the voice wake-up method provided in any of the various possible implementations of the first aspect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of embodiments of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a voice wake-up method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a wake-up word recognition network according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a wake-up identification network according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a voice wake-up apparatus according to an embodiment of the present invention;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Along with the development of smart homes, the voice awakening function is more and more popular. The voice wake-up is mainly realized by understanding voice data of a user to wake up the intelligent terminal. At present, when voice awakening is realized, acoustic likelihood of an awakening word and acoustic likelihood of a non-awakening word which respectively correspond to the awakening word in an awakening word recognition process are generally used; calculating the ratio of the acoustic likelihood of the awakening word to the acoustic likelihood of the non-awakening word to obtain the acoustic likelihood ratio of the awakening word; and if the acoustic likelihood ratio is larger than a fixed preset threshold value, confirming that the recognition result of the awakening word is credible, and successfully awakening the intelligent terminal. Because the preset threshold is fixed, if the awakening password changes, the preset threshold may not be applicable to the judgment process of the current awakening language, so that the awakening success rate is reduced.

For example, if the currently preset wake-up password is "ding-dong", by comparing the acoustic likelihood ratio of the wake-up word "ding-dong" with a preset threshold, it can be determined more accurately whether the wake-up password spoken by the user can wake up the intelligent terminal. If the user self-defines a new wake-up password, such as "hello, fei little", the fixed preset threshold may not be suitable for the user-defined wake-up password, thereby reducing the wake-up success rate.

In view of the foregoing situation, an embodiment of the present invention provides a voice wake-up method. It should be noted that the method can be applied to an intelligent terminal with a wake-up function, such as an intelligent sound box, a wearable device or an intelligent household appliance. Referring to fig. 1, the method includes, but is not limited to:

101. and acquiring acoustic characteristics of the awakening words in the voice data.

Before execution 101, voice data may be input to a wake word recognition network to recognize the wake word. Specifically, the wake recognition network may be a hidden markov model-based Keyword and Filler network, which includes a Keyword and Filler approach as shown in fig. 2. Wherein, the Filler path represents a non-wake word path, and words except the wake word are all contained in the Filler path. When the awakening words are identified, the acoustic features of the awakening words can be extracted at the same time.

102. And inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result, wherein the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on wake-up words.

Prior to execution 102, a wake decision network may be trained. Specifically, the initial decision network may be trained on the positive sample acoustic features and the negative sample acoustic features to obtain the wake-up decision network. Wherein, the sample acoustic features of the positive example refer to the acoustic features of the awakening words which can be awakened successfully, and the sample acoustic features of the negative example refer to the acoustic features of the non-awakening words which are awakened unsuccessfully. Based on the content of the foregoing embodiment, as an alternative embodiment, the wake-up decision network may be an encoder-decoder model. Accordingly, the initial decision network may include an encoder terminal and a decoder terminal, and the wake-up word may include a plurality of sub-words. The encoder end can adopt a sequence modeling network such as a recurrent neural network, a long-time memory network and the like, acoustic features of each sub-word are input into the sequence modeling network, a characterization vector of each sub-word can be obtained, the characterization vector of each sub-word is connected into an attention mechanism, so that the influence weight of the characterization vector of each sub-word on the awakening word is obtained, the characterization vector of the whole word is further obtained, and judgment is carried out according to the influence weight.

Referring to fig. 3, for an encoder of an encoder, x1 to xT represent feature vectors of acoustic features of each subword, h1 to hT are feature vectors of the subwords after passing through a sequence modeling network, at,1 to at, T is a weight obtained after each feature vector passes through an attention mechanism, and ct is a feature vector of a whole word. For the decoder, yt-1 represents the previous historical decoding result. yt represents the final decoding result, i.e. indicates whether the wake-up was successful or not. st-1 represents intermediate information in a previous decoding process, and st represents intermediate information in a present decoding process. It should be noted that the wake-up determining network uses an encoder-decoder model, which can meet the requirements of wake-up words with different lengths.

Based on the content of the foregoing embodiment, as an optional embodiment, the acoustic feature includes at least any one of the following five kinds of information, where the following five kinds of information are, respectively, a score of a wake-up word of a subword in the wake-up word, a score of a non-wake-up word of the subword in the wake-up word, a frame number corresponding to the subword in the wake-up word, a score distribution of the subword in the wake-up word corresponding to the acoustic feature, and an embedding feature of the subword in the wake-up word.

In the five types of information provided in the embodiment of the present invention, the execution time of acquiring the five types of information may be further determined to acquire the five types of information when the wakeup word is recognized before the acoustic feature is input to the wakeup decision network and the wakeup decision result is output. For convenience of understanding, the following five information are respectively described by taking subwords as phonemes as an example:

(1) the awakening word score of the sub-word in the awakening word;

specifically, the acoustic score of each sub-word can be obtained according to recognition and decoding, and the wake-up word score of each sub-word is spliced to obtain a one-dimensional scalar, that is, the wake-up word score of the sub-word. For example, the wake word is "hello, fei", the wake word contains 4 speech units, and assuming that each speech unit is represented by 2 phonemes, the whole wake word contains 8 phonemes, and a subword acoustic score sequence with a length of 8, that is, a wake word score of a subword in the wake word, can be obtained.

(2) The non-awakening word score of the sub-word in the awakening word;

similarly, the filler acoustic score of each sub-word can be obtained according to the recognition and decoding, and the non-awakening word score of each sub-word is spliced to obtain a one-dimensional scalar, namely the non-awakening word score of the sub-word. Also taking "hello, little flying" as an example, a non-awakening word acoustic score sequence with a length of 8, that is, a non-awakening word score of a subword in an awakening word can be obtained.

(3) The number of frames corresponding to the sub-word in the awakening word;

because the sub-words all occupy a certain time length, and the time length can be represented by the frame number, the frame number corresponding to each sub-word in the awakening word can be determined. For example, also taking "hello, fei" as an example, if each phonetic unit includes two subwords, there are 8 subwords in total. After determining the number of frames occupied by each subword, a sequence of frames of length 8 is obtained.

(4) Score distribution of sub-word corresponding acoustic features in the awakening word;

as can be seen from the above, the process of obtaining the acoustic score of the subword in the wake-up word may be performed before the acoustic feature is input to the wake-up decision network and the wake-up decision result is output. Based on the content of the foregoing embodiment, as an optional embodiment, the embodiment of the present invention does not specifically limit the score distribution manner for obtaining the corresponding acoustic features of the subword in the wake-up word, and includes but is not limited to: and determining the probability value of the acoustic features corresponding to any sub-word subordinate to each example phoneme, and distributing the probability value as the score of the acoustic features corresponding to any sub-word.

Wherein example phonemes refer to all phonemes which may be exhaustive at present. Taking the chinese as an example, the example phonemes of the chinese may be initials and finals, and the total number of the example factors is 83 in total. For any subword in the wake word, the acoustic feature corresponding to the subword may be subordinate to a certain example phoneme among the 83 example phonemes, so that the possibility that the acoustic feature corresponding to the subword is subordinate to each example phoneme can be determined, i.e. represented by a probability value. For any sub-word, "dependent on" means that the acoustic feature corresponding to that sub-word matches the example element.

(5) Awakening the embedding characteristics of the sub-words in the words;

since the subwords are phonemes and the wake-up word is composed of a plurality of subwords, each subword is deterministic and the example phonemes can be exhaustive, so that it can be determined which example factor matches each subword in the wake-up word with the 83 example phonemes. Based on the above, the embedded features are used to represent the matching relationship. Specifically, the embedded feature may be a one hot vector, and a matching relationship between each subword in the wake word and all example factors is represented by the one hot vector. For example, if the first subword in the wake-up word is matched with the second example phoneme, the one hot vector corresponding to the first subword can be obtained as (0, 1, 0.. 0). Where 1 is followed by 81 0 s. Similarly, if the second sub-word in the wake-up word is matched with the first example phoneme, a one hot vector corresponding to the second sub-word is obtained as (1, 0.. 0). Where 1 is followed by 82 0 s.

It should be noted that the embedded feature of the subword can be obtained through the wake-up recognition network in the previous wake-up recognition process, that is, before the embodiment of the present invention is executed, the embedded feature of the subword can be further wake-up recognized through the wake-up recognition network in a determination manner based on the acoustic likelihood ratio. When the awakening identification is carried out through the awakening identification network, the embedding characteristics of the sub-words in the awakening words can be obtained at the same time. For example, if the voice data is "hello, small fly", after the wake-up word path corresponding to "hello, small fly" in the wake-up recognition network is passed, the obtained path confidence is higher than the path confidence obtained through other wake-up word paths, so that the embedding characteristics of the subwords in the wake-up words can be obtained.

In the method provided by the embodiment of the invention, at least one of the five kinds of information is input into the awakening judgment network, and the awakening judgment result is output. Because the awakening judgment can be carried out through the awakening judgment network under any awakening words defined by the user without depending on a fixed preset threshold value, the awakening success rate can be improved, and the applicable scenes of the awakening process are wider.

Based on the content of the foregoing embodiment, as an alternative embodiment, the embodiment of the present invention does not specifically limit the manner of determining the probability value that the acoustic feature corresponding to any subword belongs to each example phoneme, which includes but is not limited to: calculating the probability value of each frame belonging to each example phoneme in any subword to obtain a probability value sequence corresponding to each frame; and (4) according to the total frame number contained in any subword, normalizing the probability value sequence corresponding to each frame to obtain the probability value of the acoustic feature corresponding to any subword subordinate to each example phoneme.

For example, for any subword, if the subword contains two frames, for the first frame acoustic feature, a probability value that the first frame acoustic feature belongs to each example phoneme may be determined. Taking the number of example factors as 83 as an example, the first frame may get 83 probability values, which are measured as a ═ 0.01, 0.02, …, 0.3. Similarly, for a second frame of acoustic features, a probability value that the second frame of acoustic features belongs to each example phoneme may be determined, which may be counted as B ═ 0.03, 0.04. Since the total number of frames included in the subword is 2, the probability value sequence corresponding to each frame is normalized, that is, the two sequences are added and averaged, and the probability value that the acoustic feature corresponding to the subword belongs to each example phoneme is C ═ 0.02, 0.03.

According to the method provided by the embodiment of the invention, the probability value of each frame of acoustic feature subordinate to each example phoneme in any subword is calculated, so that the probability value sequence corresponding to each frame is obtained. And (4) according to the total frame number contained in any subword, normalizing the probability value sequence corresponding to each frame to obtain the probability value of the acoustic feature corresponding to any subword subordinate to each example phoneme. Because the awakening judgment can be carried out through the awakening judgment network under any awakening words defined by the user without depending on a fixed preset threshold value, the awakening success rate can be improved, and the applicable scenes of the awakening process are wider.

Based on the above description of the embodiments, as an alternative embodiment, the subword is a state, a monophone or, a triphone, or a syllable. Each phonetic unit in the wake-up word may include a plurality of phonemes, and each phoneme may include a plurality of states. Alternatively, each phonetic unit may include a plurality of triphones, and each triphone may include a plurality of states.

Based on the content of the above embodiment, as an optional embodiment, before inputting the acoustic feature to the wake-up determination network and outputting the wake-up determination result, it may be further determined whether the wake-up word satisfies a preset condition. The embodiment of the present invention does not specifically limit the manner of determining whether the wakeup word meets the preset condition, including but not limited to: and acquiring the acoustic likelihood ratio of the awakening word, and judging whether the acoustic likelihood ratio meets a preset condition. Accordingly, the embodiment of the present invention does not specifically limit the manner of inputting the acoustic feature to the wake-up decision network and outputting the wake-up decision result, and includes but is not limited to: and if the acoustic likelihood ratio meets the preset condition, inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result.

Based on the content of the foregoing embodiment, as an optional embodiment, the preset condition is that the acoustic likelihood ratio is smaller than a preset threshold or within a preset threshold interval.

When the preset condition is that the acoustic likelihood ratio is smaller than a preset threshold, in an actual implementation process, if the acoustic likelihood ratio is smaller than the preset threshold, a wakeup judging process can be executed, and if the acoustic likelihood ratio is larger than the preset threshold, a successful wakeup can be determined. Of course, the preset condition may also be that the acoustic likelihood ratio is greater than a preset threshold, and when the preset condition is that the acoustic likelihood ratio is greater than the preset threshold, in an actual implementation process, if the acoustic likelihood ratio is greater than the preset threshold, the wake-up determination process may be executed. If the acoustic likelihood ratio is less than a preset threshold, the wake-up failure can be directly determined. When the preset condition is that the acoustic likelihood ratio is larger than the preset threshold value, the awakening is realized through two confidence processes.

When the preset condition is that the acoustic likelihood ratio is in a preset interval, if the acoustic likelihood ratio is in a preset interval (a, b), if the actually calculated acoustic likelihood ratio of the wakeup word is in (a, b), the wakeup determination process may be executed. And if the actually calculated acoustic likelihood ratio of the awakening word is smaller than a preset threshold value, determining that the awakening fails. And if the actually calculated acoustic likelihood ratio of the awakening word is larger than a preset threshold value, determining that the awakening is successful.

According to the method provided by the embodiment of the invention, whether the acoustic likelihood ratio meets the preset condition or not is judged by acquiring the acoustic likelihood ratio of the awakening word. And if the acoustic likelihood ratio meets the preset condition, inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result. Since the acoustic likelihood ratio of the awakening word can be judged first and then the awakening judgment is executed, the awakening success rate and the accuracy rate can be improved through the secondary judgment process.

It should be noted that, all the above-mentioned alternative embodiments may be combined arbitrarily to form alternative embodiments of the present invention, and are not described in detail herein.

Based on the content of the foregoing embodiments, an embodiment of the present invention provides a voice wake-up apparatus, where the apparatus is configured to execute the voice wake-up method in the foregoing method embodiments. Referring to fig. 4, the apparatus includes: an acquisition module 401 and an output module 402; wherein the content of the first and second substances,

an obtaining module 401, configured to obtain acoustic features of a wakeup word in voice data;

and the output module 402 is configured to input the acoustic features into an awakening determination network, and output an awakening determination result, where the awakening determination result is used to indicate whether awakening is successful, and the awakening determination network is obtained based on sample acoustic feature training.

Based on the content of the foregoing embodiment, as an optional embodiment, the acoustic features include score distribution of sub-words corresponding to the acoustic features in the wake-up word; correspondingly, the device also comprises:

and the determining module is used for determining the probability value of the acoustic features corresponding to any sub-word belonging to each example phoneme for any sub-word in the awakening word, and the probability value is used as the score distribution of the acoustic features corresponding to any sub-word.

Based on the content of the foregoing embodiment, as an optional embodiment, the determining module is configured to calculate a probability value that each frame of acoustic features in any subword belongs to each example phoneme, and obtain a probability value sequence corresponding to each frame; and (4) according to the total frame number contained in any subword, normalizing the probability value sequence corresponding to each frame to obtain the probability value of the acoustic feature corresponding to any subword subordinate to each example phoneme.

Based on the contents of the above embodiments, as an alternative embodiment, the subword is a state, a monophone, a triphone, or a syllable.

Based on the content of the foregoing embodiment, as an alternative embodiment, the apparatus further includes:

the judging module is used for acquiring the acoustic likelihood ratio of the awakening word and judging whether the acoustic likelihood ratio meets a preset condition; correspondingly, the output module 402 is configured to, when the acoustic likelihood ratio satisfies a preset condition, input the acoustic feature to the wake-up determination network, and output a wake-up determination result.

Based on the content of the foregoing embodiment, as an alternative embodiment, the wake-up determination network is an encoder-decoder model.

The device provided by the embodiment of the invention obtains the acoustic characteristics of the awakening words in the voice data. And inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result which is used for indicating whether wake-up is successful. Because the awakening judgment can be carried out through the awakening judgment network under any awakening words defined by the user without depending on a fixed preset threshold value, the awakening success rate can be improved, and the applicable scenes of the awakening process are wider.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may call logic instructions in memory 530 to perform the following method: acquiring acoustic characteristics of a wakeup word in voice data; and inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result, wherein the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on wake-up words.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method provided in the foregoing embodiments when executed by a processor, for example, the method includes: acquiring acoustic characteristics of a wakeup word in voice data; and inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result, wherein the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on wake-up words.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A voice wake-up method, comprising:

acquiring acoustic characteristics of a wakeup word in voice data;

inputting the acoustic features into a wake-up decision network, and outputting a wake-up decision result, wherein the wake-up decision result is used for indicating whether wake-up is successful or not, the wake-up decision network is obtained based on sample acoustic feature training, and the wake-up decision network is used for performing confidence decision on the wake-up words;

the awakening judgment network is an encoder-decoder model, and the final decoding result of a decoder in the encoder-decoder model is used for indicating whether awakening is successful or not;

the acoustic features comprise the following five kinds of information, the following five kinds of information are respectively the score of the awakening word of the subword in the awakening word, the score of the non-awakening word of the subword in the awakening word, the frame number corresponding to the subword in the awakening word, the score distribution of the corresponding acoustic features of the subword in the awakening word and the embedding features of the subword in the awakening word, and the subwords are single phones, triphones or syllables.

2. The method of claim 1, wherein the acoustic features comprise a score distribution of sub-word corresponding acoustic features in the wake-up word; correspondingly, before inputting the acoustic features into a wake up decision network and outputting a wake up decision result, the method further includes:

and for any sub-word in the awakening word, determining a probability value of the acoustic features corresponding to the sub-word belonging to each example phoneme, and taking the probability value as the score distribution of the acoustic features corresponding to the sub-word.

3. The method of claim 2, wherein determining a probability value that an acoustic feature corresponding to the each subword depends on each example phoneme comprises:

calculating the probability value of each frame of acoustic features belonging to each example phoneme in any subword to obtain a probability value sequence corresponding to each frame;

and according to the total frame number contained in any subword, normalizing the probability value sequence corresponding to each frame to obtain the probability value of the acoustic feature corresponding to any subword subordinate to each example phoneme.

4. The method of claim 1, wherein before inputting the acoustic signature into a wake decision network and outputting a wake decision result, further comprising:

acquiring an acoustic likelihood ratio of the awakening word, and judging whether the acoustic likelihood ratio meets a preset condition;

correspondingly, the inputting the acoustic features into a wake-up decision network and outputting a wake-up decision result includes:

and if the acoustic likelihood ratio meets a preset condition, inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result.

5. The method of claim 4, wherein the predetermined condition is that the acoustic likelihood ratio is less than a predetermined threshold or within a predetermined threshold interval.

6. A voice wake-up apparatus, comprising:

the output module is used for inputting the acoustic features into an awakening judgment network and outputting an awakening judgment result, the awakening judgment result is used for indicating whether awakening is successful or not, the awakening judgment network is obtained based on sample acoustic feature training, and the awakening judgment network is used for carrying out confidence judgment on the awakening words;

7. An electronic device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 5.

8. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 5.