CN109273007B - Voice wake-up method and device - Google Patents

Voice wake-up method and device Download PDF

Info

Publication number
CN109273007B
CN109273007B CN201811184504.8A CN201811184504A CN109273007B CN 109273007 B CN109273007 B CN 109273007B CN 201811184504 A CN201811184504 A CN 201811184504A CN 109273007 B CN109273007 B CN 109273007B
Authority
CN
China
Prior art keywords
wake
awakening
word
acoustic
subword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811184504.8A
Other languages
Chinese (zh)
Other versions
CN109273007A (en
Inventor
吴国兵
潘嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Xunfei Super Brain Information Technology Co Ltd
Original Assignee
Xi'an Xunfei Super Brain Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Xunfei Super Brain Information Technology Co ltd filed Critical Xi'an Xunfei Super Brain Information Technology Co ltd
Priority to CN201811184504.8A priority Critical patent/CN109273007B/en
Publication of CN109273007A publication Critical patent/CN109273007A/en
Application granted granted Critical
Publication of CN109273007B publication Critical patent/CN109273007B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building

Abstract

The embodiment of the invention provides a voice awakening method and device, and belongs to the technical field of voice recognition. The method comprises the following steps: acquiring acoustic characteristics of a wakeup word in voice data; and inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result, wherein the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on wake-up words. The embodiment of the invention obtains the acoustic characteristics of the awakening words in the voice data. And inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result which is used for indicating whether wake-up is successful. Because the awakening judgment can be carried out through the awakening judgment network under any awakening words defined by the user without depending on a fixed preset threshold value, the awakening success rate can be improved, and the applicable scenes of the awakening process are wider.

Description

Voice wake-up method and device
Technical Field
The embodiment of the invention relates to the technical field of voice recognition, in particular to a voice awakening method and device.
Background
Along with the development of smart homes, the voice awakening function is more and more popular. The voice wake-up is mainly realized by understanding voice data of a user to wake up the intelligent terminal. At present, when voice awakening is realized, acoustic likelihoods of an awakening word path and a filer path which respectively correspond to the awakening word path and the filer path in an awakening word recognition process are generally adopted; and if the acoustic likelihood ratio is larger than a fixed preset threshold value, confirming that the recognition result of the awakening word is credible, and successfully awakening the intelligent terminal. Because the preset threshold is fixed, if the awakening password changes, the preset threshold may not be applicable to the judgment process of the current awakening language, so that the awakening success rate is reduced.
Disclosure of Invention
To solve the above problems, embodiments of the present invention provide a voice wake-up method and apparatus that overcome the above problems or at least partially solve the above problems.
According to a first aspect of the embodiments of the present invention, there is provided a voice wake-up method, including:
acquiring acoustic characteristics of a wakeup word in voice data;
and inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result, wherein the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on wake-up words.
According to the method provided by the embodiment of the invention, the acoustic characteristics of the awakening words in the voice data are obtained. And inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result which is used for indicating whether wake-up is successful. Because the awakening judgment can be carried out through the awakening judgment network under any awakening words defined by the user without depending on a fixed preset threshold value, the awakening success rate can be improved, and the applicable scenes of the awakening process are wider.
According to a second aspect of the embodiments of the present invention, there is provided a voice wake-up apparatus, including:
the acquisition module is used for acquiring the acoustic characteristics of the awakening words in the voice data;
and the output module is used for inputting the acoustic features into a wake-up judgment network and outputting a wake-up judgment result, the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on the wake-up words.
According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus, including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor calling the program instructions being capable of performing the voice wake-up method provided by any of the various possible implementations of the first aspect.
According to a fourth aspect of the present invention, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the voice wake-up method provided in any of the various possible implementations of the first aspect.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of embodiments of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a voice wake-up method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a wake-up word recognition network according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a wake-up identification network according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a voice wake-up apparatus according to an embodiment of the present invention;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Along with the development of smart homes, the voice awakening function is more and more popular. The voice wake-up is mainly realized by understanding voice data of a user to wake up the intelligent terminal. At present, when voice awakening is realized, acoustic likelihood of an awakening word and acoustic likelihood of a non-awakening word which respectively correspond to the awakening word in an awakening word recognition process are generally used; calculating the ratio of the acoustic likelihood of the awakening word to the acoustic likelihood of the non-awakening word to obtain the acoustic likelihood ratio of the awakening word; and if the acoustic likelihood ratio is larger than a fixed preset threshold value, confirming that the recognition result of the awakening word is credible, and successfully awakening the intelligent terminal. Because the preset threshold is fixed, if the awakening password changes, the preset threshold may not be applicable to the judgment process of the current awakening language, so that the awakening success rate is reduced.
For example, if the currently preset wake-up password is "ding-dong", by comparing the acoustic likelihood ratio of the wake-up word "ding-dong" with a preset threshold, it can be determined more accurately whether the wake-up password spoken by the user can wake up the intelligent terminal. If the user self-defines a new wake-up password, such as "hello, fei little", the fixed preset threshold may not be suitable for the user-defined wake-up password, thereby reducing the wake-up success rate.
In view of the foregoing situation, an embodiment of the present invention provides a voice wake-up method. It should be noted that the method can be applied to an intelligent terminal with a wake-up function, such as an intelligent sound box, a wearable device or an intelligent household appliance. Referring to fig. 1, the method includes, but is not limited to:
101. and acquiring acoustic characteristics of the awakening words in the voice data.
Before execution 101, voice data may be input to a wake word recognition network to recognize the wake word. Specifically, the wake recognition network may be a hidden markov model-based Keyword and Filler network, which includes a Keyword and Filler approach as shown in fig. 2. Wherein, the Filler path represents a non-wake word path, and words except the wake word are all contained in the Filler path. When the awakening words are identified, the acoustic features of the awakening words can be extracted at the same time.
102. And inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result, wherein the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on wake-up words.
Prior to execution 102, a wake decision network may be trained. Specifically, the initial decision network may be trained on the positive sample acoustic features and the negative sample acoustic features to obtain the wake-up decision network. Wherein, the sample acoustic features of the positive example refer to the acoustic features of the awakening words which can be awakened successfully, and the sample acoustic features of the negative example refer to the acoustic features of the non-awakening words which are awakened unsuccessfully. Based on the content of the foregoing embodiment, as an alternative embodiment, the wake-up decision network may be an encoder-decoder model. Accordingly, the initial decision network may include an encoder terminal and a decoder terminal, and the wake-up word may include a plurality of sub-words. The encoder end can adopt a sequence modeling network such as a recurrent neural network, a long-time memory network and the like, acoustic features of each sub-word are input into the sequence modeling network, a characterization vector of each sub-word can be obtained, the characterization vector of each sub-word is connected into an attention mechanism, so that the influence weight of the characterization vector of each sub-word on the awakening word is obtained, the characterization vector of the whole word is further obtained, and judgment is carried out according to the influence weight.
Referring to fig. 3, for an encoder of an encoder, x1 to xT represent feature vectors of acoustic features of each subword, h1 to hT are feature vectors of the subwords after passing through a sequence modeling network, at,1 to at, T is a weight obtained after each feature vector passes through an attention mechanism, and ct is a feature vector of a whole word. For the decoder, yt-1 represents the previous historical decoding result. yt represents the final decoding result, i.e. indicates whether the wake-up was successful or not. st-1 represents intermediate information in a previous decoding process, and st represents intermediate information in a present decoding process. It should be noted that the wake-up determining network uses an encoder-decoder model, which can meet the requirements of wake-up words with different lengths.
According to the method provided by the embodiment of the invention, the acoustic characteristics of the awakening words in the voice data are obtained. And inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result which is used for indicating whether wake-up is successful. Because the awakening judgment can be carried out through the awakening judgment network under any awakening words defined by the user without depending on a fixed preset threshold value, the awakening success rate can be improved, and the applicable scenes of the awakening process are wider.
Based on the content of the foregoing embodiment, as an optional embodiment, the acoustic feature includes at least any one of the following five kinds of information, where the following five kinds of information are, respectively, a score of a wake-up word of a subword in the wake-up word, a score of a non-wake-up word of the subword in the wake-up word, a frame number corresponding to the subword in the wake-up word, a score distribution of the subword in the wake-up word corresponding to the acoustic feature, and an embedding feature of the subword in the wake-up word.
In the five types of information provided in the embodiment of the present invention, the execution time of acquiring the five types of information may be further determined to acquire the five types of information when the wakeup word is recognized before the acoustic feature is input to the wakeup decision network and the wakeup decision result is output. For convenience of understanding, the following five information are respectively described by taking subwords as phonemes as an example:
(1) the awakening word score of the sub-word in the awakening word;
specifically, the acoustic score of each sub-word can be obtained according to recognition and decoding, and the wake-up word score of each sub-word is spliced to obtain a one-dimensional scalar, that is, the wake-up word score of the sub-word. For example, the wake word is "hello, fei", the wake word contains 4 speech units, and assuming that each speech unit is represented by 2 phonemes, the whole wake word contains 8 phonemes, and a subword acoustic score sequence with a length of 8, that is, a wake word score of a subword in the wake word, can be obtained.
(2) The non-awakening word score of the sub-word in the awakening word;
similarly, the filler acoustic score of each sub-word can be obtained according to the recognition and decoding, and the non-awakening word score of each sub-word is spliced to obtain a one-dimensional scalar, namely the non-awakening word score of the sub-word. Also taking "hello, little flying" as an example, a non-awakening word acoustic score sequence with a length of 8, that is, a non-awakening word score of a subword in an awakening word can be obtained.
(3) The number of frames corresponding to the sub-word in the awakening word;
because the sub-words all occupy a certain time length, and the time length can be represented by the frame number, the frame number corresponding to each sub-word in the awakening word can be determined. For example, also taking "hello, fei" as an example, if each phonetic unit includes two subwords, there are 8 subwords in total. After determining the number of frames occupied by each subword, a sequence of frames of length 8 is obtained.
(4) Score distribution of sub-word corresponding acoustic features in the awakening word;
as can be seen from the above, the process of obtaining the acoustic score of the subword in the wake-up word may be performed before the acoustic feature is input to the wake-up decision network and the wake-up decision result is output. Based on the content of the foregoing embodiment, as an optional embodiment, the embodiment of the present invention does not specifically limit the score distribution manner for obtaining the corresponding acoustic features of the subword in the wake-up word, and includes but is not limited to: and determining the probability value of the acoustic features corresponding to any sub-word subordinate to each example phoneme, and distributing the probability value as the score of the acoustic features corresponding to any sub-word.
Wherein example phonemes refer to all phonemes which may be exhaustive at present. Taking the chinese as an example, the example phonemes of the chinese may be initials and finals, and the total number of the example factors is 83 in total. For any subword in the wake word, the acoustic feature corresponding to the subword may be subordinate to a certain example phoneme among the 83 example phonemes, so that the possibility that the acoustic feature corresponding to the subword is subordinate to each example phoneme can be determined, i.e. represented by a probability value. For any sub-word, "dependent on" means that the acoustic feature corresponding to that sub-word matches the example element.
(5) Awakening the embedding characteristics of the sub-words in the words;
since the subwords are phonemes and the wake-up word is composed of a plurality of subwords, each subword is deterministic and the example phonemes can be exhaustive, so that it can be determined which example factor matches each subword in the wake-up word with the 83 example phonemes. Based on the above, the embedded features are used to represent the matching relationship. Specifically, the embedded feature may be a one hot vector, and a matching relationship between each subword in the wake word and all example factors is represented by the one hot vector. For example, if the first subword in the wake-up word is matched with the second example phoneme, the one hot vector corresponding to the first subword can be obtained as (0, 1, 0.. 0). Where 1 is followed by 81 0 s. Similarly, if the second sub-word in the wake-up word is matched with the first example phoneme, a one hot vector corresponding to the second sub-word is obtained as (1, 0.. 0). Where 1 is followed by 82 0 s.
It should be noted that the embedded feature of the subword can be obtained through the wake-up recognition network in the previous wake-up recognition process, that is, before the embodiment of the present invention is executed, the embedded feature of the subword can be further wake-up recognized through the wake-up recognition network in a determination manner based on the acoustic likelihood ratio. When the awakening identification is carried out through the awakening identification network, the embedding characteristics of the sub-words in the awakening words can be obtained at the same time. For example, if the voice data is "hello, small fly", after the wake-up word path corresponding to "hello, small fly" in the wake-up recognition network is passed, the obtained path confidence is higher than the path confidence obtained through other wake-up word paths, so that the embedding characteristics of the subwords in the wake-up words can be obtained.
In the method provided by the embodiment of the invention, at least one of the five kinds of information is input into the awakening judgment network, and the awakening judgment result is output. Because the awakening judgment can be carried out through the awakening judgment network under any awakening words defined by the user without depending on a fixed preset threshold value, the awakening success rate can be improved, and the applicable scenes of the awakening process are wider.
Based on the content of the foregoing embodiment, as an alternative embodiment, the embodiment of the present invention does not specifically limit the manner of determining the probability value that the acoustic feature corresponding to any subword belongs to each example phoneme, which includes but is not limited to: calculating the probability value of each frame belonging to each example phoneme in any subword to obtain a probability value sequence corresponding to each frame; and (4) according to the total frame number contained in any subword, normalizing the probability value sequence corresponding to each frame to obtain the probability value of the acoustic feature corresponding to any subword subordinate to each example phoneme.
For example, for any subword, if the subword contains two frames, for the first frame acoustic feature, a probability value that the first frame acoustic feature belongs to each example phoneme may be determined. Taking the number of example factors as 83 as an example, the first frame may get 83 probability values, which are measured as a ═ 0.01, 0.02, …, 0.3. Similarly, for a second frame of acoustic features, a probability value that the second frame of acoustic features belongs to each example phoneme may be determined, which may be counted as B ═ 0.03, 0.04. Since the total number of frames included in the subword is 2, the probability value sequence corresponding to each frame is normalized, that is, the two sequences are added and averaged, and the probability value that the acoustic feature corresponding to the subword belongs to each example phoneme is C ═ 0.02, 0.03.
According to the method provided by the embodiment of the invention, the probability value of each frame of acoustic feature subordinate to each example phoneme in any subword is calculated, so that the probability value sequence corresponding to each frame is obtained. And (4) according to the total frame number contained in any subword, normalizing the probability value sequence corresponding to each frame to obtain the probability value of the acoustic feature corresponding to any subword subordinate to each example phoneme. Because the awakening judgment can be carried out through the awakening judgment network under any awakening words defined by the user without depending on a fixed preset threshold value, the awakening success rate can be improved, and the applicable scenes of the awakening process are wider.
Based on the above description of the embodiments, as an alternative embodiment, the subword is a state, a monophone or, a triphone, or a syllable. Each phonetic unit in the wake-up word may include a plurality of phonemes, and each phoneme may include a plurality of states. Alternatively, each phonetic unit may include a plurality of triphones, and each triphone may include a plurality of states.
Based on the content of the above embodiment, as an optional embodiment, before inputting the acoustic feature to the wake-up determination network and outputting the wake-up determination result, it may be further determined whether the wake-up word satisfies a preset condition. The embodiment of the present invention does not specifically limit the manner of determining whether the wakeup word meets the preset condition, including but not limited to: and acquiring the acoustic likelihood ratio of the awakening word, and judging whether the acoustic likelihood ratio meets a preset condition. Accordingly, the embodiment of the present invention does not specifically limit the manner of inputting the acoustic feature to the wake-up decision network and outputting the wake-up decision result, and includes but is not limited to: and if the acoustic likelihood ratio meets the preset condition, inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result.
Based on the content of the foregoing embodiment, as an optional embodiment, the preset condition is that the acoustic likelihood ratio is smaller than a preset threshold or within a preset threshold interval.
When the preset condition is that the acoustic likelihood ratio is smaller than a preset threshold, in an actual implementation process, if the acoustic likelihood ratio is smaller than the preset threshold, a wakeup judging process can be executed, and if the acoustic likelihood ratio is larger than the preset threshold, a successful wakeup can be determined. Of course, the preset condition may also be that the acoustic likelihood ratio is greater than a preset threshold, and when the preset condition is that the acoustic likelihood ratio is greater than the preset threshold, in an actual implementation process, if the acoustic likelihood ratio is greater than the preset threshold, the wake-up determination process may be executed. If the acoustic likelihood ratio is less than a preset threshold, the wake-up failure can be directly determined. When the preset condition is that the acoustic likelihood ratio is larger than the preset threshold value, the awakening is realized through two confidence processes.
When the preset condition is that the acoustic likelihood ratio is in a preset interval, if the acoustic likelihood ratio is in a preset interval (a, b), if the actually calculated acoustic likelihood ratio of the wakeup word is in (a, b), the wakeup determination process may be executed. And if the actually calculated acoustic likelihood ratio of the awakening word is smaller than a preset threshold value, determining that the awakening fails. And if the actually calculated acoustic likelihood ratio of the awakening word is larger than a preset threshold value, determining that the awakening is successful.
According to the method provided by the embodiment of the invention, whether the acoustic likelihood ratio meets the preset condition or not is judged by acquiring the acoustic likelihood ratio of the awakening word. And if the acoustic likelihood ratio meets the preset condition, inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result. Since the acoustic likelihood ratio of the awakening word can be judged first and then the awakening judgment is executed, the awakening success rate and the accuracy rate can be improved through the secondary judgment process.
It should be noted that, all the above-mentioned alternative embodiments may be combined arbitrarily to form alternative embodiments of the present invention, and are not described in detail herein.
Based on the content of the foregoing embodiments, an embodiment of the present invention provides a voice wake-up apparatus, where the apparatus is configured to execute the voice wake-up method in the foregoing method embodiments. Referring to fig. 4, the apparatus includes: an acquisition module 401 and an output module 402; wherein the content of the first and second substances,
an obtaining module 401, configured to obtain acoustic features of a wakeup word in voice data;
and the output module 402 is configured to input the acoustic features into an awakening determination network, and output an awakening determination result, where the awakening determination result is used to indicate whether awakening is successful, and the awakening determination network is obtained based on sample acoustic feature training.
Based on the content of the foregoing embodiment, as an optional embodiment, the acoustic feature includes at least any one of the following five kinds of information, where the following five kinds of information are, respectively, a score of a wake-up word of a subword in the wake-up word, a score of a non-wake-up word of the subword in the wake-up word, a frame number corresponding to the subword in the wake-up word, a score distribution of the subword in the wake-up word corresponding to the acoustic feature, and an embedding feature of the subword in the wake-up word.
Based on the content of the foregoing embodiment, as an optional embodiment, the acoustic features include score distribution of sub-words corresponding to the acoustic features in the wake-up word; correspondingly, the device also comprises:
and the determining module is used for determining the probability value of the acoustic features corresponding to any sub-word belonging to each example phoneme for any sub-word in the awakening word, and the probability value is used as the score distribution of the acoustic features corresponding to any sub-word.
Based on the content of the foregoing embodiment, as an optional embodiment, the determining module is configured to calculate a probability value that each frame of acoustic features in any subword belongs to each example phoneme, and obtain a probability value sequence corresponding to each frame; and (4) according to the total frame number contained in any subword, normalizing the probability value sequence corresponding to each frame to obtain the probability value of the acoustic feature corresponding to any subword subordinate to each example phoneme.
Based on the contents of the above embodiments, as an alternative embodiment, the subword is a state, a monophone, a triphone, or a syllable.
Based on the content of the foregoing embodiment, as an alternative embodiment, the apparatus further includes:
the judging module is used for acquiring the acoustic likelihood ratio of the awakening word and judging whether the acoustic likelihood ratio meets a preset condition; correspondingly, the output module 402 is configured to, when the acoustic likelihood ratio satisfies a preset condition, input the acoustic feature to the wake-up determination network, and output a wake-up determination result.
Based on the content of the foregoing embodiment, as an optional embodiment, the preset condition is that the acoustic likelihood ratio is smaller than a preset threshold or within a preset threshold interval.
Based on the content of the foregoing embodiment, as an alternative embodiment, the wake-up determination network is an encoder-decoder model.
The device provided by the embodiment of the invention obtains the acoustic characteristics of the awakening words in the voice data. And inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result which is used for indicating whether wake-up is successful. Because the awakening judgment can be carried out through the awakening judgment network under any awakening words defined by the user without depending on a fixed preset threshold value, the awakening success rate can be improved, and the applicable scenes of the awakening process are wider.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may call logic instructions in memory 530 to perform the following method: acquiring acoustic characteristics of a wakeup word in voice data; and inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result, wherein the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on wake-up words.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method provided in the foregoing embodiments when executed by a processor, for example, the method includes: acquiring acoustic characteristics of a wakeup word in voice data; and inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result, wherein the wake-up judgment result is used for indicating whether wake-up is successful, the wake-up judgment network is obtained based on sample acoustic feature training, and the wake-up judgment network is used for carrying out confidence judgment on wake-up words.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A voice wake-up method, comprising:
acquiring acoustic characteristics of a wakeup word in voice data;
inputting the acoustic features into a wake-up decision network, and outputting a wake-up decision result, wherein the wake-up decision result is used for indicating whether wake-up is successful or not, the wake-up decision network is obtained based on sample acoustic feature training, and the wake-up decision network is used for performing confidence decision on the wake-up words;
the awakening judgment network is an encoder-decoder model, and the final decoding result of a decoder in the encoder-decoder model is used for indicating whether awakening is successful or not;
the acoustic features comprise the following five kinds of information, the following five kinds of information are respectively the score of the awakening word of the subword in the awakening word, the score of the non-awakening word of the subword in the awakening word, the frame number corresponding to the subword in the awakening word, the score distribution of the corresponding acoustic features of the subword in the awakening word and the embedding features of the subword in the awakening word, and the subwords are single phones, triphones or syllables.
2. The method of claim 1, wherein the acoustic features comprise a score distribution of sub-word corresponding acoustic features in the wake-up word; correspondingly, before inputting the acoustic features into a wake up decision network and outputting a wake up decision result, the method further includes:
and for any sub-word in the awakening word, determining a probability value of the acoustic features corresponding to the sub-word belonging to each example phoneme, and taking the probability value as the score distribution of the acoustic features corresponding to the sub-word.
3. The method of claim 2, wherein determining a probability value that an acoustic feature corresponding to the each subword depends on each example phoneme comprises:
calculating the probability value of each frame of acoustic features belonging to each example phoneme in any subword to obtain a probability value sequence corresponding to each frame;
and according to the total frame number contained in any subword, normalizing the probability value sequence corresponding to each frame to obtain the probability value of the acoustic feature corresponding to any subword subordinate to each example phoneme.
4. The method of claim 1, wherein before inputting the acoustic signature into a wake decision network and outputting a wake decision result, further comprising:
acquiring an acoustic likelihood ratio of the awakening word, and judging whether the acoustic likelihood ratio meets a preset condition;
correspondingly, the inputting the acoustic features into a wake-up decision network and outputting a wake-up decision result includes:
and if the acoustic likelihood ratio meets a preset condition, inputting the acoustic features into a wake-up judgment network, and outputting a wake-up judgment result.
5. The method of claim 4, wherein the predetermined condition is that the acoustic likelihood ratio is less than a predetermined threshold or within a predetermined threshold interval.
6. A voice wake-up apparatus, comprising:
the acquisition module is used for acquiring the acoustic characteristics of the awakening words in the voice data;
the output module is used for inputting the acoustic features into an awakening judgment network and outputting an awakening judgment result, the awakening judgment result is used for indicating whether awakening is successful or not, the awakening judgment network is obtained based on sample acoustic feature training, and the awakening judgment network is used for carrying out confidence judgment on the awakening words;
the awakening judgment network is an encoder-decoder model, and the final decoding result of a decoder in the encoder-decoder model is used for indicating whether awakening is successful or not;
the acoustic features comprise the following five kinds of information, the following five kinds of information are respectively the score of the awakening word of the subword in the awakening word, the score of the non-awakening word of the subword in the awakening word, the frame number corresponding to the subword in the awakening word, the score distribution of the corresponding acoustic features of the subword in the awakening word and the embedding features of the subword in the awakening word, and the subwords are single phones, triphones or syllables.
7. An electronic device, comprising:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 5.
8. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 5.
CN201811184504.8A 2018-10-11 2018-10-11 Voice wake-up method and device Active CN109273007B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811184504.8A CN109273007B (en) 2018-10-11 2018-10-11 Voice wake-up method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811184504.8A CN109273007B (en) 2018-10-11 2018-10-11 Voice wake-up method and device

Publications (2)

Publication Number Publication Date
CN109273007A CN109273007A (en) 2019-01-25
CN109273007B true CN109273007B (en) 2022-05-17

Family

ID=65196523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811184504.8A Active CN109273007B (en) 2018-10-11 2018-10-11 Voice wake-up method and device

Country Status (1)

Country Link
CN (1) CN109273007B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862963A (en) * 2019-04-12 2020-10-30 阿里巴巴集团控股有限公司 Voice wake-up method, device and equipment
CN110310628B (en) * 2019-06-27 2022-05-20 百度在线网络技术(北京)有限公司 Method, device and equipment for optimizing wake-up model and storage medium
CN110473536B (en) * 2019-08-20 2021-10-15 北京声智科技有限公司 Awakening method and device and intelligent device
CN110473539B (en) * 2019-08-28 2021-11-09 思必驰科技股份有限公司 Method and device for improving voice awakening performance
CN111243604B (en) * 2020-01-13 2022-05-10 思必驰科技股份有限公司 Training method for speaker recognition neural network model supporting multiple awakening words, speaker recognition method and system
CN111489737B (en) * 2020-04-13 2020-11-10 深圳市友杰智新科技有限公司 Voice command recognition method and device, storage medium and computer equipment
CN111696562B (en) * 2020-04-29 2022-08-19 华为技术有限公司 Voice wake-up method, device and storage medium
CN112164395A (en) * 2020-09-18 2021-01-01 北京百度网讯科技有限公司 Vehicle-mounted voice starting method and device, electronic equipment and storage medium
CN112669818B (en) * 2020-12-08 2022-12-02 北京地平线机器人技术研发有限公司 Voice wake-up method and device, readable storage medium and electronic equipment
CN113327617B (en) * 2021-05-17 2024-04-19 西安讯飞超脑信息科技有限公司 Voiceprint discrimination method, voiceprint discrimination device, computer device and storage medium
CN113327618B (en) * 2021-05-17 2024-04-19 西安讯飞超脑信息科技有限公司 Voiceprint discrimination method, voiceprint discrimination device, computer device and storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105096939A (en) * 2015-07-08 2015-11-25 百度在线网络技术(北京)有限公司 Voice wake-up method and device
CN105575395A (en) * 2014-10-14 2016-05-11 中兴通讯股份有限公司 Voice wake-up method and apparatus, terminal, and processing method thereof
CN106910496A (en) * 2017-02-28 2017-06-30 广东美的制冷设备有限公司 Intelligent electrical appliance control and device
CN107123417A (en) * 2017-05-16 2017-09-01 上海交通大学 Optimization method and system are waken up based on the customized voice that distinctive is trained
CN107134279A (en) * 2017-06-30 2017-09-05 百度在线网络技术(北京)有限公司 A kind of voice awakening method, device, terminal and storage medium
CN107369439A (en) * 2017-07-31 2017-11-21 北京捷通华声科技股份有限公司 A kind of voice awakening method and device
CN107644638A (en) * 2017-10-17 2018-01-30 北京智能管家科技有限公司 Audio recognition method, device, terminal and computer-readable recording medium
CN107767861A (en) * 2016-08-22 2018-03-06 科大讯飞股份有限公司 voice awakening method, system and intelligent terminal
CN107871499A (en) * 2017-10-27 2018-04-03 珠海市杰理科技股份有限公司 Audio recognition method, system, computer equipment and computer-readable recording medium
CN107871506A (en) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 The awakening method and device of speech identifying function
CN108198548A (en) * 2018-01-25 2018-06-22 苏州奇梦者网络科技有限公司 A kind of voice awakening method and its system
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN108335696A (en) * 2018-02-09 2018-07-27 百度在线网络技术(北京)有限公司 Voice awakening method and device
CN108564941A (en) * 2018-03-22 2018-09-21 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971685B (en) * 2013-01-30 2015-06-10 腾讯科技(深圳)有限公司 Method and system for recognizing voice commands

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105575395A (en) * 2014-10-14 2016-05-11 中兴通讯股份有限公司 Voice wake-up method and apparatus, terminal, and processing method thereof
CN105096939A (en) * 2015-07-08 2015-11-25 百度在线网络技术(北京)有限公司 Voice wake-up method and device
CN107767861A (en) * 2016-08-22 2018-03-06 科大讯飞股份有限公司 voice awakening method, system and intelligent terminal
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN106910496A (en) * 2017-02-28 2017-06-30 广东美的制冷设备有限公司 Intelligent electrical appliance control and device
CN107123417A (en) * 2017-05-16 2017-09-01 上海交通大学 Optimization method and system are waken up based on the customized voice that distinctive is trained
CN107134279A (en) * 2017-06-30 2017-09-05 百度在线网络技术(北京)有限公司 A kind of voice awakening method, device, terminal and storage medium
CN107369439A (en) * 2017-07-31 2017-11-21 北京捷通华声科技股份有限公司 A kind of voice awakening method and device
CN107644638A (en) * 2017-10-17 2018-01-30 北京智能管家科技有限公司 Audio recognition method, device, terminal and computer-readable recording medium
CN107871499A (en) * 2017-10-27 2018-04-03 珠海市杰理科技股份有限公司 Audio recognition method, system, computer equipment and computer-readable recording medium
CN107871506A (en) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 The awakening method and device of speech identifying function
CN108198548A (en) * 2018-01-25 2018-06-22 苏州奇梦者网络科技有限公司 A kind of voice awakening method and its system
CN108335696A (en) * 2018-02-09 2018-07-27 百度在线网络技术(北京)有限公司 Voice awakening method and device
CN108564941A (en) * 2018-03-22 2018-09-21 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN109273007A (en) 2019-01-25

Similar Documents

Publication Publication Date Title
CN109273007B (en) Voice wake-up method and device
CN108564940B (en) Speech recognition method, server and computer-readable storage medium
CN103971685B (en) Method and system for recognizing voice commands
CN105529028B (en) Speech analysis method and apparatus
CN110610707B (en) Voice keyword recognition method and device, electronic equipment and storage medium
CN110706692B (en) Training method and system of child voice recognition model
CN108281137A (en) A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN109036471B (en) Voice endpoint detection method and device
CN109545197B (en) Voice instruction identification method and device and intelligent terminal
CN112581938B (en) Speech breakpoint detection method, device and equipment based on artificial intelligence
WO2008137616A1 (en) Multi-class constrained maximum likelihood linear regression
KR102199246B1 (en) Method And Apparatus for Learning Acoustic Model Considering Reliability Score
CN113096647B (en) Voice model training method and device and electronic equipment
CN110706710A (en) Voice recognition method and device, electronic equipment and storage medium
CN114171002A (en) Voice recognition method and device, electronic equipment and storage medium
CN107886940B (en) Voice translation processing method and device
CN111179941B (en) Intelligent device awakening method, registration method and device
CN114360514A (en) Speech recognition method, apparatus, device, medium, and product
CN111400489B (en) Dialog text abstract generating method and device, electronic equipment and storage medium
CN111640423A (en) Word boundary estimation method and device and electronic equipment
CN111696555A (en) Method and system for confirming awakening words
CN115762500A (en) Voice processing method, device, equipment and storage medium
CN111383641B (en) Voice recognition method, device and controller
CN113744718A (en) Voice text output method and device, storage medium and electronic device
CN112542173A (en) Voice interaction method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190617

Address after: 710000 Yunhui Valley D Block 101, No. 156 Tiangu Eighth Road, Software New City, Xi'an High-tech Zone, Xi'an High-tech Zone, Xi'an City, Shaanxi Province

Applicant after: Xi'an Xunfei Super Brain Information Technology Co., Ltd.

Address before: 230088 666 Wangjiang West Road, Hefei hi tech Development Zone, Anhui

Applicant before: Iflytek Co., Ltd.

GR01 Patent grant
GR01 Patent grant