CN114299935A

CN114299935A - Awakening word recognition method, awakening word recognition device, terminal and storage medium

Info

Publication number: CN114299935A
Application number: CN202111590038.5A
Authority: CN
Inventors: 李良斌; 李志勇; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-08

Abstract

The disclosure provides a method, a device, a terminal and a storage medium for identifying awakening words, and belongs to the field of artificial intelligence. According to the method, the edge difference between each target edge and the corresponding target awakening edge is calculated according to the weighted values of the target edge on the target decoding path and the corresponding target awakening edge on the awakening path, and then the path difference between the target decoding path and the awakening path is calculated, so that when the path difference between the two paths meets a threshold value condition, the fact that the voice information to be recognized contains the awakening word is determined. According to the method, the difference value between the highest decoding score corresponding to the target decoding path and the awakening score corresponding to the awakening path is not directly used as a basis for judging whether the voice information to be recognized contains the awakening word or not, but the difference of each step in the decoding process is considered, the difference of each step is calculated and accumulated to the difference of the two paths, and then the path difference is used as the awakening word recognition basis, so that the accuracy of the awakening word recognition result is improved, and the number of times of mistaken awakening is reduced.

Description

Awakening word recognition method, awakening word recognition device, terminal and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for identifying a wakeup word.

Background

With the development of artificial intelligence technology and voice recognition technology, human-computer interaction based on voice information has been widely applied in life, for example, in scenes such as vehicle navigation, smart home, voice dialing, simultaneous interpretation and the like. When the awakening system of the intelligent voice equipment recognizes that the voice information of the user contains the awakening words, the intelligent voice equipment is awakened so as to enable the intelligent voice equipment to enter a normal working state.

At present, the following method is mainly adopted when the related technology identifies the awakening word: acquiring a decoding path corresponding to the voice information to be detected and a decoding score thereof; acquiring a wake-up score of a wake-up path corresponding to a wake-up word; and when the difference value between the decoding score and the awakening score corresponding to the decoding path is smaller than a preset threshold value, determining that the voice message to be detected contains the awakening word.

However, in the related art, whether the voice message includes the wakeup word is determined only based on the difference between the wakeup scores corresponding to the wakeup word, and when the difference between the decoding path of the voice message to be detected and the wakeup path is large, but the difference between the decoding score corresponding to the decoding path and the wakeup score is smaller than the preset threshold, the voice message to be detected is mistakenly considered to include the wakeup word, so that false wakeup is caused. In order to improve the accuracy of the identification result of the awakening word and reduce the number of times of false awakening, a new awakening word identification method needs to be provided urgently.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device, a terminal and a storage medium for identifying a wake-up word, which can improve the accuracy of a wake-up word identification result and reduce the number of times of false wake-up. The technical scheme is as follows:

in a first aspect, a method for identifying a wake-up word is provided, where the method includes:

acquiring at least one decoding path of the voice information to be recognized and a decoding score corresponding to each decoding path;

according to the decoding score corresponding to the at least one decoding path, obtaining a target decoding path with the highest decoding score from the at least one decoding path;

obtaining at least one target edge from the target decoding path, wherein the at least one target edge comprises all edges in the target decoding path or edges with a specified decoding order in the target decoding path;

calculating edge difference between each target edge and the corresponding target awakening edge according to the weight value on each target edge and the weight value on the corresponding target awakening edge, wherein the target awakening edge is an awakening edge in the awakening path of the awakening word and has the same decoding sequence with the target edge;

calculating a path difference degree between the target decoding path and the awakening path based on the edge difference degree between each target edge and the corresponding target awakening edge;

and when the path difference degree meets a threshold condition, determining that the voice information to be recognized contains the awakening word.

In another embodiment of the present disclosure, the obtaining at least one decoding path of the to-be-recognized speech information and a decoding score corresponding to each decoding path includes:

decoding the voice information to be recognized based on a decoding network corresponding to the awakening word recognition model to obtain at least one token, wherein the token is used for recording the state of each moment in the decoding process and the weight value of each edge;

and generating the at least one decoding path and a decoding score corresponding to each decoding path based on the at least one token.

In another embodiment of the present disclosure, the calculating an edge difference between each target edge and the corresponding target wake-up edge according to the weight value on each target edge and the weight value on the corresponding target wake-up edge includes:

and calculating the difference between the weight value on each target edge and the weight value on the corresponding target awakening edge to obtain the edge difference between each target edge and the corresponding target awakening edge.

calculating an average weight value according to the weight value of at least one awakening edge in the awakening path;

and calculating the difference between the weight value on each target edge and the average weight value to obtain the edge difference between each target edge and the corresponding target awakening edge.

In another embodiment of the present disclosure, the calculating an average weight value according to a weight value on at least one wake-up edge in the wake-up path includes:

calculating the average value of the weight values of all awakening edges in the awakening path to obtain the average weight value; alternatively, the first and second electrodes may be,

and calculating the average value of the weight values on the wake-up edges with the appointed decoding order in the wake-up path to obtain the average weight value.

In another embodiment of the present disclosure, the calculating a path difference degree between the target decoding path and the wake-up path based on the edge difference degree between each target edge and the corresponding target wake-up edge includes:

performing root mean square calculation based on the edge difference between each target edge and the corresponding target awakening edge to obtain a root mean square calculation result;

and determining the root-mean-square calculation result as the path difference degree between the target decoding path and the wake-up path.

In a second aspect, an apparatus for identifying a wake-up word is provided, the apparatus comprising:

the acquisition module is used for acquiring at least one decoding path of the voice information to be recognized and a decoding score corresponding to each decoding path;

the obtaining module is further configured to obtain, according to a decoding score corresponding to the at least one decoding path, a target decoding path with a highest decoding score from the at least one decoding path;

the obtaining module is further configured to obtain at least one target edge from the target decoding path, where the at least one target edge includes all edges in the target decoding path or edges having a specified decoding order in the target decoding path;

the calculation module is used for calculating edge difference between each target edge and the corresponding target awakening edge according to the weight value on each target edge and the weight value on the corresponding target awakening edge, wherein the target awakening edge is an awakening edge in the awakening path of the awakening word and has the same decoding sequence with the target edge;

the calculation module is configured to calculate a path difference between the target decoding path and the wakeup path based on an edge difference between each target edge and a corresponding target wakeup edge;

and the determining module is used for determining that the voice information to be recognized contains the awakening word when the path difference meets a threshold condition.

In another embodiment of the present disclosure, the obtaining module is configured to decode the voice information to be recognized based on a decoding network corresponding to a wakeup word recognition model to obtain at least one token, where the token is used to record a state of each time and a weight value on each edge experienced in a decoding process;

In another embodiment of the disclosure, the calculating module is configured to calculate a difference between a weight value on each target edge and a weight value on a corresponding target wake-up edge, so as to obtain an edge difference between each target edge and the corresponding target wake-up edge.

In another embodiment of the present disclosure, the calculating module is configured to calculate an average weight value according to a weight value on at least one wake-up edge in the wake-up path; and calculating the difference between the weight value on each target edge and the average weight value to obtain the edge difference between each target edge and the corresponding target awakening edge.

In another embodiment of the present disclosure, the calculating module is configured to calculate an average value of the weight values on all wake-up edges in the wake-up path, so as to obtain the average weight value; alternatively, the first and second electrodes may be,

the calculating module is configured to calculate an average value of weight values on a wake-up edge having a specified decoding order in the wake-up path, and obtain the average weight value.

In another embodiment of the present disclosure, the calculating module is configured to perform root-mean-square calculation based on an edge difference between each target edge and a corresponding target wake-up edge, so as to obtain a root-mean-square calculation result; and determining the root-mean-square calculation result as the path difference degree between the target decoding path and the wake-up path.

In a third aspect, a terminal is provided, where the terminal includes a processor and a memory, where the memory stores at least one program code, and the at least one program code is loaded and executed by the processor to implement the method for identifying a wakeup word according to the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the method for identifying a wake word according to the first aspect.

In a fifth aspect, a computer program product is provided, the computer program product comprising computer program code stored in a computer readable storage medium, the computer program code being read from the computer readable storage medium by a processor of a terminal, the processor executing the computer program code to cause the terminal to perform the wake word recognition method according to the first aspect.

The technical scheme provided by the embodiment of the disclosure has the following beneficial effects:

and calculating edge difference between each target edge and the corresponding target awakening edge according to the weight values of the target edge on the target decoding path and the corresponding target awakening edge on the awakening path, and further calculating the path difference between the target decoding path and the awakening path, so that when the path difference between the two paths meets a threshold condition, the fact that the voice information to be recognized contains the awakening word is determined. According to the method, the difference value between the highest decoding score corresponding to the target decoding path and the awakening score corresponding to the awakening path is not directly used as a basis for judging whether the voice information to be recognized contains the awakening word or not, but the difference of each step in the decoding process is considered, the difference of each step is calculated and accumulated to the difference of the two paths, and then the path difference is used as the awakening word recognition basis, so that the accuracy of the awakening word recognition result is improved, and the number of times of mistaken awakening is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for identifying a wakeup word according to an embodiment of the present disclosure;

fig. 2 is a flowchart of another wake word recognition method provided by the embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an apparatus for identifying a wakeup word according to an embodiment of the present disclosure;

fig. 4 shows a block diagram of a terminal according to an exemplary embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

It is to be understood that the terms "each," "a plurality," and "any" and the like, as used in the embodiments of the present disclosure, are intended to encompass two or more, each referring to each of the corresponding plurality, and any referring to any one of the corresponding plurality. For example, the plurality of words includes 10 words, and each word refers to each of the 10 words, and any word refers to any one of the 10 words.

The embodiment of the present disclosure provides a method for identifying a wakeup word, referring to fig. 1, the method provided by the embodiment of the present disclosure includes:

101. and acquiring at least one decoding path of the voice information to be recognized and a decoding score corresponding to each decoding path.

102. And acquiring a target decoding path with the highest decoding score from the at least one decoding path according to the decoding score corresponding to the at least one decoding path.

103. At least one target edge is obtained from the target decoding path.

Wherein the at least one target edge comprises all edges in the target decoding path or edges with a specified decoding order in the target decoding path.

104. And calculating the edge difference between each target edge and the corresponding target awakening edge according to the weight value on each target edge and the weight value on the corresponding target awakening edge.

And the target awakening edge is an awakening edge in the awakening path of the awakening word, and the target awakening edge has the same decoding order as the target edge.

105. And calculating the path difference degree between the target decoding path and the awakening path based on the edge difference degree between each target edge and the corresponding target awakening edge.

106. And when the path difference meets the threshold condition, determining that the voice information to be recognized contains awakening words.

According to the method provided by the embodiment of the disclosure, the edge difference between each target edge and the corresponding target awakening edge is calculated according to the weight values of the target edge on the target decoding path and the corresponding target awakening edge on the awakening path, and then the path difference between the target decoding path and the awakening path is calculated, so that when the path difference between the two paths meets the threshold condition, it is determined that the voice information to be recognized contains the awakening word. According to the method, the difference value between the highest decoding score corresponding to the target decoding path and the awakening score corresponding to the awakening path is not directly used as a basis for judging whether the voice information to be recognized contains the awakening word or not, but the difference of each step in the decoding process is considered, the difference of each step is calculated and accumulated to the difference of the two paths, and then the path difference is used as the awakening word recognition basis, so that the accuracy of the awakening word recognition result is improved, and the number of times of mistaken awakening is reduced.

In another embodiment of the present disclosure, acquiring at least one decoding path of the speech information to be recognized and a decoding score corresponding to each decoding path includes:

and generating at least one decoding path and a decoding score corresponding to each decoding path based on the at least one token.

In another embodiment of the present disclosure, calculating an edge difference between each target edge and the corresponding target wake-up edge according to the weight value on each target edge and the weight value on the corresponding target wake-up edge includes:

and calculating the difference between the weight value of each target edge and the average weight value to obtain the edge difference between each target edge and the corresponding target awakening edge.

In another embodiment of the present disclosure, calculating an average weight value according to a weight value on at least one wake-up edge in a wake-up path includes:

calculating the average value of the weight values of all awakening edges in the awakening path to obtain an average weight value; alternatively, the first and second electrodes may be,

In another embodiment of the present disclosure, calculating a path difference degree between a target decoding path and a wakeup path based on an edge difference degree between each target edge and a corresponding target wakeup edge includes:

and determining the root mean square calculation result as a path difference degree between the target decoding path and the wake-up path.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

The embodiment of the disclosure provides a method for identifying a wake-up word, which is implemented by a terminal, wherein the terminal can be a smart phone, a smart elevator, a smart air conditioner, a smart sound box and the like, and the type of the terminal is not specifically limited by the embodiment of the disclosure. Referring to fig. 2, a method flow provided by the embodiment of the present disclosure includes:

201. the terminal obtains at least one decoding path of the voice information to be recognized and a decoding score corresponding to each decoding path.

When the voice information to be recognized is acquired through the audio acquisition equipment such as the microphone, the terminal recognizes the voice information to be recognized to acquire at least one decoding path of the voice information to be recognized and a decoding score corresponding to each decoding path. When the terminal obtains at least one decoding path of the voice information to be recognized and a decoding score corresponding to each decoding path, the following method can be adopted:

2011. and the terminal decodes the voice information to be recognized based on a decoding network corresponding to the awakening word recognition model to obtain at least one token.

The awakening word recognition model is used for recognizing awakening words from voice information. In the field of speech technology, the wakeup word recognition model corresponds to a decoding network, which may be a Weighted fine-State transmitter (WSFT) or the like. The WSFT is used for large scale speech recognition and its state changes can be marked with input symbols and output symbols. The WSFT integrates various knowledge sources, including at least one of a language model, an acoustic model, a context correlation model and a pronunciation dictionary model, and can form different types of decoding networks. For example, a one-factor decoding network consisting of L and G, denoted LG; the device comprises a C-level decoding network consisting of C, L, G and is marked as a CLG network; HCLG networks represented using hidden markov models. Where G represents a language model, the input and output of which are of the same type. A language model is a representation of a language structure (including words, rules between sentences, such as grammars, common collocations of words, etc.) whose probabilities are used to represent the probability that a sequence of language units will occur in a piece of speech signal. L denotes a pronunciation dictionary, which inputs monophones (phonemes) and outputs words. The pronunciation dictionary includes a word set and pronunciations thereof. C represents context correlation, with triphone as input and monophone as output. Contextual relevance is used to indicate the correspondence between triphones to phonemes. The H represents an acoustic model, which is a differentiated representation of acoustics, linguistics, environmental variables, speaker gender, accent, and the like. The acoustic model includes, but is not limited to, at least one of GMM (Gauss of mixture models) -HMM (Hidden Markov model), DNN (Deep Neural Networks), CNN (Convolutional Neural Networks), LSTM (Long Short Term Memory Network), and the like.

In the embodiment of the disclosure, based on the decoding network corresponding to the wakening word recognition model, when the voice information to be recognized is input into the decoding network, the terminal can search for a decoding path in the decoding network by adopting a search algorithm. In the process of searching a decoding path in a decoding network, in order to manage the decoding process of voice information to be recognized, a terminal can record a weight value and information in a certain state at a certain moment in the decoding process by using a token (i.e., token). Starting from the initial state of the WSFT, the token transitions along an edge having a direction. In the process of state transfer from the initial state to the end state, the state at each moment in the decoding process and the weight value on each edge are recorded in the token.

2012. And the terminal generates at least one decoding path and a decoding score corresponding to each decoding path based on the at least one token.

When the voice information to be recognized is decoded, the terminal connects state nodes from the initial state node to the end state node according to the search time sequence based on the states recorded in each token to obtain at least one edge, and then the at least one edge forms a decoding path. Each edge has a determined decoding order, the decoding order is determined according to a decoding time order in the decoding process, and the decoding time order is consistent with the order of the edges traversed by the directional decoding path from the starting state node to the ending state node, namely, in the directional decoding path from the starting state node to the ending state node, the first edge is decoded first, the decoding order of the first edge is the first bit, the last edge is decoded last, and the decoding order of the last edge is the last bit. And the terminal acquires the weight values on all edges in each decoding path, and then accumulates or continuously multiplies the weight values on all edges in each decoding path to obtain the decoding fraction corresponding to each decoding path.

202. And the terminal acquires a target decoding path with the highest decoding score from the at least one decoding path according to the decoding score corresponding to the at least one decoding path.

And based on the decoding score corresponding to at least one decoding path, the terminal acquires the decoding path with the highest decoding score from the at least one decoding path, and further takes the decoding path with the highest decoding score as a target decoding path.

203. The terminal selects at least one target edge from the target decoding path.

The target edge is an edge used for calculating the path difference degree between the target decoding path and the wake-up path in the target decoding path. The at least one target edge may include all edges in the target decoding path, and the at least one target edge may also be an edge having a specified decoding order in the target decoding path, where the edge having the specified decoding order is an edge that is prone to error in the decoding process in practice, and when a difference between the edge having the decoding error and the corresponding target wake-up edge in the wake-up path is small, it may be considered that a difference degree of the path between the target decoding path and the wake-up path is small, and it is determined that the to-be-recognized speech information includes a wake-up word. The edges having the designated decoding order are usually the first preset number of edges in the directional target decoding path from the initial state to the end state, and may also be the second preset number of edges in the target decoding path, and so on. The first preset number may be 2, 3, and the like, and the second preset number may be 3, 5, and the like, and the first preset number and the second preset number are not specifically limited in the embodiments of the present disclosure.

204. And the terminal acquires at least one target awakening edge according to the at least one target edge.

Selecting an awakening edge with the same decoding order as the target edge from the awakening path based on the decoding order of the at least one target edge selected from the target decoding path and the at least one target edge of the terminal in the target decoding path to obtain the at least one target awakening edge, wherein for example, if the at least one target edge comprises all edges in the target decoding path, the at least one target awakening edge also comprises all awakening edges on the awakening path; for another example, the at least one target edge includes the first three edges on the target decoding path, and then the at least one target wake-up edge also includes the first three edges on the wake-up path.

205. And the terminal calculates the edge difference between each target edge and the corresponding target awakening edge according to the weight value on each target edge and the weight value on the corresponding target awakening edge.

Wherein the edge difference is used for representing the difference degree of the two edges. When the terminal calculates the edge difference between each target edge and the corresponding target wake-up edge according to the weight value on each target edge and the weight value on the corresponding target wake-up edge, the following modes can be adopted:

in a possible implementation manner, the terminal calculates a difference between a weight value on each target edge and a weight value on a corresponding target wake-up edge, so as to obtain an edge difference between each target edge and the corresponding target wake-up edge. For example, if the target edge is the first edge on the target decoding path, the target wake-up edge is the first wake-up edge on the wake-up path, the weight value on the target edge is 0.4, and the weight value on the target wake-up edge is 0.5, the edge difference between the target edge and the corresponding target wake-up edge is-0.1 to-0.5.

In another possible implementation manner, the terminal calculates an average weight value according to a weight value of at least one awakening edge in the awakening path, and further calculates a difference between the weight value of each target edge and the average weight value, so as to obtain an edge difference between each target edge and the corresponding target awakening edge. Since the at least one target edge includes all edges in the target decoding path or edges in the target decoding path having the specified decoding order, and accordingly, the at least one target wake-up edge includes all wake-up edges in the wake-up path or wake-up edges in the wake-up path having the specified decoding order, the terminal, when calculating the average weight value, includes but is not limited to the following two cases for the difference between the target edge and the target wake-up edge:

in the first case, when at least one target edge includes all edges in the target decoding path, the terminal calculates an average value of the weight values on all the awakening edges in the awakening path to obtain an average weight value.

In the second case, when at least one target edge includes an edge having a specified decoding order in the target decoding path, the terminal calculates an average value of the weight values on the wakeup edge having the specified decoding order in the wakeup path, to obtain an average weight value.

206. And the terminal calculates the path difference between the target decoding path and the awakening path based on the edge difference between each target edge and the corresponding target awakening edge.

And the terminal performs root mean square calculation on the edge difference between each target edge and the corresponding target awakening edge to obtain a root mean square calculation result, and further determines the root mean square calculation result as the path difference between the target decoding path and the awakening path. For the edge difference between each target edge and the corresponding target wake-up edge calculated in the above two manners, the root-mean-square calculation process can be represented by the following two formulas:

the formula I is as follows:

wherein maxk represents a weight value on the k-th entry label edge in the target decoding path, roadk represents a weight value on the k-th entry label wake-up edge in the wake-up path, and n represents the number of target edges included in the target decoding path and is also the number of target wake-up edges included in the wake-up path.

The formula II is as follows:

wherein x is_kRepresenting the weight value on the k-th entry label edge in the target decoding path,

represents the average weight value, and n represents the number of target edges included in the target decoding path and also the number of target wake-up edges included in the wake-up path.

207. And when the path difference meets the threshold condition, the terminal determines that the voice information to be recognized contains the awakening word.

Wherein the threshold condition may be less than a preset threshold, etc. That is, when the path difference between the target decoding path and the wake-up path is smaller than the preset threshold, the terminal may determine that the voice information to be recognized includes the wake-up word. And responding to the fact that the voice information to be recognized contains the awakening words, awakening other systems in the terminal by the awakening system, and enabling the terminal to be in a normal working state at the moment. Under a normal working state, when the collected voice information is recognized to contain a certain instruction word, the terminal executes corresponding operation based on the instruction word.

Referring to fig. 3, an embodiment of the present disclosure provides a wake word recognition apparatus, including:

an obtaining module 301, configured to obtain at least one decoding path of the voice information to be recognized and a decoding score corresponding to each decoding path;

the obtaining module 301 is further configured to obtain, according to a decoding score corresponding to at least one decoding path, a target decoding path with a highest decoding score from the at least one decoding path;

an obtaining module 301, further configured to obtain at least one target edge from the target decoding path, where the at least one target edge includes all edges in the target decoding path or an edge having a specified decoding order in the target decoding path;

a calculating module 302, configured to calculate an edge difference between each target edge and a corresponding target wake-up edge according to the weight value on each target edge and the weight value on the corresponding target wake-up edge, where the target wake-up edge is a wake-up edge in a wake-up path of a wake-up word and has the same decoding order as the target edge;

a calculating module 302, configured to calculate a path difference between a target decoding path and a wakeup path based on an edge difference between each target edge and a corresponding target wakeup edge;

the determining module 303 is configured to determine that the voice information to be recognized includes a wakeup word when the path difference satisfies the threshold condition.

In another embodiment of the present disclosure, the obtaining module 301 is configured to decode voice information to be recognized based on a decoding network corresponding to the awakening word recognition model to obtain at least one token, where the token is used to record a state of each time and a weight value on each edge experienced in a decoding process; and generating at least one decoding path and a decoding score corresponding to each decoding path based on the at least one token.

In another embodiment of the present disclosure, the calculating module 302 is configured to calculate a difference between a weight value on each target edge and a weight value on a corresponding target wake-up edge, so as to obtain an edge difference between each target edge and the corresponding target wake-up edge.

In another embodiment of the present disclosure, the calculating module 302 is configured to calculate an average weight value according to a weight value on at least one wake-up edge in the wake-up path; and calculating the difference between the weight value of each target edge and the average weight value to obtain the edge difference between each target edge and the corresponding target awakening edge.

In another embodiment of the present disclosure, the calculating module 302 is configured to calculate an average value of the weight values on all wake-up edges in the wake-up path, so as to obtain an average weight value; alternatively, the first and second electrodes may be,

a calculating module 302, configured to calculate an average value of the weight values on the wake-up edge having the specified decoding order in the wake-up path, so as to obtain an average weight value.

In another embodiment of the present disclosure, the calculating module 302 is configured to perform root-mean-square calculation based on an edge difference between each target edge and a corresponding target wake-up edge, so as to obtain a root-mean-square calculation result; and determining the root mean square calculation result as a path difference degree between the target decoding path and the wake-up path.

To sum up, the device provided in the embodiment of the present disclosure calculates, according to the weighted values of the target edge on the target decoding path and the target wake-up edge corresponding to the wake-up path, the edge difference between each target edge and the corresponding target wake-up edge, and further calculates the path difference between the target decoding path and the wake-up path, so that when the path difference between the two paths satisfies the threshold condition, it is determined that the voice information to be recognized includes a wake-up word. According to the method, the difference value between the highest decoding score corresponding to the target decoding path and the awakening score corresponding to the awakening path is not directly used as a basis for judging whether the voice information to be recognized contains the awakening word or not, but the difference of each step in the decoding process is considered, the difference of each step is calculated and accumulated to the difference of the two paths, and then the path difference is used as the awakening word recognition basis, so that the accuracy of the awakening word recognition result is improved, and the number of times of mistaken awakening is reduced.

Fig. 4 shows a block diagram of a terminal 400 according to an exemplary embodiment of the present disclosure. The terminal 400 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 400 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

Generally, the terminal 400 includes: a processor 401 and a memory 402.

Processor 401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 401 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 401 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 401 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 401 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 402 is used to store at least one instruction for execution by processor 401 to implement a wake word recognition method provided by method embodiments herein.

In some embodiments, the terminal 400 may further optionally include: a peripheral interface 403 and at least one peripheral. The processor 401, memory 402 and peripheral interface 403 may be connected by bus or signal lines. Each peripheral may be connected to the peripheral interface 403 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 404, a display screen 405, a camera assembly 406, an audio circuit 407, a positioning assembly 408, and a power supply 409.

The peripheral interface 403 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 401 and the memory 402. In some embodiments, processor 401, memory 402, and peripheral interface 403 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 401, the memory 402 and the peripheral interface 403 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 404 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 404 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 404 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 404 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 404 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 404 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 405 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 405 is a touch display screen, the display screen 405 also has the ability to capture touch signals on or over the surface of the display screen 405. The touch signal may be input to the processor 401 as a control signal for processing. At this point, the display screen 405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 405 may be one, providing the front panel of the terminal 400; in other embodiments, the display screen 405 may be at least two, respectively disposed on different surfaces of the terminal 400 or in a folded design; in other embodiments, the display 405 may be a flexible display disposed on a curved surface or a folded surface of the terminal 400. Even further, the display screen 405 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display screen 405 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 406 is used to capture images or video. Optionally, camera assembly 406 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 406 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 407 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 401 for processing, or inputting the electric signals to the radio frequency circuit 404 for realizing voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 400. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 401 or the radio frequency circuit 404 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 407 may also include a headphone jack.

The positioning component 408 is used to locate the current geographic position of the terminal 400 for navigation or LBS (Location Based Service). The Positioning component 408 may be a Positioning component based on the GPS (Global Positioning System) of the united states, the beidou System of china, the graves System of russia, or the galileo System of the european union.

The power supply 409 is used to supply power to the various components in the terminal 400. The power source 409 may be alternating current, direct current, disposable or rechargeable. When power source 409 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 400 also includes one or more sensors 410. The one or more sensors 410 include, but are not limited to: acceleration sensor 411, gyro sensor 412, pressure sensor 413, fingerprint sensor 414, optical sensor 415, and proximity sensor 416.

The acceleration sensor 411 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 400. For example, the acceleration sensor 411 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 401 may control the display screen 405 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 411. The acceleration sensor 411 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 412 may detect a body direction and a rotation angle of the terminal 400, and the gyro sensor 412 may cooperate with the acceleration sensor 411 to acquire a 3D motion of the terminal 400 by the user. From the data collected by the gyro sensor 412, the processor 401 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 413 may be disposed on a side bezel of the terminal 400 and/or on a lower layer of the display screen 405. When the pressure sensor 413 is disposed on the side frame of the terminal 400, a user's holding signal to the terminal 400 can be detected, and the processor 401 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 413. When the pressure sensor 413 is disposed at the lower layer of the display screen 405, the processor 401 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 405. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 414 is used for collecting a fingerprint of the user, and the processor 401 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 414, or the fingerprint sensor 414 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 401 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 414 may be disposed on the front, back, or side of the terminal 400. When a physical key or vendor Logo is provided on the terminal 400, the fingerprint sensor 414 may be integrated with the physical key or vendor Logo.

The optical sensor 415 is used to collect the ambient light intensity. In one embodiment, processor 401 may control the display brightness of display screen 405 based on the ambient light intensity collected by optical sensor 415. Specifically, when the ambient light intensity is high, the display brightness of the display screen 405 is increased; when the ambient light intensity is low, the display brightness of the display screen 405 is reduced. In another embodiment, the processor 401 may also dynamically adjust the shooting parameters of the camera assembly 406 according to the ambient light intensity collected by the optical sensor 415.

A proximity sensor 416, also known as a distance sensor, is typically disposed on the front panel of the terminal 400. The proximity sensor 416 is used to collect the distance between the user and the front surface of the terminal 400. In one embodiment, when the proximity sensor 416 detects that the distance between the user and the front surface of the terminal 400 gradually decreases, the processor 401 controls the display screen 405 to switch from the bright screen state to the dark screen state; when the proximity sensor 416 detects that the distance between the user and the front surface of the terminal 400 is gradually increased, the processor 401 controls the display screen 405 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 4 is not intended to be limiting of terminal 400 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

According to the terminal provided by the embodiment of the disclosure, the edge difference between each target edge and the corresponding target wake-up edge is calculated according to the weight values of the target edge on the target decoding path and the corresponding target wake-up edge on the wake-up path, and then the path difference between the target decoding path and the wake-up path is calculated, so that when the path difference between the two paths meets the threshold condition, it is determined that the voice information to be recognized contains the wake-up word. According to the method, the difference value between the highest decoding score corresponding to the target decoding path and the awakening score corresponding to the awakening path is not directly used as a basis for judging whether the voice information to be recognized contains the awakening word or not, but the difference of each step in the decoding process is considered, the difference of each step is calculated and accumulated to the difference of the two paths, and then the path difference is used as the awakening word recognition basis, so that the accuracy of the awakening word recognition result is improved, and the number of times of mistaken awakening is reduced.

The disclosed embodiments provide a computer-readable storage medium having at least one program code stored therein, the at least one program code being loaded and executed by a processor to implement a wake-up word recognition method. The computer readable storage medium may be non-transitory. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The computer-readable storage medium provided by the embodiment of the disclosure calculates edge difference between each target edge and the corresponding target wake-up edge according to the weighted values of the target edge on the target decoding path and the corresponding target wake-up edge on the wake-up path, and further calculates the path difference between the target decoding path and the wake-up path, so that when the path difference between the two paths meets a threshold condition, it is determined that the voice information to be recognized includes a wake-up word. According to the method, the difference value between the highest decoding score corresponding to the target decoding path and the awakening score corresponding to the awakening path is not directly used as a basis for judging whether the voice information to be recognized contains the awakening word or not, but the difference of each step in the decoding process is considered, the difference of each step is calculated and accumulated to the difference of the two paths, and then the path difference is used as the awakening word recognition basis, so that the accuracy of the awakening word recognition result is improved, and the number of times of mistaken awakening is reduced.

An embodiment of the present disclosure provides a computer program product including computer program code stored in a computer-readable storage medium, a processor of a terminal reading the computer program code from the computer-readable storage medium, the processor executing the computer program code to cause the terminal to execute a wake-up word recognition method.

According to the computer program product provided by the embodiment of the disclosure, the edge difference between each target edge and the corresponding target wake-up edge is calculated according to the weight values of the target edge on the target decoding path and the corresponding target wake-up edge on the wake-up path, and then the path difference between the target decoding path and the wake-up path is calculated, so that when the path difference between the two paths meets the threshold condition, it is determined that the voice information to be recognized contains the wake-up word. According to the method, the difference value between the highest decoding score corresponding to the target decoding path and the awakening score corresponding to the awakening path is not directly used as a basis for judging whether the voice information to be recognized contains the awakening word or not, but the difference of each step in the decoding process is considered, the difference of each step is calculated and accumulated to the difference of the two paths, and then the path difference is used as the awakening word recognition basis, so that the accuracy of the awakening word recognition result is improved, and the number of times of mistaken awakening is reduced.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is intended to be exemplary only and not to limit the present disclosure, and any modification, equivalent replacement, or improvement made without departing from the spirit and scope of the present disclosure is to be considered as the same as the present disclosure.

Claims

1. A method of wake word recognition, the method comprising:

2. The method according to claim 1, wherein the obtaining at least one decoding path of the speech information to be recognized and a decoding score corresponding to each decoding path comprises:

3. The method according to claim 1, wherein calculating the edge difference between each target edge and the corresponding target wake-up edge according to the weight value on each target edge and the weight value on the corresponding target wake-up edge comprises:

4. The method according to claim 1, wherein calculating the edge difference between each target edge and the corresponding target wake-up edge according to the weight value on each target edge and the weight value on the corresponding target wake-up edge comprises:

5. The method according to claim 4, wherein the calculating an average weight value according to the weight value on at least one wake-up edge in the wake-up path comprises:

6. The method of claim 1, wherein calculating the path difference between the target decoding path and the wake-up path based on the edge difference between each target edge and the corresponding target wake-up edge comprises:

7. An apparatus for wake word recognition, the apparatus comprising:

8. A terminal, characterized in that the terminal comprises a processor and a memory, wherein at least one program code is stored in the memory, and the at least one program code is loaded and executed by the processor to implement the wake word recognition method according to any one of claims 1 to 6.

9. A computer-readable storage medium, having stored therein at least one program code, which is loaded and executed by a processor, to implement the wake word recognition method according to any one of claims 1 to 6.

10. A computer program product, characterized in that the computer program product comprises computer program code, which is stored in a computer-readable storage medium, from which a processor of a terminal reads the computer program code, the processor executing the computer program code, causing the terminal to perform the wake-up word recognition method according to any of claims 1 to 6.