CN111128172B

CN111128172B - Voice recognition method, electronic equipment and storage medium

Info

Publication number: CN111128172B
Application number: CN201911414565.3A
Authority: CN
Inventors: 吴占伟
Original assignee: Cloudminds Shanghai Robotics Co Ltd
Current assignee: Cloudminds Robotics Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2022-12-16
Anticipated expiration: 2039-12-31
Also published as: CN111128172A

Abstract

The embodiment of the invention relates to the technical field of computers, and discloses a voice recognition method, electronic equipment and a storage medium. Some embodiments of the present application provide a speech recognition method, including: determining the probability of each phoneme corresponding to the voice frame according to the acoustic characteristics of the voice frame; inputting the probability of each phoneme corresponding to the voice frame into a path tree model; the path tree model determines command words formed by the voice frames according to the probability of each phoneme corresponding to the voice frames, a pre-stored path tree and a preset path search rule; the path tree is generated based on command words, nodes of the path tree store information of characters corresponding to the nodes, and the information of the characters indicates the structural information of the characters; determining command words according to the output of the path tree model. By adopting the voice recognition method in the embodiment, the command words in the voice can be accurately and quickly recognized.

Description

Voice recognition method, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a voice recognition method, electronic equipment and a storage medium.

Background

With the social development and the advancement of science and technology, more and more intelligent devices appear in the lives of people. With the development of speech recognition technology, the living habits of people are gradually changing with speech interaction. The intelligent device can support various functions, and command word voice recognition technology is needed when the functions or services are started in order to realize quick interaction.

However, the inventors found that at least the following problems exist in the related art: at present, the command word is generally recognized by combining an acoustic model and a language model, but the accuracy of the recognition mode of the command word is limited.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

An object of embodiments of the present invention is to provide a voice recognition method, an electronic device, and a storage medium, which enable command words in a voice to be recognized accurately and quickly.

In order to solve the above technical problem, an embodiment of the present invention provides a speech recognition method, including: determining the probability of each phoneme corresponding to the voice frame according to the acoustic characteristics of the voice frame; inputting the probability of each phoneme corresponding to the voice frame into a path tree model; the path tree model determines command words formed by the voice frames according to the probability of each phoneme corresponding to the voice frames, a pre-stored path tree and a preset path search rule; the path tree is generated based on the command words, the nodes of the path tree store character information corresponding to the nodes, and the character information indicates the character composition information; determining the command word according to the output of the path tree model.

An embodiment of the present invention also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the voice recognition method.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the above-mentioned speech recognition method.

Compared with the prior art, the embodiment of the invention inputs the probability of each phoneme corresponding to the voice frame into the path tree model, and determines the command word according to the output of the path tree model; the path tree model determines command words formed by the voice frames according to the probability of each phoneme corresponding to the voice frames, the pre-stored path tree and the preset path search rule, and the path tree can store a large number of command words, so that the number of recognizable command words of the path tree model can be increased, and the accuracy of determining the command words by the path tree model is improved; the path tree model adopts a preset path search rule without a large amount of training, so that the complexity of constructing the path tree model is reduced, and the application range of the path tree model can be expanded; in the present embodiment, the path tree model specifies the command word composed of the speech frame based on the probability of each phoneme corresponding to the input speech frame, and specifies each word in the command word by specifying the probability of each phoneme, thereby improving the accuracy of specifying each word and the accuracy of the command word composed of words.

In addition, the preset path search rule includes: determining the probability that the voice frame belongs to the activation node of the path tree according to the probability of each phoneme corresponding to the voice frame and the information of the word corresponding to the activation node; the nodes of the path tree comprise activated nodes and inactivated nodes; and determining the command word formed by the voice frame according to the probability that the voice frame belongs to the activated node of the path tree and the position information of each node of the path tree. Because training is not needed, the command words formed by the voice frames are determined by determining the probability that the voice frames belong to the activated nodes of the path tree and the position information of each node of the path tree, and the search rule is simple.

In addition, determining the command word formed by the voice frame according to the probability that the voice frame belongs to the activated node of the path tree and the position information of each node of the path tree comprises the following steps: extracting an active node from the unextracted active nodes in the active node list of the path tree, and aiming at the extracted active node, performing the following operations: if the probability that the voice frame belongs to the activated node is determined to be smaller than a first threshold value and the activated node is a leaf node, backtracking from the activated node to a root node, and determining a command word formed by the voice frame; if the probability that the voice frame belongs to the activated node is determined to be larger than or equal to a first threshold value, or the probability that the voice frame belongs to the activated node is smaller than the first threshold value, and the activated node is not a leaf node, determining whether to activate the child node of the activated node according to the probability that the voice frame belongs to the child node of the activated node; judging whether the activating node list has activating nodes which are not extracted, if yes, returning to the activating nodes which are not extracted in the activating node list of the path tree, and extracting one activating node until the activating nodes which are not extracted do not exist, or successfully determining a command word formed by the voice frame; and if the voice frame does not exist, updating the activated node list and operating the next voice frame. According to the probability that the voice frame belongs to the activated nodes, the command words formed by the voice frame are searched in the path tree, so that the accuracy of each determined activated node can be improved, and the accuracy of the command words is further improved.

In addition, determining whether to activate the child node of the active node according to the probability that the voice frame belongs to the child node of the active node specifically includes: judging whether the probability that the voice frame belongs to the child node of the active node is greater than or equal to a second threshold value or not aiming at each child node of the active node; and if so, activating the child nodes of the activated node. By comparing the probability that the voice frame belongs to the child node of the activated node with the second threshold value, the child node of the activated node is activated when the fact that the probability is larger than the second threshold value is determined, the problem that the child node of the activated node is activated by mistake is avoided, unnecessary searching steps are reduced, and the determining speed is improved.

In addition, before determining whether to activate a child node of an active node according to a probability that a speech frame belongs to the child node of the active node, the speech recognition method further includes: determining the jumping information of the activated node according to the probability of each phoneme corresponding to the voice frame and the pre-stored information of the word corresponding to the activated node, wherein the jumping information of the activated node indicates the phoneme of the word corresponding to the activated node corresponding to the voice frame; and determining that the phoneme of the word corresponding to the activation node corresponding to the voice frame indicated by the jumping information of the activation node is the end phoneme of the word corresponding to the activation node. If the child node of the activated node is not activated after the phoneme is finished, the probability that the voice frame does not belong to the child node of the activated node is high, and the waste of searching resources is caused.

In addition, after determining that the probability that the voice frame belongs to the active node is greater than or equal to the first threshold, the voice recognition method further includes: adding one to the first statistical value of the activated node; the first statistical value indicates the number of speech frames belonging to the active node; before determining whether to activate a child node of an active node according to the probability that a speech frame belongs to the child node of the active node, the speech recognition method further includes: determining that the first statistical value is greater than or equal to a third threshold; the first statistical value indicates the number of speech frames belonging to the active node. After the first statistical value is compared with the third threshold value, whether the child node of the activated node is activated or not is determined, and by the judgment, under the condition that the voice frame does not belong to the child node of the activated node, subsequent operation is not carried out, useless search is reduced, and the search efficiency is further improved.

In addition, after determining the hop information of the active node, the method further comprises: determining the number of phonemes lost by the voice frame; before going back from the active node to the root node and determining the command word formed by the voice frame, the voice recognition method further comprises the following steps: it is determined that the total number of phonemes of the speech loss is less than a fourth threshold value. If the total number of the lost phonemes is larger than the fourth threshold value, it is indicated that an error exists in the current search, and the probability of the determined command word is further improved by comparing the current search result with the fourth threshold value.

In addition, after determining that the probability that the voice frame belongs to the active node is less than the first threshold, the voice recognition method further includes: the active node is turned off. Because no other node exists under the leaf node, the search of the path is completed when the leaf node is reached, and the activation node is closed, the difficulty and the redundancy of subsequent path search can be reduced.

In addition, after determining the hop information of the active node, the method further includes: judging whether the phoneme of the character corresponding to the active node is in a skipping error or not according to skipping information of the active node; if yes, adding one to the second statistic value; the second statistical value indicates the number of speech frames not belonging to the active node; after determining that the probability that the voice frame belongs to the active node is less than the first threshold, the method further comprises: adding one to the second statistical value; before turning off the active node, the speech recognition method further comprises: determining that the second statistical value is greater than a fifth threshold value. By means of the fifth threshold, an erroneous shutdown of the active node is prevented.

In addition, the composition information of the word includes each phoneme contained in the word; determining the probability that the voice frame belongs to the activation node of the path tree according to the probability of each phoneme corresponding to the voice frame and the information of the word corresponding to the activation node, specifically comprising: determining the probability of each phoneme contained in the word corresponding to the active node according to the information of the word corresponding to the active node; and calculating the probability that the speech frame belongs to the activated node according to the probability of each contained phoneme. The probability that the voice frame belongs to the activated node is determined based on the probability of each phoneme, and because the phonemes form words, the probability that the voice frame belongs to the activated node is determined by adopting the phonemes with smaller units, the accuracy of each determined word is improved, and therefore the accuracy of determining the command word is improved.

In addition, the word formation information further includes confusing phonemes for each phoneme included in the word; before calculating the probability that the speech frame belongs to the active node according to the probability of each phoneme contained, the speech recognition method further comprises the following steps: determining the probability of confusing phonemes for each phoneme contained; calculating the probability that the voice frame belongs to the activated node according to the probability of each contained phoneme, specifically comprising: and calculating the probability that the speech frame belongs to the activated node according to the probability of each contained phoneme and the probability of the confusing phoneme of each contained phoneme. Increasing the probability of confusing phonemes for each phoneme can improve the accuracy of the determined command word under strong noise and inaccurate pronunciation conditions.

In addition, calculating the probability that the speech frame belongs to the activated node according to the probability of each contained phoneme and the probability of the confusing phoneme of each contained phoneme specifically includes: for each phoneme of the character corresponding to the active node, respectively performing the following operations: adding the product of the probability of the confusable phoneme of the phoneme and the weight of the confusable phoneme of the phoneme to the probability of the phoneme to obtain the correction probability of the phoneme; and taking the sum of the correction probabilities of all phonemes of the character corresponding to the active node as the probability that the voice frame belongs to the active node.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a detailed flowchart of a speech recognition method according to a first embodiment of the present invention;

FIG. 2 is a diagram of a path tree in one of the speech recognition methods according to the first embodiment of the present invention;

FIG. 3 is a diagram illustrating the sub-steps of a preset path search rule in the speech recognition method according to the first embodiment of the present invention;

fig. 4 is a schematic diagram of an implementation of determining a command word formed by a speech frame in a speech recognition method according to a first embodiment of the present invention;

fig. 5 is a schematic diagram of an implementation of determining a command word formed by a speech frame in a speech recognition method according to a second embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present disclosure, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified.

The inventor finds that in the related speech recognition technology, if strong noise occurs or pronunciation is not accurate, an accurate command word cannot be obtained, and the accuracy of the current speech recognition technology is not high, which affects the speech interaction function between a user and equipment.

A first embodiment of the present invention relates to a speech recognition method. The speech recognition method can be applied to electronic equipment, such as: robot, audio amplifier, intelligent house, server etc.. The specific flow of the speech recognition method is shown in fig. 1.

Step 101: and determining the probability of each phoneme corresponding to the voice frame according to the acoustic characteristics of the voice frame.

Specifically, real-time audio signals can be collected by a sound collection device, which can be a microphone. After the collected audio signal is subjected to operations such as framing and windowing, the acoustic features of the audio signal are extracted, wherein the acoustic features can be FBANK acoustic features, such as: a 40-dimensional FBANK acoustic signature; MFCC acoustic features are also possible, such as: 12-dimensional MFCC acoustic features. The acoustic features of the speech frame are input into the acoustic model, and the probability of each phoneme corresponding to the speech frame can be obtained.

The acoustic model inputs the acoustic characteristics of a speech frame, outputs the probability of each phoneme corresponding to the speech frame, can adopt a deep neural network as a model architecture, and trains data to all phonemes in a language system, for example, phonemes of all pinyins in a Chinese language system, namely, initial phonemes and final phonemes in the pinyins. The specific training process for the acoustic model in this example is not described in detail.

It is worth mentioning that, because the input of the acoustic model is the acoustic features of the speech frame, the probability of each phoneme is taken as the output; because the acoustic model training is not carried out on the fixed command words, if the command words are replaced, the trained command word voice data do not need to be collected again, and the development cost caused by replacement of the command words can be effectively reduced.

Step 102: and inputting the probability of each phoneme corresponding to the voice frame into a path tree model.

The path tree model determines command words formed by the voice frames according to the probability of each phoneme corresponding to the voice frames, a pre-stored path tree and a preset path search rule; the path tree is generated based on the command words, the nodes of the path tree store the information of the characters corresponding to the nodes, and the information of the characters indicates the structural information of the characters.

Specifically, the path tree is generated based on the command words, and since the voice frames all contain silence and the silence state can jump to any phoneme, the root node of the path tree can be set as a silence modeling unit and constructed layer by layer according to the sequence of the words contained in the command words, for example, the command words include "bluetooth on", "desk lamp off", "city query" and "weather query", the root node is a silence modeling unit and constructed layer by layer according to the sequence of the words contained in the command words, so as to obtain the path tree shown in fig. 2, and it can be understood that the root node is always in an activated state and can jump to the child nodes thereof.

It is worth mentioning that in the process of constructing the path tree, for command words with the same prefix, the prefix information is stored in the same father node, so that the memory occupancy rate and the searching complexity in the decoding process can be effectively reduced.

Wherein, each node holds information of a corresponding word, and the information of the word may include: the composition information of the word. The information of the word may further include: identification information of the word, composition information of the word, decoding status and activation information, etc.

In one example, the preset path search rule includes the path search substeps shown in fig. 3:

substep S1: and determining the probability that the voice frame belongs to the activated nodes of the path tree according to the probability of each phoneme corresponding to the voice frame and the information of the word corresponding to the activated node, wherein the nodes of the path tree comprise activated nodes and inactivated nodes.

In one example, the constituent information of a word includes individual phonemes that the word contains; determining the probability of each phoneme contained in the word corresponding to the active node according to the information of the word corresponding to the active node; and calculating the probability that the speech frame belongs to the activated node according to the probability of each contained phoneme.

Specifically, the state information of each node of the path tree may be stored in an active node list, after determining the probability of each phoneme corresponding to the speech frame, one active node in the active node list may be extracted, information of a word corresponding to the active node may be obtained, the information of the word includes identification information of the word, each phoneme included in the word may be queried according to the identification information of the word, according to the probability of each phoneme corresponding to the speech frame, that is, the probability of each phoneme in the word of the active node may be determined, and the sum of the probabilities of each phoneme in the word of the active node is used as the probability that the speech frame belongs to the active node. The following is a specific example:

inputting the acoustic characteristics of the speech frame into an acoustic model to obtain the probability of each phoneme corresponding to the obtained speech frame, for example, the content of the speech frame is that "you" the phonemes of the speech frame are "n" and "i"; the acoustic model outputs a vector consisting of probabilities of 32 phonemes of the Chinese character, wherein the probabilities of the phoneme n and the phoneme i are high, and the probabilities of the rest factors are low or 0. Assuming that the word of the activated node is 'I', searching phonemes in the activated node to be 'w' and 'o' respectively according to the ID information of the activated node; determining the probability values of "w" and "o" as "0.01" and "0.05" respectively from the probabilities of the phonemes corresponding to the speech frame, and taking the sum of the two as the probability P1 that the speech frame belongs to the activated node, namely P1=0.01+0.05.

In practical application, the probability of strong noise existing in the collected audio data is very high, and simultaneously, the problem of nonstandard pronunciation also exists, and in order to improve the probability of determining that a voice frame belongs to an activation node of a path tree, the embodiment provides another probability of determining that the voice frame belongs to the activation node of the path tree.

In another example, the composition information of the word further includes: a word contains confusing phonemes for each phoneme. Determining the probability of confusing phoneme of each contained phoneme, and calculating the probability of the speech frame belonging to the activated node according to the probability of each contained phoneme by the specific process: and calculating the probability that the speech frame belongs to the activated node according to the probability of each contained phoneme and the probability of the confusing phoneme of each contained phoneme.

Specifically, the confusable tone for each phoneme can be pre-stored, for example, in the chinese system, the confusable tone for phoneme "n" is "l". After determining the probability of each phoneme corresponding to the speech frame, extracting an activation node in the activation node list, obtaining information of a word corresponding to the activation node, where the information of the word includes identification information of the word, querying each phoneme included in the word through the identification information of the word, and querying to obtain an confusable phoneme of each phoneme; according to the probabilities of the phonemes corresponding to the speech frame, the probability of each phoneme in the word to the activation node and the probability of the confusable phoneme corresponding to each phoneme in the word are determined.

Calculating the probability that the speech frame belongs to the activated node according to the probability of each contained phoneme and the probability of the confusing phoneme of each contained phoneme, wherein the specific process can be as follows: for each phoneme of the character corresponding to the active node, respectively performing the following operations: adding the product of the probability of the confusable phoneme of the phoneme and the weight of the confusable phoneme of the phoneme to the probability of the phoneme to obtain the correction probability of the phoneme; and taking the sum of the correction probabilities of all phonemes of the character corresponding to the active node as the probability that the voice frame belongs to the active node.

Specifically, the weight of the confusing phoneme may be preset according to the requirement of the actual application language, for example, if the command is applied to the intelligent speaker supporting chinese, the weight of the confusing phoneme "l" of "n" may be set in a range of 0 to 1, and the range of the weight in this example is only for illustration and is not limited to the technical solution of the present invention. Taking the sum of the correction probabilities of the phonemes of the word corresponding to the active node as the probability that the speech frame belongs to the active node, for example: the content in the speech frame is 'you' the the phonemes of the speech frame are "n" and "i"; the acoustic model outputs a vector consisting of probabilities of 32 phonemes of the Chinese character, wherein the probabilities of the phoneme n and the phoneme i are high, and the probabilities of the rest factors are low or 0. Supposing that the word of the activated node is 'good', searching phonemes in the activated node to be 'f' and 'u' respectively according to the identification information of the activated node; determining the probability values of 'f' and 'u' as '0.01' and '0.05' respectively from the probabilities of the phonemes corresponding to the voice frame; the probability of confusing phoneme "h" of "f" is 0.02; the probability of confusing phoneme of "o" of "u" is 0.012; then the correction probability for each phoneme is: the correction probability P1=0.01+0.02 + weight 1 of f; the correction probability of u, P2=0.05+0.012 + weight 2; the probability P0= P1+ P2 that the speech frame belongs to the active node.

And a substep S2: and determining the command word formed by the voice frame according to the probability that the voice frame belongs to the activated node of the path tree and the position information of each node of the path tree.

In particular, the process of determining the command word formed by the speech frame may comprise the sub-steps as shown in fig. 4:

substep S21: and extracting an active node from the unextracted active nodes in the active node list of the path tree.

Specifically, in order to reduce the difficulty of searching, the unretracted active nodes may be selected from the active node list according to the ranking order in the active node list, for example, the first active node in the active node list is ranked first, and then the active node may be extracted.

For the extracted active node, the following processing is performed:

substep S22: judging whether the probability that the voice frame belongs to the activated node is smaller than a first threshold value, wherein the activated node is a leaf node, if yes, executing step S23, and if the probability that the voice frame belongs to the activated node is larger than or equal to the first threshold value, or the probability that the voice frame belongs to the activated node is smaller than the first threshold value and the activated node is not the leaf node, executing step S24.

Specifically, the first threshold may be preset, for example, the first threshold is set in a range of 0.5 to 0.9; if the probability that the voice frame belongs to the activated node is smaller than a first threshold value, the voice frame does not belong to the activated node; and because the path tree is built layer by layer according to the sequence that the command word contains the characters, if the activated node is a leaf node, the activated node indicates that the tail of the command word is searched at present, subsequent searching is not needed, and the path tree can trace back to the root node to determine the command word.

For example, as shown in the path tree of fig. 2, if the current active node is a "tooth" word, the voice frame is currently silent; since it is determined that the probability that the voice frame belongs to the active node is less than the first threshold and the active node "tooth" is a leaf node, it may be determined to perform step S23, trace back from the active node to the root node, and determine the command word formed by the voice frame, that is, determine that the command is "bluetooth on".

If the probability that the voice frame belongs to the activated node is greater than or equal to a first threshold value; it indicates that the speech frame belongs to the active node and a subsequent search may be performed. If the probability that the voice frame belongs to the activated node is smaller than the first threshold value and the activated node is not the leaf node, the voice frame does not belong to the activated node, and meanwhile, the activated node can be replaced to continue searching because the activated node is not the leaf node.

For example, as shown in FIG. 2, if the current active node is a "blue" word; the probability that a voice frame belongs to an active node is greater than a first threshold value, which indicates that the voice frame belongs to the active node 'blue'; it is determined to perform substep S24. If the probability that the voice frame belongs to the activated node is smaller than a first threshold value; indicating that the speech frame does not belong to the "blue" word of the active node; and the active node is not a leaf node, step S24 is also performed.

Substep S23: and backtracking from the activated node to the root node, and determining the command words formed by the voice frames.

Substep S24: and determining whether to activate the child node of the active node according to the probability that the voice frame belongs to the child node of the active node.

In one example, for each child node of the active node, judging whether the probability that the voice frame belongs to the child node of the active node is larger than or equal to a second threshold value; and if so, activating the child node of the activated node.

Specifically, since the active node in the sub-step S22 is not a leaf node, the child node of the active node may be activated for subsequent searching. Any child node in the active node may be extracted, and the probability that the speech frame belongs to the child node is determined, in a manner similar to that in the sub-step S1, and will not be described herein again. The second threshold may be preset, and the value of the second threshold may be the same as or different from the value of the first threshold. Comparing the determined probability that the voice frame belongs to the child node with a second threshold, and if the probability that the voice frame belongs to the child node is greater than the second threshold, indicating that the voice frame belongs to the child node; activating the child node; if the probability that the voice frame belongs to the child node is smaller than the second threshold, which indicates that the voice frame does not belong to the child node of the activated node, the child node is not activated, and the substep S25 is directly performed.

Substep S25: judging whether the activating node list has activating nodes which are not extracted, if yes, returning to the substep S21, namely returning to the step of extracting one activating node from the activating nodes which are not extracted in the activating node list of the path tree; until there is no activating node which is not extracted, or the command word formed by the voice frame is successfully determined; if it is determined not to exist, step S26 is performed.

For example, if there is an unretracted active node in the active list, the step S21 is returned until there is no unretracted active node; or until the command word that the speech frame constitutes has been determined.

Step S26: and updating the active node list and operating the next voice frame.

Specifically, if there are no unretracted active nodes, indicating that all active nodes have been extracted, the active node list may be updated to operate on the next speech frame; the update may be to set all nodes except the root node to an inactive state.

Step 103: determining the command word according to the output of the path tree model.

Specifically, the output of the path tree model is taken as the determined command word. In this embodiment, a method for speech recognition of a limited command word is provided, that is, a method for performing speech recognition on a limited command word. In the embodiment, the command words are stored in the path tree form, so that the recognition of more command words can be supported on the premise of low resource occupation and the command word recognition accuracy rate.

A second embodiment of the present invention relates to a speech recognition method. In the first embodiment, in the sub-step S2 of the present embodiment, a further improvement of the command word formed by the speech frame is determined according to the probability that the speech frame belongs to the activated node of the path tree and the position information of each node of the path tree, and the specific improvement is as follows: in the second embodiment of the present invention, before activating the activated node, it is determined that the skip information of the activated node indicates that the phoneme of the word corresponding to the activated node corresponding to the speech frame is the end phoneme of the word corresponding to the activated node. A schematic diagram of a specific implementation of the command word formed by determining the speech frame is shown in fig. 5.

This substep S21 is substantially the same as the substep S21 in the first embodiment, and will not be described here.

For the extracted active node, the following processing is performed:

substep S22: and judging whether the probability that the voice frame belongs to the activated node is smaller than a first threshold value or not, wherein the activated node is a leaf node, if so, executing the step S23, and if the probability that the voice frame belongs to the activated node is larger than or equal to the first threshold value, or the probability that the voice frame belongs to the activated node is smaller than the first threshold value and the activated node is not the leaf node, executing the step S24.

In one example, if the probability that the voice frame belongs to the active node is determined to be less than the first threshold, and the active node is a leaf node, the active node may be further turned off.

Specifically, if it is determined that the probability that the voice frame belongs to the active node is less than the first threshold, indicating that the voice frame does not belong to the word of the active node, and the active node is a leaf node, indicating that the active nodes before the leaf node are all the words formed by the voice frame, it may be determined that the search for a command word has been completed, and thus the active node may be turned off before step S23 is performed.

It is worth mentioning that since no other node exists under the leaf node, the path search is completed when the leaf node is reached, and the activation node is closed, so that the difficulty and redundancy of subsequent path search can be reduced.

This substep S22 is substantially the same as the substep S22 in the first embodiment, and will not be described here.

In one example, in order to further improve the command word formed by the determined speech frame, the number of phonemes lost by the speech frame may also be determined; if the total number of phonemes with speech loss is determined to be smaller than the fourth threshold, substep S23 is performed.

Specifically, the fourth threshold may be set as needed, for example, the fourth threshold is set to 3; when the total number of the lost phonemes is smaller than a fourth threshold value, the number of the lost phonemes cannot influence the accuracy of the determined command word; at this time, step S23 may be directly performed to determine the command word formed by the voice frame. If the total number of the lost phonemes is greater than the fourth threshold, there may be an error in the current search, and the command word obtained by tracing back from the active node to the root node is inaccurate, so that the substep S23 may not be executed, or the step of extracting the active node may be returned, that is, the step S21 in the present embodiment is executed; and other operations such as alarming and the like can also be carried out.

Substep S24: and determining the jumping information of the activated nodes according to the probability of each phoneme corresponding to the voice frame and the pre-stored information of the word corresponding to the activated nodes, wherein the jumping information of the activated nodes indicates the phoneme of the word corresponding to the activated nodes corresponding to the voice frame.

Judging whether the phoneme of the word corresponding to the active node is in a skip error or not according to the skip information of the active node, and if so, adding one to the second statistical value; the second statistical value indicates the number of speech frames not belonging to the active node. And if the probability that the voice frame belongs to the activated node is determined to be less than the first threshold value, adding one to the second statistical value. When the two situations occur, the second statistic value is added by one, and the second statistic value is used for counting the number of the voice frames which do not belong to the active node. It is thus determined that the second statistic is greater than the fifth threshold before the active node is turned off.

Specifically, since the second statistical value is obtained by adding one to the second statistical value after the skip error occurs and the probability that the voice frame belongs to the active node is smaller than the first threshold. The fifth threshold may be set to N, where N is greater than 1, the active node may be turned off when the probability that the speech frame belongs to the active node is smaller than the first threshold, and the second statistical value is greater than the fifth threshold, that is, it may be understood that the probability that the speech frame already belongs to the active node N consecutive times is smaller than the first threshold, for example, the fifth threshold may be 3; when the second statistical value is larger than the fifth threshold value, the situation that the probability that the voice frame belongs to the activated node is smaller than the first threshold value for 3 times is shown; or a probability that a speech frame belongs to the active node has occurred 3 consecutive times, which is less than a first threshold or a hop error, or a 3 consecutive hop error. The active node may be turned off to reduce the complexity of subsequently searching for the speech frame.

Substep S25: and determining that the phoneme of the word corresponding to the activation node corresponding to the voice frame indicated by the jumping information of the activation node is the ending phoneme of the word corresponding to the activation node.

Specifically, because the speech of a word includes a plurality of phonemes, the ending phoneme is the last phoneme of the word, if the intermediate phoneme of the word is determined, the word jumps down to the child node of the activation node, then any child node of the activation node does not belong to the speech frame, and the step S21 is returned again, that is, the activation node is re-extracted for searching, which reduces the speed of determining the command word formed by the speech frame, if it is determined that the jumping information of the activation node indicates that the phoneme of the word corresponding to the activation node of the speech frame is the ending phoneme of the word corresponding to the activation node, which indicates that the word corresponding to the speech frame has been completely recognized, and then the redundancy can be reduced by jumping. For example, if the speech frame is "lian", three phonemes are included, and if the skip information of the activated node is "i"; it may be determined that the phoneme of the word corresponding to the active node corresponding to the speech frame is not the ending phoneme, and it may not be necessary to jump to the child node that the active node "loves".

It should be noted that, in practical applications, other relevant determining steps may be performed before determining whether to activate the child node of the active node, for example, after determining that the probability that the voice frame belongs to the active node is greater than or equal to the first threshold, adding one to the first statistical value of the active node; the first statistical value indicates the number of speech frames belonging to the active node. Before determining whether to activate the child node of the activated node according to the probability that the voice frame belongs to the child node of the activated node, judging whether the first statistical value is greater than or equal to a third threshold value, and if the first statistical value is greater than or equal to the third threshold value; substep S26 may be performed, and if the first statistical value is less than the third threshold value, substep S27 is performed without performing substep S26. For another example, before determining whether to activate the child node of the active node according to the probability that the speech frame belongs to the child node of the active node, the electronic device determines whether the speech frame corresponds to the end phoneme of the word corresponding to the active node according to the skip information of the active node, determines whether the first statistical value is greater than or equal to a third threshold, and performs substep S26 after determining that the speech frame corresponds to the end phoneme of the word corresponding to the active node and the first statistical value is greater than or equal to the third threshold, and performs substep S27 if the speech frame is not the end phoneme of the word corresponding to the active node or the first statistical value is less than the third threshold.

Specifically, the third threshold may be set according to actual needs, for example, the third threshold may be 10. And if the first statistical value is greater than or equal to the third threshold value and it is determined that the skip information of the activated node indicates that the phoneme of the word corresponding to the activated node corresponding to the speech frame is the end phoneme of the word corresponding to the activated node, performing substep S26.

Substep S26: and determining whether to activate the child node of the active node according to the probability that the voice frame belongs to the child node of the active node.

This step is substantially the same as the substep S24 in the first embodiment, and will not be described here.

Substep S27: judging whether the activating node list has activating nodes which are not extracted, if yes, returning to the substep S21, namely returning to the step of extracting one activating node from the activating nodes which are not extracted in the activating node list of the path tree; until there is no activating node which is not extracted, or the command word formed by the voice frame is successfully determined; if it is determined not to exist, step S28 is performed.

This step is substantially the same as the substep S25 in the first embodiment, and will not be described here.

Substep S28: and updating the active node list and operating the next voice frame.

This step is substantially the same as the substep S26 in the first embodiment, and will not be described here.

The process of determining the command word by the speech recognition method will be described as an example.

Step 2001: and collecting the voice of the microphone in real time.

Step 2002: and framing and windowing the collected voice, and extracting acoustic features.

Step 2003: and determining the probability of each phoneme corresponding to the voice frame according to the acoustic characteristics of the voice frame.

Step 2004: one active node is extracted from the active nodes that have not been extracted from the active node list.

Step 2005: determining the probability that the voice frame belongs to the activated node according to the probability of each phoneme corresponding to the voice frame and the information of the word corresponding to the activated node; and determining the hop information of the active node.

Step 2006: and judging whether the probability that the voice frame belongs to the active node is greater than or equal to a first threshold value. If yes, go to step 2007, otherwise go to step 2012.

Step 2007: the first statistical value is added by 1.

Step 2008: judging whether the activated node meets the following constraint conditions: the first statistic value is larger than or equal to a third threshold value, and the jumping information of the activated node indicates that the voice frame corresponds to the ending phoneme of the word corresponding to the activated node.

If it is determined that the constraint condition is satisfied, go to step 2009, and if it is determined that the constraint condition is not satisfied, go to step 2010.

Step 2009: and sequentially extracting the child nodes of the activated node, and judging whether the probability that the voice frame belongs to the child nodes of the activated node is greater than or equal to a second threshold value or not aiming at the child nodes of the activated node. If yes, activating the child node of the activated node, and extracting the next child node until all child nodes are traversed. If not, extracting the next child node until all child nodes are traversed.

Step 2010: and judging whether the activation node list has activation nodes which are not extracted, if so, executing step 2004, and if not, executing step 2011.

Step 2011: the list of active nodes is updated. Then, the judgment of the current speech frame is ended, and the next speech frame is judged, that is, step 2001 is executed.

Step 2012: and judging whether the second statistic value is greater than or equal to a fifth threshold value. If not, go to step 2013, otherwise, go to step 2014.

Step 2013: adding 1 to the second statistical value. Step 2008 is then performed.

Step 2014: the active node is turned off.

Step 2015: and judging whether the activated node is a leaf node. If yes, go to step 2016, otherwise go to step 2008.

Step 2016: and judging whether the total number of the phonemes lost by the voice is smaller than a fourth threshold value. If yes, go to step 2017, otherwise go to step 2018.

Step 2017: and backtracking from the active node to the root node, and determining the command words formed by the voice frames.

Step 2018: it is determined that command word recognition failed. Step 2001 may be re-executed, or other operations such as alarming may be performed.

In the speech recognition method provided by this embodiment, after comparing the first statistical value with the third threshold value, it is determined whether to activate a child node of the activated node, and by this determination, for a case where the speech frame does not belong to a child node, no subsequent operation is performed, so that useless search is reduced, and the search efficiency is further improved.

The above description is only for illustrative purposes and does not limit the technical aspects of the present invention.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A third embodiment of the present invention relates to an electronic device, and a specific configuration of the electronic device 30 is as shown in fig. 6, and includes: at least one processor 301; and a memory 302 communicatively coupled to the at least one processor 301; the memory 302 stores instructions executable by the at least one processor 301, and the instructions are executed by the at least one processor 301, so that the at least one processor 301 can execute the voice recognition method according to the first embodiment or the second embodiment.

The electronic device 30 includes: one or more processors 301 and a memory 302, with one processor 301 being illustrated in fig. 6. The processor 301 and the memory 302 may be connected by a bus or other means, and fig. 6 illustrates the connection by the bus as an example. The memory 302 is a non-volatile computer-readable storage medium that can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the speech recognition methods described in the embodiments of the present application, in the memory 302. The processor 301 implements the above-described voice recognition method by executing nonvolatile software programs, instructions, and modules stored in the memory 302 to thereby execute various functional applications of the device and data processing.

The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store a list of options, etc. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 302 may optionally include memory located remotely from processor 301, which may be connected to an external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory 302, which when executed by the one or more processors 301, perform the speech recognition method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, has corresponding functional modules and beneficial effects of the execution method, and can refer to the method provided by the embodiment of the application without detailed technical details in the embodiment.

A fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A speech recognition method, comprising:

determining the probability of each phoneme corresponding to the voice frame according to the acoustic characteristics of the voice frame;

inputting the probability of each phoneme corresponding to the voice frame into a path tree model; the path tree model determines command words formed by the voice frames according to the probability of each phoneme corresponding to the voice frames, a pre-stored path tree and a preset path search rule; the path tree is generated based on command words, nodes of the path tree store information of characters corresponding to the nodes, and the information of the characters indicates the structural information of the characters;

determining command words according to the output of the path tree model;

the preset path search rule comprises:

determining the probability that the voice frame belongs to the activation node of the path tree according to the probability of each phoneme corresponding to the voice frame and the information of the word corresponding to the activation node; the nodes of the path tree include the activated nodes and the inactivated nodes;

extracting an active node from the active nodes which are not extracted in the active node list of the path tree, and performing the following operations for the extracted active node:

if the probability that the voice frame belongs to the activated node is smaller than a first threshold value and the activated node is a leaf node, backtracking from the activated node to a root node, and determining a command word formed by the voice frame;

if the probability that the voice frame belongs to the activated node is determined to be larger than or equal to a first threshold value, or the probability that the voice frame belongs to the activated node is smaller than the first threshold value, and the activated node is not a leaf node, determining whether to activate the child node of the activated node according to the probability that the voice frame belongs to the child node of the activated node; judging whether the activating node list has activating nodes which are not extracted or not, if yes, returning to the activating nodes which are not extracted from the activating node list of the path tree, and extracting one activating node until the activating nodes which are not extracted do not exist, or successfully determining a command word formed by a voice frame; and if the voice frame does not exist, updating the activated node list and operating the next voice frame.

2. The speech recognition method according to claim 1, wherein the determining whether to activate the child node of the active node according to the probability that the speech frame belongs to the child node of the active node specifically comprises:

for each child node of the active node, judging whether the probability that the voice frame belongs to the child node of the active node is greater than or equal to a second threshold value;

and if so, activating the child nodes of the activated node.

3. The speech recognition method of claim 1, wherein before the determining whether to activate the child node of the active node according to the probability that the speech frame belongs to the child node of the active node, the speech recognition method further comprises:

determining the jumping information of the activated nodes according to the probability of each phoneme corresponding to the voice frame and the pre-stored information of the words corresponding to the activated nodes, wherein the jumping information of the activated nodes indicates the phoneme of the words corresponding to the activated nodes corresponding to the voice frame;

and determining that the jumping information of the activated node indicates that the phoneme of the word corresponding to the activated node corresponding to the voice frame is the ending phoneme of the word corresponding to the activated node.

4. The speech recognition method of claim 1, wherein after determining that the probability that the speech frame belongs to the active node is greater than or equal to a first threshold, the speech recognition method further comprises:

adding one to the first statistical value of the active node; the first statistical value indicates the number of speech frames belonging to the active node;

before the determining whether to activate the child node of the active node according to the probability that the voice frame belongs to the child node of the active node, the voice recognition method further includes:

it is determined that the first statistical value is greater than or equal to a third threshold.

5. The speech recognition method of claim 3, further comprising, after determining the hop information for the active node:

determining the number of phonemes lost by the voice frame;

before the determining a command word formed by the speech frame from the active node back to the root node, the speech recognition method further includes:

determining that the total number of phonemes of the speech loss is less than a fourth threshold.

6. The speech recognition method of claim 3, wherein after determining that the probability that the speech frame belongs to the active node is less than a first threshold, the speech recognition method further comprises:

and closing the active node.

7. The speech recognition method of claim 6, further comprising, after the determining the hop information for the active node:

judging whether the phoneme of the character corresponding to the active node is in a skipping error or not according to the skipping information of the active node;

if yes, adding one to the second statistic value; the second statistical value indicates the number of speech frames not belonging to the active node;

after determining that the probability that the speech frame belongs to the active node is less than a first threshold, further comprising:

adding one to the second statistical value;

before the turning off the active node, the voice recognition method further includes:

determining that the second statistical value is greater than a fifth threshold.

8. The speech recognition method according to claim 1, wherein the composition information of the word includes each phoneme contained in the word;

determining the probability that the voice frame belongs to the activation node of the path tree according to the probability of each phoneme corresponding to the voice frame and the information of the word corresponding to the activation node, specifically comprising:

determining the probability of each phoneme contained in the word corresponding to the active node according to the information of the word corresponding to the active node;

and calculating the probability that the voice frame belongs to the activated node according to the probability of each contained phoneme.

9. The speech recognition method of claim 8, wherein the information on the composition of the word further comprises confusing phonemes for respective phonemes comprised in the word;

before the calculating the probability that the speech frame belongs to the active node according to the probability of each phoneme, the speech recognition method further includes:

determining a probability of confusing phonemes for each of the contained phonemes;

the calculating, according to the probabilities of the contained phonemes, the probability that the speech frame belongs to the activated node specifically includes:

and calculating the probability that the speech frame belongs to the activated node according to the probability of each contained phoneme and the probability of confusing the phoneme of each contained phoneme.

10. The method according to claim 9, wherein said calculating a probability that the speech frame belongs to the active node according to the probability of each of the included phonemes and the probability of an confusable phoneme of each of the included phonemes comprises:

for each phoneme of the character corresponding to the active node, respectively performing the following operations: adding the product of the probability of the confusing phoneme of the phoneme and the weight of the confusing phoneme of the phoneme to the probability of the phoneme to obtain the correction probability of the phoneme;

and taking the sum of the correction probabilities of all phonemes of the words corresponding to the active nodes as the probability that the voice frame belongs to the active nodes.

11. An electronic device, comprising: at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the speech recognition method of any one of claims 1 to 10.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the speech recognition method of any one of claims 1 to 10.