CN111754981A

CN111754981A - Command word recognition method and system using mutual prior constraint model

Info

Publication number: CN111754981A
Application number: CN202010593154.1A
Authority: CN
Inventors: 曾可为; 杨毅; 孙甲松
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-06-26
Filing date: 2020-06-26
Publication date: 2020-10-09

Abstract

The invention discloses a command word recognition method using mutual prior constraint models, which is based on an end-to-end voice command word recognition structure, wherein the end-to-end voice command word recognition structure comprises a phoneme module for extracting phoneme characteristics from audio, a word module for extracting word characteristics from the phoneme characteristics and a semantic module for extracting semantic characteristics from the word characteristics. Compared with the prior art, the method utilizes the correlation among the variables, and greatly improves the identification accuracy.

Description

Command word recognition method and system using mutual prior constraint model

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a command word recognition method and system using mutual prior constraint models.

Background

In the popular speech recognition products in the market, the traditional online semantic understanding system is mostly used. The system sends the voice information recorded by the equipment to the server, then decodes the voice into a text by using a huge voice recognition model stored by the server, analyzes semantic information contained in the text by using a natural language understanding technology, and finally sends the recognized semantic information to the client, thereby realizing semantic understanding. Because two separated models are used, the system is huge, the efficiency is low, and semantic understanding can only be carried out on the server, so that the problems of network delay, privacy disclosure and the like can exist, and the safety is difficult to ensure. Meanwhile, the information is transmitted between the two fractured models of speech recognition and natural language understanding, so that the loss and error superposition of original audio information can be caused, and the accuracy of recognition is influenced. In addition, because of limited bandwidth and resources, online semantic understanding cannot keep a recording for a long time and continuously upload the recording to a server, and a wakeup word system is needed to wake up the system, start recording and upload voice information, and the wakeup word system can only use offline voice recognition. This not only increases the complexity of the overall semantic understanding system, but also is inconvenient for the user to use, especially if the product wake words are different.

In recent years, the problem can be avoided by a popular end-to-end semantic understanding system. First, the end-to-end semantic understanding system maps acoustic feature sequences directly into semantic information using only one model. Therefore, errors of semantic recognition can be directly optimized, so that the convergence speed and the accuracy are improved, meanwhile, intermediate steps which are easy to make mistakes and comprise a search algorithm, a language model, a finite state sensor and the like are avoided, and components which are related to semantics in voice but cannot be expressed in characters, such as stress, rhythm and the like, can be utilized. In addition, the end-to-end semantic understanding system is low in computational power consumption and high in recognition speed, and online voice recognition is not needed. Therefore, a wake-up word system can be omitted, so that the whole system has fewer identification steps and is more convenient to use.

The overall structure of end-to-end speech command word recognition is shown in fig. 1, and includes a phoneme module for extracting phoneme features from audio, a word module for extracting word features from phoneme features, and a semantic module for extracting semantic features from word features.

After the phoneme module and the word module are pre-trained on the data set, a full connection layer used for mapping phoneme or word features to specific elements is removed, the phoneme module and the word module are spliced together, and then the whole model is trained end to end on the voice command word data set in a supervision mode.

The specific structure of the phoneme module is shown in fig. 2, and the module is used for obtaining phoneme characteristics from input audio. The input audio signal is first input into an interpretable convolution filter to obtain the characteristic information of the original audio. Features are filtered through a one-dimensional maximum pooling layer, activated using a modified linear unit, and then over-fit is prevented using random discard. And finally, extracting phoneme features by using a bidirectional cyclic neural network. The extracted phoneme features are also randomly discarded to prevent overfitting, and then down-sampled to the appropriate dimension and input into the word module. In pre-training, the phone features also need to be mapped to specific phones through a fully-connected layer, but this fully-connected layer is not used in actual training.

The specific structure of the word module is shown in fig. 3, and the module is used for extracting word features from phoneme features. And performing feature extraction on the input phoneme features by using a bidirectional RNN, preventing overfitting by using random discarding, adjusting dimensionality by using a down-sampling method to obtain word features, and inputting the word features into a semantic module. Similarly, in the actual training process, the fully-connected layer used for pre-training also needs to be omitted.

The specific structure of the semantic module is shown in fig. 4, and the module is used for extracting semantic information from word features. After word information is extracted through a bidirectional RNN layer, random discarding is used for preventing overfitting, a down-sampling method is used for adjusting dimensionality, then the dimensionality is input into a linear layer, the possible probability of each command word is obtained, a maximum pooling layer is used for finding the command word with the maximum probability, and the command words are mapped to three dimensionalities, namely the action, the target and the position of one command, so that end-to-end command word recognition is completed.

The model is a synchronous and independent process for identifying the action, the target and the position, and the action, the target and the position are not related in sequence. The recognition accuracy in this case is formulated as

p(A,O,L)＝p(A)*p(O)*p(L)

The accuracy of the method needs to be further improved.

Disclosure of Invention

In order to overcome the disadvantages of the prior art, the present invention provides a command word recognition method and system using mutual prior constraint models, which are used for end-to-end speech command word recognition and can greatly improve the recognition accuracy.

In order to achieve the purpose, the invention proposes the relation between command words by modifying the original models into the prior constraint models, and specifically adopts the technical scheme that:

a command word recognition method using mutual prior constraint models is based on an end-to-end voice command word recognition structure, wherein the end-to-end voice command word recognition structure comprises a phoneme module used for extracting phoneme characteristics from audio, a word module used for extracting word characteristics from the phoneme characteristics and a semantic module used for extracting semantic characteristics from the word characteristics.

The content components are actions, targets and locations.

When the action is recognized firstly, obtaining the action of the command firstly, then inputting the action into the target recognition network to influence the target recognition result, and finally inputting the action and the target into the position recognition network to influence the position recognition result; when the target is recognized firstly, obtaining the target of the command firstly, then inputting the target into the position recognition network to influence the position recognition result, and finally inputting the target and the position into the action recognition network to influence the action recognition result; when the position is recognized, the position of the command is obtained firstly, then the position is input into the action recognition network to influence the action recognition result, and finally the position and the action are input into the target recognition network to influence the target recognition result.

In the identification process, after word features are extracted from word information, a random discarding method is used for preventing overfitting, a down-sampling method is used for adjusting dimensionality, then the full-connection layer is input, the possible probability of each command word is obtained, and the command word with the maximum probability is found by using the maximum pooling.

After the command word with the maximum probability is obtained, the discrete variables are mapped to a continuous vector space, and a new representation mode in the space is generated, namely, the recognized action and/or target is mapped to a vector.

The invention also provides a command word recognition system using the mutual prior constraint model, which is based on an end-to-end voice command word recognition structure, wherein the end-to-end voice command word recognition structure comprises a phoneme module for extracting phoneme characteristics from audio, a word module for extracting word characteristics from the phoneme characteristics and a semantic module for extracting semantic characteristics from the word characteristics.

The semantic module includes an identification portion and a mapping portion, wherein:

the recognition part comprises a parallel multi-path structure, the number N of parallel paths is the same as the classification number of content components to be recognized, each path comprises a bidirectional cyclic neural network and a full connection layer, after semantic features of input word features are extracted through the bidirectional cyclic neural network, a random discarding method is used for preventing overfitting, dimensionality is adjusted through a down-sampling method, then a linear layer is input, the possible probability of each command word is obtained, and the command word with the maximum probability is found through maximum pooling;

the mapping part comprises a parallel multi-path structure, the number of the parallel paths is N-1, the parallel paths are respectively in one-to-one correspondence with the first N-1 path structures of the identification part, each path of the mapping part maps discrete variables to a continuous vector space through a function, a new representation mode in the space is generated, namely discrete elements such as the identified action and/or target and/or position are mapped into a vector, and the vector is combined with the output part of the next cyclic neural network and is input into a linear layer conveniently.

The parameters of the function are initialized randomly at the beginning of training, the parameters of the function are continuously updated according to the error information of back propagation along with the training, and finally a unique vector mapping is found for each discrete variable.

Compared with the prior art, the method utilizes the correlation among the variables, and greatly improves the identification accuracy.

Drawings

Fig. 1 is a schematic diagram of the overall structure of end-to-end voice command word recognition.

Fig. 2 is a schematic diagram of a specific structure of the phoneme module.

Fig. 3 is a specific structural diagram of the word module.

FIG. 4 is a schematic diagram of a specific structure of a semantic module.

Fig. 5 is a schematic diagram of a network structure after the improvement of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

The invention relates to a command word recognition method using mutual prior constraint models, which is improved on the existing end-to-end voice command word recognition structure, wherein the end-to-end voice command word recognition structure comprises a phoneme module for extracting phoneme characteristics from audio, a word module for extracting word characteristics from the phoneme characteristics and a semantic module for extracting semantic characteristics from the word characteristics, and the phoneme module and the word module need to use a larger voice command word data set for pre-training.

Taking content components needing to be identified as an action, a target and a position as an example, commands used in daily life are mutually prior-constrained, for example, a command for turning on or off a light is often used in a garage, but not a command for controlling colors; the instructions to control temperature are more often used indoors than outdoors; the instruction for changing language is mostly a smart device, not furniture or light.

In order to establish the relation, the improvement of the invention lies in a semantic module, command word contents are identified in the semantic module according to the sequence, so that an independent identification process is changed into an identification process with sequence and mutual prior constraint, namely, a content component is identified first, and then the identification results of other content components are influenced by the content component, thereby achieving the purpose of improving the accuracy. Taking the action recognized firstly as an example, the action of the command is recognized firstly, then the action is input into the target recognition network to influence the target recognition result, and finally the action and the target are input into the position recognition network to influence the position recognition result, and the stronger the relevance of the action and the target is, the better the improvement effect of the model is, which is very consistent with the experience of daily life. The recognition accuracy rate is formulated as

p(A,O,L)＝p(A)*p(O|A)*p(L|A,O)

Therefore, the original independent model is modified into the mutual prior constraint model, and the identification accuracy is greatly improved. The method can be widely applied to the field of off-line end-to-end voice recognition and can be operated on an embedded system.

The improved semantic module network structure of the invention is shown in fig. 5, and comprises:

(1) identification part

The identification part comprises a parallel three-path structure which is used for identifying actions, targets and positions respectively, and each path comprises a bidirectional cyclic neural network and a full connection layer. Similar to the baseline system, after semantic features of input word features are extracted through a bidirectional cyclic neural network, a random discarding method is used for preventing overfitting, dimensionality is adjusted through a down-sampling method, then a linear layer is input, the possible probability of each command word is obtained, and the command word with the highest probability is found through maximum pooling.

(2) Mapping section

The mapping part comprises two parallel structures which are respectively used for mapping and inputting the action to the target recognition structure and mapping and inputting the action and the target to the position recognition structure. Specifically, each path of the mapping part maps discrete variables to a continuous vector space through a function to generate a new representation mode in the space, namely, discrete elements such as recognized actions and/or targets and/or positions are mapped to a vector, and the vector is combined with an output part of the next recurrent neural network and is input into a linear layer.

The parameters of the function are initialized randomly at the beginning of training, the parameters of the function are updated according to error information of back propagation along with the training, and finally a unique vector mapping is found for each discrete variable.

The recognition accuracy of the present invention on the Snips SLU dataset and Fluent Speech Commands dataset is shown in tables 1 and 2:

TABLE 1 test accuracy on the Snips SLU data set

TABLE 2 test accuracy on Fluent Speech Commands dataset

It can be seen that, no matter on the Snips SLU data set with a single instruction and the Fluent Speech Commands data set with a complex instruction, the recognition accuracy of the mutual prior constraint model provided by the invention is improved to a certain extent compared with the original system. Because the relevance among the action, the target and the position on the Fluent Speech Commands data set is stronger than that of the Snips SLU data set, on the data set, the mutual prior constraint model provided by the invention has higher identification accuracy than an original model.

Claims

1. A command word recognition method using mutual prior constraint models is based on an end-to-end voice command word recognition structure, wherein the end-to-end voice command word recognition structure comprises a phoneme module used for extracting phoneme characteristics from audio, a word module used for extracting word characteristics from the phoneme characteristics and a semantic module used for extracting semantic characteristics from the word characteristics.

2. The command word recognition method using mutual prior constraint models according to claim 1, wherein the content components are actions, targets and locations.

3. The method for recognizing command words using mutual prior constraint models according to claim 2, wherein when recognizing the action first, the action of the command is obtained first, then the action is input into the target recognition network to influence the target recognition result, and finally the action and the target are input into the position recognition network to influence the position recognition result; when the target is recognized firstly, obtaining the target of the command firstly, then inputting the target into the position recognition network to influence the position recognition result, and finally inputting the target and the position into the action recognition network to influence the action recognition result; when the position is recognized, the position of the command is obtained firstly, then the position is input into the action recognition network to influence the action recognition result, and finally the position and the action are input into the target recognition network to influence the target recognition result.

4. The method for recognizing command words by using mutual prior constraint models according to claim 1, wherein in the recognition process, after word features are extracted from word information, a random discarding method is used to prevent overfitting, a down-sampling method is used to adjust dimensionality, then a full-link layer is input to obtain the possible probability of each command word, and the command word with the highest probability is found by using maximum pooling.

5. The method of claim 4, wherein after obtaining the most probable command word, the discrete variables are mapped to a continuous vector space to generate a new representation in space, i.e. mapping the identified motion and/or object to a vector.

6. The end-to-end voice command word recognition structure comprises a phoneme module used for extracting phoneme characteristics from audio, a word module used for extracting word characteristics from the phoneme characteristics and a semantic module used for extracting semantic characteristics from the word characteristics.

7. The system according to claim 6, wherein the content components are actions, objects and locations.

8. The system of claim 6, wherein when the action is recognized first, the action of the command is obtained first, then the action is input into the target recognition network to affect the target recognition result, and finally the action and the target are input into the position recognition network to affect the position recognition result; when the target is recognized firstly, obtaining the target of the command firstly, then inputting the target into the position recognition network to influence the position recognition result, and finally inputting the target and the position into the action recognition network to influence the action recognition result; when the position is recognized, the position of the command is obtained firstly, then the position is input into the action recognition network to influence the action recognition result, and finally the position and the action are input into the target recognition network to influence the target recognition result.

9. The command word recognition system using the mutually a priori constrained models of claim 6, wherein the semantic module comprises a recognition portion and a mapping portion, wherein:

10. The system of claim 9, wherein the parameters of the function are initialized at random at the beginning of training, and as the training progresses, the parameters of the function are updated according to the back-propagated error information, and finally a unique vector mapping is found for each discrete variable.