CN111754981A - Command word recognition method and system using mutual prior constraint model - Google Patents

Command word recognition method and system using mutual prior constraint model Download PDF

Info

Publication number
CN111754981A
CN111754981A CN202010593154.1A CN202010593154A CN111754981A CN 111754981 A CN111754981 A CN 111754981A CN 202010593154 A CN202010593154 A CN 202010593154A CN 111754981 A CN111754981 A CN 111754981A
Authority
CN
China
Prior art keywords
recognition
word
action
target
command
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010593154.1A
Other languages
Chinese (zh)
Inventor
曾可为
杨毅
孙甲松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010593154.1A priority Critical patent/CN111754981A/en
Publication of CN111754981A publication Critical patent/CN111754981A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a command word recognition method using mutual prior constraint models, which is based on an end-to-end voice command word recognition structure, wherein the end-to-end voice command word recognition structure comprises a phoneme module for extracting phoneme characteristics from audio, a word module for extracting word characteristics from the phoneme characteristics and a semantic module for extracting semantic characteristics from the word characteristics. Compared with the prior art, the method utilizes the correlation among the variables, and greatly improves the identification accuracy.

Description

Command word recognition method and system using mutual prior constraint model
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a command word recognition method and system using mutual prior constraint models.
Background
In the popular speech recognition products in the market, the traditional online semantic understanding system is mostly used. The system sends the voice information recorded by the equipment to the server, then decodes the voice into a text by using a huge voice recognition model stored by the server, analyzes semantic information contained in the text by using a natural language understanding technology, and finally sends the recognized semantic information to the client, thereby realizing semantic understanding. Because two separated models are used, the system is huge, the efficiency is low, and semantic understanding can only be carried out on the server, so that the problems of network delay, privacy disclosure and the like can exist, and the safety is difficult to ensure. Meanwhile, the information is transmitted between the two fractured models of speech recognition and natural language understanding, so that the loss and error superposition of original audio information can be caused, and the accuracy of recognition is influenced. In addition, because of limited bandwidth and resources, online semantic understanding cannot keep a recording for a long time and continuously upload the recording to a server, and a wakeup word system is needed to wake up the system, start recording and upload voice information, and the wakeup word system can only use offline voice recognition. This not only increases the complexity of the overall semantic understanding system, but also is inconvenient for the user to use, especially if the product wake words are different.
In recent years, the problem can be avoided by a popular end-to-end semantic understanding system. First, the end-to-end semantic understanding system maps acoustic feature sequences directly into semantic information using only one model. Therefore, errors of semantic recognition can be directly optimized, so that the convergence speed and the accuracy are improved, meanwhile, intermediate steps which are easy to make mistakes and comprise a search algorithm, a language model, a finite state sensor and the like are avoided, and components which are related to semantics in voice but cannot be expressed in characters, such as stress, rhythm and the like, can be utilized. In addition, the end-to-end semantic understanding system is low in computational power consumption and high in recognition speed, and online voice recognition is not needed. Therefore, a wake-up word system can be omitted, so that the whole system has fewer identification steps and is more convenient to use.
The overall structure of end-to-end speech command word recognition is shown in fig. 1, and includes a phoneme module for extracting phoneme features from audio, a word module for extracting word features from phoneme features, and a semantic module for extracting semantic features from word features.
After the phoneme module and the word module are pre-trained on the data set, a full connection layer used for mapping phoneme or word features to specific elements is removed, the phoneme module and the word module are spliced together, and then the whole model is trained end to end on the voice command word data set in a supervision mode.
The specific structure of the phoneme module is shown in fig. 2, and the module is used for obtaining phoneme characteristics from input audio. The input audio signal is first input into an interpretable convolution filter to obtain the characteristic information of the original audio. Features are filtered through a one-dimensional maximum pooling layer, activated using a modified linear unit, and then over-fit is prevented using random discard. And finally, extracting phoneme features by using a bidirectional cyclic neural network. The extracted phoneme features are also randomly discarded to prevent overfitting, and then down-sampled to the appropriate dimension and input into the word module. In pre-training, the phone features also need to be mapped to specific phones through a fully-connected layer, but this fully-connected layer is not used in actual training.
The specific structure of the word module is shown in fig. 3, and the module is used for extracting word features from phoneme features. And performing feature extraction on the input phoneme features by using a bidirectional RNN, preventing overfitting by using random discarding, adjusting dimensionality by using a down-sampling method to obtain word features, and inputting the word features into a semantic module. Similarly, in the actual training process, the fully-connected layer used for pre-training also needs to be omitted.
The specific structure of the semantic module is shown in fig. 4, and the module is used for extracting semantic information from word features. After word information is extracted through a bidirectional RNN layer, random discarding is used for preventing overfitting, a down-sampling method is used for adjusting dimensionality, then the dimensionality is input into a linear layer, the possible probability of each command word is obtained, a maximum pooling layer is used for finding the command word with the maximum probability, and the command words are mapped to three dimensionalities, namely the action, the target and the position of one command, so that end-to-end command word recognition is completed.
The model is a synchronous and independent process for identifying the action, the target and the position, and the action, the target and the position are not related in sequence. The recognition accuracy in this case is formulated as
p(A,O,L)=p(A)*p(O)*p(L)
The accuracy of the method needs to be further improved.
Disclosure of Invention
In order to overcome the disadvantages of the prior art, the present invention provides a command word recognition method and system using mutual prior constraint models, which are used for end-to-end speech command word recognition and can greatly improve the recognition accuracy.
In order to achieve the purpose, the invention proposes the relation between command words by modifying the original models into the prior constraint models, and specifically adopts the technical scheme that:
a command word recognition method using mutual prior constraint models is based on an end-to-end voice command word recognition structure, wherein the end-to-end voice command word recognition structure comprises a phoneme module used for extracting phoneme characteristics from audio, a word module used for extracting word characteristics from the phoneme characteristics and a semantic module used for extracting semantic characteristics from the word characteristics.
The content components are actions, targets and locations.
When the action is recognized firstly, obtaining the action of the command firstly, then inputting the action into the target recognition network to influence the target recognition result, and finally inputting the action and the target into the position recognition network to influence the position recognition result; when the target is recognized firstly, obtaining the target of the command firstly, then inputting the target into the position recognition network to influence the position recognition result, and finally inputting the target and the position into the action recognition network to influence the action recognition result; when the position is recognized, the position of the command is obtained firstly, then the position is input into the action recognition network to influence the action recognition result, and finally the position and the action are input into the target recognition network to influence the target recognition result.
In the identification process, after word features are extracted from word information, a random discarding method is used for preventing overfitting, a down-sampling method is used for adjusting dimensionality, then the full-connection layer is input, the possible probability of each command word is obtained, and the command word with the maximum probability is found by using the maximum pooling.
After the command word with the maximum probability is obtained, the discrete variables are mapped to a continuous vector space, and a new representation mode in the space is generated, namely, the recognized action and/or target is mapped to a vector.
The invention also provides a command word recognition system using the mutual prior constraint model, which is based on an end-to-end voice command word recognition structure, wherein the end-to-end voice command word recognition structure comprises a phoneme module for extracting phoneme characteristics from audio, a word module for extracting word characteristics from the phoneme characteristics and a semantic module for extracting semantic characteristics from the word characteristics.
The semantic module includes an identification portion and a mapping portion, wherein:
the recognition part comprises a parallel multi-path structure, the number N of parallel paths is the same as the classification number of content components to be recognized, each path comprises a bidirectional cyclic neural network and a full connection layer, after semantic features of input word features are extracted through the bidirectional cyclic neural network, a random discarding method is used for preventing overfitting, dimensionality is adjusted through a down-sampling method, then a linear layer is input, the possible probability of each command word is obtained, and the command word with the maximum probability is found through maximum pooling;
the mapping part comprises a parallel multi-path structure, the number of the parallel paths is N-1, the parallel paths are respectively in one-to-one correspondence with the first N-1 path structures of the identification part, each path of the mapping part maps discrete variables to a continuous vector space through a function, a new representation mode in the space is generated, namely discrete elements such as the identified action and/or target and/or position are mapped into a vector, and the vector is combined with the output part of the next cyclic neural network and is input into a linear layer conveniently.
The parameters of the function are initialized randomly at the beginning of training, the parameters of the function are continuously updated according to the error information of back propagation along with the training, and finally a unique vector mapping is found for each discrete variable.
Compared with the prior art, the method utilizes the correlation among the variables, and greatly improves the identification accuracy.
Drawings
Fig. 1 is a schematic diagram of the overall structure of end-to-end voice command word recognition.
Fig. 2 is a schematic diagram of a specific structure of the phoneme module.
Fig. 3 is a specific structural diagram of the word module.
FIG. 4 is a schematic diagram of a specific structure of a semantic module.
Fig. 5 is a schematic diagram of a network structure after the improvement of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples.
The invention relates to a command word recognition method using mutual prior constraint models, which is improved on the existing end-to-end voice command word recognition structure, wherein the end-to-end voice command word recognition structure comprises a phoneme module for extracting phoneme characteristics from audio, a word module for extracting word characteristics from the phoneme characteristics and a semantic module for extracting semantic characteristics from the word characteristics, and the phoneme module and the word module need to use a larger voice command word data set for pre-training.
Taking content components needing to be identified as an action, a target and a position as an example, commands used in daily life are mutually prior-constrained, for example, a command for turning on or off a light is often used in a garage, but not a command for controlling colors; the instructions to control temperature are more often used indoors than outdoors; the instruction for changing language is mostly a smart device, not furniture or light.
In order to establish the relation, the improvement of the invention lies in a semantic module, command word contents are identified in the semantic module according to the sequence, so that an independent identification process is changed into an identification process with sequence and mutual prior constraint, namely, a content component is identified first, and then the identification results of other content components are influenced by the content component, thereby achieving the purpose of improving the accuracy. Taking the action recognized firstly as an example, the action of the command is recognized firstly, then the action is input into the target recognition network to influence the target recognition result, and finally the action and the target are input into the position recognition network to influence the position recognition result, and the stronger the relevance of the action and the target is, the better the improvement effect of the model is, which is very consistent with the experience of daily life. The recognition accuracy rate is formulated as
p(A,O,L)=p(A)*p(O|A)*p(L|A,O)
Therefore, the original independent model is modified into the mutual prior constraint model, and the identification accuracy is greatly improved. The method can be widely applied to the field of off-line end-to-end voice recognition and can be operated on an embedded system.
The improved semantic module network structure of the invention is shown in fig. 5, and comprises:
(1) identification part
The identification part comprises a parallel three-path structure which is used for identifying actions, targets and positions respectively, and each path comprises a bidirectional cyclic neural network and a full connection layer. Similar to the baseline system, after semantic features of input word features are extracted through a bidirectional cyclic neural network, a random discarding method is used for preventing overfitting, dimensionality is adjusted through a down-sampling method, then a linear layer is input, the possible probability of each command word is obtained, and the command word with the highest probability is found through maximum pooling.
(2) Mapping section
The mapping part comprises two parallel structures which are respectively used for mapping and inputting the action to the target recognition structure and mapping and inputting the action and the target to the position recognition structure. Specifically, each path of the mapping part maps discrete variables to a continuous vector space through a function to generate a new representation mode in the space, namely, discrete elements such as recognized actions and/or targets and/or positions are mapped to a vector, and the vector is combined with an output part of the next recurrent neural network and is input into a linear layer.
The parameters of the function are initialized randomly at the beginning of training, the parameters of the function are updated according to error information of back propagation along with the training, and finally a unique vector mapping is found for each discrete variable.
The recognition accuracy of the present invention on the Snips SLU dataset and Fluent Speech Commands dataset is shown in tables 1 and 2:
TABLE 1 test accuracy on the Snips SLU data set
Figure BDA0002556471700000061
TABLE 2 test accuracy on Fluent Speech Commands dataset
Figure BDA0002556471700000062
Figure BDA0002556471700000071
It can be seen that, no matter on the Snips SLU data set with a single instruction and the Fluent Speech Commands data set with a complex instruction, the recognition accuracy of the mutual prior constraint model provided by the invention is improved to a certain extent compared with the original system. Because the relevance among the action, the target and the position on the Fluent Speech Commands data set is stronger than that of the Snips SLU data set, on the data set, the mutual prior constraint model provided by the invention has higher identification accuracy than an original model.

Claims (10)

1. A command word recognition method using mutual prior constraint models is based on an end-to-end voice command word recognition structure, wherein the end-to-end voice command word recognition structure comprises a phoneme module used for extracting phoneme characteristics from audio, a word module used for extracting word characteristics from the phoneme characteristics and a semantic module used for extracting semantic characteristics from the word characteristics.
2. The command word recognition method using mutual prior constraint models according to claim 1, wherein the content components are actions, targets and locations.
3. The method for recognizing command words using mutual prior constraint models according to claim 2, wherein when recognizing the action first, the action of the command is obtained first, then the action is input into the target recognition network to influence the target recognition result, and finally the action and the target are input into the position recognition network to influence the position recognition result; when the target is recognized firstly, obtaining the target of the command firstly, then inputting the target into the position recognition network to influence the position recognition result, and finally inputting the target and the position into the action recognition network to influence the action recognition result; when the position is recognized, the position of the command is obtained firstly, then the position is input into the action recognition network to influence the action recognition result, and finally the position and the action are input into the target recognition network to influence the target recognition result.
4. The method for recognizing command words by using mutual prior constraint models according to claim 1, wherein in the recognition process, after word features are extracted from word information, a random discarding method is used to prevent overfitting, a down-sampling method is used to adjust dimensionality, then a full-link layer is input to obtain the possible probability of each command word, and the command word with the highest probability is found by using maximum pooling.
5. The method of claim 4, wherein after obtaining the most probable command word, the discrete variables are mapped to a continuous vector space to generate a new representation in space, i.e. mapping the identified motion and/or object to a vector.
6. The end-to-end voice command word recognition structure comprises a phoneme module used for extracting phoneme characteristics from audio, a word module used for extracting word characteristics from the phoneme characteristics and a semantic module used for extracting semantic characteristics from the word characteristics.
7. The system according to claim 6, wherein the content components are actions, objects and locations.
8. The system of claim 6, wherein when the action is recognized first, the action of the command is obtained first, then the action is input into the target recognition network to affect the target recognition result, and finally the action and the target are input into the position recognition network to affect the position recognition result; when the target is recognized firstly, obtaining the target of the command firstly, then inputting the target into the position recognition network to influence the position recognition result, and finally inputting the target and the position into the action recognition network to influence the action recognition result; when the position is recognized, the position of the command is obtained firstly, then the position is input into the action recognition network to influence the action recognition result, and finally the position and the action are input into the target recognition network to influence the target recognition result.
9. The command word recognition system using the mutually a priori constrained models of claim 6, wherein the semantic module comprises a recognition portion and a mapping portion, wherein:
the recognition part comprises a parallel multi-path structure, the number N of parallel paths is the same as the classification number of content components to be recognized, each path comprises a bidirectional cyclic neural network and a full connection layer, after semantic features of input word features are extracted through the bidirectional cyclic neural network, a random discarding method is used for preventing overfitting, dimensionality is adjusted through a down-sampling method, then a linear layer is input, the possible probability of each command word is obtained, and the command word with the maximum probability is found through maximum pooling;
the mapping part comprises a parallel multi-path structure, the number of the parallel paths is N-1, the parallel paths are respectively in one-to-one correspondence with the first N-1 path structures of the identification part, each path of the mapping part maps discrete variables to a continuous vector space through a function, a new representation mode in the space is generated, namely discrete elements such as the identified action and/or target and/or position are mapped into a vector, and the vector is combined with the output part of the next cyclic neural network and is input into a linear layer conveniently.
10. The system of claim 9, wherein the parameters of the function are initialized at random at the beginning of training, and as the training progresses, the parameters of the function are updated according to the back-propagated error information, and finally a unique vector mapping is found for each discrete variable.
CN202010593154.1A 2020-06-26 2020-06-26 Command word recognition method and system using mutual prior constraint model Pending CN111754981A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010593154.1A CN111754981A (en) 2020-06-26 2020-06-26 Command word recognition method and system using mutual prior constraint model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010593154.1A CN111754981A (en) 2020-06-26 2020-06-26 Command word recognition method and system using mutual prior constraint model

Publications (1)

Publication Number Publication Date
CN111754981A true CN111754981A (en) 2020-10-09

Family

ID=72677438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010593154.1A Pending CN111754981A (en) 2020-06-26 2020-06-26 Command word recognition method and system using mutual prior constraint model

Country Status (1)

Country Link
CN (1) CN111754981A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489639A (en) * 2020-11-26 2021-03-12 北京百度网讯科技有限公司 Audio signal processing method, device, system, electronic equipment and readable medium
CN112750434A (en) * 2020-12-16 2021-05-04 马上消费金融股份有限公司 Method and device for optimizing voice recognition system and electronic equipment
CN113053377A (en) * 2021-03-23 2021-06-29 南京地平线机器人技术有限公司 Voice wake-up method and device, computer readable storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170372200A1 (en) * 2016-06-23 2017-12-28 Microsoft Technology Licensing, Llc End-to-end memory networks for contextual language understanding
CN107704866A (en) * 2017-06-15 2018-02-16 清华大学 Multitask Scene Semantics based on new neural network understand model and its application
US20180060301A1 (en) * 2016-08-31 2018-03-01 Microsoft Technology Licensing, Llc End-to-end learning of dialogue agents for information access
CN111092798A (en) * 2019-12-24 2020-05-01 东华大学 Wearable system based on spoken language understanding

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170372200A1 (en) * 2016-06-23 2017-12-28 Microsoft Technology Licensing, Llc End-to-end memory networks for contextual language understanding
US20180060301A1 (en) * 2016-08-31 2018-03-01 Microsoft Technology Licensing, Llc End-to-end learning of dialogue agents for information access
CN107704866A (en) * 2017-06-15 2018-02-16 清华大学 Multitask Scene Semantics based on new neural network understand model and its application
CN111092798A (en) * 2019-12-24 2020-05-01 东华大学 Wearable system based on spoken language understanding

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ELISAVET PALOGIANNIDI ET AL: "End-To-End Architectures For Asr-Free Spoken Language Understanding", 《ARXIV》 *
LOREN LUGOSCH ET AL: "Speech Model Pre-training for End-to-End Spoken Language Understanding", 《ARXIV》 *
周奇安等: "基于BERT的任务导向对话系统自然语言理解的改进模型与调优方法", 《中文信息学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489639A (en) * 2020-11-26 2021-03-12 北京百度网讯科技有限公司 Audio signal processing method, device, system, electronic equipment and readable medium
CN112750434A (en) * 2020-12-16 2021-05-04 马上消费金融股份有限公司 Method and device for optimizing voice recognition system and electronic equipment
CN113053377A (en) * 2021-03-23 2021-06-29 南京地平线机器人技术有限公司 Voice wake-up method and device, computer readable storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
KR102181836B1 (en) Voice wake-up method, apparatus and electronic device
CN111754981A (en) Command word recognition method and system using mutual prior constraint model
CN109272988B (en) Voice recognition method based on multi-path convolution neural network
EP3424044B1 (en) Modular deep learning model
CN110838286B (en) Model training method, language identification method, device and equipment
CN111048082B (en) Improved end-to-end speech recognition method
CN105843381B (en) Data processing method for realizing multi-modal interaction and multi-modal interaction system
CN108829667A (en) It is a kind of based on memory network more wheels dialogue under intension recognizing method
CN118349673A (en) Training method of text processing model, text processing method and device
CN105139864B (en) Audio recognition method and device
CN108510985A (en) System and method for reducing the principle sexual deviation in production speech model
CN109272990A (en) Audio recognition method based on convolutional neural networks
JP7096919B2 (en) Entity word recognition method and device
CN107680585B (en) Chinese word segmentation method, Chinese word segmentation device and terminal
Tang et al. Improving document representations by generating pseudo query embeddings for dense retrieval
CN112182154B (en) Personalized search model for eliminating keyword ambiguity by using personal word vector
CN114830139A (en) Training models using model-provided candidate actions
KR20170081883A (en) Voice recognition terminal, voice recognition server and voice recognition method performing a personalized voice recognition for performing personalized voice recognition
CN108389575A (en) Audio data recognition methods and system
CN111161726B (en) Intelligent voice interaction method, device, medium and system
US20230094730A1 (en) Model training method and method for human-machine interaction
EP4238088A1 (en) End-to-end streaming acoustic trigger apparatus and method
CN108461080A (en) A kind of Acoustic Modeling method and apparatus based on HLSTM models
CN111210815B (en) Deep neural network construction method for voice command word recognition, and recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201009

RJ01 Rejection of invention patent application after publication