CN111092798B - Wearable system based on spoken language understanding - Google Patents
Wearable system based on spoken language understanding Download PDFInfo
- Publication number
- CN111092798B CN111092798B CN201911344765.6A CN201911344765A CN111092798B CN 111092798 B CN111092798 B CN 111092798B CN 201911344765 A CN201911344765 A CN 201911344765A CN 111092798 B CN111092798 B CN 111092798B
- Authority
- CN
- China
- Prior art keywords
- language understanding
- spoken language
- voice
- input
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L12/00—Data switching networks
- H04L12/28—Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
- H04L12/2803—Home automation networks
- H04L12/2816—Controlling appliance services of a home automation network by calling their functionalities
- H04L12/282—Controlling appliance services of a home automation network by calling their functionalities based on user interaction within the home
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04B—TRANSMISSION
- H04B1/00—Details of transmission systems, not covered by a single one of groups H04B3/00 - H04B13/00; Details of transmission systems not characterised by the medium used for transmission
- H04B1/38—Transceivers, i.e. devices in which transmitter and receiver form a structural unit and in which at least one part is used for functions of transmitting and receiving
- H04B1/3827—Portable transceivers
- H04B1/385—Transceivers carried on the body, e.g. in helmets
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/72—Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
- H04M1/724—User interfaces specially adapted for cordless or mobile telephones
- H04M1/72403—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
Abstract
The invention relates to a design method of a wearable system based on spoken language understanding, which comprises the following steps: firstly, wearable equipment picks up pronunciation, transmits to smart mobile phone APP through the bluetooth in, APP understands speech signal transmission to spoken language, the intention text of keyword is directly discerned to speech signal, transmits back to the cell-phone end. The smart phone controls the actions of other smart devices according to the received intention, such as controlling the on and off of kitchen lights. The intelligent household equipment can be controlled quickly, accurately and conveniently, so that the life of people is simplified, and the happiness is improved.
Description
Technical Field
The invention relates to a wearable system based on spoken language understanding, which can be applied to smart homes or internet of things systems. The wearable equipment picks up voice signals in real time, the mobile phone is used as a transmission medium and a control center, and common voice signals are recognized as control commands, so that the intelligent home is accurately controlled to move, and people live more elegantly.
Background
Along with the rapid development of computer system performance, especially the remarkable improvement of computer graphics card performance, and the continuous breakthrough of speech processing technology, natural language processing technology and machine learning method, the research of spoken language understanding system has been promoted in recent years.
Current spoken language understanding systems generally employ a pipeline model, with the entire pipeline consisting of two parts: speech Recognition (ASR), Natural Language Understanding (NLU). More and more products are now being integrated into the knowledge base, mainly introduced in the dialogue management module. The divide-and-conquer method ensures that each subtask is independently modeled, and is simple and easy to implement.
According to the current state of research, there is no product that combines spoken language understanding with wearable devices. Currently, mainstream intelligent products, such as intelligent sound boxes like "tianmaoling", can realize simple human-computer interaction, but have three obvious disadvantages: one is that its size is great inconvenient to carry, and another does not have portable power source, need insert the electricity constantly and just can normally work, and the third is that remote pickup effect is not good, and the area of coverage is limited.
Disclosure of Invention
The purpose of the invention is: by training the spoken language understanding algorithm model and utilizing the convenience of the wearable equipment, the household intelligent device is controlled anytime and anywhere.
In order to achieve the above object, a technical solution of the present invention is to provide a wearable system based on spoken language understanding, including:
the wearable device is worn on the body of a user, collects voice signals sent by the user and used for controlling the smart home to act, and forwards the collected voice signals to the smart phone used by the user;
the smart phone APP runs on the smart phone, on one hand, the smart phone APP receives voice signals forwarded by the wearable device and uploads the voice signals to the spoken language understanding server, on the other hand, the smart phone APP receives intention texts issued by the spoken language understanding server, and the smart home is controlled according to the intention texts;
the spoken language understanding server runs a spoken language understanding model, the spoken language understanding model generates an intention text according to a voice signal uploaded by the smart phone, the spoken language understanding model comprises a voice recognition module ASR and a natural language understanding module NLU, the voice signal after preprocessing is recognized by the voice recognition module ASR to obtain text information, and the natural language understanding module NLU obtains a semantic analysis result according to the text information to form the intention text, wherein:
the speech recognition module ASR realizes end-to-end speech recognition through a Recurrent Neural Network (RNN) and a linked time sequence classification (CTC);
the natural language understanding module NLU is composed of an input layer, a first hidden layer, a second hidden layer and an output layer:
given an input sentence S ═{w1,w2,...,wT},wiRepresenting the ith word in a sentence, T representing the length of the sentence, and the input layer of the natural language understanding module NLU Embedding words into Embedding to put word text wiConversion to word vector xi;
Word vector x to be processediSending into a first hidden layer, wherein the first hidden layer selects GRU as a nerve unit, processing by a circulating neuron, and then training by a second hidden layer, wherein in the training of the second hidden layer, the source input of the neuron consists of 3 aspects, namely the output of the previous layerActivation value of neuron at previous timeAnd the output y of the output layer at the previous momentt-1Composition, state value of second hidden layerThe formula of (1) is:
where σ is the activation function, Wh2、WyIs a coefficient matrix;
the output layer uses a softmax classification function, solves the problem of multi-sequence labeling and outputsThe probability of the category k is shown in the t-th word, wherein the number of the classification categories is 3 in total, each classification category is three, each classification category is a place, an object, and an action, and the probabilities of all categories add up to 1 in one word.
Preferably, the speech recognition module ASR first pre-processes the input speech signal, where the pre-processing includes pre-emphasis, silence removal and windowing and framing, and after the pre-processing, the speech signal becomes many smallEach segment is defined as a frame of speech waveform, and the MFCC is used for feature extraction to change each frame of speech waveform into a multi-dimensional vector X, wherein the set X is { X ═ X1,x2,...,xTRepresenting a characteristic sequence set corresponding to the current voice signal, wherein T represents the total frame number of the voice waveform; training by utilizing a recurrent neural network, selecting GRU as hidden layer neuron of the recurrent neural network, outputting a posterior probability matrix y of the character to be recognized by utilizing softmax by the recurrent neural network, and defining as follows:
y=(y1,y2,...,yt,...,yT)
wherein, the t-th column y of ytComprises the following steps:
when the t-th frame is represented, the probability that the pronunciation is N characters, wherein N represents the length of the character to be recognized, and the probability of all character types on the data of one frame is added to be 1, namely:
and then enters the CTC output layer where the CTC only care as to whether the input sequence is close to the true sequence as a loss function, and not whether each result in the predicted output sequence is exactly aligned in time with the input sequence. The CTC adds a blank for marking the invalid voice of the label, and removes repeated phonemes and blanks by calculating the forward and backward loss calculation to realize the result of the recognized text with the output length far smaller than the input length.
The wearable equipment system mainly comprises two parts, wherein one part is a sound pickup part (which can be a bow tie, a bracelet and the like) and is mainly used for acquiring voice signals of a user in a short distance in real time; the other part is a smart phone part, which is similar to a transportation junction and is mainly used for transmitting voice signals to a spoken language understanding server and controlling the actions of the smart home according to the spoken language understanding result.
At present, simple human-computer interaction can be realized by mainstream intelligent products such as intelligent sound boxes of 'tianmaoling', but three obvious disadvantages exist, one is that the size of the intelligent sound boxes is large and inconvenient to carry, the other is that no mobile power supply exists, the intelligent sound boxes can normally work only by being plugged at any time, the third is that the remote sound pickup effect is poor, and the covered area is limited. Based on the defects, the wearable device system is adopted, the system mainly comprises two parts, namely a sound pickup part (which can be a bow tie, a bracelet and the like), and the wearable device system is mainly used for collecting voice signals of a user in real-time static distance. One is a smart phone part, which is similar to a transportation hub and is mainly used for transmitting voice signals to a spoken language understanding server and controlling the actions of smart home according to spoken language understanding results.
The oral understanding scheme of the invention is that the two parts of speech recognition and natural language understanding are independently modeled respectively, and the output result of the previous module is taken as the input of the next module in a pipeline manner. The voice recognition part adopts a cyclic neural network and CTC to realize end-to-end voice recognition, and compared with the traditional voice recognition, the voice recognition method has the advantages of less language models, more simplicity, more convenient debugging, higher accuracy and larger requirement on training by a data set. The natural language understanding part uses GRU as the neuron of the deep circulation neural network, can effectively eliminate the problem of gradient explosion or gradient disappearance, and is more convenient for training compared with LSTM parameters. The divide-and-conquer method ensures that each subtask is independently modeled, and is simple and easy to implement.
Drawings
FIG. 1 is a first scheme of wearable system composition based on spoken language understanding;
FIG. 2 is a second scheme of wearable system composition based on spoken language understanding;
FIG. 3 is a schematic diagram of a GRU;
FIG. 4 is an ASR model deep recurrent neural network;
fig. 5 is an NLU model deep cyclic neural network.
Detailed Description
The invention is further elucidated with reference to the drawing. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The embodiment of the invention relates to a design method of a wearable system based on spoken language understanding, which comprises the following modules as shown in figure 1:
(1) collecting voice signals: this part can be realized by wearable equipment, for example brooch, button, bracelet etc.. This part is used for gathering user's speech signal including miniature high performance's adapter, and speech processing module is used for reducing noise and analog-to-digital conversion, and bluetooth module is used for transmitting to the smart mobile phone.
(2) Smart phone APP: the APP development mainly comprises three parts, wherein the first part is a Bluetooth part and is used for connection between an active device and a wearable band device, the second part is used for transmitting voice data to a spoken language understanding server, the third part is used for receiving spoken language understanding recognition result data display, outputting the spoken language understanding result to a mobile phone interface and connecting to a smart device in a home through a personal area network, and the actions of controlling the smart device to be switched on and off according to a control command of spoken language understanding.
(3) Spoken language understanding server: the server for providing the spoken language understanding service is a core part of the invention, and the spoken language understanding model runs on the spoken language understanding server. The spoken language understanding model is divided into two parts: a speech recognition module ASR and a natural language understanding module NLU. The speech signal is preprocessed to a speech recognition module ASR to obtain a recognized text, the text information is transmitted to a natural language understanding module NLU, and a semantic analysis result is output. For example, a voice input "turn on the kitchen light", resolves { location: kitchen, object: lamp, action: open }.
The speech recognition module ASR implements end-to-end speech recognition through a Recurrent Neural Network (RNN) and a link timing classification (CTC). The speech recognition module ASR is realized by performing pre-emphasis, mute cutting, windowing and framing processing on an input speech signal, and converting the speech into a plurality of small segments after processing. But the waveform has little description capability in the time domain, so the waveform must be transformed. The invention adopts MFCC to extract features, and changes each frame of voice waveform into a multi-dimensional vector X according to the physiological characteristics of human ears, wherein the set X is { X ═ X1,x2,...,xTAnd expressing the feature sequence set corresponding to each frame of voice, wherein T expresses the frame number. Considering the continuity of a voice signal in time, a recurrent neural network is selected, and considering the advantages that the GRU can effectively solve gradient explosion or gradient disappearance and has few parameters, the GRU is selected as a hidden node of the neural network. And outputting a posterior probability matrix y of the character to be recognized by utilizing softmax. Is defined as:
y=(y1,y2,...,yt,...,yT)
wherein, each column y of ytComprises the following steps:
wherein, N represents the length of the character to be recognized (26 bits of English character plus blank),the probability of pronouncing to n characters in the t-th frame is represented, and the probability of all character types on the data of one frame is added to be 1. Namely, it is
And then enters the CTC output layer where the CTC only care as to whether the input sequence is close to the true sequence as a loss function, and not whether each result in the predicted output sequence is exactly aligned in time with the input sequence. The CTC is added with a blank for marking the label invalid voice, and the repeated phonemes and the blank are removed by calculating the forward and backward loss calculation to realize the recognition text result with the output length far smaller than the input length. The network architecture is shown in figure 4.
After the recognized text is obtained from the speech recognition module ASR, the natural language understanding module NLU is entered. This module is formed by four layer network structures: given an input sentence W ═ W1, W2M},wiRepresenting the ith word in the sentence, and M represents the length of the sentence. The input layer uses word Embedding (Embedding) to embed wiConverting the word text into word vectors, and sending the processed input word vectors into a first hidden layer, wherein GRUs still used by the hidden layer are neural units. After treatment of the circulating neurons, training is performed through a second hidden layer, and in the training of the layer, the source input of the neurons consists of 3 aspects: output of the previous layerActivation value of neuron at previous timeAnd the output y of the output layer at the previous momentt-1And (4) forming. State value of the layerThe formula of (1) is:
where σ is the activation function, Wh2、WyIs a matrix of coefficients. The output layer uses a softmax classification function to solve the problem of multi-sequence labeling. The invention is mainly based on the control command of the smart home, so the output is respectively three types, namely place, object and action, and the output is accordinglyThe probability of the category k at the t-th word is shown, wherein the number of classification categories is 4 in total, and three categories are a place, an object, an action, and NULL, respectively, and the probability of all categories on one word is added up to 1. The specific neural network structure is shown in fig. 5.
And writing the check vector to be tested into a program, and testing the trained network. And comparing the trained training result with expected data, observing the identification accuracy and improving the algorithm. The method has the practicability that the spoken language understanding algorithm can be used for understanding the keywords of the language of the user in real time, the intention of the user is rapidly recognized, and the intelligent home is controlled to act through the intelligent mobile phone and the network of the intelligent home.
The wearable system based on spoken language understanding can realize the accurate control to intelligent house anytime and anywhere, thereby simplifying the life of people, saving time, making the life of people more elegant, and greatly improving the life happiness of people.
Claims (1)
1. A wearable system based on spoken language understanding, comprising:
the wearable device is worn on the body of a user, collects voice signals sent by the user and used for controlling the smart home to act, and forwards the collected voice signals to the smart phone used by the user;
the smart phone APP runs on the smart phone, on one hand, the smart phone APP receives voice signals forwarded by the wearable device and uploads the voice signals to the spoken language understanding server, on the other hand, the smart phone APP receives intention texts issued by the spoken language understanding server, and the smart home is controlled according to the intention texts;
the spoken language understanding server runs a spoken language understanding model, the spoken language understanding model generates an intention text according to a voice signal uploaded by the smart phone, the spoken language understanding model comprises a voice recognition module ASR and a natural language understanding module NLU, the voice signal after preprocessing is recognized by the voice recognition module ASR to obtain text information, and the natural language understanding module NLU obtains a semantic analysis result according to the text information to form the intention text, wherein:
the speech recognition module ASR realizes end-to-end speech recognition through a Recurrent Neural Network (RNN) and a linked time sequence classification (CTC);
the natural language understanding module NLU is composed of an input layer, a first hidden layer, a second hidden layer and an output layer:
given an input sentence S ═ w1, w2T},wiRepresenting the ith word in a sentence, T representing the length of the sentence, and the input layer of the natural language understanding module NLU Embedding words into Embedding to put word text wiConversion to word vector xi;
Word vector x to be processediSending into a first hidden layer, wherein the first hidden layer selects GRU as a nerve unit, processing by a circulating neuron, and then training by a second hidden layer, wherein in the training of the second hidden layer, the source input of the neuron consists of 3 aspects, namely the output of the previous layerActivation value of neuron at previous timeAnd the output y of the output layer at the previous momentt-1Composition, state value of second hidden layerThe formula of (1) is:
where σ is the activation function, Wh2、WyIs a coefficient matrix;
the output layer uses a softmax classification function, solves the problem of multi-sequence labeling and outputsThe probability of the category k is shown in the t word, wherein the number of the classified categories is 3 in total, the classified categories are three categories, namely, a place, an object and an action, and the probabilities of all the categories on one word are added up to be 1;
the speech recognition module ASR firstly preprocesses an input speech signal, wherein the preprocessing comprises pre-emphasis, silence removal and windowing framing processing, after the preprocessing, the speech signal is changed into a plurality of small sections, each small section is defined as a frame of speech waveform, the MFCC is adopted to carry out feature extraction to change each frame of speech waveform into a multi-dimensional vector X, and a set X is { X ═ X1,x2,...,xTRepresenting a characteristic sequence set corresponding to the current voice signal, wherein T represents the total frame number of the voice waveform; training by utilizing a recurrent neural network, selecting GRU as hidden layer neuron of the recurrent neural network, outputting a posterior probability matrix y of the character to be recognized by utilizing softmax by the recurrent neural network, and defining as follows:
y=(y1,y2,...,yt,...,yT)
wherein, the t-th column y of ytComprises the following steps:
when the t-th frame is represented, the probability that the pronunciation is N characters, wherein N represents the length of the character to be recognized, and the probability of all character types on the data of one frame is added to be 1, namely:
then entering a CTC output layer, wherein the CTC is used as a loss function and only concerns whether an input sequence is close to a real sequence or not, but does not concern whether each result in a prediction output sequence is exactly aligned with the input sequence in time or not; the CTC adds a blank for marking the invalid voice of the label, and removes repeated phonemes and blanks by calculating the forward and backward loss calculation to realize the result of the recognized text with the output length far smaller than the input length.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911344765.6A CN111092798B (en) | 2019-12-24 | 2019-12-24 | Wearable system based on spoken language understanding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911344765.6A CN111092798B (en) | 2019-12-24 | 2019-12-24 | Wearable system based on spoken language understanding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111092798A CN111092798A (en) | 2020-05-01 |
CN111092798B true CN111092798B (en) | 2021-06-11 |
Family
ID=70395373
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911344765.6A Active CN111092798B (en) | 2019-12-24 | 2019-12-24 | Wearable system based on spoken language understanding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111092798B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111754981A (en) * | 2020-06-26 | 2020-10-09 | 清华大学 | Command word recognition method and system using mutual prior constraint model |
CN113793599B (en) * | 2021-09-15 | 2023-09-29 | 北京百度网讯科技有限公司 | Training method of voice recognition model, voice recognition method and device |
CN114596947A (en) * | 2022-03-08 | 2022-06-07 | 北京百度网讯科技有限公司 | Diagnosis and treatment department recommendation method and device, electronic equipment and storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105139864A (en) * | 2015-08-17 | 2015-12-09 | 北京天诚盛业科技有限公司 | Voice recognition method and voice recognition device |
CN105895087A (en) * | 2016-03-24 | 2016-08-24 | 海信集团有限公司 | Voice recognition method and apparatus |
CN106469552A (en) * | 2015-08-20 | 2017-03-01 | 三星电子株式会社 | Speech recognition apparatus and method |
CN106782497A (en) * | 2016-11-30 | 2017-05-31 | 天津大学 | A kind of intelligent sound noise reduction algorithm based on Portable intelligent terminal |
CN107767863A (en) * | 2016-08-22 | 2018-03-06 | 科大讯飞股份有限公司 | voice awakening method, system and intelligent terminal |
CN108268452A (en) * | 2018-01-15 | 2018-07-10 | 东北大学 | A kind of professional domain machine synchronous translation device and method based on deep learning |
CN109767759A (en) * | 2019-02-14 | 2019-05-17 | 重庆邮电大学 | End-to-end speech recognition methods based on modified CLDNN structure |
US10373610B2 (en) * | 2017-02-24 | 2019-08-06 | Baidu Usa Llc | Systems and methods for automatic unit selection and target decomposition for sequence labelling |
US10468019B1 (en) * | 2017-10-27 | 2019-11-05 | Kadho, Inc. | System and method for automatic speech recognition using selection of speech models based on input characteristics |
CN110428820A (en) * | 2019-08-27 | 2019-11-08 | 深圳大学 | A kind of Chinese and English mixing voice recognition methods and device |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106023995A (en) * | 2015-08-20 | 2016-10-12 | 漳州凯邦电子有限公司 | Voice recognition method and wearable voice control device using the method |
CN105632486B (en) * | 2015-12-23 | 2019-12-17 | 北京奇虎科技有限公司 | Voice awakening method and device of intelligent hardware |
CN106950927B (en) * | 2017-02-17 | 2019-05-17 | 深圳大学 | A kind of method and intelligent wearable device controlling smart home |
-
2019
- 2019-12-24 CN CN201911344765.6A patent/CN111092798B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105139864A (en) * | 2015-08-17 | 2015-12-09 | 北京天诚盛业科技有限公司 | Voice recognition method and voice recognition device |
CN106469552A (en) * | 2015-08-20 | 2017-03-01 | 三星电子株式会社 | Speech recognition apparatus and method |
CN105895087A (en) * | 2016-03-24 | 2016-08-24 | 海信集团有限公司 | Voice recognition method and apparatus |
CN107767863A (en) * | 2016-08-22 | 2018-03-06 | 科大讯飞股份有限公司 | voice awakening method, system and intelligent terminal |
CN106782497A (en) * | 2016-11-30 | 2017-05-31 | 天津大学 | A kind of intelligent sound noise reduction algorithm based on Portable intelligent terminal |
US10373610B2 (en) * | 2017-02-24 | 2019-08-06 | Baidu Usa Llc | Systems and methods for automatic unit selection and target decomposition for sequence labelling |
US10468019B1 (en) * | 2017-10-27 | 2019-11-05 | Kadho, Inc. | System and method for automatic speech recognition using selection of speech models based on input characteristics |
CN108268452A (en) * | 2018-01-15 | 2018-07-10 | 东北大学 | A kind of professional domain machine synchronous translation device and method based on deep learning |
CN109767759A (en) * | 2019-02-14 | 2019-05-17 | 重庆邮电大学 | End-to-end speech recognition methods based on modified CLDNN structure |
CN110428820A (en) * | 2019-08-27 | 2019-11-08 | 深圳大学 | A kind of Chinese and English mixing voice recognition methods and device |
Non-Patent Citations (1)
Title |
---|
基于词向量特征的循环神经网络语言模型;张剑等;《模式识别与人工智能》;20150415;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111092798A (en) | 2020-05-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tripathi et al. | Deep learning based emotion recognition system using speech features and transcriptions | |
Sun | End-to-end speech emotion recognition with gender information | |
CN111092798B (en) | Wearable system based on spoken language understanding | |
JP6810283B2 (en) | Image processing equipment and method | |
WO2022057712A1 (en) | Electronic device and semantic parsing method therefor, medium, and human-machine dialog system | |
Zhou et al. | Converting anyone's emotion: Towards speaker-independent emotional voice conversion | |
CN106847279A (en) | Man-machine interaction method based on robot operating system ROS | |
CN110675859A (en) | Multi-emotion recognition method, system, medium, and apparatus combining speech and text | |
CN112151015B (en) | Keyword detection method, keyword detection device, electronic equipment and storage medium | |
CN112101044B (en) | Intention identification method and device and electronic equipment | |
CN112151030B (en) | Multi-mode-based complex scene voice recognition method and device | |
CN106557165B (en) | The action simulation exchange method and device and smart machine of smart machine | |
CN115329779A (en) | Multi-person conversation emotion recognition method | |
CN115393933A (en) | Video face emotion recognition method based on frame attention mechanism | |
Verkholyak et al. | Modeling short-term and long-term dependencies of the speech signal for paralinguistic emotion classification | |
Sun et al. | Sparse autoencoder with attention mechanism for speech emotion recognition | |
Peerzade et al. | A review: Speech emotion recognition | |
CN111009235A (en) | Voice recognition method based on CLDNN + CTC acoustic model | |
Song et al. | A review of audio-visual fusion with machine learning | |
KR102297480B1 (en) | System and method for structured-paraphrasing the unstructured query or request sentence | |
Jie | Speech emotion recognition based on convolutional neural network | |
CN112749567A (en) | Question-answering system based on reality information environment knowledge graph | |
CN117251057A (en) | AIGC-based method and system for constructing AI number wisdom | |
CN111009236A (en) | Voice recognition method based on DBLSTM + CTC acoustic model | |
Pujari et al. | A survey on deep learning based lip-reading techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |