CN111092798B - Wearable system based on spoken language understanding - Google Patents

Wearable system based on spoken language understanding Download PDF

Info

Publication number
CN111092798B
CN111092798B CN201911344765.6A CN201911344765A CN111092798B CN 111092798 B CN111092798 B CN 111092798B CN 201911344765 A CN201911344765 A CN 201911344765A CN 111092798 B CN111092798 B CN 111092798B
Authority
CN
China
Prior art keywords
language understanding
spoken language
voice
input
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911344765.6A
Other languages
Chinese (zh)
Other versions
CN111092798A (en
Inventor
吴怡之
施军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201911344765.6A priority Critical patent/CN111092798B/en
Publication of CN111092798A publication Critical patent/CN111092798A/en
Application granted granted Critical
Publication of CN111092798B publication Critical patent/CN111092798B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/2803Home automation networks
    • H04L12/2816Controlling appliance services of a home automation network by calling their functionalities
    • H04L12/282Controlling appliance services of a home automation network by calling their functionalities based on user interaction within the home
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B1/00Details of transmission systems, not covered by a single one of groups H04B3/00 - H04B13/00; Details of transmission systems not characterised by the medium used for transmission
    • H04B1/38Transceivers, i.e. devices in which transmitter and receiver form a structural unit and in which at least one part is used for functions of transmitting and receiving
    • H04B1/3827Portable transceivers
    • H04B1/385Transceivers carried on the body, e.g. in helmets
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality

Abstract

The invention relates to a design method of a wearable system based on spoken language understanding, which comprises the following steps: firstly, wearable equipment picks up pronunciation, transmits to smart mobile phone APP through the bluetooth in, APP understands speech signal transmission to spoken language, the intention text of keyword is directly discerned to speech signal, transmits back to the cell-phone end. The smart phone controls the actions of other smart devices according to the received intention, such as controlling the on and off of kitchen lights. The intelligent household equipment can be controlled quickly, accurately and conveniently, so that the life of people is simplified, and the happiness is improved.

Description

Wearable system based on spoken language understanding
Technical Field
The invention relates to a wearable system based on spoken language understanding, which can be applied to smart homes or internet of things systems. The wearable equipment picks up voice signals in real time, the mobile phone is used as a transmission medium and a control center, and common voice signals are recognized as control commands, so that the intelligent home is accurately controlled to move, and people live more elegantly.
Background
Along with the rapid development of computer system performance, especially the remarkable improvement of computer graphics card performance, and the continuous breakthrough of speech processing technology, natural language processing technology and machine learning method, the research of spoken language understanding system has been promoted in recent years.
Current spoken language understanding systems generally employ a pipeline model, with the entire pipeline consisting of two parts: speech Recognition (ASR), Natural Language Understanding (NLU). More and more products are now being integrated into the knowledge base, mainly introduced in the dialogue management module. The divide-and-conquer method ensures that each subtask is independently modeled, and is simple and easy to implement.
According to the current state of research, there is no product that combines spoken language understanding with wearable devices. Currently, mainstream intelligent products, such as intelligent sound boxes like "tianmaoling", can realize simple human-computer interaction, but have three obvious disadvantages: one is that its size is great inconvenient to carry, and another does not have portable power source, need insert the electricity constantly and just can normally work, and the third is that remote pickup effect is not good, and the area of coverage is limited.
Disclosure of Invention
The purpose of the invention is: by training the spoken language understanding algorithm model and utilizing the convenience of the wearable equipment, the household intelligent device is controlled anytime and anywhere.
In order to achieve the above object, a technical solution of the present invention is to provide a wearable system based on spoken language understanding, including:
the wearable device is worn on the body of a user, collects voice signals sent by the user and used for controlling the smart home to act, and forwards the collected voice signals to the smart phone used by the user;
the smart phone APP runs on the smart phone, on one hand, the smart phone APP receives voice signals forwarded by the wearable device and uploads the voice signals to the spoken language understanding server, on the other hand, the smart phone APP receives intention texts issued by the spoken language understanding server, and the smart home is controlled according to the intention texts;
the spoken language understanding server runs a spoken language understanding model, the spoken language understanding model generates an intention text according to a voice signal uploaded by the smart phone, the spoken language understanding model comprises a voice recognition module ASR and a natural language understanding module NLU, the voice signal after preprocessing is recognized by the voice recognition module ASR to obtain text information, and the natural language understanding module NLU obtains a semantic analysis result according to the text information to form the intention text, wherein:
the speech recognition module ASR realizes end-to-end speech recognition through a Recurrent Neural Network (RNN) and a linked time sequence classification (CTC);
the natural language understanding module NLU is composed of an input layer, a first hidden layer, a second hidden layer and an output layer:
given an input sentence S ═{w1,w2,...,wT},wiRepresenting the ith word in a sentence, T representing the length of the sentence, and the input layer of the natural language understanding module NLU Embedding words into Embedding to put word text wiConversion to word vector xi
Word vector x to be processediSending into a first hidden layer, wherein the first hidden layer selects GRU as a nerve unit, processing by a circulating neuron, and then training by a second hidden layer, wherein in the training of the second hidden layer, the source input of the neuron consists of 3 aspects, namely the output of the previous layer
Figure BDA0002333031830000021
Activation value of neuron at previous time
Figure BDA0002333031830000022
And the output y of the output layer at the previous momentt-1Composition, state value of second hidden layer
Figure BDA0002333031830000023
The formula of (1) is:
Figure BDA0002333031830000024
where σ is the activation function, Wh2、WyIs a coefficient matrix;
the output layer uses a softmax classification function, solves the problem of multi-sequence labeling and outputs
Figure BDA0002333031830000025
The probability of the category k is shown in the t-th word, wherein the number of the classification categories is 3 in total, each classification category is three, each classification category is a place, an object, and an action, and the probabilities of all categories add up to 1 in one word.
Preferably, the speech recognition module ASR first pre-processes the input speech signal, where the pre-processing includes pre-emphasis, silence removal and windowing and framing, and after the pre-processing, the speech signal becomes many smallEach segment is defined as a frame of speech waveform, and the MFCC is used for feature extraction to change each frame of speech waveform into a multi-dimensional vector X, wherein the set X is { X ═ X1,x2,...,xTRepresenting a characteristic sequence set corresponding to the current voice signal, wherein T represents the total frame number of the voice waveform; training by utilizing a recurrent neural network, selecting GRU as hidden layer neuron of the recurrent neural network, outputting a posterior probability matrix y of the character to be recognized by utilizing softmax by the recurrent neural network, and defining as follows:
y=(y1,y2,...,yt,...,yT)
wherein, the t-th column y of ytComprises the following steps:
Figure BDA0002333031830000031
Figure BDA0002333031830000032
when the t-th frame is represented, the probability that the pronunciation is N characters, wherein N represents the length of the character to be recognized, and the probability of all character types on the data of one frame is added to be 1, namely:
Figure BDA0002333031830000033
and then enters the CTC output layer where the CTC only care as to whether the input sequence is close to the true sequence as a loss function, and not whether each result in the predicted output sequence is exactly aligned in time with the input sequence. The CTC adds a blank for marking the invalid voice of the label, and removes repeated phonemes and blanks by calculating the forward and backward loss calculation to realize the result of the recognized text with the output length far smaller than the input length.
The wearable equipment system mainly comprises two parts, wherein one part is a sound pickup part (which can be a bow tie, a bracelet and the like) and is mainly used for acquiring voice signals of a user in a short distance in real time; the other part is a smart phone part, which is similar to a transportation junction and is mainly used for transmitting voice signals to a spoken language understanding server and controlling the actions of the smart home according to the spoken language understanding result.
At present, simple human-computer interaction can be realized by mainstream intelligent products such as intelligent sound boxes of 'tianmaoling', but three obvious disadvantages exist, one is that the size of the intelligent sound boxes is large and inconvenient to carry, the other is that no mobile power supply exists, the intelligent sound boxes can normally work only by being plugged at any time, the third is that the remote sound pickup effect is poor, and the covered area is limited. Based on the defects, the wearable device system is adopted, the system mainly comprises two parts, namely a sound pickup part (which can be a bow tie, a bracelet and the like), and the wearable device system is mainly used for collecting voice signals of a user in real-time static distance. One is a smart phone part, which is similar to a transportation hub and is mainly used for transmitting voice signals to a spoken language understanding server and controlling the actions of smart home according to spoken language understanding results.
The oral understanding scheme of the invention is that the two parts of speech recognition and natural language understanding are independently modeled respectively, and the output result of the previous module is taken as the input of the next module in a pipeline manner. The voice recognition part adopts a cyclic neural network and CTC to realize end-to-end voice recognition, and compared with the traditional voice recognition, the voice recognition method has the advantages of less language models, more simplicity, more convenient debugging, higher accuracy and larger requirement on training by a data set. The natural language understanding part uses GRU as the neuron of the deep circulation neural network, can effectively eliminate the problem of gradient explosion or gradient disappearance, and is more convenient for training compared with LSTM parameters. The divide-and-conquer method ensures that each subtask is independently modeled, and is simple and easy to implement.
Drawings
FIG. 1 is a first scheme of wearable system composition based on spoken language understanding;
FIG. 2 is a second scheme of wearable system composition based on spoken language understanding;
FIG. 3 is a schematic diagram of a GRU;
FIG. 4 is an ASR model deep recurrent neural network;
fig. 5 is an NLU model deep cyclic neural network.
Detailed Description
The invention is further elucidated with reference to the drawing. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The embodiment of the invention relates to a design method of a wearable system based on spoken language understanding, which comprises the following modules as shown in figure 1:
(1) collecting voice signals: this part can be realized by wearable equipment, for example brooch, button, bracelet etc.. This part is used for gathering user's speech signal including miniature high performance's adapter, and speech processing module is used for reducing noise and analog-to-digital conversion, and bluetooth module is used for transmitting to the smart mobile phone.
(2) Smart phone APP: the APP development mainly comprises three parts, wherein the first part is a Bluetooth part and is used for connection between an active device and a wearable band device, the second part is used for transmitting voice data to a spoken language understanding server, the third part is used for receiving spoken language understanding recognition result data display, outputting the spoken language understanding result to a mobile phone interface and connecting to a smart device in a home through a personal area network, and the actions of controlling the smart device to be switched on and off according to a control command of spoken language understanding.
(3) Spoken language understanding server: the server for providing the spoken language understanding service is a core part of the invention, and the spoken language understanding model runs on the spoken language understanding server. The spoken language understanding model is divided into two parts: a speech recognition module ASR and a natural language understanding module NLU. The speech signal is preprocessed to a speech recognition module ASR to obtain a recognized text, the text information is transmitted to a natural language understanding module NLU, and a semantic analysis result is output. For example, a voice input "turn on the kitchen light", resolves { location: kitchen, object: lamp, action: open }.
The speech recognition module ASR implements end-to-end speech recognition through a Recurrent Neural Network (RNN) and a link timing classification (CTC). The speech recognition module ASR is realized by performing pre-emphasis, mute cutting, windowing and framing processing on an input speech signal, and converting the speech into a plurality of small segments after processing. But the waveform has little description capability in the time domain, so the waveform must be transformed. The invention adopts MFCC to extract features, and changes each frame of voice waveform into a multi-dimensional vector X according to the physiological characteristics of human ears, wherein the set X is { X ═ X1,x2,...,xTAnd expressing the feature sequence set corresponding to each frame of voice, wherein T expresses the frame number. Considering the continuity of a voice signal in time, a recurrent neural network is selected, and considering the advantages that the GRU can effectively solve gradient explosion or gradient disappearance and has few parameters, the GRU is selected as a hidden node of the neural network. And outputting a posterior probability matrix y of the character to be recognized by utilizing softmax. Is defined as:
y=(y1,y2,...,yt,...,yT)
wherein, each column y of ytComprises the following steps:
Figure BDA0002333031830000051
wherein, N represents the length of the character to be recognized (26 bits of English character plus blank),
Figure BDA0002333031830000053
the probability of pronouncing to n characters in the t-th frame is represented, and the probability of all character types on the data of one frame is added to be 1. Namely, it is
Figure BDA0002333031830000052
And then enters the CTC output layer where the CTC only care as to whether the input sequence is close to the true sequence as a loss function, and not whether each result in the predicted output sequence is exactly aligned in time with the input sequence. The CTC is added with a blank for marking the label invalid voice, and the repeated phonemes and the blank are removed by calculating the forward and backward loss calculation to realize the recognition text result with the output length far smaller than the input length. The network architecture is shown in figure 4.
After the recognized text is obtained from the speech recognition module ASR, the natural language understanding module NLU is entered. This module is formed by four layer network structures: given an input sentence W ═ W1, W2M},wiRepresenting the ith word in the sentence, and M represents the length of the sentence. The input layer uses word Embedding (Embedding) to embed wiConverting the word text into word vectors, and sending the processed input word vectors into a first hidden layer, wherein GRUs still used by the hidden layer are neural units. After treatment of the circulating neurons, training is performed through a second hidden layer, and in the training of the layer, the source input of the neurons consists of 3 aspects: output of the previous layer
Figure BDA0002333031830000061
Activation value of neuron at previous time
Figure BDA0002333031830000062
And the output y of the output layer at the previous momentt-1And (4) forming. State value of the layer
Figure BDA0002333031830000063
The formula of (1) is:
Figure BDA0002333031830000064
where σ is the activation function, Wh2、WyIs a matrix of coefficients. The output layer uses a softmax classification function to solve the problem of multi-sequence labeling. The invention is mainly based on the control command of the smart home, so the output is respectively three types, namely place, object and action, and the output is accordingly
Figure BDA0002333031830000065
The probability of the category k at the t-th word is shown, wherein the number of classification categories is 4 in total, and three categories are a place, an object, an action, and NULL, respectively, and the probability of all categories on one word is added up to 1. The specific neural network structure is shown in fig. 5.
And writing the check vector to be tested into a program, and testing the trained network. And comparing the trained training result with expected data, observing the identification accuracy and improving the algorithm. The method has the practicability that the spoken language understanding algorithm can be used for understanding the keywords of the language of the user in real time, the intention of the user is rapidly recognized, and the intelligent home is controlled to act through the intelligent mobile phone and the network of the intelligent home.
The wearable system based on spoken language understanding can realize the accurate control to intelligent house anytime and anywhere, thereby simplifying the life of people, saving time, making the life of people more elegant, and greatly improving the life happiness of people.

Claims (1)

1. A wearable system based on spoken language understanding, comprising:
the wearable device is worn on the body of a user, collects voice signals sent by the user and used for controlling the smart home to act, and forwards the collected voice signals to the smart phone used by the user;
the smart phone APP runs on the smart phone, on one hand, the smart phone APP receives voice signals forwarded by the wearable device and uploads the voice signals to the spoken language understanding server, on the other hand, the smart phone APP receives intention texts issued by the spoken language understanding server, and the smart home is controlled according to the intention texts;
the spoken language understanding server runs a spoken language understanding model, the spoken language understanding model generates an intention text according to a voice signal uploaded by the smart phone, the spoken language understanding model comprises a voice recognition module ASR and a natural language understanding module NLU, the voice signal after preprocessing is recognized by the voice recognition module ASR to obtain text information, and the natural language understanding module NLU obtains a semantic analysis result according to the text information to form the intention text, wherein:
the speech recognition module ASR realizes end-to-end speech recognition through a Recurrent Neural Network (RNN) and a linked time sequence classification (CTC);
the natural language understanding module NLU is composed of an input layer, a first hidden layer, a second hidden layer and an output layer:
given an input sentence S ═ w1, w2T},wiRepresenting the ith word in a sentence, T representing the length of the sentence, and the input layer of the natural language understanding module NLU Embedding words into Embedding to put word text wiConversion to word vector xi
Word vector x to be processediSending into a first hidden layer, wherein the first hidden layer selects GRU as a nerve unit, processing by a circulating neuron, and then training by a second hidden layer, wherein in the training of the second hidden layer, the source input of the neuron consists of 3 aspects, namely the output of the previous layer
Figure FDA0002886775480000011
Activation value of neuron at previous time
Figure FDA0002886775480000012
And the output y of the output layer at the previous momentt-1Composition, state value of second hidden layer
Figure FDA0002886775480000013
The formula of (1) is:
Figure FDA0002886775480000014
where σ is the activation function, Wh2、WyIs a coefficient matrix;
the output layer uses a softmax classification function, solves the problem of multi-sequence labeling and outputs
Figure FDA0002886775480000015
The probability of the category k is shown in the t word, wherein the number of the classified categories is 3 in total, the classified categories are three categories, namely, a place, an object and an action, and the probabilities of all the categories on one word are added up to be 1;
the speech recognition module ASR firstly preprocesses an input speech signal, wherein the preprocessing comprises pre-emphasis, silence removal and windowing framing processing, after the preprocessing, the speech signal is changed into a plurality of small sections, each small section is defined as a frame of speech waveform, the MFCC is adopted to carry out feature extraction to change each frame of speech waveform into a multi-dimensional vector X, and a set X is { X ═ X1,x2,...,xTRepresenting a characteristic sequence set corresponding to the current voice signal, wherein T represents the total frame number of the voice waveform; training by utilizing a recurrent neural network, selecting GRU as hidden layer neuron of the recurrent neural network, outputting a posterior probability matrix y of the character to be recognized by utilizing softmax by the recurrent neural network, and defining as follows:
y=(y1,y2,...,yt,...,yT)
wherein, the t-th column y of ytComprises the following steps:
Figure FDA0002886775480000021
Figure FDA0002886775480000022
when the t-th frame is represented, the probability that the pronunciation is N characters, wherein N represents the length of the character to be recognized, and the probability of all character types on the data of one frame is added to be 1, namely:
Figure FDA0002886775480000023
then entering a CTC output layer, wherein the CTC is used as a loss function and only concerns whether an input sequence is close to a real sequence or not, but does not concern whether each result in a prediction output sequence is exactly aligned with the input sequence in time or not; the CTC adds a blank for marking the invalid voice of the label, and removes repeated phonemes and blanks by calculating the forward and backward loss calculation to realize the result of the recognized text with the output length far smaller than the input length.
CN201911344765.6A 2019-12-24 2019-12-24 Wearable system based on spoken language understanding Active CN111092798B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911344765.6A CN111092798B (en) 2019-12-24 2019-12-24 Wearable system based on spoken language understanding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911344765.6A CN111092798B (en) 2019-12-24 2019-12-24 Wearable system based on spoken language understanding

Publications (2)

Publication Number Publication Date
CN111092798A CN111092798A (en) 2020-05-01
CN111092798B true CN111092798B (en) 2021-06-11

Family

ID=70395373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911344765.6A Active CN111092798B (en) 2019-12-24 2019-12-24 Wearable system based on spoken language understanding

Country Status (1)

Country Link
CN (1) CN111092798B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111754981A (en) * 2020-06-26 2020-10-09 清华大学 Command word recognition method and system using mutual prior constraint model
CN113793599B (en) * 2021-09-15 2023-09-29 北京百度网讯科技有限公司 Training method of voice recognition model, voice recognition method and device
CN114596947A (en) * 2022-03-08 2022-06-07 北京百度网讯科技有限公司 Diagnosis and treatment department recommendation method and device, electronic equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139864A (en) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 Voice recognition method and voice recognition device
CN105895087A (en) * 2016-03-24 2016-08-24 海信集团有限公司 Voice recognition method and apparatus
CN106469552A (en) * 2015-08-20 2017-03-01 三星电子株式会社 Speech recognition apparatus and method
CN106782497A (en) * 2016-11-30 2017-05-31 天津大学 A kind of intelligent sound noise reduction algorithm based on Portable intelligent terminal
CN107767863A (en) * 2016-08-22 2018-03-06 科大讯飞股份有限公司 voice awakening method, system and intelligent terminal
CN108268452A (en) * 2018-01-15 2018-07-10 东北大学 A kind of professional domain machine synchronous translation device and method based on deep learning
CN109767759A (en) * 2019-02-14 2019-05-17 重庆邮电大学 End-to-end speech recognition methods based on modified CLDNN structure
US10373610B2 (en) * 2017-02-24 2019-08-06 Baidu Usa Llc Systems and methods for automatic unit selection and target decomposition for sequence labelling
US10468019B1 (en) * 2017-10-27 2019-11-05 Kadho, Inc. System and method for automatic speech recognition using selection of speech models based on input characteristics
CN110428820A (en) * 2019-08-27 2019-11-08 深圳大学 A kind of Chinese and English mixing voice recognition methods and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106023995A (en) * 2015-08-20 2016-10-12 漳州凯邦电子有限公司 Voice recognition method and wearable voice control device using the method
CN105632486B (en) * 2015-12-23 2019-12-17 北京奇虎科技有限公司 Voice awakening method and device of intelligent hardware
CN106950927B (en) * 2017-02-17 2019-05-17 深圳大学 A kind of method and intelligent wearable device controlling smart home

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139864A (en) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 Voice recognition method and voice recognition device
CN106469552A (en) * 2015-08-20 2017-03-01 三星电子株式会社 Speech recognition apparatus and method
CN105895087A (en) * 2016-03-24 2016-08-24 海信集团有限公司 Voice recognition method and apparatus
CN107767863A (en) * 2016-08-22 2018-03-06 科大讯飞股份有限公司 voice awakening method, system and intelligent terminal
CN106782497A (en) * 2016-11-30 2017-05-31 天津大学 A kind of intelligent sound noise reduction algorithm based on Portable intelligent terminal
US10373610B2 (en) * 2017-02-24 2019-08-06 Baidu Usa Llc Systems and methods for automatic unit selection and target decomposition for sequence labelling
US10468019B1 (en) * 2017-10-27 2019-11-05 Kadho, Inc. System and method for automatic speech recognition using selection of speech models based on input characteristics
CN108268452A (en) * 2018-01-15 2018-07-10 东北大学 A kind of professional domain machine synchronous translation device and method based on deep learning
CN109767759A (en) * 2019-02-14 2019-05-17 重庆邮电大学 End-to-end speech recognition methods based on modified CLDNN structure
CN110428820A (en) * 2019-08-27 2019-11-08 深圳大学 A kind of Chinese and English mixing voice recognition methods and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于词向量特征的循环神经网络语言模型;张剑等;《模式识别与人工智能》;20150415;全文 *

Also Published As

Publication number Publication date
CN111092798A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
Tripathi et al. Deep learning based emotion recognition system using speech features and transcriptions
Sun End-to-end speech emotion recognition with gender information
CN111092798B (en) Wearable system based on spoken language understanding
JP6810283B2 (en) Image processing equipment and method
WO2022057712A1 (en) Electronic device and semantic parsing method therefor, medium, and human-machine dialog system
Zhou et al. Converting anyone's emotion: Towards speaker-independent emotional voice conversion
CN106847279A (en) Man-machine interaction method based on robot operating system ROS
CN110675859A (en) Multi-emotion recognition method, system, medium, and apparatus combining speech and text
CN112151015B (en) Keyword detection method, keyword detection device, electronic equipment and storage medium
CN112101044B (en) Intention identification method and device and electronic equipment
CN112151030B (en) Multi-mode-based complex scene voice recognition method and device
CN106557165B (en) The action simulation exchange method and device and smart machine of smart machine
CN115329779A (en) Multi-person conversation emotion recognition method
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
Verkholyak et al. Modeling short-term and long-term dependencies of the speech signal for paralinguistic emotion classification
Sun et al. Sparse autoencoder with attention mechanism for speech emotion recognition
Peerzade et al. A review: Speech emotion recognition
CN111009235A (en) Voice recognition method based on CLDNN + CTC acoustic model
Song et al. A review of audio-visual fusion with machine learning
KR102297480B1 (en) System and method for structured-paraphrasing the unstructured query or request sentence
Jie Speech emotion recognition based on convolutional neural network
CN112749567A (en) Question-answering system based on reality information environment knowledge graph
CN117251057A (en) AIGC-based method and system for constructing AI number wisdom
CN111009236A (en) Voice recognition method based on DBLSTM + CTC acoustic model
Pujari et al. A survey on deep learning based lip-reading techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant