CN111092798B

CN111092798B - Wearable system based on spoken language understanding

Info

Publication number: CN111092798B
Application number: CN201911344765.6A
Authority: CN
Inventors: 吴怡之; 施军
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2021-06-11
Anticipated expiration: 2039-12-24
Also published as: CN111092798A

Abstract

The invention relates to a design method of a wearable system based on spoken language understanding, which comprises the following steps: firstly, wearable equipment picks up pronunciation, transmits to smart mobile phone APP through the bluetooth in, APP understands speech signal transmission to spoken language, the intention text of keyword is directly discerned to speech signal, transmits back to the cell-phone end. The smart phone controls the actions of other smart devices according to the received intention, such as controlling the on and off of kitchen lights. The intelligent household equipment can be controlled quickly, accurately and conveniently, so that the life of people is simplified, and the happiness is improved.

Description

Wearable system based on spoken language understanding

Technical Field

The invention relates to a wearable system based on spoken language understanding, which can be applied to smart homes or internet of things systems. The wearable equipment picks up voice signals in real time, the mobile phone is used as a transmission medium and a control center, and common voice signals are recognized as control commands, so that the intelligent home is accurately controlled to move, and people live more elegantly.

Background

Along with the rapid development of computer system performance, especially the remarkable improvement of computer graphics card performance, and the continuous breakthrough of speech processing technology, natural language processing technology and machine learning method, the research of spoken language understanding system has been promoted in recent years.

Current spoken language understanding systems generally employ a pipeline model, with the entire pipeline consisting of two parts: speech Recognition (ASR), Natural Language Understanding (NLU). More and more products are now being integrated into the knowledge base, mainly introduced in the dialogue management module. The divide-and-conquer method ensures that each subtask is independently modeled, and is simple and easy to implement.

According to the current state of research, there is no product that combines spoken language understanding with wearable devices. Currently, mainstream intelligent products, such as intelligent sound boxes like "tianmaoling", can realize simple human-computer interaction, but have three obvious disadvantages: one is that its size is great inconvenient to carry, and another does not have portable power source, need insert the electricity constantly and just can normally work, and the third is that remote pickup effect is not good, and the area of coverage is limited.

Disclosure of Invention

The purpose of the invention is: by training the spoken language understanding algorithm model and utilizing the convenience of the wearable equipment, the household intelligent device is controlled anytime and anywhere.

In order to achieve the above object, a technical solution of the present invention is to provide a wearable system based on spoken language understanding, including:

the wearable device is worn on the body of a user, collects voice signals sent by the user and used for controlling the smart home to act, and forwards the collected voice signals to the smart phone used by the user;

the smart phone APP runs on the smart phone, on one hand, the smart phone APP receives voice signals forwarded by the wearable device and uploads the voice signals to the spoken language understanding server, on the other hand, the smart phone APP receives intention texts issued by the spoken language understanding server, and the smart home is controlled according to the intention texts;

the spoken language understanding server runs a spoken language understanding model, the spoken language understanding model generates an intention text according to a voice signal uploaded by the smart phone, the spoken language understanding model comprises a voice recognition module ASR and a natural language understanding module NLU, the voice signal after preprocessing is recognized by the voice recognition module ASR to obtain text information, and the natural language understanding module NLU obtains a semantic analysis result according to the text information to form the intention text, wherein:

the speech recognition module ASR realizes end-to-end speech recognition through a Recurrent Neural Network (RNN) and a linked time sequence classification (CTC);

the natural language understanding module NLU is composed of an input layer, a first hidden layer, a second hidden layer and an output layer:

given an input sentence S ═{w1，w2，...，w_T}，w_iRepresenting the ith word in a sentence, T representing the length of the sentence, and the input layer of the natural language understanding module NLU Embedding words into Embedding to put word text w_iConversion to word vector x_i；

Word vector x to be processed_iSending into a first hidden layer, wherein the first hidden layer selects GRU as a nerve unit, processing by a circulating neuron, and then training by a second hidden layer, wherein in the training of the second hidden layer, the source input of the neuron consists of 3 aspects, namely the output of the previous layer

Activation value of neuron at previous time

And the output y of the output layer at the previous moment_t-1Composition, state value of second hidden layer

The formula of (1) is:

where σ is the activation function, W_h2、W_yIs a coefficient matrix;

the output layer uses a softmax classification function, solves the problem of multi-sequence labeling and outputs

The probability of the category k is shown in the t-th word, wherein the number of the classification categories is 3 in total, each classification category is three, each classification category is a place, an object, and an action, and the probabilities of all categories add up to 1 in one word.

Preferably, the speech recognition module ASR first pre-processes the input speech signal, where the pre-processing includes pre-emphasis, silence removal and windowing and framing, and after the pre-processing, the speech signal becomes many smallEach segment is defined as a frame of speech waveform, and the MFCC is used for feature extraction to change each frame of speech waveform into a multi-dimensional vector X, wherein the set X is { X ═ X₁，x₂，...，x_TRepresenting a characteristic sequence set corresponding to the current voice signal, wherein T represents the total frame number of the voice waveform; training by utilizing a recurrent neural network, selecting GRU as hidden layer neuron of the recurrent neural network, outputting a posterior probability matrix y of the character to be recognized by utilizing softmax by the recurrent neural network, and defining as follows:

y＝(y¹，y²，...，y^t，...，y^T)

wherein, the t-th column y of y^tComprises the following steps:

when the t-th frame is represented, the probability that the pronunciation is N characters, wherein N represents the length of the character to be recognized, and the probability of all character types on the data of one frame is added to be 1, namely:

and then enters the CTC output layer where the CTC only care as to whether the input sequence is close to the true sequence as a loss function, and not whether each result in the predicted output sequence is exactly aligned in time with the input sequence. The CTC adds a blank for marking the invalid voice of the label, and removes repeated phonemes and blanks by calculating the forward and backward loss calculation to realize the result of the recognized text with the output length far smaller than the input length.

The wearable equipment system mainly comprises two parts, wherein one part is a sound pickup part (which can be a bow tie, a bracelet and the like) and is mainly used for acquiring voice signals of a user in a short distance in real time; the other part is a smart phone part, which is similar to a transportation junction and is mainly used for transmitting voice signals to a spoken language understanding server and controlling the actions of the smart home according to the spoken language understanding result.

At present, simple human-computer interaction can be realized by mainstream intelligent products such as intelligent sound boxes of 'tianmaoling', but three obvious disadvantages exist, one is that the size of the intelligent sound boxes is large and inconvenient to carry, the other is that no mobile power supply exists, the intelligent sound boxes can normally work only by being plugged at any time, the third is that the remote sound pickup effect is poor, and the covered area is limited. Based on the defects, the wearable device system is adopted, the system mainly comprises two parts, namely a sound pickup part (which can be a bow tie, a bracelet and the like), and the wearable device system is mainly used for collecting voice signals of a user in real-time static distance. One is a smart phone part, which is similar to a transportation hub and is mainly used for transmitting voice signals to a spoken language understanding server and controlling the actions of smart home according to spoken language understanding results.

The oral understanding scheme of the invention is that the two parts of speech recognition and natural language understanding are independently modeled respectively, and the output result of the previous module is taken as the input of the next module in a pipeline manner. The voice recognition part adopts a cyclic neural network and CTC to realize end-to-end voice recognition, and compared with the traditional voice recognition, the voice recognition method has the advantages of less language models, more simplicity, more convenient debugging, higher accuracy and larger requirement on training by a data set. The natural language understanding part uses GRU as the neuron of the deep circulation neural network, can effectively eliminate the problem of gradient explosion or gradient disappearance, and is more convenient for training compared with LSTM parameters. The divide-and-conquer method ensures that each subtask is independently modeled, and is simple and easy to implement.

Drawings

FIG. 1 is a first scheme of wearable system composition based on spoken language understanding;

FIG. 2 is a second scheme of wearable system composition based on spoken language understanding;

FIG. 3 is a schematic diagram of a GRU;

FIG. 4 is an ASR model deep recurrent neural network;

fig. 5 is an NLU model deep cyclic neural network.

Detailed Description

The invention is further elucidated with reference to the drawing. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The embodiment of the invention relates to a design method of a wearable system based on spoken language understanding, which comprises the following modules as shown in figure 1:

(1) collecting voice signals: this part can be realized by wearable equipment, for example brooch, button, bracelet etc.. This part is used for gathering user's speech signal including miniature high performance's adapter, and speech processing module is used for reducing noise and analog-to-digital conversion, and bluetooth module is used for transmitting to the smart mobile phone.

(2) Smart phone APP: the APP development mainly comprises three parts, wherein the first part is a Bluetooth part and is used for connection between an active device and a wearable band device, the second part is used for transmitting voice data to a spoken language understanding server, the third part is used for receiving spoken language understanding recognition result data display, outputting the spoken language understanding result to a mobile phone interface and connecting to a smart device in a home through a personal area network, and the actions of controlling the smart device to be switched on and off according to a control command of spoken language understanding.

(3) Spoken language understanding server: the server for providing the spoken language understanding service is a core part of the invention, and the spoken language understanding model runs on the spoken language understanding server. The spoken language understanding model is divided into two parts: a speech recognition module ASR and a natural language understanding module NLU. The speech signal is preprocessed to a speech recognition module ASR to obtain a recognized text, the text information is transmitted to a natural language understanding module NLU, and a semantic analysis result is output. For example, a voice input "turn on the kitchen light", resolves { location: kitchen, object: lamp, action: open }.

The speech recognition module ASR implements end-to-end speech recognition through a Recurrent Neural Network (RNN) and a link timing classification (CTC). The speech recognition module ASR is realized by performing pre-emphasis, mute cutting, windowing and framing processing on an input speech signal, and converting the speech into a plurality of small segments after processing. But the waveform has little description capability in the time domain, so the waveform must be transformed. The invention adopts MFCC to extract features, and changes each frame of voice waveform into a multi-dimensional vector X according to the physiological characteristics of human ears, wherein the set X is { X ═ X₁，x₂，...，x_TAnd expressing the feature sequence set corresponding to each frame of voice, wherein T expresses the frame number. Considering the continuity of a voice signal in time, a recurrent neural network is selected, and considering the advantages that the GRU can effectively solve gradient explosion or gradient disappearance and has few parameters, the GRU is selected as a hidden node of the neural network. And outputting a posterior probability matrix y of the character to be recognized by utilizing softmax. Is defined as:

y＝(y¹，y²，...，y^t，...，y^T)

wherein, each column y of y^tComprises the following steps:

wherein, N represents the length of the character to be recognized (26 bits of English character plus blank),

the probability of pronouncing to n characters in the t-th frame is represented, and the probability of all character types on the data of one frame is added to be 1. Namely, it is

And then enters the CTC output layer where the CTC only care as to whether the input sequence is close to the true sequence as a loss function, and not whether each result in the predicted output sequence is exactly aligned in time with the input sequence. The CTC is added with a blank for marking the label invalid voice, and the repeated phonemes and the blank are removed by calculating the forward and backward loss calculation to realize the recognition text result with the output length far smaller than the input length. The network architecture is shown in figure 4.

After the recognized text is obtained from the speech recognition module ASR, the natural language understanding module NLU is entered. This module is formed by four layer network structures: given an input sentence W ═ W1, W2_M}，w_iRepresenting the ith word in the sentence, and M represents the length of the sentence. The input layer uses word Embedding (Embedding) to embed w_iConverting the word text into word vectors, and sending the processed input word vectors into a first hidden layer, wherein GRUs still used by the hidden layer are neural units. After treatment of the circulating neurons, training is performed through a second hidden layer, and in the training of the layer, the source input of the neurons consists of 3 aspects: output of the previous layer

Activation value of neuron at previous time

And the output y of the output layer at the previous moment_t-1And (4) forming. State value of the layer

The formula of (1) is:

where σ is the activation function, W_h2、W_yIs a matrix of coefficients. The output layer uses a softmax classification function to solve the problem of multi-sequence labeling. The invention is mainly based on the control command of the smart home, so the output is respectively three types, namely place, object and action, and the output is accordingly

The probability of the category k at the t-th word is shown, wherein the number of classification categories is 4 in total, and three categories are a place, an object, an action, and NULL, respectively, and the probability of all categories on one word is added up to 1. The specific neural network structure is shown in fig. 5.

And writing the check vector to be tested into a program, and testing the trained network. And comparing the trained training result with expected data, observing the identification accuracy and improving the algorithm. The method has the practicability that the spoken language understanding algorithm can be used for understanding the keywords of the language of the user in real time, the intention of the user is rapidly recognized, and the intelligent home is controlled to act through the intelligent mobile phone and the network of the intelligent home.

The wearable system based on spoken language understanding can realize the accurate control to intelligent house anytime and anywhere, thereby simplifying the life of people, saving time, making the life of people more elegant, and greatly improving the life happiness of people.

Claims

1. A wearable system based on spoken language understanding, comprising:

given an input sentence S ═ w1, w2_T}，w_iRepresenting the ith word in a sentence, T representing the length of the sentence, and the input layer of the natural language understanding module NLU Embedding words into Embedding to put word text w_iConversion to word vector x_i；

Activation value of neuron at previous time

The formula of (1) is:

where σ is the activation function, W_h2、W_yIs a coefficient matrix;

The probability of the category k is shown in the t word, wherein the number of the classified categories is 3 in total, the classified categories are three categories, namely, a place, an object and an action, and the probabilities of all the categories on one word are added up to be 1;

the speech recognition module ASR firstly preprocesses an input speech signal, wherein the preprocessing comprises pre-emphasis, silence removal and windowing framing processing, after the preprocessing, the speech signal is changed into a plurality of small sections, each small section is defined as a frame of speech waveform, the MFCC is adopted to carry out feature extraction to change each frame of speech waveform into a multi-dimensional vector X, and a set X is { X ═ X₁，x₂，...，x_TRepresenting a characteristic sequence set corresponding to the current voice signal, wherein T represents the total frame number of the voice waveform; training by utilizing a recurrent neural network, selecting GRU as hidden layer neuron of the recurrent neural network, outputting a posterior probability matrix y of the character to be recognized by utilizing softmax by the recurrent neural network, and defining as follows:

y＝(y¹，y²，...，y^t，...，y^T)

wherein, the t-th column y of y^tComprises the following steps:

then entering a CTC output layer, wherein the CTC is used as a loss function and only concerns whether an input sequence is close to a real sequence or not, but does not concern whether each result in a prediction output sequence is exactly aligned with the input sequence in time or not; the CTC adds a blank for marking the invalid voice of the label, and removes repeated phonemes and blanks by calculating the forward and backward loss calculation to realize the result of the recognized text with the output length far smaller than the input length.