CN112183062B

CN112183062B - Spoken language understanding method based on alternate decoding, electronic equipment and storage medium

Info

Publication number: CN112183062B
Application number: CN202011045822.3A
Authority: CN
Inventors: 刘广灿
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2024-04-19
Anticipated expiration: 2040-09-28
Also published as: CN112183062A

Abstract

The invention discloses a spoken language understanding method based on alternate decoding, electronic equipment and a storage medium.

Description

Spoken language understanding method based on alternate decoding, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of man-machine conversation systems, in particular to a spoken language understanding method based on alternate decoding, electronic equipment and a storage medium.

Background

Spoken language understanding mainly includes two subtasks: intent recognition (Intint Detection) and Slot Filling (Slot Filling). The two tasks of intention recognition and slot filling are not independent of each other, and the slot filling is highly dependent on the result of the intention recognition, and the slot filling can also promote the intention recognition. The prior art jointly models two tasks to make full use of knowledge information common to the two tasks, typically using a Multi-task (multitask) framework, two subtasks sharing the coding layer, and then adding the loss functions of the two parts to model.

The existing method which uses a multi-task framework adopts a shared coding layer and loss function addition is an implicit joint modeling method, and the method does not explicitly model the interaction relation between intention recognition and slot filling two subtasks; the Slot-Gated method and the SF-ID method are also researched and proposed for preliminary exploration, but the existing model cannot fully utilize the co-occurrence relation between the groove and the intention, so that the potential performance of the groove and the intention is limited.

Disclosure of Invention

In order to solve the problems, the invention provides that the word-level intention recognition and word-level slot filling are alternately decoded, and the whole decoding state is initialized by using the sentence-level intention recognition result, so that the mutual contribution of the intention recognition and the slot filling can be realized from the two angles of the global and the local, thereby improving the effect of the spoken language understanding task.

According to an aspect of an embodiment of the present invention, there is provided a spoken language understanding method based on alternate decoding, including:

s100, obtaining a semantic vector sequence of a pre-training language model BERT based on an input sequence, and taking a first mark of the input sequence as a classification mark;

s200, obtaining a corresponding final hidden state according to the classification mark, and carrying out sentence-level intention recognition by using a fully connected neural network and a Softmax function based on the final hidden state to obtain sentence-level global intention information;

S300, word level intention recognition and slot filling alternate decoding are carried out based on the semantic vector sequence and the sentence level global intention information;

s400 obtains the result of intention recognition and slot filling based on the result of the alternate decoding.

Preferably, the alternately decoding includes: and obtaining a corresponding decoder hiding state based on the semantic vector sequence and the sentence-level global intention information, decoding the decoder hiding state to obtain an output sequence, wherein the length of the output sequence is 2n, and obtaining a word-level intention recognition sequence and a word-level slot filling sequence based on the output sequence, wherein the decoder is unidirectional LSTM.

Preferably, the decoding is implemented to map intent tags and slot tags to a high-dimensional embedding space, explicitly distinguishing and semantically representing classification categories to facilitate understanding of classification tags.

Preferably, the method further comprises:

S310, when decoding to the ith step and i is an odd number, analyzing the intention of the previous word and the corresponding slot information of the previous word based on the semantic vector sequence when predicting the intention of the ([ i/2] +1) th word, wherein i is [0,2n ].

Preferably, the decoding method is as follows:

y_[i/2]+1＝argmax(softmax(W_yh_i))，

Where W _y and W _o are both trainable parameters, the symbol [ ] represents a rounding operation, Embedded vector representing intent tag y _[i/2],/>An embedded representation of the representation slot label o _[i/2],

The semantic vector sequence is represented as e= (e _[cls],e₁,e₂,e₃,e₄,...,e_n),

The output sequence is denoted y= (y ₁,o₁,y₂,o₂,y₃,o₃,...,y_n,o_n),

The word-level intent recognition sequence is y ⁱ＝(y₁,y₂,y₃,...,y_n),

The word level slot filling sequence is y ^s＝(o₁,o₂,o₃,...,o_n).

Preferably, the method further comprises:

S320, when decoding to the ith step and i is even, analyzing the intention information of the current word and the slot information of the previous word when predicting the slot of the [ i/2] word, wherein i is [0,2n ].

Preferably, the decoding method is as follows:

o_[i/2]＝argmax(softmax(W_oh_i))，

The output sequence is denoted y= (y ₁,o₁,y₂,o₂,y₃,o₃,...,y_n,o_n),

The word-level intent recognition sequence is y ⁱ＝(y₁,y₂,y₃,...,y_n),

The word level slot filling sequence is y ^s＝(o₁,o₂,o₃,...,o_n).

Preferably, the method further comprises:

s330, loss function calculation of word level alternating decoding: the negative log likelihood is used as a loss function to optimize the model parameters.

Preferably, the decoder is trained using a planned sampling mechanism.

Preferably, in the prediction stage, the intent result of the whole sentence is determined by adopting voting results of all word intentions, and the greedy search is used for sampling to obtain a slot prediction result.

According to another aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the foregoing method.

According to yet another aspect of embodiments of the present invention, there is provided a non-transitory computer readable storage medium having stored thereon executable instructions that, when run on a processor, implement the foregoing method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.

The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the ambit of the technical disclosure.

Fig. 1 is a schematic diagram of a spoken language understanding model proposed by the present invention.

Detailed Description

Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The model of spoken language understanding proposed by the invention is shown in fig. 1, the model uses a pre-trained language model BERT (Bidirectional Encoder Representations from Transformers) as an encoding layer, and word-level sequence decoding is performed by using semantic vector sequence representations obtained by BERT. The method and the device provided by the invention are used for alternately decoding the intention recognition at the word level and the slot filling at the word level, initializing the whole decoding state by using the intention recognition result at the sentence level, and realizing the mutual contribution of the intention recognition and the slot filling from the global and local angles in such a way, thereby improving the effect of the spoken language understanding task.

The specific method comprises the following steps:

Step 1, obtaining a semantic vector sequence coding representation of a pre-training language model BERT:

The BERT model structure is a multi-layer bi-directional transform-based encoder whose inputs include word embedding, sentence embedding, and position embedding. The first marker of the input sequence is always a special classification marker [ cls ], and the final hidden state corresponding to the special marker is used for classification tasks; and a special tag [ SEP ] is used as the last tag of the sequence. The input of the invention is expressed as x= ([ cls ], x ₁,x₂,x₃,x₄,...,x_n), and the semantic vector sequence code obtained by BERT is expressed as e= (e _[cls],e₁,e₂,e₃,e₄,...,e_n).

Step 2, obtaining a sentence level intention recognition result:

According to the final hidden state corresponding to the special symbol [ CLS ] obtained by BERT, sentence-level intention recognition is carried out by using a fully connected neural network (FCN, full ConnectedNetwork) and a Softmax function, sentence-level global intention information y _[cls]＝softmax(FCN(e_[cls]) is finally obtained, and then sentence-level global intention information y _[cls] is used for guiding an alternate decoding process of word-level intention recognition and slot filling, so that overall consistency is ensured.

Wherein, the FCN analysis of the fully connected neural network: for the n-1 layer and the n layer, any node of the n-1 layer is connected with all nodes of the n layer. I.e. each node of the n-th layer, when performing the calculation, the input of the activation function is the weighting of all nodes of the n-1 layer, so as to obtain the weight matrix of the full connection layer.

The full connection layer multiplies the weight matrix by the input vector and adds bias, and maps n (- -infinity, + -infinity) real numbers to K (- -infinity, + -infinity) real numbers (fractions); softmax will be K (- ≡, ++ infinity) is mapped to K real numbers (probabilities) of (0, 1), while ensuring that their sum is 1. The method comprises the following steps:

y_[CLS]＝softmax(FCN(e_[cls]))＝softmax(W^Tx+b)

Where x is the input of the fully connected layer, W _n X K is the weight b is the bias term, and y _[CLS] is the probability of Softmax output.

The probability of splitting into each category is as follows:

y_[cls]＝softmax(FCN(e_[cls]))＝softmax(W_[CLS]·x+b_[CLS])

wherein W _[CLS] is a vector formed by the weights of one node of the n-th layer and all nodes of the n-1 th layer in the full connection layer.

Step 3, word level intention recognition and slot filling alternate decoding are carried out:

Word level intent recognition is used to reduce the negative impact on the slot predictions for all words when the intent of the entire sentence is mispredicted.

The invention expresses an output sequence obtained by decoding a semantic vector sequence as y= (y ₁,o₁,y₂,o₂,y₃,o₃,...,y_n,o_n), and the sequence length is 2n; the hidden state of the corresponding decoder is denoted as h= (h ₁,h₂,h₃,...,h_2n); where word level intent identifies the decoding result sequence as y ⁱ＝(y₁,y₂,y₃,...,y_n), word level slot filling results sequence as y ^s＝(o₁,o₂,o₃,...,o_n).

The invention adopts a unidirectional LSTM as a decoder, the LSTM refers to a state transition equation of a long-time memory network, and when decoding to the ith step, i epsilon [0,2n ], the corresponding calculation process of the hidden state of the decoder is as follows:

and 3.1, when i is an odd number, analyzing the intention of the previous word and the slot information corresponding to the previous word when predicting the intention of the ([ i/2] +1) th word. The word-level intent is decoded using the following formula:

y_[i/2]+1＝argmax(softmax(W_yh_i))。

and 3.2, when i is even, analyzing the intention information of the current word and the slot information of the previous word when predicting the slot of the [ i/2] word. The decoding process is as follows:

o_[i/2]＝argmax(softmax(W_oh_i))，

Where W _y and W _o are both trainable parameters, the symbol [ ] represents a rounding operation. Embedded vector representing intent tag y _[i/2],/>The embedded representation of the representation slot label o _[i/2], mapping the label to the high-dimensional embedded space, to some extent, explicitly distinguishes and semantically represents the classification category, thereby facilitating understanding of the classification label by the model.

Step 3.3, calculating a loss function of word level alternate decoding: the present invention uses negative log-likelihood as a loss function to optimize the model parameters, i.e., L (y) = -log (y). And a design sampling (SS, schedule Sampling) mechanism is used during decoder training to solve the problem of bias in training and predicted information. In planned sampling, i.e. the sampling rate P is varied during training. The training is insufficient at the beginning, the sampling rate P is reduced, the real label is used as input as much as possible, the sampling rate P is increased along with the progress of the training, and the output of the sampling rate P is mostly used as the input of the next prediction. As training proceeds, the sampling rate P becomes larger and the training model is finally the same as the prediction model, so that deviation between training and predicted information is eliminated.

Step 4, the result acquisition of the prediction stage intention recognition and the slot filling is carried out:

Because the word-level intention recognition mode is used, the intention result of the final whole sentence is determined by adopting the voting result of all word intents; the greedy search is used for sampling to obtain a slot prediction result, such as Dijkstra algorithm, prim algorithm, kruskal algorithm and the like, when solving, the optimal solution is guaranteed each time, the coverage area of each sampling is maximized, so that the best or optimal (i.e. most favorable) selection is adopted in each step of selection, and the result is expected to be the best or optimal. Furthermore, the use of Conditional Random Fields (CRF) may have negative effects.

According to the method, the intention recognition is modeled as a word level classification task, the interactive relation among sentence level intention recognition, word level intention recognition and word level slot filling is explicitly modeled through a word level alternate decoding framework, and the potential performance of spoken language understanding is improved through mutual assistance of the three. Therefore, the problem that the co-occurrence relation between the groove and the intention cannot be fully utilized in the conventional method which takes the intention recognition as a sentence-level classification task and takes the groove filling as a sequence labeling task is remarkably improved.

The alternative decoding-based spoken language understanding method provided by the embodiment of the invention can be realized in the form of a software functional module and sold or used as an independent product, and can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims

1. A method of spoken language understanding based on alternate decoding, comprising:

S300, word level intention recognition and slot filling alternate decoding are carried out based on the semantic vector sequence and the sentence level global intention information; wherein the alternately decoding includes: obtaining a corresponding decoder hiding state based on the semantic vector sequence and the sentence-level global intention information, decoding the decoder hiding state to obtain an output sequence, wherein the length of the output sequence is 2n, and obtaining a word-level intention recognition sequence and a word-level slot filling sequence based on the output sequence, wherein the decoder is unidirectional LSTM;

s310, when decoding to the ith step and i is an odd number, analyzing the intention of the previous word and the slot position information corresponding to the previous word based on the semantic vector sequence when predicting the intention of the [ i/2] +1 th word, wherein i is [0,2n ];

s320, when decoding to the ith step and i is even, analyzing the intention information of the current word and the slot information of the previous word when predicting the slot of the [ i/2] word, wherein i is [0,2n ];

S330, calculating a loss function of word level alternate decoding: optimizing model parameters using negative log-likelihood as a loss function;

S400, obtaining results of intention recognition and slot filling based on the results of the alternate decoding.

2. The alternative decoding-based spoken language understanding method of claim 1, wherein the decoding is implemented to map intent tags and slot tags to a high-dimensional embedding space, explicitly distinguishing and semantically representing classification categories.

3. The method for speech understanding based on alternate decoding according to claim 1,

The decoding method comprises the following steps:

y_[i/2]+1＝argmax(softmax(W_yh_i))，

where W _y is a trainable parameter, the symbol [ ] represents a rounding operation, The embedded vector representing the intent tag y _[i/2],An embedded representation of the representation slot label o _[i/2],

The semantic vector sequence is expressed as e= (e _[cls],e₁,e₂,e₃,e₄,...,e_n), [ cls ] is a special class label of the input sequence,

The output sequence is denoted y= (y ₁,o₁,y₂,o₂,y₃,o₃,...,y_n,o_n),

The word-level intent recognition sequence is y ⁱ＝(y₁,y₂,y₃,...,y_n),

The word level slot filling sequence is y ^s＝(o₁,o₂,o₃,...,o_n).

4. The alternative decoding-based spoken language understanding method of claim 1, wherein the decoding method is as follows:

o_[i/2]＝argmax(softmax(W_oh_i))，

Wherein W _o is a trainable parameter,

The output sequence is denoted y= (y ₁,o₁,y₂,o₂,y₃,o₃,...,y_n,o_n),

The word-level intent recognition sequence is y ⁱ＝(y₁,y₂,y₃,...,y_n),

The word level slot filling sequence is y ^s＝(o₁,o₂,o₃,...,o_n);

the symbol [ ] indicates a rounding operation, An embedded vector representing the intent tag y _[i/2], [ cls ] is a special class label of the input sequence.

5. The alternative decoding-based spoken language understanding method of claim 1, wherein the decoder is trained using a planned sampling mechanism.

6. The alternative decoding-based spoken language understanding method of any one of claims 1-5,

In the prediction stage, the intention result of the whole sentence is determined by adopting the voting result of all word intentions, and the greedy search is used for sampling to obtain the slot prediction result.

7. An electronic device, comprising: one or more processors; storage means having stored thereon one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.

8. A non-transitory computer readable storage medium having stored thereon executable instructions that, when run on a processor, implement the method according to any of claims 1-6.