CN112183062A

CN112183062A - Spoken language understanding method based on alternate decoding, electronic equipment and storage medium

Info

Publication number: CN112183062A
Application number: CN202011045822.3A
Authority: CN
Inventors: 刘广灿
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-01-05
Anticipated expiration: 2040-09-28
Also published as: CN112183062B

Abstract

The invention discloses a spoken language understanding method based on alternate decoding, electronic equipment and a storage medium.

Description

Spoken language understanding method based on alternate decoding, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of man-machine conversation systems, in particular to a spoken language understanding method based on alternate decoding, electronic equipment and a storage medium.

Background

Spoken language understanding mainly includes two subtasks: intent recognition (Intent Detection) and Slot Filling (Slot Filling). The two tasks of intent recognition and slot filling are not independent of each other, slot filling is highly dependent on the result of intent recognition, and slot filling may also facilitate intent recognition. In the prior art, two tasks are jointly modeled to fully utilize knowledge information common to the two tasks, a Multi-task (Multi-task) framework is generally adopted, two subtasks share a coding layer, and then loss functions of the two parts are added to model.

The existing method for using a multi-task frame and adopting a shared coding layer and loss function addition is an implicit joint modeling method, and the method does not explicitly model the interaction relationship between intention identification and slot position filling two subtasks; the Slot-Gated method and the SF-ID method are also proposed for preliminary exploration, but the existing models cannot fully utilize the co-occurrence relationship between the slots and the intentions, so that the potential performances of the slots and the intentions are limited.

Disclosure of Invention

In order to solve the problems, the invention proposes that the intention recognition at the word level and the slot filling at the word level are alternately decoded, and the whole decoding state is initialized by using the intention recognition result at the sentence level, so that the intention recognition and the slot filling can mutually contribute from the global and local angles, thereby improving the effect of the spoken language understanding task.

According to an aspect of an embodiment of the present invention, there is provided a spoken language understanding method based on alternating decoding, including:

s100, obtaining a semantic vector sequence of a pre-training language model BERT based on an input sequence, and marking a first label of the input sequence as a classification label;

s200, obtaining a corresponding final hidden state according to the classification marks, and performing sentence level intention recognition by using a fully-connected neural network and a Softmax function based on the final hidden state to obtain sentence level global intention information;

s300, performing word level intention recognition and slot filling alternate decoding based on the semantic vector sequence and the sentence level global intention information;

s400 obtains results of intent recognition and slot filling based on the results of the alternating decoding.

Preferably, the alternating decoding comprises: and obtaining a corresponding decoder hidden state based on the semantic vector sequence and the sentence level global intention information, decoding the decoder hidden state to obtain an output sequence, wherein the length of the output sequence is 2n, and obtaining a word level intention identification sequence and a word level slot filling sequence based on the output sequence, wherein the decoder is a unidirectional LSTM.

Preferably, the decoding is implemented to map intent tags and slot tags into a high-dimensional embedding space, explicitly distinguishing and semantically representing classification categories to facilitate understanding of the classification tags.

Preferably, the method further comprises the following steps:

s310, when the ith step is decoded and i is an odd number, when the intention of the ([ i/2] +1) th word is predicted, the intention of the previous word and the slot position information corresponding to the previous word are analyzed based on the semantic vector sequence, wherein i belongs to [0,2n ].

Preferably, the decoding method is as follows:

y_[i/2]+1＝argmax(softmax(W_{y i}))，

wherein, W_yAnd W_oAre trainable parameters and symbols]It is shown that the rounding operation is performed,

label y for indicating intention_[i/2]The embedded vector of (a) is embedded,

indicating slot label o_[i/2]Is to be used to represent the embedded representation of,

the semantic vector sequence is represented as e ═ (e)_[cls],e₁,e₂,e₃,e₄,...,e_n)，

The output sequence is represented as y ═ y (y)₁,o₁,y₂,o₂,y₃,o₃,...,y_n,o_n)，

The word-level intent recognition sequence is yⁱ＝(y₁,y₂,y₃,...,y_n)，

The word-level slot fill sequence is y^s＝(o₁,o₂,o₃,...,o_n)。

Preferably, the method further comprises the following steps:

s320, when the ith step is decoded and i is an even number, when the slot position of the [ i/2] th word is predicted, analyzing intention information of the current word and slot position information of the previous word, wherein i belongs to [0,2n ].

Preferably, the decoding method is as follows:

o_[i/2]＝argmax(softmax(W_{o i}))，

label y for indicating intention_[i/2]The embedded vector of (a) is embedded,

The word-level slot fill sequence is y^s＝(o₁,o₂,o₃,...,o_n)。

Preferably, the method further comprises the following steps:

s330 loss function calculation for word-level alternating decoding: the model parameters are optimized using negative log-likelihood as a loss function.

Preferably, the decoder is trained using a planned sampling mechanism.

Preferably, in the prediction stage, the voting results of all word intentions are adopted for determining the intention result of the whole word, and sampling is performed by using greedy search to obtain a slot prediction result.

According to another aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; a storage device having one or more programs stored thereon which, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

According to yet another aspect of embodiments of the present invention, there is provided a non-transitory computer-readable storage medium having stored thereon executable instructions that, when executed on a processor, implement the foregoing method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions that the present invention can be implemented, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the effects and the achievable by the present invention, should still fall within the range that the technical contents disclosed in the present invention can cover.

Fig. 1 is a schematic diagram of a spoken language understanding model provided by the present invention.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The model for spoken language understanding proposed by the present invention is shown in fig. 1, and the model uses a pre-training language model BERT (bidirectional Encoder expressions from transforms) as a coding layer, and performs word-level sequence decoding using semantic vector sequence representation obtained by BERT. The invention provides that the intention recognition at the word level and the slot filling at the word level are alternately decoded, and the whole decoding state is initialized by using the intention recognition result at the sentence level, so that the intention recognition and the slot filling can mutually contribute from the global and local angles, and the effect of the spoken language understanding task is improved.

The specific method comprises the following steps:

step 1, obtaining semantic vector sequence coding representation of a pre-training language model BERT:

the BERT model structure is a multi-layer bidirectional-based Transformer encoder, and the input of the BERT model structure comprises three parts of word embedding, sentence embedding and position embedding. The first marker of the input sequence is always a special class marker [ CLS]The final hidden state corresponding to the special mark is used for a classification task; and use a special mark SEP]As the last marker of the sequence. The input to the present invention is expressed as x ═ ([ cls ]],x₁,x₂,x₃,x₄,...,x_n) The BERT derived semantic vector sequence code is expressed as e ═ (e)_[cls],e₁,e₂,e₃,e₄,...,e_n)。

And step 2, obtaining a sentence level intention recognition result:

special symbols [ CLS ] obtained from BERT]And corresponding to the final hidden state, performing sentence-level intention recognition by using a fully Connected neural Network (FCN) and a Softmax function, and finally obtaining sentence-level global intention information y_[cls]＝softmax(FCN(e_[cls]) And then use sentence-level global intention information y)_[cls]And guiding the alternate decoding process of word-level intention identification and slot filling to ensure the overall consistency.

Wherein, the full-connection neural network FCN analyzes: for the n-1 layer and the n layer, any node of the n-1 layer is connected with all nodes of the n layer. That is, when each node of the nth layer performs calculation, the input of the activation function is the weight of all nodes of the n-1 layer, so as to obtain the weight matrix of the fully-connected layer.

The fully-connected layer multiplies the weight matrix by the input vector, adds an offset, and maps n (— infinity, + infinity) real numbers into K (— infinity, + infinity) real numbers (fractions); softmax maps K (— infinity, + ∞) real numbers to K (0,1) real numbers (probabilities), while ensuring that their sum is 1. The method comprises the following specific steps:

y_[CLS]＝softmax(FCN(e_[cls]))＝softmax(W^Tx+b)

where x is the input of the full link layer, W_nxK is the weight b is the bias term, y_[CLS]Probability of output of Softmax.

The probability of splitting into each class is as follows:

y_[cls]＝softmax(FCN(e_[cls]))＝softmax(W_[CLS]·x+b_[CLS])

wherein, W_[CLS]And a vector is formed by the weight of one node of the n-th layer in the full connection layer and all nodes of the n-1-th layer.

And 3, performing word level intention identification and slot filling alternate decoding:

word-level intent recognition is used to reduce the negative impact of slot prediction on all words when the intent of the entire sentence is mispredicted.

The invention expresses the output sequence obtained by decoding the semantic vector sequence as y ═ y₁,o₁,y₂,o₂,y₃,o₃,...,y_n,o_n) The sequence length is 2 n; the hidden state of the corresponding decoder is denoted as ═ c (c)₁, ₂, ₃,..., _2n) (ii) a Wherein the word-level intent recognition decoding result sequence is yⁱ＝(y₁,y₂,y₃,...,y_n) The result sequence of word-level slot filling is y^s＝(o₁,o₂,o₃,...,o_n)。

The invention adopts a unidirectional LSTM as a decoder, the LSTM refers to a state transfer equation of a long-time memory network, when the decoding reaches the ith step, i belongs to [0,2n ], and the calculation process of the hidden state of the corresponding decoder is as follows:

and 3.1, when i is an odd number, when the intention of the ([ i/2] +1) th word is predicted, analyzing the intention of the previous word and the slot position information corresponding to the previous word. Decoding the word-level intent using the following formula:

y_[i/2]+1＝argmax(softmax(W_{y i}))。

and 3.2, when i is an even number, analyzing intention information of the current word and slot position information of the previous word when the slot position of the [ i/2] th word is predicted. The decoding process is as follows:

o_[i/2]＝argmax(softmax(W_{o i}))，

wherein W_yAnd W_oAre trainable parameters and symbols]Indicating a rounding operation.

Label y for indicating intention_[i/2]The embedded vector of (a) is embedded,

indicating slot label o_[i/2]Mapping tags into a high-dimensional embedding space to a certain extent explicitly distinguishes and semantically represents classification categories, thereby facilitating the understanding of classification tags by the model.

And 3.3, calculating a loss function of the word-level alternative decoding: the present invention uses negative log-likelihood as a loss function to optimize the model parameters, i.e., l (y) -log (y). And a plan Sampling (SS) mechanism is used in the decoder training process to solve the problem that the training and prediction information is biased. In planning sampling, i.e. the sampling rate P is varied during training. The training is insufficient at the beginning, the sampling rate P is reduced, real labels are used as input as much as possible, the sampling rate P is increased along with the training, and the output of the sampling rate P is mostly used as the input of the next prediction. As training progresses, the sampling rate P becomes larger and larger, and the training model is finally the same as the prediction model, so that the deviation between the training information and the prediction information is eliminated.

And 4, acquiring results of intention identification and slot filling in the prediction stage:

because the word level intention recognition mode is used, the final intention result of the whole sentence is decided by adopting the voting result of all word intentions; sampling is performed by greedy search to obtain a slot prediction result, such as Dijkstra algorithm, Prim algorithm, Kruskal algorithm and the like, and when solving is performed, an optimal solution is guaranteed each time, so that the coverage area of each time is maximized, and the best or optimal (i.e., most favorable) selection is adopted in each step of selection, so that the result is expected to be the best or optimal. Furthermore, the use of Conditional Random Fields (CRF) may have negative effects.

The method models the intention recognition into a word-level classification task, explicitly models the interactive relation among the sentence-level intention recognition, the word-level intention recognition and the word-level slot position through a word-level alternate decoding frame, and improves the potential performance of spoken language understanding through mutual assistance of the three. Therefore, the problem that the conventional method treats intention recognition as a sentence-level classification task and fills the slot position as a sequence labeling task and cannot fully utilize the co-occurrence relationship between the slot and the intention is solved.

The spoken language understanding method based on the alternative decoding provided by the embodiment of the invention can be realized in the form of a software functional module, sold or used as an independent product, and can be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A method for spoken language understanding based on alternating decoding, comprising:

2. The alternate decoding-based spoken language understanding method of claim 1, wherein the alternate decoding comprises:

deriving a corresponding decoder hidden state based on the semantic vector sequence and the sentence-level global intent information,

decoding the hidden state of the decoder to obtain an output sequence, wherein the length of the output sequence is 2n, and obtaining a word-level intention identification sequence and a word-level slot filling sequence based on the output sequence, wherein the decoder is a unidirectional LSTM.

3. The spoken language understanding method based on alternating decoding according to claim 2,

the decoding is implemented to map intent tags and slot tags into a high-dimensional embedding space, explicitly distinguishing and semantically representing classification categories to facilitate understanding of the classification tags.

4. The spoken language understanding method based on alternating decoding according to claim 2, further comprising:

5. The spoken language understanding method based on alternating decoding according to claim 4, characterized in that the decoding method is as follows:

y_[i/2]+1＝argmax(softmax(W_yi))，

label y for indicating intention_[i/2]The embedded vector of (a) is embedded,

the semantic vector sequence is represented as e ═ (e)_[cls]，e₁，e₂，e₃，e₄，...，e_n)，

The output sequence is represented as y ═ y (y)₁，o₁，y₂，o₂，y₃，o₃，...，y_n，o_n)，

The word-level intent recognition sequence is yⁱ＝(y₁，y₂，y₃，...，y_n)，

The word-level slot fill sequence is y^s＝(o₁，o₂，o₃，...，o_n)。

6. The spoken language understanding method based on alternating decoding according to claim 4, further comprising:

7. The spoken language understanding method based on alternating decoding according to claim 6,

the decoding method comprises the following steps:

o_[i/2]＝argmax(softmax(W_oi))，

label y for indicating intention_[i/2]The embedded vector of (a) is embedded,

The word-level slot fill sequence is y^s＝(o₁，o₂，o₃，...，o_n)。

8. The spoken language understanding method based on alternating decoding according to claim 6, further comprising:

9. The spoken language understanding method based on alternating decoding according to claim 8,

the decoder is trained using a planned sampling mechanism.

10. The method for spoken language understanding based on alternating decoding according to one of claims 1 to 9,

in the prediction stage, the whole sentence intention result is determined by adopting the voting results of all word intentions, and sampling is carried out by greedy search to obtain a slot position prediction result.

11. An electronic device, comprising: one or more processors; storage means having one or more programs stored thereon which, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-10.

12. A non-transitory computer readable storage medium having stored thereon executable instructions which, when executed on a processor, implement the method of any one of claims 1-10.