CN111814489A

CN111814489A - Spoken language semantic understanding method and system

Info

Publication number: CN111814489A
Application number: CN202010716764.6A
Authority: CN
Inventors: 俞凯; 刘辰; 朱苏; 赵子健; 曹瑞升; 陈露
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2020-10-23

Abstract

The embodiment of the invention provides a spoken language semantic understanding method. The method comprises the following steps: serializing the word confusion network and the system behaviors of the previous round of conversation, and splicing into an input sequence; performing word segmentation on an input sequence to obtain a sequence at a sub-word level, determining word embedding, position embedding and fragment embedding of each word segmentation as input of a transform-based bidirectional coding representation model; outputting a characteristic vector of a sub-word level based on a transform bidirectional coding representation model; gradually aggregating the feature vectors at the sub-word level into feature vectors at the sentence level through a sentence representation module; and determining the system behavior of the spoken dialog in the current turn based on the feature vectors at the sentence level. The embodiment of the invention also provides a spoken language semantic understanding system. The embodiment of the invention utilizes the posterior probability information of the word confusion network and encodes the context information of the conversation by introducing the system behavior of the previous round, thereby improving the uncertainty capacity of model encoding and showing good generalization performance.

Description

Spoken language semantic understanding method and system

Technical Field

The invention relates to the field of semantic understanding, in particular to a spoken language semantic understanding method and a spoken language semantic understanding system.

Background

Spoken language understanding is a technique for converting the output produced by automatic speech recognition into a structured semantic representation and is therefore very sensitive to speech recognition errors. Traditional spoken language understanding systems use best hypothesis text for speech recognition as input. To improve the robustness of the system to speech recognition errors, speech recognition outputs containing more information are also used as input for spoken language understanding, such as N-best hypothesis lists, Word lattices, and Word Confusion Networks (WCNs). Compared with word lattices, the word confusion network has a more compact structure and more efficient calculation. The spoken language understanding is usually performed using: 1. a modeling method of LSTM (Long Short-Term Memory Network, Long-Term Memory Network) for a word confusion Network; 2. GPT (Generative Pre-Training) method for modeling word lattices.

The word confusion network can be regarded as a segment (bin) sequence, each segment comprises all candidate words and corresponding posterior probability in the time step, word embedding vectors in each segment are weighted to obtain word embedding representation at the segment (bin) level, and the word embedding representation is input into the LSTM for coding.

The word lattice is a directed acyclic graph, and the word lattice is coded by using a one-way pre-training language model GPT. The essence of GPT is a Transformer structure based entirely on the attention mechanism, representing the positional relationship between nodes of a word lattice by introducing a reachability mask.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

in techniques using LSTM encoded word confusion networks, all candidate word embedding borrowing word posterior probabilities for each segment are simply weighted to obtain a segment (bin) -level representation of features that are local, i.e., do not take into account the interactions between segments. In addition, due to the poor parallel capability of LSTM, the model training and reasoning speed is slow, resulting in limited application in large-scale deep neural networks.

Since a word confusion network contains a large amount of uncertainty information, in a very noisy environment, simply taking the uncertainty assumption of speech recognition as a semantic understanding input is still not sufficient to arrive at a correct result.

In the method for modeling word lattices by using GPT, the modeling object is the word lattice, and the word confusion network has certain advantages in space and time efficiency compared with the word lattice. In addition, GPT is a unidirectional language model, future information is not considered in the encoding process, and the encoding capability of the GPT is slightly insufficient.

Disclosure of Invention

The method aims to at least solve the problems that context information is not considered, the speed is low, and the spoken language understanding effect is poor in the prior art.

In a first aspect, an embodiment of the present invention provides a spoken language semantic understanding method, including:

serializing the word confusion network and the system behaviors of the previous round of conversation, and splicing the serialized word confusion network and the system behaviors into an input sequence, wherein the word confusion network comprises: candidate words of the current dialog and posterior probabilities of the candidate words, the system behavior comprising: structured behavior-slot-value triplets;

performing word segmentation on the input sequence, and determining word embedding, position embedding and fragment embedding of each word segmentation as input of a transform-based bidirectional coding representation model;

modifying self-attention weights within the transform-based bi-directionally encoded representation model that outputs feature vectors at a sub-word level based on a posterior probability of the candidate word in the word confusion network;

gradually aggregating the feature vectors at the sub-word level into feature vectors at a statement level through a statement representation module;

and outputting a structured behavior-slot-value triple based on the feature vector of the statement level, and determining the triple as the system behavior of the spoken language dialog.

In a second aspect, an embodiment of the present invention provides a spoken language semantic understanding system, including:

a sequence conversion program module, configured to serialize a word confusion network and system behaviors of a previous round of dialog and splice the system behaviors into an input sequence, where the word confusion network includes: candidate words of the current dialog and posterior probabilities of the candidate words, the system behavior comprising: structured behavior-slot-value triplets;

the word segmentation program module is used for performing word segmentation on the input sequence, determining word embedding, position embedding and fragment embedding of each word segmentation, and taking the word embedding, position embedding and fragment embedding as input of a transform-based bidirectional coding representation model;

a feature vector determination program module for modifying self-attention weights within the transform-based bi-directionally encoded representation model that outputs feature vectors at a sub-word level based on a posterior probability of the candidate word in the word confusion network;

the aggregation program module is used for gradually aggregating the feature vectors at the sub-word level into feature vectors at the statement level through the statement representation module;

and the triple determining program module is used for outputting a structured behavior-slot-value triple based on the feature vector of the statement level and determining the triple as the system behavior of the spoken language dialog.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the spoken semantic understanding method of any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement the steps of the spoken language semantic understanding method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: the method is used for a semantic tuple prediction task in spoken language understanding. By improving the BERT calculation, the posterior probability information of the word confusion network can be effectively utilized. In addition, the structural information of the word confusion network is embodied in the position sequence and the characteristic aggregation module. And further, context information of a conversation is coded by introducing the system behavior of the previous round, so that the capability of the model for coding uncertainty in a noise environment is further improved, and good generalization performance is shown.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for semantic understanding of spoken language according to an embodiment of the present invention;

FIG. 2 is a structural diagram of a WCN-BERTSLU model of a spoken language semantic understanding method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an aggregation of the spoken language semantic understanding method according to an embodiment of the present invention;

FIG. 4 is a data diagram of F1 score (%) and voicing level accuracy (%) of a baseline model of a spoken language semantic understanding method and proposed models on a test set of the method provided by an embodiment of the present invention;

FIG. 5 is a data diagram comparing a spoken semantic understanding method provided by an embodiment of the invention with the prior art on a DSTC2 data set;

FIG. 6 is a graph of F1 score (%) data for an ablation study on a test set of a spoken language semantic understanding method provided by an embodiment of the present invention;

FIG. 7 is a graph of fraction (%) data obtained by F1 of a spoken language semantic understanding method according to an embodiment of the present invention on a test set for WCN-BERT SLU;

fig. 8 is a schematic structural diagram of a spoken language semantic understanding system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a spoken language semantic understanding method according to an embodiment of the present invention, which includes the following steps:

s11: serializing the word confusion network and the system behaviors of the previous round of conversation, and splicing the serialized word confusion network and the system behaviors into an input sequence, wherein the word confusion network comprises: candidate words of the current dialog and posterior probabilities of the candidate words, the system behavior comprising: structured behavior-slot-value triplets;

s12: performing word segmentation on the input sequence, and determining word embedding, position embedding and fragment embedding of each word segmentation as input of a transform-based bidirectional coding representation model;

s13: modifying self-attention weights within the transform-based bi-directionally encoded representation model that outputs feature vectors at a sub-word level based on a posterior probability of the candidate word in the word confusion network;

s14: gradually aggregating the feature vectors at the sub-word level into feature vectors at a statement level through a statement representation module;

s15: and outputting a structured behavior-slot-value triple based on the feature vector of the statement level, and determining the triple as the system behavior of the spoken language dialog.

In the present embodiment, a method of encoding a word confusion network by using BERT (Bidirectional encoding representation model based on Transformers) is used for a semantic tuple prediction task in spoken language understanding. In order to further improve the fault tolerance of the model, besides the voice recognition hypothesis, the system behavior (system act) of the previous dialog is introduced, and the system behavior represents the feedback of the system to the user request and can reflect the intention of the user to a certain extent. As shown in fig. 2, the model of the method comprises three modules: the system comprises a BERT coder, a statement representation module and an output module.

For step S11, the input of the BERT contains two parts: the words confuse the network and the previous round of system behavior. The input is first serialized before entering BERT. Wherein, when the word confusion network is serialized, all words in each segment are expanded according to a default sequence, and the word confusion network shown in the figure is serialized into 'I wait a of gastropub pubfood'; the system behavior is a group of triples with a structure of action-slot-value (act-slot-value), and the serialization process is to splice the words in all the triples into a sequence. Wherein, the word confusion network comprises: i/1.0; want/1.0; (a/0.84; of/0.16); (gastropub/0.96; pub/0.04); food/1.0.

As an embodiment, the serializing the system behaviors of the word confusion network and the previous dialog includes:

and when the system behavior of the word confusion network or the previous round of conversation comprises a compound word consisting of a plurality of words, segmenting the compound word into a plurality of words and then serializing.

In this embodiment, some compound words are segmented into meaningful words (e.g. pricerge is segmented into two words, price and range), and then the system behavior shown in the figure (containing only one triple) is serialized into "informance range model".

For step S12, the input sequence is segmented by a BERT segmenter to obtain a sequence of subwords (subwords), and the sequence is "gas # # tro # # pu # # b" after the segmentation of "gastropub". The position sequence and fragment sequence corresponding to this sequence were then obtained, these three sequences being the necessary inputs for BERT.

As an embodiment, the segmenting the input sequence, and determining word embedding, position embedding and segment embedding of each segmented word includes:

performing word segmentation on the input sequence through a word segmentation device of the transform-based bidirectional coding representation model to obtain a sequence at a sub-word level, and determining a sub-word sequence, a position sequence and a fragment sequence of each word segmentation;

inputting the sub-word sequence, the position sequence and the fragment sequence into an embedding layer of the transform-based bidirectional coding representation model, and outputting word embedding, position embedding and fragment embedding.

The sequence is required to be converted into embedded embedding which can be processed by BERT, and the sub-word sequence, the position sequence and the fragment sequence are subjected to embedding layer of the BERT to obtain three different embedding, namely token/position/segment embedding (namely corresponding word embedding, position embedding and fragment embedding) in the graph.

As an embodiment, the position numbers of the sub-word sequences of the same participle are the same;

in the fragment sequence, the fragment values of the participles in the word confusion network are the same, and the fragment values of the participles in the previous round of system behavior are the same and different from the fragment values of the participles in the word confusion network.

In the present embodiment, for the position sequence, the positions of all sub-words in each segment are the same, such as the positions of "gas", "# # tro", "# # pu", "# # b", and "pub" are all 4. For fragment sequences, the word confusion network corresponds to 0 and the system behavior corresponds to 1.

For step S13, encoding is performed using BERT.We improve BERT to adapt the word confusion network by taking the posterior probability of words in the word confusion network into account in the calculation of the self-attention weights. The formula for calculating the self-attention weight in the original BERT is:

the method modifies the method as follows:

wherein p is_jIs the posterior probability corresponding to the jth word. (Note that if a word is participled, then the probabilities of all the sub-words are identical to the original word, such as 0.96 for "gas", "# # tro", "# # pu", "# # b"), λ^l，hAre continuously updated during model training for trainable parameters. Thereby outputting a feature vector at the subword level.

For step S14, feature vectors at subword level have been obtained and gradually aggregated into feature vectors at sentence level for confirmation of triples.

As an embodiment, the step-by-step aggregation of the feature vectors at the subword level into the feature vectors at the sentence level by the sentence representation module includes:

averaging the characteristic vectors of the sub-word level of each participle corresponding to the word confusion network in the characteristic vectors of the sub-word level, aggregating the characteristic vectors of the word level, and weighting and summing the characteristic vectors of the word level according to the posterior probability to determine the characteristic vectors of the subsection level;

keeping the feature vectors of the sub-word level corresponding to the system behaviors of the previous round of conversation unchanged in the feature vectors of the sub-word level;

and aggregating the feature vectors at the segmentation level corresponding to the word confusion network and the feature vectors at the sub-word level corresponding to the system behavior of the previous round of conversation through a self-attention mechanism in the sentence representation module, and outputting the feature vectors at the sentence level of the current round of conversation.

In this embodiment, the BERT encoder outputs feature vectors at the sub-word level, and we aggregate the feature vectors at the sub-word level into feature vectors at the sentence level in a sentence representation module that utilizes special structural information of a word confusion network.

The first step entails aggregating subword-level features corresponding to the word confusion network into segment-level features, as shown in fig. 3. The sub-word features are averaged to obtain word-level features, such as vectors of four sub-words of "gas", "# # tro", "# # pu", "# # b", and are averaged to obtain a word-level vector representing "gastropub". And then, weighting and summing all the words in the segment according to the posterior probability according to the structure of the word confusion network to obtain a segment level vector. Note that the features corresponding to the system behavior remain unchanged. The segmentation level feature obtained in this step is denoted as u ═ u (u)₁，...，u_T′)。

The second step polymerizes the features obtained above by means of an autofocusing mechanism:

the final statement level vector is then r ═ Concat (u)₁U'), wherein u₁Is a special character "[ CLS]"corresponding features, are often classified directly in BERT using u _ 1. Note that the calculation of u' starts with t-2 to avoid u₁Can be repeatedly used.

For step S15, for the effectiveness and universality of the model, two methods are used for processing, and the output structure is an act-slot-value (act-slot-value) triple.

As an embodiment, the outputting a structured behavior-slot-value triplet based on the statement-level feature vector includes:

inputting the feature vector of the statement level into a semantic tuple classifier, and determining a multi-label classifier of a behavior-slot;

constructing a multi-classifier for the value corresponding to each candidate behavior-slot based on semantic understanding;

and combining the multi-label classifier and the classification result of the multi-classifier to generate the semantic triple of the system behavior of the current round of conversation.

In this embodiment, a Semantic Tuple Classifier (STC) first performs multi-label classification on "behavior-slot", and the classification result is "form-food" in fig. 2. Since each "behavior-slot" in semantic understanding can only correspond to one value, we construct a multi-classifier for each "behavior-slot", such as the "info-food" classifier, which results in "gastropub". And finally, combining the classification results of the two classifiers to obtain a semantic tuple.

As another embodiment, the outputting a structured behavior-slot-value triple based on the statement-level feature vector further includes:

and inputting the statement-level feature vectors into a hierarchical decoder, classifying the behaviors and the slots in sequence, determining values corresponding to the behaviors and the slots by adopting a generating mode, and generating semantic triples of the system behaviors of the current round of conversation.

In the present embodiment, a Hierarchical Decoder (HD) sequentially classifies "behavior" and "slot", and a generating formula is used as "value". The difference is that the LSTM decoder of the existing method, which uses Glove pre-training word vectors for "behaviors" and "slots", is replaced with a Transformer decoder to improve efficiency, while the method uses BERT coding.

The difference between the above two methods is that the former is discriminant and the latter is generative. The former has better performance, while the latter has stronger generalization capability, has more advantages in the face of unknown fields, and can better predict the unseen groove value.

According to the embodiment, in consideration of the spoken language semantic understanding, a word confusion network with better effect is used, and a word confusion network coding method based on a transform bi-directional coding representation model is used for a semantic tuple prediction task in the spoken language understanding. By improving the BERT calculation, the posterior probability information of the word confusion network can be effectively utilized. In addition, the structural information of the word confusion network is embodied in the position sequence and the characteristic aggregation module. And further, context information of a conversation is coded by introducing the system behavior of the previous round, so that the capability of the model for coding uncertainty in a noise environment is further improved, and good generalization performance is shown.

The above steps are explained in detail, and the framework of the WCN-BERT SLU is explained, as shown in fig. 2, wherein, WCN (Word fusion Network): the method is a linear directed graph structure, a plurality of candidate word hypotheses and corresponding posterior probabilities are arranged between every two nodes, and each path represents one hypothesis of voice recognition and passes through all the nodes. BERT (Bidirectional Encoder representation from transforms, transform-based bi-directional encoding representation model): the bidirectional language model obtained by large-scale unsupervised corpus pre-training is proved to have strong language representation capability in a plurality of researches. SLU (Spoken Language Understanding): the connection module between speech recognition and dialogue state tracking in the spoken dialogue system converts the text information input by the user into structured semantic information.

For each user turn, the model takes the corresponding WCN, the last system behavior serves as input, and the semantic tuples at the turn level are predicted.

The WCN is a compact lattice structure in which the candidate words and their associated posterior probabilities are aligned at each location. The sequence B, which is usually considered as a word segment, (B)₁，...，b_M). The mth location may be formalized as

Wherein I_mDenotes b_m，

And

the i-th candidate word and the posterior probability thereof are respectively given by the ASR system. WCN flattened into word sequence

The system behavior contains a dialog context in the form of an "act-slot-value" triple. We consider the last system action a before the current dialog turn ((a) ═ a)₁，s₁，v₁)，...，(a_K，s_K，v_K) Where K is the number of triples, a)_k，s_kAnd v_kThe k-th action, action slot and value. It is recombined into a sequence w^SA＝(a₁，s₁，v₁，...，a_K，s_K，v_K). The action name and action slot name need to be divided into meaningful words, e.g., "precerange" is divided into "price" and "range".

To input WCNs into BERT along with system behavior, we represent the input sub-word sequence as w ═ (w ═ w₁，...，w_T)＝[CLS]⊕TOK(w^WCN)⊕[SEP]⊕TOK(w^SA)⊕[SEP]Where ≦ concatenates the sequences together, TOK () is a BERT generator that tokenizes words into subwords, [ CLS ≦ CLS]And [ SEP ]]Are auxiliary sub-words for separation.

T＝|TOK(w^WCN)|+|TOK(w^SA) And l + 3. Considering the structural feature of WCN, that multiple words compete in one segment, it is defined that all words (actually sub-words after word segmentation) in the same segment share the same position ID.

Finally, the BERT input layer embeds w into d by aggregating the following three embeddings_xDimension sequential representation:

wherein E is_token(·)，E_pos(. cndot.) and Eseg (. cndot.) represent subword embedding, position embedding and segment embedding, respectively.

BERT consists of a series of bi-directional transform encoder layers, each of which contains a multi-headed self-attention module and a feed-forward network with residual concatenation. For the firstl layers, assuming input as

The output is calculated by the self-attention layer (consisting of H heads):

wherein FC is a fully-connected layer, LayerNorm indicates layer normalization, H is more than or equal to 1 and less than or equal to H,

the WCN probability perception self-attention mechanism expands the original self-attention mechanism to consider the posterior probability of the subwords in the WCN. For each input sub-word sequence w ═ (w)₁，...，w_T) We define its corresponding probability sequence as p ═ (p)₁，...，p_T) Wherein:

wherein P (w)_t) The ASR posterior probability of a subword is represented. Note that in WCN, the probability of a subword is equal to the probability of the original word. The probability of a subword in system behavior is defined as 1.0. Now, by changing

The ASR posterior probability is injected into the BERT encoder:

where λ l, h are trainable parameters.

Finally, a representation of the subword level is generated after the stacked encoder layers, denoted o ═ o (o)₁，...，o_T)。

Corresponding to [ CLS]Output latent layer vectors of subwords, i.e. o₁And is typically used to represent the entire sentence. In addition, other hidden layer vectors are collected by considering the structure information of the WCN.

First, the subword level hidden vectors of the WCN part are aggregated into bin level by the following two steps:

(1) averaging BERT sub-word vectors belonging to the input word into word vectors;

(2) the word-level vectors for each bin are weighted and summed to obtain a bin-level vector. An example of feature aggregation for the WCN portion is shown in fig. 3, while the features corresponding to the system actions are unchanged. After polymerization, we obtain a new eigenvector u ═ u (u)₁，…，u_T′) Wherein T' ═ M + | TOK (ω^SA) L +3(M is bin is a number) and T' is ≦ T.

Then, we aggregate bin level features using a self-attention approach, as follows:

wherein b is_a，

And

are trainable parameters. By reacting u' with [ CLS ]]Hidden state of (i.e. r ═ Concat (u))₁U')) may be concatenated to obtain a final sentence-level feature representation.

Semantic Tuple Classifier (STC), on the final sentence-level feature representation r, we applied a two-classifier to predict each "behavior-slot" pair and a multi-classifier to classify the possible values of each "behavior-slot" pair in the training set. Thus, the method cannot predict values that are not seen in the training set.

A transform-based layered decoder (HD) is constructed, which is composed of a behavior classifier, a slot classifier and a value generator, in order to improve the generalization capability of value prediction. However, there are two main differences:

(1) behaviors and slots are tokenized and then embedded by the BERT embedding layer as an additional feature of the slot classifier and value generator. For each behavior a or slot s, the sub-word level vectors from the BERT embedding layer are averaged to obtain a single feature vector (a denotes e)_aAnd s represents e_s)。

(2) The LSTM based value generator is replaced with a transform based value generator. The subtotal values are embedded by the BERT embedding layer and generated at the subword level. Therefore, we bind the subword embedding of BERT to the weight matrix of the linear output layer in the value generator.

The method was tested in a way that,

we experimented with a data set in a second dialog state tracking challenge (DSTC2), which contained 11677, 3934, and 9890 samples for training, validation, and testing, respectively. To shorten the flattened WCN sequence, WCN is pruned by deleting exclamations and candidate words with a probability below a certain threshold (0.001 is suggested). The evaluation metrics are the F1 score and sentence-level accuracy of the behavior-slot-value triplets. We do not assume that all candidate values for each slot are known a priori.

In the experiment, we used an English-based case-less BERT model with 12 layers 768 hidden layer elements and 12 attention headers. The optimizer uses Adam during training. We select the initial learning rate from {5e-5, 3e-5, 2e-5}, with a pre-heating rate of 0.1 and a L2 weight decay of 0.01. The maximum norm of gradient clipping is set to 1. The dropout of the BERT encoder is set to 0.1 and the dropout of the statement representation model and output layer is set to 0.3. The model was trained for 50 rounds and saved according to the best performance on the validation set. Each experimental setup was run five times using different random seeds and the average results were reported.

As shown in FIG. 4, different types of input were applied to the baseline, including manually transcribed text, ASR 1-best, 10-best list, and WCN. The baseline models of the first three types (manually transcribed text, 1-best and 10-best) are BLSTM (Bidirectional Long Short-Term Memory Network) with a self-attention mechanism. BLSTM encodes the input sequence and acquires sentence-level features through a self-attention mechanism. To test on the 10-best list with a model trained on 1-best, we run the model on each hypothesis in the 10-best list and weight the individual results by the ASR posterior probability. For direct training and evaluation of the 10-best list, the representation vector r is calculated as

Wherein r is_iIs a representative vector assuming i, and γ_iIs the corresponding ASR posterior probability.

For WCN modeling, we followed the neural network ConfNet classification method. The WCN is input into the model by a simple weighted sum representation method, where all word vectors are weighted by their posterior probabilities and then summed. We also apply the Lattice-SLU method of GPT on WCN. The output layers of all baseline models are STCs.

By comparing the baselines, we find that performance is better using a larger space of ASR hypotheses. The neural ConfNet classification method can outperform training and testing a 1-best system and can achieve results comparable to a 10-best system. By virtue of the powerful pre-trained language model, the Lattice-SLU outperforms the other baseline models described above. The last round of system behavior was not utilized in the baseline and will be analyzed in subsequent ablation studies.

By jointly modeling WCN and the previous round of system behavior using BERT, our proposed framework is significantly better than baseline in terms of F1 scores and sentence-level accuracy, and achieves the best performance at present on the DSTC2 dataset, as in fig. 4 and 5.

We performed experiments on WCN-BERT SLUs with STC and HD, as shown in FIG. 6. (b) The results shown in the rows do not take into account the WCN probabilities in the BERT encoder. In this case, the model lacks a priori knowledge of the confidence in the ASR, resulting in a significant drop in performance (STC of 0.76%, HD of 0.86%). By considering only the hidden layer state related to [ CLS ] as sentence-level representation (c), i.e., r ═ u1, the F1 score decreases with both STC and HD, indicating that the structural information of WCN is favorable for vocal representation.

By deleting the BERT (line (d)), we jointly encode WCN and system behavior with a Transformer, but do not use pre-trained BERT for fine-tuning. We initialized the word vectors using word embedding in Glove6B 100 dimension 100. The results show that the deletion of BERT results in a significant decrease in F1 score (STC 1.79%, HD 1.32%).

In addition, we investigate the impact of the dialog context by removing the previous round of system behavior from the input, as shown in line (e). The results show that the joint encoding of WCN and the previous round of system behavior can significantly improve performance. This can be compared to the Lattice-SLU baseline in FIG. 4. The results show that our model is more efficient due to the capability of the bidirectional transformer considering future environments, and different considerations of the WCN structure.

In line (f), neither BERT nor dialog context is included. This is also comparable to the neural network ConfNet classification method at the model level (one for Transformer and one for BLSTM). With the STC as an output layer, the Transformer with the WCN probability perception self-attention mechanism has better modeling capability (F1 is improved by 0.84%) and higher training efficiency (6 times faster).

Furthermore, we replaced the previous round of system behavior (g) with the previous round of system statements in the input, which resulted in only slight changes in the F1 score (within 0.1% difference). This shows that our model is also applicable to datasets that have only systematic statements but no systematic behavior.

As can be seen from the foregoing results, the model with the layered decoder (HD) has worse performance than STC. However, the generation method enables generalization capabilities for the model, and the pointer-generator network facilitates handling out-of-set (OOV) words. To analyze the generalization ability of the model with a layered decoder, we randomly select a certain proportion of the data of the training set to train our proposed WCN-BERT SLU with STC or HD. The validation and test set remains unchanged. Furthermore, we evaluate the F1 scores of visible and invisible triples based on whether a "behavior-slot-value" triple appears in the training set.

As shown in fig. 7, the overall performance of HD does not drop dramatically as the training scale decreases. Moreover, the layered decoder exhibits better generalization capability for both visible and invisible tags.

Fig. 8 is a schematic structural diagram of a spoken language semantic understanding system according to an embodiment of the present invention, which can execute the spoken language semantic understanding method according to any of the above embodiments and is configured in a terminal.

The spoken language semantic understanding system provided by the embodiment comprises: a sequence conversion program module 11, a word segmentation program module 12, a feature vector determination program module 13, an aggregation program module 14 and a triplet determination program module 15.

The sequence conversion program module 11 is configured to serialize the word confusion network and the system behaviors of the previous round of dialog, and splice the system behaviors into an input sequence, where the word confusion network includes: candidate words of the current dialog and posterior probabilities of the candidate words, the system behavior comprising: structured behavior-slot-value triplets; the word segmentation program module 12 is configured to perform word segmentation on the input sequence, determine word embedding, position embedding, and fragment embedding of each word segmentation, and use the word embedding, position embedding, and fragment embedding as input of a transform-based bidirectional coding representation model; the feature vector determination program module 13 is configured to modify the self-attention weight in the transform-based bi-directional coding representation model that outputs feature vectors at a sub-word level based on the posterior probability of the candidate word in the word confusion network; the aggregation program module 14 is configured to gradually aggregate the feature vectors at the sub-word level into feature vectors at a statement level through the statement representation module; the triple determination program module 15 is configured to output a structured behavior-slot-value triple based on the feature vector at the statement level, and determine the triple as a system behavior of the spoken language dialog.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the spoken language semantic understanding method in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the spoken semantic understanding method of any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the spoken semantic understanding method of any of the embodiments of the present invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of spoken semantic understanding, comprising:

2. The method of claim 1, wherein the segmenting the input sequence, and determining word embedding, position embedding, and segment embedding of each segmented word comprises:

3. The method of claim 2, wherein in the sequence of positions, the sequence of subwords of the same participle has the same position number;

4. The method of claim 1, wherein the step-by-step aggregation of the subword-level feature vectors into sentence-level feature vectors by a sentence representation module comprises:

5. The method of claim 1, wherein outputting a structured behavior-slot-value triplet based on the statement-level feature vector comprises:

6. The method of claim 1, wherein outputting a structured behavior-slot-value triplet based on the statement-level feature vector comprises further comprising:

7. The method of claim 1, wherein the serializing system behaviors of a word confusion network and a previous conversation round comprises:

8. A spoken semantic understanding system, comprising:

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-7.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.