CN113486669A

CN113486669A - Semantic recognition method for emergency rescue input voice

Info

Publication number: CN113486669A
Application number: CN202110764294.5A
Authority: CN
Inventors: 刘中民; 夏新; 沈方舟; 朱建成
Original assignee: Shanghai East Hospital Tongji University Affiliated East Hospital
Current assignee: Shanghai East Hospital Tongji University Affiliated East Hospital
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-10-08
Anticipated expiration: 2041-07-06
Also published as: CN113486669B

Abstract

A semantic recognition method for emergency rescue input voice relates to the technical field of natural language processing, and the method adopts a word segmentation device and a coder of a BERT pre-training model to segment and code a user input sentence to obtain word level expression; acquiring local intention information of a statement input by a user by adopting an encoder of a convolutional neural network; acquiring intention information fused with semantic slot position information by adopting an encoder based on a self-attention mechanism; acquiring a specific intention category by adopting an intention recognition decoder; and obtaining the label classification of the semantic slot position by adopting a semantic slot position decoder. The method provided by the invention is used for recognizing the input voice of the user by the machine in emergency rescue.

Description

Semantic recognition method for emergency rescue input voice

Technical Field

The invention relates to a natural language processing technology, in particular to a technology of a semantic recognition method of emergency rescue input voice.

Background

In the current emergency rescue system, the application of the voice recognition technology is quite wide, rescuers can send voice commands to the machine terminal, the machine terminal intelligently recognizes the intention and the semantics of the voice commands and then executes related operations, the manner of controlling the machine terminal by voice can liberate both hands of the rescuers, and the emergency rescue efficiency can be effectively improved. In the process, the recognition accuracy of the intention and the semantics of the voice command by the machine terminal is very critical.

Intent recognition and semantic slot filling are two main tasks in a spoken language understanding module (or natural language understanding module) aimed at understanding the user's conversational intent. The purpose of the intent recognition task is to machine the user's input text, recognizing the user question sentence as the user intent. The semantic slot filling task is used for marking the semantic slot in the text input by the user as the information of the specific slot value pair by processing the query sentence of the user. Since the result of the intention recognition can promote the result of semantic slot filling, and the result of the semantic slot filling can also play a positive role in the intention recognition task in turn, the joint training of the intention recognition task and the semantic slot filling task also serves as an important research branch in spoken language understanding.

The existing methods for intention recognition and semantic slot filling joint training can be divided into three categories:

1) when the method completes the intention recognition and semantic slot filling, a large amount of manpower and material resources are needed to formulate the intention recognition rule and the semantic slot rule in a specific field due to the limitation of a data set, and the expandability and generalization capability of the method are poor.

2) A joint training method based on a recurrent neural network mainly uses a Seq-Seq framework to carry out intention identification and semantic slot filling, and achieves a relatively ideal effect.

3) In the method, an implicit joint learning method learns the characteristics of two tasks, the two tasks are associated only through a loss function, and most of explicit joint learning methods propose a structure of a gating mechanism to further combine an intention recognition task and a semantic slot filling task.

Generally, Convolutional Neural Networks (CNNs) are commonly used in the fields of computer vision and images due to their property of focusing more on local features, and in recent years, they also play a great role in natural language processing, especially as feature extractors capable of effectively capturing local information of corpora.

Attention mechanism (Attention) proposes the goal of being able to allocate more computing resources to more important tasks while being able to handle the problem of information overload, given the limited computing power available today. The deep learning network has stronger expression capability, and the quantity of parameters required by the model is larger and larger, so that the problem of information overload is caused frequently. The introduction of the attention mechanism can focus on information more critical to the current task and reduce the sensitivity to other information, so that the efficiency and the accuracy of task processing are improved, and meanwhile, the attention mechanism can improve the parallel efficiency of calculation for splicing the key vectors and independent operation of the input vectors. The Self-Attention mechanism (Self-Attention) belongs to a variant of the Attention mechanism, and is different from the Attention mechanism in that the Self-Attention mechanism relies on more sequences related to the Self-Attention mechanism, so that the dependence of external information is reduced.

It is noted that although the intent recognition and semantic slot filling joint training achieve better training results, the existing research still has some problems to be improved, such as the lack of labeled data, the domain universality (a method works well in a specific data set in a given domain, but the model effect is greatly reduced after the dialogue domain and the dialogue data set are replaced), and the occurrence of unknown words (the out-of-vocabulary words in the test set, i.e. the words not in the training set, may result in lower test performance).

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to solve the technical problem of providing the semantic recognition method for the emergency rescue input voice, which can overcome the problems of data labeling, field universality and unknown word occurrence in the prior art, has good integrity of characteristic information and can improve the integral coding efficiency and the accuracy of the model.

In order to solve the technical problem, the semantic recognition method for the emergency rescue input voice is characterized by comprising the following specific steps of:

s1: after a word splitter of a BERT pre-training model is adopted to perform word splitting on a user input sentence, adding a [ CLS ] label to the first token of the sentence, adding an [ SEP ] label to the last token of the sentence, and inputting the tag into an encoder of the BERT pre-training model for encoding to obtain an output sequence H of the user input sentence;

s2: extracting other elements except the element containing the [ CLS ] label from the output sequence H, inputting the extracted elements into an encoder of a convolutional neural network, and obtaining a local intention information sequence P of a user input statement by using the encoder of the convolutional neural network;

s3: calculating the attention of each token word in the output sequence H obtained in the step S1, and fusing the semantic slot information into [ CLS ] labels containing sentence intention information to obtain an intention information sequence G fused with the semantic slot information;

s4: inputting an intention information sequence G fused with semantic slot position information, a local intention information sequence P obtained by an encoder of a convolutional neural network and intention information of a [ CLS ] label output by the encoder of a BERT pre-training model into an intention recognition decoder, and obtaining a specific intention type sequence Y consisting of final intention types of all token words in an output sequence H by using the intention recognition decoder;

s5: and (4) inputting the specific intention category sequence Y obtained in the step (S4) into a gating mechanism of a semantic slot decoder, calculating the semantic slot of each token word in the output sequence H through the gating mechanism, and classifying each semantic slot into a slot label by using a semantic slot information classifier.

Further, the specific step of acquiring the output sequence H of the user input sentence in step S1 is as follows:

s11: adopting a word segmentation device of a BERT pre-training model to segment words in the input sentences of the user according to the minimum unit of a word list of the word segmentation device;

s12: add one [ CLS ] to the first token of a sentence]Tag, adding one [ SEP ] to last token of sentence]Tag, obtaining input sequence x ═ x (x) of user input sentence₁，x₂，x₃，...，x_T) Wherein T is the number of elements in the input sequence x;

s13: the input sequence x obtained in step S12 is input to the coder of the BERT pre-training model, and after coding is performed using the coder of the BERT pre-training model, the output sequence H of the user input sentence is obtained (H ═ H)₁，h₂，h₃，...，h_T)；

Further, the specific steps of acquiring the local intention information sequence P of the user input sentence in step S2 are as follows:

s21: from the output sequence H obtained in step S1, the sequence H except the first element H is extracted₁Other elements, and a new output sequence H2 is constructed using the extracted elements (H)₂，h₃，...，h_T)；

S22: the output sequence H2 obtained in step S21 is used as the input layer of the convolutional neural network, each of which is convolved with a plurality of types of convolution kernels, and after the convolution, the Top K algorithm is used to maximize the pooled feature, so that the obtained local intention information sequence P of the user input sentence is equal to (P ═ P₂，p₃，...，p_T)。

Further, the specific steps of acquiring the intention information sequence G fused with the semantic slot information in step S3 are as follows:

s31: the output sequence H obtained in step S1 is (H)₁，h₂，h₃，...，h_T) Obtaining a query vector matrix Q, a key vector matrix K and a value vector matrix V through linear change, wherein the formula of the linear change is as follows:

Q＝W_Q·H

K＝W_K·H

V＝W_V·H

wherein, W_QFor querying the parameters of the vector matrix Q, W_KIs a parameter of a key vector matrix K, W_VIs a parameter of the value vector matrix V;

s32: the output sequence H obtained in step S1 is (H)₁，h₂，h₃，...，h_T) Each token word in (1) is calculated from attention by the following formula:

wherein, g_iTo output the self-attention, q, of the ith token word in the sequence H_iFor the query vector in the query vector matrix Q for the ith token word in the output sequence H, v_jFor outputting the value vector, k, of the jth token word in the sequence H in the value vector matrix V_jFor the key vector of the jth token word in the output sequence H in the key vector matrix K, softmax is a normalized exponential function;

s33: constructing an intention information sequence G (G) fused with semantic slot information by using the self-attention of each token word in the output sequence H₁，g₂，g₃，...，g_T)。

Further, the specific steps of acquiring the specific intention category sequence Y in step S4 are as follows:

s41: inputting an intention information sequence G fused with semantic slot position information, a local intention information sequence P obtained by an encoder of a convolutional neural network and intention information of a [ CLS ] label output by the encoder of a BERT pre-training model into an intention identification decoder;

s42: the intention recognition decoder calculates the final intention information of each token word in the output sequence H obtained in step S1, and the calculation formula is:

f_i＝h₁+W_p·P_i+W_g·g_i

wherein f is_iFor outputting the final of the ith token word in the sequence HIntention information, W_pIntroduction parameters for local intention information obtained by an encoder using a convolutional neural network, W_gFor introducing parameters of intention information fused with semantic slot information, h₁[ CLS ] output for encoder of BERT pretrained model]Intention information of the tag, p_iFor the ith local intention information g in the local intention information P of the user input sentence obtained in step S22_iSelf-attention for the ith token word in output sequence H;

s43: mapping the final intention information of each token word in the output sequence H into each final intention category through a full-connection classifier by using an intention recognition decoder to obtain a specific intention category sequence Y consisting of the final intention categories of each token word in the output sequence H, wherein the calculation formula is as follows:

y¹＝softmax(W_f·f_i+b_f)

wherein, yⁱFor the final intention class of the ith token word in the output sequence H, W_fAs neural network parameters of the classifier, f_iTo output the final intention information of the ith token word in the sequence H, b_fFor the bias vector, softmax is a normalized exponential function.

Further, the specific steps of tag classification of the semantic slot in step S5 are as follows:

s51: inputting the specific intention type sequence Y obtained in the step S4 into a gating mechanism of a semantic slot decoder, and calculating the semantic slot of each token word in the output sequence H through the gating mechanism, wherein the calculation formula is as follows:

r_i＝sigmoid(W_r.[g_i，f_i])

wherein s is_iFor the semantic slot, r, of the ith token word in the output sequence H_iFor outputting the slot phase closing coefficient of the ith token word in the sequence H, sigmoid is an activation functionNumber, W_rNeural network parameters for semantic slot decoders, g_iTo output the self-attention of the ith token word in the sequence H, f_iOutputting final intention information of the ith token word in the sequence H;

s52: classifying the semantic slot positions of each token word in the output sequence H through a classifier, classifying each semantic slot position into a slot position label, wherein the classification formula is as follows:

wherein the content of the first and second substances,

for outputting slot tags, W, for each token word in the sequence H_SNeural network parameters for classifiers, b_SFor the bias vector, softmax is a normalized exponential function.

The semantic recognition method for the emergency rescue input voice provided by the invention has the following beneficial effects:

1) the BERT coding layer, the attention mechanism coding layer and the convolutional neural network coding layer are used in the coding layer, the decoder aiming at the purpose recognition and the decoder filled with the semantic slot position are used in the decoding layer, and the problems of out-of-vocabulary words and field universality are effectively avoided.

2) When the sentence characteristics are captured, the self-attention mechanism is used, and meanwhile, the local intention information is extracted from the sentence convolution neural network of the user, so that the local characteristic information can be fused while the remote characteristic information of the sentences is captured, the integrity of the characteristic information is enhanced, and the overall coding efficiency of the model can be improved.

3) The bidirectional gating mechanism of the intention and the semantic slot position is adopted, a recessive multi-head self-attention layer is used when the intention map is used for assisting the semantic slot position to be filled, the gating mechanism is used when the semantic slot position is used for filling and assisting the intention recognition, the bidirectional complementary design of the intention and the semantic slot position is really realized, the intention recognition accuracy is improved, and meanwhile, the great help is brought to the semantic slot position filling accuracy.

Detailed Description

The following embodiments of the present invention are described in further detail, but the present invention is not limited thereto, and all similar structures and similar variations thereof adopting the present invention should be included in the scope of the present invention, wherein the pause numbers in the present invention are all expressed in a relation of sum, and the english letters in the present invention are distinguished by case.

The embodiment of the invention provides a semantic recognition method of emergency rescue input voice, which is characterized by comprising the following specific steps of:

s3: a slot-intent mechanism of slot position information auxiliary intention classification information is realized by using an encoder based on a self-attention mechanism, self-attention is calculated for each token word in the output sequence H obtained in the step S1, semantic slot position information is fused into a [ CLS ] label containing sentence intention information, and an intention information sequence G fused with the semantic slot position information is obtained;

although a BERT pre-training model already comprises a plurality of multi-head self-attention mechanism coding layers, the effect of entering one self-attention coding layer after semantic coding is obtained can still be partially improved, the multi-head self-attention mechanism is also used in the method, the token of the first [ CLS ] label is subjected to self-attention calculation, and the self-attention calculation after the [ CLS ] label is added is equivalent to the allocation of the attention information of each slot to the first label due to the fact that the self-attention mechanism has higher attention degree on the sequence of the self-attention mechanism, and belongs to a slot-intent mechanism with intention information assisted by the semantic slot information;

s4: an intention information sequence G fused with the semantic slot position information (comprising intention information of [ CLS ] labels after self-attention layer calculation, wherein the intention information is intention information for filling auxiliary intention information in the semantic slot positions and fully utilizing the information of the semantic slot positions), a local intention information sequence P obtained by an encoder of a convolutional neural network (the intention information obtained by the encoder is fused with local intention information of a dialog and completely retains sequence information), and intention information of [ CLS ] labels output by the encoder of a BERT pre-training model (the introduction of the information is mainly used for preventing the intention information obtained by a slot-intent mechanism from being too large in weight and having a shielding effect on the intention information of the BERT pre-training model, so the intention information obtained by the [ CLS ] labels is directly fused by using an idea similar to a residual error network), inputting the intention information into an intention recognition decoder, and obtaining a specific intention category sequence Y (a specific intention category can be obtained by integrating three parts of intention information to be used as complete dialog intention information and inputting the intention information into the intention recognition decoder) which consists of final intention categories of the token words in the output sequence H by using the intention recognition decoder;

s5: and (4) inputting the specific intention category sequence Y obtained in the step (S4) into a gating mechanism of a semantic slot decoder to realize an intent-slot mechanism for predicting the intention information assisted semantic slot, calculating the semantic slot of each token word in the output sequence H through the gating mechanism, predicting the information of the semantic slot in a mode of filling the intention information assisted semantic slot, and classifying each semantic slot into a slot label by using a semantic slot information classifier.

The slot-intent mechanism of the bidirectional information auxiliary mechanism provided by the embodiment is mainly realized by an encoder based on a self-attention mechanism, and the mode of assisting the semantic slot-intent information by intention is mainly obtained by calculation through a gating mechanism aiming at the semantic slot.

In step S1 of the embodiment of the present invention, the specific steps of obtaining the output sequence H of the user input sentence are as follows:

the BERT used is configured as a BERT-Base-Unmeasured configuration, which is a basic configuration of the BERT and contains a parameter amount of 110M;

The output of the first token ([ CLS ] label) contains the intention information of the sentence, and the dimension of each sequence output is 768 dimensions.

The BERT pre-training model is actually an encoder part of a transform model and mainly comprises word Embedding characteristics of three positions, namely word slice-based Embedding characteristics (WordPiece), Position-based Embedding characteristics (Position Embedding) and segmentation-based Embedding characteristics (Segment Embedding);

in the embedding based on WordPiece, the word segmentation device of the BERT pre-training model can segment words input by user dialogue according to the minimum unit of a word list, and the flexibility degree of characters and the effective degree of the words can be considered. For example, a user input sentence is "[ book ]/[ a ]/[ breakdown ]/[ for ]/[ one ]", and after passing through the BERT tokenizer, the sentence is divided into word slice features of "[ book ]/[ a ]/[ breakdown ]/[ subject ]/[ # # eri ]/[ # # e ]/[ for ]/[ one ]";

based on the Position Embedding, the Position information is not added; in Segment Embedding, the Segment Embedding is mainly used for distinguishing a plurality of sentences, and different sentences are coded by different numbers. The data set used by the method is a single round of dialogue, so that position information does not need to be added.

In step S2 of the embodiment of the present invention, the specific steps of obtaining the local intention information sequence P of the user input sentence are as follows:

Due to h₁Is represented as [ CLS ]]The vector representation of the label represents intention information of the whole sentence, contains long-distance dependence information and is not beneficial to the characteristic extraction of local intention information, so that the label is not added to an encoder module for extracting the local intention;

s22: taking the output sequence H2 obtained in step S21 as an input layer of the convolutional neural network, performing convolution using a plurality of types of convolution kernels, and using a Top K algorithm to maximize pooled features after convolution (the features can retain both local intention feature information and sequence features of the whole sentence), so that the obtained local intention information sequence P of the user input sentence is equal to (P is equal to₂，p₃，...，p_T)。

Convolutional neural networks are used in natural language processing, with some adjustment in the structure of each part. Firstly, at the input layer, the vector of the input layer has no pixel matrix of the picture, but uses the word vector of each word as input, because each word can be represented by one word vector, each row embedded in the matrix represents one word vector, and the word vector can be represented by a static vector or can be updated during training. In the convolution layer, because the content of the input layer is changed, a two-dimensional convolution kernel is not needed to scan the image matrix, when a text is convoluted, the text is convoluted along one direction, the direction of the text is convoluted, the width of the convolution kernel is fixed to the dimension of a word vector, the height of the convolution kernel is a hyper-parameter, and different settings can be carried out. In the pooling layer, the present embodiment selects the Top K max pooling mode instead of selecting the max pooling layer.

In step S3 of the embodiment of the present invention, the specific steps of acquiring the intention information sequence G fused with the semantic slot information are as follows:

Q＝W_Q·H

K＝W_K·H

V＝W_V·H

wherein, W_QFor querying the parameters of the vector matrix Q, W_KIs a parameter of a key vector matrix K, W_VThe linear change formula is dynamically changed by training as a parameter of the value vector matrix V;

Through the steps from S31 to S33, the information of the [ CLS ] label is merged into the self-attention layer, and the information of each semantic slot and the global intention information are subjected to self-attention calculation, so that the slot-intent mechanism of semantic slot information assisted intention classification is realized.

In step S4 of the embodiment of the present invention, the specific steps of acquiring the specific intention category sequence Y are as follows:

f_i＝h₁+W_p·P_i+W_g·g_i

wherein f is_iTo output the final intention information for the ith token word in the sequence H, W_PIntroduction parameters for local intention information obtained by an encoder using a convolutional neural network, W_gFor introducing parameters of intention information fused with semantic slot information, h₁[ CLS ] output for encoder of BERT pretrained model]Intention information of the tag, p_iFor the ith local intention information g in the local intention information P of the user input sentence obtained in step S22_iSelf-attention for the ith token word in output sequence H;

yⁱ＝softmax(W_ff_i+b_f)

wherein, yⁱFor the final intention class of the ith token word in the output sequence H, W_fAs neural network parameters of the classifier, f_iTo output the final intention information of the ith token word in the sequence H, b_fFor the bias vector, softmax is a normalized exponential function used to normalize all class probabilities of predictions.

The primary role of the intent recognition decoder module is to fuse the intent information and then classify the final intent information into specific intent categories. The input to the intent recognition decoder is largely divided into three parts: and fusing intention information of local intention, intention information through a pre-training model and intention information for assisting intention classification through semantic slot position information.

In step S5 of the embodiment of the present invention, the specific steps of tag classification for a semantic slot are as follows:

s51: inputting the specific intention type sequence Y obtained in the step S4 into a gating mechanism of a semantic slot decoder, wherein the gating mechanism represents the correlation degree of final intention information and semantic slots by defining slot closing door, and the semantic slots of token words in the output sequence H are calculated through the gating mechanism, and the calculation formula is as follows:

r_i＝sigmoid(W_r·[g_i，f_i])

wherein s is_iFor the semantic slot, r, of the ith token word in the output sequence H_iFor outputting a slot phase closing coefficient of an ith token word in the sequence H, sigmoid is an activation function, the result is mapped into a range from 0 to 1 by the sigmoid, the slot phase closing controls the correlation degree of final intention information and semantic slot positions by the gating mechanism, and W is_rNeural network parameters for semantic slot decoders, g_iTo output the self-attention of the ith token word in the sequence H, f_iOutputting final intention information of the ith token word in the sequence H;

wherein the content of the first and second substances,

Finally, the objective function combining the intent recognition decoder and the semantic slot decoder can be defined as follows:

the optimization goal of the method is to minimize the conditional probability of both, and the training is performed by a cross entropy loss function.

The method of the embodiment of the invention converts the user input sentence (such as book a breakdown for one) into a semantic frame, each word in the user input sentence is regarded as a semantic slot, and the user input sentence is understood as a specific intention. The semantic slot filling task is regarded as a sequence tagging problem, which takes the word sequence of the user input sentence as input, i.e. input x ═ x₁，x₂，x₃，...，x_T) The tag of each semantic slot is taken as output, i.e. output

The intent recognition task can also be viewed as a classification problem, with the same input x ═ x₁，x₂，x₃，...，x_T) And the intention y to classify^IAs an output.

Claims

1. A semantic recognition method for emergency rescue input voice is characterized by comprising the following specific steps:

2. The semantic recognition method for emergency rescue input speech according to claim 1, wherein the specific steps of obtaining the output sequence H of the user input sentence in step S1 are as follows:

s13: the input sequence x obtained in step S12 is input to the coder of the BERT pre-training model, and after coding is performed using the coder of the BERT pre-training model, the output sequence H of the user input sentence is obtained (H ═ H)₁，h₂，h₃，...，h_T)。

3. The semantic recognition method for emergency rescue input speech according to claim 2, wherein the specific steps of obtaining the local intention information sequence P of the user input sentence in step S2 are as follows:

4. The semantic recognition method of the emergency rescue input voice according to claim 3, wherein the specific steps of obtaining the intention information sequence G fused with the semantic slot information in step S3 are as follows:

Q＝W_Q·H

K＝W_K·H

V＝W_v·H

5. The semantic recognition method for emergency rescue input speech according to claim 4, wherein the specific steps of obtaining the specific intention category sequence Y in step S4 are as follows:

f_i＝h₁+W_p·p_i+W_g·g_i

yⁱ＝soffmax(W_f·f_i+b_f)

6. The semantic recognition method for emergency rescue input speech according to claim 5, wherein the specific steps of tag classification of the semantic slots in step S5 are as follows:

r_i＝sigmoid(W_r·[g_i，f_i])

wherein s is_iFor the semantic slot, r, of the ith token word in the output sequence H_iFor outputting the slot phase closing coefficient of the ith token word in the sequence H, sigmoid is an activation function, W_rFor neural network parameters of the semantic slot decoder, gi is the self-attention of the ith token word in the output sequence H, f_iOutputting final intention information of the ith token word in the sequence H;

wherein the content of the first and second substances,