CN113486669A - Semantic recognition method for emergency rescue input voice - Google Patents

Semantic recognition method for emergency rescue input voice Download PDF

Info

Publication number
CN113486669A
CN113486669A CN202110764294.5A CN202110764294A CN113486669A CN 113486669 A CN113486669 A CN 113486669A CN 202110764294 A CN202110764294 A CN 202110764294A CN 113486669 A CN113486669 A CN 113486669A
Authority
CN
China
Prior art keywords
sequence
intention
information
word
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110764294.5A
Other languages
Chinese (zh)
Other versions
CN113486669B (en
Inventor
刘中民
夏新
沈方舟
朱建成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai East Hospital Tongji University Affiliated East Hospital
Original Assignee
Shanghai East Hospital Tongji University Affiliated East Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai East Hospital Tongji University Affiliated East Hospital filed Critical Shanghai East Hospital Tongji University Affiliated East Hospital
Priority to CN202110764294.5A priority Critical patent/CN113486669B/en
Publication of CN113486669A publication Critical patent/CN113486669A/en
Application granted granted Critical
Publication of CN113486669B publication Critical patent/CN113486669B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

A semantic recognition method for emergency rescue input voice relates to the technical field of natural language processing, and the method adopts a word segmentation device and a coder of a BERT pre-training model to segment and code a user input sentence to obtain word level expression; acquiring local intention information of a statement input by a user by adopting an encoder of a convolutional neural network; acquiring intention information fused with semantic slot position information by adopting an encoder based on a self-attention mechanism; acquiring a specific intention category by adopting an intention recognition decoder; and obtaining the label classification of the semantic slot position by adopting a semantic slot position decoder. The method provided by the invention is used for recognizing the input voice of the user by the machine in emergency rescue.

Description

Semantic recognition method for emergency rescue input voice
Technical Field
The invention relates to a natural language processing technology, in particular to a technology of a semantic recognition method of emergency rescue input voice.
Background
In the current emergency rescue system, the application of the voice recognition technology is quite wide, rescuers can send voice commands to the machine terminal, the machine terminal intelligently recognizes the intention and the semantics of the voice commands and then executes related operations, the manner of controlling the machine terminal by voice can liberate both hands of the rescuers, and the emergency rescue efficiency can be effectively improved. In the process, the recognition accuracy of the intention and the semantics of the voice command by the machine terminal is very critical.
Intent recognition and semantic slot filling are two main tasks in a spoken language understanding module (or natural language understanding module) aimed at understanding the user's conversational intent. The purpose of the intent recognition task is to machine the user's input text, recognizing the user question sentence as the user intent. The semantic slot filling task is used for marking the semantic slot in the text input by the user as the information of the specific slot value pair by processing the query sentence of the user. Since the result of the intention recognition can promote the result of semantic slot filling, and the result of the semantic slot filling can also play a positive role in the intention recognition task in turn, the joint training of the intention recognition task and the semantic slot filling task also serves as an important research branch in spoken language understanding.
The existing methods for intention recognition and semantic slot filling joint training can be divided into three categories:
1) when the method completes the intention recognition and semantic slot filling, a large amount of manpower and material resources are needed to formulate the intention recognition rule and the semantic slot rule in a specific field due to the limitation of a data set, and the expandability and generalization capability of the method are poor.
2) A joint training method based on a recurrent neural network mainly uses a Seq-Seq framework to carry out intention identification and semantic slot filling, and achieves a relatively ideal effect.
3) In the method, an implicit joint learning method learns the characteristics of two tasks, the two tasks are associated only through a loss function, and most of explicit joint learning methods propose a structure of a gating mechanism to further combine an intention recognition task and a semantic slot filling task.
Generally, Convolutional Neural Networks (CNNs) are commonly used in the fields of computer vision and images due to their property of focusing more on local features, and in recent years, they also play a great role in natural language processing, especially as feature extractors capable of effectively capturing local information of corpora.
Attention mechanism (Attention) proposes the goal of being able to allocate more computing resources to more important tasks while being able to handle the problem of information overload, given the limited computing power available today. The deep learning network has stronger expression capability, and the quantity of parameters required by the model is larger and larger, so that the problem of information overload is caused frequently. The introduction of the attention mechanism can focus on information more critical to the current task and reduce the sensitivity to other information, so that the efficiency and the accuracy of task processing are improved, and meanwhile, the attention mechanism can improve the parallel efficiency of calculation for splicing the key vectors and independent operation of the input vectors. The Self-Attention mechanism (Self-Attention) belongs to a variant of the Attention mechanism, and is different from the Attention mechanism in that the Self-Attention mechanism relies on more sequences related to the Self-Attention mechanism, so that the dependence of external information is reduced.
It is noted that although the intent recognition and semantic slot filling joint training achieve better training results, the existing research still has some problems to be improved, such as the lack of labeled data, the domain universality (a method works well in a specific data set in a given domain, but the model effect is greatly reduced after the dialogue domain and the dialogue data set are replaced), and the occurrence of unknown words (the out-of-vocabulary words in the test set, i.e. the words not in the training set, may result in lower test performance).
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to solve the technical problem of providing the semantic recognition method for the emergency rescue input voice, which can overcome the problems of data labeling, field universality and unknown word occurrence in the prior art, has good integrity of characteristic information and can improve the integral coding efficiency and the accuracy of the model.
In order to solve the technical problem, the semantic recognition method for the emergency rescue input voice is characterized by comprising the following specific steps of:
s1: after a word splitter of a BERT pre-training model is adopted to perform word splitting on a user input sentence, adding a [ CLS ] label to the first token of the sentence, adding an [ SEP ] label to the last token of the sentence, and inputting the tag into an encoder of the BERT pre-training model for encoding to obtain an output sequence H of the user input sentence;
s2: extracting other elements except the element containing the [ CLS ] label from the output sequence H, inputting the extracted elements into an encoder of a convolutional neural network, and obtaining a local intention information sequence P of a user input statement by using the encoder of the convolutional neural network;
s3: calculating the attention of each token word in the output sequence H obtained in the step S1, and fusing the semantic slot information into [ CLS ] labels containing sentence intention information to obtain an intention information sequence G fused with the semantic slot information;
s4: inputting an intention information sequence G fused with semantic slot position information, a local intention information sequence P obtained by an encoder of a convolutional neural network and intention information of a [ CLS ] label output by the encoder of a BERT pre-training model into an intention recognition decoder, and obtaining a specific intention type sequence Y consisting of final intention types of all token words in an output sequence H by using the intention recognition decoder;
s5: and (4) inputting the specific intention category sequence Y obtained in the step (S4) into a gating mechanism of a semantic slot decoder, calculating the semantic slot of each token word in the output sequence H through the gating mechanism, and classifying each semantic slot into a slot label by using a semantic slot information classifier.
Further, the specific step of acquiring the output sequence H of the user input sentence in step S1 is as follows:
s11: adopting a word segmentation device of a BERT pre-training model to segment words in the input sentences of the user according to the minimum unit of a word list of the word segmentation device;
s12: add one [ CLS ] to the first token of a sentence]Tag, adding one [ SEP ] to last token of sentence]Tag, obtaining input sequence x ═ x (x) of user input sentence1,x2,x3,...,xT) Wherein T is the number of elements in the input sequence x;
s13: the input sequence x obtained in step S12 is input to the coder of the BERT pre-training model, and after coding is performed using the coder of the BERT pre-training model, the output sequence H of the user input sentence is obtained (H ═ H)1,h2,h3,...,hT);
Further, the specific steps of acquiring the local intention information sequence P of the user input sentence in step S2 are as follows:
s21: from the output sequence H obtained in step S1, the sequence H except the first element H is extracted1Other elements, and a new output sequence H2 is constructed using the extracted elements (H)2,h3,...,hT);
S22: the output sequence H2 obtained in step S21 is used as the input layer of the convolutional neural network, each of which is convolved with a plurality of types of convolution kernels, and after the convolution, the Top K algorithm is used to maximize the pooled feature, so that the obtained local intention information sequence P of the user input sentence is equal to (P ═ P2,p3,...,pT)。
Further, the specific steps of acquiring the intention information sequence G fused with the semantic slot information in step S3 are as follows:
s31: the output sequence H obtained in step S1 is (H)1,h2,h3,...,hT) Obtaining a query vector matrix Q, a key vector matrix K and a value vector matrix V through linear change, wherein the formula of the linear change is as follows:
Q=WQ·H
K=WK·H
V=WV·H
wherein, WQFor querying the parameters of the vector matrix Q, WKIs a parameter of a key vector matrix K, WVIs a parameter of the value vector matrix V;
s32: the output sequence H obtained in step S1 is (H)1,h2,h3,...,hT) Each token word in (1) is calculated from attention by the following formula:
Figure BDA0003150406300000041
wherein, giTo output the self-attention, q, of the ith token word in the sequence HiFor the query vector in the query vector matrix Q for the ith token word in the output sequence H, vjFor outputting the value vector, k, of the jth token word in the sequence H in the value vector matrix VjFor the key vector of the jth token word in the output sequence H in the key vector matrix K, softmax is a normalized exponential function;
s33: constructing an intention information sequence G (G) fused with semantic slot information by using the self-attention of each token word in the output sequence H1,g2,g3,...,gT)。
Further, the specific steps of acquiring the specific intention category sequence Y in step S4 are as follows:
s41: inputting an intention information sequence G fused with semantic slot position information, a local intention information sequence P obtained by an encoder of a convolutional neural network and intention information of a [ CLS ] label output by the encoder of a BERT pre-training model into an intention identification decoder;
s42: the intention recognition decoder calculates the final intention information of each token word in the output sequence H obtained in step S1, and the calculation formula is:
fi=h1+Wp·Pi+Wg·gi
wherein f isiFor outputting the final of the ith token word in the sequence HIntention information, WpIntroduction parameters for local intention information obtained by an encoder using a convolutional neural network, WgFor introducing parameters of intention information fused with semantic slot information, h1[ CLS ] output for encoder of BERT pretrained model]Intention information of the tag, piFor the ith local intention information g in the local intention information P of the user input sentence obtained in step S22iSelf-attention for the ith token word in output sequence H;
s43: mapping the final intention information of each token word in the output sequence H into each final intention category through a full-connection classifier by using an intention recognition decoder to obtain a specific intention category sequence Y consisting of the final intention categories of each token word in the output sequence H, wherein the calculation formula is as follows:
y1=softmax(Wf·fi+bf)
wherein, yiFor the final intention class of the ith token word in the output sequence H, WfAs neural network parameters of the classifier, fiTo output the final intention information of the ith token word in the sequence H, bfFor the bias vector, softmax is a normalized exponential function.
Further, the specific steps of tag classification of the semantic slot in step S5 are as follows:
s51: inputting the specific intention type sequence Y obtained in the step S4 into a gating mechanism of a semantic slot decoder, and calculating the semantic slot of each token word in the output sequence H through the gating mechanism, wherein the calculation formula is as follows:
ri=sigmoid(Wr.[gi,fi])
Figure BDA0003150406300000061
wherein s isiFor the semantic slot, r, of the ith token word in the output sequence HiFor outputting the slot phase closing coefficient of the ith token word in the sequence H, sigmoid is an activation functionNumber, WrNeural network parameters for semantic slot decoders, giTo output the self-attention of the ith token word in the sequence H, fiOutputting final intention information of the ith token word in the sequence H;
s52: classifying the semantic slot positions of each token word in the output sequence H through a classifier, classifying each semantic slot position into a slot position label, wherein the classification formula is as follows:
Figure BDA0003150406300000062
wherein the content of the first and second substances,
Figure BDA0003150406300000063
for outputting slot tags, W, for each token word in the sequence HSNeural network parameters for classifiers, bSFor the bias vector, softmax is a normalized exponential function.
The semantic recognition method for the emergency rescue input voice provided by the invention has the following beneficial effects:
1) the BERT coding layer, the attention mechanism coding layer and the convolutional neural network coding layer are used in the coding layer, the decoder aiming at the purpose recognition and the decoder filled with the semantic slot position are used in the decoding layer, and the problems of out-of-vocabulary words and field universality are effectively avoided.
2) When the sentence characteristics are captured, the self-attention mechanism is used, and meanwhile, the local intention information is extracted from the sentence convolution neural network of the user, so that the local characteristic information can be fused while the remote characteristic information of the sentences is captured, the integrity of the characteristic information is enhanced, and the overall coding efficiency of the model can be improved.
3) The bidirectional gating mechanism of the intention and the semantic slot position is adopted, a recessive multi-head self-attention layer is used when the intention map is used for assisting the semantic slot position to be filled, the gating mechanism is used when the semantic slot position is used for filling and assisting the intention recognition, the bidirectional complementary design of the intention and the semantic slot position is really realized, the intention recognition accuracy is improved, and meanwhile, the great help is brought to the semantic slot position filling accuracy.
Detailed Description
The following embodiments of the present invention are described in further detail, but the present invention is not limited thereto, and all similar structures and similar variations thereof adopting the present invention should be included in the scope of the present invention, wherein the pause numbers in the present invention are all expressed in a relation of sum, and the english letters in the present invention are distinguished by case.
The embodiment of the invention provides a semantic recognition method of emergency rescue input voice, which is characterized by comprising the following specific steps of:
s1: after a word splitter of a BERT pre-training model is adopted to perform word splitting on a user input sentence, adding a [ CLS ] label to the first token of the sentence, adding an [ SEP ] label to the last token of the sentence, and inputting the tag into an encoder of the BERT pre-training model for encoding to obtain an output sequence H of the user input sentence;
s2: extracting other elements except the element containing the [ CLS ] label from the output sequence H, inputting the extracted elements into an encoder of a convolutional neural network, and obtaining a local intention information sequence P of a user input statement by using the encoder of the convolutional neural network;
s3: a slot-intent mechanism of slot position information auxiliary intention classification information is realized by using an encoder based on a self-attention mechanism, self-attention is calculated for each token word in the output sequence H obtained in the step S1, semantic slot position information is fused into a [ CLS ] label containing sentence intention information, and an intention information sequence G fused with the semantic slot position information is obtained;
although a BERT pre-training model already comprises a plurality of multi-head self-attention mechanism coding layers, the effect of entering one self-attention coding layer after semantic coding is obtained can still be partially improved, the multi-head self-attention mechanism is also used in the method, the token of the first [ CLS ] label is subjected to self-attention calculation, and the self-attention calculation after the [ CLS ] label is added is equivalent to the allocation of the attention information of each slot to the first label due to the fact that the self-attention mechanism has higher attention degree on the sequence of the self-attention mechanism, and belongs to a slot-intent mechanism with intention information assisted by the semantic slot information;
s4: an intention information sequence G fused with the semantic slot position information (comprising intention information of [ CLS ] labels after self-attention layer calculation, wherein the intention information is intention information for filling auxiliary intention information in the semantic slot positions and fully utilizing the information of the semantic slot positions), a local intention information sequence P obtained by an encoder of a convolutional neural network (the intention information obtained by the encoder is fused with local intention information of a dialog and completely retains sequence information), and intention information of [ CLS ] labels output by the encoder of a BERT pre-training model (the introduction of the information is mainly used for preventing the intention information obtained by a slot-intent mechanism from being too large in weight and having a shielding effect on the intention information of the BERT pre-training model, so the intention information obtained by the [ CLS ] labels is directly fused by using an idea similar to a residual error network), inputting the intention information into an intention recognition decoder, and obtaining a specific intention category sequence Y (a specific intention category can be obtained by integrating three parts of intention information to be used as complete dialog intention information and inputting the intention information into the intention recognition decoder) which consists of final intention categories of the token words in the output sequence H by using the intention recognition decoder;
s5: and (4) inputting the specific intention category sequence Y obtained in the step (S4) into a gating mechanism of a semantic slot decoder to realize an intent-slot mechanism for predicting the intention information assisted semantic slot, calculating the semantic slot of each token word in the output sequence H through the gating mechanism, predicting the information of the semantic slot in a mode of filling the intention information assisted semantic slot, and classifying each semantic slot into a slot label by using a semantic slot information classifier.
The slot-intent mechanism of the bidirectional information auxiliary mechanism provided by the embodiment is mainly realized by an encoder based on a self-attention mechanism, and the mode of assisting the semantic slot-intent information by intention is mainly obtained by calculation through a gating mechanism aiming at the semantic slot.
In step S1 of the embodiment of the present invention, the specific steps of obtaining the output sequence H of the user input sentence are as follows:
s11: adopting a word segmentation device of a BERT pre-training model to segment words in the input sentences of the user according to the minimum unit of a word list of the word segmentation device;
s12: add one [ CLS ] to the first token of a sentence]Tag, adding one [ SEP ] to last token of sentence]Tag, obtaining input sequence x ═ x (x) of user input sentence1,x2,x3,...,xT) Wherein T is the number of elements in the input sequence x;
the BERT used is configured as a BERT-Base-Unmeasured configuration, which is a basic configuration of the BERT and contains a parameter amount of 110M;
s13: the input sequence x obtained in step S12 is input to the coder of the BERT pre-training model, and after coding is performed using the coder of the BERT pre-training model, the output sequence H of the user input sentence is obtained (H ═ H)1,h2,h3,...,hT);
The output of the first token ([ CLS ] label) contains the intention information of the sentence, and the dimension of each sequence output is 768 dimensions.
The BERT pre-training model is actually an encoder part of a transform model and mainly comprises word Embedding characteristics of three positions, namely word slice-based Embedding characteristics (WordPiece), Position-based Embedding characteristics (Position Embedding) and segmentation-based Embedding characteristics (Segment Embedding);
in the embedding based on WordPiece, the word segmentation device of the BERT pre-training model can segment words input by user dialogue according to the minimum unit of a word list, and the flexibility degree of characters and the effective degree of the words can be considered. For example, a user input sentence is "[ book ]/[ a ]/[ breakdown ]/[ for ]/[ one ]", and after passing through the BERT tokenizer, the sentence is divided into word slice features of "[ book ]/[ a ]/[ breakdown ]/[ subject ]/[ # # eri ]/[ # # e ]/[ for ]/[ one ]";
based on the Position Embedding, the Position information is not added; in Segment Embedding, the Segment Embedding is mainly used for distinguishing a plurality of sentences, and different sentences are coded by different numbers. The data set used by the method is a single round of dialogue, so that position information does not need to be added.
In step S2 of the embodiment of the present invention, the specific steps of obtaining the local intention information sequence P of the user input sentence are as follows:
s21: from the output sequence H obtained in step S1, the sequence H except the first element H is extracted1Other elements, and a new output sequence H2 is constructed using the extracted elements (H)2,h3,...,hT);
Due to h1Is represented as [ CLS ]]The vector representation of the label represents intention information of the whole sentence, contains long-distance dependence information and is not beneficial to the characteristic extraction of local intention information, so that the label is not added to an encoder module for extracting the local intention;
s22: taking the output sequence H2 obtained in step S21 as an input layer of the convolutional neural network, performing convolution using a plurality of types of convolution kernels, and using a Top K algorithm to maximize pooled features after convolution (the features can retain both local intention feature information and sequence features of the whole sentence), so that the obtained local intention information sequence P of the user input sentence is equal to (P is equal to2,p3,...,pT)。
Convolutional neural networks are used in natural language processing, with some adjustment in the structure of each part. Firstly, at the input layer, the vector of the input layer has no pixel matrix of the picture, but uses the word vector of each word as input, because each word can be represented by one word vector, each row embedded in the matrix represents one word vector, and the word vector can be represented by a static vector or can be updated during training. In the convolution layer, because the content of the input layer is changed, a two-dimensional convolution kernel is not needed to scan the image matrix, when a text is convoluted, the text is convoluted along one direction, the direction of the text is convoluted, the width of the convolution kernel is fixed to the dimension of a word vector, the height of the convolution kernel is a hyper-parameter, and different settings can be carried out. In the pooling layer, the present embodiment selects the Top K max pooling mode instead of selecting the max pooling layer.
In step S3 of the embodiment of the present invention, the specific steps of acquiring the intention information sequence G fused with the semantic slot information are as follows:
s31: the output sequence H obtained in step S1 is (H)1,h2,h3,...,hT) Obtaining a query vector matrix Q, a key vector matrix K and a value vector matrix V through linear change, wherein the formula of the linear change is as follows:
Q=WQ·H
K=WK·H
V=WV·H
wherein, WQFor querying the parameters of the vector matrix Q, WKIs a parameter of a key vector matrix K, WVThe linear change formula is dynamically changed by training as a parameter of the value vector matrix V;
s32: the output sequence H obtained in step S1 is (H)1,h2,h3,...,hT) Each token word in (1) is calculated from attention by the following formula:
Figure BDA0003150406300000101
wherein, giTo output the self-attention, q, of the ith token word in the sequence HiFor the query vector in the query vector matrix Q for the ith token word in the output sequence H, vjFor outputting the value vector, k, of the jth token word in the sequence H in the value vector matrix VjFor the key vector of the jth token word in the output sequence H in the key vector matrix K, softmax is a normalized exponential function;
s33: constructing an intention information sequence G (G) fused with semantic slot information by using the self-attention of each token word in the output sequence H1,g2,g3,...,gT)。
Through the steps from S31 to S33, the information of the [ CLS ] label is merged into the self-attention layer, and the information of each semantic slot and the global intention information are subjected to self-attention calculation, so that the slot-intent mechanism of semantic slot information assisted intention classification is realized.
In step S4 of the embodiment of the present invention, the specific steps of acquiring the specific intention category sequence Y are as follows:
s41: inputting an intention information sequence G fused with semantic slot position information, a local intention information sequence P obtained by an encoder of a convolutional neural network and intention information of a [ CLS ] label output by the encoder of a BERT pre-training model into an intention identification decoder;
s42: the intention recognition decoder calculates the final intention information of each token word in the output sequence H obtained in step S1, and the calculation formula is:
fi=h1+Wp·Pi+Wg·gi
wherein f isiTo output the final intention information for the ith token word in the sequence H, WPIntroduction parameters for local intention information obtained by an encoder using a convolutional neural network, WgFor introducing parameters of intention information fused with semantic slot information, h1[ CLS ] output for encoder of BERT pretrained model]Intention information of the tag, piFor the ith local intention information g in the local intention information P of the user input sentence obtained in step S22iSelf-attention for the ith token word in output sequence H;
s43: mapping the final intention information of each token word in the output sequence H into each final intention category through a full-connection classifier by using an intention recognition decoder to obtain a specific intention category sequence Y consisting of the final intention categories of each token word in the output sequence H, wherein the calculation formula is as follows:
yi=softmax(Wffi+bf)
wherein, yiFor the final intention class of the ith token word in the output sequence H, WfAs neural network parameters of the classifier, fiTo output the final intention information of the ith token word in the sequence H, bfFor the bias vector, softmax is a normalized exponential function used to normalize all class probabilities of predictions.
The primary role of the intent recognition decoder module is to fuse the intent information and then classify the final intent information into specific intent categories. The input to the intent recognition decoder is largely divided into three parts: and fusing intention information of local intention, intention information through a pre-training model and intention information for assisting intention classification through semantic slot position information.
In step S5 of the embodiment of the present invention, the specific steps of tag classification for a semantic slot are as follows:
s51: inputting the specific intention type sequence Y obtained in the step S4 into a gating mechanism of a semantic slot decoder, wherein the gating mechanism represents the correlation degree of final intention information and semantic slots by defining slot closing door, and the semantic slots of token words in the output sequence H are calculated through the gating mechanism, and the calculation formula is as follows:
ri=sigmoid(Wr·[gi,fi])
Figure BDA0003150406300000121
wherein s isiFor the semantic slot, r, of the ith token word in the output sequence HiFor outputting a slot phase closing coefficient of an ith token word in the sequence H, sigmoid is an activation function, the result is mapped into a range from 0 to 1 by the sigmoid, the slot phase closing controls the correlation degree of final intention information and semantic slot positions by the gating mechanism, and W isrNeural network parameters for semantic slot decoders, giTo output the self-attention of the ith token word in the sequence H, fiOutputting final intention information of the ith token word in the sequence H;
s52: classifying the semantic slot positions of each token word in the output sequence H through a classifier, classifying each semantic slot position into a slot position label, wherein the classification formula is as follows:
Figure BDA0003150406300000122
wherein the content of the first and second substances,
Figure BDA0003150406300000123
for outputting slot tags, W, for each token word in the sequence HSNeural network parameters for classifiers, bSFor the bias vector, softmax is a normalized exponential function.
Finally, the objective function combining the intent recognition decoder and the semantic slot decoder can be defined as follows:
Figure BDA0003150406300000124
the optimization goal of the method is to minimize the conditional probability of both, and the training is performed by a cross entropy loss function.
The method of the embodiment of the invention converts the user input sentence (such as book a breakdown for one) into a semantic frame, each word in the user input sentence is regarded as a semantic slot, and the user input sentence is understood as a specific intention. The semantic slot filling task is regarded as a sequence tagging problem, which takes the word sequence of the user input sentence as input, i.e. input x ═ x1,x2,x3,...,xT) The tag of each semantic slot is taken as output, i.e. output
Figure BDA0003150406300000131
The intent recognition task can also be viewed as a classification problem, with the same input x ═ x1,x2,x3,...,xT) And the intention y to classifyIAs an output.

Claims (6)

1. A semantic recognition method for emergency rescue input voice is characterized by comprising the following specific steps:
s1: after a word splitter of a BERT pre-training model is adopted to perform word splitting on a user input sentence, adding a [ CLS ] label to the first token of the sentence, adding an [ SEP ] label to the last token of the sentence, and inputting the tag into an encoder of the BERT pre-training model for encoding to obtain an output sequence H of the user input sentence;
s2: extracting other elements except the element containing the [ CLS ] label from the output sequence H, inputting the extracted elements into an encoder of a convolutional neural network, and obtaining a local intention information sequence P of a user input statement by using the encoder of the convolutional neural network;
s3: calculating the attention of each token word in the output sequence H obtained in the step S1, and fusing the semantic slot information into [ CLS ] labels containing sentence intention information to obtain an intention information sequence G fused with the semantic slot information;
s4: inputting an intention information sequence G fused with semantic slot position information, a local intention information sequence P obtained by an encoder of a convolutional neural network and intention information of a [ CLS ] label output by the encoder of a BERT pre-training model into an intention recognition decoder, and obtaining a specific intention type sequence Y consisting of final intention types of all token words in an output sequence H by using the intention recognition decoder;
s5: and (4) inputting the specific intention category sequence Y obtained in the step (S4) into a gating mechanism of a semantic slot decoder, calculating the semantic slot of each token word in the output sequence H through the gating mechanism, and classifying each semantic slot into a slot label by using a semantic slot information classifier.
2. The semantic recognition method for emergency rescue input speech according to claim 1, wherein the specific steps of obtaining the output sequence H of the user input sentence in step S1 are as follows:
s11: adopting a word segmentation device of a BERT pre-training model to segment words in the input sentences of the user according to the minimum unit of a word list of the word segmentation device;
s12: add one [ CLS ] to the first token of a sentence]Tag, adding one [ SEP ] to last token of sentence]Tag, obtaining input sequence x ═ x (x) of user input sentence1,x2,x3,...,xT) Wherein T is the number of elements in the input sequence x;
s13: the input sequence x obtained in step S12 is input to the coder of the BERT pre-training model, and after coding is performed using the coder of the BERT pre-training model, the output sequence H of the user input sentence is obtained (H ═ H)1,h2,h3,...,hT)。
3. The semantic recognition method for emergency rescue input speech according to claim 2, wherein the specific steps of obtaining the local intention information sequence P of the user input sentence in step S2 are as follows:
s21: from the output sequence H obtained in step S1, the sequence H except the first element H is extracted1Other elements, and a new output sequence H2 is constructed using the extracted elements (H)2,h3,...,hT);
S22: the output sequence H2 obtained in step S21 is used as the input layer of the convolutional neural network, each of which is convolved with a plurality of types of convolution kernels, and after the convolution, the Top K algorithm is used to maximize the pooled feature, so that the obtained local intention information sequence P of the user input sentence is equal to (P ═ P2,p3,...,pT)。
4. The semantic recognition method of the emergency rescue input voice according to claim 3, wherein the specific steps of obtaining the intention information sequence G fused with the semantic slot information in step S3 are as follows:
s31: the output sequence H obtained in step S1 is (H)1,h2,h3,...,hT) Obtaining a query vector matrix Q, a key vector matrix K and a value vector matrix V through linear change, wherein the formula of the linear change is as follows:
Q=WQ·H
K=WK·H
V=Wv·H
wherein, WQFor querying the parameters of the vector matrix Q, WKIs a parameter of a key vector matrix K, WVIs a parameter of the value vector matrix V;
s32: the output sequence H obtained in step S1 is (H)1,h2,h3,...,hT) Each token word in (1) is calculated from attention by the following formula:
Figure FDA0003150406290000021
wherein, giTo output the self-attention, q, of the ith token word in the sequence HiFor the query vector in the query vector matrix Q for the ith token word in the output sequence H, vjFor outputting the value vector, k, of the jth token word in the sequence H in the value vector matrix VjFor the key vector of the jth token word in the output sequence H in the key vector matrix K, softmax is a normalized exponential function;
s33: constructing an intention information sequence G (G) fused with semantic slot information by using the self-attention of each token word in the output sequence H1,g2,g3,...,gT)。
5. The semantic recognition method for emergency rescue input speech according to claim 4, wherein the specific steps of obtaining the specific intention category sequence Y in step S4 are as follows:
s41: inputting an intention information sequence G fused with semantic slot position information, a local intention information sequence P obtained by an encoder of a convolutional neural network and intention information of a [ CLS ] label output by the encoder of a BERT pre-training model into an intention identification decoder;
s42: the intention recognition decoder calculates the final intention information of each token word in the output sequence H obtained in step S1, and the calculation formula is:
fi=h1+Wp·pi+Wg·gi
wherein f isiTo output the final intention information for the ith token word in the sequence H, WpIntroduction parameters for local intention information obtained by an encoder using a convolutional neural network, WgFor introducing parameters of intention information fused with semantic slot information, h1[ CLS ] output for encoder of BERT pretrained model]Intention information of the tag, piFor the ith local intention information g in the local intention information P of the user input sentence obtained in step S22iSelf-attention for the ith token word in output sequence H;
s43: mapping the final intention information of each token word in the output sequence H into each final intention category through a full-connection classifier by using an intention recognition decoder to obtain a specific intention category sequence Y consisting of the final intention categories of each token word in the output sequence H, wherein the calculation formula is as follows:
yi=soffmax(Wf·fi+bf)
wherein, yiFor the final intention class of the ith token word in the output sequence H, WfAs neural network parameters of the classifier, fiTo output the final intention information of the ith token word in the sequence H, bfFor the bias vector, softmax is a normalized exponential function.
6. The semantic recognition method for emergency rescue input speech according to claim 5, wherein the specific steps of tag classification of the semantic slots in step S5 are as follows:
s51: inputting the specific intention type sequence Y obtained in the step S4 into a gating mechanism of a semantic slot decoder, and calculating the semantic slot of each token word in the output sequence H through the gating mechanism, wherein the calculation formula is as follows:
ri=sigmoid(Wr·[gi,fi])
Figure FDA0003150406290000041
wherein s isiFor the semantic slot, r, of the ith token word in the output sequence HiFor outputting the slot phase closing coefficient of the ith token word in the sequence H, sigmoid is an activation function, WrFor neural network parameters of the semantic slot decoder, gi is the self-attention of the ith token word in the output sequence H, fiOutputting final intention information of the ith token word in the sequence H;
s52: classifying the semantic slot positions of each token word in the output sequence H through a classifier, classifying each semantic slot position into a slot position label, wherein the classification formula is as follows:
Figure FDA0003150406290000042
wherein the content of the first and second substances,
Figure FDA0003150406290000043
for outputting slot tags, W, for each token word in the sequence HsNeural network parameters for classifiers, bsFor the bias vector, softmax is a normalized exponential function.
CN202110764294.5A 2021-07-06 2021-07-06 Semantic recognition method for emergency rescue input voice Active CN113486669B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110764294.5A CN113486669B (en) 2021-07-06 2021-07-06 Semantic recognition method for emergency rescue input voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110764294.5A CN113486669B (en) 2021-07-06 2021-07-06 Semantic recognition method for emergency rescue input voice

Publications (2)

Publication Number Publication Date
CN113486669A true CN113486669A (en) 2021-10-08
CN113486669B CN113486669B (en) 2024-03-29

Family

ID=77941353

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110764294.5A Active CN113486669B (en) 2021-07-06 2021-07-06 Semantic recognition method for emergency rescue input voice

Country Status (1)

Country Link
CN (1) CN113486669B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114021582A (en) * 2021-12-30 2022-02-08 深圳市北科瑞声科技股份有限公司 Spoken language understanding method, device, equipment and storage medium combined with voice information
CN115658891A (en) * 2022-10-18 2023-01-31 支付宝(杭州)信息技术有限公司 Intention identification method and device, storage medium and electronic equipment
CN116092495A (en) * 2023-04-07 2023-05-09 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116227629A (en) * 2023-05-10 2023-06-06 荣耀终端有限公司 Information analysis method, model training method, device and electronic equipment
WO2024001101A1 (en) * 2022-06-30 2024-01-04 青岛海尔科技有限公司 Text intention recognition method and apparatus, storage medium, and electronic apparatus

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555097A (en) * 2018-05-31 2019-12-10 罗伯特·博世有限公司 Slot filling with joint pointer and attention in spoken language understanding
CN111625641A (en) * 2020-07-30 2020-09-04 浙江大学 Dialog intention recognition method and system based on multi-dimensional semantic interaction representation model
CN112084790A (en) * 2020-09-24 2020-12-15 中国民航大学 Relation extraction method and system based on pre-training convolutional neural network
WO2021051503A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Semantic representation model-based text classification method and apparatus, and computer device
CN113032568A (en) * 2021-04-02 2021-06-25 同方知网(北京)技术有限公司 Query intention identification method based on bert + bilstm + crf and combined sentence pattern analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555097A (en) * 2018-05-31 2019-12-10 罗伯特·博世有限公司 Slot filling with joint pointer and attention in spoken language understanding
WO2021051503A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Semantic representation model-based text classification method and apparatus, and computer device
CN111625641A (en) * 2020-07-30 2020-09-04 浙江大学 Dialog intention recognition method and system based on multi-dimensional semantic interaction representation model
CN112084790A (en) * 2020-09-24 2020-12-15 中国民航大学 Relation extraction method and system based on pre-training convolutional neural network
CN113032568A (en) * 2021-04-02 2021-06-25 同方知网(北京)技术有限公司 Query intention identification method based on bert + bilstm + crf and combined sentence pattern analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周奇安;李舟军;: "基于BERT的任务导向对话系统自然语言理解的改进模型与调优方法", 中文信息学报, no. 05, 15 May 2020 (2020-05-15), pages 82 - 90 *
迟海洋;严馨;周枫;徐广义;张磊;: "基于BERT-BiGRU-Attention的在线健康社区用户意图识别方法", 河北科技大学学报, no. 03, 15 June 2020 (2020-06-15), pages 225 - 231 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114021582A (en) * 2021-12-30 2022-02-08 深圳市北科瑞声科技股份有限公司 Spoken language understanding method, device, equipment and storage medium combined with voice information
CN114021582B (en) * 2021-12-30 2022-04-01 深圳市北科瑞声科技股份有限公司 Spoken language understanding method, device, equipment and storage medium combined with voice information
WO2024001101A1 (en) * 2022-06-30 2024-01-04 青岛海尔科技有限公司 Text intention recognition method and apparatus, storage medium, and electronic apparatus
CN115658891A (en) * 2022-10-18 2023-01-31 支付宝(杭州)信息技术有限公司 Intention identification method and device, storage medium and electronic equipment
CN115658891B (en) * 2022-10-18 2023-07-25 支付宝(杭州)信息技术有限公司 Method and device for identifying intention, storage medium and electronic equipment
CN116092495A (en) * 2023-04-07 2023-05-09 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116092495B (en) * 2023-04-07 2023-08-29 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116227629A (en) * 2023-05-10 2023-06-06 荣耀终端有限公司 Information analysis method, model training method, device and electronic equipment
CN116227629B (en) * 2023-05-10 2023-10-20 荣耀终端有限公司 Information analysis method, model training method, device and electronic equipment

Also Published As

Publication number Publication date
CN113486669B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
CN113486669B (en) Semantic recognition method for emergency rescue input voice
Cheng et al. Fully convolutional networks for continuous sign language recognition
CN111897908B (en) Event extraction method and system integrating dependency information and pre-training language model
CN109524006B (en) Chinese mandarin lip language identification method based on deep learning
CN108647603B (en) Semi-supervised continuous sign language translation method and device based on attention mechanism
CN110119786B (en) Text topic classification method and device
WO2023134073A1 (en) Artificial intelligence-based image description generation method and apparatus, device, and medium
Gao et al. RNN-transducer based Chinese sign language recognition
CN109165563B (en) Pedestrian re-identification method and apparatus, electronic device, storage medium, and program product
CN112860888B (en) Attention mechanism-based bimodal emotion analysis method
CN113902964A (en) Multi-mode attention video question-answering method and system based on keyword perception
CN113836992B (en) Label identification method, label identification model training method, device and equipment
CN112329760A (en) Method for recognizing and translating Mongolian in printed form from end to end based on space transformation network
CN111598183A (en) Multi-feature fusion image description method
CN112712068B (en) Key point detection method and device, electronic equipment and storage medium
CN116861995A (en) Training of multi-mode pre-training model and multi-mode data processing method and device
CN113516152A (en) Image description method based on composite image semantics
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN115796182A (en) Multi-modal named entity recognition method based on entity-level cross-modal interaction
Gui et al. Adaptive Context-aware Reinforced Agent for Handwritten Text Recognition.
CN113609922A (en) Continuous sign language sentence recognition method based on mode matching
CN116229482A (en) Visual multi-mode character detection recognition and error correction method in network public opinion analysis
Chen et al. Cross-lingual text image recognition via multi-task sequence to sequence learning
CN116910307A (en) Cross-modal video text retrieval method, system, equipment and medium
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant