WO2022057712A1 - Electronic device and semantic parsing method therefor, medium, and human-machine dialog system - Google Patents

Electronic device and semantic parsing method therefor, medium, and human-machine dialog system Download PDF

Info

Publication number
WO2022057712A1
WO2022057712A1 PCT/CN2021/117251 CN2021117251W WO2022057712A1 WO 2022057712 A1 WO2022057712 A1 WO 2022057712A1 CN 2021117251 W CN2021117251 W CN 2021117251W WO 2022057712 A1 WO2022057712 A1 WO 2022057712A1
Authority
WO
WIPO (PCT)
Prior art keywords
slot
intent
word
semantic
corpus data
Prior art date
Application number
PCT/CN2021/117251
Other languages
French (fr)
Chinese (zh)
Inventor
童甜甜
祝官文
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022057712A1 publication Critical patent/WO2022057712A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the invention relates to the technical field of man-machine dialogue, in particular to an electronic device and a semantic analysis method, medium and man-machine dialogue system thereof.
  • the human-machine dialogue dialogue system is more and more applied in various intelligent terminal electronic devices, such as: smart speakers, smart phones, car smart In-vehicle intelligent systems such as voice navigation and robots, etc.
  • the human-computer dialogue system uses technologies such as speech recognition, semantic analysis and language generation to realize dialogue and information exchange between humans and machines.
  • the spoken language comprehension task in semantic parsing technology includes two sub-tasks, intent recognition and slot filling.
  • intent recognition and slot filling are mainly for single-intent and single-slot identification, that is, for the same speech, the closest intent is selected from multiple intent recognition result options as the recognition result.
  • the corpus of multiple intents and the corpus of a single intent may have the same sentence pattern, and the single-intent classification model cannot distinguish the corpus of multiple intents, which eventually leads to a high mis-entry rate of the model, that is, the intent recognition slot filling result Error rate is high.
  • the prior art intent slot identification architecture cannot display the modeling of the relationship between intent slots, the accuracy of intent identification and slot filling for multiple tags is poor, and it is not compatible with single-intent and multi-intent mixed scenarios.
  • Embodiments of the present application provide an electronic device, a semantic analysis method, a medium, and a human-machine dialogue system thereof. By recognizing multiple intents close to the user's true intent from a user's voice, and then using the identified multiple intents to predict slot positions information, thereby improving the accuracy of slot filling, and correspondingly improving the speed or efficiency of slot filling, thereby improving the accuracy of semantic parsing in human-computer dialogue.
  • an embodiment of the present application provides a semantic parsing method, the method includes: acquiring corpus data to be parsed; The degree of intent correlation, and the degree of correlation between the word and the slot represented by the corpus data to be parsed; based on the semantic information of the word and the above semantic information of the word, and the intent of the word correlation The degree of correlation with the slot position is used to predict the slot position of the corpus data to be parsed.
  • the corpus data can be obtained by performing voice recognition and conversion on the user's voice command.
  • the degree of intent correlation between the words included in the corpus data to be parsed and the intent represented by the corpus data to be parsed can be represented by an intent attention vector, and the relationship between the word and the slot represented by the corpus data to be parsed can be represented.
  • the degree of slot correlation can be represented by the slot attention vector.
  • the semantic information of the word can be understood as the word meaning information of the word, that is, the literal meaning of the word and the meaning it refers to. Sentences as nouns (eg in song titles, hello old days).
  • the above semantic information of the word can be the semantic information of the previous word continuous with the current word in the corpus data, if the current word processed is the first word, then the above semantic information can be the sentence semantic information of the corpus data,
  • the above semantic information is mainly based on its significance to the slot prediction of the current word.
  • the above semantic information can be expressed in the hidden state vector output at the previous moment (relative to the current moment).
  • the above-mentioned method further includes: predicting multiple intents from the corpus data to be parsed; from the predicted slots, determining each of the multiple intents The slot corresponding to the intent.
  • multiple intents are obtained by parsing the corpus data converted from a user's voice command. If the corpus data only contains a single intent, the present application can also be applied to parse a single intent in such single-intent corpus data, which has a certain versatility. , the user experience is better.
  • each intent should correspond to at least one slot, and some intents may have three or more slots corresponding to it.
  • the present application can accurately sort out the correspondence between multiple intents and multiple slots.
  • the above-mentioned method further includes: the above-mentioned semantic information includes semantic information of at least one word located in front of the word in the corpus data to be parsed.
  • the above semantic information of the first word is the sentence semantic information of this piece of corpus data.
  • the semantic information of the second word above is the semantic information of the first word, and at this time the semantic information of the first word contains the semantic information of the sentence of the corpus data in The message passed to the first word at the first moment.
  • the above semantic information of each subsequent character is the semantic information of the previous character, and the semantic information of the previous character includes the semantic information transmitted by the previous character of the previous character. This transfer relationship is progressive.
  • the word sense information correlation degree of two adjacent words is the largest, and the word meaning information correlation degree of two non-adjacent words is small, or the correlation degree gradually approaches 0 with the increase of the spaced characters.
  • the method further includes: generating sentence semantic information of the corpus data to be parsed and semantic information of each word in the corpus data to be parsed.
  • sentence character representing the sentence in the corpus data is encoded by the encoder, so that the sentence character can express specific semantic information, and this specific semantic information is the same as or close to the semantic information obtained by human understanding the sentence.
  • word character of each word in the corpus data is encoded by the encoder, so that the word character can express specific word meaning information, and this specific word meaning information is related to the human understanding of the sentence.
  • the word meaning information of word comprehension is the same or close.
  • sentence semantic information of the corpus data can be represented by a sentence vector
  • word meaning information of each word in the corpus data can be represented by a word vector.
  • the above-mentioned method further includes: the method is implemented by a neural network model.
  • the neural network model includes a fully connected layer and a long short-term memory network model.
  • a semantic parsing model is trained through a neural network model combined with a BERT model, an attention mechanism, a slot gate mechanism, and a sigmoid activation function, enabling it to implement the above method.
  • the above-mentioned method further includes: sentence semantic information of the corpus data to be parsed, the above semantic information of the word, the degree of intent correlation and the degree of slot correlation of the word It is represented in the form of a vector in the neural network model.
  • the sentence semantic information of the corpus data to be parsed is represented by a sentence vector
  • the above semantic information of the word is represented by a hidden state vector at the previous moment
  • the intention correlation degree and slot correlation degree of the word are respectively represented by intention attention. Force vector and slot attention vector representation.
  • an embodiment of the present application provides a man-machine dialogue method, which includes: receiving a user voice command; converting the user voice command into a text-form corpus to be parsed; Parse the intent in the corpus and the slot corresponding to each intent; based on the parsed intent and the slot corresponding to each intent, execute the operation corresponding to the user's voice command or generate a response voice.
  • the method further includes: the operations include one or more of sending instructions to the smart home device, opening application software, searching web pages, making calls, and sending and receiving short messages.
  • the parsed intent is to book a ticket and a hotel
  • the slots corresponding to these two intents are the origin, destination, (hotel) location, and (hotel) star rating
  • the operation performed by the smartphone may be to open a ticket hotel reservation software, query the ticket information corresponding to the departure and destination for the user to choose, and recommend a five-star hotel in a certain location to the user for selection.
  • a smart home may include, but is not limited to, laptop computers, desktop computers, tablet computers, smartphones, wearable devices, portable music players, reader devices, or other electronic devices capable of accessing a network.
  • an embodiment of the present application provides a human-machine dialogue system, the system includes: a speech recognition module for converting a user's voice command into corpus data in text form; a semantic parsing module for performing the above semantic parsing method; a problem solving module for finding a solution for the results obtained by the semantic parsing module analysis; a language generating module for generating natural language sentences corresponding to the solution; a speech synthesis module for converting the natural language The language sentence is synthesized into the response voice; the dialogue management module is used to schedule the speech recognition module, the semantic analysis module, the problem solving module, the language generation module and the speech synthesis module to cooperate with each other to realize the man-machine dialogue.
  • an embodiment of the present application provides a readable medium, where an instruction is stored on the readable medium, and the instruction, when executed on an electronic device, causes the electronic device to execute the above semantic parsing method or the above man-machine dialogue method.
  • an embodiment of the present application provides an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device, and a processor, which is one of the processors of the electronic device, and uses for executing the above semantic parsing method or the above man-machine dialogue method.
  • Fig. 1 is a schematic software block diagram of a common man-machine dialogue system
  • FIG. 2 is a schematic diagram of a man-machine dialogue scene to which an embodiment of the present application is applicable;
  • FIG. 3 is a schematic structural diagram of an exemplary structure of a semantic parsing model in an embodiment of the present application
  • FIG. 4 is a schematic diagram of processing results of corpus data at different stages in the semantic parsing method according to an embodiment of the present application
  • FIG. 5 is a schematic diagram of a training process of a semantic parsing model in the semantic parsing method according to an embodiment of the present application
  • FIG. 6 is a schematic diagram of an interaction flow between a mobile phone 100 and a user according to an embodiment of the present application
  • FIG. 7 is a schematic interface diagram of a mobile phone 100 according to an embodiment of the present application performing corresponding operations according to user voice commands;
  • FIG. 8 is an exemplary structural diagram of a mobile phone 100 according to an embodiment of the present application.
  • Illustrative embodiments of the present application include, but are not limited to, electronic devices and semantic parsing methods and media thereof.
  • the embodiment of the present application first identifies multiple intents close to the user's true intent from the user's voice, and then uses the identified multiple intents to predict slot information, thereby improving the accuracy of slot filling, and correspondingly The speed or efficiency of slot filling is improved, thereby improving the accuracy of semantic parsing in human-machine dialogue.
  • NLP Natural language processing
  • Natural language is human language
  • natural language processing is the processing of human language.
  • Natural language processing is the process of systematically analyzing, understanding, and extracting information from text data in an intelligent and efficient manner.
  • NLP natural language processing
  • NER Named Entity Recognition
  • RE Relation Extraction
  • IE Information Extraction
  • Sentiment Analysis Speech Recognition, Question Answering, Topic Segmentation, etc.
  • natural language processing tasks can fall into the following categories.
  • Sequence tagging Each word in a sentence requires the model to give a categorical category based on the context. Such as Chinese word segmentation, part-of-speech tagging, named entity recognition, semantic role tagging.
  • Classification tasks output a classification value for the entire sentence, such as text classification.
  • Sentence relationship inference Given two sentences, determine whether the two sentences have a nominal relationship. For example, enlightenment, QA, semantic rewriting, natural language inference.
  • Generative task output a piece of text, generate another piece of text.
  • Intent The voice commands input by the user all correspond to the user's intention. It is understandable that the so-called intention is the expression of the user's will. In the human-machine dialogue system, the intention is generally named after "verb + noun", for example Check the weather, book hotels, etc.
  • intent recognition also known as intent classification, mainly extracts the intent corresponding to the current voice command according to the voice command input by the user.
  • An intent is a collection of one or more expressions, such as "I want to watch a movie” and "I want to see an action movie made by a certain star in a certain year” can belong to the same intent to play a video.
  • An intent can be configured with one or more slots.
  • the slot is the key information used to express the user's intention, and the accuracy of the slot filling directly affects whether the electronic device can match the correct intention.
  • a slot corresponds to a keyword of a type of attribute, and the information in the slot can be filled with keywords of the same type, that is, slot filling.
  • the query pattern corresponding to the intent to play a song could be "I want to hear ⁇ song ⁇ of ⁇ singer ⁇ ".
  • ⁇ singer ⁇ is the singer's slot
  • ⁇ song ⁇ is the song's slot.
  • the electronic device can extract the slot information filled in the ⁇ singer ⁇ slot from the voice command as: Faye Wong, the slot information filled in the slot ⁇ song ⁇ is: red beans. In this way, the electronic device (or server) can identify, according to the two slot information, that the user's intention of this voice input is: to play Faye Wong's song Red Bean.
  • the semantic parsing method of the present application is suitable for various scenarios requiring semantic parsing, for example, a user sends a voice command to an intelligent electronic device, and a user conducts a man-machine dialogue with a voice assistant of the intelligent electronic device.
  • a user sends a voice command to an intelligent electronic device
  • a user conducts a man-machine dialogue with a voice assistant of the intelligent electronic device.
  • the following introduces the semantic parsing solution of the present application based on the human-machine dialogue system.
  • a common human-machine dialogue system 110 mainly includes the following six technical modules: speech recognition module 111; semantic analysis module 112; problem solving module 113; language generation module 114; dialogue management module 115; speech synthesis module 116. in,
  • the speech recognition module 111 is used to realize speech-to-text recognition and conversion through speech recognition technology (Automatic Speech Recognition, ASR).
  • ASR Automatic Speech Recognition
  • the recognition result is generally in the form of the top n (n ⁇ 1) sentences or word lattices with the highest scores.
  • Output corpus data is generally in the form of the top n (n ⁇ 1) sentences or word lattices with the highest scores.
  • the semantic parsing module 112 also known as the Natural Language Understanding (NLU) module, is mainly used for performing natural language processing (NLP) tasks, including semantic parsing and identifying the corpus data output by the speech recognition module.
  • NLP natural language processing
  • the function of the semantic parsing module is implemented by a pre-trained semantic parsing model 121, and the semantic parsing model 121 will be described in detail below, and will not be repeated here.
  • the problem solving module 113 is mainly used for reasoning or querying according to the intention identified by the semantic analysis and the corresponding slot, so as to feed back the solution corresponding to the intention and the corresponding slot to the user.
  • the language generation module 114 mainly generates a natural language sentence for the solution found by the problem solving module 113 and needs to be output to the user, and feeds it back to the user in text or further converted into voice.
  • the dialogue management module 115 is the central hub in the human-machine dialogue system, and is used for scheduling the mutual cooperation of other modules in the human-computer interaction system based on the dialogue history, assisting the language parsing module to correctly understand the results of the speech recognition, and providing the problem solving module. Provides assistance and guides the natural language generation process of the language generation module.
  • the speech synthesis module 116 is used for converting the natural language sentences generated by the language generation module into speech output.
  • FIG. 2 shows a schematic diagram of a man-machine dialogue scene according to an embodiment of the present application.
  • the application scenario includes the electronic device 100 and the electronic device 200 .
  • the electronic device 100 is a terminal intelligent device that interacts with a user, and an application system capable of semantic analysis, such as the above-mentioned human-machine dialogue system 110 , is installed thereon.
  • the electronic device 100 can recognize the user's voice command through the man-machine dialogue system 110, and perform corresponding operations according to the voice command or answer the questions raised by the user.
  • the electronic device 100 may include, but is not limited to, smart speakers, smart phones, wearable devices, head-mounted displays, in-vehicle intelligent systems such as in-vehicle intelligent voice navigation, as well as intelligent robots, portable music players, and readers.
  • the electronic device 200 can be used to train the semantic parsing model 121 , and transplant the trained semantic parsing model 121 to the electronic device 100 for the electronic device 100 to perform semantic parsing and perform corresponding operations.
  • the electronic device 200 can also perform semantic parsing on the corpus data sent by the electronic device 100 through the trained semantic parsing model 121, and feed the result back to the electronic device 100, and the electronic device 100 further performs corresponding operations.
  • electronic device 200 may include, but is not limited to, clouds, servers, laptops, desktops, tablet computers, and other electronic devices capable of accessing a network with one or more processors embedded or coupled therein.
  • the technical solutions of the present application are described in detail below by taking the electronic device 100 as a mobile phone and the electronic device 200 as a server as an example.
  • the mobile phone 100 is installed with the human-machine dialogue system 110, and the semantic analysis module 112 in the human-machine dialogue system 110 has a semantic analysis model 121, which can perform semantic analysis on user speech based on the technical solution of the present application.
  • the semantic parsing model 121 of the present application will be described in detail below.
  • the semantic parsing model 121 is a natural language processing model pre-trained by the server 200 based on natural language processing and the above-mentioned various neural network structures and models.
  • the pre-trained semantic parsing model 121 can extract multiple intents in a single piece of corpus data, and predict slots based on multiple intents, so as to accurately identify intents and corresponding slots in the corpus data, which can greatly improve the efficiency of slot filling. accuracy.
  • the data input into the semantic parsing model 121 is the data obtained after preprocessing the corpus data, wherein the corpus data is obtained after the user's voice instruction is recognized and transformed.
  • the preprocessing of the corpus data is a routine operation for understanding text in the human-machine dialogue system 110, and is one of the natural language processing tasks performed by the semantic parsing module 112.
  • preprocessing generally includes segmenting the corpus data, filling and marking. (Token) sequence and segmentation mark (Segmentation) and create masks.
  • the data preprocessing finally obtains the Token sequence containing the text characters of the sentence and the text characters of each word in the sentence, the segmentation mark representing the sentence position corresponding to each word, and the corresponding mask indicating whether each character position in the Token sequence is a valid character.
  • the word segmentation process mainly uses word segmentation tools (such as Chinese vocabulary) to divide the corpus data into sentences and individual words that make up sentences, and to the obtained sentences. Possible slot labels.
  • word segmentation processing is to prepare data for the next step of filling the Token sequence.
  • PLAY_MUSIC PLAY_VIDEO
  • PLAY_VOICE PLAY_VOICE
  • the slot labels marked for each word are:
  • Your corresponding 3 slots are songName-B, videoName-B and mediaName-B;
  • the four words good, old, time, and light correspond to the three slot labels (songName-I, videoName-I, mediaName-I).
  • Forming the Token sequence mainly uses the data obtained by word segmentation to obtain a Token sequence that meets the character length requirements by truncating sentences or filling characters.
  • the Token sequence contains sentence characters corresponding to the entire sentence of the voice command, and word characters corresponding to each word in the sentence.
  • the first character in the Token sequence is generally ⁇ CLS>, which marks the sentence obtained by word segmentation (for example, the character ⁇ CLS> marks the sentence: please play you for me Good old times)
  • the ending character in the Token sequence is generally the truncation character ⁇ SEP>
  • ⁇ SEP> indicates that the preceding sentence is a complete sentence that meets the character length requirements of a single sentence
  • each character between the characters ⁇ CLS> and ⁇ SEP> is a complete sentence.
  • the words are marked with a punctuation mark "Sentence 1", indicating that these words are the words that make up Sentence 1.
  • Character length + 2 are required to meet the maximum character length requirements. Generally, the character length + 2 contained in the user instruction is within the range of the maximum character length of 32 bits.
  • Creating a mask is mainly to create a mask (Mask) corresponding to each character in the Token sequence obtained by the above filling.
  • the purpose of creating a mask is to mark whether each character in the Token sequence expresses valid information into a computer-readable marking code.
  • the value of the created mask element corresponding to the character ⁇ pad> in the Token sequence is 0, and the value of the created mask element corresponding to the character other than the character ⁇ pad> is 1.
  • Token sequence ⁇ CLS> Please play hello old times for me ⁇ pad>... ⁇ pad> ⁇ SEP>;
  • the corpus data recognized by the voice command input by the user is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station"
  • the three data obtained after the above data preprocessing are:
  • Token sequence ⁇ CLS> help me book train tickets from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station ⁇ SEP>;
  • Sentence 1 help, me, pre-determined, from, Shanghai, sea, to, north, Beijing, de, fire, train, ticket, and, pre, fixed, north, Beijing, train, train, station, Near, near, of, five, star, grade, hotel, shop);
  • the three data obtained after the above-mentioned data preprocessing of the corpus data can be input into the semantic parsing model 121 for semantic parsing.
  • the semantic parsing model 121 will be described in detail below.
  • the semantic parsing model 121 includes a BERT encoding layer 1211, an intent classification layer 1212, an attention layer 1213, a slot filling layer 1214, and a post-processing layer 1215.
  • the BERT encoding layer 1211 takes as input the Token sequence, segmentation mark and mask obtained after data preprocessing of the expected data, and outputs the encoded vector sequence after encoding.
  • the coding vector sequence includes sentence vector and word vector
  • the sentence vector represents the semantic information of the corpus data to be parsed
  • the word vector contains the lexical information of each word in the corpus data to be parsed.
  • semantic information and word meaning information are the meaning expressions of corpus data based on natural language understanding, and these semantic information and word meaning information can express the real intention of the user and the real slot corresponding to the real intention of the user.
  • the semantic information represented by the sentence vector h 0 may include PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE, hello, old time, hello old time, etc.
  • the word meaning information represented by the word vectors h 1 , h 2 , ..., h t may include songName, videoName, mediaName and the literal meaning of each word that constitutes a sentence, where the word corresponding to h 1 is please, and the word corresponding to h 2 is The word corresponding to wei, h 3 is me, ..., the word corresponding to h 10 is light.
  • the corpus data to be parsed is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station"
  • the encoded vector sequence ⁇ h 0 ,h 1 output by the BERT encoding layer 1211 ,h 2 ,...,h t ⁇ the semantic information represented by the sentence vector h 0 may include ticket booking, hotel booking, departure place, destination, Shanghai, Beijing, hotel, star, five-star and so on.
  • the word meaning information represented by the word vectors h 1 , h 2 , ..., h t may include the origin, destination, Shanghai, Beijing, hotel, star, five-star, and the literal meaning of each word composing the sentence, where h The word corresponding to 1 is gang, the word corresponding to h 2 is me, the word corresponding to h 3 is pre, ..., the word corresponding to h 30 is shop.
  • the Token sequence, segmented tag and the mask generated by the corresponding Token sequence obtained after data preprocessing will be used as the input of the BERT coding layer 1211 .
  • the BERT encoding layer 1211 sequentially identifies the valid characters ⁇ CLS>, x 1 , x 2 , .
  • the value of the code element is 1, and the value of the blank character mask element is 0).
  • the character ⁇ CLS> that marks the sentence in the Token sequence is input into the trained BERT coding layer 1211 for semantic encoding, and then the character ⁇ CLS> is given semantic information of the corpus data to generate a high-dimensional sentence vector h 0 .
  • the characters x 1 , x 2 ,...,x t-1 between the character ⁇ CLS> and the truncated character ⁇ SEP> in the Token sequence correspond to each word that constitutes the sentence in the corpus data, and the characters x 1 , x 2 ,... , x t-1 is input into the trained BERT coding layer 1211 for semantic encoding, and then assigns the semantic information of the corpus data to the characters x 1 , x 2 ,..., x t-1 , correspondingly generates a high-dimensional word vector h 1 , h 2 ,...,h t .
  • the mask element value corresponding to the blank character ⁇ pad> in the Token sequence is 0, and no word is marked, so it is not used as the input of the BERT encoding layer 1211.
  • the BERT coding layer 1211 can be obtained by training based on the BERT model.
  • the BERT model is a multi-layer bidirectional transformer encoder model based on fine-tuning, and the key technological innovation of the BERT model is to apply the bidirectional training of the transformer to language modeling.
  • a striking feature of the BERT model is its unified architecture across different tasks, so there is little difference between its pretrained architecture and the final downstream architecture.
  • the BERT model can further increase the generalization ability of the word vector model, and fully describe the character-level, word-level, sentence-level and even inter-sentence relationship features.
  • the BERT encoding layer 1211 can also be obtained by training other encoders or encoding models, which is not limited here.
  • the intent classification layer 1212 is used to predict candidate intents in the corpus data, wherein the intent classification layer 1212 can extract multiple intent labels in the corpus data, and retain the intent labels that meet the conditions as candidate intent outputs.
  • the intent classification layer 1212 takes the sentence vector h 0 obtained by the above-mentioned BERT encoding layer 1211 as input, and based on the semantic information represented by the sentence vector h 0 , the intent classification layer 1212 can extract all possible intent labels, and for each extracted The intent label calculates the intent confidence to judge whether the intent label satisfies the output condition.
  • the intent confidence represents the closeness of the extracted intent label to the real intent expressed by the corpus data, and may also be referred to as intent reliability.
  • the intent with higher intent confidence is closer to the real intent expressed by the corpus data.
  • a certain threshold can be set for the intent confidence, for example, the threshold of the intent confidence is set to 0.5, and the intent label whose intent confidence is greater than or equal to the threshold satisfies the output condition, and the corresponding intent label will be output As a candidate intent; an intent label whose intent confidence is less than the threshold does not meet the output conditions, and its corresponding intent label will be deleted and will not be output from the intent classification layer 1212 .
  • the semantic information represented by the sentence vector h 0 output by the BERT coding layer may include 3 possible intent labels : PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE.
  • the intent classification layer 1212 extracts the above three possible intent labels, and calculates the intent confidence of each intent label as 0.8, 0.75, and 0.5, respectively.
  • the intent classification layer 1212 sets the The intent confidence threshold is 0.5, then the intent confidence of the above three intent labels all meet the condition of being greater than or equal to 0.5, that is, the above three intent labels satisfy the output condition, and the final intent classification layer 1212 outputs three candidate intents: PLAY_MUSIC, PLAY_VIDEO , PLAY_VOICE.
  • the semantic information represented by the sentence vector h 0 output by the BERT coding layer May include 4 possible intent labels to check train times, book tickets, find hotels, and book hotels.
  • the intent classification layer 1212 extracts the above four possible intent labels, and calculates the intent confidence of each intent label as 0.48, 0.87, 0.45, and 0.7, respectively.
  • the intent confidence threshold set by 1212 is 0.5, then the intent labels with the corresponding intent confidence greater than or equal to 0.5 among the above four intent labels are ticket booking and hotel booking, which satisfy the output conditions, then the intent classification layer 1212 outputs 2 candidate intents : Book tickets, book hotels. However, the two intent labels whose intent confidence is less than 0.5: checking the number of trains and finding a hotel do not meet the output conditions, so they will not be output from the intent classification layer 1212 .
  • the working process of the intent classification layer 1212 is shown in FIG. 3 :
  • the intent classification layer 1212 takes the sentence vector h 0 in the encoded vector sequence output by the BERT encoding layer 1211 as input, and the intent classification layer 1212 extracts all possible intents in the semantic information represented by the sentence vector h 0 by decoding and activating the sentence vector h 0 labels, and compute the intent confidence y I for each intent label.
  • the calculation formula of the intention confidence y I obtained after passing through the Sigmoid activation function is as follows:
  • I represents the number of schematic diagrams
  • W I is the random weight coefficient of the sentence vector h 0
  • b I represents the deviation value
  • the intent classification layer 1212 can be obtained by training a fully connected layer (dense) and a sigmoid function as an activation function.
  • a fully connected layer dense
  • a sigmoid function a sigmoid function as an activation function.
  • the decoder a deep neural network with the same function as the fully connected layer
  • other functions with the same function as the Sigmoid function can also be used as the activation of the corresponding deep neural network decoder. function, there is no restriction here.
  • the attention layer 1213 is used to quantify the degree of correlation between each word in the corpus data and the intent expressed by the sentence. For example, it can be represented by an intent attention vector, and the intent attention vector can also be understood as an intent context vector; and the attention layer 1213 also It is used to quantify the degree of correlation between each word in the corpus data and the slot expressed by the sentence, for example, represented by the slot attention vector.
  • the intent attention vector output by the attention layer 1213 will be used as the input of the slot filling layer 1214 to guide the slot prediction to improve the accuracy of the slot prediction; the slot attention vector output by the attention layer 1213 is used as the slot The deviation value of the bit calculation to correct the deviation of the slot prediction calculation.
  • the attention layer 1213 takes the encoding vector sequence output by the BERT encoding layer 1211 as input, and based on the semantic information represented by the sentence vector h 0 and the semantic information represented by the word vectors h 1 , h 2 ,..., h t , the attention
  • the intent attention vector output by the layer can be understood as the correlation degree between the words corresponding to each word vector and the sentence expression intent corresponding to the sentence vector
  • the slot attention vector output by the attention layer can be understood as using It is used to quantify the degree of correlation between the word corresponding to each word vector and the slot expressed by the sentence corresponding to the sentence vector.
  • the semantic information represented by the sentence vector h 0 in the output coding vector sequence through the BERT coding layer may include 3 Possible intent labels: PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE, the word meaning information represented by word vectors h 1 , h 2 , ..., h t may include songName, videoName, mediaName, and the literal meaning of each word that composes the sentence.
  • the intent attention vector CI output by the attention layer 1213 (corresponding to: please play hello old times for me , play, play), in which the sentence "Please play for me The intention expressed by "playing your good old days” may be PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE, and "playing, playing" has a relatively high degree of correlation with the intention expressed in the sentence, "you, good, old, time, light, please, for, me ” is less or less relevant to the intent expressed by the sentence.
  • the correlation degree is 0.9, which means that the degree of correlation is relatively large; in the end, it can be concluded that the degree of correlation between "you, good, old, time, light” and the above three slots Both are relatively high, and "play, play, please, for, me” has a low or no correlation with the slot expressed in the sentence.
  • the intent attention vector output by the attention layer 1213 (help me book a train ticket from Train tickets from Shanghai to Beijing and book five-star hotels near Beijing Railway Station, book, book, train, train, ticket, hotel, shop), in which the sentence "Help me book train tickets from Shanghai to Beijing and book Beijing
  • the intention expressed by the five-star hotel near the railway station” may be to book a ticket or a hotel.
  • Sea, Beijing, Beijing, fire, train, station, five, star, grade, help, me” have a low degree of relevance or irrelevance to the intent expressed by the sentence.
  • the degree of correlation between " ⁇ " and the slot “Departure” is 0.9, while the correlation with the other 3 slots (destination, location, star rating) is 0.9.
  • a degree of 0.3 indicates that "up” has a high degree of correlation with the slot “departure”, and a low degree of correlation with the other three slots (destination, location, star rating).
  • the attention layer 1213 takes the encoding vector sequence ⁇ h 0 , h 1 , h 2 ,..., h t ⁇ output by the BERT encoding layer 1211 as input, and the attention layer 1213 extracts the semantic information represented by the sentence vector h 0 and the word vector h 1 , h 2 ,...,h t represents the semantic information, and outputs a hidden state vector at each time step t, which represents the extracted data before the previous moment (t-1 moment) of the corresponding time step t. Semantic information, word meaning information.
  • t
  • the attention vector calculation formula based on the attention mechanism is as follows:
  • Q in the above formula (2) represents the sentence vector h 0 in the encoding vector sequence input to the attention layer 1213
  • V represents the input to the attention layer 1213 at each time step t
  • the word vectors h 1 , h 2 ,...,h t in the encoded vector sequence of the attention vector obtained by the above formula (3) can quantify the degree of correlation between each word vector and the sentence vector.
  • the semantic information represented by the sentence vector h 0 contains all possible intent label information.
  • the sentence vector h 0 is combined with the attention vector calculated by the above formula (2) to obtain the intent attention vector C I , and the obtained intent
  • the attention vector CI is used to quantify the degree of correlation between the word corresponding to each word vector and the sentence expression intent corresponding to the sentence vector.
  • Q in the above formula (2) represents the hidden state vector C output by the attention layer 1213 at the previous moment (time t-1), and V represents the encoding vector sequence input to the attention layer 1213 ⁇ h 0 , h 1 , h 2 ,...,h t ⁇
  • the attention vector obtained by the above formula (2) can be combined with the hidden state vector at the previous moment to learn the correlation degree of the word vector processed at the current moment t.
  • the hidden state vector C output at time t-1 is calculated by the above formula (2).
  • the attention vector is combined to get the slot attention vector
  • the resulting slot attention vector It is used to quantify the degree of correlation between the word corresponding to each word vector and the slot expressed by the sentence corresponding to the sentence vector.
  • the attention layer 1213 can be obtained by training a Long Short Term Memory (LSTM) model and an attention mechanism.
  • LSTM Long Short Term Memory
  • the attention mechanism for the specific training process, please refer to the detailed description below, which will not be repeated here.
  • LSTM Long Short Term Memory
  • other neural network models that have the same functions as the LSTM model and the attention mechanism, as well as other neural network models that are used to learn the degree of correlation between the words in a sentence and the intent or slot expressed in the sentence can be used. mechanism, there is no restriction here.
  • the slot filling layer 1214 is used to predict candidate slots in the corpus data and fill in the slot value, wherein the slot filling layer 1214 can predict multiple slot labels in the corpus data, and retain the slot labels that meet the conditions Output as a candidate slot.
  • the slot filling layer 1214 uses the coding vector h t output by the BERT coding layer 1211 and the hidden state vector C output by the attention layer 1213 at time t-1 (That is, the semantic information of the sentence before the currently processed word or the semantic information of the word), and the intent attention vector CI and slot attention vector output by the attention layer 1213 at time t For input, output the candidate slot at time t.
  • the slot filling layer 1214 predicts possible slot labels based on the four input vectors at each time step t, and calculates the slot position reliability for the predicted slot label to determine whether the slot label satisfies the output condition.
  • the slot filling layer 1214 is based on an encoding vector including a word vector (containing the semantic information of each word in the corpus data to be parsed), the semantic information of the sentence before the currently processed word or the semantic information of the word, the currently processed word and sentence.
  • the degree of correlation of the expressed intention and the degree of correlation between the currently processed word and the slot expressed by the sentence obtain the possible slot labels of the corpus data to be parsed, and calculate the slot position reliability of each slot label, and then select the Slot labels that satisfy the condition or that have a correlation degree with the slot that is actually expressed by the corpus data to be analyzed exceeds the threshold are output as candidate slots.
  • the slot position reliability represents the closeness of the predicted slot label to the actual slot expressed by the corpus data, and may also be referred to as slot reliability.
  • the slot label with higher slot position reliability is closer to the real slot expressed by the corpus data.
  • a certain threshold can be set for the slot position reliability, for example, the threshold value of the slot position reliability is set to 0.5, the slot label whose slot position reliability is greater than or equal to the threshold satisfies the output condition, and the corresponding slot
  • the bit label will be output as a candidate slot; the slot label whose slot position reliability is less than the threshold does not meet the output condition, and its corresponding slot label will be deleted and will not be output from the slot filling layer 1214 .
  • the threshold set for the slot position reliability in the slot filling layer 1214 is 0.5. Then, in the slot label predicted by the 5 words "please, for, me, play, play” in the slot filling layer 1214, the slot position reliability (for example, 0.7) of slot O is greater than or equal to 0.5, and the other slots (such as songName), the slot position reliability (for example, 0.3) is less than 0.5. Therefore, the candidate slots corresponding to the output of the five words "please, for, me, play, play” are all O slots.
  • the slot filling layer 1214 is the slot label predicted by the five words "you, good, old, time, light", the slot position reliability of songName, videoName, and mediaName (for example, the slot position reliability is 0.86, 0.7, 0.55) is greater than or equal to 0.5, and the slot position reliability of slot O (for example, 0.3) is less than 0.5. Therefore, the candidate slots corresponding to "you" are songName-B, videoName-B, mediaName-B, "OK". , old, time, light” corresponding output candidate slots are songName-I, videoName-I, mediaName-I, among which, B marks the word of the starting position of the name, which means that "you” is the first in the name word; I marks the word after the start of the name. Since the O slot represents an empty slot or an unimportant slot, the output of the slot filling layer 1214 finally outputs three candidate slots songName, videoName, and mediaName, and fills each candidate slot with the slot value "you" good old times".
  • the threshold set for the slot position reliability in the slot filling layer 1214 is 0.5. .
  • the slot position reliability (for example, 0.7) of the slot label "Departure” is greater than or equal to 0.5, therefore, "Shanghai, Sea”
  • the candidate slots corresponding to the output of the two words are all "departures”;
  • the slot filling layer 1214 is the slot label predicted by "Beijing, Beijing"
  • the slot position reliability of the slot label "destination” (for example, 0.8 ) is greater than or equal to 0.5, therefore, the candidate slots corresponding to the output of the two words "Beijing, Beijing” are all "destination”;
  • the slot filling layer 1214 is the predicted slot of "Beijing, Beijing, train, train, station”
  • the slot position reliability (for example, 0.75) of the slot label "Location” is greater than or equal to 0.5.
  • the candidate slots corresponding to the output of the five words "Beijing, Beijing, Train, Station” are all "Location”; in the slot label predicted by the slot filling layer 1214 as “five, star, level", the slot location reliability (for example, 0.75) of the slot label “Location” is greater than or equal to 0.5, therefore, "five, The three words "star, level” correspond to the output candidate slots that are all "star”.
  • the slot filling layer 1214 finally outputs 4 candidate slots: departure, destination, location, star rating, and the slot value filled in the slot (departure) is (Shanghai), and the slot (destination)
  • the value of the filled slot is (Beijing)
  • the value of the filled slot (location) is (Beijing Railway Station)
  • the value of the filled slot (star) is (five star).
  • the vector h 0 is input as the initial value. Since the semantic information represented by the intent attention vector and the sentence vector includes all possible intent labels, the slot filling layer 1214 predicts possible slot labels based on the possible intent labels. The slot label is associated with the intent label. Therefore, the accuracy of slot prediction is greatly improved, and the speed or efficiency of slot prediction is also improved accordingly.
  • FIG. 3 the working process of the slot filling layer 1214 is shown in FIG. 3 :
  • the slot filling layer 1214 uses the coding vector h t output by the BERT coding layer 1211 at time t, the intent attention vector C I output by the attention layer 1213 at time t, and the slot attention vector and the hidden state vector C output by the attention layer 1213 at time t-1 as input.
  • the slot filling layer 1214 firstly models the relationship between the intent and the slot based on the slot gate mechanism, and obtains the intent attention vector C I and the slot attention vector
  • the fusion vector gS of further predicts the slot label corresponding to each time step t, and calculates the slot position reliability of each slot label.
  • v represents the random weight coefficient of the hyperbolic tangent function tanh(x) in the above formula (3)
  • W represents the random weight coefficient of the schematic attention vector C I
  • W greater than 1 means the schematic attention vector C I pair slot Bit prediction is more influential than the slot attention vector is greater, and W is less than 1, which means that the influence of the schematic attention vector C I on the slot prediction is greater than that of the slot attention vector.
  • the degree of influence is small, and W equal to 1 means that the influence degree of the schematic attention vector C I on the slot prediction and the slot attention vector the same degree of impact.
  • the slot filling layer 1214 can obtain a slot vector representing slot label information based on the above four input vectors and then based on the slot vector Calculate the slot position reliability of the corresponding slot label
  • the slot position reliability The calculation formula obtained after passing the Sigmoid activation function is as follows:
  • S is the number of slots
  • W S is the slot vector
  • b S represents the bias value
  • the layer 1214 uses the encoding vector h 3 (corresponding to: me), the hidden state vector C (corresponding to: is) and the intention attention vector C I (corresponding to: please play hello for me, respectively) output by the attention layer 1213 at time t-1.
  • the hidden state vector C includes the word meaning information passed by the word vector (corresponding: please), the word vector (Corresponding to: please) It also includes the semantic information transmitted by the sentence vector (corresponding: please play hello old times for me).
  • the slot filling layer 1214 uses the encoding vector h 6 (corresponding to: you) and the hidden state vector C output by the attention layer 1213 at time t-1 (corresponding to: put ), intent attention vector C I (corresponding to: please play hello old times for me, play, play) and slot attention vector (corresponding to: playing, you) as input; among them, the hidden state vector C (corresponding to: you ) includes the word meaning information passed by the word vector (corresponding: play), and the word vector (corresponding: put) also includes the word meaning information passed by the previous word vector (corresponding: playing), and so on, the word vector (corresponding: please ) also includes the semantic information passed by the sentence vector (corresponding: please play hello old times for me).
  • the reliability is 0.7, the slot position reliability of the slot label mediaName is 0.55, and the slot position reliability of the slot label O is 0.2, then, the final predicted slot for "you" is songName, videoName, mediaName, as the slot Fill the output of layer 1214.
  • the slot filling layer 1214 can be obtained by training based on the slot-gate mechanism, the LSTM model and the Sigmoid activation function.
  • the slot gate mechanism focuses on learning the relationship between the intent attention vector and the slot attention vector, and obtains a better semantic frame through global optimization.
  • the slot gate mechanism mainly uses the intent context vector to model the relationship between intent and slot to improve slot filling performance.
  • other deep neural network models with the same function as the LSTM model can be used as the decoder, and other functions with the same function as the Sigmoid function can also be used as the activation function of the corresponding deep neural network decoder. There are no restrictions.
  • the slot filling layer 1214 is used to sort out the correspondence between candidate intents and candidate slots.
  • the result obtained after the candidate intent corresponds to the candidate slot is output from the post-processing layer 1215 as the semantic parsing result.
  • the candidate intents PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE
  • the candidate slots SongName
  • videoName mediaName
  • the semantic parsing result output after inference and prediction based on the intent-slot mapping table in the post-processing layer 1215 is:
  • the candidate intents PLAY_MUSIC, PLAY_VIDEO, and PLAY_VOICE are the intents for parsing and identifying the corpus data
  • the candidate slots songName, videoName, and mediaName are the slots obtained by parsing the corpus data
  • "Hello, old time” is the filled slot value.
  • the candidate intent output by the intent classification layer 1212 (booking a ticket, booking a hotel), slot After the candidate slots (departure and destination) output by the filling layer 1214 are input to the post-processing layer 1215, the semantic parsing result output after inference and prediction based on the intent-slot mapping table in the post-processing layer 1215 is:
  • the candidate intent (booking a ticket, booking a hotel) is the intent to parse and identify the corpus data
  • the candidate slot (departure, destination, location, star rating) is the slot obtained by parsing the corpus data.
  • Beijing Railway Station, and Five Star are the slot values filled in for the corresponding slot (departure, destination, location, star rating).
  • FIG. 3 the working process of the post-processing layer 1215 is shown in FIG. 3 :
  • the post-processing layer 1215 takes the candidate intents obtained by the above-mentioned intent classification layer 1212 and the candidate slots obtained by the slot filling layer 1214 as input, and sorts out candidate intents and candidates based on the intent-slot mapping table obtained during the pre-training process of the semantic parsing model 121 . Correspondence between slots.
  • the intent slot mapping table obtained based on the pre-training process of the semantic parsing model 121 is described in detail below, and details are not repeated here.
  • the intent slot mapping table is based on the result of the candidate intent and candidate slot combing obtained by training a large number of samples. Therefore, in the process of performing the semantic parsing task, the intent slot can be continuously updated based on more corpus data in practical applications. Bitmap table.
  • the above BERT encoding layer 1211 , intent classification layer 1212 , attention layer 1213 , slot filling layer 1214 and post-processing layer 1215 together constitute the semantic parsing model 121 .
  • each layer in the structure of the semantic parsing model 121 needs to be pre-trained with a large amount of sample expected data so that it has the corresponding function of each layer above.
  • the semantic parsing model 121 is pre-trained by the server 200. Afterwards, the trained semantic parsing model 121 can either be transplanted to the mobile phone 100 to directly perform the semantic parsing task, or it can continue to exist in the server 200 to execute data from the mobile phone 100. The semantic parsing task requested to be performed.
  • the pre-training process of the semantic parsing model 121 will be described in detail below.
  • the pre-training process of the semantic parsing model 121 includes:
  • the server 200 collects sample corpus data for training the semantic parsing model 121 .
  • the collected samples are expected to cover as many fields as possible and as many verbs, proper nouns, common nouns, etc. as possible, so that the generalization performance of the trained semantic parsing model 121 will be better.
  • sample corpus data used for training the semantic parsing model 121 needs to be input into the layers of the semantic parsing model 121 for training in batches.
  • concepts related to sample data are introduced below.
  • (a) batch batch
  • the loss function required for each parameter update of deep learning is not obtained by a data label ⁇ data: label ⁇ , but by a set of data weighted, the number of this set of data is batchsize .
  • batchsize batch size, the number of samples in a batch. Each training takes batchsize samples in the training set for training.
  • (c) iteration The number of iterations is the number of times the batch needs to complete an epoch. 1 iteration is equal to using batchsize samples to train once; in an epoch, the number of batches and iterations are equal.
  • (d) epoch When a complete dataset passes through the neural network once and returns once, the process is called an epoch. That is to say, 1 epoch is equivalent to using all the samples in the training set to train once.
  • training the entire sample set requires 100 iterations and 1 epoch.
  • training the entire sample set requires 100 iterations and 1 epoch.
  • a dataset with 2000 training samples Divide 2000 samples into batches of size 500, then it takes 4 iterations to complete an epoch.
  • the server 200 performs data preprocessing on the sample corpus data to be input into the training of the semantic parsing model 121 through the NLP module.
  • data preprocessing of the sample corpus data please refer to the relevant description of the data preprocessing in the BERT coding layer 1211 above, which will not be repeated here.
  • a Token sequence, a segment mark and a mask corresponding to the Token sequence corresponding to the piece of sample corpus data are obtained.
  • the server 200 In an epoch training, the server 200 respectively inputs the Token sequence, the segmentation mark and the mask corresponding to the Token sequence obtained by data preprocessing for each sample corpus into the BERT coding layer 1211 in the semantic parsing model 121 for training, so that the It can output a sequence of encoded vectors as described in the BERT encoding layer 1211 above.
  • the BERT coding layer 1211 is obtained based on the training of the BERT model. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters of the semantic parsing model 121, so that the BERT coding layer can output the above coding vector sequence ⁇ h 0 , h 1 , h 2 , ..., h t ⁇ .
  • the server 200 In an epoch training, the server 200 respectively inputs the sentence vector h 0 output by the BERT encoding layer 1211 in the above process 503 into the intent classification layer 1212 in the semantic parsing model 121 for training, so that it can output the intent classification layer 1212 as above.
  • the candidate intents described in are not repeated here.
  • the intent classification layer 1212 is obtained by training based on the fully connected layer and the Sigmoid function as the activation function. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters of the semantic parsing model 121, so that the intent classification layer 1212 expects the learning of the data after a long enough time or a large enough number of samples. After that, all possible intent labels and the intent confidence corresponding to each intent label can be extracted, and then multiple intent labels that meet the output conditions can be extracted as candidate intents, which are output from the intent classification layer 1212. For details, please refer to the above formula (1) and related descriptions , and will not be repeated here.
  • the candidate intents output by the intent classification layer 1212 are input to the post-processing layer 1215.
  • the server 200 respectively inputs the coding vector sequence ⁇ h 0 , h 1 , h 2 , , h t ⁇ output by the BERT coding layer 1211 trained in the above process 503 into the attention in the semantic parsing model 121
  • the force layer 1213 is trained to output the intent attention vector CI and the slot attention vector as described in the attention layer 1213 above It is not repeated here.
  • the attention layer 1213 is obtained from training based on the attention mechanism and the LSTM model. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters in the semantic parsing model 121, so that the attention layer 1213 can quantify the degree of relevance of the expression intent of the word pair corresponding to each word vector. , and quantify the correlation degree of the slot represented by the word pair corresponding to each word vector, and finally output the intent attention vector and the slot attention vector.
  • the attention layer 1213 can quantify the degree of relevance of the expression intent of the word pair corresponding to each word vector. , and quantify the correlation degree of the slot represented by the word pair corresponding to each word vector, and finally output the intent attention vector and the slot attention vector.
  • the LSTM model is a special RNN model, which is proposed to solve the problem of gradient dispersion of the RNN model. Its core is the cell state, which is temporarily called the cell state. It can also be understood as a conveyor belt, which is actually the memory in the entire model. space changes over time.
  • the working principle of the LSTM model can be simply described as: (1) forget gate: choose to forget some information in the past: (2) input gate: remember some information in the present: (3) merge the past and present memory: (4) )output gate: choose to output some information.
  • the attention mechanism imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience and external sense to increase the fineness of observation in some areas, and can use limited attention resources to quickly screen out high-value information from a large amount of information. .
  • Attention mechanism The attention mechanism can quickly extract important features of sparse data.
  • the essential idea of the attention mechanism can be rewritten as the following formula:
  • Lx
  • represents the length of Source
  • the meaning of the formula is to imagine that the constituent elements in Source are composed of a series of ⁇ Key, Value> data pairs.
  • an element Query in the target Target is given.
  • the weight coefficient of each Key corresponding to Value is obtained, and then the weighted sum of Value is obtained, that is, the final Attention value is obtained.
  • the Attention mechanism is to weight and sum the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient of the corresponding Value.
  • the server 200 respectively uses the encoding vector h t output by the BERT coding layer 1211 trained in the above process 503 at time t, and the intent attention vector output by the attention layer 1213 trained in the above process 505 at time t.
  • the hidden state vector C output by the LSTM model at time t-1 in the attention layer 1213 (that is, the semantic information of the sentence before the currently processed word or the word meaning information of the word) is input into the semantic analysis model 121
  • the slot filling layer 1214 is trained , so that it can output candidate slots as described in the slot filling layer 1214 above, which will not be repeated here.
  • the slot filling layer 1214 is obtained by training based on the slot gate mechanism, the LSTM model as the decoder, and the Sigmoid function as the activation function. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters of the semantic parsing model 121, so that the slot filling layer 1214 has a long enough time. Or a large enough number of samples are expected to be able to predict all possible slot labels and the slot position reliability corresponding to each slot label corresponding to the possible intent labels after the learning of the data, and then extract multiple candidate slots that meet the output conditions.
  • For the output of the bit filling layer 1214 refer to the above formulas (3) to (4) and related descriptions for details, which will not be repeated here.
  • the candidate slots output by the slot filling layer 1214 are input to the post-processing layer 1215 .
  • the server 200 determines whether the training results of the above-mentioned processes 501-506 satisfy the training termination condition. If the training result satisfies the training termination condition, go to 508 ; if the training result does not satisfy the training termination condition, go to 509 .
  • an Early Stopping mechanism may be used to determine the termination of model training. That is, when the number of training epochs reaches the number threshold or the epoch interval with the last optimal model is greater than the set interval threshold, the training result satisfies the training termination condition; otherwise, the training result does not meet the training termination condition.
  • the early stopping mechanism can make the trained neural network model have good generalization performance, that is, it can fit the data well. Its basic meaning is to calculate the performance of the model on the validation set during training. When performance starts to drop, stop training to avoid overfitting problems caused by continuing training.
  • the server 200 terminates the training of the BERT encoding layer 1211, the intent classification layer 1212, the attention layer 1213, and the slot filling layer 1214 in the semantic parsing model 121, and further adds a large number of candidate intents and A large number of candidate slots are input into the post-processing layer 1215 in the semantic parsing model 121 to sort out relationships, for example, sorting out candidate slots based on candidate intents to obtain an intent-slot mapping table.
  • the semantic parsing model training ends.
  • the candidate intent and the candidate slot will be obtained corresponding to each sample corpus data after the training of the above processes 502 to 506.
  • the candidate intent and candidate slot of the input post-processing layer 1215 are also sufficient. pass.
  • the post-processing layer 1215 is trained based on a sufficient number of candidate intents and candidate slots, so that it can sort out candidate slots based on the candidate intents, and output an ordered correspondence between intents and slots, for example, training to obtain one intent Slot mapping table. Based on the intent slot mapping table, the post-processing layer 1215 can accurately and quickly find the corresponding relationship between the candidate intent and the candidate slot for the candidate intent and the candidate slot input.
  • the server 200 continues to input the sample corpus data of the next epoch and repeats the processes 502 to 507 to continue training the semantic parsing model 121 .
  • the objective loss function adopted for the joint optimization of intent and slot is added by the intent classification loss function, the slot filling loss function and the regularization term of the weight.
  • the intent classification loss function adopts the multi-label Sigmoid cross entropy loss (Cross Entropy Loss) function
  • the slot filling loss function adopts the serialized multi-label Sigmoid Cross Entropy Loss function.
  • the calculation formula of Sigmoid Cross Entropy Loss is deduced as follows:
  • L y (y,f(x)) is the intent classification loss function calculated according to the above formula 6
  • L c (y,f(x)) is the slot filling loss function calculated according to the above formula 6
  • is the super Parameter
  • m is the number of data in a batch
  • the reason for dividing by 2 is to cancel it out when derivation; represents the sum of the W parameters of the lth layer; is a matrix, and k and j represent the rows and columns of the matrix.
  • the joint optimization function is mainly to jointly optimize the intention classification loss or slot filling loss generated in the process of matrix transformation in the neural network.
  • the semantic parsing model 121 trained by the server 200 can parse the corpus data to be parsed into candidate intents and candidate slots that are closer to the real intent and the real slot.
  • the trained semantic parsing model 121 can either be transplanted to the mobile phone 100 to directly perform the semantic parsing task, or can continue to exist in the server 200 to execute requests from the mobile phone 100 Semantic parsing tasks performed. Specifically, as shown in FIG. 6 , the user enters a voice command by waking up the voice assistant of the mobile phone 100 , and the mobile phone 100 extracts one or more intents and information corresponding to the user's voice command through the internal human-machine dialogue system 110 based on the above semantic analysis model 121 .
  • the mobile phone 100 further performs corresponding operations based on the identified intent and the slot, for example, opening an application software, or performing a web page search.
  • the specific interaction process between the mobile phone 100 with the semantic parsing model 121 transplanted and the user please refer to the following examples:
  • the mobile phone 100 obtains the user's voice instruction.
  • a voice assistant is installed in the mobile phone 100 , and the user can send a voice command to the mobile phone 100 by waking up the voice assistant of the mobile phone 100 .
  • the mobile phone 100 acquires the user's voice instruction "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station".
  • the speech recognition module 111 in the man-machine dialogue system 110 of the mobile phone 100 recognizes and converts the acquired user speech instruction into corpus data in the form of text. For example, converting the above voice command into textual corpus data "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station".
  • the semantic parsing module 112 in the human-machine dialogue system 110 of the mobile phone 100 is configured to perform semantic parsing on the corpus data to obtain a semantic parsing result intended to correspond to the slot.
  • the semantic parsing module 112 preprocesses the corpus data to obtain a Token sequence, a sentence segmentation mark, and a mask created corresponding to the Token sequence.
  • the semantic parsing module 112 uses the Token sequence, the segmentation mark and the mask created corresponding to the Token sequence as the input of the semantic parsing model 121, performs semantic parsing, extracts multiple candidate intents and multiple candidate slots, and finally the semantic parsing model 121 After sorting out the correspondence between multiple candidate intents and multiple candidate slots, it is output as the semantic parsing result.
  • a simple single-intent corpus can also be parsed by the semantic parsing model 121 to extract a single candidate intent and one or more corresponding candidate slots, which is not limited herein.
  • the semantic analysis result obtained by the semantic analysis module 112 of the human-machine dialogue system 110 through the semantic analysis model 121 is: :
  • the problem solving module 113 in the human-machine dialogue system 110 of the mobile phone 100 searches for a corresponding application or network resource based on the semantic analysis result obtained by the semantic analysis module 112 to obtain a solution to the intent and slot in the semantic analysis result.
  • the solution searched by the problem solving module 114 is that the mobile phone 100 can open the installed booking service software application or travel software application to query
  • the train ticket information and hotel information are available for the user to choose to book, or select a train ticket by default according to the user's historical usage records. Enter the booking interface and ask the user to confirm.
  • the mobile phone interface is shown in Figure 7.
  • the intent and slot mapping result obtained by parsing the corpus data identified by the instruction include the user's three intents and relative to each 3 slots for each intention, and the slot value filled in each slot, then the mobile phone 100 can open the music player software to play the local music "Hello Old Times” by default based on the user's usage habits, or open the audio player software to obtain Music or video files about "Hello Old Times" for users to choose to play.
  • the language generation module 114 in the man-machine dialogue system 110 of the mobile phone 100 generates a natural language sentence for the solution found by the problem solving module 113 , and feeds it back to the user through the display interface of the mobile phone 100 .
  • the solution searched by the above problem solving module 114 is:
  • the mobile phone 100 can open the installed booking service software application or travel software application to query train ticket information and hotel information for the user to select and reserve, or select a train ticket by default according to the user's historical usage record to enter the reservation interface and ask the user to confirm.
  • the language generation module 114 can correspondingly generate the train number information of the train ticket or the introduction information of the hotel, and feed it back to the user through the display interface of the mobile phone 100 , as shown in FIG. 7 .
  • the user's voice command obtained by the mobile phone 100 is to query the weather for the last three days.
  • the solution searched by the problem solving module 113 is to open the browser on the mobile phone 100 or open the browser installed on the mobile phone 100.
  • the weather query software searches for the weather conditions of the last three days.
  • the language generation module 114 generates natural language texts from the searched weather conditions as follows:
  • the weather today is 28-32°C;
  • the dialogue management module 115 in the human-machine dialogue system 110 of the mobile phone 100 may schedule other modules based on the user's dialogue history to further improve the accurate understanding of the user's voice command. For example, in the process of searching the weather by the problem solving module 113, the location is not clearly indicated in the user's voice command, then the dialogue management module 115 can schedule the problem solving module 113 based on the user's dialogue history to search for Beijing, which is frequently inquired by the user, as a search address, and provide feedback to the user.
  • the dialogue management module 115 can also dispatch the problem solving module 113 based on the location information of the mobile phone 100 to search for the weather in the current location of the user for the past three days, and further dispatch the language generation module 114 to generate the following natural language sentences:
  • the weather today is 28-32°C;
  • the dialogue management module 115 in the human-machine dialogue system 110 of the mobile phone 100 can flexibly schedule other modules in the human-machine dialogue system 110 to perform corresponding functions.
  • the speech synthesis module 116 in the man-machine dialogue system 110 of the mobile phone 100 further synthesizes and converts the natural language sentences generated by the language generation module 114 into speech, which is played back to the user through the mobile phone 100 .
  • the weather conditions generated by the language generation module 114 in the above process 605 are converted into voice and played to the user, so that the user can hear the weather conditions without looking at the mobile phone.
  • the trained semantic parsing model 121 may also continue to exist in the server 200 to perform the semantic parsing task requested from the mobile phone 100 .
  • the user inputs voice commands by waking up the voice assistant of the mobile phone 100, the mobile phone 100 converts the user's voice commands into corpus data through the internal man-machine dialogue system 110, and the mobile phone 100 interacts with the server 200 to send the converted corpus data to the server 200 for semantic processing.
  • the server 200 extracts multiple candidate intents and candidate slots corresponding to the intents in the user's voice instruction based on the semantic parsing model 121 . Further, the server 200 feeds back the extracted intent and the corresponding result of the slot to the mobile phone 100, and the mobile phone 100 further performs corresponding operations based on the identified intent and the slot, such as opening an application software or performing a web page search.
  • FIG. 8 shows a schematic structural diagram of a mobile phone 100 according to an embodiment of the present application.
  • the mobile phone 100 may include a processor 101, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and user Identity module (subscriber identification module, SIM) card interface 195 and so on.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the mobile phone 100 .
  • the mobile phone 100 may include more or less components than shown, or some components are combined, or some components are separated, or different components are arranged.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the mobile phone 100 can obtain the user's voice command and feed back the response voice to the user through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor.
  • the mobile phone 100 obtains the user's voice command through the receiver 170B or the microphone 170C, and sends the obtained user's voice command to the human-machine dialogue system 110 for voice recognition and semantic analysis.
  • the corresponding solution is matched and executed through the mobile phone 100.
  • the corresponding operation is used to realize the solution corresponding to the semantic parsing result.
  • the man-machine dialogue system 110 can also generate a response voice from the solution corresponding to the semantic analysis result and feed back the response voice to the user through the speaker 170A of the mobile phone 100 or the earphone plugged in the earphone interface 170D.
  • the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 101 , or some functional modules of the audio module 170 may be provided in the processor 101 .
  • Speaker 170A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
  • the electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
  • the voice can be answered by placing the receiver 170B close to the human ear.
  • the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C.
  • the electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • the processor 101 may include one or more processing units, for example, the processor 101 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal Image signal processor (ISP), controller, video codec, digital signal processor (DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • the processor 101 realizes the function of the semantic parsing model 121 by running the program, and the human-machine dialogue system 110 converts the user's voice command recognition into text corpus data, which is input into the semantic parsing model 121 run by the processor 101 after data preprocessing for semantic parsing. , get the semantic parsing result.
  • the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 101 for storing instructions and data.
  • the memory in processor 101 is a cache memory.
  • the memory may hold instructions or data that have just been used or recycled by the processor 101 . If the processor 101 needs to use the instruction or data again, it can be called directly from the memory. Repeated access is avoided, and the waiting time of the processor 101 is reduced, thereby improving the efficiency of the system.
  • the processor 101 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a general-purpose input/output (GPIO) interface, a SIM interface, and/or or USB interface, etc.
  • the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the mobile phone 100 .
  • the mobile phone 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger may be a wireless charger or a wired charger.
  • the charging management module 140 may receive charging input from the wired charger through the USB interface 130.
  • the charging management module 140 may receive wireless charging input through the wireless charging coil of the mobile phone 100 . While the charging management module 140 charges the battery 142 , it can also supply power to the electronic device through the power management module 141 .
  • the power management module 141 is used to connect the battery 142 , the charging management module 140 and the processor 101 .
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 101, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160.
  • the wireless communication function of the mobile phone 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in handset 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the mobile phone 100 . .
  • the wireless communication module 160 can provide applications on the mobile phone 100 including wireless local area networks (WLAN), such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • Wi-Fi wireless fidelity
  • BT global navigation satellite system
  • GNSS global navigation satellite system
  • frequency modulation frequency modulation, FM
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the antenna 1 of the mobile phone 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the mobile phone 100 can communicate with the network and other devices through wireless communication technology.
  • the mobile phone 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
  • Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel. In some embodiments, the handset 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the SIM card interface 195 is used to connect a SIM card.
  • the present disclosure also relates to apparatuses for performing operations in text.
  • This apparatus may be specially constructed for the required purposes or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer.
  • a computer program may be stored on a computer readable medium, such as, but not limited to, any type of disk, including floppy disks, optical disks, CD-ROMs, magneto-optical disks, read only memory (ROM), random access memory (RAM) , EPROM, EEPROM, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of medium suitable for storing electronic instructions, and each may be coupled to a computer system bus.
  • the computers referred to in the specification may include a single processor or may be architectures employing multiple processors for increased computing power.

Abstract

The present application relates to the technical field of human-machine dialogs, and specifically relates to an electronic device and a semantic parsing method therefor, a medium, and a human-machine dialog system. The semantic parsing method comprises: obtaining corpus data to be parsed; calculating the degree of intention correlation between a word comprised in said corpus data and an intention represented by said corpus data, and the degree of slot correlation between the word and a slot represented by said corpus data; and predicting the slot of said corpus data according to the semantic information of the word, the foregoing semantic information of the word, and the degree of intention correlation and the degree of slot correlation of the word. A plurality of intentions close to the real intention of a user are recognized from user voice, and then slot information is predicted by using the plurality of recognized intentions, thereby improving the accuracy of slot filling, also correspondingly improving the speed or efficiency of slot filling, and further improving the accuracy of semantic parsing in a human-machine dialog.

Description

电子设备及其语义解析方法、介质和人机对话系统Electronic device and its semantic analysis method, medium and human-computer dialogue system
本申请要求于2020年09月15日提交中国专利局、申请号为202010970477.8、申请名称为“电子设备及其语义解析方法、介质和人机对话系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on September 15, 2020, with the application number of 202010970477.8 and the application title of "Electronic Equipment and Its Semantic Analysis Method, Medium and Human-Machine Dialogue System", the entire contents of which are Incorporated herein by reference.
技术领域technical field
本发明涉及人机对话技术领域,具体涉及一种电子设备及其语义解析方法、介质和人机对话系统。The invention relates to the technical field of man-machine dialogue, in particular to an electronic device and a semantic analysis method, medium and man-machine dialogue system thereof.
背景技术Background technique
随着人工智能技术的不断发展以及各种智能终端电子设备的深度普及,人机对话对话系统被越来越多的应用在各种智能终端电子设备中,例如:智能音箱、智能手机、车载智能语音导航等车载智能系统以及机器人等。人机对话系统是利用语音识别、语义解析以及语言生成等技术,实现人机之间的对话和信息交流。With the continuous development of artificial intelligence technology and the in-depth popularization of various intelligent terminal electronic devices, the human-machine dialogue dialogue system is more and more applied in various intelligent terminal electronic devices, such as: smart speakers, smart phones, car smart In-vehicle intelligent systems such as voice navigation and robots, etc. The human-computer dialogue system uses technologies such as speech recognition, semantic analysis and language generation to realize dialogue and information exchange between humans and machines.
其中,语义解析技术中的口语理解任务包括两个子任务,意图识别以及槽位填充。目前意图识别及槽位填充主要是针对单意图单槽位的识别,即对同一条语音在多个意图识别结果选项中选择出一个最接近的意图作为识别结果。但在实际应用中,多个意图的语料与单个意图的语料可能存在相同句式,而单意图分类模型无法区分多个意图的语料,最终导致模型误闯率高,即意图识别槽位填充结果出错率高。并且,现有技术的意图槽位识别架构还无法对意图槽位之间的关系进行显示建模,对多标签的意图识别和槽位填充准确性差,更不能兼容单意图和多意图混合场景。Among them, the spoken language comprehension task in semantic parsing technology includes two sub-tasks, intent recognition and slot filling. At present, intent recognition and slot filling are mainly for single-intent and single-slot identification, that is, for the same speech, the closest intent is selected from multiple intent recognition result options as the recognition result. However, in practical applications, the corpus of multiple intents and the corpus of a single intent may have the same sentence pattern, and the single-intent classification model cannot distinguish the corpus of multiple intents, which eventually leads to a high mis-entry rate of the model, that is, the intent recognition slot filling result Error rate is high. In addition, the prior art intent slot identification architecture cannot display the modeling of the relationship between intent slots, the accuracy of intent identification and slot filling for multiple tags is poor, and it is not compatible with single-intent and multi-intent mixed scenarios.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种电子设备及其语义解析方法、介质和人机对话系统,通过从用户语音中识别出接近用户真实意图的多个意图,然后采用识别出的多个意图预测槽位信息,从而提高槽位填充的准确性,也相应的提高了槽位填充的速度或效率,进而提高人机对话中语义解析的准确度。Embodiments of the present application provide an electronic device, a semantic analysis method, a medium, and a human-machine dialogue system thereof. By recognizing multiple intents close to the user's true intent from a user's voice, and then using the identified multiple intents to predict slot positions information, thereby improving the accuracy of slot filling, and correspondingly improving the speed or efficiency of slot filling, thereby improving the accuracy of semantic parsing in human-computer dialogue.
第一方面,本申请实施例提供了一种语义解析方法,所述方法包括:获取待解析语料数据;计算所述待解析语料数据所包括的字与所述待解析语料数据所表示的意图的意图相关程度,以及所述字与所述待解析语料数据所表示的槽位的槽位相关程度;基于所述字的语义信息和所述字的上文语义信息、以及所述字的意图相关程度和槽位相关程度,预测出所述待解析语料数据的槽位。In a first aspect, an embodiment of the present application provides a semantic parsing method, the method includes: acquiring corpus data to be parsed; The degree of intent correlation, and the degree of correlation between the word and the slot represented by the corpus data to be parsed; based on the semantic information of the word and the above semantic information of the word, and the intent of the word correlation The degree of correlation with the slot position is used to predict the slot position of the corpus data to be parsed.
例如,所述语料数据可以通过对用户语音指令进行语音识别转化得到。For example, the corpus data can be obtained by performing voice recognition and conversion on the user's voice command.
所述待解析语料数据所包括的字与所述待解析语料数据所表示的意图的意图相关程度可以用意图注意力向量来表征,所述字与所述待解析语料数据所表示的槽位的槽位相关程度可以用槽位注意力向量来表征。The degree of intent correlation between the words included in the corpus data to be parsed and the intent represented by the corpus data to be parsed can be represented by an intent attention vector, and the relationship between the word and the slot represented by the corpus data to be parsed can be represented. The degree of slot correlation can be represented by the slot attention vector.
所述字的语义信息可以理解为字的词义信息,也就是这个字的字面意思及其指代含义,比如,“你”这个字可以是代词(表示对对方的称谓词义)、也可以在特定语句中作为名词(比如歌曲名称中, 你好旧时光)。The semantic information of the word can be understood as the word meaning information of the word, that is, the literal meaning of the word and the meaning it refers to. Sentences as nouns (eg in song titles, hello old days).
所述字的上文语义信息可以是语料数据中与当前字连续的上一个字的词义信息,如果处理的当前字是第一个字,那么上文语义信息可以是语料数据的句子语义信息,采用上文语义信息主要是基于其对当前字的槽位预测具有重要意义。上文语义信息可以在上一时刻(相对于当前时刻而言)输出的隐藏状态向量中表达。The above semantic information of the word can be the semantic information of the previous word continuous with the current word in the corpus data, if the current word processed is the first word, then the above semantic information can be the sentence semantic information of the corpus data, The above semantic information is mainly based on its significance to the slot prediction of the current word. The above semantic information can be expressed in the hidden state vector output at the previous moment (relative to the current moment).
在上述第一方面的一种可能的实现中,上述方法还包括:从所述待解析语料数据中预测出多个意图;从所述预测出的槽位中,确定所述多个意图中每个意图所对应的槽位。In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: predicting multiple intents from the corpus data to be parsed; from the predicted slots, determining each of the multiple intents The slot corresponding to the intent.
例如,对某个用户语音指令转化得到的语料数据解析得到多个意图,如果语料数据仅包含单个意图,本申请也可以适用于解析这类单意图语料数据中的单个意图,具有一定的通用性,用户体验比较好。For example, multiple intents are obtained by parsing the corpus data converted from a user's voice command. If the corpus data only contains a single intent, the present application can also be applied to parse a single intent in such single-intent corpus data, which has a certain versatility. , the user experience is better.
对于解析得到的多个意图中,每个意图都有应至少一个槽位与之对应,有的意图可能有三个或者更多的槽位与之相对应。本申请能够准确梳理多个意图和多个槽位之间的对应关系。For the multiple intents obtained by parsing, each intent should correspond to at least one slot, and some intents may have three or more slots corresponding to it. The present application can accurately sort out the correspondence between multiple intents and multiple slots.
在上述第一方面的一种可能的实现中,上述方法还包括:所述上文语义信息包括所述待解析语料数据中位于所述字前面的至少一个字的语义信息。In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the above-mentioned semantic information includes semantic information of at least one word located in front of the word in the corpus data to be parsed.
例如,对某条语料数据进行语义解析时,在第一时刻对第一个字预测槽位时,第一个字的上文语义信息就是这条语料数据的句子语义信息。在第二时刻对第二个字预测槽位时,第二个字的上文语义信息就是第一个字的词义信息,而此时第一个字的词义信息包含语料数据的句子语义信息在第一时刻传递给第一个字的信息。以此类推,后面每个字的上文语义信息都是上一个字的词义信息,同时上一个字的词义信息包括上一个字的前一个字传递的词义信息,这种传递关系是递进式的,相邻两个字的词义信息关联度最大,不相邻的两个字的词义信息关联度较小或者关联度随着间隔的字符增加而逐渐趋近于0。For example, when performing semantic analysis on a certain piece of corpus data, when the slot is predicted for the first word at the first moment, the above semantic information of the first word is the sentence semantic information of this piece of corpus data. When the slot is predicted for the second word at the second moment, the semantic information of the second word above is the semantic information of the first word, and at this time the semantic information of the first word contains the semantic information of the sentence of the corpus data in The message passed to the first word at the first moment. By analogy, the above semantic information of each subsequent character is the semantic information of the previous character, and the semantic information of the previous character includes the semantic information transmitted by the previous character of the previous character. This transfer relationship is progressive. The word sense information correlation degree of two adjacent words is the largest, and the word meaning information correlation degree of two non-adjacent words is small, or the correlation degree gradually approaches 0 with the increase of the spaced characters.
在上述第一方面的一种可能的实现中,上述方法还包括:生成所述待解析语料数据的句语义信息和所述待解析语料数据中每个字的语义信息。In a possible implementation of the first aspect, the method further includes: generating sentence semantic information of the corpus data to be parsed and semantic information of each word in the corpus data to be parsed.
例如,通过编码器对语料数据中代表句子的句字符进行编码使得该句字符能够表达出特定的语义信息,这个特定的语义信息与人为理解这句话得到的语义信息相同或接近。还有,通过编码器对语料数据中的每个字的词字符进行编码使得该词字符能够表达特定的词义信息,这个特定的词义信息与人为理解这句话后对这句话中的每个字理解的词义信息相同或接近。在一些实施例中,可以用句向量表示语料数据的句语义信息,用词向量表示语料数据中每个字的词义信息。For example, the sentence character representing the sentence in the corpus data is encoded by the encoder, so that the sentence character can express specific semantic information, and this specific semantic information is the same as or close to the semantic information obtained by human understanding the sentence. In addition, the word character of each word in the corpus data is encoded by the encoder, so that the word character can express specific word meaning information, and this specific word meaning information is related to the human understanding of the sentence. The word meaning information of word comprehension is the same or close. In some embodiments, sentence semantic information of the corpus data can be represented by a sentence vector, and word meaning information of each word in the corpus data can be represented by a word vector.
在上述第一方面的一种可能的实现中,上述方法还包括:所述方法通过神经网络模型实现。所述神经网络模型包括全连接层、长短期记忆网络模型。In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the method is implemented by a neural network model. The neural network model includes a fully connected layer and a long short-term memory network model.
例如,通过神经网络模型结合BERT模型、注意力机制、槽位门机制以及Sigmoid激活函数训练一个语义解析模型,使其能够实现上述方法。For example, a semantic parsing model is trained through a neural network model combined with a BERT model, an attention mechanism, a slot gate mechanism, and a sigmoid activation function, enabling it to implement the above method.
在上述第一方面的一种可能的实现中,上述方法还包括:所述待解析语料数据的句语义信息、所述字的上文语义信息、所述字的意图相关程度和槽位相关程度在所述神经网络模型中以向量的形式表示。In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: sentence semantic information of the corpus data to be parsed, the above semantic information of the word, the degree of intent correlation and the degree of slot correlation of the word It is represented in the form of a vector in the neural network model.
例如,所述待解析语料数据的句语义信息用句向量表示,所述字的上文语义信息以上一时刻的隐藏状态向量表示,所述字的意图相关程度和槽位相关程度分别用意图注意力向量和槽位注意力向量表示。For example, the sentence semantic information of the corpus data to be parsed is represented by a sentence vector, the above semantic information of the word is represented by a hidden state vector at the previous moment, and the intention correlation degree and slot correlation degree of the word are respectively represented by intention attention. Force vector and slot attention vector representation.
第二方面,本申请实施例提供了一种人机对话方法,包括:接收用户语音指令;将所述用户语音指令转化为文本形式的待解析语料;通过上述语义解析方法,解析出所述待解析语料中的意图和与各意图对应的槽位;基于所述解析出的所述意图和与各意图对应的槽位,执行所述用户语音指令所对应的操 作或者生成应答语音。In a second aspect, an embodiment of the present application provides a man-machine dialogue method, which includes: receiving a user voice command; converting the user voice command into a text-form corpus to be parsed; Parse the intent in the corpus and the slot corresponding to each intent; based on the parsed intent and the slot corresponding to each intent, execute the operation corresponding to the user's voice command or generate a response voice.
在上述第二方面的一种可能的实现中,上述方法还包括:所述操作包括向智能家居设备发送指令、打开应用软件、搜索网页、拨打电话、收发短信中的一种或多种。In a possible implementation of the second aspect, the method further includes: the operations include one or more of sending instructions to the smart home device, opening application software, searching web pages, making calls, and sending and receiving short messages.
例如,通过智能手机对某条用户语音指令转化得到的待解析语料,解析出来的意图是订车票和订酒店,对应这两个意图的槽位是出发地,目的地,(酒店)地点,和(酒店)星级,那么智能手机执行的操作可能是打开某车票酒店预订软件,查询对应出发地和目的地的车票信息供用户选择,以及某个地点的五星级酒店推荐给用户选择。智能家居可以包括但不限于,膝上型计算机、台式计算机、平板计算机、智能手机、可穿戴设备、便携式音乐播放器、阅读器设备、或能够访问网络的其他电子设备。For example, in the corpus to be parsed obtained by converting a user's voice command through a smartphone, the parsed intent is to book a ticket and a hotel, and the slots corresponding to these two intents are the origin, destination, (hotel) location, and (hotel) star rating, then the operation performed by the smartphone may be to open a ticket hotel reservation software, query the ticket information corresponding to the departure and destination for the user to choose, and recommend a five-star hotel in a certain location to the user for selection. A smart home may include, but is not limited to, laptop computers, desktop computers, tablet computers, smartphones, wearable devices, portable music players, reader devices, or other electronic devices capable of accessing a network.
第三方面,本申请实施例提供了一种人机对话系统,所述系统包括:语音识别模块,用于将用户语音指令转化为文本形式的语料数据;语义解析模块,用于执行上述语义解析方法;问题求解模块,用于为所述语义解析模块解析得到的结果查找解决方案;语言生成模块,用于生成与所述解决方案对应的自然语言句子;语音合成模块,用于将所述自然语言句子合成为应答语音;对话管理模块,用于调度所述语音识别模块、所述语义解析模块、所述问题求解模块、所述语言生成模块以及所述语音合成模块相互配合,以实现人机对话。In a third aspect, an embodiment of the present application provides a human-machine dialogue system, the system includes: a speech recognition module for converting a user's voice command into corpus data in text form; a semantic parsing module for performing the above semantic parsing method; a problem solving module for finding a solution for the results obtained by the semantic parsing module analysis; a language generating module for generating natural language sentences corresponding to the solution; a speech synthesis module for converting the natural language The language sentence is synthesized into the response voice; the dialogue management module is used to schedule the speech recognition module, the semantic analysis module, the problem solving module, the language generation module and the speech synthesis module to cooperate with each other to realize the man-machine dialogue.
第四方面,本申请实施例提供了一种可读介质,所述可读介质上存储有指令,该指令在电子设备上执行时使电子设备执行上述语义解析方法或上述人机对话方法。In a fourth aspect, an embodiment of the present application provides a readable medium, where an instruction is stored on the readable medium, and the instruction, when executed on an electronic device, causes the electronic device to execute the above semantic parsing method or the above man-machine dialogue method.
第五方面,本申请实施例提供了一种电子设备,包括:存储器,用于存储由电子设备的一个或多个处理器执行的指令,以及处理器,是电子设备的处理器之一,用于执行上述语义解析方法或上述人机对话方法。In a fifth aspect, an embodiment of the present application provides an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device, and a processor, which is one of the processors of the electronic device, and uses for executing the above semantic parsing method or the above man-machine dialogue method.
附图说明Description of drawings
图1为常见的人机对话系统的示意性软件框图;Fig. 1 is a schematic software block diagram of a common man-machine dialogue system;
图2为本申请实施例适用的人机对话场景示意图;FIG. 2 is a schematic diagram of a man-machine dialogue scene to which an embodiment of the present application is applicable;
图3为本申请实施例中语义解析模型的示例性结构示意图;3 is a schematic structural diagram of an exemplary structure of a semantic parsing model in an embodiment of the present application;
图4为本申请实施例的语义解析方法中语料数据在不同阶段的处理结果示意图;4 is a schematic diagram of processing results of corpus data at different stages in the semantic parsing method according to an embodiment of the present application;
图5为本申请实施例的语义解析方法中语义解析模型的训练流程示意图;5 is a schematic diagram of a training process of a semantic parsing model in the semantic parsing method according to an embodiment of the present application;
图6为本申请实施例的手机100与用户之间的交互流程示意图;FIG. 6 is a schematic diagram of an interaction flow between a mobile phone 100 and a user according to an embodiment of the present application;
图7为本申请实施例的手机100根据用户语音指令执行相应操作的界面示意图;FIG. 7 is a schematic interface diagram of a mobile phone 100 according to an embodiment of the present application performing corresponding operations according to user voice commands;
图8为本申请实施例的一种手机100的示例性结构图。FIG. 8 is an exemplary structural diagram of a mobile phone 100 according to an embodiment of the present application.
具体实施方式detailed description
本申请的说明性实施例包括但不限于电子设备及其语义解析方法和介质。Illustrative embodiments of the present application include, but are not limited to, electronic devices and semantic parsing methods and media thereof.
如上所述,现有技术在处理多意图语句时,存在无法识别出用户语音中的多个意图,从而在通过意图识别结果进行槽位填充时槽位填充结果出错率高的问题。为了解决该问题,本申请实施例首先从用户语音中识别出接近用户真实意图的多个意图,然后采用识别出的多个意图预测槽位信息,从而提高槽位填充的准确性,也相应的提高了槽位填充的速度或效率,进而提高人机对话中语义解析的准确度。As mentioned above, when processing multi-intent sentences in the prior art, there is a problem that multiple intentions in the user's voice cannot be recognized, so that the error rate of the slot filling result is high when the slot filling is performed based on the intention recognition result. In order to solve this problem, the embodiment of the present application first identifies multiple intents close to the user's true intent from the user's voice, and then uses the identified multiple intents to predict slot information, thereby improving the accuracy of slot filling, and correspondingly The speed or efficiency of slot filling is improved, thereby improving the accuracy of semantic parsing in human-machine dialogue.
为了方便清楚的理解本申请实施例,下面对本申请实施例中可能涉及中技术术语以及神经网络的相关术语做简要介绍。In order to facilitate a clear understanding of the embodiments of the present application, the following briefly introduces technical terms that may be involved in the embodiments of the present application and related terms of neural networks.
(1)自然语言处理(natural language processing,NLP)(1) Natural language processing (NLP)
自然语言(natural language)即人类语言,自然语言处理(NLP)就是对人类语言的处理。自然语言处理是以一种智能与高效的方式,对文本数据进行系统化分析、理解与信息提取的过程。通过使用NLP及其组件,我们可以管理非常大块的文本数据,或者执行大量的自动化任务,并且解决各式各样的问题,如自动摘要(automatic summarization),机器翻译(machine translation,MT),命名实体识别(named entity recognition,NER),关系提取(relation extraction,RE),信息抽取(information extraction,IE),情感分析,语音识别(speech recognition),问答系统(question answering)以及主题分割等等。示例性的,自然语言处理任务可以有以下几类。Natural language is human language, and natural language processing (NLP) is the processing of human language. Natural language processing is the process of systematically analyzing, understanding, and extracting information from text data in an intelligent and efficient manner. By using NLP and its components, we can manage very large chunks of textual data, or perform a large number of automated tasks, and solve a wide variety of problems, such as automatic summarization, machine translation (MT), Named Entity Recognition (NER), Relation Extraction (RE), Information Extraction (IE), Sentiment Analysis, Speech Recognition, Question Answering, Topic Segmentation, etc. . Exemplarily, natural language processing tasks can fall into the following categories.
序列标注:句子中每一个单词要求模型根据上下文给出一个分类类别。如中文分词、词性标注、命名实体识别、语义角色标注。Sequence tagging: Each word in a sentence requires the model to give a categorical category based on the context. Such as Chinese word segmentation, part-of-speech tagging, named entity recognition, semantic role tagging.
分类任务:整个句子输出一个分类值,如文本分类。Classification tasks: output a classification value for the entire sentence, such as text classification.
句子关系推断:给定两个句子,判断这两个句子是否具备某种名义关系。例如entilment、QA、语义改写、自然语言推断。Sentence relationship inference: Given two sentences, determine whether the two sentences have a nominal relationship. For example, enlightenment, QA, semantic rewriting, natural language inference.
生成式任务:输出一段文本,生成另一段文本。如机器翻译、文本摘要、写诗造句、看图说话。Generative task: output a piece of text, generate another piece of text. Such as machine translation, text summarization, writing poems and sentences, looking at pictures and talking.
(2)意图(intent):用户输入的语音指令都对应着用户的意图,可以理解,所谓意图就是用户的意愿表达,在人机对话系统中,意图一般是以“动词+名词”命名,例如查询天气、预定酒店等。意图识别,又称意图分类,主要是根据用户输入的语音指令提取与本次语音指令对应的意图。意图是一句或多句表达形式的集合,例如“我要看电影”和“我想看某年某位明星拍摄的动作电影”可以属于同一个播放视频的意图。一个意图下可以配置有一个或多个槽位。(2) Intent: The voice commands input by the user all correspond to the user's intention. It is understandable that the so-called intention is the expression of the user's will. In the human-machine dialogue system, the intention is generally named after "verb + noun", for example Check the weather, book hotels, etc. Intent recognition, also known as intent classification, mainly extracts the intent corresponding to the current voice command according to the voice command input by the user. An intent is a collection of one or more expressions, such as "I want to watch a movie" and "I want to see an action movie made by a certain star in a certain year" can belong to the same intent to play a video. An intent can be configured with one or more slots.
(3)槽位(slot)是用来表达用户意图的关键信息,槽位填充的准确度直接影响到电子设备能否匹配正确的意图。一个槽位对应着一类属性的关键词,该槽位中信息可以由同一类型的关键词进行填充,即槽位填充。例如,与歌曲播放这一意图对应的查询句式可以为“我想听{singer}的{song}”。其中,{singer}为歌手的槽位,{song}为歌曲的槽位。那么,如果接收到用户输入“我想听王菲的红豆”这一语音指令,则电子设备(或服务器)可从该语音指令中提取到{singer}这一槽位中填充的槽位信息为:王菲,{song}这一槽位中填充的槽位信息为:红豆。这样,电子设备(或服务器)可根据这两个槽位信息识别出本次语音输入的用户意图为:播放王菲的歌曲红豆。(3) The slot is the key information used to express the user's intention, and the accuracy of the slot filling directly affects whether the electronic device can match the correct intention. A slot corresponds to a keyword of a type of attribute, and the information in the slot can be filled with keywords of the same type, that is, slot filling. For example, the query pattern corresponding to the intent to play a song could be "I want to hear {song} of {singer}". Among them, {singer} is the singer's slot, and {song} is the song's slot. Then, if the voice command "I want to listen to Faye Wong's red beans" is received from the user, the electronic device (or server) can extract the slot information filled in the {singer} slot from the voice command as: Faye Wong, the slot information filled in the slot {song} is: red beans. In this way, the electronic device (or server) can identify, according to the two slot information, that the user's intention of this voice input is: to play Faye Wong's song Red Bean.
可以理解,本申请的语义解析方法适用于各种需要进行语义解析的场景,例如,用户向智能电子设备发出语音指令、用户与智能电子设备的语音助手进行人机对话等。为了便于说明,下文以人机对话系统为基础介绍本申请的语义解析方案。It can be understood that the semantic parsing method of the present application is suitable for various scenarios requiring semantic parsing, for example, a user sends a voice command to an intelligent electronic device, and a user conducts a man-machine dialogue with a voice assistant of the intelligent electronic device. For the convenience of description, the following introduces the semantic parsing solution of the present application based on the human-machine dialogue system.
目前,如图1所示,常见的人机对话系统110主要包括如下6个技术模块:语音识别模块111;语义解析模块112;问题求解模块113;语言生成模块114;对话管理模块115;语音合成模块116。其中,At present, as shown in FIG. 1, a common human-machine dialogue system 110 mainly includes the following six technical modules: speech recognition module 111; semantic analysis module 112; problem solving module 113; language generation module 114; dialogue management module 115; speech synthesis module 116. in,
语音识别模块111,用于通过语音识别技术(Automatic Speech Recognition,ASR)实现语音到文本的识别转换,识别结果一般以得分最高的前n(n≥1)个句子或词格(word lattice)形式输出语料数据。The speech recognition module 111 is used to realize speech-to-text recognition and conversion through speech recognition technology (Automatic Speech Recognition, ASR). The recognition result is generally in the form of the top n (n≥1) sentences or word lattices with the highest scores. Output corpus data.
语义解析模块112,也被称为自然语言理解(Natural Language Understanding,NLU)模块主要用于执行自然语言处理(natural language processing,NLP)任务,包括对语音识别模块输出的语料数据进行语义解析,识别用户表达的意图(intent)及相应的槽位(slot)。在本申请的实施例中,语义解析模块的功能通过预训练的语义解析模型121实现,关于语义解析模型121将在下文详细描述,此处不再赘述。The semantic parsing module 112, also known as the Natural Language Understanding (NLU) module, is mainly used for performing natural language processing (NLP) tasks, including semantic parsing and identifying the corpus data output by the speech recognition module. The intent expressed by the user and the corresponding slot. In the embodiment of the present application, the function of the semantic parsing module is implemented by a pre-trained semantic parsing model 121, and the semantic parsing model 121 will be described in detail below, and will not be repeated here.
问题求解模块113,主要用于根据语义解析识别的意图及相应槽位进行推理或查询,以向用户反馈 对应其意图及相应槽位的解决方案。The problem solving module 113 is mainly used for reasoning or querying according to the intention identified by the semantic analysis and the corresponding slot, so as to feed back the solution corresponding to the intention and the corresponding slot to the user.
语言生成模块114,主要是对问题求解模块113找到的需要向用户输出的解决方案生成自然语言句子,以文本或进一步转化成语音反馈给用户。The language generation module 114 mainly generates a natural language sentence for the solution found by the problem solving module 113 and needs to be output to the user, and feeds it back to the user in text or further converted into voice.
对话管理模块115,是人机对话系统中的中心枢纽,用于基于对话历史调度人机交系统中其他模块的相互配合,辅助语言解析模块对语音识别的结果进行正确的理解,为问题求解模块提供帮助,并指导语言生成模块的自然语言生成过程。The dialogue management module 115 is the central hub in the human-machine dialogue system, and is used for scheduling the mutual cooperation of other modules in the human-computer interaction system based on the dialogue history, assisting the language parsing module to correctly understand the results of the speech recognition, and providing the problem solving module. Provides assistance and guides the natural language generation process of the language generation module.
语音合成模块116,用于对语言生成模块生成的自然语言句子转化成语音输出。The speech synthesis module 116 is used for converting the natural language sentences generated by the language generation module into speech output.
为使本申请的目的、技术方案和优点更加清楚,下面通过结合附图和实施方案,对本申请实施例的技术方案做进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application clearer, the technical solutions of the embodiments of the present application will be further described in detail below with reference to the accompanying drawings and embodiments.
图2根据本申请实施例示出了一种人机对话场景的示意图。FIG. 2 shows a schematic diagram of a man-machine dialogue scene according to an embodiment of the present application.
具体地,如图2所示,该应用场景包括电子设备100和电子设备200。其中电子设备100为与用户进行交互的终端智能设备,其上安装有能够进行语义解析的应用系统,例如上述的人机对话系统110。电子设备100可以通过人机对话系统110识别用户的语音指令,并根据语音指令执行相应的操作或者回答用户提出的问题。可以理解,在本申请中,电子设备100可以包括但不限于智能音箱、智能手机、可穿戴设备、头戴式显示器、车载智能语音导航等车载智能系统以及智能机器人、便携式音乐播放器、阅读器设备及其他安装有人机对话系统或者其他语音识别应用程序的电子设备。Specifically, as shown in FIG. 2 , the application scenario includes the electronic device 100 and the electronic device 200 . The electronic device 100 is a terminal intelligent device that interacts with a user, and an application system capable of semantic analysis, such as the above-mentioned human-machine dialogue system 110 , is installed thereon. The electronic device 100 can recognize the user's voice command through the man-machine dialogue system 110, and perform corresponding operations according to the voice command or answer the questions raised by the user. It can be understood that in this application, the electronic device 100 may include, but is not limited to, smart speakers, smart phones, wearable devices, head-mounted displays, in-vehicle intelligent systems such as in-vehicle intelligent voice navigation, as well as intelligent robots, portable music players, and readers. Equipment and other electronic equipment with man-machine dialogue systems or other speech recognition applications installed.
电子设备200可以用于训练语义解析模型121,将训练出的语义解析模型121移植到电子设备100,以供电子设备100进行语义解析并执行相应的操作。此外,电子设备200也可以通过训练出的语义解析模型121对电子设备100发送过来的语料数据进行语义解析并将结果反馈给电子设备100,电子设备100进一步执行相应的操作。The electronic device 200 can be used to train the semantic parsing model 121 , and transplant the trained semantic parsing model 121 to the electronic device 100 for the electronic device 100 to perform semantic parsing and perform corresponding operations. In addition, the electronic device 200 can also perform semantic parsing on the corpus data sent by the electronic device 100 through the trained semantic parsing model 121, and feed the result back to the electronic device 100, and the electronic device 100 further performs corresponding operations.
可以理解,电子设备200可以包括但不限于云端、服务器、膝上型计算机、台式计算机、平板计算机、以及其中嵌入或耦接有一个或多个处理器的能够访问网络的其他电子设备。It will be appreciated that electronic device 200 may include, but is not limited to, clouds, servers, laptops, desktops, tablet computers, and other electronic devices capable of accessing a network with one or more processors embedded or coupled therein.
为了便于说明,下文以电子设备100为手机、电子设备200为服务器为例,详细说明本申请的技术方案。其中,手机100安装有上述人机对话系统110,人机对话系统110中的语义解析模块112具有语义解析模型121,该语义解析模型121能够基于本申请的技术方案对用户语音进行语义解析。For convenience of description, the technical solutions of the present application are described in detail below by taking the electronic device 100 as a mobile phone and the electronic device 200 as a server as an example. The mobile phone 100 is installed with the human-machine dialogue system 110, and the semantic analysis module 112 in the human-machine dialogue system 110 has a semantic analysis model 121, which can perform semantic analysis on user speech based on the technical solution of the present application.
下面详细介绍本申请的语义解析模型121。The semantic parsing model 121 of the present application will be described in detail below.
语义解析模型121是由服务器200基于自然语言处理及上述各种神经网络结构和模型预训练出来的一个自然语言处理模型。预训练的语义解析模型121能够提取单条语料数据中的多个意图,并基于多个意图来预测槽位,从而准确的识别语料数据中的意图及相应的槽位,能够大大提高槽位填充的准确性。The semantic parsing model 121 is a natural language processing model pre-trained by the server 200 based on natural language processing and the above-mentioned various neural network structures and models. The pre-trained semantic parsing model 121 can extract multiple intents in a single piece of corpus data, and predict slots based on multiple intents, so as to accurately identify intents and corresponding slots in the corpus data, which can greatly improve the efficiency of slot filling. accuracy.
数据预处理data preprocessing
输入语义解析模型121的数据,是语料数据经过预处理后得到的数据,其中,语料数据是用户语音指令经过识别转化后得到的。对语料数据的预处理,是人机对话系统110中理解文本的常规操作,是通过语义解析模块112执行的自然语言处理任务之一,例如,预处理一般包括对语料数据进行分词处理、填充标记(Token)序列及断句标记(Segmentation)以及创建掩码。数据预处理最终得到包含句子文本字符和句子中每个字文本字符的Token序列、代表每个字对应的句子位置的断句标记以及对应表示Token序列中各个字符位置上是否为有效字符的掩码。The data input into the semantic parsing model 121 is the data obtained after preprocessing the corpus data, wherein the corpus data is obtained after the user's voice instruction is recognized and transformed. The preprocessing of the corpus data is a routine operation for understanding text in the human-machine dialogue system 110, and is one of the natural language processing tasks performed by the semantic parsing module 112. For example, preprocessing generally includes segmenting the corpus data, filling and marking. (Token) sequence and segmentation mark (Segmentation) and create masks. The data preprocessing finally obtains the Token sequence containing the text characters of the sentence and the text characters of each word in the sentence, the segmentation mark representing the sentence position corresponding to each word, and the corresponding mask indicating whether each character position in the Token sequence is a valid character.
其中,分词处理,主要是利用分词工具(例如中文词汇表)将语料数据划分成句子和组成句子的单个字,并对得到的句子打上所有可能的意图标签、对组成句子的每个字打上所有可能的槽位标签。分词处理的目的是为下一步填充Token序列做数据准备。Among them, the word segmentation process mainly uses word segmentation tools (such as Chinese vocabulary) to divide the corpus data into sentences and individual words that make up sentences, and to the obtained sentences. Possible slot labels. The purpose of word segmentation processing is to prepare data for the next step of filling the Token sequence.
例如,如图4所示,对语音指令转化得到的语料数据“请为我播放你好旧时光”经过分词处理后得到:For example, as shown in Figure 4, the corpus data "please play hello old times for me" converted from the voice command is obtained after word segmentation:
3个可能的意图标签:PLAY_MUSIC、PLAY_VIDEO、PLAY_VOICE;3 possible intent tags: PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE;
一个句子:请为我播放你好旧时光;A sentence: please play hello old times for me;
组成句子的10个字:请、为、我、播、放、你、好、旧、时、光;10 words that make up a sentence: please, for, me, play, play, you, good, old, time, light;
其中,为每个字打上的槽位标签分别是:Among them, the slot labels marked for each word are:
请、为、我、播、放这五个词分别对应的槽位标签都是O;The slot labels corresponding to the five words please, for, me, play, and play are all O;
你对应3个槽位分别是songName-B、videoName-B和mediaName-B;Your corresponding 3 slots are songName-B, videoName-B and mediaName-B;
好、旧、时、光这四个字分别对应3个槽位标签分别是songName-I、videoName-I、mediaName-I)。The four words good, old, time, and light correspond to the three slot labels (songName-I, videoName-I, mediaName-I).
填充Token序列,主要是利用分词处理得到的数据通过对句子截断或者填充字符的方式得到符合字符长度要求的Token序列。通常,Token序列中每个句子的最大字符长度要求是maxLength=32,若分词得到的句子字符长度+2大于maxLength,则需要对句子进行截断;若分词得到的句子字符长度+2小于maxLength,则需要在句子结尾处填充空白字符<pad>使句子字符长度+2达到maxLength。Token序列中包含对应语音指令的整句话的句子字符,以及对应句子中每个字的词字符。Filling the Token sequence mainly uses the data obtained by word segmentation to obtain a Token sequence that meets the character length requirements by truncating sentences or filling characters. Usually, the maximum character length requirement of each sentence in the Token sequence is maxLength=32. If the character length of the sentence obtained by word segmentation + 2 is greater than maxLength, the sentence needs to be truncated; if the sentence character length + 2 obtained by word segmentation is less than maxLength, then Need to pad the blank character <pad> at the end of the sentence to make the sentence character length +2 reach maxLength. The Token sequence contains sentence characters corresponding to the entire sentence of the voice command, and word characters corresponding to each word in the sentence.
其中,字符长度计算时+2主要是因为Token序列中的第一个字符一般是<CLS>,它标记的是分词得到的句子(例如,字符<CLS>标记的是句子:请为我播放你好旧时光),Token序列中的结尾字符一般是截断字符<SEP>,<SEP>表示其前面的句子是符合单句字符长度要求的完整句子,字符<CLS>与<SEP>之间的每个字上都打上断句标记“句子1”,表明这些字都是组成句子1的字。如果一个Token序列中有两个<SEP>,则表明第一个<SEP>到前面<CLS>之间是第一个句子,两个<SEP>之间是第二个句子,每个句子的字符长度+2均要求符合最大字符长度要求。一般地,用户指令中包含的字符长度+2是在32位最大字符长度范围内。Among them, +2 when calculating the character length is mainly because the first character in the Token sequence is generally <CLS>, which marks the sentence obtained by word segmentation (for example, the character <CLS> marks the sentence: please play you for me Good old times), the ending character in the Token sequence is generally the truncation character <SEP>, <SEP> indicates that the preceding sentence is a complete sentence that meets the character length requirements of a single sentence, and each character between the characters <CLS> and <SEP> is a complete sentence. The words are marked with a punctuation mark "Sentence 1", indicating that these words are the words that make up Sentence 1. If there are two <SEP>s in a Token sequence, it means that the first sentence is between the first <SEP> and the preceding <CLS>, and the second sentence is between the two <SEP>s. Character length + 2 are required to meet the maximum character length requirements. Generally, the character length + 2 contained in the user instruction is within the range of the maximum character length of 32 bits.
创建掩码,主要是对上述填充得到的Token序列中的每个字符对应创建一个掩码(Mask)。创建掩码的目的是将Token序列中每个字符是否表达有效信息标记计算机可读的标记码。其中,Token序列中字符<pad>对应创建的掩码元素值为0,非字符<pad>的字符对应创建的掩码元素值为1。Creating a mask is mainly to create a mask (Mask) corresponding to each character in the Token sequence obtained by the above filling. The purpose of creating a mask is to mark whether each character in the Token sequence expresses valid information into a computer-readable marking code. Among them, the value of the created mask element corresponding to the character <pad> in the Token sequence is 0, and the value of the created mask element corresponding to the character other than the character <pad> is 1.
如图4所示,对语料数据进行数据预处理后主要得到三个数据,即Token序列、断句标记以及对应Token序列生成的掩码,对于图4所示的语料数据,上述三个数据分别为:As shown in Figure 4, after data preprocessing on the corpus data, three data are mainly obtained, namely the Token sequence, the segmentation mark and the mask generated by the corresponding Token sequence. For the corpus data shown in Figure 4, the above three data are: :
Token序列:<CLS>请为我播放你好旧时光<pad>…<pad><SEP>;Token sequence: <CLS> Please play hello old times for me <pad>…<pad><SEP>;
断句标记:句子1(请、为、我、播、放、你、好、旧、时、光);Segmentation mark: Sentence 1 (please, for, me, play, play, you, good, old, time, light);
掩码:{11111111111000000000000000000001}。Mask: {11111111111000000000000000000001}.
又例如,如果用户输入的语音指令识别得到的语料数据是是“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”,经过上述数据预处理后得到的三个数据分别为:For another example, if the corpus data recognized by the voice command input by the user is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", the three data obtained after the above data preprocessing The data are:
Token序列:<CLS>帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店<SEP>;Token sequence: <CLS> help me book train tickets from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station <SEP>;
断句标记:句子1(帮、我、预、定、从、上、海、到、北、京、的、火、车、票、并、预、定、北、京、火、车、站、附、近、的、五、星、级、酒、店);Segmentation mark: Sentence 1 (help, me, pre-determined, from, Shanghai, sea, to, north, Beijing, de, fire, train, ticket, and, pre, fixed, north, Beijing, train, train, station, Near, near, of, five, star, grade, hotel, shop);
掩码:{11111111111111111111111111111111}。Mask: {11111111111111111111111111111111}.
将语料数据经过上述数据预处理后得到的三个数据可以输入语义解析模型121中进行语义解析。下面将详细介绍语义解析模型121。The three data obtained after the above-mentioned data preprocessing of the corpus data can be input into the semantic parsing model 121 for semantic parsing. The semantic parsing model 121 will be described in detail below.
语义解析模型121Semantic Parsing Model 121
具体地,如图3所示,该语义解析模型121包括BERT编码层1211、意图分类层1212、注意力层 1213、槽位填充层1214以及后置处理层1215。Specifically, as shown in FIG. 3 , the semantic parsing model 121 includes a BERT encoding layer 1211, an intent classification layer 1212, an attention layer 1213, a slot filling layer 1214, and a post-processing layer 1215.
1)BERT编码层12111) BERT coding layer 1211
BERT编码层1211以预料数据经过数据预处理后得到的Token序列、断句标记以及掩码作为输入,编码后输出编码向量序列。其中,编码向量序列包括句向量和词向量,句向量表示待解析语料数据的语义信息,词向量包含待解析语料数据中每个字的词义信息。可以理解,语义信息和词义信息是对语料数据基于自然语言理解的意思表达,这些语义信息和词义信息能够表达用户的真实意图以及对应用户真实意图的真实槽位。The BERT encoding layer 1211 takes as input the Token sequence, segmentation mark and mask obtained after data preprocessing of the expected data, and outputs the encoded vector sequence after encoding. Wherein, the coding vector sequence includes sentence vector and word vector, the sentence vector represents the semantic information of the corpus data to be parsed, and the word vector contains the lexical information of each word in the corpus data to be parsed. It can be understood that semantic information and word meaning information are the meaning expressions of corpus data based on natural language understanding, and these semantic information and word meaning information can express the real intention of the user and the real slot corresponding to the real intention of the user.
例如,如图4所示,如果待解析语料数据是“请为我播放你好旧时光”,那么BERT编码层1211输出的编码向量序列{h 0,h 1,h 2,……,h t}中,句向量h 0表示的语义信息可能包括PLAY_MUSIC、PLAY_VIDEO、PLAY_VOICE、你好、旧时光、你好旧时光等。词向量h 1,h 2,……,h t表示的词义信息可能包括songName、videoName、mediaName以及组成句子的每个字的字面含义,其中h 1对应的字是请、h 2对应的字是为、h 3对应的字是我,……,h 10对应的字是光。 For example, as shown in Fig. 4, if the corpus data to be parsed is "please play the good old days for me", then the encoding vector sequence {h 0 ,h 1 ,h 2 ,...,h t output by the BERT encoding layer 1211 }, the semantic information represented by the sentence vector h 0 may include PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE, hello, old time, hello old time, etc. The word meaning information represented by the word vectors h 1 , h 2 , ..., h t may include songName, videoName, mediaName and the literal meaning of each word that constitutes a sentence, where the word corresponding to h 1 is please, and the word corresponding to h 2 is The word corresponding to wei, h 3 is me, ..., the word corresponding to h 10 is light.
又例如,如果待解析的语料数据是“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”,那么BERT编码层1211输出的编码向量序列{h 0,h 1,h 2,……,h t}中,句向量h 0表示的语义信息可能包括订车票、订酒店、出发地、目的地、上海、北京、酒店、星级、五星级等。词向量h 1,h 2,……,h t表示的词义信息可能包括出发地、目的地、上海、北京、酒店、星级、五星级以及组成句子的每个字的字面含义,其中h 1对应的字是帮、h 2对应的字是我、h 3对应的字是预、……、h 30对应的字是店。 For another example, if the corpus data to be parsed is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", then the encoded vector sequence {h 0 ,h 1 output by the BERT encoding layer 1211 ,h 2 ,...,h t }, the semantic information represented by the sentence vector h 0 may include ticket booking, hotel booking, departure place, destination, Shanghai, Beijing, hotel, star, five-star and so on. The word meaning information represented by the word vectors h 1 , h 2 , ..., h t may include the origin, destination, Shanghai, Beijing, hotel, star, five-star, and the literal meaning of each word composing the sentence, where h The word corresponding to 1 is gang, the word corresponding to h 2 is me, the word corresponding to h 3 is pre, ..., the word corresponding to h 30 is shop.
具体地,BERT编码层1211的工作过程如图3所示:Specifically, the working process of the BERT coding layer 1211 is shown in Figure 3:
经过数据预处理得到的Token序列、断句标记以及对应Token序列生成的掩码将作为BERT编码层1211的输入。The Token sequence, segmented tag and the mask generated by the corresponding Token sequence obtained after data preprocessing will be used as the input of the BERT coding layer 1211 .
BERT编码层1211通过识别掩码元素值依次识别Token序列中的有效字符<CLS>,x 1,x 2,……,x t-1,<SEP>和空白字符(无效字符)(有效字符掩码元素值为1,空白字符掩码元素值为0)。其中,t表示对句子中的每个字处理的时刻,下称时间步t或t时刻,例如,t=1时刻,处理x 1字符对应的字,t=2时刻处理,处理x 2字符对应的字。 The BERT encoding layer 1211 sequentially identifies the valid characters <CLS>, x 1 , x 2 , . The value of the code element is 1, and the value of the blank character mask element is 0). Among them, t represents the time of processing each word in the sentence, hereinafter referred to as time step t or time t, for example, at time t=1, the word corresponding to the character x 1 is processed, and at time t=2, the character corresponding to the character x 2 is processed word.
Token序列中标记句子的字符<CLS>输入训练后的BERT编码层1211进行语义编码后,对字符<CLS>赋予语料数据的语义信息,生成一个高维的句向量h 0The character <CLS> that marks the sentence in the Token sequence is input into the trained BERT coding layer 1211 for semantic encoding, and then the character <CLS> is given semantic information of the corpus data to generate a high-dimensional sentence vector h 0 .
Token序列中字符<CLS>与截断字符<SEP>之间的字符x 1,x 2,……,x t-1对应语料数据中组成句子的每个字,字符x 1,x 2,……,x t-1输入训练后的BERT编码层1211进行语义编码后,对字符x 1,x 2,……,x t-1赋予语料数据的词义信息,对应生成高维的词向量h 1,h 2,……,h tThe characters x 1 , x 2 ,...,x t-1 between the character <CLS> and the truncated character <SEP> in the Token sequence correspond to each word that constitutes the sentence in the corpus data, and the characters x 1 , x 2 ,... , x t-1 is input into the trained BERT coding layer 1211 for semantic encoding, and then assigns the semantic information of the corpus data to the characters x 1 , x 2 ,..., x t-1 , correspondingly generates a high-dimensional word vector h 1 , h 2 ,...,h t .
Token序列中的空白字符<pad>对应的掩码元素值为0,未标记任何词,因此不作为BERT编码层1211的输入。The mask element value corresponding to the blank character <pad> in the Token sequence is 0, and no word is marked, so it is not used as the input of the BERT encoding layer 1211.
基于句向量h 0和词向量h 1,h 2,……,h t生成编码向量序列,作为BERT编码层1211的输出。 Based on the sentence vector h 0 and the word vectors h 1 , h 2 , .
BERT编码层1211可以基于BERT模型训练得到,具体训练过程参考下文详细描述,在此不再赘述。其中,BERT模型是一种基于微调的多层双向变换器编码器模型,BERT模型的关键技术创新是将变换器的双向培训应用于语言建模。使用BERT模型训练BERT编码层有两个阶段:预训练和微调。预训练期间BERT模型在不同的预训练任务上训练未标记的数据后,首先使用预训练参数初始化BERT模型,并使用来自下游任务的标记数据对所有参数进行微调。BERT模型的一个显著特点是它跨越不同任务的统一架构,因此其预训练架构与最终下游架构之间的差异很小。BERT模型能够进一步增加词向量模型泛化能力,充分描述字符级、词级、句子级甚至句间关系特征。The BERT coding layer 1211 can be obtained by training based on the BERT model. For the specific training process, please refer to the detailed description below, which will not be repeated here. Among them, the BERT model is a multi-layer bidirectional transformer encoder model based on fine-tuning, and the key technological innovation of the BERT model is to apply the bidirectional training of the transformer to language modeling. There are two stages to training a BERT encoding layer with a BERT model: pre-training and fine-tuning. After the BERT model is trained on unlabeled data on different pre-training tasks during pre-training, the BERT model is first initialized with pre-trained parameters, and all parameters are fine-tuned using labeled data from downstream tasks. A striking feature of the BERT model is its unified architecture across different tasks, so there is little difference between its pretrained architecture and the final downstream architecture. The BERT model can further increase the generalization ability of the word vector model, and fully describe the character-level, word-level, sentence-level and even inter-sentence relationship features.
在另一些实施例中,BERT编码层1211也可以通过其他编码器或编码模型训练得到,此处不做限制。In other embodiments, the BERT encoding layer 1211 can also be obtained by training other encoders or encoding models, which is not limited here.
2)意图分类层12122) Intent classification layer 1212
意图分类层1212用于预测语料数据中的候选意图,其中,该意图分类层1212可以提取出语料数据中的多个意图标签,并保留满足条件的意图标签作为候选意图输出。The intent classification layer 1212 is used to predict candidate intents in the corpus data, wherein the intent classification layer 1212 can extract multiple intent labels in the corpus data, and retain the intent labels that meet the conditions as candidate intent outputs.
具体地,意图分类层1212以上述BERT编码层1211得到的句向量h 0为输入,基于句向量h 0表示的语义信息,意图分类层1212可以提取所有可能的意图标签,并对提取的每个意图标签计算意图置信度以判断该意图标签是否满足输出条件。 Specifically, the intent classification layer 1212 takes the sentence vector h 0 obtained by the above-mentioned BERT encoding layer 1211 as input, and based on the semantic information represented by the sentence vector h 0 , the intent classification layer 1212 can extract all possible intent labels, and for each extracted The intent label calculates the intent confidence to judge whether the intent label satisfies the output condition.
可以理解,此处,意图置信度表示提取的意图标签与语料数据表达的真实意图的接近程度,也可以称为意图可靠度。意图置信度越高的意图与语料数据表达的真实意图越接近。在意图分类层1212中,可以对意图置信度设定某一阈值,例如,设置意图置信度的阈值为0.5,意图置信度大于或等于阈值的意图标签满足输出条件,对应的意图标签将会输出作为候选意图;意图置信度小于阈值的意图标签不满足输出条件,其对应的意图标签将会被删除,不会从意图分类层1212输出。It can be understood that, here, the intent confidence represents the closeness of the extracted intent label to the real intent expressed by the corpus data, and may also be referred to as intent reliability. The intent with higher intent confidence is closer to the real intent expressed by the corpus data. In the intent classification layer 1212, a certain threshold can be set for the intent confidence, for example, the threshold of the intent confidence is set to 0.5, and the intent label whose intent confidence is greater than or equal to the threshold satisfies the output condition, and the corresponding intent label will be output As a candidate intent; an intent label whose intent confidence is less than the threshold does not meet the output conditions, and its corresponding intent label will be deleted and will not be output from the intent classification layer 1212 .
例如,如图4所示,如果待解析的语料数据是“请为我播放你好旧时光”,则通过BERT编码层输出的句向量h 0表示的语义信息中可能包括3个可能的意图标签:PLAY_MUSIC、PLAY_VIDEO、PLAY_VOICE。该句向量h 0输入意图分类层1212后,意图分类层1212提取上述3个可能的意图标签,并计算每个意图标签的意图置信度分别为0.8,0.75,0.5,假如在意图分类层1212设置的意图置信度阈值为0.5,那么上述3个意图标签的意图置信度均符合大于等于0.5的条件,即上述3个意图标签满足输出条件,最终意图分类层1212输出3个候选意图:PLAY_MUSIC、PLAY_VIDEO、PLAY_VOICE。 For example, as shown in Figure 4, if the corpus data to be parsed is "please play hello old times for me", the semantic information represented by the sentence vector h 0 output by the BERT coding layer may include 3 possible intent labels : PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE. After the sentence vector h0 is input into the intent classification layer 1212, the intent classification layer 1212 extracts the above three possible intent labels, and calculates the intent confidence of each intent label as 0.8, 0.75, and 0.5, respectively. If the intent classification layer 1212 sets the The intent confidence threshold is 0.5, then the intent confidence of the above three intent labels all meet the condition of being greater than or equal to 0.5, that is, the above three intent labels satisfy the output condition, and the final intent classification layer 1212 outputs three candidate intents: PLAY_MUSIC, PLAY_VIDEO , PLAY_VOICE.
又例如,如果待解析的语料数据是“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”,则通过BERT编码层输出的句向量h 0表示的语义信息中可能包括4个可能的意图标签查车次、订车票、找酒店、订酒店。该句向量h 0输入意图分类层1212后,意图分类层1212提取上述4个可能的意图标签,并计算每个意图标签的意图置信度分别为0.48,0.87,0.45,0.7,假如在意图分类层1212设置的意图置信度阈值为0.5,那么上述4个意图标签中对应的意图置信度大于或等于0.5的意图标签为订车票和订酒店,满足输出条件,那么意图分类层1212输出2个候选意图:订车票、订酒店。而意图置信度小于0.5的两个意图标签:查车次、找酒店不满足输出条件,则不会从意图分类层1212中输出。 For another example, if the corpus data to be parsed is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", the semantic information represented by the sentence vector h 0 output by the BERT coding layer May include 4 possible intent labels to check train times, book tickets, find hotels, and book hotels. After the sentence vector h 0 is input into the intent classification layer 1212, the intent classification layer 1212 extracts the above four possible intent labels, and calculates the intent confidence of each intent label as 0.48, 0.87, 0.45, and 0.7, respectively. The intent confidence threshold set by 1212 is 0.5, then the intent labels with the corresponding intent confidence greater than or equal to 0.5 among the above four intent labels are ticket booking and hotel booking, which satisfy the output conditions, then the intent classification layer 1212 outputs 2 candidate intents : Book tickets, book hotels. However, the two intent labels whose intent confidence is less than 0.5: checking the number of trains and finding a hotel do not meet the output conditions, so they will not be output from the intent classification layer 1212 .
具体地,意图分类层1212的工作过程如图3所示:Specifically, the working process of the intent classification layer 1212 is shown in FIG. 3 :
意图分类层1212以BERT编码层1211输出的编码向量序列中的句向量h 0作为输入,意图分类层1212通过解码并激活句向量h 0提取句向量h 0表示的语义信息中的所有可能的意图标签,并计算每个意图标签的意图置信度y I。其中,意图置信度y I通过Sigmoid激活函数后得到的计算公式如下: The intent classification layer 1212 takes the sentence vector h 0 in the encoded vector sequence output by the BERT encoding layer 1211 as input, and the intent classification layer 1212 extracts all possible intents in the semantic information represented by the sentence vector h 0 by decoding and activating the sentence vector h 0 labels, and compute the intent confidence y I for each intent label. Among them, the calculation formula of the intention confidence y I obtained after passing through the Sigmoid activation function is as follows:
y I=Sigmoid(W Ih 0+b I)    (1) y I =Sigmoid(W I h 0 +b I ) (1)
其中,I表示意图数量,W I为句向量h 0的随机权重系数,b I表示偏差值。 Among them, I represents the number of schematic diagrams, W I is the random weight coefficient of the sentence vector h 0 , and b I represents the deviation value.
意图分类层1212可以通过全连接层(dense)以及Sigmoid函数作为激活函数训练得到,具体训练过程参考下文详细描述,在此不再赘述。可以理解,在另一些实施例中,可以采用其他与全连接层具有相同功能的深度神经网络作为解码器,也可以采用其他与Sigmoid函数具有相同功能的函数作为对应的深度神经网络解码器的激活函数,此处不做限制。The intent classification layer 1212 can be obtained by training a fully connected layer (dense) and a sigmoid function as an activation function. For the specific training process, please refer to the detailed description below, which will not be repeated here. It can be understood that in other embodiments, other deep neural networks with the same function as the fully connected layer can be used as the decoder, and other functions with the same function as the Sigmoid function can also be used as the activation of the corresponding deep neural network decoder. function, there is no restriction here.
3)注意力层12133) Attention layer 1213
注意力层1213用于量化语料数据中每个字与句子所表达意图的相关程度,例如,可以用意图注意力向量表示,意图注意力向量也可以理解为意图上下文向量;以及注意力层1213还用于量化语料数据 中每个字与句子所表达槽位的相关程度,例如,用槽位注意力向量表示。其中,注意力层1213输出的意图注意力向量将作为槽位填充层1214的输入以指导槽位预测,以提高槽位预测的准确性;注意力层1213输出的槽位注意力向量用作槽位计算的偏差值,以矫正槽位预测计算的偏差。The attention layer 1213 is used to quantify the degree of correlation between each word in the corpus data and the intent expressed by the sentence. For example, it can be represented by an intent attention vector, and the intent attention vector can also be understood as an intent context vector; and the attention layer 1213 also It is used to quantify the degree of correlation between each word in the corpus data and the slot expressed by the sentence, for example, represented by the slot attention vector. Among them, the intent attention vector output by the attention layer 1213 will be used as the input of the slot filling layer 1214 to guide the slot prediction to improve the accuracy of the slot prediction; the slot attention vector output by the attention layer 1213 is used as the slot The deviation value of the bit calculation to correct the deviation of the slot prediction calculation.
具体地,注意力层1213以BERT编码层1211输出的编码向量序列为输入,基于句向量h 0表示的语义信息和词向量h 1,h 2,……,h t表示的词义信息,注意力层输出的意图注意力向量相应的可以理解为用于量化每个词向量对应的字与句向量对应的句子表达意图的相关程度、注意力层输出的槽位注意力向量相应的可以理解为用于量化每个词向量对应的字与句向量对应的句子所表达槽位的相关程度。 Specifically, the attention layer 1213 takes the encoding vector sequence output by the BERT encoding layer 1211 as input, and based on the semantic information represented by the sentence vector h 0 and the semantic information represented by the word vectors h 1 , h 2 ,..., h t , the attention The intent attention vector output by the layer can be understood as the correlation degree between the words corresponding to each word vector and the sentence expression intent corresponding to the sentence vector, and the slot attention vector output by the attention layer can be understood as using It is used to quantify the degree of correlation between the word corresponding to each word vector and the slot expressed by the sentence corresponding to the sentence vector.
例如,如图4所示,如果待解析的语料数据是“请为我播放你好旧时光”,则通过BERT编码层输出编码向量序列中,句向量h 0表示的语义信息中可能包括3个可能的意图标签:PLAY_MUSIC、PLAY_VIDEO、PLAY_VOICE,词向量h 1,h 2,……,h t表示的词义信息可能包括songName、videoName、mediaName以及组成句子的每个字的字面含义。将上述编码向量序列输入注意力层1213后,例如,注意力层1213输出的意图注意力向量C I(对应:请为我播放你好旧时光,播,放),其中,句子“请为我播放你好旧时光”所表达的意图可能是PLAY_MUSIC、PLAY_VIDEO、PLAY_VOICE,“播、放”与句子所表达意图的相关程度比较高,“你、好、旧、时、光、请、为、我”与句子所表达意图的相关程度较低或者不相关。而句子“请为我播放你好旧时光”所表达的槽位是songName、videoName、mediaName,那么t=1时刻,注意力层1213输出的槽位注意力向量
Figure PCTCN2021117251-appb-000001
表示“请”与上述3个槽位的相关程度,例如相关程度为0即不相关;在t=2时刻,注意力层1213输出的槽位注意力向量
Figure PCTCN2021117251-appb-000002
表示“为”与上述3个槽位的相关程度,例如相关程度为0即不相关;以此类推,在t=6时刻,注意力层1213输出的槽位注意力向量
Figure PCTCN2021117251-appb-000003
表示“你”与上述3个槽位的相关程度,例如相关程度为0.9即相关程度较大;最终可以得出,“你、好、旧、时、光”与上述三个槽位的相关程度均比较高,“播、放,请,为,我”与句子所表达槽位的相关程度较低或者不相关。
For example, as shown in Figure 4, if the corpus data to be parsed is "please play hello old times for me", the semantic information represented by the sentence vector h 0 in the output coding vector sequence through the BERT coding layer may include 3 Possible intent labels: PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE, the word meaning information represented by word vectors h 1 , h 2 , ..., h t may include songName, videoName, mediaName, and the literal meaning of each word that composes the sentence. After inputting the above sequence of encoding vectors into the attention layer 1213, for example, the intent attention vector CI output by the attention layer 1213 (corresponding to: please play hello old times for me , play, play), in which the sentence "Please play for me The intention expressed by "playing your good old days" may be PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE, and "playing, playing" has a relatively high degree of correlation with the intention expressed in the sentence, "you, good, old, time, light, please, for, me ” is less or less relevant to the intent expressed by the sentence. And the slot expressed by the sentence "please play your good old times for me" is songName, videoName, mediaName, then at t=1 time, the slot attention vector output by the attention layer 1213
Figure PCTCN2021117251-appb-000001
Indicates the degree of correlation between "please" and the above three slots, for example, if the degree of correlation is 0, it is irrelevant; at time t=2, the slot attention vector output by the attention layer 1213
Figure PCTCN2021117251-appb-000002
Indicates the degree of correlation between "as" and the above three slots, for example, if the degree of correlation is 0, it is irrelevant; and so on, at time t=6, the attention vector of the slot output by the attention layer 1213
Figure PCTCN2021117251-appb-000003
Indicates the degree of correlation between "you" and the above three slots. For example, the correlation degree is 0.9, which means that the degree of correlation is relatively large; in the end, it can be concluded that the degree of correlation between "you, good, old, time, light" and the above three slots Both are relatively high, and "play, play, please, for, me" has a low or no correlation with the slot expressed in the sentence.
又例如,如果待解析的语料数据是“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”,那么注意力层1213输出的意图注意力向量(帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店,预,定,火,车,票,酒,店),其中,句子“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”所表达的意图可能是订车票、订酒店,“预、定、火、车、票、酒、店”与句子所表达意图的相关程度比较高,“上、海、北、京、火、车、站、五、星、级,帮、我”与句子所表达意图的相关程度较低或者不相关。For another example, if the corpus data to be parsed is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", then the intent attention vector output by the attention layer 1213 (help me book a train ticket from Train tickets from Shanghai to Beijing and book five-star hotels near Beijing Railway Station, book, book, train, train, ticket, hotel, shop), in which the sentence "Help me book train tickets from Shanghai to Beijing and book Beijing The intention expressed by the five-star hotel near the railway station” may be to book a ticket or a hotel. Sea, Beijing, Beijing, fire, train, station, five, star, grade, help, me" have a low degree of relevance or irrelevance to the intent expressed by the sentence.
而句子“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”所表达的槽位可能是出发地、目的地、地点、星级,那么那么t=1时刻,注意力层1213输出的槽位注意力向量
Figure PCTCN2021117251-appb-000004
表示“帮”与上述4个槽位的相关程度,例如相关程度为0即不相关;在t=2时刻,注意力层1213输出的槽位注意力向量
Figure PCTCN2021117251-appb-000005
表示“我”与上述4个槽位的相关程度,例如相关程度为0即不相关;以此类推,在t=6时刻,注意力层1213输出的槽位注意力向量
Figure PCTCN2021117251-appb-000006
表示“上”与上述4个槽位的相关程度,例如,“上”与槽位“出发地”的相关程度为0.9,而与其他3个槽位(目的地、地点、星级)的相关程度为0.3,则表明“上”与槽位“出发地”的相关程度较高,与其他3个槽位(目的地、地点、星级)的相关程度较低。以此类推,最终可以得出,“上,海”与槽位“出发地”的相关程度较大,“北,京”与槽位“目的地”的相关程度较大,“北,京,火,车,站”与槽位“地点”的相关程度较大,“五,星,级”与槽位“星级”的相关程度较大,而句子中其他字,例如“帮、我”与上述4个槽位的相关程度较低或者不相关。
And the sentence "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station" may express the slot of departure, destination, location, star, then t=1 time, Slot attention vector output by attention layer 1213
Figure PCTCN2021117251-appb-000004
Indicates the degree of correlation between "help" and the above four slots, for example, if the degree of correlation is 0, it is irrelevant; at time t=2, the slot attention vector output by the attention layer 1213
Figure PCTCN2021117251-appb-000005
Indicates the degree of correlation between "I" and the above-mentioned 4 slots, for example, if the correlation degree is 0, it is irrelevant; and so on, at time t=6, the attention vector of the slot output by the attention layer 1213
Figure PCTCN2021117251-appb-000006
Indicates the degree of correlation between "上" and the above 4 slots. For example, the degree of correlation between "上" and the slot "Departure" is 0.9, while the correlation with the other 3 slots (destination, location, star rating) is 0.9. A degree of 0.3 indicates that "up" has a high degree of correlation with the slot "departure", and a low degree of correlation with the other three slots (destination, location, star rating). By analogy, it can be concluded that "Shanghai, Hai" is more closely related to the "departure" of the slot, "Beijing, Beijing" is more closely related to the "destination" of the slot, and "North, Beijing, "Fire, train, station" is more related to slot "location", "five, star, level" is more related to slot "star", and other words in the sentence, such as "help, me" There is little or no correlation with the above 4 slots.
具体地,注意力层1213的工作过程如图3所示:Specifically, the working process of the attention layer 1213 is shown in Figure 3:
注意力层1213以BERT编码层1211输出的编码向量序列{h 0,h 1,h 2,……,h t}作为输入,注意力层1213提取句向量h 0表示的语义信息以及词向量h 1,h 2,……,h t表示的词义信息,并在每个时间步t输出隐藏状态向量,该隐藏状态向量表示相应时间步t的前一时刻(t-1时刻)之前已提取的语义信息、词义信息。其中,相对于第一时刻(t=1)的t-1时刻(0时刻)注意力层1213输出的隐藏状态向量为句向量h 0对应的语义信息,t=2时刻的上一时刻(t=1时刻)输出的隐藏状态向量为词向量h 1对应的第一个字的词义信息,其中词向量h 1对应的词义信息还包括t=0时刻句向量h 0传递过来的语义信息;t=3时刻的上一时刻(t=2时刻)输出的隐藏状态向量为词向量h 2对应的第二个字的词义信息,其中,词向量h 2对应的词义信息包括t=1时刻词向量h 1传递过来的词义信息,而词向量h 1对应的词义信息还包括t=0时刻句向量h 0传递过来的语义信息;以此类推。 The attention layer 1213 takes the encoding vector sequence {h 0 , h 1 , h 2 ,..., h t } output by the BERT encoding layer 1211 as input, and the attention layer 1213 extracts the semantic information represented by the sentence vector h 0 and the word vector h 1 , h 2 ,...,h t represents the semantic information, and outputs a hidden state vector at each time step t, which represents the extracted data before the previous moment (t-1 moment) of the corresponding time step t. Semantic information, word meaning information. Among them, the hidden state vector output by the attention layer 1213 at time t-1 (time 0) at the first time (t=1) is the semantic information corresponding to the sentence vector h 0 , and the previous time at time t=2 (t = 1 time) the output hidden state vector is the semantic information of the first word corresponding to the word vector h 1 , wherein the semantic information corresponding to the word vector h 1 also includes the semantic information transmitted by the sentence vector h 0 at time t=0; t = The hidden state vector output at the previous moment at time 3 (time t=2) is the word meaning information of the second word corresponding to the word vector h 2 , wherein the word meaning information corresponding to the word vector h 2 includes the word vector at time t=1 The word meaning information transmitted by h 1 , and the word meaning information corresponding to the word vector h 1 also includes the semantic information transmitted by the sentence vector h 0 at time t=0; and so on.
进一步地,在注意力层1213中,基于注意力机制的注意力向量计算公式如下:Further, in the attention layer 1213, the attention vector calculation formula based on the attention mechanism is as follows:
Attention=W u*tanh(W q*Q+W v*V)    (2) Attention=W u *tanh(W q *Q+W v *V) (2)
其中,在计算意图注意力向量C I时,上述公式(2)中的Q表示输入注意力层1213的编码向量序列中的句向量h 0,V表示在每个时间步t输入注意力层1213的编码向量序列中的词向量h 1,h 2,……,h t,通过上述公式(3)得到的注意力向量能够量化每个词向量与句向量的相关程度。基于句向量h 0表示的语义信息中包含所有可能的意图标签信息,因此将句向量h 0与通过上述公式(2)计算得到的注意力向量组合,得到意图注意力向量C I,得到的意图注意力向量C I用于量化每个词向量对应的字与句向量对应的句子表达意图的相关程度。 Wherein, when calculating the intent attention vector CI , Q in the above formula (2) represents the sentence vector h 0 in the encoding vector sequence input to the attention layer 1213 , and V represents the input to the attention layer 1213 at each time step t The word vectors h 1 , h 2 ,...,h t in the encoded vector sequence of , the attention vector obtained by the above formula (3) can quantify the degree of correlation between each word vector and the sentence vector. The semantic information represented by the sentence vector h 0 contains all possible intent label information. Therefore, the sentence vector h 0 is combined with the attention vector calculated by the above formula (2) to obtain the intent attention vector C I , and the obtained intent The attention vector CI is used to quantify the degree of correlation between the word corresponding to each word vector and the sentence expression intent corresponding to the sentence vector.
在计算槽位注意力向量时,上述公式(2)中的Q表示上一时刻(t-1时刻)注意力层1213输出的隐藏状态向量C,V表示输入注意力层1213的编码向量序列{h 0,h 1,h 2,……,h t},通过上述公式(2)得到的注意力向量能够结合上一时刻隐藏状态向量学习当前t时刻处理的词向量的相关程度。基于上一时刻隐藏状态向量表示的已提取的语义信息或/和词义信息包含所有可能的槽位标签信息,因此将t-1时刻输出的隐藏状态向量C与通过上述公式(2)计算得到的注意力向量组合,得到槽位注意力向量
Figure PCTCN2021117251-appb-000007
得到的槽位注意力向量
Figure PCTCN2021117251-appb-000008
用于量化每个词向量对应的字与句向量对应的句子所表达槽位的相关程度。
When calculating the slot attention vector, Q in the above formula (2) represents the hidden state vector C output by the attention layer 1213 at the previous moment (time t-1), and V represents the encoding vector sequence input to the attention layer 1213 { h 0 , h 1 , h 2 ,...,h t }, the attention vector obtained by the above formula (2) can be combined with the hidden state vector at the previous moment to learn the correlation degree of the word vector processed at the current moment t. Based on the extracted semantic information or/and lexical information represented by the hidden state vector at the last moment, it includes all possible slot label information. Therefore, the hidden state vector C output at time t-1 is calculated by the above formula (2). The attention vector is combined to get the slot attention vector
Figure PCTCN2021117251-appb-000007
The resulting slot attention vector
Figure PCTCN2021117251-appb-000008
It is used to quantify the degree of correlation between the word corresponding to each word vector and the slot expressed by the sentence corresponding to the sentence vector.
如上所述,注意力层1213可以通过长短期记忆(Long Short Term Memory,LSTM)模型和注意力机制训练得到,具体训练过程参考下文详细描述,在此不再赘述。可以理解,在另一些实施例中,可以采用其他与LSTM模型以及注意力机制具有相同功能的神经网络模型以及用于学习自然语言的句子中的字与句子所表达意图或槽位相关程度的其他机制,此处不做限制。As mentioned above, the attention layer 1213 can be obtained by training a Long Short Term Memory (LSTM) model and an attention mechanism. For the specific training process, please refer to the detailed description below, which will not be repeated here. It can be understood that in other embodiments, other neural network models that have the same functions as the LSTM model and the attention mechanism, as well as other neural network models that are used to learn the degree of correlation between the words in a sentence and the intent or slot expressed in the sentence can be used. mechanism, there is no restriction here.
4)槽位填充层12144) Slot filling layer 1214
槽位填充层1214用于预测语料数据中的候选槽位并填充槽位值,其中,该槽位填充层1214可以预测出语料数据中的多个槽位标签,并保留满足条件的槽位标签作为候选槽位输出。The slot filling layer 1214 is used to predict candidate slots in the corpus data and fill in the slot value, wherein the slot filling layer 1214 can predict multiple slot labels in the corpus data, and retain the slot labels that meet the conditions Output as a candidate slot.
具体地,在t时刻,假如槽位填充层1214对当前的某个字进行处理,其以BERT编码层1211输出的编码向量h t、t-1时刻注意力层1213输出的隐藏状态向量C(即当前处理的字之前的句子的语义信息或字的词义信息)、以及t时刻注意力层1213输出的意图注意力向量C I和槽位注意力向量
Figure PCTCN2021117251-appb-000009
为输入,输出t时刻的候选槽位。槽位填充层1214在每个时间步t基于上述输入的四个向量预测可能的槽位标签,并对预测的槽位标签计算槽位置信度以判断该槽位标签是否满足输出条件。即槽位填充层1214基于包括词向量(包含待解析语料数据中每个字的词义信息)的编码向量、当前处理的字之前的句子的语义信息或字的词义信息、当前处理的字与句子所表达意图的相关程度以及当前处理的字与句子所表达槽位的相关程度,得到待解析语料数据的可能的槽位标签,并计算每个槽位标签的槽位置信度,然后从中 选择出满足条件或者说与待解析语料数据真实表达出的槽位的相关程度超过阈值的槽位标签作为候选槽位输出。
Specifically, at time t, if the slot filling layer 1214 processes a certain current word, it uses the coding vector h t output by the BERT coding layer 1211 and the hidden state vector C output by the attention layer 1213 at time t-1 ( That is, the semantic information of the sentence before the currently processed word or the semantic information of the word), and the intent attention vector CI and slot attention vector output by the attention layer 1213 at time t
Figure PCTCN2021117251-appb-000009
For input, output the candidate slot at time t. The slot filling layer 1214 predicts possible slot labels based on the four input vectors at each time step t, and calculates the slot position reliability for the predicted slot label to determine whether the slot label satisfies the output condition. That is, the slot filling layer 1214 is based on an encoding vector including a word vector (containing the semantic information of each word in the corpus data to be parsed), the semantic information of the sentence before the currently processed word or the semantic information of the word, the currently processed word and sentence. The degree of correlation of the expressed intention and the degree of correlation between the currently processed word and the slot expressed by the sentence, obtain the possible slot labels of the corpus data to be parsed, and calculate the slot position reliability of each slot label, and then select the Slot labels that satisfy the condition or that have a correlation degree with the slot that is actually expressed by the corpus data to be analyzed exceeds the threshold are output as candidate slots.
可以理解,此处,槽位置信度表示预测的槽位标签与语料数据表达的真实槽位的接近程度,也可以称为槽位可靠度。槽位置信度越高的槽位标签越接近语料数据表达的真实槽位。在槽位填充层1214中,可以对槽位置信度设定某一阈值,例如设置槽位置信度的阈值为0.5,槽位置信度大于或等于阈值的槽位标签满足输出条件,对应的槽位标签将会输出作为候选槽位;槽位置信度小于阈值的槽位标签不满足输出条件,其对应的槽位标签将会被删除,不会从槽位填充层1214输出。It can be understood that, here, the slot position reliability represents the closeness of the predicted slot label to the actual slot expressed by the corpus data, and may also be referred to as slot reliability. The slot label with higher slot position reliability is closer to the real slot expressed by the corpus data. In the slot filling layer 1214, a certain threshold can be set for the slot position reliability, for example, the threshold value of the slot position reliability is set to 0.5, the slot label whose slot position reliability is greater than or equal to the threshold satisfies the output condition, and the corresponding slot The bit label will be output as a candidate slot; the slot label whose slot position reliability is less than the threshold does not meet the output condition, and its corresponding slot label will be deleted and will not be output from the slot filling layer 1214 .
例如,如图4所示,如果待解析的语料数据是“请为我播放你好旧时光”,假设槽位填充层1214中对槽位置信度设置的阈值为0.5。那么槽位填充层1214为“请、为、我、播、放”这5个字预测的槽位标签中,O槽位的槽位置信度(例如是0.7)大于或等于0.5,其他槽位(如songName)的槽位置信度(例如是0.3)小于0.5,因此,“请、为、我、播、放”这5个字对应输出的候选槽位都是O槽位。槽位填充层1214为“你、好、旧、时、光”这5个字预测的槽位标签中,songName、videoName、mediaName的槽位置信度(例如槽位置信度分别为0.86、0.7、0.55)大于或等于0.5,而O槽位的槽位置信度(例如是0.3)小于0.5,因此,“你”对应输出的候选槽位是songName-B、videoName-B、mediaName-B、“好、旧、时、光”对应输出的候选槽位是songName-I、videoName-I、mediaName-I,其中,B标记的是名称起始位置的字,即表示“你”是名称中的第一个字;I标记的是名称起始位置之后的字。由于的O槽位表示空槽位或不重要的槽位,因此槽位填充层1214的输出最终输出3个候选槽位songName、videoName、mediaName,并对每个候选槽位填充槽位值“你好旧时光”。For example, as shown in FIG. 4 , if the corpus data to be parsed is "please play hello old times for me", it is assumed that the threshold set for the slot position reliability in the slot filling layer 1214 is 0.5. Then, in the slot label predicted by the 5 words "please, for, me, play, play" in the slot filling layer 1214, the slot position reliability (for example, 0.7) of slot O is greater than or equal to 0.5, and the other slots (such as songName), the slot position reliability (for example, 0.3) is less than 0.5. Therefore, the candidate slots corresponding to the output of the five words "please, for, me, play, play" are all O slots. The slot filling layer 1214 is the slot label predicted by the five words "you, good, old, time, light", the slot position reliability of songName, videoName, and mediaName (for example, the slot position reliability is 0.86, 0.7, 0.55) is greater than or equal to 0.5, and the slot position reliability of slot O (for example, 0.3) is less than 0.5. Therefore, the candidate slots corresponding to "you" are songName-B, videoName-B, mediaName-B, "OK". , old, time, light" corresponding output candidate slots are songName-I, videoName-I, mediaName-I, among which, B marks the word of the starting position of the name, which means that "you" is the first in the name word; I marks the word after the start of the name. Since the O slot represents an empty slot or an unimportant slot, the output of the slot filling layer 1214 finally outputs three candidate slots songName, videoName, and mediaName, and fills each candidate slot with the slot value "you" good old times".
又例如,如果待解析语料数据是“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”,假设槽位填充层1214中对槽位置信度设置的阈值为0.5。其中,槽位填充层1214为“上,海”预测的槽位标签中,槽位标签“出发地”的槽位置信度(例如是0.7)大于或等于0.5,因此,“上,海”这2个字对应输出的候选槽位均是“出发地”;槽位填充层1214为“北,京”预测的槽位标签中,槽位标签“目的地”的槽位置信度(例如是0.8)大于或等于0.5,因此,“北,京”这2个字对应输出的候选槽位均是“目的地”;槽位填充层1214为“北,京,火,车,站”预测的槽位标签中,槽位标签“地点”的槽位置信度(例如是0.75)大于或等于0.5,因此,“北,京,火,车,站”这5个字对应输出的候选槽位均是“地点”;槽位填充层1214为“五,星,级”预测的槽位标签中,槽位标签“地点”的槽位置信度(例如是0.75)大于或等于0.5,因此,“五,星,级”这3个字对应输出的候选槽位均是“星级”。因此槽位填充层1214最终输出4个候选槽位:出发地、目的地、地点、星级,并对槽位(出发地)填充的槽位值是(上海),对槽位(目的地)填充的槽位值是(北京),对槽位(地点)填充的槽位值是(北京火车站),对槽位(星级)填充的槽位值是(五星级)。For another example, if the corpus data to be parsed is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", it is assumed that the threshold set for the slot position reliability in the slot filling layer 1214 is 0.5. . Among them, in the slot label predicted by the slot filling layer 1214 as "Shanghai, Sea", the slot position reliability (for example, 0.7) of the slot label "Departure" is greater than or equal to 0.5, therefore, "Shanghai, Sea" The candidate slots corresponding to the output of the two words are all "departures"; the slot filling layer 1214 is the slot label predicted by "Beijing, Beijing", the slot position reliability of the slot label "destination" (for example, 0.8 ) is greater than or equal to 0.5, therefore, the candidate slots corresponding to the output of the two words "Beijing, Beijing" are all "destination"; the slot filling layer 1214 is the predicted slot of "Beijing, Beijing, train, train, station" In the bit label, the slot position reliability (for example, 0.75) of the slot label "Location" is greater than or equal to 0.5. Therefore, the candidate slots corresponding to the output of the five words "Beijing, Beijing, Train, Station" are all "Location"; in the slot label predicted by the slot filling layer 1214 as "five, star, level", the slot location reliability (for example, 0.75) of the slot label "Location" is greater than or equal to 0.5, therefore, "five, The three words "star, level" correspond to the output candidate slots that are all "star". Therefore, the slot filling layer 1214 finally outputs 4 candidate slots: departure, destination, location, star rating, and the slot value filled in the slot (departure) is (Shanghai), and the slot (destination) The value of the filled slot is (Beijing), the value of the filled slot (location) is (Beijing Railway Station), and the value of the filled slot (star) is (five star).
值得注意的是,槽位填充层1214基于意图注意力向量作为每个时间步t预测槽位的输入量进行槽位预测,并且t=1时刻对第一个字预测槽位时采用的是句向量h 0作为初始值输入,由于意图注意力向量和句向量所表示的语义信息包括所有可能的意图标签,因此槽位填充层1214是基于可能的意图标签来预测可能的槽位标签,预测得到的槽位标签与意图标签相关联。因而使得槽位预测的准确度大大提高,相应的也提高了槽位预测的速度或效率。 It is worth noting that the slot filling layer 1214 performs slot prediction based on the intent attention vector as the input to predict the slot at each time step t, and predicts the slot for the first word at time t=1 using the sentence The vector h 0 is input as the initial value. Since the semantic information represented by the intent attention vector and the sentence vector includes all possible intent labels, the slot filling layer 1214 predicts possible slot labels based on the possible intent labels. The slot label is associated with the intent label. Therefore, the accuracy of slot prediction is greatly improved, and the speed or efficiency of slot prediction is also improved accordingly.
具体地,槽位填充层1214的工作过程如图3所示:Specifically, the working process of the slot filling layer 1214 is shown in FIG. 3 :
槽位填充层1214以t时刻BERT编码层1211输出的编码向量h t、注意力层1213在t时刻输出的意图注意力向量C I、槽位注意力向量
Figure PCTCN2021117251-appb-000010
以及注意力层1213在t-1时刻输出的隐藏状态向量C作为输入。槽位填充层1214先基于槽位门机制对意图和槽位之间关系进行显示建模,得到意图注意力向量C I与槽 位注意力向量
Figure PCTCN2021117251-appb-000011
的融合向量gS,再进一步预测对应每个时间步t的槽位标签,并计算每个槽位标签的槽位置信度。
The slot filling layer 1214 uses the coding vector h t output by the BERT coding layer 1211 at time t, the intent attention vector C I output by the attention layer 1213 at time t, and the slot attention vector
Figure PCTCN2021117251-appb-000010
and the hidden state vector C output by the attention layer 1213 at time t-1 as input. The slot filling layer 1214 firstly models the relationship between the intent and the slot based on the slot gate mechanism, and obtains the intent attention vector C I and the slot attention vector
Figure PCTCN2021117251-appb-000011
The fusion vector gS of , further predicts the slot label corresponding to each time step t, and calculates the slot position reliability of each slot label.
其中,意图注意力向量与槽位注意力向量的融合向量gS计算公式如下:Among them, the calculation formula of the fusion vector gS of the intention attention vector and the slot attention vector is as follows:
Figure PCTCN2021117251-appb-000012
Figure PCTCN2021117251-appb-000012
其中,v表示上述公式(3)中的双曲正切函数tanh(x)的随机权重系数,W表示表示意图注意力向量C I的随机权重系数,W大于1表示意图注意力向量C I对槽位预测的影响程度比槽位注意力向量
Figure PCTCN2021117251-appb-000013
的影响程度大,W小于1则表示意图注意力向量C I对槽位预测的影响程度比槽位注意力向量
Figure PCTCN2021117251-appb-000014
的影响程度小,W等于1则表示意图注意力向量C I对槽位预测的影响程度与槽位注意力向量
Figure PCTCN2021117251-appb-000015
的影响程度相同。
Among them, v represents the random weight coefficient of the hyperbolic tangent function tanh(x) in the above formula (3), W represents the random weight coefficient of the schematic attention vector C I , and W greater than 1 means the schematic attention vector C I pair slot Bit prediction is more influential than the slot attention vector
Figure PCTCN2021117251-appb-000013
is greater, and W is less than 1, which means that the influence of the schematic attention vector C I on the slot prediction is greater than that of the slot attention vector.
Figure PCTCN2021117251-appb-000014
The degree of influence is small, and W equal to 1 means that the influence degree of the schematic attention vector C I on the slot prediction and the slot attention vector
Figure PCTCN2021117251-appb-000015
the same degree of impact.
在每个时间步t,槽位填充层1214基于输入的上述四个向量可以得到表示槽位标签信息的槽位向量
Figure PCTCN2021117251-appb-000016
进而基于槽位向量
Figure PCTCN2021117251-appb-000017
计算相应槽位标签的槽位置信度
Figure PCTCN2021117251-appb-000018
将槽位置信度
Figure PCTCN2021117251-appb-000019
通过Sigmoid激活函数后得到的计算公式如下:
At each time step t, the slot filling layer 1214 can obtain a slot vector representing slot label information based on the above four input vectors
Figure PCTCN2021117251-appb-000016
and then based on the slot vector
Figure PCTCN2021117251-appb-000017
Calculate the slot position reliability of the corresponding slot label
Figure PCTCN2021117251-appb-000018
The slot position reliability
Figure PCTCN2021117251-appb-000019
The calculation formula obtained after passing the Sigmoid activation function is as follows:
Figure PCTCN2021117251-appb-000020
Figure PCTCN2021117251-appb-000020
其中,S为槽位数量,W S表示槽位向量
Figure PCTCN2021117251-appb-000021
的随机权重系数,b S表示偏差值。
Among them, S is the number of slots, W S is the slot vector
Figure PCTCN2021117251-appb-000021
The random weight coefficient of , b S represents the bias value.
例如,在上述图4所示的示例中,在对语料数据“请为我播放你好旧时光”预测槽位的过程中,在t=3时刻为“我”预测槽位时,槽位填充层1214以编码向量h 3(对应:我)、t-1时刻注意力层1213输出的隐藏状态向量C(对应:为)、意图注意力向量C I(分别对应:请为我播放你好旧时光,播,放)以及槽位注意力向量(分别对应:为,我)作为输入;其中,隐藏状态向量C(对应:为)包括词向量(对应:请)传递过来的词义信息,词向量(对应:请)又包括句向量(对应:请为我播放你好旧时光)传递过来的语义信息。 For example, in the example shown in Figure 4 above, in the process of predicting the slot for the corpus data "please play hello old times for me", when the slot is predicted for "me" at time t=3, the slot is filled The layer 1214 uses the encoding vector h 3 (corresponding to: me), the hidden state vector C (corresponding to: is) and the intention attention vector C I (corresponding to: please play hello for me, respectively) output by the attention layer 1213 at time t-1. time, play, play) and slot attention vector (corresponding to: yes, i) as input; among them, the hidden state vector C (corresponding to: yes) includes the word meaning information passed by the word vector (corresponding: please), the word vector (Corresponding to: please) It also includes the semantic information transmitted by the sentence vector (corresponding: please play hello old times for me).
由于“我”对应的词义信息是一个代表自己的称谓词,“我”与句子“请为我播放你好旧时光”所表达的意图和槽位的均不相关,因此,在为“我”预测槽位时,例如计算得到:槽位标签songName的槽位置信度为0.2,槽位标签videoName的槽位置信度是0.3,槽位标签O的槽位置信度是0.7,那么,最终为“我”预测的槽位是O槽位,而O槽位一般表示不重要的槽位,也不会作为槽位填充层1214的输出。Since the semantic information corresponding to "I" is an appellation that represents itself, "I" is not related to the intention and slot expressed by the sentence "Please play your good old times for me". Therefore, in the "me" When predicting a slot, for example, it is calculated that the slot position reliability of the slot label songName is 0.2, the slot position reliability of the slot label videoName is 0.3, and the slot position reliability of the slot label O is 0.7, then, the final result is " The slot I've predicted is the O slot, and the O slot generally represents an unimportant slot and will not be used as the output of the slot filling layer 1214.
例如,在t=6时刻为“你”预测槽位时,槽位填充层1214以编码向量h 6(对应:你)、t-1时刻注意力层1213输出的隐藏状态向量C(对应:放)、意图注意力向量C I(对应:请为我播放你好旧时光,播,放)以及槽位注意力向量(对应:放,你)作为输入;其中,隐藏状态向量C(对应:你)包括词向量(对应:放)传递过来的词义信息,词向量(对应:放)又包括其前一个词向量(对应:播)传递过来的词义信息,以此类推,词向量(对应:请)又包括句向量(对应:请为我播放你好旧时光)传递过来的语义信息。 For example, when predicting the slot for "you" at time t=6, the slot filling layer 1214 uses the encoding vector h 6 (corresponding to: you) and the hidden state vector C output by the attention layer 1213 at time t-1 (corresponding to: put ), intent attention vector C I (corresponding to: please play hello old times for me, play, play) and slot attention vector (corresponding to: playing, you) as input; among them, the hidden state vector C (corresponding to: you ) includes the word meaning information passed by the word vector (corresponding: play), and the word vector (corresponding: put) also includes the word meaning information passed by the previous word vector (corresponding: playing), and so on, the word vector (corresponding: please ) also includes the semantic information passed by the sentence vector (corresponding: please play hello old times for me).
由于“你”对应的词义信息是歌曲或视频名称中的某个字,“你”与句子“请为我播放你好旧时光”所表达意图的相关程度较小、与句子“请为我播放你好旧时光”所表达槽位的相关程度较大,因此,在为“你”预测槽位时,例如计算得到,槽位标签songName的槽位置信度为0.86,槽位标签videoName的槽位置信度是0.7,槽位标签mediaName的槽位置信度是0.55,槽位标签O的槽位置信度是0.2,那么,最终为“你”预测的槽位是songName、videoName、mediaName,作为槽位填充层1214的输出。Since the word meaning information corresponding to "you" is a certain word in the title of a song or video, "you" is less relevant to the intent expressed by the sentence "please play hello old times for me", and is less relevant to the sentence "please play for me The slot positions expressed by "Hello, old times" have a relatively large degree of correlation. Therefore, when predicting the slot positions for "you", for example, it can be calculated that the slot position reliability of the slot tag songName is 0.86, and the slot position of the slot tag videoName is 0.86. The reliability is 0.7, the slot position reliability of the slot label mediaName is 0.55, and the slot position reliability of the slot label O is 0.2, then, the final predicted slot for "you" is songName, videoName, mediaName, as the slot Fill the output of layer 1214.
槽位填充层1214可以基于槽位门(slot-gate)机制、LSTM模型以及Sigmoid激活函数训练得到,具体训练过程参考下文详细描述,在此不再赘述。其中,槽位门机制关注于学习意图注意力向量与槽位注意力向量之间的关系,通过全局优化获得更好的语义框架(semantic frame)。槽位门机制主要是利用意图上下文向量来建模意图和槽位的关系,以提高槽填充性能。在另一些实施例中,可以采用其他与 LSTM模型具有相同功能的深度神经网络模型作为解码器,也可以采用其他与Sigmoid函数具有相同功能的函数作为对应的深度神经网络解码器的激活函数,此处不做限制。The slot filling layer 1214 can be obtained by training based on the slot-gate mechanism, the LSTM model and the Sigmoid activation function. For the specific training process, please refer to the detailed description below, which will not be repeated here. Among them, the slot gate mechanism focuses on learning the relationship between the intent attention vector and the slot attention vector, and obtains a better semantic frame through global optimization. The slot gate mechanism mainly uses the intent context vector to model the relationship between intent and slot to improve slot filling performance. In other embodiments, other deep neural network models with the same function as the LSTM model can be used as the decoder, and other functions with the same function as the Sigmoid function can also be used as the activation function of the corresponding deep neural network decoder. There are no restrictions.
5)后置处理层12155) Post-processing layer 1215
槽位填充层1214用于梳理候选意图和候选槽位之间的对应关系。其中,候选意图与候选槽位对应后得到的结果作为语义解析结果从后置处理层1215输出。The slot filling layer 1214 is used to sort out the correspondence between candidate intents and candidate slots. The result obtained after the candidate intent corresponds to the candidate slot is output from the post-processing layer 1215 as the semantic parsing result.
例如,如图4所示,如果输入的语料是请为我播放你好旧时光,意图分类层1212输出的候选意图(PLAY_MUSIC、PLAY_VIDEO、PLAY_VOICE)、槽位填充层1214输出的候选槽位(songName,videoName,mediaName)输入后置处理层1215后,基于后置处理层1215中意图槽位映射表推理预测后输出的语义解析结果为:For example, as shown in FIG. 4 , if the input corpus is please play hello old times for me, the candidate intents (PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE) output by the intent classification layer 1212 and the candidate slots (songName) output by the slot filling layer 1214 , videoName, mediaName) are input into the post-processing layer 1215, and the semantic parsing result output after inference and prediction based on the intent-slot mapping table in the post-processing layer 1215 is:
PLAY_MUSIC songName,videoName,mediaName,你好旧时光;PLAY_MUSIC songName, videoName, mediaName, hello old days;
PLAY_VIDEO songName,videoName,mediaName,你好旧时光;PLAY_VIDEO songName, videoName, mediaName, hello old days;
PLAY_VOICE songName,videoName,mediaName,你好旧时光。PLAY_VOICE songName, videoName, mediaName, hello old days.
其中,候选意图PLAY_MUSIC、PLAY_VIDEO、PLAY_VOICE即是对语料数据解析识别的意图,候选槽位songName、videoName、mediaName即是对语料数据解析得到的槽位,“你好旧时光”为填充的槽位值。Among them, the candidate intents PLAY_MUSIC, PLAY_VIDEO, and PLAY_VOICE are the intents for parsing and identifying the corpus data, the candidate slots songName, videoName, and mediaName are the slots obtained by parsing the corpus data, and "Hello, old time" is the filled slot value. .
又例如,如果待解析的语料数据是帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店,意图分类层1212输出的候选意图(订车票、订酒店)、槽位填充层1214输出的候选槽位(出发地、目的地)输入后置处理层1215后,基于后置处理层1215中意图槽位映射表推理预测后输出的语义解析结果为:For another example, if the corpus data to be parsed is to help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station, the candidate intent output by the intent classification layer 1212 (booking a ticket, booking a hotel), slot After the candidate slots (departure and destination) output by the filling layer 1214 are input to the post-processing layer 1215, the semantic parsing result output after inference and prediction based on the intent-slot mapping table in the post-processing layer 1215 is:
订车票出发地,上海;Departure place for booking tickets, Shanghai;
目的地,北京;destination, Beijing;
订酒店地点,北京火车站;Book hotel location, Beijing Railway Station;
星级,五星级。Star, five star.
其中,候选意图(订车票、订酒店)即是对语料数据解析识别的意图,候选槽位(出发地、目的地、地点、星级)即是对语料数据解析得到的槽位,上海、北京、北京火车站、五星级为对应槽位(出发地、目的地、地点、星级)填充的槽位值。Among them, the candidate intent (booking a ticket, booking a hotel) is the intent to parse and identify the corpus data, and the candidate slot (departure, destination, location, star rating) is the slot obtained by parsing the corpus data. , Beijing Railway Station, and Five Star are the slot values filled in for the corresponding slot (departure, destination, location, star rating).
具体地,后置处理层1215的工作过程如图3所示:Specifically, the working process of the post-processing layer 1215 is shown in FIG. 3 :
后置处理层1215以上述意图分类层1212得到的候选意图和槽位填充层1214得到的候选槽位作为输入,并基于语义解析模型121预训练过程得到的意图槽位映射表梳理候选意图与候选槽位之间的对应关系。基于语义解析模型121预训练过程得到意图槽位映射表参考下文详细描述,在此不再赘述。The post-processing layer 1215 takes the candidate intents obtained by the above-mentioned intent classification layer 1212 and the candidate slots obtained by the slot filling layer 1214 as input, and sorts out candidate intents and candidates based on the intent-slot mapping table obtained during the pre-training process of the semantic parsing model 121 . Correspondence between slots. The intent slot mapping table obtained based on the pre-training process of the semantic parsing model 121 is described in detail below, and details are not repeated here.
可以理解,意图槽位映射表是基于大量的样本训练得到的候选意图和候选槽位的梳理结果,因此在执行语义解析任务的过程中可以基于更多的实际应用中的语料数据不断更新意图槽位映射表。It can be understood that the intent slot mapping table is based on the result of the candidate intent and candidate slot combing obtained by training a large number of samples. Therefore, in the process of performing the semantic parsing task, the intent slot can be continuously updated based on more corpus data in practical applications. Bitmap table.
以上BERT编码层1211、意图分类层1212、注意力层1213、槽位填充层1214以及后置处理层1215共同构成了语义解析模型121。其中,语义解析模型121结构中的每一层都需要经过大量的样本预料数据预训练使其具有上述每一层相应的功能。如前文所述,语义解析模型121由服务器200完成预训练,之后,训练好的语义解析模型121既能够移植到手机100上直接执行语义解析任务,也可以继续存在于服务器200中执行来自手机100请求执行的语义解析任务。The above BERT encoding layer 1211 , intent classification layer 1212 , attention layer 1213 , slot filling layer 1214 and post-processing layer 1215 together constitute the semantic parsing model 121 . Among them, each layer in the structure of the semantic parsing model 121 needs to be pre-trained with a large amount of sample expected data so that it has the corresponding function of each layer above. As mentioned above, the semantic parsing model 121 is pre-trained by the server 200. Afterwards, the trained semantic parsing model 121 can either be transplanted to the mobile phone 100 to directly perform the semantic parsing task, or it can continue to exist in the server 200 to execute data from the mobile phone 100. The semantic parsing task requested to be performed.
下面将详细介绍语义解析模型121的预训练过程,对语义解析模型121的预训练过程可以参考以下示例。The pre-training process of the semantic parsing model 121 will be described in detail below. For the pre-training process of the semantic parsing model 121, reference may be made to the following examples.
如图5所示,语义解析模型121的预训练流程包括:As shown in FIG. 5 , the pre-training process of the semantic parsing model 121 includes:
501:服务器200采集样本语料数据,用于训练语义解析模型121。其中,采集的样本预料数据要尽可能覆盖比较多的领域以及尽可能多的动词,专有名词、常用名词等,这样训练出来的语义解析模型121的泛化性能会更好。501 : The server 200 collects sample corpus data for training the semantic parsing model 121 . Among them, the collected samples are expected to cover as many fields as possible and as many verbs, proper nouns, common nouns, etc. as possible, so that the generalization performance of the trained semantic parsing model 121 will be better.
用于训练语义解析模型121的样本语料数据需要分批输入语义解析模型121的各层结构中进行训练,每一个样本预料数据都会经过语义解析模型121中每一层的处理。为了便于理解,下面介绍一下样本数据相关的几个概念。The sample corpus data used for training the semantic parsing model 121 needs to be input into the layers of the semantic parsing model 121 for training in batches. For ease of understanding, several concepts related to sample data are introduced below.
(a)batch:批,深度学习每一次参数的更新所需要损失函数并不是由一个数据标签{data:label}获得的,而是由一组数据加权得到的,这一组数据的数量就是batchsize。(a) batch: batch, the loss function required for each parameter update of deep learning is not obtained by a data label {data: label}, but by a set of data weighted, the number of this set of data is batchsize .
(b)batchsize:批大小,一个batch中的样本数量。每次训练在训练集中取batchsize个样本训练。(b) batchsize: batch size, the number of samples in a batch. Each training takes batchsize samples in the training set for training.
(c)iteration:迭代数是batch需要完成一个epoch的次数。1个iteration等于使用batchsize个样本训练一次;在一个epoch中,batch数和iteration数是相等的。(c) iteration: The number of iterations is the number of times the batch needs to complete an epoch. 1 iteration is equal to using batchsize samples to train once; in an epoch, the number of batches and iterations are equal.
(d)epoch:当一个完整的数据集通过神经网络一次并且返回了一次,这个过程称为一个epoch。也就是说1个epoch等于使用训练集中的全部样本训练一次。(d) epoch: When a complete dataset passes through the neural network once and returns once, the process is called an epoch. That is to say, 1 epoch is equivalent to using all the samples in the training set to train once.
例如,训练集有1000个样本,batchsize=10,那么:训练完整个样本集需要100次iteration,1次epoch。再例如,对于一个有2000个训练样本的数据集。将2000个样本分成大小为500的batch,那么完成一个epoch需要4个iteration。For example, if the training set has 1000 samples and batchsize=10, then: training the entire sample set requires 100 iterations and 1 epoch. As another example, consider a dataset with 2000 training samples. Divide 2000 samples into batches of size 500, then it takes 4 iterations to complete an epoch.
502:服务器200通过NLP模块对即将输入语义解析模型121进行训练的样本语料数据进行数据预处理。对样本语料数据的数据预处理参考上文BERT编码层1211中关于数据预处理的相关描述,在此不再赘述。502 : The server 200 performs data preprocessing on the sample corpus data to be input into the training of the semantic parsing model 121 through the NLP module. For the data preprocessing of the sample corpus data, please refer to the relevant description of the data preprocessing in the BERT coding layer 1211 above, which will not be repeated here.
每条样本语料数据经过数据预处理后得到与该条样本语料数据相应的Token序列、断句标记以及对应Token序列的掩码。After each piece of sample corpus data is preprocessed, a Token sequence, a segment mark and a mask corresponding to the Token sequence corresponding to the piece of sample corpus data are obtained.
503:在一次epoch训练中,服务器200分别将每条样本语料数据经过数据预处理得到的Token序列、断句标记以及对应Token序列的掩码输入语义解析模型121中的BERT编码层1211进行训练,使其能够输出如上文BERT编码层1211中描述的编码向量序列。503: In an epoch training, the server 200 respectively inputs the Token sequence, the segmentation mark and the mask corresponding to the Token sequence obtained by data preprocessing for each sample corpus into the BERT coding layer 1211 in the semantic parsing model 121 for training, so that the It can output a sequence of encoded vectors as described in the BERT encoding layer 1211 above.
BERT编码层1211基于BERT模型训练得到,训练过程中需要不断微调语义解析模型121的上下游参数,使得BERT编码层经过足够长的时间或者足够样本预料数据的学习后能够输出上述编码向量序列{h 0,h 1,h 2,……,h t}。 The BERT coding layer 1211 is obtained based on the training of the BERT model. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters of the semantic parsing model 121, so that the BERT coding layer can output the above coding vector sequence {h 0 , h 1 , h 2 , ..., h t }.
504:在一次epoch训练中,服务器200分别将上述流程503中BERT编码层1211输出的句向量h 0输入语义解析模型121中的意图分类层1212进行训练,使其能够输出如上文意图分类层1212中描述的候选意图,在此不再赘述。 504: In an epoch training, the server 200 respectively inputs the sentence vector h 0 output by the BERT encoding layer 1211 in the above process 503 into the intent classification layer 1212 in the semantic parsing model 121 for training, so that it can output the intent classification layer 1212 as above. The candidate intents described in , are not repeated here.
意图分类层1212基于全连接层以及Sigmoid函数作为激活函数训练得到,训练过程中需要不断微调语义解析模型121的上下游参数,使得意图分类层1212经过足够长时间或者足够大量的样本预料数据的学习后能够提取所有可能的意图标签以及每个意图标签对应的意图置信度,进而提取多个满足输出条件的意图标签作为候选意图,从意图分类层1212输出,具体参见上述公式(1)及相关描述,在此不再赘述。The intent classification layer 1212 is obtained by training based on the fully connected layer and the Sigmoid function as the activation function. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters of the semantic parsing model 121, so that the intent classification layer 1212 expects the learning of the data after a long enough time or a large enough number of samples. After that, all possible intent labels and the intent confidence corresponding to each intent label can be extracted, and then multiple intent labels that meet the output conditions can be extracted as candidate intents, which are output from the intent classification layer 1212. For details, please refer to the above formula (1) and related descriptions , and will not be repeated here.
对于每条样本语料数据,意图分类层1212输出的候选意图都会输入后置处理层1215。For each piece of sample corpus data, the candidate intents output by the intent classification layer 1212 are input to the post-processing layer 1215.
505:在一次epoch训练中,服务器200分别将上述流程503训练的BERT编码层1211输出的编码 向量序列{h 0,h 1,h 2,……,h t}输入语义解析模型121中的注意力层1213进行训练,使其能够输出如上文注意力层1213中描述的意图注意力向量C I和槽位注意力向量
Figure PCTCN2021117251-appb-000022
在此不再赘述。
505: In an epoch training, the server 200 respectively inputs the coding vector sequence {h 0 , h 1 , h 2 ,  , h t } output by the BERT coding layer 1211 trained in the above process 503 into the attention in the semantic parsing model 121 The force layer 1213 is trained to output the intent attention vector CI and the slot attention vector as described in the attention layer 1213 above
Figure PCTCN2021117251-appb-000022
It is not repeated here.
注意力层1213基于注意力机制和LSTM模型训练得到,训练过程中需要不断微调语义解析模型121中的上下游参数,使得注意力层1213能够量化每个词向量对应的字对表达意图的相关程度、以及量化每个词向量对应的字对所表示的槽位的相关程度,最终输出意图注意力向量以及槽位注意力向量,具体参见上述公式(2)及相关描述,在此不再赘述。The attention layer 1213 is obtained from training based on the attention mechanism and the LSTM model. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters in the semantic parsing model 121, so that the attention layer 1213 can quantify the degree of relevance of the expression intent of the word pair corresponding to each word vector. , and quantify the correlation degree of the slot represented by the word pair corresponding to each word vector, and finally output the intent attention vector and the slot attention vector. For details, please refer to the above formula (2) and related descriptions, which will not be repeated here.
其中,LSTM模型是一种特殊的RNN模型,是为了解决RNN模型梯度弥散的问题而提出的,其核心是cell state,暂且名为细胞状态,也可以理解为传送带,其实就是整个模型中的记忆空间,随着时间而变化的。LSTM模型的工作原理可以简单描述为:(1)forget gate:选择忘记过去某些信息:(2)input gate:记忆现在的某些信息:(3)将过去与现在的记忆进行合并:(4)output gate:选择输出某些信息。注意力机制模仿了生物观察行为的内部过程,即一种将内部经验和外部感觉对齐从而增加部分区域的观察精细度的机制,能够利用有限的注意力资源从大量信息中快速筛选出高价值信息。注意力机制注意力机制可以快速提取稀疏数据的重要特征,注意力机制的本质思想可以改写为如下公式:Among them, the LSTM model is a special RNN model, which is proposed to solve the problem of gradient dispersion of the RNN model. Its core is the cell state, which is temporarily called the cell state. It can also be understood as a conveyor belt, which is actually the memory in the entire model. space changes over time. The working principle of the LSTM model can be simply described as: (1) forget gate: choose to forget some information in the past: (2) input gate: remember some information in the present: (3) merge the past and present memory: (4) )output gate: choose to output some information. The attention mechanism imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience and external sense to increase the fineness of observation in some areas, and can use limited attention resources to quickly screen out high-value information from a large amount of information. . Attention mechanism The attention mechanism can quickly extract important features of sparse data. The essential idea of the attention mechanism can be rewritten as the following formula:
Figure PCTCN2021117251-appb-000023
Figure PCTCN2021117251-appb-000023
其中,Lx=||Source||代表Source的长度,公式含义即将Source中的构成元素想象成是由一系列的<Key,Value>数据对构成,此时给定目标Target中的某个元素Query,通过计算Query和各个Key的相似性或者相关性,得到每个Key对应Value的权重系数,然后对Value进行加权求和,即得到了最终的Attention数值。所以本质上Attention机制是对Source中元素的Value值进行加权求和,而Query和Key用来计算对应Value的权重系数。Among them, Lx=||Source|| represents the length of Source, and the meaning of the formula is to imagine that the constituent elements in Source are composed of a series of <Key, Value> data pairs. At this time, an element Query in the target Target is given. , By calculating the similarity or correlation between Query and each Key, the weight coefficient of each Key corresponding to Value is obtained, and then the weighted sum of Value is obtained, that is, the final Attention value is obtained. So in essence, the Attention mechanism is to weight and sum the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient of the corresponding Value.
506:在一次epoch训练中,服务器200分别将上述流程503训练的BERT编码层1211在t时刻输出的编码向量h t、以及上述流程505训练的注意力层1213在t时刻输出的意图注意力向量C I和槽位注意力向量
Figure PCTCN2021117251-appb-000024
以及注意力层1213中t-1时刻LSTM模型输出的隐藏状态向量C(即当前处理的字之前的句子的语义信息或字的词义信息)输入语义解析模型121中的槽位填充层1214进行训练,使其能够输出如上文槽位填充层1214中描述的候选槽位,在此不再赘述。
506: In an epoch training, the server 200 respectively uses the encoding vector h t output by the BERT coding layer 1211 trained in the above process 503 at time t, and the intent attention vector output by the attention layer 1213 trained in the above process 505 at time t. CI and slot attention vector
Figure PCTCN2021117251-appb-000024
And the hidden state vector C output by the LSTM model at time t-1 in the attention layer 1213 (that is, the semantic information of the sentence before the currently processed word or the word meaning information of the word) is input into the semantic analysis model 121 The slot filling layer 1214 is trained , so that it can output candidate slots as described in the slot filling layer 1214 above, which will not be repeated here.
槽位填充层1214基于槽位门机制、LSTM模型作为解码器以及Sigmoid函数作为激活函数训练得到,训练过程中需要不断微调语义解析模型121的上下游参数,使得槽位填充层1214经过足够长时间或者足够大量的样本预料数据的学习后能够对应可能的意图标签预测所有可能的槽位标签以及每个槽位标签对应的槽位置信度,进而提取多个满足输出条件的候选槽位,从槽位填充层1214输出,具体参见上述公式(3)~(4)及相关描述,在此不再赘述。The slot filling layer 1214 is obtained by training based on the slot gate mechanism, the LSTM model as the decoder, and the Sigmoid function as the activation function. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters of the semantic parsing model 121, so that the slot filling layer 1214 has a long enough time. Or a large enough number of samples are expected to be able to predict all possible slot labels and the slot position reliability corresponding to each slot label corresponding to the possible intent labels after the learning of the data, and then extract multiple candidate slots that meet the output conditions. For the output of the bit filling layer 1214, refer to the above formulas (3) to (4) and related descriptions for details, which will not be repeated here.
对于每条样本语料数据,槽位填充层1214输出的候选槽位都会输入后置处理层1215。For each piece of sample corpus data, the candidate slots output by the slot filling layer 1214 are input to the post-processing layer 1215 .
507:服务器200判断上述流程501~506的训练结果是否满足训练终止条件。如果训练结果满足训练终止条件,则进行508;如果训练结果不满足训练终止条件,则进行509。507: The server 200 determines whether the training results of the above-mentioned processes 501-506 satisfy the training termination condition. If the training result satisfies the training termination condition, go to 508 ; if the training result does not satisfy the training termination condition, go to 509 .
在本申请实施例中,可以采用早停法(Early Stopping)机制进行模型训练终止判断。即当训练的epoch次数达到次数阈值或者与上一次最优模型的epoch间隔大于设定的间隔阈值时,训练结果满足训练终止条件,否则,训练结果不满足训练终止条件。In this embodiment of the present application, an Early Stopping mechanism may be used to determine the termination of model training. That is, when the number of training epochs reaches the number threshold or the epoch interval with the last optimal model is greater than the set interval threshold, the training result satisfies the training termination condition; otherwise, the training result does not meet the training termination condition.
早停法机制可以使训练的神经网络模型具有良好的泛化性能,即可以很好的拟合数据,其基本含义是在训练中计算模型在验证集上的表现,当模型在验证集上的表现开始下降的时候,停止训练,这样就能避免继续训练导致过拟合的问题。The early stopping mechanism can make the trained neural network model have good generalization performance, that is, it can fit the data well. Its basic meaning is to calculate the performance of the model on the validation set during training. When performance starts to drop, stop training to avoid overfitting problems caused by continuing training.
508:服务器200终止对语义解析模型121中BERT编码层1211、意图分类层1212、注意力层1213 以及槽位填充层1214的训练,进一步将上述流程502~506中训练累计得到的大量候选意图和大量候选槽位输入语义解析模型121中的后置处理层1215进行关系梳理,例如基于候选意图对候选槽位进行梳理,得到意图槽位映射表。语义解析模型训练结束。508: The server 200 terminates the training of the BERT encoding layer 1211, the intent classification layer 1212, the attention layer 1213, and the slot filling layer 1214 in the semantic parsing model 121, and further adds a large number of candidate intents and A large number of candidate slots are input into the post-processing layer 1215 in the semantic parsing model 121 to sort out relationships, for example, sorting out candidate slots based on candidate intents to obtain an intent-slot mapping table. The semantic parsing model training ends.
其中,对应每条样本语料数据经过上述流程502~506的训练,都会得到候选意图和候选槽位,经过足够多的epoch次数训练后,输入后置处理层1215的候选意图和候选槽位也足够过。在进行后置处理层1215训练之前,候选意图与候选槽位之间是无序且无对应的关系,也就是说候选意图与候选槽位之间并没有形成映射。基于足够多的候选意图和候选槽位对后置处理层1215进行训练,使其能够基于候选意图梳理候选槽位,输出意图与槽位之间有序的对应关系,例如即训练得到一个的意图槽位映射表。基于意图槽位映射表,后置处理层1215对输入其中的候选意图和候选槽位便能够准确快速的找到候选意图与候选槽位之间的对应关系。Among them, the candidate intent and the candidate slot will be obtained corresponding to each sample corpus data after the training of the above processes 502 to 506. After enough epoch times of training, the candidate intent and candidate slot of the input post-processing layer 1215 are also sufficient. pass. Before the post-processing layer 1215 is trained, there is an disordered and non-corresponding relationship between candidate intents and candidate slots, that is, no mapping is formed between candidate intents and candidate slots. The post-processing layer 1215 is trained based on a sufficient number of candidate intents and candidate slots, so that it can sort out candidate slots based on the candidate intents, and output an ordered correspondence between intents and slots, for example, training to obtain one intent Slot mapping table. Based on the intent slot mapping table, the post-processing layer 1215 can accurately and quickly find the corresponding relationship between the candidate intent and the candidate slot for the candidate intent and the candidate slot input.
509:服务器200继续输入下一次epoch的样本语料数据重复流程502~507继续训练语义解析模型121。509 : The server 200 continues to input the sample corpus data of the next epoch and repeats the processes 502 to 507 to continue training the semantic parsing model 121 .
值得注意的是,为了消除语义解析模型121得到的候选意图或候选槽位与真实的意图或槽位之间的因意图分类或槽位填充损失带来的差异,需要在训练语义解析模型121时引入联合优化函数对输出的候选意图和候选槽位进行意图分类损失函数和槽位填充损失函数的联合优化训练。It is worth noting that, in order to eliminate the difference between the candidate intents or candidate slots obtained by the semantic parsing model 121 and the real intents or slots due to intent classification or slot filling loss, it is necessary to train the semantic parsing model 121 when training the semantic parsing model 121. A joint optimization function is introduced to perform joint optimization training of the intent classification loss function and the slot filling loss function on the output candidate intents and candidate slots.
具体地,对于意图和槽位联合优化采用的目标损失函数则由意图分类损失函数、槽位填充损失函数和权重的正则化项相加。其中,意图分类损失函数采用多标签Sigmoid交叉熵损失(Cross Entropy Loss)函数,槽位填充损失函数采用序列化的多标签Sigmoid Cross Entropy Loss函数,其中,Sigmoid Cross Entropy Loss的计算公式推导如下:Specifically, the objective loss function adopted for the joint optimization of intent and slot is added by the intent classification loss function, the slot filling loss function and the regularization term of the weight. Among them, the intent classification loss function adopts the multi-label Sigmoid cross entropy loss (Cross Entropy Loss) function, and the slot filling loss function adopts the serialized multi-label Sigmoid Cross Entropy Loss function. The calculation formula of Sigmoid Cross Entropy Loss is deduced as follows:
Figure PCTCN2021117251-appb-000025
Figure PCTCN2021117251-appb-000025
其中,P(t i=1|x i)是Sigmoid函数,
Figure PCTCN2021117251-appb-000026
where P(t i =1|x i ) is the Sigmoid function,
Figure PCTCN2021117251-appb-000026
权重后增加L2正则化之后,得到联合优化目标损失函数,公式为:After adding L2 regularization after the weight, the joint optimization objective loss function is obtained, and the formula is:
Figure PCTCN2021117251-appb-000027
Figure PCTCN2021117251-appb-000027
其中,L y(y,f(x))是根据上述公式6计算的意图分类损失函数,L c(y,f(x))是根据上述公式6计算的槽位填充损失函数,λ是超参数,m是一个batch中数据个数,除以2的原因是在求导的时候抵消掉;
Figure PCTCN2021117251-appb-000028
表示第l层W参数的和;
Figure PCTCN2021117251-appb-000029
是一个矩阵,k和j表示该矩阵的行和列。
where L y (y,f(x)) is the intent classification loss function calculated according to the above formula 6, L c (y,f(x)) is the slot filling loss function calculated according to the above formula 6, λ is the super Parameter, m is the number of data in a batch, the reason for dividing by 2 is to cancel it out when derivation;
Figure PCTCN2021117251-appb-000028
represents the sum of the W parameters of the lth layer;
Figure PCTCN2021117251-appb-000029
is a matrix, and k and j represent the rows and columns of the matrix.
由此可知,联合优化函数主要是对神经网络中的矩阵变换过程中产生的意图分类损失或槽位填充损失进行联合优化。经过上述公式(7)联合优化后,服务器200训练得到的语义解析模型121能够对待解析的语料数据解析出更接近真实意图及真实槽位的候选意图和候选槽位。It can be seen that the joint optimization function is mainly to jointly optimize the intention classification loss or slot filling loss generated in the process of matrix transformation in the neural network. After the joint optimization of the above formula (7), the semantic parsing model 121 trained by the server 200 can parse the corpus data to be parsed into candidate intents and candidate slots that are closer to the real intent and the real slot.
如上文所述,服务器200完成语义解析模型121的预训练后,训练好的语义解析模型121既能够移植到手机100上直接执行语义解析任务,也可以继续存在于服务器200中执行来自手机100请求执行的语义解析任务。具体地,如图6所示,用户通过唤醒手机100的语音助手输入语音指令,手机100通过内部的人机对话系统110基于上述语义解析模型121提取与用户语音指令对应的一个或多个意图及槽位,手机100进一步基于识别的意图及槽位对应执行相应的操作,例如打开应用软件,或者进行网页搜索等。移植了语义解析模型121的手机100与用户之间的具体交互过程可以参考以下示例:As described above, after the server 200 completes the pre-training of the semantic parsing model 121, the trained semantic parsing model 121 can either be transplanted to the mobile phone 100 to directly perform the semantic parsing task, or can continue to exist in the server 200 to execute requests from the mobile phone 100 Semantic parsing tasks performed. Specifically, as shown in FIG. 6 , the user enters a voice command by waking up the voice assistant of the mobile phone 100 , and the mobile phone 100 extracts one or more intents and information corresponding to the user's voice command through the internal human-machine dialogue system 110 based on the above semantic analysis model 121 . For the slot, the mobile phone 100 further performs corresponding operations based on the identified intent and the slot, for example, opening an application software, or performing a web page search. For the specific interaction process between the mobile phone 100 with the semantic parsing model 121 transplanted and the user, please refer to the following examples:
601:手机100获取用户语音指令。601: The mobile phone 100 obtains the user's voice instruction.
手机100中安装有语音助手,用户可以通过唤醒手机100的语音助手向手机100发出语音指令。例如,手机100获取到用户的语音指令“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”。A voice assistant is installed in the mobile phone 100 , and the user can send a voice command to the mobile phone 100 by waking up the voice assistant of the mobile phone 100 . For example, the mobile phone 100 acquires the user's voice instruction "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station".
602:手机100的人机对话系统110中的语音识别模块111将获取的用户语音指令进行识别转化为文本形式的语料数据。例如,将上述语音指令转化为文本形式的语料数据“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”。602: The speech recognition module 111 in the man-machine dialogue system 110 of the mobile phone 100 recognizes and converts the acquired user speech instruction into corpus data in the form of text. For example, converting the above voice command into textual corpus data "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station".
603:手机100的人机对话系统110中的语义解析模块112用于对语料数据进行语义解析,得到意图对应槽位的语义解析结果。603: The semantic parsing module 112 in the human-machine dialogue system 110 of the mobile phone 100 is configured to perform semantic parsing on the corpus data to obtain a semantic parsing result intended to correspond to the slot.
具体地,语义解析模块112对语料数据先进行预处理,得到Token序列、断句标记以及对应Token序列创建的掩码。接着,语义解析模块112再将Token序列、断句标记以及对应Token序列创建的掩码作为语义解析模型121的输入,进行语义解析,提取多个候选意图以及多个候选槽位,最终语义解析模型121将多个候选意图与多个候选槽位梳理对应关系后作为语义解析结果输出。在一些实施例中,对于简单的单意图的语料,也可以通过语义解析模型121进行解析,提取单个候选意图及对应的一个或多个候选槽位,在此不做限制。Specifically, the semantic parsing module 112 preprocesses the corpus data to obtain a Token sequence, a sentence segmentation mark, and a mask created corresponding to the Token sequence. Next, the semantic parsing module 112 uses the Token sequence, the segmentation mark and the mask created corresponding to the Token sequence as the input of the semantic parsing model 121, performs semantic parsing, extracts multiple candidate intents and multiple candidate slots, and finally the semantic parsing model 121 After sorting out the correspondence between multiple candidate intents and multiple candidate slots, it is output as the semantic parsing result. In some embodiments, a simple single-intent corpus can also be parsed by the semantic parsing model 121 to extract a single candidate intent and one or more corresponding candidate slots, which is not limited herein.
例如,人机对话系统110中语义解析模块112对上述语料数据“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”通过语义解析模型121得到的语义解析结果是:For example, the semantic analysis result obtained by the semantic analysis module 112 of the human-machine dialogue system 110 through the semantic analysis model 121 is: :
订车票出发地,上海;Departure place for booking tickets, Shanghai;
目的地,北京;destination, Beijing;
订酒店地点,北京火车站;Book hotel location, Beijing Railway Station;
星级,五星级。Star, five star.
604:手机100的人机对话系统110中的问题求解模块113基于语义解析模块112得到的语义解析结果搜索相应的应用程序或网络资源,以得到对语义解析结果中意图及槽位的解决方案。604 : The problem solving module 113 in the human-machine dialogue system 110 of the mobile phone 100 searches for a corresponding application or network resource based on the semantic analysis result obtained by the semantic analysis module 112 to obtain a solution to the intent and slot in the semantic analysis result.
例如,上述流程603中对用户指令“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”解析得到意图及槽位映射结果中包含用户的2个意图及相对于每个意图的4个槽位、以及每个槽位填充的槽位值,那么上述问题求解模块114搜索到的解决方式是通过手机100可以打开已安装的订票服务软件应用或旅游软件应用查询火车票信息及酒店信息供用户选择预定,或者根据用户的历史使用记录默认选择某一车次的车票进入预订界面请用户确认,手机界面如图7所示。For example, in the above process 603, the user's instruction "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station" is parsed to obtain the intent and the slot mapping result contains two intents of the user and relative to the two intents of the user. 4 slots for each intent and the slot value filled in each slot, then the solution searched by the problem solving module 114 is that the mobile phone 100 can open the installed booking service software application or travel software application to query The train ticket information and hotel information are available for the user to choose to book, or select a train ticket by default according to the user's historical usage records. Enter the booking interface and ask the user to confirm. The mobile phone interface is shown in Figure 7.
又例如,对用户指令“请为我播放你好旧时光”,如图4所示,对该指令识别的语料数据解析得到的意图及槽位映射结果中包含用户的3个意图及相对于每个意图的3个槽位、以及每个槽位填充的槽位值,那么手机100可以基于用户的使用习惯设置默认打开音乐播放软件播放本地音乐《你好旧时光》,或者打开音频播放软件获取关于《你好旧时光》的音乐或视频文件供用户选择播放。For another example, for the user instruction "please play hello old times for me", as shown in Figure 4, the intent and slot mapping result obtained by parsing the corpus data identified by the instruction include the user's three intents and relative to each 3 slots for each intention, and the slot value filled in each slot, then the mobile phone 100 can open the music player software to play the local music "Hello Old Times" by default based on the user's usage habits, or open the audio player software to obtain Music or video files about "Hello Old Times" for users to choose to play.
605:手机100的人机对话系统110中的语言生成模块114对问题求解模块113找到的解决方案生成自然语言句子,通过手机100的显示界面反馈给用户。605 : The language generation module 114 in the man-machine dialogue system 110 of the mobile phone 100 generates a natural language sentence for the solution found by the problem solving module 113 , and feeds it back to the user through the display interface of the mobile phone 100 .
上述流程603中对用户指令“帮我预定从上海到北京的火车票并预定北京火车站附近的五星级酒店”,经过语音识别和语义解析后,上述问题求解模块114搜索到的解决方式是通过手机100可以打开已安装的订票服务软件应用或旅游软件应用查询火车票信息及酒店信息供用户选择预定,或者根据用户的历史使用记录默认选择某一车次的车票进入预订界面请用户确认。语言生成模块114可以相应的生成火车票的车次信息或酒店的介绍信息,通过手机100的显示界面反馈给用户,如图7所示。In the above process 603, for the user instruction "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", after speech recognition and semantic analysis, the solution searched by the above problem solving module 114 is: The mobile phone 100 can open the installed booking service software application or travel software application to query train ticket information and hotel information for the user to select and reserve, or select a train ticket by default according to the user's historical usage record to enter the reservation interface and ask the user to confirm. The language generation module 114 can correspondingly generate the train number information of the train ticket or the introduction information of the hotel, and feed it back to the user through the display interface of the mobile phone 100 , as shown in FIG. 7 .
又例如,手机100获取的用户语音指令是查询最近三天的天气,经过语音识别和语义解析后,问题求解模块113搜索到的解决方案是打开手机100上的浏览器或者打开手机100上安装的天气查询软件搜索最近三天的天气情况,相应的,语言生成模块114将搜索到的天气情况生成自然语言文字如下:For another example, the user's voice command obtained by the mobile phone 100 is to query the weather for the last three days. After voice recognition and semantic analysis, the solution searched by the problem solving module 113 is to open the browser on the mobile phone 100 or open the browser installed on the mobile phone 100. The weather query software searches for the weather conditions of the last three days. Correspondingly, the language generation module 114 generates natural language texts from the searched weather conditions as follows:
今日天气晴28-32℃;The weather today is 28-32℃;
明日天气晴28-33℃;Tomorrow the weather will be sunny 28-33 ℃;
周三天气晴转多云28-32℃。The weather on Wednesday turned cloudy to 28-32°C.
606:手机100的人机对话系统110中的对话管理模块115可以基于用户对话历史调度其他模块进一步完善对用户语音指令的准确理解。例如,上述问题求解模块113搜索天气的过程中,用户语音指令中没有明确指出地点,那么对话管理模块115可以基于用户对话历史调度问题求解模块113搜索用户经常查询的北京作为搜索地址,向用户反馈北京近三天的天气情况,对话管理模块115也可以基于手机100的位置信息调度问题求解模块113搜索用户当前所在地近三天的天气,进一步调度语言生成模块114生成如下的自然语言句子:606: The dialogue management module 115 in the human-machine dialogue system 110 of the mobile phone 100 may schedule other modules based on the user's dialogue history to further improve the accurate understanding of the user's voice command. For example, in the process of searching the weather by the problem solving module 113, the location is not clearly indicated in the user's voice command, then the dialogue management module 115 can schedule the problem solving module 113 based on the user's dialogue history to search for Beijing, which is frequently inquired by the user, as a search address, and provide feedback to the user. For the weather conditions in Beijing for the past three days, the dialogue management module 115 can also dispatch the problem solving module 113 based on the location information of the mobile phone 100 to search for the weather in the current location of the user for the past three days, and further dispatch the language generation module 114 to generate the following natural language sentences:
北京地区:Beijing area:
今日天气晴28-32℃;The weather today is 28-32℃;
明日天气晴28-33℃;Tomorrow the weather will be sunny 28-33 ℃;
周三天气晴转多云28-32℃。The weather on Wednesday turned cloudy to 28-32°C.
可以理解,手机100的人机对话系统110中的对话管理模块115可以灵活调度人机对话系统110中的其他各模块执行相应的功能。It can be understood that the dialogue management module 115 in the human-machine dialogue system 110 of the mobile phone 100 can flexibly schedule other modules in the human-machine dialogue system 110 to perform corresponding functions.
607:手机100的人机对话系统110中的语音合成模块116将语言生成模块114生成的自然语言句子进一步合成转化为语音通过手机100播放反馈给用户。例如,对于上述流程605中语言生成模块114生成的天气情况转化成语音播放给用户听,使用户不用看手机也能听到天气情况。607 : The speech synthesis module 116 in the man-machine dialogue system 110 of the mobile phone 100 further synthesizes and converts the natural language sentences generated by the language generation module 114 into speech, which is played back to the user through the mobile phone 100 . For example, the weather conditions generated by the language generation module 114 in the above process 605 are converted into voice and played to the user, so that the user can hear the weather conditions without looking at the mobile phone.
在另一些实施例中,训练好的语义解析模型121也可以继续存在于服务器200中执行来自手机100请求执行的语义解析任务。用户通过唤醒手机100的语音助手输入语音指令,手机100通过内部的人机对话系统110将用户语音指令转化成语料数据,手机100通过与服务器200交互将转化后的语料数据发送给服务器200进行语义解析,服务器200基于上述语义解析模型121提取用户语音指令中的多个候选意图及与意图相对应的候选槽位。进一步地,服务器200将提取到意图及槽位对应结果反馈给手机100,手机100进一步基于识别的意图及槽位对应执行相应的操作,例如打开应用软件,或者进行网页搜索等。In other embodiments, the trained semantic parsing model 121 may also continue to exist in the server 200 to perform the semantic parsing task requested from the mobile phone 100 . The user inputs voice commands by waking up the voice assistant of the mobile phone 100, the mobile phone 100 converts the user's voice commands into corpus data through the internal man-machine dialogue system 110, and the mobile phone 100 interacts with the server 200 to send the converted corpus data to the server 200 for semantic processing. For parsing, the server 200 extracts multiple candidate intents and candidate slots corresponding to the intents in the user's voice instruction based on the semantic parsing model 121 . Further, the server 200 feeds back the extracted intent and the corresponding result of the slot to the mobile phone 100, and the mobile phone 100 further performs corresponding operations based on the identified intent and the slot, such as opening an application software or performing a web page search.
下面结合本申请的实施例给出电子设备100的一种示例性结构。An exemplary structure of the electronic device 100 is given below in conjunction with the embodiments of the present application.
图8示出了根据本申请实施例的手机100的结构示意图。FIG. 8 shows a schematic structural diagram of a mobile phone 100 according to an embodiment of the present application.
手机100可以包括处理器101,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。The mobile phone 100 may include a processor 101, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and user Identity module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
可以理解的是,本发明实施例示意的结构并不构成对手机100的具体限定。在本申请另一些实施例中,手机100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的 部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may include more or less components than shown, or some components are combined, or some components are separated, or different components are arranged. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
手机100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等获取用户语音指令以及向用户反馈应答语音的功能。例如手机100通过受话器170B或麦克风170C获取用户语音指令,并将获取的用户语音指令发送给人机对话系统110进行语音识别及语义解析,根据解语义解析结果匹配对应的解决方案,通过手机100执行相应的操作来实现对应语义解析结果的解决方案。人机对话系统110还可以将对应语义解析结果的解决方案生成应答语音通过手机100的扬声器170A或通过耳机接口170D插接耳机向用户反馈该应答语音。The mobile phone 100 can obtain the user's voice command and feed back the response voice to the user through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. For example, the mobile phone 100 obtains the user's voice command through the receiver 170B or the microphone 170C, and sends the obtained user's voice command to the human-machine dialogue system 110 for voice recognition and semantic analysis. According to the semantic analysis result, the corresponding solution is matched and executed through the mobile phone 100. The corresponding operation is used to realize the solution corresponding to the semantic parsing result. The man-machine dialogue system 110 can also generate a response voice from the solution corresponding to the semantic analysis result and feed back the response voice to the user through the speaker 170A of the mobile phone 100 or the earphone plugged in the earphone interface 170D.
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器101中,或将音频模块170的部分功能模块设置于处理器101中。The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 101 , or some functional modules of the audio module 170 may be provided in the processor 101 .
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话。Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置至少一个麦克风170C。在另一些实施例中,电子设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。The microphone 170C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
其中,处理器101可以包括一个或多个处理单元,例如:处理器101可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。处理器101通过运行程序实现语义解析模型121的功能,人机对话系统110将用户的语音指令识别转化为文本语料数据,经过数据预处理后输入处理器101运行的语义解析模型121中进行语义解析,得到语义解析结果。The processor 101 may include one or more processing units, for example, the processor 101 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal Image signal processor (ISP), controller, video codec, digital signal processor (DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors. The processor 101 realizes the function of the semantic parsing model 121 by running the program, and the human-machine dialogue system 110 converts the user's voice command recognition into text corpus data, which is input into the semantic parsing model 121 run by the processor 101 after data preprocessing for semantic parsing. , get the semantic parsing result.
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
处理器101中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器101中的存储器为高速缓冲存储器。该存储器可以保存处理器101刚用过或循环使用的指令或数据。如果处理器101需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器101的等待时间,因而提高了系统的效率。A memory may also be provided in the processor 101 for storing instructions and data. In some embodiments, the memory in processor 101 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 101 . If the processor 101 needs to use the instruction or data again, it can be called directly from the memory. Repeated access is avoided, and the waiting time of the processor 101 is reduced, thereby improving the efficiency of the system.
在一些实施例中,处理器101可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,通用输入输出(general-purpose input/output,GPIO)接口,SIM接口,和/或USB接口等。In some embodiments, the processor 101 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a general-purpose input/output (GPIO) interface, a SIM interface, and/or or USB interface, etc.
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对手机100的结构限定。在本申请另一些实施例中,手机100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that, the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输 入。在一些无线充电的实施例中,充电管理模块140可以通过手机100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。The charging management module 140 is used to receive charging input from the charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from the wired charger through the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive wireless charging input through the wireless charging coil of the mobile phone 100 . While the charging management module 140 charges the battery 142 , it can also supply power to the electronic device through the power management module 141 .
电源管理模块141用于连接电池142,充电管理模块140与处理器101。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器101,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。The power management module 141 is used to connect the battery 142 , the charging management module 140 and the processor 101 . The power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 101, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160.
手机100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。The wireless communication function of the mobile phone 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
天线1和天线2用于发射和接收电磁波信号。手机100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。 Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in handset 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
移动通信模块150可以提供应用在手机100上的包括2G/3G/4G/5G等无线通信的解决方案。。The mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the mobile phone 100 . .
无线通信模块160可以提供应用在手机100上的包括无线局域网(wireless local area networks,WLAN),如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。。The wireless communication module 160 can provide applications on the mobile phone 100 including wireless local area networks (WLAN), such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. .
在一些实施例中,手机100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得手机100可以通过无线通信技术与网络以及其他设备通信。In some embodiments, the antenna 1 of the mobile phone 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the mobile phone 100 can communicate with the network and other devices through wireless communication technology.
手机100通过GPU、显示屏194、以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。The mobile phone 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
显示屏194用于显示图像,视频等。显示屏194包括显示面板。在一些实施例中,手机100可以包括1个或N个显示屏194,N为大于1的正整数。Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel. In some embodiments, the handset 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
SIM卡接口195用于连接SIM卡。The SIM card interface 195 is used to connect a SIM card.
在说明书对“一个实施例”或“实施例”的引用意指结合实施例所描述的具体特征、结构或特性被包括在根据本申请公开的至少一个范例实施方案或技术中。说明书中的各个地方的短语“在一个实施例中”的出现不一定全部指代同一个实施例。Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one example embodiment or technique disclosed in accordance with this application. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.
本申请公开还涉及用于执行文本中的操作装置。该装置可以专门处于所要求的目的而构造或者其可以包括被存储在计算机中的计算机程序选择性地激活或者重新配置的通用计算机。这样的计算机程序可以被存储在计算机可读介质中,诸如,但不限于任何类型的盘,包括软盘、光盘、CD-ROM、磁光盘、只读存储器(ROM)、随机存取存储器(RAM)、EPROM、EEPROM、磁或光卡、专用集成电路(ASIC)或者适于存储电子指令的任何类型的介质,并且每个可以被耦合到计算机系统总线。此外,说明书中所提到的计算机可以包括单个处理器或者可以是采用针对增加的计算能力的多个处理器涉及的架构。The present disclosure also relates to apparatuses for performing operations in text. This apparatus may be specially constructed for the required purposes or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored on a computer readable medium, such as, but not limited to, any type of disk, including floppy disks, optical disks, CD-ROMs, magneto-optical disks, read only memory (ROM), random access memory (RAM) , EPROM, EEPROM, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of medium suitable for storing electronic instructions, and each may be coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processors for increased computing power.
本文所提出的过程和显示器固有地不涉及任何具体计算机或其他装置。各种通用系统也可以与根据本文中的教导的程序一起使用,或者构造更多专用装置以执行一个或多个方法步骤可以证明是方便的。在一下描述中讨论了用于各种这些系统的结构。另外,可以使用足以实现本申请公开的技术和实施方案的任何具体编程语言。各种编程语言可以被用于实施本公开,如本文所讨论的。The processes and displays presented herein are not inherently related to any specific computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform one or more method steps. Architectures for various of these systems are discussed in the following description. Additionally, any specific programming language sufficient to implement the techniques and embodiments disclosed herein may be used. Various programming languages may be used to implement the present disclosure, as discussed herein.
另外,在本说明书所使用的语言已经主要被选择用于可读性和指导性的目的并且可能未被选择为描绘或限制所公开的主题。因此,本申请公开旨在说明而非限制本文所讨论的概念的范围。Additionally, the language used in this specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or limit the disclosed subject matter. Accordingly, the present disclosure is intended to illustrate, but not to limit, the scope of the concepts discussed herein.

Claims (12)

  1. 一种语义解析方法,其特征在于,所述方法包括:A semantic parsing method, characterized in that the method comprises:
    获取待解析语料数据;Get the corpus data to be parsed;
    计算所述待解析语料数据所包括的字与所述待解析语料数据所表示的意图的意图相关程度,以及所述字与所述待解析语料数据所表示的槽位的槽位相关程度;Calculate the degree of intent correlation between the words included in the corpus data to be parsed and the intent represented by the corpus data to be parsed, and the degree of slot correlation between the word and the slot position represented by the corpus data to be parsed;
    基于所述字的语义信息和所述字的上文语义信息、以及所述字的意图相关程度和槽位相关程度,预测出所述待解析语料数据的槽位。The slot position of the corpus data to be parsed is predicted based on the semantic information of the word and the above semantic information of the word, and the degree of intent correlation and slot correlation of the word.
  2. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    从所述待解析语料数据中预测出多个意图;predicting multiple intents from the corpus data to be parsed;
    从所述预测出的槽位中,确定所述多个意图中每个意图所对应的槽位。From the predicted slots, a slot corresponding to each of the multiple intents is determined.
  3. 根据权利要求1所述的方法,其特征在于,所述上文语义信息包括所述待解析语料数据中位于所述字前面的至少一个字的语义信息。The method according to claim 1, wherein the above semantic information includes semantic information of at least one character located in front of the character in the corpus data to be parsed.
  4. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    生成所述待解析语料数据的句语义信息和所述待解析语料数据中每个字的语义信息。The sentence semantic information of the corpus data to be parsed and the semantic information of each character in the corpus data to be parsed are generated.
  5. 根据权利要求4所述的方法,其特征在于,所述方法通过神经网络模型实现。The method according to claim 4, wherein the method is implemented by a neural network model.
  6. 根据权利要求5所述的方法,其特征在于,所述神经网络模型包括全连接层、长短期记忆网络模型。The method according to claim 5, wherein the neural network model comprises a fully connected layer and a long short-term memory network model.
  7. 根据权利要求5或6所述的方法,其特征在于,所述待解析语料数据的句语义信息、所述字的上文语义信息、所述字的意图相关程度和槽位相关程度在所述神经网络模型中以向量的形式表示。The method according to claim 5 or 6, characterized in that the sentence semantic information of the corpus data to be parsed, the above semantic information of the word, the intention correlation degree and the slot correlation degree of the word are in the It is represented in the form of a vector in the neural network model.
  8. 一种人机对话方法,其特征在于,包括:A man-machine dialogue method, comprising:
    接收用户语音指令;Receive user voice commands;
    将所述用户语音指令转化为文本形式的待解析语料;Converting the user's voice command into a corpus to be parsed in text form;
    通过权利要求1至6中任一项所述的语义解析方法,解析出所述待解析语料中的意图和与各意图对应的槽位;Through the semantic parsing method according to any one of claims 1 to 6, the intent in the to-be-parsed corpus and the slot corresponding to each intent are parsed;
    基于所述解析出的所述意图和与各意图对应的槽位,执行所述用户语音指令所对应的操作或者生成应答语音。Based on the parsed intent and the slot corresponding to each intent, an operation corresponding to the user's voice command is executed or a response voice is generated.
  9. 根据权利要求8中所述的方法,其特征在于,所述操作包括向智能家居设备发送指令、打开应用软件、搜索网页、拨打电话、收发短信中的一种或多种。The method according to claim 8, wherein the operations include one or more of sending instructions to the smart home device, opening application software, searching web pages, making calls, and sending and receiving short messages.
  10. 一种人机对话系统,其特征在于,所述系统包括:A man-machine dialogue system, characterized in that the system comprises:
    语音识别模块,用于将用户语音指令转化为文本形式的语料数据;The speech recognition module is used to convert the user's voice command into corpus data in the form of text;
    语义解析模块,用于执行权利要求1至6中任一项所述的语义解析方法;A semantic parsing module for executing the semantic parsing method according to any one of claims 1 to 6;
    问题求解模块,用于为所述语义解析模块解析得到的结果查找解决方案;A problem solving module, used for finding a solution for the result obtained by the semantic analysis module;
    语言生成模块,用于生成与所述解决方案对应的自然语言句子;a language generation module for generating natural language sentences corresponding to the solution;
    语音合成模块,用于将所述自然语言句子合成为应答语音;a speech synthesis module for synthesizing the natural language sentence into a response speech;
    对话管理模块,用于调度所述语音识别模块、所述语义解析模块、所述问题求解模块、所述语言生成模块以及所述语音合成模块相互配合,以实现人机对话。The dialogue management module is used to schedule the speech recognition module, the semantic analysis module, the problem solving module, the language generation module and the speech synthesis module to cooperate with each other to realize the man-machine dialogue.
  11. 一种可读介质,其特征在于,所述可读介质上存储有指令,该指令在电子设备上执行时使电子设备执行权利要求1-6及权利要求9中任一项所述的方法。A readable medium, characterized in that an instruction is stored on the readable medium, and when the instruction is executed on the electronic device, the electronic device executes the method of any one of claims 1-6 and claim 9.
  12. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    存储器,用于存储由电子设备的一个或多个处理器执行的指令,以及memory for storing instructions for execution by one or more processors of the electronic device, and
    处理器,是电子设备的处理器之一,用于执行权利要求1-6及权利要求9中任一项所述的方法。The processor, which is one of the processors of the electronic device, is configured to execute the method described in any one of claims 1-6 and claim 9 .
PCT/CN2021/117251 2020-09-15 2021-09-08 Electronic device and semantic parsing method therefor, medium, and human-machine dialog system WO2022057712A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010970477.8A CN114186563A (en) 2020-09-15 2020-09-15 Electronic equipment and semantic analysis method and medium thereof and man-machine conversation system
CN202010970477.8 2020-09-15

Publications (1)

Publication Number Publication Date
WO2022057712A1 true WO2022057712A1 (en) 2022-03-24

Family

ID=80539263

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/117251 WO2022057712A1 (en) 2020-09-15 2021-09-08 Electronic device and semantic parsing method therefor, medium, and human-machine dialog system

Country Status (2)

Country Link
CN (1) CN114186563A (en)
WO (1) WO2022057712A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818659A (en) * 2022-06-29 2022-07-29 北京澜舟科技有限公司 Text emotion source analysis method and system and storage medium
CN115358186A (en) * 2022-08-31 2022-11-18 南京擎盾信息科技有限公司 Slot position label generation method and device and storage medium
CN115934922A (en) * 2023-03-09 2023-04-07 杭州心识宇宙科技有限公司 Conversation service execution method and device, storage medium and electronic equipment
CN116050427A (en) * 2022-12-30 2023-05-02 北京百度网讯科技有限公司 Information generation method, training device, electronic equipment and storage medium
CN116227629A (en) * 2023-05-10 2023-06-06 荣耀终端有限公司 Information analysis method, model training method, device and electronic equipment
CN116227496A (en) * 2023-05-06 2023-06-06 国网智能电网研究院有限公司 Deep learning-based electric public opinion entity relation extraction method and system
CN116959442A (en) * 2023-07-29 2023-10-27 浙江阳宁科技有限公司 Chip for intelligent switch panel and method thereof
CN117238277A (en) * 2023-11-09 2023-12-15 北京水滴科技集团有限公司 Intention recognition method, device, storage medium and computer equipment

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115440200B (en) * 2021-06-02 2024-03-12 上海擎感智能科技有限公司 Control method and control system of vehicle-mounted system
CN115292463B (en) * 2022-08-08 2023-05-12 云南大学 Information extraction-based method for joint multi-intention detection and overlapping slot filling
CN115934913B (en) * 2022-12-23 2024-03-22 国义招标股份有限公司 Carbon emission accounting method and system based on deep learning data generation
CN115906874A (en) * 2023-03-08 2023-04-04 小米汽车科技有限公司 Semantic parsing method, system, electronic device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309276A (en) * 2018-03-28 2019-10-08 蔚来汽车有限公司 Electric car dialogue state management method and system
CN110309277A (en) * 2018-03-28 2019-10-08 蔚来汽车有限公司 Human-computer dialogue semanteme parsing method and system
US20190385611A1 (en) * 2018-06-18 2019-12-19 Sas Institute Inc. System for determining user intent from text
CN110705267A (en) * 2019-09-29 2020-01-17 百度在线网络技术(北京)有限公司 Semantic parsing method, semantic parsing device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309276A (en) * 2018-03-28 2019-10-08 蔚来汽车有限公司 Electric car dialogue state management method and system
CN110309277A (en) * 2018-03-28 2019-10-08 蔚来汽车有限公司 Human-computer dialogue semanteme parsing method and system
US20190385611A1 (en) * 2018-06-18 2019-12-19 Sas Institute Inc. System for determining user intent from text
CN110705267A (en) * 2019-09-29 2020-01-17 百度在线网络技术(北京)有限公司 Semantic parsing method, semantic parsing device and storage medium

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818659B (en) * 2022-06-29 2022-09-23 北京澜舟科技有限公司 Text emotion source analysis method and system and storage medium
CN114818659A (en) * 2022-06-29 2022-07-29 北京澜舟科技有限公司 Text emotion source analysis method and system and storage medium
CN115358186A (en) * 2022-08-31 2022-11-18 南京擎盾信息科技有限公司 Slot position label generation method and device and storage medium
CN115358186B (en) * 2022-08-31 2023-11-14 南京擎盾信息科技有限公司 Generating method and device of slot label and storage medium
CN116050427B (en) * 2022-12-30 2023-10-27 北京百度网讯科技有限公司 Information generation method, training device, electronic equipment and storage medium
CN116050427A (en) * 2022-12-30 2023-05-02 北京百度网讯科技有限公司 Information generation method, training device, electronic equipment and storage medium
CN115934922A (en) * 2023-03-09 2023-04-07 杭州心识宇宙科技有限公司 Conversation service execution method and device, storage medium and electronic equipment
CN115934922B (en) * 2023-03-09 2024-01-30 杭州心识宇宙科技有限公司 Dialogue service execution method and device, storage medium and electronic equipment
CN116227496A (en) * 2023-05-06 2023-06-06 国网智能电网研究院有限公司 Deep learning-based electric public opinion entity relation extraction method and system
CN116227496B (en) * 2023-05-06 2023-07-14 国网智能电网研究院有限公司 Deep learning-based electric public opinion entity relation extraction method and system
CN116227629B (en) * 2023-05-10 2023-10-20 荣耀终端有限公司 Information analysis method, model training method, device and electronic equipment
CN116227629A (en) * 2023-05-10 2023-06-06 荣耀终端有限公司 Information analysis method, model training method, device and electronic equipment
CN116959442A (en) * 2023-07-29 2023-10-27 浙江阳宁科技有限公司 Chip for intelligent switch panel and method thereof
CN116959442B (en) * 2023-07-29 2024-03-19 浙江阳宁科技有限公司 Chip for intelligent switch panel and method thereof
CN117238277A (en) * 2023-11-09 2023-12-15 北京水滴科技集团有限公司 Intention recognition method, device, storage medium and computer equipment
CN117238277B (en) * 2023-11-09 2024-01-19 北京水滴科技集团有限公司 Intention recognition method, device, storage medium and computer equipment

Also Published As

Publication number Publication date
CN114186563A (en) 2022-03-15

Similar Documents

Publication Publication Date Title
WO2022057712A1 (en) Electronic device and semantic parsing method therefor, medium, and human-machine dialog system
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
JP7062851B2 (en) Voiceprint creation / registration method and equipment
WO2021072875A1 (en) Intelligent dialogue generation method, device, computer apparatus and computer storage medium
CN110516253B (en) Chinese spoken language semantic understanding method and system
CN113205817B (en) Speech semantic recognition method, system, device and medium
CN112100349A (en) Multi-turn dialogue method and device, electronic equipment and storage medium
WO2021147041A1 (en) Semantic analysis method and apparatus, device, and storage medium
CN112052333B (en) Text classification method and device, storage medium and electronic equipment
CN108986790A (en) The method and apparatus of voice recognition of contact
CN110047481A (en) Method for voice recognition and device
WO2021218028A1 (en) Artificial intelligence-based interview content refining method, apparatus and device, and medium
CN112101044B (en) Intention identification method and device and electronic equipment
CN116955699B (en) Video cross-mode search model training method, searching method and device
CN110866090A (en) Method, apparatus, electronic device and computer storage medium for voice interaction
CN112632244A (en) Man-machine conversation optimization method and device, computer equipment and storage medium
US20190303393A1 (en) Search method and electronic device using the method
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
WO2023272616A1 (en) Text understanding method and system, terminal device, and storage medium
CN112397063A (en) System and method for modifying speech recognition results
CN113129867A (en) Training method of voice recognition model, voice recognition method, device and equipment
CN106971721A (en) A kind of accent speech recognition system based on embedded mobile device
CN113393841B (en) Training method, device, equipment and storage medium of voice recognition model
KR102297480B1 (en) System and method for structured-paraphrasing the unstructured query or request sentence
CN110809796B (en) Speech recognition system and method with decoupled wake phrases

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21868537

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21868537

Country of ref document: EP

Kind code of ref document: A1