WO2022057712A1

WO2022057712A1 - Electronic device and semantic parsing method therefor, medium, and human-machine dialog system

Info

Publication number: WO2022057712A1
Application number: PCT/CN2021/117251
Authority: WO
Inventors: 童甜甜; 祝官文
Original assignee: 华为技术有限公司
Priority date: 2020-09-15
Filing date: 2021-09-08
Publication date: 2022-03-24
Also published as: CN114186563A

Abstract

The present application relates to the technical field of human-machine dialogs, and specifically relates to an electronic device and a semantic parsing method therefor, a medium, and a human-machine dialog system. The semantic parsing method comprises: obtaining corpus data to be parsed; calculating the degree of intention correlation between a word comprised in said corpus data and an intention represented by said corpus data, and the degree of slot correlation between the word and a slot represented by said corpus data; and predicting the slot of said corpus data according to the semantic information of the word, the foregoing semantic information of the word, and the degree of intention correlation and the degree of slot correlation of the word. A plurality of intentions close to the real intention of a user are recognized from user voice, and then slot information is predicted by using the plurality of recognized intentions, thereby improving the accuracy of slot filling, also correspondingly improving the speed or efficiency of slot filling, and further improving the accuracy of semantic parsing in a human-machine dialog.

Description

Electronic device and its semantic analysis method, medium and human-computer dialogue system

This application claims the priority of the Chinese patent application filed on September 15, 2020, with the application number of 202010970477.8 and the application title of "Electronic Equipment and Its Semantic Analysis Method, Medium and Human-Machine Dialogue System", the entire contents of which are Incorporated herein by reference.

technical field

The invention relates to the technical field of man-machine dialogue, in particular to an electronic device and a semantic analysis method, medium and man-machine dialogue system thereof.

Background technique

With the continuous development of artificial intelligence technology and the in-depth popularization of various intelligent terminal electronic devices, the human-machine dialogue dialogue system is more and more applied in various intelligent terminal electronic devices, such as: smart speakers, smart phones, car smart In-vehicle intelligent systems such as voice navigation and robots, etc. The human-computer dialogue system uses technologies such as speech recognition, semantic analysis and language generation to realize dialogue and information exchange between humans and machines.

Among them, the spoken language comprehension task in semantic parsing technology includes two sub-tasks, intent recognition and slot filling. At present, intent recognition and slot filling are mainly for single-intent and single-slot identification, that is, for the same speech, the closest intent is selected from multiple intent recognition result options as the recognition result. However, in practical applications, the corpus of multiple intents and the corpus of a single intent may have the same sentence pattern, and the single-intent classification model cannot distinguish the corpus of multiple intents, which eventually leads to a high mis-entry rate of the model, that is, the intent recognition slot filling result Error rate is high. In addition, the prior art intent slot identification architecture cannot display the modeling of the relationship between intent slots, the accuracy of intent identification and slot filling for multiple tags is poor, and it is not compatible with single-intent and multi-intent mixed scenarios.

SUMMARY OF THE INVENTION

Embodiments of the present application provide an electronic device, a semantic analysis method, a medium, and a human-machine dialogue system thereof. By recognizing multiple intents close to the user's true intent from a user's voice, and then using the identified multiple intents to predict slot positions information, thereby improving the accuracy of slot filling, and correspondingly improving the speed or efficiency of slot filling, thereby improving the accuracy of semantic parsing in human-computer dialogue.

In a first aspect, an embodiment of the present application provides a semantic parsing method, the method includes: acquiring corpus data to be parsed; The degree of intent correlation, and the degree of correlation between the word and the slot represented by the corpus data to be parsed; based on the semantic information of the word and the above semantic information of the word, and the intent of the word correlation The degree of correlation with the slot position is used to predict the slot position of the corpus data to be parsed.

For example, the corpus data can be obtained by performing voice recognition and conversion on the user's voice command.

The degree of intent correlation between the words included in the corpus data to be parsed and the intent represented by the corpus data to be parsed can be represented by an intent attention vector, and the relationship between the word and the slot represented by the corpus data to be parsed can be represented. The degree of slot correlation can be represented by the slot attention vector.

The semantic information of the word can be understood as the word meaning information of the word, that is, the literal meaning of the word and the meaning it refers to. Sentences as nouns (eg in song titles, hello old days).

The above semantic information of the word can be the semantic information of the previous word continuous with the current word in the corpus data, if the current word processed is the first word, then the above semantic information can be the sentence semantic information of the corpus data, The above semantic information is mainly based on its significance to the slot prediction of the current word. The above semantic information can be expressed in the hidden state vector output at the previous moment (relative to the current moment).

In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: predicting multiple intents from the corpus data to be parsed; from the predicted slots, determining each of the multiple intents The slot corresponding to the intent.

For example, multiple intents are obtained by parsing the corpus data converted from a user's voice command. If the corpus data only contains a single intent, the present application can also be applied to parse a single intent in such single-intent corpus data, which has a certain versatility. , the user experience is better.

For the multiple intents obtained by parsing, each intent should correspond to at least one slot, and some intents may have three or more slots corresponding to it. The present application can accurately sort out the correspondence between multiple intents and multiple slots.

In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the above-mentioned semantic information includes semantic information of at least one word located in front of the word in the corpus data to be parsed.

For example, when performing semantic analysis on a certain piece of corpus data, when the slot is predicted for the first word at the first moment, the above semantic information of the first word is the sentence semantic information of this piece of corpus data. When the slot is predicted for the second word at the second moment, the semantic information of the second word above is the semantic information of the first word, and at this time the semantic information of the first word contains the semantic information of the sentence of the corpus data in The message passed to the first word at the first moment. By analogy, the above semantic information of each subsequent character is the semantic information of the previous character, and the semantic information of the previous character includes the semantic information transmitted by the previous character of the previous character. This transfer relationship is progressive. The word sense information correlation degree of two adjacent words is the largest, and the word meaning information correlation degree of two non-adjacent words is small, or the correlation degree gradually approaches 0 with the increase of the spaced characters.

In a possible implementation of the first aspect, the method further includes: generating sentence semantic information of the corpus data to be parsed and semantic information of each word in the corpus data to be parsed.

For example, the sentence character representing the sentence in the corpus data is encoded by the encoder, so that the sentence character can express specific semantic information, and this specific semantic information is the same as or close to the semantic information obtained by human understanding the sentence. In addition, the word character of each word in the corpus data is encoded by the encoder, so that the word character can express specific word meaning information, and this specific word meaning information is related to the human understanding of the sentence. The word meaning information of word comprehension is the same or close. In some embodiments, sentence semantic information of the corpus data can be represented by a sentence vector, and word meaning information of each word in the corpus data can be represented by a word vector.

In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the method is implemented by a neural network model. The neural network model includes a fully connected layer and a long short-term memory network model.

For example, a semantic parsing model is trained through a neural network model combined with a BERT model, an attention mechanism, a slot gate mechanism, and a sigmoid activation function, enabling it to implement the above method.

In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: sentence semantic information of the corpus data to be parsed, the above semantic information of the word, the degree of intent correlation and the degree of slot correlation of the word It is represented in the form of a vector in the neural network model.

For example, the sentence semantic information of the corpus data to be parsed is represented by a sentence vector, the above semantic information of the word is represented by a hidden state vector at the previous moment, and the intention correlation degree and slot correlation degree of the word are respectively represented by intention attention. Force vector and slot attention vector representation.

In a second aspect, an embodiment of the present application provides a man-machine dialogue method, which includes: receiving a user voice command; converting the user voice command into a text-form corpus to be parsed; Parse the intent in the corpus and the slot corresponding to each intent; based on the parsed intent and the slot corresponding to each intent, execute the operation corresponding to the user's voice command or generate a response voice.

In a possible implementation of the second aspect, the method further includes: the operations include one or more of sending instructions to the smart home device, opening application software, searching web pages, making calls, and sending and receiving short messages.

For example, in the corpus to be parsed obtained by converting a user's voice command through a smartphone, the parsed intent is to book a ticket and a hotel, and the slots corresponding to these two intents are the origin, destination, (hotel) location, and (hotel) star rating, then the operation performed by the smartphone may be to open a ticket hotel reservation software, query the ticket information corresponding to the departure and destination for the user to choose, and recommend a five-star hotel in a certain location to the user for selection. A smart home may include, but is not limited to, laptop computers, desktop computers, tablet computers, smartphones, wearable devices, portable music players, reader devices, or other electronic devices capable of accessing a network.

In a third aspect, an embodiment of the present application provides a human-machine dialogue system, the system includes: a speech recognition module for converting a user's voice command into corpus data in text form; a semantic parsing module for performing the above semantic parsing method; a problem solving module for finding a solution for the results obtained by the semantic parsing module analysis; a language generating module for generating natural language sentences corresponding to the solution; a speech synthesis module for converting the natural language The language sentence is synthesized into the response voice; the dialogue management module is used to schedule the speech recognition module, the semantic analysis module, the problem solving module, the language generation module and the speech synthesis module to cooperate with each other to realize the man-machine dialogue.

In a fourth aspect, an embodiment of the present application provides a readable medium, where an instruction is stored on the readable medium, and the instruction, when executed on an electronic device, causes the electronic device to execute the above semantic parsing method or the above man-machine dialogue method.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device, and a processor, which is one of the processors of the electronic device, and uses for executing the above semantic parsing method or the above man-machine dialogue method.

Description of drawings

Fig. 1 is a schematic software block diagram of a common man-machine dialogue system;

FIG. 2 is a schematic diagram of a man-machine dialogue scene to which an embodiment of the present application is applicable;

3 is a schematic structural diagram of an exemplary structure of a semantic parsing model in an embodiment of the present application;

4 is a schematic diagram of processing results of corpus data at different stages in the semantic parsing method according to an embodiment of the present application;

5 is a schematic diagram of a training process of a semantic parsing model in the semantic parsing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an interaction flow between a mobile phone 100 and a user according to an embodiment of the present application;

FIG. 7 is a schematic interface diagram of a mobile phone 100 according to an embodiment of the present application performing corresponding operations according to user voice commands;

FIG. 8 is an exemplary structural diagram of a mobile phone 100 according to an embodiment of the present application.

detailed description

Illustrative embodiments of the present application include, but are not limited to, electronic devices and semantic parsing methods and media thereof.

As mentioned above, when processing multi-intent sentences in the prior art, there is a problem that multiple intentions in the user's voice cannot be recognized, so that the error rate of the slot filling result is high when the slot filling is performed based on the intention recognition result. In order to solve this problem, the embodiment of the present application first identifies multiple intents close to the user's true intent from the user's voice, and then uses the identified multiple intents to predict slot information, thereby improving the accuracy of slot filling, and correspondingly The speed or efficiency of slot filling is improved, thereby improving the accuracy of semantic parsing in human-machine dialogue.

In order to facilitate a clear understanding of the embodiments of the present application, the following briefly introduces technical terms that may be involved in the embodiments of the present application and related terms of neural networks.

(1) Natural language processing (NLP)

Natural language is human language, and natural language processing (NLP) is the processing of human language. Natural language processing is the process of systematically analyzing, understanding, and extracting information from text data in an intelligent and efficient manner. By using NLP and its components, we can manage very large chunks of textual data, or perform a large number of automated tasks, and solve a wide variety of problems, such as automatic summarization, machine translation (MT), Named Entity Recognition (NER), Relation Extraction (RE), Information Extraction (IE), Sentiment Analysis, Speech Recognition, Question Answering, Topic Segmentation, etc. . Exemplarily, natural language processing tasks can fall into the following categories.

Sequence tagging: Each word in a sentence requires the model to give a categorical category based on the context. Such as Chinese word segmentation, part-of-speech tagging, named entity recognition, semantic role tagging.

Classification tasks: output a classification value for the entire sentence, such as text classification.

Sentence relationship inference: Given two sentences, determine whether the two sentences have a nominal relationship. For example, enlightenment, QA, semantic rewriting, natural language inference.

Generative task: output a piece of text, generate another piece of text. Such as machine translation, text summarization, writing poems and sentences, looking at pictures and talking.

(2) Intent: The voice commands input by the user all correspond to the user's intention. It is understandable that the so-called intention is the expression of the user's will. In the human-machine dialogue system, the intention is generally named after "verb + noun", for example Check the weather, book hotels, etc. Intent recognition, also known as intent classification, mainly extracts the intent corresponding to the current voice command according to the voice command input by the user. An intent is a collection of one or more expressions, such as "I want to watch a movie" and "I want to see an action movie made by a certain star in a certain year" can belong to the same intent to play a video. An intent can be configured with one or more slots.

(3) The slot is the key information used to express the user's intention, and the accuracy of the slot filling directly affects whether the electronic device can match the correct intention. A slot corresponds to a keyword of a type of attribute, and the information in the slot can be filled with keywords of the same type, that is, slot filling. For example, the query pattern corresponding to the intent to play a song could be "I want to hear {song} of {singer}". Among them, {singer} is the singer's slot, and {song} is the song's slot. Then, if the voice command "I want to listen to Faye Wong's red beans" is received from the user, the electronic device (or server) can extract the slot information filled in the {singer} slot from the voice command as: Faye Wong, the slot information filled in the slot {song} is: red beans. In this way, the electronic device (or server) can identify, according to the two slot information, that the user's intention of this voice input is: to play Faye Wong's song Red Bean.

It can be understood that the semantic parsing method of the present application is suitable for various scenarios requiring semantic parsing, for example, a user sends a voice command to an intelligent electronic device, and a user conducts a man-machine dialogue with a voice assistant of the intelligent electronic device. For the convenience of description, the following introduces the semantic parsing solution of the present application based on the human-machine dialogue system.

At present, as shown in FIG. 1, a common human-machine dialogue system 110 mainly includes the following six technical modules: speech recognition module 111; semantic analysis module 112; problem solving module 113; language generation module 114; dialogue management module 115; speech synthesis module 116. in,

The speech recognition module 111 is used to realize speech-to-text recognition and conversion through speech recognition technology (Automatic Speech Recognition, ASR). The recognition result is generally in the form of the top n (n≥1) sentences or word lattices with the highest scores. Output corpus data.

The semantic parsing module 112, also known as the Natural Language Understanding (NLU) module, is mainly used for performing natural language processing (NLP) tasks, including semantic parsing and identifying the corpus data output by the speech recognition module. The intent expressed by the user and the corresponding slot. In the embodiment of the present application, the function of the semantic parsing module is implemented by a pre-trained semantic parsing model 121, and the semantic parsing model 121 will be described in detail below, and will not be repeated here.

The problem solving module 113 is mainly used for reasoning or querying according to the intention identified by the semantic analysis and the corresponding slot, so as to feed back the solution corresponding to the intention and the corresponding slot to the user.

The language generation module 114 mainly generates a natural language sentence for the solution found by the problem solving module 113 and needs to be output to the user, and feeds it back to the user in text or further converted into voice.

The dialogue management module 115 is the central hub in the human-machine dialogue system, and is used for scheduling the mutual cooperation of other modules in the human-computer interaction system based on the dialogue history, assisting the language parsing module to correctly understand the results of the speech recognition, and providing the problem solving module. Provides assistance and guides the natural language generation process of the language generation module.

The speech synthesis module 116 is used for converting the natural language sentences generated by the language generation module into speech output.

In order to make the objectives, technical solutions and advantages of the present application clearer, the technical solutions of the embodiments of the present application will be further described in detail below with reference to the accompanying drawings and embodiments.

FIG. 2 shows a schematic diagram of a man-machine dialogue scene according to an embodiment of the present application.

Specifically, as shown in FIG. 2 , the application scenario includes the electronic device 100 and the electronic device 200 . The electronic device 100 is a terminal intelligent device that interacts with a user, and an application system capable of semantic analysis, such as the above-mentioned human-machine dialogue system 110 , is installed thereon. The electronic device 100 can recognize the user's voice command through the man-machine dialogue system 110, and perform corresponding operations according to the voice command or answer the questions raised by the user. It can be understood that in this application, the electronic device 100 may include, but is not limited to, smart speakers, smart phones, wearable devices, head-mounted displays, in-vehicle intelligent systems such as in-vehicle intelligent voice navigation, as well as intelligent robots, portable music players, and readers. Equipment and other electronic equipment with man-machine dialogue systems or other speech recognition applications installed.

The electronic device 200 can be used to train the semantic parsing model 121 , and transplant the trained semantic parsing model 121 to the electronic device 100 for the electronic device 100 to perform semantic parsing and perform corresponding operations. In addition, the electronic device 200 can also perform semantic parsing on the corpus data sent by the electronic device 100 through the trained semantic parsing model 121, and feed the result back to the electronic device 100, and the electronic device 100 further performs corresponding operations.

It will be appreciated that electronic device 200 may include, but is not limited to, clouds, servers, laptops, desktops, tablet computers, and other electronic devices capable of accessing a network with one or more processors embedded or coupled therein.

For convenience of description, the technical solutions of the present application are described in detail below by taking the electronic device 100 as a mobile phone and the electronic device 200 as a server as an example. The mobile phone 100 is installed with the human-machine dialogue system 110, and the semantic analysis module 112 in the human-machine dialogue system 110 has a semantic analysis model 121, which can perform semantic analysis on user speech based on the technical solution of the present application.

The semantic parsing model 121 of the present application will be described in detail below.

The semantic parsing model 121 is a natural language processing model pre-trained by the server 200 based on natural language processing and the above-mentioned various neural network structures and models. The pre-trained semantic parsing model 121 can extract multiple intents in a single piece of corpus data, and predict slots based on multiple intents, so as to accurately identify intents and corresponding slots in the corpus data, which can greatly improve the efficiency of slot filling. accuracy.

data preprocessing

The data input into the semantic parsing model 121 is the data obtained after preprocessing the corpus data, wherein the corpus data is obtained after the user's voice instruction is recognized and transformed. The preprocessing of the corpus data is a routine operation for understanding text in the human-machine dialogue system 110, and is one of the natural language processing tasks performed by the semantic parsing module 112. For example, preprocessing generally includes segmenting the corpus data, filling and marking. (Token) sequence and segmentation mark (Segmentation) and create masks. The data preprocessing finally obtains the Token sequence containing the text characters of the sentence and the text characters of each word in the sentence, the segmentation mark representing the sentence position corresponding to each word, and the corresponding mask indicating whether each character position in the Token sequence is a valid character.

Among them, the word segmentation process mainly uses word segmentation tools (such as Chinese vocabulary) to divide the corpus data into sentences and individual words that make up sentences, and to the obtained sentences. Possible slot labels. The purpose of word segmentation processing is to prepare data for the next step of filling the Token sequence.

For example, as shown in Figure 4, the corpus data "please play hello old times for me" converted from the voice command is obtained after word segmentation:

3 possible intent tags: PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE;

A sentence: please play hello old times for me;

10 words that make up a sentence: please, for, me, play, play, you, good, old, time, light;

Among them, the slot labels marked for each word are:

The slot labels corresponding to the five words please, for, me, play, and play are all O;

Your corresponding 3 slots are songName-B, videoName-B and mediaName-B;

The four words good, old, time, and light correspond to the three slot labels (songName-I, videoName-I, mediaName-I).

Filling the Token sequence mainly uses the data obtained by word segmentation to obtain a Token sequence that meets the character length requirements by truncating sentences or filling characters. Usually, the maximum character length requirement of each sentence in the Token sequence is maxLength=32. If the character length of the sentence obtained by word segmentation + 2 is greater than maxLength, the sentence needs to be truncated; if the sentence character length + 2 obtained by word segmentation is less than maxLength, then Need to pad the blank character <pad> at the end of the sentence to make the sentence character length +2 reach maxLength. The Token sequence contains sentence characters corresponding to the entire sentence of the voice command, and word characters corresponding to each word in the sentence.

Among them, +2 when calculating the character length is mainly because the first character in the Token sequence is generally <CLS>, which marks the sentence obtained by word segmentation (for example, the character <CLS> marks the sentence: please play you for me Good old times), the ending character in the Token sequence is generally the truncation character <SEP>, <SEP> indicates that the preceding sentence is a complete sentence that meets the character length requirements of a single sentence, and each character between the characters <CLS> and <SEP> is a complete sentence. The words are marked with a punctuation mark "Sentence 1", indicating that these words are the words that make up Sentence 1. If there are two <SEP>s in a Token sequence, it means that the first sentence is between the first <SEP> and the preceding <CLS>, and the second sentence is between the two <SEP>s. Character length + 2 are required to meet the maximum character length requirements. Generally, the character length + 2 contained in the user instruction is within the range of the maximum character length of 32 bits.

Creating a mask is mainly to create a mask (Mask) corresponding to each character in the Token sequence obtained by the above filling. The purpose of creating a mask is to mark whether each character in the Token sequence expresses valid information into a computer-readable marking code. Among them, the value of the created mask element corresponding to the character <pad> in the Token sequence is 0, and the value of the created mask element corresponding to the character other than the character <pad> is 1.

As shown in Figure 4, after data preprocessing on the corpus data, three data are mainly obtained, namely the Token sequence, the segmentation mark and the mask generated by the corresponding Token sequence. For the corpus data shown in Figure 4, the above three data are: :

Token sequence: <CLS> Please play hello old times for me <pad>…<pad><SEP>;

Segmentation mark: Sentence 1 (please, for, me, play, play, you, good, old, time, light);

Mask: {11111111111000000000000000000001}.

For another example, if the corpus data recognized by the voice command input by the user is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", the three data obtained after the above data preprocessing The data are:

Token sequence: <CLS> help me book train tickets from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station <SEP>;

Segmentation mark: Sentence 1 (help, me, pre-determined, from, Shanghai, sea, to, north, Beijing, de, fire, train, ticket, and, pre, fixed, north, Beijing, train, train, station, Near, near, of, five, star, grade, hotel, shop);

Mask: {11111111111111111111111111111111}.

The three data obtained after the above-mentioned data preprocessing of the corpus data can be input into the semantic parsing model 121 for semantic parsing. The semantic parsing model 121 will be described in detail below.

Semantic Parsing Model 121

Specifically, as shown in FIG. 3 , the semantic parsing model 121 includes a BERT encoding layer 1211, an intent classification layer 1212, an attention layer 1213, a slot filling layer 1214, and a post-processing layer 1215.

1) BERT coding layer 1211

The BERT encoding layer 1211 takes as input the Token sequence, segmentation mark and mask obtained after data preprocessing of the expected data, and outputs the encoded vector sequence after encoding. Wherein, the coding vector sequence includes sentence vector and word vector, the sentence vector represents the semantic information of the corpus data to be parsed, and the word vector contains the lexical information of each word in the corpus data to be parsed. It can be understood that semantic information and word meaning information are the meaning expressions of corpus data based on natural language understanding, and these semantic information and word meaning information can express the real intention of the user and the real slot corresponding to the real intention of the user.

For example, as shown in Fig. 4, if the corpus data to be parsed is "please play the good old days for me", then the encoding vector sequence {h ₀ ,h ₁ ,h ₂ ,...,h _t output by the BERT encoding layer 1211 }, the semantic information represented by the sentence vector h ₀ may include PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE, hello, old time, hello old time, etc. The word meaning information represented by the word vectors h ₁ , h ₂ , ..., h _t may include songName, videoName, mediaName and the literal meaning of each word that constitutes a sentence, where the word corresponding to h ₁ is please, and the word corresponding to h ₂ is The word corresponding to wei, h ₃ is me, ..., the word corresponding to h ₁₀ is light.

For another example, if the corpus data to be parsed is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", then the encoded vector sequence {h ₀ ,h ₁ output by the BERT encoding layer 1211 ,h ₂ ,...,h _t }, the semantic information represented by the sentence vector h ₀ may include ticket booking, hotel booking, departure place, destination, Shanghai, Beijing, hotel, star, five-star and so on. The word meaning information represented by the word vectors h ₁ , h ₂ , ..., h _t may include the origin, destination, Shanghai, Beijing, hotel, star, five-star, and the literal meaning of each word composing the sentence, where h The word corresponding to ₁ is gang, the word corresponding to h ₂ is me, the word corresponding to h ₃ is pre, ..., the word corresponding to h ₃₀ is shop.

Specifically, the working process of the BERT coding layer 1211 is shown in Figure 3:

The Token sequence, segmented tag and the mask generated by the corresponding Token sequence obtained after data preprocessing will be used as the input of the BERT coding layer 1211 .

The BERT encoding layer 1211 sequentially identifies the valid characters <CLS>, _x ₁ , x ₂ , . The value of the code element is 1, and the value of the blank character mask element is 0). Among them, t represents the time of processing each word in the sentence, hereinafter referred to as time step t or time t, for example, at time t=1, the word corresponding to the character x ₁ is processed, and at time t=2, the character corresponding to the character x ₂ is processed word.

The character <CLS> that marks the sentence in the Token sequence is input into the trained BERT coding layer 1211 for semantic encoding, and then the character <CLS> is given semantic information of the corpus data to generate a high-dimensional sentence vector h ₀ .

The characters x ₁ , x ₂ ,...,x _t-1 between the character <CLS> and the truncated character <SEP> in the Token sequence correspond to each word that constitutes the sentence in the corpus data, and the characters x ₁ , x ₂ ,... , x _t-1 is input into the trained BERT coding layer 1211 for semantic encoding, and then assigns the semantic information of the corpus data to the characters x ₁ , x ₂ ,..., x _t-1 , correspondingly generates a high-dimensional word vector h ₁ , h ₂ ,...,h _t .

The mask element value corresponding to the blank character <pad> in the Token sequence is 0, and no word is marked, so it is not used as the input of the BERT encoding layer 1211.

Based on the sentence vector h ₀ and the word _vectors h ₁ , h ₂ , .

The BERT coding layer 1211 can be obtained by training based on the BERT model. For the specific training process, please refer to the detailed description below, which will not be repeated here. Among them, the BERT model is a multi-layer bidirectional transformer encoder model based on fine-tuning, and the key technological innovation of the BERT model is to apply the bidirectional training of the transformer to language modeling. There are two stages to training a BERT encoding layer with a BERT model: pre-training and fine-tuning. After the BERT model is trained on unlabeled data on different pre-training tasks during pre-training, the BERT model is first initialized with pre-trained parameters, and all parameters are fine-tuned using labeled data from downstream tasks. A striking feature of the BERT model is its unified architecture across different tasks, so there is little difference between its pretrained architecture and the final downstream architecture. The BERT model can further increase the generalization ability of the word vector model, and fully describe the character-level, word-level, sentence-level and even inter-sentence relationship features.

In other embodiments, the BERT encoding layer 1211 can also be obtained by training other encoders or encoding models, which is not limited here.

2) Intent classification layer 1212

The intent classification layer 1212 is used to predict candidate intents in the corpus data, wherein the intent classification layer 1212 can extract multiple intent labels in the corpus data, and retain the intent labels that meet the conditions as candidate intent outputs.

Specifically, the intent classification layer 1212 takes the sentence vector h ₀ obtained by the above-mentioned BERT encoding layer 1211 as input, and based on the semantic information represented by the sentence vector h ₀ , the intent classification layer 1212 can extract all possible intent labels, and for each extracted The intent label calculates the intent confidence to judge whether the intent label satisfies the output condition.

It can be understood that, here, the intent confidence represents the closeness of the extracted intent label to the real intent expressed by the corpus data, and may also be referred to as intent reliability. The intent with higher intent confidence is closer to the real intent expressed by the corpus data. In the intent classification layer 1212, a certain threshold can be set for the intent confidence, for example, the threshold of the intent confidence is set to 0.5, and the intent label whose intent confidence is greater than or equal to the threshold satisfies the output condition, and the corresponding intent label will be output As a candidate intent; an intent label whose intent confidence is less than the threshold does not meet the output conditions, and its corresponding intent label will be deleted and will not be output from the intent classification layer 1212 .

For example, as shown in Figure 4, if the corpus data to be parsed is "please play hello old times for me", the semantic information represented by the sentence vector h ₀ output by the BERT coding layer may include 3 possible intent labels : PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE. After the sentence vector _h0 is input into the intent classification layer 1212, the intent classification layer 1212 extracts the above three possible intent labels, and calculates the intent confidence of each intent label as 0.8, 0.75, and 0.5, respectively. If the intent classification layer 1212 sets the The intent confidence threshold is 0.5, then the intent confidence of the above three intent labels all meet the condition of being greater than or equal to 0.5, that is, the above three intent labels satisfy the output condition, and the final intent classification layer 1212 outputs three candidate intents: PLAY_MUSIC, PLAY_VIDEO , PLAY_VOICE.

For another example, if the corpus data to be parsed is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", the semantic information represented by the sentence vector h ₀ output by the BERT coding layer May include 4 possible intent labels to check train times, book tickets, find hotels, and book hotels. After the sentence vector h ₀ is input into the intent classification layer 1212, the intent classification layer 1212 extracts the above four possible intent labels, and calculates the intent confidence of each intent label as 0.48, 0.87, 0.45, and 0.7, respectively. The intent confidence threshold set by 1212 is 0.5, then the intent labels with the corresponding intent confidence greater than or equal to 0.5 among the above four intent labels are ticket booking and hotel booking, which satisfy the output conditions, then the intent classification layer 1212 outputs 2 candidate intents : Book tickets, book hotels. However, the two intent labels whose intent confidence is less than 0.5: checking the number of trains and finding a hotel do not meet the output conditions, so they will not be output from the intent classification layer 1212 .

Specifically, the working process of the intent classification layer 1212 is shown in FIG. 3 :

The intent classification layer 1212 takes the sentence vector h ₀ in the encoded vector sequence output by the BERT encoding layer 1211 as input, and the intent classification layer 1212 extracts all possible intents in the semantic information represented by the sentence vector h ₀ by decoding and activating the sentence vector h ₀ labels, and compute the intent confidence y ^I for each intent label. Among them, the calculation formula of the intention confidence y ^I obtained after passing through the Sigmoid activation function is as follows:

y ^I =Sigmoid(W ^I h ₀ +b ^I ) (1)

Among them, I represents the number of schematic diagrams, W ^I is the random weight coefficient of the sentence vector h ₀ , and b ^I represents the deviation value.

The intent classification layer 1212 can be obtained by training a fully connected layer (dense) and a sigmoid function as an activation function. For the specific training process, please refer to the detailed description below, which will not be repeated here. It can be understood that in other embodiments, other deep neural networks with the same function as the fully connected layer can be used as the decoder, and other functions with the same function as the Sigmoid function can also be used as the activation of the corresponding deep neural network decoder. function, there is no restriction here.

3) Attention layer 1213

The attention layer 1213 is used to quantify the degree of correlation between each word in the corpus data and the intent expressed by the sentence. For example, it can be represented by an intent attention vector, and the intent attention vector can also be understood as an intent context vector; and the attention layer 1213 also It is used to quantify the degree of correlation between each word in the corpus data and the slot expressed by the sentence, for example, represented by the slot attention vector. Among them, the intent attention vector output by the attention layer 1213 will be used as the input of the slot filling layer 1214 to guide the slot prediction to improve the accuracy of the slot prediction; the slot attention vector output by the attention layer 1213 is used as the slot The deviation value of the bit calculation to correct the deviation of the slot prediction calculation.

Specifically, the attention layer 1213 takes the encoding vector sequence output by the BERT encoding layer 1211 as input, and based on the semantic information represented by the sentence vector h ₀ and the semantic information represented by the word vectors h ₁ , h ₂ ,..., h _t , the attention The intent attention vector output by the layer can be understood as the correlation degree between the words corresponding to each word vector and the sentence expression intent corresponding to the sentence vector, and the slot attention vector output by the attention layer can be understood as using It is used to quantify the degree of correlation between the word corresponding to each word vector and the slot expressed by the sentence corresponding to the sentence vector.

For example, as shown in Figure 4, if the corpus data to be parsed is "please play hello old times for me", the semantic information represented by the sentence vector h ₀ in the output coding vector sequence through the BERT coding layer may include 3 Possible intent labels: PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE, the word meaning information represented by word vectors h ₁ , h ₂ , ..., h _t may include songName, videoName, mediaName, and the literal meaning of each word that composes the sentence. After inputting the above sequence of encoding vectors into the attention layer 1213, for example, the intent attention vector CI output by the attention layer 1213 (corresponding to: please play hello old times for ^me , play, play), in which the sentence "Please play for me The intention expressed by "playing your good old days" may be PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE, and "playing, playing" has a relatively high degree of correlation with the intention expressed in the sentence, "you, good, old, time, light, please, for, me ” is less or less relevant to the intent expressed by the sentence. And the slot expressed by the sentence "please play your good old times for me" is songName, videoName, mediaName, then at t=1 time, the slot attention vector output by the attention layer 1213

Indicates the degree of correlation between "please" and the above three slots, for example, if the degree of correlation is 0, it is irrelevant; at time t=2, the slot attention vector output by the attention layer 1213

Indicates the degree of correlation between "as" and the above three slots, for example, if the degree of correlation is 0, it is irrelevant; and so on, at time t=6, the attention vector of the slot output by the attention layer 1213

Indicates the degree of correlation between "you" and the above three slots. For example, the correlation degree is 0.9, which means that the degree of correlation is relatively large; in the end, it can be concluded that the degree of correlation between "you, good, old, time, light" and the above three slots Both are relatively high, and "play, play, please, for, me" has a low or no correlation with the slot expressed in the sentence.

For another example, if the corpus data to be parsed is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", then the intent attention vector output by the attention layer 1213 (help me book a train ticket from Train tickets from Shanghai to Beijing and book five-star hotels near Beijing Railway Station, book, book, train, train, ticket, hotel, shop), in which the sentence "Help me book train tickets from Shanghai to Beijing and book Beijing The intention expressed by the five-star hotel near the railway station” may be to book a ticket or a hotel. Sea, Beijing, Beijing, fire, train, station, five, star, grade, help, me" have a low degree of relevance or irrelevance to the intent expressed by the sentence.

And the sentence "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station" may express the slot of departure, destination, location, star, then t=1 time, Slot attention vector output by attention layer 1213

Indicates the degree of correlation between "help" and the above four slots, for example, if the degree of correlation is 0, it is irrelevant; at time t=2, the slot attention vector output by the attention layer 1213

Indicates the degree of correlation between "I" and the above-mentioned 4 slots, for example, if the correlation degree is 0, it is irrelevant; and so on, at time t=6, the attention vector of the slot output by the attention layer 1213

Indicates the degree of correlation between "上" and the above 4 slots. For example, the degree of correlation between "上" and the slot "Departure" is 0.9, while the correlation with the other 3 slots (destination, location, star rating) is 0.9. A degree of 0.3 indicates that "up" has a high degree of correlation with the slot "departure", and a low degree of correlation with the other three slots (destination, location, star rating). By analogy, it can be concluded that "Shanghai, Hai" is more closely related to the "departure" of the slot, "Beijing, Beijing" is more closely related to the "destination" of the slot, and "North, Beijing, "Fire, train, station" is more related to slot "location", "five, star, level" is more related to slot "star", and other words in the sentence, such as "help, me" There is little or no correlation with the above 4 slots.

Specifically, the working process of the attention layer 1213 is shown in Figure 3:

The attention layer 1213 takes the encoding vector sequence {h ₀ , h ₁ , h ₂ ,..., h _t } output by the BERT encoding layer 1211 as input, and the attention layer 1213 extracts the semantic information represented by the sentence vector h ₀ and the word vector h ₁ , h ₂ ,...,h _t represents the semantic information, and outputs a hidden state vector at each time step t, which represents the extracted data before the previous moment (t-1 moment) of the corresponding time step t. Semantic information, word meaning information. Among them, the hidden state vector output by the attention layer 1213 at time t-1 (time 0) at the first time (t=1) is the semantic information corresponding to the sentence vector h ₀ , and the previous time at time t=2 (t = 1 time) the output hidden state vector is the semantic information of the first word corresponding to the word vector h ₁ , wherein the semantic information corresponding to the word vector h ₁ also includes the semantic information transmitted by the sentence vector h ₀ at time t=0; t = The hidden state vector output at the previous moment at time 3 (time t=2) is the word meaning information of the second word corresponding to the word vector h ₂ , wherein the word meaning information corresponding to the word vector h ₂ includes the word vector at time t=1 The word meaning information transmitted by h ₁ , and the word meaning information corresponding to the word vector h ₁ also includes the semantic information transmitted by the sentence vector h ₀ at time t=0; and so on.

Further, in the attention layer 1213, the attention vector calculation formula based on the attention mechanism is as follows:

Attention=W _u *tanh(W _q *Q+W _v *V) (2)

Wherein, when calculating the intent attention vector ^CI , Q in the above formula (2) represents the sentence vector h ₀ in the encoding vector sequence input to the attention layer 1213 , and V represents the input to the attention layer 1213 at each time step t The word vectors h ₁ , h ₂ ,...,h _t in the encoded vector sequence of , the attention vector obtained by the above formula (3) can quantify the degree of correlation between each word vector and the sentence vector. The semantic information represented by the sentence vector h ₀ contains all possible intent label information. Therefore, the sentence vector h ₀ is combined with the attention vector calculated by the above formula (2) to obtain the intent attention vector C ^I , and the obtained intent The attention vector ^CI is used to quantify the degree of correlation between the word corresponding to each word vector and the sentence expression intent corresponding to the sentence vector.

When calculating the slot attention vector, Q in the above formula (2) represents the hidden state vector C output by the attention layer 1213 at the previous moment (time t-1), and V represents the encoding vector sequence input to the attention layer 1213 { h ₀ , h ₁ , h ₂ ,...,h _t }, the attention vector obtained by the above formula (2) can be combined with the hidden state vector at the previous moment to learn the correlation degree of the word vector processed at the current moment t. Based on the extracted semantic information or/and lexical information represented by the hidden state vector at the last moment, it includes all possible slot label information. Therefore, the hidden state vector C output at time t-1 is calculated by the above formula (2). The attention vector is combined to get the slot attention vector

The resulting slot attention vector

It is used to quantify the degree of correlation between the word corresponding to each word vector and the slot expressed by the sentence corresponding to the sentence vector.

As mentioned above, the attention layer 1213 can be obtained by training a Long Short Term Memory (LSTM) model and an attention mechanism. For the specific training process, please refer to the detailed description below, which will not be repeated here. It can be understood that in other embodiments, other neural network models that have the same functions as the LSTM model and the attention mechanism, as well as other neural network models that are used to learn the degree of correlation between the words in a sentence and the intent or slot expressed in the sentence can be used. mechanism, there is no restriction here.

4) Slot filling layer 1214

The slot filling layer 1214 is used to predict candidate slots in the corpus data and fill in the slot value, wherein the slot filling layer 1214 can predict multiple slot labels in the corpus data, and retain the slot labels that meet the conditions Output as a candidate slot.

Specifically, at time t, if the slot filling layer 1214 processes a certain current word, it uses the coding vector h _t output by the BERT coding layer 1211 and the hidden state vector C output by the attention layer 1213 at time t-1 ( That is, the semantic information of the sentence before the currently processed word or the semantic information of the word), and the intent attention vector ^CI and slot attention vector output by the attention layer 1213 at time t

For input, output the candidate slot at time t. The slot filling layer 1214 predicts possible slot labels based on the four input vectors at each time step t, and calculates the slot position reliability for the predicted slot label to determine whether the slot label satisfies the output condition. That is, the slot filling layer 1214 is based on an encoding vector including a word vector (containing the semantic information of each word in the corpus data to be parsed), the semantic information of the sentence before the currently processed word or the semantic information of the word, the currently processed word and sentence. The degree of correlation of the expressed intention and the degree of correlation between the currently processed word and the slot expressed by the sentence, obtain the possible slot labels of the corpus data to be parsed, and calculate the slot position reliability of each slot label, and then select the Slot labels that satisfy the condition or that have a correlation degree with the slot that is actually expressed by the corpus data to be analyzed exceeds the threshold are output as candidate slots.

It can be understood that, here, the slot position reliability represents the closeness of the predicted slot label to the actual slot expressed by the corpus data, and may also be referred to as slot reliability. The slot label with higher slot position reliability is closer to the real slot expressed by the corpus data. In the slot filling layer 1214, a certain threshold can be set for the slot position reliability, for example, the threshold value of the slot position reliability is set to 0.5, the slot label whose slot position reliability is greater than or equal to the threshold satisfies the output condition, and the corresponding slot The bit label will be output as a candidate slot; the slot label whose slot position reliability is less than the threshold does not meet the output condition, and its corresponding slot label will be deleted and will not be output from the slot filling layer 1214 .

For example, as shown in FIG. 4 , if the corpus data to be parsed is "please play hello old times for me", it is assumed that the threshold set for the slot position reliability in the slot filling layer 1214 is 0.5. Then, in the slot label predicted by the 5 words "please, for, me, play, play" in the slot filling layer 1214, the slot position reliability (for example, 0.7) of slot O is greater than or equal to 0.5, and the other slots (such as songName), the slot position reliability (for example, 0.3) is less than 0.5. Therefore, the candidate slots corresponding to the output of the five words "please, for, me, play, play" are all O slots. The slot filling layer 1214 is the slot label predicted by the five words "you, good, old, time, light", the slot position reliability of songName, videoName, and mediaName (for example, the slot position reliability is 0.86, 0.7, 0.55) is greater than or equal to 0.5, and the slot position reliability of slot O (for example, 0.3) is less than 0.5. Therefore, the candidate slots corresponding to "you" are songName-B, videoName-B, mediaName-B, "OK". , old, time, light" corresponding output candidate slots are songName-I, videoName-I, mediaName-I, among which, B marks the word of the starting position of the name, which means that "you" is the first in the name word; I marks the word after the start of the name. Since the O slot represents an empty slot or an unimportant slot, the output of the slot filling layer 1214 finally outputs three candidate slots songName, videoName, and mediaName, and fills each candidate slot with the slot value "you" good old times".

For another example, if the corpus data to be parsed is "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", it is assumed that the threshold set for the slot position reliability in the slot filling layer 1214 is 0.5. . Among them, in the slot label predicted by the slot filling layer 1214 as "Shanghai, Sea", the slot position reliability (for example, 0.7) of the slot label "Departure" is greater than or equal to 0.5, therefore, "Shanghai, Sea" The candidate slots corresponding to the output of the two words are all "departures"; the slot filling layer 1214 is the slot label predicted by "Beijing, Beijing", the slot position reliability of the slot label "destination" (for example, 0.8 ) is greater than or equal to 0.5, therefore, the candidate slots corresponding to the output of the two words "Beijing, Beijing" are all "destination"; the slot filling layer 1214 is the predicted slot of "Beijing, Beijing, train, train, station" In the bit label, the slot position reliability (for example, 0.75) of the slot label "Location" is greater than or equal to 0.5. Therefore, the candidate slots corresponding to the output of the five words "Beijing, Beijing, Train, Station" are all "Location"; in the slot label predicted by the slot filling layer 1214 as "five, star, level", the slot location reliability (for example, 0.75) of the slot label "Location" is greater than or equal to 0.5, therefore, "five, The three words "star, level" correspond to the output candidate slots that are all "star". Therefore, the slot filling layer 1214 finally outputs 4 candidate slots: departure, destination, location, star rating, and the slot value filled in the slot (departure) is (Shanghai), and the slot (destination) The value of the filled slot is (Beijing), the value of the filled slot (location) is (Beijing Railway Station), and the value of the filled slot (star) is (five star).

It is worth noting that the slot filling layer 1214 performs slot prediction based on the intent attention vector as the input to predict the slot at each time step t, and predicts the slot for the first word at time t=1 using the sentence The vector h ₀ is input as the initial value. Since the semantic information represented by the intent attention vector and the sentence vector includes all possible intent labels, the slot filling layer 1214 predicts possible slot labels based on the possible intent labels. The slot label is associated with the intent label. Therefore, the accuracy of slot prediction is greatly improved, and the speed or efficiency of slot prediction is also improved accordingly.

Specifically, the working process of the slot filling layer 1214 is shown in FIG. 3 :

The slot filling layer 1214 uses the coding vector h _t output by the BERT coding layer 1211 at time t, the intent attention vector C ^I output by the attention layer 1213 at time t, and the slot attention vector

and the hidden state vector C output by the attention layer 1213 at time t-1 as input. The slot filling layer 1214 firstly models the relationship between the intent and the slot based on the slot gate mechanism, and obtains the intent attention vector C ^I and the slot attention vector

The fusion vector gS of , further predicts the slot label corresponding to each time step t, and calculates the slot position reliability of each slot label.

Among them, the calculation formula of the fusion vector gS of the intention attention vector and the slot attention vector is as follows:

Among them, v represents the random weight coefficient of the hyperbolic tangent function tanh(x) in the above formula (3), W represents the random weight coefficient of the schematic attention vector C ^I , and W greater than 1 means the schematic attention vector C ^I pair slot Bit prediction is more influential than the slot attention vector

is greater, and W is less than 1, which means that the influence of the schematic attention vector C ^I on the slot prediction is greater than that of the slot attention vector.

The degree of influence is small, and W equal to 1 means that the influence degree of the schematic attention vector C ^I on the slot prediction and the slot attention vector

the same degree of impact.

At each time step t, the slot filling layer 1214 can obtain a slot vector representing slot label information based on the above four input vectors

and then based on the slot vector

Calculate the slot position reliability of the corresponding slot label

The slot position reliability

The calculation formula obtained after passing the Sigmoid activation function is as follows:

Among them, S is the number of slots, W ^S is the slot vector

The random weight coefficient of , b ^S represents the bias value.

For example, in the example shown in Figure 4 above, in the process of predicting the slot for the corpus data "please play hello old times for me", when the slot is predicted for "me" at time t=3, the slot is filled The layer 1214 uses the encoding vector h ₃ (corresponding to: me), the hidden state vector C (corresponding to: is) and the intention attention vector C ^I (corresponding to: please play hello for me, respectively) output by the attention layer 1213 at time t-1. time, play, play) and slot attention vector (corresponding to: yes, i) as input; among them, the hidden state vector C (corresponding to: yes) includes the word meaning information passed by the word vector (corresponding: please), the word vector (Corresponding to: please) It also includes the semantic information transmitted by the sentence vector (corresponding: please play hello old times for me).

Since the semantic information corresponding to "I" is an appellation that represents itself, "I" is not related to the intention and slot expressed by the sentence "Please play your good old times for me". Therefore, in the "me" When predicting a slot, for example, it is calculated that the slot position reliability of the slot label songName is 0.2, the slot position reliability of the slot label videoName is 0.3, and the slot position reliability of the slot label O is 0.7, then, the final result is " The slot I've predicted is the O slot, and the O slot generally represents an unimportant slot and will not be used as the output of the slot filling layer 1214.

For example, when predicting the slot for "you" at time t=6, the slot filling layer 1214 uses the encoding vector h ₆ (corresponding to: you) and the hidden state vector C output by the attention layer 1213 at time t-1 (corresponding to: put ), intent attention vector C ^I (corresponding to: please play hello old times for me, play, play) and slot attention vector (corresponding to: playing, you) as input; among them, the hidden state vector C (corresponding to: you ) includes the word meaning information passed by the word vector (corresponding: play), and the word vector (corresponding: put) also includes the word meaning information passed by the previous word vector (corresponding: playing), and so on, the word vector (corresponding: please ) also includes the semantic information passed by the sentence vector (corresponding: please play hello old times for me).

Since the word meaning information corresponding to "you" is a certain word in the title of a song or video, "you" is less relevant to the intent expressed by the sentence "please play hello old times for me", and is less relevant to the sentence "please play for me The slot positions expressed by "Hello, old times" have a relatively large degree of correlation. Therefore, when predicting the slot positions for "you", for example, it can be calculated that the slot position reliability of the slot tag songName is 0.86, and the slot position of the slot tag videoName is 0.86. The reliability is 0.7, the slot position reliability of the slot label mediaName is 0.55, and the slot position reliability of the slot label O is 0.2, then, the final predicted slot for "you" is songName, videoName, mediaName, as the slot Fill the output of layer 1214.

The slot filling layer 1214 can be obtained by training based on the slot-gate mechanism, the LSTM model and the Sigmoid activation function. For the specific training process, please refer to the detailed description below, which will not be repeated here. Among them, the slot gate mechanism focuses on learning the relationship between the intent attention vector and the slot attention vector, and obtains a better semantic frame through global optimization. The slot gate mechanism mainly uses the intent context vector to model the relationship between intent and slot to improve slot filling performance. In other embodiments, other deep neural network models with the same function as the LSTM model can be used as the decoder, and other functions with the same function as the Sigmoid function can also be used as the activation function of the corresponding deep neural network decoder. There are no restrictions.

5) Post-processing layer 1215

The slot filling layer 1214 is used to sort out the correspondence between candidate intents and candidate slots. The result obtained after the candidate intent corresponds to the candidate slot is output from the post-processing layer 1215 as the semantic parsing result.

For example, as shown in FIG. 4 , if the input corpus is please play hello old times for me, the candidate intents (PLAY_MUSIC, PLAY_VIDEO, PLAY_VOICE) output by the intent classification layer 1212 and the candidate slots (songName) output by the slot filling layer 1214 , videoName, mediaName) are input into the post-processing layer 1215, and the semantic parsing result output after inference and prediction based on the intent-slot mapping table in the post-processing layer 1215 is:

PLAY_MUSIC songName, videoName, mediaName, hello old days;

PLAY_VIDEO songName, videoName, mediaName, hello old days;

PLAY_VOICE songName, videoName, mediaName, hello old days.

Among them, the candidate intents PLAY_MUSIC, PLAY_VIDEO, and PLAY_VOICE are the intents for parsing and identifying the corpus data, the candidate slots songName, videoName, and mediaName are the slots obtained by parsing the corpus data, and "Hello, old time" is the filled slot value. .

For another example, if the corpus data to be parsed is to help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station, the candidate intent output by the intent classification layer 1212 (booking a ticket, booking a hotel), slot After the candidate slots (departure and destination) output by the filling layer 1214 are input to the post-processing layer 1215, the semantic parsing result output after inference and prediction based on the intent-slot mapping table in the post-processing layer 1215 is:

Departure place for booking tickets, Shanghai;

destination, Beijing;

Book hotel location, Beijing Railway Station;

Star, five star.

Among them, the candidate intent (booking a ticket, booking a hotel) is the intent to parse and identify the corpus data, and the candidate slot (departure, destination, location, star rating) is the slot obtained by parsing the corpus data. , Beijing Railway Station, and Five Star are the slot values filled in for the corresponding slot (departure, destination, location, star rating).

Specifically, the working process of the post-processing layer 1215 is shown in FIG. 3 :

The post-processing layer 1215 takes the candidate intents obtained by the above-mentioned intent classification layer 1212 and the candidate slots obtained by the slot filling layer 1214 as input, and sorts out candidate intents and candidates based on the intent-slot mapping table obtained during the pre-training process of the semantic parsing model 121 . Correspondence between slots. The intent slot mapping table obtained based on the pre-training process of the semantic parsing model 121 is described in detail below, and details are not repeated here.

It can be understood that the intent slot mapping table is based on the result of the candidate intent and candidate slot combing obtained by training a large number of samples. Therefore, in the process of performing the semantic parsing task, the intent slot can be continuously updated based on more corpus data in practical applications. Bitmap table.

The above BERT encoding layer 1211 , intent classification layer 1212 , attention layer 1213 , slot filling layer 1214 and post-processing layer 1215 together constitute the semantic parsing model 121 . Among them, each layer in the structure of the semantic parsing model 121 needs to be pre-trained with a large amount of sample expected data so that it has the corresponding function of each layer above. As mentioned above, the semantic parsing model 121 is pre-trained by the server 200. Afterwards, the trained semantic parsing model 121 can either be transplanted to the mobile phone 100 to directly perform the semantic parsing task, or it can continue to exist in the server 200 to execute data from the mobile phone 100. The semantic parsing task requested to be performed.

The pre-training process of the semantic parsing model 121 will be described in detail below. For the pre-training process of the semantic parsing model 121, reference may be made to the following examples.

As shown in FIG. 5 , the pre-training process of the semantic parsing model 121 includes:

501 : The server 200 collects sample corpus data for training the semantic parsing model 121 . Among them, the collected samples are expected to cover as many fields as possible and as many verbs, proper nouns, common nouns, etc. as possible, so that the generalization performance of the trained semantic parsing model 121 will be better.

The sample corpus data used for training the semantic parsing model 121 needs to be input into the layers of the semantic parsing model 121 for training in batches. For ease of understanding, several concepts related to sample data are introduced below.

(a) batch: batch, the loss function required for each parameter update of deep learning is not obtained by a data label {data: label}, but by a set of data weighted, the number of this set of data is batchsize .

(b) batchsize: batch size, the number of samples in a batch. Each training takes batchsize samples in the training set for training.

(c) iteration: The number of iterations is the number of times the batch needs to complete an epoch. 1 iteration is equal to using batchsize samples to train once; in an epoch, the number of batches and iterations are equal.

(d) epoch: When a complete dataset passes through the neural network once and returns once, the process is called an epoch. That is to say, 1 epoch is equivalent to using all the samples in the training set to train once.

For example, if the training set has 1000 samples and batchsize=10, then: training the entire sample set requires 100 iterations and 1 epoch. As another example, consider a dataset with 2000 training samples. Divide 2000 samples into batches of size 500, then it takes 4 iterations to complete an epoch.

502 : The server 200 performs data preprocessing on the sample corpus data to be input into the training of the semantic parsing model 121 through the NLP module. For the data preprocessing of the sample corpus data, please refer to the relevant description of the data preprocessing in the BERT coding layer 1211 above, which will not be repeated here.

After each piece of sample corpus data is preprocessed, a Token sequence, a segment mark and a mask corresponding to the Token sequence corresponding to the piece of sample corpus data are obtained.

503: In an epoch training, the server 200 respectively inputs the Token sequence, the segmentation mark and the mask corresponding to the Token sequence obtained by data preprocessing for each sample corpus into the BERT coding layer 1211 in the semantic parsing model 121 for training, so that the It can output a sequence of encoded vectors as described in the BERT encoding layer 1211 above.

The BERT coding layer 1211 is obtained based on the training of the BERT model. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters of the semantic parsing model 121, so that the BERT coding layer can output the above coding vector sequence {h ₀ , h ₁ , h ₂ , ..., h _t }.

504: In an epoch training, the server 200 respectively inputs the sentence vector h ₀ output by the BERT encoding layer 1211 in the above process 503 into the intent classification layer 1212 in the semantic parsing model 121 for training, so that it can output the intent classification layer 1212 as above. The candidate intents described in , are not repeated here.

The intent classification layer 1212 is obtained by training based on the fully connected layer and the Sigmoid function as the activation function. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters of the semantic parsing model 121, so that the intent classification layer 1212 expects the learning of the data after a long enough time or a large enough number of samples. After that, all possible intent labels and the intent confidence corresponding to each intent label can be extracted, and then multiple intent labels that meet the output conditions can be extracted as candidate intents, which are output from the intent classification layer 1212. For details, please refer to the above formula (1) and related descriptions , and will not be repeated here.

For each piece of sample corpus data, the candidate intents output by the intent classification layer 1212 are input to the post-processing layer 1215.

505: In an epoch training, the server 200 respectively inputs the coding vector sequence {h ₀ , h ₁ , h ₂ , , h _t } output by the BERT coding layer 1211 trained in the above process 503 into the attention in the semantic parsing model 121 The force layer 1213 is trained to output the intent attention vector ^CI and the slot attention vector as described in the attention layer 1213 above

It is not repeated here.

The attention layer 1213 is obtained from training based on the attention mechanism and the LSTM model. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters in the semantic parsing model 121, so that the attention layer 1213 can quantify the degree of relevance of the expression intent of the word pair corresponding to each word vector. , and quantify the correlation degree of the slot represented by the word pair corresponding to each word vector, and finally output the intent attention vector and the slot attention vector. For details, please refer to the above formula (2) and related descriptions, which will not be repeated here.

Among them, the LSTM model is a special RNN model, which is proposed to solve the problem of gradient dispersion of the RNN model. Its core is the cell state, which is temporarily called the cell state. It can also be understood as a conveyor belt, which is actually the memory in the entire model. space changes over time. The working principle of the LSTM model can be simply described as: (1) forget gate: choose to forget some information in the past: (2) input gate: remember some information in the present: (3) merge the past and present memory: (4) )output gate: choose to output some information. The attention mechanism imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience and external sense to increase the fineness of observation in some areas, and can use limited attention resources to quickly screen out high-value information from a large amount of information. . Attention mechanism The attention mechanism can quickly extract important features of sparse data. The essential idea of the attention mechanism can be rewritten as the following formula:

Among them, Lx=||Source|| represents the length of Source, and the meaning of the formula is to imagine that the constituent elements in Source are composed of a series of <Key, Value> data pairs. At this time, an element Query in the target Target is given. , By calculating the similarity or correlation between Query and each Key, the weight coefficient of each Key corresponding to Value is obtained, and then the weighted sum of Value is obtained, that is, the final Attention value is obtained. So in essence, the Attention mechanism is to weight and sum the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient of the corresponding Value.

506: In an epoch training, the server 200 respectively uses the encoding vector h _t output by the BERT coding layer 1211 trained in the above process 503 at time t, and the intent attention vector output by the attention layer 1213 trained in the above process 505 at time t. ^CI and slot attention vector

And the hidden state vector C output by the LSTM model at time t-1 in the attention layer 1213 (that is, the semantic information of the sentence before the currently processed word or the word meaning information of the word) is input into the semantic analysis model 121 The slot filling layer 1214 is trained , so that it can output candidate slots as described in the slot filling layer 1214 above, which will not be repeated here.

The slot filling layer 1214 is obtained by training based on the slot gate mechanism, the LSTM model as the decoder, and the Sigmoid function as the activation function. During the training process, it is necessary to continuously fine-tune the upstream and downstream parameters of the semantic parsing model 121, so that the slot filling layer 1214 has a long enough time. Or a large enough number of samples are expected to be able to predict all possible slot labels and the slot position reliability corresponding to each slot label corresponding to the possible intent labels after the learning of the data, and then extract multiple candidate slots that meet the output conditions. For the output of the bit filling layer 1214, refer to the above formulas (3) to (4) and related descriptions for details, which will not be repeated here.

For each piece of sample corpus data, the candidate slots output by the slot filling layer 1214 are input to the post-processing layer 1215 .

507: The server 200 determines whether the training results of the above-mentioned processes 501-506 satisfy the training termination condition. If the training result satisfies the training termination condition, go to 508 ; if the training result does not satisfy the training termination condition, go to 509 .

In this embodiment of the present application, an Early Stopping mechanism may be used to determine the termination of model training. That is, when the number of training epochs reaches the number threshold or the epoch interval with the last optimal model is greater than the set interval threshold, the training result satisfies the training termination condition; otherwise, the training result does not meet the training termination condition.

The early stopping mechanism can make the trained neural network model have good generalization performance, that is, it can fit the data well. Its basic meaning is to calculate the performance of the model on the validation set during training. When performance starts to drop, stop training to avoid overfitting problems caused by continuing training.

508: The server 200 terminates the training of the BERT encoding layer 1211, the intent classification layer 1212, the attention layer 1213, and the slot filling layer 1214 in the semantic parsing model 121, and further adds a large number of candidate intents and A large number of candidate slots are input into the post-processing layer 1215 in the semantic parsing model 121 to sort out relationships, for example, sorting out candidate slots based on candidate intents to obtain an intent-slot mapping table. The semantic parsing model training ends.

Among them, the candidate intent and the candidate slot will be obtained corresponding to each sample corpus data after the training of the above processes 502 to 506. After enough epoch times of training, the candidate intent and candidate slot of the input post-processing layer 1215 are also sufficient. pass. Before the post-processing layer 1215 is trained, there is an disordered and non-corresponding relationship between candidate intents and candidate slots, that is, no mapping is formed between candidate intents and candidate slots. The post-processing layer 1215 is trained based on a sufficient number of candidate intents and candidate slots, so that it can sort out candidate slots based on the candidate intents, and output an ordered correspondence between intents and slots, for example, training to obtain one intent Slot mapping table. Based on the intent slot mapping table, the post-processing layer 1215 can accurately and quickly find the corresponding relationship between the candidate intent and the candidate slot for the candidate intent and the candidate slot input.

509 : The server 200 continues to input the sample corpus data of the next epoch and repeats the processes 502 to 507 to continue training the semantic parsing model 121 .

It is worth noting that, in order to eliminate the difference between the candidate intents or candidate slots obtained by the semantic parsing model 121 and the real intents or slots due to intent classification or slot filling loss, it is necessary to train the semantic parsing model 121 when training the semantic parsing model 121. A joint optimization function is introduced to perform joint optimization training of the intent classification loss function and the slot filling loss function on the output candidate intents and candidate slots.

Specifically, the objective loss function adopted for the joint optimization of intent and slot is added by the intent classification loss function, the slot filling loss function and the regularization term of the weight. Among them, the intent classification loss function adopts the multi-label Sigmoid cross entropy loss (Cross Entropy Loss) function, and the slot filling loss function adopts the serialized multi-label Sigmoid Cross Entropy Loss function. The calculation formula of Sigmoid Cross Entropy Loss is deduced as follows:

where P(t _i =1|x _i ) is the Sigmoid function,

After adding L2 regularization after the weight, the joint optimization objective loss function is obtained, and the formula is:

where L _y (y,f(x)) is the intent classification loss function calculated according to the above formula 6, L _c (y,f(x)) is the slot filling loss function calculated according to the above formula 6, λ is the super Parameter, m is the number of data in a batch, the reason for dividing by 2 is to cancel it out when derivation;

represents the sum of the W parameters of the lth layer;

is a matrix, and k and j represent the rows and columns of the matrix.

It can be seen that the joint optimization function is mainly to jointly optimize the intention classification loss or slot filling loss generated in the process of matrix transformation in the neural network. After the joint optimization of the above formula (7), the semantic parsing model 121 trained by the server 200 can parse the corpus data to be parsed into candidate intents and candidate slots that are closer to the real intent and the real slot.

As described above, after the server 200 completes the pre-training of the semantic parsing model 121, the trained semantic parsing model 121 can either be transplanted to the mobile phone 100 to directly perform the semantic parsing task, or can continue to exist in the server 200 to execute requests from the mobile phone 100 Semantic parsing tasks performed. Specifically, as shown in FIG. 6 , the user enters a voice command by waking up the voice assistant of the mobile phone 100 , and the mobile phone 100 extracts one or more intents and information corresponding to the user's voice command through the internal human-machine dialogue system 110 based on the above semantic analysis model 121 . For the slot, the mobile phone 100 further performs corresponding operations based on the identified intent and the slot, for example, opening an application software, or performing a web page search. For the specific interaction process between the mobile phone 100 with the semantic parsing model 121 transplanted and the user, please refer to the following examples:

601: The mobile phone 100 obtains the user's voice instruction.

A voice assistant is installed in the mobile phone 100 , and the user can send a voice command to the mobile phone 100 by waking up the voice assistant of the mobile phone 100 . For example, the mobile phone 100 acquires the user's voice instruction "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station".

602: The speech recognition module 111 in the man-machine dialogue system 110 of the mobile phone 100 recognizes and converts the acquired user speech instruction into corpus data in the form of text. For example, converting the above voice command into textual corpus data "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station".

603: The semantic parsing module 112 in the human-machine dialogue system 110 of the mobile phone 100 is configured to perform semantic parsing on the corpus data to obtain a semantic parsing result intended to correspond to the slot.

Specifically, the semantic parsing module 112 preprocesses the corpus data to obtain a Token sequence, a sentence segmentation mark, and a mask created corresponding to the Token sequence. Next, the semantic parsing module 112 uses the Token sequence, the segmentation mark and the mask created corresponding to the Token sequence as the input of the semantic parsing model 121, performs semantic parsing, extracts multiple candidate intents and multiple candidate slots, and finally the semantic parsing model 121 After sorting out the correspondence between multiple candidate intents and multiple candidate slots, it is output as the semantic parsing result. In some embodiments, a simple single-intent corpus can also be parsed by the semantic parsing model 121 to extract a single candidate intent and one or more corresponding candidate slots, which is not limited herein.

For example, the semantic analysis result obtained by the semantic analysis module 112 of the human-machine dialogue system 110 through the semantic analysis model 121 is: :

Departure place for booking tickets, Shanghai;

destination, Beijing;

Book hotel location, Beijing Railway Station;

Star, five star.

604 : The problem solving module 113 in the human-machine dialogue system 110 of the mobile phone 100 searches for a corresponding application or network resource based on the semantic analysis result obtained by the semantic analysis module 112 to obtain a solution to the intent and slot in the semantic analysis result.

For example, in the above process 603, the user's instruction "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station" is parsed to obtain the intent and the slot mapping result contains two intents of the user and relative to the two intents of the user. 4 slots for each intent and the slot value filled in each slot, then the solution searched by the problem solving module 114 is that the mobile phone 100 can open the installed booking service software application or travel software application to query The train ticket information and hotel information are available for the user to choose to book, or select a train ticket by default according to the user's historical usage records. Enter the booking interface and ask the user to confirm. The mobile phone interface is shown in Figure 7.

For another example, for the user instruction "please play hello old times for me", as shown in Figure 4, the intent and slot mapping result obtained by parsing the corpus data identified by the instruction include the user's three intents and relative to each 3 slots for each intention, and the slot value filled in each slot, then the mobile phone 100 can open the music player software to play the local music "Hello Old Times" by default based on the user's usage habits, or open the audio player software to obtain Music or video files about "Hello Old Times" for users to choose to play.

605 : The language generation module 114 in the man-machine dialogue system 110 of the mobile phone 100 generates a natural language sentence for the solution found by the problem solving module 113 , and feeds it back to the user through the display interface of the mobile phone 100 .

In the above process 603, for the user instruction "help me book a train ticket from Shanghai to Beijing and book a five-star hotel near Beijing Railway Station", after speech recognition and semantic analysis, the solution searched by the above problem solving module 114 is: The mobile phone 100 can open the installed booking service software application or travel software application to query train ticket information and hotel information for the user to select and reserve, or select a train ticket by default according to the user's historical usage record to enter the reservation interface and ask the user to confirm. The language generation module 114 can correspondingly generate the train number information of the train ticket or the introduction information of the hotel, and feed it back to the user through the display interface of the mobile phone 100 , as shown in FIG. 7 .

For another example, the user's voice command obtained by the mobile phone 100 is to query the weather for the last three days. After voice recognition and semantic analysis, the solution searched by the problem solving module 113 is to open the browser on the mobile phone 100 or open the browser installed on the mobile phone 100. The weather query software searches for the weather conditions of the last three days. Correspondingly, the language generation module 114 generates natural language texts from the searched weather conditions as follows:

The weather today is 28-32℃;

Tomorrow the weather will be sunny 28-33 ℃;

The weather on Wednesday turned cloudy to 28-32°C.

606: The dialogue management module 115 in the human-machine dialogue system 110 of the mobile phone 100 may schedule other modules based on the user's dialogue history to further improve the accurate understanding of the user's voice command. For example, in the process of searching the weather by the problem solving module 113, the location is not clearly indicated in the user's voice command, then the dialogue management module 115 can schedule the problem solving module 113 based on the user's dialogue history to search for Beijing, which is frequently inquired by the user, as a search address, and provide feedback to the user. For the weather conditions in Beijing for the past three days, the dialogue management module 115 can also dispatch the problem solving module 113 based on the location information of the mobile phone 100 to search for the weather in the current location of the user for the past three days, and further dispatch the language generation module 114 to generate the following natural language sentences:

Beijing area:

The weather today is 28-32℃;

Tomorrow the weather will be sunny 28-33 ℃;

The weather on Wednesday turned cloudy to 28-32°C.

It can be understood that the dialogue management module 115 in the human-machine dialogue system 110 of the mobile phone 100 can flexibly schedule other modules in the human-machine dialogue system 110 to perform corresponding functions.

607 : The speech synthesis module 116 in the man-machine dialogue system 110 of the mobile phone 100 further synthesizes and converts the natural language sentences generated by the language generation module 114 into speech, which is played back to the user through the mobile phone 100 . For example, the weather conditions generated by the language generation module 114 in the above process 605 are converted into voice and played to the user, so that the user can hear the weather conditions without looking at the mobile phone.

In other embodiments, the trained semantic parsing model 121 may also continue to exist in the server 200 to perform the semantic parsing task requested from the mobile phone 100 . The user inputs voice commands by waking up the voice assistant of the mobile phone 100, the mobile phone 100 converts the user's voice commands into corpus data through the internal man-machine dialogue system 110, and the mobile phone 100 interacts with the server 200 to send the converted corpus data to the server 200 for semantic processing. For parsing, the server 200 extracts multiple candidate intents and candidate slots corresponding to the intents in the user's voice instruction based on the semantic parsing model 121 . Further, the server 200 feeds back the extracted intent and the corresponding result of the slot to the mobile phone 100, and the mobile phone 100 further performs corresponding operations based on the identified intent and the slot, such as opening an application software or performing a web page search.

An exemplary structure of the electronic device 100 is given below in conjunction with the embodiments of the present application.

FIG. 8 shows a schematic structural diagram of a mobile phone 100 according to an embodiment of the present application.

The mobile phone 100 may include a processor 101, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and user Identity module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.

It can be understood that the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may include more or less components than shown, or some components are combined, or some components are separated, or different components are arranged. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The mobile phone 100 can obtain the user's voice command and feed back the response voice to the user through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. For example, the mobile phone 100 obtains the user's voice command through the receiver 170B or the microphone 170C, and sends the obtained user's voice command to the human-machine dialogue system 110 for voice recognition and semantic analysis. According to the semantic analysis result, the corresponding solution is matched and executed through the mobile phone 100. The corresponding operation is used to realize the solution corresponding to the semantic parsing result. The man-machine dialogue system 110 can also generate a response voice from the solution corresponding to the semantic analysis result and feed back the response voice to the user through the speaker 170A of the mobile phone 100 or the earphone plugged in the earphone interface 170D.

The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 101 , or some functional modules of the audio module 170 may be provided in the processor 101 .

Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.

The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.

The microphone 170C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.

The processor 101 may include one or more processing units, for example, the processor 101 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal Image signal processor (ISP), controller, video codec, digital signal processor (DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors. The processor 101 realizes the function of the semantic parsing model 121 by running the program, and the human-machine dialogue system 110 converts the user's voice command recognition into text corpus data, which is input into the semantic parsing model 121 run by the processor 101 after data preprocessing for semantic parsing. , get the semantic parsing result.

The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.

A memory may also be provided in the processor 101 for storing instructions and data. In some embodiments, the memory in processor 101 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 101 . If the processor 101 needs to use the instruction or data again, it can be called directly from the memory. Repeated access is avoided, and the waiting time of the processor 101 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 101 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a general-purpose input/output (GPIO) interface, a SIM interface, and/or or USB interface, etc.

It can be understood that, the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.

The charging management module 140 is used to receive charging input from the charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from the wired charger through the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive wireless charging input through the wireless charging coil of the mobile phone 100 . While the charging management module 140 charges the battery 142 , it can also supply power to the electronic device through the power management module 141 .

The power management module 141 is used to connect the battery 142 , the charging management module 140 and the processor 101 . The power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 101, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160.

The wireless communication function of the mobile phone 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.

Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in handset 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the mobile phone 100 . .

The wireless communication module 160 can provide applications on the mobile phone 100 including wireless local area networks (WLAN), such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. .

In some embodiments, the antenna 1 of the mobile phone 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the mobile phone 100 can communicate with the network and other devices through wireless communication technology.

The mobile phone 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.

Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel. In some embodiments, the handset 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.

The SIM card interface 195 is used to connect a SIM card.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one example embodiment or technique disclosed in accordance with this application. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

The present disclosure also relates to apparatuses for performing operations in text. This apparatus may be specially constructed for the required purposes or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored on a computer readable medium, such as, but not limited to, any type of disk, including floppy disks, optical disks, CD-ROMs, magneto-optical disks, read only memory (ROM), random access memory (RAM) , EPROM, EEPROM, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of medium suitable for storing electronic instructions, and each may be coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processors for increased computing power.

The processes and displays presented herein are not inherently related to any specific computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform one or more method steps. Architectures for various of these systems are discussed in the following description. Additionally, any specific programming language sufficient to implement the techniques and embodiments disclosed herein may be used. Various programming languages may be used to implement the present disclosure, as discussed herein.

Additionally, the language used in this specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or limit the disclosed subject matter. Accordingly, the present disclosure is intended to illustrate, but not to limit, the scope of the concepts discussed herein.

Claims

A semantic parsing method, characterized in that the method comprises:

Get the corpus data to be parsed;

Calculate the degree of intent correlation between the words included in the corpus data to be parsed and the intent represented by the corpus data to be parsed, and the degree of slot correlation between the word and the slot position represented by the corpus data to be parsed;

The slot position of the corpus data to be parsed is predicted based on the semantic information of the word and the above semantic information of the word, and the degree of intent correlation and slot correlation of the word.
The method of claim 1, further comprising:

predicting multiple intents from the corpus data to be parsed;

From the predicted slots, a slot corresponding to each of the multiple intents is determined.
The method according to claim 1, wherein the above semantic information includes semantic information of at least one character located in front of the character in the corpus data to be parsed.
The method of claim 1, further comprising:

The sentence semantic information of the corpus data to be parsed and the semantic information of each character in the corpus data to be parsed are generated.
The method according to claim 4, wherein the method is implemented by a neural network model.
The method according to claim 5, wherein the neural network model comprises a fully connected layer and a long short-term memory network model.
The method according to claim 5 or 6, characterized in that the sentence semantic information of the corpus data to be parsed, the above semantic information of the word, the intention correlation degree and the slot correlation degree of the word are in the It is represented in the form of a vector in the neural network model.
A man-machine dialogue method, comprising:

Receive user voice commands;

Converting the user's voice command into a corpus to be parsed in text form;

Through the semantic parsing method according to any one of claims 1 to 6, the intent in the to-be-parsed corpus and the slot corresponding to each intent are parsed;

Based on the parsed intent and the slot corresponding to each intent, an operation corresponding to the user's voice command is executed or a response voice is generated.
The method according to claim 8, wherein the operations include one or more of sending instructions to the smart home device, opening application software, searching web pages, making calls, and sending and receiving short messages.
A man-machine dialogue system, characterized in that the system comprises:

The speech recognition module is used to convert the user's voice command into corpus data in the form of text;

A semantic parsing module for executing the semantic parsing method according to any one of claims 1 to 6;

A problem solving module, used for finding a solution for the result obtained by the semantic analysis module;

a language generation module for generating natural language sentences corresponding to the solution;

a speech synthesis module for synthesizing the natural language sentence into a response speech;

The dialogue management module is used to schedule the speech recognition module, the semantic analysis module, the problem solving module, the language generation module and the speech synthesis module to cooperate with each other to realize the man-machine dialogue.
A readable medium, characterized in that an instruction is stored on the readable medium, and when the instruction is executed on the electronic device, the electronic device executes the method of any one of claims 1-6 and claim 9.
An electronic device, comprising:

memory for storing instructions for execution by one or more processors of the electronic device, and

The processor, which is one of the processors of the electronic device, is configured to execute the method described in any one of claims 1-6 and claim 9 .