CN113505591A

CN113505591A - Slot position identification method and electronic equipment

Info

Publication number: CN113505591A
Application number: CN202010210034.9A
Authority: CN
Inventors: 季冬; 孟函可; 祝官文
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2021-10-15
Also published as: WO2021190259A1

Abstract

The application provides a slot position identification method and electronic equipment, and the method belongs to the technical field of voice identification and can be executed by a man-machine conversation system or a man-machine conversation device. The method comprises the following steps: preprocessing a user command to obtain an original word sequence, performing BERT coding processing on the original word sequence to obtain an intention vector and a hidden state vector of each participle, and executing the following processing aiming at any participle: determining an attention vector according to the hidden state vector and the intention vector of the participle, splicing the hidden state vector of the participle and the attention vector of the participle, and determining a slot position probability vector of the participle; and finally, determining the corresponding slot position of the word segmentation according to K probability values in the slot position probability vector of the word segmentation. Therefore, the intention is used as the input of the slot filling task in the method, so that the accuracy of understanding the request information of the user by the dialogue system is improved, and the use experience of the user is improved.

Description

Slot position identification method and electronic equipment

Technical Field

The present application relates to the field of terminal technologies, and in particular, to a slot position identification method and an electronic device.

Background

With the rapid development of the internet, the application of the man-machine conversation system is more and more extensive. Taking the task-oriented man-machine dialog system as an example, the input of the user can be asking for the question information such as weather, ordering air tickets, medical inquiry and the like, and the man-machine dialog system can feed back response dialog information to the user according to the question information of the user. For example, the question information input by the user may be "how the weather is in the open sky in beijing", and the human-computer dialog system may search the preset database and feed back the response dialog information to the user as "the weather is in clear to cloudy in the open sky in beijing". Therefore, it is important for the man-machine interaction system to accurately recognize the question information input by the user. In the process of voice recognition, the intention recognition and slot filling technology is the key for ensuring the accuracy of the voice recognition result.

For the purpose recognition, the method can be abstracted into a classification problem, then a classifier represented by a convolution sum and knowledge is used for training a purpose recognition model, in addition to word embedding is carried out on a speech problem of a user in the purpose recognition model, semantic representation of knowledge is introduced to increase generalization capability of a representation layer, but in practical application, the defect of slot filling deviation exists in the model, and the accuracy of the purpose recognition model is influenced. For slot filling, the essence is to formalize a sentence sequence into a labeling sequence, and there are many common methods for labeling sequences, such as hidden markov models or conditional random field models, but in a specific application scenario, the slot filling models lack context information, which may cause ambiguity of the slot under different semantic intentions, and thus cannot meet the actual application requirements. Therefore, in the prior art, the training of the two models is performed independently, and the combined optimization is not performed on the intention recognition task and the slot filling task, so that the problem that the trained models are low in recognition accuracy in the aspect of voice recognition is finally caused, and the user experience is reduced.

Disclosure of Invention

The application provides a slot position recognition method and electronic equipment, which are used for performing joint optimization training on intention recognition and slot position filling, and recognizing voice conversations by using a joint training model so as to improve the accuracy of voice recognition.

In a first aspect, embodiments of the present application provide a slot position identification method, which may be performed by a human-machine interaction system or a human-machine interaction device. The method comprises the following steps: preprocessing a user command to obtain an original word sequence, performing BERT coding processing on the original word sequence to obtain an intention vector and a hidden state vector of each participle, and executing the following processing aiming at any participle, namely a first participle: and determining the attention vector of the first participle according to the hidden state vector and the intention vector of the first participle. Then, splicing the hidden state vector of the first participle and the attention vector of the first participle to determine a slot position probability vector of the first participle; and determining the slot position corresponding to the first participle according to K probability values in the slot position probability vector of the first participle.

Therefore, in the embodiment of the application, the intention is used as the input of the slot filling task to correct the slot prediction result, so that the accuracy of understanding the request information of the user by the dialogue system is improved, and the use experience of the user is improved.

In a possible design, for any participle in T-1 participles of an original word sequence except for a first participle, an attention vector of the participle may be further determined in combination with a slot probability vector corresponding to a previous participle, that is, the attention vector of the participle may be determined according to a hidden state vector of the participle, the intention vector, and a slot probability vector corresponding to a previous participle of the participle.

In the embodiment of the application, the intention and the slot position of the previous participle are used as the input of the slot position filling task of the next participle, so that the slot position prediction result is further corrected, and the accuracy of understanding the request information of the user by the dialogue system is further improved.

In one possible design, the manner of preprocessing the user command to obtain the original word sequence includes: firstly, generating a Token sequence according to a user command, then randomly sequencing the Token sequence, and dividing the Token sequence into a plurality of batch Token sequences according to batch _ size; and finally, performing truncation or filling operation on the Token sequence of each Batch to obtain the preprocessed original word sequence. According to the method, the user command is preprocessed in the mode, and invalid information is filtered out.

In one possible design, the specific way to generate the hidden state vector may be: after the BERT semantic coding is firstly carried out on the original word sequence, a vector sequence h is generated₀,h₁,……,h_TWherein h is₀Encoding information for a sentence vector of user commands, h₁,……,h_THidden state vectors corresponding to the T participles respectively; then, the information h is encoded according to the sentence vector of the user command₀Generating an intent vector of a user command, wherein the intent vector satisfies

Wherein, y^I∈R¹ ^×IWhere I represents the number of possible intentions of the user command, y^IThe intention corresponding to the medium maximum probability value is the intention of the user command, h₀Encoding information for a sentence vector of the user command,

in order to be a term of the offset,

is a weight matrix.

In one possible embodiment, the specific calculation manner of the slot probability vector may include: firstly, the hidden state vector h of the first word segmentation_iAnd attention vector of the first participle

Splicing to generate deep vector coding information

Satisfy the requirement of

Wherein concat is a splicing operation function,

and representing the spliced deep vector coding information. Then, the deep layer vector coding information

Performing softmax conversion on a logistic regression model to obtain the slot position probability vector of the first participle

Satisfy the requirement of

Wherein softmax represents a normalized exponential function,

a matrix of weights is represented by a matrix of weights,

represents the deep-level vector coding information, the

A bias term is represented.

In a second aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory is used for storing one or more computer programs; the one or more computer programs stored in the memory, when executed by the processor, enable the electronic device to implement any of the possible design methodologies of any of the aspects described above.

In a third aspect, the present application further provides an apparatus including a module/unit for performing the method of any one of the possible designs of any one of the above aspects. These modules/units may be implemented by hardware, or by hardware executing corresponding software.

In a fourth aspect, this embodiment also provides a computer-readable storage medium, which includes a computer program and when the computer program runs on an electronic device, causes the electronic device to execute any one of the possible design methods of any one of the above aspects.

In a fifth aspect, the present application further provides a computer program product, which when run on a terminal, causes the electronic device to execute any one of the possible design methods of any one of the above aspects.

In a sixth aspect, the present application further provides a chip, coupled with a memory, for executing a computer program stored in the memory to perform any one of the possible design methods of the foregoing aspects.

Drawings

FIG. 1 is a schematic diagram of a possible dialog system architecture applicable to the embodiment of the present application;

FIG. 2 is a schematic diagram of a joint training model according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a slot position identification method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of system module interaction provided in an embodiment of the present application;

fig. 5 is a schematic view of a dialog interface of a dialog system according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating an example of joint chart inference and slot filling provided by an embodiment of the present application;

FIG. 7 is an exemplary block diagram of an apparatus provided by an embodiment of the present application;

fig. 8 is a schematic diagram of an apparatus according to an embodiment of the present disclosure.

Detailed Description

First, some terms referred to in the present application are explained so as to be understood by those skilled in the art.

(1) User command

In the field of human-computer dialog, user commands are input by a user, which may also be referred to as user requirements. The user command in the embodiment of the present application may be one or a combination of multiple of voice, image, video, audio-video, text, and the like. For example, the user command is a voice input by the user through a microphone, and at this time, the user command may also be referred to as a "voice command"; for another example, the user command is a text input by the user through a keyboard or a virtual keyboard, and in this case, the user command may also be referred to as a "text command"; as another example, the user command is an image input by the user through a camera, and "is a person in the image" inputted through a virtual keyboard? ", at this time, the user command is a combination of an image and text; for another example, the user command is a segment of audio and video input by the user through the camera and the microphone, and at this time, the user command may also be referred to as an "audio and video command".

(2) Speech recognition (speech recognition)

Speech recognition technology, also known as Automatic Speech Recognition (ASR), computer speech recognition (computer speech recognition), or speech to text recognition (STT), is a method for converting human speech into corresponding text through a computer.

When the user command is a voice command or a command containing voice, the user command may be converted into text by ASR. Generally, the working principle of ASR is: firstly, splitting an audio signal input by a user according to frames to obtain frame information; secondly, identifying the obtained frame information into states, wherein a plurality of pieces of frame information correspond to one state; thirdly, combining the states into a phoneme, wherein every three states are combined into a phoneme; and fourthly, combining the phonemes into words, wherein a plurality of phonemes form a word. It can be seen that the result of speech recognition is obtained as long as it is known which state each frame of information corresponds to. How to determine the state corresponding to each frame of information generally can identify which state the frame of information belongs to when the probability of which state the frame of information corresponds to is the greatest.

In the process of speech recognition, an Acoustic Model (AM) and a Language Model (LM) may be used to determine a set of word sequences corresponding to a piece of speech. An acoustic model is understood to be a model of an utterance that can convert a speech input into an output of an acoustic representation, i.e. a unit that decodes acoustic features of a piece of speech into phonemes or words, more precisely a probability that the speech belongs to an acoustic symbol (e.g. phoneme). The language model then gives the probability that a set of word sequences is the speech, i.e. the words are decoded into a set of word sequences (i.e. a complete sentence).

(3) Natural Language Understanding (NLU)

Natural language understanding is that it is desirable for a robot to have the language understanding ability of a normal person, like a human. One important function is intent recognition, for example, the user command "how far hilton hotel is from the white cloud airport? If the intention of the user command is 'query distance', the slot configured by the intention has a 'starting place' and a 'destination', the information of the slot 'starting place' is 'Hillton Hotel', the information of the slot 'destination' is 'Baiyun airport', and the machine can answer the request with the intention and the information of the slot.

(4) Intent (Intent) and Intent recognition

Intent is to identify what a user command specifically wants to do. Intent recognition may be understood as a problem of semantic expression classification, or, in other words, the intent recognition is a classifier (also referred to as an intent classifier in the embodiments of the present application) that determines which intent a user command is. A commonly used intent classifier for intent recognition is a Support Vector Machine (SVM). Decision trees and Deep Neural Networks (DNNs). The deep neural network may be a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN), and the RNN may include a long-short-term-memory (LSTM) network, a stacked-ring neural network (SRNN), and the like.

The general flow of intent recognition includes, first, preprocessing the corpus (i.e., a set of word sequences), such as removing punctuation marks from the corpus, removing stop words, etc.; secondly, generating word vectors (word embedding) from the preprocessed corpus by using a word embedding (word embedding) algorithm, for example, a word2vec algorithm; further, an intention classifier (e.g., LSTM) is used to perform the work of feature extraction, intention classification, and the like. In the embodiment of the application, the intention classifier is a trained model, and can identify intentions in one or more scenes or identify any intentions. For example, the intent classifier may identify intentions in a ticket booking scenario, including booking tickets, screening tickets, querying ticket prices, querying ticket information, returning tickets, changing tickets, querying distance to airport, and the like. For another example, the intent classifier can identify intents in multiple scenarios.

(5) Slot position (slot)

After the user's intention is determined, the NLU module needs to further understand the content of the user command, and for simplicity, the most core part can be selected for understanding, and the others can be ignored, and those most important parts can be called slots (slots). That is, a slot is a definition of key information in a user expression (e.g., a sequence of words that a user command is recognized). One or more slots may be configured for the intent of a user command to obtain information for the slot, and the machine may respond to the user command. For example, in the intention of booking an air ticket, the Slot positions include a departure time, a starting place and a destination, three key information of which need to be recognized when natural language is understood, and the Slot position can be accurately recognized and a Slot-Type (Slot-Type) needs to be used. Still taking the above example, if we want to accurately identify three slots "departure time", "origin" and "destination", we need to have the slot types corresponding to the back, i.e., "time" and "city name", respectively. The slot type can be said to be a structured knowledge base of specific knowledge to identify and translate the slot spoken by the user. From the perspective of programming language, it is understood that intent + slot can be regarded as describing the requirement of the user by a function, where "intent corresponds to function", "slot corresponds to parameter of function", and "slot _ type corresponds to type of parameter". The slot positions configured with different intentions can be divided into a necessary slot position and an optional slot position, wherein the necessary slot position is a slot position which is necessary for executing filling of a user command, and the optional slot position is a slot position which is optional or not to be filled by executing the user command.

In the above example "ticket booking" three core slots are defined, respectively "departure time", "origin" and "destination". If the content that needs to be input by the user for booking the air ticket is considered comprehensively, more slot positions can be expected, such as the number of passengers, an airline company, a take-off airport, a landing airport and the like, and for a designer of the slot position, the slot position can be designed based on the granularity of intention.

(6) Slot filling (slot filling)

The slot filling is to extract a structured field in a user command, and may also be said to read some semantic components in a sentence (in the embodiment of the present application, the user command), so that the slot filling may be regarded as a sequence labeling problem. Sequence tagging questions include word segmentation, part-of-speech tagging, Named Entity Recognition (NER), keyword extraction, word sense role tagging, etc. in natural language processing. When sequence labeling is carried out, given a specific label set, sequence labeling can be carried out. Methods for solving the sequence labeling problem include Maximum Entropy Markov Models (MEMMs), Conditional Random Fields (CRFs), and Recurrent Neural Networks (RNNs).

Sequence tagging, which essentially is the problem of classifying each element in a linear sequence according to context, labels each character in a given text. That is, for a one-dimensional linear input sequence, each element in the linear input sequence is labeled with a certain label in the label set. In the embodiment of the present application, a slot position can be labeled on a text of a user command by using a slot position extraction classifier, in an NLU related to the embodiment of the present application, a linear sequence is a text of a user command (a text input by a user or a text recognized by an input voice), a chinese character is often regarded as an element of the linear sequence, meaning represented by a tag set of the chinese character is different for different tasks, and sequence labeling is to mark a suitable tag on the chinese character according to context of the chinese character, that is, determine the slot position of the chinese character.

Illustratively, when the filling information of the slot is missing in the user command, such as the user command "how far away this hotel is from the rainbow bridge airport? "the machine needs to know which hotel" means in response to the user command, and in the prior art, the machine might ask the user "do you ask for the distance between the hotel and the rainbow bridge airport? "to obtain information of the slot. It can be seen that the machine needs to interact with the user many times to obtain the information of the slot position missing in the user command.

In order to make the objects, technical solutions and advantages of the present application more clear, the present application will be further described in detail with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a possible architecture of a human-computer dialog system according to an embodiment of the present application. As shown in fig. 1, the man-machine interaction system architecture may include a server 101, one or more user-side devices (e.g., the user-side device 1021 and the user-side device 1022 illustrated in fig. 1). Optionally, one or more customer service side devices (such as customer service side device 1023 and customer service side device 1024 illustrated in fig. 1) may be further included in the dialog system architecture.

The user-side device or the customer-service-side device may be a terminal device, such as a mobile phone, a tablet computer, or a desktop computer, and is not limited specifically. In the embodiment of the application, based on the difference of the operation interfaces of the client devices, the client devices are divided into user-side devices (client devices for user operation) and customer-service-side devices (client devices for manual customer service operation). In other embodiments, the user-side device may also be referred to as a user client or other names, and the customer-side device may also be referred to as a human agent client or other names, which is not limited specifically.

The user side device may be configured to obtain information input by a user, and send the information input by the user to the server. For example, if the user inputs text information in the dialog box, the user-side device may acquire the text information and send the text information to the server; if the user inputs the voice information in the dialog box, the user side equipment can convert the voice into the text information through the voice recognition technology, and then the text information is sent to the server. Optionally, the user-side device may further communicate with the customer-service-side device, for example, the user-side device sends input information of the user to the customer-service-side device, and receives information returned by the customer-service-side device, thereby realizing that manual customer service provides service for the user.

The server is used for processing various calculations required by the man-machine conversation system, such as question-answer matching, namely searching a preset database according to the request information of the user to obtain response information corresponding to the request information. The preset database can comprise a question bank and a response bank corresponding to the question bank, the question bank comprises a plurality of preset request messages, the response bank comprises response messages corresponding to the plurality of preset request messages, the server can compare the request messages of the user with the plurality of preset request messages, and then feeds the response messages corresponding to the preset request messages with the maximum similarity between the request messages of the user and the request messages of the user back to the user side equipment, and the response messages are presented to the user by the user side equipment.

Taking a task-based multi-turn dialog system as an example, the application scenario may be virtual personal assistant, intelligent customer service, etc. At present, in general, when a virtual personal assistant or an intelligent customer service performs a task-type multi-turn conversation with a user, response information meeting the user's expectations can be returned to the user based on request information input by the user. However, since the language expression has a strong flexibility, in the process that the virtual personal assistant or the intelligent customer service understands the semantics of the sentence input by the user at present, the intention identification task and the slot filling task are respectively and independently performed, and because the slots corresponding to different intentions may be different, the problem that the intention in the identification result is not aligned with the slots may exist, so that the virtual personal assistant or the intelligent customer service may misunderstand the semantics of the user, and the searched response information which is not in accordance with the expectations of the user is returned to the user. For example, the voice content input by the user is "buy an air ticket flying to shanghai in the morning", the intention recognition result of the intention recognition task in the man-machine interaction system is determined as "order the air ticket", and the slot recognition of the slot filling task may be to mark "shanghai" as a "navigation destination", so that since the intention recognition task and the slot filling task are performed independently, it is likely that the slot which is considered as a "navigation departure place" by the intelligent customer service is to be filled, and therefore, the user is input with wrong response information such as "ask for the navigation departure place to be the current location". That is to say, the intent recognition result cannot be considered in the conventional slot filling process, so that the intent and the slot are not aligned, and the semantic parsing result is affected.

It is often found in practice that slot population is intent-dependent, e.g., the same locality entity (e.g., the Temple of Town deity), under a food search intent, represents a restaurant; and under the navigation intention, a starting point or an end point can be represented. Based on the analysis, the embodiment of the present application provides a slot position recognition method, which improves a training model in an existing man-machine conversation system, and uses an intention recognition result as an input parameter for slot position filling in a modified joint training model, that is, associates the intention recognition result with slot position filling, so as to implement modification of slot position filling by using the intention recognition result. The united training model obtained by training by the method is used for carrying out semantic analysis on the input information of the user, so that the accuracy of a semantic analysis result can be improved.

The improved joint training model is shown in fig. 2, and includes a Bert coding Layer, a fully connected Layer (dense Layer), a Masked Attention Layer (Masked Attention Layer), and a softmax (logistic regression model) Layer. When a dialogue corpus is input into the joint training model, the logistic regression model of the joint training model can output information of an intention and a slot position corresponding to the dialogue corpus, wherein the intention is an input of a mask Attention Layer as seen from a connection relation between the full connection Layer (dense Layer) and the mask Attention Layer (mask Attention Layer), and the full connection Layer (dense Layer) and the mask Attention Layer (mask Attention Layer) realize that an intention recognition result is associated with slot position filling.

It should be noted that, in the training process of the joint training model, the objective Loss function for joint optimization of the intent and the slot is added by an intent classification Loss function, a slot filling Loss function and a regularization term of the weight, where the intent classification Loss function may adopt a two-class Cross entry Loss function, and the slot filling Loss function may adopt a multiple-class Cross entry Loss function. The training termination condition that triggers the joint training model may be: and when the epoch times of training reach a set threshold or the batch interval with the last optimal model is larger than the set threshold, terminating the training and generating the final combined training model.

It should be noted that the slot position identification method provided by the embodiment of the present application may be applicable to various possible human-computer conversation systems, and is particularly applicable to a task-based multi-round conversation system. In order to facilitate understanding of the embodiments of the present application, the slot position identification method is described below by taking a task-based multi-turn dialogue system as an example.

Fig. 3 is a schematic flow chart of a slot position identification method according to an embodiment of the present application, including:

step 301, after the server obtains the user command, the server preprocesses the user command to obtain an original word sequence.

Wherein the user command may be one or a combination of voice, image, video, audio-video, text, etc. For example, the user command is a voice input by the user through a microphone, and at this time, the user command may also be referred to as a "voice command"; for another example, the user command is a text input by the user through a keyboard or a virtual keyboard, and in this case, the user command may be also referred to as a "text command". Specifically, the server may mark the user command, generate a Token sequence, randomly sort the Token sequence, divide the Token sequence into Token sequences of multiple batchs according to the batch size, and perform truncation or padding operation on the Token sequence of each batch to obtain the preprocessed original word sequence. Optionally, the server may also create a mask with the same dimension for the Token sequence after the truncating or padding operation, where corresponding to the < pad > element position in the Token sequence, the element value in the mask sequence is 0, and otherwise is 1.

For example, a user of a mobile phone may run a voice assistant software program in the mobile phone, and when the user inputs a voice "play red break" through a microphone, the voice assistant software program converts the voice into an english text and sends the english text to a server corresponding to the program. In the first step, the server performs serialization processing on the english text, for example, a Token sequence can be generated from the text by adopting WordPiece (word fragmentation) technology. It should be noted that if the text is a chinese text, the text may be generated into a Token sequence in a character-based manner. And secondly, randomly sequencing the serialized corpora by the server, dividing the corpora into a plurality of batchs according to the size of the Batch _ size, and then performing truncation or filling operation on the Token sequence of each Batch. Specifically, for each Token sequence of each batch, < Pad > is padded if its length +2 is smaller than a predetermined maximum sequence length (usually maxLength ═ 512), and the redundant Token words are truncated if its length +2 is larger than the predetermined maximum sequence length; after truncation/padding, < CLS > is padded at the beginning of the Token sequence for labeling as a classification task, and < SEP > is padded at the end of the Token sequence for sentence segmentation, indicating that a complete sentence is preceded, for example, as shown in table 1.

TABLE 1

Step 302, the joint training model in the server first performs BERT coding on the original word sequence to obtain an intention vector and hidden state vectors corresponding to the T participles respectively.

Among them, BERT (bidirectional encoder representation from transforms), which is a deep bidirectional pre-training language understanding model used as a feature extractor. Specifically, the server performs the original word sequence input into the joint training modelAfter BERT semantic coding, generating a hidden state vector sequence h₀,h₁,……,h_TWherein h is₀Encoding information for a corresponding sentence vector of a user command (i.e., corresponding to<CLS>Coding vector of position), h₁,……,h_TThe hidden state vectors (i.e. the coded vectors corresponding to the remaining positions) respectively correspond to the T participles. Further, the BERT coding layer will vector the sentence into h₀Input to the logistic regression model (Softmax) layer, generating an intention vector, i.e.

y^I∈R^1×IWhere I represents the number of intentions corresponding to the user command, y^IThe intention corresponding to the middle maximum probability value is the intention corresponding to the user command, h₀Information is encoded for the sentence vector of the user command,

in order to be a term of the offset,

is a weight matrix.

For example, when the user inputs the voice command "play red break" shown in table 1, and the original word sequence corresponding to the voice command is input into the joint training model, the output intent is "play music", as shown in fig. 2. In addition, the hidden state vector h corresponding to "play" is₁Hidden state vector h corresponding to' red₂Hidden state vector h corresponding to' break₃Is input to the full link layer.

Step 303, aiming at a first participle in the original word sequence, which is any one of the T participles, the joint training model in the server performs the following processing:

determining an attention vector of the first word segmentation according to the hidden state vector and the intention vector of the first word segmentation; then splicing the hidden state vector of the first participle and the attention vector of the first participle to determine a slot position probability vector of the first participle; and determining the slot position corresponding to the first word segmentation according to K probability values in the slot position probability vector of the first word segmentation.

The method for splicing the hidden state vector of the first participle and the attention vector of the first participle may be as follows:

hidden state vector h of first participle_iAnd attention vector of the first participle

Splicing to generate deep vector coding information

Satisfy the requirement of

Wherein concat is a splicing operation function,

and representing the spliced deep vector coding information. Further, the server encodes the deep-level vector information

Inputting the data into a softmax layer for softmax conversion so as to obtain a slot position probability vector of the first participle

Satisfy the requirement of

Wherein softmax represents a weight matrix,

a matrix of weights is represented by a matrix of weights,

representing said depthLayer vector encoding information, said

A bias term is represented.

Referring to FIG. 2, for the first participle "play" in the voice command "play red break", c1 in the fully-connected layer is the intention vector y^IIntention vector y^IAnd a hidden state vector h₁As an input to the mask attention layer, an attention vector corresponding to the first participle "play" is generated. For the second participle "red" in the voice command "play red break", c2 in the fully-connected layer is the intention vector y^IIntention vector y^IAnd a hidden state vector h₂As an input to the masking attention layer, an attention vector corresponding to the second participle "red" is generated. For the second participle "break" in the voice command "play red break", c3 in the fully-connected layer is the intention vector y^IIntention vector y^IAnd a hidden state vector h₂As an input of the mask attention layer, an attention vector corresponding to the third participle "break" is generated.

In another possible implementation, assuming that the first participle is any one of the T-1 participles except the first participle, determining the attention vector of the first participle may be performed as follows: according to the hidden state vector and the intention vector of the first participle and the slot position probability vector corresponding to the previous participle of the first participle, determining the attention vector of the first participle and finally obtaining the slot position probability vector corresponding to the play

For example, referring to FIG. 2, for the first participle "play" in the voice command "play red break", c1 in the fully-connected layer is the intention vector y^IIntention vector y^IAnd a hidden state vector h₁As an input to the mask attention layer, an attention vector corresponding to the first participle "play" is generated. For the second participle "red" in the voice command "play red break", the fully connected layerC2 in (1) is the intention vector y^IIntention vector y^IAnd a hidden state vector h₂And a slot probability vector corresponding to "play

As an input to the masking attention layer, thereby generating an attention vector corresponding to the second participle "red

For the second participle "break" in the voice command "play red break", c3 in the fully-connected layer is the intention vector y^IIntention vector y^IAnd a hidden state vector h₂And the attention vector corresponding to the second participle "red

As an input of the mask attention layer, thereby generating an attention vector corresponding to the third participle "break

Specifically, in calculating the attention vector, the masked attention layer may calculate the attention vector as follows:

step a, for a hidden vector h at each moment_i(i-1, …, T), query vector

It receives only the intention vector information y^IHidden vector information h at the current moment_iAnd slot output information predicted from the previous time (from 1 to i-1 at time t)

Wherein q is_i∈R^1×d；

Step b, Key vector information is passed throughLinear transformation to

Wherein k is_i∈R^1×dThe key vector information of all the time points form a matrix K ═ (K)₁,k₂…,k_T)∈R^n×d；

Step c, Value vector information is transformed into linear information

Wherein v is_i∈R^1×dThe value vector information of all the time points forms a moment V ═ (V)₁,v₂…,v_T)∈R^n×d；

Step d, calculating the attention vector information of the current moment,

wherein

The M matrix is a mask matrix in the form of an upper triangular identity matrix, i.e., when i is less than or equal to j, M is_ijWhen i > j, m is 1_ij＝-∞。

It can be seen that, in the embodiment of the present application, after receiving a user command, the server performs preprocessing according to step 301 to generate an original word sequence, then uses the original word sequence as an input of the joint training model, and obtains an intention vector y after model predictive reasoning^IAnd slot position vector sequence y_i ^S. For the intention vector y^IFor example, an intention corresponding to the probability maximum may be selected as the predicted intention; for the ith slot position vector y_i ^SFor example, the slot corresponding to the probability maximum may be selected as the ith predicted slot. According to the method, the intention is used as the input of the slot filling task to correct the slot prediction result, so that the accuracy of understanding the request information of the user by the conversation system is improved, and the use experience of the user is improved.

The slot position identification method provided by the present application can be specifically applied to the system architecture shown in fig. 4, wherein the NLU module is integrated with the joint training model. The following describes, by way of example, a specific application process of the above method in conjunction with the system architecture, and specific steps are as follows.

Step 401, the user opens a voice assistant software program of the mobile phone, and sends out a voice message "help me book the air ticket that Shenzhen flies to Shanghai" in a dialog box.

In step 402, the ASR module in the mobile phone voice assistant software program converts the voice information into text information, as shown in fig. 5, and sends the converted text information to a DM (Dialog management) module in the voice assistant software program.

In step 403, the DM module in the voice assistant software program obtains context information (such as the historical dialog information in fig. 5, and the current dialog information) and status information corresponding to the voice information from the dialog box, and the DM module sends the voice information, and other relevant information to the NLU (natural language understanding) module.

In step 404, the NLU module identifies the intention and slot in the text "help me book the air ticket that Shenzhen flies to shanghai" according to the method provided by the embodiment of the present application.

Specifically, as shown in fig. 6, after BERT semantic coding is performed on the "Help me orders the Shenzhen to fly to Shanghai", a hidden state vector sequence h is generated₀,h₁,……,h₈The BERT coding layer will vector the sentence into h₀Inputting the data into a Softmax layer to generate an intention vector y^I，y^IThe intention corresponding to the middle maximum probability value is an air booking ticket. Hiding the state vector h when slot filling is performed₁And an intention vector y^IAs an input, a slot probability vector corresponding to the first participle "help" is generated

Wherein,

the slot position corresponding to the middle maximum probability value is empty and is marked as 'o', and in addition, the hidden state is formed when the slot position is filledState vector h₄And an intention vector y^IAnd slot probability vector of previous word segmentation

Generating a slot position probability vector corresponding to the fourth participle 'Shenzhen' as input

Wherein,

the slot corresponding to the medium maximum probability value is marked as the 'departure place', so that the slot is marked as the 'departure place' or 'FromLoc'. Analogizing in turn, hiding the state vector h when slot filling is performed₆And an intention vector y^IAnd slot probability vector of previous word segmentation

Generating a slot position probability vector corresponding to the sixth participle 'Shenzhen' as input

Wherein,

the slot corresponding to the highest probability value in (a) is "destination", and is therefore labeled "destination" or "ToLoc". Therefore, the joint training model outputs the intention corresponding to 'Heli-me orders the Shenzhen flying to Shanghai' as 'order the air ticket', 'the slot position corresponding to the Shenzhen' as 'departure place', and 'the slot position corresponding to the Shanghai' as 'destination'.

In step 405, the NLU module returns the intent and slot identification results to the DM module.

In step 406, the DM module inputs the intention and slot recognition result to an NLG (natural language generation) module.

The DM module is divided into two sub-modules, namely a Dialog State Tracking (DST) and a dialog strategy learning (DPL), and mainly functions to update the state of the dialog system according to the recognition result of the NLU module and generate a corresponding system action, such as inquiring an airline ticket.

In step 407, the NLG module textualizes the system actions output by the DM, expresses the system actions in text form, and sends other relevant information to the DM module.

The DM module sends the system action execution result to the TTS (speech synthesis) module, step 408.

In step 409, the TTS module converts the text to speech and outputs the speech to the user. For example, outputting voice content corresponding to the queried flight information.

Further, if the server determines that there is a slot vacancy, for example, a slot vacancy of "time" in the request information "help me book an air ticket that shenzhen flies to shanghai", the server may further send guidance information to the user-side device, for example, "ask you to book an air ticket on which day", as shown in fig. 5, where the guidance information is used to guide the user to provide the associated information of the request information. By adopting the mode, the user is guided to provide the associated information of the request information through the guide information, so that the server can conveniently inquire the response information which is expected by the user in the preset database based on the associated information of the request information, thereby avoiding unnecessary switching labor and improving the satisfaction degree of the user.

In the embodiment of the application, the guidance information may further include third history dialogue information, and the similarity between the history request information in the third history dialogue information and the request information is greater than a sixth threshold, that is, the guidance information may further be added with history request information similar to the request information of the user, so as to remind the user that similar questions have been asked.

It is understood that, in other possible situations, other possible contents may be included in the guidance information, and those skilled in the art may set the contents included in the guidance information according to actual experience and needs, but all the information that is sent to the user to actively seek the response information meeting the user's desire and has the functions of guidance, prompt, and placation, etc., are within the protection scope of the present invention.

Therefore, in the embodiment of the application, the intention identification and the slot filling are jointly performed, and the intention is considered in the slot filling process, so that the granularity of the slot filling is thinner and more accurate, and two tasks can be well completed by only training one joint model result.

It should be noted that: (1) the step number is only one example of an execution flow in the embodiment of the present application, and there is no strict execution order between steps having no time sequence dependency relationship among the above steps. Each of steps 401 to 410 does not necessarily have to perform a step, and in a specific implementation, some of the steps may be selectively performed according to actual needs.

The above description mainly introduces the scheme provided by the present application from the perspective of interaction between various devices. It is understood that the above-mentioned devices for implementing the functions include hardware structures and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the present invention can be implemented in hardware or a combination of hardware and computer software, with the exemplary elements and algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In case of an integrated unit, fig. 7 shows a possible exemplary block diagram of the apparatus involved in the embodiments of the present application, which apparatus 700 may be present in the form of software. The apparatus 700 may include: a processing unit 702 and a communication unit 703. The processing unit 702 is configured to control and manage operations of the apparatus 700. The communication unit 703 is used to support communication between the apparatus 700 and other devices (such as a user-side device or a customer service-side device). The apparatus 700 may further comprise a storage unit 701 for storing program codes and data of the apparatus 700.

The processing unit 702 may be a processor or a controller, such as a general Central Processing Unit (CPU), a general purpose processor, a Digital Signal Processing (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication unit 703 may be a communication interface, a transceiver circuit, or the like, wherein the communication interface is generally referred to, and in a specific implementation, the communication interface may include a plurality of interfaces. The memory unit 701 may be a memory.

The apparatus 700 may be the server in the above-described embodiment, or may also be a semiconductor chip provided in the server. The processing unit 702 may support the apparatus 700 to perform the actions of the server in the above method examples, and the communication unit 703 may support communication between the apparatus 700 and the user-side device or the client-side device; for example, the processing unit 702 is configured to support the apparatus 700 to perform steps 301 to 303 in fig. 3, and the communication unit 703 is configured to support the apparatus 700 to perform step 405 in fig. 4.

Specifically, in an embodiment, the processing unit 702 is configured to, after obtaining the user command, perform preprocessing on the user command to obtain an original word sequence, and a joint training model in the server first performs BERT coding on the original word sequence to obtain an intention vector and hidden state vectors corresponding to the T participles respectively. Aiming at a first participle in an original word sequence, wherein the first participle is any one of T participles, a joint training model in a server executes the following processing: determining an attention vector of the first word segmentation according to the hidden state vector and the intention vector of the first word segmentation; then splicing the hidden state vector of the first participle and the attention vector of the first participle to determine a slot position probability vector of the first participle; and selecting the slot position corresponding to the maximum probability value as the slot position corresponding to the first participle from K probability values in the slot position probability vector of the first participle.

In a possible embodiment, assuming that the first participle is any one of T-1 participles except for the first participle, when the processing unit 702 determines the attention vector of the first participle according to the hidden state vector of the first participle and the intention, the attention vector of the first participle may be specifically determined according to the hidden state vector of the first participle, the intention, and the slot probability vector corresponding to the preceding participle of the first participle.

For related specific implementation, reference may be made to the content in the above method, and details are not repeated here.

Referring to fig. 8, a schematic diagram of an apparatus provided in the present application is shown, where the apparatus may be the server, or may also be a chip disposed in the server. The apparatus 800 comprises: a processor 802, a communication interface 803, and a memory 801. Optionally, the apparatus 800 may also include a communication line 804. The communication interface 803, the processor 802, and the memory 801 may be connected to each other via a communication line 804; the communication line 804 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication lines 804 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.

The processor 802 may be a CPU, microprocessor, ASIC, or one or more integrated circuits configured to control the execution of programs in accordance with the teachings of the present application.

The communication interface 803 may be any device, such as a transceiver, for communicating with other devices or communication networks, such as an ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), a wired access network, etc.

The memory 801 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory may be separate and coupled to the processor via a communication line 804. The memory may also be integral to the processor.

The memory 801 is used for storing computer-executable instructions for executing the present application, and is controlled by the processor 802 to execute the instructions. The processor 802 is configured to execute computer-executable instructions stored in the memory 801 to implement the methods provided by the above-described embodiments of the present application.

Optionally, the computer-executable instructions in the embodiments of the present application may also be referred to as application program codes, which are not specifically limited in the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The embodiment of the present application further provides a computer storage medium, where computer instructions are stored in the computer storage medium, and when the computer instructions are run on an electronic device, the electronic device is caused to execute the above related method steps to implement the method in the foregoing embodiment.

The embodiment of the present application further provides a computer program product, which when running on a computer, causes the computer to execute the above related steps to implement the method in the above embodiment.

In addition, embodiments of the present application also provide an apparatus, which may be specifically a chip, a component or a module, and may include a processor and a memory connected to each other; the memory is used for storing computer execution instructions, and when the device runs, the processor can execute the computer execution instructions stored in the memory, so that the chip can execute the method in the above method embodiments.

Through the description of the above embodiments, those skilled in the art will understand that, for convenience and simplicity of description, only the division of the above functional modules is used as an example, and in practical applications, the above function distribution may be completed by different functional modules as needed, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another apparatus, or some features may be discarded or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed to a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A slot position identification method is characterized by comprising the following steps:

preprocessing a user command to obtain an original word sequence;

carrying out deep bidirectional pre-training language understanding model BERT coding processing on the original word sequence to obtain an intention vector and hidden state vectors corresponding to T participles in the original word sequence respectively;

aiming at a first participle in the T participles, wherein the first participle is any one of the T participles, executing the following processing:

determining an attention vector of the first participle according to the hidden state vector and the intention vector of the first participle;

splicing the hidden state vector of the first participle and the attention vector of the first participle to determine a slot position probability vector of the first participle;

and determining the slot position corresponding to the first participle according to K probability values in the slot position probability vector of the first participle.

2. The method of claim 1, wherein determining the attention vector of the first participle from the hidden state vector and the intent vector of the first participle comprises:

determining an attention vector of the first participle according to the hidden state vector of the first participle, the intention vector and a slot position probability vector corresponding to a preceding participle of the first participle;

wherein the first participle is any one of T-1 participles except the first participle in the T participles.

3. The method of claim 1 or 2, wherein the pre-processing the user command to obtain an original word sequence comprises:

generating a marker Token sequence according to the user command;

randomly sequencing the Token sequences, and dividing the Token sequences into a plurality of batch Token sequences according to a batch size batch _ size;

and performing truncation or filling operation on the Token sequence of each Batch to obtain a preprocessed original word sequence.

4. The method according to any one of claims 1 to 3, wherein the obtaining of the intention vector and the hidden state vectors corresponding to the T participles by performing a BERT coding process on the original word sequence comprises:

after BERT semantic coding is carried out on the original word sequence, a vector sequence h is generated₀,h₁,……,h_TWherein h is₀Encoding information for the sentence vector of the user command, h₁,……,h_THidden state vectors respectively corresponding to the T participles;

sentence vector coding information h according to the user command₀Generating an intent vector for the user command, wherein the intent vector satisfies

Wherein, y^I∈R^1×IWhere I represents the number of possible intentions of the user command, y^IThe intention corresponding to the medium maximum probability value is the intention of the user command, h₀Encoding information for a sentence vector of the user command,

in order to be a term of the offset,

is a weight matrix.

5. The method of any one of claims 1 to 4, wherein stitching the hidden state vector of the first participle with the attention vector of the first participle to determine the slot probability vector of the first participle comprises:

the hidden state vector h of the first participle_iAnd an attention vector of the first participle

Splicing to generate deep vector coding information

Satisfy the requirement of

Wherein concat is a splicing operation function,

representing spliced deep vector coding information;

encoding the deep layer vector information

Performing logistic regression model softmax conversion to obtain the slot position probability vector of the first participle

Satisfy the requirement of

Wherein softmax represents a normalized exponential function,

a matrix of weights is represented by a matrix of weights,

represents the deep-level vector coding information, the

A bias term is represented.

6. An electronic device, comprising a processor and a memory;

the memory stores program instructions;

the processor is configured to execute the program instructions stored by the memory to cause the electronic device to perform:

preprocessing a user command to obtain an original word sequence;

carrying out deep bidirectional pre-training language understanding model BERT coding processing on the original word sequence to obtain an intention vector and hidden state vectors corresponding to the T participles respectively;

and determining the slot position corresponding to the first word segmentation according to K probability values in the slot position probability vector of the first word segmentation.

7. The electronic device of claim 6, wherein the processor, when determining the attention vector of the first participle from the hidden state vector and the intention vector of the first participle, specifically performs:

determining an attention vector of the first participle according to the hidden state vector of the first participle, the intention vector and a slot position corresponding to a preceding participle of the first participle;

8. The electronic device according to claim 6 or 7, wherein when the processor preprocesses the user command to obtain the original word sequence, the following is specifically performed:

generating a marker Token sequence according to the user command;

9. The electronic device according to any one of claims 6 to 8, wherein the processor performs BERT encoding processing on the original word sequence to obtain an intention vector and hidden state vectors corresponding to the T participles, and specifically performs:

in order to be a term of the offset,

is a weight matrix.

10. The electronic device according to any one of claims 6 to 9, wherein the processor concatenates the hidden state vector of the first participle and the attention vector of the first participle, and when determining the slot position probability vector of the first participle, specifically performs:

Splicing to generate deep vector coding information

Satisfy the requirement of

Wherein concat is a splicing operation function,

representing spliced deep vector coding information;

encoding the deep layer vector information

Satisfy the requirement of

Wherein softmax represents a normalized exponential function,

a matrix of weights is represented by a matrix of weights,

represents the deep-level vector coding information, the

A bias term is represented.

11. A computer-readable storage medium, comprising program instructions which, when run on a processor, cause the processor to perform the method of any one of claims 1 to 5.

12. A chip, wherein the chip is coupled to a memory for executing a computer program stored in the memory to perform the method of any of claims 1 to 5.