CN114077666A

CN114077666A - Dialog intention classification method, apparatus and non-volatile computer storage medium

Info

Publication number: CN114077666A
Application number: CN202010850324.XA
Authority: CN
Inventors: 徐华; 张瀚镭; 林廷恩
Original assignee: Tsinghua University; Toyota Motor Corp
Current assignee: Tsinghua University; Toyota Motor Corp
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2022-02-22

Abstract

The present disclosure relates to a dialog intention classification method, apparatus, and non-volatile computer storage medium. The method includes receiving data of a conversation, and extracting a first feature vector of a sentence based on it using a first learning network. The modular length of the first feature vector is adjusted such that the smaller the minimum difference, the larger the modular length to obtain a second feature vector. Determining probability-related parameters for each known intent class using a second learning network based on the second feature vector, the first learning network and the second learning network being jointly trained using a first loss function as a metric loss function and a second loss function characterizing a classification loss. The respective probability-related parameter is compared to a threshold to determine whether the dialog belongs to an unknown intent category and to which known intent category. The method disclosed by the invention can effectively detect unknown intention categories and ensure the accuracy of the classification of the known intention categories when facing the task of opening intention classification.

Description

Dialog intention classification method, apparatus and non-volatile computer storage medium

Technical Field

The present disclosure relates to methods and apparatus for automated processing and analysis of conversations, and more particularly, to methods and apparatus for intent recognition and classification of conversations.

Background

Artificial intelligence techniques, including dialogue robots, are widely used for automated processing and analysis of human dialogs. Taking a conversation robot as an example, it needs to effectively recognize and classify a conversation intention from data of a conversation. However, not all dialog intents are known, and unknown dialog intents also exist.

As shown in table 1, the dialog robot often needs to process the task of open intent classification, that is, some sentences included in the dialog data to be analyzed belong to the category of known intentions, some sentences cannot identify and classify intentions, and the intention tags are unknown intentions. As examples of known intent categories, the intent label of "i want to take medicine" is "sick", "the intent label of i feel painful at head" is "sick", "the intent label of i hungry" is "eat", "the intent label of i thirst" is "drink". As an example of unknown intent, from "how do weather today? "," can you help me see what is this? "such sentences cannot be identified and classified into intention categories by machine learning methods.

Table 1 example of dialog opening intention classification

Dialog content of a user	Intention label
		I want to take medicine.	Disease of illness
I am somewhat painful in their head.	Disease of illness
		I starve.	Eating food
I are thirsty.	Drinking water
		……	……
How do the weather today?	Unknown intention
		Can you help me see what is this?	Unknown intention

Dialog data of unknown intent classes can interfere with the identification and classification of known intent classes, but methods for effectively identifying and classifying unknown intent classes are currently lacking. In particular, in the face of the task of open intent classification, sentences of known intent and unknown intent coexist, and recognition and classification of the known intent and the unknown intent cannot be considered at the same time. Although the traditional supervised classification method has a good effect on classifying dialog data of a known intention class, due to the lack of a priori knowledge of an unknown intention class, the traditional supervised classification method can introduce false positive errors to the classification of the dialog data of the unknown intention class, and at present, a technical means for effectively utilizing the dialog data of the known intention class to generalize to the identification and classification of the unknown intention class is also lacked.

Disclosure of Invention

The present disclosure is provided to solve the above-mentioned problems occurring in the prior art.

There is a need for a dialog intention classification method, a dialog intention classification apparatus, and a nonvolatile computer storage medium that are capable of ensuring the accuracy of classification of a known intention class while effectively detecting an unknown intention class in the face of the task of opening intention classification.

According to a first aspect of the present disclosure, a dialog intention classification method is provided. The method includes receiving data for a conversation. The method also includes performing, by the processor, the following steps. A first feature vector of a sentence may be extracted using a trained first learning network based on the received data of the conversation. The modular length of the first feature vector may be adjusted based on a smallest difference of the first feature vector with respect to a representative feature vector of each known intent category, such that the smaller the smallest difference, the larger the modular length, to obtain a second feature vector. Respective probabilistic related parameters for respective known intent classes may be determined using a trained second learning network based on the second feature vector. Wherein the second learning network is configured to perceive a modular length of the second feature vector. The first learning network and the second learning network may be jointly trained using a first loss function and a second loss function characterizing a classification loss. The first loss function may be defined such that the difference of the first feature vector from the representative feature vectors of the known intention classes to which it belongs is smaller than the representative value of the respective differences from the representative feature vectors of the other known intention classes. The determined respective probability-related parameters for the respective known intent categories are compared to a threshold. Determining that the dialog belongs to the unknown intention category under the condition that each probability related parameter is smaller than a threshold value; in case each probability-related parameter is above the threshold, it is then determined that the dialog belongs to the known intent category to which the maximum probability-related parameter corresponds.

According to a second aspect of the present disclosure, there is provided a dialog intention classification apparatus comprising an interface and a processor. The interface may be configured to receive data of a conversation. The processor may be configured to perform the following steps. A first feature vector of a sentence may be extracted using a trained first learning network based on the received data of the conversation. The modular length of the first feature vector may be adjusted based on a smallest difference of the first feature vector with respect to a representative feature vector of each known intent category, such that the smaller the smallest difference, the larger the modular length, to obtain a second feature vector. Respective probabilistic related parameters for respective known intent classes may be determined using a trained second learning network based on the second feature vector. Wherein the second learning network is configured to perceive a modular length of the second feature vector. The first learning network and the second learning network may be jointly trained using a first loss function and a second loss function characterizing a classification loss. The first loss function may be defined such that the difference of the first feature vector from the representative feature vectors of the known intention classes to which it belongs is smaller than the representative value of the respective differences from the representative feature vectors of the other known intention classes. The determined respective probability-related parameters for the respective known intent categories are compared to a threshold. Determining that the dialog belongs to the unknown intention category under the condition that each probability related parameter is smaller than a threshold value; in case each probability-related parameter is above the threshold, it is then determined that the dialog belongs to the known intent category to which the maximum probability-related parameter corresponds.

According to a third aspect of the present disclosure, there is provided a non-transitory computer storage medium having stored thereon executable instructions that, when executed by a processor, implement a dialog intention classification method according to various embodiments of the present disclosure.

By using the dialog intention classification method, the dialog intention classification device and the nonvolatile computer storage medium of the embodiments of the disclosure, when facing the task of opening intention classification, the accuracy of classification of the known intention class can be ensured while effectively detecting the unknown intention class by combining the mode of adjusting the modular length of the first feature vector, the configuration of the perceptual modular length of the second learning network and the joint training using the metric loss function and the classification loss function.

Drawings

In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar components in different views. Like reference numerals having letter suffixes or different letter suffixes may represent different instances of similar components. The drawings illustrate various embodiments generally by way of example and not by way of limitation, and together with the description and claims serve to explain the disclosed embodiments. Such embodiments are illustrative, and are not intended to be exhaustive or exclusive embodiments of the present apparatus or method.

FIG. 1(a) shows a flow diagram of a dialog intention classification method according to an embodiment of the present disclosure;

FIG. 1(b) shows a flow diagram of a learning network for constructing and training a dialog intention classification method according to an embodiment of the present disclosure;

FIG. 2 illustrates a framework diagram of a pre-trained language model for extracting a first feature vector of a sentence in a dialog, according to an embodiment of the present disclosure;

FIG. 3 shows a flow diagram of a dialog intention classification method according to another embodiment of the present disclosure;

FIG. 4 illustrates a configuration diagram of a dialog intention classification system according to another embodiment of the present disclosure; and

fig. 5 illustrates a block diagram of a dialog intention classification apparatus according to an embodiment of the present disclosure.

Detailed Description

For a better understanding of the technical aspects of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings. Embodiments of the present disclosure are described in further detail below with reference to the figures and the detailed description, but the present disclosure is not limited thereto. The terms "first," "second," and "third" as used in this disclosure are intended only to distinguish between corresponding features, do not denote a need for such ordering, and do not necessarily denote only the singular.

Fig. 1(a) shows a flowchart of a dialog intention classification method according to an embodiment of the present disclosure. As shown in fig. 1, the dialog intention classification method starts in step 101: data of a conversation is received, the conversation including a plurality of sentences. At step 102, a first feature vector of a sentence is extracted by a processor based on the received data of the dialogue using a trained first learning network. The first feature vector contains semantic information of a sentence and can be extracted through a first learning network with various configurations. For example, words may be characterized by discrete spatial vectors, such as codes in a bag-of-words model. For another example, the first learning network may be utilized to convert discrete vectors of a high-dimensional space into dense vector distributed representation of a low-dimensional space, which can link distance relationships between words and word sense similarities, and each dimension may contain a specific meaning, so that more semantic information is contained, and compared with discrete space vector representation, dimension increase and space overhead increase of the space vector can be suppressed. In some embodiments, a language model of a dynamic word vector may be used, so that feature vectors of words are dynamically adjusted according to different contexts, and therefore, under the condition that a static word vector (each word has a fixed word vector) cannot solve the word ambiguity, the feature vectors of the same word in different contexts can contain rich semantics and accurately reflect the influence of different contexts.

In step 103, based on the minimum difference of the first feature vector relative to the representative feature vectors of the known intention classes, the modulus length of the first feature vector is adjusted so that the smaller the minimum difference, the larger the modulus length, so as to obtain a second feature vector. Although the first feature vector may contain richer semantic information through a language model of a dynamic word vector, the inventors found that it still cannot achieve effective distinction between an unknown intention category and a known intention category alone. By introducing difference information on the basis of the first feature vector (which may be implemented as various defined distance information, difference information, correlation information, etc.), and guiding the modular length adjustment of the feature vector with the difference information, the modular length of the feature vector may be introduced and take into account the difference information in the form of an adjustment coefficient, which may represent how close the respective data are to a known intention, the larger the adjustment coefficient representing the closer to the known intention category, the smaller the adjustment coefficient being the closer to the unknown intention category. In this way, the second feature vector after the modular length adjustment (hereinafter, an example of which is also referred to as "meta-embedding") can be made to learn a data relationship that is close within a deep class and far away between classes (especially, far away between a known intention class and an unknown intention class).

At step 104, determining, by the processor, based on the second feature vector, respective probability-related parameters for respective known intent classes using a second learning network configured to be able to perceive modal length information of the feature vector. In step 103, the second feature vector after adjusting the modular length based on the difference information, and a second learning network configured to be able to sense (capture) the modular length information of the feature vector, may be used to obtain an easily distinguishable probability-related parameter (e.g., classification confidence).

Then, in step 105, the respective determined probability-related parameters for the respective known intent categories are compared to a threshold by the processor. Under the condition that all the probability related parameters are smaller than the threshold value, determining that the conversation belongs to the unknown intention category (step 106); in case each probability-related parameter is above the threshold, it is determined that the dialog belongs to the known intent category to which the largest probability-related parameter corresponds (step 107).

Fig. 1(b) shows a flowchart of a training method of a learning network employed by a dialog intention classification method according to an embodiment of the present disclosure. The learning network includes a first learning network and a second learning network.

The training method may begin in step 108: a first learning network and a second learning network are constructed, the first learning network being configurable to extract a first feature vector of a sentence, and the second learning network being configurable to perceive modular length information of a second feature vector. As described above, the first learning network may employ various network structures. In some embodiments, respective third feature vectors for respective tokens may be extracted using a pre-trained language model, taking into account the context of the respective tokens, based on the received data of the dialog; and determining the first feature vector of the sentence through a comprehensive operation based on the extracted respective third feature vectors. In some embodiments, the pre-trained language models include, but are not limited to, BERT language models, RoBERTa language models, various recurrent neural networks (RNNs, such as, but not limited to, long-short term memory neural networks (LSTM)), xlnets, and the like. The pre-trained language model may be pre-trained through multiple tasks to capture different contextual information for a word. After the pre-training is completed, part of the parameters in the first learning network may be fixed, and in the subsequent training step 111 together with the second learning network, only the remaining parameters in the first learning network are determined and adjusted (fine-tuned), so that the calculation amount in the training process can be significantly reduced and the training effect can be considered.

At step 109, a corpus of known intent categories is received. For the dialog data of unknown intention categories, training corpora are lacked, and by means of the cooperation of the definition of the second feature vector and the configuration of the second learning network and the special definition of the first loss function, the training corpora of the known intention categories can be used for carrying out combined training on the first learning network and the second learning network in the training method, so that the trained second learning network can effectively distinguish the unknown intention categories and accurately divide the known intention categories in the application scene of dialog open intention classification.

In step 110, each corpus may be loaded as a training sample. At step 111, parameters of at least a portion of the first learning network and parameters of the second learning network may be determined based on the training samples, wherein the first learning network employing the pre-trained language model may directly use the pre-trained partial network layer, while only the parameters of the remaining network layers are determined and adjusted at step 111. At step 112, the first learning network and the second learning network may be jointly trained for the first loss function and the second loss function to verify and adjust the network parameters. The second loss function may be a loss function representing various forms of classification loss, including but not limited to a square loss function, a cross-entropy loss function, and the like. The first loss function is defined such that the difference between the first feature vector and the representative feature vector of the known intention category to which the first feature vector belongs is smaller than the representative value of each difference between the first feature vector and the representative feature vector of the other intention category, and for example, a loss boundary may be set (for example, in the form of a set value, a boundary penalty term, or the like), so that the second learning network must consider when learning the decision boundary, by making the learning network learn features that are close to each other within the class and far from each other within the class through the constraint of the first loss function.

Note that, although step 111 and step 112 are shown before and after in fig. 1(b), the execution order thereof is not limited thereto as long as at least part of the parameters of the first learning network and the parameters of the second learning network are adjusted in consideration of both the first loss function and the second loss function.

At step 113, it is determined whether all training samples have been processed. If yes, outputting the trained first learning network and second learning network in step 114; if not, return to step 110 to continue to load the next training sample for training. In some embodiments, a batch training mode, such as a small batch training mode, may be adopted, and accordingly, steps 110, 111, 112, and 113 may be adjusted accordingly according to the specifically adopted training mode, which is not described herein.

Fig. 2 illustrates a framework diagram of a pre-trained language model for extracting a first feature vector of a sentence in a dialog, according to an embodiment of the present disclosure. As shown in fig. 2, for a sentence 201 input by the user, S ═ word 1, word 2, …, word n, where n is the total number of words included in the sentence, here, a description will be given taking a word as an example of a token, and the pre-training language model may be implemented by generalizing the word into various tokens (for example, words). The following is described with reference to fig. 2 using a BERT pre-trained language model as an example of the pre-trained language model, but it should be understood that the pre-trained language model is not limited thereto, and various implementations such as RoBERTa language model, various recurrent neural networks (RNNs, such as but not limited to long-short term memory neural network (LSTM)), XLNet, etc. may be used. The following description in combination with the specification of the BERT pre-training language model is also applicable to the extraction of the sentence feature vectors of the pre-training language models through adaptive modification, and is not repeated herein.

S may be encoded in a BERT manner and then go through BERT' S conversion layers (also called hidden coding layers), 1 st conversion layer 202, … …, and 12 th conversion layer 203 to obtain respective word feature vectors [ CLS, T ] of corresponding respective words₁,T₂,…T_n]Wherein CLS is used as an element as a classification tag, T_iA feature vector representing the ith word. 12 conversion layers are shown in FIG. 2 as an example, but the number of conversion layers is not limited thereto, and the BERT pre-training language model may include 12-24 conversion layers.

The extracted individual word feature vectors (also referred to as third feature vectors in this disclosure) may be pooled, such as, but not limited to, an average pooling operation, with pooling layer 204 to determine a fourth feature vector x, x of the sentence as an average pooling ([ CLS, T, etc. ])₁,T₂,…,T_n])∈R^HAnd H is the number of hidden layer neurons. Taking the pooling layer 204 as an example in fig. 2, the first feature vector of the sentence may also be determined by other forms of synthesis (composition) operations based on the extracted respective third feature vectors, the synthesis operations being configured to integrate the third feature vectors of the respective words to obtain a sentence-level feature vector representation. In some embodiments, the determined fourth feature vector of the sentence may also be used as the first feature vector of the sentence. In some embodiments, as shown in fig. 2, a compact layer 205 may be disposed downstream of the pooling layer 204, such that a fourth feature vector of a sentence is processed by the compact layer 205 to obtain the first feature vector z (also identified as "BERT embedding" in this disclosure) of the sentence, where z ═ f_θ(x)∈R^DWhere θ is a parameter of dense layer 205 and D is a dimension of the eigenvector.

In some embodiments, the BERT language model may be pre-trained by two tasks, namely, masked language prediction and next sentence prediction, so that different context information of words can be captured, and feature vectors of the words may be dynamically adjusted according to different contexts, so that richer semantic information and external knowledge are included, and the characterization effect of features is enhanced. By introducing the dense layer 205, the discrete vector of the sentence in the higher dimensional space output by the pooling layer 205 can be converted into the feature vector of the dense sentence in the lower dimensional space, the feature vector of the dense sentence in the lower dimensional space has the characteristic of distributed representation, the difference (distance) relationship between words and the similarity of word senses can be linked, and each dimension can contain a specific meaning, so that more abundant information can be contained, the increase of space overhead is remarkably inhibited, the calculation efficiency is improved, and meanwhile, the prediction capability of the model can be improved.

In some embodiments, the BERT language model may include 12-24 translation layers, only the last 1-24 translation layers and dense layers of which are trained during the joint training of the first learning network and the second learning network. That is, other translation layers can directly follow the pre-trained parameters without further training. Taking the BERT language model including 12 conversion layers as an example, during the subsequent joint training, only the last conversion layer (i.e., the 12 th conversion layer 203) and the dense layer 205 may be actually required to be trained, so that the number of parameters of training can be significantly reduced, the computational load of training can be reduced, and the training speed can be increased.

Fig. 3 illustrates a flowchart of a dialog intention classification method according to another embodiment of the present disclosure. In fig. 3, a flow of a dialog intention classification method according to an embodiment of the present disclosure is described with a pre-trained BERT language model as an example of the pre-trained language model, a cluster center vector of a known intention category as an example of a representative feature vector of the known intention category, accordingly, a euclidean distance between a first feature vector and the cluster center vector of the known intention category as an example of the difference between the first feature vector and the representative feature vector, and a cosine classifier as an example of a second learning network. It should be appreciated that other pre-training language models may be employed, the representative feature vectors of known intent categories may also take other forms, such as cluster median vectors, etc., the differences may also be implemented as correlations, differences, etc., the euclidean distance may also be modified to be the poisson distance, etc., and the second learning network may also take other configurations, which are not described herein.

As shown in fig. 3, a sentence input by a user is fed into a pre-trained BERT language model 301 to obtain BERT embedding 302, i.e. said first feature vector z ═ f of the sentence_θ(x)∈R^D。

First, a cluster center vector of each known intention category can be calculated according to formula (3.1) as a representative feature vector of the known intention category, wherein the cluster center vector is obtained by averaging feature vectors of all data of the known intention category:

wherein, c_kCluster center vector, S, representing class k_kRepresenting a data set labeled k, x representing a sentence sample, y representing a label, | · | for the size of the statistical set.

Next, the distance of each data point (i.e., each first eigenvector z) to its nearest cluster center can be calculated according to equation (3.2):

d_min＝min_k||z-c_k||₂formula (3.2)

Wherein d is_minRepresenting data point z to the center of all clusters

(ii) minimum distance, | · | | non-conducting phosphor₂Means calculation of Euclidean distance, min_k{. denotes the minimum of all K distances.

The modular length adjustment of the first feature vector z can be guided by introducing euclidean distance information on the basis of the first feature vector z and by means of an adjustment coefficient according to the euclidean distance information, the adjustment coefficient can represent the degree of distance of each data to the known intention, and the larger the adjustment coefficient, the closer the adjustment coefficient to the known intention category is, the smaller the adjustment coefficient, the closer the adjustment coefficient to the unknown intention category is. The adjustment coefficient can guide the feature vector to learn the data relationship of deep-level close-in classes and far-out classes. For example, the minimum distance d obtained in the previous step can be set according to the formula (3.3)_minIs added as an adjustment coefficient to the original first eigenvector z to obtain a second eigenvector z_meta(identified as "meta-embedding 303" in FIG. 3):

z_meta＝(1/d_min) Z formula (3.3)

Wherein, 1/d_minThe degree of distance of each data point to the known intention category can be represented, and the larger the adjustment coefficient is, the more the data point isThe closer to the known intent category, the smaller the closer to the unknown intent category. The second eigenvector z is determined by adjusting the size of the coefficient_metaA modular length feature of (a), which facilitates learning deep-level data relationships that are close within classes (close within known intent classes) and far between classes (especially far between known intent classes and unknown intent classes), so that a subsequent second learning network (illustrated in fig. 3 as cosine classifier 304) can obtain more easily distinguishable probability-related parameters (illustrated in fig. 3 as classification probabilities 306 as an example).

As shown in fig. 3, the distance information is integrated into the second feature vector z based on the adjustment coefficient_metaThe magnitude of its modulo length (including distance information) is converted into classification probability 306 information by cosine classifier 304.

The cosine classifier 304 is essentially a layer of weights of the neural network, which can be learned to the second feature vector z through network training_metaCan be determined by calculating the second eigenvector z according to equation (3.4)_metaAnd cosine similarity of the classification weights to obtain classification scores:

wherein, sim_kRepresenting a second eigenvector z_metaClass-k classification weight vector

The cosine similarity between them, τ is a learnable scalar value, and the cosine similarity can be calculated by using dot product.

In some embodiments, as shown in the expression on the right side of equation (3.4), a second feature vector z may be paired_metaAnd class k classification weight vector

Normalization processing is performed, such as but not limited to L1 norm normalization, L2 norm normalization, and the like.

The following areThe L2 norm normalization process is explained as an example of the normalization process.

And

all represent a second eigenvector z_metaAnd class k classification weight vector

And normalizing the processed vector by an L2 norm. In particular, it is possible to use,

and

the calculation method of (c) is as follows:

wherein | | · | | represents the modular length of the feature, and the nonlinear extrusion function can be used to apply to the second feature vector z_metaWith L2 norm normalization, the non-linear squeeze function can compress vectors with larger modulo length to a length slightly below 1 and vectors with smaller modulo length to a length close to 0. The feature obtained by the processing can convert the feature vector of the unknown intention category with a small modular length into the feature vector with a small length, and further obtain a very low classification score which is easy to distinguish through a threshold value. In some embodiments, a classification weight vector may also be applied, as defined by equation (3.6)

L2 norm normalization is performed to eliminate the classification weight vectorThe modular length of the quantity has an influence on the classification result, so that the classification is more accurate.

In some embodiments, classification prediction 305 may be accomplished by applying a softmax function to the classification scores obtained using cosine classifier 304 to obtain classification probabilities 306, and then classifying known intent classes and unknown intent classes based on confidence thresholds, as shown in fig. 3.

The partial network layer and cosine classifier 304 of the pre-trained BERT language model 301 may be jointly trained using a first loss function and a second loss function. In some embodiments, the first Loss function Loss may be set according to equation (3.7)_dAnd a second Loss function Loss_ceCombined as a total Loss function Loss, wherein the first Loss function Loss_dCan be used as a metric learning Loss function, with a second Loss function Loss_ceThe distance information can be used as a classification loss function, and the classification loss function and the metric learning loss function are combined to carry out combined training, so that the characteristic representation capable of sensing the distance information can be learned while the classification performance is ensured.

Loss＝Loss_ce+λ·Loss_dFormula (3.7)

Where λ is a scalar value used to balance the two loss functions.

In some embodiments, the second Loss function Loss_ceAs the classification loss function, a cross entropy loss function defined by the following formula (3.8) and formula (3.9) can be employed:

wherein the content of the first and second substances,

indicating a linear output layer, i indicating a sample number, N indicating a sample numberTotal number of samples, K representing the number of known intent classes, y_iA known intent class label and y representing the ith sample_iE {1,2, …, K }, m1 denotes a set loss boundary, j denotes a sequence number of a known intent class,

indicating label y_iWeight parameter, W, of identified known intent categories_j ^TA weight vector representing the jth class,

indicating label y_iBias of identified known intent classes, b_jThe bias of the j-th class is indicated,

a second feature vector (also referred to as "meta-feature") representing the ith sample extraction. The meta-feature may be computed by using the softmax function to obtain a classification probability 306, and the cross-entropy loss is defined as maximizing the probability that the meta-feature is classified into the belonging intent class.

In some embodiments, the first Loss function Loss_dCan be used as a metric learning loss function, which can be defined by various formulas as long as the following physical meanings are achieved: the difference (e.g., euclidean distance) between the first feature vector and the representative feature vectors of the known intent classes to which it belongs is less than a representative value (e.g., mean value) of each difference (e.g., euclidean distance) between the first feature vector and the representative feature vectors of the other intent classes, e.g., at least some loss boundary threshold. Thus, the Loss function Loss can be learned by introducing the metric into the total Loss function_dSuch that the first learning network (and accordingly the first feature vector) learns deep data relationships within classes that are close together and between classes that are far apart. Due to Loss via the first Loss function Loss_dThe constraint that the difference of the first feature vector to the known intention category and the unknown intention category is different by at least a loss boundary threshold value is provided, so that whether the first feature vector belongs to the known intention category or the unknown intention category can be judged relatively easily through the distance between the data and the known intention category。

In some embodiments, the first Loss function Loss_dCan be defined by the following equation (1):

therein, Loss₁Representing said first Loss function Loss_dOr its first component, | · | | non-woven phosphor₂Representing the Euclidean distance, i representing the sample number, N representing the total number of samples, K representing the number of known intention classes, K representing the number of known intention classes, y_iA known intent class label and y representing the ith sample_iE {1,2, …, K }, m1 represents the set loss boundary, z_iRepresenting said first feature vector extracted for the ith sample, c_kA cluster center vector representing the kth known intent class. Equation (1) provides the following physical significance: the euclidean distance of the first feature vector to the cluster center vector of the known intent class in which it is located is at least m1 less than the mean of the distances to the cluster center vectors of the other intent classes. By the first Loss function Loss_dThe first feature vector has the characteristics of close within class and far between classes, and since the Euclidean distances from the first feature vector to the known intention class and the unknown intention class are different by at least a boundary value m1, whether the first feature vector belongs to the known class or the unknown class can be judged relatively easily according to the distance from data to the known intention class.

In some embodiments, the first Loss function Loss_dThe respective components Loss of the first Loss function defined by the formula (1), the following formula (2) and the following formula (3) may also be used₁、Loss₂And Loss₃In combination to define:

therein, Loss₂Representing a second component of said first loss function, s1 being a scaling factor,

is a first feature vector z_iAnd label y_iWeight vector for identified known intent categories

The angle between j denotes the known intention class number, theta_jIs the first eigenvector z_iWeight vector W with class j_jM2 is the cosine distance margin constant;

therein, Loss₃The third component of the first loss function is represented, s2 is a scaling factor, m3 is an angular distance margin constant, and other parameters that are the same as those in formula (2) have the same definitions, which are not repeated herein.

By Loss₁、Loss₂And Loss₃In the combination of the above, the first feature vector may be constrained by a decision boundary on the three metrics of the euclidean distance, the cosine distance and the angle distance, so as to further promote the intra-class approach and the inter-class distance, so that the classification of the known intention categories can be performed more accurately in the application scenario of dialog open intention classification, and the effective detection of the unknown intention categories is considered, which is verified in the comparison experiment using the three public standard data sets, and will be described in detail hereinafter, which is not described herein again.

Fig. 4 shows a configuration diagram of a dialog intention classification system 400 according to another embodiment of the present disclosure. As shown in fig. 4, the dialog intent classification system 400 may include at least an intent classification unit 406 configured to: receiving the trained first learning network, the trained second learning network and data of the conversation; determining, using the trained first and second learning networks, respective probability-related parameters of the data of the conversation relative to respective known intent categories based on the received data of the conversation; and deriving an intent classification result for the dialog based on the respective probability-related parameter, which intent classification result may be open, e.g., whether the dialog belongs to a known intent class or an unknown intent class, to which known intent class it belongs, etc. The first and second learning networks may be trained elsewhere and fed to the intent classification unit 406. In some embodiments, the first learning network may be configured to extract a first feature vector of a sentence based on data of a conversation, and the second learning network may be configured to determine respective probability-related parameters for respective known intent classes based on a second feature vector obtained after modulo-length adjustment of the first feature vector. Accordingly, the intention classification unit 406 may be configured to adjust the modular length of the first feature vector based on the minimum difference of the first feature vector with respect to the representative feature vector of each known intention category, such that the smaller the minimum difference is, the larger the modular length is, and the modular length of the second feature vector thus obtained may represent the degree of closeness of the corresponding dialog data to the known intention category, so that the second feature vector incorporates the degree of closeness information. This feature facilitates learning deep levels of data relationships that are close within a class (close within a known intent category) and far between classes (especially far between a known intent category and an unknown intent category), such that a subsequent second learning network can obtain more readily distinguishable probabilistic correlation parameters. In some embodiments, the intent classification unit 406 may be configured to: and comparing each determined probability related parameter of each known intention category with a threshold, determining that the dialog belongs to the unknown intention category under the condition that each probability related parameter is smaller than the threshold, and determining that the dialog belongs to the known intention category corresponding to the maximum probability related parameter under the condition that each probability related parameter is more than or equal to the threshold.

In some embodiments, the dialog intention classification system 400 may further include a learning network construction unit 401 configured to construct a first learning network and a second learning network. For example, the first learning network may be constructed based on a pre-trained language model, and the construction of the first and second learning networks is described in detail in other embodiments of the disclosure and is not repeated here. The pre-training samples may be obtained from a pre-training sample database 403, and the pre-training of the first learning network may be completed by using the pre-training unit 402. The pre-trained first learning network and the constructed second learning network may be transmitted to the training unit 404 for completing the training of the first and second learning networks using training samples obtained from the training sample database 405. After the pre-training is completed, part of parameters in the first learning network can be kept in a pre-training completed state, and in the subsequent training step together with the second learning network, only the rest parameters in the first learning network and the parameters of the second learning network are determined and adjusted (fine-tuned), so that the calculated amount in the training process can be obviously reduced, and the training effect is considered.

The above units may be respectively configured to perform corresponding pre-training processing, intention classification processing, and the like, which are described in other embodiments of the present disclosure, and are not described herein again.

Fig. 5 illustrates a block diagram of a dialog intention classification apparatus 500 according to an embodiment of the present disclosure. As shown in fig. 5, the network training device 501 is constructed as a separate device from the dialogue intention classification device 500, the former being configured to construct and train first and second learning networks, which can be fed to the dialogue intention classification device 500 for use thereof via the communication interface 503. This is merely by way of example and the two devices may also be integrated in one and the same device.

The dialogue data collection device 502 may be configured to collect data of a dialogue, which may include, for example, a microphone, an analog-to-digital converter, a filter, etc., and the data of the dialogue collected thereby may be transmitted to the dialogue intention classification device 500 via the communication interface 503. In some embodiments, the dialogue data collection apparatus 502 may be integrated with the dialogue intent classification apparatus 500 in a same device, such as but not limited to a smart wearable device, a smart service robot, a smart phone, and so on.

The dialog intention classification apparatus 500 may be a special purpose computer or a general purpose computer. For example, the dialog intent classification device 500 may be a customized computer to perform dialog data collection and dialog data processing tasks. As shown in fig. 5, the dialog intention classification apparatus 500 may include a communication interface 503, a processor 504, a memory 505, a storage 506, and a display 507.

The communication interface 503 may include a network adapter, cable connector, serial connector, USB connector, parallel connector, high speed data transmission adapter (such as fiber optic, USB 3.0, thunderbolt interface, etc.), wireless network adapter (such as WiFi adapter), telecommunications (3G, 4G/LTE, 5G, etc.) adapter, and the like. The dialog intention classification device 500 may be connected to other components, such as, but not limited to, a network training device 501, a dialog data collection device 502, etc., through a communication interface 503. In some embodiments, the communication interface 503 receives session data from the session data acquisition device 502. In some embodiments, the communication interface 503 may also receive, for example, trained first and second learning networks from the network training device 501.

Processor 504 may be a processing device including more than one general purpose processing device such as a microprocessor, Central Processing Unit (CPU), Graphics Processing Unit (GPU), etc. More specifically, the processor may be a Complex Instruction Set Computing (CISC) microprocessor, Reduced Instruction Set Computing (RISC) microprocessor, Very Long Instruction Word (VLIW) microprocessor, processor running other instruction sets, or processors running a combination of instruction sets. The processor may also be one or more special-purpose processing devices such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), a system on a chip (SoC), or the like. Processor 504 may be communicatively coupled to memory 505 and configured to execute computer-executable instructions stored thereon to perform dialog intent classification processing procedures, such as those described in various embodiments of the present disclosure.

The memory 505/storage 506 may be a non-transitory computer-readable medium, such as Read Only Memory (ROM), Random Access Memory (RAM), phase change random access memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), other types of Random Access Memory (RAM), flash disk or other forms of flash memory, cache, registers, static memory, compact disc read only memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, tape cassettes or other magnetic storage devices, or any other possible non-transitory medium that may be used to store information or instructions that may be accessed by a computer device, and so forth.

In some embodiments, the storage 506 may store the trained first and second learning networks and data (such as raw dialogue data, first feature vectors, intermediate parameters that modulo-length adjust the first feature vectors, second feature vectors, respective probability-related parameters for respective known intent classes, comparison thresholds for respective probability-related parameters, and so forth), data received, used, or generated while executing a computer program, and so forth. In some embodiments, memory 505 may store computer-executable instructions, such as more than one dialog intention classification program.

In some embodiments, the processor 504 may be further configured to classify the results based on the intent of the conversation, such that the corresponding output device provides the corresponding service. For example, the output device may be the display 507 shown in fig. 5, but is not limited thereto, and may also be a speaker, a movement driving device, or the like. For example, in the case where the intention classification result of the conversation is "eat", the processor 504 may be configured to cause the display 507 to display a list of nearby restaurants for the user to select, and also to cause the display 507 to display a navigation map of the selected restaurant, and to play a navigation voice signal via a speaker. For another example, in the case where the intention classification result of the dialog is "sick", the processor 504 may be configured to cause the internet inquiry platform to be turned on, display an interface of the inquiry platform on the display 507, and cause a speaker (not shown) to play an inquiry voice, for example, "please describe your pain part", "whether there is a designated doctor? "," please upload the examination report ", etc.

In some embodiments, display 507 may include a Liquid Crystal Display (LCD), a light emitting diode display (LED), a plasma display, or any other type of display, and provides a Graphical User Interface (GUI) presented on the display for user input and image/data display. The display may comprise many different types of materials, such as plastic or glass, and may be touch sensitive to receive commands from a user. For example, the display may comprise a substantially rigid touch sensitive material (such as Gorilla glass (TM)) or a substantially flexible touch sensitive material (such as Willow glass (TM)).

According to the present disclosure, the network training device 501 may have the same or similar structure as the dialogue intention classification device 500. In some embodiments, the network training device 501 includes a processor and other components configured to train the first and second learning networks.

A comparative experiment was performed on the general model and the dialogue intention classification method according to various embodiments of the present disclosure using three public standard datasets, which are a StackOverflow technical problem dataset, a FewRel relation extraction dataset, and an OOS dialogue dataset provided by Clinc corporation, respectively. Wherein, the StackOverflow comprises 20 technical problem intention labels, the FewRel comprises 80 relation intents, and the OOS data set covers 150 intents of 20 fields and 1200 pieces of data outside the fields.

The macro F1 value is used as an evaluation index of each model, each data set is divided into an independent training set, a verification set and a test set, the training set and the verification set only contain known class intentions, the test set contains known class intentions and open intentions, and the verification set is used for screening the optimal parameters. The general model employs a conventional supervised classification model and a cross entropy loss function, and the dialog intention classification method according to various embodiments of the present disclosure includes five variations of classification methods, referred to as dialog intention classification method 1, dialog intention classification method 2, dialog intention classification method 3, dialog intention classification method 4, and dialog intention classification method 5. Specifically, the dialog intention classification method 1 utilizes a cross entropy loss function on the basis of the classification model shown in fig. 3, and the dialog intention classification methods 2 to 5 utilize the cross entropy loss function and the metric loss function according to various embodiments of the present disclosure to perform joint training on the basis of the classification model shown in fig. 3, wherein the metric loss function used in the dialog intention classification method 2 is the loss function shown in formula (1), the metric loss function used in the dialog intention classification method 3 is the loss function shown in formula (2), the metric loss function used in the dialog intention classification method 4 is the loss function shown in formula (3), and the metric loss function used in the dialog intention classification method 5 is a combination of the loss functions of the respective metrics shown in formulas (1) to (3).

The results of the comparative experiment are shown in table 1, where dialog intent classification methods 1-5 are simply referred to as example methods 1-5.

TABLE 1 comparative experiment results of conversational patency intent classification method

Compared with a common model, the open intention classification method of each embodiment of the disclosure can remarkably improve classification performance under the condition of different known class intention proportions, has better robustness, and maintains stable performance in data sets of different scenes. In particular, the cross entropy loss function and the metric loss function according to various embodiments of the present disclosure are utilized for joint training, which can significantly improve the classification performance and robustness of the same classification model; furthermore, the combination of several independent metric loss functions is adopted as the metric loss function and the cross entropy loss function for combined training, and compared with the combined training by using a single metric loss function and a cross entropy loss function, the classification performance and the robustness of the classification model can be obviously improved.

Moreover, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments based on the disclosure with equivalent elements, modifications, omissions, combinations (e.g., of various embodiments across), adaptations or alterations. The elements of the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.

The order of the various steps in this disclosure is merely exemplary and not limiting. The order of execution of the steps may be adjusted without affecting the implementation of the present disclosure (without destroying the logical relationship between the required steps), and various embodiments obtained after the adjustment still fall within the scope of the present disclosure.

The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more versions thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the foregoing detailed description, various features may be grouped together to streamline the disclosure. This should not be interpreted as an intention that a disclosed feature not claimed is essential to any claim. Rather, inventive subject matter may lie in less than all features of a particular disclosed embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with each other in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

1. A dialog intention classification method, comprising:

receiving data of a conversation;

extracting, by a processor, a first feature vector of a sentence using a trained first learning network based on the received data of the conversation;

adjusting, by the processor, a modular length of the first feature vector based on a smallest difference of the first feature vector with respect to a representative feature vector of each known intent category such that the smaller the smallest difference, the larger the modular length, to obtain a second feature vector;

determining, by the processor, based on the second feature vector, respective probability-related parameters for respective known intent classes using a trained second learning network configured to perceive a modular length of the second feature vector, wherein the first and second learning networks are jointly trained using a first loss function and a second loss function characterizing a classification loss, the first loss function being defined such that the first feature vector differs from a representative value of a representative feature vector of a known intent class to which it belongs by less than respective differences from representative feature vectors of other known intent classes;

and comparing each determined probability related parameter of each known intention category with a threshold value by the processor, determining that the dialog belongs to the unknown intention category under the condition that each probability related parameter is smaller than the threshold value, and determining that the dialog belongs to the known intention category corresponding to the maximum probability related parameter under the condition that each probability related parameter is higher than or equal to the threshold value.

2. The dialog intent classification method according to claim 1, wherein extracting a first feature vector of a sentence using a trained first learning network based on the received data of the dialog further comprises:

extracting respective third feature vectors of the respective tokens, taking into account their context, using a pre-trained language model, based on the received data of the dialog;

determining the first feature vector of the sentence through a synthesis operation based on the extracted respective third feature vectors.

3. The dialog intention classification method according to claim 2, characterized in that determining the first feature vector of the sentence by a synthesis operation based on the extracted respective third feature vectors further comprises:

performing pooling operation on each extracted third feature vector to determine a fourth feature vector of the sentence; and

and processing by utilizing a compact network layer based on the fourth feature vector of the sentence to obtain the first feature vector of the sentence.

4. The dialog intent classification method according to claim 3, characterized in that the pre-trained language model comprises a BERT language model, which comprises 12-24 conversion layers and of which only the last 1-24 conversion layers and the dense network layers are trained during the joint training of the first learning network and the second learning network.

5. The dialog intention classification method according to claim 1, characterized in that a representative feature vector of a known intention class is a cluster center vector of the known intention class, the difference of the first feature vector from the representative feature vector comprises a distance of the first feature vector from the representative feature vector, and the second learning network comprises a cosine classifier.

6. The dialog intent classification method according to claim 5, wherein determining, based on the second feature vector, respective probability-related parameters for respective known intent classes using a trained second learning network further comprises: normalizing the second feature vector and normalizing the weight of the cosine classifier; based on the normalized second feature vector, determining each probability-related parameter of each known intention category by using the cosine classifier after weight normalization.

7. The dialog intention classification method according to claim 5, characterized in that the second loss function comprises a cross-entropy loss function, the first loss function being defined with the following formula (1):

therein, Loss₁Represents the first loss function orIts first component, | · | non-woven phosphor₂Representing the Euclidean distance, i representing the sample number, N representing the total number of samples, K representing the number of known intention classes, K representing the number of known intention classes, y_iA known intent class label and y representing the ith sample_iE {1,2, …, K }, m1 represents the set loss boundary, z_iRepresenting said first feature vector extracted for the ith sample, c_kA cluster center vector representing the kth known intent class.

8. The dialog intention classification method according to claim 7, characterized in that the first Loss function utilizes the respective components Loss of the first Loss function defined by the formula (1), the following formula (2) and formula (3)₁、Loss₂And Loss₃In combination to define:

therein, Loss₃A third component representing the first loss function, s2 being a scaling factor, m3 being an angleDegree distance margin constant.

9. A dialog intention classifying apparatus, comprising:

an interface configured to receive data of a conversation; and

a processor configured to:

extracting a first feature vector of a sentence by using the trained first learning network based on the received data of the dialogue;

based on the minimum difference of the first feature vector relative to the representative feature vectors of the known intention classes, adjusting the modular length of the first feature vector so that the smaller the minimum difference is, the larger the modular length is, to obtain a second feature vector;

determining, based on the second feature vector, respective probability-related parameters for respective known intent classes using a trained second learning network configured to perceive a modular length of the second feature vector, wherein the first learning network and the second learning network are jointly trained using a first loss function and a second loss function characterizing a classification loss, the first loss function being defined such that a difference of the first feature vector from a representative feature vector of a known intent class to which the first feature vector belongs is smaller than a representative value of respective differences from representative feature vectors of other known intent classes;

and comparing each determined probability related parameter of each known intention category with a threshold, determining that the dialog belongs to the unknown intention category under the condition that each probability related parameter is smaller than the threshold, and determining that the dialog belongs to the known intention category corresponding to the maximum probability related parameter under the condition that each probability related parameter is more than or equal to the threshold.

10. The dialog intent classification device according to claim 9, wherein extracting the first feature vector of the sentence using the trained first learning network based on the received data of the dialog further comprises:

11. The dialog intention classification device according to claim 10, wherein determining the first feature vector of the sentence through a synthesis operation based on the extracted respective third feature vectors further comprises:

12. The dialog intent classification device according to claim 11, characterized in that the pre-trained language model comprises a BERT language model comprising 12-24 conversion layers and only the last 1-24 conversion layers and the dense network layers thereof are trained during the joint training of the first and second learning networks.

13. The dialog intent classification device according to claim 9, wherein a representative feature vector of a known intent class is a cluster center vector of the known intent class, the difference between the first feature vector and the representative feature vector comprises a distance between the first feature vector and the representative feature vector, and the second learning network comprises a cosine classifier.

14. The dialog intent classification device according to claim 13, wherein determining, based on the second feature vector, respective probability-related parameters for respective known intent classes using a trained second learning network further comprises: normalizing the second feature vector and normalizing the weight of the cosine classifier; based on the normalized second feature vector, determining each probability-related parameter of each known intention category by using the cosine classifier after weight normalization.

15. The dialog intent classification device according to claim 13, characterized in that the second loss function comprises a cross-entropy loss function, the first loss function being defined with the following formula (1):

therein, Loss₁Representing the first loss function or a first component thereof, | · | | | non-calculation₂Representing the Euclidean distance, i representing the sample number, N representing the total number of samples, K representing the number of known intention classes, K representing the number of known intention classes, y_iA known intent class label and y representing the ith sample_iE {1,2, …, K }, m1 represents the set loss boundary, z_iRepresenting said first feature vector extracted for the ith sample, c_kA cluster center vector representing the kth known intent class.

16. The dialog intention classification device according to claim 15, characterized in that the first Loss function utilizes the respective components Loss of the first Loss function defined by the formula (1), the following formula (2) and formula (3)₁、Loss₂And Loss₃In combination to define:

therein, Loss₃Representing a third component of the first loss function, s2 being a scaling factor, m3 being an angular distance margin constant.

17. A non-transitory computer storage medium having stored thereon executable instructions that, when executed by a processor, implement a dialog intent classification method, comprising: