CN111159332A

CN111159332A - Text multi-intention identification method based on bert

Info

Publication number: CN111159332A
Application number: CN201911219732.9A
Authority: CN
Inventors: 黄友福; 肖龙源; 蔡振华; 李稀敏; 刘晓葳; 谭玉坤
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-05-15

Abstract

The invention discloses a text multi-intent recognition method based on bert, which comprises the following steps: s1: acquiring a text to be recognized, and performing duplicate removal and stop word deletion to obtain a training corpus; s2: obtaining a sentence vector; s3: training a sentence vector model for identifying the intention by using a lightgbm model to obtain an intention category and outputting all main intents; s4: selecting a standard vector; s5: and calculating the mahalanobis distance of the standard vector and outputting the subcategory intention.

Description

Text multi-intention identification method based on bert

Technical Field

The invention relates to the fields of medical science and beauty and the technical field of natural language processing, in particular to a text multi-intention recognition method based on bert.

Background

The reply mechanism of the medical simulation marketing robot is to reply according to the question and the item of the visitor. In actual operation, the text is often ambiguous, or the text itself does have multiple intentions, so that the unique category cannot be accurately selected. On the other hand, there is a real need for multi-intent recognition of texts in the field of creating intelligent dialog systems, etc., and it is necessary to customize a comprehensive reply according to different intentions contained in texts. Therefore, the problem of multi-intent recognition of text is an urgent problem to be solved.

The problem of text multi-purpose recognition is solved, and a manual labeling method and a machine labeling method are generally adopted. The manual labeling method is that a labeling person reads the corpus one by one, then understands a plurality of meanings in the corpus and labels the corpus. The manual labeling has the advantages of robustness and relatively high accuracy, but has the problems of low efficiency and consumption of manpower and time resources. If the annotation is performed by multiple workers, systematic errors caused by the difference of understanding ability of the annotation personnel can also occur. If a machine is used for labeling, the following problems also exist: 1. the machine learning model can only provide an optimal solution with high accuracy, and is difficult to output suboptimal solutions (other intentions), so that the machine learning model is only suitable for identifying the univocal graphs; 2. the problem of outputting multiple intentions can be solved by utilizing a deep learning model to label the multiple intentions, but a large amount of balanced multiple intention data is needed for training, and the accuracy is difficult to guarantee.

Disclosure of Invention

In order to solve the problems in the prior art, the method is improved based on the existing text single-intention recognition model, and realizes the output of text multiple intentions by recognizing the intention by using a bert model and a lightgbm model and performing secondary matching on the text distance according to the Mahalanobis distance. A text multi-intention recognition method based on bert is provided.

The method comprises the following specific steps:

a text multi-intention recognition method based on bert comprises the following steps:

s1: acquiring a text to be recognized, and performing duplicate removal and stop word deletion to obtain a training corpus;

s2: obtaining a sentence vector;

s3: training a sentence vector model for identifying the intention by using a lightgbm model to obtain an intention category and outputting all main intents;

s4: selecting a standard vector;

s5: and calculating the mahalanobis distance of the standard vector and outputting the subcategory intention.

Preferably, step S2 is further: building bert to generate embedding service bert-as-service, inputting training corpus into bert-as-service, and obtaining sentence vector of sentence.

Preferably, step S4 is further: and respectively calculating the occurrence frequency of all texts with the same main intention in the type of intention text, and taking the sentence vector of the text with the highest occurrence frequency as a standard vector of the type of text.

Preferably, step S5 is further: calculating the Mahalanobis distance from each text to all intention category standard vectors to obtain a set containing n distance values, and selecting an intention category corresponding to k numerical values with the minimum absolute value in the set as a subcategory of the text; wherein n is the number of intention categories, and k < n.

Preferably, the standard vector can also be obtained by calculating a sentence vector average value or an empirical method.

Preferably, step S1 is implemented using ETL.

Compared with the prior art, the invention has the following advantages:

1. by utilizing the advantages of high efficiency and accuracy of ensemble learning, a new high-precision text multi-intention recognition method is provided by improving the ensemble learning.

2. And outputting the most possible sub-intentions contained in the text on the premise of ensuring that the text idea is correct.

3. When the intention category marking vector is selected, the sentence vector of the text with the highest frequency of occurrence in the intention-like text is used as the standard vector of the text, and the accuracy is higher in practical application.

4. And (4) obtaining a sentence vector by using bert, so that the semantic and generalization capabilities of the prediction result are greatly improved.

Drawings

FIG. 1 is a flow chart of the text multi-intent recognition method based on bert in the invention.

Detailed Description

Fig. 1 is a flow chart of a text multi-intent recognition method based on bert, which is improved based on the existing text single-intent recognition model, and the method performs idea recognition by using the bert and lightgbm models and performs secondary matching on text distances according to mahalanobis distances, thereby realizing output of text multi-intent. The method comprises the following specific steps:

the related kernel algorithm comprises lightgbm, bert and Mahalanobis distance; wherein,

LightGBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithms. Boosting is a framework algorithm, which mainly obtains sample subsets through operations on a sample set, and then trains a series of base classifiers on the sample subsets by using a weak classification algorithm. Tools generated based on boosting algorithms are gbdt, adaboost, xgboost, etc. A Microsoft DMTK (distributed machine learning toolkit) team opens LightGBM with performance exceeding that of other boosting tools on a GitHub, a decision tree algorithm based on Histogram makes difference acceleration for Leaf growth strategy histograms of Leaf-wise with depth limitation to directly support class Feature (conditional Feature) Cache hit rate optimization and sparse Feature optimization multithreading optimization based on histograms mainly introduces a Histogram algorithm, Leaf growth strategies of Leaf-wise with depth limitation and Histogram difference acceleration.

Mahalanobis distance (Mahalanobis distance) is proposed by the indian statistician Mahalanobis (p.c. Mahalanobis), and the comparison of the similarity between unknown samples is achieved by calculating the covariance distance of two unknowns. The advantage of mahalanobis distance over euclidean distance is that the dimension is independent, i.e. the mahalanobis distance between two points is independent of the unit of measurement of the raw data, and interference of the correlation between the variables can be excluded. Therefore, the mahalanobis distance can well avoid interference caused by correlation among different dimensions after the text is converted into the vector. The algorithm for mahalanobis distance is as follows:

with vector space { X1, X2, … …, Xn }, the Mahalanobis distance from Xi to Xj is calculated as

The BERT model released by Google AI team causes huge reverberation in NLP industry, and is considered to be a milestone progress in NLP field. The BERT model showed surprising performance in the machine reading understanding top level test sqaad 1.1: both metrics outperform humans in all and also yield the best performance in 11 different NLP tests, including scaling to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7% (5.6% absolute improvement), etc. The innovative point of BERT is that it uses a bidirectional Transformer for the language model, which is preceded by entering a text sequence from left to right, or by combining the training of left-to-right and right-to-left. The results of the experiments show that a bi-directionally trained language model will understand the context more deeply than a uni-directional language model.

1) ETL is carried out on the training corpus, namely, duplication removal and stop words deletion are carried out on the training corpus.

2) Building bert to generate embedding service bert-as-service, inputting training corpus into bert-as-service, and obtaining sentence vector of the sentence.

3) Taking the sentence vectors output in the step 2) as training data, training a high-quality single-intention output model by using the lightgbm model, and outputting the main meaning diagrams of all training linguistic data.

4) And respectively calculating the occurrence frequency of all texts belonging to the same main intention in the intention texts, and taking the sentence vector of the text with the highest occurrence frequency as a standard vector of the texts. In addition, the selection method of the standard vector can be confirmed by other methods such as calculating the average value of the sentence vectors, an empirical method and the like.

5) And calculating the Mahalanobis distance from each text to all the intention category standard vectors to obtain a set containing n distance values (n is the number of intention categories), and selecting the intention category (except an idea diagram) corresponding to k (k < n) numerical values with the minimum absolute value in the set as the subcategory of the text.

The embodiments in the above embodiments can be further combined or replaced, and the embodiments are only used for describing the preferred embodiments of the present invention, and do not limit the concept and scope of the present invention, and various changes and modifications made to the technical solution of the present invention by those skilled in the art without departing from the design idea of the present invention belong to the protection scope of the present invention.

Claims

1. A text multi-intention recognition method based on bert is characterized by comprising the following steps:

s2: obtaining a sentence vector;

s4: selecting a standard vector;

2. The bert-based text multi-intent recognition method according to claim 1, wherein the step S2 further comprises: building bert to generate embedding service bert-as-service, inputting training corpus into bert-as-service, and obtaining sentence vector of sentence.

3. The bert-based text multi-intent recognition method according to claim 1, wherein the step S4 further comprises: and respectively calculating the occurrence frequency of all texts with the same main intention in the type of intention text, and taking the sentence vector of the text with the highest occurrence frequency as a standard vector of the type of text.

4. The bert-based text multi-intent recognition method according to claim 1, wherein the step S5 further comprises: calculating the Mahalanobis distance from each text to all intention category standard vectors to obtain a set containing n distance values, and selecting an intention category corresponding to k numerical values with the minimum absolute value in the set as a subcategory of the text; wherein n is the number of intention categories, and k < n.

5. The method as claimed in any one of claims 1 or 3, wherein the standard vector is obtained by calculating mean value of sentence vector or empirical method.

6. The bert-based text multi-intent recognition method according to claim 1, wherein the step S1 is implemented by ETL.