CN111159332A - Text multi-intention identification method based on bert - Google Patents
Text multi-intention identification method based on bert Download PDFInfo
- Publication number
- CN111159332A CN111159332A CN201911219732.9A CN201911219732A CN111159332A CN 111159332 A CN111159332 A CN 111159332A CN 201911219732 A CN201911219732 A CN 201911219732A CN 111159332 A CN111159332 A CN 111159332A
- Authority
- CN
- China
- Prior art keywords
- text
- intention
- bert
- vector
- recognition method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 239000013598 vector Substances 0.000 claims abstract description 37
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000012217 deletion Methods 0.000 claims abstract description 4
- 230000037430 deletion Effects 0.000 claims abstract description 4
- 238000004836 empirical method Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000002372 labelling Methods 0.000 description 6
- 238000003058 natural language processing Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000002015 leaf growth Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000003796 beauty Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a text multi-intent recognition method based on bert, which comprises the following steps: s1: acquiring a text to be recognized, and performing duplicate removal and stop word deletion to obtain a training corpus; s2: obtaining a sentence vector; s3: training a sentence vector model for identifying the intention by using a lightgbm model to obtain an intention category and outputting all main intents; s4: selecting a standard vector; s5: and calculating the mahalanobis distance of the standard vector and outputting the subcategory intention.
Description
Technical Field
The invention relates to the fields of medical science and beauty and the technical field of natural language processing, in particular to a text multi-intention recognition method based on bert.
Background
The reply mechanism of the medical simulation marketing robot is to reply according to the question and the item of the visitor. In actual operation, the text is often ambiguous, or the text itself does have multiple intentions, so that the unique category cannot be accurately selected. On the other hand, there is a real need for multi-intent recognition of texts in the field of creating intelligent dialog systems, etc., and it is necessary to customize a comprehensive reply according to different intentions contained in texts. Therefore, the problem of multi-intent recognition of text is an urgent problem to be solved.
The problem of text multi-purpose recognition is solved, and a manual labeling method and a machine labeling method are generally adopted. The manual labeling method is that a labeling person reads the corpus one by one, then understands a plurality of meanings in the corpus and labels the corpus. The manual labeling has the advantages of robustness and relatively high accuracy, but has the problems of low efficiency and consumption of manpower and time resources. If the annotation is performed by multiple workers, systematic errors caused by the difference of understanding ability of the annotation personnel can also occur. If a machine is used for labeling, the following problems also exist: 1. the machine learning model can only provide an optimal solution with high accuracy, and is difficult to output suboptimal solutions (other intentions), so that the machine learning model is only suitable for identifying the univocal graphs; 2. the problem of outputting multiple intentions can be solved by utilizing a deep learning model to label the multiple intentions, but a large amount of balanced multiple intention data is needed for training, and the accuracy is difficult to guarantee.
Disclosure of Invention
In order to solve the problems in the prior art, the method is improved based on the existing text single-intention recognition model, and realizes the output of text multiple intentions by recognizing the intention by using a bert model and a lightgbm model and performing secondary matching on the text distance according to the Mahalanobis distance. A text multi-intention recognition method based on bert is provided.
The method comprises the following specific steps:
a text multi-intention recognition method based on bert comprises the following steps:
s1: acquiring a text to be recognized, and performing duplicate removal and stop word deletion to obtain a training corpus;
s2: obtaining a sentence vector;
s3: training a sentence vector model for identifying the intention by using a lightgbm model to obtain an intention category and outputting all main intents;
s4: selecting a standard vector;
s5: and calculating the mahalanobis distance of the standard vector and outputting the subcategory intention.
Preferably, step S2 is further: building bert to generate embedding service bert-as-service, inputting training corpus into bert-as-service, and obtaining sentence vector of sentence.
Preferably, step S4 is further: and respectively calculating the occurrence frequency of all texts with the same main intention in the type of intention text, and taking the sentence vector of the text with the highest occurrence frequency as a standard vector of the type of text.
Preferably, step S5 is further: calculating the Mahalanobis distance from each text to all intention category standard vectors to obtain a set containing n distance values, and selecting an intention category corresponding to k numerical values with the minimum absolute value in the set as a subcategory of the text; wherein n is the number of intention categories, and k < n.
Preferably, the standard vector can also be obtained by calculating a sentence vector average value or an empirical method.
Preferably, step S1 is implemented using ETL.
Compared with the prior art, the invention has the following advantages:
1. by utilizing the advantages of high efficiency and accuracy of ensemble learning, a new high-precision text multi-intention recognition method is provided by improving the ensemble learning.
2. And outputting the most possible sub-intentions contained in the text on the premise of ensuring that the text idea is correct.
3. When the intention category marking vector is selected, the sentence vector of the text with the highest frequency of occurrence in the intention-like text is used as the standard vector of the text, and the accuracy is higher in practical application.
4. And (4) obtaining a sentence vector by using bert, so that the semantic and generalization capabilities of the prediction result are greatly improved.
Drawings
FIG. 1 is a flow chart of the text multi-intent recognition method based on bert in the invention.
Detailed Description
Fig. 1 is a flow chart of a text multi-intent recognition method based on bert, which is improved based on the existing text single-intent recognition model, and the method performs idea recognition by using the bert and lightgbm models and performs secondary matching on text distances according to mahalanobis distances, thereby realizing output of text multi-intent. The method comprises the following specific steps:
the related kernel algorithm comprises lightgbm, bert and Mahalanobis distance; wherein,
LightGBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithms. Boosting is a framework algorithm, which mainly obtains sample subsets through operations on a sample set, and then trains a series of base classifiers on the sample subsets by using a weak classification algorithm. Tools generated based on boosting algorithms are gbdt, adaboost, xgboost, etc. A Microsoft DMTK (distributed machine learning toolkit) team opens LightGBM with performance exceeding that of other boosting tools on a GitHub, a decision tree algorithm based on Histogram makes difference acceleration for Leaf growth strategy histograms of Leaf-wise with depth limitation to directly support class Feature (conditional Feature) Cache hit rate optimization and sparse Feature optimization multithreading optimization based on histograms mainly introduces a Histogram algorithm, Leaf growth strategies of Leaf-wise with depth limitation and Histogram difference acceleration.
Mahalanobis distance (Mahalanobis distance) is proposed by the indian statistician Mahalanobis (p.c. Mahalanobis), and the comparison of the similarity between unknown samples is achieved by calculating the covariance distance of two unknowns. The advantage of mahalanobis distance over euclidean distance is that the dimension is independent, i.e. the mahalanobis distance between two points is independent of the unit of measurement of the raw data, and interference of the correlation between the variables can be excluded. Therefore, the mahalanobis distance can well avoid interference caused by correlation among different dimensions after the text is converted into the vector. The algorithm for mahalanobis distance is as follows:
with vector space { X1, X2, … …, Xn }, the Mahalanobis distance from Xi to Xj is calculated as
The BERT model released by Google AI team causes huge reverberation in NLP industry, and is considered to be a milestone progress in NLP field. The BERT model showed surprising performance in the machine reading understanding top level test sqaad 1.1: both metrics outperform humans in all and also yield the best performance in 11 different NLP tests, including scaling to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7% (5.6% absolute improvement), etc. The innovative point of BERT is that it uses a bidirectional Transformer for the language model, which is preceded by entering a text sequence from left to right, or by combining the training of left-to-right and right-to-left. The results of the experiments show that a bi-directionally trained language model will understand the context more deeply than a uni-directional language model.
A text multi-intention recognition method based on bert comprises the following steps:
1) ETL is carried out on the training corpus, namely, duplication removal and stop words deletion are carried out on the training corpus.
2) Building bert to generate embedding service bert-as-service, inputting training corpus into bert-as-service, and obtaining sentence vector of the sentence.
3) Taking the sentence vectors output in the step 2) as training data, training a high-quality single-intention output model by using the lightgbm model, and outputting the main meaning diagrams of all training linguistic data.
4) And respectively calculating the occurrence frequency of all texts belonging to the same main intention in the intention texts, and taking the sentence vector of the text with the highest occurrence frequency as a standard vector of the texts. In addition, the selection method of the standard vector can be confirmed by other methods such as calculating the average value of the sentence vectors, an empirical method and the like.
5) And calculating the Mahalanobis distance from each text to all the intention category standard vectors to obtain a set containing n distance values (n is the number of intention categories), and selecting the intention category (except an idea diagram) corresponding to k (k < n) numerical values with the minimum absolute value in the set as the subcategory of the text.
The embodiments in the above embodiments can be further combined or replaced, and the embodiments are only used for describing the preferred embodiments of the present invention, and do not limit the concept and scope of the present invention, and various changes and modifications made to the technical solution of the present invention by those skilled in the art without departing from the design idea of the present invention belong to the protection scope of the present invention.
Claims (6)
1. A text multi-intention recognition method based on bert is characterized by comprising the following steps:
s1: acquiring a text to be recognized, and performing duplicate removal and stop word deletion to obtain a training corpus;
s2: obtaining a sentence vector;
s3: training a sentence vector model for identifying the intention by using a lightgbm model to obtain an intention category and outputting all main intents;
s4: selecting a standard vector;
s5: and calculating the mahalanobis distance of the standard vector and outputting the subcategory intention.
2. The bert-based text multi-intent recognition method according to claim 1, wherein the step S2 further comprises: building bert to generate embedding service bert-as-service, inputting training corpus into bert-as-service, and obtaining sentence vector of sentence.
3. The bert-based text multi-intent recognition method according to claim 1, wherein the step S4 further comprises: and respectively calculating the occurrence frequency of all texts with the same main intention in the type of intention text, and taking the sentence vector of the text with the highest occurrence frequency as a standard vector of the type of text.
4. The bert-based text multi-intent recognition method according to claim 1, wherein the step S5 further comprises: calculating the Mahalanobis distance from each text to all intention category standard vectors to obtain a set containing n distance values, and selecting an intention category corresponding to k numerical values with the minimum absolute value in the set as a subcategory of the text; wherein n is the number of intention categories, and k < n.
5. The method as claimed in any one of claims 1 or 3, wherein the standard vector is obtained by calculating mean value of sentence vector or empirical method.
6. The bert-based text multi-intent recognition method according to claim 1, wherein the step S1 is implemented by ETL.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911219732.9A CN111159332A (en) | 2019-12-03 | 2019-12-03 | Text multi-intention identification method based on bert |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911219732.9A CN111159332A (en) | 2019-12-03 | 2019-12-03 | Text multi-intention identification method based on bert |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111159332A true CN111159332A (en) | 2020-05-15 |
Family
ID=70556541
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911219732.9A Pending CN111159332A (en) | 2019-12-03 | 2019-12-03 | Text multi-intention identification method based on bert |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111159332A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112256864A (en) * | 2020-09-23 | 2021-01-22 | 北京捷通华声科技股份有限公司 | Multi-intention recognition method and device, electronic equipment and readable storage medium |
CN112560458A (en) * | 2020-12-09 | 2021-03-26 | 杭州艾耕科技有限公司 | Article title generation method based on end-to-end deep learning model |
CN112989800A (en) * | 2021-04-30 | 2021-06-18 | 平安科技(深圳)有限公司 | Multi-intention identification method and device based on Bert sections and readable storage medium |
CN113223735A (en) * | 2021-04-30 | 2021-08-06 | 平安科技(深圳)有限公司 | Triage method, device and equipment based on session representation and storage medium |
CN114818665A (en) * | 2022-04-22 | 2022-07-29 | 电子科技大学 | Multi-intention identification method and system based on bert + bilstm + crf and xgboost models |
CN118656494A (en) * | 2024-08-16 | 2024-09-17 | 成都晓多科技有限公司 | Acoustic fine granularity intention analysis and matching method and system for buyers |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109635105A (en) * | 2018-10-29 | 2019-04-16 | 厦门快商通信息技术有限公司 | A kind of more intension recognizing methods of Chinese text and system |
CN110147452A (en) * | 2019-05-17 | 2019-08-20 | 北京理工大学 | A kind of coarseness sentiment analysis method based on level BERT neural network |
CN110287309A (en) * | 2019-06-21 | 2019-09-27 | 深圳大学 | The method of rapidly extracting text snippet |
CN110446063A (en) * | 2019-07-26 | 2019-11-12 | 腾讯科技(深圳)有限公司 | Generation method, device and the electronic equipment of video cover |
-
2019
- 2019-12-03 CN CN201911219732.9A patent/CN111159332A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109635105A (en) * | 2018-10-29 | 2019-04-16 | 厦门快商通信息技术有限公司 | A kind of more intension recognizing methods of Chinese text and system |
CN110147452A (en) * | 2019-05-17 | 2019-08-20 | 北京理工大学 | A kind of coarseness sentiment analysis method based on level BERT neural network |
CN110287309A (en) * | 2019-06-21 | 2019-09-27 | 深圳大学 | The method of rapidly extracting text snippet |
CN110446063A (en) * | 2019-07-26 | 2019-11-12 | 腾讯科技(深圳)有限公司 | Generation method, device and the electronic equipment of video cover |
Non-Patent Citations (2)
Title |
---|
MOUNIKA MARREDDY等: ""Evaluating the Combination ofWord Embeddings with Mixture of Experts and Cascading gcForest in Identifying Sentiment Polarity"", 《IN PROCEEDINGS OF KDD 2019 (WISDOM’19): 8TH KDDWORKSHOP ON ISSUES OF SENTIMENT DISCOVERY AND OPINION MINING》 * |
刘娇等: ""人机对话系统中意图识别方法综述"", 《计算机工程与应用》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112256864A (en) * | 2020-09-23 | 2021-01-22 | 北京捷通华声科技股份有限公司 | Multi-intention recognition method and device, electronic equipment and readable storage medium |
CN112256864B (en) * | 2020-09-23 | 2024-05-14 | 北京捷通华声科技股份有限公司 | Multi-intention recognition method, device, electronic equipment and readable storage medium |
CN112560458A (en) * | 2020-12-09 | 2021-03-26 | 杭州艾耕科技有限公司 | Article title generation method based on end-to-end deep learning model |
CN112989800A (en) * | 2021-04-30 | 2021-06-18 | 平安科技(深圳)有限公司 | Multi-intention identification method and device based on Bert sections and readable storage medium |
CN113223735A (en) * | 2021-04-30 | 2021-08-06 | 平安科技(深圳)有限公司 | Triage method, device and equipment based on session representation and storage medium |
CN113223735B (en) * | 2021-04-30 | 2024-08-20 | 平安科技(深圳)有限公司 | Diagnosis method, device, equipment and storage medium based on dialogue characterization |
CN114818665A (en) * | 2022-04-22 | 2022-07-29 | 电子科技大学 | Multi-intention identification method and system based on bert + bilstm + crf and xgboost models |
CN118656494A (en) * | 2024-08-16 | 2024-09-17 | 成都晓多科技有限公司 | Acoustic fine granularity intention analysis and matching method and system for buyers |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110209823B (en) | Multi-label text classification method and system | |
CN111159332A (en) | Text multi-intention identification method based on bert | |
CN104699763B (en) | The text similarity gauging system of multiple features fusion | |
CN109344250B (en) | Rapid structuring method of single disease diagnosis information based on medical insurance data | |
CN110096570A (en) | A kind of intension recognizing method and device applied to intelligent customer service robot | |
CN110147451B (en) | Dialogue command understanding method based on knowledge graph | |
CN106776538A (en) | The information extracting method of enterprise's noncanonical format document | |
CN106445921B (en) | Utilize the Chinese text terminology extraction method of quadratic mutual information | |
CN106777957B (en) | The new method of biomedical more ginseng event extractions on unbalanced dataset | |
Alotaibi et al. | Optical character recognition for quranic image similarity matching | |
CN109635105A (en) | A kind of more intension recognizing methods of Chinese text and system | |
CN110096572B (en) | Sample generation method, device and computer readable medium | |
CN113033183B (en) | Network new word discovery method and system based on statistics and similarity | |
CN111597328B (en) | New event theme extraction method | |
CN110046356B (en) | Label-embedded microblog text emotion multi-label classification method | |
CN105205124A (en) | Semi-supervised text sentiment classification method based on random feature subspace | |
CN109657039A (en) | A kind of track record information extraction method based on the double-deck BiLSTM-CRF | |
CN112069312B (en) | Text classification method based on entity recognition and electronic device | |
CN103823857A (en) | Space information searching method based on natural language processing | |
CN113220865B (en) | Text similar vocabulary retrieval method, system, medium and electronic equipment | |
Devi et al. | Entity extraction for malayalam social media text using structured skip-gram based embedding features from unlabeled data | |
CN108763192B (en) | Entity relation extraction method and device for text processing | |
CN113157918A (en) | Commodity name short text classification method and system based on attention mechanism | |
CN104317882A (en) | Decision-based Chinese word segmentation and fusion method | |
CN107797986A (en) | A kind of mixing language material segmenting method based on LSTM CNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200515 |
|
RJ01 | Rejection of invention patent application after publication |