CN111159332A - Text multi-intention identification method based on bert - Google Patents

Text multi-intention identification method based on bert Download PDF

Info

Publication number
CN111159332A
CN111159332A CN201911219732.9A CN201911219732A CN111159332A CN 111159332 A CN111159332 A CN 111159332A CN 201911219732 A CN201911219732 A CN 201911219732A CN 111159332 A CN111159332 A CN 111159332A
Authority
CN
China
Prior art keywords
text
intention
bert
vector
recognition method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911219732.9A
Other languages
Chinese (zh)
Inventor
黄友福
肖龙源
蔡振华
李稀敏
刘晓葳
谭玉坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN201911219732.9A priority Critical patent/CN111159332A/en
Publication of CN111159332A publication Critical patent/CN111159332A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text multi-intent recognition method based on bert, which comprises the following steps: s1: acquiring a text to be recognized, and performing duplicate removal and stop word deletion to obtain a training corpus; s2: obtaining a sentence vector; s3: training a sentence vector model for identifying the intention by using a lightgbm model to obtain an intention category and outputting all main intents; s4: selecting a standard vector; s5: and calculating the mahalanobis distance of the standard vector and outputting the subcategory intention.

Description

Text multi-intention identification method based on bert
Technical Field
The invention relates to the fields of medical science and beauty and the technical field of natural language processing, in particular to a text multi-intention recognition method based on bert.
Background
The reply mechanism of the medical simulation marketing robot is to reply according to the question and the item of the visitor. In actual operation, the text is often ambiguous, or the text itself does have multiple intentions, so that the unique category cannot be accurately selected. On the other hand, there is a real need for multi-intent recognition of texts in the field of creating intelligent dialog systems, etc., and it is necessary to customize a comprehensive reply according to different intentions contained in texts. Therefore, the problem of multi-intent recognition of text is an urgent problem to be solved.
The problem of text multi-purpose recognition is solved, and a manual labeling method and a machine labeling method are generally adopted. The manual labeling method is that a labeling person reads the corpus one by one, then understands a plurality of meanings in the corpus and labels the corpus. The manual labeling has the advantages of robustness and relatively high accuracy, but has the problems of low efficiency and consumption of manpower and time resources. If the annotation is performed by multiple workers, systematic errors caused by the difference of understanding ability of the annotation personnel can also occur. If a machine is used for labeling, the following problems also exist: 1. the machine learning model can only provide an optimal solution with high accuracy, and is difficult to output suboptimal solutions (other intentions), so that the machine learning model is only suitable for identifying the univocal graphs; 2. the problem of outputting multiple intentions can be solved by utilizing a deep learning model to label the multiple intentions, but a large amount of balanced multiple intention data is needed for training, and the accuracy is difficult to guarantee.
Disclosure of Invention
In order to solve the problems in the prior art, the method is improved based on the existing text single-intention recognition model, and realizes the output of text multiple intentions by recognizing the intention by using a bert model and a lightgbm model and performing secondary matching on the text distance according to the Mahalanobis distance. A text multi-intention recognition method based on bert is provided.
The method comprises the following specific steps:
a text multi-intention recognition method based on bert comprises the following steps:
s1: acquiring a text to be recognized, and performing duplicate removal and stop word deletion to obtain a training corpus;
s2: obtaining a sentence vector;
s3: training a sentence vector model for identifying the intention by using a lightgbm model to obtain an intention category and outputting all main intents;
s4: selecting a standard vector;
s5: and calculating the mahalanobis distance of the standard vector and outputting the subcategory intention.
Preferably, step S2 is further: building bert to generate embedding service bert-as-service, inputting training corpus into bert-as-service, and obtaining sentence vector of sentence.
Preferably, step S4 is further: and respectively calculating the occurrence frequency of all texts with the same main intention in the type of intention text, and taking the sentence vector of the text with the highest occurrence frequency as a standard vector of the type of text.
Preferably, step S5 is further: calculating the Mahalanobis distance from each text to all intention category standard vectors to obtain a set containing n distance values, and selecting an intention category corresponding to k numerical values with the minimum absolute value in the set as a subcategory of the text; wherein n is the number of intention categories, and k < n.
Preferably, the standard vector can also be obtained by calculating a sentence vector average value or an empirical method.
Preferably, step S1 is implemented using ETL.
Compared with the prior art, the invention has the following advantages:
1. by utilizing the advantages of high efficiency and accuracy of ensemble learning, a new high-precision text multi-intention recognition method is provided by improving the ensemble learning.
2. And outputting the most possible sub-intentions contained in the text on the premise of ensuring that the text idea is correct.
3. When the intention category marking vector is selected, the sentence vector of the text with the highest frequency of occurrence in the intention-like text is used as the standard vector of the text, and the accuracy is higher in practical application.
4. And (4) obtaining a sentence vector by using bert, so that the semantic and generalization capabilities of the prediction result are greatly improved.
Drawings
FIG. 1 is a flow chart of the text multi-intent recognition method based on bert in the invention.
Detailed Description
Fig. 1 is a flow chart of a text multi-intent recognition method based on bert, which is improved based on the existing text single-intent recognition model, and the method performs idea recognition by using the bert and lightgbm models and performs secondary matching on text distances according to mahalanobis distances, thereby realizing output of text multi-intent. The method comprises the following specific steps:
the related kernel algorithm comprises lightgbm, bert and Mahalanobis distance; wherein,
LightGBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithms. Boosting is a framework algorithm, which mainly obtains sample subsets through operations on a sample set, and then trains a series of base classifiers on the sample subsets by using a weak classification algorithm. Tools generated based on boosting algorithms are gbdt, adaboost, xgboost, etc. A Microsoft DMTK (distributed machine learning toolkit) team opens LightGBM with performance exceeding that of other boosting tools on a GitHub, a decision tree algorithm based on Histogram makes difference acceleration for Leaf growth strategy histograms of Leaf-wise with depth limitation to directly support class Feature (conditional Feature) Cache hit rate optimization and sparse Feature optimization multithreading optimization based on histograms mainly introduces a Histogram algorithm, Leaf growth strategies of Leaf-wise with depth limitation and Histogram difference acceleration.
Mahalanobis distance (Mahalanobis distance) is proposed by the indian statistician Mahalanobis (p.c. Mahalanobis), and the comparison of the similarity between unknown samples is achieved by calculating the covariance distance of two unknowns. The advantage of mahalanobis distance over euclidean distance is that the dimension is independent, i.e. the mahalanobis distance between two points is independent of the unit of measurement of the raw data, and interference of the correlation between the variables can be excluded. Therefore, the mahalanobis distance can well avoid interference caused by correlation among different dimensions after the text is converted into the vector. The algorithm for mahalanobis distance is as follows:
with vector space { X1, X2, … …, Xn }, the Mahalanobis distance from Xi to Xj is calculated as
Figure BDA0002300465710000031
The BERT model released by Google AI team causes huge reverberation in NLP industry, and is considered to be a milestone progress in NLP field. The BERT model showed surprising performance in the machine reading understanding top level test sqaad 1.1: both metrics outperform humans in all and also yield the best performance in 11 different NLP tests, including scaling to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7% (5.6% absolute improvement), etc. The innovative point of BERT is that it uses a bidirectional Transformer for the language model, which is preceded by entering a text sequence from left to right, or by combining the training of left-to-right and right-to-left. The results of the experiments show that a bi-directionally trained language model will understand the context more deeply than a uni-directional language model.
A text multi-intention recognition method based on bert comprises the following steps:
1) ETL is carried out on the training corpus, namely, duplication removal and stop words deletion are carried out on the training corpus.
2) Building bert to generate embedding service bert-as-service, inputting training corpus into bert-as-service, and obtaining sentence vector of the sentence.
3) Taking the sentence vectors output in the step 2) as training data, training a high-quality single-intention output model by using the lightgbm model, and outputting the main meaning diagrams of all training linguistic data.
4) And respectively calculating the occurrence frequency of all texts belonging to the same main intention in the intention texts, and taking the sentence vector of the text with the highest occurrence frequency as a standard vector of the texts. In addition, the selection method of the standard vector can be confirmed by other methods such as calculating the average value of the sentence vectors, an empirical method and the like.
5) And calculating the Mahalanobis distance from each text to all the intention category standard vectors to obtain a set containing n distance values (n is the number of intention categories), and selecting the intention category (except an idea diagram) corresponding to k (k < n) numerical values with the minimum absolute value in the set as the subcategory of the text.
The embodiments in the above embodiments can be further combined or replaced, and the embodiments are only used for describing the preferred embodiments of the present invention, and do not limit the concept and scope of the present invention, and various changes and modifications made to the technical solution of the present invention by those skilled in the art without departing from the design idea of the present invention belong to the protection scope of the present invention.

Claims (6)

1. A text multi-intention recognition method based on bert is characterized by comprising the following steps:
s1: acquiring a text to be recognized, and performing duplicate removal and stop word deletion to obtain a training corpus;
s2: obtaining a sentence vector;
s3: training a sentence vector model for identifying the intention by using a lightgbm model to obtain an intention category and outputting all main intents;
s4: selecting a standard vector;
s5: and calculating the mahalanobis distance of the standard vector and outputting the subcategory intention.
2. The bert-based text multi-intent recognition method according to claim 1, wherein the step S2 further comprises: building bert to generate embedding service bert-as-service, inputting training corpus into bert-as-service, and obtaining sentence vector of sentence.
3. The bert-based text multi-intent recognition method according to claim 1, wherein the step S4 further comprises: and respectively calculating the occurrence frequency of all texts with the same main intention in the type of intention text, and taking the sentence vector of the text with the highest occurrence frequency as a standard vector of the type of text.
4. The bert-based text multi-intent recognition method according to claim 1, wherein the step S5 further comprises: calculating the Mahalanobis distance from each text to all intention category standard vectors to obtain a set containing n distance values, and selecting an intention category corresponding to k numerical values with the minimum absolute value in the set as a subcategory of the text; wherein n is the number of intention categories, and k < n.
5. The method as claimed in any one of claims 1 or 3, wherein the standard vector is obtained by calculating mean value of sentence vector or empirical method.
6. The bert-based text multi-intent recognition method according to claim 1, wherein the step S1 is implemented by ETL.
CN201911219732.9A 2019-12-03 2019-12-03 Text multi-intention identification method based on bert Pending CN111159332A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911219732.9A CN111159332A (en) 2019-12-03 2019-12-03 Text multi-intention identification method based on bert

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911219732.9A CN111159332A (en) 2019-12-03 2019-12-03 Text multi-intention identification method based on bert

Publications (1)

Publication Number Publication Date
CN111159332A true CN111159332A (en) 2020-05-15

Family

ID=70556541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911219732.9A Pending CN111159332A (en) 2019-12-03 2019-12-03 Text multi-intention identification method based on bert

Country Status (1)

Country Link
CN (1) CN111159332A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256864A (en) * 2020-09-23 2021-01-22 北京捷通华声科技股份有限公司 Multi-intention recognition method and device, electronic equipment and readable storage medium
CN112560458A (en) * 2020-12-09 2021-03-26 杭州艾耕科技有限公司 Article title generation method based on end-to-end deep learning model
CN112989800A (en) * 2021-04-30 2021-06-18 平安科技(深圳)有限公司 Multi-intention identification method and device based on Bert sections and readable storage medium
CN113223735A (en) * 2021-04-30 2021-08-06 平安科技(深圳)有限公司 Triage method, device and equipment based on session representation and storage medium
CN114818665A (en) * 2022-04-22 2022-07-29 电子科技大学 Multi-intention identification method and system based on bert + bilstm + crf and xgboost models
CN118656494A (en) * 2024-08-16 2024-09-17 成都晓多科技有限公司 Acoustic fine granularity intention analysis and matching method and system for buyers

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635105A (en) * 2018-10-29 2019-04-16 厦门快商通信息技术有限公司 A kind of more intension recognizing methods of Chinese text and system
CN110147452A (en) * 2019-05-17 2019-08-20 北京理工大学 A kind of coarseness sentiment analysis method based on level BERT neural network
CN110287309A (en) * 2019-06-21 2019-09-27 深圳大学 The method of rapidly extracting text snippet
CN110446063A (en) * 2019-07-26 2019-11-12 腾讯科技(深圳)有限公司 Generation method, device and the electronic equipment of video cover

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635105A (en) * 2018-10-29 2019-04-16 厦门快商通信息技术有限公司 A kind of more intension recognizing methods of Chinese text and system
CN110147452A (en) * 2019-05-17 2019-08-20 北京理工大学 A kind of coarseness sentiment analysis method based on level BERT neural network
CN110287309A (en) * 2019-06-21 2019-09-27 深圳大学 The method of rapidly extracting text snippet
CN110446063A (en) * 2019-07-26 2019-11-12 腾讯科技(深圳)有限公司 Generation method, device and the electronic equipment of video cover

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MOUNIKA MARREDDY等: ""Evaluating the Combination ofWord Embeddings with Mixture of Experts and Cascading gcForest in Identifying Sentiment Polarity"", 《IN PROCEEDINGS OF KDD 2019 (WISDOM’19): 8TH KDDWORKSHOP ON ISSUES OF SENTIMENT DISCOVERY AND OPINION MINING》 *
刘娇等: ""人机对话系统中意图识别方法综述"", 《计算机工程与应用》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256864A (en) * 2020-09-23 2021-01-22 北京捷通华声科技股份有限公司 Multi-intention recognition method and device, electronic equipment and readable storage medium
CN112256864B (en) * 2020-09-23 2024-05-14 北京捷通华声科技股份有限公司 Multi-intention recognition method, device, electronic equipment and readable storage medium
CN112560458A (en) * 2020-12-09 2021-03-26 杭州艾耕科技有限公司 Article title generation method based on end-to-end deep learning model
CN112989800A (en) * 2021-04-30 2021-06-18 平安科技(深圳)有限公司 Multi-intention identification method and device based on Bert sections and readable storage medium
CN113223735A (en) * 2021-04-30 2021-08-06 平安科技(深圳)有限公司 Triage method, device and equipment based on session representation and storage medium
CN113223735B (en) * 2021-04-30 2024-08-20 平安科技(深圳)有限公司 Diagnosis method, device, equipment and storage medium based on dialogue characterization
CN114818665A (en) * 2022-04-22 2022-07-29 电子科技大学 Multi-intention identification method and system based on bert + bilstm + crf and xgboost models
CN118656494A (en) * 2024-08-16 2024-09-17 成都晓多科技有限公司 Acoustic fine granularity intention analysis and matching method and system for buyers

Similar Documents

Publication Publication Date Title
CN110209823B (en) Multi-label text classification method and system
CN111159332A (en) Text multi-intention identification method based on bert
CN104699763B (en) The text similarity gauging system of multiple features fusion
CN109344250B (en) Rapid structuring method of single disease diagnosis information based on medical insurance data
CN110096570A (en) A kind of intension recognizing method and device applied to intelligent customer service robot
CN110147451B (en) Dialogue command understanding method based on knowledge graph
CN106776538A (en) The information extracting method of enterprise&#39;s noncanonical format document
CN106445921B (en) Utilize the Chinese text terminology extraction method of quadratic mutual information
CN106777957B (en) The new method of biomedical more ginseng event extractions on unbalanced dataset
Alotaibi et al. Optical character recognition for quranic image similarity matching
CN109635105A (en) A kind of more intension recognizing methods of Chinese text and system
CN110096572B (en) Sample generation method, device and computer readable medium
CN113033183B (en) Network new word discovery method and system based on statistics and similarity
CN111597328B (en) New event theme extraction method
CN110046356B (en) Label-embedded microblog text emotion multi-label classification method
CN105205124A (en) Semi-supervised text sentiment classification method based on random feature subspace
CN109657039A (en) A kind of track record information extraction method based on the double-deck BiLSTM-CRF
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN103823857A (en) Space information searching method based on natural language processing
CN113220865B (en) Text similar vocabulary retrieval method, system, medium and electronic equipment
Devi et al. Entity extraction for malayalam social media text using structured skip-gram based embedding features from unlabeled data
CN108763192B (en) Entity relation extraction method and device for text processing
CN113157918A (en) Commodity name short text classification method and system based on attention mechanism
CN104317882A (en) Decision-based Chinese word segmentation and fusion method
CN107797986A (en) A kind of mixing language material segmenting method based on LSTM CNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200515

RJ01 Rejection of invention patent application after publication