CN112101423A - Multi-model fused FAQ matching method and device - Google Patents

Multi-model fused FAQ matching method and device Download PDF

Info

Publication number
CN112101423A
CN112101423A CN202010852824.7A CN202010852824A CN112101423A CN 112101423 A CN112101423 A CN 112101423A CN 202010852824 A CN202010852824 A CN 202010852824A CN 112101423 A CN112101423 A CN 112101423A
Authority
CN
China
Prior art keywords
model
matching
faq
training
sentence pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010852824.7A
Other languages
Chinese (zh)
Inventor
田东坡
巩乐
翟永佳
张铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Changtou Network Technology Co ltd
Original Assignee
Shanghai Changtou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Changtou Network Technology Co ltd filed Critical Shanghai Changtou Network Technology Co ltd
Priority to CN202010852824.7A priority Critical patent/CN112101423A/en
Publication of CN112101423A publication Critical patent/CN112101423A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is suitable for the technical field of natural language processing, and provides a method and a device for matching multi-model fused FAQ (resource oriented language), which are used for constructing the FAQ of finance education by sequentially obtaining a training text set of problems to be processed, extracting and summarizing knowledge points of finance education by combining Bert-encoder + DBSCAn clustering assistance, manually marking a small number of similar problems, generating a large number of similar problems according to the marked small number of similar problems, manually checking, constructing a sentence pair matching data set, training a sentence pair matching model by using an unsupervised model and a supervised deep learning model, receiving the problems input by a user after the sentence pair matching model training is finished, identifying the most matched problems with the input problems and outputting corresponding answers, adopting a plurality of model fusion, training or pre-training models, extracting texts, matching standard problems and replying corresponding answers, the problems that FAQ users are fussy to check and manual customer service efficiency is low are solved.

Description

Multi-model fused FAQ matching method and device
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a multi-model fusion FAQ matching method and device.
Background
The land conditions in the artificial intelligence field of the financial education industry are not many, and especially, the FAQ uplink industry has private linguistic data, which is inconvenient to open sources and limits the development.
In recent years, the NLP field develops rapidly, but the NLP field can not land on the finance education and has a good effect, and the forefront FAQ sentence is not good for the matching algorithm to land on the finance education field.
The intelligent FAQ has difficulty in obtaining a good effect under the conditions of more knowledge points and extremely similar problem semantics.
Disclosure of Invention
The invention provides a multi-model fused FAQ matching method and device, and aims to solve the problems in the prior art.
The invention is realized in this way, a method and a device for matching multi-model fused FAQ, comprising the following steps:
s1, obtaining a training text set of the problems to be processed, extracting and concluding the finance education knowledge points in combination with Bert-encoder + DBSCAN clustering assistance, thereby constructing a finance education FAQ and manually labeling a small number of similar problems;
s2, generating a large number of similar problems by using a similar problem generation module according to a small number of labeled similar problems, manually checking and constructing a sentence pair matching data set;
s3, constructing a pre-training model data set;
s4, training sentence pair matching models by using an unsupervised model and a supervised deep learning model;
and S5, after the sentence pair matching model is trained, receiving the questions input by the user, inputting the texts of the questions into the sentence pair matching model, identifying the questions most matched with the input questions and corresponding answers, outputting the questions and the corresponding answers, and replying the answers to the user.
Preferably, the training text set comprises text with a text length limit of 3-50, deleted emoticons, numbers and mail text.
Preferably, the unsupervised model comprises a WMD model and a SIF model;
the supervised model comprises a bert model, an albert model and a roberta model.
The invention also provides a multi-model fused FAQ matching device, which comprises:
the financial education corpus database is used for storing pre-input FAQ corpus data and generating a training text set;
the manual labeling module is used for manually labeling a small number of similar problems in the training text set by an operator;
the similar problem generation module is used for generating a large number of similar problems according to a small number of labeled similar problems, carrying out manual review and constructing a sentence pair matching data set;
and the NLU module is used for training the sentence pair matching model, matching the questions input by the user by using the trained model, finding out the best matching question, outputting the best matching question and the corresponding answer thereof, and replying the best matching question and the corresponding answer to the user.
Preferably, the training text set comprises text with a text length limit of 3-50, deleted emoticons, numbers and mail text.
Preferably, the NLU module includes an unsupervised model and a supervised model;
the unsupervised model comprises a WMD model and an SIF model;
the supervised model comprises a bert model, an albert model and a roberta model.
Compared with the prior art, the invention has the beneficial effects that: the FAQ matching method and device with multi-model fusion sequentially obtain the training text set of the problems to be processed, extract and conclude the financing education knowledge points by combining the Bert-encoder + DBSCAn clustering assistance, thereby constructing the financing education FAQ, manually labeling a small amount of similar problems, then, according to a small amount of labeled similar problems, a large amount of similar problems are generated and are checked manually, a sentence pair matching data set is constructed, a sentence pair matching model is trained by using an unsupervised model and a supervised deep learning model, and finally, after the sentence pair matching model is trained, receiving a question input by a user, identifying a question most matched with the input question and corresponding answer output, adopting a plurality of model fusion, training or pre-training models, extracting a text, matching a standard question, and replying corresponding answers to solve the problems of complicated check of FAQ users and low manual customer service efficiency.
Drawings
Fig. 1 is an overall system schematic diagram of a multi-model fused FAQ matching apparatus according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, the present invention provides a technical solution: a multi-model fused FAQ matching method and a device thereof are provided, the multi-model fused FAQ matching method comprises the following steps:
s1, obtaining a training text set of the problems to be processed, extracting and summarizing the finance education knowledge points by combining Bert-encoder + DBSCAn clustering assistance, thereby constructing a finance education FAQ and manually labeling a small number of similar problems. Wherein the training text set comprises text with the text length limit of 3-50, deleted expressions, numbers and mails.
In the present embodiment, the problem is: what is the a strand and B strand difference?
Similar problems are artificially increased: how to distinguish between the A strand and the B strand? Is a good strand a good or a good strand B?
And S2, generating a large number of similar problems according to the marked small number of similar problems by using a similar problem generation module, manually checking and constructing a sentence pair matching data set.
In the present embodiment, the sentence pair matching data set is shown in table 1 below:
TABLE 1
Standard problem of Problem of similarity
What is the difference between strand a and strand B? How to distinguish between the A strand and the B strand?
Do i want to know that strand A is the same as strand B? Is a strand a or B?
What are you good, ask what are the concept shares? What is?
What is a concept strand? What is the concept strand?
What is the meaning of the concept strand? What is the concept strand asked?
And S3, constructing a pre-training model data set according to the sentence pair matching data set.
And S4, training a sentence pair matching model by using an unsupervised model and a supervised deep learning model. The unsupervised model comprises a WMD model and an SIF model. The supervised model comprises a bert model, an albert model and a roberta model. And further according to the model prediction probability, a scheme of linear regression and XGBOST is fused on the line, wherein the linear regression can be trained in real time and update the model by 40%, the XGBOST uses the model trained in advance by 60%, so that 60% of the XGBOST ensures the stability of the model, and 40% of the linear regression ensures the flexibility of the model.
And S5, after the sentence pair matching model is trained, receiving the questions input by the user, inputting the texts of the questions into the sentence pair matching model, identifying the questions most matched with the input questions and corresponding answers, outputting the questions and the corresponding answers, and replying the answers to the user.
The invention discloses a multi-model fused FAQ matching device which comprises a manual labeling module, a similar problem production module and an NLU module.
And the financial education corpus database is used for storing pre-input FAQ corpus data and generating a training text set. The training text set includes text with text length limits between 3-50, deleted emoticons, numbers, and mail text.
The manual labeling module is used for allowing an operator to manually label a small number of similar problems in the training text set;
and the similar problem generation module comprises a similar problem generation model, and the similar problem generation model is used for generating a large number of similar problems according to a small number of labeled similar problems, carrying out manual review and constructing a sentence pair matching data set.
The NLU module is used for training sentence pair matching models, matching questions input by a user by using the trained models, finding out the best matching questions, outputting the best matching questions and corresponding answers thereof, and replying the best matching questions and corresponding answers to the user. The NLU module comprises an unsupervised model and a supervised model. The unsupervised model comprises a WMD model and an SIF model. Supervised models include bert, albert, roberta.
The modules are deployed in an online environment and are deployed on two GPU servers of the RTX 60024G. And the online environment optimizes service performance to solve the high concurrency problem, so that the response speed is controlled within 300 ms. The optimization process comprises the following steps: parallel computation in the bert preprocessing process, hot loading of trained models, parallel computation of multiple models and the like.
The FAQ matching method and device with multi-model fusion sequentially obtain the training text set of the problems to be processed, extract and conclude the financing education knowledge points by combining the Bert-encoder + DBSCAn clustering assistance, thereby constructing the financing education FAQ, manually labeling a small amount of similar problems, then, according to a small amount of labeled similar problems, a large amount of similar problems are generated and are checked manually, a sentence pair matching data set is constructed, a sentence pair matching model is trained by using an unsupervised model and a supervised deep learning model, and finally, after the sentence pair matching model is trained, receiving a question input by a user, identifying a question most matched with the input question and corresponding answer output, adopting a plurality of model fusion, training or pre-training models, extracting a text, matching a standard question, and replying corresponding answers to solve the problems of complicated check of FAQ users and low manual customer service efficiency.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (6)

1. A multi-model fused FAQ matching method is characterized in that: the method comprises the following steps:
s1, obtaining a training text set of the problems to be processed, extracting and concluding the finance education knowledge points in combination with Bert-encoder + DBSCAN clustering assistance, thereby constructing a finance education FAQ and manually labeling a small number of similar problems;
s2, generating a large number of similar problems by using a similar problem generation module according to a small number of labeled similar problems, manually checking and constructing a sentence pair matching data set;
s3, constructing a pre-training model data set;
s4, training sentence pair matching models by using an unsupervised model and a supervised deep learning model;
and S5, after the sentence pair matching model is trained, receiving the questions input by the user, inputting the texts of the questions into the sentence pair matching model, identifying the questions most matched with the input questions and corresponding answers, outputting the questions and the corresponding answers, and replying the answers to the user.
2. The method of multi-model fused FAQ matching as claimed in claim 1, wherein: the training text set includes text length limits between 3-50, deleted expressions, numbers, and text of the email.
3. The method of multi-model fused FAQ matching as claimed in claim 1, wherein: the unsupervised model comprises a WMD model and an SIF model;
the supervised model comprises a bert model, an albert model and a roberta model.
4. An apparatus for multi-model fused FAQ matching, comprising: the method comprises the following steps:
the financial education corpus database is used for storing pre-input FAQ corpus data and generating a training text set;
the manual labeling module is used for manually labeling a small number of similar problems in the training text set by an operator;
the similar problem generation module is used for generating a large number of similar problems according to a small number of labeled similar problems, carrying out manual review and constructing a sentence pair matching data set;
and the NLU module is used for training the sentence pair matching model, matching the questions input by the user by using the trained model, finding out the best matching question, outputting the best matching question and the corresponding answer thereof, and replying the best matching question and the corresponding answer to the user.
5. The apparatus for multiple model fused FAQ matching according to claim 4, wherein: the training text set includes text length limits between 3-50, deleted expressions, numbers, and text of the email.
6. The apparatus for multiple model fused FAQ matching according to claim 4, wherein: the NLU module comprises an unsupervised model and a supervised model;
the unsupervised model comprises a WMD model and an SIF model;
the supervised model comprises a bert model, an albert model and a roberta model.
CN202010852824.7A 2020-08-22 2020-08-22 Multi-model fused FAQ matching method and device Pending CN112101423A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010852824.7A CN112101423A (en) 2020-08-22 2020-08-22 Multi-model fused FAQ matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010852824.7A CN112101423A (en) 2020-08-22 2020-08-22 Multi-model fused FAQ matching method and device

Publications (1)

Publication Number Publication Date
CN112101423A true CN112101423A (en) 2020-12-18

Family

ID=73754202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010852824.7A Pending CN112101423A (en) 2020-08-22 2020-08-22 Multi-model fused FAQ matching method and device

Country Status (1)

Country Link
CN (1) CN112101423A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505207A (en) * 2021-07-02 2021-10-15 中科苏州智能计算技术研究院 Machine reading understanding method and system for financial public opinion research and report
CN114117022A (en) * 2022-01-26 2022-03-01 杭州远传新业科技有限公司 FAQ similarity problem generation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347835A (en) * 2019-07-11 2019-10-18 招商局金融科技有限公司 Text Clustering Method, electronic device and storage medium
CN110727779A (en) * 2019-10-16 2020-01-24 信雅达系统工程股份有限公司 Question-answering method and system based on multi-model fusion
CN111191442A (en) * 2019-12-30 2020-05-22 杭州远传新业科技有限公司 Similar problem generation method, device, equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347835A (en) * 2019-07-11 2019-10-18 招商局金融科技有限公司 Text Clustering Method, electronic device and storage medium
CN110727779A (en) * 2019-10-16 2020-01-24 信雅达系统工程股份有限公司 Question-answering method and system based on multi-model fusion
CN111191442A (en) * 2019-12-30 2020-05-22 杭州远传新业科技有限公司 Similar problem generation method, device, equipment and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505207A (en) * 2021-07-02 2021-10-15 中科苏州智能计算技术研究院 Machine reading understanding method and system for financial public opinion research and report
CN113505207B (en) * 2021-07-02 2024-02-20 中科苏州智能计算技术研究院 Machine reading understanding method and system for financial public opinion research report
CN114117022A (en) * 2022-01-26 2022-03-01 杭州远传新业科技有限公司 FAQ similarity problem generation method and system
CN114117022B (en) * 2022-01-26 2022-05-06 杭州远传新业科技有限公司 FAQ similarity problem generation method and system

Similar Documents

Publication Publication Date Title
US12010073B2 (en) Systems and processes for operating and training a text-based chatbot
CN110555095B (en) Man-machine conversation method and device
CN111708869B (en) Processing method and device for man-machine conversation
CN117009490A (en) Training method and device for generating large language model based on knowledge base feedback
US11487952B2 (en) Method and terminal for generating a text based on self-encoding neural network, and medium
CN110175229B (en) Method and system for on-line training based on natural language
CN110781681B (en) Automatic first-class mathematic application problem solving method and system based on translation model
CN112101423A (en) Multi-model fused FAQ matching method and device
CN113434688B (en) Data processing method and device for public opinion classification model training
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN109508367A (en) Automatically extract the method, on-line intelligence customer service system and electronic equipment of question and answer corpus
CN113326367A (en) Task type dialogue method and system based on end-to-end text generation
CN112287085A (en) Semantic matching method, system, device and storage medium
CN111523328A (en) Intelligent customer service semantic processing method
CN114330318A (en) Method and device for recognizing Chinese fine-grained entities in financial field
CN113051388A (en) Intelligent question and answer method and device, electronic equipment and storage medium
Kaviya et al. Artificial intelligence based farmer assistant chatbot
Sawant et al. Analytical and Sentiment based text generative chatbot
CN111488448A (en) Method and device for generating machine reading marking data
CN114579706B (en) Automatic subjective question review method based on BERT neural network and multi-task learning
CN116362331A (en) Knowledge point filling method based on man-machine cooperation construction knowledge graph
CN114610743A (en) Structured query language statement processing method, system, device, and medium
CN113886521A (en) Text relation automatic labeling method based on similar vocabulary
Liang et al. Intelligent chat robot in digital campus based on deep learning
Buyrukoğlu et al. A Novel Semi-Automated Chatbot Model: Providing Consistent Response of Students’ Email in Higher Education based on Case-Based Reasoning and Latent Semantic Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination