CN111125334B - Search question-answering system based on pre-training - Google Patents
Search question-answering system based on pre-training Download PDFInfo
- Publication number
- CN111125334B CN111125334B CN201911341560.2A CN201911341560A CN111125334B CN 111125334 B CN111125334 B CN 111125334B CN 201911341560 A CN201911341560 A CN 201911341560A CN 111125334 B CN111125334 B CN 111125334B
- Authority
- CN
- China
- Prior art keywords
- question
- questions
- module
- knowledge
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a search question-answering system based on pre-training, which comprises a noise judging module, a QA question-answering module, a knowledge matching module and a response output module; the noise judging module judges whether the user questions belong to noise, the QA question-answering module comprises a rule input unit and a rule analysis unit, the knowledge matching module indexes the questions and knowledge in a question-answering library and makes similarity sorting, the knowledge index comprises two modes of inverted index and Annoy index, the response output module is used for outputting responses, and the responses comprise a similar question list, accurate answers, no answers and recommended hot questions. The invention can effectively solve the problems of knowledge generalization migration, noise judgment and QA customization, and greatly improves the question-answering efficiency while improving the user experience.
Description
Technical Field
The invention relates to a search question-answering system based on pre-training, which is a system for obtaining a language model and an existing question-answering database based on pre-training and interacting with a client problem, and belongs to the fields of natural language processing and machine learning.
Background
The search question-answering system is characterized in that the system receives user questions, performs similar question searching and sequencing in a question-answering knowledge base, displays a similar question list for the user, and enables the user to select to solve the user questions to the greatest extent. At present, the question-answering mode is adopted in various knowledge base question-answering systems, intelligent customer service assistants, self-service machines and other terminal devices. Unlike traditional dialog systems, which focus on interactions, search question-and-answer systems focus on improving a more accurate list of similar questions, which do not require maintenance of as many context states as dialog systems, nor do they completely require accurate question-and-answer responses. In the terminal equipment, the problems of speech recognition accuracy and client speaking are still key reasons for restricting the development of a dialogue system. The search question and answer system can alleviate this problem by recommending a list of similar questions in a simple and efficient manner, but still faces the problem of lower accuracy of the first 5 recommendations. The factors with low recommendation accuracy rate before 5 are affected mainly comprise three factors, namely spoken language generalization, noise influence and accurate answer of fixed questions.
Search question-answering systems tend to have a large amount of data for one industry, and the question-answering forms of users are more diverse. User problems may come from terminal speech recognition equipment collection (noise interference), or industry website interrogation, etc., with more general spoken language. Taking tax industry as an example, "pay tax", the user is more likely to say "pay more money". In fact, industry clients often only provide question-answer pairs, i.e., one standard question corresponds to one standard answer, and there is no generalization of the standard question, and when the system receives a user question, the system can only match with questions in the question-answer knowledge base in sequence. Thus, search question-and-answer systems are also faced with this large number of spoken language generalization problems.
The search question and answer system is a list of recommended similar questions, and the system cannot respond to any questions. One solution is to set a minimum threshold for the similarity of the user questions to questions in the question-and-answer knowledge base below which no response is made. However, because of the large similarity between the different questions and the questions in the knowledge base, it is difficult to select an appropriate threshold to avoid noise problems.
In addition, the search question-answering system may still face some customized questions-answers, that is, the user system inputs a fixed question and can accurately give a fixed answer, instead of returning a similar question list, such as a jump instruction of a terminal device page, some common propaganda of the user, an operation instruction, and the like, and the search question-answering system still needs to have a customized QA question-answering function.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a search question-answering system based on pre-training, which can effectively solve the problems of knowledge generalization migration, noise judgment and QA customization, and greatly improve question-answering efficiency while improving user experience.
In order to solve the technical problems, the invention adopts the following technical scheme: a search question-answering system based on pre-training comprises a noise judging module, a QA question-answering module, a knowledge matching module and a response output module; the noise judging module judges whether the user problem belongs to noise or not through an industry word stock and an exclusion word stock, when the user problem contains industry words and does not contain exclusion words, the user problem is determined to be non-noise, the user problem enters the QA question-answering module to analyze, otherwise the user problem is determined to be noise, and the response output module returns a popular recommended problem or no response;
the QA question-answering module comprises a rule input unit and a rule analysis unit, wherein the rule analysis unit analyzes the user questions input by the rule input unit and judges whether the analyzed user questions need to accurately return answers, if so, the response output module outputs standard answers corresponding to the questions, and if not, the analyzed questions are sent to the knowledge matching module;
the knowledge matching module indexes the knowledge in the question and answer library and performs similarity sorting, the knowledge index comprises an inverted index and an Annoy index, the Annoy index is based on a semantic model, the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer and the last but one layer of the semantic model is used as a question vector to perform Annoy index, and when similarity calculation and sorting are performed, vector similarity, question and answer frequency and text alignment ratio are comprehensively considered;
the response output module is used for outputting responses, and the responses comprise four types of similar question lists, accurate answers, no answers and recommended hot questions.
Further, the word stock removal of the noise judgment module is obtained through manual screening and later log maintenance, and the process of obtaining the word stock of the noise judgment module industry is as follows: a1 Firstly, counting training data, wherein the training data sources comprise a question-answer knowledge base and other industry data problems crawled through network resources; a2 Using a precise model of barker word segmentation to segment words, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of words based on all data, calculating word weight W based on the word frequency TF and the inverse document frequency IDF, and calculating formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:
W=TF*IDF;
a3 B), selecting a proper number of words as industry words according to the word weight calculated in the step b, or selecting the industry words by setting a lowest threshold value, splitting a split phrase in the industry words or adding the industry words into custom word segmentation of the crust word segmentation, and improving the weight of the split phrase to ensure that the industry words can be correctly segmented;
a4 A plurality of spoken abbreviations or other non-conventional industry words are provided by professionals to form a final industry word stock.
Further, the rule input unit of the QA question-answering module supports input logic expression, brackets, logic nesting, digital analysis and entity input, and the rule analysis unit firstly matches the problem with the rule expression to form a logic expression only comprising 1 and 0 aiming at the input logic expression, and then calculates the logic expression through a rule analysis algorithm and outputs whether the logic expression is matched or not.
Further, the process of calculating the logic expression through the rule analysis algorithm is as follows: pushing the logic expression into a digital stack and an operator stack to carry out recursive calculation, wherein the operation priority is as follows: brackets > AND operations > or operations, the operation rules are: 1& 1=1;
1&0 = 0&1 =0; 1|0 = 0|1 =1; 1|1 =1; 0|0 =0, where & represents AND operation, | represents OR operation, AND in the logical expression, the AND operation symbol AND is replaced with & AND the OR operation symbol OR is replaced with |.
Furthermore, the inverted index in the knowledge matching module is obtained by performing barking word segmentation and blackout word removal on the questions in the question-answering knowledge base; the Annoy index in the knowledge matching module is obtained by constructing a plurality of high-dimensional neighbor search trees for knowledge vectors in a question-answer knowledge base, the knowledge vectors in the question-answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.
Further, the specific process of obtaining the semantic model from the pre-training model is as follows: b1 Adding a full connection layer and a softmax layer after an output layer of the pre-training model, and respectively using the output of the last layer and the last-last layer of the pre-training model as a problem vector when a knowledge vector is constructed; b2 Building training data through text enhancement; b3 And B1) performing fine-tuning on the pre-trained model processed in the step B1 based on text enhancement, thereby obtaining a semantic model.
Further, in the step B2, the process of constructing training data through text enhancement is as follows: c1 Building a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplementing sources comprise common question-answers crawled in an industry website and manual supply; c2 Synonym extraction, including industry thesaurus extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing industry keywords after tf-idf weight calculation and manual screening, then extracting synonyms of the keywords by using a Tencent vector, wherein the extraction rule is that the top 10 synonyms of the keywords are not included, similarity is cosine similarity, and a final synonym phrase is obtained after manual screening and de-duplication; c3 Constructing similar questions of the questions in the question-answering knowledge base through synonym replacement; c4 Performing back translation, namely performing English translation on the questions in the question-answering knowledge base through a translation open platform, translating the questions into Chinese, and performing manual screening and labeling; c5 And obtaining synonymous question-answer pairs through text enhancement, and obtaining non-synonymous question-answer pairs through questions without answers, random combination of questions and non-synonymous word replacement, thereby constructing training data and a verification set.
Further, the pre-training model is a Chinese pre-training RoBERTa or XLNet model.
Further, the formula for calculating the similarity of the problems is:
wherein Q is u ,Q k S (Q) u ,Q k ) Representing the similarity of user questions to questions indexed in the knowledge base, V u ,V k Problem vectors respectively input to the semantic model for user problems and queried by indexing, cos (V u ,V k ) Represents V u ,V k Cosine similarity of two vectors, gamma 1 、γ 2 Mu is a coefficient, wherein gamma 1 ∈(0,0.1),γ 2 ∈(0,0.1),μ∈(0,1);C(Q k ) For problem Q obtained by log statistics k Is a number of inquiries; max (C (Q) k ) A maximum number of interrogated questions.
Further, if the user question is judged to be noise, the output response module returns a popular recommended question or no response according to the question-answer frequency, if the user question accords with a rule that an answer needs to be accurately returned after being analyzed by the QA module, the output response module outputs a standard answer corresponding to the question, and if the user question is not noise and does not accord with the QA matching rule, the similar question list is returned by the knowledge matching module for the user to select; in addition, the system maintains a similar question list state, and the user answers questions by clicking or position instructions.
The invention has the beneficial effects that: the invention relates to a search question-answering system based on pre-training, which solves knowledge generalization migration by using a pre-training model to fine-tune, solves noise judgment by using statistical learning, solves QA customized problem by using a rule analysis algorithm, and greatly improves question-answering efficiency while improving user experience.
Drawings
FIG. 1 is a diagram of an architecture of a search question-answering system;
FIG. 2 is a knowledge index acquisition flow chart;
FIG. 3 is a semantic model acquisition flow chart;
fig. 4 is a training data generation flow chart.
Detailed Description
The invention will be further described with reference to the drawings and the specific examples.
Fig. 1 is a diagram of the architecture of the whole search question-answering system, including a noise judging module, a QA question-answering module, a knowledge matching module and a response output module, when the system receives a user question, firstly judging whether the user question is noise, if the user question is non-noise, performing QA question-answering matching, if the user question is not yet responded, performing index matching on the questions in the knowledge base, sorting the similarity, selecting the first 5 similar questions, and returning the first 5 similar questions to the response module, and finally giving a question-answering response by the response.
In this embodiment, the noise judging module judges whether the user problem belongs to noise through the industry word stock and the exclusion word stock, when the user problem contains industry words and does not contain exclusion words, the user problem is determined to be non-noise, the user problem enters the QA question-answering module to analyze, otherwise, the user problem is determined to be noise, and the response output module returns a popular recommended problem or no response.
The word stock of the noise judgment module is constructed manually according to industry experience, and irregular problems such as sensitive information, voice recognition error information and the like are filtered uniformly. The word stock is removed and is continuously enriched in the subsequent use or log statistics and other processes.
The process of obtaining the noise judgment module industry word stock is as follows:
a1 Firstly, counting training data, wherein the training data sources comprise a question-answer knowledge base and other industry data problems crawled through network resources;
a2 Using a precise model of barker word segmentation to segment words, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of words based on all data, calculating word weight W based on the word frequency TF and the inverse document frequency IDF, and calculating formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:
W=TF*IDF;
a3 B), selecting a proper number of words as industry words according to the word weight calculated in the step b, or selecting the industry words by setting the lowest threshold value, wherein the split phrase in the industry words needs to be split as fine as possible when constructing the index, such as 'paying instead of deduction', and the like, and the barking word can be split into one word, but if a user only says 'paying instead of deduction', the problem cannot be located through the inverted index, and therefore the split into 'paying instead of deduction' is needed. In addition, the industry word is required to be added into the custom word segmentation of the bargain word, and the weight of the custom word segmentation is increased, so that the industry word can be correctly segmented;
a4 A plurality of spoken abbreviations or other non-conventional industry words are provided by professionals to form a final industry word stock.
For conventional questions and answers such as instruction intentions (e.g. instructions for controlling front-end page jumps, etc.), industry/unit introduction, calling, etc., an answer needs to be accurately returned, so a QA question and answer module is set.
The QA question-answering module comprises a rule input unit and a rule analysis unit, wherein the rule analysis unit analyzes the user questions input by the rule input unit and judges whether the analyzed user questions need to accurately return answers, if so, the response output module outputs standard answers corresponding to the questions, and if not, the analyzed questions are sent to the knowledge matching module.
In order to improve the utilization rate of data and relatively reduce the recording and maintenance work of the data, the general information can be independently taken out to be defined as an entity, and the printing invoice is taken as an example, and the entity of the type of @ invoice can be defined as water charge OR electric charge OR gas charge, so that the method can be directly used when regular recording is carried out.
In this embodiment, the rule input unit of the QA question-answering module supports inputting logic expressions such as "and nor", brackets, logic nesting, digital analysis, entity input and the like, so that the implementation of various complex logic rules can be satisfied. Taking the example of invoicing, its intent entry may be "AND (water OR electricity OR gas) AND invoice", OR the entity "AND { @ invoice category } AND invoice" may be used. In addition, in this embodiment, a system entity (number, place, etc.) is also provided, and especially for the digital entity, a logic rule (greater than, less than, equal to, etc.) of the number may be entered.
The rule analysis unit firstly matches the problem with the rule expression aiming at the entered logic expression to form a logic expression (AND is replaced by & replacement, OR is replaced by | AND Chinese brackets are replaced by English brackets) only comprising 1 (True) AND 0 (False), AND then calculates the logic expression through a rule analysis algorithm AND outputs whether the logic expression is matched (1 is matched AND 0 is not matched).
Specifically, the process of calculating the logic expression through the rule parsing algorithm is as follows: pushing the logic expression into a digital stack and an operator stack, and performing recursive computation in a manner similar to addition, subtraction, multiplication and division, wherein the operation priority is as follows: brackets > and operations > or operations,
the operation rule is as follows:
1&1=1;
1&0=0&1=0;
1|0=0|1=1;
1|1=1;
0|0=0,
where & represents AND operation, | represents OR operation, AND in the logical expression the & replace AND operation symbol AND is replaced with the | OR operation symbol OR.
The knowledge matching module indexes and sorts the knowledge in the question and answer library, the knowledge index comprises an inverted index and an Annoy index, the Annoy index is based on a semantic model, the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer and the last but one layer of the semantic model is used as a question vector to carry out Annoy index, and when the similarity is calculated and sorted, the vector similarity, the question and answer frequency and the text alignment ratio are comprehensively considered.
In this embodiment, the specific implementation process of the knowledge matching module is:
1. training set construction, as shown in fig. 4, training data is constructed mainly by three parts of synonym substitution, back translation and random combination. The method specifically comprises the following steps:
c1 Building a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplementing sources comprise common question-answers crawled in an industry website and manual supply;
c2 Synonym extraction, including industry thesaurus extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing industry keywords after tf-idf weight calculation and manual screening, then extracting synonyms of the keywords by using a Tencent vector, wherein the extraction rule is that the top 10 synonyms of the keywords are not included, similarity is cosine similarity, and a final synonym phrase is obtained after manual screening and de-duplication;
the synonymous phrases prior to screening are as follows:
innovative enterprise and civil enterprise innovation enterprise industry and civil enterprise manufacturing industry manufacturing enterprise investment organization products and service medicine enterprise
The screening was followed as follows:
medicine enterprise for enterprise civil enterprise company
C3 Constructing similar questions of questions in a question-answering knowledge base through synonym replacement, replacing synonyms in the questions by using the questions in the knowledge base to expand the number of the questions, and then removing unreasonable questions and marking whether the questions are similar or not;
c4 Performing back translation, namely traversing all the problems in the knowledge base, calling a hundred-degree translation open platform interface to perform English translation on the problems in the knowledge base, performing back translation into Chinese, and finally removing unreasonable problems and marking whether the problems are similar;
c5 The number of similar question pairs is arranged, all questions (questions without answers, random combination of questions, non-synonym replacement to obtain non-synonymous question-answer pairs) are randomly combined, the same number of dissimilar question pairs are constructed, and the random disorder is followed by training set. The ratio of training set to validation set was 9:1.
2. Performing fine-tuning on the pre-training model based on training data to obtain a semantic model, wherein the specific process is as shown in fig. 3:
b1 The full-join layer and the softmax layer are added after the output layer of the pre-training model, and the probability of similarity and dissimilarity is output after the softmax layer. The input of the model is a problem pair in training data, the loss function uses cross entropy, the gradient descent uses Adam gradient descent, f1 and accuracy are adopted as comprehensive indexes, and the optimal semantic model is selected through a verification set.
When the knowledge vector is constructed, the knowledge base problem is traversed circularly, the output (768 dimensions) of the last layer (high first accuracy) and the output (768 dimensions) of the last but one layer (high recall) of the model are used as problem vectors respectively, a client can select the knowledge vectors by himself, and the system defaults to use the output of the last but one layer as the problem vector.
Compared with the traditional bert model, the pre-training model used by the method uses more training data and more training steps, and has better effect in each nlp item by using dynamic whole word masking and canceling the following prediction. In the search system, the pre-training model can be improved by 20.5% in the front 1 hit rate and 11.5% in the front 5 hit rate compared with the simplest bert model after being tested before the fine-training process.
3. The knowledge index construction mainly comprises an inverted index and an Annoy index, the inverted index construction speed is high, the calculation amount of knowledge insertion and deletion index updating is small, the knowledge index construction method is suitable for a knowledge search question-answering system with small data volume (the data volume is preferably within ten thousand), and the recall rate of a problem can be improved by using the mode; the Annoy index is slow in construction speed, index updating caused by knowledge updating is not supported, but the query speed is high in a large amount of data, and the Annoy index is suitable for index searching of a large amount of base database data (the data amount can be supported to the millions), and a user can set and select the index according to the size and the category of a knowledge base. In general, the industry basic question-answering library contains more problems and has fewer updates, so that the method has universality among different clients in the same industry, and an Annoy index is used; while for customization issues (dynamics) for different customers we can use the inverted index to calculate.
As shown in FIG. 2, the inverted index is obtained by blocking and separating words and removing blackout words for the questions in the question-answering knowledge base. The Annoy index is obtained by constructing a plurality of high-dimensional neighbor search trees for knowledge vectors in a question-answer knowledge base, the knowledge vectors in the question-answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.
The industry basic question-answering library contains more problems and has fewer updates, so that the method has universality among different clients in the same industry, and an Annoy index is used; while for customization issues (dynamics) for different customers we can use the inverted index to calculate.
In this embodiment, the pre-training model is a chinese pre-training RoBERTa or XLNet model.
4. And calculating the similarity of the problems and arranging the similarity, carrying out cosine similarity calculation aiming at the user problem vectors and the knowledge vectors in the knowledge base, comprehensively considering the characteristics such as hot spot problem statistics, text alignment ratio and the like, and carrying out similarity sorting after comprehensive consideration.
The similarity calculation formula is:
wherein Q is u ,Q k S (Q) u ,Q k ) Representing similarity of user-containing questions to questions indexed in the knowledge base, V u ,V k Problem vectors respectively input to the semantic model for user problems and queried by indexing, cos (V u ,V k ) Represents V u ,V k Cosine similarity of two vectors, gamma 1 、γ 2 Mu is a coefficient, wherein gamma 1 ∈(0,0.1),γ 2 ∈(0,0.1),μ∈(0,1);C(Q k ) For problem Q obtained by log statistics k Is a number of inquiries; max (C (Q) k ) A maximum number of interrogated questions.
The first part of the formula mainly considers the similarity between the user problem and the cosine of the index problem vector and is the most main part; the second part mainly considers the alignment ratio of the user problem and the word segmentation number of the index problem, L (Q) u ,Q k ) Absolute value of word segmentation quantity difference; the third part mainly considers the question and answer frequency of common questions and answers; the three parts of values are normalized to similarity values of 0-1 after softmax function, and are ordered.
The response output module is used for outputting responses, and the responses comprise four types of similar question lists, accurate answers, no answers and recommended hot questions.
Specifically, if the user problem is judged to be noise, the output response module returns a hot recommended problem or no response according to the question-answer frequency; if the user question accords with the rule that the answer must be accurately returned after being analyzed by the QA module, the output response module outputs a standard answer corresponding to the question; if the user problem is non-noise and does not accord with the QA matching rule, returning a similar problem list for the user to select through the knowledge matching module; in addition, the system maintains a similar question list state, and the user answers questions by clicking or position instructions.
According to the method, a language model with certain generalization capability can be obtained by performing unsupervised learning on a large amount of data, and a question vector index of a question-answer library is constructed according to the language model, in addition, statistical information such as tf (word frequency), idf (inverse document word frequency) and the like of the question-answer library and manually added semantic matching rules are comprehensively considered, a search question-answer system responding to user questions is constructed, the question-answer generalization capability is improved, meanwhile, the questions with higher relevance are recommended, the efficiency of solving the questions is greatly improved, and the user experience is improved.
The embodiment is based on an intelligent search question-answering system, and the generalization capability of the model can be greatly improved while the sensitivity of the model to industry knowledge is improved by performing parameter tuning on the pre-training model; through statistical learning and manual screening of the industry word stock, interference of incoherent problems such as non-industry and the like can be effectively eliminated; through the QA rule input and analysis system and nesting the QA rule input and analysis system into the search question-answering system, in addition, the input of the QA rule input support entity (including numbers) and the input of complex logic rules can effectively realize accurate QA question-answering. The invoice can effectively solve the pain points such as knowledge generalization migration, noise judgment, QA customization and the like in the current engineering application process, and greatly improves the question-answering efficiency while improving the user experience. The invention can also be used on intelligent equipment, such as mobile phones, computers, intelligent robots, self-service machines and tools, and the like.
The foregoing description is only of the basic principles and preferred embodiments of the present invention, and modifications and alternatives thereto will occur to those skilled in the art to which the present invention pertains, as defined by the appended claims.
Claims (9)
1. A search question-answering system based on pre-training, characterized in that: the system comprises a noise judging module, a QA question-answering module, a knowledge matching module and a response output module;
the noise judging module judges whether the user problem belongs to noise or not through an industry word stock and an exclusion word stock, when the user problem contains industry words and does not contain exclusion words, the user problem is determined to be non-noise, the user problem enters the QA question-answering module to analyze, otherwise the user problem is determined to be noise, and the response output module returns a popular recommended problem or no response;
the QA question-answering module comprises a rule input unit and a rule analysis unit, wherein the rule analysis unit analyzes the user questions input by the rule input unit and judges whether the analyzed user input needs to accurately return answers, if so, the response output module outputs standard answers corresponding to the questions, and if not, the analyzed questions are sent to the knowledge matching module;
the knowledge matching module indexes the knowledge in the question and answer library and performs similarity sorting, the knowledge index comprises an inverted index and an Annoy index, the Annoy index is based on a semantic model, the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer or the last-second layer of the semantic model is used as a question vector to perform Annoy index, and when similarity calculation and sorting are performed, vector similarity, question and answer frequency and text alignment ratio are comprehensively considered;
the formula for calculating the similarity of the problems is as follows:
wherein Q is u ,Q k S (Q) u ,Q k ) Representing the similarity of user questions to questions indexed in the knowledge base, V u ,V k Problem vectors respectively input to the semantic model for user problems and queried by indexing, cos (V u ,V k ) Represents V u ,V k Cosine similarity of two vectors, gamma 1 、γ 2 Mu is a coefficient, wherein gamma 1 ∈(0,0.1),γ 2 ∈(0,0.1),μ∈(0,1);C(Q k ) For problem Q obtained by log statistics k Is a number of inquiries; l (Q) u ,Q k ) The absolute value of the word segmentation quantity difference value between the user problem and the index problem is calculated; max (C (Q) 1,2... ) A maximum number of interrogated questions;
the response output module is used for outputting responses, and the responses comprise four types of similar question lists, accurate answers, no answers and recommended hot questions.
2. The pre-training based search question-answering system according to claim 1, wherein: the word stock removal of the noise judgment module is obtained through manual screening and later log maintenance, and the process of obtaining the word stock of the noise judgment module industry is as follows: a1 Firstly, counting training data, wherein the training data sources comprise a question-answer knowledge base and other industry data problems crawled through network resources; a2 Using a precise model of barker word segmentation to segment words, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of words based on all data, calculating word weight W based on the word frequency TF and the inverse document frequency IDF, and calculating formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:
W=TF*IDF;
a3 B), selecting a proper number of words as industry words according to the word weight calculated in the step b, or selecting the industry words by setting a minimum threshold value, splitting the split phrase in the industry words or adding the industry words into the custom word segmentation of the crust word segmentation, and improving the weight of the split phrase;
a4 A plurality of spoken abbreviations or other non-conventional industry words are provided by professionals to form a final industry word stock.
3. The pre-training based search question-answering system according to claim 1, wherein: the rule input unit of the QA question-answering module supports input logic expression, brackets, logic nesting, digital analysis and entity input, and the rule analysis unit firstly matches a problem with the rule expression to form a logic expression only comprising 1 and 0 aiming at the input logic expression, and then calculates the logic expression through a rule analysis algorithm and outputs whether the logic expression is matched or not.
4. A pre-training based search question-answering system according to claim 3, wherein: the process of calculating the logic expression through the rule analysis algorithm is as follows: pushing the logic expression into a digital stack and an operator stack to carry out recursive calculation, wherein the operation priority is as follows: brackets > AND operations > or operations, the operation rules are: 1& 1=1; 1&0 = 0&1 =0; 1|0 = 0|1 =1; 1|1 =1; 0|0 =0, where & represents AND operation, | represents OR operation, AND in the logical expression, the AND operation symbol AND is replaced with & AND the OR operation symbol OR is replaced with |.
5. The pre-training based search question-answering system according to claim 1, wherein: the inverted index in the knowledge matching module is obtained by blocking and word segmentation and power failure word removal for the questions in the question-answering knowledge base; the Annoy index in the knowledge matching module is obtained by constructing a plurality of high-dimensional neighbor search trees for knowledge vectors in a question-answer knowledge base, the knowledge vectors in the question-answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.
6. The pre-training based search question-answering system according to claim 5, wherein: the specific process of obtaining the semantic model from the pre-training model is as follows: b1 Adding a full connection layer and a softmax layer after an output layer of the pre-training model, and respectively using the output of the last layer and the last-last layer of the pre-training model as a problem vector when a knowledge vector is constructed; b2 Building training data through text enhancement; b3 And B1) performing fine-tuning on the pre-trained model processed in the step B1 based on text enhancement, thereby obtaining a semantic model.
7. The pre-training based search question-answering system according to claim 6, wherein: the process of constructing training data through text enhancement in the step B2 is as follows: c1 Building a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplementing sources comprise common question-answers crawled in an industry website and manual supply; c2 Synonym extraction, including industry thesaurus extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing industry keywords after tf-idf weight calculation and manual screening, then extracting synonyms of the keywords by using a Tencent vector, wherein the extraction rule is that the top 10 synonyms of the keywords are not included, similarity is cosine similarity, and a final synonym phrase is obtained after manual screening and de-duplication; c3 Constructing similar questions of the questions in the question-answering knowledge base through synonym replacement; c4 Performing back translation, namely performing English translation on the questions in the question-answering knowledge base through a translation open platform, translating the questions into Chinese, and performing manual screening and labeling; c5 And obtaining synonymous question-answer pairs through text enhancement, and obtaining non-synonymous question-answer pairs through questions without answers, random combination of questions and non-synonymous word replacement, thereby constructing training data and a verification set.
8. The pretrained search question and answer system of any one of claims 1, 5, 6, 7, wherein: the pre-training model is a Chinese pre-training RoBERTa or XLnet model.
9. The pre-training based search question-answering system according to claim 1, wherein: if the user question is judged to be noise, the output response module returns a popular recommended question or no response according to the question-answer frequency, if the user question accords with a rule that an answer needs to be accurately returned after being analyzed by the QA module, the output response module outputs a standard answer corresponding to the question, and if the user question is judged to be non-noise and does not accord with the QA matching rule, the similar question list is returned through the knowledge matching module for the user to select; in addition, the system maintains a similar question list state, and the user answers questions by clicking or position instructions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911341560.2A CN111125334B (en) | 2019-12-20 | 2019-12-20 | Search question-answering system based on pre-training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911341560.2A CN111125334B (en) | 2019-12-20 | 2019-12-20 | Search question-answering system based on pre-training |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111125334A CN111125334A (en) | 2020-05-08 |
CN111125334B true CN111125334B (en) | 2023-09-12 |
Family
ID=70501440
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911341560.2A Active CN111125334B (en) | 2019-12-20 | 2019-12-20 | Search question-answering system based on pre-training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111125334B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111310234B (en) * | 2020-05-09 | 2020-11-03 | 支付宝(杭州)信息技术有限公司 | Personal data processing method and device based on zero-knowledge proof and electronic equipment |
CN111752804B (en) * | 2020-06-29 | 2022-09-09 | 中国电子科技集团公司第二十八研究所 | Database cache system based on database log scanning |
CN112100345A (en) * | 2020-08-25 | 2020-12-18 | 百度在线网络技术(北京)有限公司 | Training method and device for non-question-answer-like model, electronic equipment and storage medium |
CN112231537A (en) * | 2020-11-09 | 2021-01-15 | 张印祺 | Intelligent reading system based on deep learning and web crawler |
CN112380843B (en) * | 2020-11-18 | 2022-12-30 | 神思电子技术股份有限公司 | Random disturbance network-based open answer generation method |
CN112506963B (en) * | 2020-11-23 | 2022-09-09 | 上海方立数码科技有限公司 | Multi-service-scene-oriented service robot problem matching method |
CN112507097B (en) * | 2020-12-17 | 2022-11-18 | 神思电子技术股份有限公司 | Method for improving generalization capability of question-answering system |
CN112541069A (en) * | 2020-12-24 | 2021-03-23 | 山东山大鸥玛软件股份有限公司 | Text matching method, system, terminal and storage medium combined with keywords |
CN112597291A (en) * | 2020-12-26 | 2021-04-02 | 中国农业银行股份有限公司 | Intelligent question and answer implementation method, device and equipment |
CN112380358A (en) * | 2020-12-31 | 2021-02-19 | 神思电子技术股份有限公司 | Rapid construction method of industry knowledge base |
CN113627152B (en) * | 2021-07-16 | 2023-05-16 | 中国科学院软件研究所 | Self-supervision learning-based unsupervised machine reading and understanding training method |
CN114860913B (en) * | 2022-05-24 | 2023-12-12 | 北京百度网讯科技有限公司 | Intelligent question-answering system construction method, question-answering processing method and device |
CN116860951B (en) * | 2023-09-04 | 2023-11-14 | 贵州中昂科技有限公司 | Information consultation service management method and management system based on artificial intelligence |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101621391A (en) * | 2009-08-07 | 2010-01-06 | 北京百问百答网络技术有限公司 | Method and system for classifying short texts based on probability topic |
CN105005564A (en) * | 2014-04-17 | 2015-10-28 | 北京搜狗科技发展有限公司 | Data processing method and apparatus based on question-and-answer platform |
CN107273350A (en) * | 2017-05-16 | 2017-10-20 | 广东电网有限责任公司江门供电局 | A kind of information processing method and its device for realizing intelligent answer |
CN107980130A (en) * | 2017-11-02 | 2018-05-01 | 深圳前海达闼云端智能科技有限公司 | It is automatic to answer method, apparatus, storage medium and electronic equipment |
WO2018077655A1 (en) * | 2016-10-24 | 2018-05-03 | Koninklijke Philips N.V. | Multi domain real-time question answering system |
CN108959531A (en) * | 2018-06-29 | 2018-12-07 | 北京百度网讯科技有限公司 | Information search method, device, equipment and storage medium |
CN109947921A (en) * | 2019-03-19 | 2019-06-28 | 河海大学常州校区 | A kind of intelligent Answer System based on natural language processing |
CN110008322A (en) * | 2019-03-25 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Art recommended method and device under more wheel session operational scenarios |
CN110309267A (en) * | 2019-07-08 | 2019-10-08 | 哈尔滨工业大学 | Semantic retrieving method and system based on pre-training model |
CN110413783A (en) * | 2019-07-23 | 2019-11-05 | 银江股份有限公司 | A kind of judicial style classification method and system based on attention mechanism |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220380A (en) * | 2017-06-27 | 2017-09-29 | 北京百度网讯科技有限公司 | Question and answer based on artificial intelligence recommend method, device and computer equipment |
-
2019
- 2019-12-20 CN CN201911341560.2A patent/CN111125334B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101621391A (en) * | 2009-08-07 | 2010-01-06 | 北京百问百答网络技术有限公司 | Method and system for classifying short texts based on probability topic |
CN105005564A (en) * | 2014-04-17 | 2015-10-28 | 北京搜狗科技发展有限公司 | Data processing method and apparatus based on question-and-answer platform |
WO2018077655A1 (en) * | 2016-10-24 | 2018-05-03 | Koninklijke Philips N.V. | Multi domain real-time question answering system |
CN107273350A (en) * | 2017-05-16 | 2017-10-20 | 广东电网有限责任公司江门供电局 | A kind of information processing method and its device for realizing intelligent answer |
CN107980130A (en) * | 2017-11-02 | 2018-05-01 | 深圳前海达闼云端智能科技有限公司 | It is automatic to answer method, apparatus, storage medium and electronic equipment |
CN108959531A (en) * | 2018-06-29 | 2018-12-07 | 北京百度网讯科技有限公司 | Information search method, device, equipment and storage medium |
CN109947921A (en) * | 2019-03-19 | 2019-06-28 | 河海大学常州校区 | A kind of intelligent Answer System based on natural language processing |
CN110008322A (en) * | 2019-03-25 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Art recommended method and device under more wheel session operational scenarios |
CN110309267A (en) * | 2019-07-08 | 2019-10-08 | 哈尔滨工业大学 | Semantic retrieving method and system based on pre-training model |
CN110413783A (en) * | 2019-07-23 | 2019-11-05 | 银江股份有限公司 | A kind of judicial style classification method and system based on attention mechanism |
Non-Patent Citations (1)
Title |
---|
刘飞 ; 张俊然 ; 杨豪 ; .基于深度学习的医学图像识别研究进展.中国生物医学工程学报.(第01期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111125334A (en) | 2020-05-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111125334B (en) | Search question-answering system based on pre-training | |
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
CN108304468B (en) | Text classification method and text classification device | |
JP5936698B2 (en) | Word semantic relation extraction device | |
CN109299280B (en) | Short text clustering analysis method and device and terminal equipment | |
CN111325029B (en) | Text similarity calculation method based on deep learning integrated model | |
CN112035599B (en) | Query method and device based on vertical search, computer equipment and storage medium | |
CN112632228A (en) | Text mining-based auxiliary bid evaluation method and system | |
CN111460148A (en) | Text classification method and device, terminal equipment and storage medium | |
US10586174B2 (en) | Methods and systems for finding and ranking entities in a domain specific system | |
CN112307182B (en) | Question-answering system-based pseudo-correlation feedback extended query method | |
CN108287848B (en) | Method and system for semantic parsing | |
CN109255012A (en) | A kind of machine reads the implementation method and device of understanding | |
CN111625621A (en) | Document retrieval method and device, electronic equipment and storage medium | |
Khalid et al. | Topic detection from conversational dialogue corpus with parallel dirichlet allocation model and elbow method | |
CN110347833B (en) | Classification method for multi-round conversations | |
CN114722176A (en) | Intelligent question answering method, device, medium and electronic equipment | |
Al Mostakim et al. | Bangla content categorization using text based supervised learning methods | |
TWI734085B (en) | Dialogue system using intention detection ensemble learning and method thereof | |
CN115329207B (en) | Intelligent sales information recommendation method and system | |
CN114579729B (en) | FAQ question-answer matching method and system fusing multi-algorithm models | |
CN111104422A (en) | Training method, device, equipment and storage medium of data recommendation model | |
CN116628146A (en) | FAQ intelligent question-answering method and system in financial field | |
CN110287396A (en) | Text matching technique and device | |
CN115713349A (en) | Small sample comment data driven product key user demand mining method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |