CN111125334B - Search question-answering system based on pre-training - Google Patents

Search question-answering system based on pre-training Download PDF

Info

Publication number
CN111125334B
CN111125334B CN201911341560.2A CN201911341560A CN111125334B CN 111125334 B CN111125334 B CN 111125334B CN 201911341560 A CN201911341560 A CN 201911341560A CN 111125334 B CN111125334 B CN 111125334B
Authority
CN
China
Prior art keywords
question
questions
module
knowledge
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911341560.2A
Other languages
Chinese (zh)
Other versions
CN111125334A (en
Inventor
申冲
张传锋
朱锦雷
薛付忠
杨帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Synthesis Electronic Technology Co Ltd
Original Assignee
Synthesis Electronic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Synthesis Electronic Technology Co Ltd filed Critical Synthesis Electronic Technology Co Ltd
Priority to CN201911341560.2A priority Critical patent/CN111125334B/en
Publication of CN111125334A publication Critical patent/CN111125334A/en
Application granted granted Critical
Publication of CN111125334B publication Critical patent/CN111125334B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a search question-answering system based on pre-training, which comprises a noise judging module, a QA question-answering module, a knowledge matching module and a response output module; the noise judging module judges whether the user questions belong to noise, the QA question-answering module comprises a rule input unit and a rule analysis unit, the knowledge matching module indexes the questions and knowledge in a question-answering library and makes similarity sorting, the knowledge index comprises two modes of inverted index and Annoy index, the response output module is used for outputting responses, and the responses comprise a similar question list, accurate answers, no answers and recommended hot questions. The invention can effectively solve the problems of knowledge generalization migration, noise judgment and QA customization, and greatly improves the question-answering efficiency while improving the user experience.

Description

Search question-answering system based on pre-training
Technical Field
The invention relates to a search question-answering system based on pre-training, which is a system for obtaining a language model and an existing question-answering database based on pre-training and interacting with a client problem, and belongs to the fields of natural language processing and machine learning.
Background
The search question-answering system is characterized in that the system receives user questions, performs similar question searching and sequencing in a question-answering knowledge base, displays a similar question list for the user, and enables the user to select to solve the user questions to the greatest extent. At present, the question-answering mode is adopted in various knowledge base question-answering systems, intelligent customer service assistants, self-service machines and other terminal devices. Unlike traditional dialog systems, which focus on interactions, search question-and-answer systems focus on improving a more accurate list of similar questions, which do not require maintenance of as many context states as dialog systems, nor do they completely require accurate question-and-answer responses. In the terminal equipment, the problems of speech recognition accuracy and client speaking are still key reasons for restricting the development of a dialogue system. The search question and answer system can alleviate this problem by recommending a list of similar questions in a simple and efficient manner, but still faces the problem of lower accuracy of the first 5 recommendations. The factors with low recommendation accuracy rate before 5 are affected mainly comprise three factors, namely spoken language generalization, noise influence and accurate answer of fixed questions.
Search question-answering systems tend to have a large amount of data for one industry, and the question-answering forms of users are more diverse. User problems may come from terminal speech recognition equipment collection (noise interference), or industry website interrogation, etc., with more general spoken language. Taking tax industry as an example, "pay tax", the user is more likely to say "pay more money". In fact, industry clients often only provide question-answer pairs, i.e., one standard question corresponds to one standard answer, and there is no generalization of the standard question, and when the system receives a user question, the system can only match with questions in the question-answer knowledge base in sequence. Thus, search question-and-answer systems are also faced with this large number of spoken language generalization problems.
The search question and answer system is a list of recommended similar questions, and the system cannot respond to any questions. One solution is to set a minimum threshold for the similarity of the user questions to questions in the question-and-answer knowledge base below which no response is made. However, because of the large similarity between the different questions and the questions in the knowledge base, it is difficult to select an appropriate threshold to avoid noise problems.
In addition, the search question-answering system may still face some customized questions-answers, that is, the user system inputs a fixed question and can accurately give a fixed answer, instead of returning a similar question list, such as a jump instruction of a terminal device page, some common propaganda of the user, an operation instruction, and the like, and the search question-answering system still needs to have a customized QA question-answering function.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a search question-answering system based on pre-training, which can effectively solve the problems of knowledge generalization migration, noise judgment and QA customization, and greatly improve question-answering efficiency while improving user experience.
In order to solve the technical problems, the invention adopts the following technical scheme: a search question-answering system based on pre-training comprises a noise judging module, a QA question-answering module, a knowledge matching module and a response output module; the noise judging module judges whether the user problem belongs to noise or not through an industry word stock and an exclusion word stock, when the user problem contains industry words and does not contain exclusion words, the user problem is determined to be non-noise, the user problem enters the QA question-answering module to analyze, otherwise the user problem is determined to be noise, and the response output module returns a popular recommended problem or no response;
the QA question-answering module comprises a rule input unit and a rule analysis unit, wherein the rule analysis unit analyzes the user questions input by the rule input unit and judges whether the analyzed user questions need to accurately return answers, if so, the response output module outputs standard answers corresponding to the questions, and if not, the analyzed questions are sent to the knowledge matching module;
the knowledge matching module indexes the knowledge in the question and answer library and performs similarity sorting, the knowledge index comprises an inverted index and an Annoy index, the Annoy index is based on a semantic model, the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer and the last but one layer of the semantic model is used as a question vector to perform Annoy index, and when similarity calculation and sorting are performed, vector similarity, question and answer frequency and text alignment ratio are comprehensively considered;
the response output module is used for outputting responses, and the responses comprise four types of similar question lists, accurate answers, no answers and recommended hot questions.
Further, the word stock removal of the noise judgment module is obtained through manual screening and later log maintenance, and the process of obtaining the word stock of the noise judgment module industry is as follows: a1 Firstly, counting training data, wherein the training data sources comprise a question-answer knowledge base and other industry data problems crawled through network resources; a2 Using a precise model of barker word segmentation to segment words, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of words based on all data, calculating word weight W based on the word frequency TF and the inverse document frequency IDF, and calculating formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:
W=TF*IDF;
a3 B), selecting a proper number of words as industry words according to the word weight calculated in the step b, or selecting the industry words by setting a lowest threshold value, splitting a split phrase in the industry words or adding the industry words into custom word segmentation of the crust word segmentation, and improving the weight of the split phrase to ensure that the industry words can be correctly segmented;
a4 A plurality of spoken abbreviations or other non-conventional industry words are provided by professionals to form a final industry word stock.
Further, the rule input unit of the QA question-answering module supports input logic expression, brackets, logic nesting, digital analysis and entity input, and the rule analysis unit firstly matches the problem with the rule expression to form a logic expression only comprising 1 and 0 aiming at the input logic expression, and then calculates the logic expression through a rule analysis algorithm and outputs whether the logic expression is matched or not.
Further, the process of calculating the logic expression through the rule analysis algorithm is as follows: pushing the logic expression into a digital stack and an operator stack to carry out recursive calculation, wherein the operation priority is as follows: brackets > AND operations > or operations, the operation rules are: 1& 1=1;
1&0 = 0&1 =0; 1|0 = 0|1 =1; 1|1 =1; 0|0 =0, where & represents AND operation, | represents OR operation, AND in the logical expression, the AND operation symbol AND is replaced with & AND the OR operation symbol OR is replaced with |.
Furthermore, the inverted index in the knowledge matching module is obtained by performing barking word segmentation and blackout word removal on the questions in the question-answering knowledge base; the Annoy index in the knowledge matching module is obtained by constructing a plurality of high-dimensional neighbor search trees for knowledge vectors in a question-answer knowledge base, the knowledge vectors in the question-answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.
Further, the specific process of obtaining the semantic model from the pre-training model is as follows: b1 Adding a full connection layer and a softmax layer after an output layer of the pre-training model, and respectively using the output of the last layer and the last-last layer of the pre-training model as a problem vector when a knowledge vector is constructed; b2 Building training data through text enhancement; b3 And B1) performing fine-tuning on the pre-trained model processed in the step B1 based on text enhancement, thereby obtaining a semantic model.
Further, in the step B2, the process of constructing training data through text enhancement is as follows: c1 Building a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplementing sources comprise common question-answers crawled in an industry website and manual supply; c2 Synonym extraction, including industry thesaurus extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing industry keywords after tf-idf weight calculation and manual screening, then extracting synonyms of the keywords by using a Tencent vector, wherein the extraction rule is that the top 10 synonyms of the keywords are not included, similarity is cosine similarity, and a final synonym phrase is obtained after manual screening and de-duplication; c3 Constructing similar questions of the questions in the question-answering knowledge base through synonym replacement; c4 Performing back translation, namely performing English translation on the questions in the question-answering knowledge base through a translation open platform, translating the questions into Chinese, and performing manual screening and labeling; c5 And obtaining synonymous question-answer pairs through text enhancement, and obtaining non-synonymous question-answer pairs through questions without answers, random combination of questions and non-synonymous word replacement, thereby constructing training data and a verification set.
Further, the pre-training model is a Chinese pre-training RoBERTa or XLNet model.
Further, the formula for calculating the similarity of the problems is:
wherein Q is u ,Q k S (Q) u ,Q k ) Representing the similarity of user questions to questions indexed in the knowledge base, V u ,V k Problem vectors respectively input to the semantic model for user problems and queried by indexing, cos (V u ,V k ) Represents V u ,V k Cosine similarity of two vectors, gamma 1 、γ 2 Mu is a coefficient, wherein gamma 1 ∈(0,0.1),γ 2 ∈(0,0.1),μ∈(0,1);C(Q k ) For problem Q obtained by log statistics k Is a number of inquiries; max (C (Q) k ) A maximum number of interrogated questions.
Further, if the user question is judged to be noise, the output response module returns a popular recommended question or no response according to the question-answer frequency, if the user question accords with a rule that an answer needs to be accurately returned after being analyzed by the QA module, the output response module outputs a standard answer corresponding to the question, and if the user question is not noise and does not accord with the QA matching rule, the similar question list is returned by the knowledge matching module for the user to select; in addition, the system maintains a similar question list state, and the user answers questions by clicking or position instructions.
The invention has the beneficial effects that: the invention relates to a search question-answering system based on pre-training, which solves knowledge generalization migration by using a pre-training model to fine-tune, solves noise judgment by using statistical learning, solves QA customized problem by using a rule analysis algorithm, and greatly improves question-answering efficiency while improving user experience.
Drawings
FIG. 1 is a diagram of an architecture of a search question-answering system;
FIG. 2 is a knowledge index acquisition flow chart;
FIG. 3 is a semantic model acquisition flow chart;
fig. 4 is a training data generation flow chart.
Detailed Description
The invention will be further described with reference to the drawings and the specific examples.
Fig. 1 is a diagram of the architecture of the whole search question-answering system, including a noise judging module, a QA question-answering module, a knowledge matching module and a response output module, when the system receives a user question, firstly judging whether the user question is noise, if the user question is non-noise, performing QA question-answering matching, if the user question is not yet responded, performing index matching on the questions in the knowledge base, sorting the similarity, selecting the first 5 similar questions, and returning the first 5 similar questions to the response module, and finally giving a question-answering response by the response.
In this embodiment, the noise judging module judges whether the user problem belongs to noise through the industry word stock and the exclusion word stock, when the user problem contains industry words and does not contain exclusion words, the user problem is determined to be non-noise, the user problem enters the QA question-answering module to analyze, otherwise, the user problem is determined to be noise, and the response output module returns a popular recommended problem or no response.
The word stock of the noise judgment module is constructed manually according to industry experience, and irregular problems such as sensitive information, voice recognition error information and the like are filtered uniformly. The word stock is removed and is continuously enriched in the subsequent use or log statistics and other processes.
The process of obtaining the noise judgment module industry word stock is as follows:
a1 Firstly, counting training data, wherein the training data sources comprise a question-answer knowledge base and other industry data problems crawled through network resources;
a2 Using a precise model of barker word segmentation to segment words, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of words based on all data, calculating word weight W based on the word frequency TF and the inverse document frequency IDF, and calculating formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:
W=TF*IDF;
a3 B), selecting a proper number of words as industry words according to the word weight calculated in the step b, or selecting the industry words by setting the lowest threshold value, wherein the split phrase in the industry words needs to be split as fine as possible when constructing the index, such as 'paying instead of deduction', and the like, and the barking word can be split into one word, but if a user only says 'paying instead of deduction', the problem cannot be located through the inverted index, and therefore the split into 'paying instead of deduction' is needed. In addition, the industry word is required to be added into the custom word segmentation of the bargain word, and the weight of the custom word segmentation is increased, so that the industry word can be correctly segmented;
a4 A plurality of spoken abbreviations or other non-conventional industry words are provided by professionals to form a final industry word stock.
For conventional questions and answers such as instruction intentions (e.g. instructions for controlling front-end page jumps, etc.), industry/unit introduction, calling, etc., an answer needs to be accurately returned, so a QA question and answer module is set.
The QA question-answering module comprises a rule input unit and a rule analysis unit, wherein the rule analysis unit analyzes the user questions input by the rule input unit and judges whether the analyzed user questions need to accurately return answers, if so, the response output module outputs standard answers corresponding to the questions, and if not, the analyzed questions are sent to the knowledge matching module.
In order to improve the utilization rate of data and relatively reduce the recording and maintenance work of the data, the general information can be independently taken out to be defined as an entity, and the printing invoice is taken as an example, and the entity of the type of @ invoice can be defined as water charge OR electric charge OR gas charge, so that the method can be directly used when regular recording is carried out.
In this embodiment, the rule input unit of the QA question-answering module supports inputting logic expressions such as "and nor", brackets, logic nesting, digital analysis, entity input and the like, so that the implementation of various complex logic rules can be satisfied. Taking the example of invoicing, its intent entry may be "AND (water OR electricity OR gas) AND invoice", OR the entity "AND { @ invoice category } AND invoice" may be used. In addition, in this embodiment, a system entity (number, place, etc.) is also provided, and especially for the digital entity, a logic rule (greater than, less than, equal to, etc.) of the number may be entered.
The rule analysis unit firstly matches the problem with the rule expression aiming at the entered logic expression to form a logic expression (AND is replaced by & replacement, OR is replaced by | AND Chinese brackets are replaced by English brackets) only comprising 1 (True) AND 0 (False), AND then calculates the logic expression through a rule analysis algorithm AND outputs whether the logic expression is matched (1 is matched AND 0 is not matched).
Specifically, the process of calculating the logic expression through the rule parsing algorithm is as follows: pushing the logic expression into a digital stack and an operator stack, and performing recursive computation in a manner similar to addition, subtraction, multiplication and division, wherein the operation priority is as follows: brackets > and operations > or operations,
the operation rule is as follows:
1&1=1;
1&0=0&1=0;
1|0=0|1=1;
1|1=1;
0|0=0,
where & represents AND operation, | represents OR operation, AND in the logical expression the & replace AND operation symbol AND is replaced with the | OR operation symbol OR.
The knowledge matching module indexes and sorts the knowledge in the question and answer library, the knowledge index comprises an inverted index and an Annoy index, the Annoy index is based on a semantic model, the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer and the last but one layer of the semantic model is used as a question vector to carry out Annoy index, and when the similarity is calculated and sorted, the vector similarity, the question and answer frequency and the text alignment ratio are comprehensively considered.
In this embodiment, the specific implementation process of the knowledge matching module is:
1. training set construction, as shown in fig. 4, training data is constructed mainly by three parts of synonym substitution, back translation and random combination. The method specifically comprises the following steps:
c1 Building a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplementing sources comprise common question-answers crawled in an industry website and manual supply;
c2 Synonym extraction, including industry thesaurus extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing industry keywords after tf-idf weight calculation and manual screening, then extracting synonyms of the keywords by using a Tencent vector, wherein the extraction rule is that the top 10 synonyms of the keywords are not included, similarity is cosine similarity, and a final synonym phrase is obtained after manual screening and de-duplication;
the synonymous phrases prior to screening are as follows:
innovative enterprise and civil enterprise innovation enterprise industry and civil enterprise manufacturing industry manufacturing enterprise investment organization products and service medicine enterprise
The screening was followed as follows:
medicine enterprise for enterprise civil enterprise company
C3 Constructing similar questions of questions in a question-answering knowledge base through synonym replacement, replacing synonyms in the questions by using the questions in the knowledge base to expand the number of the questions, and then removing unreasonable questions and marking whether the questions are similar or not;
c4 Performing back translation, namely traversing all the problems in the knowledge base, calling a hundred-degree translation open platform interface to perform English translation on the problems in the knowledge base, performing back translation into Chinese, and finally removing unreasonable problems and marking whether the problems are similar;
c5 The number of similar question pairs is arranged, all questions (questions without answers, random combination of questions, non-synonym replacement to obtain non-synonymous question-answer pairs) are randomly combined, the same number of dissimilar question pairs are constructed, and the random disorder is followed by training set. The ratio of training set to validation set was 9:1.
2. Performing fine-tuning on the pre-training model based on training data to obtain a semantic model, wherein the specific process is as shown in fig. 3:
b1 The full-join layer and the softmax layer are added after the output layer of the pre-training model, and the probability of similarity and dissimilarity is output after the softmax layer. The input of the model is a problem pair in training data, the loss function uses cross entropy, the gradient descent uses Adam gradient descent, f1 and accuracy are adopted as comprehensive indexes, and the optimal semantic model is selected through a verification set.
When the knowledge vector is constructed, the knowledge base problem is traversed circularly, the output (768 dimensions) of the last layer (high first accuracy) and the output (768 dimensions) of the last but one layer (high recall) of the model are used as problem vectors respectively, a client can select the knowledge vectors by himself, and the system defaults to use the output of the last but one layer as the problem vector.
Compared with the traditional bert model, the pre-training model used by the method uses more training data and more training steps, and has better effect in each nlp item by using dynamic whole word masking and canceling the following prediction. In the search system, the pre-training model can be improved by 20.5% in the front 1 hit rate and 11.5% in the front 5 hit rate compared with the simplest bert model after being tested before the fine-training process.
3. The knowledge index construction mainly comprises an inverted index and an Annoy index, the inverted index construction speed is high, the calculation amount of knowledge insertion and deletion index updating is small, the knowledge index construction method is suitable for a knowledge search question-answering system with small data volume (the data volume is preferably within ten thousand), and the recall rate of a problem can be improved by using the mode; the Annoy index is slow in construction speed, index updating caused by knowledge updating is not supported, but the query speed is high in a large amount of data, and the Annoy index is suitable for index searching of a large amount of base database data (the data amount can be supported to the millions), and a user can set and select the index according to the size and the category of a knowledge base. In general, the industry basic question-answering library contains more problems and has fewer updates, so that the method has universality among different clients in the same industry, and an Annoy index is used; while for customization issues (dynamics) for different customers we can use the inverted index to calculate.
As shown in FIG. 2, the inverted index is obtained by blocking and separating words and removing blackout words for the questions in the question-answering knowledge base. The Annoy index is obtained by constructing a plurality of high-dimensional neighbor search trees for knowledge vectors in a question-answer knowledge base, the knowledge vectors in the question-answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.
The industry basic question-answering library contains more problems and has fewer updates, so that the method has universality among different clients in the same industry, and an Annoy index is used; while for customization issues (dynamics) for different customers we can use the inverted index to calculate.
In this embodiment, the pre-training model is a chinese pre-training RoBERTa or XLNet model.
4. And calculating the similarity of the problems and arranging the similarity, carrying out cosine similarity calculation aiming at the user problem vectors and the knowledge vectors in the knowledge base, comprehensively considering the characteristics such as hot spot problem statistics, text alignment ratio and the like, and carrying out similarity sorting after comprehensive consideration.
The similarity calculation formula is:
wherein Q is u ,Q k S (Q) u ,Q k ) Representing similarity of user-containing questions to questions indexed in the knowledge base, V u ,V k Problem vectors respectively input to the semantic model for user problems and queried by indexing, cos (V u ,V k ) Represents V u ,V k Cosine similarity of two vectors, gamma 1 、γ 2 Mu is a coefficient, wherein gamma 1 ∈(0,0.1),γ 2 ∈(0,0.1),μ∈(0,1);C(Q k ) For problem Q obtained by log statistics k Is a number of inquiries; max (C (Q) k ) A maximum number of interrogated questions.
The first part of the formula mainly considers the similarity between the user problem and the cosine of the index problem vector and is the most main part; the second part mainly considers the alignment ratio of the user problem and the word segmentation number of the index problem, L (Q) u ,Q k ) Absolute value of word segmentation quantity difference; the third part mainly considers the question and answer frequency of common questions and answers; the three parts of values are normalized to similarity values of 0-1 after softmax function, and are ordered.
The response output module is used for outputting responses, and the responses comprise four types of similar question lists, accurate answers, no answers and recommended hot questions.
Specifically, if the user problem is judged to be noise, the output response module returns a hot recommended problem or no response according to the question-answer frequency; if the user question accords with the rule that the answer must be accurately returned after being analyzed by the QA module, the output response module outputs a standard answer corresponding to the question; if the user problem is non-noise and does not accord with the QA matching rule, returning a similar problem list for the user to select through the knowledge matching module; in addition, the system maintains a similar question list state, and the user answers questions by clicking or position instructions.
According to the method, a language model with certain generalization capability can be obtained by performing unsupervised learning on a large amount of data, and a question vector index of a question-answer library is constructed according to the language model, in addition, statistical information such as tf (word frequency), idf (inverse document word frequency) and the like of the question-answer library and manually added semantic matching rules are comprehensively considered, a search question-answer system responding to user questions is constructed, the question-answer generalization capability is improved, meanwhile, the questions with higher relevance are recommended, the efficiency of solving the questions is greatly improved, and the user experience is improved.
The embodiment is based on an intelligent search question-answering system, and the generalization capability of the model can be greatly improved while the sensitivity of the model to industry knowledge is improved by performing parameter tuning on the pre-training model; through statistical learning and manual screening of the industry word stock, interference of incoherent problems such as non-industry and the like can be effectively eliminated; through the QA rule input and analysis system and nesting the QA rule input and analysis system into the search question-answering system, in addition, the input of the QA rule input support entity (including numbers) and the input of complex logic rules can effectively realize accurate QA question-answering. The invoice can effectively solve the pain points such as knowledge generalization migration, noise judgment, QA customization and the like in the current engineering application process, and greatly improves the question-answering efficiency while improving the user experience. The invention can also be used on intelligent equipment, such as mobile phones, computers, intelligent robots, self-service machines and tools, and the like.
The foregoing description is only of the basic principles and preferred embodiments of the present invention, and modifications and alternatives thereto will occur to those skilled in the art to which the present invention pertains, as defined by the appended claims.

Claims (9)

1. A search question-answering system based on pre-training, characterized in that: the system comprises a noise judging module, a QA question-answering module, a knowledge matching module and a response output module;
the noise judging module judges whether the user problem belongs to noise or not through an industry word stock and an exclusion word stock, when the user problem contains industry words and does not contain exclusion words, the user problem is determined to be non-noise, the user problem enters the QA question-answering module to analyze, otherwise the user problem is determined to be noise, and the response output module returns a popular recommended problem or no response;
the QA question-answering module comprises a rule input unit and a rule analysis unit, wherein the rule analysis unit analyzes the user questions input by the rule input unit and judges whether the analyzed user input needs to accurately return answers, if so, the response output module outputs standard answers corresponding to the questions, and if not, the analyzed questions are sent to the knowledge matching module;
the knowledge matching module indexes the knowledge in the question and answer library and performs similarity sorting, the knowledge index comprises an inverted index and an Annoy index, the Annoy index is based on a semantic model, the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer or the last-second layer of the semantic model is used as a question vector to perform Annoy index, and when similarity calculation and sorting are performed, vector similarity, question and answer frequency and text alignment ratio are comprehensively considered;
the formula for calculating the similarity of the problems is as follows:
wherein Q is u ,Q k S (Q) u ,Q k ) Representing the similarity of user questions to questions indexed in the knowledge base, V u ,V k Problem vectors respectively input to the semantic model for user problems and queried by indexing, cos (V u ,V k ) Represents V u ,V k Cosine similarity of two vectors, gamma 1 、γ 2 Mu is a coefficient, wherein gamma 1 ∈(0,0.1),γ 2 ∈(0,0.1),μ∈(0,1);C(Q k ) For problem Q obtained by log statistics k Is a number of inquiries; l (Q) u ,Q k ) The absolute value of the word segmentation quantity difference value between the user problem and the index problem is calculated; max (C (Q) 1,2... ) A maximum number of interrogated questions;
the response output module is used for outputting responses, and the responses comprise four types of similar question lists, accurate answers, no answers and recommended hot questions.
2. The pre-training based search question-answering system according to claim 1, wherein: the word stock removal of the noise judgment module is obtained through manual screening and later log maintenance, and the process of obtaining the word stock of the noise judgment module industry is as follows: a1 Firstly, counting training data, wherein the training data sources comprise a question-answer knowledge base and other industry data problems crawled through network resources; a2 Using a precise model of barker word segmentation to segment words, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of words based on all data, calculating word weight W based on the word frequency TF and the inverse document frequency IDF, and calculating formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:
W=TF*IDF;
a3 B), selecting a proper number of words as industry words according to the word weight calculated in the step b, or selecting the industry words by setting a minimum threshold value, splitting the split phrase in the industry words or adding the industry words into the custom word segmentation of the crust word segmentation, and improving the weight of the split phrase;
a4 A plurality of spoken abbreviations or other non-conventional industry words are provided by professionals to form a final industry word stock.
3. The pre-training based search question-answering system according to claim 1, wherein: the rule input unit of the QA question-answering module supports input logic expression, brackets, logic nesting, digital analysis and entity input, and the rule analysis unit firstly matches a problem with the rule expression to form a logic expression only comprising 1 and 0 aiming at the input logic expression, and then calculates the logic expression through a rule analysis algorithm and outputs whether the logic expression is matched or not.
4. A pre-training based search question-answering system according to claim 3, wherein: the process of calculating the logic expression through the rule analysis algorithm is as follows: pushing the logic expression into a digital stack and an operator stack to carry out recursive calculation, wherein the operation priority is as follows: brackets > AND operations > or operations, the operation rules are: 1& 1=1; 1&0 = 0&1 =0; 1|0 = 0|1 =1; 1|1 =1; 0|0 =0, where & represents AND operation, | represents OR operation, AND in the logical expression, the AND operation symbol AND is replaced with & AND the OR operation symbol OR is replaced with |.
5. The pre-training based search question-answering system according to claim 1, wherein: the inverted index in the knowledge matching module is obtained by blocking and word segmentation and power failure word removal for the questions in the question-answering knowledge base; the Annoy index in the knowledge matching module is obtained by constructing a plurality of high-dimensional neighbor search trees for knowledge vectors in a question-answer knowledge base, the knowledge vectors in the question-answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.
6. The pre-training based search question-answering system according to claim 5, wherein: the specific process of obtaining the semantic model from the pre-training model is as follows: b1 Adding a full connection layer and a softmax layer after an output layer of the pre-training model, and respectively using the output of the last layer and the last-last layer of the pre-training model as a problem vector when a knowledge vector is constructed; b2 Building training data through text enhancement; b3 And B1) performing fine-tuning on the pre-trained model processed in the step B1 based on text enhancement, thereby obtaining a semantic model.
7. The pre-training based search question-answering system according to claim 6, wherein: the process of constructing training data through text enhancement in the step B2 is as follows: c1 Building a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplementing sources comprise common question-answers crawled in an industry website and manual supply; c2 Synonym extraction, including industry thesaurus extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing industry keywords after tf-idf weight calculation and manual screening, then extracting synonyms of the keywords by using a Tencent vector, wherein the extraction rule is that the top 10 synonyms of the keywords are not included, similarity is cosine similarity, and a final synonym phrase is obtained after manual screening and de-duplication; c3 Constructing similar questions of the questions in the question-answering knowledge base through synonym replacement; c4 Performing back translation, namely performing English translation on the questions in the question-answering knowledge base through a translation open platform, translating the questions into Chinese, and performing manual screening and labeling; c5 And obtaining synonymous question-answer pairs through text enhancement, and obtaining non-synonymous question-answer pairs through questions without answers, random combination of questions and non-synonymous word replacement, thereby constructing training data and a verification set.
8. The pretrained search question and answer system of any one of claims 1, 5, 6, 7, wherein: the pre-training model is a Chinese pre-training RoBERTa or XLnet model.
9. The pre-training based search question-answering system according to claim 1, wherein: if the user question is judged to be noise, the output response module returns a popular recommended question or no response according to the question-answer frequency, if the user question accords with a rule that an answer needs to be accurately returned after being analyzed by the QA module, the output response module outputs a standard answer corresponding to the question, and if the user question is judged to be non-noise and does not accord with the QA matching rule, the similar question list is returned through the knowledge matching module for the user to select; in addition, the system maintains a similar question list state, and the user answers questions by clicking or position instructions.
CN201911341560.2A 2019-12-20 2019-12-20 Search question-answering system based on pre-training Active CN111125334B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911341560.2A CN111125334B (en) 2019-12-20 2019-12-20 Search question-answering system based on pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911341560.2A CN111125334B (en) 2019-12-20 2019-12-20 Search question-answering system based on pre-training

Publications (2)

Publication Number Publication Date
CN111125334A CN111125334A (en) 2020-05-08
CN111125334B true CN111125334B (en) 2023-09-12

Family

ID=70501440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911341560.2A Active CN111125334B (en) 2019-12-20 2019-12-20 Search question-answering system based on pre-training

Country Status (1)

Country Link
CN (1) CN111125334B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310234B (en) * 2020-05-09 2020-11-03 支付宝(杭州)信息技术有限公司 Personal data processing method and device based on zero-knowledge proof and electronic equipment
CN111752804B (en) * 2020-06-29 2022-09-09 中国电子科技集团公司第二十八研究所 Database cache system based on database log scanning
CN112100345A (en) * 2020-08-25 2020-12-18 百度在线网络技术(北京)有限公司 Training method and device for non-question-answer-like model, electronic equipment and storage medium
CN112231537A (en) * 2020-11-09 2021-01-15 张印祺 Intelligent reading system based on deep learning and web crawler
CN112380843B (en) * 2020-11-18 2022-12-30 神思电子技术股份有限公司 Random disturbance network-based open answer generation method
CN112506963B (en) * 2020-11-23 2022-09-09 上海方立数码科技有限公司 Multi-service-scene-oriented service robot problem matching method
CN112507097B (en) * 2020-12-17 2022-11-18 神思电子技术股份有限公司 Method for improving generalization capability of question-answering system
CN112541069A (en) * 2020-12-24 2021-03-23 山东山大鸥玛软件股份有限公司 Text matching method, system, terminal and storage medium combined with keywords
CN112597291A (en) * 2020-12-26 2021-04-02 中国农业银行股份有限公司 Intelligent question and answer implementation method, device and equipment
CN112380358A (en) * 2020-12-31 2021-02-19 神思电子技术股份有限公司 Rapid construction method of industry knowledge base
CN113627152B (en) * 2021-07-16 2023-05-16 中国科学院软件研究所 Self-supervision learning-based unsupervised machine reading and understanding training method
CN114860913B (en) * 2022-05-24 2023-12-12 北京百度网讯科技有限公司 Intelligent question-answering system construction method, question-answering processing method and device
CN116860951B (en) * 2023-09-04 2023-11-14 贵州中昂科技有限公司 Information consultation service management method and management system based on artificial intelligence

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101621391A (en) * 2009-08-07 2010-01-06 北京百问百答网络技术有限公司 Method and system for classifying short texts based on probability topic
CN105005564A (en) * 2014-04-17 2015-10-28 北京搜狗科技发展有限公司 Data processing method and apparatus based on question-and-answer platform
CN107273350A (en) * 2017-05-16 2017-10-20 广东电网有限责任公司江门供电局 A kind of information processing method and its device for realizing intelligent answer
CN107980130A (en) * 2017-11-02 2018-05-01 深圳前海达闼云端智能科技有限公司 It is automatic to answer method, apparatus, storage medium and electronic equipment
WO2018077655A1 (en) * 2016-10-24 2018-05-03 Koninklijke Philips N.V. Multi domain real-time question answering system
CN108959531A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Information search method, device, equipment and storage medium
CN109947921A (en) * 2019-03-19 2019-06-28 河海大学常州校区 A kind of intelligent Answer System based on natural language processing
CN110008322A (en) * 2019-03-25 2019-07-12 阿里巴巴集团控股有限公司 Art recommended method and device under more wheel session operational scenarios
CN110309267A (en) * 2019-07-08 2019-10-08 哈尔滨工业大学 Semantic retrieving method and system based on pre-training model
CN110413783A (en) * 2019-07-23 2019-11-05 银江股份有限公司 A kind of judicial style classification method and system based on attention mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220380A (en) * 2017-06-27 2017-09-29 北京百度网讯科技有限公司 Question and answer based on artificial intelligence recommend method, device and computer equipment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101621391A (en) * 2009-08-07 2010-01-06 北京百问百答网络技术有限公司 Method and system for classifying short texts based on probability topic
CN105005564A (en) * 2014-04-17 2015-10-28 北京搜狗科技发展有限公司 Data processing method and apparatus based on question-and-answer platform
WO2018077655A1 (en) * 2016-10-24 2018-05-03 Koninklijke Philips N.V. Multi domain real-time question answering system
CN107273350A (en) * 2017-05-16 2017-10-20 广东电网有限责任公司江门供电局 A kind of information processing method and its device for realizing intelligent answer
CN107980130A (en) * 2017-11-02 2018-05-01 深圳前海达闼云端智能科技有限公司 It is automatic to answer method, apparatus, storage medium and electronic equipment
CN108959531A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Information search method, device, equipment and storage medium
CN109947921A (en) * 2019-03-19 2019-06-28 河海大学常州校区 A kind of intelligent Answer System based on natural language processing
CN110008322A (en) * 2019-03-25 2019-07-12 阿里巴巴集团控股有限公司 Art recommended method and device under more wheel session operational scenarios
CN110309267A (en) * 2019-07-08 2019-10-08 哈尔滨工业大学 Semantic retrieving method and system based on pre-training model
CN110413783A (en) * 2019-07-23 2019-11-05 银江股份有限公司 A kind of judicial style classification method and system based on attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘飞 ; 张俊然 ; 杨豪 ; .基于深度学习的医学图像识别研究进展.中国生物医学工程学报.(第01期),全文. *

Also Published As

Publication number Publication date
CN111125334A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN111125334B (en) Search question-answering system based on pre-training
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN108304468B (en) Text classification method and text classification device
JP5936698B2 (en) Word semantic relation extraction device
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
CN111460148A (en) Text classification method and device, terminal equipment and storage medium
US10586174B2 (en) Methods and systems for finding and ranking entities in a domain specific system
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN108287848B (en) Method and system for semantic parsing
CN109255012A (en) A kind of machine reads the implementation method and device of understanding
CN111625621A (en) Document retrieval method and device, electronic equipment and storage medium
Khalid et al. Topic detection from conversational dialogue corpus with parallel dirichlet allocation model and elbow method
CN110347833B (en) Classification method for multi-round conversations
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
Al Mostakim et al. Bangla content categorization using text based supervised learning methods
TWI734085B (en) Dialogue system using intention detection ensemble learning and method thereof
CN115329207B (en) Intelligent sales information recommendation method and system
CN114579729B (en) FAQ question-answer matching method and system fusing multi-algorithm models
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
CN116628146A (en) FAQ intelligent question-answering method and system in financial field
CN110287396A (en) Text matching technique and device
CN115713349A (en) Small sample comment data driven product key user demand mining method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant