CN111125334A - Search question-answering system based on pre-training - Google Patents

Search question-answering system based on pre-training Download PDF

Info

Publication number
CN111125334A
CN111125334A CN201911341560.2A CN201911341560A CN111125334A CN 111125334 A CN111125334 A CN 111125334A CN 201911341560 A CN201911341560 A CN 201911341560A CN 111125334 A CN111125334 A CN 111125334A
Authority
CN
China
Prior art keywords
question
answer
module
training
industry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911341560.2A
Other languages
Chinese (zh)
Other versions
CN111125334B (en
Inventor
申冲
张传锋
朱锦雷
薛付忠
杨帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Synthesis Electronic Technology Co Ltd
Original Assignee
Synthesis Electronic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Synthesis Electronic Technology Co Ltd filed Critical Synthesis Electronic Technology Co Ltd
Priority to CN201911341560.2A priority Critical patent/CN111125334B/en
Publication of CN111125334A publication Critical patent/CN111125334A/en
Application granted granted Critical
Publication of CN111125334B publication Critical patent/CN111125334B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a pre-training-based question-answering search system, which comprises a noise judgment module, a QA question-answering module, a knowledge matching module and a response output module; the noise judgment module judges whether the user question belongs to noise, the QA question-answer module comprises a rule input unit and a rule analysis unit, the knowledge matching module indexes the question and knowledge in a question-answer base and sorts the similarity, the knowledge index comprises an inverted index and an Annoy index, the response output module is used for outputting a response, and the response comprises four types of similar question lists, accurate answers, no answers and recommended hot questions. The invention can effectively solve the problems of knowledge generalization transfer, noise judgment and QA customization, and greatly improves the question-answering efficiency while improving the user experience.

Description

Search question-answering system based on pre-training
Technical Field
The invention relates to a search question-answer system based on pre-training, which is a system for obtaining a language model and an existing question-answer database based on pre-training and interacting customer questions, and belongs to the field of natural language processing and machine learning.
Background
The search question-answering system is used for receiving user questions, searching and sequencing similar questions in a question-answering knowledge base, displaying a similar question list to a user, and enabling the user to select so as to solve the user questions to the greatest extent. At present, various knowledge base question-answering systems, intelligent customer service assistants, self-service machines and other terminal devices adopt the question-answering mode. Unlike traditional dialog systems, the focus of the dialog system is on interaction, while the focus of the search question-and-answer system is on improving a more accurate list of similar questions, which does not have as many contextual states as the dialog system to maintain, nor does it require an accurate question-and-answer response at all. In the terminal equipment, the problems of speech recognition accuracy and spoken language of a client are still key reasons for restricting the development of a conversation system. The search question and answer system can alleviate this problem in a simple and efficient way of recommending a list of similar questions, but still faces the problem of low accuracy of the top 5 recommendations. There are three main factors affecting the first 5 low recommendation accuracy rates, which are spoken language generalization, noise impact and fixed question needs accurate answers.
The search question-answering system is usually provided with a large amount of data for one industry, and the question-answering forms of users are more various. User problems may come from terminal speech recognition device acquisition (noise interference), or industry website queries, etc., with more general spoken language. Taking the tax industry as an example, "paying taxes," the user is more likely to say "how much money to pay". In fact, an industrial customer often only provides question and answer pairs, namely one standard question corresponds to one standard answer, the generalization of the standard questions is avoided, and when the system receives the user questions, the system can only be matched with the questions in the question and answer knowledge base in sequence. Therefore, the search question-answering system also faces this large number of spoken generalization problems.
The search question-answering system recommends a list of similar questions, and the system is unlikely to respond to any questions. One solution is to set a minimum threshold for the similarity of the user's question to questions in the knowledge base of questions and answers, below which no response is made. However, due to the wide difference in similarity between different problems and the problems in the knowledge base, it is difficult to select an appropriate threshold to avoid the noise problem.
In addition, the search question-answering system may still face some customized question-answering, that is, the user system inputs a fixed question to accurately give a fixed answer, instead of returning a similar question list, such as a jump instruction of a terminal device page, some common publicity and an operation instruction of the user, and the search question-answering system still needs to have a customized QA question-answering function.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a pre-training-based question-answering search system, which can effectively solve the problems of knowledge generalization migration, noise judgment and QA customization, and greatly improve the question-answering efficiency while improving the user experience.
In order to solve the technical problem, the technical scheme adopted by the invention is as follows: a pre-training-based question-answering search system comprises a noise judgment module, a QA question-answering module, a knowledge matching module and a response output module; the noise judgment module judges whether the user question belongs to noise through an industry word bank and an exclusion word bank, when the user question contains industry words and does not contain exclusion words, the user question is determined to be non-noise and enters a QA question-answering module for analysis, otherwise, the user question is determined to be noise, and a response output module returns a hot recommendation question or no response;
the QA question-answer module comprises a rule input unit and a rule analysis unit, the rule analysis unit analyzes the user question input by the rule input unit and judges whether the analyzed user question needs to accurately return an answer, if so, the QA question-answer module responds to the output module to output a standard answer corresponding to the question, and if not, the analyzed question is sent to the knowledge matching module;
the knowledge matching module indexes the knowledge in the question and answer base and performs similarity sequencing, the knowledge index comprises an inverted index and an Annoy index, the Annoy index is based on a semantic model, the semantic model is obtained by training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer and the second last layer of the semantic model is used as a question vector to perform the Annoy index, and when the similarity is calculated and sequenced, the vector similarity, the question and answer frequency and the text alignment ratio are comprehensively considered;
the response output module is used for outputting a response, and the response comprises four types of similar question lists, accurate answers, no answers and recommended popular questions.
Further, the word stock for excluding the noise judgment module is obtained through manual screening and later-stage log maintenance, and the process of obtaining the word stock in the industry of the noise judgment module is as follows: A1) firstly, counting training data, wherein the source of the training data comprises a question-answer knowledge base and other industry data problems crawled through network resources; A2) the method comprises the following steps of performing word segmentation by using an accurate mode of the Chinese character segmentation, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of the words based on all data, and calculating word weight W based on the word frequency TF and the inverse document frequency IDF, wherein the calculation formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:
Figure BDA0002328455050000021
Figure BDA0002328455050000022
W=TF*IDF;
A3) selecting a proper number of words as industry words according to the word weight calculated in the step b, or selecting the industry words by setting a lowest threshold value, splitting the splittable phrases in the industry words or adding the industry words into the custom segmentation of the ending segmentation, and improving the weight of the industry words to ensure that the industry words can be correctly separated;
A4) and providing a plurality of spoken short names or other non-conventional industry words by professionals to form a final industry word stock.
Furthermore, a rule input unit of the QA question-answering module supports input of a logic expression, brackets, logic nesting, digital analysis and entity input, and aiming at the input logic expression, a rule analysis unit firstly matches a question with the rule expression to form a logic expression comprising only 1 and 0, and then calculates the logic expression through a rule analysis algorithm and outputs whether the logic expression is matched or not.
Further, the process of calculating the logic expression by the rule parsing algorithm is as follows: and pushing the logic expression into a digital stack and an operator stack for recursive computation, wherein the operation priority is as follows: bracket > and operation > or operation, the operation rule is: 1&1 ═ 1;
1&0 ═ 0&1 ═ 0; 1|0 ═ 0|1 ═ 1; 1|1 ═ 1; 0|0 ═ 0, where & denotes AND operation, | denotes OR operation, AND in logical expressions, AND operation symbol AND is replaced by & AND operation symbol OR is replaced by |.
Further, the inverted index in the knowledge matching module is obtained by carrying out crust word segmentation and power off word removal on the questions in the question and answer knowledge base; the Annoy index in the knowledge matching module is obtained by constructing a plurality of high-dimensional neighbor search trees for the knowledge vectors in the question and answer knowledge base, the knowledge vectors in the question and answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.
Further, the specific process of obtaining the semantic model from the pre-training model is as follows: B1) adding a full connection layer and a softmax layer behind an output layer of the pre-training model, and respectively using the outputs of the last layer and the second last layer of the pre-training model as problem vectors when constructing the knowledge vectors; B2) building training data through text enhancement; B3) and performing fine-tuning on the pre-training model processed in the step B1 based on text enhancement to obtain a semantic model.
Further, the process of constructing training data through text enhancement in step B2 is as follows: C1) constructing a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplement source comprises common question answers crawled in an industry website and manual supply; C2) extracting synonyms, including industry word stock extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing an industry keyword after tf-idf weight calculation and manual screening, then extracting synonyms of the keyword by using Tencent word vectors, wherein the extraction rule is that the first 10 synonyms do not comprise the keyword, the similarity uses cosine similarity, and a final synonym phrase is obtained after manual screening and duplicate removal; C3) constructing similar questions of the questions in the question and answer knowledge base through synonym replacement; C4) translating the questions in the question and answer knowledge base into Chinese after English translation through the translation open platform, and manually screening and marking; C5) obtaining synonymous question-answer pairs after text enhancement, and obtaining non-synonymous question-answer pairs through questions without answers, random combination of questions and non-synonymous word replacement, thereby constructing training data and a verification set.
Further, the pre-training model is a Chinese pre-training RoBERTA or XLNET model.
Further, the formula for calculating the similarity of the problem is as follows:
Figure BDA0002328455050000031
wherein Q isu,QkThe questions, S (Q), are respectively the user question and the question indexed and inquired in the knowledge baseu,Qk) Representing the similarity, V, of the user question to questions queried by the index in the knowledge baseu,VkQuestion vectors output into the semantic model for user questions and question vectors queried by indexing, Cos (V), respectivelyu,Vk) Represents Vu,VkCosine similarity of two vectors, gamma1、γ2Mu is a coefficient, wherein gamma1∈(0,0.1),γ2∈(0,0.1),μ∈(0,1);C(Qk) For problems Q obtained by log statisticskThe number of queries; max (C (Q)k) Is the maximum number of queries for the question.
Further, if the user question is judged to be noise, the output response module returns hot recommendation questions or no response according to the question-answer frequency, if the user question is analyzed by the QA module and accords with a rule that an answer must be accurately returned, the output response module outputs a standard answer corresponding to the question, and if the user question is not noise and does not accord with the QA matching rule, a similar question list is returned through the knowledge matching module for the user to select; in addition, the system maintains a similar question list state, and the user asks and answers by clicking or position instructions.
The invention has the beneficial effects that: the invention relates to a search question-answering system based on pre-training, which solves the problem of knowledge generalization migration by using a pre-training model for fine tuning, solves noise judgment by using statistical learning, solves the problem of QA customization by using a rule analysis algorithm, and greatly improves the question-answering efficiency while improving the user experience.
Drawings
FIG. 1 is an architecture diagram of a search question and answer system;
FIG. 2 is a flow diagram of knowledge index acquisition;
FIG. 3 is a semantic model acquisition flow diagram;
fig. 4 is a flow chart of training data generation.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
Fig. 1 is an architecture diagram of the whole search question-answering system, which includes a noise judgment module, a QA question-answering module, a knowledge matching module and a response output module, when the system receives a user question, it first judges whether the question is noise, if the question is non-noise, QA question-answering matching is performed, if there is no response, index matching is performed on the question in the knowledge base, after similarity sorting, the top 5 similar questions are selected and returned to the response module, and finally a question-answering response is given by the response.
In this embodiment, the noise determination module determines whether the user question belongs to noise through the industry lexicon and the exclusion lexicon, determines that the user question is non-noise when the user question includes an industry word and does not include an exclusion word, and enters the QA question-answering module for analysis, otherwise determines that the user question is noise, and the response output module returns a hot recommendation question or no response.
The excluding word bank of the noise judgment module is manually constructed according to industry experience, and irregular problems such as sensitive information, voice recognition error information and the like are uniformly filtered. The lexicon is excluded and is continuously enriched in the subsequent use or log statistics and other processes.
The process of obtaining the noise judgment module industry word stock comprises the following steps:
A1) firstly, counting training data, wherein the source of the training data comprises a question-answer knowledge base and other industry data problems crawled through network resources;
A2) the method comprises the following steps of performing word segmentation by using an accurate mode of the Chinese character segmentation, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of the words based on all data, and calculating word weight W based on the word frequency TF and the inverse document frequency IDF, wherein the calculation formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:
Figure BDA0002328455050000041
Figure BDA0002328455050000042
W=TF*IDF;
A3) selecting a proper number of words as industry words according to the word weight calculated in the step b, or selecting the industry words by setting a lowest threshold value, and dividing the words into one word by the ending word because the division words are required to be as delicate as possible when the index is constructed, such as 'paying for a representative' and the like, but if the user only says 'paying for a representative', the problem cannot be positioned by the inverted index, so the division words are required to be divided into 'paying for a representative' and 'withholding'. In addition, the industry word needs to be added to the custom segmentation of the ending segmentation, and the weight of the industry word is increased to ensure that the industry word can be correctly separated;
A4) and providing a plurality of spoken short names or other non-conventional industry words by professionals to form a final industry word stock.
For the conventional questions and answers such as instruction intentions (such as instructions for controlling front-end page jump and the like), industry/unit introduction, call calling and the like, answers need to be accurately returned, so that a QA question and answer module is arranged.
The QA question-answer module comprises a rule input unit and a rule analysis unit, the rule analysis unit analyzes the user question input by the rule input unit and judges whether the analyzed user question needs to accurately return an answer, if so, the QA question-answer module responds to the output module to output a standard answer corresponding to the question, and if not, the analyzed question is sent to the knowledge matching module.
In order to improve the utilization rate of data and relatively reduce the data entry and maintenance work, some common information can be taken out independently and defined as entities, and the entity can be defined as water charge OR electricity charge OR gas charge by taking a printed invoice as an example, and can be directly used when rules are entered.
In this embodiment, the rule entry unit of the QA question-answering module supports the input of logical expressions such as "and or" not or ", brackets, logical nesting, digital parsing, entity entry, and the like, thereby being capable of satisfying the implementation of various complex logical rules. For example, the intent entry may be "AND (water OR electricity OR gas) AND invoice", OR the entity "AND { @ invoice kind } AND invoice" may be used. In addition, the embodiment also provides a system entity (number, place, etc.), and especially for the digital entity, a logical rule (greater than, less than, equal to, etc.) of the number can be entered.
The rule analysis unit firstly matches the question with the rule expression aiming at the recorded logic expression to form a logic expression (AND is used for & replacement, OR is used for | replacement, Chinese brackets are replaced with English brackets) only comprising 1(True) AND 0(False), AND then calculates the logic expression through a rule analysis algorithm AND outputs whether the logic expression is matched (1 is matched AND 0 is not matched).
Specifically, the process of calculating the logic expression by the rule parsing algorithm is as follows: the logic expression is pressed into a digital stack and an operator stack, recursive calculation is carried out by using a similar addition, subtraction, multiplication and division mode, and the operation priority is as follows: bracket > and operation > or operation,
the operation rule is as follows:
1&1=1;
1&0=0&1=0;
1|0=0|1=1;
1|1=1;
0|0=0,
where & represents AND operation, | represents OR operation, AND in logical expressions, AND operation symbol AND is replaced by &, OR operation symbol OR is replaced by | in logical expressions.
The knowledge matching module indexes knowledge in a question and answer base and performs similarity sorting, the knowledge indexing comprises two modes of inverted indexing and Annoy indexing, the Annoy indexing is based on a semantic model, the semantic model is obtained by training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer and the second last layer of the semantic model is used as a question vector to perform the Annoy indexing, and when the similarity is calculated and sorted, the vector similarity, the question and answer frequency and the text alignment ratio are comprehensively considered.
In this embodiment, the specific implementation process of the knowledge matching module is as follows:
firstly, a training set structure, as shown in fig. 4, mainly comprises three parts of synonym replacement, retranslation and random combination to construct training data. The method specifically comprises the following steps:
C1) constructing a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplement source comprises common question answers crawled in an industry website and manual supply;
C2) extracting synonyms, including industry word stock extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing an industry keyword after tf-idf weight calculation and manual screening, then extracting synonyms of the keyword by using Tencent word vectors, wherein the extraction rule is that the first 10 synonyms do not comprise the keyword, the similarity uses cosine similarity, and a final synonym phrase is obtained after manual screening and duplicate removal;
synonyms before screening are as follows:
manufacturing company's own investment organization product and service medicine enterprise in enterprise's private enterprise company innovation company industry private-member manufacturing industry
The screening was followed as follows:
enterprise civil-enterprise company industry medicine enterprise
C3) Constructing similar questions of the questions in the question and answer knowledge base through synonym replacement, replacing the synonyms in the questions in the knowledge base by using the questions in the knowledge base to expand the number of the questions, and then removing unreasonable questions and marking whether the questions are similar;
C4) translating back, traversing all problems in the knowledge base, calling a Baidu translation open platform interface to perform English translation on the problems in the knowledge base, translating back into Chinese, and finally removing unreasonable problems and marking whether the problems are similar or not;
C5) the number of similar question pairs is sorted out, all questions (questions without answers, question random combination and nonsynonymous word replacement to obtain nonsynonymous question-answer pairs) are randomly combined, the same number of dissimilar question pairs are constructed, and a training set is formed after random disorder. The ratio of training set to validation set was 9: 1.
Secondly, performing fine-tuning on the pre-training model based on the training data to obtain a semantic model, as shown in fig. 3, the specific process is as follows:
B1) and adding a full connection layer and a softmax layer after an output layer of the pre-training model, and outputting similar and dissimilar probabilities after passing through the softmax layer. The input of the model is a problem pair in training data, cross entropy is used for a loss function, Adam gradient reduction is used for gradient reduction, f1 and accuracy are used as comprehensive indexes, and the optimal semantic model is selected through a verification set.
When the knowledge vector is constructed, the problem of the knowledge base is circularly traversed, the output (768 dimensions) of the last layer (high accuracy rate of the first layer) and the output (high recall rate) of the second last layer (high recall rate) of the model are respectively used as the problem vector, a client can select the problem vector by himself, and the system defaults to use the output of the second last layer as the problem vector.
Compared with the traditional bert model, the pre-training model used by the invention uses more training data and more training steps, and has better effect in each nlp project by using dynamic whole word masking and eliminating the following prediction. In the present search system, the pre-trained model was tested to improve the top 1 hit rate by 20.5% and the top 5 hit rate by 11.5% over the simplest bert model before the fine-tuning process.
The knowledge index is constructed, the construction speed of the reverse index is high, the updating calculation amount of the knowledge insertion and deletion index is small, and the method is suitable for a knowledge search question-answering system with small data volume (the data volume is preferably within ten thousand), and the recall rate of the problems can be improved by using the mode; the Annoy index is slow in construction speed and does not support index updating caused by knowledge updating, but the query speed in a large amount of data is high, so that the Annoy index is suitable for index searching of mass basic database data (the data volume can be supported to be in a million level), and a user can set and select the Annoy index according to the size and the category of the knowledge base. In general, an industry-based question-answer library contains more questions and is updated less, and has universality among different clients in the same industry, and the Annoy index is used; while for different customers' customization problems (dynamics) we can use the inverted index for computation.
As shown in fig. 2, the inverted index is obtained after the words are segmented by the bus and the words are removed for the questions in the question-and-answer knowledge base. The Annoy index is obtained by constructing a plurality of high-dimensional neighbor search trees for knowledge vectors in a question-answer knowledge base, the knowledge vectors in the question-answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.
The industry basic question-answer library comprises more problems and is updated less, the universality is realized among different clients in the same industry, and Annoy indexes are used; while for different customers' customization problems (dynamics) we can use the inverted index for computation.
In this embodiment, the pre-training model is a chinese pre-training RoBERTa or XLNet model.
And fourthly, calculating the similarity of the problems and sequencing, calculating the cosine similarity of the problem vectors of the users and the knowledge vectors in the knowledge base, comprehensively considering the hot spot problem statistics, the text alignment ratio and other characteristics, and sequencing the similarity after comprehensive consideration.
The similarity calculation formula is as follows:
Figure BDA0002328455050000061
wherein Q isu,QkThe questions, S (Q), are respectively the user question and the question indexed and inquired in the knowledge baseu,Qk) Representing the similarity of a user's question to questions queried in the knowledge base by indexing, Vu,VkQuestion vectors output into the semantic model for user questions and question vectors queried by indexing, Cos (V), respectivelyu,Vk) Represents Vu,VkCosine similarity of two vectors, gamma1、γ2Mu is a coefficient, wherein gamma1∈(0,0.1),γ2∈(0,0.1),μ∈(0,1);C(Qk) For problems Q obtained by log statisticskThe number of queries; max (C (Q)k) Is the maximum number of queries for the question.
The first part of the formula mainly considers the similarity between the user problem and the cosine of the index problem vector, and is the most main part; the second part mainly considers the alignment ratio of the user question to the number of participles of the index question, L (Q)u,Qk) The absolute value of the difference of the word segmentation quantity is taken as the absolute value; the third part mainly considers the question-answer frequency of common question-answers; and after the three parts of numerical values pass through a softmax function, normalizing the numerical values into similarity values of 0-1 and sequencing the similarity values.
The response output module is used for outputting a response, and the response comprises four types of similar question lists, accurate answers, no answers and recommended popular questions.
Specifically, if the user question is judged to be noise, the output response module returns a hot recommendation question or no response according to the question-answer frequency; if the user question is analyzed by the QA module and then accords with the rule that an answer must be accurately returned, the output response module outputs a standard answer corresponding to the question; if the user question is not noisy and does not accord with the QA matching rule, a similar question list is returned through the knowledge matching module for the user to select; in addition, the system maintains a similar question list state, and the user asks and answers by clicking or position instructions.
In the embodiment, a language model with certain generalization capability can be obtained by unsupervised learning of a large amount of data, the problem vector index of the question-answer library is built according to the language model, in addition, statistical information such as tf (word frequency), idf (inverse document word frequency) and the like of the question-answer library and a semantic matching rule added manually are comprehensively considered, and a search question-answer system responding to the user problems is built.
The embodiment is based on an intelligent search question-answering system, and parameter optimization is performed on a pre-training model, so that the sensitivity of the model to industrial knowledge is improved, and meanwhile, the generalization capability of the model can be greatly improved; the interference of irrelevant problems such as non-industry and the like can be effectively eliminated by screening the industry word bank based on statistical learning and manual work; the QA rule is input and analyzed by the system and is embedded into the question-answering system, and in addition, the input of the QA rule supports the input of the entity (including the number) and the complex logic rule, so that the accurate QA question-answering can be effectively realized. The invoice can effectively solve the pain points of knowledge generalization migration, noise judgment, QA customization and the like in the current engineering application process, and greatly improves the question and answer efficiency while improving the user experience. The invention can also be used on intelligent equipment, such as mobile phones, computers, intelligent robots, self-service machines and the like.
The foregoing description is only for the basic principle and the preferred embodiments of the present invention, and modifications and substitutions by those skilled in the art are included in the scope of the present invention.

Claims (10)

1. A search question-answering system based on pre-training is characterized in that: the system comprises a noise judgment module, a QA question-answering module, a knowledge matching module and a response output module;
the noise judgment module judges whether the user question belongs to noise through an industry word bank and an exclusion word bank, when the user question contains industry words and does not contain exclusion words, the user question is determined to be non-noise and enters a QA question-answering module for analysis, otherwise, the user question is determined to be noise, and a response output module returns a hot recommendation question or no response;
the QA question-answer module comprises a rule input unit and a rule analysis unit, the rule analysis unit analyzes the user question input by the rule input unit and judges whether the analyzed user input needs to accurately return an answer, if so, the QA question-answer module responds to the output module to output a standard answer corresponding to the question, and if not, the analyzed question is sent to the knowledge matching module;
the knowledge matching module indexes knowledge in a question and answer base and performs similarity sorting, the knowledge index comprises an inverted index and an Annoy index, the Annoy index is based on a semantic model, the semantic model is obtained by training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer or the second last layer of the semantic model is used as a question vector to perform the Annoy index, and when the similarity is calculated and sorted, the vector similarity, the question and answer frequency and the text alignment ratio are comprehensively considered;
the response output module is used for outputting a response, and the response comprises four types of similar question lists, accurate answers, no answers and recommended popular questions.
2. The pre-training based search question-answering system according to claim 1, wherein: the word stock of the noise judgment module is obtained through manual screening and later-stage log maintenance, and the process of obtaining the word stock of the noise judgment module industry is as follows: A1) firstly, counting training data, wherein the source of the training data comprises a question-answer knowledge base and other industry data problems crawled through network resources; A2) the method comprises the following steps of performing word segmentation by using an accurate mode of the Chinese character segmentation, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of the words based on all data, and calculating word weight W based on the word frequency TF and the inverse document frequency IDF, wherein the calculation formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:
Figure FDA0002328455040000011
Figure FDA0002328455040000012
W=TF*IDF;
A3) selecting a proper number of words as industry words according to the word weight calculated in the step b, or selecting the industry words by setting a lowest threshold value, splitting the splittable phrases in the industry words or adding the industry words into the custom segmentation of the ending segmentation, and improving the weight of the industry words to ensure that the industry words can be correctly separated;
A4) and providing a plurality of spoken short names or other non-conventional industry words by professionals to form a final industry word stock.
3. The pre-training based search question-answering system according to claim 1, wherein: the QA question-answering module comprises a rule input unit, a rule analysis unit and a rule analysis unit, wherein the rule input unit of the QA question-answering module supports input of a logic expression, brackets, logic nesting, digital analysis and entity input, the rule analysis unit firstly matches a question with the rule expression aiming at the input logic expression to form a logic expression only comprising 1 and 0, and then the logic expression is calculated through a rule analysis algorithm and whether the logic expression is matched or not is output.
4. The pre-training based search question-answering system according to claim 3, wherein: the process of calculating the logic expression by the rule analysis algorithm is as follows: and pushing the logic expression into a digital stack and an operator stack for recursive computation, wherein the operation priority is as follows: bracket > and operation > or operation, the operation rule is: 1&1 ═ 1; 1&0 ═ 0&1 ═ 0; 1|0 ═ 0|1 ═ 1; 1|1 ═ 1; 0|0 ═ 0, where & denotes AND operation, | denotes OR operation, AND in logical expressions, AND operation symbol AND is replaced by & AND operation symbol OR is replaced by |.
5. The pre-training based search question-answering system according to claim 1, wherein: the inverted index in the knowledge matching module is obtained by carrying out bus segmentation and power off word removal on questions in a question and answer knowledge base; the Annoy index in the knowledge matching module is obtained by constructing a plurality of high-dimensional neighbor search trees for the knowledge vectors in the question and answer knowledge base, the knowledge vectors in the question and answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.
6. The pre-training based search question-answering system according to claim 5, wherein: the specific process of obtaining the semantic model from the pre-training model is as follows: B1) adding a full connection layer and a softmax layer behind an output layer of the pre-training model, and respectively using the outputs of the last layer and the second last layer of the pre-training model as problem vectors when constructing the knowledge vectors; B2) building training data through text enhancement; B3) and performing fine-tuning on the pre-training model processed in the step B1 based on text enhancement to obtain a semantic model.
7. The pre-training based search question-answering system according to claim 6, wherein: the process of constructing training data through text enhancement in step B2 is: C1) constructing a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplement source comprises common question answers crawled in an industry website and manual supply; C2) extracting synonyms, including industry word stock extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing an industry keyword after tf-idf weight calculation and manual screening, then extracting synonyms of the keyword by using Tencent word vectors, wherein the extraction rule is that the first 10 synonyms do not comprise the keyword, the similarity uses cosine similarity, and a final synonym phrase is obtained after manual screening and duplicate removal; C3) constructing similar questions of the questions in the question and answer knowledge base through synonym replacement; C4) translating the questions in the question and answer knowledge base into Chinese after English translation through the translation open platform, and manually screening and marking; C5) obtaining synonymous question-answer pairs after text enhancement, and obtaining non-synonymous question-answer pairs through questions without answers, random combination of questions and non-synonymous word replacement, thereby constructing training data and a verification set.
8. The pre-training based search question-answering system according to any one of claims 1, 5, 6 and 7, wherein: the pre-training model is a Chinese pre-training RoBERTA or XLNET model.
9. The pre-training based search question-answering system according to claim 1, wherein: the formula for calculating the similarity of the problem is as follows:
Figure FDA0002328455040000021
wherein Q isu,QkThe questions, S (Q), are respectively the user question and the question indexed and inquired in the knowledge baseu,Qk) Representing the similarity, V, of the user question to questions queried by the index in the knowledge baseu,VkQuestion vectors output into the semantic model for user questions and question vectors queried by indexing, Cos (V), respectivelyu,Vk) Represents Vu,VkCosine similarity of two vectors, gamma1、γ2Mu is a coefficient, wherein gamma1∈(0,0.1),γ2∈(0,0.1),μ∈(0,1);C(Qk) For problems Q obtained by log statisticskThe number of queries; max (C (Q)k) Is the maximum number of queries for the question.
10. The pre-training based search question-answering system according to claim 1, wherein: if the user question is judged to be noise, the output response module returns hot recommended questions or no response according to the question-answer frequency, if the user question is analyzed by the QA module and accords with a rule that an answer must be accurately returned, the output response module outputs a standard answer corresponding to the question, and if the user question is not noise and does not accord with the QA matching rule, a similar question list is returned through the knowledge matching module for the user to select; in addition, the system maintains a similar question list state, and the user asks and answers by clicking or position instructions.
CN201911341560.2A 2019-12-20 2019-12-20 Search question-answering system based on pre-training Active CN111125334B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911341560.2A CN111125334B (en) 2019-12-20 2019-12-20 Search question-answering system based on pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911341560.2A CN111125334B (en) 2019-12-20 2019-12-20 Search question-answering system based on pre-training

Publications (2)

Publication Number Publication Date
CN111125334A true CN111125334A (en) 2020-05-08
CN111125334B CN111125334B (en) 2023-09-12

Family

ID=70501440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911341560.2A Active CN111125334B (en) 2019-12-20 2019-12-20 Search question-answering system based on pre-training

Country Status (1)

Country Link
CN (1) CN111125334B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310234A (en) * 2020-05-09 2020-06-19 支付宝(杭州)信息技术有限公司 Personal data processing method and device based on zero-knowledge proof and electronic equipment
CN111752804A (en) * 2020-06-29 2020-10-09 中国电子科技集团公司第二十八研究所 Database cache system based on database log scanning
CN111813910A (en) * 2020-06-24 2020-10-23 平安科技(深圳)有限公司 Method, system, terminal device and computer storage medium for updating customer service problem
CN112100345A (en) * 2020-08-25 2020-12-18 百度在线网络技术(北京)有限公司 Training method and device for non-question-answer-like model, electronic equipment and storage medium
CN112231537A (en) * 2020-11-09 2021-01-15 张印祺 Intelligent reading system based on deep learning and web crawler
CN112380843A (en) * 2020-11-18 2021-02-19 神思电子技术股份有限公司 Random disturbance network-based open answer generation method
CN112380358A (en) * 2020-12-31 2021-02-19 神思电子技术股份有限公司 Rapid construction method of industry knowledge base
CN112507097A (en) * 2020-12-17 2021-03-16 神思电子技术股份有限公司 Method for improving generalization capability of question-answering system
CN112506963A (en) * 2020-11-23 2021-03-16 上海方立数码科技有限公司 Multi-service-scene-oriented service robot problem matching method
CN112541069A (en) * 2020-12-24 2021-03-23 山东山大鸥玛软件股份有限公司 Text matching method, system, terminal and storage medium combined with keywords
CN112597291A (en) * 2020-12-26 2021-04-02 中国农业银行股份有限公司 Intelligent question and answer implementation method, device and equipment
CN113627152A (en) * 2021-07-16 2021-11-09 中国科学院软件研究所 Unsupervised machine reading comprehension training method based on self-supervised learning
CN114860913A (en) * 2022-05-24 2022-08-05 北京百度网讯科技有限公司 Intelligent question-answering system construction method, question-answering processing method and device
CN116860951A (en) * 2023-09-04 2023-10-10 贵州中昂科技有限公司 Information consultation service management method and management system based on artificial intelligence
CN118093839A (en) * 2024-04-24 2024-05-28 北京中关村科金技术有限公司 Knowledge operation question-answer dialogue processing method and system based on deep learning

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101621391A (en) * 2009-08-07 2010-01-06 北京百问百答网络技术有限公司 Method and system for classifying short texts based on probability topic
CN105005564A (en) * 2014-04-17 2015-10-28 北京搜狗科技发展有限公司 Data processing method and apparatus based on question-and-answer platform
CN107273350A (en) * 2017-05-16 2017-10-20 广东电网有限责任公司江门供电局 A kind of information processing method and its device for realizing intelligent answer
CN107980130A (en) * 2017-11-02 2018-05-01 深圳前海达闼云端智能科技有限公司 It is automatic to answer method, apparatus, storage medium and electronic equipment
WO2018077655A1 (en) * 2016-10-24 2018-05-03 Koninklijke Philips N.V. Multi domain real-time question answering system
CN108959531A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Information search method, device, equipment and storage medium
US20180373782A1 (en) * 2017-06-27 2018-12-27 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for recommending answer to question based on artificial intelligence
CN109947921A (en) * 2019-03-19 2019-06-28 河海大学常州校区 A kind of intelligent Answer System based on natural language processing
CN110008322A (en) * 2019-03-25 2019-07-12 阿里巴巴集团控股有限公司 Art recommended method and device under more wheel session operational scenarios
CN110309267A (en) * 2019-07-08 2019-10-08 哈尔滨工业大学 Semantic retrieving method and system based on pre-training model
CN110413783A (en) * 2019-07-23 2019-11-05 银江股份有限公司 A kind of judicial style classification method and system based on attention mechanism

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101621391A (en) * 2009-08-07 2010-01-06 北京百问百答网络技术有限公司 Method and system for classifying short texts based on probability topic
CN105005564A (en) * 2014-04-17 2015-10-28 北京搜狗科技发展有限公司 Data processing method and apparatus based on question-and-answer platform
WO2018077655A1 (en) * 2016-10-24 2018-05-03 Koninklijke Philips N.V. Multi domain real-time question answering system
CN107273350A (en) * 2017-05-16 2017-10-20 广东电网有限责任公司江门供电局 A kind of information processing method and its device for realizing intelligent answer
US20180373782A1 (en) * 2017-06-27 2018-12-27 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for recommending answer to question based on artificial intelligence
CN107980130A (en) * 2017-11-02 2018-05-01 深圳前海达闼云端智能科技有限公司 It is automatic to answer method, apparatus, storage medium and electronic equipment
CN108959531A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Information search method, device, equipment and storage medium
CN109947921A (en) * 2019-03-19 2019-06-28 河海大学常州校区 A kind of intelligent Answer System based on natural language processing
CN110008322A (en) * 2019-03-25 2019-07-12 阿里巴巴集团控股有限公司 Art recommended method and device under more wheel session operational scenarios
CN110309267A (en) * 2019-07-08 2019-10-08 哈尔滨工业大学 Semantic retrieving method and system based on pre-training model
CN110413783A (en) * 2019-07-23 2019-11-05 银江股份有限公司 A kind of judicial style classification method and system based on attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘飞;张俊然;杨豪;: "基于深度学习的医学图像识别研究进展", no. 01 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310234A (en) * 2020-05-09 2020-06-19 支付宝(杭州)信息技术有限公司 Personal data processing method and device based on zero-knowledge proof and electronic equipment
CN111813910A (en) * 2020-06-24 2020-10-23 平安科技(深圳)有限公司 Method, system, terminal device and computer storage medium for updating customer service problem
CN111813910B (en) * 2020-06-24 2024-05-31 平安科技(深圳)有限公司 Customer service problem updating method, customer service problem updating system, terminal equipment and computer storage medium
CN111752804A (en) * 2020-06-29 2020-10-09 中国电子科技集团公司第二十八研究所 Database cache system based on database log scanning
CN112100345A (en) * 2020-08-25 2020-12-18 百度在线网络技术(北京)有限公司 Training method and device for non-question-answer-like model, electronic equipment and storage medium
CN112231537A (en) * 2020-11-09 2021-01-15 张印祺 Intelligent reading system based on deep learning and web crawler
CN112380843A (en) * 2020-11-18 2021-02-19 神思电子技术股份有限公司 Random disturbance network-based open answer generation method
CN112506963B (en) * 2020-11-23 2022-09-09 上海方立数码科技有限公司 Multi-service-scene-oriented service robot problem matching method
CN112506963A (en) * 2020-11-23 2021-03-16 上海方立数码科技有限公司 Multi-service-scene-oriented service robot problem matching method
CN112507097A (en) * 2020-12-17 2021-03-16 神思电子技术股份有限公司 Method for improving generalization capability of question-answering system
CN112507097B (en) * 2020-12-17 2022-11-18 神思电子技术股份有限公司 Method for improving generalization capability of question-answering system
CN112541069A (en) * 2020-12-24 2021-03-23 山东山大鸥玛软件股份有限公司 Text matching method, system, terminal and storage medium combined with keywords
CN112597291A (en) * 2020-12-26 2021-04-02 中国农业银行股份有限公司 Intelligent question and answer implementation method, device and equipment
CN112597291B (en) * 2020-12-26 2024-09-17 中国农业银行股份有限公司 Intelligent question-answering implementation method, device and equipment
CN112380358A (en) * 2020-12-31 2021-02-19 神思电子技术股份有限公司 Rapid construction method of industry knowledge base
CN113627152A (en) * 2021-07-16 2021-11-09 中国科学院软件研究所 Unsupervised machine reading comprehension training method based on self-supervised learning
CN113627152B (en) * 2021-07-16 2023-05-16 中国科学院软件研究所 Self-supervision learning-based unsupervised machine reading and understanding training method
CN114860913A (en) * 2022-05-24 2022-08-05 北京百度网讯科技有限公司 Intelligent question-answering system construction method, question-answering processing method and device
CN114860913B (en) * 2022-05-24 2023-12-12 北京百度网讯科技有限公司 Intelligent question-answering system construction method, question-answering processing method and device
CN116860951A (en) * 2023-09-04 2023-10-10 贵州中昂科技有限公司 Information consultation service management method and management system based on artificial intelligence
CN116860951B (en) * 2023-09-04 2023-11-14 贵州中昂科技有限公司 Information consultation service management method and management system based on artificial intelligence
CN118093839A (en) * 2024-04-24 2024-05-28 北京中关村科金技术有限公司 Knowledge operation question-answer dialogue processing method and system based on deep learning
CN118093839B (en) * 2024-04-24 2024-07-02 北京中关村科金技术有限公司 Knowledge operation question-answer dialogue processing method and system based on deep learning

Also Published As

Publication number Publication date
CN111125334B (en) 2023-09-12

Similar Documents

Publication Publication Date Title
CN111125334B (en) Search question-answering system based on pre-training
CN108052583B (en) E-commerce ontology construction method
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
AU2018355097B2 (en) Methods, systems, and computer program product for implementing an intelligent system with dynamic configurability
CN109101479B (en) Clustering method and device for Chinese sentences
CN112800170A (en) Question matching method and device and question reply method and device
US12032911B2 (en) Systems and methods for structured phrase embedding and use thereof
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN109460457A (en) Text sentence similarity calculating method, intelligent government affairs auxiliary answer system and its working method
US10586174B2 (en) Methods and systems for finding and ranking entities in a domain specific system
US12093648B2 (en) Systems and methods for producing a semantic representation of a document
CN111625621B (en) Document retrieval method and device, electronic equipment and storage medium
CN102609433A (en) Method and system for recommending query based on user log
CN111159381B (en) Data searching method and device
CN108287848B (en) Method and system for semantic parsing
US20220245361A1 (en) System and method for managing and optimizing lookup source templates in a natural language understanding (nlu) framework
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
CN111460114A (en) Retrieval method, device, equipment and computer readable storage medium
Nugraha et al. Typographic-based data augmentation to improve a question retrieval in short dialogue system
US20220237383A1 (en) Concept system for a natural language understanding (nlu) framework
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
WO2023207566A1 (en) Voice room quality assessment method, apparatus, and device, medium, and product
CN109684357B (en) Information processing method and device, storage medium and terminal
US20220229990A1 (en) System and method for lookup source segmentation scoring in a natural language understanding (nlu) framework
CN110750967A (en) Pronunciation labeling method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant