CN111125334A

CN111125334A - Search question-answering system based on pre-training

Info

Publication number: CN111125334A
Application number: CN201911341560.2A
Authority: CN
Inventors: 申冲; 张传锋; 朱锦雷; 薛付忠; 杨帆
Original assignee: Synthesis Electronic Technology Co Ltd
Current assignee: Synthesis Electronic Technology Co Ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-08
Anticipated expiration: 2039-12-20
Also published as: CN111125334B

Abstract

The invention discloses a pre-training-based question-answering search system, which comprises a noise judgment module, a QA question-answering module, a knowledge matching module and a response output module; the noise judgment module judges whether the user question belongs to noise, the QA question-answer module comprises a rule input unit and a rule analysis unit, the knowledge matching module indexes the question and knowledge in a question-answer base and sorts the similarity, the knowledge index comprises an inverted index and an Annoy index, the response output module is used for outputting a response, and the response comprises four types of similar question lists, accurate answers, no answers and recommended hot questions. The invention can effectively solve the problems of knowledge generalization transfer, noise judgment and QA customization, and greatly improves the question-answering efficiency while improving the user experience.

Description

Search question-answering system based on pre-training

Technical Field

The invention relates to a search question-answer system based on pre-training, which is a system for obtaining a language model and an existing question-answer database based on pre-training and interacting customer questions, and belongs to the field of natural language processing and machine learning.

Background

The search question-answering system is used for receiving user questions, searching and sequencing similar questions in a question-answering knowledge base, displaying a similar question list to a user, and enabling the user to select so as to solve the user questions to the greatest extent. At present, various knowledge base question-answering systems, intelligent customer service assistants, self-service machines and other terminal devices adopt the question-answering mode. Unlike traditional dialog systems, the focus of the dialog system is on interaction, while the focus of the search question-and-answer system is on improving a more accurate list of similar questions, which does not have as many contextual states as the dialog system to maintain, nor does it require an accurate question-and-answer response at all. In the terminal equipment, the problems of speech recognition accuracy and spoken language of a client are still key reasons for restricting the development of a conversation system. The search question and answer system can alleviate this problem in a simple and efficient way of recommending a list of similar questions, but still faces the problem of low accuracy of the top 5 recommendations. There are three main factors affecting the first 5 low recommendation accuracy rates, which are spoken language generalization, noise impact and fixed question needs accurate answers.

The search question-answering system is usually provided with a large amount of data for one industry, and the question-answering forms of users are more various. User problems may come from terminal speech recognition device acquisition (noise interference), or industry website queries, etc., with more general spoken language. Taking the tax industry as an example, "paying taxes," the user is more likely to say "how much money to pay". In fact, an industrial customer often only provides question and answer pairs, namely one standard question corresponds to one standard answer, the generalization of the standard questions is avoided, and when the system receives the user questions, the system can only be matched with the questions in the question and answer knowledge base in sequence. Therefore, the search question-answering system also faces this large number of spoken generalization problems.

The search question-answering system recommends a list of similar questions, and the system is unlikely to respond to any questions. One solution is to set a minimum threshold for the similarity of the user's question to questions in the knowledge base of questions and answers, below which no response is made. However, due to the wide difference in similarity between different problems and the problems in the knowledge base, it is difficult to select an appropriate threshold to avoid the noise problem.

In addition, the search question-answering system may still face some customized question-answering, that is, the user system inputs a fixed question to accurately give a fixed answer, instead of returning a similar question list, such as a jump instruction of a terminal device page, some common publicity and an operation instruction of the user, and the search question-answering system still needs to have a customized QA question-answering function.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a pre-training-based question-answering search system, which can effectively solve the problems of knowledge generalization migration, noise judgment and QA customization, and greatly improve the question-answering efficiency while improving the user experience.

In order to solve the technical problem, the technical scheme adopted by the invention is as follows: a pre-training-based question-answering search system comprises a noise judgment module, a QA question-answering module, a knowledge matching module and a response output module; the noise judgment module judges whether the user question belongs to noise through an industry word bank and an exclusion word bank, when the user question contains industry words and does not contain exclusion words, the user question is determined to be non-noise and enters a QA question-answering module for analysis, otherwise, the user question is determined to be noise, and a response output module returns a hot recommendation question or no response;

the QA question-answer module comprises a rule input unit and a rule analysis unit, the rule analysis unit analyzes the user question input by the rule input unit and judges whether the analyzed user question needs to accurately return an answer, if so, the QA question-answer module responds to the output module to output a standard answer corresponding to the question, and if not, the analyzed question is sent to the knowledge matching module;

the knowledge matching module indexes the knowledge in the question and answer base and performs similarity sequencing, the knowledge index comprises an inverted index and an Annoy index, the Annoy index is based on a semantic model, the semantic model is obtained by training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer and the second last layer of the semantic model is used as a question vector to perform the Annoy index, and when the similarity is calculated and sequenced, the vector similarity, the question and answer frequency and the text alignment ratio are comprehensively considered;

the response output module is used for outputting a response, and the response comprises four types of similar question lists, accurate answers, no answers and recommended popular questions.

Further, the word stock for excluding the noise judgment module is obtained through manual screening and later-stage log maintenance, and the process of obtaining the word stock in the industry of the noise judgment module is as follows: A1) firstly, counting training data, wherein the source of the training data comprises a question-answer knowledge base and other industry data problems crawled through network resources; A2) the method comprises the following steps of performing word segmentation by using an accurate mode of the Chinese character segmentation, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of the words based on all data, and calculating word weight W based on the word frequency TF and the inverse document frequency IDF, wherein the calculation formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:

W＝TF*IDF；

A3) selecting a proper number of words as industry words according to the word weight calculated in the step b, or selecting the industry words by setting a lowest threshold value, splitting the splittable phrases in the industry words or adding the industry words into the custom segmentation of the ending segmentation, and improving the weight of the industry words to ensure that the industry words can be correctly separated;

A4) and providing a plurality of spoken short names or other non-conventional industry words by professionals to form a final industry word stock.

Furthermore, a rule input unit of the QA question-answering module supports input of a logic expression, brackets, logic nesting, digital analysis and entity input, and aiming at the input logic expression, a rule analysis unit firstly matches a question with the rule expression to form a logic expression comprising only 1 and 0, and then calculates the logic expression through a rule analysis algorithm and outputs whether the logic expression is matched or not.

Further, the process of calculating the logic expression by the rule parsing algorithm is as follows: and pushing the logic expression into a digital stack and an operator stack for recursive computation, wherein the operation priority is as follows: bracket > and operation > or operation, the operation rule is: 1&1 ═ 1;

1&0 ═ 0&1 ═ 0; 1|0 ═ 0|1 ═ 1; 1|1 ═ 1; 0|0 ═ 0, where & denotes AND operation, | denotes OR operation, AND in logical expressions, AND operation symbol AND is replaced by & AND operation symbol OR is replaced by |.

Further, the inverted index in the knowledge matching module is obtained by carrying out crust word segmentation and power off word removal on the questions in the question and answer knowledge base; the Annoy index in the knowledge matching module is obtained by constructing a plurality of high-dimensional neighbor search trees for the knowledge vectors in the question and answer knowledge base, the knowledge vectors in the question and answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.

Further, the specific process of obtaining the semantic model from the pre-training model is as follows: B1) adding a full connection layer and a softmax layer behind an output layer of the pre-training model, and respectively using the outputs of the last layer and the second last layer of the pre-training model as problem vectors when constructing the knowledge vectors; B2) building training data through text enhancement; B3) and performing fine-tuning on the pre-training model processed in the step B1 based on text enhancement to obtain a semantic model.

Further, the process of constructing training data through text enhancement in step B2 is as follows: C1) constructing a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplement source comprises common question answers crawled in an industry website and manual supply; C2) extracting synonyms, including industry word stock extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing an industry keyword after tf-idf weight calculation and manual screening, then extracting synonyms of the keyword by using Tencent word vectors, wherein the extraction rule is that the first 10 synonyms do not comprise the keyword, the similarity uses cosine similarity, and a final synonym phrase is obtained after manual screening and duplicate removal; C3) constructing similar questions of the questions in the question and answer knowledge base through synonym replacement; C4) translating the questions in the question and answer knowledge base into Chinese after English translation through the translation open platform, and manually screening and marking; C5) obtaining synonymous question-answer pairs after text enhancement, and obtaining non-synonymous question-answer pairs through questions without answers, random combination of questions and non-synonymous word replacement, thereby constructing training data and a verification set.

Further, the pre-training model is a Chinese pre-training RoBERTA or XLNET model.

Further, the formula for calculating the similarity of the problem is as follows:

wherein Q is_u,Q_kThe questions, S (Q), are respectively the user question and the question indexed and inquired in the knowledge base_u,Q_k) Representing the similarity, V, of the user question to questions queried by the index in the knowledge base_u,V_kQuestion vectors output into the semantic model for user questions and question vectors queried by indexing, Cos (V), respectively_u,V_k) Represents V_u,V_kCosine similarity of two vectors, gamma₁、γ₂Mu is a coefficient, wherein gamma₁∈(0,0.1)，γ₂∈(0,0.1)，μ∈(0,1)；C(Q_k) For problems Q obtained by log statistics_kThe number of queries; max (C (Q)_k) Is the maximum number of queries for the question.

Further, if the user question is judged to be noise, the output response module returns hot recommendation questions or no response according to the question-answer frequency, if the user question is analyzed by the QA module and accords with a rule that an answer must be accurately returned, the output response module outputs a standard answer corresponding to the question, and if the user question is not noise and does not accord with the QA matching rule, a similar question list is returned through the knowledge matching module for the user to select; in addition, the system maintains a similar question list state, and the user asks and answers by clicking or position instructions.

The invention has the beneficial effects that: the invention relates to a search question-answering system based on pre-training, which solves the problem of knowledge generalization migration by using a pre-training model for fine tuning, solves noise judgment by using statistical learning, solves the problem of QA customization by using a rule analysis algorithm, and greatly improves the question-answering efficiency while improving the user experience.

Drawings

FIG. 1 is an architecture diagram of a search question and answer system;

FIG. 2 is a flow diagram of knowledge index acquisition;

FIG. 3 is a semantic model acquisition flow diagram;

fig. 4 is a flow chart of training data generation.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

Fig. 1 is an architecture diagram of the whole search question-answering system, which includes a noise judgment module, a QA question-answering module, a knowledge matching module and a response output module, when the system receives a user question, it first judges whether the question is noise, if the question is non-noise, QA question-answering matching is performed, if there is no response, index matching is performed on the question in the knowledge base, after similarity sorting, the top 5 similar questions are selected and returned to the response module, and finally a question-answering response is given by the response.

In this embodiment, the noise determination module determines whether the user question belongs to noise through the industry lexicon and the exclusion lexicon, determines that the user question is non-noise when the user question includes an industry word and does not include an exclusion word, and enters the QA question-answering module for analysis, otherwise determines that the user question is noise, and the response output module returns a hot recommendation question or no response.

The excluding word bank of the noise judgment module is manually constructed according to industry experience, and irregular problems such as sensitive information, voice recognition error information and the like are uniformly filtered. The lexicon is excluded and is continuously enriched in the subsequent use or log statistics and other processes.

The process of obtaining the noise judgment module industry word stock comprises the following steps:

A1) firstly, counting training data, wherein the source of the training data comprises a question-answer knowledge base and other industry data problems crawled through network resources;

A2) the method comprises the following steps of performing word segmentation by using an accurate mode of the Chinese character segmentation, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of the words based on all data, and calculating word weight W based on the word frequency TF and the inverse document frequency IDF, wherein the calculation formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:

W＝TF*IDF；

A3) selecting a proper number of words as industry words according to the word weight calculated in the step b, or selecting the industry words by setting a lowest threshold value, and dividing the words into one word by the ending word because the division words are required to be as delicate as possible when the index is constructed, such as 'paying for a representative' and the like, but if the user only says 'paying for a representative', the problem cannot be positioned by the inverted index, so the division words are required to be divided into 'paying for a representative' and 'withholding'. In addition, the industry word needs to be added to the custom segmentation of the ending segmentation, and the weight of the industry word is increased to ensure that the industry word can be correctly separated;

For the conventional questions and answers such as instruction intentions (such as instructions for controlling front-end page jump and the like), industry/unit introduction, call calling and the like, answers need to be accurately returned, so that a QA question and answer module is arranged.

The QA question-answer module comprises a rule input unit and a rule analysis unit, the rule analysis unit analyzes the user question input by the rule input unit and judges whether the analyzed user question needs to accurately return an answer, if so, the QA question-answer module responds to the output module to output a standard answer corresponding to the question, and if not, the analyzed question is sent to the knowledge matching module.

In order to improve the utilization rate of data and relatively reduce the data entry and maintenance work, some common information can be taken out independently and defined as entities, and the entity can be defined as water charge OR electricity charge OR gas charge by taking a printed invoice as an example, and can be directly used when rules are entered.

In this embodiment, the rule entry unit of the QA question-answering module supports the input of logical expressions such as "and or" not or ", brackets, logical nesting, digital parsing, entity entry, and the like, thereby being capable of satisfying the implementation of various complex logical rules. For example, the intent entry may be "AND (water OR electricity OR gas) AND invoice", OR the entity "AND { @ invoice kind } AND invoice" may be used. In addition, the embodiment also provides a system entity (number, place, etc.), and especially for the digital entity, a logical rule (greater than, less than, equal to, etc.) of the number can be entered.

The rule analysis unit firstly matches the question with the rule expression aiming at the recorded logic expression to form a logic expression (AND is used for & replacement, OR is used for | replacement, Chinese brackets are replaced with English brackets) only comprising 1(True) AND 0(False), AND then calculates the logic expression through a rule analysis algorithm AND outputs whether the logic expression is matched (1 is matched AND 0 is not matched).

Specifically, the process of calculating the logic expression by the rule parsing algorithm is as follows: the logic expression is pressed into a digital stack and an operator stack, recursive calculation is carried out by using a similar addition, subtraction, multiplication and division mode, and the operation priority is as follows: bracket > and operation > or operation,

the operation rule is as follows:

1&1＝1；

1&0＝0&1＝0；

1|0＝0|1＝1；

1|1＝1；

0|0＝0，

where & represents AND operation, | represents OR operation, AND in logical expressions, AND operation symbol AND is replaced by &, OR operation symbol OR is replaced by | in logical expressions.

The knowledge matching module indexes knowledge in a question and answer base and performs similarity sorting, the knowledge indexing comprises two modes of inverted indexing and Annoy indexing, the Annoy indexing is based on a semantic model, the semantic model is obtained by training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer and the second last layer of the semantic model is used as a question vector to perform the Annoy indexing, and when the similarity is calculated and sorted, the vector similarity, the question and answer frequency and the text alignment ratio are comprehensively considered.

In this embodiment, the specific implementation process of the knowledge matching module is as follows:

firstly, a training set structure, as shown in fig. 4, mainly comprises three parts of synonym replacement, retranslation and random combination to construct training data. The method specifically comprises the following steps:

C1) constructing a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplement source comprises common question answers crawled in an industry website and manual supply;

C2) extracting synonyms, including industry word stock extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing an industry keyword after tf-idf weight calculation and manual screening, then extracting synonyms of the keyword by using Tencent word vectors, wherein the extraction rule is that the first 10 synonyms do not comprise the keyword, the similarity uses cosine similarity, and a final synonym phrase is obtained after manual screening and duplicate removal;

synonyms before screening are as follows:

manufacturing company's own investment organization product and service medicine enterprise in enterprise's private enterprise company innovation company industry private-member manufacturing industry

The screening was followed as follows:

enterprise civil-enterprise company industry medicine enterprise

C3) Constructing similar questions of the questions in the question and answer knowledge base through synonym replacement, replacing the synonyms in the questions in the knowledge base by using the questions in the knowledge base to expand the number of the questions, and then removing unreasonable questions and marking whether the questions are similar;

C4) translating back, traversing all problems in the knowledge base, calling a Baidu translation open platform interface to perform English translation on the problems in the knowledge base, translating back into Chinese, and finally removing unreasonable problems and marking whether the problems are similar or not;

C5) the number of similar question pairs is sorted out, all questions (questions without answers, question random combination and nonsynonymous word replacement to obtain nonsynonymous question-answer pairs) are randomly combined, the same number of dissimilar question pairs are constructed, and a training set is formed after random disorder. The ratio of training set to validation set was 9: 1.

Secondly, performing fine-tuning on the pre-training model based on the training data to obtain a semantic model, as shown in fig. 3, the specific process is as follows:

B1) and adding a full connection layer and a softmax layer after an output layer of the pre-training model, and outputting similar and dissimilar probabilities after passing through the softmax layer. The input of the model is a problem pair in training data, cross entropy is used for a loss function, Adam gradient reduction is used for gradient reduction, f1 and accuracy are used as comprehensive indexes, and the optimal semantic model is selected through a verification set.

When the knowledge vector is constructed, the problem of the knowledge base is circularly traversed, the output (768 dimensions) of the last layer (high accuracy rate of the first layer) and the output (high recall rate) of the second last layer (high recall rate) of the model are respectively used as the problem vector, a client can select the problem vector by himself, and the system defaults to use the output of the second last layer as the problem vector.

Compared with the traditional bert model, the pre-training model used by the invention uses more training data and more training steps, and has better effect in each nlp project by using dynamic whole word masking and eliminating the following prediction. In the present search system, the pre-trained model was tested to improve the top 1 hit rate by 20.5% and the top 5 hit rate by 11.5% over the simplest bert model before the fine-tuning process.

The knowledge index is constructed, the construction speed of the reverse index is high, the updating calculation amount of the knowledge insertion and deletion index is small, and the method is suitable for a knowledge search question-answering system with small data volume (the data volume is preferably within ten thousand), and the recall rate of the problems can be improved by using the mode; the Annoy index is slow in construction speed and does not support index updating caused by knowledge updating, but the query speed in a large amount of data is high, so that the Annoy index is suitable for index searching of mass basic database data (the data volume can be supported to be in a million level), and a user can set and select the Annoy index according to the size and the category of the knowledge base. In general, an industry-based question-answer library contains more questions and is updated less, and has universality among different clients in the same industry, and the Annoy index is used; while for different customers' customization problems (dynamics) we can use the inverted index for computation.

As shown in fig. 2, the inverted index is obtained after the words are segmented by the bus and the words are removed for the questions in the question-and-answer knowledge base. The Annoy index is obtained by constructing a plurality of high-dimensional neighbor search trees for knowledge vectors in a question-answer knowledge base, the knowledge vectors in the question-answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.

The industry basic question-answer library comprises more problems and is updated less, the universality is realized among different clients in the same industry, and Annoy indexes are used; while for different customers' customization problems (dynamics) we can use the inverted index for computation.

In this embodiment, the pre-training model is a chinese pre-training RoBERTa or XLNet model.

And fourthly, calculating the similarity of the problems and sequencing, calculating the cosine similarity of the problem vectors of the users and the knowledge vectors in the knowledge base, comprehensively considering the hot spot problem statistics, the text alignment ratio and other characteristics, and sequencing the similarity after comprehensive consideration.

The similarity calculation formula is as follows:

wherein Q is_u,Q_kThe questions, S (Q), are respectively the user question and the question indexed and inquired in the knowledge base_u,Q_k) Representing the similarity of a user's question to questions queried in the knowledge base by indexing, V_u,V_kQuestion vectors output into the semantic model for user questions and question vectors queried by indexing, Cos (V), respectively_u,V_k) Represents V_u,V_kCosine similarity of two vectors, gamma₁、γ₂Mu is a coefficient, wherein gamma₁∈(0,0.1)，γ₂∈(0,0.1)，μ∈(0,1)；C(Q_k) For problems Q obtained by log statistics_kThe number of queries; max (C (Q)_k) Is the maximum number of queries for the question.

The first part of the formula mainly considers the similarity between the user problem and the cosine of the index problem vector, and is the most main part; the second part mainly considers the alignment ratio of the user question to the number of participles of the index question, L (Q)_u，Q_k) The absolute value of the difference of the word segmentation quantity is taken as the absolute value; the third part mainly considers the question-answer frequency of common question-answers; and after the three parts of numerical values pass through a softmax function, normalizing the numerical values into similarity values of 0-1 and sequencing the similarity values.

Specifically, if the user question is judged to be noise, the output response module returns a hot recommendation question or no response according to the question-answer frequency; if the user question is analyzed by the QA module and then accords with the rule that an answer must be accurately returned, the output response module outputs a standard answer corresponding to the question; if the user question is not noisy and does not accord with the QA matching rule, a similar question list is returned through the knowledge matching module for the user to select; in addition, the system maintains a similar question list state, and the user asks and answers by clicking or position instructions.

In the embodiment, a language model with certain generalization capability can be obtained by unsupervised learning of a large amount of data, the problem vector index of the question-answer library is built according to the language model, in addition, statistical information such as tf (word frequency), idf (inverse document word frequency) and the like of the question-answer library and a semantic matching rule added manually are comprehensively considered, and a search question-answer system responding to the user problems is built.

The embodiment is based on an intelligent search question-answering system, and parameter optimization is performed on a pre-training model, so that the sensitivity of the model to industrial knowledge is improved, and meanwhile, the generalization capability of the model can be greatly improved; the interference of irrelevant problems such as non-industry and the like can be effectively eliminated by screening the industry word bank based on statistical learning and manual work; the QA rule is input and analyzed by the system and is embedded into the question-answering system, and in addition, the input of the QA rule supports the input of the entity (including the number) and the complex logic rule, so that the accurate QA question-answering can be effectively realized. The invoice can effectively solve the pain points of knowledge generalization migration, noise judgment, QA customization and the like in the current engineering application process, and greatly improves the question and answer efficiency while improving the user experience. The invention can also be used on intelligent equipment, such as mobile phones, computers, intelligent robots, self-service machines and the like.

The foregoing description is only for the basic principle and the preferred embodiments of the present invention, and modifications and substitutions by those skilled in the art are included in the scope of the present invention.

Claims

1. A search question-answering system based on pre-training is characterized in that: the system comprises a noise judgment module, a QA question-answering module, a knowledge matching module and a response output module;

the noise judgment module judges whether the user question belongs to noise through an industry word bank and an exclusion word bank, when the user question contains industry words and does not contain exclusion words, the user question is determined to be non-noise and enters a QA question-answering module for analysis, otherwise, the user question is determined to be noise, and a response output module returns a hot recommendation question or no response;

the QA question-answer module comprises a rule input unit and a rule analysis unit, the rule analysis unit analyzes the user question input by the rule input unit and judges whether the analyzed user input needs to accurately return an answer, if so, the QA question-answer module responds to the output module to output a standard answer corresponding to the question, and if not, the analyzed question is sent to the knowledge matching module;

the knowledge matching module indexes knowledge in a question and answer base and performs similarity sorting, the knowledge index comprises an inverted index and an Annoy index, the Annoy index is based on a semantic model, the semantic model is obtained by training data generation and fine-tuning on the basis of a pre-training model, the output of the last layer or the second last layer of the semantic model is used as a question vector to perform the Annoy index, and when the similarity is calculated and sorted, the vector similarity, the question and answer frequency and the text alignment ratio are comprehensively considered;

2. The pre-training based search question-answering system according to claim 1, wherein: the word stock of the noise judgment module is obtained through manual screening and later-stage log maintenance, and the process of obtaining the word stock of the noise judgment module industry is as follows: A1) firstly, counting training data, wherein the source of the training data comprises a question-answer knowledge base and other industry data problems crawled through network resources; A2) the method comprises the following steps of performing word segmentation by using an accurate mode of the Chinese character segmentation, calculating word frequency TF based on an industry question-answer knowledge base, calculating inverse document frequency IDF of the words based on all data, and calculating word weight W based on the word frequency TF and the inverse document frequency IDF, wherein the calculation formulas of the word frequency TF, the inverse document frequency IDF and the word weight W are respectively as follows:

W＝TF*IDF；

3. The pre-training based search question-answering system according to claim 1, wherein: the QA question-answering module comprises a rule input unit, a rule analysis unit and a rule analysis unit, wherein the rule input unit of the QA question-answering module supports input of a logic expression, brackets, logic nesting, digital analysis and entity input, the rule analysis unit firstly matches a question with the rule expression aiming at the input logic expression to form a logic expression only comprising 1 and 0, and then the logic expression is calculated through a rule analysis algorithm and whether the logic expression is matched or not is output.

4. The pre-training based search question-answering system according to claim 3, wherein: the process of calculating the logic expression by the rule analysis algorithm is as follows: and pushing the logic expression into a digital stack and an operator stack for recursive computation, wherein the operation priority is as follows: bracket > and operation > or operation, the operation rule is: 1&1 ═ 1; 1&0 ═ 0&1 ═ 0; 1|0 ═ 0|1 ═ 1; 1|1 ═ 1; 0|0 ═ 0, where & denotes AND operation, | denotes OR operation, AND in logical expressions, AND operation symbol AND is replaced by & AND operation symbol OR is replaced by |.

5. The pre-training based search question-answering system according to claim 1, wherein: the inverted index in the knowledge matching module is obtained by carrying out bus segmentation and power off word removal on questions in a question and answer knowledge base; the Annoy index in the knowledge matching module is obtained by constructing a plurality of high-dimensional neighbor search trees for the knowledge vectors in the question and answer knowledge base, the knowledge vectors in the question and answer knowledge base are obtained through a semantic model, and the semantic model is obtained through training data generation and fine-tuning on the basis of a pre-training model.

6. The pre-training based search question-answering system according to claim 5, wherein: the specific process of obtaining the semantic model from the pre-training model is as follows: B1) adding a full connection layer and a softmax layer behind an output layer of the pre-training model, and respectively using the outputs of the last layer and the second last layer of the pre-training model as problem vectors when constructing the knowledge vectors; B2) building training data through text enhancement; B3) and performing fine-tuning on the pre-training model processed in the step B1 based on text enhancement to obtain a semantic model.

7. The pre-training based search question-answering system according to claim 6, wherein: the process of constructing training data through text enhancement in step B2 is: C1) constructing a question-answer knowledge base according to industry user supply and supplementing the question-answer knowledge base, wherein the supplement source comprises common question answers crawled in an industry website and manual supply; C2) extracting synonyms, including industry word stock extraction and synonym phrase extraction; crawling a question-answer knowledge base and network data, constructing an industry keyword after tf-idf weight calculation and manual screening, then extracting synonyms of the keyword by using Tencent word vectors, wherein the extraction rule is that the first 10 synonyms do not comprise the keyword, the similarity uses cosine similarity, and a final synonym phrase is obtained after manual screening and duplicate removal; C3) constructing similar questions of the questions in the question and answer knowledge base through synonym replacement; C4) translating the questions in the question and answer knowledge base into Chinese after English translation through the translation open platform, and manually screening and marking; C5) obtaining synonymous question-answer pairs after text enhancement, and obtaining non-synonymous question-answer pairs through questions without answers, random combination of questions and non-synonymous word replacement, thereby constructing training data and a verification set.

8. The pre-training based search question-answering system according to any one of claims 1, 5, 6 and 7, wherein: the pre-training model is a Chinese pre-training RoBERTA or XLNET model.

9. The pre-training based search question-answering system according to claim 1, wherein: the formula for calculating the similarity of the problem is as follows:

10. The pre-training based search question-answering system according to claim 1, wherein: if the user question is judged to be noise, the output response module returns hot recommended questions or no response according to the question-answer frequency, if the user question is analyzed by the QA module and accords with a rule that an answer must be accurately returned, the output response module outputs a standard answer corresponding to the question, and if the user question is not noise and does not accord with the QA matching rule, a similar question list is returned through the knowledge matching module for the user to select; in addition, the system maintains a similar question list state, and the user asks and answers by clicking or position instructions.