CN106649561B

CN106649561B - Intelligent question-answering system for tax consultation service

Info

Publication number: CN106649561B
Application number: CN201610990193.9A
Authority: CN
Inventors: 张文强; 高恩强; 张尚彤; 郑骁庆; 路红; 张睿; 陈辰; 王洪荣; 张超; 薛向阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2016-11-10
Filing date: 2016-11-10
Publication date: 2020-05-26
Anticipated expiration: 2036-11-10
Also published as: CN106649561A

Abstract

The invention belongs to the technical field of artificial intelligence, and particularly relates to an intelligent question-answering system for tax consultation service. The system comprises: the system comprises a set of terminal equipment for installing an Android operating system and a set of computer; the terminal is provided with an application software program, and the application software comprises a voice conversion module and a question return module; the computer is provided with a service software system, and the service software system comprises a problem understanding module and a problem retrieval module; when the system works, the voice conversion module converts voice data output by a user into text data, semantic understanding is carried out through the problem understanding module, answers are retrieved through the problem retrieval module, and a processing result is transmitted to a terminal user through the problem returning module; the invention comprehensively uses the technologies of voice recognition, text classification, similarity calculation and the like to form a method for performing text similarity matching on an incomplete data set in the professional field, can perform deep semantic analysis on problems raised by taxpayers, simultaneously deals with massive users, and provides uninterrupted and accurate consultation service so as to meet the actual requirement of tax consultation.

Description

Intelligent question-answering system for tax consultation service

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to an intelligent question-answering system for tax consultation service.

Background

With the rapid development of 12366 service hotlines for more than ten years, tax counseling has become an important way for taxpayers to understand tax laws and express appeal, but currently, the total amount of taxpayers is increased at a high speed, the complexity of problems is continuously deepened, hot problems are relatively concentrated, and the requirements of taxpayers cannot be met only by the original counseling way.

The tax intelligent consultation is used as an application of a question-answering system, can provide continuous online service for massive taxpayers, and gradually becomes a mode of future consultation service. The method can meet the requirements of deep reform of governments, role change and service quality improvement, and is an effective means for improving the answering accuracy and the user satisfaction, improving the tax consultation efficiency and reducing the tax cost. At present, the tax intelligent consultation mainly adopts keyword matching, lacks understanding of problem semantics and cannot meet the requirements of users.

Disclosure of Invention

In order to overcome the problems of the traditional method at present, the invention provides an intelligent question-answering system of tax consultation service.

The intelligent question-answering system for the tax consultation service, provided by the invention, adopts a method for matching the text similarity on a general database, and solves the problems of the traditional model that the generalization capability is too poor on an incomplete database and the volume of a special database is too large.

The invention uses a Word Mover's Distance (WMD) model, can calculate the similarity by calculating the Distance between two texts, and effectively improves the understanding of problem semantics. The method uses a Long-Short Term Memory (LSTM) model, classifies problems and then calculates the similarity, so that the calculation accuracy of the similarity can be improved, and the calculation time can be effectively shortened. In addition, the WMD model and the LSTM network are optimized in algorithm, and the time complexity of the algorithm is greatly reduced.

The invention provides an intelligent question-answering system for tax consultation service, comprising:

the Android operating system terminal equipment is used for acquiring the voice problem of a user, converting the voice data into text data in real time and transmitting the text data into a computer; the answer matching module is also used for returning the final matched answer to the user for displaying;

a computer for understanding and searching the real-time incoming text questions;

the terminal equipment is provided with an application software program, and the application software program comprises a voice conversion module 1 and a question return module 4 and is used for acquiring user voice data, providing accurate question answers for users and providing a friendly interface for the users;

the computer is provided with a service software system, and the processing software system comprises a problem understanding module 2 and a problem retrieval module 3, and is used for performing semantic analysis so as to understand the problem and retrieve the problem with the highest similarity;

when the system works, the voice conversion module 1 converts voice data output by a user into text data, semantic understanding is carried out through the question understanding module 2, answers are retrieved through the question retrieving module 3, and processing results are transmitted to a terminal user through the question returning module 4.

The voice conversion module 1 is used for completing the function of converting the user voice signal into the information of the corresponding text, and comprises the steps of collecting the user voice, extracting the characteristics of the voice information to form a model to be recognized, matching the model with a reference model, searching the model with the highest similarity and finally outputting a recognition result; it inputs voice information and outputs text information.

The problem understanding module 2 is used for processing text information, including word segmentation of input text, classification of text, deletion of stop words contained in text and the like; the input is question text and the output is feature words.

The question retrieval module 3 is used for completing the matching of the questions input by the user and the questions and answers in the tax corpus, and comprises the steps of retrieving some questions with the most common characteristics of the two questions through the comparison of the matching degree between the two given questions; the input of the problem features and the output of the problem numbers with the highest similarity.

The question returning module 4 is used for completing the display of the matched questions, presetting the number of returned questions and returning the answers of the questions with the highest similarity to the user; the input is the question number to be returned, and the output is the corresponding question and answer.

The invention comprehensively uses the voice recognition technology, the text classification technology and the similarity calculation technology to form a method for performing text similarity matching on an incomplete data set in the professional field, can perform deep semantic analysis on problems raised by taxpayers, can simultaneously deal with massive users, and provides uninterrupted and accurate consultation service so as to meet the actual requirement of tax consultation.

In the invention, the problem understanding module 2 comprises a Chinese word segmentation module 21, a text classification module 22 and a stop word module 23. The Chinese word segmentation module 21 is configured to perform word segmentation on an input text to determine feature words included in the text; the text classification module 22 is used for classifying the texts according to the trained classification data so as to improve the accuracy and efficiency of the system; and the stop word module is used for deleting stop words contained in the text so as to improve the system efficiency. The Chinese word segmentation module 21 determines the characteristic words contained in the text and transmits the characteristic words to the text classification module 22; the text classification module 22 classifies the questions; the feature words are transmitted to the stop word module 23 for processing.

The Chinese word segmentation module 21 analyzes the sentence, judges which characters should be combined together to form words according to a certain understanding rule, and separates each word of the whole sentence.

The text classification module 22 extracts the predefined classification into specific features, establishes corresponding judgment rules, and then automatically classifies the text to be classified;

the stop word removing module 23 removes the words to be extracted from the question by using a preset stop word list.

In the present invention, the text classification module 22 uses a Long-Short term neural network (LSTM) model.

In the invention, the problem retrieval module 3 uses a Word Mover's Distance (WMD) model to perform similarity calculation.

In the invention, the system needs to acquire and train data before being put into automatic operation. The question searching module 3 needs to use a tax question and answer corpus, which is a question and answer library at the core of the tax consultation system and is a data source for answering the user questions. The problem retrieval module 3 also needs to use a word vector training library, which is mainly used for completing training of different word characteristics and used for subsequent similarity calculation. The text classification module 22 needs to use a training set after manual classification to extract the features of each classification for completing the classification task of strange problems. The stop word module 23 collects a stop word list. The training does not need additional equipment, and only needs to be completed in the calculation. After the initialization is completed, the system can start to operate.

THE ADVANTAGES OF THE PRESENT INVENTION

The invention can effectively understand the problems provided by the user and provide accurate answers in real time, and has wide application prospect. The demand of tax consultation is increased dramatically, the invention can deal with massive users at the same time, and can effectively help manual customer service to distribute users. The system can be designed complexly and comprehensively, and overcomes the defect that the complex problem is difficult to deal with due to the fact that manual customer service is separately arranged in the national tax and tax service range. New problems can be added in advance aiming at the hot spot problem, and the consultation peak can be prevented in advance. The invention adopts a voice input mode, which is a mode that taxpayers are more happy to use.

Drawings

Fig. 1 is an overall configuration diagram of the system of the present invention.

FIG. 2 is a flow chart of the system of the present invention.

FIG. 3 is a model diagram of a text classification module of the present invention.

FIG. 4 is a model diagram of a problem search module of the present invention.

Detailed Description

Preferred embodiments of the present invention are given below with reference to fig. 1 to 4 and are described in detail to facilitate a better understanding of the present invention and not to limit the scope of use of the present invention.

Referring to fig. 1, in order to fulfill the requirement of effectively answering the question posed by the user, the information transmission among the modules of the present invention is performed according to the sequence of a speech conversion module 1, a question understanding module 2, a question retrieving module 3 and a question returning module 4. The problem understanding module 2 and the problem retrieval module 3 are the core of the whole system, a large amount of calculation work is completed by running background service software, and the voice conversion module 1 and the problem return module 4 are mainly used for completing information exchange with a user by running foreground application software.

Referring to fig. 2, the specific process design of the system mainly includes the following steps: voice data input to a user is converted into text data. A word segmentation component is used to segment the input question. The invention adopts a basic category classifier to directly classify some problems containing specific words and can directly judge the categories without using a classification algorithm for operation, thereby accelerating the classification speed and improving the classification accuracy. And performing shunting process by judging whether the keywords exist or not, if the keywords are not found, pushing the participle data to an LSTM classifier, and classifying the participle problem by using a data result trained by an LSTM model designed in the earlier stage. And removing stop words from the word segmentation data by using the stop word list, mainly aiming at improving the speed and accuracy of similarity calculation. The WMD similarity calculator uses word vector data of word2Vec training, similarity calculation is carried out on feature word data without stop words and processed question linguistic data, and IDs of series of questions with the highest similarity are output, wherein the IDs of the questions are unique numbers of questions and answers in a tax corpus, and the purpose of the numbering is to facilitate quick indexing. And inquiring corresponding questions and answers in the unprocessed question and answer corpus according to the question IDs and returning the questions and answers to the user.

The main work of the invention is focused on the text classification module 22 and the question retrieval module 3, which are core modules for completing the text similarity calculation.

The text classification module 22 uses the LSTM network, which designs a memory module for storing the history data, and the memory module is composed of memory cells, information can be freely transmitted in each memory cell without being influenced by the disappearance of the gradient, and these cells can be added or removed, and the memory cells are mainly composed of several main parts, i.e. an input gate, a forgetting gate and an output gate. The setting of various gates is mainly used for adjusting the relation between the memory cell and the external environment, wherein the input gate mainly determines whether the received data is changed, the forgetting gate mainly determines whether the state of the memory cell at the previous moment is deleted, and the output gate influences other nerve cells.

How the neural cells, i.e., memory cells, are updated at each time is mainly described below, assuming that h is the output of the LSTM unit, C is the value of the LSTM memory unit, x is the input, W is the corresponding weight matrix, σ and tanh are activation functions, and b is the multiplication factor (BIAS) vector. The update process is described by the following equation:

(1) the value of the neural cells at time t is recorded

(2) Calculate the value of the input gate as i_t：

i_t＝σ(W_xix_t+W_cic_t-1+b_i) (2)

(3) Calculating the value of the forgetting door at the time t and recording the value as f_t：

f_t＝σ(W_xfx_t+W_hfh_t-1+b_f) (3)

(4) Combining the value of the nerve cell at the time t with the value of the forgetting gate to update the nerve cell, and recording the updated nerve cell value as C_t：

(5) Bound nerveThe new value of the cell is calculated by a hidden layer activated by a sigmoid function, and the output of which part of information is marked as o_t：

o_t＝σ(W_x0x_t+W_hoh_t-1+W_coc_t-1+b_o) (5)

(6) Filtering the final update with a tanh function to determine the final output h_t：

h_t＝o_t*tanh(C_t) (6)

With these settings, the LSTM network not only overcomes the problem of gradient disappearance, but also has special functions such as the ability to save and read, reset, and update historical information at different times.

As shown in FIG. 3, the present invention uses a variation of the conventional LSTM network in which the output gates are independent of memory cell C_tThis facilitates faster training of the neural network. The output gate used in the invention has no W in the calculation formula_hoc_t-1And equation 5 is changed to the following equation:

o_t＝σ(W_x0x_t+W_hoh_t-1+b_o) (7)

unlike conventional LSTM networks, the present invention uses a network that adds a mean pooling function and a logistic regression layer to the conventional LSTM. The input of the whole neural network is a sentence, which is formed by word vectors and is marked as x₀，x₁，,…，x_nAfter the word vectors are processed by the LSTM network, the word vectors become an abstract representation h₀，h₁，…，h_nAnd after mean pooling is performed on the abstract representations, obtaining a vector representation of the whole sentence as follows:

and calculating the vector expression h of the final sentence by a softmax function through a logistic regression layer to obtain a final classification and the probability of belonging to each classification, wherein the calculation formula is as follows:

the question retrieval module 3, as shown in fig. 4, combines a text and word vectors trained by word2vec into a d × n matrix R, where d is the number of feature values of the text expressed by using a bag-of-words model, but removes stop words. n is the number of words in the word2vec vocabulary, and each column represents the feature description of a certain word in d dimension. Converting the text into a word vector matrix, two feature words can be understood as two points in an n-dimensional space, and the semantic similarity of the two feature words is related to the distance between the two points and can be calculated by Euclidean distance. For two texts, which can be considered as two distributions, the similarity degree of the two texts, i.e. the distance between the two distributions, is now compared, but the distance can only be used for the calculation of two points, and here the idea of EMD distance can be applied to the distribution of the two texts. When the EMD distance of the two texts is smaller, the similarity is high, and then the similarity of the texts can be calculated and compared through the EMD distance.

The distance between the text and the word is calculated correspondingly through the distance calculation between the word and the word, and if two texts are respectively P and Q, the meanings of the two texts are possibly similar, but any one same characteristic word is not contained, the two texts exist in different areas in space to be distributed, and the words with similar semantics in P and Q need to be found for conversion. Let P assume any word j of i and Q as an arbitrary word and have a Euclidean distance d_ij. The aim of the method is to find the minimum distance between each pair of words, the characteristic words of the text are selected, the shortest distance word can be found by comparing the distances between each pair of words in P and Q, and the shortest distance word is converted, wherein the conversion distance c (i, j) is defined as:

c(i,j)＝||x_i-x_j||2 (10)

but usually the number of feature words of the two texts is not equalCannot be converted in pairs, so by means of a concept similar to the weight in the distance of the EMD, it is assumed that the word appears c in the text_iThen, the quality of the word is expressed as d_iRepresents:

it is assumed that a word movement matrix T is available, where T_ijAnd the transport volume required for transferring from the word i to the word j is more than or equal to 0. In order to convert all the words i into j, the following conditions are set on the basis of the above assumptions:

(1) d each word should be shifted out of all its masses, and is formulated as:

(2) each word in d' should also receive all of its qualities, formulated as:

(3) the objective of calculating the text is to minimize the total traffic volume transmitted, so this problem becomes a linear programming problem, expressed by the following formula:

WMD＝min_T≥0∑_i,jT_ijc(i,j) (14)

the least linear programming approach translates semantically similar words because they are closer together, and if the quality differs due to the different number of translated text words, the excess quality translates to other words of similar meaning.

The invention uses the WCD distance and the RWCD distance to carry out necessary screening on the text, can reduce the workload of finally calculating the WMD distance, achieves the aim of reducing the calculation time, and simplifies the calculation of the WMD. WCD distance represents each document as its weighted average word vector, which can be extrapolated to prove WMD ≧ WCD. Although the computation of the WCD is very simple, it is not very strict and is an approximate algorithm, and it is conceivable to narrow the value of the WMD distance in vector space. The principle of RWCD distance is to reduce the WMD distance constraint, formula 12 or formula 13, which results in two scenarios, L1 and L2, which define RWCD ═ mas (L1, L2), and in doing so, we can obtain stricter limits. The result of the relaxation condition may reduce the value of WMD, that is, WMD ≧ RWCD, which can be proven by mathematical formula, here it can be imagined that the relaxation limit may cause incomplete word transmission, often some words in one document are closer to the word semantics of another document, because of the shorter distance, the relaxation condition causes the words to be transmitted many times, and some words with larger distance are not transmitted, which finally causes the total transmission amount to be reduced.

The main process of the optimization realization of the invention is that firstly, the distance of the text WCD to be inquired is calculated, all the WCD distances are sorted in ascending order, and the texts with the first k WCD distances smaller are taken out to calculate the WMD values of the texts. Next, we calculate the remaining text RWMD distance, and if the minimum value of the remaining text RWMD exceeds the minimum value of the current K WMD values, we can delete the remaining text, and the minimum value of the current K WMD value we want to find. If not, repeating the value taking operation until the above conditions occur. Because RWMD is a very strict boundary, RWMD can enable deletion of about 95% of text, that is, only 5% of text needs to be WMD calculated, which can save a lot of time.

The method optimizes the WMD algorithm and optimizes the text needing to be matched. The length of the question s put forward by the user and the target question q to be matched_iAre approximately equal in length, and the answer a of the target question_iUsually longer if directly connected to a_iAnd q is_iSplicing together as a new text with s-computed WMD values will increase computation time considerably. Therefore, before splicing, the method uses TF-IDF to extract a_iThen combining these keywords with q_iSpliced together and finally recalculatedA WMD value.

Through the optimization process, the WMD value of each question in the question and answer in the tax corpus is converted into through calculation of the question input by the user, the IDs of some questions with the minimum WMD values are found through descending the power of the WMD values and sorting, and finally the corresponding questions and the corresponding answers are returned to the user.

Examples

The purpose is as follows: under the environment of real consultation questions, the questions are input into the system of the invention to complete the answers to the questions.

The system has the following main parameters in the software programming environment: CPU model Intel Core i7 quad-Core; CPU frequency 2 GHz; 8GB of memory; a hard disk 256G; the operating system is OSX EL Capitan 10.11.4; the development environment is Android Studio, Pycharm and IntelliJ IDEA; the data is mainly stored in JSON, bin and txt formats; the programming language service software is written by python, and the application software is written by java.

The main parameters of software deployment are as follows: CPU model Intel i3-2120 dual core; CPU frequency 3.30 GHz; a memory 4 GB; a hard disk 500G; the operating system is 32 bits of Windows 10 professional edition; python2.7.3 is used for deploying environment background services, and the main components of installation are genimm 0.12.4, jieba0.38, numpy1.9.1, PuLP1.5.3, scinit-learn 0.14.1, six1.10.0, Theano0.8.2 and smart _ open1.3.3; the deployment environment foreground used genmotion2.6.0 and virtualbox5.0.20 to perform simulation of the android system.

The voice conversion module provides online service by using voice recognition developed by science university communication and aviation, ltd.

The Chinese segmentation module uses the Jieba component of Python.

The text classification module uses a long-term neural network LSTM model. Training uses a training set of manual classifications, including tax documents, tax treatises, local government documents, tax news, and the like.

The question retrieval module uses a word movement distance WMD model. Training uses the dog searching whole network news data to carry out word vector training, uses word2vec software of a tool Google company, main parameters used by the word2vec software are a Skip-gram model selected by the model, the dimension selection is 250-dimension optimization, and the algorithm selection is the Hierarchical Softmax algorithm.

The question return module sets the number of returned questions 10.

The core tax corpus uses a question-answer library in the 12366 tax service hot-line tax business knowledge base.

The internet questions used by the question bank are similar to the questions in the tax corpus in three question-answer systems of hundredth knowledge, 360 question-answer and dog search question-answer. When 10 questions were returned, the average accuracy was 74.49%.

Claims

1. An intelligent question-answering system for tax consultation service is characterized by comprising:

the computer is provided with a service software system, and the service software system comprises a problem understanding module 2 and a problem retrieval module 3, and is used for performing semantic analysis so as to understand problems and retrieve the problem with the highest similarity;

when the system works, the voice conversion module 1 converts voice data output by a user into text data, semantic understanding is carried out through the problem understanding module 2, answers are retrieved through the problem retrieval module 3, and a processing result is transmitted to a terminal user through the problem return module 4;

wherein:

the voice conversion module 1 is used for completing the function of converting the user voice signal into the information of the corresponding text, and comprises the steps of collecting the user voice, extracting the characteristics of the voice information to form a model to be recognized, matching the model with a reference model, searching the model with the highest similarity and finally outputting a recognition result; the input is voice information, and the output is text information;

the problem understanding module 2 is used for processing the text information, including segmenting the input text, classifying the text and deleting stop words contained in the text; the input is question text, and the output is a feature word;

the question retrieval module 3 is used for completing the matching of the questions input by the user and the questions and answers in the tax corpus, and comprises the steps of retrieving some questions with the most common characteristics of the two questions through the comparison of the matching degree between the two given questions; the input is the question characteristics, and the output is a plurality of question numbers with the highest similarity;

the question returning module 4 is used for completing the display of the matched questions, presetting the number of returned questions and returning the answers of the questions with the highest similarity to the user; the input is the question number to be returned, and the output is the corresponding question and answer;

the problem understanding module 2 comprises a Chinese word segmentation module 21, a text classification module 22 and a stop word module 23; the Chinese word segmentation module 21 is configured to perform word segmentation on an input text to determine feature words included in the text; the text classification module 22 is used for classifying the texts according to the trained classification data; the stop word module is used for deleting stop words contained in the text; the Chinese word segmentation module 21 determines the characteristic words contained in the text and transmits the characteristic words to the text classification module 22; the text classification module 22 classifies the questions; transmitting the feature words to a stop word module 23 for processing;

the problem retrieval module 3 uses a Word Moving Distance (WMD) model to calculate the similarity;

the system needs to acquire and train data before being put into automatic operation; the question retrieval module 3 needs to use a tax question and answer corpus, which is a question and answer library at the core of a tax consultation system and is a data source for answering the user questions; the problem retrieval module 3 also needs to use a word vector training library for completing training of different word characteristics and for subsequent similarity calculation; the text classification module 22 needs to use a training set after manual classification to extract the characteristics of each classification for completing the classification task of strange problems; the stop word module 23 needs to collect a stop word list;

the text classification module 22 uses a long-term neural network LSTM network, which is designed with a memory module for storing historical data, and the memory module is composed of memory cells, so that information can be freely transmitted in each memory unit without being influenced by gradient disappearance; the memory cell mainly comprises an input gate, a forgetting gate and an output gate; the setting of each gate is mainly used for adjusting the relation between the memory cell and the external environment, wherein the input gate is mainly used for determining whether the received data is changed or not, the forgetting gate is used for determining whether the state of the memory cell at the previous moment is deleted or not, and the output gate influences other nerve cells;

the memory cells are refreshed at each moment in time in the following way:

assuming h is the output of the LSTM unit, C is the value of the LSTM memory unit, x is the input, W is the corresponding weight matrix, sigma and tanh are activation functions, and b is a multiplication factor (BIAS) vector; the update process is described by the following equation:

(1) the value of the neural cells at time t is recorded

Wherein: w_xcThe weight, x, of the input data corresponding to time t_tFor input at time t, W_hcIs the weight value of the LSTM unit output at the last moment, h_t-1For the last time LSTM cell output, b_cIs the vector of the ratio of divergence (BIAS) of the neural cells corresponding to time t;

(2) calculate the value of the input gate as i_t：

i_t＝σ(W_xix_t+W_cic_t-1+b_i) (2)

Wherein, W_xiThe weight, x, of the input data corresponding to time t_tFor input at time t, W_ciIs the weight of the neural cell at the last moment, c_t-1Is the value of the neural cell at the previous time, b_iA multiplier-to-average ratio (BIAS) vector for the corresponding input gate at time t;

f_t＝σ(W_xfx_t+W_hfh_t-1+b_f) (3)

Wherein, W_xfThe weight, x, of the input data corresponding to time t_tFor input at time t, W_hfIs the weight value of the LSTM unit output at the last moment, h_t-1For the last time LSTM cell output, b_fA take-off rate (BIAS) vector for the forgetting gate corresponding to time t;

Wherein, C_t-1The value of the nerve cell at the last moment;

(5) calculating which part of the information output is denoted as o by a hidden layer activated by a sigma si function in combination with new values of the nerve cells_t：

o_t＝σ(W_x0x_t+W_hoc_t-1+b_o) (5)

h_t＝o_t*tanh(C_t) (6) 。

2. The tax counseling service oriented intelligent question answering system according to claim 1, wherein one additional component is added to the traditional LSTM networkA mean pooling function and a logistic regression layer; the input to the whole neural network is a sentence, which is composed of word vectors, denoted as x₀，x₁，...，x_nAfter the word vectors are processed by the LSTM network, the word vectors become an abstract representation h₀，h₁，...，h_nAnd after mean pooling is performed on the abstract representations, obtaining a vector representation of the whole sentence as follows:

where, s represents the input question,

a topic of the i-th class is represented,

indicating that question s belongs to class i topic

The probability of (a) of (b) being,

representing different classes of models obtained after training the LSTM using the corpus of classes,

representing by continuously adapting the topic model

The probability of the problem s is calculated.

3. The tax counseling service oriented intelligent question-answering system according to claim 1 or 2, wherein in the question retrieving module 3, a text and word vectors trained by word2vec are set to form a d x n matrix R, wherein d is the number of eigenvalues of the text expressed by a word bag model, but stop words are removed, n is the number of words in word2vec vocabulary, and each list represents the feature description of a certain word in d dimension; converting the text into a word vector matrix, wherein two feature words can be understood as two points in an n-dimensional space, and the semantic similarity of the two feature words is calculated by adopting an Euclidean distance; regarding two texts, the two texts are considered as two distributions, and the similarity of the two texts is calculated by adopting the EMD distance;

the distance between the text and the word is correspondingly calculated through the distance calculation of the word and the word, and if two texts are respectively P and Q, the meanings of the two texts are possibly similar, but any one same characteristic word is not contained, the two texts exist in different areas in space to be distributed, and the words with similar semantics in P and Q need to be found for conversion; let P assume any word j of i and Q as an arbitrary word and have a Euclidean distance d_ij(ii) a The goal is to find the minimum distance of each pair of words; the shortest distance word can be found by comparing the distances of each pair of words in P and Q, and the shortest distance word is converted, wherein the conversion distance c (i, j) is defined as:

c(i，j)＝||x_i-x_j||2 (10)

however, the number of feature words of two texts is not equal, and the conversion cannot be performed in pairs, so that by means of a concept similar to the weight in the distance of EMD, it is assumed that the words appear c in the texts_iThen, the quality of the word is expressed as d_iRepresents:

it is assumed that a word movement matrix T is available, where T_ijIs greater than or equal to 0 and represents the transition from the word i to the wordj required traffic volume; in order to convert all the words i into j, the following conditions are set on the basis of the above assumptions:

(1) d each word should be shifted out of all its masses, and is formulated as:

(2) each word in d' should also receive all of its qualities, formulated as:

(3) the goal of calculating the text is to minimize the total traffic volume transmitted, which becomes a linear programming problem, expressed by the following equation:

WMD＝min_T≥0∑_i，jT_ijc(i，j) (14)

the scheme with the minimum linear programming can convert words with similar semantics, and if the quality of two converted text words is different due to different quantities, redundant quality can be converted into words with similar meanings;

define RWCD ═ max (L1, L2), referred to as relaxed document center distance; here, L1 is the scheme of the WMD distance constraint formula (12), and L2 is the scheme of the WMD distance constraint formula (13);

the method comprises the following steps of performing necessary screening on texts by using RWCD (world Wide Web) distances, specifically, calculating the WCD distance of the texts to be queried, performing ascending sequencing on all WCD values, taking out the texts with the lower k front WCD values, and calculating the WMD values of the texts; next, calculating the value of the remaining text RWMD, and deleting the remaining text if the minimum value of the remaining text RWMD exceeds the minimum value of the current k WMD values; the minimum WMD value in K is to be found at present; if the minimum value of the remaining text RWMD does not exceed the minimum value of the current k WMD values, repeating the value-taking operation until the above conditions occur;

in this way, the WMD value of each question in the question and answer converted into the tax corpus by the user input question is calculated, the IDs of some questions with the minimum WMD values are found by descending the power of the WMD values and sorting, and finally the corresponding questions and answers are returned to the user.