Disclosure of Invention
In order to improve the accuracy of intelligent question answering and enable a user to quickly obtain high-quality answers, the invention provides an intelligent question answering method and system based on subject word filtering.
The technical scheme adopted by the invention is as follows:
an intelligent question-answering method based on subject word filtering comprises the following steps:
step 1, obtaining question information q proposed by a user0;
Step 2, loading a subject word bank T and solving the problem information q0Performing word segmentation and stop word removal processing to obtain an initial word set S0;
Step 3, using the topic thesaurus T to S0Filtering to obtain a topic word set S1I.e. S1=S0∩T;
Step 4, loading the theme model M, and according to M and S1Get problem information q0Vector w ({ S) }11,w1},{S12,w2},…,{S1n,wn}),S1i(i is 1,2, …, n, n is S1Number of middle words) is S1The term of (1), wiIs a word S1iWherein w isiThe value of (A) is;
wi=M(S1i);
step 5, calculating w and each question C in the question set C one by onejDegree of similarity p ofj(j is 1,2, …, m is the number of C question-answer pairs), the candidate questions with the similarity larger than the threshold value t are reserved, and the initial candidate question list L is obtained by sorting according to the descending order of the similarity0Wherein w is associated with problem cjVector of (2)
c
jk(k=1,2,…,
Is c
jNumber of middle words) is S
1The words and phrases in (1) or (b),
is a word c
jkThe weight of (2) is obtained from a vector model set Cw of the problem;
step 6, initializing L1={},L2={};
Step 7, obtaining L0The first candidate question q;
step 8, if q does not exist, go to step 9, otherwise, pair q and q0Carrying out character string matching:
the analysis process comprises the following steps:
(8.1) from question-answer Pair set CpObtaining an answer r of q;
(8.2) if q ═ q0Returning the answer r of q, ending, otherwise going to step 8.3;
(8.3) if
L
1={(q,r)}∪L
1Else L
2={(q,r)}∪L
2:
(8.4) from L0Deleting q, and returning to the step 7;
step 9, if L1Not null, sort in reverse order and return L1Otherwise, sorting and returning L in reverse order2And then, the process is ended.
Further, the question-answer library comprises 3 parts:
1. problem set C: only the questions in the question-answer pair set are included, so that data training is facilitated;
2. question and answer set Cp: storing in the form of 'question-answer';
3. vector set C of questionsw: the words are stored in a form of 'word 1, word 2 … -weight 1, weight 2 …', and are obtained by training a question set in a topic model;
the three parts are associated according to the unique index sequence number of the problem, and the data of the corresponding part can be acquired through the sequence number.
Further, in the step 2, the word segmentation adopts an NLPIR word segmentation system and adopts a subject word bank as a user-defined word bank.
In the step 3, the topic word stock is a pre-constructed word stock and is composed of topic keywords and topic-related high-frequency words, and the topic keywords are composed of topic-specific key words, such as the key words of a programming language, and can be obtained from official documents; the high frequency words are automatically extracted from the theme related e-books or documents using the NLPIR keyword extraction tool of the chinese academy of sciences.
In the step 4, the topic model is a model trained in advance according to the topic lexicon, and vectorization representation of the problem information is directly obtained through the topic model; the topic model is obtained by training a question set in a question-answer library and comprises the following steps:
1. loading a question set, performing word segmentation and stop word processing on the question set, and filtering out words which are not in a subject word library to obtain a corresponding initial word set;
2. calculating the weights of the words in the word segmentation result set through a TF _ IDF algorithm;
3. outputting the topic model to a file in a form of 'word weight';
4. outputting the vectorized representation of the problem set to a file in the form of 'word 1 word 2 … -weight 1 weight 2 …' according to the word segmentation result set and the corresponding weight;
in the topic model M, the words are stored according to the key value pairs of the word weight, so when the vectorization representation of the problem information is obtained according to the topic model, the weights of the words can be sequentially and directly obtained through the words.
In said step 5, a set of vectors C of the problemwThe vector representation of all the problems is saved, the similarity calculation can be directly calculated according to the vector, and the threshold t is the best predefined minimum similarity.
Further, in the step 8, the question-answer pair set CpThe questions and the corresponding answers are stored, and the corresponding answers can be obtained through the question indexes.
An intelligent question-answering system based on subject word filtering comprises the following modules:
the problem information acquisition module is used for acquiring problem information of a user;
the question-answer library module is used for storing a subject word library under a subject, a question-answer library and a subject model;
the natural language processing module is used for processing the problem information of the user so as to obtain a word set of the problem information;
the question-answer library matching module is used for matching the question information of the user with the questions in the question set of the question-answer library to obtain a related candidate question list;
the character string matching module is used for processing the candidate question list obtained from the question-answer library matching module and further matching the question information;
and the answer returning module returns the finally obtained answer to the user.
Further, the question answering library module comprises: 1) a topic word library: storing topic keywords and topic-related high-frequency words; 2) a question-answer library: a topic question-answer library is stored; 3) the topic model is as follows: the trained question word sets and vector representations are stored and are stored according to the question-answer pair sequence.
The natural language processing module comprises: word segmentation unit: dividing the question information into word lists, and adding a subject word library as one of bases for word segmentation; a stop word and subject word filtering unit: after word segmentation, stop words and words which do not belong to the subject word bank are filtered.
Furthermore, in the character string matching module, each question-answer pair is obtained from a question-answer library, if a question with the same information as the question exists, the answer of the question is directly returned, otherwise, the answer containing q is searched0Question-answer pair list L1If L is1Not null, sort in reverse order and return L1Otherwise, sorting and returning in reverse order without q0Question-answer pair list L2。
The technical conception of the invention is as follows: the method comprises the steps of obtaining question information, carrying out natural language processing and subject word filtering on the question information, obtaining a candidate question list after matching with a question-answering library, carrying out character string matching, and finally returning a result, so that the intelligent question-answering accuracy is improved.
In the process of asking questions of a user, the algorithm updates the question-answer library at regular time, for example, the period is 1 hour, if questions which are not recorded in the question-answer library appear, the question-answer pairs are recorded in the question-answer library after manual answering, and effective answering information is provided for the user.
The invention has the following beneficial effects: the method comprises the steps of filtering contents irrelevant to the subject in question information based on a specific subject word bank to enable the question information to be more suitable for the subject, and meanwhile, improving the matching degree of the question information and a question-answer bank by adopting a character string matching method to enable the question of a user to be answered more accurately.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 and 2, an intelligent question-answering method based on subject word filtering obtains user question information, then performs natural language processing to obtain corresponding word vectors, then performs question-answer library similarity matching and character string matching in sequence based on the obtained word vectors, and returns answers or question-answer pair lists to users after obtaining the answers or question-answer pair lists. The method comprises the following steps:
step 1, obtaining user question information q0(e.g., "what do i want to know the difference between the array and the pointer.
In this embodiment, the theme is set as programming language C/C + +, the segmentation adopts NLPIR segmentation system, and adds a theme thesaurus as a user-defined dictionary, the theme thesaurus includes C/C + +, keywords, operators, and "C language confusion" from the book: the method comprises the steps that high-frequency vocabularies extracted from partial sections in pointer, array, function and multi-file programming are extracted, the inactive vocabulary lists synthesize common inactive vocabulary lists provided by Baidu and Haohang, TF-IDF values are used as word weights during training of a theme model, and a similarity threshold value t is 0.6.
Step 2, loading the subject thesaurus T, and aligning q as shown in table 10Natural language processing including word segmentation and stop word removal to obtain initial word set S0(<Thinking, array, pointer, distinction>):
Array of elements
|
Pointer with a movable finger
|
Character string
|
Output of
|
…
|
auto
|
break
|
…
|
++
|
…
|
&& |
TABLE 1
Step 3, using the topic thesaurus T to S0Filtering to obtain a topic word set S1(<Array, pointer, distinction>);
Step 4, loading the theme model M, as shown in Table 2, according to S1Get problem information q0Vector w { "array", 0.2002578}, { "pointer", 0.202271}, { "difference", 0.097653 });
TABLE 2
Step 5, calculating w and each question C in the question set C one by onejDegree of similarity p ofjThe method comprises the following steps:
when the j is equal to 1, the total weight of the alloy is less than 1,
As shown in Table 3, the candidate questions greater than the threshold t are retained, and the initial candidate question list L is obtained by sorting the candidate questions in descending order of similarity0As shown in table 4:
TABLE 3
TABLE 4
Step 6, initializing L1={},L2={};
Step 7, selecting L0In the first candidate question, when q ═ is "actually say, i want to know what the difference between the array and the pointer is, can tell i? ", q0"what do i want to know the difference between the array and the pointer? ";
step 8, q and q are paired0Matching character strings;
(8.1) from question-answer Pair set CpGet the answer to q, when r ═ array auto allocate space, but … ";
(8.2)q≠q0go to step 8.3;
(8.3)
L
1{ ("then why can the array and pointer declare as function parameters be interchanged
1At this time L
1As shown in table 5:
TABLE 5
(8.4) from L0Deleting q, and returning to the step 7;
step 7, selecting L0The first candidate problem, when q ═ is then "why can the array and pointer declarations be interchanged as functional parameters? ", q0"what do i want to know the difference between the array and the pointer? ";
step 8, q and q are paired0Matching character strings;
(8.1) from question-answer Pair set CpGet the answer to q, when r ═ array auto allocate space, but … ";
(8.2)q≠q0go to step 8.3;
(8.3)
L
2{ ("then why can the array and pointer declare as function parameters be interchanged
2At this time L
2As shown in table 6:
TABLE 6
(8.4) from L0Deleting q, and returning to the step 7;
step 7, selecting L0In the first candidate question, when q ═ is "what is a void pointer, can tell me? ", q0"I want to know what the difference between the array and the pointer is?”;
Step 8, q and q are paired0Matching character strings;
(8.1) from question-answer Pair set CpObtaining the answer of q, wherein r is the meaning of ' void ' … ';
(8.2)q≠q0go to step 8.3;
(8.3)
L
2{ ("what is a pointer
2At this time L
2As shown in table 7:
TABLE 7
(8.4) from L0Deleting m, and returning to the step 7;
step 7, no candidate answer exists;
step 8, proceeding to step 9;
step 9, L1Not empty, and only one record, return L1And then, the process is ended.
In this embodiment, the end question mark is used as a criterion for determining whether the question is question information, the end question mark is not included in the string matching, and the ellipses represent that the text is too long and are hidden and displayed.
It will be appreciated by persons skilled in the art that the foregoing is illustrative only and is not to be construed as limiting the invention, as variations and modifications of the foregoing examples are within the spirit and scope of the invention as defined by the appended claims.