CN111177316A

CN111177316A - An intelligent question answering method and system based on subject word filtering

Info

Publication number: CN111177316A
Application number: CN201911325753.9A
Authority: CN
Inventors: 潘建; 汤绍雄; 祝训醉
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-19

Abstract

An intelligent question answering method based on subject word filtering, comprising the following steps: step 1, obtaining question information q ₀ raised by a user; step 2, loading a thesaurus T to obtain an initial word set S ₀ ; step 3, obtaining a theme word set S ₁ ; Step 4, obtain the vectorized representation w of the question information q ₀ according to the topic models M and S ₁ ; Step 5, retain the candidate questions whose similarity is greater than the threshold t, and sort them in descending order of similarity to obtain the initial candidate question list L ₀ ; Step 6, initialize L ₁ , L ₂ ; Step 7, obtain the first candidate question q in L ₀ ; Step 8, if q does not exist, go to step 9, otherwise string matching q and q ₀ ; Step 9 , If L ₁ is not empty, sort in reverse order and return L ₁ , otherwise sort in reverse order and return L ₂ , end. And provide an intelligent question answering system based on subject word filtering. The present invention enables users' questions to be answered more accurately.

Description

Intelligent question and answer method and system based on subject word filtering

Technical Field

The invention relates to an intelligent question and answer method and system based on subject word filtering.

Technical Field

Intelligent question answering aims at automatically providing answers to natural language questions posed by users. In recent years, with the mass growth of internet data, the improvement of computing power and the progress of natural language processing technology, intelligent question-answering methods and systems are rapidly developed and widely applied to daily life of people.

However, due to the diversity and openness of the questions, some of the existing intelligent question-answering algorithms have some disadvantages, for example, some community-oriented question-answering algorithms have too low accuracy of results due to large topic span, and some algorithms for some specific topics have no expansibility and cannot be applied to multiple topics at the same time, so that the questions put forward by the user cannot be answered with useful or high quality, and the user needs to spend more time searching for answers. Therefore, how to accurately search out high-quality answers based on the questions posed by the user has strong theoretical and practical values.

Disclosure of Invention

In order to improve the accuracy of intelligent question answering and enable a user to quickly obtain high-quality answers, the invention provides an intelligent question answering method and system based on subject word filtering.

The technical scheme adopted by the invention is as follows:

an intelligent question-answering method based on subject word filtering comprises the following steps:

step 1, obtaining question information q proposed by a user₀；

Step 2, loading a subject word bank T and solving the problem information q₀Performing word segmentation and stop word removal processing to obtain an initial word set S₀；

Step 3, using the topic thesaurus T to S₀Filtering to obtain a topic word set S₁I.e. S₁＝S₀∩T；

Step 4, loading the theme model M, and according to M and S₁Get problem information q₀Vector w ({ S) }₁₁,w₁},{S₁₂,w₂},…,{S_1n,w_n}),S_1i(i is 1,2, …, n, n is S₁Number of middle words) is S₁The term of (1), w_iIs a word S_1iWherein w is_iThe value of (A) is;

w_i＝M(S_1i)；

step 5, calculating w and each question C in the question set C one by one_jDegree of similarity p of_j(j is 1,2, …, m is the number of C question-answer pairs), the candidate questions with the similarity larger than the threshold value t are reserved, and the initial candidate question list L is obtained by sorting according to the descending order of the similarity₀Wherein w is associated with problem c_jVector of (2)

c_jk(k＝1,2,…,

Is c_jNumber of middle words) is S₁The words and phrases in (1) or (b),

is a word c_jkThe weight of (2) is obtained from a vector model set Cw of the problem;

step 6, initializing L₁＝{},L₂＝{}；

Step 7, obtaining L₀The first candidate question q;

step 8, if q does not exist, go to step 9, otherwise, pair q and q₀Carrying out character string matching:

the analysis process comprises the following steps:

(8.1) from question-answer Pair set C_pObtaining an answer r of q;

(8.2) if q ═ q₀Returning the answer r of q, ending, otherwise going to step 8.3;

(8.3) if

L₁＝{(q,r)}∪L₁Else L₂＝{(q,r)}∪L₂：

(8.4) from L₀Deleting q, and returning to the step 7;

step 9, if L₁Not null, sort in reverse order and return L₁Otherwise, sorting and returning L in reverse order₂And then, the process is ended.

Further, the question-answer library comprises 3 parts:

1. problem set C: only the questions in the question-answer pair set are included, so that data training is facilitated;

2. question and answer set C_p: storing in the form of 'question-answer';

3. vector set C of questions_w: the words are stored in a form of 'word 1, word 2 … -weight 1, weight 2 …', and are obtained by training a question set in a topic model;

the three parts are associated according to the unique index sequence number of the problem, and the data of the corresponding part can be acquired through the sequence number.

Further, in the step 2, the word segmentation adopts an NLPIR word segmentation system and adopts a subject word bank as a user-defined word bank.

In the step 3, the topic word stock is a pre-constructed word stock and is composed of topic keywords and topic-related high-frequency words, and the topic keywords are composed of topic-specific key words, such as the key words of a programming language, and can be obtained from official documents; the high frequency words are automatically extracted from the theme related e-books or documents using the NLPIR keyword extraction tool of the chinese academy of sciences.

In the step 4, the topic model is a model trained in advance according to the topic lexicon, and vectorization representation of the problem information is directly obtained through the topic model; the topic model is obtained by training a question set in a question-answer library and comprises the following steps:

1. loading a question set, performing word segmentation and stop word processing on the question set, and filtering out words which are not in a subject word library to obtain a corresponding initial word set;

2. calculating the weights of the words in the word segmentation result set through a TF _ IDF algorithm;

3. outputting the topic model to a file in a form of 'word weight';

4. outputting the vectorized representation of the problem set to a file in the form of 'word 1 word 2 … -weight 1 weight 2 …' according to the word segmentation result set and the corresponding weight;

in the topic model M, the words are stored according to the key value pairs of the word weight, so when the vectorization representation of the problem information is obtained according to the topic model, the weights of the words can be sequentially and directly obtained through the words.

In said step 5, a set of vectors C of the problem_wThe vector representation of all the problems is saved, the similarity calculation can be directly calculated according to the vector, and the threshold t is the best predefined minimum similarity.

Further, in the step 8, the question-answer pair set C_pThe questions and the corresponding answers are stored, and the corresponding answers can be obtained through the question indexes.

An intelligent question-answering system based on subject word filtering comprises the following modules:

the problem information acquisition module is used for acquiring problem information of a user;

the question-answer library module is used for storing a subject word library under a subject, a question-answer library and a subject model;

the natural language processing module is used for processing the problem information of the user so as to obtain a word set of the problem information;

the question-answer library matching module is used for matching the question information of the user with the questions in the question set of the question-answer library to obtain a related candidate question list;

the character string matching module is used for processing the candidate question list obtained from the question-answer library matching module and further matching the question information;

and the answer returning module returns the finally obtained answer to the user.

Further, the question answering library module comprises: 1) a topic word library: storing topic keywords and topic-related high-frequency words; 2) a question-answer library: a topic question-answer library is stored; 3) the topic model is as follows: the trained question word sets and vector representations are stored and are stored according to the question-answer pair sequence.

The natural language processing module comprises: word segmentation unit: dividing the question information into word lists, and adding a subject word library as one of bases for word segmentation; a stop word and subject word filtering unit: after word segmentation, stop words and words which do not belong to the subject word bank are filtered.

Furthermore, in the character string matching module, each question-answer pair is obtained from a question-answer library, if a question with the same information as the question exists, the answer of the question is directly returned, otherwise, the answer containing q is searched₀Question-answer pair list L₁If L is₁Not null, sort in reverse order and return L₁Otherwise, sorting and returning in reverse order without q₀Question-answer pair list L₂。

The technical conception of the invention is as follows: the method comprises the steps of obtaining question information, carrying out natural language processing and subject word filtering on the question information, obtaining a candidate question list after matching with a question-answering library, carrying out character string matching, and finally returning a result, so that the intelligent question-answering accuracy is improved.

In the process of asking questions of a user, the algorithm updates the question-answer library at regular time, for example, the period is 1 hour, if questions which are not recorded in the question-answer library appear, the question-answer pairs are recorded in the question-answer library after manual answering, and effective answering information is provided for the user.

The invention has the following beneficial effects: the method comprises the steps of filtering contents irrelevant to the subject in question information based on a specific subject word bank to enable the question information to be more suitable for the subject, and meanwhile, improving the matching degree of the question information and a question-answer bank by adopting a character string matching method to enable the question of a user to be answered more accurately.

Drawings

FIG. 1 is a flow chart of the method for implementing intelligent question answering according to the present invention,

figure 2 is a schematic diagram of a system module,

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, an intelligent question-answering method based on subject word filtering obtains user question information, then performs natural language processing to obtain corresponding word vectors, then performs question-answer library similarity matching and character string matching in sequence based on the obtained word vectors, and returns answers or question-answer pair lists to users after obtaining the answers or question-answer pair lists. The method comprises the following steps:

step 1, obtaining user question information q₀(e.g., "what do i want to know the difference between the array and the pointer.

In this embodiment, the theme is set as programming language C/C + +, the segmentation adopts NLPIR segmentation system, and adds a theme thesaurus as a user-defined dictionary, the theme thesaurus includes C/C + +, keywords, operators, and "C language confusion" from the book: the method comprises the steps that high-frequency vocabularies extracted from partial sections in pointer, array, function and multi-file programming are extracted, the inactive vocabulary lists synthesize common inactive vocabulary lists provided by Baidu and Haohang, TF-IDF values are used as word weights during training of a theme model, and a similarity threshold value t is 0.6.

Step 2, loading the subject thesaurus T, and aligning q as shown in table 1₀Natural language processing including word segmentation and stop word removal to obtain initial word set S₀(<Thinking, array, pointer, distinction>)：

Array of elements
	Pointer with a movable finger
Character string
	Output of
…
	auto
break
	…
++
	…
&&

TABLE 1

Step 3, using the topic thesaurus T to S₀Filtering to obtain a topic word set S₁(<Array, pointer, distinction>)；

Step 4, loading the theme model M, as shown in Table 2, according to S₁Get problem information q₀Vector w { "array", 0.2002578}, { "pointer", 0.202271}, { "difference", 0.097653 });

TABLE 2

Step 5, calculating w and each question C in the question set C one by one_jDegree of similarity p of_jThe method comprises the following steps:

when the j is equal to 1, the total weight of the alloy is less than 1,

likewise, can obtain

As shown in Table 3, the candidate questions greater than the threshold t are retained, and the initial candidate question list L is obtained by sorting the candidate questions in descending order of similarity₀As shown in table 4:

TABLE 3

TABLE 4

Step 6, initializing L₁＝{}，L₂＝{}；

Step 7, selecting L₀In the first candidate question, when q ═ is "actually say, i want to know what the difference between the array and the pointer is, can tell i? ", q₀"what do i want to know the difference between the array and the pointer? ";

step 8, q and q are paired₀Matching character strings;

(8.1) from question-answer Pair set C_pGet the answer to q, when r ═ array auto allocate space, but … ";

(8.2)q≠q₀go to step 8.3;

(8.3)

L₁{ ("then why can the array and pointer declare as function parameters be interchanged₁At this time L₁As shown in table 5:

TABLE 5

(8.4) from L₀Deleting q, and returning to the step 7;

step 7, selecting L₀The first candidate problem, when q ═ is then "why can the array and pointer declarations be interchanged as functional parameters? ", q₀"what do i want to know the difference between the array and the pointer? ";

step 8, q and q are paired₀Matching character strings;

(8.2)q≠q₀go to step 8.3;

(8.3)

L₂{ ("then why can the array and pointer declare as function parameters be interchanged₂At this time L₂As shown in table 6:

TABLE 6

(8.4) from L₀Deleting q, and returning to the step 7;

step 7, selecting L₀In the first candidate question, when q ═ is "what is a void pointer, can tell me? ", q₀"I want to know what the difference between the array and the pointer is？”；

Step 8, q and q are paired₀Matching character strings;

(8.1) from question-answer Pair set C_pObtaining the answer of q, wherein r is the meaning of ' void ' … ';

(8.2)q≠q₀go to step 8.3;

(8.3)

L₂{ ("what is a pointer₂At this time L₂As shown in table 7:

TABLE 7

(8.4) from L₀Deleting m, and returning to the step 7;

step 7, no candidate answer exists;

step 8, proceeding to step 9;

step 9, L₁Not empty, and only one record, return L₁And then, the process is ended.

In this embodiment, the end question mark is used as a criterion for determining whether the question is question information, the end question mark is not included in the string matching, and the ellipses represent that the text is too long and are hidden and displayed.

It will be appreciated by persons skilled in the art that the foregoing is illustrative only and is not to be construed as limiting the invention, as variations and modifications of the foregoing examples are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. an intelligent question and answer method based on subject word filtering, is characterized in that: described method comprises the following steps:

Step 1. Obtain the question information q ₀ raised by the user;

Step 2, load the thesaurus T, perform word segmentation and stop word removal processing on the question information q ₀ to obtain an initial word set S ₀ ;

Step 3. Filter S ₀ by using the thesaurus T to obtain a set of subject words S ₁ , that is, S ₁ =S ₀ ∩T;

Step 4. Load the topic model M, and obtain the vector w=({S ₁₁ ,w ₁ },{S ₁₂ ,w ₂ },...,{S _1n ,w _n }) of the question information q ₀ according to M and S ₁ , S _1i (i=1,2,...,n, n is the number of words in S ₁ ) is the word in S ₁ , _wi is the weight of the word S _1i , and the value of _wi is;

w _i =M(S _1i );

Step 5. Calculate the similarity p _j of w and each question c _j in the question set C one by one, j=1,2,...,m,m is the number of question-answer pairs in C, and reserve the candidate questions whose similarity is greater than the threshold t , sorted in descending order of similarity to get the initial candidate question list L ₀ , where w is the vector of questions c _j

c _jk (

is the number of words in c _j ) is the words in S ₁ ,

is the weight of the word c _jk , obtained from the vector model set Cw of the problem;

Step 6. Initialize L ₁ ={}, L ₂ ={};

Step 7. Obtain the first candidate question q in L ₀ ;

Step 8. If q does not exist, go to step 9, otherwise perform string matching on q and q ₀ :

The analysis process is:

(8.1) Obtain the answer r of q from the question-answer pair set C _p ;

(8.2) If q=q ₀ , return the answer r of q, end, otherwise go to step 8.3;

(8.3) If

L ₁ ={(q,r)}∪L ₁ , otherwise L ₂ ={(q,r)}∪L ₂ :

(8.4) delete q from L ₀ and return to step 7;

Step 9. If L ₁ is not empty, sort in reverse order and return to L ₁ , otherwise sort in reverse order and return to L ₂ , end.

2. a kind of intelligent question answering method based on subject word filtering according to claim 1, is characterized in that:

The Q&A library consists of 3 parts:

1) Question set C: It only includes the questions in the above question-and-answer pair set, which is convenient for data training;

2) Question-answer pair set C _p : stored in the form of "question-answer";

3) The vector set C _w of the question: stored in the form of "word 1 word 2...-weight 1 weight 2...", obtained by training the topic model from the question set;

The above three parts are associated according to the unique index sequence number of the question, and the data of the corresponding part can be obtained through the sequence number.

3. a kind of intelligent question answering method based on subject word filtering according to claim 1 and 2, is characterized in that: in described step 2, word segmentation adopts NLPIR word segmentation system and adopts the thesaurus as user-defined thesaurus.

4. a kind of intelligent question-answering method based on subject word filtering according to claim 1 and 2, is characterized in that: in described step 3, the thesaurus is a pre-constructed thesaurus, which is related to the subject keyword and the subject It consists of high-frequency words of the Chinese Academy of Sciences; topic keywords are composed of key words of a specific topic, which can be obtained from official documents; high-frequency words are automatically extracted from topic-related e-books or documents using the NLPIR keyword extraction tool of the Chinese Academy of Sciences.

5. a kind of intelligent question answering method based on subject word filtering according to claim 3 is characterized in that: in step 4, the subject model is a model trained in advance according to subject thesaurus, and the vectorized representation of the question information directly passes through the subject The model is obtained; the topic model is trained from the question set in the question and answer library, which is:

1) Load the question set, perform word segmentation and stop word processing on it, and filter out the words outside the thesaurus to obtain the corresponding word segmentation result set;

2) Calculate the weight of the words in the word segmentation result set by the TF_IDF algorithm;

3) Output the topic model to the file in the form of "word weight";

4) According to the word segmentation result set and the corresponding weight, the vectorized representation of the question set is output to the file in the form of "word 1 word 2...-weight 1 weight 2...";

In the topic model M, it is stored according to the key-value pair of "word weight", so when the vectorized representation of the question information is obtained according to the topic model, the weight of the word can be obtained directly through the word in turn.

6. a kind of intelligent question answering method based on subject word filtering according to claim 4 is characterized in that: in step 5, the vector representation of all questions is preserved in the vector set C _w of the question, and similarity calculation can be directly based on the vector Calculated, the threshold t is the pre-defined minimum similarity.

7. The intelligent question answering method based on subject word filtering according to claim 3, wherein in step 8, question answer pairs are stored in the question answer pair set C _p , and corresponding answers can be obtained through the question index.

8. An intelligent question answering system based on subject word filtering, characterized in that: the system comprises:

The problem information acquisition module is used to obtain the user's problem information;

Question and answer library module, used to store subject thesaurus, question and answer library and topic model under the topic;

The natural language processing module is used to process the user's question information to obtain the word set of the question information;

The question and answer library matching module is used to match the user question information with the questions in the question set of the question and answer library to obtain a list of relevant candidate questions;

The string matching module is used to process the candidate question list obtained from the question answering library matching module, and further match the question information;

The answer return module returns the final answer to the user.

9. a kind of intelligent question and answer system based on subject word filtering according to claim 7, is characterized in that: described question and answer library module comprises:

1) Thesaurus: Stores subject keywords and high-frequency words related to the subject;

2) Question and answer library: there is a subject question and answer library;

3) Topic model: The trained question word set and vector representation are stored, and are stored in the order of question and answer pairs.

10. a kind of intelligent question answering system based on subject word filtering according to claim 7, it is characterized in that: described natural language module comprises: word segmentation unit: divides the question information into word list, and has a thesaurus adding thesaurus as word segmentation simultaneously. One of the bases; filter unit to remove stop words and subject words: after word segmentation, remove stop words and words that do not belong to the subject thesaurus;

In the string matching module, each question-and-answer pair is obtained from the question-and-answer library. If there is a question with the same information as the question, the answer to the question is returned directly. Otherwise, the question-answer pair list L ₁ containing q ₀ is searched. If L ₁ is not empty, Sort in reverse order and return L ₁ , if L ₁ is empty, sort in reverse order and return a list of question-answer pairs L ₂ that does not contain q ₀ .