CN110825930A

CN110825930A - Method for automatically identifying correct answers in community question-answering forum based on artificial intelligence

Info

Publication number: CN110825930A
Application number: CN201911058818.8A
Authority: CN
Inventors: 孙海峰; 王晶; 戚琦; 王敬宇; 郭令奇; 马兵; 杜纯宁
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-02-21

Abstract

The method for automatically identifying correct answers in the community question-answering forum based on artificial intelligence comprises the following operation steps: (1) a process of data set establishment; (2) extracting information characteristics of the text pairs by using a deep learning method; (3) extracting other characteristics of the question and the answer by using a rule, and splicing the characteristics and the characteristics obtained in the step (2) into a characteristic vector, wherein the format of the characteristic vector is [ BERT prediction probability, similarity of the current answer and the excellent answer, similarity of the answer and the question, and day difference ]; (4) training the machine to learn the classification model and predict new posts. The method can quickly and accurately judge the answer which is probably the correct answer under a post, and is time-saving and labor-saving.

Description

Method for automatically identifying correct answers in community question-answering forum based on artificial intelligence

Technical Field

The invention relates to a method for automatically identifying correct answers in a community question-answering forum based on artificial intelligence, which belongs to the technical field of natural language processing, in particular to the technical field of natural language processing forum question-answering based on artificial intelligence.

Background

With the advent of numerous community forums, the tasks associated therewith have recently become increasingly important. With the daily influx of a plurality of new problems in the forums, most of the messages related to the new problems have certain errors, and certain misleading effect is caused to other people. These false messages, if identified manually, require not only a relatively authoritative specialist in some areas, but are also time consuming and laborious. Therefore, how to quickly and effectively judge whether the answer to the new question is helpful to solve the question is an effective way for solving the problem which is increasing.

Artificial intelligence technology and natural language processing technology have been developed greatly in recent years, and how to use artificial intelligence technology and natural language processing technology to discriminate the answer quality is a technical problem which needs to be solved urgently.

Disclosure of Invention

In view of the above, the present invention is to invent a method for automatically identifying correct answers in a community question and answer forum based on artificial intelligence, so as to identify answers in a question and answer sticker, and select excellent answers for others to refer to.

In order to achieve the above object, the present invention provides a method for automatically identifying correct answers in a community question-answering forum based on artificial intelligence, which comprises the following operation steps:

(1) the specific contents of the process of establishing the data set are as follows: crawling a large amount of question and answer sticker contents by using crawler software; after crawling, storing the contents of the question and answer stickers in a form of text pairs consisting of questions and single answers; then, data cleaning is carried out on the stored data, and then manual marking is carried out to establish a data set;

(2) extracting information characteristics of the text pairs by using a deep learning method, wherein the specific contents are as follows: taking the data set obtained in the step (1) as a training set to train a deep learning model, and then extracting characteristics such as tone, keywords, grammatical structures and the like of the text pair by using the deep learning model;

(3) other characteristics of the questions and the answers are extracted by using the rules, and the specific contents are as follows: calculating the difference of day numbers of the question and the answer issue, calculating the similarity degree of the single answer and the current question by using TF-IDF, calculating the similarity degree of the single answer and other answers of the current question by using TF-IDF and other characteristics, and splicing the characteristics and the characteristics obtained in the step (2) into a characteristic vector;

(4) training a machine learning classification model and predicting a new post, wherein the specific contents are as follows: training a machine learning classification model by using the feature vectors obtained in the step (3); and (3) predicting the new post after the training is finished, crawling all contents of the new post by using a crawler and storing, then extracting characteristic composition vectors according to the step (2) and the step (3), predicting by using the machine learning classification model, and selecting the first n answers with the highest probability, wherein n is a natural number and is not more than the total number of answers.

The specific content of the step (1) comprises the following operation steps:

(11) the information of a website is crawled by using a crawler, and information such as post question asking, answer, question user, answer user, posting time and the like is stored, or data can be obtained from other similar data sets and is arranged together;

(12) traversing and filling NULL attributes, unifying the maximum length of the text, and cleaning interference data;

(13) and storing the data obtained in the last step in a text pair mode through questions and single answers, and carrying out manual annotation.

The specific content of the step (2) comprises the following operation steps:

(21) performing fine tuning training by using a BERT model according to the data obtained in the step (1); the BERT model carries out byte coding, segment coding and position coding on input text content; and after the fine tuning training is finished, storing the fine tuned model.

(22) And (4) adding the vectors of the three coding layers obtained in the step (21) and then classifying to obtain a single question and a single answer classification result, wherein the classification result contains text features such as mood, keywords and the like learned by the BERT model in the text.

The specific content of the step (3) comprises the following operation steps:

(31) reading the current question and the time of the answer thereof in the data set, calculating the difference of days, namely the difference of days is the time of the question-the time of the answer, and calculating the similarity of the single answer and the question by using a TF-IDF (Trans-inverse document frequency) algorithm;

(32) calculating the similarity between each answer and the answer with the highest probability of the current question according to the classification result of all the answers obtained in the step (2), wherein the similarity is calculated by using a TF-IDF (Trans-inverse document frequency) algorithm, and the answer with the highest probability is an excellent answer;

(33) and (3) splicing the obtained day difference features, similarity features and the feature values obtained in the step (2) into a feature vector, wherein the format of the feature vector is [ BERT prediction probability, similarity of current answer and excellent answer, similarity of answer and question, and day difference ].

The specific content of the step (4) comprises the following operation steps:

(41) selecting an SVM model as a machine learning classification model, and training the machine learning classification model according to the feature vector obtained in the step (3);

(42) obtaining relevant information of target posts, including but not limited to question content, answer content and posting time, and storing the question and the single answer in a text pair mode according to the storage format of the step (1);

(43) predicting the target paste by using the BERT model finely adjusted in the step (2) according to the text data obtained in the last step, calculating features such as an antenna number difference, similarity and the like according to the method in the step (3), and combining the features into a feature vector, wherein the feature vector has the same format as the feature vector formed in the step (3), and the number of the feature vectors is equal to the number of answers;

(44) and (4) predicting the feature vectors by using the machine learning classification model trained in the step (41), and outputting the previous n answers with the highest probability for the user to refer to, wherein n is a natural number and is not more than the total number of answers.

The invention has the beneficial effects that: the method of the invention is not limited to the text information of postings and postings, but also considers the information except the text, such as user name, time difference between postings and postings, similarity with other answers and the like, and trains the model by using the multi-dimensional characteristics, so that the accuracy of the model is higher. The method can quickly and accurately judge the answer which is probably the correct answer under a post, saves time and labor and also reduces the misleading of wrong answers to other people.

Drawings

FIG. 1 is a flow chart of a method for automatically identifying correct answers in a community question-answering forum based on artificial intelligence in accordance with the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.

Referring to fig. 1, a method for automatically identifying correct answers in a community question and answer forum based on artificial intelligence is presented, the method comprises the following operation steps:

(1) the specific contents of the process of establishing the data set are as follows: crawling a large amount of question and answer sticker contents by using crawler software; after crawling, storing the contents of the question and answer stickers in a form of text pairs consisting of questions and single answers, wherein the data storage format of the text pairs refers to table 1; data may also be obtained from other data sets; then, data cleaning is carried out on the stored data, and then manual marking is carried out to establish a data set;

TABLE 1

Problem(s)	Answers
		The validity period of a visa is several months	3 months old
The validity period of a visa is several months	Multiple entries and exits were allowed within 3 months.
		The validity period of a visa is several months	About two months, or three months

The specific content of the step (1) comprises the following operation steps:

(11) the information of a website is crawled by using a crawler, the posts are asked, answered, users are asked, users are answered, the posting time and other information are stored, and data can also be obtained from other similar data sets, such as: some data sets with forum help posts as main contents, such as data sets of Task8 of Semeval2019, obtain data and arrange the data together;

(12) traversing and filling NULL attributes, unifying the maximum length of the text, and cleaning interference data by using rules; for example, irrelevant post contents such as discussion posts, bulletin posts and the like are searched, and whether keywords such as 'festival happy', 'water paste', 'discussion' and the like are contained in the posts or not is mainly searched for;

TABLE 2

(13) Storing the data obtained in the last step in the form of a question and a single answer in the form of a text pair, and manually labeling, wherein the manual labeling method follows the following formula:

in the above formula, a represents the label of a text pair, which would be labeled "1" if the answer is correct, "0" if the answer is wrong, and "2" if the answer is a question.

Referring to table 1, data is stored in a file for reading in the form of text pairs of questions and individual answers. Referring to the data example shown in table 2, each row in table 2 represents a single text pair, with the first column being a question and the second column being an answer to the question. There may be zero, one, or more correct answers to a post. In this example, the first and second answers are correct and the third answer is wrong.

The specific content of the step (2) comprises the following operation steps:

(21) performing fine tuning training by using a BERT model according to the data obtained in the step (1); the BERT model carries out byte coding, segment coding and position coding on input text content; and after the fine tuning training is finished, storing the fine tuned model. The BERT model is generally referred to as Bidirective Encoder registration from transformations, see the paper Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.2018.Bert: Pre-training of deep Bidirectional transformations for language interpretation: arXiv preprints Arxiv: 1810.04805;

The specific content of the step (3) comprises the following operation steps:

The specific content of the step (4) comprises the following operation steps:

The inventor conducts a large number of experiments on the method, and the experimental results prove that the method is feasible and effective.

Claims

1. The method for automatically identifying correct answers in the community question-answering forum based on artificial intelligence is characterized by comprising the following steps: the method comprises the following operation steps:

2. The method for automatically identifying correct answers in a community question-answering forum based on artificial intelligence as claimed in claim 1, wherein: the specific content of the step (1) comprises the following operation steps:

3. The method for automatically identifying correct answers in a community question-answering forum based on artificial intelligence as claimed in claim 1, wherein: the specific content of the step (2) comprises the following operation steps:

(21) performing fine tuning training by using a BERT model according to the data obtained in the step (1); the BERT model carries out byte coding, segment coding and position coding on input text content; after the fine tuning training is finished, storing the fine tuned model;

4. The method for automatically identifying correct answers in a community question-answering forum based on artificial intelligence as claimed in claim 1, wherein: the specific content of the step (3) comprises the following operation steps:

5. The method for automatically identifying correct answers in a community question-answering forum based on artificial intelligence as claimed in claim 1, wherein: the specific content of the step (4) comprises the following operation steps: