CN112966518A

CN112966518A - High-quality answer identification method for large-scale online learning platform

Info

Publication number: CN112966518A
Application number: CN202011535456.XA
Authority: CN
Inventors: 吴宁; 陆鑫; 梁欢; 王雅迪; 邹斌
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-06-15
Anticipated expiration: 2040-12-22
Also published as: CN112966518B

Abstract

A high-quality answer identification method for a large-scale online learning platform comprises the following steps: firstly, constructing a feature vector: after preprocessing the acquired data set, manually labeling the data set, and then constructing a feature vector; secondly, constructing a classification model based on XGBOOST and training by taking the feature vector constructed in the step one as input and taking the manually marked label as output; for a new question and a series of answers and comments thereof, constructing three feature vectors by using the text content of the new question, the text content of the answers and the text content of the comments, and inputting the three feature vectors into the model trained in the step (II) so as to obtain a series of classification results as results for identifying high-quality answers; the invention uses more information from different angles, fully uses the questions, the answers and the answer comments to identify the high-quality answers, and improves the prediction result to a certain extent on a plurality of evaluation indexes.

Description

High-quality answer identification method for large-scale online learning platform

Technical Field

The invention relates to the technical field of artificial intelligence natural language processing, in particular to a high-quality answer identification method for a large-scale online learning platform.

Background

With the development of internet technology, online education is recognized by the public with the advantages of no time and place limitation and the like, and more people learn by using an online learning mode, so that online education is rapidly developed. Although the question-answering community provided by the large-scale online learning platform provides online communication opportunities for learners, due to the large number of learners, a teacher cannot provide personalized and real-time problem answers for students, and therefore, an intelligent question-answering technology capable of simulating online at any time by the teacher becomes one of research hotspots of online education. How to quickly select the best answer for the question of the learner also becomes an important problem to be solved in the field of intelligent question answering.

The high-quality answer recognition and the answer sorting are both used for helping the user obtain high-quality answers and improving the use experience of the user. The difference between the two is that the answer ranking generally takes the number of praise as the target of model learning, and the number of praise can only show the answer quality to a certain extent, the number of praise is influenced by the time of answer publication and other factors, and the answer with the highest number of praise does not represent the best answer. The answer sorting mode in the community question-answering platform mainly comprises the following steps: according to content relevance, according to answer length, according to answer publishing time, according to high-quality answers, according to the number of comments of the answers, according to the number of praise of the answers and the like. At present, a large-scale online learning platform only provides a mode of sequencing answers according to the number of prawns and the time of issuing the answers, and does not provide a function of identifying high-quality answers. For a large-scale online learning platform, providing an intelligent question and answer service for simulating teachers online at any time is an important way for improving user experience, and identifying high-quality answers is an important technology in intelligent question and answer.

Currently, there are relatively few studies on good answer recognition, and most relevant to this is the study of answer ranking, and many researchers have proposed a variety of answer ranking methods, as follows:

(1) a community question-answering platform answer ordering method (applicant: Chinese university of science and technology, application number: 201810186972.2);

(2) an answer ranking method for a question-answering system (applicant: Shenzhen research institute of Beijing university, application No. 201810284245. X);

(3) an answer quality determination model training method, an answer quality determination method and an answer quality determination device (applicant: China letter Youyi data Co., Ltd., application number: 201811285467. X);

(4) a method for automatically identifying correct answers in a community question-answering forum based on artificial intelligence (applicant: Beijing post and telecommunication university, application No. 201911058818.8).

The relevant research mainly takes the number of praise of the answer as a learning target of the answer quality ranking, the research focuses on evaluating the answer quality by using the characteristics of the relevance between the question and the answer, the content attribute of the answer, the time attribute of the answer and the like, and the positive influence of the comment text of the answer and the emotion polarity of the comment text on the answer quality evaluation is ignored.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a high-quality answer identification method facing a large-scale online learning platform, which uses more information from different angles, fully uses the problems, answers and answer comments to solve the problem of identifying high-quality answers, and improves the prediction result to a certain extent on a plurality of evaluation indexes.

In order to achieve the above object, the present invention is achieved by the following technical solutions.

A high-quality answer identification method for a large-scale online learning platform comprises the following steps:

constructing a feature vector: after preprocessing the acquired data set, manually labeling the data set, and then constructing feature vectors of the following three angles: semantic relevance features of the questions and the answers, document vector features of all comments of each answer, and emotional features of the comments; the feature acquisition of three angles is realized by the following three ways: (1) obtaining sentence vector representation of the question and the answer, and then calculating the similarity of two semantic vectors based on cosine similarity to obtain the semantic correlation of the question and the answer; (2) performing document vector representation on the answer comments by using a HAN model; (3) extracting emotional features of the comments by using transfer learning;

(II) model construction: taking the feature vector constructed in the step (one) as input, taking a manually marked label as output, constructing a classification model based on XGBOOST and training;

and (III) for a new question and a series of answers and comments thereof, constructing three feature vectors described in the step (I) by using the text content of the new question, the text content of the answers and the text content of the comments, and inputting the three feature vectors into the model trained in the step (II) so as to obtain a series of classification results as the result of identifying the high-quality answers.

The manual labeling in the step (one) is specifically operated as follows:

crawling website information by using a crawler technology, storing and sorting question, answer comment and answer praise information, clearing data with empty questions, answers and comments, integrating comments of the same answer under the same question, storing the acquired data in the form of questions, answers and integrated comments, and manually labeling a crawled data set by using the following method:

in the above formula, Flag represents the label of the text pair, if the answer is wrong, the text pair is considered as a poor answer, the text pair is labeled as '0', if the answer is correct but imperfect, the text pair is considered as a normal answer, the text pair is labeled as '1', if the answer is correct and perfect, the text pair is considered as a good answer, the text pair is labeled as '2', and after the manual labeling is completed, the final data set contains the following contents: a question, an answer, an integrated answer review, and a label for a text pair;

the semantic relevance feature extraction operation of the question and the answer is as follows:

(1) obtaining sentence vectors of the questions and the answers by using a BERT model, inputting the texts of the questions and the answers into the BERT model, generating the sentence vectors, and taking the output value of the second layer of the pre-training model which is the reciprocal as the sentence vectors of the questions and the answers;

(2) and calculating the similarity between the question and the answer by using a cosine similarity method, and measuring the similarity between the question and the answer by calculating a cosine value of an included angle between two vectors.

The document vector feature extraction operation of the answer comment is as follows:

extracting the features of a plurality of comments by using a hierarchical attention network HAN, wherein the HAN model is divided into two parts, one part is used for constructing sentence vectors according to word vectors, the other part is used for constructing document vectors according to the sentence vectors, comment contents in a data set are used as the input of the HAN model, labels of a text pair are used as the output of the HAN model for model training, and the output of the second layer from the last number of the model is used as the document vectors of the comments;

the HAN model is a neural network for document classification, and has two features: one is hierarchical structure, a document vector can be constructed by first constructing a representation of a sentence and then aggregating it into a document representation; secondly, two levels of attention mechanisms are applied to the word level and the sentence level, so that the representation of important contents can be strengthened when the document representation is constructed;

the extraction operation of the emotional features of the answer comment is as follows:

because the obtained answer comment content has no related emotion label and the workload of manual labeling is very large, emotion label labeling is carried out on partial data randomly, and then a pseudo label strategy in semi-supervised learning is adopted to solve the problem of insufficient training data: firstly, training marked data by using an emotion classification model to obtain an optimal model, carrying out pseudo label marking on unmarked data by using the optimal model, and then training all data to improve the model effect, wherein the method specifically comprises the following steps:

(1) training on the marked comment data, obtaining a sentence vector of a comment by using a BERT model, inputting a comment text into the BERT model, taking an output value of a second layer which is the reciprocal of a pre-training model as a sentence vector of a question and an answer, reducing the dimension of the sentence vector by using a full-connection network, carrying out softmax normalization processing on the sentence vector after dimension reduction, using a result for emotion classification, and obtaining a trained emotion classification model; the emotion classification model consists of an input layer, a pretrained BERT model, a full-connection network layer and an output layer;

(2) analyzing the comment texts which are not marked by using the emotion classification model trained in the step (1), expressing the comment texts which are not marked into sentence vectors, and analyzing the emotional characteristics by using the trained model to obtain the emotional characteristics of the comments; and combining the original labeled data with the data generated based on the pseudo label strategy, and continuing training the emotion analysis model to obtain the optimal model.

The invention has the advantages that: the method is oriented to online education platform high-quality answer recognition, feature extraction is carried out from three angles, namely the correlation features of questions and answers, the comment document vector features of answers and the emotional features of answer comments. Compared with other methods, more information from different angles is used, and the prediction result is improved to a certain extent on a plurality of evaluation indexes.

Drawings

Fig. 1 is a flow chart of an implementation of the embodiment of the present invention.

FIG. 2 is a model diagram of similarity between answers to questions.

Fig. 3 is a model diagram of the HAN model.

FIG. 4 is a diagram of an answer comment emotion feature extraction model.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Referring to fig. 1, a method for identifying good answers for a large-scale online learning platform includes the following steps:

constructing a feature vector: after preprocessing (including outlier deletion, formatting, etc.) the acquired data set, manually labeling the data set, and then constructing feature vectors of the following three angles: semantic relevance features of the questions and the answers, document vector features of all comments of each answer, and emotional features of the comments; the feature acquisition of three angles is realized by the following three ways: (1) obtaining sentence vector representation of the question and the answer, and then calculating the similarity of two semantic vectors based on cosine similarity to obtain the semantic correlation of the question and the answer; (2) performing document vector representation on the answer comments by using a HAN model; (3) extracting emotional features of the comments by using transfer learning;

The manual labeling in the step (one) is specifically operated as follows:

the method comprises the steps of crawling website information by using a crawler technology, storing and sorting question, answer comment and answer praise number information, clearing data with empty questions, answers and comments, integrating comments of the same answer under the same question, and storing the obtained data in the forms of the questions, the answers and the integrated comments. Manually annotating the crawled data set using the following method:

in the above formula, Flag represents a label of a text pair, and if the answer is wrong, it is considered as a poor answer, the text pair is labeled as '0', if the answer is correct but imperfect, it is considered as a normal answer, the text pair is labeled as '1', if the answer is correct and perfect, it is considered as a good answer, and the text pair is labeled as '2'. After the manual annotation is completed, the final data set contains the following: questions, answers, integrated answer reviews, and tags for text pairs.

Referring to fig. 2, the semantic relevance feature extraction operation of the question and the answer is as follows:

(1) the method has the advantages that sentence vectors of problems and answers are obtained by using the BERT model, a traditional word vector sentence vector generation mode has a large defect, the same word can be represented into the same vector when different contexts have different semantemes, the BERT is a large-scale pre-training model, the problem of word ambiguity can be solved, and good experiment results can be obtained by using the BERT and carrying out fine adjustment in a specific field. BERT includes two versions, 12-layer transformer and 24-layer transformer, and this experiment uses 12-layer model to perform experiment, theoretically, the output value of each layer of transformer can be used as sentence vector, and the best sentence vector should be taken as the penultimate layer by referring to experimental data, because the value of the last layer is too close to the target and the values of the previous layers are not fully learned about the semantic information of the sentence. Inputting the question and answer texts into a BERT model and generating sentence vectors, and taking the output value of the second layer from the inverse of the pre-training model as the sentence vectors of the question and the answer.

Referring to fig. 3, the document vector feature extraction operation of the answer comment is:

generally, one answer has multiple comments, and the following two types of work are available on how to extract the multiple comments: one is to splice a plurality of comments to obtain a longer document, and then perform feature extraction on the document; the other is to model each review separately and then aggregate the modeled features. In the invention, the distinction between single comments is not needed to be distinguished, so that the comments are not needed to be distinguished, a first mode is adopted in the invention, a plurality of comments are spliced into a document, and the document is processed by using a document vector feature extraction method, specifically:

the HAN model is a neural network for document classification, and has two features: one is hierarchical structure, a document vector can be constructed by first constructing a representation of a sentence and then aggregating it into a document representation; the second is that two levels of attention mechanism are applied at the word and sentence level, enabling it to enhance the representation of important content when constructing the document representation.

Referring to fig. 4, the operation of extracting the emotional features of the answer comment is as follows:

(1) training on the marked comment corpus, acquiring a sentence vector of a comment by using a BERT model, inputting a comment text into the BERT model, taking an output value of a second layer which is the reciprocal of the pre-training model as a sentence vector of a question and an answer, reducing the dimension of the sentence vector by using a full-connection network, carrying out softmax normalization processing on the sentence vector after dimension reduction, using a result for emotion classification, and simultaneously obtaining a trained emotion classification model; the emotion classification model consists of an input layer, a pretrained BERT model, a full-connection network layer and an output layer;

(2) and (3) analyzing the comment texts which are not marked by using the emotion classification model trained in the step (1), expressing the comment texts which are not marked into sentence vectors, and analyzing the emotional characteristics by using the trained model to obtain the emotional characteristics of the comments.

In summary, based on the extraction methods of the three features, the finally obtained feature vector format is [ similarity of question answers, document vector of comments, and emotional features of comments ].

Claims

1. A high-quality answer identification method for a large-scale online learning platform is characterized by comprising the following steps:

2. The method for identifying good answers for large-scale online learning platform according to claim 1,

the manual labeling in the step (one) is specifically operated as follows:

crawling website information by using a crawler technology, storing and sorting the question, the answer comment and the answer praise number information, clearing data with empty questions, answers and comments, integrating comments of the same answer under the same question, and storing the obtained data in the form of the question, the answer and the integrated comments; manually annotating the crawled data set using the following method:

in the above formula, Flag represents the label of the text pair, if the answer is wrong, the text pair is considered as a poor answer, the text pair is labeled as '0', if the answer is correct but imperfect, the text pair is considered as a normal answer, the text pair is labeled as '1', if the answer is correct and perfect, the text pair is considered as a good answer, the text pair is labeled as '2', and after the manual labeling is completed, the final data set contains the following contents: questions, answers, integrated answer reviews, and tags for text pairs.

3. The method for identifying good answers for large-scale online learning platform according to claim 1,

4. The method for identifying good answers for large-scale online learning platform according to claim 1,

5. The method for identifying good answers for large-scale online learning platform according to claim 1,