CN112966518B - High-quality answer identification method for large-scale online learning platform - Google Patents

High-quality answer identification method for large-scale online learning platform Download PDF

Info

Publication number
CN112966518B
CN112966518B CN202011535456.XA CN202011535456A CN112966518B CN 112966518 B CN112966518 B CN 112966518B CN 202011535456 A CN202011535456 A CN 202011535456A CN 112966518 B CN112966518 B CN 112966518B
Authority
CN
China
Prior art keywords
answer
model
answers
comments
questions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011535456.XA
Other languages
Chinese (zh)
Other versions
CN112966518A (en
Inventor
吴宁
陆鑫
梁欢
王雅迪
邹斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202011535456.XA priority Critical patent/CN112966518B/en
Publication of CN112966518A publication Critical patent/CN112966518A/en
Application granted granted Critical
Publication of CN112966518B publication Critical patent/CN112966518B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Educational Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Educational Administration (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Biophysics (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Human Resources & Organizations (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

一种面向大规模在线学习平台的优质答案识别方法,步骤如下:(一)、特征向量构建:在对获取的数据集进行预处理之后,对数据集进行人工标注,随后构建特征向量;(二)、将步骤(一)中构建的特征向量作为输入,以人工标注的标签作为输出,构建基于XGBOOST的分类模型并进行训练;(三)、对于一个新问题及其一系列答案与评论,利用新问题的文本内容、答案的文本内容及评论的文本内容构建三个特征向量并以此输入到步骤(二)中训练好的模型中,从而得到一系列分类结果,作为识别优质答案的结果;本发明使用了不同角度更多的信息,对问题和答案及答案评论进行充分使用来解决识别优质答案,预测结果也在多个评价指标上有一定的提高。

A high-quality answer identification method for large-scale online learning platforms. The steps are as follows: (1) Feature vector construction: After preprocessing the acquired data set, manually annotate the data set, and then construct a feature vector; (2) ), use the feature vector constructed in step (1) as input and manually annotated labels as output to build a classification model based on XGBOOOST and train it; (3) for a new question and a series of answers and comments, use The textual content of the new question, the textual content of the answer and the textual content of the comment construct three feature vectors and input them into the model trained in step (2), thereby obtaining a series of classification results as the result of identifying high-quality answers; The present invention uses more information from different angles and makes full use of questions, answers and answer comments to solve the problem of identifying high-quality answers. The prediction results also have a certain improvement in multiple evaluation indicators.

Description

一种面向大规模在线学习平台的优质答案识别方法A high-quality answer identification method for large-scale online learning platforms

技术领域Technical field

本发明涉及人工智能的自然语言处理技术领域,特别涉及一种面向大规模在线学习平台的优质答案识别方法。The invention relates to the technical field of natural language processing of artificial intelligence, and in particular to a high-quality answer recognition method for large-scale online learning platforms.

背景技术Background technique

随着互联网技术的发展,在线教育以其无时间地点限制等优势得到了大众的认可,越来越多的人使用线上学习方式进行学习,使得在线教育得到了快速的发展。虽然大规模在线学习平台提供的问答社区为学习者提供了在线交流的机会,但是由于学习者人数众多,教师无法为学生提供个性化、实时的问题解答,因此,能够模拟教师随时在线的智能问答技术成为在线教育的研究热点之一。而如何针对学习者的提问,快速选出最优质的答案,也就成为智能问答领域需要解决的重要问题。With the development of Internet technology, online education has been recognized by the public for its advantages such as no time and location restrictions. More and more people use online learning methods to learn, which has led to the rapid development of online education. Although the Q&A community provided by the large-scale online learning platform provides learners with the opportunity to communicate online, due to the large number of learners, teachers are unable to provide personalized, real-time answers to students' questions. Therefore, intelligent Q&A that can simulate teachers online at any time Technology has become one of the research hotspots in online education. How to quickly select the best answers to learners' questions has become an important issue that needs to be solved in the field of intelligent question answering.

优质答案识别与答案排序本质上都是为了帮助用户获取高质量的答案,提升用户的使用体验。两者之间的区别在于答案排序一般将点赞数作为模型学习的目标,而点赞数只能在一定程度上表示答案质量,点赞数会因为答案的发表时间等因素受到影响,点赞数最高的答案并不代表是最好的答案。社区问答平台中答案的排序方式主要有以下几种:按照内容相关性、按照答案长短、按照答案发表时间、按照优质答案、按照答案的评论数以及按照答案的点赞数等方式。目前大规模在线学习平台只提供按照点赞数及答案发表时间对答案进行排序的方式,没有提供识别优质答案这样的功能。对于大规模在线学习平台而言,提供智能化模拟教师随时在线的问答服务是提升用户体验的一个重要方式,而识别优质答案则是智能问答中一个重要的技术。High-quality answer identification and answer sorting are essentially to help users obtain high-quality answers and improve the user experience. The difference between the two is that answer sorting generally uses the number of likes as the goal of model learning, and the number of likes can only indicate the quality of the answer to a certain extent. The number of likes will be affected by factors such as the publication time of the answer. Likes The answer with the highest number does not mean it is the best answer. There are mainly the following methods for sorting answers in community Q&A platforms: by content relevance, by answer length, by answer publication time, by high-quality answers, by the number of comments on the answer, and by the number of likes on the answer. At present, large-scale online learning platforms only provide a way to sort answers according to the number of likes and the time when the answers were published, but do not provide the function of identifying high-quality answers. For large-scale online learning platforms, providing intelligent simulated teachers with online Q&A services at any time is an important way to improve user experience, and identifying high-quality answers is an important technology in intelligent Q&A.

目前关于优质答案识别的研究相对较少,与之最相关的是答案排序的研究,很多研究者提出了多种答案排序的方式,如下所列:At present, there is relatively little research on high-quality answer identification. The most relevant research is on answer ranking. Many researchers have proposed a variety of answer ranking methods, as listed below:

(1)一种社区问答平台回答排序方法(申请人:中国科学技术大学,申请号:201810186972.2);(1) A method for sorting answers on a community question and answer platform (Applicant: University of Science and Technology of China, Application Number: 201810186972.2);

(2)一种用于问答系统的答案排序方法(申请人:北京大学深圳研究生院,申请号:201810284245.X);(2) An answer sorting method for question and answer systems (Applicant: Peking University Shenzhen Graduate School, Application Number: 201810284245.X);

(3)答案质量确定模型训练方法、答案质量确定方法及装置(申请人:国信优易数据有限公司,申请号:201811285467.X);(3) Answer quality determination model training method, answer quality determination method and device (Applicant: Guoxin Youyi Data Co., Ltd., application number: 201811285467.X);

(4)基于人工智能自动识别社区问答论坛中的正确回答的方法(申请人:北京邮电大学,申请号:201911058818.8)。(4) Method to automatically identify correct answers in community question and answer forums based on artificial intelligence (Applicant: Beijing University of Posts and Telecommunications, application number: 201911058818.8).

以上相关的研究主要是将答案的点赞数作为其答案质量排序的学习目标,研究集中在使用问题与答案之间的相关性、答案的内容属性及答案的时间属性等方面的特征对答案质量进行评估,忽略了答案的评论文本及评论文本情感极性对答案质量评估带来的积极影响。The above related research mainly uses the number of likes of answers as the learning goal of ranking answer quality. The research focuses on using the correlation between questions and answers, the content attributes of answers, and the time attributes of answers to evaluate answer quality. The evaluation ignores the positive impact of the answer's comment text and the emotional polarity of the comment text on the answer quality evaluation.

发明内容Contents of the invention

为了克服上述现有技术的缺陷,本发明的目的是提供一种面向大规模在线学习平台的优质答案识别方法,使用了不同角度更多的信息,对问题和答案及答案评论进行充分使用来解决识别优质答案的问题,预测结果也在多个评价指标上有一定的提高。In order to overcome the above-mentioned shortcomings of the prior art, the purpose of the present invention is to provide a high-quality answer recognition method for large-scale online learning platforms, using more information from different angles, and making full use of questions, answers, and answer comments to solve the problem. For questions that identify high-quality answers, the prediction results also improve to a certain extent on multiple evaluation indicators.

为了达到上述目的,本发明通过以下技术方案实现。In order to achieve the above objects, the present invention is achieved through the following technical solutions.

一种面向大规模在线学习平台的优质答案识别方法,步骤如下:A high-quality answer identification method for large-scale online learning platforms. The steps are as follows:

(一)特征向量构建:在对获取的数据集进行预处理之后,对数据集进行人工标注,随后构建以下三个角度的特征向量:问题与答案的语义相关性特征、每一答案的所有评论的文档向量特征和评论的情感特征;三种角度的特征获取通过以下三个方式实现:(1)获取问题和答案的句子向量表示,然后基于余弦相似度计算两个语义向量相似性,得到问题与答案的语义相关性;(2)使用HAN模型对答案评论进行文档向量表示;(3)使用迁移学习对评论进行情感特征的提取;(1) Feature vector construction: After preprocessing the acquired data set, manually annotate the data set, and then construct feature vectors from the following three perspectives: semantic correlation features of questions and answers, and all comments of each answer. document vector features and emotional features of comments; feature acquisition from three perspectives is achieved in the following three ways: (1) Obtain sentence vector representations of questions and answers, and then calculate the similarity of the two semantic vectors based on cosine similarity to obtain the question Semantic relevance to the answer; (2) Use the HAN model to represent the answer comments as document vectors; (3) Use transfer learning to extract emotional features of the comments;

(二)模型构建:将步骤(一)中构建的特征向量作为输入,以人工标注的标签作为输出,构建基于XGBOOST的分类模型并进行训练;(2) Model construction: Taking the feature vector constructed in step (1) as input and manually annotated labels as output, a classification model based on XGBOOOST is constructed and trained;

(三)对于一个新问题及其一系列答案与评论,利用新问题的文本内容、答案的文本内容及评论的文本内容构建步骤(一)中描述的三个特征向量并以此输入到步骤(二)中训练好的模型中,从而得到一系列分类结果,作为识别优质答案的结果。(3) For a new question and a series of answers and comments, use the text content of the new question, the text content of the answers, and the text content of the comments to construct the three feature vectors described in step (1) and input them into step ((1)). 2) In the trained model, a series of classification results are obtained as the result of identifying high-quality answers.

所述步骤(一)中的人工标注具体操作为:The specific operations of manual annotation in step (1) are:

使用爬虫技术爬取网站信息,将问题、答案、答案评论及答案点赞数信息进行存储整理,对于问题、答案及评论为空的数据进行清除,对于同一问题下同一答案的评论进行整合,将获取到的数据以问题、答案及整合后的评论的形式进行存储,使用如下的方法对爬取的数据集进行人工标注:Use crawler technology to crawl website information, store and organize information about questions, answers, answer comments and answer likes, clear the data of questions, answers and comments that are empty, integrate the comments of the same answer under the same question, and The acquired data is stored in the form of questions, answers and integrated comments, and the crawled data sets are manually annotated using the following method:

在上述公式中,Flag代表文本对的标签,如果回答错误,认为是较差答案,文本对被标注为‘0’,如果回答正确但不完善,认为是普通答案,文本对被标注为‘1’,如果回答正确且完善,认为是优质答案,文本对被标注为‘2’,完成人工标注后,最终数据集包含以下内容:问题、答案、整合后的答案评论和文本对的标签;In the above formula, Flag represents the label of the text pair. If the answer is wrong, it is considered a poor answer, and the text pair is marked as '0'. If the answer is correct but incomplete, it is considered an ordinary answer, and the text pair is marked as '1' ', if the answer is correct and complete, it is considered a high-quality answer, and the text pair is marked as '2'. After completing manual annotation, the final data set contains the following content: questions, answers, integrated answer comments, and labels of text pairs;

所述问题与答案的语义相关性特征提取操作为:The semantic correlation feature extraction operation of the questions and answers is:

(1)使用BERT模型获取问题和答案的句子向量,将问题、答案文本输入BERT模型并进行句向量生成,将预训练模型倒数第二层的输出值作为问题和答案的句子向量;(1) Use the BERT model to obtain the sentence vectors of questions and answers, input the question and answer texts into the BERT model and generate sentence vectors, and use the output value of the penultimate layer of the pre-training model as the sentence vectors of questions and answers;

(2)使用余弦相似度方法计算问题与答案之间的相似性,通过计算两个向量夹角的余弦值来度量它们之间的相似性。(2) Use the cosine similarity method to calculate the similarity between the question and the answer, and measure the similarity between the two vectors by calculating the cosine value of the angle between them.

所述答案评论的文档向量特征提取操作为:The document vector feature extraction operation for the answer comments is:

使用层次注意力网络HAN对多条评论进行特征提取,HAN模型分成两个部分,一部分是根据词向量构建句子向量,另一部分是根据句子向量构建文档向量,将数据集中的评论内容作为HAN模型的输入,文本对的标签作为其输出进行模型训练,将模型的倒数第二层输出作为评论的文档向量;Use the hierarchical attention network HAN to extract features from multiple comments. The HAN model is divided into two parts. One part is to build sentence vectors based on word vectors, and the other part is to build document vectors based on sentence vectors. The content of the comments in the data set is used as the HAN model. Input, the label of the text pair is used as its output for model training, and the output of the penultimate layer of the model is used as the document vector of the review;

所述的HAN模型是一种用于文档分类的神经网络,该模型有两个特征:一是具有层级结构,可以通过首先构建句子的表示再将其聚合成文档表示来构造文档向量;二是在单词和句子级别应用了两个级别的注意力机制,使其能够在构建文档表示时能够加强对重要内容的表示;The HAN model is a neural network used for document classification. The model has two characteristics: first, it has a hierarchical structure, and the document vector can be constructed by first constructing the representation of the sentence and then aggregating it into a document representation; the second is A two-level attention mechanism is applied at the word and sentence levels, enabling it to enhance the representation of important content when building document representations;

所述答案评论的情感特征的提取操作为:The extraction operation of the emotional features of the answer comments is:

由于获取到的答案评论内容并没有相关的情感标签,而人工标注的工作量非常大,因此随机地对部分数据进行情感标签标记,然后采用半监督学习中伪标签策略来解决训练数据不充足的问题:首先使用情感分类模型对已有标记的数据进行训练,得到最优模型,使用最优模型对未标记的数据进行伪标签标记,之后使用所有数据进行训练提升模型效果,具体为:Since the obtained answer comments do not have relevant emotional labels, and the workload of manual annotation is very heavy, part of the data is randomly labeled with emotional labels, and then a pseudo-label strategy in semi-supervised learning is used to solve the problem of insufficient training data. Question: First, use the emotion classification model to train the already labeled data to obtain the optimal model. Use the optimal model to pseudo-label the unlabeled data, and then use all the data for training to improve the model effect. Specifically:

(1)在已有标记的评论数据上进行训练,使用BERT模型获取评论的句子向量,将评论文本输入BERT模型,将预训练模型倒数第二层的输出值作为问题和答案的句子向量,使用全连接网络对句子向量进行降维,将降维后的句子向量通过softmax归一化处理,将结果用于情感分类,同时得到训练好的情感分类模型;情感分类模型由输入层、预训练好的BERT模型、全连接网络层和输出层组成;(1) Train on the already labeled comment data, use the BERT model to obtain the sentence vector of the comment, input the comment text into the BERT model, and use the output value of the penultimate layer of the pre-trained model as the sentence vector of the question and answer, using The fully connected network reduces the dimension of the sentence vector, normalizes the reduced sentence vector through softmax, and uses the result for emotion classification. At the same time, a trained emotion classification model is obtained; the emotion classification model consists of an input layer, a pre-trained It consists of BERT model, fully connected network layer and output layer;

(2)使用(1)中训练好的情感分类模型对未标注的评论文本进行分析,将未标注的评论文本表示成句子向量,使用训练好的模型进行情感特征分析,得到评论的情感特征;再将原始已有标注的数据和基于伪标签策略生成的数据进行结合,继续训练情感分析模型获取最优模型。(2) Use the sentiment classification model trained in (1) to analyze the unlabeled review text, represent the unlabeled review text as a sentence vector, use the trained model to perform sentiment feature analysis, and obtain the sentiment features of the review; Then combine the original annotated data with the data generated based on the pseudo-labeling strategy, and continue to train the sentiment analysis model to obtain the optimal model.

本发明的优点:本发明面向在线教育平台优质答案识别,从三个角度进行了特征提取,分别是问题与答案的相关性特征、答案的评论文档向量特征和答案评论的情感特征。相比于其他方法,使用了不同角度更多的信息,预测结果也在多个评价指标上有一定的提高。Advantages of the present invention: This invention is oriented to the identification of high-quality answers on online education platforms. Feature extraction is performed from three perspectives, namely the correlation features between questions and answers, the comment document vector features of the answers, and the emotional features of the answer comments. Compared with other methods, more information from different angles is used, and the prediction results are also improved in multiple evaluation indicators.

附图说明Description of drawings

图1为本发明实施例的实现流程图。Figure 1 is an implementation flow chart of an embodiment of the present invention.

图2为问题答案相似度的模型图。Figure 2 is a model diagram of the similarity of question answers.

图3为HAN模型的模型图。Figure 3 is a model diagram of the HAN model.

图4为答案评论情感特征提取模型图。Figure 4 shows the answer comment emotional feature extraction model diagram.

具体实施方式Detailed ways

下面结合附图及具体实施方式对本发明作进一步的详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific implementation modes.

参见图1,一种面向大规模在线学习平台的优质答案识别方法,步骤如下:See Figure 1, a high-quality answer identification method for large-scale online learning platforms. The steps are as follows:

(一)特征向量构建:在对获取的数据集进行预处理(包括异常值删除、格式处理等)之后,对数据集进行人工标注,随后构建以下三个角度的特征向量:问题与答案的语义相关性特征、每一答案的所有评论的文档向量特征和评论的情感特征;三种角度的特征获取通过以下三个方式实现:(1)获取问题和答案的句子向量表示,然后基于余弦相似度计算两个语义向量相似性,得到问题与答案的语义相关性;(2)使用HAN模型对答案评论进行文档向量表示;(3)使用迁移学习对评论进行情感特征的提取;(1) Feature vector construction: After preprocessing the acquired data set (including outlier deletion, format processing, etc.), the data set is manually annotated, and then feature vectors are constructed from the following three perspectives: the semantics of questions and answers Relevance features, document vector features of all comments for each answer, and sentiment features of comments; feature acquisition from three perspectives is achieved in the following three ways: (1) Obtain sentence vector representations of questions and answers, and then based on cosine similarity Calculate the similarity of two semantic vectors to obtain the semantic correlation between the question and the answer; (2) Use the HAN model to represent the answer comments as document vectors; (3) Use transfer learning to extract the emotional features of the comments;

(二)模型构建:将步骤(一)中构建的特征向量作为输入,以人工标注的标签作为输出,构建基于XGBOOST的分类模型并进行训练;(2) Model construction: Taking the feature vector constructed in step (1) as input and manually annotated labels as output, a classification model based on XGBOOOST is constructed and trained;

(三)对于一个新问题及其一系列答案与评论,利用新问题的文本内容、答案的文本内容及评论的文本内容构建步骤(一)中描述的三个特征向量并以此输入到步骤(二)中训练好的模型中,从而得到一系列分类结果,作为识别优质答案的结果。(3) For a new question and a series of answers and comments, use the text content of the new question, the text content of the answers, and the text content of the comments to construct the three feature vectors described in step (1) and input them into step ((1)). 2) In the trained model, a series of classification results are obtained as the result of identifying high-quality answers.

所述步骤(一)中的人工标注具体操作为:The specific operations of manual annotation in step (1) are:

使用爬虫技术爬取网站信息,将问题、答案、答案评论及答案点赞数信息进行存储整理,对于问题、答案及评论为空的的数据进行清除,对于同一问题下同一答案的评论进行整合,将获取到的数据以问题、答案及整合后的评论的形式进行存储。使用如下的方法对爬取的数据集进行人工标注:Use crawler technology to crawl website information, store and organize information about questions, answers, answer comments and answer likes, clear the data of questions, answers and comments that are empty, and integrate the comments of the same answer under the same question. The acquired data is stored in the form of questions, answers, and consolidated comments. Use the following method to manually annotate the crawled data set:

在上述公式中,Flag代表文本对的标签,如果回答错误,认为是较差答案,文本对被标注为‘0’,如果回答正确但不完善,认为是普通答案,文本对被标注为‘1’,如果回答正确且完善,认为是优质答案,文本对被标注为‘2’。完成人工标注后,最终数据集包含以下内容:问题、答案、整合后的答案评论和文本对的标签。In the above formula, Flag represents the label of the text pair. If the answer is wrong, it is considered a poor answer, and the text pair is marked as '0'. If the answer is correct but incomplete, it is considered an ordinary answer, and the text pair is marked as '1' ', if the answer is correct and complete, it is considered a high-quality answer, and the text pair is marked as '2'. After manual annotation, the final dataset contains the following: questions, answers, integrated answer comments, and labels for text pairs.

参照图2,所述问题与答案的语义相关性特征提取操作为:Referring to Figure 2, the semantic correlation feature extraction operation of the questions and answers is:

(1)使用BERT模型获取问题和答案的句子向量,传统的词向量句向量生成方式有一个较大的弊端,同一个词在不同语境语义不同时也会被表示成相同的向量,而BERT是一个大型预训练模型,能够解决一词多义问题,使用BERT并在特定领域进行微调会获取到很好的实验结果。BERT包括两个版本,12层的transformer和24层的transformer,本实验使用了12层的模型进行实验,理论上每一层transformer的输出值都可以作为句子向量,参考实验数据可知最佳的句子向量应该采取倒数第二层,这是由于最后一层的值太接近目标以及前面几层的值对句子的语义信息还没有充分学习到。将问题、答案文本输入BERT模型并进行句向量生成,将预训练模型倒数第二层的输出值作为问题和答案的句子向量。(1) Use the BERT model to obtain sentence vectors of questions and answers. The traditional word vector sentence vector generation method has a major drawback. The same word will be represented as the same vector when the semantics are different in different contexts, and BERT It is a large-scale pre-trained model that can solve the problem of polysemy. Using BERT and fine-tuning in specific fields will obtain good experimental results. BERT includes two versions, a 12-layer transformer and a 24-layer transformer. This experiment uses a 12-layer model for the experiment. In theory, the output value of each layer of transformer can be used as a sentence vector. The best sentence can be known by referring to the experimental data. The vector should take the penultimate layer because the value of the last layer is too close to the target and the values of the previous layers have not fully learned the semantic information of the sentence. Input the question and answer text into the BERT model and generate sentence vectors. The output value of the penultimate layer of the pre-trained model is used as the sentence vector of the question and answer.

(2)使用余弦相似度方法计算问题与答案之间的相似性,通过计算两个向量夹角的余弦值来度量它们之间的相似性。(2) Use the cosine similarity method to calculate the similarity between the question and the answer, and measure the similarity between the two vectors by calculating the cosine value of the angle between them.

参照图3,所述答案评论的文档向量特征提取操作为:Referring to Figure 3, the document vector feature extraction operation of the answer comments is:

一般情况下一条答案会有多条评论,关于如何提取多条评论,已有工作分为以下两种:一种是将多条评论拼接得到一个较长的文档,然后对该文档进行特征提取;另一种是对每一条评论进行单独建模,之后再将建模后的特征进行聚合。在本发明中不需要区分单条评论之间的区别,因此不需要对其进行区分,所以本发明采取第一种方式,将多条评论拼接成文档,再使用文档向量特征提取的方法对其进行处理,具体为:Generally, an answer will have multiple comments. Regarding how to extract multiple comments, existing work is divided into the following two types: one is to splice multiple comments into a longer document, and then perform feature extraction on the document; The other is to model each review individually and then aggregate the modeled features. In the present invention, there is no need to distinguish between single comments, so there is no need to distinguish them. Therefore, the present invention adopts the first method to splice multiple comments into documents, and then uses the method of document vector feature extraction to extract them. Processing, specifically:

使用层次注意力网络HAN对多条评论进行特征提取,HAN模型分成两个部分,一部分是根据词向量构建句子向量,另一部分是根据句子向量构建文档向量,将数据集中的评论内容作为HAN模型的输入,文本对的标签作为其输出进行模型训练,将模型的倒数第二层输出作为评论的文档向量;Use the hierarchical attention network HAN to extract features from multiple comments. The HAN model is divided into two parts. One part is to build sentence vectors based on word vectors, and the other part is to build document vectors based on sentence vectors. The content of the comments in the data set is used as the HAN model. Input, the label of the text pair is used as its output for model training, and the output of the penultimate layer of the model is used as the document vector of the review;

所述的HAN模型是一种用于文档分类的神经网络,该模型有两个特征:一是具有层级结构,可以通过首先构建句子的表示再将其聚合成文档表示来构造文档向量;二是在单词和句子级别应用了两个级别的注意力机制,使其能够在构建文档表示时能够加强对重要内容的表示。The HAN model is a neural network used for document classification. The model has two characteristics: first, it has a hierarchical structure, and the document vector can be constructed by first constructing the representation of the sentence and then aggregating it into a document representation; the second is A two-level attention mechanism is applied at the word and sentence levels, enabling it to enhance the representation of important content when building document representations.

参照图4,所述答案评论的情感特征的提取操作为:Referring to Figure 4, the extraction operation of the emotional features of the answer comments is:

由于获取到的答案评论内容并没有相关的情感标签,而人工标注的工作量非常大,因此随机地对部分数据进行情感标签标记,然后采用半监督学习中伪标签策略来解决训练数据不充足的问题:首先使用情感分类模型对已有标记的数据进行训练,得到最优模型,使用最优模型对未标记的数据进行伪标签标记,之后使用所有数据进行训练提升模型效果,具体为:Since the obtained answer comments do not have relevant emotional labels, and the workload of manual annotation is very heavy, part of the data is randomly labeled with emotional labels, and then a pseudo-label strategy in semi-supervised learning is used to solve the problem of insufficient training data. Question: First, use the emotion classification model to train the already labeled data to obtain the optimal model. Use the optimal model to pseudo-label the unlabeled data, and then use all the data for training to improve the model effect. Specifically:

(1)在已有标记的评论语料上进行训练,使用BERT模型获取评论的句子向量,将评论文本输入BERT模型,将预训练模型倒数第二层的输出值作为问题和答案的句子向量,使用全连接网络对句子向量进行降维,将降维后的句子向量通过softmax归一化处理,将结果用于情感分类,同时得到训练好的情感分类模型;情感分类模型由输入层、预训练好的BERT模型、全连接网络层和输出层组成;(1) Train on the already labeled comment corpus, use the BERT model to obtain the sentence vector of the comment, input the comment text into the BERT model, use the output value of the penultimate layer of the pre-trained model as the sentence vector of the question and answer, use The fully connected network reduces the dimension of the sentence vector, normalizes the reduced sentence vector through softmax, and uses the result for emotion classification. At the same time, a trained emotion classification model is obtained; the emotion classification model consists of an input layer, a pre-trained It consists of BERT model, fully connected network layer and output layer;

(2)使用(1)中训练好的情感分类模型对未标注的评论文本进行分析,将未标注的评论文本表示成句子向量,使用训练好的模型进行情感特征分析,得到评论的情感特征。(2) Use the sentiment classification model trained in (1) to analyze the unlabeled comment text, represent the unlabeled comment text as a sentence vector, use the trained model to perform sentiment feature analysis, and obtain the sentiment features of the review.

综上所述,基于三种特征的提取方法,最终获取到的特征向量格式为[问题答案的相似度、评论的文档向量、评论的情感特征]。In summary, based on the three feature extraction methods, the final feature vector format obtained is [similarity of question answers, document vector of comments, and emotional characteristics of comments].

Claims (5)

1. A high-quality answer identification method for a large-scale online learning platform is characterized by comprising the following steps:
construction of feature vectors: after preprocessing the acquired data set, manually labeling the data set, and then constructing the following three-angle feature vectors: semantic relevance features of questions and answers, document vector features of all comments of each answer, and emotion features of comments; the three angles of feature acquisition is realized in three ways: (1) Sentence vector representations of the questions and the answers are obtained, and then similarity of two semantic vectors is calculated based on cosine similarity, so that semantic relevance of the questions and the answers is obtained; (2) Using a HAN model to represent the document vector of the answer comment; (3) Extracting emotion characteristics of the comments by using transfer learning;
and (II) constructing a model: taking the feature vector constructed in the step (one) as input, taking the artificially marked label as output, constructing a XGBOOST-based classification model and training;
and thirdly, for a new question and a series of answers and comments thereof, constructing three feature vectors described in the first step by using the text content of the new question, the text content of the answer and the text content of the comments, and inputting the three feature vectors into the trained model in the second step so as to obtain a series of classification results as a result of identifying high-quality answers.
2. The method for recognizing high-quality answers to a large-scale online learning platform of claim 1, wherein,
the manual labeling specific operation in the step (one) is as follows:
crawling website information by using a crawler technology, storing and sorting the questions, answers, answer comments and answer point approval information, clearing the data of the questions, the answers and the comments which are empty, integrating the comments of the same answer under the same question, and storing the acquired data in the form of the questions, the answers and the integrated comments; manually labeling the crawled data set by using the following method:
in the above formula, flag represents the label of the text pair, if the answer is wrong, it is considered to be a poor answer, the text pair is marked as '0', if the answer is correct but imperfect, it is considered to be a normal answer, the text pair is marked as '1', if the answer is correct and perfect, it is considered to be a good answer, the text pair is marked as '2', and after the manual marking is completed, the final data set contains the following contents: questions, answers, integrated answer comments, and labels for text pairs.
3. The method for recognizing high-quality answers to a large-scale online learning platform of claim 1, wherein,
the semantic relevance feature extraction operation of the questions and answers is as follows:
(1) The method comprises the steps of obtaining sentence vectors of questions and answers by using a BERT model, inputting the texts of the questions and the answers into the BERT model, generating sentence vectors, and taking output values of a penultimate layer of the pre-training model as the sentence vectors of the questions and the answers;
(2) The cosine similarity method is used for calculating the similarity between the questions and the answers, and the cosine value of the included angle of the two vectors is calculated to measure the similarity between the questions and the answers.
4. The method for recognizing high-quality answers to a large-scale online learning platform of claim 1, wherein,
the document vector feature extraction operation of the answer comments comprises the following steps:
extracting characteristics of a plurality of comments by using a hierarchical attention network HAN, wherein the HAN model is divided into two parts, one part is used for constructing sentence vectors according to word vectors, the other part is used for constructing document vectors according to sentence vectors, comment contents in a data set are used as input of the HAN model, labels of text pairs are used as output of the HAN model for model training, and the penultimate layer of the model is used as the document vectors of the comments;
the HAN model is a neural network for document classification, and has two features: firstly, with a hierarchical structure, a document vector can be constructed by first constructing a representation of a sentence and then aggregating it into a document representation; secondly, two levels of attention mechanisms are applied at the word and sentence level, enabling it to strengthen the representation of important content when building a document representation.
5. The method for recognizing high-quality answers to a large-scale online learning platform of claim 1, wherein,
the extraction operation of the emotion characteristics of the answer comments comprises the following steps:
because the obtained answer comment content does not have related emotion labels, and the workload of manual labeling is very large, part of data is randomly labeled with emotion labels, and then a pseudo-label strategy in semi-supervised learning is adopted to solve the problem of insufficient training data: firstly, training the marked data by using an emotion classification model to obtain an optimal model, marking unmarked data by using the optimal model to carry out pseudo-tag marking, and then training all the data to improve the model effect, wherein the method specifically comprises the following steps:
(1) Training on the marked comment data, acquiring a comment sentence vector by using a BERT model, inputting a comment text into the BERT model, taking an output value of a penultimate layer of the pre-training model as a sentence vector of a question and an answer, reducing the dimension of the sentence vector by using a fully connected network, normalizing the sentence vector after the dimension reduction by using softmax, and using the result for emotion classification, thereby obtaining a trained emotion classification model; the emotion classification model consists of an input layer, a pre-trained BERT model, a fully-connected network layer and an output layer;
(2) Analyzing unlabeled comment texts by using the trained emotion classification model in the step (1), expressing the unlabeled comment texts into sentence vectors, and performing emotion feature analysis by using the trained model to obtain the emotion features of comments; and combining the original data with the existing labels and the data generated based on the pseudo tag strategy, and continuing training the emotion analysis model to obtain an optimal model.
CN202011535456.XA 2020-12-22 2020-12-22 High-quality answer identification method for large-scale online learning platform Active CN112966518B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011535456.XA CN112966518B (en) 2020-12-22 2020-12-22 High-quality answer identification method for large-scale online learning platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011535456.XA CN112966518B (en) 2020-12-22 2020-12-22 High-quality answer identification method for large-scale online learning platform

Publications (2)

Publication Number Publication Date
CN112966518A CN112966518A (en) 2021-06-15
CN112966518B true CN112966518B (en) 2023-12-19

Family

ID=76271262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011535456.XA Active CN112966518B (en) 2020-12-22 2020-12-22 High-quality answer identification method for large-scale online learning platform

Country Status (1)

Country Link
CN (1) CN112966518B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486177A (en) * 2021-07-12 2021-10-08 贵州电网有限责任公司 Electric power field table column labeling method based on text classification
CN114444481B (en) * 2022-01-27 2023-04-07 四川大学 Sentiment analysis and generation method of news comment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012039686A1 (en) * 2010-09-24 2012-03-29 National University Of Singapore Methods and systems for automated text correction
LU101290B1 (en) * 2018-08-17 2019-11-29 Univ Qilu Technology Method, System, Storage Medium and Electric Device of Medical Automatic Question Answering
CN111259127A (en) * 2020-01-15 2020-06-09 浙江大学 Long text answer selection method based on transfer learning sentence vector
CN112069302A (en) * 2020-09-15 2020-12-11 腾讯科技(深圳)有限公司 Training method of conversation intention recognition model, conversation intention recognition method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012039686A1 (en) * 2010-09-24 2012-03-29 National University Of Singapore Methods and systems for automated text correction
LU101290B1 (en) * 2018-08-17 2019-11-29 Univ Qilu Technology Method, System, Storage Medium and Electric Device of Medical Automatic Question Answering
CN111259127A (en) * 2020-01-15 2020-06-09 浙江大学 Long text answer selection method based on transfer learning sentence vector
CN112069302A (en) * 2020-09-15 2020-12-11 腾讯科技(深圳)有限公司 Training method of conversation intention recognition model, conversation intention recognition method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
中文问答社区答案质量的评价研究:以知乎为例;王伟;冀宇强;王洪伟;郑丽娟;;图书情报工作(第22期);全文 *
基于预训练模型的文本分类网络TextCGA;杨玮祺;杜晔;;现代计算机(第12期);全文 *
附加情感特征的在线问答社区信息质量自动化评价;姜雯;许鑫;武高峰;;图书情报工作(第04期);全文 *

Also Published As

Publication number Publication date
CN112966518A (en) 2021-06-15

Similar Documents

Publication Publication Date Title
CN112115238B (en) Question-answering method and system based on BERT and knowledge base
CN113962219B (en) Semantic matching method and system for knowledge retrieval and question answering of power transformers
CN108804654A (en) A kind of collaborative virtual learning environment construction method based on intelligent answer
CN111475629A (en) A knowledge graph construction method and system for mathematics tutoring question answering system
CN110569508A (en) Emotional orientation classification method and system integrating part-of-speech and self-attention mechanism
CN116127095A (en) Question-answering method combining sequence model and knowledge graph
CN110825867B (en) Similar text recommendation method and device, electronic equipment and storage medium
CN113157885B (en) An efficient intelligent question answering system for artificial intelligence domain knowledge
CN110851599A (en) A Chinese composition automatic scoring method and teaching assistance system
CN110134954A (en) A Named Entity Recognition Method Based on Attention Mechanism
CN118484520A (en) Intelligent teaching method and system based on deep knowledge tracking and large language model
CN113011196A (en) Concept-enhanced representation and one-way attention-containing subjective question automatic scoring neural network model
CN110968708A (en) A method and system for labeling attributes of educational information resources
CN111091002B (en) A Recognition Method of Chinese Named Entity
CN112966518B (en) High-quality answer identification method for large-scale online learning platform
CN116401373B (en) A method, storage medium and device for marking test knowledge points
CN116127954A (en) A dictionary-based method for extracting Chinese knowledge concepts for new engineering majors
CN117992614A (en) A method, device, equipment and medium for sentiment classification of Chinese online course reviews
Maji et al. An interpretable deep learning system for automatically scoring request for proposals
Huang et al. PQSCT: Pseudo-siamese BERT for concept tagging with both questions and solutions
CN114547342A (en) College professional intelligent question-answering system and method based on knowledge graph
CN114491023A (en) Text processing method and device, electronic equipment and storage medium
Xia et al. Question-answering using keyword entries in the Oil&Gas domain
CN113361615A (en) Text classification method based on semantic relevance
CN117271776A (en) Difficulty-knowledge points-intelligent multi-label annotation method and system for problem-solving ideas

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant