CN115270713A - A human-machine collaborative corpus construction method and system - Google Patents

A human-machine collaborative corpus construction method and system Download PDF

Info

Publication number
CN115270713A
CN115270713A CN202210869795.4A CN202210869795A CN115270713A CN 115270713 A CN115270713 A CN 115270713A CN 202210869795 A CN202210869795 A CN 202210869795A CN 115270713 A CN115270713 A CN 115270713A
Authority
CN
China
Prior art keywords
labeling
corpus
result
manual
annotation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210869795.4A
Other languages
Chinese (zh)
Inventor
周东岱
董晓晓
顾恒年
李振
邬伟业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeast Normal University
Original Assignee
Northeast Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Normal University filed Critical Northeast Normal University
Priority to CN202210869795.4A priority Critical patent/CN115270713A/en
Publication of CN115270713A publication Critical patent/CN115270713A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method and a system for constructing a man-machine collaborative corpus, and belongs to the field of corpus construction. The method comprises the steps of constructing an automatic labeling model of a corpus and a data set, carrying out manual labeling on the data set and constructing a basic corpus, training the automatic labeling model of the corpus by using the basic corpus, carrying out machine labeling by using the trained model, calculating the labeling consistency of two labeling modes, and completing the construction of a man-machine collaborative corpus if the labeling consistency meets the requirement. The invention utilizes the man-machine collaborative corpus constructed by the method of combining the manual marking and the machine marking, can shorten the marking period and improve the marking quality.

Description

一种人机协同语料库构建方法及系统A method and system for building a human-computer collaborative corpus

技术领域technical field

本发明涉及语料库构建领域,特别是涉及一种人机协同语料库构建方法及系统。The invention relates to the field of corpus construction, in particular to a method and system for man-machine collaborative corpus construction.

背景技术Background technique

对于语料库的构建来说,主要工作是根据人工制定的实体关系类型以及标注规范对数据进行标注。目前语料的标注模式主要有3种,第一种是领域专家标注,该标注模式适用于专业领域的语料标注,能够确保标注的质量,但标注成本高,周期长;第二种是众包标注,该标注模式利用在线用户对同一数据进行标注,并进行投票来获取标注,该方法的标注成本较低,但仅适用于简单的且要求不多的领域知识的标注任务;第三种是团体标注,这种标注模式可以不依赖专家,且最后构建的语料库质量较高,但对标注团体的要求很高。这三种语料标注模式都属于人工标注方式。人工标注方式可以在相对较少的语料准备下完成,并且可以在相当复杂的现象上以较高的质量完成。但是人工标注周期较长,实施难度大,在当前大数据时代下,开展覆盖数万条数据的大规模标注工作是一项重大任务。而机器标注速度较快,周期较短,可以在较短的时间内完成语料库的构建,但纯机器标注质量较差。For the construction of the corpus, the main work is to label the data according to the manually formulated entity relationship types and labeling specifications. At present, there are mainly three annotation modes for corpus. The first is domain expert annotation. This annotation mode is suitable for corpus annotation in professional fields and can ensure the quality of annotation, but the annotation cost is high and the cycle is long; the second is crowdsourcing annotation. , this annotation mode uses online users to annotate the same data and vote to obtain annotations. This method has a low annotation cost, but it is only suitable for simple annotation tasks that require little domain knowledge; the third is group Annotation, this annotation mode does not rely on experts, and the quality of the final corpus constructed is high, but it has high requirements for the annotation community. These three corpus annotation modes are all manual annotation methods. The manual annotation method can be completed with relatively little corpus preparation, and can be completed with high quality on quite complex phenomena. However, the manual labeling cycle is long and difficult to implement. In the current era of big data, it is a major task to carry out large-scale labeling work covering tens of thousands of pieces of data. The machine annotation speed is faster, the cycle is shorter, and the construction of the corpus can be completed in a short period of time, but the quality of pure machine annotation is poor.

发明内容Contents of the invention

本发明的目的是提供一种人机协同语料库构建方法及系统,以解决现有技术中的人工标注的周期长且机器标注的质量较差的问题。The purpose of the present invention is to provide a method and system for constructing a human-machine collaborative corpus, so as to solve the problems in the prior art that the cycle of manual labeling is long and the quality of machine labeling is poor.

为实现上述目的,本发明提供了如下方案:To achieve the above object, the present invention provides the following scheme:

一种人机协同语料库构建方法,包括:A method for constructing a human-machine collaborative corpus, comprising:

构建语料库自动标注模型和数据集;所述数据集中的文本包括学科教材、教学设计和导学案;Build corpus automatic labeling models and datasets; the texts in the datasets include subject textbooks, teaching designs and tutorial plans;

对所述数据集进行人工标注,得到人工标注结果并构建基础语料库;人工标注包括人工预标注和人工正式标注;所述人工标注结果包括人工正式标注结果和人工预标注结果;Carrying out manual labeling on the data set, obtaining manual labeling results and constructing a basic corpus; manual labeling includes manual pre-labeling and manual formal labeling; the manual labeling results include manual formal labeling results and manual pre-labeling results;

利用所述基础语料库对所述语料库自动标注模型进行训练,得到训练后的语料库自动标注模型;Using the basic corpus to train the corpus automatic labeling model to obtain the trained corpus automatic labeling model;

利用所述训练后的语料库自动标注模型对待标注文本进行机器标注,得到机器标注结果;所述待标注文本为所述人工正式标注时所选取的文本;Using the trained corpus automatic labeling model to perform machine labeling on the text to be labeled to obtain a machine labeling result; the text to be labelled is the text selected when the manual formal labeling is performed;

对所述机器标注结果与所述人工正式标注结果进行标注一致性计算,得到第一标注一致性;Performing labeling consistency calculation on the machine labeling result and the manual official labeling result to obtain the first labeling consistency;

判断所述第一标注一致性是否达到第一标注一致性阈值,得到判断结果;Judging whether the consistency of the first annotation reaches the threshold of consistency of the first annotation, and obtaining a judgment result;

若所述判断结果为所述第一标注一致性达到所述第一标注一致性阈值,则输出所述训练后的语料库自动标注模型,将所述训练后的语料库自动标注模型作为语料库;If the judgment result is that the first labeling consistency reaches the first labeling consistency threshold, outputting the trained corpus automatic labeling model, and using the trained corpus automatic labeling model as a corpus;

若所述判断结果为所述第一标注一致性未达到所述第一标注一致性阈值,则扩充所述基础语料库,并返回“利用所述基础语料库对所述语料库自动标注模型进行训练,得到训练后的语料库自动标注模型”的步骤。If the judgment result is that the first annotation consistency does not reach the first annotation consistency threshold, then expand the basic corpus, and return "use the basic corpus to train the automatic annotation model of the corpus, and obtain The trained corpus automatically annotates the model" step.

可选地,所述构建语料库自动标注模型和数据集,之后还包括:Optionally, said constructing the corpus to automatically label the model and the data set further includes:

对所述数据集进行预处理,得到处理后的数据集。The data set is preprocessed to obtain a processed data set.

可选地,所述对所述数据集进行预处理,得到处理后的数据集,具体包括:Optionally, the preprocessing of the data set to obtain the processed data set specifically includes:

对所述文本进行格式转换,得到预设格式的文本;performing format conversion on the text to obtain a text in a preset format;

剔除所述预设格式的文本中的不规则文本,得到处理后的数据集;所述不规则文本包括表格、图片、学校名称和教师名称。Irregular text in the pre-formatted text is eliminated to obtain a processed data set; the irregular text includes tables, pictures, school names and teacher names.

可选地,所述对所述数据集进行人工标注,得到基础语料库,具体包括:Optionally, the manual labeling of the data set is performed to obtain a basic corpus, which specifically includes:

根据知识的维度和其他领域的分类方法,结合初中数学学科的特点,制定标注规范;所述其他领域包括医学领域和军事领域;According to the dimensions of knowledge and the classification methods of other fields, combined with the characteristics of junior high school mathematics, formulate labeling specifications; the other fields mentioned include the medical field and the military field;

对所述处理后的数据集进行词法和句法分析,构建事件三元组;Carrying out lexical and syntactic analysis to the processed data set, constructing event triples;

根据所述事件三元组对第一预设数量的文本进行两次人工预标注,得到预标注结果,并根据所述预标注结果更新所述标注规范;所述预标注结果第一预标注结果和第二预标注结果;Perform manual pre-labeling on the first preset quantity of text twice according to the event triplet to obtain a pre-labeling result, and update the labeling specification according to the pre-labeling result; the pre-labeling result is the first pre-labeling result and the second pre-labeled result;

根据所述事件三元组对第二预设数量的文本进行两次人工正式标注,得到人工正式标注结果,并构建基础语料库;所述人工正式标注结果包括第一正式标注结果和第二正式标注结果;所述第二预设数量的文本与所述第一预设数量的文本不同。According to the event triplet, the second preset number of texts is manually annotated twice to obtain an artificial official annotating result and construct a basic corpus; the artificial official annotating result includes a first official annotating result and a second official annotating result As a result; the second preset amount of text is different from the first preset amount of text.

可选地,所述根据所述事件三元组对第一预设数量的文本进行两次人工预标注,得到预标注结果,并根据所述预标注结果更新所述标注规范,具体包括:Optionally, performing manual pre-labeling on the first preset quantity of text twice according to the event triplet to obtain a pre-labeling result, and updating the labeling specification according to the pre-labeling result, specifically includes:

根据所述事件三元组对第一预设数量的文本进行两次人工预标注,得到预标注结果;Carrying out manual pre-marking on the first preset quantity of text twice according to the event triplet to obtain a pre-marking result;

计算所述第一预标注结果和所述第二预标注结果的第二标志一致性;calculating a second sign consistency between the first pre-labeled result and the second pre-labeled result;

判断是否迭代预设次数,得到第二判断结果;judging whether to iterate the preset number of times to obtain a second judging result;

若所述第二判断结果为未迭代所述预设次数,则对比所述第一预标注结果和所述第二预标注结果,得到对比结果;If the second judgment result is that the preset number of times has not been iterated, then comparing the first pre-marking result and the second pre-marking result to obtain a comparison result;

根据所述对比结果更新所述标注规范,得到新的标注规范,并返回“根据所述事件三元组对第一预设数量的文本进行两次人工预标注,得到预标注结果”;Updating the labeling specification according to the comparison result to obtain a new labeling specification, and returning "carry out manual pre-labeling on the first preset number of texts twice according to the event triplet to obtain a pre-labeling result";

若所述第二判断结果为迭代所述预设次数,则结束所述人工预标注,并输出所述新的标注规范。If the second judgment result is the preset number of iterations, the manual pre-labeling is ended, and the new labeling specification is output.

可选地,所述根据所述事件三元组对第二预设数量的文本进行两次人工正式标注,得到人工正式标注结果,并构建基础语料库,具体包括:Optionally, the second preset number of texts is manually formally marked twice according to the event triplet, to obtain a manual formal markup result, and to construct a basic corpus, specifically including:

根据所述事件三元组对第二预设数量的文本进行两次人工正式标注,得到人工正式标注结果;Carrying out manual formal labeling twice on the text of the second preset quantity according to the event triplet, and obtaining manual formal labeling results;

计算所述第一正式标注结果和所述第二正式标注结果的第三标注一致性;calculating a third label consistency between the first formal labeling result and the second formal labeling result;

判断所述第三标注一致性是否达到第二标注一致性阈值,得到第三判断结果;Judging whether the consistency of the third label reaches the second label consistency threshold, and obtaining a third judgment result;

若所述第三判断结果为所述第三标注一致性达到所述第二标注一致性阈值,则完成所述人工正式标注,并输出所述人工正式标注结果;If the third judgment result is that the consistency of the third label reaches the second label consistency threshold, the manual formal labeling is completed, and the manual formal labeling result is output;

根据所述人工正式标注结果构建所述基础语料库;Constructing the basic corpus according to the manual formal labeling results;

若所述第三判断结果为所述第三标注一致性未达到所述第二标注一致性阈值,则返回“根据所述事件三元组对第二预设数量的文本进行两次人工正式标注,得到人工正式标注结果”。If the third judgment result is that the third labeling consistency does not reach the second labeling consistency threshold, return "according to the event triplet, perform manual formal labeling on the second preset number of texts twice." , to obtain the manual formal labeling results”.

一种人机协同语料库构建系统,包括:A human-computer collaborative corpus construction system, comprising:

模型和数据集构建模块,用于构建语料库自动标注模型和数据集;所述数据集中的文本包括学科教材、教学设计和导学案;Model and data set building blocks are used to construct corpus automatic labeling models and data sets; the texts in the data set include subject textbooks, teaching designs and teaching plans;

人工标注模块,用于对所述数据集进行人工标注,得到人工标注结果并构建基础语料库;人工标注包括人工预标注和人工正式标注;所述人工标注结果包括人工正式标注结果和人工预标注结果;The manual labeling module is used to manually label the data set, obtain manual labeling results and build a basic corpus; manual labeling includes manual pre-labeling and manual formal labeling; the manual labeling results include manual manual labeling results and manual pre-labeling results ;

训练模块,用于利用所述基础语料库对所述语料库自动标注模型进行训练,得到训练后的语料库自动标注模型;A training module, configured to use the basic corpus to train the automatic labeling model of the corpus to obtain the automatic labeling model of the corpus after training;

机器标注模块,用于利用所述训练后的语料库自动标注模型对待标注文本进行机器标注,得到机器标注结果;所述待标注文本为所述人工正式标注时所选取的文本;The machine labeling module is used to use the trained corpus automatic labeling model to carry out machine labeling on the text to be labeled to obtain a machine labeling result; the text to be labelled is the text selected when the manual formal labeling;

计算模块,用于对所述机器标注结果与所述人工正式标注结果进行标注一致性计算,得到第一标注一致性;A calculation module, configured to perform labeling consistency calculation on the machine labeling result and the manual formal labeling result to obtain the first labeling consistency;

判断模块,用于判断所述第一标注一致性是否达到第一标注一致性阈值,得到判断结果;A judging module, configured to judge whether the first label consistency reaches the first label consistency threshold, and obtain a judgment result;

第一执行模块,用于若所述判断结果为所述第一标注一致性达到所述第一标注一致性阈值,则输出所述训练后的语料库自动标注模型,将所述训练后的语料库自动标注模型作为语料库;The first execution module is configured to output the trained corpus automatic labeling model if the judgment result is that the first labeling consistency reaches the first labeling consistency threshold, and automatically label the trained corpus Label the model as a corpus;

第二执行模块,用于若所述判断结果为所述第一标注一致性未达到所述第一标注一致性阈值,则扩充所述基础语料库,并返回“利用所述基础语料库对所述语料库自动标注模型进行训练,得到训练后的语料库自动标注模型”的步骤。The second execution module is configured to expand the basic corpus if the judgment result is that the consistency of the first annotations does not reach the threshold of consistency of the first annotations, and return "Use the basic corpus to analyze the corpus Automatic labeling model is trained, and the step of obtaining the trained corpus automatic labeling model".

可选地,还包括:Optionally, also include:

数据处理模块,用于对所述数据集进行预处理,得到处理后的数据集。The data processing module is used to preprocess the data set to obtain the processed data set.

可选地,所述数据处理模块,包括:Optionally, the data processing module includes:

格式转换单元,用于对所述文本进行格式转换,得到预设格式的文本;a format conversion unit, configured to perform format conversion on the text to obtain a text in a preset format;

数据剔除单元,用于剔除所述预设格式的文本中的不规则文本,得到处理后的数据集;所述不规则文本包括表格、图片、学校名称和教师名称。The data elimination unit is used to eliminate irregular text in the text in the preset format to obtain a processed data set; the irregular text includes tables, pictures, school names and teacher names.

可选地,所述人工标注模块,包括:Optionally, the manual labeling module includes:

标注规范制定单元,用于根据知识的维度和其他领域的分类方法,结合初中数学学科的特点,制定标注规范;所述其他领域包括医学领域和军事领域;An annotation specification formulation unit is used to formulate an annotation specification according to the dimension of knowledge and the classification methods of other fields, combined with the characteristics of junior high school mathematics; the other fields include the medical field and the military field;

数据分析单元,用于对所述处理后的数据集进行词法和句法分析,构建事件三元组;A data analysis unit, configured to perform lexical and syntactic analysis on the processed data set to construct event triples;

预标注单元,用于根据所述事件三元组对第一预设数量的文本进行两次人工预标注,得到预标注结果,并根据所述预标注结果更新所述标注规范;所述预标注结果第一预标注结果和第二预标注结果;A pre-labeling unit, configured to manually pre-label the first preset quantity of text twice according to the event triplet, obtain a pre-labeled result, and update the labeling specification according to the pre-labeled result; the pre-labeled The results are the first pre-labeled result and the second pre-labeled result;

正式标注单元,用于根据所述事件三元组对第二预设数量的文本进行两次人工正式标注,得到人工正式标注结果,并构建基础语料库;所述人工正式标注结果包括第一正式标注结果和第二正式标注结果;所述第二预设数量的文本与所述第一预设数量的文本不同。The formal labeling unit is used to perform manual formal labeling on the second preset quantity of text twice according to the event triplet, obtain the manual formal labeling result, and construct a basic corpus; the manual manual labeling result includes the first formal labeling A result and a second official labeling result; the second preset number of texts is different from the first preset number of texts.

根据本发明提供的具体实施例,本发明公开了以下技术效果:According to the specific embodiments provided by the invention, the invention discloses the following technical effects:

本发明的人机协同语料库构建方法及系统,通过构建语料库自动标注模型和数据集,对数据集进行人工标注并构建基础语料库,利用基础语料库对语料库自动标注模型进行训练,利用训练后的模型进行机器标注,计算两种标注方式的标注一致性,若标注一致性达到要求,则完成人机协同语料库的构建。本发明利用将人工标注和机器标注相结合的方法构建的人机协同语料库,能够缩短标注周期,且提高标注质量。The human-machine collaborative corpus construction method and system of the present invention, by constructing the corpus automatic labeling model and data set, manually labeling the data set and constructing the basic corpus, using the basic corpus to train the corpus automatic labeling model, using the trained model Machine labeling, calculate the labeling consistency of the two labeling methods, if the labeling consistency meets the requirements, the construction of the human-machine collaborative corpus will be completed. The invention utilizes the man-machine collaborative corpus constructed by the method of combining manual labeling and machine labeling, which can shorten the labeling cycle and improve the labeling quality.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the accompanying drawings required in the embodiments. Obviously, the accompanying drawings in the following description are only some of the present invention. Embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without paying creative labor.

图1为本发明提供的一种人机协同语料库构建方法的流程图;Fig. 1 is the flow chart of a kind of man-machine cooperative corpus construction method provided by the present invention;

图2为本发明的基于预训练语言模型的命名实体识别模型的结构图;Fig. 2 is the structural diagram of the named entity recognition model based on pre-training language model of the present invention;

图3为本发明的基于图卷积神经网络的关系抽取模型的结构图;Fig. 3 is the structural diagram of the relationship extraction model based on graph convolutional neural network of the present invention;

图4为本发明的人工预标注流程图;Fig. 4 is the manual pre-labeling flowchart of the present invention;

图5为本发明的人工正式标注流程图;Fig. 5 is the flow chart of manual formal labeling of the present invention;

图6为本发明的模型自动标注识别结果图;Fig. 6 is a diagram of the automatic labeling and recognition results of the model of the present invention;

图7为本发明的人机协同标注流程图;Fig. 7 is a flow chart of the human-machine collaborative labeling of the present invention;

图8为本发明的人机协同语料库构建方法的整体框架图;FIG. 8 is an overall framework diagram of the human-computer collaborative corpus construction method of the present invention;

图9为本发明的提供的一种人机协同语料库构建系统的结构图。FIG. 9 is a structural diagram of a human-machine collaborative corpus construction system provided by the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本发明的目的是提供一种人机协同语料库构建方法及系统,以解决现有技术中的人工标注的周期长且机器标注的质量较差的问题。人机协同语料库构建方法主要包括人工处理和机器处理两个阶段,其中人工处理阶段包括数据收集与处理,标注规范制定和人工标注;机器处理阶段包括词法句法分析和机器标注。The purpose of the present invention is to provide a method and system for constructing a human-machine collaborative corpus, so as to solve the problems in the prior art that the cycle of manual labeling is long and the quality of machine labeling is poor. The human-computer collaborative corpus construction method mainly includes two stages: manual processing and machine processing. The manual processing stage includes data collection and processing, annotation specification formulation and manual annotation; the machine processing stage includes lexical and syntactic analysis and machine annotation.

本发明在人工标注的基础上加入机器处理操作,研究人机协同构建语料库的方法。并以初中数学学科为例,参考布鲁姆在《教育目标分类学》中对知识的分类维度以及其他领域(如医疗领域、军事领域等)的分类方法,并结合初中数学知识特点,制定了实体和关系类别,构建了初中数学语料库,对后续教育知识图谱的构建与应用具有实际支撑作用。The invention adds machine processing operation on the basis of manual labeling, and studies the method of man-machine cooperative construction of corpus. Taking junior high school mathematics as an example, referring to the classification dimensions of knowledge in Bloom's "Taxonomy of Educational Objectives" and the classification methods of other fields (such as medical field, military field, etc.), combined with the characteristics of junior high school mathematics knowledge, the formula Entity and relationship categories have constructed a junior high school mathematics corpus, which has a practical supporting role in the construction and application of knowledge maps for follow-up education.

为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1为本发明提供的一种人机协同语料库构建方法的流程图,如图1所示,方法包括:Fig. 1 is the flow chart of a kind of man-machine cooperative corpus construction method provided by the present invention, as shown in Fig. 1, the method comprises:

步骤101:构建语料库自动标注模型和数据集。所述数据集中的文本包括学科教材、教学设计和导学案。Step 101: Construct a corpus automatic labeling model and a data set. The texts in the data set include subject textbooks, instructional design and tutorial plans.

作为一个可选的实施方式,所述步骤101,之后还包括:As an optional implementation manner, the step 101 further includes:

对所述数据集进行预处理,得到处理后的数据集。The data set is preprocessed to obtain a processed data set.

作为一个可选的实施方式,所述对所述数据集进行预处理,得到处理后的数据集,具体包括:As an optional implementation manner, the preprocessing of the data set to obtain the processed data set specifically includes:

对所述文本进行格式转换,得到预设格式的文本。Perform format conversion on the text to obtain text in a preset format.

剔除所述预设格式的文本中的不规则文本,得到处理后的数据集;所述不规则文本包括表格、图片、学校名称和教师名称。Irregular text in the pre-formatted text is eliminated to obtain a processed data set; the irregular text includes tables, pictures, school names and teacher names.

人机协同的语料库标注主要包括人工标注和模型自动标注两个部分,其中人工标注严格按照标注流程执行,自动标注采用语料库自动标注模型实现自动标注。在学科语料库的人机协同构建方法中,人工标注和机器标注是穿插协同进行的,整体框架图如图8所示。The human-machine collaborative corpus annotation mainly includes two parts: manual annotation and model automatic annotation. The manual annotation is performed strictly according to the annotation process, and the automatic annotation adopts the corpus automatic annotation model to realize automatic annotation. In the human-computer collaborative construction method of the subject corpus, manual annotation and machine annotation are interspersed and coordinated, and the overall framework is shown in Figure 8.

在实际应用中,语料库自动标注过程主要是通过基于深度学习的模型(语料库自动标注模型),来自动识别语料数据中的实体及实体关系的过程,语料库自动标注模型分为基于预训练语言模型的命名实体识别模型(BERT-wwm-BiLSTM-CRF)和基于图卷积神经网络的关系抽取模型(BERT-GCN)。In practical applications, the automatic corpus tagging process is mainly the process of automatically identifying entities and entity relationships in the corpus data through a deep learning-based model (corpus automatic tagging model). The corpus automatic tagging model is divided into pre-trained language model-based Named entity recognition model (BERT-wwm-BiLSTM-CRF) and graph convolutional neural network-based relation extraction model (BERT-GCN).

在语料库自动标注模型的模型框架构建基础上,本发明以基础教育学科数学语料数据为例,结合多源教育教学文本如教学设计、导学案和试题册等,针对教育语料特点,准备语料库自动标注模型的训练语料,并进行模型训练,具体分为以下几个步骤,分别是:数据收集与处理、标注规范制定、词法和句法分析、模型训练。On the basis of the model frame construction of the corpus automatic labeling model, the present invention takes the mathematics corpus data of the basic education subject as an example, combines multi-source educational teaching texts such as teaching design, teaching guides and test booklets, etc., and prepares the corpus automatic Label the training corpus of the model and carry out model training, which is divided into the following steps: data collection and processing, labeling standard formulation, lexical and syntactic analysis, and model training.

本发明主要是面向基础教育学科数学,学段为初中,教材版本为人民教育出版社,共收集到教学设计688篇和电子教材,包含了人教版初中数学教材中29章91小节的每一章每一节,保证了初中数学实体关系的全部覆盖。本发明收集了初中数学教材、教学设计、导学案等数据,对其进行文本规范化处理。The present invention is mainly oriented to the basic education subject mathematics, the school period is junior high school, and the textbook version is the People's Education Publishing House. A total of 688 teaching designs and electronic textbooks have been collected, including each of the 29 chapters and 91 subsections of the junior high school mathematics textbook in the People's Education Edition. Each section of the chapter ensures the full coverage of the junior middle school mathematics entity relationship. The invention collects data such as junior high school mathematics textbooks, teaching design, and teaching plans, and performs text standardization processing on them.

网站上收集到的数据主要是以pdf、doc、docx格式存储的,在教学设计中存在学校名称、教师姓名等一些隐私信息,有一些内容的格式是以表格形式呈现的,还有电子教材中存在一些图片以及习题。以上这些不利于后续的标注工作,故需要对原始数据进行处理。首先对收集到的数据进行格式转换,将不同格式的文本统一采用UTF-8编码,以txt格式进行存储。其次将文本中的表格、图片、习题、学校名称、教师姓名去除掉以便于后续的工作。The data collected on the website is mainly stored in pdf, doc, and docx formats. In the teaching design, there are some private information such as school names and teacher names, and some content formats are presented in the form of tables. There are some pictures and exercises. The above are not conducive to the subsequent labeling work, so the original data needs to be processed. Firstly, format conversion is performed on the collected data, and the texts in different formats are uniformly encoded in UTF-8 and stored in txt format. Secondly, the tables, pictures, exercises, school names, and teacher names in the text are removed to facilitate subsequent work.

命名实体识别(Named Entity Recognition,NER)是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等。本发明使用的基于预训练语言模型的命名实体识别模型BERT-wwm-BiLSTM-CRF主要包括3个模块,模型图如图2所示。Named Entity Recognition (NER) refers to the recognition of entities with specific meaning in text, mainly including names of people, places, institutions, proper nouns, etc. The named entity recognition model BERT-wwm-BiLSTM-CRF based on the pre-trained language model used in the present invention mainly includes three modules, and the model diagram is shown in FIG. 2 .

首先初中数学文本经过BERT-wwm预训练语言模型抽取丰富的文本特征获得词向量,再把得到的初中数学词向量输入到BiLSTM模块中充分提取数学句子中的上下文信息,最后通过CRF模块对融合了句子上下文语义特征的输出结果进行解码,获取全局最优的标签序列,从而完成初中数学文本的命名实体识别。First, the junior high school mathematics text is extracted from rich text features through the BERT-wwm pre-training language model to obtain the word vector, and then the obtained junior high school mathematics word vector is input into the BiLSTM module to fully extract the context information in the mathematics sentence, and finally through the CRF module. The output results of the sentence context semantic features are decoded to obtain the globally optimal label sequence, thereby completing the named entity recognition of junior high school mathematics texts.

BERT-wwm模型的输入是初中数学文本,包括token embedding词向量,segmentembedding句向量和position embedding位置向量,采用Transformer Encoder结构,可以根据语境的语义关系来表示语境中词的具体语义。Transformer是基于注意力机制对初中数学句子进行建模,Attention方式是Self-Attention,通过学习三个参数,对同一个embedding后的feature进行转换,计算出本句的Attention得分,计算如公式(1)所示。The input of the BERT-wwm model is junior high school mathematics text, including token embedding word vector, segment embedding sentence vector and position embedding position vector, using the Transformer Encoder structure, which can express the specific semantics of words in the context according to the semantic relationship of the context. Transformer is based on the attention mechanism to model junior high school mathematics sentences. The Attention method is Self-Attention. By learning three parameters, the feature after the same embedding is converted to calculate the Attention score of this sentence. The calculation is as follows: formula (1 ) shown.

Figure BDA0003760131510000081
Figure BDA0003760131510000081

其中Q,K和V是输入字向量矩阵,dk是维数,QKT用于计算输入词向量之间的关系,缩小dk后,通过softmax归一化获得权重表示。最后,输出是初中数学句子中所有词向量的加权和,因此每个词的表示都包含初中数学句子中其他词的信息,该信息与上下文有关,并且比传统词嵌入模型更具全局性。Among them, Q, K and V are the input word vector matrix, d k is the dimension, and QK T is used to calculate the relationship between the input word vectors. After d k is reduced, the weight representation is obtained through softmax normalization. Finally, the output is a weighted sum of all word vectors in the sentence, so that the representation of each word contains information about other words in the sentence, which is context-dependent and more global than traditional word embedding models.

BiLSTM接收BERT-wwm预训练模型生成的字符embedding,并预测每个字符标签的概率。双向LSTM结构可以更有效地使用上下文信息,每个初中数学句子按顺序和相反的顺序计算,然后通过向量拼接的方式获得隐藏层表示。设上个时刻的输入为ht-1,这一时刻的输入为xt,那么有如下公式。BiLSTM receives the character embedding generated by the BERT-wwm pre-training model and predicts the probability of each character label. The bidirectional LSTM structure can use contextual information more effectively. Each junior high school mathematics sentence is calculated in order and in reverse order, and then the hidden layer representation is obtained by vector splicing. Suppose the input at the last moment is h t-1 , and the input at this moment is x t , then the formula is as follows.

ft=σ(wf·[ht-1,xt]+bf)f t =σ(w f ·[h t-1 ,x t ]+b f )

it=σ(wi·[ht-1,xt]+bi)i t =σ(w i ·[h t-1 ,x t ]+b i )

Figure BDA0003760131510000091
Figure BDA0003760131510000091

Figure BDA0003760131510000092
Figure BDA0003760131510000092

Ot=σ(wo·[ht-1,xt]+bo)O t =σ(w o ·[h t-1 ,x t ]+b o )

ht=Ot·tanh·(St)h t =O t ·tanh·(S t )

其中,σ是sigmoid激活函数,ft是遗忘门,it是输入门,Ot是输出门,xt是当前的输入,tanh是双曲正切函数。bf是遗忘门的偏移向量,bi是输入门的偏移向量,bo是输出门的偏移向量,bc表示记忆单元的偏移向量,wf是遗忘门的权重矩阵,wi是输入门的权重矩阵,wo是输出门的权重矩阵,wc表示记忆单元的权重矩阵,

Figure BDA0003760131510000093
表示当前时刻的状态,是由tanh层创建的新候选值向量,St表示由状态St-1更新的新单元状态,ht表示t时刻最终的输出。Among them, σ is the sigmoid activation function, f t is the forget gate, it is the input gate, O t is the output gate, x t is the current input, and tanh is the hyperbolic tangent function. b f is the offset vector of the forget gate, b i is the offset vector of the input gate, b o is the offset vector of the output gate, b c is the offset vector of the memory unit, w f is the weight matrix of the forget gate, w i is the weight matrix of the input gate, w o is the weight matrix of the output gate, w c represents the weight matrix of the memory unit,
Figure BDA0003760131510000093
Represents the state at the current moment, which is a new candidate value vector created by the tanh layer, S t represents the new unit state updated by state S t-1 , and h t represents the final output at time t.

CRF层的输入是来自BiLSTM层训练而得到的词向量。将CRF层的输入记作X=(x1,x2,…,xn),其中xn表示第n个单词;预测序列Y=(y1,y2,…,yn),其中yn表示第n个单词的预测标签。其得分函数表示如公式(3)所示。The input of the CRF layer is the word vector obtained from the training of the BiLSTM layer. The input of the CRF layer is denoted as X=(x 1 ,x 2 ,…,x n ), where x n represents the nth word; the prediction sequence Y=(y 1 ,y 2 ,…,y n ), where y n denotes the predicted label for the nth word. Its score function expression is shown in formula (3).

Figure BDA0003760131510000094
Figure BDA0003760131510000094

利用softmax函数计算条件概率,如公式(4)。Use the softmax function to calculate the conditional probability, such as formula (4).

Figure BDA0003760131510000095
Figure BDA0003760131510000095

其中,Ayi,yi+1表示标签从yi到yi+1的转移概率,由CRF学习的转移矩阵得到。Pi,yi表示词Xi在标签yi的非归一化概率,也称状态分数,由BiLSTM输出得到。

Figure BDA0003760131510000096
表示除当前计算的yi之外,全集Y中剩下的其他y。Among them, A yi,yi+1 represents the transition probability of the label from yi to yi+1, which is obtained from the transition matrix learned by CRF. P i,yi represents the unnormalized probability of the word Xi in the label yi, also known as the state score, which is obtained from the output of BiLSTM.
Figure BDA0003760131510000096
Indicates other y left in the corpus Y other than the currently computed yi.

关系抽取(Relation Extraction,RE)是在非结构或半结构化数据中找出主体与客体之间存在的关系(若有两个存在着关系的实体,可将两个实体分别成为主体和客体),并将其表示为实体关系三元组,即(主体,关系,客体)。本发明使用的基于图卷积神经网络的关系抽取模型BERT-GCN模型主要分为三部分,模型图如图3所示。Relation Extraction (Relation Extraction, RE) is to find the relationship between the subject and the object in unstructured or semi-structured data (if there are two entities with a relationship, the two entities can be called the subject and the object respectively) , and represent it as an entity-relationship triple, namely (subject, relation, object). The graph convolutional neural network-based relation extraction model BERT-GCN model used in the present invention is mainly divided into three parts, and the model diagram is shown in FIG. 3 .

首先将文本输入到BERT预训练模型中提取上下文丰富的语义特征,并获得相应的词向量表示,同时对文本进行依存句法分析来构建依存句法图,进而获得图的邻接矩阵,用来存储句子的语法结构信息;其次将获得到的词向量和邻接矩阵送入到GCN中进行进一步的特征提取,最后经过softmax层对特征向量进行分类,做关系的概率预测。First, the text is input into the BERT pre-training model to extract context-rich semantic features, and the corresponding word vector representation is obtained. At the same time, the dependency syntax analysis is performed on the text to build a dependency syntax graph, and then the adjacency matrix of the graph is obtained, which is used to store sentences. Grammatical structure information; secondly, the obtained word vector and adjacency matrix are sent to GCN for further feature extraction, and finally the feature vector is classified through the softmax layer to predict the probability of the relationship.

首先将文本输入到BERT预训练模型中提取丰富的文本特征获得相应的词向量。其次,通过对输入的初中数学句子进行句法依存分析,构建句法依存图,图神经网络可以从依存结构中学习初中数学句子的语法信息。本文用G=(V,E)表示依存句法图,其中V=v1,v2...,vn表示图中的点集合,E=el,e2,...em表示图中的边集合,依存句法图所对应的邻接矩阵A如公式(5)所示。Aij为1时,代表顶点vi和vj存在一条边,Aij值为0,代表不存在边。First, the text is input into the BERT pre-training model to extract rich text features to obtain corresponding word vectors. Secondly, by analyzing the syntactic dependence of the input junior high school mathematics sentences and constructing a syntactic dependency graph, the graph neural network can learn the grammatical information of junior high school mathematics sentences from the dependency structure. In this paper, G=(V, E) is used to represent the dependency syntax graph, where V=v1,v2...,vn represents the set of points in the graph, E=el,e2,...em represents the set of edges in the graph, and the dependency The adjacency matrix A corresponding to the syntax graph is shown in formula (5). When A ij is 1, it means that there is an edge between vertices vi and vj, and when A ij is 0, it means that there is no edge.

Figure BDA0003760131510000101
Figure BDA0003760131510000101

根据无向图的邻接矩阵存储法,可以通过依存句法分析得到邻接矩阵,并将邻接矩阵送入到图神经网络中进行进一步的特征提取,对于节点i的隐藏状态的计算过程如公式(6)所示。According to the adjacency matrix storage method of an undirected graph, the adjacency matrix can be obtained through dependency syntax analysis, and the adjacency matrix can be sent to the graph neural network for further feature extraction. The calculation process of the hidden state of node i is shown in formula (6) shown.

Figure BDA0003760131510000102
Figure BDA0003760131510000102

其中Aij表示节点i和j在邻接矩阵中所对应的元素,W(j)表示节点j的权重矩阵,hj表示节点j的隐藏状态,b(j)表示节点j的偏置,f表示非线性函数,本发明使用的是RELU非线性函数。where A ij represents the elements corresponding to nodes i and j in the adjacency matrix, W (j) represents the weight matrix of node j, h j represents the hidden state of node j, b (j) represents the bias of node j, f represents Non-linear function, what the present invention uses is RELU non-linear function.

第l层的图神经网络,输入由l-1层的输出和邻接矩阵组成,经过1层运算后,节点i经过l层图神经网络后的隐藏状态表示如公式(7)所示。The input of the graph neural network of layer l is composed of the output of layer l-1 and the adjacency matrix. After the operation of layer 1, the hidden state representation of node i after the graph neural network of layer l is shown in formula (7).

Figure BDA0003760131510000103
Figure BDA0003760131510000103

其中W(l)表示第l层的权重矩阵,hj (l-1)表示节点j经过l-1层图神经网络后的隐藏状态,b(l)表示第l层的偏置。Where W (l) represents the weight matrix of layer l, h j (l-1) represents the hidden state of node j after passing through layer l-1 graph neural network, and b (l) represents the bias of layer l.

经过l层图卷积操作后,如果片面地应用图卷积运算,可能会导致节点表示的结果不相同,因为节点的度变化很大。这可能会影响到高度节点的句子表示,不管节点包含什么信息,第l-1层图神经网络中的节点信息被转移到第l层。因此,数据在被发送到非线性层之前被归一化,并且在每个图节点上添加了自循环。如公式(8)所示。After the l-layer graph convolution operation, if the graph convolution operation is applied one-sidedly, the result of the node representation may be different, because the degree of the node varies greatly. This may affect the sentence representation of high-level nodes, no matter what information the nodes contain, the node information in the l-1th layer graph neural network is transferred to the lth layer. Therefore, the data is normalized before being sent to the nonlinear layer, and self-loops are added on each graph node. As shown in formula (8).

Figure BDA0003760131510000111
Figure BDA0003760131510000111

其中

Figure BDA0003760131510000112
I是单位矩阵,
Figure BDA0003760131510000113
是节点i的度。in
Figure BDA0003760131510000112
I is the identity matrix,
Figure BDA0003760131510000113
is the degree of node i.

本发明的关系抽取任务可以看作是多分类任务,最后通过softmax函数分类器得到关系的概率分布,预测关系标签,并在训练过程中使用交叉熵函数作为模型的损失函数。The relationship extraction task of the present invention can be regarded as a multi-classification task. Finally, the probability distribution of the relationship is obtained through the softmax function classifier, the relationship label is predicted, and the cross-entropy function is used as the loss function of the model during the training process.

步骤102:对所述数据集进行人工标注,得到人工标注结果并构建基础语料库。人工标注包括人工预标注和人工正式标注;所述人工标注结果包括人工正式标注结果和人工预标注结果。Step 102: Carry out manual labeling on the data set, obtain manual labeling results and build a basic corpus. Manual labeling includes manual pre-labeling and manual formal labeling; the manual labeling results include manual formal labeling results and manual pre-labeling results.

作为一个可选的实施方式,所述步骤102,具体包括:As an optional implementation manner, the step 102 specifically includes:

步骤1021:根据知识的维度和其他领域的分类方法,结合初中数学学科的特点,制定标注规范。所述其他领域包括医学领域和军事领域。Step 1021: According to the dimensions of knowledge and the classification methods of other fields, combined with the characteristics of junior high school mathematics, formulate labeling specifications. Said other fields include the medical field and the military field.

在实际应用中,本发明参考布鲁姆在《教育目标分类学》中对知识的分类维度以及其他领域(如医疗领域、军事领域等)的分类方法,并结合初中数学知识特点,制定了三种实体类别,知识点、知识单元、知识簇。其中知识点是最小的、不可再分的知识;知识单元下有多个不可再分的知识点;知识簇是对这一类型的知识的总称。如表1所示。In practical application, the present invention refers to the classification dimension of knowledge and the classification methods of other fields (such as medical field, military field, etc.) There are two types of entities, knowledge points, knowledge units, and knowledge clusters. Among them, the knowledge point is the smallest knowledge that cannot be further divided; there are multiple knowledge points that cannot be further divided under the knowledge unit; the knowledge cluster is the general term for this type of knowledge. As shown in Table 1.

表1实体类别表Table 1 Entity category table

Figure BDA0003760131510000114
Figure BDA0003760131510000114

Figure BDA0003760131510000121
Figure BDA0003760131510000121

五种关系类型,前驱后继、包含、平行、父子、兄弟。前驱后继关系代表实体A和B之间存在某种依赖即前驱和后继的关系。包含关系代表实体A和B之间存在整体和部分的关系。平行关系代表实体A和B之间存在一种并立的关系。父子关系代表实体A和B之间存在某些相同的属性。兄弟关系代表实体A和B之间具有相同的父类。如表2所示。Five types of relationships, forerunner and successor, containment, parallel, father and son, brother. The predecessor-successor relationship represents that there is a certain dependence between entities A and B, that is, a predecessor-successor relationship. The containment relationship represents the whole and part relationship between entities A and B. A parallel relationship means that there is a parallel relationship between entities A and B. A parent-child relationship means that entities A and B have some common attributes. A sibling relationship means that entities A and B have the same parent class. As shown in table 2.

表2关系类型表Table 2 Relationship type table

关系类型relationship type 关系举例relationship example 前驱后继关系predecessor successor relationship (多项式,方程,前驱后继关系)(polynomial, equation, predecessor-sequence relationship) 包含关系containment relationship (三角形,边,包含关系)(triangle, edge, containment relationship) 平行关系parallel relationship (自变量,因变量,平行关系)(independent variable, dependent variable, parallel relationship) 父子关系father-son relationship (函数,一次函数,父子关系)(function, one-time function, parent-child relationship) 兄弟关系Brotherhood (一次函数,二次函数,兄弟关系)(primary function, quadratic function, sibling relationship)

在正式标注工作开始前,首先了解初中数学整体的教学体系、知识体系,然后完成标注规范的制定。首先,对于初中数学的实体存在嵌套的情况,导致在标注和识别过程中,可能会出现无法确认实体边界的情况。如文本中存在“二元一次方程组”这个实体,但这个词中同时也包含了“二元一次方程”和“方程”两个实体。因此,在标注和识别过程中,应根据上下文来确认实体边界。Before the official labeling work starts, first understand the overall teaching system and knowledge system of junior high school mathematics, and then complete the formulation of labeling specifications. First of all, there is a nesting situation for the entities of junior high school mathematics, which may cause the boundary of entities to be unable to be confirmed during the labeling and recognition process. For example, the entity "system of binary linear equations" exists in the text, but this word also includes two entities of "binary linear equations" and "equation". Therefore, during labeling and recognition, entity boundaries should be confirmed according to the context.

其次,对于初中数学中的实体通常由特定的结束字构成,如“实数”,“整数”,“复数”,“有理数”,都由“数”字作为结尾。因此,在实体的识别过程中,可以根据这一特征来增强标注的准确率。标注实体时,要遵循标注实体最大范围的原则,对于初中数学实体间的关系,在标注过程中如两个实体之间存在前驱后继关系,要确保前驱实体在前,后继实体在后。如“方程”和“函数”存在前驱后继关系,那么标注的实体1为“方程”,实体2为“函数”。最后在标注时要保证标注的实体不存在重叠和嵌套,且标注的内容不能含有标点符号。Secondly, entities in junior high school mathematics are usually composed of specific ending words, such as "real numbers", "integer numbers", "complex numbers" and "rational numbers", all of which end with the word "number". Therefore, in the process of entity recognition, the accuracy of labeling can be enhanced according to this feature. When labeling entities, the principle of marking the maximum range of entities should be followed. For the relationship between mathematical entities in junior high school, if there is a predecessor-successor relationship between two entities during the labeling process, it is necessary to ensure that the predecessor entity is in front and the successor entity is in the back. If there is a predecessor-successor relationship between "equation" and "function", then the marked entity 1 is "equation" and entity 2 is "function". Finally, when marking, it is necessary to ensure that the marked entities do not overlap or nest, and the marked content cannot contain punctuation marks.

步骤1022:对所述处理后的数据集进行词法和句法分析,构建事件三元组。在实际应用中,进行词法句法分析处理,先对文本进行中文分词,引入数学词库;然后进行句法分析,并结合语义角色分析构建出事件三元组。Step 1022: Perform lexical and syntactic analysis on the processed data set to construct event triples. In practical applications, lexical and syntactic analysis is performed, firstly Chinese word segmentation is performed on the text, and a mathematical thesaurus is introduced; then syntactic analysis is performed, and event triples are constructed in combination with semantic role analysis.

在实际应用中,中文分词指的是将一句话切分成一个单独的词。在跨领域分词中,往往存在大量专业领域术语词汇。然而,教育领域专业术语具有明显的学科倾向,专业性较强,现有的分词工具对教育领域下基础学科的专业词汇无法正确切分,对后续实验影响较大。例如本发明使用的分词工具jieba分词,会把“一元一次方程”切分成“一元”和“一次方程”两个词,“配方法”会切分成“配”和“方法”两个词。In practical application, Chinese word segmentation refers to splitting a sentence into individual words. In cross-domain word segmentation, there are often a large number of professional terminology vocabulary. However, professional terms in the field of education have obvious disciplinary tendencies and are highly specialized. Existing word segmentation tools cannot correctly segment professional words in basic subjects in the field of education, which has a great impact on subsequent experiments. For example, the word segmentation tool jieba word segmentation used in the present invention will segment "unary linear equation" into two words "unary" and "linear equation", and "combination method" will be segmented into two words "combination" and "method".

针对教育领域中文分词性能较差的问题,本发明加入数学词库,以提高分词的准确率。从搜狗输入法词库大全中下载数学词汇大全,但文件格式是scel,不能直接用于jieba分词中。对下载好的数学词汇大全进行文件格式转换,由原来的scel格式转换成UTF-8编码的txt格式。转换后的数学词库共有135593个词汇,包含了各个阶段的数学词汇。该词库内每一个词汇自成一行。将此数学词库引入到jieba分词中,再次进行中文分词,对于上文中提到的“配”“方法”,“一元”“一次方程”等词汇的分词结果准确。证明在进行限定领域的中文分词时,引入该领域的专业词库会提升分词结果的准确率,更有助于进行后续的工作。Aiming at the poor performance of Chinese word segmentation in the field of education, the invention adds a mathematical thesaurus to improve the accuracy of word segmentation. Download the Mathematics Vocabulary Encyclopedia from Sogou Input Method Thesaurus Encyclopedia, but the file format is scel, which cannot be directly used in jieba word segmentation. Convert the file format of the downloaded Mathematics Vocabulary Encyclopedia, from the original scel format to UTF-8 encoded txt format. The converted mathematical thesaurus has a total of 135,593 words, including mathematical words at various stages. Each word in the thesaurus forms its own line. Introduce this mathematical thesaurus into the jieba word segmentation, and perform Chinese word segmentation again. The word segmentation results of the above-mentioned words such as "matching", "method", "unary" and "linear equation" are accurate. It proves that when performing Chinese word segmentation in a limited field, the introduction of a professional thesaurus in this field will improve the accuracy of the word segmentation results and be more helpful for subsequent work.

在教育领域,由于教学设计文本中存在一些特殊的句法结构,如“在...的基础上,学习...”,“理解(认识、掌握)...”,这些句法结构可以明确看出实体之间的前驱后继关系。因此,在原有树库的基础上,加入教学设计中特有的一些特殊的句法结构来完成初中数学语料的中文句法分析,主要是以哈尔滨工业大学汉语依存树库进行标注分析。哈尔滨工业大学汉语依存树库数据来源于《人民日报》的语料,该树库总共包括8000条句子,其中依存关系有14类。In the field of education, due to the existence of some special syntactic structures in instructional design texts, such as "on the basis of ..., learn ...", "understand (knowledge, master) ...", these syntactic structures can be clearly seen Outline the predecessor-successor relationship between entities. Therefore, on the basis of the original treebank, some special syntactic structures unique to the teaching design are added to complete the Chinese syntactic analysis of the junior high school mathematics corpus, mainly using the Chinese dependency treebank of Harbin Institute of Technology for annotation analysis. Harbin Institute of Technology Chinese Dependency Treebank data comes from the corpus of "People's Daily". The treebank includes a total of 8,000 sentences, and there are 14 types of dependency relationships.

以“学生已经学习了平方根的概念,这是学习立方根和实数的基础”这句话为例,依存关系类别中每列的具体含义如下表3所示,标注结果如表4所示,主要是注重本句话中的核心关系,以及主谓关系和动宾关系。Taking the sentence "Students have learned the concept of square root, which is the basis for learning cube roots and real numbers" as an example, the specific meaning of each column in the dependency category is shown in Table 3 below, and the labeling results are shown in Table 4, mainly Pay attention to the core relationship in this sentence, as well as the subject-predicate relationship and the verb-object relationship.

表3符号含义表Table 3 Symbol meaning table

符号symbol 含义meaning IDID 序号serial number FORMFORM 词语或标点word or punctuation LEMMALEMMA 在中文中,和FORM相同In Chinese, same as FORM CPOSTAGCPOSTAG 词性(粗粒度)Part of speech (coarse-grained) POSTAGPOSTAG 词性(细粒度)Part of speech (fine-grained) FEATSFEATS HEADHEAD 中心词下标head word subscript DEPRELDEPREL 与中心词的依存关系Dependency with the head word

表4标注结果表Table 4 Annotation result table

Figure BDA0003760131510000141
Figure BDA0003760131510000141

Figure BDA0003760131510000151
Figure BDA0003760131510000151

基于加入教学设计特殊句法结构的树库送入CRF模型进行训练,将训练好的模型导入到pyLTP句法分析器中进行初中数学语料的句法分析。除此之外,还在依存句法分析的基础上加入了语义角色分析,最后构建出简单的事件三元组来帮助人工进行关系标注。首先查看是否存在主谓宾的结构,如果存在则按照主谓宾的形式进行提取,否则换依存句法的方式。对于每个词生成一个依存句法的子节点;并对该词生成一个父子数组的依存结构,主要记录词性、父节点的词性以及他们的关系;循环每个词,找到具有动宾、定语后置动宾、介宾的主谓动补关系,进行提取。基于以上的提取过程,完成简单的事件三元组的构建。The tree bank based on the special syntactic structure added to the teaching design is sent to the CRF model for training, and the trained model is imported into the pyLTP syntactic analyzer for syntactic analysis of junior high school mathematics corpus. In addition, semantic role analysis is added on the basis of dependency syntax analysis, and finally a simple event triple is constructed to help manual relationship labeling. First check whether there is a subject-verb-object structure, if it exists, extract it in the form of subject-verb-object, otherwise change to the method of dependent syntax. For each word, generate a child node that depends on the syntax; and generate a parent-child array dependency structure for the word, mainly recording the part of speech, the part of speech of the parent node and their relationship; loop each word, find the verb-object, attributive suffix The subject-verb-verb-complement relationship of verb-object and pre-object is extracted. Based on the above extraction process, the construction of a simple event triplet is completed.

进行人工标注操作时,由两名标注人员基于分词结果和句法分析结果进行人工预标注与人工正式标注,在预标注阶段,由两名标注人员标注一部分相同的文本,这部分相同的文档主要是用来计算每次的标注一致性的,循环三轮,逐步完善标注规范。基于最新的标注规范完成正式标注,构建出基础语料库。When performing manual labeling operations, two labelers perform manual pre-labeling and manual formal labeling based on word segmentation results and syntactic analysis results. In the pre-labeling stage, two labelers label a part of the same text. These same documents are mainly It is used to calculate the consistency of labeling each time, and it will go through three rounds to gradually improve the labeling specifications. Complete formal annotation based on the latest annotation specifications, and build a basic corpus.

步骤1023:根据所述事件三元组对第一预设数量的文本进行两次人工预标注,得到预标注结果,并根据所述预标注结果更新所述标注规范。所述预标注结果第一预标注结果和第二预标注结果。Step 1023: Carry out manual pre-labeling on the first preset amount of text twice according to the event triplet to obtain a pre-labeling result, and update the labeling specification according to the pre-labeling result. The pre-labeled results are a first pre-labeled result and a second pre-labeled result.

作为一个可选的实施方式,所述步骤1023,具体包括:As an optional implementation manner, the step 1023 specifically includes:

根据所述事件三元组对第一预设数量的文本进行两次人工预标注,得到预标注结果。The first preset amount of text is manually pre-labeled twice according to the event triplet to obtain a pre-labeled result.

计算所述第一预标注结果和所述第二预标注结果的第二标志一致性。Computing a second flag consistency between the first pre-labeled result and the second pre-labeled result.

判断是否迭代预设次数,得到第二判断结果。It is judged whether to iterate a preset number of times to obtain a second judgment result.

若所述第二判断结果为未迭代所述预设次数,则对比所述第一预标注结果和所述第二预标注结果,得到对比结果。If the second judgment result is that the preset times have not been iterated, then comparing the first pre-marking result with the second pre-marking result to obtain a comparison result.

根据所述对比结果更新所述标注规范,得到新的标注规范,并返回“根据所述事件三元组对第一预设数量的文本进行两次人工预标注,得到预标注结果”。The labeling specification is updated according to the comparison result to obtain a new labeling specification, and "the first preset number of texts are manually pre-labeled twice according to the event triplet to obtain a pre-labeling result" is returned.

若所述第二判断结果为迭代所述预设次数,则结束所述人工预标注,并输出所述新的标注规范。If the second judgment result is the preset number of iterations, the manual pre-labeling is ended, and the new labeling specification is output.

在实际应用中,为了完善制定的标注规范,提高标注阶段的标注质量,在正式标注开始前,首先进行了语料预标注的工作。其中人工预标注流程如图4所示,随机选取50篇文档,由两名标注人员独立标注,并对不同的标注进行分析,在专业初中数学教师的指导下进行修改并完善标注规范,重复三轮,计算每轮的标注一致性。In practical applications, in order to perfect the established labeling specifications and improve the labeling quality in the labeling stage, the corpus pre-labeling work is carried out before the official labeling starts. Among them, the manual pre-labeling process is shown in Figure 4. 50 documents were randomly selected, and two annotators independently annotated and analyzed different annotations. Under the guidance of professional junior high school mathematics teachers, they revised and improved the annotation specifications. Rounds, calculate the annotation consistency of each round.

步骤1024:根据所述事件三元组对第二预设数量的文本进行两次人工正式标注,得到人工正式标注结果,并构建基础语料库。所述人工正式标注结果包括第一正式标注结果和第二正式标注结果;所述第二预设数量的文本与所述第一预设数量的文本不同。Step 1024: According to the event triples, the second preset number of texts are formally annotated manually twice to obtain the formal annotated results and construct a basic corpus. The artificial formal marking result includes a first formal marking result and a second formal marking result; the second preset number of texts is different from the first preset number of texts.

作为一个可选的实施方式,所述步骤1024,具体包括:As an optional implementation manner, the step 1024 specifically includes:

根据所述事件三元组对第二预设数量的文本进行两次人工正式标注,得到人工正式标注结果。The second preset number of texts are manually formally marked twice according to the event triples to obtain a manual formally marked result.

计算所述第一正式标注结果和所述第二正式标注结果的第三标注一致性。Calculate a third annotation consistency between the first formal annotation result and the second official annotation result.

判断所述第三标注一致性是否达到第二标注一致性阈值,得到第三判断结果。It is judged whether the consistency of the third annotation reaches the second threshold of consistency of annotation, and a third judgment result is obtained.

若所述第三判断结果为所述第三标注一致性达到所述第二标注一致性阈值,则完成所述人工正式标注,并输出所述人工正式标注结果。If the third judgment result is that the consistency of the third labeling reaches the second labeling consistency threshold, the manual formal labeling is completed, and the manual formal labeling result is output.

根据所述人工正式标注结果构建所述基础语料库。The basic corpus is constructed according to the manual formal annotation results.

若所述第三判断结果为所述第三标注一致性未达到所述第二标注一致性阈值,则返回“根据所述事件三元组对第二预设数量的文本进行两次人工正式标注,得到人工正式标注结果”。If the third judgment result is that the third labeling consistency does not reach the second labeling consistency threshold, return "according to the event triplet, perform manual formal labeling on the second preset number of texts twice." , to obtain the manual formal labeling results”.

在实际应用中,在预标注的工作结束之后,进入到正式标注阶段,两名标注人员分别标注预标注之外的200份文档,同时随机抽取50份文档由两名标注人员共同标注,用于标注结束后计算标注一致性。正式标注阶段的流程如图5所示。In practical applications, after the pre-labeling work is over, it enters the formal labeling stage. Two labelers respectively label 200 documents other than the pre-labeled ones, and at the same time, 50 documents are randomly selected and labeled by the two labelers for After the annotation is completed, the annotation consistency is calculated. The process of the formal labeling stage is shown in Figure 5.

完成实体的人工标注,将数据导出成ann格式。然后选取序列标注模式中的BIO模式对导出数据进行序列标注,即将句子中的每个词都标注成“B-X”、“I-X”或“O”的形式,其中B是实体的开始,I是实体的中间或结尾,O不是实体。据此,结合定义的3类实体类型,共可以得到7类标签。其中知识点符号为“KNO”,知识单元符号为“KUN”,知识簇符号为“KNG”。如表5所示,最终得到的带有BIO标签的基础语料库。Complete the manual labeling of entities, and export the data into ann format. Then select the BIO mode in the sequence labeling mode to perform sequence labeling on the exported data, that is, label each word in the sentence as "B-X", "I-X" or "O", where B is the beginning of the entity, and I is the entity In the middle or end of , O is not an entity. Accordingly, combined with the defined three types of entity types, a total of seven types of labels can be obtained. The knowledge point symbol is "KNO", the knowledge unit symbol is "KUN", and the knowledge cluster symbol is "KNG". As shown in Table 5, the final base corpus with BIO tags is obtained.

表5实体关系的人工标注结果表Table 5 Manual labeling result table of entity relationship

序号serial number 标签类型label type 标签定义label definition 11 B-KNOB-KNO 知识点实体首字The first word of knowledge point entity 22 I-KNOI-KNO 知识点实体中间或结尾Knowledge point entity middle or end 33 B-KUNB-KUN 知识单元实体首字The first word of the knowledge unit entity 44 I-KUNI-KUN 知识单元实体中间或结尾Knowledge unit entity middle or end 55 B-KNGB-KNG 知识簇实体首字The first word of the knowledge cluster entity 66 I-KNGI-KNG 知识簇实体中间或结尾Knowledge cluster entity middle or end 77 Oo 不是实体not an entity

完成关系的人工标注,将数据导出成ann格式并进行处理,最终形成训练关系抽取模型所需的格式(关系类型序号句子实体1开始位置实体1结束位置实体2开始位置实体2结束位置),其中实体1用“#”隔开,实体2用“$”隔开。其中关系类型对应的标签如表6所示。Complete the manual labeling of the relationship, export the data into ann format and process it, and finally form the format required for training the relationship extraction model (relation type serial number sentence entity 1 start position entity 1 end position entity 2 start position entity 2 end position), where Entity 1 is separated by "#", and entity 2 is separated by "$". The labels corresponding to the relationship types are shown in Table 6.

表6关系类型的标注结果表Table 6 Annotation result table of relation type

Figure BDA0003760131510000171
Figure BDA0003760131510000171

Figure BDA0003760131510000181
Figure BDA0003760131510000181

步骤103:利用所述基础语料库对所述语料库自动标注模型进行训练,得到训练后的语料库自动标注模型。Step 103: Using the basic corpus to train the corpus automatic tagging model to obtain a trained corpus automatic tagging model.

在实际应用中,针对基于预训练语言模型的命名实体识别模型(BERT-wwm-BiLSTM-CRF)的训练,该模型训练数据所使用的数据来源于人机协同构建的初中数学语料库,其中实体语料库共有22803条句子,按照8:2的比例随机划分训练集和测试集。初中数学语料库的实体数量为2756个知识点KNO、808个知识单元KUN、185个知识簇KNG。模型训练参数主要包括采用Adam优化器,批处理参数batch_size设置为64,迭代次数epoch设置为100,学习率设置为5e-5,失活率drop_out_rate设置为0.5,最大序列长度max_seq_length设置为128。基于预训练语言模型的命名实体识别模型训练的损失函数如下所示:In practical applications, for the training of the named entity recognition model (BERT-wwm-BiLSTM-CRF) based on the pre-trained language model, the data used for the training data of the model comes from the junior high school mathematics corpus constructed by human-computer cooperation, in which the entity corpus There are 22803 sentences in total, and the training set and test set are randomly divided according to the ratio of 8:2. The number of entities in the junior high school mathematics corpus is 2756 knowledge points KNO, 808 knowledge units KUN, and 185 knowledge clusters KNG. The model training parameters mainly include using the Adam optimizer, the batch processing parameter batch_size is set to 64, the number of iterations epoch is set to 100, the learning rate is set to 5e-5, the inactivation rate drop_out_rate is set to 0.5, and the maximum sequence length max_seq_length is set to 128. The loss function of the named entity recognition model training based on the pre-trained language model is as follows:

Figure BDA0003760131510000182
Figure BDA0003760131510000182

其中,Z(x)表示CRF得分累和函数,其中score(·)得分函数为CRF中的发射概率和转移概率之和。Among them, Z(x) represents the CRF score accumulation function, where score( ) score function is the sum of emission probability and transition probability in CRF.

针对基于图卷积神经网络的关系抽取模型(BERT-GCN)的训练,该模型训练数据所使用的三元组数据来源于第三章的初中数学语料库,关系语料库共有26764条句子,实验中按照8:2的比例随机划分训练集和测试集。初中数学语料库的关系包括6351个前驱后继关系、2044个包含关系、554个平行关系、460个父子关系和209个兄弟关系。模型训练参数主要包括采用Adam优化器,批处理参数batch_size设置为32,迭代次数epoch设置为100,学习率设置为1.7e-5,失活率drop_out_rate设置为0.4,最大序列长度max_seq_length设置为128,权重衰减设置为1e-3。训练的损失函数采用交叉熵,如下所示:For the training of the relational extraction model based on graph convolutional neural network (BERT-GCN), the triplet data used in the training data of this model comes from the junior high school mathematics corpus in Chapter 3. The relational corpus has a total of 26,764 sentences. In the experiment, according to The ratio of 8:2 is randomly divided into training set and test set. The relationships in the junior middle school mathematics corpus include 6351 predecessor-successor relationships, 2044 inclusion relationships, 554 parallel relationships, 460 parent-child relationships and 209 brother relationships. The model training parameters mainly include using the Adam optimizer, the batch processing parameter batch_size is set to 32, the number of iterations epoch is set to 100, the learning rate is set to 1.7e-5, the inactivation rate drop_out_rate is set to 0.4, and the maximum sequence length max_seq_length is set to 128. Weight decay is set to 1e-3. The loss function for training uses cross entropy as follows:

Figure BDA0003760131510000183
Figure BDA0003760131510000183

其中,D表示关系类型个数,N表示三元组个数,Y表示真实关系标签,Z表示预测关系概率。Among them, D represents the number of relation types, N represents the number of triples, Y represents the real relation label, and Z represents the predicted relation probability.

步骤104:利用所述训练后的语料库自动标注模型对待标注文本进行机器标注,得到机器标注结果。所述待标注文本为所述人工正式标注时所选取的文本。Step 104: Use the trained corpus automatic labeling model to perform machine labeling on the text to be labelled, and obtain a machine labeling result. The text to be marked is the text selected during the manual formal marking.

在实际应用中,机器标注主要是通过训练好的模型(基于预训练语言模型的命名实体识别模型(BERT-wwm-BiLSTM-CRF)和基于图卷积神经网络的关系抽取模型(BERT-GCN))来自动识别实体关系的过程。在得到用人工标注的基础语料库训练好的命名实体识别模型和关系抽取模型(训练后的语料库自动标注模型)后,将训练后的语料库自动标注模型嵌入到团队开发的标注工具中,用于机器自动标注。将待标注文本导入到标注工具中,开启自动标注模式后,标注工具会将模型的识别结果可视化地标注出来,如图6所示。In practical applications, machine annotation is mainly through the trained model (named entity recognition model based on pre-trained language model (BERT-wwm-BiLSTM-CRF) and relationship extraction model based on graph convolutional neural network (BERT-GCN) ) to automatically identify entity relationships. After obtaining the named entity recognition model and relation extraction model (trained corpus automatic labeling model) trained with the manually labeled basic corpus, the trained corpus automatic labeling model is embedded into the labeling tool developed by the team for machine Automatic labeling. Import the text to be labeled into the labeling tool, and after the automatic labeling mode is turned on, the labeling tool will visually label the recognition results of the model, as shown in Figure 6.

步骤105:对所述机器标注结果与所述人工正式标注结果进行标注一致性计算,得到第一标注一致性。Step 105: Carry out labeling consistency calculation on the machine labeling result and the manual official labeling result to obtain a first labeling consistency.

步骤106:判断所述第一标注一致性是否达到第一标注一致性阈值,得到判断结果。若是,则执行步骤107;若否,则执行步骤108。Step 106: Judging whether the consistency of the first annotation reaches the first threshold of consistency of annotation, and obtaining a judgment result. If yes, go to step 107; if not, go to step 108.

步骤107:输出所述训练后的语料库自动标注模型,将所述训练后的语料库自动标注模型作为语料库。Step 107: Output the trained corpus automatic tagging model, and use the trained corpus automatic tagging model as a corpus.

步骤108:扩充所述基础语料库,并返回“步骤103”。Step 108: expand the basic corpus, and return to "step 103".

在实际应用中,将人工正式标注时随机选取的50份文档也用于机器标注,目的是用于计算标注一致性。以人工标注的这50份文档为基准,并将机器标注与人工标注相同的那部分文档进行标注一致性对比,判断与机器标注的这50份文档的标注一致性是否达到了80%,如果未达到80%,需要扩充基础语料库继续训练模型,直到最后形成初中数学学科语料库,如果达到80%,则完成机器标注,形成最终的初中数学学科语料库。In practical applications, 50 documents randomly selected during manual formal labeling are also used for machine labeling, the purpose is to calculate labeling consistency. Taking the 50 documents marked manually as the benchmark, compare the marking consistency of the part of the document marked by the machine with the same manual marking, and judge whether the marking consistency with the 50 documents marked by the machine has reached 80%. When it reaches 80%, it is necessary to expand the basic corpus and continue to train the model until the final corpus of junior high school mathematics is formed. If it reaches 80%, machine annotation is completed to form the final junior high school mathematics corpus.

具体的,在人工和机器标注完成后,对语料库的标注一致性进行分析对比。本发明使用F值或Kappa值来评价语料库的标注一致性。其中,Kappa值通常用于评价标注的正例和负例,对于本发明的语料库来说,只能将未标注的语料作为负例,难以统计。对于这种情况,一般采用F值来评价,设两个标注方(人工和机器)分别为A和B,将A作为参照组(参照组的选择是随机的)来计算B的准确度P,计算公式为:Specifically, after the manual and machine annotations are completed, the consistency of annotations in the corpus is analyzed and compared. The present invention uses F value or Kappa value to evaluate the labeling consistency of the corpus. Among them, the Kappa value is usually used to evaluate the positive and negative examples of labeling. For the corpus of the present invention, only unlabeled corpus can be used as negative examples, which is difficult to count. In this case, the F value is generally used to evaluate, and the two labeling parties (manual and machine) are respectively A and B, and A is used as a reference group (the selection of the reference group is random) to calculate the accuracy P of B, The calculation formula is:

Figure BDA0003760131510000201
Figure BDA0003760131510000201

则B的召回率R为:Then the recall rate R of B is:

Figure BDA0003760131510000202
Figure BDA0003760131510000202

基于B的准确度P和召回率R,计算得到F值:Based on the accuracy P and recall rate R of B, the F value is calculated:

Figure BDA0003760131510000203
Figure BDA0003760131510000203

F值越高,则B标注的准确度越高,即标注一致性越高。计算实体标注一致性时,实体字符串和类型都一样时认为标注一致。上述指标同样运用在计算关系一致性时,实体对以及实体之间的关系都一样时标注一致。The higher the F value, the higher the accuracy of B labeling, that is, the higher the labeling consistency. When calculating the consistency of entity labels, when the entity strings and types are the same, the labels are considered to be consistent. The above indicators are also used in the calculation of relationship consistency, when the entity pairs and the relationship between entities are the same, they are labeled consistently.

表7实体标注对比结果表Table 7 Entity annotation comparison result table

Figure BDA0003760131510000204
Figure BDA0003760131510000204

表8关系标注对比结果表Table 8 Relational annotation comparison result table

Figure BDA0003760131510000205
Figure BDA0003760131510000205

从表7和表8可以看出,预标注过程中,标注一致性的值每一轮是在逐渐变大的。这说明在每一轮预标注结束后,都会根据标注结果对标注规范进行进一步的修正,而标注规范的逐步完善也使得标注一致性逐渐提升。It can be seen from Table 7 and Table 8 that during the pre-labeling process, the value of labeling consistency is gradually increasing in each round. This shows that after each round of pre-labeling, the labeling specification will be further revised according to the labeling results, and the gradual improvement of the labeling specification will also gradually improve the labeling consistency.

如果机器标注的语料库一致性未达到80%,那么需要扩充基础语料库,人工再标注1000条数据用来扩充基础语料库,扩充完成后再从第一步进行迭代,直到标注一致性大于80%。If the consistency of the machine-labeled corpus does not reach 80%, then the basic corpus needs to be expanded, and another 1,000 pieces of data are manually labeled to expand the basic corpus. After the expansion is completed, iterate from the first step until the labeling consistency is greater than 80%.

本发明中的人机协同语料库构建方法,人工处理阶段包括数据收集与处理,标注规范制定和人工标注;机器处理阶段包括词法句法分析和机器标注,如图7所示。该方法,在标注海量数据时,可以有效地减少人工标注的时间和成本,具有适用性,能够有效解决大规模数据标注的问题。In the human-computer collaborative corpus construction method in the present invention, the manual processing stage includes data collection and processing, label specification formulation and manual labeling; the machine processing stage includes lexical syntax analysis and machine labeling, as shown in FIG. 7 . This method, when labeling massive data, can effectively reduce the time and cost of manual labeling, has applicability, and can effectively solve the problem of large-scale data labeling.

图9为本发明提供的一种人机协同语料库构建系统的结构图,如图9所示,系统包括:Fig. 9 is a structural diagram of a human-machine collaborative corpus construction system provided by the present invention. As shown in Fig. 9, the system includes:

模型和数据集构建模块901,用于构建语料库自动标注模型和数据集;所述数据集中的文本包括学科教材、教学设计和导学案。The model and data set construction module 901 is used to construct corpus automatic labeling models and data sets; the texts in the data set include subject textbooks, teaching designs and tutorial plans.

人工标注模块902,用于对所述数据集进行人工标注,得到人工标注结果并构建基础语料库。人工标注包括人工预标注和人工正式标注;所述人工标注结果包括人工正式标注结果和人工预标注结果。The manual labeling module 902 is configured to perform manual labeling on the dataset, obtain manual labeling results and construct a basic corpus. Manual labeling includes manual pre-labeling and manual formal labeling; the manual labeling results include manual formal labeling results and manual pre-labeling results.

训练模块903,用于利用所述基础语料库对所述语料库自动标注模型进行训练,得到训练后的语料库自动标注模型。The training module 903 is configured to use the basic corpus to train the corpus automatic tagging model to obtain a trained corpus automatic tagging model.

机器标注模块904,用于利用所述训练后的语料库自动标注模型对待标注文本进行机器标注,得到机器标注结果。所述待标注文本为所述人工正式标注时所选取的文本。The machine labeling module 904 is configured to use the trained corpus automatic labeling model to perform machine labeling on the text to be labelled, and obtain a machine labeling result. The text to be marked is the text selected during the manual formal marking.

计算模块905,用于对所述机器标注结果与所述人工正式标注结果进行标注一致性计算,得到第一标注一致性。The calculation module 905 is configured to perform labeling consistency calculation on the machine labeling result and the manual formal labeling result to obtain a first labeling consistency.

判断模块906,用于判断所述第一标注一致性是否达到第一标注一致性阈值,得到判断结果。A judging module 906, configured to judge whether the consistency of the first annotation reaches the first threshold of consistency of annotation, and obtain a judgment result.

第一执行模块907,用于若所述判断结果为所述第一标注一致性达到所述第一标注一致性阈值,则输出所述训练后的语料库自动标注模型,将所述训练后的语料库自动标注模型作为语料库。The first execution module 907 is configured to output the trained corpus automatic labeling model if the judgment result is that the first labeling consistency reaches the first labeling consistency threshold, and the trained corpus Automatically annotate the model as a corpus.

第二执行模块908,用于若所述判断结果为所述第一标注一致性未达到所述第一标注一致性阈值,则扩充所述基础语料库,并返回“利用所述基础语料库对所述语料库自动标注模型进行训练,得到训练后的语料库自动标注模型”的步骤。The second execution module 908 is configured to expand the basic corpus and return "Use the basic corpus for the The corpus automatic labeling model is trained to obtain the trained corpus automatic labeling model".

作为一种可选的实施方式,还包括:As an optional implementation, it also includes:

数据处理模块,用于对所述数据集进行预处理,得到处理后的数据集。The data processing module is used to preprocess the data set to obtain the processed data set.

作为一种可选的实施方式,所述数据处理模块,包括:As an optional implementation, the data processing module includes:

格式转换单元,用于对所述文本进行格式转换,得到预设格式的文本。The format conversion unit is configured to perform format conversion on the text to obtain a text in a preset format.

数据剔除单元,用于剔除所述预设格式的文本中的不规则文本,得到处理后的数据集;所述不规则文本包括表格、图片、学校名称和教师名称。The data elimination unit is used to eliminate irregular text in the text in the preset format to obtain a processed data set; the irregular text includes tables, pictures, school names and teacher names.

作为一种可选的实施方式,所述人工标注模块902,包括:As an optional implementation manner, the manual labeling module 902 includes:

标注规范制定单元,用于根据知识的维度和其他领域的分类方法,结合初中数学学科的特点,制定标注规范。所述其他领域包括医学领域和军事领域。The labeling specification formulation unit is used to formulate labeling specifications based on the dimensions of knowledge and classification methods in other fields, combined with the characteristics of junior high school mathematics. Said other fields include the medical field and the military field.

数据分析单元,用于对所述处理后的数据集进行词法和句法分析,构建事件三元组。The data analysis unit is configured to perform lexical and syntactic analysis on the processed data set to construct event triples.

预标注单元,用于根据所述事件三元组对第一预设数量的文本进行两次人工预标注,得到预标注结果,并根据所述预标注结果更新所述标注规范。所述预标注结果第一预标注结果和第二预标注结果。The pre-labeling unit is configured to manually pre-label the first preset quantity of text twice according to the event triplet to obtain a pre-labeling result, and update the labeling specification according to the pre-labeling result. The pre-labeled results are a first pre-labeled result and a second pre-labeled result.

正式标注单元,用于根据所述事件三元组对第二预设数量的文本进行两次人工正式标注,得到人工正式标注结果,并构建基础语料库。所述人工正式标注结果包括第一正式标注结果和第二正式标注结果;所述第二预设数量的文本与所述第一预设数量的文本不同。The formal labeling unit is configured to perform manual formal labeling twice on a second preset number of texts according to the event triples, obtain manual formal labeling results, and construct a basic corpus. The artificial formal marking result includes a first formal marking result and a second formal marking result; the second preset number of texts is different from the first preset number of texts.

作为一种可选的实施方式,所述预标注单元,包括:As an optional implementation manner, the pre-marking unit includes:

预标注子单元,用于根据所述事件三元组对第一预设数量的文本进行两次人工预标注,得到预标注结果。The pre-labeling subunit is configured to manually pre-label the first preset number of texts twice according to the event triples to obtain a pre-labeling result.

第一计算子单元,用于计算所述第一预标注结果和所述第二预标注结果的第二标志一致性。A first calculating subunit, configured to calculate a second flag consistency between the first pre-labeled result and the second pre-labeled result.

第一判断子单元,用于判断是否迭代预设次数,得到第二判断结果。The first judging subunit is used for judging whether to iterate a preset number of times to obtain a second judging result.

第一执行子单元,用于若所述第二判断结果为未迭代所述预设次数,则对比所述第一预标注结果和所述第二预标注结果,得到对比结果。The first execution subunit is configured to compare the first pre-marking result with the second pre-marking result to obtain a comparison result if the second judgment result is that the preset number of times has not been iterated.

更新子单元,用于根据所述对比结果更新所述标注规范,得到新的标注规范,并返回“根据所述事件三元组对第一预设数量的文本进行两次人工预标注,得到预标注结果”。The update subunit is used to update the labeling specification according to the comparison result to obtain a new labeling specification, and return "according to the event triplet, perform manual pre-labeling on the first preset number of texts twice, and obtain the pre-labeled Annotate the result".

第二执行子单元,用于若所述第二判断结果为迭代所述预设次数,则结束所述人工预标注,并输出所述新的标注规范。The second executing subunit is configured to end the manual pre-labeling and output the new labeling specification if the second judgment result is the preset number of iterations.

作为一种可选的实施方式,所述正式标注单元,包括:As an optional implementation manner, the formal labeling unit includes:

正式标注子单元,用于根据所述事件三元组对第二预设数量的文本进行两次人工正式标注,得到人工正式标注结果。The formal labeling subunit is configured to perform manual formal labeling twice on a second preset number of texts according to the event triples to obtain manual formal labeling results.

第二计算子单元,用于计算所述第一正式标注结果和所述第二正式标注结果的第三标注一致性。The second calculation subunit is configured to calculate a third label consistency between the first formal labeling result and the second formal labeling result.

第二判断子单元,用于判断所述第三标注一致性是否达到第二标注一致性阈值,得到第三判断结果。The second judging subunit is configured to judge whether the consistency of the third annotation reaches the second threshold of consistency of annotation, and obtain a third judgment result.

第三执行子单元,用于若所述第三判断结果为所述第三标注一致性达到所述第二标注一致性阈值,则完成所述人工正式标注,并输出所述人工正式标注结果。A third executing subunit, configured to complete the manual formal labeling and output the manual formal labeling result if the third judgment result is that the third labeling consistency reaches the second labeling consistency threshold.

基础语料库构建子单元,用于根据所述人工正式标注结果构建所述基础语料库。The basic corpus construction subunit is configured to construct the basic corpus according to the manual formal annotation result.

第四执行子单元,用于若所述第三判断结果为所述第三标注一致性未达到所述第二标注一致性阈值,则返回“根据所述事件三元组对第二预设数量的文本进行两次人工正式标注,得到人工正式标注结果”。The fourth execution subunit is configured to return "According to the event triplet for the second preset number of The text of the text is manually annotated twice, and the result of manual official annotation is obtained.

本发明中人机协同语料库构建系统,可以完成人机协同构建语料库,对于数据驱动下的语料库构建方法研究具有参考价值,形成的初中数学学科语料库对后续教育知识图谱的构建与应用具有实际支撑作用。The human-computer collaborative corpus construction system in the present invention can complete the human-computer collaborative corpus construction, which has reference value for the research on data-driven corpus construction methods, and the formed junior high school mathematics corpus has practical support for the construction and application of follow-up education knowledge maps .

本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other. As for the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for the related information, please refer to the description of the method part.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本发明的限制。In this paper, specific examples have been used to illustrate the principle and implementation of the present invention. The description of the above embodiments is only used to help understand the method of the present invention and its core idea; meanwhile, for those of ordinary skill in the art, according to the present invention Thoughts, there will be changes in specific implementation methods and application ranges. In summary, the contents of this specification should not be construed as limiting the present invention.

Claims (10)

1. A method for constructing a human-computer collaborative corpus is characterized by comprising the following steps:
constructing an automatic labeling model and a data set of a corpus; the text in the data set comprises subject textbooks, teaching designs and teaching guide cases;
carrying out manual annotation on the data set to obtain a manual annotation result and construct a basic corpus; the manual marking comprises manual pre-marking and manual formal marking; the manual annotation result comprises a manual formal annotation result and a manual pre-annotation result;
training the corpus automatic annotation model by using the basic corpus to obtain a trained corpus automatic annotation model;
performing machine labeling on a text to be labeled by using the trained corpus automatic labeling model to obtain a machine labeling result; the text to be marked is the text selected during the manual formal marking;
carrying out labeling consistency calculation on the machine labeling result and the manual formal labeling result to obtain a first labeling consistency;
judging whether the first labeling consistency reaches a first labeling consistency threshold value or not to obtain a judgment result;
if the judgment result shows that the first labeling consistency reaches the first labeling consistency threshold value, outputting the trained corpus automatic labeling model, and taking the trained corpus automatic labeling model as a corpus;
and if the judgment result is that the first annotation consistency does not reach the first annotation consistency threshold, expanding the basic corpus, and returning to the step of training the automatic annotation model of the corpus by using the basic corpus to obtain the trained automatic annotation model of the corpus.
2. The method for constructing a human-computer collaborative corpus according to claim 1, wherein the constructing corpus automatically labels a model and a data set, and then further comprises:
and preprocessing the data set to obtain a processed data set.
3. The method for constructing the human-computer collaborative corpus of claim 2, wherein the preprocessing the data set to obtain a processed data set specifically comprises:
carrying out format conversion on the text to obtain a text with a preset format;
eliminating irregular texts in the preset format to obtain a processed data set; the irregular text comprises a table, a picture, a school name and a teacher name.
4. The method for constructing a human-computer collaborative corpus according to claim 3, wherein the manually labeling the data set to obtain a basic corpus specifically comprises:
according to knowledge dimensions and classification methods in other fields, and by combining the characteristics of junior middle school mathematics disciplines, marking specifications are formulated; such other fields include the medical field and the military field;
performing lexical and syntactic analysis on the processed data set to construct event triples;
manually pre-labeling the texts with a first preset number twice according to the event triad to obtain a pre-labeling result, and updating the labeling specification according to the pre-labeling result; the pre-labeling results comprise a first pre-labeling result and a second pre-labeling result;
carrying out manual formal annotation twice on a second preset number of texts according to the event triplet to obtain a manual formal annotation result, and constructing a basic corpus; the manual formal annotation result comprises a first formal annotation result and a second formal annotation result; the second preset number of texts is different from the first preset number of texts.
5. The method for constructing the human-computer collaborative corpus of claim 4, wherein the manually pre-labeling a first preset number of texts twice according to the event triplet to obtain pre-labeling results, and updating the labeling specifications according to the pre-labeling results specifically comprises:
manually pre-labeling the texts with the first preset number twice according to the event triad to obtain a pre-labeling result;
calculating the second mark consistency of the first pre-labeling result and the second pre-labeling result;
judging whether iteration is carried out for a preset number of times to obtain a second judgment result;
if the second judgment result is that the preset times are not iterated, comparing the first pre-labeling result with the second pre-labeling result to obtain a comparison result;
updating the marking specification according to the comparison result to obtain a new marking specification, and returning to the step of performing manual pre-marking twice on the texts with the first preset number according to the event triplet to obtain a pre-marking result;
and if the second judgment result is that the preset times are iterated, ending the manual pre-labeling and outputting the new labeling specification.
6. The method for constructing a human-computer collaborative corpus of claim 5, wherein the performing two times of artificial formal annotation on a second preset number of texts according to the event triplet to obtain an artificial formal annotation result, and constructing a basic corpus specifically comprises:
carrying out manual formal annotation twice on a second preset number of texts according to the event triad to obtain a manual formal annotation result;
calculating third labeling consistency of the first formal labeling result and the second formal labeling result;
judging whether the third labeling consistency reaches a second labeling consistency threshold value or not to obtain a third judgment result;
if the third judgment result is that the third labeling consistency reaches the second labeling consistency threshold, completing the manual formal labeling and outputting a manual formal labeling result;
constructing the basic corpus according to the artificial formal annotation result;
and if the third judgment result shows that the third annotation consistency does not reach the second annotation consistency threshold, returning to the step of carrying out twice manual formal annotation on the texts with the second preset number according to the event triad to obtain a manual formal annotation result.
7. A human-computer collaborative corpus construction system, comprising:
the model and data set construction module is used for constructing an automatic annotation model and a data set of a corpus; the texts in the data set comprise subject textbooks, teaching designs and teaching guide cases;
the manual labeling module is used for performing manual labeling on the data set to obtain a manual labeling result and construct a basic corpus; the manual marking comprises manual pre-marking and manual formal marking; the manual annotation result comprises a manual formal annotation result and a manual pre-annotation result;
the training module is used for training the corpus automatic annotation model by utilizing the basic corpus to obtain a trained corpus automatic annotation model;
the machine labeling module is used for performing machine labeling on a text to be labeled by using the trained corpus automatic labeling model to obtain a machine labeling result; the text to be marked is the text selected during the manual formal marking;
the calculation module is used for carrying out marking consistency calculation on the machine marking result and the manual formal marking result to obtain first marking consistency;
the judging module is used for judging whether the first marking consistency reaches a first marking consistency threshold value to obtain a judging result;
the first execution module is used for outputting the trained corpus automatic labeling model if the judgment result shows that the first labeling consistency reaches the first labeling consistency threshold value, and taking the trained corpus automatic labeling model as a corpus;
and the second execution module is used for expanding the basic corpus and returning to the step of training the corpus automatic labeling model by using the basic corpus to obtain the trained corpus automatic labeling model if the judgment result shows that the first labeling consistency does not reach the first labeling consistency threshold value.
8. The system for constructing a human-computer collaborative corpus of claim 7, further comprising:
and the data processing module is used for preprocessing the data set to obtain a processed data set.
9. The system for constructing a human-computer collaborative corpus of claim 8, wherein the data processing module comprises:
the format conversion unit is used for carrying out format conversion on the text to obtain the text with a preset format;
the data removing unit is used for removing the irregular texts in the texts with the preset formats to obtain a processed data set; the irregular text comprises a table, a picture, a school name and a teacher name.
10. The system for constructing a human-computer collaborative corpus of claim 9, wherein the manual labeling module comprises:
the marking specification making unit is used for making a marking specification according to the dimensionality of knowledge and classification methods in other fields by combining the characteristics of the junior high school mathematics subject; such other fields include the medical field and the military field;
the data analysis unit is used for carrying out lexical and syntactic analysis on the processed data set and constructing event triples;
the pre-labeling unit is used for manually pre-labeling the texts with the first preset number twice according to the event triad to obtain a pre-labeling result and updating the labeling specification according to the pre-labeling result; the pre-labeling result comprises a first pre-labeling result and a second pre-labeling result;
the manual labeling unit is used for carrying out two times of manual formal labeling on the second preset number of texts according to the event triplet, obtaining manual formal labeling results and constructing a basic corpus; the manual formal annotation result comprises a first formal annotation result and a second formal annotation result; the second preset number of texts is different from the first preset number of texts.
CN202210869795.4A 2022-07-22 2022-07-22 A human-machine collaborative corpus construction method and system Pending CN115270713A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210869795.4A CN115270713A (en) 2022-07-22 2022-07-22 A human-machine collaborative corpus construction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210869795.4A CN115270713A (en) 2022-07-22 2022-07-22 A human-machine collaborative corpus construction method and system

Publications (1)

Publication Number Publication Date
CN115270713A true CN115270713A (en) 2022-11-01

Family

ID=83768991

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210869795.4A Pending CN115270713A (en) 2022-07-22 2022-07-22 A human-machine collaborative corpus construction method and system

Country Status (1)

Country Link
CN (1) CN115270713A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118070028A (en) * 2024-04-17 2024-05-24 北方健康医疗大数据科技有限公司 A method and system for evaluating annotation quality

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118070028A (en) * 2024-04-17 2024-05-24 北方健康医疗大数据科技有限公司 A method and system for evaluating annotation quality

Similar Documents

Publication Publication Date Title
Khan et al. A novel natural language processing (NLP)–based machine translation model for English to Pakistan sign language translation
CN108363743B (en) Intelligent problem generation method and device and computer readable storage medium
CN109902298B (en) Domain knowledge modeling and knowledge level estimation method in self-adaptive learning system
CN110110054B (en) Method for acquiring question-answer pairs from unstructured text based on deep learning
CN113642330A (en) Entity recognition method of rail transit specification based on catalog topic classification
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN114547298B (en) Biomedical relation extraction method, device and medium based on combination of multi-head attention and graph convolution network and R-Drop mechanism
CN110134954B (en) Named entity recognition method based on Attention mechanism
CN110569345B (en) An Intelligent Question Answering Method for Current Political Knowledge Based on Entity Linking and Relationship Prediction
CN112765952A (en) Conditional probability combined event extraction method under graph convolution attention mechanism
CN111143574A (en) Query and visualization system construction method based on minority culture knowledge graph
CN108874896A (en) A kind of humorous recognition methods based on neural network and humorous feature
CN114238653A (en) A method for the construction, completion and intelligent question answering of programming education knowledge graph
CN113204967B (en) Resume Named Entity Recognition Method and System
CN111274829A (en) Sequence labeling method using cross-language information
CN112699685A (en) Named entity recognition method based on label-guided word fusion
Chen et al. ADOL: a novel framework for automatic domain ontology learning
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN113869054A (en) A feature recognition method of power field project based on deep learning
CN117992614A (en) A method, device, equipment and medium for sentiment classification of Chinese online course reviews
CN109815497B (en) Character attribute extraction method based on syntactic dependency
CN111291550A (en) Chinese entity extraction method and device
CN115270713A (en) A human-machine collaborative corpus construction method and system
CN116595992B (en) A single-step extraction method of terms and types of binary pairs and its model
Žitko et al. Automatic question generation using semantic role labeling for morphologically rich languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination