CN105824791A

CN105824791A - Reference format checking method

Info

Publication number: CN105824791A
Application number: CN201610153946.0A
Authority: CN
Inventors: 李宁; 侯霞; 赵琳; 田英爱
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2016-03-17
Filing date: 2016-03-17
Publication date: 2016-08-03
Anticipated expiration: 2036-03-17
Also published as: CN105824791B

Abstract

The present invention provides a method for checking the format of reference documents, comprising: step 1, expressing the format rules of bibliographical items of references using Schema, wherein the format of bibliographical items of references includes at least one of the following bibliographic items: responsible person, title, Reference type, publisher, publication date, page number; step 2, read each reference, and divide the bibliographic items; step 3, identify the bibliographic items of the references, and extract the identified bibliographic items into XML nodes; where The bibliographic items include at least one of the following: responsible person, title, place of publication, publisher, publication date, etc.; at the same time, it is judged whether the bibliographical bibliographical items include the document type mark, and if not, add the reference according to the bibliographical items The document type mark of the document; step 4, using the format rules for the bibliographical items of the references to verify the bibliographical items.

Description

A Method for Checking the Format of References

技术领域technical field

本发明涉及属于文本处理技术领域，特别是指一种参考文献格式检查方法。The invention relates to the technical field of text processing, in particular to a method for checking the format of reference documents.

背景技术Background technique

在各种论文中都不可避免的要引述在先公开的参考文献以帮助阅读者理解论文的背景知识。一般引述参考文献时需要提供参考文献的作者(author)、标题(title)、出版者(publisher，即该著作在何处公开发表，)、出版页码(publishpage)、出版日期(publishyear)。但是在会议文集、大型期刊等论文很集中的文件中，每一篇论文都会引述大量参考文献，这样很难保证每一篇论文都以相同的格式引述参考文献。In all kinds of papers, it is inevitable to cite previously published references to help readers understand the background knowledge of the paper. Generally, when citing references, it is necessary to provide the author (author), title (title), publisher (publisher (that is, where the work is published), publication page number (publishpage), and publication date (publishyear). However, in conference proceedings, large journals and other documents with a large concentration of papers, each paper will cite a large number of references, so it is difficult to ensure that every paper cites references in the same format.

现有都是依靠审稿人在对论文进行审稿的同时审核格式要求，然后再由编辑再次审核；但是这种纯依靠人工进行审核的方式很难确保不出现遗漏。At present, it is relying on reviewers to review the format requirements while reviewing the paper, and then the editor will review again; but this method of purely relying on manual review is difficult to ensure that there are no omissions.

发明内容Contents of the invention

针对现有技术中采用人工审核参考文献格式的方式很容易出现遗漏，无法确保文集或期刊中每一篇论文都采用相同的规则引述参考文献的问题，本发明要解决的技术问题是提供一种能够自动对电子版的论文中的参考文献引述是否符合预设规则进行审核的参考文献格式检查方法和系统，确保参考文献格式的规范性且提高效率、防止出现遗漏。Aiming at the problem that the manual review of the reference format in the prior art is prone to omissions, and it is impossible to ensure that every paper in the anthology or periodicals uses the same rules to quote the references, the technical problem to be solved by the present invention is to provide a The method and system for checking the format of references can automatically check whether the citations of references in the electronic version of the paper conform to the preset rules, so as to ensure the standardization of the format of references, improve efficiency, and prevent omissions.

为了解决上述问题，本发明实施例提出了一种参考文献格式检查方法，包括：In order to solve the above problems, an embodiment of the present invention proposes a method for checking the format of reference documents, including:

步骤1、将参考文献著录项格式规则采用Schema进行表述，其中所述参考文献著录项格式中包括以下的至少一个著录项：著录者、题名、参考文献类型、出版者、出版日期、页码；Step 1. Express the format rules of bibliographical items of references using Schema, wherein the format of bibliographical items of references includes at least one of the following bibliographical items: scribe, title, type of reference, publisher, date of publication, page number;

步骤2、读取各条参考文献，进行著录项切分；Step 2. Read each reference and segment the bibliographical items;

步骤3、识别参考文献著录项,并将识别出的著录项提取成为XML节点；其中所述著录项包括以下的至少一种：责任者、题名、出版地、出版者、出版日期等；同时，判断该参考文献著录项中是否包括文献类型标志，如果没有则根据著录项添加该参考文献的文献类型标志；Step 3, identifying the bibliographical items of the reference, and extracting the identified bibliographic items into XML nodes; wherein the bibliographic items include at least one of the following: responsible person, title, place of publication, publisher, publication date, etc.; meanwhile, Judging whether the bibliographical item of the reference includes a document type flag, if not, adding the bibliographic type flag of the reference according to the bibliographical item;

步骤4、利用所述参考文献著录项格式规则对著录项进行验证。Step 4. Verify the bibliographic items by using the format rules for the bibliographical items.

其中，所述方法还包括：Wherein, the method also includes:

步骤5、当参考文献著录项存在错误时，对著录项进行修改；具体包括；Step 5. When there is an error in the bibliographical item of the reference, modify the bibliographical item; specifically include;

当错误为缺项时，补全著录项并加上标点符号重组形成格式规范的参考文献；When the error is a missing item, complete the bibliographical item and add punctuation to reorganize to form a standardized reference;

当错误为多项时，删除该著录项并加上标点符号重组形成格式规范的参考文献；When there are multiple errors, delete the bibliographic item and reorganize with punctuation to form a standardized reference;

当错误为错项时，按照规范的格式进行修改后加上标点符号重组形成格式规范的参考文献。When the error is a wrong item, modify it according to the standard format and add punctuation to reorganize it to form a standard reference.

其中，所述步骤2包括：Wherein, said step 2 includes:

步骤21、利用ApachePOI对文档进行识别以提取参考文献内容；。Step 21, using ApachePOI to identify the document to extract the reference content;

步骤22、对提取出的参考文献内容进行切分以得到著录项，包括：Step 22. Segment the extracted reference content to obtain bibliographic items, including:

对参考文献中的符号进行识别，以判断参考文献中是否包括非半角符号，如果包括则将其替换为相应的半角符号；Identify the symbols in the references to determine whether the references include non-half-width symbols, and replace them with corresponding half-width symbols if they are included;

根据著录用符号对著录项进行切分。Segment the description items according to the bibliographic symbols.

其中，所述步骤3包括：利用预设的著录项识别模型对论文文字中所引述的参考文献进行识别以提取所述参考文献的著录项，其中所述著录项识别模型为根据预设语料库进行学习获得的；具体包括：Wherein, the step 3 includes: using a preset bibliographic identification model to identify the references cited in the text of the paper to extract the bibliographic items of the references, wherein the bibliographic identification model is based on a preset corpus Learned; specifically include:

步骤31、提取语料库；Step 31, extracting the corpus;

步骤32、采用预设的语料库，利用NER算法进行训练以获得著录项识别模型；Step 32, using the preset corpus, using the NER algorithm for training to obtain a bibliographic item recognition model;

步骤33、判断参考文献中是否包括参考文献类型参数，如果不包括则利用参考文献的著录项判断所述参考文献的类型。Step 33, judging whether the reference document includes the reference document type parameter, and if not, judging the type of the reference document by using the bibliographic item of the reference document.

其中，所述步骤33包括：Wherein, the step 33 includes:

步骤331：构建出著录项的决策树；具体包括：Step 331: Build a decision tree for the bibliographic item; specifically include:

通过以下公式计算基尼指数Gini，熵Entropy，错误率(Error)：Calculate the Gini index Gini, entropy Entropy, and error rate (Error) by the following formula:

$G G i i n no i i = = 11 - - {Σ Σ}_{i i = = 11}^{n no} p p {((i i))}^{22}$

$E E. n no t t r r o o p p y the y = = - - {Σ Σ}_{i i = = 11}^{n no} p p ((i i)) * * {log log}_{22} p p ((i i))$

Error＝1-max{p(i)|iin[1,n]}Error＝1-max{p(i)|iin[1,n]}

并计算信息增益Gain和信息增益率GainRateAnd calculate the information gain Gain and the information gain rate GainRate

Gain(U,V)＝Entropy(U)-Entropy(U,V))Gain(U,V)＝Entropy(U)-Entropy(U,V))

GainRate(U,V)＝Gain(U,V)/Entropy(V)GainRate(U,V)=Gain(U,V)/Entropy(V)

以确定决策树的根节点和最佳分组变量；To determine the root node of the decision tree and the best grouping variables;

步骤332，对数据进行预处理，具体骤包括：对所述参考文献的著录项完整性进行检查，以将非数字型、非名称型的数据转化为数字型、名称型；查找参考文献中是否具有缺少的著录项，如果有则根据参考文献中相关的著录项对空缺值进行填充；根据著录项的相关性，删除其中可忽略的著录项；对数据进行概化表述；Step 332, preprocessing the data. The specific steps include: checking the completeness of the bibliographic items of the references, so as to convert non-numeric and non-name data into digital and name data; If there are missing bibliographical items, fill in the vacant value according to the relevant bibliographical items in the reference; delete the negligible bibliographical items according to the relevance of the bibliographical items; generalize the data;

步骤333、利用参考文献的决策树和预处理后的数据进行类型判定。在本发明实施例中，采用WEKA平台进行类型判定。Step 333, using the decision tree of the references and the preprocessed data to determine the type. In the embodiment of the present invention, the WEKA platform is used for type determination.

其中，所述步骤333具体包括：Wherein, the step 333 specifically includes:

步骤3331、导入要测试的数据集；Step 3331, import the data set to be tested;

步骤3332、获取步骤332进行预处理后的待测数据；Step 3332, acquiring the data to be tested after preprocessing in step 332;

步骤3333、将处理后的数据集置于不同的学习方案中进行学习并建立预测模型来预测未知的实例；Step 3333, putting the processed data set into different learning schemes for learning and establishing a prediction model to predict unknown instances;

步骤3334、对预测的结果进行评估。Step 3334, evaluate the predicted result.

本发明的上述技术方案的有益效果如下：The beneficial effects of above-mentioned technical scheme of the present invention are as follows:

随着科技论文的大量涌现，国家有关部门推行了学术期刊的标准化和规范化，其中参考文献的格式标准已被作为广大作者和编辑工作人员所必须遵守的规则。作者在撰写学术论文过程中要学习标准规范才能高质量地完成论文，而编辑工作人员同样需要学习标准规范才能高效率地完成论文的核对工作。因此，作者和编辑工作人员都需要一种方便的工具进行参考文献格式规范性的检测。由于不同类型的参考文献有不同的格式，同一种参考文献有很多著录项，所以作者在编写过程中难免会出错，因此在学术论文的参考文献中仍存在大量不规范的现象，这给编辑工作人员增加了核对的难度。本课题主要解决参考文献格式规范性问题，具有较高的实用价值。With the emergence of a large number of scientific and technological papers, the relevant departments of the state have promoted the standardization and standardization of academic journals, and the format standards of references have been regarded as the rules that authors and editors must abide by. In the process of writing academic papers, the author must learn the standard norms to complete the paper with high quality, and the editors also need to learn the standard norms to complete the checking work of the paper efficiently. Therefore, both authors and editorial staff need a convenient tool to check the standardization of reference formats. Because different types of references have different formats, and the same reference has many bibliographic items, it is inevitable that the author will make mistakes during the writing process. Therefore, there are still a large number of irregularities in the references of academic papers, which is difficult for editors. Personnel increase the difficulty of checking. This topic mainly solves the standardization of reference format, which has high practical value.

1)这项研究可以使参考文献格式检查工作更加智能化，减少参考文献著录差错，提高参考文献格式检查工作的效率。1) This research can make the work of checking the format of references more intelligent, reduce errors in the description of references, and improve the efficiency of checking the style of references.

2)对参考文献各个著录项进行正确理解，利于日后为参考文献的进一步发掘利用(如分析引用和被引信息，评估学术著作的研究水平，以及梳理相关作者的研究成果等)。2) A correct understanding of the bibliographic items of the references will facilitate the further exploration and utilization of the references in the future (such as analyzing citation and cited information, evaluating the research level of academic works, and sorting out the research results of relevant authors, etc.).

本发明实施例可以对参考文献格式规范性进行检测，具体定位到错误的位置，并提示如何改正，为研究者提供了方便。本课题的研究成果对于提高数字出版质量、促进文档信息的高效传播利用、节省排版的人工成本等具有重要的价值。The embodiment of the present invention can detect the standardization of the format of reference documents, specifically locate the wrong position, and prompt how to correct it, which provides convenience for researchers. The research results of this topic are of great value for improving the quality of digital publishing, promoting the efficient dissemination and utilization of document information, and saving the labor cost of typesetting.

附图说明Description of drawings

图1为本发明实施例的流程示意图；Fig. 1 is the schematic flow chart of the embodiment of the present invention;

图2为Word文件中基于XML的OOXML结构的代码；Figure 2 is the code of the XML-based OOXML structure in the Word file;

图3为Word文档元素层次关系示意图；FIG. 3 is a schematic diagram of the hierarchical relationship among Word document elements;

图4为作为例子的10条格式规范的参考文献；Figure 4 is the references of 10 format specifications as an example;

图5为本发明实施例的参考文献决策树的结构示意图；FIG. 5 is a schematic structural diagram of a reference decision tree in an embodiment of the present invention;

图6为ARFF文件中部分记录的示意图；Figure 6 is a schematic diagram of some records in the ARFF file;

图7为本发明实施例中的转化的示意图；Fig. 7 is the schematic diagram of the conversion in the embodiment of the present invention;

图8为作为例子的10条待测参考文献；Fig. 8 is 10 reference documents to be tested as example;

图9为参考文献著录项识别的部分结果；Figure 9 is a partial result of bibliographic item identification;

图10是图8中待测参考文献的检测结果；Fig. 10 is the detection result of the reference document to be tested in Fig. 8;

图11为检测过程中生成的XML文件。Figure 11 is the XML file generated during the detection process.

具体实施方式detailed description

为使本发明要解决的技术问题、技术方案和优点更加清楚，下面将结合附图及具体实施例进行详细描述。In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the following will describe in detail with reference to the drawings and specific embodiments.

本发明实施例提出了一种参考文献格式检查方法，包括：The embodiment of the present invention proposes a method for checking the format of reference documents, including:

步骤1、使用Schema定义标准格式模板。目的是为了验证由参考文献各著录项所生成的XML文件的正确性。Step 1. Use Schema to define a standard format template. The purpose is to verify the correctness of the XML files generated by the bibliographic items of the references.

步骤2、读取各条参考文献，进行著录项切分。使用著录项的切分目的是使识别的类别尽可能单一，为下一步识别参考文献著录项打下基础，只有准确地切分了各著录项才能提高著录项识别的准确性。Step 2. Read each reference and segment the bibliographic items. The purpose of using bibliographical segmentation is to make the recognized category as simple as possible and lay a foundation for the next step of identifying bibliographical items. Only by accurately segmenting each bibliographical item can the accuracy of bibliographical item identification be improved.

步骤3、识别参考文献著录项。在步骤2参考文献著录项的切分方法基础上进行参考文献各著录项的识别，著录包括责任者，题名，出版地，出版者和出版日期等。Step 3. Identify the bibliography description items. On the basis of the segmentation method of reference bibliographic items in step 2, the identification of each bibliographic item of the reference bibliography is carried out, and the description includes the responsible person, title, place of publication, publisher and publication date, etc.

步骤4、将识别出的著录项提取成为XML节点。Step 4, extracting the identified bibliographic items into XML nodes.

步骤5、判断文献类型标志。在GB/T7714-2005中规定的文献类型标志如下：普通图书(M)，汇编(G)，标准(S)，期刊(J)，计算机程序(CP)，学位论文(D)，报告(R)，专利(P)，数据库(DB)，电子公告(EB)，磁带(MT)，磁盘(DK)，会议录(C)，光盘(CD)，报纸(N)，联机网络(OL)。在步骤4)中识别后的题名著录项中查找是否含有GB/T7714-2005中所规定的文献类型标志。Step 5, judging the document type flag. The document type marks specified in GB/T7714-2005 are as follows: general book (M), compilation (G), standard (S), periodical (J), computer program (CP), dissertation (D), report (R ), patent (P), database (DB), electronic bulletin (EB), tape (MT), disk (DK), conference proceedings (C), CD (CD), newspaper (N), online network (OL). In step 4), check whether the title and description items identified in step 4) contain the document type flag specified in GB/T7714-2005.

步骤6、基于步骤1中所述的标准格式模板进行验证。如果含有文献类型标志，则调用相应文献类型的Schema标准格式模板进行验证，如果未含有文献类型标志，则先判断文献类型，然后再调用相应文献类型的Schema标准格式模板进行验证。如果通过了验证，说明参考文献的格式正确，如果没有参考验证，则说明参考文献的格式错误。Step 6. Verify based on the standard format template described in step 1. If the document type flag is included, the Schema standard format template of the corresponding document type is called for verification. If the document type flag is not included, the document type is first judged, and then the Schema standard format template of the corresponding document type is called for verification. If the validation passes, the reference is formatted correctly, if not, the reference is incorrectly formatted.

步骤7、判断出错误的著录项并进行修改。包括检查著录项的顺序，最终生成正确的XML实例。具体设计思路如下：XML文件未通过Schema验证时，提取出XML文件中具体未通过验证的著录项，对于未通过验证的著录项可归纳为三种情况，一种情况是缺项，一种情况是多项，另外是错项。对于缺项的情况，补全著录项并加上标点符号重组形成格式规范的参考文献。对于多项的情况，删除该著录项并加上标点符号重组形成格式规范的参考文献。对于错项的情况，按照规范的格式进行修改后加上标点符号重组形成格式规范的参考文献。Step 7. Determine the wrong bibliographic item and modify it. Including checking the order of the bibliographic items, finally generating the correct XML instance. The specific design idea is as follows: When the XML file fails to pass the Schema verification, the specific bibliographic items in the XML file that have not passed the verification are extracted. For the bibliographic items that have not passed the verification, it can be summarized into three situations, one is missing items, and the other is There are multiple items, and the other is a wrong item. In case of missing items, complete the bibliographical items and reorganize them with punctuation marks to form references in a standardized format. For multiple items, delete the bibliographic item and add punctuation to reorganize to form a standardized reference. In the case of wrong items, modify according to the standard format, add punctuation marks and reorganize to form references with standard format.

下面对本发明实施例的每一步骤进行详细的说明。Each step of the embodiment of the present invention is described in detail below.

步骤1、将参考文献著录项格式规则采用代码进行表述，其中所述参考文献著录项格式中包括以下的至少一个著录项：著录者、题名、参考文献类型、出版者、出版日期、页码。Step 1. Express the format rules of bibliographical items of references by using codes, wherein the format of bibliographical items of references includes at least one of the following bibliographical items: scribe, title, type of reference, publisher, date of publication, page number.

在本发明实施例中，可以通过XMLSchema语言来对进行表述，其中所述参考文献著录项格式中包括以下的至少一个著录项：著录者、题名、参考文献类型、出版者、出版日期、页码；In the embodiment of the present invention, it can be expressed by XMLSchema language, wherein the bibliography bibliographic item format includes at least one of the following bibliographic items: scribe, title, reference type, publisher, publication date, page number;

以下为采用XMLSchema语言表述的参考文献著录项格式规则的一个实例，是以XMLSchema表述的会议论文集The following is an example of the format rules for bibliographical items expressed in XMLSchema language, which is a conference proceedings expressed in XMLSchema

(1)<？xmlversion＝"1.0"encoding＝"GB2312"？>(1) <? xmlversion="1.0"encoding="GB2312"? >

(2)<xs:schemaxmlns:xs＝"http://www.w3.org/2001/XMLSchema"(2)<xs:schemaxmlns:xs="http://www.w3.org/2001/XMLSchema"

(3)xmlns＝"http://www.w3school.com.cn"(3)xmlns="http://www.w3school.com.cn"

(4)targetNamespace＝http://www.w3school.com.cn"(4) targetNamespace＝http://www.w3school.com.cn"

(5)elementFormDefault＝"qualified">(5) elementFormDefault="qualified">

(6)<xs:elementname＝"reference">(6)<xs:elementname="reference">

(7)<xs:complexType>(7) <xs:complexType>

(8)<xs:sequence>(8) <xs:sequence>

(9)<xs:elementname＝"author"type＝"xs:string"/>(9)<xs:elementname="author"type="xs:string"/>

(10)<xs:elementname＝"title"type＝"xs:string"/>(10)<xs:elementname="title"type="xs:string"/>

(11)<xs:elementname＝"type"type＝"xs:string"/>(11)<xs:elementname="type"type="xs:string"/>

(12)<xs:elementname＝"publish"type＝"xs:string"minOccirs＝"0"/>(12)<xs:elementname="publish"type="xs:string"minOccirs="0"/>

(13)<xs:elementname＝"publisher"type＝"xs:string"/>(13)<xs:elementname="publisher"type="xs:string"/>

(14)<xs:elementname＝"publish_year"type＝"xs:string"/>(14)<xs:elementname="publish_year"type="xs:string"/>

(15)<xs:elementname＝"page_number"minOccirs＝"0"/>(15)<xs:elementname="page_number" minOccirs="0"/>

(16)<xs:simpleType>(16)<xs:simpleType>

(17)<xs:restrictionbase＝"xs:string">(17)<xs:restrictionbase="xs:string">

(18)<xs:pattemvalue＝"(\d{1,4}-)？\d{1,4}"/>(18)<xs:pattemvalue="(\d{1,4}-)?\d{1,4}"/>

(19)</xs:restriction>(19)</xs:restriction>

(20)</xs:simpleType>(20)</xs:simpleType>

(21)</xs:element>(21)</xs:element>

(22)</xs:sequence>(22)</xs:sequence>

(23)</xs:complexType>(23)</xs:complexType>

(24)</xs:element>(24)</xs:element>

(25)</xs:schema>(25)</xs:schema>

步骤2、利用预设文库提取文字中引述的参考文献，并提取其中的著录项；所述步骤2包括参考文献内容提取和参考文献著录项提取两部分内容。因此步骤2具体包括：Step 2, using the preset library to extract the references cited in the text, and extract the bibliographic items therein; the step 2 includes two parts: the content extraction of the reference bibliography and the bibliographic item extraction of the reference bibliography. Therefore step 2 specifically includes:

步骤21、利用ApachePOI对文档进行识别以提取参考文献内容。Step 21, using Apache POI to identify the document to extract the reference content.

由于现有的文档大多都是用MicrosoftWord格式或是兼容Word的格式来进行存储的。在MicrosoftWord文档中，信息以基于XML的OOXML(OpenOfficeXML)格式进行存储。因此可以采用ApachePOI3.13对文档进行识别。Most of the existing documents are stored in the Microsoft Word format or a Word-compatible format. In a Microsoft Word document, information is stored in the XML-based OOXML (OpenOfficeXML) format. Therefore, Apache POI3.13 can be used to identify documents.

以下本申请以举例对OOXML结构的含义进行说明。在MicrosoftWord2013中编辑两段文字，分别为“中文参考文献著录项识别”和“论文”，其对应的XML代码如图2所示。Hereinafter, this application uses an example to illustrate the meaning of the OOXML structure. Edit two paragraphs of text in Microsoft Word2013, which are "identification of bibliographical items in Chinese references" and "thesis". The corresponding XML codes are shown in Figure 2.

在代码中，<w:document>元素为文档的根元素，其他所有的元素都是它的子元素。元素后面通过属性定义了若干个命名空间。In the code, the <w:document> element is the root element of the document, and all other elements are its child elements. Several namespaces are defined behind the element through attributes.

<w:body>元素是文档内容所在的元素，是唯一必须的元素。其下包含许多子元素，具体参见OOXML标准。在众多的子元素中，最基本的元素有三个，分别为<w:p>元素，<w:r>元素和<w:t>元素。其中，<w:p>元素代表一个段落，用于定义一个开始于新行的内容；<w:r>元素表示句层的内容，可以是句、数学内容、智能标记和用户自定义标记等，句是可以设置式样的最小单元；<w:t>元素表示句内的具体文本内容。这些元素的层级关系示意图如图3所示。The <w:body> element is where the content of the document resides and is the only required element. It contains many sub-elements, see the OOXML standard for details. Among the many sub-elements, there are three basic elements, namely the <w:p> element, the <w:r> element and the <w:t> element. Among them, the <w:p> element represents a paragraph and is used to define a content that starts on a new line; the <w:r> element represents the content of the sentence level, which can be sentences, mathematical content, smart tags and user-defined tags, etc. , the sentence is the smallest unit that can be styled; the <w:t> element represents the specific text content in the sentence. A schematic diagram of the hierarchical relationship of these elements is shown in FIG. 3 .

由于参考文献的位置是固定的，在了解了基于XML的OOXML(OpenOfficeXML)格式的信息之后，就可以利用ApachePOI3.13对文档进行识别，以提取其中的参考文献的内容。Since the position of the references is fixed, after knowing the information in the XML-based OOXML (OpenOfficeXML) format, Apache POI3.13 can be used to identify the document to extract the contents of the references.

步骤22、从所述参考文献内容中提取著录项。Step 22, extracting bibliographic items from the contents of the references.

由于参考文献由若干著录项组成，因此需要先将著录项进行切分，然后才能进行识别。具体包括：Since the reference is composed of several bibliographic items, it is necessary to segment the bibliographic items before identifying them. Specifically include:

符号规范化步骤：对参考文献中的符号进行识别，以判断参考文献中费否包括非半角符号，如果包括则将其替换为相应的半角符号；Symbol normalization step: identify the symbols in the references to determine whether non-half-width symbols are included in the references, and replace them with corresponding half-width symbols if they are included;

切分步骤：根据参考文献国标GB/T7714-2005中规定的著录用符号对著录项进行切分。Segmentation step: Segment the bibliographic items according to the bibliographic symbols stipulated in the national standard GB/T7714-2005.

在参考文献国标GB/T7714-2005中，所有的著录用符号都作为前置符。比如参考文献中第一个著录项的责任者前不使用任何标志符号；“.”用于题名项和析出文献题名项等的前置符等。通过对GB/T7714-2005的分析，将不同的前置符作为分隔点进行著录项的切分。In the reference document GB/T7714-2005, all bibliographic symbols are used as prefixes. For example, do not use any signs before the responsible person of the first bibliographic item in the reference; "." is used as the prefix of the title item and the extracted document title item, etc. Through the analysis of GB/T7714-2005, different prefixes are used as separation points to divide the description items.

在对一些毕业生学位论文进行抽样分析过程中发现，格式错误的种类形式各异，可以利用统计方法计算得出书写错误的概率模型。例如如果在一条参考文献中含有“.”，则此条参考文献中基本不会出现“。”作为著录项之间的分隔符；如果在一条参考文献中含有“。”作为著录项之间的分隔符，则此条参考文献中基本不会出现“.”作为著录项之间的分隔符。In the process of sampling and analyzing some graduate dissertations, it is found that there are different types of formatting errors, and the probability model of writing errors can be calculated by using statistical methods. For example, if there is a "." in a reference, "." will not appear in this reference as a separator between bibliographic items; if a reference contains "." as a delimiter between bibliographic items Separator, then "." will not appear as a separator between bibliographic items in this reference.

步骤3、利用预设的著录项识别模型对论文文字中所引述的参考文献进行识别以提取所述参考文献的著录项，其中所述著录项识别模型为根据预设语料库进行学习获得的。Step 3. Using a preset bibliographic identification model to identify the references cited in the text of the thesis to extract the bibliographic items of the references, wherein the bibliographic identification model is obtained by learning from a preset corpus.

在本发明实施例中，采用基于条件随机场的斯坦福大学命名实体识别方法(StanfordNamedEntityRecognizer，NER)。NER可以按照类别将实体进行标记，例如人名，公司名，地区，基因和蛋白质的名字等。NER配备了精心设计的特征提取器对命名实体进行识别，经过训练即可得到训练模型。理论上用于训练的数据，即大量人工标记好的文本越多，NER识别效果越好。为了满足新的需求要重新训练模型。In the embodiment of the present invention, a named entity recognition method (Stanford NamedEntity Recognizer, NER) of Stanford University based on a conditional random field is used. NER can mark entities according to categories, such as names of people, companies, regions, genes and proteins, etc. NER is equipped with a well-designed feature extractor to identify named entities, and a training model can be obtained after training. In theory, the data used for training, that is, the more manually marked texts, the better the NER recognition effect. The model needs to be retrained to meet new requirements.

因此步骤3中具体包括；Therefore, step 3 specifically includes;

步骤31、提取语料库；具体的语料库采用抽取的1998年1月份《人民日报》标注语料库和2015年北大版《中文核心期刊要目总览》。Step 31, extract the corpus; the specific corpus uses the extracted annotated corpus of "People's Daily" in January 1998 and the 2015 Peking University edition of "Overview of Chinese Core Periodicals".

其中；in;

1、抽取的1998年1月份《人民日报》标注语料：由于《人民日报》语料中人名，地名等名词所占比例较多，因此可以作为很好的训练语料库。1. The annotated corpus of "People's Daily" extracted in January 1998: Since the "People's Daily" corpus has a large proportion of nouns such as names of people and places, it can be used as a good training corpus.

2、2015年北大版《中文核心期刊要目总览》；由于除了人名、地名之外，通常论文中希望能够更好地识别出常用的期刊名称以及一些论文标题中常用的关键词，因此采用《中文核心期刊要目总览》配合《人民日报》。2. The 2015 edition of Peking University's "Overview of Chinese Core Periodicals"; in addition to names of people and places, it is usually hoped to better identify commonly used journal names and keywords commonly used in some paper titles in papers, so " An Overview of Chinese Core Periodicals in conjunction with the People's Daily.

比如毕业论文的题目或期刊的题目中常含有“基于”一词，所以在《人民日报》语料的基础上增加了2015年北大版《中文核心期刊要目总览》中出现的期刊名称及统计出的标题中常用的关键词，最终将这几部分组合在一起共同作为本实验的系统训练集，将其保存在testdata.tsv文件中，并用于系统的封闭测试。For example, the title of the graduation thesis or the title of the journal often contains the word "based on", so on the basis of the "People's Daily" corpus, the name of the journal and the statistics that appeared in the 2015 Peking University edition "Overview of Chinese Core Journals" were added. The keywords commonly used in the title, these parts are finally combined together as the system training set of this experiment, which is saved in the testdata.tsv file and used for the closed test of the system.

另外，2015年北大版《中文核心期刊要目总览》为抽取的本校毕业论文中的文后参考文献，将其组成测试集用于系统的开放测试。In addition, the 2015 edition of Peking University's "Overview of Chinese Core Journals" is the post-text references extracted from the graduation thesis of the school, and it is used to form a test set for the open test of the system.

其中，提取出的参考文献内容可以如图4所示的。Wherein, the extracted reference content may be as shown in FIG. 4 .

步骤32、采用预设的语料库，利用NER算法进行训练以获得著录项识别模型。Step 32, using the preset corpus, and using the NER algorithm for training to obtain a bibliographic item recognition model.

NER提供了两种训练模型的方式，分别为命令行方式及配置文件方式。NER provides two ways to train the model, namely command line mode and configuration file mode.

在本发明实施例中，可以采用配置文件的方式。In the embodiment of the present invention, configuration files may be used.

具体的，在StanfordNER中配置文件名称为austen.prop，利用如下表1所示的修改其参数Specifically, the name of the configuration file in StanfordNER is austen.prop, and its parameters are modified as shown in Table 1 below

表1austen.prop修改参数表Table 1 austen.prop modification parameter list

其中，trainFile指定用于训练的数据集，serializeTo指定训练后输出的模型名称。将修改后的配置文件保存，并与训练数据集testdata.tsv共同放在程序的根目录下，执行命令以下命令：Among them, trainFile specifies the data set used for training, and serializeTo specifies the name of the model output after training. Save the modified configuration file and put it in the root directory of the program together with the training data set testdata.tsv, and execute the following command:

java–cpStanford-ner.jaredu.stanford.nlp.ie.crf.CRFClassifier–propausten.prop”java –cpStanford-ner.jaredu.stanford.nlp.ie.crf.CRFClassifier –propausten.prop”

当执行成功后，在目录下生成ner-model.ser.gz，即为训练数据得到的模型。When the execution is successful, ner-model.ser.gz will be generated in the directory, which is the model obtained from the training data.

在获得了著录项识别模型之后，可以通过著录项识别模型对步骤2中的著录项进行识别。After obtaining the bibliographic item identification model, the bibliographic item in step 2 can be identified through the bibliographic item identification model.

由于参考文献中信息有可能是不完整的，有可能会缺失参考文献类型，而参考文献类型是否准确对后续的参考文献格式检查有重要影响，因此本发明实施例中可以进一步包括：Since the information in the references may be incomplete, and the type of the reference may be missing, and whether the type of the reference is accurate has an important impact on the subsequent check of the format of the reference, so the embodiment of the present invention may further include:

具体的，步骤33包括：Specifically, step 33 includes:

步骤331：构建出著录项的决策树。Step 331: Build a decision tree for the bibliographic items.

还是以如图4所示的参考文献内容为例，其包括10条格式规范。由图4可知，每条参考文献由很多著录项组成，不同类型的参考文献其著录项的组成各不相同。通过对10条参考文献的分析，归纳其参考文献的著录项及其属性值描述，如表2所示。Still taking the content of the reference as shown in FIG. 4 as an example, it includes 10 format specifications. It can be seen from Figure 4 that each reference is composed of many bibliographic items, and the composition of bibliographic items is different for different types of references. Through the analysis of 10 references, the bibliographic items and attribute value descriptions of the references are summarized, as shown in Table 2.

表2参考文献的著录项及其属性值描述Table 2 Bibliographic items and attribute value descriptions of references

经数据变换后得到的图4中各条参考文献的信息模型如表3所示：The information model of each reference in Figure 4 obtained after data transformation is shown in Table 3:

表3图4中各条参考文献的信息模型Information model of each reference in Table 3 and Figure 4

因此可以根据表3构建出如图5所示的著录项决策树。根据图5中的决策树可以对未知类型的文献进行预测，比如现有一条参考文献如下：Therefore, the bibliographic item decision tree shown in Figure 5 can be constructed according to Table 3. According to the decision tree in Figure 5, unknown types of documents can be predicted. For example, an existing reference is as follows:

朱刚.新型流体有限元法及叶轮机械正反混合问题.北京：清华大学,1996.Zhu Gang. New fluid finite element method and forward and reverse mixing problems of impeller machinery. Beijing: Tsinghua University, 1996.

根据如图5所示的决策树可以预测它属于学位论文。According to the decision tree shown in Figure 5, it can be predicted that it belongs to a dissertation.

要生成如图5所示的决策树，其中有两个关键问题：To generate the decision tree shown in Figure 5, there are two key issues:

一是如何从众多的输入变量中选择一个当前最佳的分组变量？比如为什么要把出版者类型作为决策树的根节点？为什么选择著者类型作为下层的子节点而不是其它著录项？One is how to select a currently best grouping variable from numerous input variables? For example, why should the publisher type be used as the root node of the decision tree? Why choose the author type as the child node of the lower layer instead of other bibliographic items?

二是如何从分组变量的众多取值中找到一个最佳的分割点？比如出版者类型为名称类型，其属性包括“期刊”、“教育机构”、“其它”、“出版社”，为什么选择“教育机构”作为分割点？解决了这两个关键问题即可以容易地构造出决策树。The second is how to find an optimal split point from the many values of the grouping variables? For example, the publisher type is a name type, and its attributes include "journals", "educational institutions", "others", and "publishers". Why do you choose "educational institutions" as the split point? After solving these two key problems, a decision tree can be easily constructed.

决策树中需要引入“纯度”概念。常用的衡量纯度方法有三种，分别为基尼指数(Gini)，熵(Entropy)，错误率(Error)；本发明实施例中可以通过以下公式计算基尼指数(Gini)，熵(Entropy)，错误率(Error)：The concept of "purity" needs to be introduced in the decision tree. There are three kinds of measuring purity methods commonly used, are respectively Gini index (Gini), entropy (Entropy), error rate (Error); Gini index (Gini), entropy (Entropy), error rate (Error) can be calculated by the following formula in the embodiment of the present invention (Error):

假定著录项的属性具有n类不同的属性值i(i＝1,2,…,n)，每类属性值所占的比例p(i)＝第i类属性值的数量/该属性值总数量，p(i)的取值范围为[0,1]。Assuming that the attributes of the bibliographic item have n types of different attribute values i (i=1, 2,...,n), the proportion of each type of attribute value p(i) = the number of i-th attribute values/the total value of the attribute The value range of p(i) is [0,1].

$G G i i n no i i = = 11 - - {Σ Σ}_{i i = = 11}^{n no} p p {((i i))}^{22}$

Error＝1-max{p(i)|iin[1,n]}Error＝1-max{p(i)|iin[1,n]}

上面三纯度的公式1-3均为值越大，表示越“不纯”，越小表示越“纯”。实践证明三种公式的选择对最终分类准确率的影响并不大。在本发明实施例中还使用熵公式，由熵公式引申出两个常用的属性选择变量，分别为如公式4的信息增益(Gain)和如公式5的信息增益率(GainRate)。The above formulas 1-3 of the three purities are the larger the value, the more "impure", and the smaller the more "pure". Practice has proved that the choice of the three formulas has little effect on the final classification accuracy. In the embodiment of the present invention, the entropy formula is also used, and two commonly used attribute selection variables are derived from the entropy formula, which are information gain (Gain) as in formula 4 and information gain rate (GainRate) as in formula 5.

Gain(U,V)＝Entropy(U)-Entropy(U,V))Gain(U,V)＝Entropy(U)-Entropy(U,V))

GainRate(U,V)＝Gain(U,V)/Entropy(V)GainRate(U,V)=Gain(U,V)/Entropy(V)

在信息论中，信息传递过程看作是一个由信源、信道和信宿组成的传递系统实现的，信源是信息的发送端，信宿是信息的接收端。以上面参考文献类型标志预测为例，将著者类型(T1)，报告号(T2)，专利号(T3)，出版者类型(T4)，年卷期标志(T5)，页码(T6)作为输入变量，参考文献类型标志为输出变量。决策树将输出变量(参考文献类型标志)看作信源发出的信息U，输入变量看作信宿接收到的一系列信息V。In information theory, the process of information transmission is realized as a transmission system composed of information source, channel and destination. The information source is the sending end of the information, and the information destination is the receiving end of the information. Taking the above reference type mark prediction as an example, the author type (T1), report number (T2), patent number (T3), publisher type (T4), year volume mark (T5), and page number (T6) are used as input variable, the reference type is marked as an output variable. The decision tree regards the output variable (reference type mark) as the information U sent by the source, and the input variable as a series of information V received by the sink.

Gain(U,V)＝Entropy(U)-Entropy(U,V))Gain(U,V)＝Entropy(U)-Entropy(U,V))

GainRate(U,V)＝Gain(U,V)/Entropy(V)GainRate(U,V)=Gain(U,V)/Entropy(V)

采用信息增益率(GainRate)对上述两个关键问题分别进行计算，计算过程如下：Using GainRate to calculate the above two key issues respectively, the calculation process is as follows:

以著者类型T1为例：分别计算Entropy(U)、Entropy(U|T1)、Gains(U,T1)、GainsR(U,T1)，其中期刊、学位论文和图书类型文献各2条，报告、会议集、专利和标准类型文献各1条。Take the author type T1 as an example: calculate Entropy(U), Entropy(U|T1), Gains(U,T1), GainsR(U,T1) respectively, including 2 journals, dissertations and books, and reports, 1 each of conference proceedings, patent and standard type documents.

假定著录项具有M个不同类型的属性，属性值u_i(i＝1,2,…,M),每类属性值所占的比例为p(u_i)，著者类型T1具有N个不同属性值t_1j(j＝1,2,…,N)。Assuming that the bibliographic item has M different types of attributes, the attribute value u _i (i=1,2,...,M), the proportion of each type of attribute value is p(u _i ), and the author type T1 has N different attributes Value t _1j (j=1, 2, . . . , N).

$E E. n no t t r r o o p p y the y ((U u)) = = - - {Σ Σ}_{i i}^{M m} p p (({u u}_{i i})) {log log}_{22} p p (({u u}_{i i})) = = 1.832 1.832$

$E E. n no t t r r o o p p y the y ((U u | | T T 11)) = = {Σ Σ}_{j j}^{N N} p p (({t t}_{11 j j})) ((- - {Σ Σ}_{i i}^{N N} p p (({u u}_{i i} | | {t t}_{11 j j})) {log log}_{22} p p (({u u}_{i i} | | {t t}_{11 j j})))) = = 1.279 1.279$

Gain(U,T1)＝Entropy(U)-Entropy(U|T1)＝0.553Gain(U,T1)=Entropy(U)-Entropy(U|T1)=0.553

GainRate(U,T1)＝Gains(U,T1)/Entropy(V)＝0.628GainRate(U,T1)=Gains(U,T1)/Entropy(V)=0.628

即得到著者类型(T1)的信息增益率为0.628，以相同的方式计算其它各著录项，最终得到T4信息增益率值最大为1.275，因此应选择T4作为最佳分组变量，即为决策树的根节点。That is to say, the information gain rate of the author type (T1) is 0.628, and other bibliographic items are calculated in the same way, and finally the maximum information gain rate of T4 is 1.275, so T4 should be selected as the best grouping variable, which is the decision tree. root node.

在出版者类型中有4个属性，分别为“期刊”、“教育机构”、“其它”、“出版社”，那么如何选择分割点，计算过程与上面类似，经计算得到“教育机构”的信息增益率值最大为3.948，因此应选择“教育机构”作为最佳分组变量。There are 4 attributes in the publisher type, which are "journal", "educational institution", "other" and "publisher", so how to choose the split point, the calculation process is similar to the above, after calculation, the "educational institution" is obtained The maximum value of information gain rate is 3.948, so "educational institution" should be selected as the best grouping variable.

由上述分析可以看出，本发明实施例的决策树是一种直观的决策分析方法，其优点显而易见。决策树模型可读性好，具有一定的描述性，有助于人工分析；并且执行效率高，只需要一次构建就可以反复使用，可以很自然地嵌入专家的先验知识。It can be seen from the above analysis that the decision tree in the embodiment of the present invention is an intuitive decision analysis method, and its advantages are obvious. The decision tree model is readable and descriptive, which is helpful for manual analysis; and it has high execution efficiency, and can be used repeatedly after only one construction, and can naturally embed the prior knowledge of experts.

由于该参考文献的著录项中可能出现数据不一致、数据重复、数据含有噪声、数据维度高等问题。因此在对著录项进行分类之前，需要对数据进行预处理。即，所述步骤33还包括：Problems such as data inconsistency, data duplication, data containing noise, and high-dimensional data may appear in the bibliographic items of this reference. Therefore, before classifying the bibliographic items, it is necessary to preprocess the data. That is, the step 33 also includes:

步骤332，对数据进行预处理。Step 332, preprocessing the data.

具体的，数据预处理步骤包括：Specifically, the data preprocessing steps include:

步骤3321、对所述参考文献的著录项完整性进行检查。Step 3321, check the completeness of the bibliographic items of the references.

由于决策树的变量有两种类型：数字型、名称型；所以在构造决策树前需要做的主要预处理工作为将非数字型和非名称型的数据转化为数字型或者名称型。Since there are two types of variables in the decision tree: numeric and named; so the main preprocessing work that needs to be done before constructing the decision tree is to convert non-numeric and non-named data into numeric or named.

在数据挖掘中从原始数据里选取合适的属性作为数据挖掘属性，所采用的数据原则为：尽可能将属性名和属性值赋予明确的含义、去除重复的数据、去除可忽略字段、合理选择关联字段。下面具体介绍进行预处理的过程。In data mining, appropriate attributes are selected from the original data as data mining attributes. The data principles adopted are: as far as possible, give the attribute name and attribute value a clear meaning, remove duplicate data, remove negligible fields, and reasonably select associated fields. . The following describes the preprocessing process in detail.

原始数据为提取出的参考文献，然后将参考文献各著录项进行拆分，每个拆分后的著录项可以看作每条记录的属性，如下的表4为从原始数据中选取的一个片断。The original data is the extracted references, and then the bibliographic items of the references are split. Each split bibliographic item can be regarded as the attribute of each record. The following table 4 is a fragment selected from the original data .

表4原始数据记录Table 4 Raw data records

从上表中可以看出，将一条参考文献拆分后，某些字段值空缺，某些字段可以忽略，因此步骤332可以包括以下三个子步骤：As can be seen from the above table, after a reference is split, some field values are vacant, and some fields can be ignored, so step 332 may include the following three sub-steps:

步骤3322、查找参考文献中是否具有缺少的著录项，如果有则根据参考文献中相关的著录项对空缺值进行填充。Step 3322 , check whether there are missing bibliographic items in the reference documents, and if so, fill the vacant values according to the relevant bibliographic items in the reference documents.

例如表4中的的“专利国别”，“专利号”，“报告编号”，然后将所选的数据所有空缺值进行填补。空缺值的填补原则为遵从该字段已存在的值的类型，比如在已存在的记录中某个字段的部分值为数字型，那么该字段的其它空缺值的填补值也将为数字型值，如果该字段的部分值为名称型，则该字段的其它空缺值的填补值也为名称型值。For example, "patent country", "patent number", "report number" in Table 4, and then fill in all the blank values of the selected data. The filling principle of the vacant value is to comply with the type of the existing value of the field. For example, if some values of a field in the existing record are numeric, then the filling values of other vacant values of the field will also be numeric. If some of the field's values are named, the fill-in values of other blank values in this field are also named.

步骤3323、根据著录项的相关性，删除其中可忽略的著录项。例如图4所示的参考文献内容，其中的序列编号(即前面的1、2、3……10)对结果预测没有任何作用，反而会增加计算的复杂性，因此可以删除。比如出版地字段，无论出版地是哪里都将不会影响最终参考文献的类型，所以对于“出版地”字段可将其忽略。Step 3323, according to the relevancy of the descriptive items, delete the negligible descriptive items. For example, in the reference content shown in Figure 4, the sequence numbers (ie, the previous 1, 2, 3...10) have no effect on the result prediction, but will increase the complexity of the calculation, so they can be deleted. For example, the place of publication field, no matter where the place of publication is, will not affect the type of the final reference, so it can be ignored for the "place of publication" field.

步骤3323、对数据进行概化表述。这是由于在原始记录的数据，每一字段都可以被概括为数个类。例如：对于“责任者”字段来说，“责任者”字段的值可以概括为两类，一类是具体的人名，另一类是组织机构名称。通过“责任者”字段是人名还是组织机构名来对文献的类型进行预测，而与具体的人是什么名字以及与组织机构是什么名字无关。因此，可以将“责任者”字段进行数据概化为人名和组织机构名两类。依此类推，将所有类似情况的数据都将进行概化。Step 3323, generalize the data. This is due to the fact that in raw recorded data, each field can be summarized into several classes. For example: for the field of "responsible person", the value of the "responsible person" field can be summarized into two categories, one is the name of a specific person, and the other is the name of an organization. The type of the document is predicted according to whether the "responsible person" field is the name of the person or the name of the organization, and has nothing to do with the name of the specific person or the name of the organization. Therefore, the data of the "responsible person" field can be generalized into two types: person name and organization name. And so on, the data of all similar situations will be generalized.

经过上述的步骤之后可以得到如表5所示的经过预处理后的数据。After the above steps, the preprocessed data shown in Table 5 can be obtained.

表5预处理后的数据Table 5 preprocessed data

从表5中可以看到预处理的字段及字段值都为英文，当然这只是本发明实施例的一种方式，还可以采取其他任何形式的来表达经过预处理的字段及字段值。由于本发明实施例中是采用WEKA系统来进行参考文献类型来进行类型判定，因此采用英文的字段和字段值可以获得更好的计算效果。It can be seen from Table 5 that the preprocessed fields and field values are all in English. Of course, this is only one way of the embodiment of the present invention, and any other form can be used to express the preprocessed fields and field values. Since the WEKA system is used in the embodiment of the present invention to determine the types of references, better calculation results can be obtained by using English fields and field values.

以表5的例子来对本发明实施例每一字段的字段值来举例说明:The field value of each field of the embodiment of the present invention is illustrated with the example of Table 5:

“责任者”字段值为PER.Individual和PER.Group，其中PER指文献中提及的个人或人群，PER.Individual和PER.Group为PER的子类，分别指个人，人群或组织。The "responsible person" field values are PER.Individual and PER.Group, where PER refers to individuals or groups of people mentioned in the literature, and PER.Individual and PER.Group are subcategories of PER, referring to individuals, groups of people or organizations, respectively.

“题目特征标记”字段值为title_D_tag，title_C_tag等，比如title_C_tag是指会议论文集的题目中含有“会议集”特征标记，报告的题目一般会含有“报告”特征标记；其它类型没有特征标记的值记为no。The field value of "title feature tag" is title_D_tag, title_C_tag, etc. For example, title_C_tag means that the title of the conference proceedings contains the feature tag of "conference collection", and the title of the report generally contains the feature tag of "report"; other types have no value of the feature tag Mark it as no.

“出版社”字段值为PUB.Press，PUB.Journal，PUB.School，PUB.Institution和NUL，分别指非学校类的出版社，期刊，学校类出版社和研究院所，NUL指缺项。The "Publisher" field values are PUB.Press, PUB.Journal, PUB.School, PUB.Institution and NUL, which refer to non-school presses, periodicals, school presses and research institutes respectively, and NUL refers to missing items.

在步骤33中，通过步骤331构建了决策树，并通过步骤332进行了数据预处理后，需要对参考文献进行类型判定。即所述方法还包括：In step 33, a decision tree is constructed through step 331, and after data preprocessing is performed through step 332, it is necessary to determine the types of references. That is, the method also includes:

WEKA平台中进行数据挖掘的过程如下：The process of data mining in the WEKA platform is as follows:

1)导入要测试的数据集；1) Import the dataset to be tested;

2)对待测数据进行预处理(步骤332已经完成)；2) Preprocessing the data to be tested (step 332 has been completed);

3)将处理后的数据集置于不同的学习方案中进行学习并建立预测模型来预测未知的实例；3) Place the processed data set in different learning schemes for learning and establish a prediction model to predict unknown instances;

4)对预测的结果进行评估并可视化。下面针对上述四个步骤进行具体介绍。4) Evaluate and visualize the predicted results. The following is a detailed introduction to the above four steps.

因此步骤333具体包括：Therefore step 333 specifically comprises:

步骤3331、导入要测试的数据集。Step 3331, import the data set to be tested.

由于WEKA平台下能处理的数据格式为CSV和ARFF文件，但是最理想的格式为ARFF文件，所以这里使用ARFF格式文件，需要先将文件的格式进行转换后再导入要测试的数据集。原始数据存储在EXCEL文件中，先将其转换为CSV文件，再转换为ARFF文件。其中ARFF文件中部分记录如图6所示。Since the data formats that can be processed under the WEKA platform are CSV and ARFF files, but the most ideal format is ARFF files, so the ARFF format files are used here, and the file format needs to be converted before importing the data set to be tested. The raw data is stored in an EXCEL file, which is first converted to a CSV file and then converted to an ARFF file. Some of the records in the ARFF file are shown in Figure 6.

步骤3332、获取步骤332的预处理后的数据。Step 3332, acquire the preprocessed data of step 332.

步骤3333、选择具体的分类算法用于训练和测试分类。在WEKA系统的分类模块中，集成了约50种的分类算法，本发明实施例中选用了3种经典分类算法NativeBays，J48(决策树)和ZeroR对测试集进行分类测试。Step 3333, select a specific classification algorithm for training and testing classification. In the classification module of the WEKA system, about 50 kinds of classification algorithms are integrated. In the embodiment of the present invention, three classic classification algorithms NativeBays, J48 (decision tree) and ZeroR are selected to classify the test set.

第四步对不同分类算法的结果进行评估。评估分类精度的方法很多，主要有交叉法(cross-validation)、保持法(holdout)、留一法(leave-one-out)、回代法(back-substitution)。交叉法和保持法最为常用。留一法可以看作交叉法的一种特例。回代法由于它评估过度拟合导致了分类精度偏高，一般不使用。结果的可视化既可以对一次分类的结果进行可视化，也可以对一个数据集的结果进行可视化。其中数据集的可视化显示的是关于每对属性的一个二维散点图，某一次分类的输出结果可视化显示的是分类误差、树、成本曲线、ROC曲线等，用来评估各学习方案的性能。The fourth step evaluates the results of different classification algorithms. There are many methods for evaluating classification accuracy, mainly including cross-validation, holdout, leave-one-out, and back-substitution. The intersection and hold methods are the most commonly used. The leave-one-out method can be regarded as a special case of the crossover method. The back substitution method is generally not used because it evaluates overfitting and leads to high classification accuracy. The visualization of the results can not only visualize the results of a classification, but also visualize the results of a data set. Among them, the visualization of the data set shows a two-dimensional scatter diagram about each pair of attributes, and the output of a certain classification is visualized to show the classification error, tree, cost curve, ROC curve, etc., which are used to evaluate the performance of each learning scheme. .

一些算法对文献类型标志的判定虽然准确率相对较高，但是不可能达到100％的准确率，这将影响到最终的文献格式错误检测的准确率。Although some algorithms have relatively high accuracy in judging the document type flag, it is impossible to achieve 100% accuracy, which will affect the final accuracy of document format error detection.

为了尽量减小预测误差，本发明实施例中采用了特征标记的方法。即在决策树进行文献类型标志判定后再根据特征标记来进行判定，若二者判定结果相同则将其作为最终结果，若二者判定不相同则以根据特征标记来判定的结果为准。表6列出了各类特征标记与参考文献类型的对应关系。In order to reduce the prediction error as much as possible, the method of feature marking is adopted in the embodiment of the present invention. That is to say, after the decision tree judges the document type mark, the judgment is made according to the characteristic mark. If the two judgment results are the same, it will be taken as the final result. If the two judgments are not the same, the result of the judgment based on the characteristic mark shall prevail. Table 6 lists the correspondence between various feature marks and reference types.

表6特征标记与文献类型的关系表Table 6 The relationship between feature marks and document types

步骤4、利用所述参考文献著录项格式规则代码，对识别出的所述参考文献的著录项进行检查。具体包括：将识别后的各著录项根据参考文献类型标志生成相应的XML文档，然后使用Schema进行验证；如果通过验证则说明该条文献格式正确，否则说明该条文献的格式存在错误。Step 4: Check the identified bibliographic items of the reference documents by using the format rule codes of the bibliographic items of the reference documents. Specifically, it includes: generating corresponding XML documents for each identified bibliographic item according to the reference type mark, and then using Schema to verify; if the verification is passed, it means that the format of the document is correct; otherwise, it means that the format of the document is wrong.

在本发明实施例中，其中生成的期刊类型的XML文档如下In the embodiment of the present invention, the XML document of the periodical type generated therein is as follows

(1)<？xmlversion＝"1.0"encoding＝"GB2312"standalone＝"no"？>(1) <? xmlversion="1.0"encoding="GB2312"standalone="no"? >

(2)<referencexmlns＝"http://www/w3school.com.cn"(2)<referencexmlns="http://www/w3school.com.cn"

(3)xmlns:xsi＝"http://www.w3.org/2001/XMLSchema-instance"(3)xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

(4)xsi:schemaLocation＝"http://www.w3school.com.cnJ_pre.xsd">(4)xsi:schemaLocation="http://www.w3school.com.cnJ_pre.xsd">

(5)<authorauthorLoc＝"1">陈路瑶</author>(5) <authorauthorLoc="1">Chen Luyao</author>

(6)<titletitleLoc＝"2">信息文档结构信任模式的提取及逻辑描述</title>(6)<titletitleLoc="2">Extraction and logical description of information document structure trust model</title>

(7)<typetypeLoc＝"2">J_pre</type>(7)<typetypeLoc="2">J_pre</type>

(8)<publishpublishLoc＝"3">北京</publish>(8) <publishpublishLoc="3">Beijing</publish>

(9)<publisherpublisherLoc＝"4">计算机应用研究</publisher>(9) <publisherpublisherLoc="4">Computer application research</publisher>

(10)<publish_yearpublish_yearLoc＝"5">2015</publish_year>(10)<publish_yearpublish_yearLoc="5">2015</publish_year>

(11)<volumn_markvolumn_markLoc＝"6">27</volumn_mark>(11)<volumn_markvolumn_markLoc="6">27</volumn_mark>

(12)<page_numberoage_numberLoc＝"7">4624-4629</page_number>(12) <page_numberoage_numberLoc="7">4624-4629</page_number>

(13)</reference>(13)</reference>

在通过Schema模板进行验证后，未通过编译器验证将返回错误信息。错误信息包括错误类型和错误描述，通过错误类型可以大致判断出现的问题，若想具体定位错误需要结合错误描述。下面归纳了三种常见的错误类型；After the schema template is validated, an error message will be returned if it fails the compiler validation. The error information includes the error type and error description. The error type can roughly determine the problem. If you want to locate the error specifically, you need to combine the error description. Three common error types are summarized below;

(1)cvc-complex-type.3.1。这种错误类型是XML中属性值与Schema中定义的属性值不匹配，比如标签之间的顺序颠倒。(1) cvc-complex-type.3.1. This type of error is that the attribute value in XML does not match the attribute value defined in Schema, for example, the order of tags is reversed.

(2)cvc-complex-type.2.4.a。这种错误类型是XML文件中的逻辑结构不符合Schema规范，比如出现了Schema规范中未定义的元素。(2) cvc-complex-type.2.4.a. This type of error is that the logical structure in the XML file does not conform to the Schema specification, for example, there are elements that are not defined in the Schema specification.

(3)cvc-complex-type.2.4.b。这种错误类型是XML文件的内容不完整，比如缺项。(3) cvc-complex-type.2.4.b. This error type is that the content of the XML file is incomplete, such as missing items.

在参考文献中出现的格式错误类型都可以归结为上述三类错误的一种或几种。错误检测的过程见表7所示。The types of format errors that appear in the references can be attributed to one or more of the above three types of errors. The process of error detection is shown in Table 7.

表7参考文献格式错误检测算法Table 7 Reference format error detection algorithm

在表7的算法中，R是待测参考文献集合，r是R集合中的一条参考文献。ERRORS为XMLSchema验证参考文献未通过的错误类型集合，Er是一条参考文献对应的错误类型。在检测后，为了解决错误项的定位问题，需要将编译器提供的错误描述信息转化为相应的位置信息，下面通过一个例子来说明如何进行转化，见图7所示。In the algorithm in Table 7, R is the set of reference documents to be tested, and r is a reference document in the R set. ERRORS is a collection of error types that XMLSchema failed to verify references, and Er is the error type corresponding to a reference. After the detection, in order to solve the problem of locating the error item, it is necessary to convert the error description information provided by the compiler into the corresponding position information. An example is used below to illustrate how to perform the conversion, as shown in Figure 7.

由图7可以看出，著录项错误包括三种情况：多项、缺项和乱序。每种情况对应的位置编号变化各不相同，因此，根据位置编号及著录项内容设计算法2，见表8所示。It can be seen from Figure 7 that there are three types of bibliographic errors: multiple items, missing items, and out of order. The position number changes corresponding to each case are different. Therefore, Algorithm 2 is designed according to the position number and bibliographic content, as shown in Table 8.

表8参考文献错误项定位算法Table 8 Algorithm for locating incorrect items in references

经算法分析后，下面以图8中的10条待测参考文献为例，使用本系统对10条参考文献进行规范性检测，检测的结果如图9和图10所示。After the algorithm analysis, take the 10 references to be tested in Figure 8 as an example, use this system to carry out normative detection on the 10 references, and the detection results are shown in Figures 9 and 10.

本专利应用于文后参考文献的格式检查，对有误的参考文献格式进行校正。本发明中的参考文献信息提取采用从MicrosoftWord文档中提取，同样适用于从文本文件中提取参考文献。以下通过例子进行描述。The patent is applied to checking the format of the reference documents after the text, and correcting the incorrect format of the reference documents. The reference information extraction in the present invention adopts extracting from Microsoft Word documents, and is also applicable to extracting references from text files. The following is described by way of example.

图9是参考文献著录项识别的部分结果，以图8中第一条参考文献为例，图9中前8行为第一条参考文献的识别结果。其中第一行“J_pre”表示第一条文献缺少文献类型标志，通过文献类型标志的判定将其预测为期刊类型；第二行表示将“陈路瑶”识别为作者；第三行表示将“信息文档结构信任模式的提取及逻辑描述”识别为题名；第四行表示将“北京”识别为出版地；第五行表示将“计算机应用研究”识别为期刊类型的出版者；第六行表示将“2010”识别为出版年；第七行表示将“27”识别为卷；第八行表示将“4624-4629”识别为页码。Fig. 9 is part of the results of identification of bibliographic items of references. Taking the first reference in Fig. 8 as an example, the first 8 lines in Fig. 9 are the identification results of the first reference. Among them, the first line "J_pre" indicates that the first document lacks the document type flag, and it is predicted to be a journal type through the judgment of the document type sign; the second line indicates that "Chen Luyao" is identified as the author; the third line indicates that the "information document "Extraction and Logical Description of Structural Trust Model" is identified as the title; the fourth line indicates that "Beijing" is identified as the place of publication; the fifth line indicates that "Computer Application Research" is identified as the publisher of the journal type; the sixth line indicates that "2010 " to identify the year of publication; the seventh line means to identify "27" as the volume; the eighth line means to identify "4624-4629" as the page number.

图10是10条参考文献的检测结果，对于格式不规范的参考文献提示具体错误位置信息并给出修改建议，以方便修改。图11为检测过程中生成的XML文件。Figure 10 shows the detection results of 10 references. For references with irregular formats, specific error location information is prompted and modification suggestions are given to facilitate modification. Figure 11 is the XML file generated during the detection process.

本发明具有以下有益效果：The present invention has the following beneficial effects:

以上所述是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明所述原理的前提下，还可以作出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above description is a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications can also be made. It should be regarded as the protection scope of the present invention.

Claims

1. a list of references format checking method, it is characterised in that including:

Step 1, by Reference Citation item format convention employing Schema state, wherein said Reference Citation item lattice Formula includes at least one following bibliographical particulars: owner, autograph, list of references type, publisher, publication date, the page number；

Step 2, read each bar list of references, carry out bibliographical particulars cutting；

Step 3, identify Reference Citation item, and the bibliographical particulars that will identify that extracts and becomes XML node；Wherein said bibliographical particulars Including following at least one: owner, inscribe one's name, publish ground, publisher, publication date etc.；Simultaneously, it is judged that this list of references writes Whether record item includes document type mark, without the document type mark then adding this list of references according to bibliographical particulars；

Step 4, utilize described Reference Citation item format convention that bibliographical particulars is verified.

List of references format checking method the most according to claim 1, it is characterised in that described method also includes:

Step 5, when Reference Citation item exist mistake time, bibliographical particulars is modified；Specifically include；

When mistake is lacuna, completion bibliographical particulars the list of references of the restructuring form format specification that puts in the stops；

When mistake is multinomial, delete this bibliographical particulars the list of references of the restructuring form format specification that puts in the stops；

When mistake is wrong item, the reference of the restructuring form format specification that puts in the stops after modifying according to the form of specification Document.

List of references format checking method the most according to claim 1, it is characterised in that described step 2 includes:

Step 21, Apache POI is utilized to be identified document extracting list of references content；

Step 22, the list of references content extracted is carried out cutting to obtain bibliographical particulars, including:

Symbol in list of references is identified, to judge whether list of references includes non-DBC case, if included, It is replaced with corresponding DBC case；

With symbol, bibliographical particulars is carried out cutting according to recording.

List of references format checking method the most according to claim 1, it is characterised in that described step 3 includes: utilize pre- If bibliographical particulars identification model list of references recited in paper word is identified extracting the work of described list of references Record item, presets corpus and carries out what study obtained according to wherein said bibliographical particulars identification model；Specifically include:

Step 31, extraction corpus；

The corpus that step 32, employing are preset, utilizes NER algorithm to be trained obtaining bibliographical particulars identification model；

Step 33, judging whether list of references includes list of references type parameter, if do not included, utilizing list of references Bibliographical particulars judges the type of described list of references.

List of references format checking method the most according to claim 4, it is characterised in that described step 33 includes:

Step 331: construct the decision tree of bibliographical particulars；Specifically include:

By below equation calculate gini index Gini, entropy Entropy, error rate (Error):

G i n i = 1 - Σ_{i = 1}^{n} p {(i)}^{2}

E n t r o p y = - Σ_{i = 1}^{n} p (i) * \log_{2} p (i)

Error=1-max{p (i) | i in [1, n] }

And calculate information gain Gain and information gain-ratio GainRate

Gain (U, V)=Entropy (U)-Entropy (U, V))

GainRate (U, V)=Gain (U, V)/Entropy (V)

To determine root node and the best packet variable of decision tree；

Data are carried out pretreatment, the most suddenly include by step 332: check the bibliographical particulars integrity of described list of references, It is converted into numeric type, title type with the data by nonnumeric type, non-title type；Search in list of references and whether there is the work lacked Record item, if had, is filled with vacancy value according to bibliographical particulars relevant in list of references；According to the dependency of bibliographical particulars, delete Except the most insignificant bibliographical particulars；Data are generally changed；

Step 333, the decision tree of list of references and pretreated data are utilized to carry out type decision.In the embodiment of the present invention In, use WEKA platform to carry out type decision.

List of references format checking method the most according to claim 5, it is characterised in that described step 333 specifically includes:

Step 3331, import data set to be tested；

Step 3332, obtaining step 332 carry out pretreated testing data；

Step 3333, will process after data set be placed in different Learning Schemes and carry out learning and set up forecast model and predict Unknown example；

Step 3334, to prediction result be estimated.