CN105824791A - Reference format checking method - Google Patents

Reference format checking method Download PDF

Info

Publication number
CN105824791A
CN105824791A CN201610153946.0A CN201610153946A CN105824791A CN 105824791 A CN105824791 A CN 105824791A CN 201610153946 A CN201610153946 A CN 201610153946A CN 105824791 A CN105824791 A CN 105824791A
Authority
CN
China
Prior art keywords
references
list
bibliographical particulars
bibliographical
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610153946.0A
Other languages
Chinese (zh)
Other versions
CN105824791B (en
Inventor
李宁
侯霞
赵琳
田英爱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201610153946.0A priority Critical patent/CN105824791B/en
Publication of CN105824791A publication Critical patent/CN105824791A/en
Application granted granted Critical
Publication of CN105824791B publication Critical patent/CN105824791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a reference format checking method. The method comprises the following steps: 1, expressing a reference bibliographical particular format rule through Schema, wherein a reference bibliographical particular format comprises at least one of the following bibliographical particulars: an author, a title, a reference type, a publisher, a publication date and a page number; 2, reading references, and partitioning the bibliographical particulars; 3, recognizing the reference bibliographical particulars, extracting the recognized bibliographical particulars into XML (Extensible Markup Language) nodes, judging whether the reference bibliographical particulars include a literature type tag or not at the same time, and if not, adding the literature type tag of the reference according to the bibliographical particulars, wherein the bibliographical particulars include at least one of the author, the title, a publication place, the publisher, the publication date and the like; 4, verifying the description particulars by using the reference bibliographical particular format rule.

Description

A kind of list of references format checking method
Technical field
The present invention relates to belong to text-processing technical field, particularly relate to a kind of list of references format checking method.
Background technology
In various papers, the most inevitably to quote from formerly disclosed list of references understand paper with help reader Background knowledge.Need during general Citation of references the author (author) of list of references, title (title) to be provided, publishes Person's (publisher, i.e. this works publish wherein), the publication page number (publish page), publication date (publish year).But in the file that the paper such as proceedings, large-scale periodical is concentrated very much, each paper all can quote from a large amount of reference Document, so it is difficult to ensure that each paper Citation of references the most in the same format.
Existing all it be dependent on responsible reader and while paper is gone over a manuscript or draft, audit call format, the most again by editing again Examination & verification;But the mode that this pure dependence manually carries out auditing is it is difficult to ensure that occur without omission.
Summary of the invention
Occur for prior art using the mode of manual examination and verification list of references form be easy to omitting, it is impossible to guarantee literary composition In collection or periodical, each paper all uses the problem of identical regular Citation of references, the technical problem to be solved in the present invention Be to provide a kind of can automatically in the paper of electronic edition list of references quote from whether meet the ginseng that preset rules is audited Examine document format checking method and system, it is ensured that the standardization of list of references form and improve efficiency, prevent omission.
In order to solve the problems referred to above, the embodiment of the present invention proposes a kind of list of references format checking method, including:
Step 1, by Reference Citation item format convention employing Schema state, wherein said Reference Citation Form includes at least one following bibliographical particulars: the person of recording, autograph, list of references type, publisher, publication date, page Code;
Step 2, read each bar list of references, carry out bibliographical particulars cutting;
Step 3, identify Reference Citation item, and the bibliographical particulars that will identify that extracts and becomes XML node;Wherein said work Record item includes following at least one: owner, inscribes one's name, publish ground, publisher, publication date etc.;Simultaneously, it is judged that this reference literary composition Offer and whether bibliographical particulars includes document type mark, without the document type mark then adding this list of references according to bibliographical particulars Will;
Step 4, utilize described Reference Citation item format convention that bibliographical particulars is verified.
Wherein, described method also includes:
Step 5, when Reference Citation item exist mistake time, bibliographical particulars is modified;Specifically include;
When mistake is lacuna, completion bibliographical particulars the list of references of the restructuring form format specification that puts in the stops;
When mistake is multinomial, delete this bibliographical particulars the list of references of the restructuring form format specification that puts in the stops;
When mistake is wrong item, put in the stops after modifying according to the form of specification restructuring form format specification List of references.
Wherein, described step 2 includes:
Step 21, Apache POI is utilized to be identified document extracting list of references content;.
Step 22, the list of references content extracted is carried out cutting to obtain bibliographical particulars, including:
Symbol in list of references is identified, to judge whether list of references includes non-DBC case, if bag Include, be replaced with corresponding DBC case;
With symbol, bibliographical particulars is carried out cutting according to recording.
Wherein, described step 3 includes: utilize the bibliographical particulars identification model preset to reference literary composition recited in paper word Offer the bibliographical particulars being identified extracting described list of references, preset corpus according to wherein said bibliographical particulars identification model and enter Row study obtains;Specifically include:
Step 31, extraction corpus;
The corpus that step 32, employing are preset, utilizes NER algorithm to be trained obtaining bibliographical particulars identification model;
Step 33, judging whether list of references includes list of references type parameter, if do not included, utilizing with reference to literary composition The bibliographical particulars offered judges the type of described list of references.
Wherein, described step 33 includes:
Step 331: construct the decision tree of bibliographical particulars;Specifically include:
By below equation calculate gini index Gini, entropy Entropy, error rate (Error):
G i n i = 1 - Σ i = 1 n p ( i ) 2
E n t r o p y = - Σ i = 1 n p ( i ) * log 2 p ( i )
Error=1-max{p (i) | i in [1, n] }
And calculate information gain Gain and information gain-ratio GainRate
Gain (U, V)=Entropy (U)-Entropy (U, V))
GainRate (U, V)=Gain (U, V)/Entropy (V)
To determine root node and the best packet variable of decision tree;
Data are carried out pretreatment, the most suddenly include by step 332: carry out the bibliographical particulars integrity of described list of references Check, be converted into numeric type, title type with the data by nonnumeric type, non-title type;Whether lookup list of references have scarce Few bibliographical particulars, if had, is filled with vacancy value according to bibliographical particulars relevant in list of references;Phase according to bibliographical particulars Guan Xing, deletes the most insignificant bibliographical particulars;Data are generally changed statement;
Step 333, the decision tree of list of references and pretreated data are utilized to carry out type decision.Implement in the present invention In example, WEKA platform is used to carry out type decision.
Wherein, described step 333 specifically includes:
Step 3331, import data set to be tested;
Step 3332, obtaining step 332 carry out pretreated testing data;
Step 3333, will process after data set be placed in different Learning Schemes and carry out learning and setting up forecast model The example that prediction is unknown;
Step 3334, to prediction result be estimated.
Having the beneficial effect that of the technique scheme of the present invention:
Along with emerging in multitude of technical paper, national departments concerned has carried out standardization and the standardization of academic journal, its The rule that the format standard of middle list of references has been had to comply with as numerous authors and editing personnel.Author is writing During scientific paper, standard criterion to be learnt could complete paper in high quality, and editing personnel need also exist for study mark Quasi-specification could complete the checking work of paper expeditiously.Therefore, author and editing personnel are required for a kind of convenient Instrument carries out the detection of list of references format specification.Owing to different types of list of references has different forms, same seed ginseng Examine document and have a lot of bibliographical particulars, so author can make mistakes in compiling procedure unavoidably, therefore in the list of references of scientific paper Still suffering from a large amount of nonstandard phenomenon, this adds the difficulty of verification to editing personnel.This problem mainly solves with reference to literary composition Offer format specification sex chromosome mosaicism, there is higher practical value.
1) this research can make the work of list of references format checking more intelligent, reduces recording mistakes of references, Improve the efficiency of list of references format checking work.
2) each bibliographical particulars of list of references is carried out correct understanding, be beneficial to discovering and using further for list of references in the future (quote as analyzed and ceased by fuse, the research level of assessment academic writing, and combing is correlated with the achievement in research etc. of author).
List of references format specification can be detected by the embodiment of the present invention, specifically navigates to the position of mistake, and How prompting corrects, and provides conveniently for researcher.The achievement in research of this problem is for improving digital publishing quality, promoting document The efficient propagation of information utilizes, saves the cost of labor etc. of typesetting to have important value.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the embodiment of the present invention;
Fig. 2 is the code of OOXML structure based on XML in Word file;
Fig. 3 is Word document element hierarchy relation schematic diagram;
Fig. 4 is the list of references of 10 wiht strip-lattice type specifications as an example;
Fig. 5 is the structural representation of the list of references decision tree of the embodiment of the present invention;
Fig. 6 is the schematic diagram of member record in the middle part of ARFF file;
Fig. 7 is the schematic diagram of the conversion in the embodiment of the present invention;
Fig. 8 is 10 lists of references to be measured as an example;
Fig. 9 is the partial results of Reference Citation item identification;
Figure 10 is the testing result of list of references to be measured in Fig. 8;
Figure 11 be detection during generate XML file.
Detailed description of the invention
For making the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool Body embodiment is described in detail.
The embodiment of the present invention proposes a kind of list of references format checking method, including:
Step 1, use Schema define reference format template.Purpose is to be given birth to by each bibliographical particulars of list of references to verify The correctness of the XML file become.
Step 2, read each bar list of references, carry out bibliographical particulars cutting.The cutting purpose using bibliographical particulars makes identification Classification is the most single, identifies that Reference Citation item lays the first stone for next step, and each bibliographical particulars of cutting is just the most exactly The accuracy of bibliographical particulars identification can be improved.
Step 3, identification Reference Citation item.Join on the basis of the cutting method of step 2 Reference Citation item Examine the identification of each bibliographical particulars of document, record and include owner, autograph, publish ground, publisher and publication date etc..
Step 4, the bibliographical particulars that will identify that extract becomes XML node.
Step 5, judge document type mark.In GB/T 7714-2005, the document type mark of regulation is as follows: common Books (M), collect (G), standard (S), periodical (J), computer program (CP), and academic dissertation (D) is reported (R), patent (P), number According to storehouse (DB), BBS (EB), tape (MT), disk (DK), procceedings (C), CD (CD), newspaper (N), on-line network (OL).In step 4) in identify after title description item in search whether containing the document category of defined in GB/T 7714-2005 Type mark.
Step 6, verify based on the reference format template described in step 1.If containing document type mark, then adjusted Verifying by the Schema reference format template of corresponding document type, if not containing document type mark, the most first judging literary composition Offering type, the Schema reference format template then recalling corresponding document type is verified.If having passed through checking, explanation The form of list of references is correct, without with reference to checking, then the format error of list of references is described.
Step 7, judge mistake bibliographical particulars and modify.Including the order of inspection bibliographical particulars, ultimately generate correct XML instance.Specific design thinking is as follows: when XML file is not verified by Schema, extracts in XML file and does not specifically lead to Crossing the bibliographical particulars of checking, can be summarized as three kinds of situations for not verified bibliographical particulars, a kind of situation is lacuna, a kind of situation It is multinomial, is additionally wrong item.When lacuna, completion bibliographical particulars the ginseng of the restructuring form format specification that puts in the stops Examine document.When multinomial, delete this bibliographical particulars the list of references of the restructuring form format specification that puts in the stops.Right In the situation of wrong item, the list of references of the restructuring form format specification that puts in the stops after modifying according to the form of specification.
Below each step of the embodiment of the present invention is described in detail.
Step 1, by Reference Citation item format convention employing code state, wherein said Reference Citation item Form includes at least one following bibliographical particulars: the person of recording, autograph, list of references type, publisher, publication date, the page number.
In embodiments of the present invention, can be by XML Schema language to stating, wherein said list of references Bibliographical particulars form includes at least one following bibliographical particulars: the person of recording, autograph, list of references type, publisher, publication day Phase, the page number;
Below for using the example of Reference Citation item format convention of XML Schema language expression, be with The proceeding of XML Schema statement
(1)<?Xml version=" 1.0 " encoding=" GB2312 "?>
(2) < xs:schemaxmlns:xs=" http://www.w3.org/2001/XMLSchema "
(3) xmlns=" http://www.w3school.com.cn "
(4) targetNamespace=http: //www.w3school.com.cn "
(5) elementFormDefault=" qualified " >
(6)<xs:element name=" reference ">
(7)<xs:complexType>
(8)<xs:sequence>
(9)<xs:element name=" author " type=" xs:string "/>
(10)<xs:element name=" title " type=" xs:string "/>
(11)<xs:element name=" type " type=" xs:string "/>
(12)<xs:element name=" publish " type=" xs:string " minOccirs=" 0 "/>
(13)<xs:element name=" publisher " type=" xs:string "/>
(14)<xs:element name=" publish_year " type=" xs:string "/>
(15)<xs:element name=" page_number " minOccirs=" 0 "/>
(16)<xs:simpleType>
(17)<xs:restriction base=" xs:string ">
(18) < xs:pattem value=" (d{1,4}-)?\d{1,4}"/>
(19)</xs:restriction>
(20)</xs:simpleType>
(21)</xs:element>
(22)</xs:sequence>
(23)</xs:complexType>
(24)</xs:element>
(25)</xs:schema>
Step 2, utilization are preset library and are extracted the list of references of citation in word, and extract bibliographical particulars therein;Described step Rapid 2 include that list of references contents extraction and Reference Citation item extract two parts content.Therefore step 2 specifically includes:
Step 21, Apache POI is utilized to be identified document extracting list of references content.
Owing to existing document is the most all to deposit with the form of Microsoft Word format or compatible Word Storage.In Microsoft Word document, information is deposited with OOXML based on XML (Open Office XML) form Storage.Therefore can use Apache POI 3.13 that document is identified.
The implication of OOXML structure is illustrated by following the application with citing.Microsoft Word2013 edits Two sections of words, are respectively " Chinese Reference Citation item identification " and " paper ", and the xml code of its correspondence is as shown in Figure 2.
In code,<w:document>element is the root element of document, and other all of elements are all its daughter elements. Element passes behind several NameSpaces of attribute definition.
<w:body>element is the element at document content place, is unique necessary element.Many daughter elements are comprised under it, Referring specifically to OOXML standard.In numerous daughter elements, most basic unit have three, is respectively<w:p>element,<w:r>unit Element and<w:t>element.Wherein,<w:p>element represents a paragraph, for one content starting from newline of definition;<w:r>unit Element represents the content of sentence layer, can be sentence, mathematical material, smart tags and User Defined labelling etc., and sentence is to arrange formula The minimum unit of sample;Concrete content of text in<w:t>element representation sentence.Hierarchical relationship schematic diagram such as Fig. 3 institute of these elements Show.
Owing to the position of list of references is fixing, understanding OOXML based on XML (Open Office XML) lattice After the information of formula, it is possible to utilize Apache POI 3.13 that document is identified, to extract in list of references therein Hold.
Step 22, from described list of references content extract bibliographical particulars.
Owing to list of references is made up of some bibliographical particulars, it is therefore desirable to first bibliographical particulars is carried out cutting, the most just can carry out Identify.Specifically include:
Specification of symbols step: the symbol in list of references is identified, with judge list of references takes no include non- DBC case, if included, is replaced with corresponding DBC case;
Dicing step: with symbol, bibliographical particulars is carried out according to recording of regulation in list of references GB GB/T 7714-2005 Cutting.
In list of references GB GB/T 7714-2005, all of record with symbol all as preposition symbol.Such as reference Document does not use before the owner of first bibliographical particulars any tag mark;". " is used for inscribeing one's name item and separating out Documetary Title item Deng preposition symbol etc..By the analysis to GB/T 7714-2005, different preposition symbols is carried out bibliographical particulars as separation Cutting.
Finding during being sampled some graduate's academic dissertations analyzing, the kind form of format error is different, Statistical method can be utilized to calculate the probabilistic model of clerical error.If such as containing ". " in a list of references, then In this list of references substantially without occur "." as the separator between bibliographical particulars;If contained in a list of references “." as the separator between bibliographical particulars, then substantially without ". " occurring as the separation between bibliographical particulars in this list of references Symbol.
Step 3, utilize the bibliographical particulars identification model preset list of references recited in paper word is identified with Extract the bibliographical particulars of described list of references, preset corpus according to wherein said bibliographical particulars identification model and carry out study acquisition 's.
In embodiments of the present invention, Stanford University based on condition random field name entity recognition method is used (Stanford Named Entity Recognizer, NER).Entity can be marked by NER according to classification, such as people Name, exabyte, area, gene and the name etc. of protein.NER is equipped with well-designed feature extractor and enters name entity Row identifies, i.e. can get training pattern through training.In theory for the data of training, the text that i.e. a large amount of handmarkings are good is got over Many, NER recognition effect is the best.Re-training model is wanted in order to meet new demand.
Therefore step 3 specifically includes;
Step 31, extraction corpus;Concrete corpus uses in January, 1998 the Peoples Daily mark language material of extraction Storehouse and Beijing University's version " A Guide to the Core Chinese periodical " in 2015.
Wherein;
1, in January, 1998 the Peoples Daily mark language material of extraction: due to name, place name etc. in the Peoples Daily language material Noun proportion is more, therefore can be as good training corpus.
2, Beijing University's version " A Guide to the Core Chinese periodical " in 2015;Due in addition to name, place name, in usual paper Want to preferably identify key word conventional in conventional periodical title and some Article Titles, therefore use " in Literary composition core periodical syllabus overview " coordinate the Peoples Daily.
The exercise question of such as thesis or the exercise question of periodical often contain " based on " word, so at the Peoples Daily language material On the basis of add the periodical titles occurred in 2015 Beijing University's version " A Guide to the Core Chinese periodical " and the mark counted Key word conventional in topic, these several parts combine the systematic training collection tested collectively as this most at last, are preserved In testdata.tsv file, and for the closed test of system.
It addition, 2015 Beijing University's version " A Guide to the Core Chinese periodical " be extraction this school thesis in literary composition after join Examine document, formed test set and be used for the open test of system.
Wherein, the list of references content that extracts can as shown in Figure 4.
The corpus that step 32, employing are preset, utilizes NER algorithm to be trained obtaining bibliographical particulars identification model.
NER provides the mode of two kinds of training patterns, respectively command line mode and configuration file mode.
In embodiments of the present invention, can be in the way of using configuration file.
Concrete, in Stanford NER, configuration filename is referred to as austen.prop, utilizes as shown in table 1 below repairing Change its parameter
Table 1 austen.prop revises parameter list
Wherein, trainFile specifies the data set for training, and serializeTo specifies the model name of output after training Claim.Amended configuration file is preserved, and is jointly placed on training dataset testdata.tsv under the root of program, hold Line command is to issue orders:
java–cp Stanford-ner.jaredu.stanford.nlp.ie.crf.CRF Classifier–prop austen.prop”
After running succeeded, under catalogue, generate ner-model.ser.gz, be the model that training data obtains.
After obtaining bibliographical particulars identification model, by bibliographical particulars identification model, the bibliographical particulars in step 2 can be entered Row identifies.
Owing in list of references, information is likely to be incomplete, it is possible to meeting missing-reference document type, and with reference to literary composition Offer type follow-up list of references format checking to be had a major impact, therefore can be further in the embodiment of the present invention Including:
Step 33, judging whether list of references includes list of references type parameter, if do not included, utilizing with reference to literary composition The bibliographical particulars offered judges the type of described list of references.
Concrete, step 33 includes:
Step 331: construct the decision tree of bibliographical particulars.
Or as a example by list of references content as shown in Figure 4, it includes 10 wiht strip-lattice type specifications.As shown in Figure 4, every Radix Glehniae Examining document to be made up of a lot of bibliographical particulars, the composition of different types of its bibliographical particulars of list of references is different.By to 10 Radix Glehniae Examining the analysis of document, bibliographical particulars and the property value thereof of concluding its list of references describe, as shown in table 2.
The bibliographical particulars of table 2 list of references and property value thereof describe
In the Fig. 4 obtained after data convert, the information model of each bar list of references is as shown in table 3:
The information model of each bar list of references in table 3 Fig. 4
Therefore bibliographical particulars decision tree as shown in Figure 5 can be constructed according to table 3.Can be right according to the decision tree in Fig. 5 The document of UNKNOWN TYPE is predicted, and the most existing list of references is as follows:
Zhu Gang. novel fluid FInite Element and the positive and negative mixed problem of turbomachine. Beijing: Tsing-Hua University, 1996.
Can predict that it belongs to academic dissertation according to decision tree as shown in Figure 5.
Generate decision tree as shown in Figure 5, wherein have two key issues:
One is how to select a current optimal packet variable from numerous input variables?The most why will be going out Version person's type is as the root node of decision tree?Why select author's type as the child node of lower floor rather than other record ??
Two is how to find an optimal cut-point from numerous values of packet variable?Such as publisher's type is run after fame Claiming type, its attribute includes " periodical ", " educational institution ", " other ", " publishing house ", selects " educational institution " as dividing why Cutpoint?Solve the two key issue and i.e. can be readily constructed out decision tree.
Decision tree needs introduce " purity " concept.Conventional measurement Purity has three kinds, respectively gini index (Gini), entropy (Entropy), error rate (Error);The embodiment of the present invention can calculate gini index by below equation (Gini), entropy (Entropy), error rate (Error):
Assuming that the attribute of bibliographical particulars have the different property value i of n class (i=1,2 ..., n), the ratio shared by every generic attribute value The quantity of example p (i)=the i-th generic attribute value/this property value total quantity, the span of p (i) is [0,1].
G i n i = 1 - &Sigma; i = 1 n p ( i ) 2
E n t r o p y = - &Sigma; i = 1 n p ( i ) * log 2 p ( i )
Error=1-max{p (i) | i in [1, n] }
To be value the biggest for the formula 1-3 of three purity above, represents more " impure ", the least expression more " pure ".Facts have proved three That plants formula selects the impact on final classification accuracy the most little.The most also use entropy formula, public by entropy Formula amplifies out two conventional Attributions selection variablees, is respectively the information gain (Gain) such as formula 4 and the information such as formula 5 increases Benefit rate (GainRate).
Gain (U, V)=Entropy (U)-Entropy (U, V))
GainRate (U, V)=Gain (U, V)/Entropy (V)
In theory of information, information exchanging process regards that a transmission system being made up of information source, channel and the stay of two nights realizes as , information source is the transmitting terminal of information, and the stay of two nights is the receiving terminal of information.As a example by list of references type code above is predicted, will write Person's type (T1), reports number (T2), the patent No. (T3), publisher's type (T4), reign title and reel number mark (T5), the page number (T6) is as defeated Entering variable, list of references type code is output variable.Decision tree regards output variable (list of references type code) as information source Information U sent, input variable regards the range of information V that the stay of two nights receives as.
Gain (U, V)=Entropy (U)-Entropy (U, V))
GainRate (U, V)=Gain (U, V)/Entropy (V)
Use information gain-ratio (GainRate) that above-mentioned two key issue is respectively calculated, calculate process as follows:
As a example by author's type T1: calculate Entropy (U), Entropy (U | T1), Gains (U, T1), GainsR respectively (U, T1), wherein each 2 of periodical, academic dissertation and books type document, report, meeting collection, patent and type document each 1 Bar.
Assuming that bibliographical particulars has M different types of attribute, property value ui(i=1,2 ..., M), shared by every generic attribute value Ratio be p (ui), author's type T1 has N number of different attribute value t1j(j=1,2 ..., N).
E n t r o p y ( U ) = - &Sigma; i M p ( u i ) log 2 p ( u i ) = 1.832
E n t r o p y ( U | T 1 ) = &Sigma; j N p ( t 1 j ) ( - &Sigma; i N p ( u i | t 1 j ) log 2 p ( u i | t 1 j ) ) = 1.279
Gain (U, T1)=Entropy (U)-Entropy (U | T1)=0.553
GainRate (U, T1)=Gains (U, T1)/Entropy (V)=0.628
The information gain-ratio i.e. obtaining author's type (T1) is 0.628, calculates other each bibliographical particulars in an identical manner, Obtain T4 information gain-ratio value eventually and be 1.275 to the maximum, T4 therefore should be selected as best packet variable, be the root joint of decision tree Point.
In publisher's type, there are 4 attributes, are respectively " periodical ", " educational institution ", " other ", " publishing house ", then How to select cut-point, calculate process similar to above, the information gain-ratio value calculating " educational institution " is to the maximum 3.948, " educational institution " therefore should be selected as best packet variable.
By above-mentioned analysis it can be seen that the decision tree of the embodiment of the present invention is one method of decision analysis intuitively, it is excellent Point is apparent.Decision-tree model readability is good, has certain descriptive, contributes to manual analysis;And execution efficiency is high, Have only to once build the priori that just with Reusability, can naturally enough embed expert.
Owing to being likely to occur in the bibliographical particulars of this list of references, data are inconsistent, Data duplication, data contain noise, data The problems such as dimension is high.Therefore, before bibliographical particulars is classified, need data are carried out pretreatment.That is, described step 33 is also Including:
Data are carried out pretreatment by step 332.
Concrete, data prediction step includes:
Step 3321, bibliographical particulars integrity to described list of references check.
Owing to the variable of decision tree has a two types: numeric type, title type;So needing the master done before structure decision tree Wanting pretreatment work is that the data of nonnumeric type and non-title type are converted into numeric type or title type.
Choosing suitable attribute in data mining from initial data as data mining attribute, the data used are former Be then: as far as possible attribute-name and property value are given clear and definite implication, remove repeat data, remove negligible field, rationally Select associate field.Introduce the process carrying out pretreatment in detail below.
Initial data is the list of references extracted, and is then split by each for list of references bibliographical particulars, after each fractionation Bibliographical particulars be considered as every record attribute, table 4 below is the segment chosen from initial data.
Table 4 original data record
As can be seen from the above table, after being split by a list of references, some field value vacancy, some field can be neglected Slightly, therefore step 332 can include three below sub-step:
Whether step 3322, lookup list of references has the bibliographical particulars lacked, if had, according to phase in list of references Vacancy value is filled with by the bibliographical particulars closed.
Such as in table 4 " patent country origin ", " patent No. ", " report number ", then selected data are had vacant position Value is filled up.The principle of filling up of vacancy value is to defer to the type of the already present value of this field, such as in already present record The partial value of certain field is numeric type, then the value of filling up of other vacancy value of this field also will be for numeral offset, if should The partial value of field is title type, then the value of filling up of other vacancy value of this field is also title offset.
Step 3323, dependency according to bibliographical particulars, delete the most insignificant bibliographical particulars.The such as reference shown in Fig. 4 Literature content, sequence therein numbering (1,2,3 the most above ... 10) does not has any effect to prediction of result, can increase on the contrary Complexity of the calculation, therefore can delete.Such as publish ground field, be everywhere to will not affect that final reference with no matter publishing The type of document, so can be ignored for " publishing ground " field.
Step 3323, data are generally changed statement.This is due to the data in protocol, and each field can be by It is summarised as several class.Such as: for " owner " field, the value of " owner " field may be summarized to be two classes, and a class is Concrete name, another kind of is organization's title.It is name or institution term comes document by " owner " field Type be predicted, and be what name with concrete people and be that what name is unrelated with organization.Therefore, it can will It is name and institution term two class that " owner " field carries out Data generalization.The rest may be inferred, by the data of all analogues All will generally change.
Can obtain as shown in table 5 through pretreated data after above-mentioned step.
The pretreated data of table 5
It can be seen that the field of pretreatment and field value are all English from table 5, this is the embodiment of the present invention certainly A kind of mode, it is also possible to take other any type of field expressed through pretreatment and field values.Owing to the present invention is real Executing in example is to use WEKA system to carry out list of references type to carry out type decision, therefore uses field and the field of English Value can obtain and preferably calculate effect.
Illustrate with the field value of the example of table 5 field each to the embodiment of the present invention:
" owner " field value is PER.Individual and PER.Group, the individual that mentions during wherein PER refers to document or Crowd, PER.Individual and PER.Group is the subclass of PER, refers to individual, crowd or tissue respectively.
" exercise question signature " field value is title_D_tag, title_C_tag etc., and such as t i t le_C_tag is Refer to that the exercise question of report typically can contain " report " signature containing " meeting collection " signature in the exercise question of proceeding; Other type does not has the value of signature to be designated as no.
" publishing house " field value is PUB.Press, PUB.Journal, PUB.School, PUB.Institution and NUL, refers to the publishing house of non-school, periodical, school publishing house and institute respectively, and NUL refers to lacuna.
In step 33, construct decision tree by step 331, and after having carried out data prediction by step 332, need List of references is carried out type decision.The most described method also includes:
Step 333, the decision tree of list of references and pretreated data are utilized to carry out type decision.Implement in the present invention In example, WEKA platform is used to carry out type decision.
The process carrying out data mining in WEKA platform is as follows:
1) data set to be tested is imported;
2) testing data is carried out pretreatment (step 332 is complete);
3) data set after processing is placed in different Learning Schemes and carries out learning and set up forecast model and predict not The example known;
4) result of prediction is estimated and visualizes.It is specifically introduced below for aforementioned four step.
Therefore step 333 specifically includes:
Step 3331, import data set to be tested.
Owing to data form treatable under WEKA platform is CSV and ARFF file, but optimal form is ARFF File, so used here as ARFF formatted file, importing data to be tested again after needing first to be changed by the form of file Collection.Original data storage, in EXCEL file, is first converted into csv file, is reconverted into ARFF file.Wherein ARFF literary composition In the middle part of part, member record is as shown in Figure 6.
Step 3332, the pretreated data of obtaining step 332.
Step 3333, select concrete sorting algorithm for training and testing classification.In the sort module of WEKA system, It is integrated with the sorting algorithm of about 50 kinds, the embodiment of the present invention has been selected 3 kinds of classical taxonomy algorithm NativeBays, J48 (decision-making Tree) and ZeroR test set is carried out class test.
The result of different sorting algorithms is estimated by the 4th step.The method of assessment nicety of grading is a lot, mainly has intersection Method (cross-validation), holding method (holdout), leaving-one method (leave-one-out), back substitution method (back- substitution).Interior extrapolation method and holding method are the most commonly used.Leaving-one method is considered as a kind of special case of interior extrapolation method.Back substitution method by Result in nicety of grading in its assessment overfitting higher, do not use.The visualization of result both can be to a subseries Result visualizes, it is also possible to visualize the result of a data set.Wherein the visualization of data set is shown that About a two-dimentional scatterplot of every pair of attribute, the output result visualization of a certain subseries is shown that error in classification, sets, becomes This curve, ROC curve etc., be used for assessing the performance of each Learning Scheme.
Although some algorithms are of a relatively high to the judgement accuracy rate of document type mark, but can not reach 100% Accuracy rate, this will have influence on the accuracy rate of final document format error detection.
In order to reduce forecast error as far as possible, the method that have employed signature in the embodiment of the present invention.I.e. enter decision tree Style of writing is offered after type code judges and is judged further according to signature, if the two result of determination is identical, as finally As a result, if the two judgement differs, it is as the criterion with the result judged according to signature.Table 6 list all kinds of signature with The corresponding relation of list of references type.
Table 6 signature and the relation table of document type
Step 4, utilize described Reference Citation item format convention code, to recording of the described list of references identified Item checks.Specifically include: each bibliographical particulars after identifying generates corresponding XML document according to list of references type code, Then Schema is used to verify;If by checking, illustrate that this document form is correct, this document is otherwise described There is mistake in form.
In embodiments of the present invention, the XML document of the periodical type wherein generated is as follows
(1)<?Xml version=" 1.0 " encoding=" GB2312 " standalone=" no "?>
(2) < reference xmlns=" http://www/w3school.com.cn "
(3) xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance "
(4) xsi:schemaLocation=" http://www.w3school.com.cn J_pre.xsd " >
(5)<author authorLoc="1">chen Luyao</author>
(6)<title titleLoc="2">the extraction of information document structure trust mode and logical description</title>
(7)<type typeLoc="2">j_pre</type>
(8)<publish publishLoc="3">beijing</publish>
(9)<publisher publisherLoc="4">computer utility is studied</publisher>
(10)<publish_yearpublish_yearLoc="5">2015</publish_year>
(11)<volumn_markvolumn_markLoc="6">27</volumn_mark>
(12)<page_numberoage_numberLoc="7">4624-4629</page_number>
(13)</reference>
After being verified by Schema template, will not return error message by compiler checking.Error message bag Include type of error and error description, can substantially judge produced problem by type of error, if thinking concrete Wrong localization needs In conjunction with error description.Summarize three kinds of common type of errors below;
(1)cvc-complex-type.3.1.This type of error is property value and the attribute defined in Schema in XML Value is not mated, the reversed order between such as label.
(2)cvc-complex-type.2.4.a.This type of error is that the logical structure in XML file does not meets Schema specification, such as occurs in that undefined element in Schema specification.
(3)cvc-complex-type.2.4.b.This type of error is that the content of XML file is imperfect, such as lacuna.
The format error type occurred in list of references can be attributed to one or more of above-mentioned three class mistakes.Wrong The process of error detection is shown in Table 7.
Table 7 list of references format error detection algorithm
In the algorithm of table 7, R is list of references set to be measured, and r is a list of references in R set.ERRORS is XML Schema verifies list of references unsanctioned type of error set, and Er is the type of error that a list of references is corresponding.? After detection, for the orientation problem of solving error item, the error description information by compiler provides is needed to be converted into corresponding position Confidence ceases, and illustrates how to convert, as shown in Figure 7 below by an example.
As seen from Figure 7, bibliographical particulars mistake includes three kinds of situations: multinomial, lacuna and out of order.The position that every kind of situation is corresponding Put numbering change different, therefore, according to Position Number and bibliographical particulars content design algorithm 2, be shown in Table 8.
Table 8 list of references error items location algorithm
After Algorithm Analysis, below as a example by 10 lists of references to be measured in Fig. 8, use native system to 10 with reference to literary composition Offering and carry out standardization detection, the result of detection is as shown in Figure 9 and Figure 10.
This patent is applied to the format checking of list of references after literary composition, is corrected wrong list of references form.This List of references information retrieval in bright uses and extracts from Microsoft Word document, is equally applicable to carry from text Take list of references.It is described below by way of example.
Fig. 9 is the partial results of Reference Citation item identification, in Fig. 8 as a example by Article 1 list of references, in Fig. 9 front 8 The recognition result of behavior Article 1 list of references.Wherein the first row " J_pre " represents that Article 1 document lacks document type mark, It is predicted as periodical type by the judgement of document type mark;Second row represents " Chen Luyao " is identified as author;3rd " extraction of information document structure trust mode and logical description " is identified as autograph by row expression;Fourth line represents to be known " Beijing " Not for publish ground;Fifth line represents the publisher that " computer utility research " is identified as periodical type;6th row represent by " 2010 " are identified as publishing year;7th row represent " 27 " are identified as volume;8th row represents " 4624-4629 " is identified as page Code.
Figure 10 is the testing result of 10 lists of references, and list of references nonstandard for form points out concrete errors present Information also provides amending advice, to facilitate amendment.Figure 11 be detection during generate XML file.
The method have the advantages that
Along with emerging in multitude of technical paper, national departments concerned has carried out standardization and the standardization of academic journal, its The rule that the format standard of middle list of references has been had to comply with as numerous authors and editing personnel.Author is writing During scientific paper, standard criterion to be learnt could complete paper in high quality, and editing personnel need also exist for study mark Quasi-specification could complete the checking work of paper expeditiously.Therefore, author and editing personnel are required for a kind of convenient Instrument carries out the detection of list of references format specification.Owing to different types of list of references has different forms, same seed ginseng Examine document and have a lot of bibliographical particulars, so author can make mistakes in compiling procedure unavoidably, therefore in the list of references of scientific paper Still suffering from a large amount of nonstandard phenomenon, this adds the difficulty of verification to editing personnel.This problem mainly solves with reference to literary composition Offer format specification sex chromosome mosaicism, there is higher practical value.
1) this research can make the work of list of references format checking more intelligent, reduces recording mistakes of references, Improve the efficiency of list of references format checking work.
2) each bibliographical particulars of list of references is carried out correct understanding, be beneficial to discovering and using further for list of references in the future (quote as analyzed and ceased by fuse, the research level of assessment academic writing, and combing is correlated with the achievement in research etc. of author).
List of references format specification can be detected by the embodiment of the present invention, specifically navigates to the position of mistake, and How prompting corrects, and provides conveniently for researcher.The achievement in research of this problem is for improving digital publishing quality, promoting document The efficient propagation of information utilizes, saves the cost of labor etc. of typesetting to have important value.
The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, on the premise of without departing from principle of the present invention, it is also possible to make some improvements and modifications, these improvements and modifications are also Should be regarded as protection scope of the present invention.

Claims (6)

1. a list of references format checking method, it is characterised in that including:
Step 1, by Reference Citation item format convention employing Schema state, wherein said Reference Citation item lattice Formula includes at least one following bibliographical particulars: owner, autograph, list of references type, publisher, publication date, the page number;
Step 2, read each bar list of references, carry out bibliographical particulars cutting;
Step 3, identify Reference Citation item, and the bibliographical particulars that will identify that extracts and becomes XML node;Wherein said bibliographical particulars Including following at least one: owner, inscribe one's name, publish ground, publisher, publication date etc.;Simultaneously, it is judged that this list of references writes Whether record item includes document type mark, without the document type mark then adding this list of references according to bibliographical particulars;
Step 4, utilize described Reference Citation item format convention that bibliographical particulars is verified.
List of references format checking method the most according to claim 1, it is characterised in that described method also includes:
Step 5, when Reference Citation item exist mistake time, bibliographical particulars is modified;Specifically include;
When mistake is lacuna, completion bibliographical particulars the list of references of the restructuring form format specification that puts in the stops;
When mistake is multinomial, delete this bibliographical particulars the list of references of the restructuring form format specification that puts in the stops;
When mistake is wrong item, the reference of the restructuring form format specification that puts in the stops after modifying according to the form of specification Document.
List of references format checking method the most according to claim 1, it is characterised in that described step 2 includes:
Step 21, Apache POI is utilized to be identified document extracting list of references content;
Step 22, the list of references content extracted is carried out cutting to obtain bibliographical particulars, including:
Symbol in list of references is identified, to judge whether list of references includes non-DBC case, if included, It is replaced with corresponding DBC case;
With symbol, bibliographical particulars is carried out cutting according to recording.
List of references format checking method the most according to claim 1, it is characterised in that described step 3 includes: utilize pre- If bibliographical particulars identification model list of references recited in paper word is identified extracting the work of described list of references Record item, presets corpus and carries out what study obtained according to wherein said bibliographical particulars identification model;Specifically include:
Step 31, extraction corpus;
The corpus that step 32, employing are preset, utilizes NER algorithm to be trained obtaining bibliographical particulars identification model;
Step 33, judging whether list of references includes list of references type parameter, if do not included, utilizing list of references Bibliographical particulars judges the type of described list of references.
List of references format checking method the most according to claim 4, it is characterised in that described step 33 includes:
Step 331: construct the decision tree of bibliographical particulars;Specifically include:
By below equation calculate gini index Gini, entropy Entropy, error rate (Error):
G i n i = 1 - &Sigma; i = 1 n p ( i ) 2
E n t r o p y = - &Sigma; i = 1 n p ( i ) * log 2 p ( i )
Error=1-max{p (i) | i in [1, n] }
And calculate information gain Gain and information gain-ratio GainRate
Gain (U, V)=Entropy (U)-Entropy (U, V))
GainRate (U, V)=Gain (U, V)/Entropy (V)
To determine root node and the best packet variable of decision tree;
Data are carried out pretreatment, the most suddenly include by step 332: check the bibliographical particulars integrity of described list of references, It is converted into numeric type, title type with the data by nonnumeric type, non-title type;Search in list of references and whether there is the work lacked Record item, if had, is filled with vacancy value according to bibliographical particulars relevant in list of references;According to the dependency of bibliographical particulars, delete Except the most insignificant bibliographical particulars;Data are generally changed;
Step 333, the decision tree of list of references and pretreated data are utilized to carry out type decision.In the embodiment of the present invention In, use WEKA platform to carry out type decision.
List of references format checking method the most according to claim 5, it is characterised in that described step 333 specifically includes:
Step 3331, import data set to be tested;
Step 3332, obtaining step 332 carry out pretreated testing data;
Step 3333, will process after data set be placed in different Learning Schemes and carry out learning and set up forecast model and predict Unknown example;
Step 3334, to prediction result be estimated.
CN201610153946.0A 2016-03-17 2016-03-17 A kind of bibliography format checking method Active CN105824791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610153946.0A CN105824791B (en) 2016-03-17 2016-03-17 A kind of bibliography format checking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610153946.0A CN105824791B (en) 2016-03-17 2016-03-17 A kind of bibliography format checking method

Publications (2)

Publication Number Publication Date
CN105824791A true CN105824791A (en) 2016-08-03
CN105824791B CN105824791B (en) 2018-11-23

Family

ID=56525297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610153946.0A Active CN105824791B (en) 2016-03-17 2016-03-17 A kind of bibliography format checking method

Country Status (1)

Country Link
CN (1) CN105824791B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108196880A (en) * 2017-12-11 2018-06-22 北京大学 Software project knowledge mapping method for automatically constructing and system
CN108733634A (en) * 2017-04-20 2018-11-02 北大方正集团有限公司 The recognition methods of bibliography and identification device
CN110688823A (en) * 2019-09-20 2020-01-14 中国银行股份有限公司 XML file verification method and device
CN110717314A (en) * 2019-10-17 2020-01-21 长江师范学院 Document bibliographic format conversion method
CN111125381A (en) * 2018-11-01 2020-05-08 北大方正集团有限公司 Identification method, device, equipment and storage medium of key information of reference document
CN111401005A (en) * 2018-12-28 2020-07-10 北大方正集团有限公司 Text conversion method and device and readable storage medium
CN113505570A (en) * 2021-05-25 2021-10-15 北京北大方正电子有限公司 Method, device and equipment for checking and correcting falling-in-space in reference documents and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101952802A (en) * 2007-06-21 2011-01-19 汤姆森路透社全球资源公司 Method and system for author and publisher's checking list of references
JP2011076254A (en) * 2009-09-29 2011-04-14 Fujitsu Ltd Inter-document relation analyzing device, and program and method of the same
CN103077162A (en) * 2013-01-23 2013-05-01 北京理工大学 Word document reference organization system
CN103440233A (en) * 2013-09-10 2013-12-11 青岛大学 Automatic sScientific paper standardization automatic detecting and editing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101952802A (en) * 2007-06-21 2011-01-19 汤姆森路透社全球资源公司 Method and system for author and publisher's checking list of references
JP2011076254A (en) * 2009-09-29 2011-04-14 Fujitsu Ltd Inter-document relation analyzing device, and program and method of the same
CN103077162A (en) * 2013-01-23 2013-05-01 北京理工大学 Word document reference organization system
CN103440233A (en) * 2013-09-10 2013-12-11 青岛大学 Automatic sScientific paper standardization automatic detecting and editing system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DAS S ET AL.: "An online software for decision tree classification and visualization using c4.5 algorithm(ODTC)", 《INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT.IEEE,2014》 *
张春玲: "学术期刊电子稿件参考文献自动校验的XML解决方案", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
黄映玲 等: "毕业论文参考文献引用问题及对策研究", 《太原大学学报》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733634A (en) * 2017-04-20 2018-11-02 北大方正集团有限公司 The recognition methods of bibliography and identification device
CN108196880A (en) * 2017-12-11 2018-06-22 北京大学 Software project knowledge mapping method for automatically constructing and system
CN111125381A (en) * 2018-11-01 2020-05-08 北大方正集团有限公司 Identification method, device, equipment and storage medium of key information of reference document
CN111125381B (en) * 2018-11-01 2023-08-11 新方正控股发展有限责任公司 Method, device, equipment and storage medium for identifying key information of reference
CN111401005A (en) * 2018-12-28 2020-07-10 北大方正集团有限公司 Text conversion method and device and readable storage medium
CN110688823A (en) * 2019-09-20 2020-01-14 中国银行股份有限公司 XML file verification method and device
CN110688823B (en) * 2019-09-20 2023-08-22 中国银行股份有限公司 XML file verification method and device
CN110717314A (en) * 2019-10-17 2020-01-21 长江师范学院 Document bibliographic format conversion method
CN110717314B (en) * 2019-10-17 2023-11-17 长江师范学院 Literature writing format conversion method
CN113505570A (en) * 2021-05-25 2021-10-15 北京北大方正电子有限公司 Method, device and equipment for checking and correcting falling-in-space in reference documents and storage medium
CN113505570B (en) * 2021-05-25 2024-04-12 北京北大方正电子有限公司 Reference is made to empty checking method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN105824791B (en) 2018-11-23

Similar Documents

Publication Publication Date Title
CN105824791B (en) A kind of bibliography format checking method
Van Eck et al. Visualizing bibliometric networks
US10229154B2 (en) Subject-matter analysis of tabular data
US9483544B2 (en) Systems and methods for calculating category proportions
Saravanan et al. Improving legal document summarization using graphical models
CN111737421A (en) Intellectual property big data information retrieval system and storage medium
Basoglu et al. Inline XBRL versus XBRL for SEC reporting
CN113656805A (en) Event map automatic construction method and system for multi-source vulnerability information
Al Qundus et al. Exploring the impact of short-text complexity and structure on its quality in social media
CN111259160A (en) Knowledge graph construction method, device, equipment and storage medium
US20120221545A1 (en) Isolating desired content, metadata, or both from social media
Dias et al. A method for the identification of collaboration in large scientific databases
Rawat et al. Topic modelling of legal documents using NLP and bidirectional encoder representations from transformers
Zhou et al. Big data validity evaluation based on MMTD
Jeong et al. Applying content-based similarity measure to author co-citation analysis
Yang et al. Evaluation and assessment of machine learning based user story grouping: A framework and empirical studies
CN116595192B (en) Technological front information acquisition method and device, electronic equipment and readable storage medium
CN112966105B (en) Method for automatically generating audit test questions by using violation problem analysis
Bevendorff et al. SMAuC-The Scientific Multi-Authorship Corpus
CN109657180B (en) Intelligent automatic fuzzy extraction system for webpage content
Hadiyati Bibliometric Analysis of Sustainability in Science Education
Qiu [Retracted] Research on the Relationship between Intelligent Analysis and Weight of Keywords in English Test Questions
Çay et al. Exploring the relationship between academicians via reference parsing
Patil et al. Parsing of HTML document
Doleschal et al. CHISEL: Sculpting tabular and non-tabular data on the web

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant