CN105824791A - Reference format checking method - Google Patents
Reference format checking method Download PDFInfo
- Publication number
- CN105824791A CN105824791A CN201610153946.0A CN201610153946A CN105824791A CN 105824791 A CN105824791 A CN 105824791A CN 201610153946 A CN201610153946 A CN 201610153946A CN 105824791 A CN105824791 A CN 105824791A
- Authority
- CN
- China
- Prior art keywords
- references
- list
- bibliographical particulars
- bibliographical
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a reference format checking method. The method comprises the following steps: 1, expressing a reference bibliographical particular format rule through Schema, wherein a reference bibliographical particular format comprises at least one of the following bibliographical particulars: an author, a title, a reference type, a publisher, a publication date and a page number; 2, reading references, and partitioning the bibliographical particulars; 3, recognizing the reference bibliographical particulars, extracting the recognized bibliographical particulars into XML (Extensible Markup Language) nodes, judging whether the reference bibliographical particulars include a literature type tag or not at the same time, and if not, adding the literature type tag of the reference according to the bibliographical particulars, wherein the bibliographical particulars include at least one of the author, the title, a publication place, the publisher, the publication date and the like; 4, verifying the description particulars by using the reference bibliographical particular format rule.
Description
Technical field
The present invention relates to belong to text-processing technical field, particularly relate to a kind of list of references format checking method.
Background technology
In various papers, the most inevitably to quote from formerly disclosed list of references understand paper with help reader
Background knowledge.Need during general Citation of references the author (author) of list of references, title (title) to be provided, publishes
Person's (publisher, i.e. this works publish wherein), the publication page number (publish page), publication date (publish
year).But in the file that the paper such as proceedings, large-scale periodical is concentrated very much, each paper all can quote from a large amount of reference
Document, so it is difficult to ensure that each paper Citation of references the most in the same format.
Existing all it be dependent on responsible reader and while paper is gone over a manuscript or draft, audit call format, the most again by editing again
Examination & verification;But the mode that this pure dependence manually carries out auditing is it is difficult to ensure that occur without omission.
Summary of the invention
Occur for prior art using the mode of manual examination and verification list of references form be easy to omitting, it is impossible to guarantee literary composition
In collection or periodical, each paper all uses the problem of identical regular Citation of references, the technical problem to be solved in the present invention
Be to provide a kind of can automatically in the paper of electronic edition list of references quote from whether meet the ginseng that preset rules is audited
Examine document format checking method and system, it is ensured that the standardization of list of references form and improve efficiency, prevent omission.
In order to solve the problems referred to above, the embodiment of the present invention proposes a kind of list of references format checking method, including:
Step 1, by Reference Citation item format convention employing Schema state, wherein said Reference Citation
Form includes at least one following bibliographical particulars: the person of recording, autograph, list of references type, publisher, publication date, page
Code;
Step 2, read each bar list of references, carry out bibliographical particulars cutting;
Step 3, identify Reference Citation item, and the bibliographical particulars that will identify that extracts and becomes XML node;Wherein said work
Record item includes following at least one: owner, inscribes one's name, publish ground, publisher, publication date etc.;Simultaneously, it is judged that this reference literary composition
Offer and whether bibliographical particulars includes document type mark, without the document type mark then adding this list of references according to bibliographical particulars
Will;
Step 4, utilize described Reference Citation item format convention that bibliographical particulars is verified.
Wherein, described method also includes:
Step 5, when Reference Citation item exist mistake time, bibliographical particulars is modified;Specifically include;
When mistake is lacuna, completion bibliographical particulars the list of references of the restructuring form format specification that puts in the stops;
When mistake is multinomial, delete this bibliographical particulars the list of references of the restructuring form format specification that puts in the stops;
When mistake is wrong item, put in the stops after modifying according to the form of specification restructuring form format specification
List of references.
Wherein, described step 2 includes:
Step 21, Apache POI is utilized to be identified document extracting list of references content;.
Step 22, the list of references content extracted is carried out cutting to obtain bibliographical particulars, including:
Symbol in list of references is identified, to judge whether list of references includes non-DBC case, if bag
Include, be replaced with corresponding DBC case;
With symbol, bibliographical particulars is carried out cutting according to recording.
Wherein, described step 3 includes: utilize the bibliographical particulars identification model preset to reference literary composition recited in paper word
Offer the bibliographical particulars being identified extracting described list of references, preset corpus according to wherein said bibliographical particulars identification model and enter
Row study obtains;Specifically include:
Step 31, extraction corpus;
The corpus that step 32, employing are preset, utilizes NER algorithm to be trained obtaining bibliographical particulars identification model;
Step 33, judging whether list of references includes list of references type parameter, if do not included, utilizing with reference to literary composition
The bibliographical particulars offered judges the type of described list of references.
Wherein, described step 33 includes:
Step 331: construct the decision tree of bibliographical particulars;Specifically include:
By below equation calculate gini index Gini, entropy Entropy, error rate (Error):
Error=1-max{p (i) | i in [1, n] }
And calculate information gain Gain and information gain-ratio GainRate
Gain (U, V)=Entropy (U)-Entropy (U, V))
GainRate (U, V)=Gain (U, V)/Entropy (V)
To determine root node and the best packet variable of decision tree;
Data are carried out pretreatment, the most suddenly include by step 332: carry out the bibliographical particulars integrity of described list of references
Check, be converted into numeric type, title type with the data by nonnumeric type, non-title type;Whether lookup list of references have scarce
Few bibliographical particulars, if had, is filled with vacancy value according to bibliographical particulars relevant in list of references;Phase according to bibliographical particulars
Guan Xing, deletes the most insignificant bibliographical particulars;Data are generally changed statement;
Step 333, the decision tree of list of references and pretreated data are utilized to carry out type decision.Implement in the present invention
In example, WEKA platform is used to carry out type decision.
Wherein, described step 333 specifically includes:
Step 3331, import data set to be tested;
Step 3332, obtaining step 332 carry out pretreated testing data;
Step 3333, will process after data set be placed in different Learning Schemes and carry out learning and setting up forecast model
The example that prediction is unknown;
Step 3334, to prediction result be estimated.
Having the beneficial effect that of the technique scheme of the present invention:
Along with emerging in multitude of technical paper, national departments concerned has carried out standardization and the standardization of academic journal, its
The rule that the format standard of middle list of references has been had to comply with as numerous authors and editing personnel.Author is writing
During scientific paper, standard criterion to be learnt could complete paper in high quality, and editing personnel need also exist for study mark
Quasi-specification could complete the checking work of paper expeditiously.Therefore, author and editing personnel are required for a kind of convenient
Instrument carries out the detection of list of references format specification.Owing to different types of list of references has different forms, same seed ginseng
Examine document and have a lot of bibliographical particulars, so author can make mistakes in compiling procedure unavoidably, therefore in the list of references of scientific paper
Still suffering from a large amount of nonstandard phenomenon, this adds the difficulty of verification to editing personnel.This problem mainly solves with reference to literary composition
Offer format specification sex chromosome mosaicism, there is higher practical value.
1) this research can make the work of list of references format checking more intelligent, reduces recording mistakes of references,
Improve the efficiency of list of references format checking work.
2) each bibliographical particulars of list of references is carried out correct understanding, be beneficial to discovering and using further for list of references in the future
(quote as analyzed and ceased by fuse, the research level of assessment academic writing, and combing is correlated with the achievement in research etc. of author).
List of references format specification can be detected by the embodiment of the present invention, specifically navigates to the position of mistake, and
How prompting corrects, and provides conveniently for researcher.The achievement in research of this problem is for improving digital publishing quality, promoting document
The efficient propagation of information utilizes, saves the cost of labor etc. of typesetting to have important value.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the embodiment of the present invention;
Fig. 2 is the code of OOXML structure based on XML in Word file;
Fig. 3 is Word document element hierarchy relation schematic diagram;
Fig. 4 is the list of references of 10 wiht strip-lattice type specifications as an example;
Fig. 5 is the structural representation of the list of references decision tree of the embodiment of the present invention;
Fig. 6 is the schematic diagram of member record in the middle part of ARFF file;
Fig. 7 is the schematic diagram of the conversion in the embodiment of the present invention;
Fig. 8 is 10 lists of references to be measured as an example;
Fig. 9 is the partial results of Reference Citation item identification;
Figure 10 is the testing result of list of references to be measured in Fig. 8;
Figure 11 be detection during generate XML file.
Detailed description of the invention
For making the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool
Body embodiment is described in detail.
The embodiment of the present invention proposes a kind of list of references format checking method, including:
Step 1, use Schema define reference format template.Purpose is to be given birth to by each bibliographical particulars of list of references to verify
The correctness of the XML file become.
Step 2, read each bar list of references, carry out bibliographical particulars cutting.The cutting purpose using bibliographical particulars makes identification
Classification is the most single, identifies that Reference Citation item lays the first stone for next step, and each bibliographical particulars of cutting is just the most exactly
The accuracy of bibliographical particulars identification can be improved.
Step 3, identification Reference Citation item.Join on the basis of the cutting method of step 2 Reference Citation item
Examine the identification of each bibliographical particulars of document, record and include owner, autograph, publish ground, publisher and publication date etc..
Step 4, the bibliographical particulars that will identify that extract becomes XML node.
Step 5, judge document type mark.In GB/T 7714-2005, the document type mark of regulation is as follows: common
Books (M), collect (G), standard (S), periodical (J), computer program (CP), and academic dissertation (D) is reported (R), patent (P), number
According to storehouse (DB), BBS (EB), tape (MT), disk (DK), procceedings (C), CD (CD), newspaper (N), on-line network
(OL).In step 4) in identify after title description item in search whether containing the document category of defined in GB/T 7714-2005
Type mark.
Step 6, verify based on the reference format template described in step 1.If containing document type mark, then adjusted
Verifying by the Schema reference format template of corresponding document type, if not containing document type mark, the most first judging literary composition
Offering type, the Schema reference format template then recalling corresponding document type is verified.If having passed through checking, explanation
The form of list of references is correct, without with reference to checking, then the format error of list of references is described.
Step 7, judge mistake bibliographical particulars and modify.Including the order of inspection bibliographical particulars, ultimately generate correct
XML instance.Specific design thinking is as follows: when XML file is not verified by Schema, extracts in XML file and does not specifically lead to
Crossing the bibliographical particulars of checking, can be summarized as three kinds of situations for not verified bibliographical particulars, a kind of situation is lacuna, a kind of situation
It is multinomial, is additionally wrong item.When lacuna, completion bibliographical particulars the ginseng of the restructuring form format specification that puts in the stops
Examine document.When multinomial, delete this bibliographical particulars the list of references of the restructuring form format specification that puts in the stops.Right
In the situation of wrong item, the list of references of the restructuring form format specification that puts in the stops after modifying according to the form of specification.
Below each step of the embodiment of the present invention is described in detail.
Step 1, by Reference Citation item format convention employing code state, wherein said Reference Citation item
Form includes at least one following bibliographical particulars: the person of recording, autograph, list of references type, publisher, publication date, the page number.
In embodiments of the present invention, can be by XML Schema language to stating, wherein said list of references
Bibliographical particulars form includes at least one following bibliographical particulars: the person of recording, autograph, list of references type, publisher, publication day
Phase, the page number;
Below for using the example of Reference Citation item format convention of XML Schema language expression, be with
The proceeding of XML Schema statement
(1)<?Xml version=" 1.0 " encoding=" GB2312 "?>
(2) < xs:schemaxmlns:xs=" http://www.w3.org/2001/XMLSchema "
(3) xmlns=" http://www.w3school.com.cn "
(4) targetNamespace=http: //www.w3school.com.cn "
(5) elementFormDefault=" qualified " >
(6)<xs:element name=" reference ">
(7)<xs:complexType>
(8)<xs:sequence>
(9)<xs:element name=" author " type=" xs:string "/>
(10)<xs:element name=" title " type=" xs:string "/>
(11)<xs:element name=" type " type=" xs:string "/>
(12)<xs:element name=" publish " type=" xs:string " minOccirs=" 0 "/>
(13)<xs:element name=" publisher " type=" xs:string "/>
(14)<xs:element name=" publish_year " type=" xs:string "/>
(15)<xs:element name=" page_number " minOccirs=" 0 "/>
(16)<xs:simpleType>
(17)<xs:restriction base=" xs:string ">
(18) < xs:pattem value=" (d{1,4}-)?\d{1,4}"/>
(19)</xs:restriction>
(20)</xs:simpleType>
(21)</xs:element>
(22)</xs:sequence>
(23)</xs:complexType>
(24)</xs:element>
(25)</xs:schema>
Step 2, utilization are preset library and are extracted the list of references of citation in word, and extract bibliographical particulars therein;Described step
Rapid 2 include that list of references contents extraction and Reference Citation item extract two parts content.Therefore step 2 specifically includes:
Step 21, Apache POI is utilized to be identified document extracting list of references content.
Owing to existing document is the most all to deposit with the form of Microsoft Word format or compatible Word
Storage.In Microsoft Word document, information is deposited with OOXML based on XML (Open Office XML) form
Storage.Therefore can use Apache POI 3.13 that document is identified.
The implication of OOXML structure is illustrated by following the application with citing.Microsoft Word2013 edits
Two sections of words, are respectively " Chinese Reference Citation item identification " and " paper ", and the xml code of its correspondence is as shown in Figure 2.
In code,<w:document>element is the root element of document, and other all of elements are all its daughter elements.
Element passes behind several NameSpaces of attribute definition.
<w:body>element is the element at document content place, is unique necessary element.Many daughter elements are comprised under it,
Referring specifically to OOXML standard.In numerous daughter elements, most basic unit have three, is respectively<w:p>element,<w:r>unit
Element and<w:t>element.Wherein,<w:p>element represents a paragraph, for one content starting from newline of definition;<w:r>unit
Element represents the content of sentence layer, can be sentence, mathematical material, smart tags and User Defined labelling etc., and sentence is to arrange formula
The minimum unit of sample;Concrete content of text in<w:t>element representation sentence.Hierarchical relationship schematic diagram such as Fig. 3 institute of these elements
Show.
Owing to the position of list of references is fixing, understanding OOXML based on XML (Open Office XML) lattice
After the information of formula, it is possible to utilize Apache POI 3.13 that document is identified, to extract in list of references therein
Hold.
Step 22, from described list of references content extract bibliographical particulars.
Owing to list of references is made up of some bibliographical particulars, it is therefore desirable to first bibliographical particulars is carried out cutting, the most just can carry out
Identify.Specifically include:
Specification of symbols step: the symbol in list of references is identified, with judge list of references takes no include non-
DBC case, if included, is replaced with corresponding DBC case;
Dicing step: with symbol, bibliographical particulars is carried out according to recording of regulation in list of references GB GB/T 7714-2005
Cutting.
In list of references GB GB/T 7714-2005, all of record with symbol all as preposition symbol.Such as reference
Document does not use before the owner of first bibliographical particulars any tag mark;". " is used for inscribeing one's name item and separating out Documetary Title item
Deng preposition symbol etc..By the analysis to GB/T 7714-2005, different preposition symbols is carried out bibliographical particulars as separation
Cutting.
Finding during being sampled some graduate's academic dissertations analyzing, the kind form of format error is different,
Statistical method can be utilized to calculate the probabilistic model of clerical error.If such as containing ". " in a list of references, then
In this list of references substantially without occur "." as the separator between bibliographical particulars;If contained in a list of references
“." as the separator between bibliographical particulars, then substantially without ". " occurring as the separation between bibliographical particulars in this list of references
Symbol.
Step 3, utilize the bibliographical particulars identification model preset list of references recited in paper word is identified with
Extract the bibliographical particulars of described list of references, preset corpus according to wherein said bibliographical particulars identification model and carry out study acquisition
's.
In embodiments of the present invention, Stanford University based on condition random field name entity recognition method is used
(Stanford Named Entity Recognizer, NER).Entity can be marked by NER according to classification, such as people
Name, exabyte, area, gene and the name etc. of protein.NER is equipped with well-designed feature extractor and enters name entity
Row identifies, i.e. can get training pattern through training.In theory for the data of training, the text that i.e. a large amount of handmarkings are good is got over
Many, NER recognition effect is the best.Re-training model is wanted in order to meet new demand.
Therefore step 3 specifically includes;
Step 31, extraction corpus;Concrete corpus uses in January, 1998 the Peoples Daily mark language material of extraction
Storehouse and Beijing University's version " A Guide to the Core Chinese periodical " in 2015.
Wherein;
1, in January, 1998 the Peoples Daily mark language material of extraction: due to name, place name etc. in the Peoples Daily language material
Noun proportion is more, therefore can be as good training corpus.
2, Beijing University's version " A Guide to the Core Chinese periodical " in 2015;Due in addition to name, place name, in usual paper
Want to preferably identify key word conventional in conventional periodical title and some Article Titles, therefore use " in
Literary composition core periodical syllabus overview " coordinate the Peoples Daily.
The exercise question of such as thesis or the exercise question of periodical often contain " based on " word, so at the Peoples Daily language material
On the basis of add the periodical titles occurred in 2015 Beijing University's version " A Guide to the Core Chinese periodical " and the mark counted
Key word conventional in topic, these several parts combine the systematic training collection tested collectively as this most at last, are preserved
In testdata.tsv file, and for the closed test of system.
It addition, 2015 Beijing University's version " A Guide to the Core Chinese periodical " be extraction this school thesis in literary composition after join
Examine document, formed test set and be used for the open test of system.
Wherein, the list of references content that extracts can as shown in Figure 4.
The corpus that step 32, employing are preset, utilizes NER algorithm to be trained obtaining bibliographical particulars identification model.
NER provides the mode of two kinds of training patterns, respectively command line mode and configuration file mode.
In embodiments of the present invention, can be in the way of using configuration file.
Concrete, in Stanford NER, configuration filename is referred to as austen.prop, utilizes as shown in table 1 below repairing
Change its parameter
Table 1 austen.prop revises parameter list
Wherein, trainFile specifies the data set for training, and serializeTo specifies the model name of output after training
Claim.Amended configuration file is preserved, and is jointly placed on training dataset testdata.tsv under the root of program, hold
Line command is to issue orders:
java–cp Stanford-ner.jaredu.stanford.nlp.ie.crf.CRF Classifier–prop
austen.prop”
After running succeeded, under catalogue, generate ner-model.ser.gz, be the model that training data obtains.
After obtaining bibliographical particulars identification model, by bibliographical particulars identification model, the bibliographical particulars in step 2 can be entered
Row identifies.
Owing in list of references, information is likely to be incomplete, it is possible to meeting missing-reference document type, and with reference to literary composition
Offer type follow-up list of references format checking to be had a major impact, therefore can be further in the embodiment of the present invention
Including:
Step 33, judging whether list of references includes list of references type parameter, if do not included, utilizing with reference to literary composition
The bibliographical particulars offered judges the type of described list of references.
Concrete, step 33 includes:
Step 331: construct the decision tree of bibliographical particulars.
Or as a example by list of references content as shown in Figure 4, it includes 10 wiht strip-lattice type specifications.As shown in Figure 4, every Radix Glehniae
Examining document to be made up of a lot of bibliographical particulars, the composition of different types of its bibliographical particulars of list of references is different.By to 10 Radix Glehniae
Examining the analysis of document, bibliographical particulars and the property value thereof of concluding its list of references describe, as shown in table 2.
The bibliographical particulars of table 2 list of references and property value thereof describe
In the Fig. 4 obtained after data convert, the information model of each bar list of references is as shown in table 3:
The information model of each bar list of references in table 3 Fig. 4
Therefore bibliographical particulars decision tree as shown in Figure 5 can be constructed according to table 3.Can be right according to the decision tree in Fig. 5
The document of UNKNOWN TYPE is predicted, and the most existing list of references is as follows:
Zhu Gang. novel fluid FInite Element and the positive and negative mixed problem of turbomachine. Beijing: Tsing-Hua University, 1996.
Can predict that it belongs to academic dissertation according to decision tree as shown in Figure 5.
Generate decision tree as shown in Figure 5, wherein have two key issues:
One is how to select a current optimal packet variable from numerous input variables?The most why will be going out
Version person's type is as the root node of decision tree?Why select author's type as the child node of lower floor rather than other record
??
Two is how to find an optimal cut-point from numerous values of packet variable?Such as publisher's type is run after fame
Claiming type, its attribute includes " periodical ", " educational institution ", " other ", " publishing house ", selects " educational institution " as dividing why
Cutpoint?Solve the two key issue and i.e. can be readily constructed out decision tree.
Decision tree needs introduce " purity " concept.Conventional measurement Purity has three kinds, respectively gini index
(Gini), entropy (Entropy), error rate (Error);The embodiment of the present invention can calculate gini index by below equation
(Gini), entropy (Entropy), error rate (Error):
Assuming that the attribute of bibliographical particulars have the different property value i of n class (i=1,2 ..., n), the ratio shared by every generic attribute value
The quantity of example p (i)=the i-th generic attribute value/this property value total quantity, the span of p (i) is [0,1].
Error=1-max{p (i) | i in [1, n] }
To be value the biggest for the formula 1-3 of three purity above, represents more " impure ", the least expression more " pure ".Facts have proved three
That plants formula selects the impact on final classification accuracy the most little.The most also use entropy formula, public by entropy
Formula amplifies out two conventional Attributions selection variablees, is respectively the information gain (Gain) such as formula 4 and the information such as formula 5 increases
Benefit rate (GainRate).
Gain (U, V)=Entropy (U)-Entropy (U, V))
GainRate (U, V)=Gain (U, V)/Entropy (V)
In theory of information, information exchanging process regards that a transmission system being made up of information source, channel and the stay of two nights realizes as
, information source is the transmitting terminal of information, and the stay of two nights is the receiving terminal of information.As a example by list of references type code above is predicted, will write
Person's type (T1), reports number (T2), the patent No. (T3), publisher's type (T4), reign title and reel number mark (T5), the page number (T6) is as defeated
Entering variable, list of references type code is output variable.Decision tree regards output variable (list of references type code) as information source
Information U sent, input variable regards the range of information V that the stay of two nights receives as.
Gain (U, V)=Entropy (U)-Entropy (U, V))
GainRate (U, V)=Gain (U, V)/Entropy (V)
Use information gain-ratio (GainRate) that above-mentioned two key issue is respectively calculated, calculate process as follows:
As a example by author's type T1: calculate Entropy (U), Entropy (U | T1), Gains (U, T1), GainsR respectively
(U, T1), wherein each 2 of periodical, academic dissertation and books type document, report, meeting collection, patent and type document each 1
Bar.
Assuming that bibliographical particulars has M different types of attribute, property value ui(i=1,2 ..., M), shared by every generic attribute value
Ratio be p (ui), author's type T1 has N number of different attribute value t1j(j=1,2 ..., N).
Gain (U, T1)=Entropy (U)-Entropy (U | T1)=0.553
GainRate (U, T1)=Gains (U, T1)/Entropy (V)=0.628
The information gain-ratio i.e. obtaining author's type (T1) is 0.628, calculates other each bibliographical particulars in an identical manner,
Obtain T4 information gain-ratio value eventually and be 1.275 to the maximum, T4 therefore should be selected as best packet variable, be the root joint of decision tree
Point.
In publisher's type, there are 4 attributes, are respectively " periodical ", " educational institution ", " other ", " publishing house ", then
How to select cut-point, calculate process similar to above, the information gain-ratio value calculating " educational institution " is to the maximum
3.948, " educational institution " therefore should be selected as best packet variable.
By above-mentioned analysis it can be seen that the decision tree of the embodiment of the present invention is one method of decision analysis intuitively, it is excellent
Point is apparent.Decision-tree model readability is good, has certain descriptive, contributes to manual analysis;And execution efficiency is high,
Have only to once build the priori that just with Reusability, can naturally enough embed expert.
Owing to being likely to occur in the bibliographical particulars of this list of references, data are inconsistent, Data duplication, data contain noise, data
The problems such as dimension is high.Therefore, before bibliographical particulars is classified, need data are carried out pretreatment.That is, described step 33 is also
Including:
Data are carried out pretreatment by step 332.
Concrete, data prediction step includes:
Step 3321, bibliographical particulars integrity to described list of references check.
Owing to the variable of decision tree has a two types: numeric type, title type;So needing the master done before structure decision tree
Wanting pretreatment work is that the data of nonnumeric type and non-title type are converted into numeric type or title type.
Choosing suitable attribute in data mining from initial data as data mining attribute, the data used are former
Be then: as far as possible attribute-name and property value are given clear and definite implication, remove repeat data, remove negligible field, rationally
Select associate field.Introduce the process carrying out pretreatment in detail below.
Initial data is the list of references extracted, and is then split by each for list of references bibliographical particulars, after each fractionation
Bibliographical particulars be considered as every record attribute, table 4 below is the segment chosen from initial data.
Table 4 original data record
As can be seen from the above table, after being split by a list of references, some field value vacancy, some field can be neglected
Slightly, therefore step 332 can include three below sub-step:
Whether step 3322, lookup list of references has the bibliographical particulars lacked, if had, according to phase in list of references
Vacancy value is filled with by the bibliographical particulars closed.
Such as in table 4 " patent country origin ", " patent No. ", " report number ", then selected data are had vacant position
Value is filled up.The principle of filling up of vacancy value is to defer to the type of the already present value of this field, such as in already present record
The partial value of certain field is numeric type, then the value of filling up of other vacancy value of this field also will be for numeral offset, if should
The partial value of field is title type, then the value of filling up of other vacancy value of this field is also title offset.
Step 3323, dependency according to bibliographical particulars, delete the most insignificant bibliographical particulars.The such as reference shown in Fig. 4
Literature content, sequence therein numbering (1,2,3 the most above ... 10) does not has any effect to prediction of result, can increase on the contrary
Complexity of the calculation, therefore can delete.Such as publish ground field, be everywhere to will not affect that final reference with no matter publishing
The type of document, so can be ignored for " publishing ground " field.
Step 3323, data are generally changed statement.This is due to the data in protocol, and each field can be by
It is summarised as several class.Such as: for " owner " field, the value of " owner " field may be summarized to be two classes, and a class is
Concrete name, another kind of is organization's title.It is name or institution term comes document by " owner " field
Type be predicted, and be what name with concrete people and be that what name is unrelated with organization.Therefore, it can will
It is name and institution term two class that " owner " field carries out Data generalization.The rest may be inferred, by the data of all analogues
All will generally change.
Can obtain as shown in table 5 through pretreated data after above-mentioned step.
The pretreated data of table 5
It can be seen that the field of pretreatment and field value are all English from table 5, this is the embodiment of the present invention certainly
A kind of mode, it is also possible to take other any type of field expressed through pretreatment and field values.Owing to the present invention is real
Executing in example is to use WEKA system to carry out list of references type to carry out type decision, therefore uses field and the field of English
Value can obtain and preferably calculate effect.
Illustrate with the field value of the example of table 5 field each to the embodiment of the present invention:
" owner " field value is PER.Individual and PER.Group, the individual that mentions during wherein PER refers to document or
Crowd, PER.Individual and PER.Group is the subclass of PER, refers to individual, crowd or tissue respectively.
" exercise question signature " field value is title_D_tag, title_C_tag etc., and such as t i t le_C_tag is
Refer to that the exercise question of report typically can contain " report " signature containing " meeting collection " signature in the exercise question of proceeding;
Other type does not has the value of signature to be designated as no.
" publishing house " field value is PUB.Press, PUB.Journal, PUB.School, PUB.Institution and
NUL, refers to the publishing house of non-school, periodical, school publishing house and institute respectively, and NUL refers to lacuna.
In step 33, construct decision tree by step 331, and after having carried out data prediction by step 332, need
List of references is carried out type decision.The most described method also includes:
Step 333, the decision tree of list of references and pretreated data are utilized to carry out type decision.Implement in the present invention
In example, WEKA platform is used to carry out type decision.
The process carrying out data mining in WEKA platform is as follows:
1) data set to be tested is imported;
2) testing data is carried out pretreatment (step 332 is complete);
3) data set after processing is placed in different Learning Schemes and carries out learning and set up forecast model and predict not
The example known;
4) result of prediction is estimated and visualizes.It is specifically introduced below for aforementioned four step.
Therefore step 333 specifically includes:
Step 3331, import data set to be tested.
Owing to data form treatable under WEKA platform is CSV and ARFF file, but optimal form is ARFF
File, so used here as ARFF formatted file, importing data to be tested again after needing first to be changed by the form of file
Collection.Original data storage, in EXCEL file, is first converted into csv file, is reconverted into ARFF file.Wherein ARFF literary composition
In the middle part of part, member record is as shown in Figure 6.
Step 3332, the pretreated data of obtaining step 332.
Step 3333, select concrete sorting algorithm for training and testing classification.In the sort module of WEKA system,
It is integrated with the sorting algorithm of about 50 kinds, the embodiment of the present invention has been selected 3 kinds of classical taxonomy algorithm NativeBays, J48 (decision-making
Tree) and ZeroR test set is carried out class test.
The result of different sorting algorithms is estimated by the 4th step.The method of assessment nicety of grading is a lot, mainly has intersection
Method (cross-validation), holding method (holdout), leaving-one method (leave-one-out), back substitution method (back-
substitution).Interior extrapolation method and holding method are the most commonly used.Leaving-one method is considered as a kind of special case of interior extrapolation method.Back substitution method by
Result in nicety of grading in its assessment overfitting higher, do not use.The visualization of result both can be to a subseries
Result visualizes, it is also possible to visualize the result of a data set.Wherein the visualization of data set is shown that
About a two-dimentional scatterplot of every pair of attribute, the output result visualization of a certain subseries is shown that error in classification, sets, becomes
This curve, ROC curve etc., be used for assessing the performance of each Learning Scheme.
Although some algorithms are of a relatively high to the judgement accuracy rate of document type mark, but can not reach 100%
Accuracy rate, this will have influence on the accuracy rate of final document format error detection.
In order to reduce forecast error as far as possible, the method that have employed signature in the embodiment of the present invention.I.e. enter decision tree
Style of writing is offered after type code judges and is judged further according to signature, if the two result of determination is identical, as finally
As a result, if the two judgement differs, it is as the criterion with the result judged according to signature.Table 6 list all kinds of signature with
The corresponding relation of list of references type.
Table 6 signature and the relation table of document type
Step 4, utilize described Reference Citation item format convention code, to recording of the described list of references identified
Item checks.Specifically include: each bibliographical particulars after identifying generates corresponding XML document according to list of references type code,
Then Schema is used to verify;If by checking, illustrate that this document form is correct, this document is otherwise described
There is mistake in form.
In embodiments of the present invention, the XML document of the periodical type wherein generated is as follows
(1)<?Xml version=" 1.0 " encoding=" GB2312 " standalone=" no "?>
(2) < reference xmlns=" http://www/w3school.com.cn "
(3) xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance "
(4) xsi:schemaLocation=" http://www.w3school.com.cn J_pre.xsd " >
(5)<author authorLoc="1">chen Luyao</author>
(6)<title titleLoc="2">the extraction of information document structure trust mode and logical description</title>
(7)<type typeLoc="2">j_pre</type>
(8)<publish publishLoc="3">beijing</publish>
(9)<publisher publisherLoc="4">computer utility is studied</publisher>
(10)<publish_yearpublish_yearLoc="5">2015</publish_year>
(11)<volumn_markvolumn_markLoc="6">27</volumn_mark>
(12)<page_numberoage_numberLoc="7">4624-4629</page_number>
(13)</reference>
After being verified by Schema template, will not return error message by compiler checking.Error message bag
Include type of error and error description, can substantially judge produced problem by type of error, if thinking concrete Wrong localization needs
In conjunction with error description.Summarize three kinds of common type of errors below;
(1)cvc-complex-type.3.1.This type of error is property value and the attribute defined in Schema in XML
Value is not mated, the reversed order between such as label.
(2)cvc-complex-type.2.4.a.This type of error is that the logical structure in XML file does not meets
Schema specification, such as occurs in that undefined element in Schema specification.
(3)cvc-complex-type.2.4.b.This type of error is that the content of XML file is imperfect, such as lacuna.
The format error type occurred in list of references can be attributed to one or more of above-mentioned three class mistakes.Wrong
The process of error detection is shown in Table 7.
Table 7 list of references format error detection algorithm
In the algorithm of table 7, R is list of references set to be measured, and r is a list of references in R set.ERRORS is
XML Schema verifies list of references unsanctioned type of error set, and Er is the type of error that a list of references is corresponding.?
After detection, for the orientation problem of solving error item, the error description information by compiler provides is needed to be converted into corresponding position
Confidence ceases, and illustrates how to convert, as shown in Figure 7 below by an example.
As seen from Figure 7, bibliographical particulars mistake includes three kinds of situations: multinomial, lacuna and out of order.The position that every kind of situation is corresponding
Put numbering change different, therefore, according to Position Number and bibliographical particulars content design algorithm 2, be shown in Table 8.
Table 8 list of references error items location algorithm
After Algorithm Analysis, below as a example by 10 lists of references to be measured in Fig. 8, use native system to 10 with reference to literary composition
Offering and carry out standardization detection, the result of detection is as shown in Figure 9 and Figure 10.
This patent is applied to the format checking of list of references after literary composition, is corrected wrong list of references form.This
List of references information retrieval in bright uses and extracts from Microsoft Word document, is equally applicable to carry from text
Take list of references.It is described below by way of example.
Fig. 9 is the partial results of Reference Citation item identification, in Fig. 8 as a example by Article 1 list of references, in Fig. 9 front 8
The recognition result of behavior Article 1 list of references.Wherein the first row " J_pre " represents that Article 1 document lacks document type mark,
It is predicted as periodical type by the judgement of document type mark;Second row represents " Chen Luyao " is identified as author;3rd
" extraction of information document structure trust mode and logical description " is identified as autograph by row expression;Fourth line represents to be known " Beijing "
Not for publish ground;Fifth line represents the publisher that " computer utility research " is identified as periodical type;6th row represent by
" 2010 " are identified as publishing year;7th row represent " 27 " are identified as volume;8th row represents " 4624-4629 " is identified as page
Code.
Figure 10 is the testing result of 10 lists of references, and list of references nonstandard for form points out concrete errors present
Information also provides amending advice, to facilitate amendment.Figure 11 be detection during generate XML file.
The method have the advantages that
Along with emerging in multitude of technical paper, national departments concerned has carried out standardization and the standardization of academic journal, its
The rule that the format standard of middle list of references has been had to comply with as numerous authors and editing personnel.Author is writing
During scientific paper, standard criterion to be learnt could complete paper in high quality, and editing personnel need also exist for study mark
Quasi-specification could complete the checking work of paper expeditiously.Therefore, author and editing personnel are required for a kind of convenient
Instrument carries out the detection of list of references format specification.Owing to different types of list of references has different forms, same seed ginseng
Examine document and have a lot of bibliographical particulars, so author can make mistakes in compiling procedure unavoidably, therefore in the list of references of scientific paper
Still suffering from a large amount of nonstandard phenomenon, this adds the difficulty of verification to editing personnel.This problem mainly solves with reference to literary composition
Offer format specification sex chromosome mosaicism, there is higher practical value.
1) this research can make the work of list of references format checking more intelligent, reduces recording mistakes of references,
Improve the efficiency of list of references format checking work.
2) each bibliographical particulars of list of references is carried out correct understanding, be beneficial to discovering and using further for list of references in the future
(quote as analyzed and ceased by fuse, the research level of assessment academic writing, and combing is correlated with the achievement in research etc. of author).
List of references format specification can be detected by the embodiment of the present invention, specifically navigates to the position of mistake, and
How prompting corrects, and provides conveniently for researcher.The achievement in research of this problem is for improving digital publishing quality, promoting document
The efficient propagation of information utilizes, saves the cost of labor etc. of typesetting to have important value.
The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art
For, on the premise of without departing from principle of the present invention, it is also possible to make some improvements and modifications, these improvements and modifications are also
Should be regarded as protection scope of the present invention.
Claims (6)
1. a list of references format checking method, it is characterised in that including:
Step 1, by Reference Citation item format convention employing Schema state, wherein said Reference Citation item lattice
Formula includes at least one following bibliographical particulars: owner, autograph, list of references type, publisher, publication date, the page number;
Step 2, read each bar list of references, carry out bibliographical particulars cutting;
Step 3, identify Reference Citation item, and the bibliographical particulars that will identify that extracts and becomes XML node;Wherein said bibliographical particulars
Including following at least one: owner, inscribe one's name, publish ground, publisher, publication date etc.;Simultaneously, it is judged that this list of references writes
Whether record item includes document type mark, without the document type mark then adding this list of references according to bibliographical particulars;
Step 4, utilize described Reference Citation item format convention that bibliographical particulars is verified.
List of references format checking method the most according to claim 1, it is characterised in that described method also includes:
Step 5, when Reference Citation item exist mistake time, bibliographical particulars is modified;Specifically include;
When mistake is lacuna, completion bibliographical particulars the list of references of the restructuring form format specification that puts in the stops;
When mistake is multinomial, delete this bibliographical particulars the list of references of the restructuring form format specification that puts in the stops;
When mistake is wrong item, the reference of the restructuring form format specification that puts in the stops after modifying according to the form of specification
Document.
List of references format checking method the most according to claim 1, it is characterised in that described step 2 includes:
Step 21, Apache POI is utilized to be identified document extracting list of references content;
Step 22, the list of references content extracted is carried out cutting to obtain bibliographical particulars, including:
Symbol in list of references is identified, to judge whether list of references includes non-DBC case, if included,
It is replaced with corresponding DBC case;
With symbol, bibliographical particulars is carried out cutting according to recording.
List of references format checking method the most according to claim 1, it is characterised in that described step 3 includes: utilize pre-
If bibliographical particulars identification model list of references recited in paper word is identified extracting the work of described list of references
Record item, presets corpus and carries out what study obtained according to wherein said bibliographical particulars identification model;Specifically include:
Step 31, extraction corpus;
The corpus that step 32, employing are preset, utilizes NER algorithm to be trained obtaining bibliographical particulars identification model;
Step 33, judging whether list of references includes list of references type parameter, if do not included, utilizing list of references
Bibliographical particulars judges the type of described list of references.
List of references format checking method the most according to claim 4, it is characterised in that described step 33 includes:
Step 331: construct the decision tree of bibliographical particulars;Specifically include:
By below equation calculate gini index Gini, entropy Entropy, error rate (Error):
Error=1-max{p (i) | i in [1, n] }
And calculate information gain Gain and information gain-ratio GainRate
Gain (U, V)=Entropy (U)-Entropy (U, V))
GainRate (U, V)=Gain (U, V)/Entropy (V)
To determine root node and the best packet variable of decision tree;
Data are carried out pretreatment, the most suddenly include by step 332: check the bibliographical particulars integrity of described list of references,
It is converted into numeric type, title type with the data by nonnumeric type, non-title type;Search in list of references and whether there is the work lacked
Record item, if had, is filled with vacancy value according to bibliographical particulars relevant in list of references;According to the dependency of bibliographical particulars, delete
Except the most insignificant bibliographical particulars;Data are generally changed;
Step 333, the decision tree of list of references and pretreated data are utilized to carry out type decision.In the embodiment of the present invention
In, use WEKA platform to carry out type decision.
List of references format checking method the most according to claim 5, it is characterised in that described step 333 specifically includes:
Step 3331, import data set to be tested;
Step 3332, obtaining step 332 carry out pretreated testing data;
Step 3333, will process after data set be placed in different Learning Schemes and carry out learning and set up forecast model and predict
Unknown example;
Step 3334, to prediction result be estimated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610153946.0A CN105824791B (en) | 2016-03-17 | 2016-03-17 | A kind of bibliography format checking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610153946.0A CN105824791B (en) | 2016-03-17 | 2016-03-17 | A kind of bibliography format checking method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105824791A true CN105824791A (en) | 2016-08-03 |
CN105824791B CN105824791B (en) | 2018-11-23 |
Family
ID=56525297
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610153946.0A Active CN105824791B (en) | 2016-03-17 | 2016-03-17 | A kind of bibliography format checking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105824791B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108196880A (en) * | 2017-12-11 | 2018-06-22 | 北京大学 | Software project knowledge mapping method for automatically constructing and system |
CN108733634A (en) * | 2017-04-20 | 2018-11-02 | 北大方正集团有限公司 | The recognition methods of bibliography and identification device |
CN110688823A (en) * | 2019-09-20 | 2020-01-14 | 中国银行股份有限公司 | XML file verification method and device |
CN110717314A (en) * | 2019-10-17 | 2020-01-21 | 长江师范学院 | Document bibliographic format conversion method |
CN111125381A (en) * | 2018-11-01 | 2020-05-08 | 北大方正集团有限公司 | Identification method, device, equipment and storage medium of key information of reference document |
CN111401005A (en) * | 2018-12-28 | 2020-07-10 | 北大方正集团有限公司 | Text conversion method and device and readable storage medium |
CN113505570A (en) * | 2021-05-25 | 2021-10-15 | 北京北大方正电子有限公司 | Method, device and equipment for checking and correcting falling-in-space in reference documents and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101952802A (en) * | 2007-06-21 | 2011-01-19 | 汤姆森路透社全球资源公司 | Method and system for author and publisher's checking list of references |
JP2011076254A (en) * | 2009-09-29 | 2011-04-14 | Fujitsu Ltd | Inter-document relation analyzing device, and program and method of the same |
CN103077162A (en) * | 2013-01-23 | 2013-05-01 | 北京理工大学 | Word document reference organization system |
CN103440233A (en) * | 2013-09-10 | 2013-12-11 | 青岛大学 | Automatic sScientific paper standardization automatic detecting and editing system |
-
2016
- 2016-03-17 CN CN201610153946.0A patent/CN105824791B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101952802A (en) * | 2007-06-21 | 2011-01-19 | 汤姆森路透社全球资源公司 | Method and system for author and publisher's checking list of references |
JP2011076254A (en) * | 2009-09-29 | 2011-04-14 | Fujitsu Ltd | Inter-document relation analyzing device, and program and method of the same |
CN103077162A (en) * | 2013-01-23 | 2013-05-01 | 北京理工大学 | Word document reference organization system |
CN103440233A (en) * | 2013-09-10 | 2013-12-11 | 青岛大学 | Automatic sScientific paper standardization automatic detecting and editing system |
Non-Patent Citations (3)
Title |
---|
DAS S ET AL.: "An online software for decision tree classification and visualization using c4.5 algorithm(ODTC)", 《INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT.IEEE,2014》 * |
张春玲: "学术期刊电子稿件参考文献自动校验的XML解决方案", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
黄映玲 等: "毕业论文参考文献引用问题及对策研究", 《太原大学学报》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108733634A (en) * | 2017-04-20 | 2018-11-02 | 北大方正集团有限公司 | The recognition methods of bibliography and identification device |
CN108196880A (en) * | 2017-12-11 | 2018-06-22 | 北京大学 | Software project knowledge mapping method for automatically constructing and system |
CN111125381A (en) * | 2018-11-01 | 2020-05-08 | 北大方正集团有限公司 | Identification method, device, equipment and storage medium of key information of reference document |
CN111125381B (en) * | 2018-11-01 | 2023-08-11 | 新方正控股发展有限责任公司 | Method, device, equipment and storage medium for identifying key information of reference |
CN111401005A (en) * | 2018-12-28 | 2020-07-10 | 北大方正集团有限公司 | Text conversion method and device and readable storage medium |
CN110688823A (en) * | 2019-09-20 | 2020-01-14 | 中国银行股份有限公司 | XML file verification method and device |
CN110688823B (en) * | 2019-09-20 | 2023-08-22 | 中国银行股份有限公司 | XML file verification method and device |
CN110717314A (en) * | 2019-10-17 | 2020-01-21 | 长江师范学院 | Document bibliographic format conversion method |
CN110717314B (en) * | 2019-10-17 | 2023-11-17 | 长江师范学院 | Literature writing format conversion method |
CN113505570A (en) * | 2021-05-25 | 2021-10-15 | 北京北大方正电子有限公司 | Method, device and equipment for checking and correcting falling-in-space in reference documents and storage medium |
CN113505570B (en) * | 2021-05-25 | 2024-04-12 | 北京北大方正电子有限公司 | Reference is made to empty checking method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN105824791B (en) | 2018-11-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105824791B (en) | A kind of bibliography format checking method | |
Van Eck et al. | Visualizing bibliometric networks | |
US10229154B2 (en) | Subject-matter analysis of tabular data | |
US9483544B2 (en) | Systems and methods for calculating category proportions | |
Saravanan et al. | Improving legal document summarization using graphical models | |
CN111737421A (en) | Intellectual property big data information retrieval system and storage medium | |
Basoglu et al. | Inline XBRL versus XBRL for SEC reporting | |
CN113656805A (en) | Event map automatic construction method and system for multi-source vulnerability information | |
Al Qundus et al. | Exploring the impact of short-text complexity and structure on its quality in social media | |
CN111259160A (en) | Knowledge graph construction method, device, equipment and storage medium | |
US20120221545A1 (en) | Isolating desired content, metadata, or both from social media | |
Dias et al. | A method for the identification of collaboration in large scientific databases | |
Rawat et al. | Topic modelling of legal documents using NLP and bidirectional encoder representations from transformers | |
Zhou et al. | Big data validity evaluation based on MMTD | |
Jeong et al. | Applying content-based similarity measure to author co-citation analysis | |
Yang et al. | Evaluation and assessment of machine learning based user story grouping: A framework and empirical studies | |
CN116595192B (en) | Technological front information acquisition method and device, electronic equipment and readable storage medium | |
CN112966105B (en) | Method for automatically generating audit test questions by using violation problem analysis | |
Bevendorff et al. | SMAuC-The Scientific Multi-Authorship Corpus | |
CN109657180B (en) | Intelligent automatic fuzzy extraction system for webpage content | |
Hadiyati | Bibliometric Analysis of Sustainability in Science Education | |
Qiu | [Retracted] Research on the Relationship between Intelligent Analysis and Weight of Keywords in English Test Questions | |
Çay et al. | Exploring the relationship between academicians via reference parsing | |
Patil et al. | Parsing of HTML document | |
Doleschal et al. | CHISEL: Sculpting tabular and non-tabular data on the web |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |