CN107870991A

CN107870991A - A kind of similarity calculating method and computer-readable recording medium of paper metadata

Info

Publication number: CN107870991A
Application number: CN201711022946.8A
Authority: CN
Inventors: 龙开亮; 王志
Original assignee: Hunan Latitude Mdt Infotech Ltd
Current assignee: Hunan Latitude Mdt Infotech Ltd
Priority date: 2017-10-27
Filing date: 2017-10-27
Publication date: 2018-04-03

Abstract

A kind of similarity calculating method and computer-readable recording medium of paper metadata, are related to microcomputer data processing field, this method comprises the following steps：To the two paper metadata that need to compare, at least two features in first paper metadata are extracted, extract corresponding feature in second paper metadata；Determine the similarity of each individual features being extracted of the two paper metadata；The weight being endowed according to each feature, the similarity of each feature is weighted, draw the similarity of the two paper metadata, the computer-readable recording medium, it is stored with computer program, the step of program realizes above method when being performed, the present invention can improve the accuracy of paper metadata redundancy detection.

Description

A kind of similarity calculating method and computer-readable recording medium of paper metadata

Technical field

The present invention relates to microcomputer data processing field, more particularly to a kind of Similarity Measure of paper metadata Method and computer-readable recording medium.

Background technology

Currently, scientific achievement turns into the evaluation mark that the units such as domestic each colleges and universities and scientific research institutions weigh its basic research strength Standard, the evaluations of professional titles of numerous colleges and universities, performance appraisal, department's bonus etc. links directly with science research output and influence power, while is also School and relevant departments carry out the very important decisions such as department's restructuring, development adjustment and planning and provide objective fact foundation.Paper is made For the important carrier of scientific achievement, critical role is occupied in the statistics of scientific achievement.With the development and scientific research of information technology The globalization of propagation, various Digest Databases provide the retrieval service of paper metadata in different levels different range, from difference It is the important means assessed mechanism Publications that Digest Database, which collects paper metadata and carries out statistical analysis, but due to The different degrees of overlapping paper metadata for be gathered together be present and certain redundancy be present in the data source of each database, This have impact on the accuracy of statistical analysis to a certain extent, therefore the metadata to being converged from disparate databases carries out redundancy inspection Survey and remove duplicate data be carry out follow-up data processing basis, wherein how to judge metadata it is whether identical be redundancy detection Key.

All the time, in the industry for the non-structural data of network sentence weight problem research it is more, also layer goes out various algorithm achievements It is not poor, and have utilization in current all kinds of search engines.Judge whether paper metadata is identical mainly by more first at present Whether the relevant field value in data is consistent, such as doi (Digital Object Unique Identifier (Digital Object Unique Whether Identifier-DOI)) identical, whether author's title mechanism is identical, and whether publisher, year, phase, the page number are consistent etc. Deng.But metadata is as the structural data with semanteme, and it sentences weight standard and the requirement of the degree of accuracy is all more accurate.It is therefore existing Sentence double recipe case for unstructured data, the requirement that metadata sentences weight can not be fully met.In addition, it is commonly used to data The accurate of storehouse sentences double recipe case and can not be adapted in the environment of this partial data mistake that may be present of metadata itself.

The content of the invention

A kind of similarity of paper metadata is provided it is an object of the invention to avoid weak point of the prior art Computational methods and computer-readable recording medium, the similarity calculating method and computer-readable recording medium of the paper metadata The accuracy of paper metadata redundancy detection can be improved.

The purpose of the present invention is achieved through the following technical solutions：

A kind of similarity calculating method of paper metadata is provided, this method comprises the following steps：

To the two paper metadata that need to compare, at least two features in first paper metadata are extracted, extraction the Corresponding feature in two paper metadata；

Determine the similarity of each individual features being extracted of the two paper metadata；

The weight being endowed according to each feature, the similarity of each feature is weighted, draws the two opinions The similarity of literary metadata.

Wherein, data cleansing and regular is carried out to the relevant field of metadata before feature is extracted.

Wherein, the characteristic type is included in author, summary, title, doi, publication issue, publication reel number and the page number extremely It is few two.

Wherein, among the feature of extraction：

Author：The author field in metadata is extracted, character string is split with decollator to obtain list of authors, author is arranged Table and field character string save as traits of author value；

Summary：The abstract fields in metadata are extracted, the word list made a summary are segmented to character string, by word list Summary characteristic value is saved as with field character string；

Title：The header field in metadata is extracted, character string is segmented to obtain the word list of title, by word list Title feature value is saved as with field character string；

doi：The doi fields in metadata are extracted, field character string saves as doi characteristic values；

Publish issue：The publication issue field in metadata is extracted, it is the publication phase that field character string, which is converted to digital halftoning, Number characteristic value；

Publish reel number：The publication reel number field in metadata is extracted, field character string is converted to digital halftoning and rolled up to publish Number characteristic value；

The page number：The page number field in metadata is extracted, numeral therein is gone out to text string extracting, according to the size point of numerical value Do not matched with start page, sign-off sheet, using field character string and start page, sign-off sheet as page number characteristic value.

Wherein, the similarity calculation module of the feature performs following method：

Author：1 is set to if field character string is identical, otherwise counts same position value identical member prime number in list of authors Amount is designated as N, and the maximum length of two list of authors of note is M, then Similarity value is set to N/M,

Summary：1 is set to if field character string is identical, otherwise Similarity value is set to the cosine distances between word list,

Title：1 is set to if field character string is identical, otherwise Similarity value is set to the cosine distances between word list,

doi：1 is set to if field character string is identical, is otherwise set to 0,

Publish issue:1 is set to if field character string or publication issue value are identical, is otherwise set to 0,

Publish reel number：1 is set to if field character string or publication reel number value are identical, is otherwise set to 0,

The page number：1 is set to if field character string or start page, sign-off sheet are identical, is otherwise set to 0.

Wherein, for all features, the similarity of this feature is put if some characteristic value between metadata pair is sky For 0.5, while weight corresponding to this feature is set to 1.

Wherein, the calculation of the cosine distances between the word list is：Word in two word lists is subjected to rope Invite index the mapping table map, map (term) of word=N to represent that word term index be N, note m for map size, for the The newly-built size of one word list is m array vec1 and all elements is initially into 0, travels through word list, its index is map to word t (t) vec1 [map (t)]=vec1 [map (t)]+1 is then put, traversal terminates to obtain the term vector array vec1 of the first word list, right Obtaining term vector array vec2, cosine distance in the second word list progress same operation is

Wherein, characteristic similarity is weighted to obtain the similarity of metadata, calculation is：Assuming that it have selected Characteristic type F₁,F₂,...,F_n, wherein n is the quantity of selection feature, and corresponding weight is arranged to W₁,W₂,...,W_n, for The similarity that metadata calculates all features is S₁,S₂,...,S_n, then the similarity between metadata be

A kind of computer-readable recording medium, is stored with computer program, and the program realizes side described above when being performed The step of method.

Wherein, imparting of the user to the weight of each feature is realized when described program is performed.

Beneficial effects of the present invention：

A kind of similarity calculating method of paper metadata of the present invention, two metadata are passed sequentially through with initialization, spy Sign extraction and Similarity Measure obtain the similarity of described two metadata, i.e., first carry out data cleansing to metadata relevant field With it is regular, can remove unnecessary impurity and noise, ensure the accuracy of data, by the feature class for being configured to Similarity Measure Type, by combining different characteristic type, it can be advantageous to lift redundancy inspection from different dimensions, fineness ratio compared with the difference of metadata The recall rate of survey, by quantifying the similarity of metadata different characteristic, the difference of subtly descriptive metadata can be compared, favorably In the error rate for reducing redundancy detection, by the way that to weight corresponding to each feature configuration, both having considered keynote message, (big weight is special Sign) leading position in redundancy detection, it is also considered that influence of the auxiliary information (small weight feature) to similarity, therefore carrying Can be by lower error rate while the recall rate of liter redundancy detection.

The present invention a kind of computer-readable recording medium, be stored with computer program, when the program is performed realize with The step of upper methods described, the accuracy of paper metadata redundancy detection can be improved.

Brief description of the drawings

Invention is described further using accompanying drawing, but the embodiment in accompanying drawing does not form any limitation of the invention, For one of ordinary skill in the art, on the premise of not paying creative work, it can also be obtained according to the following drawings Its accompanying drawing.

Fig. 1 is a kind of flow chart of the similarity calculating method of paper metadata of the present invention.

Embodiment

The invention will be further described with the following Examples.

A kind of similarity calculating method of paper metadata of the present embodiment, it is simple schematic diagram to see Fig. 1, Fig. 1.This reality Example is applied to illustrate with following two metadata：

Metadata 1：{"pageRange":"25-26","keywords":" electric EMU；Structure design；Drivers' cab；Outside Type；Head；Advanced international standard；Train running speed；Railway Design；China Star；Design level；Railway technology；Air hinders Power；Mechanics Phenomenon；Locomotive ", " description ":" in recent years, China railways design, produced " pioneer " number, " blue arrow ", The electric EMU more advanced than China conventional locomotive such as " star in Central Plains ", " China Star ".The design of these novel electric vehicle groups Level has promoted the development of China railways entirety cause, electric EMU also has become me significantly close to advanced international standard The direction of state's railway technology development.The speed of service of these electric EMUs is substantially accelerated, and train running speed is higher, and it is by air The influence " of resistance, " docTitle ":" a kind of novel electric vehicle group head external form and the cab structure design ", " number ":" 06","org":" Central South University's Electrical and Mechanical Engineering College ", " author ":" Chen Nanyi；Zhang Gongming ", " volume ":""," issn":"1672-0954","journalTitle":" invention and innovation (general news column) ", " doi ":""}.

Metadata 2：{"pageRange":"25-26","keywords":" electric EMU；Structure design；Head configuration； Drivers' cab；" pioneer " number；Railway Design；China Star；Railway technology；Advanced international standard；The speed of service ", " description":"","docTitle":" a kind of novel electric vehicle group head configuration and the cab structure design ", " number":"06","org":" Central South University's Electrical and Mechanical Engineering College ", " author ":" Chen Nanyi；Zhang Gongming ", " colleges":[177],"volume":"","issn":"1672-0954","journalTitle":" invent and innovate (in Generation in class hour) ", " doi ":""}.

Feature doi, author, title, summary, publication issue, publication reel number, the page number are selected, corresponding weight distinguishes assignment For 20,3,4,5,2,2,4, the Similarity Measure of each feature of metadata is as follows：

doi：Doi fields are sky, therefore similarity is directly set to 0.5, and weight is directly set to 1, and default weight 20 is cancelled；

Author：Author field character strings are identical, similarity 1, weight 3；

Title：DocTitle field character strings differ, then similarity is the cosine distances of participle list 0.8947368, weight 4；

Summary：The description fields missing of one metadata, phase knowledge and magnanimity are 0.5, weight 1；

Publish issue：Number field character strings are identical, similarity 1, weight 2；

Publish reel number：Volume fields are sky, and phase knowledge and magnanimity are 0.5, weight 1；

The page number：PageRange section character strings are identical, similarity 1, weight 4.

The similarity of metadata is

A kind of computer-readable recording medium of the present embodiment, is stored with computer program, and the program is realized when being performed The step of above method.

Imparting of the user to the weight of each feature is realized when described program is performed.

The similarity calculating method and computer-readable recording medium of a kind of paper metadata of the present embodiment, to two members Data pass sequentially through initialization, feature extraction and Similarity Measure and obtain the similarity of described two metadata, i.e., first to first number Data cleansing and regular is carried out according to relevant field, unnecessary impurity and noise is can remove, ensures the accuracy of data, pass through and configure , can be from different dimensions, fineness ratio compared with metadata by combining different characteristic type for the characteristic type of Similarity Measure Difference, be advantageous to be lifted the recall rate of redundancy detection, by quantifying the similarity of metadata different characteristic, can compare subtly The difference of descriptive metadata, the error rate of redundancy detection is advantageously reduced, by weight corresponding to each feature configuration, both examining Leading position of the keynote message (big weight feature) in redundancy detection is considered, it is also considered that auxiliary information (small weight feature) is right The influence of similarity, therefore can be by lower error rate while the recall rate of redundancy detection is lifted.

Finally it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than the present invention is protected The limitation of scope is protected, although being explained with reference to preferred embodiment to the present invention, one of ordinary skill in the art should Work as understanding, technical scheme can be modified or equivalent substitution, without departing from the reality of technical solution of the present invention Matter and scope.

Claims

A kind of 1. computational methods of paper metadata similarity, it is characterised in that：Comprise the following steps：

To the two paper metadata that need to compare, at least two features in first paper metadata are extracted, extract second Corresponding feature in paper metadata；

Determine the similarity of each individual features being extracted of the two paper metadata；

The weight being endowed according to each feature, the similarity of each feature is weighted, draws the two papers member The similarity of data.
A kind of 2. similarity calculating method of paper metadata as claimed in claim 1, it is characterised in that：Extraction feature it The preceding relevant field to metadata carries out data cleansing and regular.
A kind of 3. similarity calculating method of paper metadata as claimed in claim 1, it is characterised in that：The characteristic type Including at least two in author, summary, title, doi, publication issue, publication reel number and the page number.
A kind of 4. similarity calculating method of paper metadata as claimed in claim 3, it is characterised in that：The feature of extraction is worked as In：

Author：Extract the author field in metadata, character string split with decollator to obtain list of authors, by list of authors and Field character string saves as traits of author value；

Summary：The abstract fields in metadata are extracted, the word list made a summary are segmented to character string, by word list and word Section character string saves as summary characteristic value；

Title：The header field in metadata is extracted, character string is segmented to obtain the word list of title, by word list and word Section character string saves as title feature value；

doi：The doi fields in metadata are extracted, field character string saves as doi characteristic values；

Publish issue：The publication issue field in metadata is extracted, it is special to publish issue that field character string is converted to digital halftoning Value indicative；

Publish reel number：The publication reel number field in metadata is extracted, it is special to publish reel number that field character string is converted to digital halftoning Value indicative；

The page number：Extract metadata in page number field, numeral therein is gone out to text string extracting, according to numerical value size respectively with Start page, sign-off sheet are matched, using field character string and start page, sign-off sheet as page number characteristic value.
A kind of 5. similarity calculating method of paper metadata as claimed in claim 4, it is characterised in that：The phase of the feature It is like degree computational methods：

Author：1 is set to if field character string is identical, same position value identical number of elements in list of authors is otherwise counted and remembers For N, the maximum length of two list of authors of note is M, then Similarity value is set to N/M,

Summary：1 is set to if field character string is identical, otherwise Similarity value is set to the cosine distances between word list,

Title：1 is set to if field character string is identical, otherwise Similarity value is set to the cosine distances between word list,

doi：1 is set to if field character string is identical, is otherwise set to 0,

Publish issue:1 is set to if field character string or publication issue value are identical, is otherwise set to 0,

Publish reel number：1 is set to if field character string or publication reel number value are identical, is otherwise set to 0,

The page number：1 is set to if field character string or start page, sign-off sheet are identical, is otherwise set to 0.
A kind of 6. similarity calculating method of paper metadata as claimed in claim 5, it is characterised in that：For all spies Sign, the similarity of this feature is set to 0.5 if some characteristic value between metadata pair is sky, while by corresponding to this feature Weight is set to 1.
A kind of 7. similarity calculating method of paper metadata as claimed in claim 6, it is characterised in that：The word list it Between the calculations of cosine distances be：Word in two word lists is indexed to obtain the index mapping table map of word, Map (term)=N represents that word term index is N, the size that note m is map, for the number that the newly-built size of the first word list is m All elements are simultaneously initially 0 by group vec1, travel through word list, to its index of word t for map (t) then put vec1 [map (t)]= Vec1 [map (t)]+1, traversal terminate to obtain the term vector array vec1 of the first word list, are carried out for the second word list identical Operation obtains term vector array vec2, cosine distance and is
A kind of 8. similarity calculating method of paper metadata as claimed in claim 1, it is characterised in that：To characteristic similarity It is weighted to obtain the similarity of metadata, calculation is：Assuming that it have selected characteristic type F₁,F₂,...,F_n, wherein n To select the quantity of feature, corresponding weight is arranged to W₁,W₂,...,W_n, for the similarity of all features of metadata calculating For S₁,S₂,...,S_n, then the similarity between metadata be
9. a kind of computer-readable recording medium, is stored with computer program, it is characterised in that：Power is realized when the program is performed Profit requires the step of 1 to 8 methods described.
A kind of 10. computer-readable recording medium according to claim 9, it is characterised in that：When described program is performed Realize imparting of the user to the weight of each feature.