CN109376238B - Paper correlation degree quantification method based on reference document list overlapping degree - Google Patents
Paper correlation degree quantification method based on reference document list overlapping degree Download PDFInfo
- Publication number
- CN109376238B CN109376238B CN201811072484.5A CN201811072484A CN109376238B CN 109376238 B CN109376238 B CN 109376238B CN 201811072484 A CN201811072484 A CN 201811072484A CN 109376238 B CN109376238 B CN 109376238B
- Authority
- CN
- China
- Prior art keywords
- paper
- articles
- article
- papers
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention discloses a thesis correlation degree quantification method based on reference document list overlapping degree. The method associates the overlapping degree of the reference documents with the similarity of the articles through a co-citation idea, quantifies the similarity of the articles by combining a statistical verification network and an error discovery rate, and obtains a threshold value for judging whether the two articles are similar. Meanwhile, the invention also provides several application methods of the quantification method, and the application methods are applied to the detection of homology of the thesis, the search of missed introduction and the simplification of a reference list. The method can evaluate the relevancy of the thesis on the basis of the analysis, improves the accuracy and effectiveness of the relevancy calculation of the thesis, greatly reduces the calculation amount, and simultaneously applies the evaluation method to the missed-introduction detection of the thesis, thereby providing a basis for the ranking retrieval, clustering and classification of the thesis and the error detection in the reference documents.
Description
Technical Field
The invention belongs to the technical field of quantitative thesis similarity in the field of design academics, and particularly relates to a thesis correlation degree measuring method based on a co-citation thought and a statistical verification network.
Background
With the explosive development of scientific fields, the number of academic papers is also rapidly increasing. Under the condition, the relevance among the quantitative papers has a great value, and the relevance can be used as an important basis for document retrieval and document classification clustering. However, mainstream text analysis methods (such as cosine similarity-based methods and TF-IDF-based methods) are not very suitable for academic papers with huge text data volume, and have high computational complexity and low efficiency; the reliability of results generated by classification screening comparison methods (such as a naive Bayes algorithm and a KNN algorithm) in academic papers of the same class and subject is not strong. On the other hand, we note that the reference as a very important part of the academic papers can reflect important problems in the research field of the papers and the like. However, there are many implicit citation non-specification problems in the references, such as missing and multiple citations. These problems are difficult to detect and it is relevant to have some way to embody the references and papers.
Disclosure of Invention
The invention aims to provide a paper similarity quantification method based on the overlapping degree of a reference document list aiming at the problems existing in the research, and the similarity between articles is measured by applying a statistical verification network method to the overlapping degree of the reference documents cited by the two papers. Meanwhile, the invention also utilizes the FDR error discovery rate method to carry out multiple verification on the calculated values to obtain a threshold value for judging whether the two papers are similar. In addition, the invention also provides an application method of the quantification method, and the application method is applied to the problems of verifying the homology of the paper, searching missed references, simplifying reference documents and the like.
The technical scheme of the invention is as follows:
a paper correlation quantification method based on reference list overlapping degree comprises the following steps:
s1: preprocessing data sets to build a paper reference network
Selecting a paper with Computer Science as a data set, extracting useful information in the data set, including the number of the paper, the title of the paper and a reference document list, calculating the number of cited articles and the number of cited articles of each paper, and storing the information in a dictionary structure to form a paper cited network;
s2: classifying the data set of S1 and calculating related parameters
Classifying the data set of S1 and calculating the related parameters, the formula is as follows:
wherein A represents a paper collection that cites at least one article, B represents a paper collection that is cited at least by two articles,represents a subset divided by the number of references k for set B,representing all reference subsets in set AThe collection of articles in (a) is,to representThe size of (d);
s3: calculating the correlation p of the two papers by using the correlation parameters and the concept of statistical verification network described in S2
S3.1: for paper pair (i, j), d is cited in paper iiAn article and article j cite djUnder the assumption of articles, the probability that the article i and the article j refer to X articles together is calculated, and the calculation formula is as follows:
s3.2: using as described in S3.1Calculating similarity quantization value p between any two papers, and commonly referring to paper i and paper jThe number of articles in (1) is greater than or equal toWherein each p value corresponds to a k value, the calculation formula is as follows:
wherein d isiAnd djRepresenting respective sets of references in paper pair (i, j)The number of articles in (a) is,indicating that paper i and paper j refer together a setThe number of articles in (1); the value of p is pij(k) Indicating that paper i is commonly referenced with paper jThe number of articles in (1) is greater than or equal toThe probability of (d);
s3.3: for slave kminTo kmaxRepeating the process S3.1-S3.2 for all possible values of k, such that each article pair (i, j) is associated with more than one p-value, each p-value associated with a quoted number of articles k;
s4: multiple tests are performed on the p-value in S3 by using the false discovery rate method to obtain a threshold p for judging whether the articles are significantly similar*Identifying a significantly similar pair of papers;
s4.1: setting a statistical threshold p, assuming a total of NtIndividual tests, ranking the p-values of all different tests in increasing order:
s4.2: continuously readjusting the threshold p until the maximum t satisfying the condition is foundmaxThe conditions were as follows:
wherein N istDenotes the number of tests, NtRefers to the set S to which all values at the number of references k correspondkThe number of all the different article pairs tested;
s4.3: comparing each value of p with the readjusted threshold value p*If the value of p is less than the threshold value p*If the similarity of the paper pair (i, j) is at the threshold value p*The above time passes statistical verification, i.e. papers i, j are similar;
s5: analyzing the result of S4, verifying the homology of the paper and finding out the possible missing phenomenon, deleting the reference document with low correlation degree, and providing correct reference for scientific evaluation
S5.1: verifying the homology of the paper, for each threshold p*All calculate Pi→j(p*) The calculation formula is as follows:
wherein, Pi→j(p*) To be at a threshold value p*Probability of citation between any two papers verified by similarity, K (p)*) To pass verificationThere are a number of references, M (p), between pairs of papers*) Passing a confidence threshold p for at least one of the associated p values*Size of the collection of statistically tested papers;
fit out Pi→j(p*) And p*The functional image of the two papers obtains the conclusion that the higher the similarity of the two papers is, the higher the probability of the citation relationship exists between the two papers is, and verifies the homology of the citation of the papers;
s5.2: searching possible missing leads, selecting a paper m based on the conclusion of S5.1, obtaining a plurality of articles similar to the paper m in the steps of S1-S4, classifying all the articles which are similar to the paper m and quote the article n into a set S, if the article m does not quote the article n, automatically setting a threshold value p ', and judging that the article m misses the article chapter n when the set S is larger than the threshold value p';
s5.3: steps S1-S4 are performed on the article and its own references to delete references with insufficiently high similarity and simplify the list of references of the article.
The invention has the beneficial effects that: through a co-citation idea, the overlapping degree of a reference document is associated with the similarity of the two papers, and the similarity of the two papers is quantified by combining a statistical verification network. In addition, the invention also calculates the probability that the two articles quote the same article at the same time by using the related knowledge of statistics, and performs multiple verification on the calculated value by using the FDR false discovery rate method to obtain a threshold value for judging whether the two articles are similar. The invention can evaluate the similarity between papers and reference documents on the basis of the analysis, thereby providing basis for document retrieval and clustering classification of papers, and also providing convenience for the discovery of implicit citation problems, such as missing citation and multiple citation, of article authors and reviewers.
Drawings
FIG. 1 is a flow chart of a method for quantifying paper relevancy based on reference list overlap;
FIG. 3 shows the experimentally obtained Pi→j(p*) And a statistical threshold p*A functional relationship graph between;
FIG. 4 is a diagram illustrating a paper missing phenomenon.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.
The embodiment of the invention provides a paper relevancy quantification method based on reference document list overlapping degree, the flow is shown as figure 1, and the design comprises the following steps:
the method comprises the following steps: and (4) preprocessing data and deleting irrelevant redundant data.
Selecting a paper with Computer Science as the domain information in the data set, deleting the redundant attribute of the paper, only leaving a number id for identifying the paper, a title of the paper and reference list references, and simultaneously calculating the number of articles cited and the number of times of articles cited in each article and storing the information in a dictionary structure.
Step two: the data sets are classified and simplified and relevant parameters are calculated.
2.1: sift out several sets A, B, Sk、Wherein A represents a collection of articles that cite at least one document, B represents a collection of articles that are cited at least by two articles,representing a collection of articles referenced by k articles,means at least referencingArticle in (1), SkIs thatAndthe union of (a). We show more intuitively in fig. 2Andthe relationship between them.
The screening process includes that firstly, a paper which quotes at least one article is put into a set A, and a paper which is quoted by at least two articles is put into a set B. Then dividing the subset into a set B according to the reference times kAnd calculateThe size of (2). Then refer to in set AThe articles in (1) are put into a collectionFinally, find outAndis combined to obtain Sk。
2.2: constructing statistical verification networks of two parties by using the characteristic that each article in the citation network can only quote other published articles and calculating the parameter di、dj、The value of (c). Wherein d isiAnd djRepresenting collections of articles i and j, respectivelyThe number of articles in (a) is,to representThe size of (a) is (b),representing measured i and j co-reference setsThe number of articles in (1).
Consider that if there is only one cited article with an in-degree of k, the number of common citations for both articles must be 1, i.e.For more than one article cited, the size of the intersection of the set of cited articles cited by two articles is calculated to obtainThe size of (2).
Step three: and calculating a similarity quantization value p. Calculating whether two articles are related or not by using the idea of statistical verification network, namely calculating the probability to obtain the quantitative value p of the similarity between any two articles (the articles i and j are commonly quotedThe number of articles in (1) is greater than or equal toProbability of (d). At this point, each p value corresponds to a k value (the number of times the reference is cited).
For each pair of papers (i, j), we use diAnd djIndicating the number of elements in their respective reference.
D in the separate citations for article i and article j can be calculated using the following formulaiAnd djUnder the assumption of different articles, the probability that they choose to refer to X articles that are not identical:
the p-value for each article can thus be calculated using the following formula:
wherein the value of p, i.e. pij(k) Meaning that i and j are commonly referencedThe number of articles in (1) is greater than or equal toThe probability of (c).
For slave kminTo kmaxRepeats the process for all possible k values. Each pair of articles (i, j) is associated with more than one p-value, each p-value of the association corresponding to an article whose quote is k.
Step four: the result is subjected to multiple tests by using the FDR error discovery rate method to obtain a significant threshold value p*And checking that all AND are less than threshold p*The article pair associated with the p-value of (a). Assuming this statistical threshold value, only verified pairs of articles can be considered significantly similar. The method specifically comprises the following steps:
4.1: setting a statistical threshold p*Suppose thatTotal N istAnd (5) testing. Ranking the p-values of all different tests in increasing order
4.2: the threshold is continuously readjusted until the maximum tmax is found that satisfies the condition:
wherein N istIndicates the number of tests, N in the present inventiontRefers to the set S corresponding to all the incomes k on the quote networkkThe number of all different article pairs tested on the corresponding set.
4.3: comparing each value of p with the readjusted threshold value p*. If the value of p is less than the threshold value p*We can assume that the similarity of the article to (i, j) is at the threshold p*The above time passes statistical verification, i.e. papers i, j are similar.
Through the above steps, we can obtain the similarity values of all the paper pairs.
Step five: and (3) applying an algorithm, researching a citation mode of a homologous article, reflecting a possible missing citation phenomenon and simplifying a reference list.
5.1: homologous articles refer to the tendency of you to associate with one another in papers with a certain degree of similarity. The expression of homology in the paper citation can be obtained by using the results obtained by the above algorithm to express the data graphically. At a certain threshold p*Here, the probability P that the observation paper i refers to the paper j for passing verificationi→j(p*) Obtaining M (p)*) With p*The conclusion that the value decreases as p decreases, because*The references of the article pairs need to have a higher degree of overlap to enter the set (p)*)。
For each threshold value p*Calculate Pi→j(p*) I.e. at the threshold value p*Pass similarity verificationThe probability of citation between any two papers of (a) is calculated by the formula:
wherein, K (p)*) Indicates the number of references, M (p), that exist between pairs of validated papers*) At least one of the p values representing the association passes a confidence threshold p*Size of the collection of statistically tested papers.
FIG. 3 shows the experimentally obtained Pi→j(p*) And a statistical threshold p*A function of (a). The analysis chart shows that the higher the similarity of two articles, the higher the probability that there is a citation relationship between them. This verifies the homology of the paper citation.
5.2: in the previous step, we have verified that the higher the similarity of the two articles, the higher the overlap of the reference lists. Consider the situation as shown in figure 4. Article B is verified to be similar to articles C, D, E, etc. at a small threshold, and the references to B and C, B and D, B and E all overlap, i.e., C, D, E has a high probability of referring to the same document. If document a is cited at this point in C, D, E, and S includes all articles that are similar to B and that cited a, when set S is sufficiently large, there is a high probability that B also cited a. A threshold value p 'is set, and when the size of the set S is larger than the threshold value p', the article B is judged to miss the article A.
5.3: the above method is performed in the papers and their own references, the thresholds associated with the documents are determined, and the references with insufficiently high correlation are deleted, thereby simplifying the list of references.
While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (1)
1. A paper relevancy quantification method based on reference document list overlapping degree is characterized by comprising the following steps:
s1: preprocessing data sets to build a paper reference network
Selecting a paper with Computer Science as a data set, extracting useful information in the data set, including the number of the paper, the title of the paper and a reference document list, calculating the number of cited articles and the number of cited articles of each paper, and storing the information in a dictionary structure to form a paper cited network;
s2: classifying the data set of S1 and calculating related parameters
Classifying the data set of S1 and calculating the related parameters, the formula is as follows:
wherein A represents a paper collection that cites at least one article, B represents a paper collection that is cited at least by two articles,represents a subset divided by the number of references k for set B,representing all reference subsets in set AThe collection of articles in (a) is,to representThe size of (d);
s3: calculating the correlation p of the two papers by using the correlation parameters and the concept of statistical verification network described in S2
S3.1: for paper pair (i, j), d is cited in paper iiAn article and article j cite djUnder the assumption of articles, the probability that the article i and the article j refer to X articles together is calculated, and the calculation formula is as follows:
s3.2: using as described in S3.1Calculating similarity quantization value p between any two papers, and commonly referring to paper i and paper jThe number of articles in (1) is greater than or equal toWherein each p value corresponds to a k value, the calculation formula is as follows:
wherein d isiAnd djRepresenting respective sets of references in paper pair (i, j)The number of articles in (a) is,indicating that paper i and paper j refer together a setThe number of articles in (1); the value of p is pij(k) Indicating that paper i is commonly referenced with paper jThe number of articles in (1) is greater than or equal toThe probability of (d);
s3.3: for slave kminTo kmaxRepeating the process S3.1-S3.2 for all possible values of k, such that each article pair (i, j) is associated with more than one p-value, each p-value associated with a quoted number of articles k;
s4: multiple tests are performed on the p-value in S3 by using the false discovery rate method to obtain a threshold p for judging whether the articles are significantly similar*Identifying a significantly similar pair of papers;
s4.1: setting a statistical threshold p*Assuming a total of NtIndividual tests, ranking the p-values of all different tests in increasing order:
s4.2: constantly readjusting the threshold p*Until the maximum t satisfying the condition is foundmaxThe conditions were as follows:
wherein N istDenotes the number of tests, NtRefers to the set S to which all values at the number of references k correspondkThe number of all the different article pairs tested;
s4.3: comparing each value of p with the readjusted threshold value p*If the value of p is less than the threshold value p*The similarity of the paper to (i, j) isThreshold value p*The above time passes statistical verification, i.e. papers i, j are similar;
s5: analyzing the result of S4, verifying the homology of the paper and finding out the possible missing phenomenon, deleting the reference document with low correlation degree, and providing correct reference for scientific evaluation
S5.1: verifying the homology of the paper, for each threshold p*All calculate Pi→j(p*) The calculation formula is as follows:
wherein, Pi→j(p*) To be at a threshold value p*Probability of citation between any two papers verified by similarity, K (p)*) For the number of references between verified pairs of papers, M (p)*) Passing a confidence threshold p for at least one of the associated p values*Size of the collection of statistically tested papers;
fit out Pi→j(p*) And p*The functional image of the two papers obtains the conclusion that the higher the similarity of the two papers is, the higher the probability of the citation relationship exists between the two papers is, and verifies the homology of the citation of the papers;
s5.2: searching possible missing leads, selecting a paper m based on the conclusion of S5.1, obtaining a plurality of articles similar to the paper m in the steps of S1-S4, classifying all the articles which are similar to the paper m and quote the article n into a set S, if the article m does not quote the article n, automatically setting a threshold value p ', and judging that the article m misses the article chapter n when the set S is larger than the threshold value p';
s5.3: steps S1-S4 are performed on the article and its own references to delete references with insufficiently high similarity and simplify the list of references of the article.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811072484.5A CN109376238B (en) | 2018-09-14 | 2018-09-14 | Paper correlation degree quantification method based on reference document list overlapping degree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811072484.5A CN109376238B (en) | 2018-09-14 | 2018-09-14 | Paper correlation degree quantification method based on reference document list overlapping degree |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109376238A CN109376238A (en) | 2019-02-22 |
CN109376238B true CN109376238B (en) | 2021-01-05 |
Family
ID=65405261
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811072484.5A Active CN109376238B (en) | 2018-09-14 | 2018-09-14 | Paper correlation degree quantification method based on reference document list overlapping degree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109376238B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110096710B (en) * | 2019-05-09 | 2022-12-30 | 董云鹏 | Article analysis and self-demonstration method |
CN114911935A (en) * | 2019-05-17 | 2022-08-16 | 爱酷赛股份有限公司 | Cluster analysis method, cluster analysis system, and cluster analysis program |
CN110489745B (en) * | 2019-07-31 | 2020-12-22 | 北京大学 | Paper text similarity detection method based on citation network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126732A (en) * | 2016-07-04 | 2016-11-16 | 中南大学 | Author's power of influence transmission capacity Forecasting Methodology based on interest scale model |
CN108132961A (en) * | 2017-11-06 | 2018-06-08 | 浙江工业大学 | A kind of bibliography based on reference prediction recommends method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011044578A1 (en) * | 2009-10-11 | 2011-04-14 | Patrick Walsh | Method and system for performing classified document research |
-
2018
- 2018-09-14 CN CN201811072484.5A patent/CN109376238B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126732A (en) * | 2016-07-04 | 2016-11-16 | 中南大学 | Author's power of influence transmission capacity Forecasting Methodology based on interest scale model |
CN108132961A (en) * | 2017-11-06 | 2018-06-08 | 浙江工业大学 | A kind of bibliography based on reference prediction recommends method |
Non-Patent Citations (3)
Title |
---|
Yicong Liang等.Finding Relevant Papers Based on Citation Relations.《Web-Age Information Management》.2011,403-414. * |
基于内容与引用关系的学术论文推荐;蔡阿妮;《中国优秀硕士学位论文全文数据库 信息科技辑》;20141115;第I138-568页 * |
多重相关检验中错误发现率的控制算法;刘遵雄等;《井冈山大学学报(自然科学版)》;20161231;第35-40页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109376238A (en) | 2019-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Colnet: Embedding the semantics of web tables for column type prediction | |
Becchetti et al. | Efficient algorithms for large-scale local triangle counting | |
US7409404B2 (en) | Creating taxonomies and training data for document categorization | |
US7603370B2 (en) | Method for duplicate detection and suppression | |
CN109376238B (en) | Paper correlation degree quantification method based on reference document list overlapping degree | |
US8738635B2 (en) | Detection of junk in search result ranking | |
WO2017000610A1 (en) | Webpage classification method and apparatus | |
Mahmoud et al. | Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
Li et al. | Mining blackhole and volcano patterns in directed graphs: A general approach | |
Melucci | On rank correlation in information retrieval evaluation | |
Hammouda et al. | Data mining in e-learning | |
Alonso et al. | Duplicate news story detection revisited | |
US20230273964A1 (en) | Apparatus and method for evaluating search engine performance, and dashboard | |
Ehsan et al. | Qurve: Query refinement for view recommendation in visual data exploration | |
Song et al. | SFP-Rank: significant frequent pattern analysis for effective ranking | |
Kumar et al. | Similarity measure approaches applied in text document clustering for information retrieval | |
KR101823463B1 (en) | Apparatus for providing researcher searching service and method thereof | |
CN114911826A (en) | Associated data retrieval method and system | |
Strobin et al. | Recommendations and object discovery in graph databases using path semantic analysis | |
Nuray-Turan et al. | Self-tuning in graph-based reference disambiguation | |
Irshad et al. | SwCS: Section-Wise Content Similarity Approach to Exploit Scientific Big Data. | |
KR102081867B1 (en) | Method for building inverted index, method and apparatus searching similar data using inverted index | |
Makary et al. | Using supervised machine learning to automatically build relevance judgments for a test collection | |
Jasbick et al. | Pushing diversity into higher dimensions: the LID effect on diversified similarity searching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |