CN109376238B

CN109376238B - Paper correlation degree quantification method based on reference document list overlapping degree

Info

Publication number: CN109376238B
Application number: CN201811072484.5A
Authority: CN
Inventors: 刘嘉莹; 张冬瑜; 肖心茹; 步晓楠; 宁兆龙; 夏锋
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2018-09-14
Filing date: 2018-09-14
Publication date: 2021-01-05
Anticipated expiration: 2038-09-14
Also published as: CN109376238A

Abstract

The invention discloses a thesis correlation degree quantification method based on reference document list overlapping degree. The method associates the overlapping degree of the reference documents with the similarity of the articles through a co-citation idea, quantifies the similarity of the articles by combining a statistical verification network and an error discovery rate, and obtains a threshold value for judging whether the two articles are similar. Meanwhile, the invention also provides several application methods of the quantification method, and the application methods are applied to the detection of homology of the thesis, the search of missed introduction and the simplification of a reference list. The method can evaluate the relevancy of the thesis on the basis of the analysis, improves the accuracy and effectiveness of the relevancy calculation of the thesis, greatly reduces the calculation amount, and simultaneously applies the evaluation method to the missed-introduction detection of the thesis, thereby providing a basis for the ranking retrieval, clustering and classification of the thesis and the error detection in the reference documents.

Description

Paper correlation degree quantification method based on reference document list overlapping degree

Technical Field

The invention belongs to the technical field of quantitative thesis similarity in the field of design academics, and particularly relates to a thesis correlation degree measuring method based on a co-citation thought and a statistical verification network.

Background

With the explosive development of scientific fields, the number of academic papers is also rapidly increasing. Under the condition, the relevance among the quantitative papers has a great value, and the relevance can be used as an important basis for document retrieval and document classification clustering. However, mainstream text analysis methods (such as cosine similarity-based methods and TF-IDF-based methods) are not very suitable for academic papers with huge text data volume, and have high computational complexity and low efficiency; the reliability of results generated by classification screening comparison methods (such as a naive Bayes algorithm and a KNN algorithm) in academic papers of the same class and subject is not strong. On the other hand, we note that the reference as a very important part of the academic papers can reflect important problems in the research field of the papers and the like. However, there are many implicit citation non-specification problems in the references, such as missing and multiple citations. These problems are difficult to detect and it is relevant to have some way to embody the references and papers.

Disclosure of Invention

The invention aims to provide a paper similarity quantification method based on the overlapping degree of a reference document list aiming at the problems existing in the research, and the similarity between articles is measured by applying a statistical verification network method to the overlapping degree of the reference documents cited by the two papers. Meanwhile, the invention also utilizes the FDR error discovery rate method to carry out multiple verification on the calculated values to obtain a threshold value for judging whether the two papers are similar. In addition, the invention also provides an application method of the quantification method, and the application method is applied to the problems of verifying the homology of the paper, searching missed references, simplifying reference documents and the like.

The technical scheme of the invention is as follows:

a paper correlation quantification method based on reference list overlapping degree comprises the following steps:

s1: preprocessing data sets to build a paper reference network

Selecting a paper with Computer Science as a data set, extracting useful information in the data set, including the number of the paper, the title of the paper and a reference document list, calculating the number of cited articles and the number of cited articles of each paper, and storing the information in a dictionary structure to form a paper cited network;

s2: classifying the data set of S1 and calculating related parameters

Classifying the data set of S1 and calculating the related parameters, the formula is as follows:

wherein A represents a paper collection that cites at least one article, B represents a paper collection that is cited at least by two articles,

represents a subset divided by the number of references k for set B,

representing all reference subsets in set A

The collection of articles in (a) is,

to represent

The size of (d);

s3: calculating the correlation p of the two papers by using the correlation parameters and the concept of statistical verification network described in S2

S3.1: for paper pair (i, j), d is cited in paper i_iAn article and article j cite d_jUnder the assumption of articles, the probability that the article i and the article j refer to X articles together is calculated, and the calculation formula is as follows:

s3.2: using as described in S3.1

Calculating similarity quantization value p between any two papers, and commonly referring to paper i and paper j

The number of articles in (1) is greater than or equal to

Wherein each p value corresponds to a k value, the calculation formula is as follows:

wherein d is_iAnd d_jRepresenting respective sets of references in paper pair (i, j)

The number of articles in (a) is,

indicating that paper i and paper j refer together a set

The number of articles in (1); the value of p is p_ij(k) Indicating that paper i is commonly referenced with paper j

The number of articles in (1) is greater than or equal to

The probability of (d);

s3.3: for slave k_minTo k_maxRepeating the process S3.1-S3.2 for all possible values of k, such that each article pair (i, j) is associated with more than one p-value, each p-value associated with a quoted number of articles k;

s4: multiple tests are performed on the p-value in S3 by using the false discovery rate method to obtain a threshold p for judging whether the articles are significantly similar^*Identifying a significantly similar pair of papers;

s4.1: setting a statistical threshold p, assuming a total of N_tIndividual tests, ranking the p-values of all different tests in increasing order:

s4.2: continuously readjusting the threshold p until the maximum t satisfying the condition is found_maxThe conditions were as follows:

wherein N is_tDenotes the number of tests, N_tRefers to the set S to which all values at the number of references k correspond_kThe number of all the different article pairs tested;

s4.3: comparing each value of p with the readjusted threshold value p^*If the value of p is less than the threshold value p^*If the similarity of the paper pair (i, j) is at the threshold value p^*The above time passes statistical verification, i.e. papers i, j are similar;

s5: analyzing the result of S4, verifying the homology of the paper and finding out the possible missing phenomenon, deleting the reference document with low correlation degree, and providing correct reference for scientific evaluation

S5.1: verifying the homology of the paper, for each threshold p^*All calculate P_i→j(p^*) The calculation formula is as follows:

wherein, P_i→j(p^*) To be at a threshold value p^*Probability of citation between any two papers verified by similarity, K (p)^*) To pass verificationThere are a number of references, M (p), between pairs of papers^*) Passing a confidence threshold p for at least one of the associated p values^*Size of the collection of statistically tested papers;

fit out P_i→j(p^*) And p^*The functional image of the two papers obtains the conclusion that the higher the similarity of the two papers is, the higher the probability of the citation relationship exists between the two papers is, and verifies the homology of the citation of the papers;

s5.2: searching possible missing leads, selecting a paper m based on the conclusion of S5.1, obtaining a plurality of articles similar to the paper m in the steps of S1-S4, classifying all the articles which are similar to the paper m and quote the article n into a set S, if the article m does not quote the article n, automatically setting a threshold value p ', and judging that the article m misses the article chapter n when the set S is larger than the threshold value p';

s5.3: steps S1-S4 are performed on the article and its own references to delete references with insufficiently high similarity and simplify the list of references of the article.

The invention has the beneficial effects that: through a co-citation idea, the overlapping degree of a reference document is associated with the similarity of the two papers, and the similarity of the two papers is quantified by combining a statistical verification network. In addition, the invention also calculates the probability that the two articles quote the same article at the same time by using the related knowledge of statistics, and performs multiple verification on the calculated value by using the FDR false discovery rate method to obtain a threshold value for judging whether the two articles are similar. The invention can evaluate the similarity between papers and reference documents on the basis of the analysis, thereby providing basis for document retrieval and clustering classification of papers, and also providing convenience for the discovery of implicit citation problems, such as missing citation and multiple citation, of article authors and reviewers.

Drawings

FIG. 1 is a flow chart of a method for quantifying paper relevancy based on reference list overlap;

FIG. 2 is a set used in the present invention

And

a graph of relationships between;

FIG. 3 shows the experimentally obtained P_i→j(p^*) And a statistical threshold p^*A functional relationship graph between;

FIG. 4 is a diagram illustrating a paper missing phenomenon.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

The embodiment of the invention provides a paper relevancy quantification method based on reference document list overlapping degree, the flow is shown as figure 1, and the design comprises the following steps:

the method comprises the following steps: and (4) preprocessing data and deleting irrelevant redundant data.

Selecting a paper with Computer Science as the domain information in the data set, deleting the redundant attribute of the paper, only leaving a number id for identifying the paper, a title of the paper and reference list references, and simultaneously calculating the number of articles cited and the number of times of articles cited in each article and storing the information in a dictionary structure.

Step two: the data sets are classified and simplified and relevant parameters are calculated.

2.1: sift out several sets A, B, S_k、

Wherein A represents a collection of articles that cite at least one document, B represents a collection of articles that are cited at least by two articles,

representing a collection of articles referenced by k articles,

means at least referencing

Article in (1), S_kIs that

And

the union of (a). We show more intuitively in fig. 2

And

the relationship between them.

The screening process includes that firstly, a paper which quotes at least one article is put into a set A, and a paper which is quoted by at least two articles is put into a set B. Then dividing the subset into a set B according to the reference times k

And calculate

The size of (2). Then refer to in set A

The articles in (1) are put into a collection

Finally, find out

And

is combined to obtain S_k。

2.2: constructing statistical verification networks of two parties by using the characteristic that each article in the citation network can only quote other published articles and calculating the parameter d_i、d_j、

The value of (c). Wherein d is_iAnd d_jRepresenting collections of articles i and j, respectively

The number of articles in (a) is,

to represent

The size of (a) is (b),

representing measured i and j co-reference sets

The number of articles in (1).

Consider that if there is only one cited article with an in-degree of k, the number of common citations for both articles must be 1, i.e.

For more than one article cited, the size of the intersection of the set of cited articles cited by two articles is calculated to obtain

The size of (2).

Step three: and calculating a similarity quantization value p. Calculating whether two articles are related or not by using the idea of statistical verification network, namely calculating the probability to obtain the quantitative value p of the similarity between any two articles (the articles i and j are commonly quoted

The number of articles in (1) is greater than or equal to

Probability of (d). At this point, each p value corresponds to a k value (the number of times the reference is cited).

For each pair of papers (i, j), we use d_iAnd d_jIndicating the number of elements in their respective reference.

D in the separate citations for article i and article j can be calculated using the following formula_iAnd d_jUnder the assumption of different articles, the probability that they choose to refer to X articles that are not identical:

the p-value for each article can thus be calculated using the following formula:

wherein the value of p, i.e. p_ij(k) Meaning that i and j are commonly referenced

The number of articles in (1) is greater than or equal to

The probability of (c).

For slave k_minTo k_maxRepeats the process for all possible k values. Each pair of articles (i, j) is associated with more than one p-value, each p-value of the association corresponding to an article whose quote is k.

Step four: the result is subjected to multiple tests by using the FDR error discovery rate method to obtain a significant threshold value p^*And checking that all AND are less than threshold p^*The article pair associated with the p-value of (a). Assuming this statistical threshold value, only verified pairs of articles can be considered significantly similar. The method specifically comprises the following steps:

4.1: setting a statistical threshold p^*Suppose thatTotal N is_tAnd (5) testing. Ranking the p-values of all different tests in increasing order

4.2: the threshold is continuously readjusted until the maximum tmax is found that satisfies the condition:

wherein N is_tIndicates the number of tests, N in the present invention_tRefers to the set S corresponding to all the incomes k on the quote network_kThe number of all different article pairs tested on the corresponding set.

4.3: comparing each value of p with the readjusted threshold value p^*. If the value of p is less than the threshold value p^*We can assume that the similarity of the article to (i, j) is at the threshold p^*The above time passes statistical verification, i.e. papers i, j are similar.

Through the above steps, we can obtain the similarity values of all the paper pairs.

Step five: and (3) applying an algorithm, researching a citation mode of a homologous article, reflecting a possible missing citation phenomenon and simplifying a reference list.

5.1: homologous articles refer to the tendency of you to associate with one another in papers with a certain degree of similarity. The expression of homology in the paper citation can be obtained by using the results obtained by the above algorithm to express the data graphically. At a certain threshold p^*Here, the probability P that the observation paper i refers to the paper j for passing verification_i→j(p^*) Obtaining M (p)^*) With p^*The conclusion that the value decreases as p decreases, because^*The references of the article pairs need to have a higher degree of overlap to enter the set (p)^*)。

For each threshold value p^*Calculate P_i→j(p^*) I.e. at the threshold value p^*Pass similarity verificationThe probability of citation between any two papers of (a) is calculated by the formula:

wherein, K (p)^*) Indicates the number of references, M (p), that exist between pairs of validated papers^*) At least one of the p values representing the association passes a confidence threshold p^*Size of the collection of statistically tested papers.

FIG. 3 shows the experimentally obtained P_i→j(p^*) And a statistical threshold p^*A function of (a). The analysis chart shows that the higher the similarity of two articles, the higher the probability that there is a citation relationship between them. This verifies the homology of the paper citation.

5.2: in the previous step, we have verified that the higher the similarity of the two articles, the higher the overlap of the reference lists. Consider the situation as shown in figure 4. Article B is verified to be similar to articles C, D, E, etc. at a small threshold, and the references to B and C, B and D, B and E all overlap, i.e., C, D, E has a high probability of referring to the same document. If document a is cited at this point in C, D, E, and S includes all articles that are similar to B and that cited a, when set S is sufficiently large, there is a high probability that B also cited a. A threshold value p 'is set, and when the size of the set S is larger than the threshold value p', the article B is judged to miss the article A.

5.3: the above method is performed in the papers and their own references, the thresholds associated with the documents are determined, and the references with insufficiently high correlation are deleted, thereby simplifying the list of references.

While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A paper relevancy quantification method based on reference document list overlapping degree is characterized by comprising the following steps:

s1: preprocessing data sets to build a paper reference network

s2: classifying the data set of S1 and calculating related parameters

represents a subset divided by the number of references k for set B,

representing all reference subsets in set A

The collection of articles in (a) is,

to represent

The size of (d);

s3.2: using as described in S3.1

The number of articles in (1) is greater than or equal to

The number of articles in (a) is,

indicating that paper i and paper j refer together a set

The number of articles in (1) is greater than or equal to

The probability of (d);

s4.1: setting a statistical threshold p^*Assuming a total of N_tIndividual tests, ranking the p-values of all different tests in increasing order:

s4.2: constantly readjusting the threshold p^*Until the maximum t satisfying the condition is found_maxThe conditions were as follows:

s4.3: comparing each value of p with the readjusted threshold value p^*If the value of p is less than the threshold value p^*The similarity of the paper to (i, j) isThreshold value p^*The above time passes statistical verification, i.e. papers i, j are similar;

wherein, P_i→j(p^*) To be at a threshold value p^*Probability of citation between any two papers verified by similarity, K (p)^*) For the number of references between verified pairs of papers, M (p)^*) Passing a confidence threshold p for at least one of the associated p values^*Size of the collection of statistically tested papers;