CN111460324B

CN111460324B - Citation recommendation method and system based on link analysis

Info

Publication number: CN111460324B
Application number: CN202010556832.7A
Authority: CN
Inventors: 冯雅; 吴宗羲
Original assignee: Hangzhou Canba Technology Co ltd
Current assignee: Hangzhou Canba Technology Co ltd
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2020-11-06
Anticipated expiration: 2040-06-18
Also published as: CN111460324A

Abstract

The invention discloses a citation recommendation method and a system based on link analysis, which comprises the following steps: constructing a directed weighted reference network; dividing the reference network into network clusters; selecting a representative node for the network cluster; selecting a candidate network cluster for the newly-built thesis; adding the nodes with the highest similarity into the candidate quotation recommendation set, and calculating the link degrees among the nodes; selecting nodes with the link degrees higher than a second threshold value to join the candidate quotation recommendation set; continuously selecting nodes to join the candidate quotation recommendation set; acquiring a paper published in a first time period as a first paper, and predicting the number of times of reference of the first paper; acquiring a paper published in a third time period as a second paper; calculating the increased length of the number of times of citation of the first paper and the second paper; and adding the papers with the increased length of the reference times larger than a fourth threshold value into the candidate quotation recommendation set. The invention optimizes the quotation recommendation of the community network, improves the recommendation accuracy, and predicts the paper with short publication time, so that the recommended quotation is more comprehensive.

Description

Citation recommendation method and system based on link analysis

Technical Field

The invention relates to the field of document searching, in particular to a citation recommendation method and system based on link analysis.

Background

While an academic paper requires a prior art work of interest to help the reader understand its background and innovation, researchers often want to quickly understand the existing literature in the field, including which papers are the most relevant, which sub-topics are in these papers, etc. As the number of academic papers increases, the citation network formed by the academic papers and their references is becoming a large-scale complex network. Citation analysis has an important role in document retrieval and paper recommendation.

The invention patent application with publication number CN 110674318A discloses a data recommendation method based on citation network community discovery, which constructs a citation network based on the co-written relation between authors and the co-citation and coupling relation between papers; dividing the citation network into a plurality of community networks; establishing association between the data set and the community network based on the similarity between the paper and the data set; and (4) overlapping and de-duplicating each paper node in the community network associated with the data set, and then recommending data.

Although the above application mentions recommendation based on citation network community discovery, it associates data sets with a community network for data recommendation. Even if the community influence of papers in the same community network is different, the probability of making citation recommendation is completely different. The citation of the papers is dynamically changed, the publication time has a great influence on the citation of the papers, and the newly published technical advanced papers may have the problems of small citation amount and the like, so that the paper recommendation method disclosed by the application has the problem of low accuracy, and how to realize high-accuracy and high-quality citation recommendation aiming at the problem of the existing citation recommendation is a problem to be solved in the field.

Disclosure of Invention

The invention aims to provide a citation recommendation method and system based on link analysis aiming at the defects of the prior art. The invention optimizes the quotation recommendation of the community network, improves the recommendation accuracy, and predicts the paper with short publication time, so that the recommended quotation is more comprehensive.

In order to achieve the purpose, the invention adopts the following technical scheme:

a citation recommendation method based on link analysis comprises the following steps:

s1, constructing a directed weighted citation network based on citation relations, author similarities and content similarities among the papers;

s2, dividing the directed weighted reference network into a plurality of network clusters;

s3, selecting the node with the largest influence for each network cluster as a representative node;

s4, selecting a corresponding network cluster for the new thesis based on author similarity and content similarity as a candidate network cluster;

s5, adding a node with the highest similarity to the newly-built paper in the candidate network cluster as a first node into a candidate quotation recommendation set, and calculating the link degrees between the first node and other nodes in the candidate network cluster;

s6, selecting the nodes with the linking degree higher than a second threshold value with the first node to join the candidate quotation recommendation set; taking the selected node as a first node, continuing to select the node to join the candidate quotation recommendation set until the requirement of a second threshold value cannot be met or all nodes in the network cluster are joined into the candidate quotation recommendation set;

s7, acquiring a paper published in a first time period as a first paper, and predicting the number of times of reference of the first paper in a second time period in the future;

s8, acquiring a paper published in a third time period as a second paper, wherein the third time period = the first time period + the second time period; calculating the increased length of the number of times of reference of the first paper and the second paper;

s9, adding the paper with the length increased by the number of citations larger than a fourth threshold value into the candidate quotation recommendation set to obtain a final quotation recommendation set, and recommending the paper in the final quotation recommendation set to a user as a quotation of a newly-built paper;

the author similarity of papers with reference relationships is:

wherein the content of the first and second substances,

、

respectively the weight occupied by the same author and the cooperation of the author,

；

in order for the paper to have the same number of authors,

for the author log with a partnership in the paper,

is as follows

Number of papers completed collaboratively for authors with a collaborative relationship;

paper with reference relationship

、

The content similarity of (a) is:

wherein the content of the first and second substances,

to be a paper

First, the

The value of the dimension(s) is,

to paper o

The value of the dimension(s) is,

dimension of the paper vector;

the similarity of papers with reference relationships is:

wherein the content of the first and second substances,

for author similarity

The weight of (c);

the weight of the edge in the directed weighted quotation network is the similarity of the papers connected with the edge;

the S2 specifically includes:

s21, selecting a thesis node with the highest node degree in the directed weighted citation network as an initial node, and setting c = 1;

s22, adding the initial node into the newly established cluster SC_c；

S23, acquiring points which are connected with the thesis nodes in the SCc and do not belong to any established cluster, and adding the points into the candidate cluster; if the candidate cluster is an empty cluster, executing step S25;

s24, judging candidate cluster and cluster SC_cIf the maximum weight of the connecting edge of the Chinese thesis node is larger than the first threshold, selecting the thesis node and the connecting edge corresponding to the maximum weight to be added into the cluster SC_cContinuing to execute step S23; if not, c = c +1, go to step S25;

s25, judging whether there is a paper node not belonging to any cluster in the directed weighted reference network, if yes, selecting the paper node with the highest degree not belonging to any cluster as the initial node, executing step S22, if not, outputting the cluster SC₁、SC₂、...、SC_c、...、SC_xWherein x is the number of network clusters;

the influence of node f is:

wherein the content of the first and second substances,

the number of nodes in the network cluster referring to the node f,

the number of times the jth node referencing node f is referenced in the network cluster,

，

the number of nodes which refer to the node f at the same time in the j-th node f;

the link degree between the neighbor node l and the first node is as follows:

wherein the content of the first and second substances,

as a neighbor node

The degree of (c) is determined,

the degrees for the neighbor nodes except the starting point,

is the average of the degrees of the nodes in the current network cluster,

is composed of

The covariance of (a) of (b),

is composed of

The variance of (c).

Further, the S4 includes: selecting a network cluster with the maximum similarity with the newly-built paper as a candidate network cluster; new thesis and network cluster SC_cThe similarity of (A) is as follows:

wherein M is SC_cThe number of nodes in the paper is,

for the similarity between the new thesis and the q-th thesis node,

。

further, the S7 specifically includes:

s71, calculating the similarity between the first paper and the nodes in the reference network;

s72, selecting nodes with similarity exceeding a third threshold value to join a similar node set;

and S73, fitting and predicting the number of references of the first paper in a second time period in the future based on the number of references of the nodes in the similar node set in a third time period from the publication date to the publication date, wherein the third time period = the first time period + the second time period.

Further, the number of times of reference increase of the paper node r is:

wherein the content of the first and second substances,

for the number of citations within a period p from the publication time, p is in units of years, wherein,

for the first paper, T is the third time period, and for the second paper, T is the time difference between the publication time and the recommendation time of the paper.

Further, the

Including actual paper citation times and predicted paper citation times.

The invention also provides a citation recommendation system based on link analysis, which is used for realizing the citation recommendation method and is characterized by comprising the following steps:

the network construction module is used for constructing a directed weighted citation network based on citation relations, author similarities and content similarities among the papers;

the cluster dividing module is used for dividing the directed weighted reference network into a plurality of network clusters;

a representative node selection module, configured to select, as a representative node, a node with the largest influence for each network cluster;

the candidate network cluster selection module is used for selecting a corresponding network cluster for the newly-built thesis based on the author similarity and the content similarity to serve as a candidate network cluster;

the link degree calculation module is used for adding a node with the highest similarity to the newly-built paper in the candidate network cluster as a first node into the candidate quotation recommendation set and calculating the link degree between the first node and other nodes in the candidate quotation recommendation set;

the candidate quotation recommendation set building module is used for selecting the nodes with the linking degree higher than a second threshold value with the first node to join the candidate quotation recommendation set; taking the selected node as a first node, continuing to select the node to join the candidate quotation recommendation set until the requirement of a second threshold value cannot be met or all nodes in the network cluster are joined into the candidate quotation recommendation set;

the citation prediction module is used for acquiring a paper published in a first time period as a first paper and predicting the citation times of the first paper in a second time period in the future;

the growth degree calculation module is used for acquiring a paper published in a third time period as a second paper, wherein the third time period = the first time period + the second time period; calculating the increased length of the number of times of reference of the first paper and the second paper;

and the final quotation recommendation set generating module is used for adding the papers with the quotation frequency increasing degree larger than a fourth threshold value into the candidate quotation recommendation set to obtain a final quotation recommendation set, and recommending the papers in the quotation recommendation set to the user as the quotations of the newly-built papers.

Compared with the prior art, the invention has the following effects:

(1) according to the method, the quotation is recommended based on the network cluster, the clustering characteristics among the papers are fully utilized, the papers belonging to the same cluster are more likely to be quoted among the papers, the probability of the quoted papers belonging to different clusters is greatly reduced, and the cost for processing the whole quotation network is reduced;

(2) the method screens the thesis nodes in the cluster, fully considers the difference among different thesis in the same cluster, and improves the accuracy of citation recommendation based on the cluster;

(3) according to the method, the first node is selected, the corresponding recommended node is selected based on the linking degree between the first node and other nodes, the problems that the citation connection edge does not exist between the citation quotation and the citation network, and the citation relation in the citation network cannot be fully utilized are solved, and the quotation recommendation accuracy is improved;

(4) the method and the device predict the citation times of the papers with short publication time, avoid the problem that the papers are worth but are recommended to be missed due to too short publication time, and further improve the accuracy of citation recommendation;

(5) the method comprehensively evaluates the citation relationship among the papers based on the author similarity and the content similarity, fully considers the difference of citation among the papers, and constructs the directed graph of the citation relationship of the papers, so that the citation recommendation accuracy based on the directed weighted citation network is higher.

Drawings

FIG. 1 is a flowchart of a citation recommendation method based on link analysis according to an embodiment;

fig. 2 is a structural diagram of a citation recommendation system based on link analysis according to the second embodiment.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

Example one

As shown in fig. 1, the present embodiment provides a citation recommendation method based on link analysis, including:

the invention firstly constructs a citation network, and in the thesis citation relationship, there is a relationship between citation and cited, so the citation network of the invention is a directed graph. The reference network can be represented as: directed graph

Wherein

，

For reference to a paper node in a network, n is the number of documents,

；

，

to refer to the connecting edges between papers in a network, m is the number of connecting edges,

are directed edges.

Presentation paper

Is being dissembled

Reference is made to the fact that,

，

。

the quotation network constructed by the invention is a directed weighted quotation network, and the quotation weights of different papers are different. Specifically, the reference weight is related to author similarity and content similarity. The higher the author similarity and the content similarity, the higher the reference weight among the papers.

For the authors of the papers, if the same authors exist between two papers, it is indicated that the two papers are likely to be related papers, and can be the progressive research results in the same field or the research results in related fields. The greater the number of authors in the two papers that are the same, the closer the connection between the two papers is. In addition, if the authors in the two papers cooperate to complete other papers, it is indicated that there is a certain association between the two authors, which may be the research and development members of the same team, and the two papers both belong to the research result of the team. The more papers the author completes collaboratively, the more closely the two authors are related. Thus, the author similarity of two papers with citation relationships is:

wherein the content of the first and second substances,

、

。

for the same number of authors that both papers have,

for the number of authors with a cooperative relationship in both papers,

is as follows

Number of papers done collaboratively on authors with a collaborative relationship.

For the content of the papers, the relevance of the papers with similar fields and similar content is larger. If the whole content of a paper is analyzed, the data processing amount is large, the calculation complexity is high, and because the paper generally comprises a abstract which is a high summary of the content of the paper, the similarity of the content is evaluated based on the similarity of the abstract of the paper. Specifically, the invention obtains the abstract information of the thesis, and performs line-distributed expression learning on words in the abstract through a Word2vec model to convert the words into vectors which can be identified by an algorithm. Word2vec is a tool for characterizing words into distributed Word vectors, which is an open source of google corporation, and is a deep learning model, and the bottom layer features are converted into high-level abstract features through a perception machine based on a neural network. The invention performs expression learning on a paper with a reference relation, and obtains a vector as follows:

wherein the content of the first and second substances,

to be a paper

First, the

The value of the dimension(s) is,

the dimensions of the paper vector.

Thus, two papers having a reference relationship

、

The content similarity of (a) is as follows:

wherein the content of the first and second substances,

to be a paper

First, the

The value of the dimension(s) is,

to paper o

The value of the dimension.

Based on author similarity and content similarity, calculating the similarity between two papers with reference relationship as the weight of the connecting edge:

wherein the content of the first and second substances,

for author similarity

The weight of (c).

The method comprehensively evaluates the citation relationship among the papers based on the author similarity and the content similarity, fully considers the difference of citation among the papers, and constructs the directed graph of the citation relationship of the papers, so that the citation recommendation accuracy based on the directed weighted citation network is higher.

for the paper citation network, the clustering characteristic is obvious. The papers belonging to the same cluster are more likely to be referred to each other, and the probability of the reference of the papers belonging to different clusters is greatly reduced. Therefore, the present invention first divides the directed weighted reference network into a plurality of network clusters, and specifically includes:

s21, selecting a thesis node with the highest node degree in the directed weighted citation network as an initial node, and setting i = 1;

the network cluster comprises the paper nodes and corresponding connecting edges. Using a weighted network of references

Dividing, and for the divided x network clusters, the following conditions need to be satisfied:

wherein the content of the first and second substances,

for a paper node comprised by the ith network cluster,

，

，

。

in a reference network, the node with the highest degree usually occupies a high position, and more nodes may be closely related to the node with the lowest degree. Therefore, the invention constructs a network cluster from the paper node with the highest node degree in the reference network, and takes the network cluster as an initial node.

S22, adding the initial node into the newly established cluster SC_i；

The initial node is the earliest joining paper node in the cluster. For example, for a newly established cluster SC₁Firstly, adding the thesis node with the highest degree of nodes in the directed weighted reference network into the SC₁And then, clustering the thesis nodes and the connecting edges in the citation network on the basis of the initial nodes.

S23, acquisition and SC_iThe nodes of the thesis in (1) are connected and do not belong to any established cluster, and the nodes are added into the candidate cluster; if the candidate cluster is an empty cluster, executing step S25;

the papers belonging to the same cluster are more likely to quote among the papers, therefore, the invention acquires and SC when constructing the network cluster_iThe paper nodes in (1) are connected and do not belong to any established cluster pointIt is taken as a candidate thesis node so as to enable SC_iAnd expanding according to the reference relation among the nodes of the thesis.

When the candidate cluster is an empty cluster, it indicates that there is no SC in the rest of the paper nodes_iThe current cluster construction is finished at the point of connection of the thesis nodes.

S24, judging candidate cluster and cluster SC_iIf the maximum weight of the connecting edge of the Chinese thesis node is larger than the first threshold, selecting the thesis node and the connecting edge corresponding to the maximum weight to be added into the cluster SC_iContinuing to execute step S23; if not, i = i +1, go to step S25;

for a node in a candidate cluster, it is associated with the cluster SC_iThe existing thesis nodes all have a connection edge, and specifically, a connection edge may exist with one existing thesis node or a connection edge may exist with a plurality of existing thesis nodes. When the cluster is constructed, when the nodes in the candidate cluster and the cluster SC_iWhen the weights of all the connecting edges between the existing thesis nodes are less than the first threshold, the rest thesis nodes and the current cluster SC are indicated_iLow similarity and long distance, therefore cluster SC_iAnd no new node and corresponding connecting edge are added, and the construction of the next new cluster is continued.

Accordingly, if present with the cluster SC_iWhen the weight of the connecting edge between the existing thesis nodes is larger than the thesis node with the first threshold, the situation that the cluster SC with the current cluster exists in the rest thesis nodes is shown_iThe nodes with high similarity and close distance are selected, so the paper node with the maximum weight and the connecting edge are added into the cluster SC_iClustering the thesis nodes and continuing to select the next clustering node.

S25, judging whether there is a paper node not belonging to any cluster in the directed weighted reference network, if yes, selecting the paper node with the highest degree not belonging to any cluster as the initial node, executing step S22, if not, outputting the cluster SC₁、SC₂、...、SC_i、...、SC_xWherein x is the number of network clusters.

Through the steps, the invention continuously carries out the cluster SC_iUntil all the nodes of the paper in the directed weighted reference network are added into the corresponding clusters. Each new cluster SC_iThe construction of (2) is to select the paper node with the highest degree from the rest paper nodes as an initial node to perform clustering of the paper nodes. Finally outputting all the constructed cluster SC₁、SC₂、...、SC_i、...、SC_xAnd completing the division of the directed weighted reference network.

for a node in a cluster, its influence is related not only to the cluster itself, but also to the influence of its neighboring nodes. In general, the greater the degree of a node, the greater its influence. For the neighbor nodes, if the neighbor nodes are continuously quoted by new nodes, the content of the original node can be continuously known by more authors, and the influence of the original node is enlarged. Therefore, the influence of the node i in the invention is as follows:

wherein the content of the first and second substances,

the number of nodes in the network cluster that reference node i,

the number of times the node referencing node i is referenced in the network cluster for the jth reference node i,

，

the number of nodes which reference the node i at the same time in the node which references the jth reference node i.

when quotation recommendation is carried out, the newly-built papers and nodes in the paper quotation network do not have corresponding quotation relations. Therefore, the similarity between the newly-built paper and the nodes in the citation network is sequentially calculated, and the similarity between the nodes is determined by the similarity of authors and the similarity of contents. The specific calculation method is consistent with the similarity calculation between the papers of step S1, and is not repeated here. Because the content similarity of the invention is calculated based on the abstract of the paper, even if the author does not complete the whole paper, the author can be recommended with the paper quotation only by inputting the corresponding abstract of the paper, so that the author can refer to and learn the paper quotation.

The invention firstly carries out paper quotation recommendation based on the constructed network cluster. Therefore, based on the similarity between the new thesis and each node in the citation network, the new thesis and the network cluster SC are calculated_iThe similarity of (a) is specifically as follows:

wherein M is SC_iThe number of nodes in the paper is,

for the similarity between the new paper and the jth paper node,

。

new thesis and network cluster SC_iThe greater its similarity to the network cluster SC_iThe greater the probability of a reference relationship occurring between nodes in (a). Therefore, the invention selects the network cluster with the maximum similarity with the newly-built paper as the candidate network cluster.

S5, adding a node with the highest similarity to the newly-built paper in the candidate network cluster as a first node into a candidate quotation recommendation set, and calculating the link degrees between the first node and other nodes in the candidate quotation recommendation set;

influence of different nodes in the same network cluster is different, and if all documents in the candidate network cluster are used as citations for recommendation, accuracy of citation recommendation is low. Therefore, the invention further screens the paper nodes in the candidate network cluster to recommend the quotation more accurately.

Specifically, the higher the similarity with the newly-created paper, the higher the probability that the node is selected as a citation. Therefore, after the similarity between the new paper and each node in the candidate network cluster is calculated in step S4, the present invention firstly adds the node with the highest similarity to the new paper in the candidate network cluster as the first node into the candidate citation recommendation set, and uses it as the first citation recommendation node.

As described above, the newly-created papers to be recommended by the citation have no corresponding reference relationship with the nodes in the candidate network cluster, and the papers with high relevance to the first node are recommended with high probability. Therefore, the invention firstly takes the first node as a starting point, analyzes the link relation between the nodes, evaluates the correlation degree between the nodes based on the link degree, and sequentially calculates the link degree between the starting point and the neighbor nodes in the candidate citation recommendation set, and for the neighbor node i, the link degree between the neighbor node i and the starting point is specifically as follows:

wherein the content of the first and second substances,

the degree of the neighboring node i,

the degrees for the neighbor nodes except the starting point,

is the average of the degrees of the nodes in the current network cluster,

is composed of

The covariance of (a) of (b),

is composed of

The variance of (c).

the higher the degree of linkage is, the greater the degree of correlation between nodes is, and the greater the probability that the nodes are simultaneously referred to by the newly-built paper is. Therefore, the node with the link degree higher than the second threshold value with the first node is selected to join the candidate quotation recommendation set based on the calculated link degree, and the candidate quotation recommendation set and the first node are taken as quotation recommendation nodes together.

The method continuously selects the nodes, and after a new node is selected to be added into the quotation recommendation set, the newly added node is added to serve as the first node, and the neighbor nodes with high link degree are continuously selected until the requirement of a second threshold value cannot be met or all the nodes in the network cluster are added into the candidate quotation recommendation set, so that the construction of the candidate quotation recommendation set is completed.

in general, citation of a paper is dynamically changed, the number of times the paper is cited is closely related to time, and papers with longer publication times are generally cited more frequently than papers with shorter publication times. For papers whose publication time is within the first time period, the times of their citations may be few or none, due to their very short publication time, but some new research success may be included in these papers. Therefore, in order to quickly acquire an important paper with a short publication time, the invention treats the paper published in the first time period as the first paper, and predicts the number of times of reference of the first paper in a second time period in the future, which specifically comprises:

the method predicts the number of times of reference of the first paper based on nodes similar to the first paper. The similarity between papers is determined by author similarity, content similarity. The specific calculation method is consistent with the similarity calculation between the papers of step S1, and is not repeated here. The invention calculates the first thesis and the nodes in the citation network in sequence to obtain the similarity with the nodes of the thesis.

the invention selects the nodes with the similarity exceeding a third threshold value to join the similar node set. The higher the similarity, the more similar between papers, the more likely the same rule of reference is followed.

The invention predicts the citation times in the future second time period, and the core of the prediction is that the citation rules of the papers considered to be similar are approximately the same. The reference times of the nodes in the similar node set in the third time period are the actual reference times of the similar nodes in the third time period from the publication date to the publication date. Therefore, the method carries out fitting prediction on the number of times of citation of the first paper based on the law of development of the number of times of citation of the similar paper in the third time period. The specific prediction method is not limited herein, and any conventional data prediction method may be used.

the invention also selects a second paper published in a third time period = first time period + second time period, e.g. the third time period is 3 years, the first time period is 1 year and the second time period is 2 years. That is, the publication time of the first paper is shorter than the publication time of the second paper, but the publication time of the second paper is not so long. The second paper invented in the third time period is a paper excluding the first paper.

For a paper, although the more times a paper is referenced, the more important the paper is in general. But with the progress of technology, the probability of being referred again in the papers with a high number of citations in the early year may be reduced, and correspondingly, for the papers published in a short time, even if the total number of citations of the papers is not high, if the total number of citations of the papers is increased in the recent period, the papers are stated to be active in the recent period, and the papers are also important, especially in the recent process of citing the papers, the probability of being referred is high. Therefore, the invention calculates the increasing length of the number of citations of the first paper and the second paper so as to select the paper with short departure schedule and strong recent attention. For the thesis node i, the increase degree of the reference times is specifically as follows:

wherein the content of the first and second substances,

for the number of citations in the j period from the publication time, j is in units of years, wherein,

for the first paper, T is the third time period, and for the second paper, T is the time difference between the publication time and the recommendation time of the paper. Since the first paper includes the predicted number of references, therefore,

the number of citations of the papers can be actual citations of the papers or predicted citations of the papers.

And S9, adding the paper with the length increased by the number of citations larger than a fourth threshold value into the candidate quotation recommendation set to obtain a final quotation recommendation set, and recommending the paper in the quotation recommendation set to a user as a quotation of the newly-built paper.

According to the invention, the paper with the increased length of the citation times larger than the fourth threshold is added into the candidate quotation recommendation set to obtain the final quotation recommendation set, so that the problem that the invented paper cannot be strongly recommended in a short period is avoided, and comprehensive and accurate quotation recommendation is realized. Therefore, after the final quotation recommendation set is obtained, the quotation in the recommended quotation recommendation set is returned to the user and used as the quotation of the newly-built paper.

Example two

As shown in fig. 2, the present embodiment provides a citation recommendation system based on link analysis, including:

Wherein

，

For reference to a paper node in a network, n is the number of documents,

；

，

are directed edges.

Presentation paper

Is being dissembled

Reference is made to the fact that,

，

。

wherein the content of the first and second substances,

、

。

for the same number of authors that both papers have,

for the number of authors with a cooperative relationship in both papers,

is as follows

wherein the content of the first and second substances,

to be a paper

First, the

The value of the dimension(s) is,

the dimensions of the paper vector.

Thus, two papers having a reference relationship

、

The content similarity of (a) is as follows:

wherein the content of the first and second substances,

to be a paper

First, the

The value of the dimension(s) is,

the value of the second dimension of paper o.

wherein the content of the first and second substances,

for author similarity

The weight of (c).

the initialization module is used for selecting a thesis node with the highest node degree in the directed weighted citation network as an initial node, and setting i = 1;

wherein the content of the first and second substances,

for a paper node comprised by the ith network cluster,

，

，

。

A first adding module, configured to add the initial node to the newly established cluster SC_i；

A second adding module for obtaining and SC_iThe nodes of the thesis in (1) are connected and do not belong to any established cluster, and the nodes are added into the candidate cluster; if the candidate cluster is an empty cluster, calling a second judgment module;

the papers belonging to the same cluster are more likely to quote among the papers, therefore, the invention acquires and SC when constructing the network cluster_iThe paper nodes in (3) are connected and points which do not belong to any established cluster are taken as candidate paper nodes so that SC_iAnd expanding according to the reference relation among the nodes of the thesis.

When the candidate cluster is an empty cluster, it indicates that there is no SC in the rest of the paper nodes_iPoint of connection of paper nodes in (1)And finishing the construction of the current cluster.

A first judging module for judging the candidate cluster and the cluster SC_iIf the maximum weight of the connecting edge of the Chinese thesis node is larger than the first threshold, selecting the thesis node and the connecting edge corresponding to the maximum weight to be added into the cluster SC_iCalling a second adding module; if not, i = i +1, calling a second judgment module;

A second judging module, configured to judge whether a paper node that does not belong to any cluster exists in the directed weighted reference network, select, if yes, the paper node with the highest degree that does not belong to any cluster as an initial node, call the first adding module, and if no, output a cluster SC₁、SC₂、...、SC_i、...、SC_xWherein x is the number of network clusters.

Through the steps, the invention continuously carries out the cluster SC_iUntil all the nodes of the paper in the directed weighted reference network are added into the corresponding clusters. Each new clusterSC_iThe construction of (2) is to select the paper node with the highest degree from the rest paper nodes as an initial node to perform clustering of the paper nodes. Finally outputting all the constructed cluster SC₁、SC₂、...、SC_i、...、SC_xAnd completing the division of the directed weighted reference network.

wherein the content of the first and second substances,

the number of nodes in the network cluster that reference node i,

，

wherein M is SC_iThe number of nodes in the paper is,

for the similarity between the new paper and the jth paper node,

。

wherein the content of the first and second substances,

the degree of the neighboring node i,

the degrees for the neighbor nodes except the starting point,

is the average of the degrees of the nodes in the current network cluster,

is composed of

The covariance of (a) of (b),

is composed of

The variance of (c).

a calculation module for calculating a similarity of the first paper to a node in the reference network;

The selecting module is used for selecting the nodes with the similarity exceeding a third threshold value to be added into the similar node set;

And the prediction module is used for fitting and predicting the number of times of reference in a second time period in the future of the first paper based on the number of times of reference in a third time period from the publication date to the publication date of the nodes in the similar node set, wherein the third time period = the first time period + the second time period.

wherein the content of the first and second substances,

Therefore, the quotation recommendation method and system based on the link analysis, which are provided by the invention, recommend quotations based on network clusters, fully utilize the clustering characteristics among papers, the papers belonging to the same cluster are more likely to be quoted among papers, the probability of quoting the papers belonging to different clusters is greatly reduced, and the cost for processing the whole quotation network is reduced; the paper nodes in the cluster are screened, the difference among different papers in the same cluster is fully considered, and the accuracy of citation recommendation based on the cluster is improved; the method comprises the steps that first nodes are selected, corresponding recommended nodes are selected based on the linking degrees between the first nodes and other nodes, the problems that reference connection edges do not exist between recommended quotations and a quotation network, and the quotation relation in the quotation network cannot be fully utilized are solved, and the accuracy of quotation recommendation is improved; the method has the advantages that the quote times of the papers with short publication time are predicted, the problem that the papers are valuable but are recommended to be missed due to too short publication time is avoided, the accuracy of quote recommendation is further improved, meanwhile, the method evaluates the papers based on the increment degree of the quote times, the attention degree change of the papers is effectively evaluated, and the importance of the papers is more accurately evaluated; the citation relation among the papers is comprehensively evaluated based on the author similarity and the content similarity, the difference of citation among the papers is fully considered, and the directed graph construction of the citation relation of the papers is carried out, so that the citation recommendation accuracy based on the directed weighted citation network is higher.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A citation recommendation method based on link analysis is characterized by comprising the following steps:

the author similarity of papers with reference relationships is:

wherein the content of the first and second substances,

、

；

in order for the paper to have the same number of authors,

for the author log with a partnership in the paper,

is as follows

paper with reference relationship

、

The content similarity of (a) is:

wherein the content of the first and second substances,

to be a paper

First, the

The value of the dimension(s) is,

to paper o

The value of the dimension(s) is,

dimension of the paper vector;

the similarity of papers with reference relationships is:

wherein the content of the first and second substances,

for author similarity

The weight of (c);

the S2 specifically includes:

s22, adding the initial node into the newly established cluster SC_c；

S23, acquisition and SC_cThe nodes of the thesis in (1) are connected and do not belong to any established cluster, and the nodes are added into the candidate cluster; if the candidate cluster is an empty cluster, executing step S25;

the influence of node f is:

wherein the content of the first and second substances,

the number of nodes in the network cluster referring to the node f,

referenced in the network cluster for the jth node referencing node fThe number of times of the operation is counted,

，

taking the first node as a starting point, the starting point and the neighbor nodes in the candidate quotation recommendation setxThe link degree of (c) is:

wherein the content of the first and second substances,

as a neighbor nodexThe degree of (c) is determined,

as a neighbor nodexExcept for the degree of the starting point,

is the average of the degrees of the nodes in the current network cluster,

is composed of

The covariance of (a) of (b),

is composed of

The variance of (c).

2. The citation recommendation method of claim 1 whichCharacterized in that said S4 includes: selecting a network cluster with the maximum similarity with the newly-built paper as a candidate network cluster; new thesis and network cluster SC_cThe similarity of (A) is as follows:

wherein M is SC_cThe number of nodes in the paper is,

for the similarity between the new thesis and the q-th thesis node,

。

3. the citation recommendation method according to claim 1, wherein said S7 specifically is:

4. The citation recommendation method according to claim 3 wherein said citation times increment of paper node r is:

wherein the content of the first and second substances,

for the number of references in p periods from publication time, pThe unit is a year, wherein,

5. The citation recommendation method of claim 4 wherein said citation recommendation method is

Including actual paper citation times and predicted paper citation times.

6. A citation recommendation system based on link analysis, which is used for realizing the citation recommendation method of any one of claims 1-5, and is characterized by comprising: