CN111460324B - Citation recommendation method and system based on link analysis - Google Patents

Citation recommendation method and system based on link analysis Download PDF

Info

Publication number
CN111460324B
CN111460324B CN202010556832.7A CN202010556832A CN111460324B CN 111460324 B CN111460324 B CN 111460324B CN 202010556832 A CN202010556832 A CN 202010556832A CN 111460324 B CN111460324 B CN 111460324B
Authority
CN
China
Prior art keywords
paper
node
cluster
network
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010556832.7A
Other languages
Chinese (zh)
Other versions
CN111460324A (en
Inventor
冯雅
吴宗羲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Canba Technology Co ltd
Original Assignee
Hangzhou Canba Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Canba Technology Co ltd filed Critical Hangzhou Canba Technology Co ltd
Priority to CN202010556832.7A priority Critical patent/CN111460324B/en
Publication of CN111460324A publication Critical patent/CN111460324A/en
Application granted granted Critical
Publication of CN111460324B publication Critical patent/CN111460324B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The invention discloses a citation recommendation method and a system based on link analysis, which comprises the following steps: constructing a directed weighted reference network; dividing the reference network into network clusters; selecting a representative node for the network cluster; selecting a candidate network cluster for the newly-built thesis; adding the nodes with the highest similarity into the candidate quotation recommendation set, and calculating the link degrees among the nodes; selecting nodes with the link degrees higher than a second threshold value to join the candidate quotation recommendation set; continuously selecting nodes to join the candidate quotation recommendation set; acquiring a paper published in a first time period as a first paper, and predicting the number of times of reference of the first paper; acquiring a paper published in a third time period as a second paper; calculating the increased length of the number of times of citation of the first paper and the second paper; and adding the papers with the increased length of the reference times larger than a fourth threshold value into the candidate quotation recommendation set. The invention optimizes the quotation recommendation of the community network, improves the recommendation accuracy, and predicts the paper with short publication time, so that the recommended quotation is more comprehensive.

Description

Citation recommendation method and system based on link analysis
Technical Field
The invention relates to the field of document searching, in particular to a citation recommendation method and system based on link analysis.
Background
While an academic paper requires a prior art work of interest to help the reader understand its background and innovation, researchers often want to quickly understand the existing literature in the field, including which papers are the most relevant, which sub-topics are in these papers, etc. As the number of academic papers increases, the citation network formed by the academic papers and their references is becoming a large-scale complex network. Citation analysis has an important role in document retrieval and paper recommendation.
The invention patent application with publication number CN 110674318A discloses a data recommendation method based on citation network community discovery, which constructs a citation network based on the co-written relation between authors and the co-citation and coupling relation between papers; dividing the citation network into a plurality of community networks; establishing association between the data set and the community network based on the similarity between the paper and the data set; and (4) overlapping and de-duplicating each paper node in the community network associated with the data set, and then recommending data.
Although the above application mentions recommendation based on citation network community discovery, it associates data sets with a community network for data recommendation. Even if the community influence of papers in the same community network is different, the probability of making citation recommendation is completely different. The citation of the papers is dynamically changed, the publication time has a great influence on the citation of the papers, and the newly published technical advanced papers may have the problems of small citation amount and the like, so that the paper recommendation method disclosed by the application has the problem of low accuracy, and how to realize high-accuracy and high-quality citation recommendation aiming at the problem of the existing citation recommendation is a problem to be solved in the field.
Disclosure of Invention
The invention aims to provide a citation recommendation method and system based on link analysis aiming at the defects of the prior art. The invention optimizes the quotation recommendation of the community network, improves the recommendation accuracy, and predicts the paper with short publication time, so that the recommended quotation is more comprehensive.
In order to achieve the purpose, the invention adopts the following technical scheme:
a citation recommendation method based on link analysis comprises the following steps:
s1, constructing a directed weighted citation network based on citation relations, author similarities and content similarities among the papers;
s2, dividing the directed weighted reference network into a plurality of network clusters;
s3, selecting the node with the largest influence for each network cluster as a representative node;
s4, selecting a corresponding network cluster for the new thesis based on author similarity and content similarity as a candidate network cluster;
s5, adding a node with the highest similarity to the newly-built paper in the candidate network cluster as a first node into a candidate quotation recommendation set, and calculating the link degrees between the first node and other nodes in the candidate network cluster;
s6, selecting the nodes with the linking degree higher than a second threshold value with the first node to join the candidate quotation recommendation set; taking the selected node as a first node, continuing to select the node to join the candidate quotation recommendation set until the requirement of a second threshold value cannot be met or all nodes in the network cluster are joined into the candidate quotation recommendation set;
s7, acquiring a paper published in a first time period as a first paper, and predicting the number of times of reference of the first paper in a second time period in the future;
s8, acquiring a paper published in a third time period as a second paper, wherein the third time period = the first time period + the second time period; calculating the increased length of the number of times of reference of the first paper and the second paper;
s9, adding the paper with the length increased by the number of citations larger than a fourth threshold value into the candidate quotation recommendation set to obtain a final quotation recommendation set, and recommending the paper in the final quotation recommendation set to a user as a quotation of a newly-built paper;
the author similarity of papers with reference relationships is:
Figure 100002_DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 221203DEST_PATH_IMAGE002
Figure 100002_DEST_PATH_IMAGE003
respectively the weight occupied by the same author and the cooperation of the author,
Figure 30021DEST_PATH_IMAGE004
Figure 100002_DEST_PATH_IMAGE005
in order for the paper to have the same number of authors,
Figure 622808DEST_PATH_IMAGE006
for the author log with a partnership in the paper,
Figure 100002_DEST_PATH_IMAGE007
is as follows
Figure 100002_DEST_PATH_IMAGE009
Number of papers completed collaboratively for authors with a collaborative relationship;
paper with reference relationship
Figure 549176DEST_PATH_IMAGE010
Figure 100002_DEST_PATH_IMAGE011
The content similarity of (a) is:
Figure 735786DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 100002_DEST_PATH_IMAGE013
to be a paper
Figure 840139DEST_PATH_IMAGE010
First, the
Figure 559702DEST_PATH_IMAGE014
The value of the dimension(s) is,
Figure 165127DEST_PATH_IMAGE013
to paper o
Figure 381476DEST_PATH_IMAGE014
The value of the dimension(s) is,
Figure 100002_DEST_PATH_IMAGE015
dimension of the paper vector;
the similarity of papers with reference relationships is:
Figure 125572DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 100002_DEST_PATH_IMAGE017
for author similarity
Figure 100002_DEST_PATH_IMAGE019
The weight of (c);
the weight of the edge in the directed weighted quotation network is the similarity of the papers connected with the edge;
the S2 specifically includes:
s21, selecting a thesis node with the highest node degree in the directed weighted citation network as an initial node, and setting c = 1;
s22, adding the initial node into the newly established cluster SCc
S23, acquiring points which are connected with the thesis nodes in the SCc and do not belong to any established cluster, and adding the points into the candidate cluster; if the candidate cluster is an empty cluster, executing step S25;
s24, judging candidate cluster and cluster SCcIf the maximum weight of the connecting edge of the Chinese thesis node is larger than the first threshold, selecting the thesis node and the connecting edge corresponding to the maximum weight to be added into the cluster SCcContinuing to execute step S23; if not, c = c +1, go to step S25;
s25, judging whether there is a paper node not belonging to any cluster in the directed weighted reference network, if yes, selecting the paper node with the highest degree not belonging to any cluster as the initial node, executing step S22, if not, outputting the cluster SC1、SC2、...、SCc、...、SCxWherein x is the number of network clusters;
the influence of node f is:
Figure 53470DEST_PATH_IMAGE020
wherein the content of the first and second substances,
Figure 100002_DEST_PATH_IMAGE021
the number of nodes in the network cluster referring to the node f,
Figure 649537DEST_PATH_IMAGE022
the number of times the jth node referencing node f is referenced in the network cluster,
Figure 100002_DEST_PATH_IMAGE023
Figure 267862DEST_PATH_IMAGE024
the number of nodes which refer to the node f at the same time in the j-th node f;
the link degree between the neighbor node l and the first node is as follows:
Figure DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure 651701DEST_PATH_IMAGE026
as a neighbor node
Figure DEST_PATH_IMAGE027
The degree of (c) is determined,
Figure 752381DEST_PATH_IMAGE028
the degrees for the neighbor nodes except the starting point,
Figure DEST_PATH_IMAGE029
is the average of the degrees of the nodes in the current network cluster,
Figure 641885DEST_PATH_IMAGE030
is composed of
Figure 206727DEST_PATH_IMAGE026
The covariance of (a) of (b),
Figure DEST_PATH_IMAGE031
is composed of
Figure 371255DEST_PATH_IMAGE026
The variance of (c).
Further, the S4 includes: selecting a network cluster with the maximum similarity with the newly-built paper as a candidate network cluster; new thesis and network cluster SCcThe similarity of (A) is as follows:
Figure 552706DEST_PATH_IMAGE032
wherein M is SCcThe number of nodes in the paper is,
Figure DEST_PATH_IMAGE033
for the similarity between the new thesis and the q-th thesis node,
Figure 382253DEST_PATH_IMAGE034
further, the S7 specifically includes:
s71, calculating the similarity between the first paper and the nodes in the reference network;
s72, selecting nodes with similarity exceeding a third threshold value to join a similar node set;
and S73, fitting and predicting the number of references of the first paper in a second time period in the future based on the number of references of the nodes in the similar node set in a third time period from the publication date to the publication date, wherein the third time period = the first time period + the second time period.
Further, the number of times of reference increase of the paper node r is:
Figure DEST_PATH_IMAGE035
wherein the content of the first and second substances,
Figure 975171DEST_PATH_IMAGE036
for the number of citations within a period p from the publication time, p is in units of years, wherein,
Figure 100002_DEST_PATH_IMAGE037
for the first paper, T is the third time period, and for the second paper, T is the time difference between the publication time and the recommendation time of the paper.
Further, the
Figure 294288DEST_PATH_IMAGE038
Including actual paper citation times and predicted paper citation times.
The invention also provides a citation recommendation system based on link analysis, which is used for realizing the citation recommendation method and is characterized by comprising the following steps:
the network construction module is used for constructing a directed weighted citation network based on citation relations, author similarities and content similarities among the papers;
the cluster dividing module is used for dividing the directed weighted reference network into a plurality of network clusters;
a representative node selection module, configured to select, as a representative node, a node with the largest influence for each network cluster;
the candidate network cluster selection module is used for selecting a corresponding network cluster for the newly-built thesis based on the author similarity and the content similarity to serve as a candidate network cluster;
the link degree calculation module is used for adding a node with the highest similarity to the newly-built paper in the candidate network cluster as a first node into the candidate quotation recommendation set and calculating the link degree between the first node and other nodes in the candidate quotation recommendation set;
the candidate quotation recommendation set building module is used for selecting the nodes with the linking degree higher than a second threshold value with the first node to join the candidate quotation recommendation set; taking the selected node as a first node, continuing to select the node to join the candidate quotation recommendation set until the requirement of a second threshold value cannot be met or all nodes in the network cluster are joined into the candidate quotation recommendation set;
the citation prediction module is used for acquiring a paper published in a first time period as a first paper and predicting the citation times of the first paper in a second time period in the future;
the growth degree calculation module is used for acquiring a paper published in a third time period as a second paper, wherein the third time period = the first time period + the second time period; calculating the increased length of the number of times of reference of the first paper and the second paper;
and the final quotation recommendation set generating module is used for adding the papers with the quotation frequency increasing degree larger than a fourth threshold value into the candidate quotation recommendation set to obtain a final quotation recommendation set, and recommending the papers in the quotation recommendation set to the user as the quotations of the newly-built papers.
Compared with the prior art, the invention has the following effects:
(1) according to the method, the quotation is recommended based on the network cluster, the clustering characteristics among the papers are fully utilized, the papers belonging to the same cluster are more likely to be quoted among the papers, the probability of the quoted papers belonging to different clusters is greatly reduced, and the cost for processing the whole quotation network is reduced;
(2) the method screens the thesis nodes in the cluster, fully considers the difference among different thesis in the same cluster, and improves the accuracy of citation recommendation based on the cluster;
(3) according to the method, the first node is selected, the corresponding recommended node is selected based on the linking degree between the first node and other nodes, the problems that the citation connection edge does not exist between the citation quotation and the citation network, and the citation relation in the citation network cannot be fully utilized are solved, and the quotation recommendation accuracy is improved;
(4) the method and the device predict the citation times of the papers with short publication time, avoid the problem that the papers are worth but are recommended to be missed due to too short publication time, and further improve the accuracy of citation recommendation;
(5) the method comprehensively evaluates the citation relationship among the papers based on the author similarity and the content similarity, fully considers the difference of citation among the papers, and constructs the directed graph of the citation relationship of the papers, so that the citation recommendation accuracy based on the directed weighted citation network is higher.
Drawings
FIG. 1 is a flowchart of a citation recommendation method based on link analysis according to an embodiment;
fig. 2 is a structural diagram of a citation recommendation system based on link analysis according to the second embodiment.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
Example one
As shown in fig. 1, the present embodiment provides a citation recommendation method based on link analysis, including:
s1, constructing a directed weighted citation network based on citation relations, author similarities and content similarities among the papers;
the invention firstly constructs a citation network, and in the thesis citation relationship, there is a relationship between citation and cited, so the citation network of the invention is a directed graph. The reference network can be represented as: directed graph
Figure DEST_PATH_IMAGE039
Wherein
Figure 369560DEST_PATH_IMAGE040
Figure DEST_PATH_IMAGE041
For reference to a paper node in a network, n is the number of documents,
Figure 2798DEST_PATH_IMAGE042
Figure DEST_PATH_IMAGE043
Figure 387905DEST_PATH_IMAGE044
to refer to the connecting edges between papers in a network, m is the number of connecting edges,
Figure 517404DEST_PATH_IMAGE044
are directed edges.
Figure DEST_PATH_IMAGE045
Presentation paper
Figure 784699DEST_PATH_IMAGE046
Is being dissembled
Figure DEST_PATH_IMAGE047
Reference is made to the fact that,
Figure 233347DEST_PATH_IMAGE048
Figure DEST_PATH_IMAGE049
the quotation network constructed by the invention is a directed weighted quotation network, and the quotation weights of different papers are different. Specifically, the reference weight is related to author similarity and content similarity. The higher the author similarity and the content similarity, the higher the reference weight among the papers.
For the authors of the papers, if the same authors exist between two papers, it is indicated that the two papers are likely to be related papers, and can be the progressive research results in the same field or the research results in related fields. The greater the number of authors in the two papers that are the same, the closer the connection between the two papers is. In addition, if the authors in the two papers cooperate to complete other papers, it is indicated that there is a certain association between the two authors, which may be the research and development members of the same team, and the two papers both belong to the research result of the team. The more papers the author completes collaboratively, the more closely the two authors are related. Thus, the author similarity of two papers with citation relationships is:
Figure 784545DEST_PATH_IMAGE050
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE051
Figure 694732DEST_PATH_IMAGE052
respectively the weight occupied by the same author and the cooperation of the author,
Figure 308378DEST_PATH_IMAGE004
Figure 735949DEST_PATH_IMAGE005
for the same number of authors that both papers have,
Figure DEST_PATH_IMAGE053
for the number of authors with a cooperative relationship in both papers,
Figure 187658DEST_PATH_IMAGE054
is as follows
Figure 675272DEST_PATH_IMAGE009
Number of papers done collaboratively on authors with a collaborative relationship.
For the content of the papers, the relevance of the papers with similar fields and similar content is larger. If the whole content of a paper is analyzed, the data processing amount is large, the calculation complexity is high, and because the paper generally comprises a abstract which is a high summary of the content of the paper, the similarity of the content is evaluated based on the similarity of the abstract of the paper. Specifically, the invention obtains the abstract information of the thesis, and performs line-distributed expression learning on words in the abstract through a Word2vec model to convert the words into vectors which can be identified by an algorithm. Word2vec is a tool for characterizing words into distributed Word vectors, which is an open source of google corporation, and is a deep learning model, and the bottom layer features are converted into high-level abstract features through a perception machine based on a neural network. The invention performs expression learning on a paper with a reference relation, and obtains a vector as follows:
Figure DEST_PATH_IMAGE055
wherein the content of the first and second substances,
Figure 900848DEST_PATH_IMAGE013
to be a paper
Figure 554945DEST_PATH_IMAGE010
First, the
Figure 674211DEST_PATH_IMAGE014
The value of the dimension(s) is,
Figure 411354DEST_PATH_IMAGE015
the dimensions of the paper vector.
Thus, two papers having a reference relationship
Figure 232548DEST_PATH_IMAGE010
Figure 1921DEST_PATH_IMAGE011
The content similarity of (a) is as follows:
Figure 506852DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 290262DEST_PATH_IMAGE056
to be a paper
Figure DEST_PATH_IMAGE057
First, the
Figure 67594DEST_PATH_IMAGE014
The value of the dimension(s) is,
Figure 375079DEST_PATH_IMAGE013
to paper o
Figure 954090DEST_PATH_IMAGE014
The value of the dimension.
Based on author similarity and content similarity, calculating the similarity between two papers with reference relationship as the weight of the connecting edge:
Figure 688828DEST_PATH_IMAGE058
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE059
for author similarity
Figure DEST_PATH_IMAGE061
The weight of (c).
The method comprehensively evaluates the citation relationship among the papers based on the author similarity and the content similarity, fully considers the difference of citation among the papers, and constructs the directed graph of the citation relationship of the papers, so that the citation recommendation accuracy based on the directed weighted citation network is higher.
S2, dividing the directed weighted reference network into a plurality of network clusters;
for the paper citation network, the clustering characteristic is obvious. The papers belonging to the same cluster are more likely to be referred to each other, and the probability of the reference of the papers belonging to different clusters is greatly reduced. Therefore, the present invention first divides the directed weighted reference network into a plurality of network clusters, and specifically includes:
s21, selecting a thesis node with the highest node degree in the directed weighted citation network as an initial node, and setting i = 1;
the network cluster comprises the paper nodes and corresponding connecting edges. Using a weighted network of references
Figure 98994DEST_PATH_IMAGE062
Dividing, and for the divided x network clusters, the following conditions need to be satisfied:
Figure DEST_PATH_IMAGE063
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE065
for a paper node comprised by the ith network cluster,
Figure 600383DEST_PATH_IMAGE066
Figure DEST_PATH_IMAGE067
Figure 502742DEST_PATH_IMAGE068
in a reference network, the node with the highest degree usually occupies a high position, and more nodes may be closely related to the node with the lowest degree. Therefore, the invention constructs a network cluster from the paper node with the highest node degree in the reference network, and takes the network cluster as an initial node.
S22, adding the initial node into the newly established cluster SCi
The initial node is the earliest joining paper node in the cluster. For example, for a newly established cluster SC1Firstly, adding the thesis node with the highest degree of nodes in the directed weighted reference network into the SC1And then, clustering the thesis nodes and the connecting edges in the citation network on the basis of the initial nodes.
S23, acquisition and SCiThe nodes of the thesis in (1) are connected and do not belong to any established cluster, and the nodes are added into the candidate cluster; if the candidate cluster is an empty cluster, executing step S25;
the papers belonging to the same cluster are more likely to quote among the papers, therefore, the invention acquires and SC when constructing the network clusteriThe paper nodes in (1) are connected and do not belong to any established cluster pointIt is taken as a candidate thesis node so as to enable SCiAnd expanding according to the reference relation among the nodes of the thesis.
When the candidate cluster is an empty cluster, it indicates that there is no SC in the rest of the paper nodesiThe current cluster construction is finished at the point of connection of the thesis nodes.
S24, judging candidate cluster and cluster SCiIf the maximum weight of the connecting edge of the Chinese thesis node is larger than the first threshold, selecting the thesis node and the connecting edge corresponding to the maximum weight to be added into the cluster SCiContinuing to execute step S23; if not, i = i +1, go to step S25;
for a node in a candidate cluster, it is associated with the cluster SCiThe existing thesis nodes all have a connection edge, and specifically, a connection edge may exist with one existing thesis node or a connection edge may exist with a plurality of existing thesis nodes. When the cluster is constructed, when the nodes in the candidate cluster and the cluster SCiWhen the weights of all the connecting edges between the existing thesis nodes are less than the first threshold, the rest thesis nodes and the current cluster SC are indicatediLow similarity and long distance, therefore cluster SCiAnd no new node and corresponding connecting edge are added, and the construction of the next new cluster is continued.
Accordingly, if present with the cluster SCiWhen the weight of the connecting edge between the existing thesis nodes is larger than the thesis node with the first threshold, the situation that the cluster SC with the current cluster exists in the rest thesis nodes is showniThe nodes with high similarity and close distance are selected, so the paper node with the maximum weight and the connecting edge are added into the cluster SCiClustering the thesis nodes and continuing to select the next clustering node.
S25, judging whether there is a paper node not belonging to any cluster in the directed weighted reference network, if yes, selecting the paper node with the highest degree not belonging to any cluster as the initial node, executing step S22, if not, outputting the cluster SC1、SC2、...、SCi、...、SCxWherein x is the number of network clusters.
Through the steps, the invention continuously carries out the cluster SCiUntil all the nodes of the paper in the directed weighted reference network are added into the corresponding clusters. Each new cluster SCiThe construction of (2) is to select the paper node with the highest degree from the rest paper nodes as an initial node to perform clustering of the paper nodes. Finally outputting all the constructed cluster SC1、SC2、...、SCi、...、SCxAnd completing the division of the directed weighted reference network.
S3, selecting the node with the largest influence for each network cluster as a representative node;
for a node in a cluster, its influence is related not only to the cluster itself, but also to the influence of its neighboring nodes. In general, the greater the degree of a node, the greater its influence. For the neighbor nodes, if the neighbor nodes are continuously quoted by new nodes, the content of the original node can be continuously known by more authors, and the influence of the original node is enlarged. Therefore, the influence of the node i in the invention is as follows:
Figure DEST_PATH_IMAGE069
wherein the content of the first and second substances,
Figure 126490DEST_PATH_IMAGE070
the number of nodes in the network cluster that reference node i,
Figure DEST_PATH_IMAGE071
the number of times the node referencing node i is referenced in the network cluster for the jth reference node i,
Figure 114300DEST_PATH_IMAGE072
Figure 29166DEST_PATH_IMAGE024
the number of nodes which reference the node i at the same time in the node which references the jth reference node i.
S4, selecting a corresponding network cluster for the new thesis based on author similarity and content similarity as a candidate network cluster;
when quotation recommendation is carried out, the newly-built papers and nodes in the paper quotation network do not have corresponding quotation relations. Therefore, the similarity between the newly-built paper and the nodes in the citation network is sequentially calculated, and the similarity between the nodes is determined by the similarity of authors and the similarity of contents. The specific calculation method is consistent with the similarity calculation between the papers of step S1, and is not repeated here. Because the content similarity of the invention is calculated based on the abstract of the paper, even if the author does not complete the whole paper, the author can be recommended with the paper quotation only by inputting the corresponding abstract of the paper, so that the author can refer to and learn the paper quotation.
The invention firstly carries out paper quotation recommendation based on the constructed network cluster. Therefore, based on the similarity between the new thesis and each node in the citation network, the new thesis and the network cluster SC are calculatediThe similarity of (a) is specifically as follows:
Figure DEST_PATH_IMAGE073
wherein M is SCiThe number of nodes in the paper is,
Figure 550147DEST_PATH_IMAGE074
for the similarity between the new paper and the jth paper node,
Figure DEST_PATH_IMAGE075
new thesis and network cluster SCiThe greater its similarity to the network cluster SCiThe greater the probability of a reference relationship occurring between nodes in (a). Therefore, the invention selects the network cluster with the maximum similarity with the newly-built paper as the candidate network cluster.
S5, adding a node with the highest similarity to the newly-built paper in the candidate network cluster as a first node into a candidate quotation recommendation set, and calculating the link degrees between the first node and other nodes in the candidate quotation recommendation set;
influence of different nodes in the same network cluster is different, and if all documents in the candidate network cluster are used as citations for recommendation, accuracy of citation recommendation is low. Therefore, the invention further screens the paper nodes in the candidate network cluster to recommend the quotation more accurately.
Specifically, the higher the similarity with the newly-created paper, the higher the probability that the node is selected as a citation. Therefore, after the similarity between the new paper and each node in the candidate network cluster is calculated in step S4, the present invention firstly adds the node with the highest similarity to the new paper in the candidate network cluster as the first node into the candidate citation recommendation set, and uses it as the first citation recommendation node.
As described above, the newly-created papers to be recommended by the citation have no corresponding reference relationship with the nodes in the candidate network cluster, and the papers with high relevance to the first node are recommended with high probability. Therefore, the invention firstly takes the first node as a starting point, analyzes the link relation between the nodes, evaluates the correlation degree between the nodes based on the link degree, and sequentially calculates the link degree between the starting point and the neighbor nodes in the candidate citation recommendation set, and for the neighbor node i, the link degree between the neighbor node i and the starting point is specifically as follows:
Figure 846261DEST_PATH_IMAGE076
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE077
the degree of the neighboring node i,
Figure 570634DEST_PATH_IMAGE078
the degrees for the neighbor nodes except the starting point,
Figure 538459DEST_PATH_IMAGE029
is the average of the degrees of the nodes in the current network cluster,
Figure DEST_PATH_IMAGE079
is composed of
Figure 415411DEST_PATH_IMAGE080
The covariance of (a) of (b),
Figure DEST_PATH_IMAGE081
is composed of
Figure 131694DEST_PATH_IMAGE077
The variance of (c).
S6, selecting the nodes with the linking degree higher than a second threshold value with the first node to join the candidate quotation recommendation set; taking the selected node as a first node, continuing to select the node to join the candidate quotation recommendation set until the requirement of a second threshold value cannot be met or all nodes in the network cluster are joined into the candidate quotation recommendation set;
the higher the degree of linkage is, the greater the degree of correlation between nodes is, and the greater the probability that the nodes are simultaneously referred to by the newly-built paper is. Therefore, the node with the link degree higher than the second threshold value with the first node is selected to join the candidate quotation recommendation set based on the calculated link degree, and the candidate quotation recommendation set and the first node are taken as quotation recommendation nodes together.
The method continuously selects the nodes, and after a new node is selected to be added into the quotation recommendation set, the newly added node is added to serve as the first node, and the neighbor nodes with high link degree are continuously selected until the requirement of a second threshold value cannot be met or all the nodes in the network cluster are added into the candidate quotation recommendation set, so that the construction of the candidate quotation recommendation set is completed.
S7, acquiring a paper published in a first time period as a first paper, and predicting the number of times of reference of the first paper in a second time period in the future;
in general, citation of a paper is dynamically changed, the number of times the paper is cited is closely related to time, and papers with longer publication times are generally cited more frequently than papers with shorter publication times. For papers whose publication time is within the first time period, the times of their citations may be few or none, due to their very short publication time, but some new research success may be included in these papers. Therefore, in order to quickly acquire an important paper with a short publication time, the invention treats the paper published in the first time period as the first paper, and predicts the number of times of reference of the first paper in a second time period in the future, which specifically comprises:
s71, calculating the similarity between the first paper and the nodes in the reference network;
the method predicts the number of times of reference of the first paper based on nodes similar to the first paper. The similarity between papers is determined by author similarity, content similarity. The specific calculation method is consistent with the similarity calculation between the papers of step S1, and is not repeated here. The invention calculates the first thesis and the nodes in the citation network in sequence to obtain the similarity with the nodes of the thesis.
S72, selecting nodes with similarity exceeding a third threshold value to join a similar node set;
the invention selects the nodes with the similarity exceeding a third threshold value to join the similar node set. The higher the similarity, the more similar between papers, the more likely the same rule of reference is followed.
And S73, fitting and predicting the number of references of the first paper in a second time period in the future based on the number of references of the nodes in the similar node set in a third time period from the publication date to the publication date, wherein the third time period = the first time period + the second time period.
The invention predicts the citation times in the future second time period, and the core of the prediction is that the citation rules of the papers considered to be similar are approximately the same. The reference times of the nodes in the similar node set in the third time period are the actual reference times of the similar nodes in the third time period from the publication date to the publication date. Therefore, the method carries out fitting prediction on the number of times of citation of the first paper based on the law of development of the number of times of citation of the similar paper in the third time period. The specific prediction method is not limited herein, and any conventional data prediction method may be used.
S8, acquiring a paper published in a third time period as a second paper, wherein the third time period = the first time period + the second time period; calculating the increased length of the number of times of reference of the first paper and the second paper;
the invention also selects a second paper published in a third time period = first time period + second time period, e.g. the third time period is 3 years, the first time period is 1 year and the second time period is 2 years. That is, the publication time of the first paper is shorter than the publication time of the second paper, but the publication time of the second paper is not so long. The second paper invented in the third time period is a paper excluding the first paper.
For a paper, although the more times a paper is referenced, the more important the paper is in general. But with the progress of technology, the probability of being referred again in the papers with a high number of citations in the early year may be reduced, and correspondingly, for the papers published in a short time, even if the total number of citations of the papers is not high, if the total number of citations of the papers is increased in the recent period, the papers are stated to be active in the recent period, and the papers are also important, especially in the recent process of citing the papers, the probability of being referred is high. Therefore, the invention calculates the increasing length of the number of citations of the first paper and the second paper so as to select the paper with short departure schedule and strong recent attention. For the thesis node i, the increase degree of the reference times is specifically as follows:
Figure 389369DEST_PATH_IMAGE082
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE083
for the number of citations in the j period from the publication time, j is in units of years, wherein,
Figure 600033DEST_PATH_IMAGE084
for the first paper, T is the third time period, and for the second paper, T is the time difference between the publication time and the recommendation time of the paper. Since the first paper includes the predicted number of references, therefore,
Figure 908654DEST_PATH_IMAGE083
the number of citations of the papers can be actual citations of the papers or predicted citations of the papers.
And S9, adding the paper with the length increased by the number of citations larger than a fourth threshold value into the candidate quotation recommendation set to obtain a final quotation recommendation set, and recommending the paper in the quotation recommendation set to a user as a quotation of the newly-built paper.
According to the invention, the paper with the increased length of the citation times larger than the fourth threshold is added into the candidate quotation recommendation set to obtain the final quotation recommendation set, so that the problem that the invented paper cannot be strongly recommended in a short period is avoided, and comprehensive and accurate quotation recommendation is realized. Therefore, after the final quotation recommendation set is obtained, the quotation in the recommended quotation recommendation set is returned to the user and used as the quotation of the newly-built paper.
Example two
As shown in fig. 2, the present embodiment provides a citation recommendation system based on link analysis, including:
the network construction module is used for constructing a directed weighted citation network based on citation relations, author similarities and content similarities among the papers;
the invention firstly constructs a citation network, and in the thesis citation relationship, there is a relationship between citation and cited, so the citation network of the invention is a directed graph. The reference network can be represented as: directed graph
Figure 326997DEST_PATH_IMAGE039
Wherein
Figure DEST_PATH_IMAGE085
Figure 540810DEST_PATH_IMAGE041
For reference to a paper node in a network, n is the number of documents,
Figure 601170DEST_PATH_IMAGE086
Figure DEST_PATH_IMAGE087
Figure 718292DEST_PATH_IMAGE088
to refer to the connecting edges between papers in a network, m is the number of connecting edges,
Figure 573116DEST_PATH_IMAGE088
are directed edges.
Figure DEST_PATH_IMAGE089
Presentation paper
Figure 274225DEST_PATH_IMAGE046
Is being dissembled
Figure 872696DEST_PATH_IMAGE047
Reference is made to the fact that,
Figure 375484DEST_PATH_IMAGE090
Figure 932367DEST_PATH_IMAGE049
the quotation network constructed by the invention is a directed weighted quotation network, and the quotation weights of different papers are different. Specifically, the reference weight is related to author similarity and content similarity. The higher the author similarity and the content similarity, the higher the reference weight among the papers.
For the authors of the papers, if the same authors exist between two papers, it is indicated that the two papers are likely to be related papers, and can be the progressive research results in the same field or the research results in related fields. The greater the number of authors in the two papers that are the same, the closer the connection between the two papers is. In addition, if the authors in the two papers cooperate to complete other papers, it is indicated that there is a certain association between the two authors, which may be the research and development members of the same team, and the two papers both belong to the research result of the team. The more papers the author completes collaboratively, the more closely the two authors are related. Thus, the author similarity of two papers with citation relationships is:
Figure 402663DEST_PATH_IMAGE050
wherein the content of the first and second substances,
Figure 54093DEST_PATH_IMAGE051
Figure 660655DEST_PATH_IMAGE052
respectively the weight occupied by the same author and the cooperation of the author,
Figure DEST_PATH_IMAGE091
Figure 88574DEST_PATH_IMAGE092
for the same number of authors that both papers have,
Figure 780586DEST_PATH_IMAGE053
for the number of authors with a cooperative relationship in both papers,
Figure 970128DEST_PATH_IMAGE093
is as follows
Figure 696775DEST_PATH_IMAGE094
Number of papers done collaboratively on authors with a collaborative relationship.
For the content of the papers, the relevance of the papers with similar fields and similar content is larger. If the whole content of a paper is analyzed, the data processing amount is large, the calculation complexity is high, and because the paper generally comprises a abstract which is a high summary of the content of the paper, the similarity of the content is evaluated based on the similarity of the abstract of the paper. Specifically, the invention obtains the abstract information of the thesis, and performs line-distributed expression learning on words in the abstract through a Word2vec model to convert the words into vectors which can be identified by an algorithm. Word2vec is a tool for characterizing words into distributed Word vectors, which is an open source of google corporation, and is a deep learning model, and the bottom layer features are converted into high-level abstract features through a perception machine based on a neural network. The invention performs expression learning on a paper with a reference relation, and obtains a vector as follows:
Figure 80614DEST_PATH_IMAGE095
wherein the content of the first and second substances,
Figure 259923DEST_PATH_IMAGE096
to be a paper
Figure 738309DEST_PATH_IMAGE010
First, the
Figure 568731DEST_PATH_IMAGE014
The value of the dimension(s) is,
Figure 107159DEST_PATH_IMAGE015
the dimensions of the paper vector.
Thus, two papers having a reference relationship
Figure 790076DEST_PATH_IMAGE010
Figure 72152DEST_PATH_IMAGE011
The content similarity of (a) is as follows:
Figure 757080DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 466411DEST_PATH_IMAGE097
to be a paper
Figure 885891DEST_PATH_IMAGE010
First, the
Figure 456811DEST_PATH_IMAGE014
The value of the dimension(s) is,
Figure 746978DEST_PATH_IMAGE013
the value of the second dimension of paper o.
Based on author similarity and content similarity, calculating the similarity between two papers with reference relationship as the weight of the connecting edge:
Figure 290525DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 416875DEST_PATH_IMAGE017
for author similarity
Figure DEST_PATH_IMAGE098
The weight of (c).
The method comprehensively evaluates the citation relationship among the papers based on the author similarity and the content similarity, fully considers the difference of citation among the papers, and constructs the directed graph of the citation relationship of the papers, so that the citation recommendation accuracy based on the directed weighted citation network is higher.
The cluster dividing module is used for dividing the directed weighted reference network into a plurality of network clusters;
for the paper citation network, the clustering characteristic is obvious. The papers belonging to the same cluster are more likely to be referred to each other, and the probability of the reference of the papers belonging to different clusters is greatly reduced. Therefore, the present invention first divides the directed weighted reference network into a plurality of network clusters, and specifically includes:
the initialization module is used for selecting a thesis node with the highest node degree in the directed weighted citation network as an initial node, and setting i = 1;
the network cluster comprises the paper nodes and corresponding connecting edges. Using a weighted network of references
Figure 899809DEST_PATH_IMAGE099
Dividing, and for the divided x network clusters, the following conditions need to be satisfied:
Figure DEST_PATH_IMAGE100
wherein the content of the first and second substances,
Figure 529635DEST_PATH_IMAGE101
for a paper node comprised by the ith network cluster,
Figure 830036DEST_PATH_IMAGE102
Figure 958529DEST_PATH_IMAGE103
Figure 120520DEST_PATH_IMAGE104
in a reference network, the node with the highest degree usually occupies a high position, and more nodes may be closely related to the node with the lowest degree. Therefore, the invention constructs a network cluster from the paper node with the highest node degree in the reference network, and takes the network cluster as an initial node.
A first adding module, configured to add the initial node to the newly established cluster SCi
The initial node is the earliest joining paper node in the cluster. For example, for a newly established cluster SC1Firstly, adding the thesis node with the highest degree of nodes in the directed weighted reference network into the SC1And then, clustering the thesis nodes and the connecting edges in the citation network on the basis of the initial nodes.
A second adding module for obtaining and SCiThe nodes of the thesis in (1) are connected and do not belong to any established cluster, and the nodes are added into the candidate cluster; if the candidate cluster is an empty cluster, calling a second judgment module;
the papers belonging to the same cluster are more likely to quote among the papers, therefore, the invention acquires and SC when constructing the network clusteriThe paper nodes in (3) are connected and points which do not belong to any established cluster are taken as candidate paper nodes so that SCiAnd expanding according to the reference relation among the nodes of the thesis.
When the candidate cluster is an empty cluster, it indicates that there is no SC in the rest of the paper nodesiPoint of connection of paper nodes in (1)And finishing the construction of the current cluster.
A first judging module for judging the candidate cluster and the cluster SCiIf the maximum weight of the connecting edge of the Chinese thesis node is larger than the first threshold, selecting the thesis node and the connecting edge corresponding to the maximum weight to be added into the cluster SCiCalling a second adding module; if not, i = i +1, calling a second judgment module;
for a node in a candidate cluster, it is associated with the cluster SCiThe existing thesis nodes all have a connection edge, and specifically, a connection edge may exist with one existing thesis node or a connection edge may exist with a plurality of existing thesis nodes. When the cluster is constructed, when the nodes in the candidate cluster and the cluster SCiWhen the weights of all the connecting edges between the existing thesis nodes are less than the first threshold, the rest thesis nodes and the current cluster SC are indicatediLow similarity and long distance, therefore cluster SCiAnd no new node and corresponding connecting edge are added, and the construction of the next new cluster is continued.
Accordingly, if present with the cluster SCiWhen the weight of the connecting edge between the existing thesis nodes is larger than the thesis node with the first threshold, the situation that the cluster SC with the current cluster exists in the rest thesis nodes is showniThe nodes with high similarity and close distance are selected, so the paper node with the maximum weight and the connecting edge are added into the cluster SCiClustering the thesis nodes and continuing to select the next clustering node.
A second judging module, configured to judge whether a paper node that does not belong to any cluster exists in the directed weighted reference network, select, if yes, the paper node with the highest degree that does not belong to any cluster as an initial node, call the first adding module, and if no, output a cluster SC1、SC2、...、SCi、...、SCxWherein x is the number of network clusters.
Through the steps, the invention continuously carries out the cluster SCiUntil all the nodes of the paper in the directed weighted reference network are added into the corresponding clusters. Each new clusterSCiThe construction of (2) is to select the paper node with the highest degree from the rest paper nodes as an initial node to perform clustering of the paper nodes. Finally outputting all the constructed cluster SC1、SC2、...、SCi、...、SCxAnd completing the division of the directed weighted reference network.
A representative node selection module, configured to select, as a representative node, a node with the largest influence for each network cluster;
for a node in a cluster, its influence is related not only to the cluster itself, but also to the influence of its neighboring nodes. In general, the greater the degree of a node, the greater its influence. For the neighbor nodes, if the neighbor nodes are continuously quoted by new nodes, the content of the original node can be continuously known by more authors, and the influence of the original node is enlarged. Therefore, the influence of the node i in the invention is as follows:
Figure 136011DEST_PATH_IMAGE105
wherein the content of the first and second substances,
Figure 76154DEST_PATH_IMAGE070
the number of nodes in the network cluster that reference node i,
Figure 691943DEST_PATH_IMAGE106
the number of times the node referencing node i is referenced in the network cluster for the jth reference node i,
Figure 142779DEST_PATH_IMAGE072
Figure 527624DEST_PATH_IMAGE024
the number of nodes which reference the node i at the same time in the node which references the jth reference node i.
The candidate network cluster selection module is used for selecting a corresponding network cluster for the newly-built thesis based on the author similarity and the content similarity to serve as a candidate network cluster;
when quotation recommendation is carried out, the newly-built papers and nodes in the paper quotation network do not have corresponding quotation relations. Therefore, the similarity between the newly-built paper and the nodes in the citation network is sequentially calculated, and the similarity between the nodes is determined by the similarity of authors and the similarity of contents. The specific calculation method is consistent with the similarity calculation between the papers of step S1, and is not repeated here. Because the content similarity of the invention is calculated based on the abstract of the paper, even if the author does not complete the whole paper, the author can be recommended with the paper quotation only by inputting the corresponding abstract of the paper, so that the author can refer to and learn the paper quotation.
The invention firstly carries out paper quotation recommendation based on the constructed network cluster. Therefore, based on the similarity between the new thesis and each node in the citation network, the new thesis and the network cluster SC are calculatediThe similarity of (a) is specifically as follows:
Figure 920559DEST_PATH_IMAGE073
wherein M is SCiThe number of nodes in the paper is,
Figure DEST_PATH_IMAGE107
for the similarity between the new paper and the jth paper node,
Figure 476174DEST_PATH_IMAGE075
new thesis and network cluster SCiThe greater its similarity to the network cluster SCiThe greater the probability of a reference relationship occurring between nodes in (a). Therefore, the invention selects the network cluster with the maximum similarity with the newly-built paper as the candidate network cluster.
The link degree calculation module is used for adding a node with the highest similarity to the newly-built paper in the candidate network cluster as a first node into the candidate quotation recommendation set and calculating the link degree between the first node and other nodes in the candidate quotation recommendation set;
influence of different nodes in the same network cluster is different, and if all documents in the candidate network cluster are used as citations for recommendation, accuracy of citation recommendation is low. Therefore, the invention further screens the paper nodes in the candidate network cluster to recommend the quotation more accurately.
Specifically, the higher the similarity with the newly-created paper, the higher the probability that the node is selected as a citation. Therefore, after the similarity between the new paper and each node in the candidate network cluster is calculated in step S4, the present invention firstly adds the node with the highest similarity to the new paper in the candidate network cluster as the first node into the candidate citation recommendation set, and uses it as the first citation recommendation node.
As described above, the newly-created papers to be recommended by the citation have no corresponding reference relationship with the nodes in the candidate network cluster, and the papers with high relevance to the first node are recommended with high probability. Therefore, the invention firstly takes the first node as a starting point, analyzes the link relation between the nodes, evaluates the correlation degree between the nodes based on the link degree, and sequentially calculates the link degree between the starting point and the neighbor nodes in the candidate citation recommendation set, and for the neighbor node i, the link degree between the neighbor node i and the starting point is specifically as follows:
Figure 996279DEST_PATH_IMAGE108
wherein the content of the first and second substances,
Figure 235631DEST_PATH_IMAGE077
the degree of the neighboring node i,
Figure DEST_PATH_IMAGE109
the degrees for the neighbor nodes except the starting point,
Figure 517576DEST_PATH_IMAGE029
is the average of the degrees of the nodes in the current network cluster,
Figure 107958DEST_PATH_IMAGE110
is composed of
Figure 681021DEST_PATH_IMAGE080
The covariance of (a) of (b),
Figure 791191DEST_PATH_IMAGE081
is composed of
Figure 57087DEST_PATH_IMAGE077
The variance of (c).
The candidate quotation recommendation set building module is used for selecting the nodes with the linking degree higher than a second threshold value with the first node to join the candidate quotation recommendation set; taking the selected node as a first node, continuing to select the node to join the candidate quotation recommendation set until the requirement of a second threshold value cannot be met or all nodes in the network cluster are joined into the candidate quotation recommendation set;
the higher the degree of linkage is, the greater the degree of correlation between nodes is, and the greater the probability that the nodes are simultaneously referred to by the newly-built paper is. Therefore, the node with the link degree higher than the second threshold value with the first node is selected to join the candidate quotation recommendation set based on the calculated link degree, and the candidate quotation recommendation set and the first node are taken as quotation recommendation nodes together.
The method continuously selects the nodes, and after a new node is selected to be added into the quotation recommendation set, the newly added node is added to serve as the first node, and the neighbor nodes with high link degree are continuously selected until the requirement of a second threshold value cannot be met or all the nodes in the network cluster are added into the candidate quotation recommendation set, so that the construction of the candidate quotation recommendation set is completed.
The citation prediction module is used for acquiring a paper published in a first time period as a first paper and predicting the citation times of the first paper in a second time period in the future;
in general, citation of a paper is dynamically changed, the number of times the paper is cited is closely related to time, and papers with longer publication times are generally cited more frequently than papers with shorter publication times. For papers whose publication time is within the first time period, the times of their citations may be few or none, due to their very short publication time, but some new research success may be included in these papers. Therefore, in order to quickly acquire an important paper with a short publication time, the invention treats the paper published in the first time period as the first paper, and predicts the number of times of reference of the first paper in a second time period in the future, which specifically comprises:
a calculation module for calculating a similarity of the first paper to a node in the reference network;
the method predicts the number of times of reference of the first paper based on nodes similar to the first paper. The similarity between papers is determined by author similarity, content similarity. The specific calculation method is consistent with the similarity calculation between the papers of step S1, and is not repeated here. The invention calculates the first thesis and the nodes in the citation network in sequence to obtain the similarity with the nodes of the thesis.
The selecting module is used for selecting the nodes with the similarity exceeding a third threshold value to be added into the similar node set;
the invention selects the nodes with the similarity exceeding a third threshold value to join the similar node set. The higher the similarity, the more similar between papers, the more likely the same rule of reference is followed.
And the prediction module is used for fitting and predicting the number of times of reference in a second time period in the future of the first paper based on the number of times of reference in a third time period from the publication date to the publication date of the nodes in the similar node set, wherein the third time period = the first time period + the second time period.
The invention predicts the citation times in the future second time period, and the core of the prediction is that the citation rules of the papers considered to be similar are approximately the same. The reference times of the nodes in the similar node set in the third time period are the actual reference times of the similar nodes in the third time period from the publication date to the publication date. Therefore, the method carries out fitting prediction on the number of times of citation of the first paper based on the law of development of the number of times of citation of the similar paper in the third time period. The specific prediction method is not limited herein, and any conventional data prediction method may be used.
The growth degree calculation module is used for acquiring a paper published in a third time period as a second paper, wherein the third time period = the first time period + the second time period; calculating the increased length of the number of times of reference of the first paper and the second paper;
the invention also selects a second paper published in a third time period = first time period + second time period, e.g. the third time period is 3 years, the first time period is 1 year and the second time period is 2 years. That is, the publication time of the first paper is shorter than the publication time of the second paper, but the publication time of the second paper is not so long. The second paper invented in the third time period is a paper excluding the first paper.
For a paper, although the more times a paper is referenced, the more important the paper is in general. But with the progress of technology, the probability of being referred again in the papers with a high number of citations in the early year may be reduced, and correspondingly, for the papers published in a short time, even if the total number of citations of the papers is not high, if the total number of citations of the papers is increased in the recent period, the papers are stated to be active in the recent period, and the papers are also important, especially in the recent process of citing the papers, the probability of being referred is high. Therefore, the invention calculates the increasing length of the number of citations of the first paper and the second paper so as to select the paper with short departure schedule and strong recent attention. For the thesis node i, the increase degree of the reference times is specifically as follows:
Figure 869186DEST_PATH_IMAGE082
wherein the content of the first and second substances,
Figure 760787DEST_PATH_IMAGE083
for the number of citations in the j period from the publication time, j is in units of years, wherein,
Figure DEST_PATH_IMAGE111
for the first paper, T is the third time period, and for the second paper, T is the time difference between the publication time and the recommendation time of the paper. Since the first paper includes the predicted number of references, therefore,
Figure 709151DEST_PATH_IMAGE083
the number of citations of the papers can be actual citations of the papers or predicted citations of the papers.
And the final quotation recommendation set generating module is used for adding the papers with the quotation frequency increasing degree larger than a fourth threshold value into the candidate quotation recommendation set to obtain a final quotation recommendation set, and recommending the papers in the quotation recommendation set to the user as the quotations of the newly-built papers.
According to the invention, the paper with the increased length of the citation times larger than the fourth threshold is added into the candidate quotation recommendation set to obtain the final quotation recommendation set, so that the problem that the invented paper cannot be strongly recommended in a short period is avoided, and comprehensive and accurate quotation recommendation is realized. Therefore, after the final quotation recommendation set is obtained, the quotation in the recommended quotation recommendation set is returned to the user and used as the quotation of the newly-built paper.
Therefore, the quotation recommendation method and system based on the link analysis, which are provided by the invention, recommend quotations based on network clusters, fully utilize the clustering characteristics among papers, the papers belonging to the same cluster are more likely to be quoted among papers, the probability of quoting the papers belonging to different clusters is greatly reduced, and the cost for processing the whole quotation network is reduced; the paper nodes in the cluster are screened, the difference among different papers in the same cluster is fully considered, and the accuracy of citation recommendation based on the cluster is improved; the method comprises the steps that first nodes are selected, corresponding recommended nodes are selected based on the linking degrees between the first nodes and other nodes, the problems that reference connection edges do not exist between recommended quotations and a quotation network, and the quotation relation in the quotation network cannot be fully utilized are solved, and the accuracy of quotation recommendation is improved; the method has the advantages that the quote times of the papers with short publication time are predicted, the problem that the papers are valuable but are recommended to be missed due to too short publication time is avoided, the accuracy of quote recommendation is further improved, meanwhile, the method evaluates the papers based on the increment degree of the quote times, the attention degree change of the papers is effectively evaluated, and the importance of the papers is more accurately evaluated; the citation relation among the papers is comprehensively evaluated based on the author similarity and the content similarity, the difference of citation among the papers is fully considered, and the directed graph construction of the citation relation of the papers is carried out, so that the citation recommendation accuracy based on the directed weighted citation network is higher.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (6)

1. A citation recommendation method based on link analysis is characterized by comprising the following steps:
s1, constructing a directed weighted citation network based on citation relations, author similarities and content similarities among the papers;
s2, dividing the directed weighted reference network into a plurality of network clusters;
s3, selecting the node with the largest influence for each network cluster as a representative node;
s4, selecting a corresponding network cluster for the new thesis based on author similarity and content similarity as a candidate network cluster;
s5, adding a node with the highest similarity to the newly-built paper in the candidate network cluster as a first node into a candidate quotation recommendation set, and calculating the link degrees between the first node and other nodes in the candidate network cluster;
s6, selecting the nodes with the linking degree higher than a second threshold value with the first node to join the candidate quotation recommendation set; taking the selected node as a first node, continuing to select the node to join the candidate quotation recommendation set until the requirement of a second threshold value cannot be met or all nodes in the network cluster are joined into the candidate quotation recommendation set;
s7, acquiring a paper published in a first time period as a first paper, and predicting the number of times of reference of the first paper in a second time period in the future;
s8, acquiring a paper published in a third time period as a second paper, wherein the third time period = the first time period + the second time period; calculating the increased length of the number of times of reference of the first paper and the second paper;
s9, adding the paper with the length increased by the number of citations larger than a fourth threshold value into the candidate quotation recommendation set to obtain a final quotation recommendation set, and recommending the paper in the final quotation recommendation set to a user as a quotation of a newly-built paper;
the author similarity of papers with reference relationships is:
Figure DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 602978DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
respectively the weight occupied by the same author and the cooperation of the author,
Figure 864326DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE005
in order for the paper to have the same number of authors,
Figure 785009DEST_PATH_IMAGE006
for the author log with a partnership in the paper,
Figure DEST_PATH_IMAGE007
is as follows
Figure 521496DEST_PATH_IMAGE008
Number of papers completed collaboratively for authors with a collaborative relationship;
paper with reference relationship
Figure DEST_PATH_IMAGE009
Figure 414497DEST_PATH_IMAGE010
The content similarity of (a) is:
Figure DEST_PATH_IMAGE011
wherein the content of the first and second substances,
Figure 377905DEST_PATH_IMAGE012
to be a paper
Figure DEST_PATH_IMAGE013
First, the
Figure 520304DEST_PATH_IMAGE014
The value of the dimension(s) is,
Figure 250363DEST_PATH_IMAGE012
to paper o
Figure 981559DEST_PATH_IMAGE014
The value of the dimension(s) is,
Figure DEST_PATH_IMAGE015
dimension of the paper vector;
the similarity of papers with reference relationships is:
Figure 50621DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE017
for author similarity
Figure 211475DEST_PATH_IMAGE018
The weight of (c);
the weight of the edge in the directed weighted quotation network is the similarity of the papers connected with the edge;
the S2 specifically includes:
s21, selecting a thesis node with the highest node degree in the directed weighted citation network as an initial node, and setting c = 1;
s22, adding the initial node into the newly established cluster SCc
S23, acquisition and SCcThe nodes of the thesis in (1) are connected and do not belong to any established cluster, and the nodes are added into the candidate cluster; if the candidate cluster is an empty cluster, executing step S25;
s24, judging candidate cluster and cluster SCcIf the maximum weight of the connecting edge of the Chinese thesis node is larger than the first threshold, selecting the thesis node and the connecting edge corresponding to the maximum weight to be added into the cluster SCcContinuing to execute step S23; if not, c = c +1, go to step S25;
s25, judging whether there is a paper node not belonging to any cluster in the directed weighted reference network, if yes, selecting the paper node with the highest degree not belonging to any cluster as the initial node, executing step S22, if not, outputting the cluster SC1、SC2、...、SCc、...、SCxWherein x is the number of network clusters;
the influence of node f is:
Figure DEST_PATH_IMAGE019
wherein the content of the first and second substances,
Figure 105744DEST_PATH_IMAGE020
the number of nodes in the network cluster referring to the node f,
Figure DEST_PATH_IMAGE021
referenced in the network cluster for the jth node referencing node fThe number of times of the operation is counted,
Figure 501566DEST_PATH_IMAGE022
Figure DEST_PATH_IMAGE023
the number of nodes which refer to the node f at the same time in the j-th node f;
taking the first node as a starting point, the starting point and the neighbor nodes in the candidate quotation recommendation setxThe link degree of (c) is:
Figure 337935DEST_PATH_IMAGE024
wherein the content of the first and second substances,
Figure 641877DEST_PATH_IMAGE025
as a neighbor nodexThe degree of (c) is determined,
Figure DEST_PATH_IMAGE026
as a neighbor nodexExcept for the degree of the starting point,
Figure 792367DEST_PATH_IMAGE027
is the average of the degrees of the nodes in the current network cluster,
Figure DEST_PATH_IMAGE028
is composed of
Figure 780046DEST_PATH_IMAGE025
The covariance of (a) of (b),
Figure 925332DEST_PATH_IMAGE029
is composed of
Figure 982149DEST_PATH_IMAGE025
The variance of (c).
2. The citation recommendation method of claim 1 whichCharacterized in that said S4 includes: selecting a network cluster with the maximum similarity with the newly-built paper as a candidate network cluster; new thesis and network cluster SCcThe similarity of (A) is as follows:
Figure DEST_PATH_IMAGE030
wherein M is SCcThe number of nodes in the paper is,
Figure DEST_PATH_IMAGE032
for the similarity between the new thesis and the q-th thesis node,
Figure 483800DEST_PATH_IMAGE033
3. the citation recommendation method according to claim 1, wherein said S7 specifically is:
s71, calculating the similarity between the first paper and the nodes in the reference network;
s72, selecting nodes with similarity exceeding a third threshold value to join a similar node set;
and S73, fitting and predicting the number of references of the first paper in a second time period in the future based on the number of references of the nodes in the similar node set in a third time period from the publication date to the publication date, wherein the third time period = the first time period + the second time period.
4. The citation recommendation method according to claim 3 wherein said citation times increment of paper node r is:
Figure DEST_PATH_IMAGE034
wherein the content of the first and second substances,
Figure 122723DEST_PATH_IMAGE035
for the number of references in p periods from publication time, pThe unit is a year, wherein,
Figure DEST_PATH_IMAGE036
for the first paper, T is the third time period, and for the second paper, T is the time difference between the publication time and the recommendation time of the paper.
5. The citation recommendation method of claim 4 wherein said citation recommendation method is
Figure DEST_PATH_IMAGE037
Including actual paper citation times and predicted paper citation times.
6. A citation recommendation system based on link analysis, which is used for realizing the citation recommendation method of any one of claims 1-5, and is characterized by comprising:
the network construction module is used for constructing a directed weighted citation network based on citation relations, author similarities and content similarities among the papers;
the cluster dividing module is used for dividing the directed weighted reference network into a plurality of network clusters;
a representative node selection module, configured to select, as a representative node, a node with the largest influence for each network cluster;
the candidate network cluster selection module is used for selecting a corresponding network cluster for the newly-built thesis based on the author similarity and the content similarity to serve as a candidate network cluster;
the link degree calculation module is used for adding a node with the highest similarity to the newly-built paper in the candidate network cluster as a first node into the candidate quotation recommendation set and calculating the link degree between the first node and other nodes in the candidate quotation recommendation set;
the candidate quotation recommendation set building module is used for selecting the nodes with the linking degree higher than a second threshold value with the first node to join the candidate quotation recommendation set; taking the selected node as a first node, continuing to select the node to join the candidate quotation recommendation set until the requirement of a second threshold value cannot be met or all nodes in the network cluster are joined into the candidate quotation recommendation set;
the citation prediction module is used for acquiring a paper published in a first time period as a first paper and predicting the citation times of the first paper in a second time period in the future;
the growth degree calculation module is used for acquiring a paper published in a third time period as a second paper, wherein the third time period = the first time period + the second time period; calculating the increased length of the number of times of reference of the first paper and the second paper;
and the final quotation recommendation set generating module is used for adding the papers with the quotation frequency increasing degree larger than a fourth threshold value into the candidate quotation recommendation set to obtain a final quotation recommendation set, and recommending the papers in the quotation recommendation set to the user as the quotations of the newly-built papers.
CN202010556832.7A 2020-06-18 2020-06-18 Citation recommendation method and system based on link analysis Active CN111460324B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010556832.7A CN111460324B (en) 2020-06-18 2020-06-18 Citation recommendation method and system based on link analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010556832.7A CN111460324B (en) 2020-06-18 2020-06-18 Citation recommendation method and system based on link analysis

Publications (2)

Publication Number Publication Date
CN111460324A CN111460324A (en) 2020-07-28
CN111460324B true CN111460324B (en) 2020-11-06

Family

ID=71683968

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010556832.7A Active CN111460324B (en) 2020-06-18 2020-06-18 Citation recommendation method and system based on link analysis

Country Status (1)

Country Link
CN (1) CN111460324B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064996B (en) * 2021-04-06 2022-08-30 合肥工业大学 Method for measuring influence of thesis in asymmetric information network

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567476A (en) * 2011-12-15 2012-07-11 浙江大学 Screening and valuing method of technical similarity patent
CN103729432B (en) * 2013-12-27 2017-01-25 河海大学 Method for analyzing and sequencing academic influence of theme literature in citation database
CN104657488B (en) * 2015-03-05 2016-03-02 中南大学 A kind of author's influence power computing method based on quoting communication network
CN105653706B (en) * 2015-12-31 2018-04-06 北京理工大学 A kind of multilayer quotation based on literature content knowledge mapping recommends method
CN108132961B (en) * 2017-11-06 2020-06-30 浙江工业大学 Reference recommendation method based on citation prediction
CN110674318A (en) * 2019-08-14 2020-01-10 中国科学院计算机网络信息中心 Data recommendation method based on citation network community discovery

Also Published As

Publication number Publication date
CN111460324A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
Mienye et al. Prediction performance of improved decision tree-based algorithms: a review
CN108776684B (en) Optimization method, device, medium, equipment and system for edge weight in knowledge graph
Huang et al. Updated review of advances in microRNAs and complex diseases: taxonomy, trends and challenges of computational models
CN111488474A (en) Fine-grained freehand sketch image retrieval method based on attention enhancement
JP2005535952A (en) Image content search method
CN105808590A (en) Search engine realization method as well as search method and apparatus
CN110737805B (en) Method and device for processing graph model data and terminal equipment
CN111949885B (en) Personalized recommendation method for scenic spots
Aledo et al. A highly scalable algorithm for weak rankings aggregation
CN111080551A (en) Multi-label image completion method based on depth convolution characteristics and semantic neighbor
CN111222847A (en) Open-source community developer recommendation method based on deep learning and unsupervised clustering
Sivaramakrishnan et al. An effective user clustering-based collaborative filtering recommender system with grey wolf optimisation
CN111460324B (en) Citation recommendation method and system based on link analysis
Khan et al. Ant colony optimization based hierarchical multi-label classification algorithm
CN112214684B (en) Seed-expanded overlapping community discovery method and device
Fang et al. Cohesive subgraph search over large heterogeneous information networks
CN114840745A (en) Personalized recommendation method and system based on graph feature learning and deep semantic matching model
Nagarajan et al. Analysing traveller ratings for tourist satisfaction and tourist spot recommendation
Fu et al. α-MOP: Molecule optimization with α-divergence
CN116010681A (en) Training and retrieving method and device for recall model and electronic equipment
Rashid et al. Unlocking the Power of Social Networks with Community Detection Techniques for Isolated and Overlapped Communities: A Review
Aljubairy et al. HeteGraph: a convolutional framework for graph learning in recommender systems
Nyman et al. Stratified Gaussian graphical models
CN113392279A (en) Similar directed subgraph searching method and system based on subjective logic and feedforward neural network
Shoukat et al. A late fusion framework with multiple optimization methods for media interestingness

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant