CN110674318A

CN110674318A - Data recommendation method based on citation network community discovery

Info

Publication number: CN110674318A
Application number: CN201910748028.6A
Authority: CN
Inventors: 李成赞; 杜一
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2020-01-10

Abstract

The invention provides a data recommendation method based on citation network community discovery, which comprises the following steps of: constructing a citation network based on the co-reference and coupling relation between authors and between papers; aiming at the citation network, discovering a community network with similar or related research contents by using a modularity Louvain algorithm; establishing association between the data set and the community network based on the similarity between the paper and the data set; and (4) overlapping and de-duplicating each paper node in the community network associated with the data set, and then recommending data.

Description

Data recommendation method based on citation network community discovery

Technical Field

The invention relates to the technical fields of a citation network, community discovery, similarity measurement and the like, and provides a data recommendation method based on citation network community discovery.

Background

Scientific data is the input and output of scientific research activities and is the core driving element of scientific and technological innovation. The latest report of International Data Corporation (IDC) "Data Age 2025" indicates that the amount of global information Data is rapidly increasing at a rate of doubling every two years, and the amount of global information Data storage will reach 47ZB by 2020. Only 3% of the potentially valuable data in the world is developed and utilized, and less data is analyzed and mined deeply. Through further Data statistical analysis of Data Circulation Index (DCI), it was found that by 2018, the referenced Data set in the Data set included in DCI accounts for only 11.83%.

Multiple research studies have shown that data user discovery and retrieval by accessing repositories, institutional websites, or search engines remains the current major avenue for open-ended dissemination of shared data resources. In the big data era with the surge of data volume and overload of information, the mode of passively waiting for users to retrieve and discover data limits the transmission and reuse of data to a certain extent.

Academic papers have experienced a history of development over 350 years, forming a complex citation network for ultra-large scale knowledge flow and information dissemination. Implicit in the citation network is a study population consisting of literature authors that have similar or related directions of study. The citation network can be divided into different research groups by a community discovery algorithm of the complex network.

With the increasingly urgent contradiction between the open sharing demand of scientific data and the actually low transmission efficiency and repeated utilization rate of data publications, how to utilize the complex citation network formed by the existing academic papers to actively and accurately recommend data resources to scientific researchers and scholars as main users of scientific data to accelerate the transmission and reuse of the data resources has important research value and significance.

The research work on complex networks has been well-established. With the development of computer technology, especially in 1998 + 1999, the scholars of Watts and Barabasi put forward a small-world network model and a scale-free network model, which opened the hot tide of complex network research. A large number of scholars begin to pay attention to theoretical researches on complex network structures, characteristics, information propagation mechanisms, dynamics principles and the like. With the deep research of the complex network theory, more and more scholars utilize the knowledge of the complex network theory to research and discuss the practical problems of political elections, disease propagation prediction, population migration, carbon emission, economic models and the like.

The citation network is a typical complex network, and a large number of students use the citation network to carry out research works such as centrality analysis, path analysis, cluster analysis, knowledge propagation analysis and the like. There has also been a considerable history in community discovery research based on the citation network, and the concepts of document coupling were proposed by the scholars of Kessl et al in 1963; in 1973, Small et al have proposed the concept of a cointroduction network; in 1981, White first proposed a concept written by the authors. Huang et al studied the leading edge of the field using the co-citation and literature coupling relationship of the citation network. Newman in 2004 utilizes the information of paper authors in different disciplines to analyze the community structure of the collaboration relationship between authors and proposes a hierarchical community structure classification method based on modularity. In 2018, the Hanqing and other scholars develop research work for calculating similarity of documents based on the co-introduced features of the documents. In addition, many scholars at home and abroad also utilize the citation network to develop research on influence evaluation of the scholars, papers and periodicals. In the recommendation research aspect based on the quotation network, students such as West adopt a hierarchical structure of scientific knowledge to recommend papers by establishing multidimensional relevancy for different users based on a paper quotation network hierarchical clustering method. The scholars of Haruna et al make academic paper recommendations by studying similarity measures based on co-citation correlation matrices.

In general, considerable research results are formed in the aspects of theory, model, algorithm, application and the like of a complex network, and the research on knowledge propagation, community discovery and influence evaluation based on the citation network has the same remarkable effect. However, relevant research and practice work for recommending data resources by using a community discovery method based on a citation network has not been discovered so far.

Disclosure of Invention

Scientific data is the input and output of scientific research activities and is the core driving element of scientific and technological innovation. Scientific data can be maximized only by open sharing and wide spreading, but the utilization rate and the spreading efficiency of the current data publications are low as a whole. In order to accelerate the spread and reuse of scientific data and improve the open sharing effect of the scientific data, the invention aims to provide a data recommendation method based on the citation network community discovery. The study populations within each community network have similar or related study directions. If a certain data resource is found and verified to have research or reference value for a certain academic paper or certain academic papers in a specific community network, other paper authors in the community network can be considered to be interested in the data resource, and accordingly, the corresponding data resource is recommended to the community network, so that the knowledge propagation mechanism of the citation network is fully utilized to accelerate the propagation and reuse of the data resource.

In order to achieve the purpose, the invention adopts the following technical scheme:

a data recommendation method based on citation network community discovery comprises the following steps:

constructing a citation network based on the co-reference and coupling relation between authors and between papers;

aiming at the citation network, discovering a community network with similar or related research contents by using a modularity Louvain algorithm;

establishing association between the data set and the community network based on content similarity by using the paper and the data set;

and carrying out superposition and de-duplication on each thesis node in the community network associated with the data set, and then carrying out data recommendation.

A citation association network model can be constructed in advance, data of data sets, papers and authors which accord with specific relations are input into the model, and then data recommendation results are output.

The invention has the following beneficial effects:

on the basis of constructing an association network among a data set, a thesis and an author, the method utilizes a Louvain algorithm to respectively discover communities in a co-authoring, co-indexing and coupling association mode, then calculates the similarity between the data set and the academic thesis by combining a TF-IDF algorithm and cosine similarity, and recommends the data set after establishing the association between the data set and the community where the thesis is located. Experimental results prove that the data recommendation method based on the citation network community discovery can effectively discover papers or authors with potential interest in the data set. Meanwhile, in the aspects of contribution degree and stability of data recommendation effect, community discovery based on the coupling relationship is optimal in performance and is subject to the second order of relationship, and the citing relationship is influenced by publishing time and quoted times to cause large effect difference.

Drawings

FIG. 1 is a diagram of data recommendation principles and steps based on citation network community discovery.

Fig. 2 is a model diagram of a citation association network.

FIG. 3 is a schematic diagram of building associations based on binding relationships.

FIG. 4 is a schematic diagram of building an association based on a coreference relationship.

FIG. 5 is a schematic diagram of building an association based on coupling relationships.

FIG. 6 is a diagram of an example of the 3 kinds of community discovery effect and data set recommendation of the citation network.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The embodiment discloses a data recommendation method based on citation network community discovery, as shown in fig. 1, comprising the following steps:

(1) firstly, constructing a citation association network model, then inputting the data set, author and paper data which accord with the specific relationship into the model through the subsequent steps (2) to (4), and outputting a recommendation result.

It should be noted that building a model facilitates the processing of data, but is not a necessary means, and the representation of data sets, papers, author relationships and data recommendations can still be achieved by the following steps without building the model, and it should be understood that building a model is only one embodiment.

(2) And constructing a citation network based on the co-authoring, co-citation and coupling relations, and dividing the community network with similar or related research contents by using a modularity Louvain algorithm.

(3) And establishing association between the data set and the community network based on the content similarity by using the paper and the data set.

(4) And carrying out superposition and de-duplication on each thesis node in the community network associated with the data set, and then carrying out data recommendation.

1) Data preparation

As shown in table 1, in order to develop the present embodiment, the present embodiment obtains the following test data based on the internet open data resource and the Web of science core database:

(1) 8 Data sets published in the Data article manner in the Earth System Science Data (ESSD) Data journal and published in PANGAEA, Dryad, the national oceanic and atmospheric administration NOAA, etc. are used as test Data sets to be recommended;

(2) the introduction academic papers of 8 data sets are 1001 in total and are used for testing and verifying the effect of the recommendation algorithm;

(3) the citation paper 5037 of the paper in the ESSD journal, the citation paper 53809 of the 5037 paper and the reference 337483 are used for academic paper citation network construction and data recommendation testing based on community discovery.

TABLE 1 test data set to be recommended

2) Citation association network model

An associated knowledge network is constructed aiming at relationships of data sets, papers, authors and the mutual citation, publication, cooperation and the like of the data sets, the papers, the authors and the like, and the entity association are expressed as a node set and an adjacent linked list thereof, each adjacent linked list stores all edges of a node, and a standardized graph is adopted to describe entity nodes and the associated edges thereof. The specific citation association network model design is shown in fig. 2.

Table 2 shows the formal expression of the entities in the citation association network model by taking the data set nodes as an example. Table 3 gives a formal representation of the association relationship between the data set and the citation network, i.e. the associated edges between the nodes.

TABLE 2 data set node entity attributes

TABLE 3 data set and citation network Association relationship

3) Associative network construction

(1) Syndicated network

As shown in fig. 3, the principle of constructing the association network based on the binding relationship is as follows: if the two actors have a thesis cooperative relationship, the two actors have a certain relevance. The more papers the two workers cooperate with, the more closely the two workers are related.

(2) Common-lead network

As shown in fig. 4, the principle of constructing the association network based on the co-reference relationship is as follows: if two papers are cited in a certain paper at the same time, the two papers have certain relevance. The higher the number of times two papers are cited together indicates that the two papers have a higher degree of similarity or association.

(3) Coupling network

As shown in fig. 5, the principle of constructing the association network based on the coupling relationship is as follows: if two papers have the same reference, the two papers have a certain relevance. The greater the number of references in two articles that are identical, the greater the degree of similarity or association between the two articles.

4) Community discovery for citation networks

The community discovery work developed by the method based on the citation network is mainly realized by a Louvain algorithm based on modularity.

The calculation formula is as follows:

where m represents the total number of edges in the network; a represents the weight between nodes, if no weight is introduced in the network, A_ij＝1；k_iRepresents the degree of node k; sigma (c)_i,c_j) Indicating a judgment community c_iAnd community c_jAnd if the community is the same community, the value is 1, otherwise, the value is 0.

In the process of community division by using a Louvain algorithm, for each node i, sequentially trying to allocate the node i to the community where each neighbor node is located, and calculating modularity increment delta Q before and after allocation, wherein the simplified calculation formula is as follows:

wherein k is_i,inRepresenting the sum of the edge weights of the node i and the node c in the community; sigma_totRepresenting the sum of the weights of the edges connected to the nodes within community c.

5) Data set-community network association construction and recommendation

The construction of the association between the data set and the community network is a crucial part of the whole data recommendation algorithm after the community discovery work of the citation network is completed. Whether the data set can be guided to the truly interested community network through association construction is the key for determining the final effect of data recommendation. The association relationship between the data set and the community network can be constructed in the modes of reference, similarity measurement and the like. Because the reference relationship has time lag and uncertainty, the relevance is mainly constructed in a similarity measurement mode at the initial stage of data set release; when the data set is published for a certain time and the quotation papers appear, the reference relationship can also be adopted for association construction.

In this embodiment, the association between the data set and the community network is mainly constructed in a similarity measurement manner, and the specific construction method is as follows: firstly, vectorizing and extracting characteristics of headline and abstract information of a data set and a paper based on a vector space model; performing word vector weight calculation by using a TF-IDF algorithm in the characteristic extraction process; and finally, calculating the similarity between the data set and a paper in the citation network by utilizing the cosine similarity.

Vector Space Model (VSM) is a common Model in natural language processing, and was proposed by Gerard sato et al in 1969. The vector space model VSM maps the text content to a feature vector v (d) ═ (t)₁,w₁(d)；…；t_n,w_n(d) In which t) is_i(i-1, 2, …, n) is a list of terms, w_i(d) Is t_iWeight in document d.

TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used weighting technique for information retrieval and data mining. The importance of a word increases in proportion to the number of times it appears in a single text content, but at the same time decreases in inverse proportion to the frequency with which it appears in the entire corpus. The calculation formula of TF-IDF is:

wherein n is_i,jIs the word t_iIn document d_jThe number of occurrences in (1); sigma_kn_k,jIs the sum of the number of occurrences of all words in the document; | D | represents the total number of documents in the corpus; i { j: t_i∈d_jMeans containing the word t_iTo avoid having a dividend of zero, 1+ | { j: t ] is typically used_i∈d_j}|。

In the feature extraction process, because the selected test data set and the thesis are in English format, word segmentation is carried out through blank spaces. It should be noted that, in the feature extraction, it is necessary to deactivate the common words of a, the, of, etc., and to clear the punctuation marks and numbers of english by regular expression.

Furthermore, data set d_iAnd paper d_jThe similarity measurement between the two is realized by cosine similarity, and the specific calculation formula is as follows:

wherein, w_k(d_i) Representing a data set d_iThe weight of the word k in the information is described, which is calculated by the TF-IDF formula (3).

6) Results of the experiment

In the embodiment, a citation association network model is constructed based on experimental data, and then community discovery work is completed by means of a Louvain community discovery algorithm based on modularity from a combination, co-citation and coupling network association mode. In order to improve the correlation degree between papers in a community and reduce the community size, the method selects to construct the co-citation association of the two papers when the co-citation times of the two papers exceeds 4 times (including), and constructs the coupling relation of the two papers when the reference of the two papers is the same exceeds 5 times (including). The final results of community discovery based on 3 relationships are shown in fig. 6. In addition, FIG. 6 also illustrates an example effect of building associations between datasets and community networks through similarity measures or reference relationships.

TABLE 7 data recommendation effect based on citation network community discovery

The effect of using the citation network community discovery to recommend experimental data is shown in table 7. It should be noted that, in the embodiment, when the association between the data set and the citation community network is constructed by performing similarity measurement based on the title and the abstract, the condition for selecting the associated data papers is that the similarity is >0.50, and if the number of papers with the similarity >0.50 exceeds 5, the 5 papers with the highest similarity are selected to construct the association. As can be seen from table 7, in the correlation construction mode based on the similarity, except that the recommendation effect of the data set 4 is poor, the probability of covering the real introduction paper in the recommended papers of the other 7 data sets exceeds 60%, and the average coverage rate is 80.02%. The method explains that the incidence relation between the data set and the quotation community network is constructed through the similarity, and the data set can be effectively and correctly guided to the community network which is possibly interested. For the data set 4 with poor recommendation effect, the embodiment further selects the first citation paper of the data set as an association construction mode of the data set and the citation community network. Under the association construction mode, the coverage rate recommended by the real quotation paper of the data set 4 reaches 80.38%, and the method for constructing the association between the data set and the quotation community network based on the quoted relation is also effective to a certain extent.

In addition, the community network constructed based on the coupling relationship has the largest contribution degree, is most stable and has the second highest binding relationship in view of the influence degree of the community network constructed by the community discovery algorithm through the three kinds of association networks of binding, co-introduction and coupling on the final recommendation effect. The community network constructed based on the co-reference relationship has larger effect difference because of the influence of the publishing time of the data set and the real number of times of reference of the data set.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A data recommendation method based on citation network community discovery is characterized by comprising the following steps:

dividing the citation network into a plurality of community networks;

establishing association between the data set and the community network based on the similarity between the paper and the data set;

and (4) overlapping and de-duplicating each paper node in the community network associated with the data set, and then recommending data.

2. The method according to claim 1, wherein the method of constructing a citation network is embodied as: the method comprises the steps of taking authors and papers as nodes, taking the co-culture relation between the authors and the co-citation and coupling relation between the papers as edges, and describing the nodes and the edges by adopting a standardized graph so as to construct a citation network.

3. The method of claim 1, wherein the citation network is partitioned into multiple community networks using a modularity Louvain algorithm having the formula:

wherein m represents the total number of edges in the citation network; a represents the weight between nodes, if no weight is introduced in the network, A_ij＝1；k_iRepresents the degree of node k; sigma (c)_i,c_j) Indicating a judgment community c_iAnd community c_jIf the two communities are the same, the value is 1, otherwise, the value is 0.

4. The method of claim 3, wherein when the community is divided by using the modularity Louvain algorithm, for each node i, sequentially trying to allocate the node i to the community where each neighbor node is located, and calculating the modularity increment Δ Q before and after allocation, the calculation formula is as follows:

5. The method of claim 1, wherein the similarity between the paper and the data set is calculated by:

vectorizing and extracting characteristics of the titles and abstract information of the data sets and the papers on the basis of a vector space model;

in the characteristic extraction process, performing word vector weight calculation by using a TF-IDF algorithm;

and calculating the similarity between the data set and the paper in the citation network by utilizing the cosine similarity.

6. The method as claimed in claim 5, wherein, in the process of feature extraction of English-format paper, stop word processing is carried out on articles and prepositions, and punctuation coincidence and numbers are eliminated through regular expressions.

7. The method of claim 5, wherein the TF-IDF algorithm has the formula:

wherein n is_i,jIs the word t_iIn document d_jThe number of occurrences in (1); sigma_kn_k,jIs the sum of the number of occurrences of all words in the document; | D | represents the total number of documents in the corpus; i { j: t_i∈d_jMeans containing the word t_iThe number of documents.

8. The method of claim 5, wherein the cosine similarity is used to calculate the similarity between the data set and the paper in the citation network by the formula:

wherein d is_iRepresenting a data set, d_jPresentation of the paper, w_k(d_i) Representing a data set d_iThe weight of the word k in the description information.

9. The method of claim 8, wherein the weight w_k(d_i) Calculated by the TF-IDF algorithm.

10. The method of claim 1, wherein a citation association network model is pre-constructed, the data sets, papers and authors are stored through formal expression of entities, the citations, publications and collaborations among the data sets, papers and authors are stored through adjacency linked lists, each adjacency linked list stores all edges of a node, and standardized graphs are used to describe the nodes and their associated edges.