CN105808729A

CN105808729A - Academic big data analysis method based on reference relationship among pieces of thesis

Info

Publication number: CN105808729A
Application number: CN201610131343.0A
Authority: CN
Inventors: 谈兆炜; 刘长风; 周劲光; 杜佳俊; 骆铮; 毛宇宁; 沈嘉明; 王彪; 傅洛伊; 王新兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2016-03-08
Filing date: 2016-03-08
Publication date: 2016-07-27
Anticipated expiration: 2036-03-08
Also published as: CN105808729B

Abstract

The invention provides an academic big data analysis method based on reference relationship among pieces of thesis. The method comprises following steps of step 1, carrying out corresponding analysis and processing to a local thesis data set; establishing a thesis reference network in a database; step 2, establishing an analysis algorithm according to the reference relationship in the thesis reference network; obtaining the importance and the mutual relationship of the nodes in the thesis reference network by the analysis algorithm; obtaining the importance of thesis relative to the center thesis; step 3, converting the one to one reference relationship of the thesis into a mapping set in a reference direction and a mapping set in a referred direction; obtaining development paths among the appointed pieces of thesis in the thesis reference network by an extraction algorithm; and calculating the importance of the paths according to the thesis importance obtained in the step 2. According to the method, the reference relationship of the thesis in the database can be conveniently analyzed; the development paths among pieces of thesis can be obtained; and the thesis search precision can be improved.

Description

Based on the academic big data analysing method of adduction relationship between paper

Technical field

The present invention relates to big data to process and network analysis technique field, in particular it relates to a kind of based on the academic big data analysing method of adduction relationship between paper.

Background technology

Along with developing rapidly of information technology, people obtain the mode of data and get more and more, and the total amount of data also presents volatile growth.The result of study of International Data Corporation (IDC) (IDC) shows, the data volume of whole world generation in 2008 is 0.49ZB, and the data volume of 2009 is 0.8ZB, within 2010, increases as 1.2ZB, the quantity of 2011 is especially up to 1.82ZB, and everyone produces the data of more than 200GB to be equivalent to the whole world.And to 2012, the data volume of all printing materials of human being's production was 200PB.The research of IBM claims, and in the total data that whole human civilization obtains, has 90% to produce in two years in the past.And having arrived the year two thousand twenty, data scale produced by the whole world is up to 44 times of today.Along with the appearance of mass data, the instrument processing mass data also arises at the historic moment, and such as Hadoop and Spark is able to mass data is carried out the software frame of distributed treatment.

And at scientific research field, along with every country is increasing for the input of scientific research activity, the direct achievement paper of scientific research is also increasing every year with surprising rapidity.For China, according to Chinese science and technology information research the statistical result showed of Chinese science and technology paper announced for 26th of JIUYUE in 2014: 2004 in JIUYUE, 2014, middle Kuomintang-Communist delivers international paper 136.98 ten thousand sections, comes the 2nd, the world；Paper is cited 1037.01 ten thousand times altogether, comes the 4th, the world.And for paper, except text message, complicated adduction relationship constitutes a very huge network between them, and then define " academic big data ".For the paper helping researcher quick obtaining oneself to want, there are now some commercial companies to develop the academics search system of oneself, helped the article that researcher fast search is wanted to oneself, for instance Microsoft and Google.

In these academics search systems, many Network algorithm are applied in paper sequence, wherein most notable is PageRank algorithm and HITS algorithm (Hypertext-InducedTopicSearchAlgorithm, the theme search algorithm that hypertext guides).But major part algorithm only considered the overall importance of one section of paper in the ranking, does not consider its relative importance for one section of specific paper.This is not particularly suited for wishing to find the situation to several sections of papers that one section of paper has the greatest impact.In addition, most of existing algorithm can not outside sequence, it is provided that more about the information of development grain between two sections of papers.Which results in researcher and can not obtain the development trend in Scientific Research Resource, comprehensive grasp research field efficiently.

The present invention in this context, develops and a set of can either analyze paper based on paper citation network another section is formulated the relative importance of paper, can represent again the system of development grain between two sections of papers.

Summary of the invention

For defect of the prior art, it is an object of the invention to provide a kind of based on the academic big data analysing method of adduction relationship between paper.

According to provided by the invention based on the academic big data analysing method of adduction relationship between paper, comprise the steps:

Step 1: carry out local paper data set analyzing accordingly and building paper citation network after process in data base；

Step 2: according to the adduction relationship creation analysis algorithm in paper citation network, obtained the importance of described paper citation network interior joint and mutual relation by this parser, and obtain the paper importance degree relative to center paper；

Described center paper refers to: user (here it is considered that user is interested in this section of paper and correlative theses thereof by inputting the one section of paper carrying out inquiring about, and want to know about other papers relative to the percentage contribution of this section of paper and relative Link Importance, but not overall situation importance degree).

Step 3: man-to-man for paper adduction relationship is converted into the mapping ensemblen quoting direction and the mapping ensemblen in the direction that is cited, described paper citation network obtains the development path specified between paper, and calculates the importance degree in path according to the paper importance degree obtained in step 2.

Preferably, also include step a: the parser in step 2 is carried out distributed treatment, calculate in this paper citation network the every section of paper importance degree relative to center paper by the multinode of described paper citation network.

Preferably, also include step b: the extraction algorithm in step 3 is carried out distributed treatment, calculate the importance degree of development path between this paper citation network middle finger final conclusion literary composition by the multinode of described paper citation network.

Preferably, described step 1 includes:

Step 1.1: utilize text-processing and analytical technology, extracts the reference information in local paper data, and reference information is to comprise any one section of paper in collection of thesis refer to the information of which paper；

Step 1.2: build paper citation network；

Step 1.3: remove the content of repetition after the reference information in the paper citation network obtained is compared, utilize database software to store and index, and the adduction relationship between paper is stored in lane database with the form of key-value pair.

Preferably, described step 2 includes:

Step 2.1: according to the adduction relationship in paper citation network, calculate the mark of structural analysis Network Based；

Step 2.2: utilize the subgraph that width first traversal search spread user pays close attention to, calculate the ratio of each section of paper longest path relative to center paper and shortest path, as based on the mark quoting step analysis, the computing formula of width first traversal is as follows:

Based on the mark=longest path/shortest path quoting step analysis；

Step 2.3: select corresponding weight, by the mark of structural analysis Network Based with based on quoting the mark of step analysis included together as other the final papers importance degree relative to center paper, computing formula is as follows:

The mark of the weight * structural analysis Network Based of importance degree=structural analysis Network Based+based on quoting the weight * of step analysis based on the mark quoting step analysis.

Preferably, described step 3 includes:

Step 3.1: the man-to-man adduction relationship of paper in data base is converted into the mapping ensemblen quoting direction and the mapping ensemblen in the direction that is cited；

Step 3.2: the adduction relationship of preliminary analysis paper, takes python program design language to call the turn the data structure of dictionary；

Step 3.3: extract the information in path between two sections of papers.

Preferably, described step 3.1 includes:

Initializing two dictionaries ref_dic, refed_dic, wherein, ref_dic represents by the paper Centroid mapping relations to multiple persons who quote, and refed_dic represents from by the paper Centroid mapping relations to multiple persons quoted；

Each row of data in ergodic data storehouse, ref_dic dictionary key assignments is found the left data of any row, if this row left data is in ref_dic dictionary key assignments, then data on the right side of this row are added to the afterbody of key assignments respective items, if not in ref_dic dictionary key assignments, then left data is preserved as new key assignments, and using right side data as corresponding item, so that the man-to-man adduction relationship of paper is converted into the mapping ensemblen quoting direction in data base；

For refed_dic dictionary, then using right side data as key assignments, left data is as item, refed_dic dictionary key assignments is found the right side data of this row, if data are in refed_dic dictionary key assignments on the right side of this row, then this row left data is added to the afterbody of key assignments respective items, if not in refed_dic dictionary key assignments, then right side data are preserved as new key assignments, and using left data as corresponding item, so that the man-to-man adduction relationship of paper is converted into the mapping ensemblen in the direction that is cited in data base.

Compared with prior art, the present invention has following beneficial effect:

What 1, illustrate in the present invention can more effectively show, based on the academic big data analysing method of adduction relationship between paper, the contribution and relative importance that other papers make for the paper that user inquires about, allow users to paper interested from a section more easily, find other correlative theses.

2, the present invention further illustrates based on the distributed approach of the academic big data analysis of adduction relationship between paper, this speed being beneficial to improve academic big data analysis；When using the present invention to set up paper inquiry system, utilize this distributed approach, it is possible to efficiently reduce the response time of user's inquiry, improve Consumer's Experience.

Accompanying drawing explanation

By reading detailed description non-limiting example made with reference to the following drawings, the other features, objects and advantages of the present invention will become more apparent upon:

Fig. 1 is distributed marking routing algorithm flow chart；

Fig. 2 is track search flow chart between two sections of papers；

Fig. 3 is path label algorithm flow chart；

Fig. 4 is path label algorithm demo system figure.

Detailed description of the invention

Below in conjunction with specific embodiment, the present invention is described in detail.Following example will assist in those skilled in the art and are further appreciated by the present invention, but do not limit the present invention in any form.It should be pointed out that, to those skilled in the art, without departing from the inventive concept of the premise, it is also possible to make some deformation and improvement.These broadly fall into protection scope of the present invention.

Step a: the parser in step 2 is carried out distributed treatment, calculates in this paper citation network the every section of paper importance degree relative to center paper by the multinode of described paper citation network.

Step 3: man-to-man for paper adduction relationship is converted into the mapping ensemblen quoting direction and the mapping ensemblen in the direction that is cited, is obtained the development path specified between paper in described paper citation network by extraction algorithm；

Step b: the extraction algorithm in step 3 is carried out distributed treatment, calculates the importance degree of development path between this paper citation network middle finger final conclusion literary composition by the multinode of described paper citation network.

Specifically, described step 1 includes:

Step 1.2: building paper citation network, developing algorithm is as follows:

Construct an an empty point set V and empty limit collection E.To each section of paper u in data base, first paper u is added in vertex set V as a paper node；Then quoting each section of paper v of paper u, (u, v), by e, (u, v) adds in the collection E of limit structure directed edge e, and to arrange weights be 1.After during this operation is to data base, all papers have all carried out one time, by now point set and Bian Ji collectively constitute paper citation network G (V, E).

Step 1.3: utilize computer program to compare removal repetition, utilize database software to store and index to the paper citation network data obtained are unified, and the adduction relationship between paper is stored in lane database with the form of key-value pair.

More specifically, based on local paper data set, adopt Python and Mathematica instrument, carry out the cleaning of text, remove therein such as insignificant characters such as space, newlines, and be organized into the convenient form analyzed plus unique numbering to every section of paper；Obtain every section of article quote paper after, obtain quoting the numbering of paper by searching, extract the adduction relationship between data, utilize database technology, adduction relationship is stored in data base with the form of key-value pair.Comprise a little and the network information on limit what obtain, by the form of key-value pair, using actively quoting paper and being drawn paper as two key assignments in once quoting, be stored in data base.Then every in data base is carried out unified automatization, structurized duplicate removal, stores and index.

Described step 2 includes:

Step 2.1, according to the adduction relationship in paper citation network, calculates the mark of structural analysis Network Based, and the algorithm of calculating is as follows:

Input: paper citation network G, center paper v₀；Output: in paper citation network, every section of article (a good appetite suddenly appearing in a serious disease heart paper is outer) is relative to the importance degree of center paper.

It is an empty queue that establishment processes queue Q, and creating and beating undue paper set S is an empty set.

Center paper is added in process queue Q.

Center paper is added to and beats in undue paper set S.

When queue not empty time, carry out following operation:

Go out one section of paper v of team.

The Quantity of Papers of quoting of paper v is assigned to n.

The paper v ' that any one section of paper v is quoted, carries out following operation:

By the mark n decile of paper v, it is designated as delta_score.

The mark of v ' is added after delta_score as new mark.

If v ' is updated score first time, carry out following operation:

Paper v' is added in process queue Q.

Paper v' is added to and beats in undue paper set S.

Step 2.2: utilizing the subgraph that width first traversal search spread user pays close attention to, calculate the ratio of each section of paper longest path relative to center paper and shortest path, as based on the mark quoting step analysis, computing formula is as follows:

Based on the mark=longest path/shortest path quoting step analysis；

Step 2.3: select optimum parameter, by the mark of structural analysis Network Based with based on quoting the mark of step analysis included together as other the final papers importance degree relative to center paper, computing formula is as follows:

Described step a includes: by the parser in distributed program procedure 2, enables to run in distributed system, utilizes the parallel framework of Spark, data are stored in internal memory, keep mutual with front end, carry out parallel computation on multiple computer points.So can being greatly enhanced the arithmetic speed of program, and dynamically provide result of calculation to be used for showing for front end, algorithm is as follows:

Step 2 employs queue, one paper node of each cycle calculations, and in distributed implementation the paper of each cycle calculations one layer.First the mark initializing center paper is 1.0, and the paper in ground floor only has center paper.Take turns in circulation each, the mark of paper in preceding layer is on average given the paper that they are cited, generate the mark contribution of paper that each section is cited by distributed Map (mapping) method.Then, on this layer all papers mark contribution all calculate complete after, by mark contribution and each section of paper before mark be added, here with distributed Reduce (reduction) method.Next the paper updating next layer is that this takes turns the paper (not including the paper accessed in the past) that in circulation, first time quotes.When next layer of paper is empty or search time is long, loop ends.At this moment result of calculation it is returned to front end.

More specifically, by analyzing topology of networks, calculate the mark of structural analysis Network Based respectively and based on the mark quoting step analysis.Two kinds of marks are combined by optimizing and revising finally by parameter, as other papers importance degree relative to center paper.

First, the mark of structural analysis Network Based is calculated.Generally, one section of paper quote in paper list, both can include that it is had the paper of material impact, also have some unessential papers, as introduced as a setting.Directly quoted at these and center paper is had in the paper of material impact, both having been potentially included the initiative paper in for a long time previous field, it is also possible to be that include just delivering recently but center paper is had the paper of directly inspiration.And these delivering recently are cited paper, because belonging to same field, also very likely quote the paper that a same section is initiative.From the paper of center, according to citation network structure, the mark of every section of paper is on average given its cited paper, successively updates the mark of paper as fractional increments.Owing to the initiative paper in a field can receive quoting from different levels paper, therefore they can obtain higher mark.Path is quoted, it is possible to show paper development and the process developed from high score paper to center paper.And a plurality of structure quoting path, then can represent the panorama of whole academic map centered by the paper of center.

The every section of paper preliminary score relative to center paper is obtained after the mark calculating structural analysis Network Based.In order to reflect the attribute of every section of paper more fully hereinafter, by the initiative characterizing every section of paper based on the mark quoting step analysis, mark is more high, and initiative is more strong.Define one section of paper longest path to center paper, as this section of paper level to center paper.The initiative paper in one field, it is more likely that directly quoted across multilamellar by the follow-up paper of the major part in this field.And the order that these follow-up papers progressively develop according to research, there is again the relation progressively quoted across monolayer.Such relation just determines from center paper, arrives paper one section initiative, both there is very long path, and there is also very short path.Then we just using the ratio of both as characterizing the mark that paper is initiative.By BFS, the subgraph that traverse user is paid close attention to, calculate the ratio of each section of paper longest path relative to center paper and shortest path, as based on the mark quoting step analysis.Using the recommendation collection of thesis in some fields as training set, to recommend the paper occurred in collection of thesis to obtain high score as target at this, optimized algorithm obtains suitable parameter.Utilize parameter as weight coefficient, be combined by the mark of structural analysis Network Based with based on the mark quoting step analysis, as other the final papers importance degree relative to center paper.

More specifically, according to the design philosophy of distributed program, the program carrying out algorithm realizes.Program, to quote in units of the number of plies, successively calculates the changing value of paper mark in citation network, and is stored as temporary variable by changing value, unified renewal mark after every layer of calculating terminates.Owing to when calculating the changing value quoting paper mark, the calculating process of every section of paper is independent from, and the calculating process that these are independent is assigned on different computing nodes and carries out parallel computation by program so that calculate speed and be greatly promoted.And the unified mark that updates, it is to avoid the situation that data base is locked occurs in calculating process.The critical datas such as the sequence number of paper, reference listing, mark are stored in internal memory by we, utilize the parallel framework of Spark, run in the distributed system having multiple computing node.After calculating completes, from data base, read the text messages such as the exercise question of paper, publication date, periodical meeting, become, by routine processes, the data form that leading portion needs, be sent to front end and be shown.When leading portion needs renewal center paper, by the information of receiving front-end, carry out new calculating, it is achieved dynamically provide data for leading portion, show an interactively interface and function for user.

Further, step a employs queue, one paper node of each cycle calculations, and in distributed implementation the paper of each cycle calculations one layer.First the mark initializing center paper is 1.0, and the paper in ground floor only has center paper.Take turns in circulation each, the mark of paper in preceding layer is on average given the paper that they are cited, generate the mark contribution of paper that each section is cited by distributed Map (mapping) method.Then, on this layer all papers mark contribution all calculate complete after, by mark contribution and each section of paper before mark be added, here with distributed Reduce (reduction) method.Next the paper updating next layer is that this takes turns the paper (not including the paper accessed in the past) that in circulation, first time quotes.Until when next layer of paper is empty, loop ends.

Described step 3 includes:

Step 3.2: the adduction relationship of preliminary analysis paper, python program design language is taked to call the turn data structure (the i.e. node 1:[corresponding node 1 of dictionary, corresponding node 2 ...], node 2:[corresponding node 1, corresponding node 2 ...], node 3 ...) store up the information of node and related direction and length；

Step 3.3:(note: depth-first search belongs to the one of nomography, its process be inventionbriefly the individual path that each is possible is deep into can not more deeply till, and each node can only access once.Breadth first traversal is a kind of traversal strategies of connected graph, and its thought is from a summit V0, the region that radial ground first traversal is wider about.)

More specifically, in order to explore number of passes that may be present and path length between any two sections of papers, then yes extracts information from data base for the first step.Owing to having only to the relation looked between paper, paper is used as and idealizes process one by one, if further, it is necessary to consider papers contents itself, then point also is able to the feature of position and the storage preserving related content storage.Based on the data base obtained in step 1, first consider, what data base preserved is the position relationship between each two node, note it being each two, the consequence being likely to bring is to there is the preservation of a lot of duplicate node, although these preservations are all necessary for data base, but a point has several adjacent node, it will occur several times, the intuition of topological structure is disagreed by this and the mankind, return read bring very big not convenient, so it is carried out pretreatment, so data effectively further could be obtained the relation quoted paper and be cited between paper.Specific algorithm is as follows:

First dictionary initializes ref_dic{}, refed_dic{}.ref_dic represent by the paper Centroid mapping relations to multiple persons who quote.Refed_dic represents from by the paper Centroid mapping relations to multiple persons quoted.Then paper data are taken out from data base.Need first to judge whether paper Centroid is present in the key assignments of dictionary, then divide situation to process.If in the middle of the key assignments of a paper node Already in dictionary, then only need to its dictionary respective items followed by new person who quote or person quoted.If this paper node is from not appearing in the middle of dictionary key assignments, we elect additional member it to be key assignments, and new person who quote or person quoted are added up.The two dictionary finally given contains the relation of the central point correspondence consecutive points being transformed by structural relation man-to-man in data base.

After obtaining the dic that indexes with person who quote and the dic indexed with the paper that is cited, by the number of passes explored between any two points and length.Owing to paper is quoted very special with the topological relation being cited, most paper all can only point to the paper node that it is quoted, small part paper have comparison many quote quantity, become the Centroid in structure.If considering to be separated by node between two sections of paper nodes too much, often dependency can be extremely weak, so in order to make search more effective and rapid, being provided with the greatest length scope of searching route.Then having used the algorithm process of breadth First and depth-first respectively, basic thought is as follows:

For breadth First, being in braces and have respective description of the variation of depth-first.

First initialize a dictionary Dir={}, quote direction for what preserve paper, namely point to person quoted from person who quote and still turn around.Initialize another sub-allusion quotation p_dic={}, preserve traversed node.

Then in one list p_list of initialization, (if breadth First is then queue, addition element is to add at p_list end, and deleting element is delete in p_list beginning.If depth-first is then stack, plus-minus element is all at p_list end), initial paper node and relevant information, such as length, sensing, it is added to and processes in list p_list.

When list p_list non-NULL time, do following circulation, and record cycle-index, when the step number that number of times requires more than us, also can terminate circulation.Cyclic process is:

Going out one section of paper v of team, all references paper that paper v is mapped in ref_dic leaves in the middle of list a_list, and all papers that are cited that paper v is mapped in refed_dic leave in the middle of list b_list.

Traversal a_list, if there is paper v ' is the paper node end that we want, then exports it, v is assigned to a transient node, refunding removal search according to the dictionary Dir path preserved, until looking for start node, then exporting the node preserved and directional information；

If v ' not, reexamines either with or without being traversed, assuming that it has not, then there be Dir [v '] ← (v, 0) { v represents the v ' paper quoted, and 0 represents from person quoted to person who quote }, it is meant to be saved in dictionary Dir the mapping relations of v to v '.Paper v ' and its sensing will be added in traversed paper set p_dic.Then v ' is joined the team, namely add in p_list.

Then traveling through b_list, process is similar with a_list, simply Dir [v '] ← (v, 1), and this step is different.

Then the research just conveniently certain paper data set being correlated with.For different data sets, its pretreatment time and can search for scope the two index can be variant, obtain two sections of internodal number of passes of paper and path under multiple situation.

Described step b includes: with the extraction algorithm in distributed program procedure 3, enabling to run in distributed system, distributed algorithm is as follows:

More specifically, the algorithm of the distributed implementation in path is divided into two parts between two sections of papers of labelling science map: forward path label and reverse path label.

In the process of forward path label, the paper of circulation or calculating one layer every time.The paper initializing ground floor is starting point paper.Take turns in circulation each, utilize paper reference information, the path of the paper that all of paper of labelling preceding layer is quoted to them.Then the paper updating next layer is that this takes turns the paper (removing terminal paper and the paper accessed in the past) that in circulation, first time quotes.Until when next layer of paper is empty, loop ends.

In the process of reverse path label, circulation remains the paper calculating a layer every time.The paper initializing ground floor is terminal paper.Taking turns in circulation each, utilize the routing information calculated in forward path label, all of paper of labelling preceding layer is to the path of their father's paper.Then the paper updating next layer is that this takes turns father's paper (removing starting point paper and the paper accessed in the past) that in circulation, first time has access to.Until when next layer of paper is empty, loop ends.At this moment, all paths from origin-to-destination have just been marked.

Above specific embodiments of the invention are described.It is to be appreciated that the invention is not limited in above-mentioned particular implementation, those skilled in the art can make various deformation or amendment within the scope of the claims, and this has no effect on the flesh and blood of the present invention.

Claims

1. one kind based on the academic big data analysing method of adduction relationship between paper, it is characterised in that comprise the steps:

Described center paper refers to: user is by inputting a certain section paper carrying out inquiring about；

2. according to claim 1 based on the academic big data analysing method of adduction relationship between paper, it is characterized in that, also include step a: the parser in step 2 is carried out distributed treatment, calculate in this paper citation network the every section of paper importance degree relative to center paper by the multinode of described paper citation network.

3. according to claim 1 based on the academic big data analysing method of adduction relationship between paper, it is characterized in that, also include step b: the extraction algorithm in step 3 is carried out distributed treatment, calculate the importance degree of development path between this paper citation network middle finger final conclusion literary composition by the multinode of described paper citation network.

4. according to claim 1 based on the academic big data analysing method of adduction relationship between paper, it is characterised in that described step 1 includes:

Step 1.2: build paper citation network；

5. according to claim 1 based on the academic big data analysing method of adduction relationship between paper, it is characterised in that described step 2 includes:

Based on the mark=longest path/shortest path quoting step analysis；

6. according to claim 1 based on the academic big data analysing method of adduction relationship between paper, it is characterised in that described step 3 includes:

Step 3.3: extract the information in path between two sections of papers.

7. according to claim 6 based on the academic big data analysing method of adduction relationship between paper, it is characterised in that described step 3.1 includes: