CN105808729A - Academic big data analysis method based on reference relationship among pieces of thesis - Google Patents

Academic big data analysis method based on reference relationship among pieces of thesis Download PDF

Info

Publication number
CN105808729A
CN105808729A CN201610131343.0A CN201610131343A CN105808729A CN 105808729 A CN105808729 A CN 105808729A CN 201610131343 A CN201610131343 A CN 201610131343A CN 105808729 A CN105808729 A CN 105808729A
Authority
CN
China
Prior art keywords
paper
data
thesis
adduction relationship
dic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610131343.0A
Other languages
Chinese (zh)
Other versions
CN105808729B (en
Inventor
谈兆炜
刘长风
周劲光
杜佳俊
骆铮
毛宇宁
沈嘉明
王彪
傅洛伊
王新兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201610131343.0A priority Critical patent/CN105808729B/en
Publication of CN105808729A publication Critical patent/CN105808729A/en
Application granted granted Critical
Publication of CN105808729B publication Critical patent/CN105808729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • G06F16/3326Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages

Abstract

The invention provides an academic big data analysis method based on reference relationship among pieces of thesis. The method comprises following steps of step 1, carrying out corresponding analysis and processing to a local thesis data set; establishing a thesis reference network in a database; step 2, establishing an analysis algorithm according to the reference relationship in the thesis reference network; obtaining the importance and the mutual relationship of the nodes in the thesis reference network by the analysis algorithm; obtaining the importance of thesis relative to the center thesis; step 3, converting the one to one reference relationship of the thesis into a mapping set in a reference direction and a mapping set in a referred direction; obtaining development paths among the appointed pieces of thesis in the thesis reference network by an extraction algorithm; and calculating the importance of the paths according to the thesis importance obtained in the step 2. According to the method, the reference relationship of the thesis in the database can be conveniently analyzed; the development paths among pieces of thesis can be obtained; and the thesis search precision can be improved.

Description

Based on the academic big data analysing method of adduction relationship between paper
Technical field
The present invention relates to big data to process and network analysis technique field, in particular it relates to a kind of based on the academic big data analysing method of adduction relationship between paper.
Background technology
Along with developing rapidly of information technology, people obtain the mode of data and get more and more, and the total amount of data also presents volatile growth.The result of study of International Data Corporation (IDC) (IDC) shows, the data volume of whole world generation in 2008 is 0.49ZB, and the data volume of 2009 is 0.8ZB, within 2010, increases as 1.2ZB, the quantity of 2011 is especially up to 1.82ZB, and everyone produces the data of more than 200GB to be equivalent to the whole world.And to 2012, the data volume of all printing materials of human being's production was 200PB.The research of IBM claims, and in the total data that whole human civilization obtains, has 90% to produce in two years in the past.And having arrived the year two thousand twenty, data scale produced by the whole world is up to 44 times of today.Along with the appearance of mass data, the instrument processing mass data also arises at the historic moment, and such as Hadoop and Spark is able to mass data is carried out the software frame of distributed treatment.
And at scientific research field, along with every country is increasing for the input of scientific research activity, the direct achievement paper of scientific research is also increasing every year with surprising rapidity.For China, according to Chinese science and technology information research the statistical result showed of Chinese science and technology paper announced for 26th of JIUYUE in 2014: 2004 in JIUYUE, 2014, middle Kuomintang-Communist delivers international paper 136.98 ten thousand sections, comes the 2nd, the world;Paper is cited 1037.01 ten thousand times altogether, comes the 4th, the world.And for paper, except text message, complicated adduction relationship constitutes a very huge network between them, and then define " academic big data ".For the paper helping researcher quick obtaining oneself to want, there are now some commercial companies to develop the academics search system of oneself, helped the article that researcher fast search is wanted to oneself, for instance Microsoft and Google.
In these academics search systems, many Network algorithm are applied in paper sequence, wherein most notable is PageRank algorithm and HITS algorithm (Hypertext-InducedTopicSearchAlgorithm, the theme search algorithm that hypertext guides).But major part algorithm only considered the overall importance of one section of paper in the ranking, does not consider its relative importance for one section of specific paper.This is not particularly suited for wishing to find the situation to several sections of papers that one section of paper has the greatest impact.In addition, most of existing algorithm can not outside sequence, it is provided that more about the information of development grain between two sections of papers.Which results in researcher and can not obtain the development trend in Scientific Research Resource, comprehensive grasp research field efficiently.
The present invention in this context, develops and a set of can either analyze paper based on paper citation network another section is formulated the relative importance of paper, can represent again the system of development grain between two sections of papers.
Summary of the invention
For defect of the prior art, it is an object of the invention to provide a kind of based on the academic big data analysing method of adduction relationship between paper.
According to provided by the invention based on the academic big data analysing method of adduction relationship between paper, comprise the steps:
Step 1: carry out local paper data set analyzing accordingly and building paper citation network after process in data base;
Step 2: according to the adduction relationship creation analysis algorithm in paper citation network, obtained the importance of described paper citation network interior joint and mutual relation by this parser, and obtain the paper importance degree relative to center paper;
Described center paper refers to: user (here it is considered that user is interested in this section of paper and correlative theses thereof by inputting the one section of paper carrying out inquiring about, and want to know about other papers relative to the percentage contribution of this section of paper and relative Link Importance, but not overall situation importance degree).
Step 3: man-to-man for paper adduction relationship is converted into the mapping ensemblen quoting direction and the mapping ensemblen in the direction that is cited, described paper citation network obtains the development path specified between paper, and calculates the importance degree in path according to the paper importance degree obtained in step 2.
Preferably, also include step a: the parser in step 2 is carried out distributed treatment, calculate in this paper citation network the every section of paper importance degree relative to center paper by the multinode of described paper citation network.
Preferably, also include step b: the extraction algorithm in step 3 is carried out distributed treatment, calculate the importance degree of development path between this paper citation network middle finger final conclusion literary composition by the multinode of described paper citation network.
Preferably, described step 1 includes:
Step 1.1: utilize text-processing and analytical technology, extracts the reference information in local paper data, and reference information is to comprise any one section of paper in collection of thesis refer to the information of which paper;
Step 1.2: build paper citation network;
Step 1.3: remove the content of repetition after the reference information in the paper citation network obtained is compared, utilize database software to store and index, and the adduction relationship between paper is stored in lane database with the form of key-value pair.
Preferably, described step 2 includes:
Step 2.1: according to the adduction relationship in paper citation network, calculate the mark of structural analysis Network Based;
Step 2.2: utilize the subgraph that width first traversal search spread user pays close attention to, calculate the ratio of each section of paper longest path relative to center paper and shortest path, as based on the mark quoting step analysis, the computing formula of width first traversal is as follows:
Based on the mark=longest path/shortest path quoting step analysis;
Step 2.3: select corresponding weight, by the mark of structural analysis Network Based with based on quoting the mark of step analysis included together as other the final papers importance degree relative to center paper, computing formula is as follows:
The mark of the weight * structural analysis Network Based of importance degree=structural analysis Network Based+based on quoting the weight * of step analysis based on the mark quoting step analysis.
Preferably, described step 3 includes:
Step 3.1: the man-to-man adduction relationship of paper in data base is converted into the mapping ensemblen quoting direction and the mapping ensemblen in the direction that is cited;
Step 3.2: the adduction relationship of preliminary analysis paper, takes python program design language to call the turn the data structure of dictionary;
Step 3.3: extract the information in path between two sections of papers.
Preferably, described step 3.1 includes:
Initializing two dictionaries ref_dic, refed_dic, wherein, ref_dic represents by the paper Centroid mapping relations to multiple persons who quote, and refed_dic represents from by the paper Centroid mapping relations to multiple persons quoted;
Each row of data in ergodic data storehouse, ref_dic dictionary key assignments is found the left data of any row, if this row left data is in ref_dic dictionary key assignments, then data on the right side of this row are added to the afterbody of key assignments respective items, if not in ref_dic dictionary key assignments, then left data is preserved as new key assignments, and using right side data as corresponding item, so that the man-to-man adduction relationship of paper is converted into the mapping ensemblen quoting direction in data base;
For refed_dic dictionary, then using right side data as key assignments, left data is as item, refed_dic dictionary key assignments is found the right side data of this row, if data are in refed_dic dictionary key assignments on the right side of this row, then this row left data is added to the afterbody of key assignments respective items, if not in refed_dic dictionary key assignments, then right side data are preserved as new key assignments, and using left data as corresponding item, so that the man-to-man adduction relationship of paper is converted into the mapping ensemblen in the direction that is cited in data base.
Compared with prior art, the present invention has following beneficial effect:
What 1, illustrate in the present invention can more effectively show, based on the academic big data analysing method of adduction relationship between paper, the contribution and relative importance that other papers make for the paper that user inquires about, allow users to paper interested from a section more easily, find other correlative theses.
2, the present invention further illustrates based on the distributed approach of the academic big data analysis of adduction relationship between paper, this speed being beneficial to improve academic big data analysis;When using the present invention to set up paper inquiry system, utilize this distributed approach, it is possible to efficiently reduce the response time of user's inquiry, improve Consumer's Experience.
Accompanying drawing explanation
By reading detailed description non-limiting example made with reference to the following drawings, the other features, objects and advantages of the present invention will become more apparent upon:
Fig. 1 is distributed marking routing algorithm flow chart;
Fig. 2 is track search flow chart between two sections of papers;
Fig. 3 is path label algorithm flow chart;
Fig. 4 is path label algorithm demo system figure.
Detailed description of the invention
Below in conjunction with specific embodiment, the present invention is described in detail.Following example will assist in those skilled in the art and are further appreciated by the present invention, but do not limit the present invention in any form.It should be pointed out that, to those skilled in the art, without departing from the inventive concept of the premise, it is also possible to make some deformation and improvement.These broadly fall into protection scope of the present invention.
According to provided by the invention based on the academic big data analysing method of adduction relationship between paper, comprise the steps:
Step 1: carry out local paper data set analyzing accordingly and building paper citation network after process in data base;
Step 2: according to the adduction relationship creation analysis algorithm in paper citation network, obtained the importance of described paper citation network interior joint and mutual relation by this parser, and obtain the paper importance degree relative to center paper;
Step a: the parser in step 2 is carried out distributed treatment, calculates in this paper citation network the every section of paper importance degree relative to center paper by the multinode of described paper citation network.
Step 3: man-to-man for paper adduction relationship is converted into the mapping ensemblen quoting direction and the mapping ensemblen in the direction that is cited, is obtained the development path specified between paper in described paper citation network by extraction algorithm;
Step b: the extraction algorithm in step 3 is carried out distributed treatment, calculates the importance degree of development path between this paper citation network middle finger final conclusion literary composition by the multinode of described paper citation network.
Specifically, described step 1 includes:
Step 1.1: utilize text-processing and analytical technology, extracts the reference information in local paper data, and reference information is to comprise any one section of paper in collection of thesis refer to the information of which paper;
Step 1.2: building paper citation network, developing algorithm is as follows:
Construct an an empty point set V and empty limit collection E.To each section of paper u in data base, first paper u is added in vertex set V as a paper node;Then quoting each section of paper v of paper u, (u, v), by e, (u, v) adds in the collection E of limit structure directed edge e, and to arrange weights be 1.After during this operation is to data base, all papers have all carried out one time, by now point set and Bian Ji collectively constitute paper citation network G (V, E).
Step 1.3: utilize computer program to compare removal repetition, utilize database software to store and index to the paper citation network data obtained are unified, and the adduction relationship between paper is stored in lane database with the form of key-value pair.
More specifically, based on local paper data set, adopt Python and Mathematica instrument, carry out the cleaning of text, remove therein such as insignificant characters such as space, newlines, and be organized into the convenient form analyzed plus unique numbering to every section of paper;Obtain every section of article quote paper after, obtain quoting the numbering of paper by searching, extract the adduction relationship between data, utilize database technology, adduction relationship is stored in data base with the form of key-value pair.Comprise a little and the network information on limit what obtain, by the form of key-value pair, using actively quoting paper and being drawn paper as two key assignments in once quoting, be stored in data base.Then every in data base is carried out unified automatization, structurized duplicate removal, stores and index.
Described step 2 includes:
Step 2.1, according to the adduction relationship in paper citation network, calculates the mark of structural analysis Network Based, and the algorithm of calculating is as follows:
Input: paper citation network G, center paper v0;Output: in paper citation network, every section of article (a good appetite suddenly appearing in a serious disease heart paper is outer) is relative to the importance degree of center paper.
It is an empty queue that establishment processes queue Q, and creating and beating undue paper set S is an empty set.
Center paper is added in process queue Q.
Center paper is added to and beats in undue paper set S.
When queue not empty time, carry out following operation:
Go out one section of paper v of team.
The Quantity of Papers of quoting of paper v is assigned to n.
The paper v ' that any one section of paper v is quoted, carries out following operation:
By the mark n decile of paper v, it is designated as delta_score.
The mark of v ' is added after delta_score as new mark.
If v ' is updated score first time, carry out following operation:
Paper v' is added in process queue Q.
Paper v' is added to and beats in undue paper set S.
Step 2.2: utilizing the subgraph that width first traversal search spread user pays close attention to, calculate the ratio of each section of paper longest path relative to center paper and shortest path, as based on the mark quoting step analysis, computing formula is as follows:
Based on the mark=longest path/shortest path quoting step analysis;
Step 2.3: select optimum parameter, by the mark of structural analysis Network Based with based on quoting the mark of step analysis included together as other the final papers importance degree relative to center paper, computing formula is as follows:
The mark of the weight * structural analysis Network Based of importance degree=structural analysis Network Based+based on quoting the weight * of step analysis based on the mark quoting step analysis.
Described step a includes: by the parser in distributed program procedure 2, enables to run in distributed system, utilizes the parallel framework of Spark, data are stored in internal memory, keep mutual with front end, carry out parallel computation on multiple computer points.So can being greatly enhanced the arithmetic speed of program, and dynamically provide result of calculation to be used for showing for front end, algorithm is as follows:
Step 2 employs queue, one paper node of each cycle calculations, and in distributed implementation the paper of each cycle calculations one layer.First the mark initializing center paper is 1.0, and the paper in ground floor only has center paper.Take turns in circulation each, the mark of paper in preceding layer is on average given the paper that they are cited, generate the mark contribution of paper that each section is cited by distributed Map (mapping) method.Then, on this layer all papers mark contribution all calculate complete after, by mark contribution and each section of paper before mark be added, here with distributed Reduce (reduction) method.Next the paper updating next layer is that this takes turns the paper (not including the paper accessed in the past) that in circulation, first time quotes.When next layer of paper is empty or search time is long, loop ends.At this moment result of calculation it is returned to front end.
More specifically, by analyzing topology of networks, calculate the mark of structural analysis Network Based respectively and based on the mark quoting step analysis.Two kinds of marks are combined by optimizing and revising finally by parameter, as other papers importance degree relative to center paper.
First, the mark of structural analysis Network Based is calculated.Generally, one section of paper quote in paper list, both can include that it is had the paper of material impact, also have some unessential papers, as introduced as a setting.Directly quoted at these and center paper is had in the paper of material impact, both having been potentially included the initiative paper in for a long time previous field, it is also possible to be that include just delivering recently but center paper is had the paper of directly inspiration.And these delivering recently are cited paper, because belonging to same field, also very likely quote the paper that a same section is initiative.From the paper of center, according to citation network structure, the mark of every section of paper is on average given its cited paper, successively updates the mark of paper as fractional increments.Owing to the initiative paper in a field can receive quoting from different levels paper, therefore they can obtain higher mark.Path is quoted, it is possible to show paper development and the process developed from high score paper to center paper.And a plurality of structure quoting path, then can represent the panorama of whole academic map centered by the paper of center.
The every section of paper preliminary score relative to center paper is obtained after the mark calculating structural analysis Network Based.In order to reflect the attribute of every section of paper more fully hereinafter, by the initiative characterizing every section of paper based on the mark quoting step analysis, mark is more high, and initiative is more strong.Define one section of paper longest path to center paper, as this section of paper level to center paper.The initiative paper in one field, it is more likely that directly quoted across multilamellar by the follow-up paper of the major part in this field.And the order that these follow-up papers progressively develop according to research, there is again the relation progressively quoted across monolayer.Such relation just determines from center paper, arrives paper one section initiative, both there is very long path, and there is also very short path.Then we just using the ratio of both as characterizing the mark that paper is initiative.By BFS, the subgraph that traverse user is paid close attention to, calculate the ratio of each section of paper longest path relative to center paper and shortest path, as based on the mark quoting step analysis.Using the recommendation collection of thesis in some fields as training set, to recommend the paper occurred in collection of thesis to obtain high score as target at this, optimized algorithm obtains suitable parameter.Utilize parameter as weight coefficient, be combined by the mark of structural analysis Network Based with based on the mark quoting step analysis, as other the final papers importance degree relative to center paper.
More specifically, according to the design philosophy of distributed program, the program carrying out algorithm realizes.Program, to quote in units of the number of plies, successively calculates the changing value of paper mark in citation network, and is stored as temporary variable by changing value, unified renewal mark after every layer of calculating terminates.Owing to when calculating the changing value quoting paper mark, the calculating process of every section of paper is independent from, and the calculating process that these are independent is assigned on different computing nodes and carries out parallel computation by program so that calculate speed and be greatly promoted.And the unified mark that updates, it is to avoid the situation that data base is locked occurs in calculating process.The critical datas such as the sequence number of paper, reference listing, mark are stored in internal memory by we, utilize the parallel framework of Spark, run in the distributed system having multiple computing node.After calculating completes, from data base, read the text messages such as the exercise question of paper, publication date, periodical meeting, become, by routine processes, the data form that leading portion needs, be sent to front end and be shown.When leading portion needs renewal center paper, by the information of receiving front-end, carry out new calculating, it is achieved dynamically provide data for leading portion, show an interactively interface and function for user.
Further, step a employs queue, one paper node of each cycle calculations, and in distributed implementation the paper of each cycle calculations one layer.First the mark initializing center paper is 1.0, and the paper in ground floor only has center paper.Take turns in circulation each, the mark of paper in preceding layer is on average given the paper that they are cited, generate the mark contribution of paper that each section is cited by distributed Map (mapping) method.Then, on this layer all papers mark contribution all calculate complete after, by mark contribution and each section of paper before mark be added, here with distributed Reduce (reduction) method.Next the paper updating next layer is that this takes turns the paper (not including the paper accessed in the past) that in circulation, first time quotes.Until when next layer of paper is empty, loop ends.
Described step 3 includes:
Step 3.1: the man-to-man adduction relationship of paper in data base is converted into the mapping ensemblen quoting direction and the mapping ensemblen in the direction that is cited;
Step 3.2: the adduction relationship of preliminary analysis paper, python program design language is taked to call the turn data structure (the i.e. node 1:[corresponding node 1 of dictionary, corresponding node 2 ...], node 2:[corresponding node 1, corresponding node 2 ...], node 3 ...) store up the information of node and related direction and length;
Step 3.3:(note: depth-first search belongs to the one of nomography, its process be inventionbriefly the individual path that each is possible is deep into can not more deeply till, and each node can only access once.Breadth first traversal is a kind of traversal strategies of connected graph, and its thought is from a summit V0, the region that radial ground first traversal is wider about.)
More specifically, in order to explore number of passes that may be present and path length between any two sections of papers, then yes extracts information from data base for the first step.Owing to having only to the relation looked between paper, paper is used as and idealizes process one by one, if further, it is necessary to consider papers contents itself, then point also is able to the feature of position and the storage preserving related content storage.Based on the data base obtained in step 1, first consider, what data base preserved is the position relationship between each two node, note it being each two, the consequence being likely to bring is to there is the preservation of a lot of duplicate node, although these preservations are all necessary for data base, but a point has several adjacent node, it will occur several times, the intuition of topological structure is disagreed by this and the mankind, return read bring very big not convenient, so it is carried out pretreatment, so data effectively further could be obtained the relation quoted paper and be cited between paper.Specific algorithm is as follows:
First dictionary initializes ref_dic{}, refed_dic{}.ref_dic represent by the paper Centroid mapping relations to multiple persons who quote.Refed_dic represents from by the paper Centroid mapping relations to multiple persons quoted.Then paper data are taken out from data base.Need first to judge whether paper Centroid is present in the key assignments of dictionary, then divide situation to process.If in the middle of the key assignments of a paper node Already in dictionary, then only need to its dictionary respective items followed by new person who quote or person quoted.If this paper node is from not appearing in the middle of dictionary key assignments, we elect additional member it to be key assignments, and new person who quote or person quoted are added up.The two dictionary finally given contains the relation of the central point correspondence consecutive points being transformed by structural relation man-to-man in data base.
After obtaining the dic that indexes with person who quote and the dic indexed with the paper that is cited, by the number of passes explored between any two points and length.Owing to paper is quoted very special with the topological relation being cited, most paper all can only point to the paper node that it is quoted, small part paper have comparison many quote quantity, become the Centroid in structure.If considering to be separated by node between two sections of paper nodes too much, often dependency can be extremely weak, so in order to make search more effective and rapid, being provided with the greatest length scope of searching route.Then having used the algorithm process of breadth First and depth-first respectively, basic thought is as follows:
For breadth First, being in braces and have respective description of the variation of depth-first.
First initialize a dictionary Dir={}, quote direction for what preserve paper, namely point to person quoted from person who quote and still turn around.Initialize another sub-allusion quotation p_dic={}, preserve traversed node.
Then in one list p_list of initialization, (if breadth First is then queue, addition element is to add at p_list end, and deleting element is delete in p_list beginning.If depth-first is then stack, plus-minus element is all at p_list end), initial paper node and relevant information, such as length, sensing, it is added to and processes in list p_list.
When list p_list non-NULL time, do following circulation, and record cycle-index, when the step number that number of times requires more than us, also can terminate circulation.Cyclic process is:
Going out one section of paper v of team, all references paper that paper v is mapped in ref_dic leaves in the middle of list a_list, and all papers that are cited that paper v is mapped in refed_dic leave in the middle of list b_list.
Traversal a_list, if there is paper v ' is the paper node end that we want, then exports it, v is assigned to a transient node, refunding removal search according to the dictionary Dir path preserved, until looking for start node, then exporting the node preserved and directional information;
If v ' not, reexamines either with or without being traversed, assuming that it has not, then there be Dir [v '] ← (v, 0) { v represents the v ' paper quoted, and 0 represents from person quoted to person who quote }, it is meant to be saved in dictionary Dir the mapping relations of v to v '.Paper v ' and its sensing will be added in traversed paper set p_dic.Then v ' is joined the team, namely add in p_list.
Then traveling through b_list, process is similar with a_list, simply Dir [v '] ← (v, 1), and this step is different.
Then the research just conveniently certain paper data set being correlated with.For different data sets, its pretreatment time and can search for scope the two index can be variant, obtain two sections of internodal number of passes of paper and path under multiple situation.
Described step b includes: with the extraction algorithm in distributed program procedure 3, enabling to run in distributed system, distributed algorithm is as follows:
More specifically, the algorithm of the distributed implementation in path is divided into two parts between two sections of papers of labelling science map: forward path label and reverse path label.
In the process of forward path label, the paper of circulation or calculating one layer every time.The paper initializing ground floor is starting point paper.Take turns in circulation each, utilize paper reference information, the path of the paper that all of paper of labelling preceding layer is quoted to them.Then the paper updating next layer is that this takes turns the paper (removing terminal paper and the paper accessed in the past) that in circulation, first time quotes.Until when next layer of paper is empty, loop ends.
In the process of reverse path label, circulation remains the paper calculating a layer every time.The paper initializing ground floor is terminal paper.Taking turns in circulation each, utilize the routing information calculated in forward path label, all of paper of labelling preceding layer is to the path of their father's paper.Then the paper updating next layer is that this takes turns father's paper (removing starting point paper and the paper accessed in the past) that in circulation, first time has access to.Until when next layer of paper is empty, loop ends.At this moment, all paths from origin-to-destination have just been marked.
Above specific embodiments of the invention are described.It is to be appreciated that the invention is not limited in above-mentioned particular implementation, those skilled in the art can make various deformation or amendment within the scope of the claims, and this has no effect on the flesh and blood of the present invention.

Claims (7)

1. one kind based on the academic big data analysing method of adduction relationship between paper, it is characterised in that comprise the steps:
Step 1: carry out local paper data set analyzing accordingly and building paper citation network after process in data base;
Step 2: according to the adduction relationship creation analysis algorithm in paper citation network, obtained the importance of described paper citation network interior joint and mutual relation by this parser, and obtain the paper importance degree relative to center paper;
Described center paper refers to: user is by inputting a certain section paper carrying out inquiring about;
Step 3: man-to-man for paper adduction relationship is converted into the mapping ensemblen quoting direction and the mapping ensemblen in the direction that is cited, described paper citation network obtains the development path specified between paper, and calculates the importance degree in path according to the paper importance degree obtained in step 2.
2. according to claim 1 based on the academic big data analysing method of adduction relationship between paper, it is characterized in that, also include step a: the parser in step 2 is carried out distributed treatment, calculate in this paper citation network the every section of paper importance degree relative to center paper by the multinode of described paper citation network.
3. according to claim 1 based on the academic big data analysing method of adduction relationship between paper, it is characterized in that, also include step b: the extraction algorithm in step 3 is carried out distributed treatment, calculate the importance degree of development path between this paper citation network middle finger final conclusion literary composition by the multinode of described paper citation network.
4. according to claim 1 based on the academic big data analysing method of adduction relationship between paper, it is characterised in that described step 1 includes:
Step 1.1: utilize text-processing and analytical technology, extracts the reference information in local paper data, and reference information is to comprise any one section of paper in collection of thesis refer to the information of which paper;
Step 1.2: build paper citation network;
Step 1.3: remove the content of repetition after the reference information in the paper citation network obtained is compared, utilize database software to store and index, and the adduction relationship between paper is stored in lane database with the form of key-value pair.
5. according to claim 1 based on the academic big data analysing method of adduction relationship between paper, it is characterised in that described step 2 includes:
Step 2.1: according to the adduction relationship in paper citation network, calculate the mark of structural analysis Network Based;
Step 2.2: utilize the subgraph that width first traversal search spread user pays close attention to, calculate the ratio of each section of paper longest path relative to center paper and shortest path, as based on the mark quoting step analysis, the computing formula of width first traversal is as follows:
Based on the mark=longest path/shortest path quoting step analysis;
Step 2.3: select corresponding weight, by the mark of structural analysis Network Based with based on quoting the mark of step analysis included together as other the final papers importance degree relative to center paper, computing formula is as follows:
The mark of the weight * structural analysis Network Based of importance degree=structural analysis Network Based+based on quoting the weight * of step analysis based on the mark quoting step analysis.
6. according to claim 1 based on the academic big data analysing method of adduction relationship between paper, it is characterised in that described step 3 includes:
Step 3.1: the man-to-man adduction relationship of paper in data base is converted into the mapping ensemblen quoting direction and the mapping ensemblen in the direction that is cited;
Step 3.2: the adduction relationship of preliminary analysis paper, takes python program design language to call the turn the data structure of dictionary;
Step 3.3: extract the information in path between two sections of papers.
7. according to claim 6 based on the academic big data analysing method of adduction relationship between paper, it is characterised in that described step 3.1 includes:
Initializing two dictionaries ref_dic, refed_dic, wherein, ref_dic represents by the paper Centroid mapping relations to multiple persons who quote, and refed_dic represents from by the paper Centroid mapping relations to multiple persons quoted;
Each row of data in ergodic data storehouse, ref_dic dictionary key assignments is found the left data of any row, if this row left data is in ref_dic dictionary key assignments, then data on the right side of this row are added to the afterbody of key assignments respective items, if not in ref_dic dictionary key assignments, then left data is preserved as new key assignments, and using right side data as corresponding item, so that the man-to-man adduction relationship of paper is converted into the mapping ensemblen quoting direction in data base;
For refed_dic dictionary, then using right side data as key assignments, left data is as item, refed_dic dictionary key assignments is found the right side data of this row, if data are in refed_dic dictionary key assignments on the right side of this row, then this row left data is added to the afterbody of key assignments respective items, if not in refed_dic dictionary key assignments, then right side data are preserved as new key assignments, and using left data as corresponding item, so that the man-to-man adduction relationship of paper is converted into the mapping ensemblen in the direction that is cited in data base.
CN201610131343.0A 2016-03-08 2016-03-08 Academic big data analysis method based on adduction relationship between paper Active CN105808729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610131343.0A CN105808729B (en) 2016-03-08 2016-03-08 Academic big data analysis method based on adduction relationship between paper

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610131343.0A CN105808729B (en) 2016-03-08 2016-03-08 Academic big data analysis method based on adduction relationship between paper

Publications (2)

Publication Number Publication Date
CN105808729A true CN105808729A (en) 2016-07-27
CN105808729B CN105808729B (en) 2019-08-23

Family

ID=56467913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610131343.0A Active CN105808729B (en) 2016-03-08 2016-03-08 Academic big data analysis method based on adduction relationship between paper

Country Status (1)

Country Link
CN (1) CN105808729B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846019A (en) * 2018-05-08 2018-11-20 北京市科学技术情报研究所 A kind of paper sort method based on gold reference algorithm
CN108874990A (en) * 2018-06-12 2018-11-23 亓富军 A kind of method and system extracted based on power technology journal article unstructured data
CN110119412A (en) * 2019-04-16 2019-08-13 南京昆虫软件有限公司 A kind of quotation source database discriminating conduct
CN112612785A (en) * 2020-11-20 2021-04-06 北京理工大学 Dynamic monitoring method for key development path of unconventional energy technology

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090043797A1 (en) * 2007-07-27 2009-02-12 Sparkip, Inc. System And Methods For Clustering Large Database of Documents
CN104537063A (en) * 2014-12-29 2015-04-22 北京理工大学 Knowledge venation map construction system and method based on thesis citation network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090043797A1 (en) * 2007-07-27 2009-02-12 Sparkip, Inc. System And Methods For Clustering Large Database of Documents
CN104537063A (en) * 2014-12-29 2015-04-22 北京理工大学 Knowledge venation map construction system and method based on thesis citation network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
汪尚兵: "基于PDA模式的电站新机组调试专家系统研究", 《中国优秀硕士学位论文全文数据库工程科技Ⅱ辑》 *
纪雪梅等: "基于权力指数的引文网络分析方法探讨", 《图书情报工作》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846019A (en) * 2018-05-08 2018-11-20 北京市科学技术情报研究所 A kind of paper sort method based on gold reference algorithm
CN108846019B (en) * 2018-05-08 2019-05-21 北京市科学技术情报研究所 A kind of paper sort method based on gold reference algorithm
CN108874990A (en) * 2018-06-12 2018-11-23 亓富军 A kind of method and system extracted based on power technology journal article unstructured data
CN110119412A (en) * 2019-04-16 2019-08-13 南京昆虫软件有限公司 A kind of quotation source database discriminating conduct
CN112612785A (en) * 2020-11-20 2021-04-06 北京理工大学 Dynamic monitoring method for key development path of unconventional energy technology
CN112612785B (en) * 2020-11-20 2023-11-17 北京理工大学 Dynamic monitoring method for key development path of unconventional energy technology

Also Published As

Publication number Publication date
CN105808729B (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CN109408627B (en) Question-answering method and system fusing convolutional neural network and cyclic neural network
CN109492077B (en) Knowledge graph-based petrochemical field question-answering method and system
CN104834735B (en) A kind of documentation summary extraction method based on term vector
CN103325061B (en) A kind of community discovery method and system
JP4537391B2 (en) Method, information processing apparatus, and program for handling tree-type data structure
CN108920527A (en) A kind of personalized recommendation method of knowledge based map
CN104317801B (en) A kind of Data clean system and method towards big data
CN104331449B (en) Query statement and determination method, device, terminal and the server of webpage similarity
CN111460311A (en) Search processing method, device and equipment based on dictionary tree and storage medium
CN108920556B (en) Expert recommending method based on discipline knowledge graph
CN108563729B (en) Bid winning information extraction method for bidding website based on DOM tree
CN104778256B (en) A kind of the quick of field question answering system consulting can increment clustering method
CN105808696B (en) It is a kind of based on global and local feature across line social network user matching process
CN105808729A (en) Academic big data analysis method based on reference relationship among pieces of thesis
CN112667877A (en) Scenic spot recommendation method and equipment based on tourist knowledge map
CN109308497A (en) A kind of multidirectional scale dendrography learning method based on multi-tag network
CN104063507A (en) Graph computation method and engine
CN104484380A (en) Personalized search method and personalized search device
CN106021457A (en) Keyword-based RDF distributed semantic search method
CN110968761A (en) Self-adaptive extraction method for webpage structured data
WO2014210387A2 (en) Concept extraction
CN108363725A (en) A kind of method of the extraction of user comment viewpoint and the generation of viewpoint label
CN107977393A (en) A kind of recommended engine design method based on data collection of illustrative plates, Information Atlas, knowledge mapping and wisdom collection of illustrative plates towards 5W question and answer
CN106909669A (en) The detection method and device of a kind of promotion message
CN107239549A (en) Method, device and the terminal of database terminology retrieval

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant