CN103020302A - Academic core author excavation and related information extraction method and system based on complex network - Google Patents
Academic core author excavation and related information extraction method and system based on complex network Download PDFInfo
- Publication number
- CN103020302A CN103020302A CN2012105928281A CN201210592828A CN103020302A CN 103020302 A CN103020302 A CN 103020302A CN 2012105928281 A CN2012105928281 A CN 2012105928281A CN 201210592828 A CN201210592828 A CN 201210592828A CN 103020302 A CN103020302 A CN 103020302A
- Authority
- CN
- China
- Prior art keywords
- author
- corporations
- data
- network
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 17
- 238000009412 basement excavation Methods 0.000 title abstract description 3
- 238000000034 method Methods 0.000 claims abstract description 45
- 238000005516 engineering process Methods 0.000 claims abstract description 24
- 238000004458 analytical method Methods 0.000 claims abstract description 14
- 238000009826 distribution Methods 0.000 claims abstract description 7
- 238000011160 research Methods 0.000 claims description 25
- 239000000284 extract Substances 0.000 claims description 16
- 230000014509 gene expression Effects 0.000 claims description 7
- 230000000875 corresponding effect Effects 0.000 claims description 5
- 238000012544 monitoring process Methods 0.000 claims description 5
- 230000002596 correlated effect Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000001747 exhibiting effect Effects 0.000 claims description 2
- 238000003012 network analysis Methods 0.000 abstract description 5
- 238000007418 data mining Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 7
- 238000000638 solvent extraction Methods 0.000 description 6
- 241000270322 Lepidosauria Species 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000013332 literature search Methods 0.000 description 3
- 238000005303 weighing Methods 0.000 description 3
- GNFTZDOKVXKIBK-UHFFFAOYSA-N 3-(2-methoxyethoxy)benzohydrazide Chemical compound COCCOC1=CC=CC(C(=O)NN)=C1 GNFTZDOKVXKIBK-UHFFFAOYSA-N 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- FGUUSXIOTUKUDN-IBGZPJMESA-N C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 Chemical compound C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 FGUUSXIOTUKUDN-IBGZPJMESA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the field of data mining, aims to solve the problem of excavating core authors in an academic field and intelligently extracting related information of the authors, and provides an improved academic core author excavation and information extraction method and system based on a core node discovery algorithm in the social network analysis technology. The method combines the vertical search technology, the social network analysis technology and the text analysis technology, and can find the core authors or groups of the academic field in the mass of information to further obtain related information of the authors. The method uses the vertical search technology to collect open source literature data, uses the bibliometric technology and the complex network analysis technology to analyze the importance of a variety of social entities in the data, and utilizes a community discovery algorithm to perform entity clustering based on the closeness degree of relationships between entities to find out an academic community. Users can find the core authors or an institution according to an entity importance ranking, and find the leadership team according to published articles amount distribution of cooperative groups.
Description
Technical field
The present invention relates to Data Mining, relate in particular to a kind of academic Core Authors based on complex network and excavate and relevant information abstracting method and system.
Background technology
Numerous real networks have a common property, and namely they all are by the common node network that is formed by connecting by each corporations.Connection between corporations' internal node is relatively tight, and the connection between corporations is relatively sparse.For example WWW can be regarded as by a large amount of website corporations and form some topics that common interest is arranged often of numerous website discussion of same corporations inside.Similarly, in author's collaboration network or circuit network, each node can be divided into different corporations according to its different character equally.Therefore, the ownership corporations of the number of corporations and each node and number are significant to the research of complex network in the network.
For the definition of the community structure in the network, present neither one the recognized standard.Therefore the form of community structure definition is a lot of in the network, but is divided into substantially two classes:
1. with node the relative density degree on a limit is weighed community structure.Under the definition of this method, the node of each corporations inside to connection relatively tight, but the connection between each corporations is but relatively sparse.
2. define community structure with the exact magnitude index in the graph theory.These community structures all are to be derived by the definition of rolling into a ball in the graph theory.Under the definition mode of this class formation, inner each point of General Requirements corporations is adjacent, perhaps can put non-conterminously with what at the most, and perhaps what jump etc. similarly mode farthest between wantonly 2.
Current domain expert identifies and recommends usually to adopt by the fuzzy text classifier of structure, the document that the expert is uploaded in the knowledge base blurs text classification, set up the method for expertise model in conjunction with factors such as quantity, times, this method exists used text library incomplete, coverage rate is low, be difficult to carry out in a plurality of fields comprehensive analysis-by-synthesis place domain expert's concrete contribution and relevant personal information, have significant limitation.Based on this, the present invention uses complex network structure, parameter analysis and the community discovery algorithm in the Complex Networks Analysis technology, can effectively be used for ambit key figure or the discovery of core group and obtaining of its relevant information.
Summary of the invention
The present invention is directed to the problem of excavating a certain sphere of learning key figure and its relevant information of intelligent extraction, the present invention proposes a kind of based on the core node discover method in the social network analysis technology and improved academic Core Authors excavates, information extraction algorithm and system.The method and system be for the data in literature of specific area, uses that complex network in the Complex Networks Analysis technology makes up, parameter analysis and community discovery algorithm, high efficiencyly finds field core group or key person.
A kind of academic Core Authors based on complex network that the present invention proposes excavates and the relevant information abstracting method, and it comprises:
Step 1, employing vertical search technology gather the data in literature of designated field, and described data in literature is carried out finishing analysis, to obtain author's relevant information;
Step 2, extract author's cooperative network according to author's relevant information of obtaining, and the statistics author parameter of being correlated with, different author's ranking informations obtained according to the different correlation parameters of adding up;
Step 3, the cooperative network that extracts is carried out corporations divide, the corporations after the division are as a scientific research colony;
Step 4, show described different author's ranking information and scientific research colony to the user, and recommend Core Authors and leader team according to user-selected author's ranking information and scientific research colony for the user.
The invention allows for a kind of academic Core Authors based on complex network and excavate and the relevant information extraction system, it comprises:
Data acquisition and collating unit: be used for adopting the vertical search technology to gather the data in literature of designated field, and described data in literature is carried out finishing analysis, to obtain author's relevant information;
Parameter analytic statistics device: extract author's cooperative network according to author's relevant information of obtaining, and add up the parameter that the author is correlated with, obtain different author's ranking informations according to the different correlation parameters of adding up;
Corporations divide device: the cooperative network that extracts is carried out corporations divide, the corporations after the division are as a scientific research colony;
Exhibiting device as a result: show described different author's ranking information and scientific research colony to the user, and recommend Core Authors and leader team according to user-selected author's ranking information and scientific research colony for the user.
Description of drawings
Fig. 1 is application system schematic diagram of the present invention;
Fig. 2 is the simple use process flow diagram of application system of the present invention;
Fig. 3 is based on the academic Core Authors excavation of complex network and the process flow diagram of relevant information abstracting method among the present invention;
Fig. 4 is data acquisition sub-process figure among the present invention;
Fig. 5 is data acquisition configuration sub-process figure among the present invention;
Fig. 6 is data analysis arrangement sub-process figure among the present invention;
Fig. 7 is the application system sectional drawing that the present invention realizes.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
Academic Core Authors based on complex network proposed by the invention excavates and relevant information abstracting method and system invent for field core expert group information retrieval, and the application system principle is seen accompanying drawing 1.
The below introduces the technology that arrives used herein:
1, acquisition technique
1.1 vertical search
This method is utilized the vertical search technology, the relevant informations such as the field of paying close attention to according to the user, meeting, from CNKI, obtain the metadata such as relevant author, mechanism, meeting on the literature search engine that SpringerLink etc. commonly use, automatically download and parsing document full text, obtain the detailed communication modes of literature author or mechanism.
Vertical search is the professional search engine for some fields, segmentation and the extension of search engine, be that the special information of certain class in the web page library is once integrated, directed minute field extracts and returns to the user with certain form again after the data that need are processed.Relatively universal search engine contain much information, inquire about new search engine service pattern inaccurate, that the degree of depth is inadequate etc. puts forward, by the information that certain values is arranged and the related service that provides for a certain specific area, a certain specific crowd or a certain particular demands.Its characteristics are exactly " special, smart, dark ", and have the industry color, the magnanimity information disordering of the universal search engine of comparing, and vertical search engine then seems more absorbed, concrete and gos deep into.
The most important technology of vertical search is the search engine reptile.The search engine crawler technology is a kind of according to certain rule, grasps automatically the technology of the network information.The design of native system correlation search engine reptile is take common reptile as the basis, and its function is effectively expanded, and mainly comprises the initial URL subset of domain-specific, page handling module, topic relativity analysis module, and URL looks into the modules such as heavy and page-downloading.This design can guarantee the topic relativity that system is good, with the topic relativity page hit rate that raising crawls, suits user's demand.
1.2 web retrieval
Web retrieval mainly is divided into the collection of degree of depth net and Dynamic Networks collection in the present technique.The feature of degree of depth net is the disguise that is its page, the result who generally needs the list of user submit data request to obtain to return.The page principal feature of Dynamic Networks is " dynamically existing ", i.e. user's interim page that dynamically generates by program when invoking page.Dynamic Networks mainly is divided into two types according to the distribution of item of information: the one, and many entries dynamic web page; Another one is exactly unirecord item dynamic web page, and the main difficulty that its page extracts is effective location and the defined different Precise Representations that extract request of different user of info web.
2, analytical technology
2.1 complex network technology
2.1.1 key concept
A concrete network can abstractly be the figure G that is comprised of point set V and limit collection E, and nodes is designated as N=|V (G) |, the limit number scale is M=|E (G) |.There is among the V node on every limit to corresponding with it among the E.If the arbitrfary point is to corresponding to the same limit, then this network is undirected network, otherwise is directed networks.If the node that comprises in the network and limit only have one type, claim that then this network is homogeneity, otherwise this network belongs to heterogeneous network.
2.1.2 intermediary's degree centrality
Intermediary's degree centrality (betweenness centrality) is based on that node defines the control ability of network service.It think if certain node be present in other node in the network between on the only way which must be passed of communication, then it must have consequence in network.
2.1.3 convergence factor
Convergence factor (clusteringcoefficient) often is used to describe the transitivity of network.Such as in the social networks net, your friend's friend probably also is your friend; Your two friends probably also are friend each other.Convergence factor is used for measuring this character of network.
2.2 other statistical indicators
2.2.1H-index
Important estimating of scientific in evaluation influence power is H-index tolerance.The value of H-index is based on scientist's the quantity of article and the number of times that is cited.For example, some scholars have at least that h piece of writing article is cited h time respectively, and then this scholar's H-index value is h.Know that from foregoing description certain scholar's H-index value is larger, his influence power in its research field is also larger.Included in to the quality comprehensive of the quantity of the scientific payoffs that H-index tolerance is delivered scholars and considered.
2.2.2APS value (average output score)
The APS value is defined as: for one piece of paper that n author arranged, it is 1/n that APS gives each author's score.An author's APS is exactly the score sum of its all paper.It has described the contribution degree of author to its chapter of sending the documents.
The present invention proposes a kind of academic Core Authors based on complex network and excavate and information extraction method, the simple use flow process of application system is seen accompanying drawing 2, and academic Core Authors excavates and the flow process of relevant information abstracting method is seen accompanying drawing 3.Concrete steps are as follows:
Step 1: data acquisition and arrangement.This method adopts the vertical search technology to specify the paper data in literature collection of meeting.Collecting flowchart is seen accompanying drawing 4.This step comprises three phases:
Stage 1: master data is obtained, and specifically comprises: step a) is determined acquisition condition, and acquisition condition is definite opinion accompanying drawing 5 really.At first need the deterministic retrieval type, comprise three kinds of retrieval types: periodical, meeting and keyword.Then according to search conditions such as dissimilar deterministic retrieval words, times, such as configuration condition such as meeting configuration condition (term that meeting is relevant etc.), literature search source and retrieval times.Then choose data source, comprise different database both at home and abroad.Thereby consist of the search condition set.Wherein, the meeting configuration condition needs the user to input, and all the other configuration condition are by system's Self-adjustment; Step b), according to the Information Monitoring of acquisition condition dynamic-configuration, to each data source site of determining, such as CNKI, SpringerLink philosophy configuration Information Monitoring is periodical such as the retrieval type, and then the Information Monitoring of configuration is periodical etc.; Step c) basic document data acquisition.Here utilize the vertical search technology, the relevant informations such as the field of paying close attention to according to the user, meeting, by initial URL subset, page handling module, topic relativity analysis module, URL looks into the modules such as heavy and page-downloading from CNKI, obtain the metadata such as relevant author, mechanism, meeting on the literature search engine that SpringerLink etc. commonly use, automatically download and resolve document in full.
Stage 2: data preparation specifically comprises: steps d) carrying out data cleansing, mainly is with author's name standardization, removes redundant character, such as the space etc., mechanism is carried out certain merger, is substituted by its one-level organization such as secondary facility unit etc.; Step e) appointed information is obtained, and topmost research object is the author among the present invention, therefore can obtain simple author information in this step, i.e. the unique ID of author's name and system assignment.
Stage 3: information warehouse-in specifically comprises: step f) result is showed the user, judged whether the result satisfied, satisfied then carry out step g by the user), a) reconfigure otherwise return step; Step g) deposits basic document information and author information in specified database; Step h) system judges whether the circle collection data, is then to wait for a period of time again to gather afterwards, otherwise finishes acquisition step.
Step 2: Parameter statistical analysis.Data analysis arrangement sub-process is seen accompanying drawing 6.This method research object is designated field associated core author and group.Therefore need to analyze author's document statistical parameter, by the parameters value being carried out overall ranking and then identifying the Core Authors in this field.The dispatch amount that statistical parameter comprises the author distributes and author APS (average output score) distribution, and utilize co-worker's Relation extraction author's cooperative network, analyze node Betweenness Centrality, degree distribution, network convergence factor and the H-index tolerance of author in cooperative network, wherein the node Betweenness Centrality be used for to be weighed an author and can to what extent be controlled contacts between other people, if a node is on the right shortest path of many other points, it just has higher intermediary's centrad.Can think that this author occupy critical positions, degree distributes and represents that certain author and How many people had cooperative relationship, the abutment points that the network convergence factor refers to nodes is the ratio of abutment points each other also, be the perfect degree of little cluster topology, be used for weighing this author in the parameter of network node cluster situation; H-index measurement representation author h piece of writing article is cited h time respectively, and then this scholar's H-index value is h, is used for weighing its influence power in the field of study.Author's ranking information that will obtain by different parameters is preserved, and the parameters such as namely the dispatch amount according to the author distributes, author APS (average output score) distributes, node Betweenness Centrality, the degree of author in cooperative network distributes, network convergence factor and H-index tolerance obtain different author's ranking informations.
Step 3: carry out population analysis according to corporations' partitioning algorithm.This method is carried out corporations' division for author's cooperative network, and each corporation after the division are equivalent to a scientific research colony.For whole scientific research colony statistics dispatch amount distribution situation.
Step 4: author's ranking information and scientific research community information are showed.The scientific research colony that the different authors ranking information that step 2 is preserved and step 3 find represents to the user, and recommends important author as the scientific research leader according to author's ranking information and the scientific research colony rank of user selection, and important colony is as Core Team.
Step 5: Core Authors information extraction and displaying.The user selectes the major domain scholar as required as Core Authors, is represented to the user by its personal information information of documentation ﹠ info Automatic Extraction by system and carries out related service or research use.
Wherein, in the stage 1 of step 1, the Literature Acquisition mode is that the collection of degree of depth net combines with the Dynamic Networks collection.
The course of work of degree of depth net collection can be divided into for 3 steps: 1) analyze the page, seek list; 2) list is filled in study; 3) identify and fetch results page.Wherein, degree of depth net reptile first step slave site homepage begins the forms pages of creeping, and studies list unless this process uses one group of heuristic rule to make a return journey; Second step extracts label from list, cooperate the signature identification (user name, password or identifying code) of domain-planning knowledge base and website, and how reptile as possible study correctly fills in list; Then the final step submission form fetches the results page identification record.In addition, in the process that degree of depth net gathers, web crawlers need to intelligently be identified specific application area knowledge based on domain knowledge base, with correlativity and the accuracy of the information that guarantees to collect.
In the Dynamic Networks gatherer process, when extracting the information of many entries dynamic web page, need to use tree edit distance model and tree merger model algorithm location and extract info web.Using the drawing-out structure of the accurate locating web-pages of tree edit distance, dynamic web page is converted to tag tree and locates the data item of separating in the webpage, is that the individual data item generates data item tree alone; On the pattern extraction of many data item, control repeating data item and optional data item generate the wrapper tree that is used for extraction with tree merger model use, i.e. final withdrawal device.When extracting the information of unirecord item dynamic web page, the user need to be by optional module, the data item of self-defined extraction, and system will generate extraction template according to user-selected data item.In extraction process, at first webpage is converted to tag tree, by user-defined extraction template coupling and extraction info web and preservation.
Among the step c in stage 1, document engine source mainly contains CNKI and SpringerLink, and gather content and comprise the document title, the document original text, the literature author, the document keyword, author mechanism, document place publication, document is delivered the time.
In the step 2, the central expression formula of intermediary's degree is defined as:
G in the formula
Jk(i) number of the shortest path by node i between expression node j and the k, g
JkThe number of shortest path between expression node j and the node k.Then need consider the directivity in path for directed networks.Intermediary's degree centrality concept is extremely important in social network analysis.In addition, the central concept of intermediary's degree also can be used to define intermediary's degree on limit to weigh the importance of limit in network except intermediary's degree of definable node.
The network convergence factor represents whether the abutment points of node connects, and is a Measure Indexes weighing network delivery.Popular says, the neighbours of the People Near Me of node may be the neighbours of this node also, are defined as:
Wherein N △ refers to the number of cooperative network intermediate cam shape, and N3 refers to be communicated with in the cooperative network quantity of tlv triple.Be communicated with three nodes that tlv triple refers to comprise certain given node, have at least two tlv triple that the limit forms from this given node to other two nodes.
In the step 3, corporations' partitioning algorithm uses the quick corporations partitioning algorithm for directed networks.Fast corporations' partitioning algorithm is based on a kind of improvement algorithm that the modularity concept that proposes in the GN algorithm is done.
Introduce first the GN algorithm:
A kind of simple group dividing method is to remove the limit that different corporations are linked to each other, the central idea of disintegrating method that Here it is.The community discovery algorithm that Grivan and Newman propose-GN algorithm is the foremost splitting-up method that is used for community discovery.Algorithm has been used intermediary's degree centrality on the limit of introducing above, does not belong to the degree of corporations step by step the limit deletion that does not belong to these corporations, until all limits are all deleted according to the limit again.According to the definition of corporations as can be known, limit Jie's number that the limit between the corporations is larger than the Bian Yougeng of corporations inside.By progressively the high limit of Jie's number, limit being removed, network can be divided into corporations.
But the GN algorithm is whenever removed a limit under worst case just need to recomputate all limit Jie's numbers, is only applicable to medium scale community network.For this defective, there are a lot of researchs from different perspectives it to be done improvement.In addition, the scholar proposes the GN algorithm for the community structure of network definition of quantity not.Therefore, can not directly judge from topology of networks whether resulting community structure has practical significance, in addition, in the situation of not knowing corporations' number, the GN algorithm is not known this decomposition yet
Proceed to which step termination.For addressing this problem, the people such as Newman have introduced one and have weighed standard---the modularity (Q) that network corporations divide quality, and it is defined as:
Wherein, A is adjacency matrix, A
IjExpression limit power, k
iAnd k
jBe respectively the number of degrees of node i and j, degree refers to and the number on the limit that this node (summit) is associated that m is total limit number of figure, C
iWith C
jCorporations number under representation node i and the j respectively.If base node i and node j belong to same corporations, then the δ function gets 1, otherwise gets 0.
The below introduces quick corporations partitioning algorithm:
The GN algorithm is a very important milestone of community discovery algorithm field, but because its algorithm complex is larger, therefore only is confined to study medium scale community network.Based on this reason, a kind of quick corporations partitioning algorithm has been proposed on the basis of GN algorithm.This fast algorithm is actually a kind of agglomerative algorithm based on greedy algorithm thought.Algorithm steps is as follows: 1. the initialization network is n corporations, and namely each node is exactly independent corporations; 2. be associated with successively corporations that the limit links to each other pair, and calculate modularity (Q) increment after merging; 3. repeated execution of steps 2, constantly merge corporations, become corporations until whole network all merges.The operation that merges each time corporations in the algorithm is corresponding modularity value all, during corresponding local maximum norm lumpiness value, is best corporations and divides.
But because its algorithm complex is still higher, at present, most of community discovery algorithm still concentrates on the research to undirected network, and in fact, most of our interested network is directed networks, for example WWW, telecommunications speech path network, Email communication network, bio-networks etc.Ignore the direction of network connection and carry out community discovery, mean to have abandoned the important information in the network structure, to the result of community discovery deviation to some extent.
For directed networks, adopt the modularity formula (modularityfunction) revised
Wherein, A is adjacency matrix, A
IjExpression limit power, δ is Kroeneker delta (Kroenekefdeltasymbol), if base node i and node j belong to same corporations, then the δ function gets 1, otherwise gets 0.
Be the in-degree of node i,
Out-degree for node j; The in-degree of node refers to enter the number on the limit of this node; The out-degree of node refers to from the number on the limit of this node.M is total limit number of network.
In the step 5, contact person's contact method can refer to contact person's Email.The Email information exchange is crossed the document original text that extracts in the step 1 and is obtained.System resolves the document original text automatically, uses the text message of matching regular expressions Email form, extracts all Eamil information that comprise in the document original text.Native system uses three kinds of matching regular expressions target Email simultaneously:
Regex1=
″\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*″
Regex2=
″\{?(\w*([-+.]\w+)*,(\s)*)*\w*([-+.]\w+)*\}?
@\w+([-.]\w+)*\.\w*([-.]\w*)*″
Regex3=
″(\s)*e-mail:(\s)*\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*(\s)*″
Above-described specific embodiment further describes purpose of the present invention, technical scheme and beneficial effect, and the application system sectional drawing that the present invention realizes is seen accompanying drawing 7.Institute it should be understood that the above only for specific embodiments of the invention, is not limited to the present invention, and is within the spirit and principles in the present invention all, any modification of making, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (10)
1. the academic Core Authors based on complex network excavates and the relevant information abstracting method, and it comprises:
Step 1, employing vertical search technology gather the data in literature of designated field, and described data in literature is carried out finishing analysis, to obtain author's relevant information;
Step 2, extract author's cooperative network according to author's relevant information of obtaining, and the statistics author parameter of being correlated with, different author's ranking informations obtained according to the different correlation parameters of adding up;
Step 3, the cooperative network that extracts is carried out corporations divide, the corporations after the division are as a scientific research colony;
Step 4, show described different author's ranking information and scientific research colony to the user, and recommend Core Authors and leader team according to user-selected author's ranking information and scientific research colony for the user.
2. the method for claim 1 is characterized in that, the data in literature that gathers designated field in the step 1 specifically comprises:
Step 11, determine acquisition condition, comprise the deterministic retrieval type, according to difference retrieval type deterministic retrieval condition;
Step 12, according to the Information Monitoring of acquisition condition dynamic-configuration;
Step 13, obtain data in literature according to acquisition condition and Information Monitoring.
3. the method for claim 1 is characterized in that, in the step 1 data is carried out finishing analysis and specifically comprises to obtain author's relevant information:
Step 14, carry out data cleansing;
Step 15, obtain author's relevant information of appointment.
4. the method for claim 1, it is characterized in that step 1 comprises that also the author's relative information displaying that will obtain to the user, need to determine whether the Resurvey data by the user, if need to would reconfigure acquisition condition, and carry out image data according to the acquisition condition that reconfigures.
5. the method for claim 1, it is characterized in that correlation parameter described in the step 2 comprises author's the distribution of dispatch amount, author's average output score, node Betweenness Centrality, degree distribution, network convergence factor and the H-index tolerance of author in cooperative network.
6. method as claimed in claim 5 is characterized in that, described node Betweenness Centrality calculates according to following formula and obtains:
Wherein, g
Jk(i) number of the shortest path by node i between expression node j and the k, g
JkThe number of shortest path between expression node j and the node k;
Described network convergence factor obtains according to following formula:
Wherein, N
△The number that refers to cooperative network intermediate cam shape, N
3Refer to be communicated with in the cooperative network quantity of tlv triple.
7. the method for claim 1 is characterized in that, corporations described in the step 3 divide the quick group dividing method that adopts for directed networks, specifically comprise:
Step 31, the described cooperative network of initialization are n corporations, and namely each node is independent corporations;
Step 32, be associated with the corporations that the limit links to each other successively, and calculate the modularity value after merging;
Step 33, repeated execution of steps 32, until whole cooperative network all is merged into corporations, wherein, when the modularity value was maximum, corresponding corporations were the corporations after final the division after merging.
9. the method for claim 1 is characterized in that, the method also comprises:
Step 5, analysis data in literature, the personal information that extracts Core Authors also offers the user.
10. the academic Core Authors based on complex network excavates and the relevant information extraction system, and it comprises:
Data acquisition and collating unit: be used for adopting the vertical search technology to gather the data in literature of designated field, and described data in literature is carried out finishing analysis, to obtain author's relevant information;
Parameter analytic statistics device: extract author's cooperative network according to author's relevant information of obtaining, and add up the parameter that the author is correlated with, obtain different author's ranking informations according to the different correlation parameters of adding up;
Corporations divide device: the cooperative network that extracts is carried out corporations divide, the corporations after the division are as a scientific research colony;
Exhibiting device as a result: show described different author's ranking information and scientific research colony to the user, and recommend Core Authors and leader team according to user-selected author's ranking information and scientific research colony for the user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210592828.1A CN103020302B (en) | 2012-12-31 | 2012-12-31 | Academic Core Authors based on complex network excavates and relevant information abstracting method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210592828.1A CN103020302B (en) | 2012-12-31 | 2012-12-31 | Academic Core Authors based on complex network excavates and relevant information abstracting method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103020302A true CN103020302A (en) | 2013-04-03 |
CN103020302B CN103020302B (en) | 2016-03-02 |
Family
ID=47968905
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210592828.1A Active CN103020302B (en) | 2012-12-31 | 2012-12-31 | Academic Core Authors based on complex network excavates and relevant information abstracting method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103020302B (en) |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103235874A (en) * | 2013-04-08 | 2013-08-07 | 浙江大学医学院附属第二医院 | Intelligent control system for clinical use knowledge library of antibacterial drugs in hospital |
CN104156437A (en) * | 2014-08-13 | 2014-11-19 | 中科嘉速(北京)并行软件有限公司 | Academic relationship network construction method based on paper author information extraction and relationship weight model |
CN104537063A (en) * | 2014-12-29 | 2015-04-22 | 北京理工大学 | Knowledge venation map construction system and method based on thesis citation network |
CN104573060A (en) * | 2015-01-23 | 2015-04-29 | 徐立水 | Batched doctor information generation method and device applied to medical websites |
CN104933621A (en) * | 2015-06-19 | 2015-09-23 | 天睿信科技术(北京)有限公司 | Big data analysis system and method for guarantee ring |
CN105260849A (en) * | 2015-10-21 | 2016-01-20 | 内蒙古科技大学 | Scientific researcher evaluation method across social networks |
CN105302882A (en) * | 2015-10-14 | 2016-02-03 | 东软集团股份有限公司 | Keyword obtaining method and apparatus |
CN105512316A (en) * | 2015-12-15 | 2016-04-20 | 中国科学院自动化研究所 | Knowledge service system combining mobile terminal |
CN105978743A (en) * | 2016-07-25 | 2016-09-28 | 广东科学技术职业学院 | Journal evaluation method based on high-order aggregation coefficient |
CN106021352A (en) * | 2016-05-10 | 2016-10-12 | 南京大学 | Community analysis-based academic search engine ranking method |
CN106022936A (en) * | 2016-05-25 | 2016-10-12 | 南京大学 | Influence maximization algorithm based on community structure and applicable to paper cooperation network |
CN106055604A (en) * | 2016-05-25 | 2016-10-26 | 南京大学 | Short text topic model mining method based on word network to extend characteristics |
CN106227835A (en) * | 2016-07-25 | 2016-12-14 | 中南大学 | Team's research direction method for digging based on two subnetwork figure hierarchical clusterings |
CN107092651A (en) * | 2017-03-14 | 2017-08-25 | 中国科学院计算技术研究所 | A kind of key person's method for digging analyzed based on communication network data and system |
CN107103551A (en) * | 2017-03-20 | 2017-08-29 | 重庆邮电大学 | A kind of coauthorship network community division method of selected seed node |
CN108595713A (en) * | 2018-05-14 | 2018-09-28 | 中国科学院计算机网络信息中心 | The method and apparatus for determining object set |
CN109086399A (en) * | 2018-07-30 | 2018-12-25 | 中国人民解放军军事科学院系统工程研究院 | A kind of analysis of comprehensive contribution degree and integrated visual technique of expression |
CN109657122A (en) * | 2018-12-10 | 2019-04-19 | 大连理工大学 | A kind of Academic Teams' important member's recognition methods based on academic big data |
CN109829634A (en) * | 2019-01-18 | 2019-05-31 | 北京工业大学 | A kind of adaptive patent Research Team, colleges and universities recognition methods |
CN110674183A (en) * | 2019-08-23 | 2020-01-10 | 上海科技发展有限公司 | Scientific research community division and core student discovery method, system, medium and terminal |
CN110825935A (en) * | 2019-09-26 | 2020-02-21 | 福建新大陆软件工程有限公司 | Community core character mining method, system, electronic equipment and readable storage medium |
WO2020042501A1 (en) * | 2018-08-27 | 2020-03-05 | 平安科技(深圳)有限公司 | Method and system for fund manager social group division, computer device, and storage medium |
CN110990662A (en) * | 2019-11-22 | 2020-04-10 | 北京市科学技术情报研究所 | Domain expert selection method based on citation network and scientific research cooperation network |
CN111090984A (en) * | 2019-11-25 | 2020-05-01 | 江苏大学 | Community division system and method based on document coupling analysis |
CN111090801A (en) * | 2019-12-18 | 2020-05-01 | 创新奇智(青岛)科技有限公司 | Expert interpersonal relationship atlas drawing method and system |
CN114741522A (en) * | 2022-03-11 | 2022-07-12 | 北京师范大学 | Text analysis method and device, storage medium and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298579A (en) * | 2010-06-22 | 2011-12-28 | 北京大学 | Scientific and technical literature-oriented model and method for sequencing papers, authors and periodicals |
CN102521337A (en) * | 2011-12-08 | 2012-06-27 | 华中科技大学 | Academic community system based on massive knowledge network |
CN102609546A (en) * | 2011-12-08 | 2012-07-25 | 清华大学 | Method and system for excavating information of academic journal paper authors |
CN102646122A (en) * | 2012-02-21 | 2012-08-22 | 北京航空航天大学 | Automatic building method of academic social network |
US20120310928A1 (en) * | 2011-06-01 | 2012-12-06 | Microsoft Corporation | Discovering expertise using document metadata in part to rank authors |
-
2012
- 2012-12-31 CN CN201210592828.1A patent/CN103020302B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298579A (en) * | 2010-06-22 | 2011-12-28 | 北京大学 | Scientific and technical literature-oriented model and method for sequencing papers, authors and periodicals |
US20120310928A1 (en) * | 2011-06-01 | 2012-12-06 | Microsoft Corporation | Discovering expertise using document metadata in part to rank authors |
CN102521337A (en) * | 2011-12-08 | 2012-06-27 | 华中科技大学 | Academic community system based on massive knowledge network |
CN102609546A (en) * | 2011-12-08 | 2012-07-25 | 清华大学 | Method and system for excavating information of academic journal paper authors |
CN102646122A (en) * | 2012-02-21 | 2012-08-22 | 北京航空航天大学 | Automatic building method of academic social network |
Cited By (40)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103235874A (en) * | 2013-04-08 | 2013-08-07 | 浙江大学医学院附属第二医院 | Intelligent control system for clinical use knowledge library of antibacterial drugs in hospital |
CN104156437A (en) * | 2014-08-13 | 2014-11-19 | 中科嘉速(北京)并行软件有限公司 | Academic relationship network construction method based on paper author information extraction and relationship weight model |
CN104537063A (en) * | 2014-12-29 | 2015-04-22 | 北京理工大学 | Knowledge venation map construction system and method based on thesis citation network |
CN104537063B (en) * | 2014-12-29 | 2017-10-13 | 北京理工大学 | A kind of knowledge train of thought figure constructing system and method based on paper citation network |
CN104573060A (en) * | 2015-01-23 | 2015-04-29 | 徐立水 | Batched doctor information generation method and device applied to medical websites |
CN104573060B (en) * | 2015-01-23 | 2018-07-10 | 徐立水 | The method and device of Mass production information about doctor applied to Medical Web sites |
CN104933621A (en) * | 2015-06-19 | 2015-09-23 | 天睿信科技术(北京)有限公司 | Big data analysis system and method for guarantee ring |
CN105302882A (en) * | 2015-10-14 | 2016-02-03 | 东软集团股份有限公司 | Keyword obtaining method and apparatus |
CN105302882B (en) * | 2015-10-14 | 2018-09-14 | 东软集团股份有限公司 | Obtain the method and device of keyword |
CN105260849A (en) * | 2015-10-21 | 2016-01-20 | 内蒙古科技大学 | Scientific researcher evaluation method across social networks |
CN105512316A (en) * | 2015-12-15 | 2016-04-20 | 中国科学院自动化研究所 | Knowledge service system combining mobile terminal |
CN105512316B (en) * | 2015-12-15 | 2018-12-21 | 中国科学院自动化研究所 | A kind of Knowledge Service System of combination mobile terminal |
CN106021352B (en) * | 2016-05-10 | 2019-04-30 | 南京大学 | A kind of academic search engine sort method based on community analysis |
CN106021352A (en) * | 2016-05-10 | 2016-10-12 | 南京大学 | Community analysis-based academic search engine ranking method |
CN106022936A (en) * | 2016-05-25 | 2016-10-12 | 南京大学 | Influence maximization algorithm based on community structure and applicable to paper cooperation network |
CN106022936B (en) * | 2016-05-25 | 2020-03-20 | 南京大学 | Community structure-based influence maximization algorithm applicable to thesis cooperative network |
CN106055604A (en) * | 2016-05-25 | 2016-10-26 | 南京大学 | Short text topic model mining method based on word network to extend characteristics |
CN106227835B (en) * | 2016-07-25 | 2018-01-19 | 中南大学 | Team's research direction method for digging based on two subnetwork figure hierarchical clusterings |
CN106227835A (en) * | 2016-07-25 | 2016-12-14 | 中南大学 | Team's research direction method for digging based on two subnetwork figure hierarchical clusterings |
CN105978743A (en) * | 2016-07-25 | 2016-09-28 | 广东科学技术职业学院 | Journal evaluation method based on high-order aggregation coefficient |
CN107092651A (en) * | 2017-03-14 | 2017-08-25 | 中国科学院计算技术研究所 | A kind of key person's method for digging analyzed based on communication network data and system |
CN107092651B (en) * | 2017-03-14 | 2020-07-24 | 中国科学院计算技术研究所 | Key character mining method and system based on communication network data analysis |
CN107103551A (en) * | 2017-03-20 | 2017-08-29 | 重庆邮电大学 | A kind of coauthorship network community division method of selected seed node |
CN108595713A (en) * | 2018-05-14 | 2018-09-28 | 中国科学院计算机网络信息中心 | The method and apparatus for determining object set |
CN108595713B (en) * | 2018-05-14 | 2020-09-29 | 中国科学院计算机网络信息中心 | Method and device for determining object set |
CN109086399A (en) * | 2018-07-30 | 2018-12-25 | 中国人民解放军军事科学院系统工程研究院 | A kind of analysis of comprehensive contribution degree and integrated visual technique of expression |
CN109086399B (en) * | 2018-07-30 | 2019-09-10 | 中国人民解放军军事科学院系统工程研究院 | A kind of analysis of comprehensive contribution degree and integrated visual technique of expression |
WO2020042501A1 (en) * | 2018-08-27 | 2020-03-05 | 平安科技(深圳)有限公司 | Method and system for fund manager social group division, computer device, and storage medium |
CN109657122A (en) * | 2018-12-10 | 2019-04-19 | 大连理工大学 | A kind of Academic Teams' important member's recognition methods based on academic big data |
CN109829634B (en) * | 2019-01-18 | 2021-02-26 | 北京工业大学 | Self-adaptive college patent and scientific research team identification method |
CN109829634A (en) * | 2019-01-18 | 2019-05-31 | 北京工业大学 | A kind of adaptive patent Research Team, colleges and universities recognition methods |
CN110674183A (en) * | 2019-08-23 | 2020-01-10 | 上海科技发展有限公司 | Scientific research community division and core student discovery method, system, medium and terminal |
CN110825935A (en) * | 2019-09-26 | 2020-02-21 | 福建新大陆软件工程有限公司 | Community core character mining method, system, electronic equipment and readable storage medium |
CN110990662A (en) * | 2019-11-22 | 2020-04-10 | 北京市科学技术情报研究所 | Domain expert selection method based on citation network and scientific research cooperation network |
CN110990662B (en) * | 2019-11-22 | 2021-06-04 | 北京市科学技术情报研究所 | Domain expert selection method based on citation network and scientific research cooperation network |
CN111090984A (en) * | 2019-11-25 | 2020-05-01 | 江苏大学 | Community division system and method based on document coupling analysis |
CN111090984B (en) * | 2019-11-25 | 2024-03-19 | 江苏大学 | Community dividing system and method based on literature coupling analysis |
CN111090801A (en) * | 2019-12-18 | 2020-05-01 | 创新奇智(青岛)科技有限公司 | Expert interpersonal relationship atlas drawing method and system |
CN114741522A (en) * | 2022-03-11 | 2022-07-12 | 北京师范大学 | Text analysis method and device, storage medium and electronic equipment |
CN114741522B (en) * | 2022-03-11 | 2024-09-06 | 北京师范大学 | Text analysis method and device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN103020302B (en) | 2016-03-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103020302B (en) | Academic Core Authors based on complex network excavates and relevant information abstracting method and system | |
Wu et al. | Mapping the knowledge domain of smart city development to urban sustainability: a scientometric study | |
Çavdar et al. | Airline customer lifetime value estimation using data analytics supported by social network information | |
US10417301B2 (en) | Analytics based on scalable hierarchical categorization of web content | |
CN106682172A (en) | Keyword-based document research hotspot recommending method | |
CN107729336A (en) | Data processing method, equipment and system | |
CN104899273A (en) | Personalized webpage recommendation method based on topic and relative entropy | |
CN106484764A (en) | User's similarity calculating method based on crowd portrayal technology | |
Jung et al. | An ontology-enabled framework for a geospatial problem-solving environment | |
CA3116778A1 (en) | Artificial intelligence engine for generating semantic directions for websites for automated entity targeting to mapped identities | |
CN103176983A (en) | Event warning method based on Internet information | |
Pan et al. | Clustering of designers based on building information modeling event logs | |
CN104899229A (en) | Swarm intelligence based behavior clustering system | |
CN104298785A (en) | Searching method for public searching resources | |
CN104750499B (en) | Web service composition method based on constraint solving and description logic | |
CN112948547A (en) | Logging knowledge graph construction query method, device, equipment and storage medium | |
CN114511353A (en) | Data analysis method and device | |
CN103942232A (en) | Method and equipment for mining intentions | |
Truong et al. | A hybrid method for fuzzy ontology integration | |
Wang et al. | Evaluating similarity measures for dataset search | |
Yan et al. | Analysis of research papers on E-commerce (2000–2013): based on a text mining approach | |
Qiao et al. | Constructing a data warehouse based decision support platform for China tourism industry | |
Han et al. | Mining integration patterns of programmable ecosystem with social tags | |
Zhang et al. | Formal concept analysis approach for data extraction from a limited deep web database | |
Yuan et al. | OLAP4R: A top-k recommendation system for OLAP Sessions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |