CN103020302A

CN103020302A - Academic core author excavation and related information extraction method and system based on complex network

Info

Publication number: CN103020302A
Application number: CN2012105928281A
Authority: CN
Inventors: 陆浩; 王飞跃; 温婉婷; 甘润生; 孙星恺
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2012-12-31
Filing date: 2012-12-31
Publication date: 2013-04-03
Anticipated expiration: 2032-12-31
Also published as: CN103020302B

Abstract

The invention belongs to the field of data mining. Aiming at the problem of mining core authors in a certain academic field and intelligently extracting their relevant information, the invention proposes an improved academic core author mining and information mining algorithm based on the core node discovery algorithm in social network analysis technology. Extraction method and system. This method integrates vertical search technology, social network analysis technology and text analysis technology, and can find core authors or groups in a certain academic field in massive information, and then obtain their relevant personal information. The present invention adopts vertical search technology to collect open source document data. Analyze the importance of various social entities present in the data using bibliometric techniques and complex network analysis techniques. And use the community discovery algorithm to cluster entities based on the closeness of the relationship between entities to discover academic communities. Users sort according to the importance of entities, find the core authors or institutions, and find the leadership team according to the distribution of the number of publications of the cooperative groups.

Description

Method and system for academic core author mining and related information extraction based on complex network

技术领域technical field

本发明涉及数据挖掘领域，尤其涉及一种基于复杂网络的学术核心作者挖掘及相关信息抽取方法和系统。The invention relates to the field of data mining, in particular to a complex network-based academic core author mining and related information extraction method and system.

背景技术Background technique

众多实际网络都有一个共同性质，即它们都是由各个社团通过公共节点连接而成网络。社团内部节点间的连接相对紧密，社团间的连接相对稀疏。例如万维网可以看成是由大量网站社团组成，同一社团内部的众多站点讨论的往往是有共同兴趣的一些话题。类似地，在作者合作网或者电路网络中，同样可以将各个节点根据其不同的性质划分为不同的社团。因此，网络中社团的数目以及每个节点的归属社团及数目对复杂网络的研究都具有重要意义。Many practical networks have a common property, that is, they are all networks connected by various communities through public nodes. The connection between the nodes within the community is relatively tight, and the connection between the communities is relatively sparse. For example, the World Wide Web can be regarded as composed of a large number of website communities, and many sites within the same community often discuss some topics of common interest. Similarly, in the author cooperation network or circuit network, each node can also be divided into different communities according to their different properties. Therefore, the number of communities in the network and the belonging community and number of each node are of great significance to the study of complex networks.

对于网络中的社团结构的定义，目前没有一个公认的标准。因此网络中社团结构定义的形式很多，但是大体上分为两类：For the definition of community structure in the network, there is no generally accepted standard. Therefore, there are many forms of community structure definitions in the network, but they are generally divided into two categories:

1.使用节点对间边的相对疏密程度来衡量社团结构。在这种方法定义下，每个社团内部的节点对间的连接相对紧密，但是各个社团之间的连接却相对稀疏。1. Use the relative density of edges between node pairs to measure the community structure. Under the definition of this method, the connection between the node pairs within each community is relatively tight, but the connection between each community is relatively sparse.

2.使用图论中的精确数量指标来定义社团结构。这些社团结构都是由图论中团的定义衍生而来。在这类结构的定义方式下，一般要求社团内部每个点都相邻，或者至多可以与多少点不相邻，或者任两点之间最远多少跳等等类似的方式。2. Use precise quantitative indicators in graph theory to define the community structure. These community structures are all derived from the definition of clique in graph theory. In the definition of this type of structure, it is generally required that every point in the community is adjacent, or at most it can be non-adjacent to the number of points, or how many jumps are the farthest between any two points, and so on.

当前领域专家识别推荐通常采用通过构造模糊文本分类器，对专家上传到知识库中的文档进行模糊文本分类，结合数量、时间等因素建立专家知识模型的方法，这种方法存在所用文本库不全，覆盖面低，很难在多个领域进行全方位综合分析所在领域专家的具体贡献及相关个人信息，存在很大的局限性。基于此，本发明使用复杂网络分析技术中的复杂网络构建、参数分析以及社团发现算法，可有效地用于学科领域核心人物或核心团体的发现与其相关信息的获取。At present, expert identification and recommendation in the field usually adopts the method of constructing fuzzy text classifiers, fuzzy text classification of documents uploaded by experts to the knowledge base, and establishing expert knowledge models in combination with factors such as quantity and time. This method has incomplete text databases. The coverage is low, and it is difficult to comprehensively analyze the specific contributions and related personal information of experts in the field in multiple fields, which has great limitations. Based on this, the present invention uses the complex network construction, parameter analysis and community discovery algorithm in the complex network analysis technology, which can be effectively used for the discovery of core figures or core groups in the subject field and the acquisition of related information.

发明内容Contents of the invention

本发明针对挖掘某一学术领域核心人物及智能提取其相关信息的问题，本发明提出了一种基于社会网络分析技术中的核心节点发现方法而改进的学术核心作者挖掘、信息抽取算法和系统。该方法和系统针对特定领域的文献数据，使用复杂网络分析技术中的复杂网络构建、参数分析以及社团发现算法，高效率的找到领域核心团体或关键人物。Aiming at the problem of digging core figures in a certain academic field and intelligently extracting their relevant information, the present invention proposes an improved academic core author mining and information extraction algorithm and system based on the core node discovery method in social network analysis technology. The method and system aim at literature data in a specific field, and use complex network construction, parameter analysis, and community discovery algorithms in complex network analysis techniques to efficiently find core groups or key figures in the field.

本发明提出的一种基于复杂网络的学术核心作者挖掘及相关信息抽取方法，其包括：A kind of academic core author digging and related information extraction method based on complex network proposed by the present invention, it comprises:

步骤1、采用垂直搜索技术采集指定领域的文献数据，并对所述文献数据进行整理分析，以获取作者相关信息；Step 1. Use vertical search technology to collect literature data in the designated field, and organize and analyze the literature data to obtain author-related information;

步骤2、根据所获取的作者相关信息抽取作者合作网络，并统计作者相关的参数，根据所统计的不同相关参数获得不同的作者排名信息；Step 2. Extract the author cooperation network according to the obtained author-related information, and count the author-related parameters, and obtain different author ranking information according to the different related parameters that are counted;

步骤3、对所抽取的合作网络进行社团划分，划分后的社团作为一个科研群体；Step 3. Divide the extracted cooperative network into communities, and the divided communities are regarded as a scientific research group;

步骤4、向用户展示所述不同的作者排名信息和科研群体，并根据用户所选择的作者排名信息和科研群体为用户推荐核心作者和领袖团队。Step 4: Display the different author ranking information and scientific research groups to the user, and recommend core authors and leadership teams to the user according to the author ranking information and scientific research groups selected by the user.

本发明还提出了一种基于复杂网络的学术核心作者挖掘及相关信息抽取系统，其包括：The present invention also proposes a complex network-based academic core author mining and related information extraction system, which includes:

数据采集和整理装置：用于采用垂直搜索技术采集指定领域的文献数据，并对所述文献数据进行整理分析，以获取作者相关信息；Data collection and collation device: used to collect literature data in a designated field using vertical search technology, and organize and analyze the literature data to obtain author-related information;

参数分析统计装置：根据所获取的作者相关信息抽取作者合作网络，并统计作者相关的参数，根据所统计的不同相关参数获得不同的作者排名信息；；Parameter analysis and statistics device: extract the author cooperation network according to the obtained author-related information, and count the author-related parameters, and obtain different author ranking information according to the different related parameters that are counted;

社团划分装置：对所抽取的合作网络进行社团划分，划分后的社团作为一个科研群体；Community division device: divide the extracted cooperative network into communities, and the divided communities are regarded as a scientific research group;

结果展示装置：向用户展示所述不同的作者排名信息和科研群体，并根据用户所选择的作者排名信息和科研群体为用户推荐核心作者和领袖团队。Result display device: display the different author ranking information and scientific research groups to the user, and recommend core authors and leadership teams to the user according to the author ranking information and scientific research groups selected by the user.

附图说明Description of drawings

图1是本发明的应用系统原理图；Fig. 1 is a schematic diagram of the application system of the present invention;

图2是本发明的应用系统的简单使用流程图；Fig. 2 is a simple flow chart of the application system of the present invention;

图3是本发明中基于复杂网络的学术核心作者挖掘及相关信息抽取方法的流程图；Fig. 3 is the flow chart of the academic core author digging and related information extraction method based on complex network in the present invention;

图4是本发明中数据采集子流程图；Fig. 4 is a sub-flow chart of data collection in the present invention;

图5是本发明中数据采集配置子流程图；Fig. 5 is a sub-flow chart of data acquisition configuration in the present invention;

图6是本发明中数据分析整理子流程图；Fig. 6 is a sub-flow chart of data analysis and arrangement in the present invention;

图7是本发明实现的应用系统截图。Fig. 7 is a screenshot of the application system realized by the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

本发明所提出的基于复杂网络的学术核心作者挖掘及相关信息抽取方法和系统是针对领域核心专家群体信息检索而发明的，应用系统原理见附图1。The complex network-based academic core author mining and related information extraction method and system proposed by the present invention are invented for information retrieval of core expert groups in the field. The principle of the application system is shown in Figure 1.

下面介绍本发明所使用到的技术：The technology used in the present invention is introduced below:

1、采集技术1. Collection technology

1.1垂直搜索1.1 Vertical search

本方法利用垂直搜索技术，根据用户关注的领域、会议等相关信息，从CNKI，SpringerLink等常用的文献检索引擎上获取相关的作者、机构、会议等元数据，自动下载并解析文献全文，获取文献作者或机构的详细通讯方式。This method uses vertical search technology to obtain relevant metadata such as authors, institutions, conferences, etc. from commonly used literature search engines such as CNKI and SpringerLink according to the fields and conferences that users are concerned about, and automatically downloads and parses the full text of the literature to obtain the literature. Correspondence details of the author or institution.

垂直搜索是针对某一个领域的专业搜索引擎，是搜索引擎的细分和延伸，是对网页库中的某类专门的信息进行一次整合，定向分字段抽取出需要的数据进行处理后再以某种形式返回给用户。相对通用搜索引擎的信息量大、查询不准确、深度不够等提出来的新的搜索引擎服务模式，通过针对某一特定领域、某一特定人群或某一特定需求提供的有一定价值的信息和相关服务。其特点就是“专、精、深”，且具有行业色彩，相比较通用搜索引擎的海量信息无序化，垂直搜索引擎则显得更加专注、具体和深入。Vertical search is a professional search engine for a certain field. It is a subdivision and extension of search engines. It is an integration of a certain type of specialized information in the webpage library, and the required data is extracted by directional sub-fields for processing and then processed by a certain field. form returned to the user. Compared with the general search engine, which has a large amount of information, inaccurate queries, and insufficient depth, etc., a new search engine service model is proposed, through providing valuable information and Related Services. It is characterized by "specialization, precision, and depth" and has an industry color. Compared with the disordered massive information of general search engines, vertical search engines are more focused, specific, and in-depth.

垂直搜索最重要的技术是搜索引擎爬虫。搜索引擎爬虫技术是一种按照一定的规则，自动的抓取网络信息的技术。本系统相关搜索引擎爬虫的设计是以普通爬虫为基础，并对其功能进行有效扩充，主要包括领域相关初始URL种子集、页面抓取模块、主题相关性分析模块，URL查重与页面下载等模块。这种设计能够保证系统良好的主题相关性，以提高爬取的主题相关性页面命中率，切合用户的需求。The most important technology of vertical search is search engine crawler. Search engine crawler technology is a technology that automatically crawls network information according to certain rules. The design of relevant search engine crawlers in this system is based on common crawlers, and its functions are effectively expanded, mainly including field-related initial URL seed sets, page crawling modules, topic correlation analysis modules, URL duplicate checking and page downloading, etc. module. This design can ensure good topic relevance of the system, so as to improve the hit rate of crawled topic-related pages and meet the needs of users.

1.2网页采集1.2 Web page collection

本技术中网页采集主要分为深度网采集和动态网采集。深度网的特征是在于其页面的隐蔽性，一般需要用户提交数据请求的表单才能获得返回的结果。动态网的页面主要特点是“动态存在”，即用户在调用页面时临时通过程序动态生成的页面。动态网按照信息项的分布主要分成两个类型：一是多记录项动态网页；另外一个就是单记录项动态网页，其页面抽取的主要难度在于网页信息的有效定位以及不同用户所定义的不同抽取请求的精确表示。Web page collection in this technology is mainly divided into deep web collection and dynamic web collection. The deep web is characterized by the concealment of its pages, and generally requires users to submit data request forms to obtain returned results. The main feature of the pages of the dynamic network is "dynamic existence", that is, the pages that are dynamically generated temporarily by the program when the user calls the page. According to the distribution of information items, dynamic web pages are mainly divided into two types: one is multi-record item dynamic web pages; the other is single-record item dynamic web pages. An exact representation of the request.

2、分析技术2. Analysis technology

2.1复杂网络技术2.1 Complex network technology

2.1.1基本概念2.1.1 Basic concepts

一个具体的网络可以抽象为由点集V和边集E组成的图G，节点数记为N＝|V(G)|，边数记为M＝|E(G)|。E中每条边都有V中一个节点对与之对应。如果任意点对对应于同一条边，则该网络为无向网络，否则为有向网络。若网络中包含的节点与边只有一种类型，则称该网络是同质的，否则该网络属于异质网络。A specific network can be abstracted as a graph G composed of a point set V and an edge set E, the number of nodes is denoted as N=|V(G)|, and the number of edges is denoted as M=|E(G)|. Each edge in E has a corresponding node pair in V. If any pair of vertices corresponds to the same edge, the network is undirected, otherwise it is directed. If the network contains only one type of nodes and edges, the network is said to be homogeneous, otherwise the network is a heterogeneous network.

2.1.2中介度中心性2.1.2 Betweenness centrality

中介度中心性(betweenness centrality)是基于节点对网络通信的控制能力来定义的。它认为如果某节点存在于网络中其它节点对之间通信的必经之路上，则其在网络中必定具有重要的地位。Betweenness centrality is defined based on the ability of nodes to control network communication. It believes that if a node exists on the only way for communication between other node pairs in the network, it must have an important position in the network.

2.1.3聚集系数2.1.3 Aggregation coefficient

聚集系数(clusteringcoefficient)经常被用来描述网络的传递性。比如在社交关系网中，你朋友的朋友很可能也是你的朋友；你的两个朋友很可能彼此也是朋友。聚集系数就是用来度量网络的这种性质的。Clustering coefficient (clusteringcoefficient) is often used to describe the transitivity of the network. For example, in a social network, your friend's friend is likely to be your friend; two of your friends are likely to be friends with each other. The aggregation coefficient is used to measure this property of the network.

2.2其他统计指标2.2 Other statistical indicators

2.2.1H-index2.2.1 H-index

评价科学家影响力的一个重要的测度是H-index度量。H-index的取值依据的是科学家的文章的数量以及被引用的次数。例如，某一个学者有至少h篇文章分别被引用h次，则这个学者的H-index取值为h。从上述描述知，某学者的H-index取值越大，他在其研究领域内的影响力也越大。H-index度量将学者们发表的科研成果的数量的质量综合地纳入了考量。An important measure to evaluate the influence of scientists is the H-index measure. The value of H-index is based on the number of scientists' articles and the number of citations. For example, if a certain scholar has at least h articles cited h times, the H-index value of this scholar is h. From the above description, the greater the H-index value of a scholar, the greater his influence in his research field. The H-index metric comprehensively takes into consideration the quantity and quality of scientific research achievements published by scholars.

2.2.2APS值(平均产出得分)2.2.2 APS value (average output score)

APS值定义为：对于一篇有n个作者的论文，APS给每个作者的得分是1/n。一个作者的APS就是它所有论文的得分之和。它描述了作者对其所发文章的贡献度。The APS value is defined as: For a paper with n authors, APS gives each author a score of 1/n. An author's APS is the sum of the scores of all his papers. It describes the author's contribution to the articles he publishes.

本发明提出了一种基于复杂网络的学术核心作者挖掘及信息抽取方法，应用系统的简单使用流程见附图2，学术核心作者挖掘及相关信息抽取方法的流程见附图3。具体步骤如下：The present invention proposes a complex network-based academic core author mining and information extraction method. The simple use process of the application system is shown in Figure 2, and the process of the academic core author mining and related information extraction method is shown in Figure 3. Specific steps are as follows:

步骤一：数据采集与整理。本方法采用垂直搜索技术进行指定会议的论文文献数据采集。采集流程见附图4。本步骤包含三个阶段：Step 1: Data collection and collation. This method adopts the vertical search technology to collect the paper literature data of the specified conference. The collection process is shown in Figure 4. This step consists of three stages:

阶段1：基本数据获取，具体包括：步骤a)确定采集条件，采集条件的确定见附图5。首先需要确定检索类型，包括三种检索类型：期刊、会议与关键词。然后根据不同类型确定检索词、时间等检索条件，如会议配置条件(会议相关的检索词等)、文献检索来源和检索年份等配置条件。接着选取数据源，包括国内外不同的数据库。从而构成检索条件集合。其中，会议配置条件需要用户输入，其余配置条件由系统自行调整；步骤b)，根据采集条件动态配置采集信息，对确定的每个数据源站点，如CNKI，SpringerLink等分别配置采集信息，如检索类型为期刊，则配置的采集信息为期刊等；步骤c)基本文献数据采集。这里利用垂直搜索技术，根据用户关注的领域、会议等相关信息，通过初始URL种子集、页面抓取模块、主题相关性分析模块，URL查重与页面下载等模块从CNKI，SpringerLink等常用的文献检索引擎上获取相关的作者、机构、会议等元数据，自动下载并解析文献全文。Phase 1: Basic data acquisition, specifically includes: Step a) Determine the acquisition conditions, see Figure 5 for the determination of the acquisition conditions. First, you need to determine the type of retrieval, including three types of retrieval: periodicals, conferences, and keywords. Then, according to different types, determine search terms, time and other search conditions, such as meeting configuration conditions (meeting-related search terms, etc.), literature search source and search year and other configuration conditions. Then select data sources, including different databases at home and abroad. Thus, a collection of retrieval conditions is formed. Among them, the conference configuration conditions need to be input by the user, and the other configuration conditions are adjusted by the system itself; step b), dynamically configure the collection information according to the collection conditions, and configure the collection information for each determined data source site, such as CNKI, SpringerLink, etc., such as retrieval If the type is periodical, the configured collection information is periodical, etc.; step c) basic document data collection. Here, using vertical search technology, according to the fields and conferences that users are concerned about, through the initial URL seed set, page crawling module, topic correlation analysis module, URL duplicate checking and page downloading modules, from commonly used documents such as CNKI, SpringerLink, etc. Obtain relevant metadata such as authors, institutions, conferences, etc. from the search engine, and automatically download and analyze the full text of the literature.

阶段2：数据整理，具体包括：步骤d)进行数据清洗，主要是将作者姓名规范化，去除多余字符，例如空格等，对机构进行一定归并，如二级机构单位由其一级单位名称替代等；步骤e)指定信息获取，本发明中最主要的研究对象是作者，因此在此步骤中可以获得简单的作者信息，即作者姓名及系统分配的唯一标识ID。Stage 2: Data collation, specifically includes: Step d) Data cleaning, mainly to standardize the author’s name, remove redundant characters, such as spaces, etc., and merge institutions, such as the second-level institution unit is replaced by its first-level unit name, etc. ; Step e) obtains specified information, the main research object in the present invention is the author, so in this step, simple author information can be obtained, namely the author's name and the unique identification ID assigned by the system.

阶段3：信息入库，具体包括：步骤f)将结果展示给用户，由用户判断是否对结果满意，满意则进行步骤g)，否则返回步骤a)重新配置；步骤g)将基本文献信息和作者信息存入指定数据库；步骤h)系统判断是否循环采集数据，是则等待一段时间之后再次采集，否则结束采集步骤。Stage 3: information storage, specifically includes: step f) display the results to the user, and let the user judge whether they are satisfied with the results, if satisfied, proceed to step g), otherwise return to step a) to reconfigure; step g) transfer basic document information and The author's information is stored in the designated database; step h) the system judges whether to collect data cyclically, if so, wait for a period of time and then collect again, otherwise end the collection step.

步骤二：参数统计分析。数据分析整理子流程见附图6。本方法研究对象为指定领域相关核心作者与团体。因此需要对作者的文献统计参数进行分析，通过对各项参数值进行综合排名进而识别出该领域的核心作者。统计参数包含作者的发文量分布和作者APS(平均产出得分)分布，并利用合作者关系抽取作者的合作网络，分析作者在合作网络中的节点中介中心性、度分布、网络聚集系数和H-index度量，其中节点中介中心性用于衡量一个作者能在多大程度上控制他人之间的交往，如果一个节点处于许多其他点对的最短路径上，它就具有较高的中介中心度。可以认为该作者居于重要位置，度分布表示某个作者与多少人有过合作关系，网络聚集系数指网络中节点的邻接点也互为邻接点的比例，即小集群结构的完美程度，用来衡量此作者在网络节点聚类情况的参数；H-index度量表示某作者h篇文章分别被引用h次，则这个学者的H-index取值为h，用来衡量其在研究领域内的影响力。将按不同参数得到的作者排名信息保存，即按照作者的发文量分布、作者APS(平均产出得分)分布、作者在合作网络中的节点中介中心性、度分布、网络聚集系数和H-index度量等参数得到不同的作者排名信息。Step 2: Statistical analysis of parameters. See Figure 6 for the sub-process of data analysis and arrangement. The research object of this method is the core authors and groups related to the specified field. Therefore, it is necessary to analyze the statistical parameters of the authors' literature, and to identify the core authors in this field by comprehensively ranking the values of each parameter. Statistical parameters include the distribution of the author's published papers and the distribution of the author's APS (average output score), and use the collaborator relationship to extract the author's cooperation network, and analyze the author's node intermediary centrality, degree distribution, network aggregation coefficient and H in the cooperation network. -index measure, where node betweenness centrality is used to measure the degree to which an author can control the communication between others, if a node is on the shortest path of many other point pairs, it has a high betweenness centrality. It can be considered that the author is in an important position, and the degree distribution indicates how many people an author has had a cooperative relationship with. The network aggregation coefficient refers to the proportion of adjacent points of nodes in the network that are also adjacent points of each other, that is, the degree of perfection of the small cluster structure. A parameter to measure the author's clustering situation in the network nodes; H-index measurement indicates that h articles of an author have been cited h times, and the value of the scholar's H-index is h, which is used to measure its influence in the research field force. Save the author ranking information obtained by different parameters, that is, according to the distribution of the author's publication volume, the author's APS (average output score) distribution, the author's node intermediary centrality in the cooperation network, degree distribution, network clustering coefficient and H-index Parameters such as metrics get different author ranking information.

步骤三：根据社团划分算法进行群体分析。本方法针对作者合作网络进行社团划分，划分后的每个社团相当于一个科研群体。针对全部科研群体统计发文量分布情况。Step 3: Carry out group analysis according to the community division algorithm. This method divides the community of the author's cooperation network, and each community after division is equivalent to a scientific research group. The distribution of published papers is counted for all scientific research groups.

步骤四：作者排名信息及科研群体信息展示。将步骤二保存的不同作者排名信息和步骤三找到的科研群体展现给用户，并根据用户选择的作者排名信息和科研群体排名推荐重要作者作为科研领袖，重要群体作为核心团队。Step 4: Display of author ranking information and research group information. Display the different author ranking information saved in step 2 and the scientific research groups found in step 3 to the user, and recommend important authors as scientific research leaders and important groups as the core team according to the author ranking information and scientific research group rankings selected by the user.

步骤五：核心作者信息抽取及展示。用户根据需要，选定主要领域学者作为核心作者，由系统通过文献信息自动抽取其个人资料信息展现给用户进行相关业务或研究使用。Step 5: Extract and display core author information. Users select scholars in major fields as core authors according to their needs, and the system automatically extracts their personal profile information through document information and presents them to users for related business or research use.

其中，步骤一的阶段1中，文献采集方式为深度网采集与动态网采集相结合。Among them, in stage 1 of step 1, the literature collection method is a combination of deep web collection and dynamic web collection.

深度网采集的工作过程可分为3步：1)分析页面，寻找表单；2)学习填写表单；3)识别和取回结果页面。其中，深度网爬虫第一步从站点主页开始爬行表单页面，这个过程使用一组启发式规则来去除非研究表单；第二步从表单中抽取标签，配合领域规则知识库及网站的特征标识(用户名、密码或验证码)，爬虫尽力学习如何正确地填写表单；最后一步提交表单，然后取回结果页面识别记录。另外，在深度网采集的过程中，网络爬虫需要基于领域知识库，智能化地识别特定应用领域知识，以保证采集到的信息的相关性和准确性。The working process of deep web collection can be divided into 3 steps: 1) analyze the page and find the form; 2) learn to fill in the form; 3) identify and retrieve the result page. Among them, the deep web crawler crawls form pages from the homepage of the site in the first step. This process uses a set of heuristic rules to remove non-research forms; the second step extracts labels from the form, and cooperates with the domain rule knowledge base and website feature identification ( Username, password or captcha), the crawler tries its best to learn how to fill out the form correctly; the last step is to submit the form, and then get back the result page identification record. In addition, in the process of deep web collection, web crawlers need to intelligently identify specific application domain knowledge based on the domain knowledge base to ensure the relevance and accuracy of the collected information.

动态网采集过程中，抽取多记录项动态网页的信息时，需要运用树编辑距离模型和树归并模型算法定位和抽取网页信息。使用树编辑距离准确定位网页的抽取结构，将动态网页转换为标签树并定位分离网页中的数据项，为单个数据项生成独自的数据项树；将树归并模型运用于多数据项的模式抽取上，控制重复数据项和可选数据项，生成用于抽取的包装器树，即最终抽取器。在抽取单记录项动态网页的信息时，用户需要通过可选模块，自定义抽取的数据项，系统将根据用户所选数据项生成抽取模板。在抽取过程中，首先将网页转换为标签树，通过用户自定义的抽取模板匹配并抽取网页信息并保存。In the process of dynamic web collection, when extracting the information of multi-record dynamic web pages, it is necessary to use the tree edit distance model and tree merge model algorithm to locate and extract web page information. Use the tree edit distance to accurately locate the extraction structure of the webpage, convert the dynamic webpage into a tag tree and locate the data items in the separated webpage, and generate a separate data item tree for a single data item; apply the tree merge model to the pattern extraction of multiple data items above, control repeated data items and optional data items, and generate a wrapper tree for extraction, which is the final extractor. When extracting the information of a dynamic web page with a single record item, the user needs to customize the extracted data item through an optional module, and the system will generate an extraction template according to the data item selected by the user. In the extraction process, the webpage is first converted into a tag tree, and the webpage information is extracted and saved through user-defined extraction template matching.

阶段1的步骤c中，文献引擎来源主要有CNKI和SpringerLink，采集内容包含文献标题，文献原文，文献作者，文献关键词，作者机构，文献所在出版物，文献发表时间。In step c of stage 1, the sources of the document engine mainly include CNKI and SpringerLink, and the collected content includes the title of the document, the original text of the document, the author of the document, the keywords of the document, the author’s institution, the publication of the document, and the publication time of the document.

步骤二中，中介度中心性的表达式定义为：In step 2, the expression of betweenness centrality is defined as:

${BC BC}_{i i} = = \underset{j j < < k k}{Σ Σ} \frac{{g g}_{jk jk} ((i i))}{{g g}_{jk jk}},,$

式中g_jk(i)表示节点j和k之间通过节点i的最短路径的条数，g_jk表示节点j和节点k之间最短路径的条数。对于有向网络则需考虑路径的方向性。中介度中心性概念在社会网络分析中非常重要。此外，中介度中心性的概念除了可定义节点的中介度，也可用来定义边的中介度以衡量边在网络中的重要性。where g _jk (i) represents the number of shortest paths between nodes j and k passing through node i, and g _jk represents the number of shortest paths between node j and node k. For a directed network, the directionality of the path needs to be considered. The concept of betweenness centrality is very important in social network analysis. In addition, the concept of betweenness centrality can not only define the betweenness of nodes, but also define the betweenness of edges to measure the importance of edges in the network.

网络聚集系数表示节点的邻接点是否连接，是衡量网络传递性的一个度量指标。通俗的说，就是节点的网络邻居的邻居也可能是该节点的邻居，定义为：The network aggregation coefficient indicates whether the adjacent points of the nodes are connected, and is a measure of the network transitivity. In layman's terms, the neighbors of the node's network neighbors may also be the neighbors of the node, defined as:

$C C = = \frac{{33 N N}_{Δ Δ}}{{N N}_{33}} . .$

其中N△是指合作网络中三角形的个数，N3是指合作网络中连通三元组的数量。连通三元组是指包括某给定节点的三个节点，至少存在从该给定节点到其它两个节点的两条边所组成的三元组。where N△ refers to the number of triangles in the cooperative network, and N3 refers to the number of connected triplets in the cooperative network. A connected triplet refers to a triplet consisting of three nodes including a given node, and at least two edges from the given node to two other nodes.

步骤三中，社团划分算法使用针对有向网络的快速社团划分算法。快速社团划分算法是基于GN算法中提出的模块度概念所做的一种改进算法。In step three, the community partition algorithm uses a fast community partition algorithm for directed networks. The fast community partition algorithm is an improved algorithm based on the modularity concept proposed in the GN algorithm.

先介绍GN算法：First introduce the GN algorithm:

一种简单的社团划分方法是移除将不同社团相连的边，这就是分裂法的中心思想。Grivan和Newman提出的社团发现算法-GN算法是最著名的用来社团发现的分裂算法。算法用到了上面介绍的边的中介度中心性，再依据边不属于社团的程度逐步地把不属于该社团的边删除，直到把所有的边均删除。根据社团的定义可知，社团之间的边比社团内部的边有更大的边介数。通过逐步把边介数高的边移去，可将网络划分为社团。A simple way to divide communities is to remove the edges connecting different communities, which is the central idea of the split method. The community discovery algorithm proposed by Grivan and Newman-GN algorithm is the most famous splitting algorithm for community discovery. The algorithm uses the betweenness centrality of the edges introduced above, and then gradually deletes the edges that do not belong to the community according to the extent to which the edges do not belong to the community, until all the edges are deleted. According to the definition of community, the edge between communities has a larger edge betweenness than the edge within the community. By gradually removing edges with high betweenness, the network can be divided into communities.

但是，GN算法在最坏情况下每移走一条边就需要重新计算所有边介数，仅适用于中等规模的社会网络。针对此缺陷，有很多研究从不同角度对其作了改进。此外，学者提出GN算法对于网络的社团结构并没有一个量的定义。因此，不能直接从网络的拓扑结构判断所得到的社团结构是否具有实际意义，此外，在不知道社团数目的情况下，GN算法也不知道这种分解However, in the worst case, the GN algorithm needs to recalculate the betweenness of all edges every time an edge is removed, which is only suitable for medium-scale social networks. Aiming at this defect, many studies have improved it from different angles. In addition, scholars have proposed that the GN algorithm does not have a quantitative definition for the community structure of the network. Therefore, it cannot be judged directly from the topology of the network whether the obtained community structure has practical significance. In addition, the GN algorithm does not know this decomposition without knowing the number of communities.

$Q Q = = \frac{11}{22 m m} \underset{ij ij}{Σ Σ} (({A A}_{ij ij} - - \frac{{k k}_{i i} {k k}_{j j}}{22 m m})) {σ σ}_{{c c}_{i i} {c c}_{j j}}$

要进行到哪一步终止。为解决这个问题，Newman等人引进了一个衡量网络社团划分质量的标准——模块度(Q)，其定义为：To which step to terminate. To solve this problem, Newman et al. introduced a standard to measure the quality of network community division - modularity (Q), which is defined as:

其中，A为图的邻接矩阵，A_ij表示边权，k_i和k_j分别为节点i和j的度数，度是指和该节点(顶点)相关联的边的条数，m为图的总边数，C_i与C_j分别代表节点i与j所属社团编号。若基节点i与节点j属于同一个社团，则δ函数取1，反之取0。Among them, A is the adjacency matrix of the graph, A _ij represents the edge weight, ki and k _j are the degrees of nodes i and _j respectively, the degree refers to the number of edges associated with the node (vertex), and m is the graph The total number of edges, C _i and C _j represent the community numbers of nodes i and j respectively. If the base node i and node j belong to the same community, the delta function takes 1, otherwise it takes 0.

下面介绍快速社团划分算法：The fast community partition algorithm is introduced as follows:

GN算法是社团发现算法领域的一个非常重要的里程碑，但由于其算法复杂度比较大，因此仅仅局限于研究中等规模的社会网络。基于这个原因，在GN算法的基础上提出了一种快速社团划分算法。这种快速算法实际上是基于贪婪算法思想的一种凝聚算法。算法步骤如下：1.初始化网络为n个社团，即每个节点就是一个独立社团；2.依次合并有边相连的社团对，并计算合并后的模块度(Q)增量；3.重复执行步骤2，不断合并社团，直到整个网络都合并成为一个社团。算法中每一次合并社团的操作都对应一个模块度值，对应着局部最大模块度值时，即为最好的社团划分。The GN algorithm is a very important milestone in the field of community discovery algorithms, but due to its relatively large algorithm complexity, it is only limited to the study of medium-scale social networks. For this reason, a fast community partition algorithm is proposed based on the GN algorithm. This fast algorithm is actually an agglomerative algorithm based on the idea of greedy algorithm. The algorithm steps are as follows: 1. Initialize the network as n communities, that is, each node is an independent community; 2. Merge the pairs of communities connected by edges in turn, and calculate the modularity (Q) increment after the merger; 3. Repeat execution Step 2, constantly merging communities until the entire network is merged into one community. Each operation of merging communities in the algorithm corresponds to a modularity value, and when it corresponds to the local maximum modularity value, it is the best community division.

但由于其算法复杂度仍较高，目前，大部分社团发现算法仍集中在对无向网络的研究，而事实上，大部分我们感兴趣的网络均为有向网络，例如万维网、电信通话网络、Email通信网络、生物网络等等。忽略网络连接的方向进行社团发现，意味着丢弃了网络结构中的重要信息，使得对社团发现的结果有所偏差。However, due to the high complexity of the algorithm, at present, most community discovery algorithms are still focused on the research of undirected networks, but in fact, most of the networks we are interested in are directed networks, such as the World Wide Web, telecom communication networks, etc. , Email communication network, biological network and so on. Neglecting the direction of network connection for community discovery means discarding important information in the network structure, which makes the results of community discovery biased.

对于有向网络，采用修改了的模块度公式(modularityfunction)For directed networks, a modified modularity formula (modularity function) is used

$Q Q = = \frac{11}{m m} \underset{ij ij}{Σ Σ} [[{A A}_{ij ij} - - \frac{{k k}_{i i}^{in in} {k k}_{j j}^{out out}}{m m}]] {δ δ}_{{c c}_{i i} {c c}_{j j}},,$

${δ δ}_{{C C}_{i i} {C C}_{j j}} = = \{\begin{matrix} 11,, & {C C}_{i i} = = {C C}_{j j} \\ 00,, & {C C}_{i i} &NotEqual; &NotEqual; {C C}_{j j} \end{matrix}$

其中，A为图的邻接矩阵，A_ij表示边权，δ为克罗内克δ符号(Kroenekefdeltasymbol)，若基节点i与节点j属于同一个社团，则δ函数取1，反之取0。

为节点i的入度，

为节点j的出度；节点的入度是指进入该节点的边的条数；节点的出度是指从该节点出发的边的条数。m为网络的总边数。Among them, A is the adjacency matrix of the graph, A _ij represents the edge weight, and δ is the Kroenekef delta symbol. If the base node i and node j belong to the same community, the δ function takes 1, otherwise it takes 0.

is the in-degree of node i,

is the out-degree of node j; the in-degree of a node refers to the number of edges entering the node; the out-degree of a node refers to the number of edges starting from the node. m is the total number of edges in the network.

步骤五中，联系人的联系方式可以是指联系人的Email。Email信息通过步骤1中抽取的文献原文获取。系统自动解析文献原文，使用正则表达式匹配Email格式的文本信息，抽取出文献原文中包含的所有Eamil信息。本系统同时使用三种正则表达式匹配目标Email：In Step 5, the contact information of the contact may refer to the Email of the contact. Email information is obtained from the original text of the documents extracted in step 1. The system automatically parses the original document, uses regular expressions to match the text information in the Email format, and extracts all Email information contained in the original document. The system uses three regular expressions to match the target Email:

Regex1＝Regex1=

″\w+([-+.]\w+)*\w+([-.]\w+)*\.\w+([-.]\w+)*″″\w+([-+.]\w+)*\w+([-.]\w+)*\.\w+([-.]\w+)*″

Regex2＝Regex2＝

″\{？(\w*([-+.]\w+)*，(\s)*)*\w*([-+.]\w+)*\}？″\{?(\w*([-+.]\w+)*, (\s)*)*\w*([-+.]\w+)*\}?

\w+([-.]\w+)*\.\w*([-.]\w*)*″\w+([-.]\w+)*\.\w*([-.]\w*)*″

Regex3＝Regex3＝

″(\s)*e-mail：(\s)*\w+([-+.]\w+)*\w+([-.]\w+)*\.\w+([-.]\w+)*(\s)*″″(\s)*e-mail: (\s)*\w+([-+.]\w+)*\w+([-.]\w+)*\.\w+([-.]\w+) *(\s)*″

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，本发明实现的应用系统截图见附图7。所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present invention in detail. The screenshot of the application system realized by the present invention is shown in Figure 7. It should be understood that the above descriptions are only specific embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention are Should be included within the protection scope of the present invention.

Claims

1. the academic Core Authors based on complex network excavates and the relevant information abstracting method, and it comprises:

Step 1, employing vertical search technology gather the data in literature of designated field, and described data in literature is carried out finishing analysis, to obtain author's relevant information;

Step 2, extract author's cooperative network according to author's relevant information of obtaining, and the statistics author parameter of being correlated with, different author's ranking informations obtained according to the different correlation parameters of adding up;

Step 3, the cooperative network that extracts is carried out corporations divide, the corporations after the division are as a scientific research colony;

Step 4, show described different author's ranking information and scientific research colony to the user, and recommend Core Authors and leader team according to user-selected author's ranking information and scientific research colony for the user.

2. the method for claim 1 is characterized in that, the data in literature that gathers designated field in the step 1 specifically comprises:

Step 11, determine acquisition condition, comprise the deterministic retrieval type, according to difference retrieval type deterministic retrieval condition;

Step 12, according to the Information Monitoring of acquisition condition dynamic-configuration;

Step 13, obtain data in literature according to acquisition condition and Information Monitoring.

3. the method for claim 1 is characterized in that, in the step 1 data is carried out finishing analysis and specifically comprises to obtain author's relevant information:

Step 14, carry out data cleansing;

Step 15, obtain author's relevant information of appointment.

4. the method for claim 1, it is characterized in that step 1 comprises that also the author's relative information displaying that will obtain to the user, need to determine whether the Resurvey data by the user, if need to would reconfigure acquisition condition, and carry out image data according to the acquisition condition that reconfigures.

5. the method for claim 1, it is characterized in that correlation parameter described in the step 2 comprises author's the distribution of dispatch amount, author's average output score, node Betweenness Centrality, degree distribution, network convergence factor and the H-index tolerance of author in cooperative network.

6. method as claimed in claim 5 is characterized in that, described node Betweenness Centrality calculates according to following formula and obtains:

{BC}_{i} = \underset{j < k}{Σ} \frac{g_{jk} (i)}{g_{jk}},

Wherein, g _Jk(i) number of the shortest path by node i between expression node j and the k, g _JkThe number of shortest path between expression node j and the node k;

Described network convergence factor obtains according to following formula:

C = \frac{{3 N}_{Δ}}{N_{3}} .

Wherein, N _△The number that refers to cooperative network intermediate cam shape, N ₃Refer to be communicated with in the cooperative network quantity of tlv triple.

7. the method for claim 1 is characterized in that, corporations described in the step 3 divide the quick group dividing method that adopts for directed networks, specifically comprise:

Step 31, the described cooperative network of initialization are n corporations, and namely each node is independent corporations;

Step 32, be associated with the corporations that the limit links to each other successively, and calculate the modularity value after merging;

Step 33, repeated execution of steps 32, until whole cooperative network all is merged into corporations, wherein, when the modularity value was maximum, corresponding corporations were the corporations after final the division after merging.

8. method as claimed in claim 7, wherein said modularity value is calculated according to following formula:

Q = \frac{1}{m} \underset{ij}{Σ [} A_{ij} - \frac{k_{i}^{in} k_{j}^{out}}{m}] δ_{c_{i} c_{j}},

δ_{C_{i} C_{j}} = \{\begin{matrix} 1, & C_{i} = C_{j} \\ 0, & C_{i} &NotEqual; C_{j} \end{matrix}

Wherein, Q is the modularity value, and A is adjacency matrix, A _IjExpression limit power,

Be the in-degree of node i,

Out-degree for node j; M is total limit number of cooperative network.

9. the method for claim 1 is characterized in that, the method also comprises:

Step 5, analysis data in literature, the personal information that extracts Core Authors also offers the user.

10. the academic Core Authors based on complex network excavates and the relevant information extraction system, and it comprises:

Data acquisition and collating unit: be used for adopting the vertical search technology to gather the data in literature of designated field, and described data in literature is carried out finishing analysis, to obtain author's relevant information;

Parameter analytic statistics device: extract author's cooperative network according to author's relevant information of obtaining, and add up the parameter that the author is correlated with, obtain different author's ranking informations according to the different correlation parameters of adding up;

Corporations divide device: the cooperative network that extracts is carried out corporations divide, the corporations after the division are as a scientific research colony;

Exhibiting device as a result: show described different author's ranking information and scientific research colony to the user, and recommend Core Authors and leader team according to user-selected author's ranking information and scientific research colony for the user.