一种基于技术图谱的关键点识别方法A Key Point Recognition Method Based on Technical Atlas
技术领域Technical field
本发明涉及一种数据处理方法,尤其是涉及一种基于技术图谱的关键点识别方法。The invention relates to a data processing method, in particular to a key point identification method based on a technical map.
背景技术Background technique
在技术图谱网络种,识别出网络中的关键节点,也即关键技术和热点技术,对于科创布局工作的展开有很大的辅助作用。传统的对于网络中的关键节点的讨论常存在于复杂网络的中心化问题和节点重要度评估上,通过实证方法度量网络的统计性质。单一运用上述某种测度指标或方法识别关键节点具有很强的片面性,每种测度指标或方法都只能从某一侧面反映节点在网络中的地位,不符合实际情况。在互联网飞速发展的时代,简单的测度指标组合无法满足现实需求,对识别关键点的准确性提出了更高的要求。In the technology map network, identifying the key nodes in the network, that is, key technologies and hotspot technologies, has a great auxiliary effect on the deployment of science and technology innovation. The traditional discussion of key nodes in the network often exists in the centralization of complex networks and the evaluation of node importance, and the statistical properties of the network are measured by empirical methods. The single use of one of the above measurement indicators or methods to identify key nodes is very one-sided. Each measurement indicator or method can only reflect the status of the node in the network from a certain side, which does not conform to the actual situation. In the era of rapid development of the Internet, a simple combination of measurement indicators cannot meet the actual needs, and higher requirements are placed on the accuracy of identifying key points.
特别是现在网络的应用更加广泛,网络的应用具有更多的现实意义,单从理论角度的测量度指标不贴合实际,降低了识别关键节点的准确性。Especially now that the application of the network is more extensive, and the application of the network has more practical significance. The measurement index from the theoretical point of view does not fit the actual situation, which reduces the accuracy of identifying key nodes.
发明内容Summary of the invention
本发明的目的就是为了克服上述现有技术存在的缺陷而提供一种基于技术图谱的关键点识别方法,解决识别技术图谱中关键节点指标的单一性和脱离实际等问题。The purpose of the present invention is to overcome the above-mentioned defects in the prior art and provide a method for identifying key points based on a technical map, and solve the problems of identifying the unity of key node indicators in the technical map and deviating from reality.
本发明的目的可以通过以下技术方案来实现:The purpose of the present invention can be achieved through the following technical solutions:
一种基于技术图谱的关键点识别方法,包括:A key point identification method based on the technical map, including:
构建技术图谱;Build a technical map;
对所述技术图谱中的节点数据,进行中心度计算,得到关键节点;Perform centrality calculation on the node data in the technical map to obtain key nodes;
采用主成分分析法,对所述的节点数据的多个维度的技术指标进行简化;Principal component analysis method is used to simplify the multiple dimensions of the technical indicators of the node data;
分析所述的关键节点与所述技术指标之间的关系,得到不同维度下的关键节点。Analyze the relationship between the key nodes and the technical indicators to obtain key nodes in different dimensions.
所述的技术图谱通过采用实体、关系和属性的抽取方法对多个网站和数据库的科技成果进行抽取,并对抽取得到的所述科技成果进行知识融合构建得到。The technical map is obtained by extracting the scientific and technological achievements of multiple websites and databases by using entity, relationship and attribute extraction methods, and constructing the extracted scientific and technological achievements through knowledge fusion.
所述的网站和所述数据库包括同方知网、国研网、自建资源库、研发机构数据、政策法规数据、行业动态数据、专利数据库和行业标准数据库中的至少一个。The website and the database include at least one of Tongfang Knowledge Network, National Research Network, self-built resource database, R&D institution data, policy and regulation data, industry dynamic data, patent database, and industry standard database.
所述的中心度包括度中心度、接近中心度和介数中心度。The centrality includes degree centrality, close centrality and betweenness centrality.
所述的技术指标的维度包括项目水平维度、人才水平维度和科研成果水平维度。The dimensions of the technical indicators include project level dimensions, talent level dimensions and scientific research results level dimensions.
所述的项目水平维度的技术指标包括项目总数、基金项目类别和科研经费投入。The technical indicators of the project level dimension include the total number of projects, the types of funded projects, and the investment in scientific research funds.
所述的人才水平维度的技术指标包括人才平均年龄、人才平均学历和人才数量。The technical indicators of the talent level dimension include the average age of the talent, the average education background of the talent, and the number of talents.
所述的科研成果水平维度中,科研成果包括论文、专利和其它成果。In the dimension of scientific research achievement level, scientific research achievements include papers, patents and other achievements.
所述的论文相关的技术指标包括论文总数、被引总频次、核心期刊论文数、核心期刊被引总频次、基金论文数、基金被引总频次、核心期刊论文占比、核心期刊论文占比、总篇均被引频次、核心期刊篇均被引频次、基金篇均被引频次和H指数;所述专利相关的技术指标包括专利总数目和发明专利数目;所述其它成果相关的技术指标包括成果获奖、成果鉴定结果、标准数目、主编或副主编著作。The technical indicators related to the papers include the total number of papers, the total frequency of citations, the number of core journal papers, the total frequency of core journal citations, the number of fund papers, the total frequency of fund citations, the percentage of core journal papers, and the percentage of core journal papers. , The total citation frequency of all articles, the citation frequency of core journal articles, the all citation frequency of fund articles, and the H index; the patent-related technical indicators include the total number of patents and the number of invention patents; the technical indicators related to the other achievements Including achievement awards, achievement appraisal results, number of standards, and works of editor-in-chief or associate editor-in-chief.
采用线性回归法分析所述的关键节点与所述技术指标之间的关系。The linear regression method is used to analyze the relationship between the key nodes and the technical indicators.
与现有技术相比,本发明综合考虑了网络中心度指标和科技资源的文献计量,解决了识别技术图谱中关键节点指标的单一性和脱离实际等缺点,基于复杂网络技术的相关理论,对技术图谱的相关指标进行量化计算,有利于更加准确地识别关键节点,发现技术研究的走向或技术趋势性线索,为科技创新提供决策支持。Compared with the prior art, the present invention comprehensively considers the network centrality index and the literature measurement of scientific and technological resources, and solves the shortcomings of the unity of the key node index in the identification technology map and the fact that it is out of reality. Based on the related theories of complex network technology, The quantitative calculation of the relevant indicators of the technology map is conducive to more accurately identifying key nodes, discovering the trend of technological research or technological trend clues, and providing decision-making support for technological innovation.
附图说明Description of the drawings
图1为本实施例基于技术图谱的关键点识别方法流程图;FIG. 1 is a flowchart of a method for identifying key points based on a technical map of this embodiment;
图2为本实施例构建的技术图谱;Figure 2 is a technical map constructed in this embodiment;
图3为本实施例各评价指标的累积贡献率曲线图。Figure 3 is a graph showing the cumulative contribution rate of each evaluation index in this embodiment.
具体实施方式detailed description
下面结合附图和具体实施例对本发明进行详细说明。本实施例以本发明技术方案为前提进行实施,给出了详细的实施方式和具体的操作过程,但本发明的保护范围不限于下述的实施例。The present invention will be described in detail below with reference to the drawings and specific embodiments. This embodiment is implemented on the premise of the technical solution of the present invention, and provides detailed implementation and specific operation procedures, but the protection scope of the present invention is not limited to the following embodiments.
实施例Example
如图1所示,一种基于技术图谱的关键点识别方法,包括以下步骤:As shown in Figure 1, a method for identifying key points based on a technical map includes the following steps:
1)构建技术图谱1) Build a technical map
从同方知网、国研网、自建资源库、外部专家及研发机构数据,内部项目及科技成果数据,添加政策法规数据、行业动态数据、专利数据及行业标准数据中获取元数据,进行实体、关系和属性的抽取,对抽取的信息进行实体消歧和共指消解,,构建技术图谱,如图2所示。Obtain metadata from Tongfang Knowledge Network, National Research Network, self-built resource database, data from external experts and R&D institutions, data from internal projects and scientific and technological achievements, adding policy and regulation data, industry dynamic data, patent data and industry standard data to perform entity , Relation and attribute extraction, entity disambiguation and coreference resolution are performed on the extracted information, and the technical map is constructed, as shown in Figure 2.
2)从复杂网络的统计指标角度考虑,基于度中心度、接近中心度、介数中心度等指标的大小来定位关键节点,具备高中介中心性和高频特性的节点,就是本领域内的关键技术,代表着这段时期的研究热点主题;2) From the perspective of statistical indicators of complex networks, key nodes are located based on indicators such as degree centrality, proximity centrality, and betweenness centrality. Nodes with high betweenness centrality and high frequency characteristics are the ones in this field. The key technologies represent the hot topics of research during this period;
度中心度是一个节点与其他节点直接连接的总和。由于技术图谱的连接是有方向的,则可分为点入中心度和点出中心度。结合点入中心度和点出中心度综合考虑,节点的度中心度的计算公式为:
其中u是技术图谱中任意一个节点,n是图中节点的个数,X
vu表示节点v与u之间之间是否直接相连。中心度是网络分析中刻画节点中心性的最直接度量指标,它反映了一个节点的凝聚力。一个节点的度中心性越高,该节点在网络中就越重要;
The degree centrality is the sum of the direct connections between a node and other nodes. Since the connection of the technical map is directional, it can be divided into point-in centrality and point-out centrality. Combining the point-in centrality and the point-out centrality, the calculation formula of the node's centrality is: Where u is any node in the technical graph, n is the number of nodes in the graph, and X vu indicates whether the nodes v and u are directly connected. Centrality is the most direct measure of node centrality in network analysis, and it reflects the cohesion of a node. The higher the degree centrality of a node, the more important the node is in the network;
接近中心度是一个节点到所有其他节点的最短路径距离之和的倒数。它反映网络中某一节点与其他节点之间的接近程度。节点的接近中心度标准化计算公式为:
其中u是技术图谱中任意一个节点,n是图中节点的个数,d(u,v)是另一个节点v与u之间最短的路径距离。由于技术图谱的连接是有方向的,则可分为入接近中心度和出接近中心度。入接近中心度反映节点的整合力,出接近中心度反映节点的辐射力;
The proximity centrality is the reciprocal of the sum of the shortest path distances from a node to all other nodes. It reflects the closeness between a node and other nodes in the network. The standardized calculation formula for the proximity centrality of a node is: Where u is any node in the technical graph, n is the number of nodes in the graph, and d(u, v) is the shortest path distance between another node v and u. Since the connection of the technical map is directional, it can be divided into in-approaching centrality and out-approaching centrality. The in-closeness centrality reflects the integration power of the node, and the out-closeness centrality reflects the radiation power of the node;
介数中心度是经过一个节点的最短路径的数目。即一个结点担任其它任意两个结点之间最短路径的桥梁的次数。节点介数中心度计算公式为:
其中,u是技术图谱中任意一个节点,p是节点s和节点t之间最短路径的总数,p(u)是节点s和节点t之间通过节点u的最短路径数。一个结点充当“中 介”的次数越高,它的介数中心度就越大,它在网络中起到“交通枢纽”的作用。
Betweenness centrality is the number of shortest paths passing through a node. That is, the number of times that a node acts as a bridge for the shortest path between any two other nodes. The calculation formula of node betweenness centrality is: Among them, u is any node in the technical graph, p is the total number of shortest paths between node s and node t, and p(u) is the number of shortest paths through node u between node s and node t. The higher the number of times a node acts as an "intermediary", the greater its betweenness centrality, and it acts as a "transportation hub" in the network.
3)基于科技资源的文献计量,从科研投入、科研成果两个方面入手;3) Document measurement based on scientific and technological resources, starting from two aspects: scientific research investment and scientific research results;
科研投入又分为科研项目和人才梯队,科研项目包括项目总数、基金项目和科研经费投入,人才梯队又包括人才平均年龄、人才平均学历和人才数量;Scientific research investment is divided into scientific research projects and talent echelon. Scientific research projects include the total number of projects, fund projects and scientific research funding investment. The talent echelon includes the average age of talents, the average academic qualifications of talents, and the number of talents;
科研成果包括论文、专利、标准、专著和成果,其中,论文需要考虑的因素是论文总数、被引总频次、核心期刊论文数、核心期刊被引总频次、基金论文数、基金被引总频次、核心期刊论文占比、核心期刊论文占比、总篇均被引频次、核心期刊篇均被引频次、基金篇均被引频次和H指数,专利包括专利总数目和发明专利数目,成果包括成果获奖和成果鉴定,还有标准数目、主编或者副主编著作等;Scientific research results include papers, patents, standards, monographs and achievements. Among them, the factors that need to be considered are the total number of papers, the total frequency of citations, the number of papers in core journals, the total frequency of citations of core journals, the number of funded papers, and the total frequency of funded citations. , The proportion of core journal papers, the proportion of core journal papers, the citation frequency of the total papers, the citation frequency of the core journal papers, the citation frequency of fund papers, and the H index. Patents include the total number of patents and the number of invention patents. The results include Achievement awards and achievement appraisal, as well as the number of standards, the editor-in-chief or associate editor's works, etc.;
4)通过主成分分析将2)和3)中定义的多维度的评估指标转化为相互独立的综合评估指标,消除评估指标间的相关性,简化评估节点关键性的指标数。4) Through principal component analysis, the multi-dimensional evaluation indicators defined in 2) and 3) are transformed into mutually independent comprehensive evaluation indicators, which eliminates the correlation between evaluation indicators and simplifies the number of critical indicators for evaluation nodes.
本发明对200项技术在科技资料中的共现关系构建了技术图谱,从网络拓扑结构、项目水平、人才水平和科研成果这几个维度来评估节点的关键性。分别计算每项技术对应的27项评估指标,构成一个200*27的矩阵,对该矩阵进行主成分分析,得到特征根、贡献率和累积贡献率,其累积贡献率如图3所示:The present invention constructs a technology map for the co-occurrence relationship of 200 technologies in scientific and technological data, and evaluates the criticality of nodes from the dimensions of network topology, project level, talent level, and scientific research results. Calculate the 27 evaluation indicators corresponding to each technology to form a 200*27 matrix, perform principal component analysis on the matrix, and obtain the characteristic root, contribution rate and cumulative contribution rate. The cumulative contribution rate is shown in Figure 3:
从图中可以看出,前5个主成分的累计贡献率达到90.79%。因此只选取前5个主成分可以充分代表27个评估指标所含的信息。通过计算前5个主成分对应的原指标权重值矩阵与评估指标矩阵的乘积,可以将评价矩阵约简为200*5。It can be seen from the figure that the cumulative contribution rate of the first 5 principal components reaches 90.79%. Therefore, selecting only the first 5 principal components can fully represent the information contained in the 27 evaluation indicators. By calculating the product of the original index weight value matrix corresponding to the first 5 principal components and the evaluation index matrix, the evaluation matrix can be reduced to 200*5.
5)利用线性回归表达式,以前5个主成分的贡献率作为主成分的权重,可以得到节点关键性的综合数值。基于4)的结果,得到评价节点关键性的综合函数:5) Using linear regression expressions, the contribution rates of the previous five principal components are used as the weights of the principal components, and the comprehensive value of the criticality of the node can be obtained. Based on the result of 4), a comprehensive function for evaluating the criticality of the node is obtained:
Z=0.3284*y
1+0.1531*y
2+0.2157*y
3+0.1196*y
4+0.0911*y
5
Z=0.3284*y 1 +0.1531*y 2 +0.2157*y 3 +0.1196*y 4 +0.0911*y 5
其中,式中的y
1至y
5代表的是4)中主成分分析得到的前5个主成分。
Among them, y 1 to y 5 in the formula represent the first 5 principal components obtained by principal component analysis in 4).
通过函数计算,对得到的数值进行排序,可以得到关键节点,在网络以醒目的颜色加以标记,便于识别。另外对于研究领域、作者、研究机构等主体构成的网络也可以采用这种方法来识别网络中的关键节点。Through function calculation and sorting the obtained values, key nodes can be obtained, which are marked with eye-catching colors on the network for easy identification. In addition, this method can also be used to identify key nodes in the network composed of research fields, authors, research institutions and other subjects.