CN111581394B

CN111581394B - A Large-Scale Knowledge Topographic Mapping Method

Info

Publication number: CN111581394B
Application number: CN202010368399.4A
Authority: CN
Inventors: 刘玉琴; 汪雪锋; 刘佳
Original assignee: Beijing Institute of Graphic Communication
Current assignee: Beijing Institute of Graphic Communication
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2023-06-23
Anticipated expiration: 2040-04-30
Also published as: CN111581394A

Abstract

The invention discloses a large-scale knowledge topography drawing method, which is a relatively simple method for drawing knowledge topography aiming at large-scale text data, wherein a TSNE algorithm is used for carrying out planar layout of documents, a planar pixel point density function taking the document aggregation degree as a parameter is established so as to map the color of points on a plane, then an HSV mode is adopted for carrying out pixel point color setting and dividing into a plurality of fixed grades, the knowledge topography is rendered, the visual expressive force is good, and the realization technology is simple.

Description

A Large-Scale Knowledge Topographic Mapping Method

技术领域technical field

本发明涉及一种文本分析方法，属于信息处理领域，具体涉及一种大规模知识地形图绘制方法。The invention relates to a text analysis method, which belongs to the field of information processing, in particular to a method for drawing a large-scale knowledge topographic map.

背景技术Background technique

知识地形图通过类似于地理信息系统中的等高线图实现对文本数据的可视化，通过用颜色的深浅来区别数据的多少以及数据之间的关系。有些文献中也将其称为景观图或主题图,尽管名称和表现形式不完全相同，其基本思想是一致的。知识地形图主要应用在文本数据分析中，如专利文本数据、论文文本数据、微博微信等网络文本数据，用以揭示文本语言所表达的知识内容。The knowledge topographic map realizes the visualization of text data through the contour map similar to the geographic information system, and distinguishes the amount of data and the relationship between data by using the depth of color. It is also called landscape map or theme map in some literatures. Although the names and expressions are not exactly the same, their basic ideas are consistent. Knowledge topographic maps are mainly used in the analysis of text data, such as patent text data, paper text data, Weibo WeChat and other network text data, to reveal the knowledge content expressed in text language.

知识地形图可以采用文本中的主题词进行知识表示、图形绘制，如中国专利申请CN106021228A(公开日20161012)所公开的一种利用知识地形图进行文本分析的方法及系统，是从文本中提取主题词进行知识地形图绘制，进行知识表达。也有采用文献距离进行文献聚类，进行知识提取、图形绘制的技术方案，典型的有加拿大科睿唯安Innovation的专利地图，该图的算法较为复杂。在图形绘制上有热力图、地形图、彩虹兔、气象图等形式。The knowledge topographic map can use the subject words in the text to carry out knowledge representation and graphic drawing, such as a method and system for text analysis using the knowledge topographic map disclosed in Chinese patent application CN106021228A (public date 20161012), which is to extract topics from the text Words are used to draw knowledge topographic maps and express knowledge. There are also technical solutions that use document distance for document clustering, knowledge extraction, and graphic drawing. A typical example is the patent map of Canada's Clarivate Innovation. The algorithm of this map is relatively complicated. There are heat maps, topographic maps, rainbow rabbits, weather maps and other forms in graphic drawing.

当文本数据较多、主题词规模较大时(比如有上万个文献或主题词)，要绘制结构清晰、准确揭示文本内容的知识地形图是十分困难的，CN106021228A在主题词超过1000以上，由于布局算法本身的限制，使得绘制的知识地形图可读性大大降低，且无法体现文本所隶属的主体之间的关系，仅仅是文本结构特征的展示。加拿大科睿唯安Innovation的专利地图，可以展示的节点数量相对较多，也能够体现文本隶属的主体之间的关系，但该地图的算法复杂，技术实现困难。When there is a lot of text data and the scale of subject terms is large (for example, there are tens of thousands of documents or subject terms), it is very difficult to draw a knowledge topographic map with a clear structure and accurately revealing the content of the text. CN106021228A has more than 1000 subject terms, Due to the limitations of the layout algorithm itself, the readability of the drawn knowledge topographic map is greatly reduced, and it cannot reflect the relationship between the subjects to which the text belongs, but only the display of the structural characteristics of the text. The patent map of Clarivate Innovation in Canada can display a relatively large number of nodes, and can also reflect the relationship between the subjects to which the text belongs. However, the algorithm of the map is complicated and the technical implementation is difficult.

发明内容Contents of the invention

针对现有技术的不足，本发明旨在提供一种大规模知识地形图绘制方法，可视化表现力好，并且实现技术简单。Aiming at the deficiencies of the prior art, the present invention aims to provide a method for drawing a large-scale knowledge topographic map, which has good visual expression and simple implementation technology.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种大规模知识地形图绘制方法，包括如下步骤：A method for drawing a large-scale knowledge topographic map, comprising the following steps:

S1、采用分词技术获取每个文档的主题词，利用所有文档的主题词建立主题词矩阵；S1. Obtain the subject terms of each document by word segmentation technology, and use the subject terms of all documents to establish a subject term matrix;

S2、将步骤S1所建立的主题词矩阵输入TSNE算法中，TSNE算法利用主题词矩阵将文档映射到二维平面，二维平面中每个文档用球形节点表示；每个球形节点对应的文档所隶属的主体采用不同的节点颜色来区分，隶属于同一个主体的文档对应的球形节点颜色相同；S2. Input the subject term matrix established in step S1 into the TSNE algorithm. The TSNE algorithm uses the subject term matrix to map documents to a two-dimensional plane, and each document in the two-dimensional plane is represented by a spherical node; the document corresponding to each spherical node is The affiliated subjects are distinguished by different node colors, and the spherical nodes corresponding to documents belonging to the same subject have the same color;

S3、基于二维平面单位面积内的球形节点的数量与坐标构建像素点的密度函数；S3. Construct a density function of pixel points based on the number and coordinates of spherical nodes within a unit area of a two-dimensional plane;

记N个文档对应的球形节点坐标分别为(x_i,y_i),i＝1,…,N，文档对应的球形节点之间的二维欧氏距离平均值为

像素点P的坐标(x,y)，定义像素点的密度函数公式为：Note that the coordinates of spherical nodes corresponding to N documents are (x _i , y _i ), i=1,...,N, and the average two-dimensional Euclidean distance between spherical nodes corresponding to documents is

The coordinates (x, y) of the pixel point P define the density function formula of the pixel point as:

其中，N_p是以像素点P为中心的单位面积中所涵盖的球形节点数量，α,β的取值决定了地形图的坡度效果，α,β的值越大，地形图的坡度越小，地形图的山峰效果越显著；Among them, N _p is the number of spherical nodes covered in the unit area centered on the pixel point P. The values of α and β determine the slope effect of the topographic map. The larger the value of α and β, the smaller the slope of the topographic map , the more prominent the peak effect of the topographic map is;

S4、计算各个像素点的颜色值并进行地形图渲染；S4. Calculate the color value of each pixel and perform topographic map rendering;

S4.1、将步骤S3得到的像素点的密度函数标准化，使其取值为0-1之间的浮点数；S4.1, standardize the density function of the pixels obtained in step S3, so that its value is a floating point number between 0-1;

S4.2、建立一个HSV模式调色板，H和V取值固定；S4.2, establish an HSV pattern palette, the values of H and V are fixed;

S4.3、建立像素点标准化后的密度值和HSV模式调色板的一一映射关系：记像素点标准化后的密度值为q，该像素点在HSV模式调色板上对应的HSV颜色值中，H和V取值固定，S值为q*100％；S4.3. Establish a one-to-one mapping relationship between the normalized density value of the pixel and the HSV mode palette: record the normalized density value of the pixel as q, and the corresponding HSV color value of the pixel on the HSV mode palette In , the values of H and V are fixed, and the value of S is q*100%;

S4.4、将像素点颜色的S值的取值划分为M个等级值，并据此调整步骤S4.3中计算得到的各个像素点的S值至对应的等级值；公式如下：S4.4. Divide the value of the S value of the pixel point color into M grade values, and accordingly adjust the S value of each pixel calculated in step S4.3 to the corresponding grade value; the formula is as follows:

M>＝3。M>=3.

进一步地，步骤S1中，所述主题词矩阵如下所示：Further, in step S1, the subject term matrix is as follows:

其中，Keyword₁,…,Keyword_l表示所提取的l个主题词；D₁，…，D_N表示N个文档；e_ij表示文档D_i中所含关键词Keyword_j的数量，i＝1,...,N，j＝1,...,l。Among them, Keyword ₁ ,...,Keyword _l represent the extracted l keywords; D ₁ ,...,D _N represent N documents; e _ij represents the number of keywords Keyword _j contained in the document D _i , i=1, ..., N, j=1, ..., l.

进一步地，步骤S4.1中，采用如下公式进行像素点的密度函数标准化：Further, in step S4.1, the following formula is used to standardize the density function of pixels:

Density_max为所有像素点的密度值中的最大值。Density _max is the maximum value among the density values of all pixels.

本发明的有益效果在于：本发明提供了一种相对简单的、针对大规模文本数据绘制知识地形图的方法，该方法以TSNE算法进行文献的平面布局，建立以文献聚集程度为参数的平面像素点密度函数，以映射平面上点的颜色，其后采用HSV模式进行像素点颜色设定，并划分为固定的多个等级，进行知识地形图的渲染，可视化表现力好，并且实现技术简单。The beneficial effect of the present invention is that: the present invention provides a relatively simple method for drawing a knowledge topographic map for large-scale text data. The method uses the TSNE algorithm to carry out the plane layout of documents, and establishes a plane pixel with document aggregation degree as a parameter. The point density function is used to map the color of the point on the plane, and then the HSV mode is used to set the pixel color, and it is divided into fixed multiple levels to render the knowledge topographic map. It has good visual expression and simple implementation technology.

附图说明Description of drawings

图1为本发明实施例中，所有文档节点的数量N与以像素点P为中心的单位面积内的文档节点数量N_p之间的关系示意图；1 is a schematic diagram of the relationship between the number N of all document nodes and the number N _p of document nodes in a unit area centered on a pixel point P in an embodiment of the present invention;

图2为本发明实施例中5000个节点的渲染效果知识地形图；Fig. 2 is the rendering effect knowledge terrain map of 5000 nodes in the embodiment of the present invention;

图3为本发明实施例中15000个节点的渲染知识地形效果图。Fig. 3 is an effect diagram of rendered knowledge terrain of 15,000 nodes in the embodiment of the present invention.

具体实施方式Detailed ways

以下将结合附图对本发明作进一步的描述，需要说明的是，本实施例以本技术方案为前提，给出了详细的实施方式和具体的操作过程，但本发明的保护范围并不限于本实施例。The present invention will be further described below in conjunction with the accompanying drawings. It should be noted that this embodiment is based on the technical solution, and provides detailed implementation and specific operation process, but the protection scope of the present invention is not limited to the present invention. Example.

本实施例提供一种大规模知识地形图绘制方法，包括如下步骤：This embodiment provides a method for drawing a large-scale knowledge topographic map, including the following steps:

S1、采用分词技术获取每个文档的主题词，利用所有文档的主题词建立主题词矩阵，所述主题词矩阵如下所示：S1. Obtain the subject terms of each document by word segmentation technology, and use the subject terms of all documents to establish a subject term matrix. The subject term matrix is as follows:

其中，Keyword₁,…,Keyword_l表示所提取的l个主题词；D₁，…，D_N表示N个文档；e_ij表示文档D_i中所含关键词Keyword_j的数量，i＝1,...,N，j＝1,...,l；Among them, Keyword ₁ ,...,Keyword _l represent the extracted l keywords; D ₁ ,...,D _N represent N documents; e _ij represents the number of keywords Keyword _j contained in the document D _i , i=1, ...,N,j=1,...,l;

S2、将步骤S1所建立的主题词矩阵输入TSNE算法中，TSNE算法利用主题词矩阵将文档映射到二维平面，二维平面中每个文档用球形节点表示；每个球形节点对应的文档所隶属的主体采用不同的节点颜色来区分，如机构、作者、国家等信息，隶属于同一个主体的文档对应的球形节点颜色相同。S2. Input the subject term matrix established in step S1 into the TSNE algorithm. The TSNE algorithm uses the subject term matrix to map documents to a two-dimensional plane, and each document in the two-dimensional plane is represented by a spherical node; the document corresponding to each spherical node is The affiliated subjects are distinguished by different node colors, such as organization, author, country and other information, and the spherical nodes corresponding to the documents belonging to the same subject have the same color.

与现有技术CN106021228A不同的是，本实施例方法中，在二维平面上的节点为文档节点，非主题词节点。采用的平面布局算法为TSNE算法，非Fruchterman-Reingoldlayout和VosMapping算法。TSNE算法的优点在于在大规模文档中仍能进行较为清晰的结构展示。Different from the prior art CN106021228A, in the method of this embodiment, the nodes on the two-dimensional plane are document nodes, not subject word nodes. The plane layout algorithm adopted is TSNE algorithm, not Fruchterman-Reingoldlayout and VosMapping algorithm. The advantage of the TSNE algorithm is that it can still display a relatively clear structure in large-scale documents.

S3、基于二维平面单位面积内的球形节点的数量与坐标构建像素点的密度函数：S3. Construct a pixel density function based on the number and coordinates of spherical nodes within a unit area of a two-dimensional plane:

各个球形节点在二维平面上的坐标确定后，要将其绘制到计算机屏幕，而为要达到地形图的渲染效果，需要确定每个像素点的颜色。为此，需要建立一个密度函数，用于影射每个像素点的颜色值；After the coordinates of each spherical node on the two-dimensional plane are determined, it needs to be drawn on the computer screen, and in order to achieve the rendering effect of the topographic map, the color of each pixel needs to be determined. To this end, a density function needs to be established to map the color value of each pixel;

其中，N_p是以像素点P为中心的单位面积中所涵盖的球形节点数量(非全部球形节点的数量，目的是加快算法的执行，节省运行时间)，a,β的取值决定了地形图的坡度效果，a,β的值越大，地形图的坡度越小，地形图的山峰效果越显著。图1展示了N与N_p之间的关系。Among them, N _p is the number of spherical nodes covered in the unit area centered on the pixel point P (the number of not all spherical nodes, the purpose is to speed up the execution of the algorithm and save running time), the values of a and β determine the terrain The slope effect of the map, the larger the value of a, β, the smaller the slope of the topographic map, and the more significant the peak effect of the topographic map. Figure 1 shows the relationship between N and _Np .

单位面积一般采用10像素*10像素、100像素*100像素等。The unit area generally adopts 10 pixels*10 pixels, 100 pixels*100 pixels, etc.

具体可以采用如下变换方式：Specifically, the following conversion methods can be used:

Density_max为所有像素点的密度值中的最大值；Density _max is the maximum value of the density values of all pixels;

S4.2、建立一个HSV模式调色板，H和V取值固定；HSV是指色调(H,Hue)、饱和度(S,Saturation)、明度(V,Value)；S4.2, establish an HSV mode palette, the values of H and V are fixed; HSV refers to hue (H, Hue), saturation (S, Saturation), lightness (V, Value);

H和V的取值无具体限制，可为任意颜色取值，如固定为蓝色、绿色对应的H、V值，这两个值表现的是地形图的整体呈现颜色；There are no specific restrictions on the values of H and V, and they can be any color. For example, the H and V values corresponding to blue and green are fixed. These two values represent the overall color of the topographic map;

S4.3、建立像素点标准化后的密度值和HSV模式调色板的一一映射关系：S4.3. Establish a one-to-one mapping relationship between the density value after pixel standardization and the HSV mode palette:

记像素点标准化后的密度值为q，该像素点在HSV模式调色板上对应的HSV颜色值中，H和V取值固定，S值为q*100％；Note that the normalized density value of the pixel point is q, and the pixel point is in the HSV color value corresponding to the HSV mode palette, the H and V values are fixed, and the S value is q*100%;

例如，如果像素点标准化后的密度值为0.1，该像素点在HSV模式调色板上对应的颜色中，H和V取值固定，S值为10％。For example, if the normalized density value of a pixel is 0.1, the pixel is in the corresponding color on the HSV mode palette, the values of H and V are fixed, and the value of S is 10%.

M>＝3；M>=3;

划分的目的是使得最后的知识地形图具有层次感。The purpose of division is to make the final knowledge topographic map have a sense of hierarchy.

例如，将S值的取值划分为10个等级，则像素点的HSV颜色中的S值按照下式进行调整：For example, if the value of the S value is divided into 10 levels, the S value in the HSV color of the pixel is adjusted according to the following formula:

例如，当某个像素点的HSV颜色中的S值为11％，按照上式将其调整为20％，这样调整后的图形可达到更具层次感的地形图效果。For example, when the S value of the HSV color of a certain pixel is 11%, adjust it to 20% according to the above formula, so that the adjusted graphic can achieve a more layered topographic map effect.

图2、3分别为5000和15000个节点的渲染效果知识地形图，其中数字表示各个文档的编号。Figures 2 and 3 are the rendering knowledge topographic maps of 5000 and 15000 nodes respectively, where the numbers indicate the number of each document.

对于本领域的技术人员来说，可以根据以上的技术方案和构思，给出各种相应的改变和变形，而所有的这些改变和变形，都应该包括在本发明权利要求的保护范围之内。For those skilled in the art, various corresponding changes and modifications can be made according to the above technical solutions and concepts, and all these changes and modifications should be included in the protection scope of the claims of the present invention.

Claims

1. A large-scale knowledge topographic map drawing method is characterized in that, comprising the steps:

S1. Obtain the subject terms of each document by word segmentation technology, and use the subject terms of all documents to establish a subject term matrix;

S2. Input the subject term matrix established in step S1 into the TSNE algorithm. The TSNE algorithm uses the subject term matrix to map documents to a two-dimensional plane, and each document in the two-dimensional plane is represented by a spherical node; the document corresponding to each spherical node is The affiliated subjects are distinguished by different node colors, and the spherical nodes corresponding to documents belonging to the same subject have the same color;

S3. Construct a density function of pixel points based on the number and coordinates of spherical nodes within a unit area of a two-dimensional plane;

Note that the coordinates of spherical nodes corresponding to N documents are (x _i , y _i ), i=1,...,N, and the average two-dimensional Euclidean distance between spherical nodes corresponding to documents is

Among them, N _p is the number of spherical nodes covered in the unit area centered on the pixel point P. The values of α and β determine the slope effect of the topographic map. The larger the value of α and β, the smaller the slope of the topographic map , the more prominent the peak effect of the topographic map is;

S4. Calculate the color value of each pixel and perform topographic map rendering;

S4.1, standardize the density function of the pixels obtained in step S3, so that its value is a floating point number between 0-1;

S4.2, establish an HSV pattern palette, the values of H and V are fixed;

S4.3. Establish a one-to-one mapping relationship between the normalized density value of the pixel and the HSV mode palette: record the normalized density value of the pixel as q, and the corresponding HSV color value of the pixel on the HSV mode palette In , the values of H and V are fixed, and the value of S is q*100%;

S4.4. Divide the value of the S value of the pixel point color into M grade values, and accordingly adjust the S value of each pixel calculated in step S4.3 to the corresponding grade value; the formula is as follows:

M>=3;

In step S1, the subject term matrix is as follows:

Among them, Keyword ₁ ,...,Keyword _l represent the extracted l keywords; D ₁ ,...,D _N represent N documents; e _ik represents the number of keywords Keyword _k contained in the document D _i , i=1, ..., N, k=1, ..., l.

2. The method according to claim 1, characterized in that, in step S4.1, the following formula is used to standardize the density function of pixels:

Density _max is the maximum value among the density values of all pixels.