CN111581394B - Large-scale knowledge topography drawing method - Google Patents

Large-scale knowledge topography drawing method Download PDF

Info

Publication number
CN111581394B
CN111581394B CN202010368399.4A CN202010368399A CN111581394B CN 111581394 B CN111581394 B CN 111581394B CN 202010368399 A CN202010368399 A CN 202010368399A CN 111581394 B CN111581394 B CN 111581394B
Authority
CN
China
Prior art keywords
value
pixel point
documents
values
topography
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010368399.4A
Other languages
Chinese (zh)
Other versions
CN111581394A (en
Inventor
刘玉琴
汪雪锋
刘佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Graphic Communication
Original Assignee
Beijing Institute of Graphic Communication
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Graphic Communication filed Critical Beijing Institute of Graphic Communication
Priority to CN202010368399.4A priority Critical patent/CN111581394B/en
Publication of CN111581394A publication Critical patent/CN111581394A/en
Application granted granted Critical
Publication of CN111581394B publication Critical patent/CN111581394B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Processing Or Creating Images (AREA)
  • Image Generation (AREA)

Abstract

The invention discloses a large-scale knowledge topography drawing method, which is a relatively simple method for drawing knowledge topography aiming at large-scale text data, wherein a TSNE algorithm is used for carrying out planar layout of documents, a planar pixel point density function taking the document aggregation degree as a parameter is established so as to map the color of points on a plane, then an HSV mode is adopted for carrying out pixel point color setting and dividing into a plurality of fixed grades, the knowledge topography is rendered, the visual expressive force is good, and the realization technology is simple.

Description

Large-scale knowledge topography drawing method
Technical Field
The invention relates to a text analysis method, belongs to the field of information processing, and particularly relates to a large-scale knowledge topography drawing method.
Background
Knowledge topography enables visualization of text data by a contour map similar to that in a geographic information system, distinguishing how much data is and the relationship between data by using the shades of color. Some documents refer to the same as landscape or topic drawings, and the basic ideas are consistent although the names and expressions are not exactly the same. The knowledge topography is mainly applied to text data analysis, such as patent text data, paper text data, microblog micro-letter and other network text data, and is used for revealing knowledge content expressed by text language.
Knowledge topography can be represented by adopting the subject words in the text, and the graph is drawn, so that a method and a system for carrying out text analysis by utilizing the knowledge topography are disclosed in Chinese patent application CN106021228A (publication date 20161012), and the knowledge topography is drawn by extracting the subject words from the text. There are also technical schemes for document clustering, knowledge extraction and graphic drawing by using document distances, and typical examples are Canadian Rui Wei Annuation patent maps, and algorithms of the maps are complex. There are thermodynamic diagrams, topographic diagrams, rainbow rabbits, meteorological diagrams, etc. on graphic drawings.
When the text data are more and the subject word scale is larger (such as tens of thousands of documents or subject words, for example), it is very difficult to draw a knowledge topography map with clear and accurate structure for revealing the text content, and the CN106021228A is more than 1000 in the subject word, and due to the limitation of the layout algorithm, the readability of the drawn knowledge topography map is greatly reduced, and the relationship between the subjects to which the text belongs cannot be reflected, only the display of the text structure features. The Canadian Rui Wei Ann Innovation patent map has relatively more displayed nodes and can reflect the relation between the subjects to which the text belongs, but the map has complex algorithm and difficult technical realization.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a large-scale knowledge topographic map drawing method which has good visual expressive force and simple realization technology.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a large-scale knowledge topography drawing method comprises the following steps:
s1, acquiring the subject word of each document by adopting a word segmentation technology, and establishing a subject word matrix by using the subject words of all the documents;
s2, inputting the subject word matrix established in the step S1 into a TSNE algorithm, wherein the TSNE algorithm utilizes the subject word matrix to map the documents to a two-dimensional plane, and each document in the two-dimensional plane is represented by a spherical node; the main bodies to which the documents corresponding to the spherical nodes belong are distinguished by adopting different node colors, and the colors of the spherical nodes corresponding to the documents belonging to the same main body are the same;
s3, constructing a density function of the pixel points based on the number and coordinates of spherical nodes in the unit area of the two-dimensional plane;
the coordinates of spherical nodes corresponding to N documents are recorded as (x) i ,y i ) I=1, …, N, the average of the two-dimensional euclidean distances between spherical nodes corresponding to the document is
Figure BDA0002477294640000021
The coordinates (x, y) of the pixel point P define a density function formula of the pixel point as follows:
Figure BDA0002477294640000031
wherein N is p The number of spherical nodes covered in a unit area taking a pixel point P as a center, the values of alpha and beta determine the gradient effect of the topographic map, and the larger the values of alpha and beta are, the smaller the gradient of the topographic map is, and the more remarkable the mountain effect of the topographic map is;
s4, calculating color values of all pixel points and rendering a topographic map;
s4.1, normalizing the density function of the pixel points obtained in the step S3 to enable the value of the density function to be a floating point number between 0 and 1;
s4.2, establishing an HSV mode palette, wherein the values of H and V are fixed;
s4.3, establishing a one-to-one mapping relation between the normalized density value of the pixel points and the HSV mode palette: recording the normalized density value of the pixel point as q, wherein the value of H and V in the corresponding HSV color value of the pixel point on the HSV mode palette is fixed, and the value of S is q.times.100%;
s4.4, dividing the value of the S value of the pixel point color into M grade values, and adjusting the S value of each pixel point calculated in the step S4.3 to the corresponding grade value according to the M grade values; the formula is as follows:
Figure BDA0002477294640000032
M>=3。
further, in step S1, the subject term matrix is as follows:
Figure BDA0002477294640000041
wherein, keyword 1 ,…,Keyword l Representing the extractedI subject words of (a); d (D) 1 ,…,D N Representing N documents; e, e ij Representing document D i Keyword contained in (3) j I=1,..and N, j=1,..l.
Further, in step S4.1, the density function normalization of the pixel point is performed using the following formula:
Figure BDA0002477294640000042
Density max is the maximum value of the density values of all the pixel points.
The invention has the beneficial effects that: the invention provides a relatively simple method for drawing a knowledge topography for large-scale text data, which uses a TSNE algorithm to carry out the planar layout of documents, establishes a planar pixel point density function taking the document aggregation degree as a parameter to map the colors of points on a plane, then adopts an HSV mode to carry out pixel point color setting, and divides the pixel point color setting into a plurality of fixed grades to carry out the rendering of the knowledge topography, has good visual expressive force and simple realization technology.
Drawings
FIG. 1 shows the number N of all document nodes and the number N of document nodes in a unit area centered on a pixel point P in an embodiment of the present invention p Schematic of the relationship between the two;
FIG. 2 is a knowledge topography of the rendering effect of 5000 nodes in an embodiment of the invention;
fig. 3 is a diagram of a rendering knowledge topography effect of 15000 nodes in an embodiment of the invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that, while the present embodiment provides a detailed implementation and a specific operation process on the premise of the present technical solution, the protection scope of the present invention is not limited to the present embodiment.
The embodiment provides a large-scale knowledge topography drawing method, which comprises the following steps:
s1, acquiring the subject word of each document by adopting a word segmentation technology, and establishing a subject word matrix by using the subject words of all the documents, wherein the subject word matrix is as follows:
Figure BDA0002477294640000051
wherein, keyword 1 ,…,Keyword l Representing the extracted l subject words; d (D) 1 ,…,D N Representing N documents; e, e ij Representing document D i Keyword contained in (3) j I=1,.. N, j=1, once again, l;
s2, inputting the subject word matrix established in the step S1 into a TSNE algorithm, wherein the TSNE algorithm utilizes the subject word matrix to map the documents to a two-dimensional plane, and each document in the two-dimensional plane is represented by a spherical node; the main body to which the document corresponding to each spherical node belongs is distinguished by adopting different node colors, such as information of institutions, authors, countries and the like, and the colors of the spherical nodes corresponding to the documents belonging to the same main body are the same.
Unlike the prior art CN106021228A, in the method of this embodiment, the nodes on the two-dimensional plane are document nodes, not subject term nodes. The adopted plane layout algorithm is a TSNE algorithm, a non-Fruchterman-reingeld layout algorithm and a VosMapping algorithm. The TSNE algorithm has the advantage of still enabling clearer structural presentation in large-scale documents.
S3, constructing a density function of the pixel points based on the number and coordinates of spherical nodes in the unit area of the two-dimensional plane:
after the coordinates of each spherical node on the two-dimensional plane are determined, the spherical node is drawn on a computer screen, and the color of each pixel point needs to be determined in order to achieve the rendering effect of the topographic map. For this purpose, a density function is established for mapping the color value of each pixel;
the coordinates of spherical nodes corresponding to N documents are recorded as (x) i ,y i ) I=1, …, N, the average of the two-dimensional euclidean distances between spherical nodes corresponding to the document is
Figure BDA0002477294640000061
The coordinates (x, y) of the pixel point P define a density function formula of the pixel point as follows:
Figure BDA0002477294640000062
wherein N is p The number of spherical nodes (the number of non-total spherical nodes is the number of the spherical nodes) covered in a unit area taking the pixel point P as the center, so that the execution of an algorithm is quickened, the running time is saved, the values of a and beta determine the gradient effect of the topographic map, and the larger the values of a and beta are, the smaller the gradient of the topographic map is, and the more remarkable the mountain effect of the topographic map is. FIG. 1 shows N and N p Relationship between them.
The unit area is generally 10 pixels by 10 pixels, 100 pixels by 100 pixels, etc.
S4, calculating color values of all pixel points and rendering a topographic map;
s4.1, normalizing the density function of the pixel points obtained in the step S3 to enable the value of the density function to be a floating point number between 0 and 1;
the following conversion modes can be adopted:
Figure BDA0002477294640000071
Density max the maximum value of the density values of all the pixel points;
s4.2, establishing an HSV mode palette, wherein the values of H and V are fixed; HSV refers to Hue (H), saturation (S), brightness (V, value);
the values of H and V are not particularly limited, and can be any color value, such as H, V values corresponding to blue and green, and the two values represent the overall color of the topographic map;
s4.3, establishing a one-to-one mapping relation between the normalized density value of the pixel points and the HSV mode palette:
recording the normalized density value of the pixel point as q, wherein the value of H and V in the corresponding HSV color value of the pixel point on the HSV mode palette is fixed, and the value of S is q.times.100%;
for example, if the normalized density value of the pixel is 0.1, the values of H and V are fixed and the value of S is 10% in the corresponding color on the HSV mode palette.
S4.4, dividing the value of the S value of the pixel point color into M grade values, and adjusting the S value of each pixel point calculated in the step S4.3 to the corresponding grade value according to the M grade values; the formula is as follows:
Figure BDA0002477294640000072
M>=3;
the purpose of the partitioning is to make the final knowledge topography hierarchical.
For example, if the value of the S value is divided into 10 levels, the S value in the HSV color of the pixel point is adjusted according to the following formula:
Figure BDA0002477294640000081
for example, when the S value in the HSV color of a certain pixel point is 11%, the S value is adjusted to 20% according to the above formula, so that the adjusted graph can achieve a more layering topographic effect.
Fig. 2 and 3 are rendering effect knowledge topography diagrams of 5000 and 15000 nodes, respectively, wherein numerals represent numbers of respective documents.
Various modifications and variations of the present invention will be apparent to those skilled in the art in light of the foregoing teachings and are intended to be included within the scope of the following claims.

Claims (2)

1. A large-scale knowledge topography drawing method is characterized by comprising the following steps:
s1, acquiring the subject word of each document by adopting a word segmentation technology, and establishing a subject word matrix by using the subject words of all the documents;
s2, inputting the subject word matrix established in the step S1 into a TSNE algorithm, wherein the TSNE algorithm utilizes the subject word matrix to map the documents to a two-dimensional plane, and each document in the two-dimensional plane is represented by a spherical node; the main bodies to which the documents corresponding to the spherical nodes belong are distinguished by adopting different node colors, and the colors of the spherical nodes corresponding to the documents belonging to the same main body are the same;
s3, constructing a density function of the pixel points based on the number and coordinates of spherical nodes in the unit area of the two-dimensional plane;
the coordinates of spherical nodes corresponding to N documents are recorded as (x) i ,y i ) I=1, …, N, the average of the two-dimensional euclidean distances between spherical nodes corresponding to the document is
Figure FDA0004246220390000011
The coordinates (x, y) of the pixel point P define a density function formula of the pixel point as follows:
Figure FDA0004246220390000012
wherein N is p The number of spherical nodes covered in a unit area taking a pixel point P as a center, the values of alpha and beta determine the gradient effect of the topographic map, and the larger the values of alpha and beta are, the smaller the gradient of the topographic map is, and the more remarkable the mountain effect of the topographic map is;
s4, calculating color values of all pixel points and rendering a topographic map;
s4.1, normalizing the density function of the pixel points obtained in the step S3 to enable the value of the density function to be a floating point number between 0 and 1;
s4.2, establishing an HSV mode palette, wherein the values of H and V are fixed;
s4.3, establishing a one-to-one mapping relation between the normalized density value of the pixel points and the HSV mode palette: recording the normalized density value of the pixel point as q, wherein the value of H and V in the corresponding HSV color value of the pixel point on the HSV mode palette is fixed, and the value of S is q.times.100%;
s4.4, dividing the value of the S value of the pixel point color into M grade values, and adjusting the S value of each pixel point calculated in the step S4.3 to the corresponding grade value according to the M grade values; the formula is as follows:
Figure FDA0004246220390000021
M>=3;
in step S1, the subject term matrix is as follows:
Figure FDA0004246220390000022
wherein, keyword 1 ,…,Keyword l Representing the extracted l subject words; d (D) 1 ,…,D N Representing N documents; e, e ik Representing document D i Keyword contained in (3) k I=1,..n, k=1,..l.
2. The method according to claim 1, wherein in step S4.1, the density function normalization of the pixel points is performed using the following formula:
Figure FDA0004246220390000023
Density max is the maximum value of the density values of all the pixel points.
CN202010368399.4A 2020-04-30 2020-04-30 Large-scale knowledge topography drawing method Active CN111581394B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010368399.4A CN111581394B (en) 2020-04-30 2020-04-30 Large-scale knowledge topography drawing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010368399.4A CN111581394B (en) 2020-04-30 2020-04-30 Large-scale knowledge topography drawing method

Publications (2)

Publication Number Publication Date
CN111581394A CN111581394A (en) 2020-08-25
CN111581394B true CN111581394B (en) 2023-06-23

Family

ID=72111948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010368399.4A Active CN111581394B (en) 2020-04-30 2020-04-30 Large-scale knowledge topography drawing method

Country Status (1)

Country Link
CN (1) CN111581394B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021228A (en) * 2016-05-18 2016-10-12 德稻全球创新网络(北京)有限公司 Method and system for performing text analysis by utilizing knowledge topographic map
CN107766412A (en) * 2017-09-05 2018-03-06 华南师范大学 A kind of mthods, systems and devices for establishing thematic map
CN108628991A (en) * 2018-04-28 2018-10-09 上海久誉软件系统有限公司 The analysis and visualization system that rail traffic failure influences passenger flow
CN109952612A (en) * 2016-11-08 2019-06-28 赛卢拉研究公司 Method for express spectra classification
CN110019766A (en) * 2017-09-25 2019-07-16 腾讯科技(深圳)有限公司 Methods of exhibiting, device, mobile terminal and the readable storage medium storing program for executing of knowledge mapping
CN110780873A (en) * 2019-09-06 2020-02-11 平安普惠企业管理有限公司 Interface color adaptation method and device, computer equipment and storage medium
CN110866126A (en) * 2019-11-22 2020-03-06 福建工程学院 College online public opinion risk assessment method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180366013A1 (en) * 2014-08-28 2018-12-20 Ideaphora India Private Limited System and method for providing an interactive visual learning environment for creation, presentation, sharing, organizing and analysis of knowledge on subject matter

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021228A (en) * 2016-05-18 2016-10-12 德稻全球创新网络(北京)有限公司 Method and system for performing text analysis by utilizing knowledge topographic map
CN109952612A (en) * 2016-11-08 2019-06-28 赛卢拉研究公司 Method for express spectra classification
CN107766412A (en) * 2017-09-05 2018-03-06 华南师范大学 A kind of mthods, systems and devices for establishing thematic map
CN110019766A (en) * 2017-09-25 2019-07-16 腾讯科技(深圳)有限公司 Methods of exhibiting, device, mobile terminal and the readable storage medium storing program for executing of knowledge mapping
CN108628991A (en) * 2018-04-28 2018-10-09 上海久誉软件系统有限公司 The analysis and visualization system that rail traffic failure influences passenger flow
CN110780873A (en) * 2019-09-06 2020-02-11 平安普惠企业管理有限公司 Interface color adaptation method and device, computer equipment and storage medium
CN110866126A (en) * 2019-11-22 2020-03-06 福建工程学院 College online public opinion risk assessment method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Hsiang-Yun Wu. Focus+context metro map layout and annotation.《SCCG '16: Proceedings of the 32nd Spring Conference on Computer Graphics》.2016,41-47. *
一种简易的技术主题图绘制方法;刘玉琴 等;《图书情报工作》;第第61卷卷(第第13期期);125-132 *
基于图数据模型的聚类方法及可信度检测;程艳云 等;《系统仿真学报》;2102-2108 *

Also Published As

Publication number Publication date
CN111581394A (en) 2020-08-25

Similar Documents

Publication Publication Date Title
CN108492343B (en) Image synthesis method for training data for expanding target recognition
CN111524106B (en) Skull fracture detection and model training method, device, equipment and storage medium
US8958644B2 (en) Creating tables with handwriting images, symbolic representations and media images from forms
Recky et al. Windows detection using k-means in cie-lab color space
CN110874618B (en) OCR template learning method and device based on small sample, electronic equipment and medium
CN113239954B (en) Attention mechanism-based image semantic segmentation feature fusion method
CN112633277A (en) Channel ship board detection, positioning and identification method based on deep learning
CN110059697A (en) A kind of Lung neoplasm automatic division method based on deep learning
CN105354248A (en) Gray based distributed image bottom-layer feature identification method and system
CN112070135A (en) Power equipment image detection method and device, power equipment and storage medium
CN113762269A (en) Chinese character OCR recognition method, system, medium and application based on neural network
CN111460059A (en) Ambient air quality data visualization method, device, equipment and storage medium
CN110223310A (en) A kind of line-structured light center line and cabinet edge detection method based on deep learning
CN106203448A (en) A kind of scene classification method based on Nonlinear Scale Space Theory
CN110377659A (en) A kind of intelligence chart recommender system and method
WO2021159778A1 (en) Image processing method and apparatus, smart microscope, readable storage medium and device
CN114120141A (en) All-weather remote sensing monitoring automatic analysis method and system thereof
She et al. 3D building model simplification method considering both model mesh and building structure
CN114399784A (en) Automatic identification method and device based on CAD drawing
CN111144466A (en) Image sample self-adaptive depth measurement learning method
CN111581394B (en) Large-scale knowledge topography drawing method
CN106021228B (en) A kind of method and system carrying out text analyzing using knowledge topographic map
CN117173223A (en) Standard template generation method, device, equipment and medium for ammeter code-breaking screen
CN116303747A (en) Visualization system based on aviation weather four-dimensional dataset
CN102541347A (en) Automatic identification system and method of handwriting Chinese character

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant