CN101211344A - Text message ergodic rapid four-dimensional visualization method - Google Patents
Text message ergodic rapid four-dimensional visualization method Download PDFInfo
- Publication number
- CN101211344A CN101211344A CNA2006101483476A CN200610148347A CN101211344A CN 101211344 A CN101211344 A CN 101211344A CN A2006101483476 A CNA2006101483476 A CN A2006101483476A CN 200610148347 A CN200610148347 A CN 200610148347A CN 101211344 A CN101211344 A CN 101211344A
- Authority
- CN
- China
- Prior art keywords
- text
- cluster
- coordinate
- dimensional
- barycenter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention provides a quick four-dimensional visualized method for novel text information ergodics. (1) establishing a database for texts under analysis; (2) accepting the user's input and combine input values with fixed characteristic values, so as to create a high-dimensional characteristic vector; the each high-dimensional characteristic vector indicates the subject property of an independent text set; (3) organizing the gained high-dimensional characteristics into a cluster; each cluster is preliminarily classified in accordance with the incidence of a subject property; (4) Calculate the mass center coordinate of each cluster; the mass center is projected onto a two-dimensional plane; (5) establishing a vector for each text; each vector comprises a distance from the text to the mass center; (6) creating a text layer; each layer relates to a corresponding cluster, and a coordinate (x, y) can be used for indicating the text related to each layer; (7) obtaining z coordinate as well as u coordinate of each text using a conversion function, so as to gain a four-dimensional visualized indication and overlay the coordinate onto the other layer.
Description
Technical field
The invention belongs to computer information retrieval and field of storage, provide a kind of new automatic four-dimensional visualization method for expressing (for the user makes up multi-dimensional indexing) at information traversal about text.This method is based upon on the basis of three-dimensional visualization and human-computer interaction.
Background technology
Present text method for visualizing mainly comprises: traditional chart method for visualizing, for example entity-Attribute Association figure in the organization chart of histogram, unit, goods catalogue, the database etc.Exist shortcoming to be: can not carry out visually to arbitrary text, can not adapt to the visual of high-volume database.Computing machine " visual inquiry " instrument carries out visually to text library by the method for graphical method or data abstraction, can be used for any environment by Any user, but still the huger text database of the scale that is not suitable for.The researchist has created analytic system for the large-scale information database of text based at present, and they rely on boolean queries, lists of documents and a large amount of manpower consumptions to classify, edit and data are carried out structuring.Such as market analysis, weather prognosis assessment, environmental monitoring even national security information gathering analysis field, analyst's task is to distinguish that carefully lot of data is to draw the cognitive pattern of appropriate information and to satisfy the scramble pattern of crossing between the different pieces of information source many.But along with open digital resource is deposited the growth of index rank, in the face of the document data of magnanimity, the user will face following problem: document is sorted out difficulty, and document is difficult to be identified, and storage space increases, the speed reduction of retrieval.Existing three-dimensional visualization method also exists processing procedure to lose the not strong shortcoming of text message and human-computer interaction too simply, easily.
Summary of the invention
In order to overcome the shortcoming that above-mentioned prior art exists, the present invention handles for the retrieval analysis of magnanimity text message provides the new spatialization based on the dimension of a vector space converting text to represent and vector processing method, can carry out the visual of any dimension according to the actual requirements, increase the user preferences parameter as the fourth dimension number.
Basic thought of the present invention is according to user input, extracts the number of proper vector, draws the best dimension that text retrieval is analyzed, with this decide with the videotex database in the content and the context of related text.All texts adopt the supplementary features value of sizes related value, peak value (sequence valve that the expression text subject is arranged according to importance in the space), content and user's input to represent.(1) makes up the database of text to be analyzed; (2) accept user's input, input value is combined with the fixed character value, create the proper vector of higher-dimension, the proper vector of each higher-dimension is represented the subject attribute of independent text collection; (3) high dimensional feature that obtains is formed cluster, each cluster is according to carrying out Preliminary division with the degree of association of certain subject attribute; (4) calculate the center-of-mass coordinate of each cluster, barycenter is projected to two dimensional surface; (5) set up a vector for each text, each vector comprises the distance of the text to barycenter; (6) layering that creates text, each layering all with corresponding cluster association, with coordinate (x, y) text that is associated with each layering of expression; (7) use z coordinate and the u coordinate that a transfer function is obtained each text, draw four-dimensional visualization and represent, and this coordinate is added in other layerings.
The present invention can effectively classify to text according to user's input feature vector and system's regulation feature, and the set of traditional text data is converted to three-dimensional form, and on the basis of three-dimensional visualization the user also as one dimension.For the query analysis of magnanimity text provides directly perceived more, image and easy method, human-computer interaction strengthens greatly, more can meet the needs of different users, and programming realizes easily.
Description of drawings
Accompanying drawing 1 is the presentation graphs of text database at two dimensional surface.
Accompanying drawing 2 is one-dimensional representation figure of Fig. 1.
Accompanying drawing 3 is level and smooth transition diagrams of Fig. 2.
Accompanying drawing 4 is four-dimensional presentation graphs of text database.
Embodiment
The specific implementation step is as follows:
(1) pre-service of text.Pending number of texts N is set, input text.Natural language text is converted to visual form, (text is numbered as the eigenwert of weighing individual text: X=with following statistical attribute, size text, text formatting, the appearance position and the number of times of the keyword in the text, the numbering of the position of each word, occurrence number and adjacent word, the number of times of the user capture text is with the semanteme of the linguistry definition of obtaining in advance and the eigenwert that the user may import).Eigenwert with text is represented text.
(2) obtain Euclidean distance Dij=(Xi-Xj) between any two texts according to eigenwert
2/ 2 (wherein Xi, Xj represent the proper vector of i and j text), this distance as the similarity between text, and is combined similarity and (1) the step eigenwert of trying to achieve and forms the set of high dimensional feature vector.
(3) the text feature vector is carried out cluster.(a) when text feature vector number M is less than or equal to N, adopt K average (or claiming ISODATA) clustering algorithm to carry out data clusters: (i) establishing c is clusters number, max is the maximum times that allows to carry out iteration, and Th is the minimum deflection threshold value that allows when carrying out subsequent iteration.The cluster error value E is the summation of the square deviation of each proper vector and barycenter.(ii) when k=1, according to c initial barycenter m of proper vector conduct of user capture number of times (such as at least greater than 100 times) selection from big to small
j (k), each Xi in the text feature vector set is assigned to and its barycenter m at a distance of nearest (being the similarity minimum)
j (k)In the cluster of representative.Calculate E
(k)(iii) new barycenter m after the dispensed
j (k+1)And error value E
(k+1)(iv) repeating step (ii) and (iii), up to k more than or equal to max or satisfy || E
(k+1)-E
(k)|| less than Th, cluster finishes.
(b) when text feature vector number during greater than N.Employing is mainly determined initial barycenter m according to features such as size text, similarities based on the heuristic of knowledge base set
j (k), guarantee that similarity maximum between the barycenter (promptly distance farthest) and clusters number are less, and these initial barycenter put into multidimensional text space that all the other steps are identical with the K mean algorithm.
(4) center-of-mass coordinate of ready-portioned higher dimensional space cluster in the step (3) is carried out rule treatments, obtain the Euclidean distance of each text feature vector to each cluster barycenter, and construct an Euclidean distance matrix in view of the above, and with each text feature multiplication of vectors of this matrix and higher dimensional space, the coordinate of higher-dimension text feature vector and cluster barycenter just is converted into the two dimensional surface coordinate coordinate of text and cluster barycenter (be to).
(5) step (4) two-dimensional visualization that produced text is represented, but is not enough for many application and user.Therefore, utilize affiliated theme term of text and user preferences parameters u serfrequency (being access frequency) to draw the third dimension z and the fourth dimension u of text respectively.Input text related subject term set, theme is numbered I, and establishing the frequency that certain theme occurs in certain cluster is f
n, if n the frequency maximum that theme occurs in k cluster, the third dimension coordinate z of all texts in k cluster then
k=I; If the number of times of certain theme of user capture in certain hour t is n, the fourth dimension coordinate of all texts is u in k then relevant with this theme cluster
k=n/t.
(6) (z u) represents for x, y, and the visualization result of a user-operable is provided with four-dimensional coordinate with all texts in the text library.
Claims (2)
1. the rapid four-dimensional visualization method of a text message ergodic is characterized in that, (1) makes up the database of text to be analyzed; (2) accept user's input, input value is combined with the fixed character value, create the proper vector of higher-dimension, the proper vector of each higher-dimension is represented the subject attribute of independent text collection; (3) high dimensional feature that obtains is formed cluster, each cluster is according to carrying out Preliminary division with the degree of association of certain subject attribute; (4) calculate the center-of-mass coordinate of each cluster, barycenter is projected to two dimensional surface; (5) set up a vector for each text, each vector comprises the distance of the text to barycenter; (6) layering that creates text, each layering all with corresponding cluster association, with coordinate (x, y) text that is associated with each layering of expression; (7) use z coordinate and the u coordinate that a transfer function is obtained each text, draw four-dimensional visualization and represent, and this coordinate is added in other layerings.
2. according to the rapid four-dimensional visualization method of the described a kind of text message ergodic of claim 1, it is characterized in that, (1) pre-service of text, pending number of texts N is set, input text, natural language text is converted to visual form, with following statistical attribute as the eigenwert of weighing individual text: X=text numbering, size text, text formatting, the appearance position and the number of times of the keyword in the text, the position of each word, the numbering of occurrence number and adjacent word, the number of times of the user capture text, with the semanteme of the linguistry definition of obtaining in advance and the eigenwert that the user may import, represent text with the eigenwert of text, wherein Xi, Xj represents the proper vector of i and j text, this distance as the similarity between text, and is combined similarity and (1) the step eigenwert of trying to achieve and forms the set of high dimensional feature vector;
(3) the text feature vector is carried out cluster, (a) when text feature vector number M is less than or equal to N, adopt the K means clustering algorithm to carry out data clusters: (i) establishing c is clusters number, max is the maximum times that allows to carry out iteration, Th is the minimum deflection threshold value that allows when carrying out subsequent iteration, the cluster error value E is the summation of the square deviation of each proper vector and barycenter, (ii) when k=1, selects c proper vector as initial barycenter m from big to small according to the user capture number of times
j (k), being assigned to each Xi in the text feature vector set with it is the barycenter m of similarity minimum at a distance of recently
j (k) representative cluster in, calculate E
(k), (iii) new barycenter m after the dispensed
j (k+1)And error value E
(k+1), (iv) repeating step (ii) and (iii), up to k more than or equal to max or satisfy || E
(k+1)-E
(k)|| less than Th, cluster finishes;
(b) when text feature vector number during, adopt heuristic, determine initial barycenter m according to features such as size text, similarities based on the knowledge base set greater than N
j (k), guarantee that the maximum i.e. distance of similarity is less with clusters number farthest between the barycenter, and these initial barycenter put into multidimensional text space that all the other steps are identical with the K mean algorithm;
(4) center-of-mass coordinate of ready-portioned higher dimensional space cluster in the step (3) is carried out rule treatments, obtain the Euclidean distance of each text feature vector to each cluster barycenter, and construct an Euclidean distance matrix in view of the above, and with each text feature multiplication of vectors of this matrix and higher dimensional space, it is that the coordinate of text and cluster barycenter is right that the coordinate of higher-dimension text feature vector and cluster barycenter just is converted into the two dimensional surface coordinate;
(5) step (4) two-dimensional visualization that produced text is represented, utilizing affiliated theme term of text and user preferences parameter is third dimension z and the fourth dimension u that access frequency draws text respectively, input text related subject term set, theme is numbered I, and establishing the frequency that certain theme occurs in certain cluster is f
n, if n the frequency maximum that theme occurs in k cluster, the third dimension coordinate z of all texts in k cluster then
k=I; If the number of times of certain theme of user capture in certain hour t is n, the fourth dimension coordinate of all texts is u in k then relevant with this theme cluster
k=n/t;
(6) (z u) represents for x, y, and the visualization result of a user-operable is provided with four-dimensional coordinate with all texts in the text library.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2006101483476A CN101211344A (en) | 2006-12-29 | 2006-12-29 | Text message ergodic rapid four-dimensional visualization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2006101483476A CN101211344A (en) | 2006-12-29 | 2006-12-29 | Text message ergodic rapid four-dimensional visualization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101211344A true CN101211344A (en) | 2008-07-02 |
Family
ID=39611376
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2006101483476A Pending CN101211344A (en) | 2006-12-29 | 2006-12-29 | Text message ergodic rapid four-dimensional visualization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101211344A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102110166A (en) * | 2011-03-01 | 2011-06-29 | 浙江大学 | Browser-based body 3D (3-demensional) visualizing and editing system and method |
CN102591924A (en) * | 2010-12-13 | 2012-07-18 | 微软公司 | Bull's-eye multidimensional data visualization |
CN102663089A (en) * | 2012-04-09 | 2012-09-12 | 中国科学院软件研究所 | Unstructured data visualization method based on stereographic mapping |
CN102999483A (en) * | 2011-09-16 | 2013-03-27 | 北京百度网讯科技有限公司 | Method and device for correcting text |
CN103077157A (en) * | 2013-01-22 | 2013-05-01 | 清华大学 | Method and device for visualizing text set similarity |
CN103646035A (en) * | 2013-11-14 | 2014-03-19 | 北京锐安科技有限公司 | Information search method based on heuristic method |
CN105630748A (en) * | 2014-10-31 | 2016-06-01 | 富士通株式会社 | Information processing device and information processing method |
CN107038193A (en) * | 2016-11-17 | 2017-08-11 | 阿里巴巴集团控股有限公司 | A kind for the treatment of method and apparatus of text message |
CN107169119A (en) * | 2017-05-26 | 2017-09-15 | 九次方大数据信息集团有限公司 | The automation visualization rendering method and system recognized based on data structure |
CN107632998A (en) * | 2017-07-24 | 2018-01-26 | 电子科技大学 | A kind of multidimensional data visualization method based on human figure |
CN108509981A (en) * | 2018-03-05 | 2018-09-07 | 天津工业大学 | Three-dimension object internal part Automated Partition Method based on sequence apex feature |
CN110047509A (en) * | 2019-03-28 | 2019-07-23 | 国家计算机网络与信息安全管理中心 | A kind of two-stage Subspace partition method and device |
-
2006
- 2006-12-29 CN CNA2006101483476A patent/CN101211344A/en active Pending
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591924B (en) * | 2010-12-13 | 2016-01-20 | 微软技术许可有限责任公司 | Target center multidimensional data visualization |
CN102591924A (en) * | 2010-12-13 | 2012-07-18 | 微软公司 | Bull's-eye multidimensional data visualization |
CN102110166A (en) * | 2011-03-01 | 2011-06-29 | 浙江大学 | Browser-based body 3D (3-demensional) visualizing and editing system and method |
CN102110166B (en) * | 2011-03-01 | 2013-07-31 | 浙江大学 | Browser-based body 3D (3-demensional) visualizing and editing system and method |
CN102999483A (en) * | 2011-09-16 | 2013-03-27 | 北京百度网讯科技有限公司 | Method and device for correcting text |
CN102999483B (en) * | 2011-09-16 | 2016-04-27 | 北京百度网讯科技有限公司 | The method and apparatus that a kind of text is corrected |
CN102663089A (en) * | 2012-04-09 | 2012-09-12 | 中国科学院软件研究所 | Unstructured data visualization method based on stereographic mapping |
CN103077157A (en) * | 2013-01-22 | 2013-05-01 | 清华大学 | Method and device for visualizing text set similarity |
CN103077157B (en) * | 2013-01-22 | 2015-08-19 | 清华大学 | A kind of method for visualizing of text collection similarity and device |
CN103646035A (en) * | 2013-11-14 | 2014-03-19 | 北京锐安科技有限公司 | Information search method based on heuristic method |
CN103646035B (en) * | 2013-11-14 | 2017-07-07 | 北京锐安科技有限公司 | A kind of information search method based on heuristic |
CN105630748A (en) * | 2014-10-31 | 2016-06-01 | 富士通株式会社 | Information processing device and information processing method |
CN107038193A (en) * | 2016-11-17 | 2017-08-11 | 阿里巴巴集团控股有限公司 | A kind for the treatment of method and apparatus of text message |
CN107169119A (en) * | 2017-05-26 | 2017-09-15 | 九次方大数据信息集团有限公司 | The automation visualization rendering method and system recognized based on data structure |
CN107632998A (en) * | 2017-07-24 | 2018-01-26 | 电子科技大学 | A kind of multidimensional data visualization method based on human figure |
CN107632998B (en) * | 2017-07-24 | 2021-04-23 | 电子科技大学 | Human body form-based multidimensional data visualization method |
CN108509981A (en) * | 2018-03-05 | 2018-09-07 | 天津工业大学 | Three-dimension object internal part Automated Partition Method based on sequence apex feature |
CN110047509A (en) * | 2019-03-28 | 2019-07-23 | 国家计算机网络与信息安全管理中心 | A kind of two-stage Subspace partition method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101211344A (en) | Text message ergodic rapid four-dimensional visualization method | |
CN109992645B (en) | Data management system and method based on text data | |
Liu et al. | Region-based image retrieval with high-level semantics using decision tree learning | |
JP6190887B2 (en) | Image search system and information recording medium | |
CN104850633B (en) | A kind of three-dimensional model searching system and method based on the segmentation of cartographical sketching component | |
CN102902826B (en) | A kind of image method for quickly retrieving based on reference picture index | |
US10657162B2 (en) | Method and system for visualizing documents | |
CN102663138A (en) | Method and device for inputting formula query terms | |
Martinet et al. | A relational vector space model using an advanced weighting scheme for image retrieval | |
Mishra et al. | Image mining in the context of content based image retrieval: a perspective | |
Han et al. | Tree-based visualization and optimization for image collection | |
Tsai et al. | Qualitative evaluation of automatic assignment of keywords to images | |
da Fonseca | Sketch-based retrieval in large sets of drawings | |
Plant et al. | Visualising image databases | |
CN111143400A (en) | Full-stack type retrieval method, system, engine and electronic equipment | |
CN114077682B (en) | Intelligent recognition matching processing method and system for image retrieval and storage medium | |
Munarko et al. | HII: Histogram Inverted Index for Fast Images Retrieval. | |
Zhang et al. | Robust sketch-based image retrieval by saliency detection | |
Yan et al. | Research on Application Value Analysis of Real Estate Registration Based on Big Data Mining | |
Wilkins et al. | Text based approaches for content-based image retrieval on large image collections | |
Gupta et al. | A new approach for cbir feedback based image classifier | |
Yang et al. | A Data Mining Model and Methods Based on Multimedia Database | |
Usman et al. | Multi level mining of warehouse schema | |
Kuo et al. | Constructing a discriminative visual vocabulary with macro and micro sense of visual words | |
Jiang et al. | Content-based image retrieval algorithm oriented by users' experience |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20080702 |