CN101211344A - Text message ergodic rapid four-dimensional visualization method - Google Patents

Text message ergodic rapid four-dimensional visualization method Download PDF

Info

Publication number
CN101211344A
CN101211344A CNA2006101483476A CN200610148347A CN101211344A CN 101211344 A CN101211344 A CN 101211344A CN A2006101483476 A CNA2006101483476 A CN A2006101483476A CN 200610148347 A CN200610148347 A CN 200610148347A CN 101211344 A CN101211344 A CN 101211344A
Authority
CN
China
Prior art keywords
text
cluster
coordinate
dimensional
barycenter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2006101483476A
Other languages
Chinese (zh)
Inventor
蔡阳波
陈勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI XINSHENG ELECTRONIC TECHNOLOGY Co Ltd
Original Assignee
SHANGHAI XINSHENG ELECTRONIC TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI XINSHENG ELECTRONIC TECHNOLOGY Co Ltd filed Critical SHANGHAI XINSHENG ELECTRONIC TECHNOLOGY Co Ltd
Priority to CNA2006101483476A priority Critical patent/CN101211344A/en
Publication of CN101211344A publication Critical patent/CN101211344A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a quick four-dimensional visualized method for novel text information ergodics. (1) establishing a database for texts under analysis; (2) accepting the user's input and combine input values with fixed characteristic values, so as to create a high-dimensional characteristic vector; the each high-dimensional characteristic vector indicates the subject property of an independent text set; (3) organizing the gained high-dimensional characteristics into a cluster; each cluster is preliminarily classified in accordance with the incidence of a subject property; (4) Calculate the mass center coordinate of each cluster; the mass center is projected onto a two-dimensional plane; (5) establishing a vector for each text; each vector comprises a distance from the text to the mass center; (6) creating a text layer; each layer relates to a corresponding cluster, and a coordinate (x, y) can be used for indicating the text related to each layer; (7) obtaining z coordinate as well as u coordinate of each text using a conversion function, so as to gain a four-dimensional visualized indication and overlay the coordinate onto the other layer.

Description

The rapid four-dimensional visualization method of text message ergodic
Technical field
The invention belongs to computer information retrieval and field of storage, provide a kind of new automatic four-dimensional visualization method for expressing (for the user makes up multi-dimensional indexing) at information traversal about text.This method is based upon on the basis of three-dimensional visualization and human-computer interaction.
Background technology
Present text method for visualizing mainly comprises: traditional chart method for visualizing, for example entity-Attribute Association figure in the organization chart of histogram, unit, goods catalogue, the database etc.Exist shortcoming to be: can not carry out visually to arbitrary text, can not adapt to the visual of high-volume database.Computing machine " visual inquiry " instrument carries out visually to text library by the method for graphical method or data abstraction, can be used for any environment by Any user, but still the huger text database of the scale that is not suitable for.The researchist has created analytic system for the large-scale information database of text based at present, and they rely on boolean queries, lists of documents and a large amount of manpower consumptions to classify, edit and data are carried out structuring.Such as market analysis, weather prognosis assessment, environmental monitoring even national security information gathering analysis field, analyst's task is to distinguish that carefully lot of data is to draw the cognitive pattern of appropriate information and to satisfy the scramble pattern of crossing between the different pieces of information source many.But along with open digital resource is deposited the growth of index rank, in the face of the document data of magnanimity, the user will face following problem: document is sorted out difficulty, and document is difficult to be identified, and storage space increases, the speed reduction of retrieval.Existing three-dimensional visualization method also exists processing procedure to lose the not strong shortcoming of text message and human-computer interaction too simply, easily.
Summary of the invention
In order to overcome the shortcoming that above-mentioned prior art exists, the present invention handles for the retrieval analysis of magnanimity text message provides the new spatialization based on the dimension of a vector space converting text to represent and vector processing method, can carry out the visual of any dimension according to the actual requirements, increase the user preferences parameter as the fourth dimension number.
Basic thought of the present invention is according to user input, extracts the number of proper vector, draws the best dimension that text retrieval is analyzed, with this decide with the videotex database in the content and the context of related text.All texts adopt the supplementary features value of sizes related value, peak value (sequence valve that the expression text subject is arranged according to importance in the space), content and user's input to represent.(1) makes up the database of text to be analyzed; (2) accept user's input, input value is combined with the fixed character value, create the proper vector of higher-dimension, the proper vector of each higher-dimension is represented the subject attribute of independent text collection; (3) high dimensional feature that obtains is formed cluster, each cluster is according to carrying out Preliminary division with the degree of association of certain subject attribute; (4) calculate the center-of-mass coordinate of each cluster, barycenter is projected to two dimensional surface; (5) set up a vector for each text, each vector comprises the distance of the text to barycenter; (6) layering that creates text, each layering all with corresponding cluster association, with coordinate (x, y) text that is associated with each layering of expression; (7) use z coordinate and the u coordinate that a transfer function is obtained each text, draw four-dimensional visualization and represent, and this coordinate is added in other layerings.
The present invention can effectively classify to text according to user's input feature vector and system's regulation feature, and the set of traditional text data is converted to three-dimensional form, and on the basis of three-dimensional visualization the user also as one dimension.For the query analysis of magnanimity text provides directly perceived more, image and easy method, human-computer interaction strengthens greatly, more can meet the needs of different users, and programming realizes easily.
Description of drawings
Accompanying drawing 1 is the presentation graphs of text database at two dimensional surface.
Accompanying drawing 2 is one-dimensional representation figure of Fig. 1.
Accompanying drawing 3 is level and smooth transition diagrams of Fig. 2.
Accompanying drawing 4 is four-dimensional presentation graphs of text database.
Embodiment
The specific implementation step is as follows:
(1) pre-service of text.Pending number of texts N is set, input text.Natural language text is converted to visual form, (text is numbered as the eigenwert of weighing individual text: X=with following statistical attribute, size text, text formatting, the appearance position and the number of times of the keyword in the text, the numbering of the position of each word, occurrence number and adjacent word, the number of times of the user capture text is with the semanteme of the linguistry definition of obtaining in advance and the eigenwert that the user may import).Eigenwert with text is represented text.
(2) obtain Euclidean distance Dij=(Xi-Xj) between any two texts according to eigenwert 2/ 2 (wherein Xi, Xj represent the proper vector of i and j text), this distance as the similarity between text, and is combined similarity and (1) the step eigenwert of trying to achieve and forms the set of high dimensional feature vector.
(3) the text feature vector is carried out cluster.(a) when text feature vector number M is less than or equal to N, adopt K average (or claiming ISODATA) clustering algorithm to carry out data clusters: (i) establishing c is clusters number, max is the maximum times that allows to carry out iteration, and Th is the minimum deflection threshold value that allows when carrying out subsequent iteration.The cluster error value E is the summation of the square deviation of each proper vector and barycenter.(ii) when k=1, according to c initial barycenter m of proper vector conduct of user capture number of times (such as at least greater than 100 times) selection from big to small j (k), each Xi in the text feature vector set is assigned to and its barycenter m at a distance of nearest (being the similarity minimum) j (k)In the cluster of representative.Calculate E (k)(iii) new barycenter m after the dispensed j (k+1)And error value E (k+1)(iv) repeating step (ii) and (iii), up to k more than or equal to max or satisfy || E (k+1)-E (k)|| less than Th, cluster finishes.
(b) when text feature vector number during greater than N.Employing is mainly determined initial barycenter m according to features such as size text, similarities based on the heuristic of knowledge base set j (k), guarantee that similarity maximum between the barycenter (promptly distance farthest) and clusters number are less, and these initial barycenter put into multidimensional text space that all the other steps are identical with the K mean algorithm.
(4) center-of-mass coordinate of ready-portioned higher dimensional space cluster in the step (3) is carried out rule treatments, obtain the Euclidean distance of each text feature vector to each cluster barycenter, and construct an Euclidean distance matrix in view of the above, and with each text feature multiplication of vectors of this matrix and higher dimensional space, the coordinate of higher-dimension text feature vector and cluster barycenter just is converted into the two dimensional surface coordinate coordinate of text and cluster barycenter (be to).
(5) step (4) two-dimensional visualization that produced text is represented, but is not enough for many application and user.Therefore, utilize affiliated theme term of text and user preferences parameters u serfrequency (being access frequency) to draw the third dimension z and the fourth dimension u of text respectively.Input text related subject term set, theme is numbered I, and establishing the frequency that certain theme occurs in certain cluster is f n, if n the frequency maximum that theme occurs in k cluster, the third dimension coordinate z of all texts in k cluster then k=I; If the number of times of certain theme of user capture in certain hour t is n, the fourth dimension coordinate of all texts is u in k then relevant with this theme cluster k=n/t.
(6) (z u) represents for x, y, and the visualization result of a user-operable is provided with four-dimensional coordinate with all texts in the text library.

Claims (2)

1. the rapid four-dimensional visualization method of a text message ergodic is characterized in that, (1) makes up the database of text to be analyzed; (2) accept user's input, input value is combined with the fixed character value, create the proper vector of higher-dimension, the proper vector of each higher-dimension is represented the subject attribute of independent text collection; (3) high dimensional feature that obtains is formed cluster, each cluster is according to carrying out Preliminary division with the degree of association of certain subject attribute; (4) calculate the center-of-mass coordinate of each cluster, barycenter is projected to two dimensional surface; (5) set up a vector for each text, each vector comprises the distance of the text to barycenter; (6) layering that creates text, each layering all with corresponding cluster association, with coordinate (x, y) text that is associated with each layering of expression; (7) use z coordinate and the u coordinate that a transfer function is obtained each text, draw four-dimensional visualization and represent, and this coordinate is added in other layerings.
2. according to the rapid four-dimensional visualization method of the described a kind of text message ergodic of claim 1, it is characterized in that, (1) pre-service of text, pending number of texts N is set, input text, natural language text is converted to visual form, with following statistical attribute as the eigenwert of weighing individual text: X=text numbering, size text, text formatting, the appearance position and the number of times of the keyword in the text, the position of each word, the numbering of occurrence number and adjacent word, the number of times of the user capture text, with the semanteme of the linguistry definition of obtaining in advance and the eigenwert that the user may import, represent text with the eigenwert of text, wherein Xi, Xj represents the proper vector of i and j text, this distance as the similarity between text, and is combined similarity and (1) the step eigenwert of trying to achieve and forms the set of high dimensional feature vector;
(3) the text feature vector is carried out cluster, (a) when text feature vector number M is less than or equal to N, adopt the K means clustering algorithm to carry out data clusters: (i) establishing c is clusters number, max is the maximum times that allows to carry out iteration, Th is the minimum deflection threshold value that allows when carrying out subsequent iteration, the cluster error value E is the summation of the square deviation of each proper vector and barycenter, (ii) when k=1, selects c proper vector as initial barycenter m from big to small according to the user capture number of times j (k), being assigned to each Xi in the text feature vector set with it is the barycenter m of similarity minimum at a distance of recently j (k) representative cluster in, calculate E (k), (iii) new barycenter m after the dispensed j (k+1)And error value E (k+1), (iv) repeating step (ii) and (iii), up to k more than or equal to max or satisfy || E (k+1)-E (k)|| less than Th, cluster finishes;
(b) when text feature vector number during, adopt heuristic, determine initial barycenter m according to features such as size text, similarities based on the knowledge base set greater than N j (k), guarantee that the maximum i.e. distance of similarity is less with clusters number farthest between the barycenter, and these initial barycenter put into multidimensional text space that all the other steps are identical with the K mean algorithm;
(4) center-of-mass coordinate of ready-portioned higher dimensional space cluster in the step (3) is carried out rule treatments, obtain the Euclidean distance of each text feature vector to each cluster barycenter, and construct an Euclidean distance matrix in view of the above, and with each text feature multiplication of vectors of this matrix and higher dimensional space, it is that the coordinate of text and cluster barycenter is right that the coordinate of higher-dimension text feature vector and cluster barycenter just is converted into the two dimensional surface coordinate;
(5) step (4) two-dimensional visualization that produced text is represented, utilizing affiliated theme term of text and user preferences parameter is third dimension z and the fourth dimension u that access frequency draws text respectively, input text related subject term set, theme is numbered I, and establishing the frequency that certain theme occurs in certain cluster is f n, if n the frequency maximum that theme occurs in k cluster, the third dimension coordinate z of all texts in k cluster then k=I; If the number of times of certain theme of user capture in certain hour t is n, the fourth dimension coordinate of all texts is u in k then relevant with this theme cluster k=n/t;
(6) (z u) represents for x, y, and the visualization result of a user-operable is provided with four-dimensional coordinate with all texts in the text library.
CNA2006101483476A 2006-12-29 2006-12-29 Text message ergodic rapid four-dimensional visualization method Pending CN101211344A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2006101483476A CN101211344A (en) 2006-12-29 2006-12-29 Text message ergodic rapid four-dimensional visualization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2006101483476A CN101211344A (en) 2006-12-29 2006-12-29 Text message ergodic rapid four-dimensional visualization method

Publications (1)

Publication Number Publication Date
CN101211344A true CN101211344A (en) 2008-07-02

Family

ID=39611376

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2006101483476A Pending CN101211344A (en) 2006-12-29 2006-12-29 Text message ergodic rapid four-dimensional visualization method

Country Status (1)

Country Link
CN (1) CN101211344A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102110166A (en) * 2011-03-01 2011-06-29 浙江大学 Browser-based body 3D (3-demensional) visualizing and editing system and method
CN102591924A (en) * 2010-12-13 2012-07-18 微软公司 Bull's-eye multidimensional data visualization
CN102663089A (en) * 2012-04-09 2012-09-12 中国科学院软件研究所 Unstructured data visualization method based on stereographic mapping
CN102999483A (en) * 2011-09-16 2013-03-27 北京百度网讯科技有限公司 Method and device for correcting text
CN103077157A (en) * 2013-01-22 2013-05-01 清华大学 Method and device for visualizing text set similarity
CN103646035A (en) * 2013-11-14 2014-03-19 北京锐安科技有限公司 Information search method based on heuristic method
CN105630748A (en) * 2014-10-31 2016-06-01 富士通株式会社 Information processing device and information processing method
CN107038193A (en) * 2016-11-17 2017-08-11 阿里巴巴集团控股有限公司 A kind for the treatment of method and apparatus of text message
CN107169119A (en) * 2017-05-26 2017-09-15 九次方大数据信息集团有限公司 The automation visualization rendering method and system recognized based on data structure
CN107632998A (en) * 2017-07-24 2018-01-26 电子科技大学 A kind of multidimensional data visualization method based on human figure
CN108509981A (en) * 2018-03-05 2018-09-07 天津工业大学 Three-dimension object internal part Automated Partition Method based on sequence apex feature
CN110047509A (en) * 2019-03-28 2019-07-23 国家计算机网络与信息安全管理中心 A kind of two-stage Subspace partition method and device

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591924B (en) * 2010-12-13 2016-01-20 微软技术许可有限责任公司 Target center multidimensional data visualization
CN102591924A (en) * 2010-12-13 2012-07-18 微软公司 Bull's-eye multidimensional data visualization
CN102110166A (en) * 2011-03-01 2011-06-29 浙江大学 Browser-based body 3D (3-demensional) visualizing and editing system and method
CN102110166B (en) * 2011-03-01 2013-07-31 浙江大学 Browser-based body 3D (3-demensional) visualizing and editing system and method
CN102999483A (en) * 2011-09-16 2013-03-27 北京百度网讯科技有限公司 Method and device for correcting text
CN102999483B (en) * 2011-09-16 2016-04-27 北京百度网讯科技有限公司 The method and apparatus that a kind of text is corrected
CN102663089A (en) * 2012-04-09 2012-09-12 中国科学院软件研究所 Unstructured data visualization method based on stereographic mapping
CN103077157A (en) * 2013-01-22 2013-05-01 清华大学 Method and device for visualizing text set similarity
CN103077157B (en) * 2013-01-22 2015-08-19 清华大学 A kind of method for visualizing of text collection similarity and device
CN103646035A (en) * 2013-11-14 2014-03-19 北京锐安科技有限公司 Information search method based on heuristic method
CN103646035B (en) * 2013-11-14 2017-07-07 北京锐安科技有限公司 A kind of information search method based on heuristic
CN105630748A (en) * 2014-10-31 2016-06-01 富士通株式会社 Information processing device and information processing method
CN107038193A (en) * 2016-11-17 2017-08-11 阿里巴巴集团控股有限公司 A kind for the treatment of method and apparatus of text message
CN107169119A (en) * 2017-05-26 2017-09-15 九次方大数据信息集团有限公司 The automation visualization rendering method and system recognized based on data structure
CN107632998A (en) * 2017-07-24 2018-01-26 电子科技大学 A kind of multidimensional data visualization method based on human figure
CN107632998B (en) * 2017-07-24 2021-04-23 电子科技大学 Human body form-based multidimensional data visualization method
CN108509981A (en) * 2018-03-05 2018-09-07 天津工业大学 Three-dimension object internal part Automated Partition Method based on sequence apex feature
CN110047509A (en) * 2019-03-28 2019-07-23 国家计算机网络与信息安全管理中心 A kind of two-stage Subspace partition method and device

Similar Documents

Publication Publication Date Title
CN101211344A (en) Text message ergodic rapid four-dimensional visualization method
CN109992645B (en) Data management system and method based on text data
Liu et al. Region-based image retrieval with high-level semantics using decision tree learning
JP6190887B2 (en) Image search system and information recording medium
CN104850633B (en) A kind of three-dimensional model searching system and method based on the segmentation of cartographical sketching component
CN102902826B (en) A kind of image method for quickly retrieving based on reference picture index
US10657162B2 (en) Method and system for visualizing documents
CN102663138A (en) Method and device for inputting formula query terms
Martinet et al. A relational vector space model using an advanced weighting scheme for image retrieval
Mishra et al. Image mining in the context of content based image retrieval: a perspective
Han et al. Tree-based visualization and optimization for image collection
Tsai et al. Qualitative evaluation of automatic assignment of keywords to images
da Fonseca Sketch-based retrieval in large sets of drawings
Plant et al. Visualising image databases
CN111143400A (en) Full-stack type retrieval method, system, engine and electronic equipment
CN114077682B (en) Intelligent recognition matching processing method and system for image retrieval and storage medium
Munarko et al. HII: Histogram Inverted Index for Fast Images Retrieval.
Zhang et al. Robust sketch-based image retrieval by saliency detection
Yan et al. Research on Application Value Analysis of Real Estate Registration Based on Big Data Mining
Wilkins et al. Text based approaches for content-based image retrieval on large image collections
Gupta et al. A new approach for cbir feedback based image classifier
Yang et al. A Data Mining Model and Methods Based on Multimedia Database
Usman et al. Multi level mining of warehouse schema
Kuo et al. Constructing a discriminative visual vocabulary with macro and micro sense of visual words
Jiang et al. Content-based image retrieval algorithm oriented by users' experience

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20080702