CN101211344A

CN101211344A - Text message ergodic rapid four-dimensional visualization method

Info

Publication number: CN101211344A
Application number: CNA2006101483476A
Authority: CN
Inventors: 蔡阳波; 陈勇
Original assignee: SHANGHAI XINSHENG ELECTRONIC TECHNOLOGY Co Ltd
Current assignee: SHANGHAI XINSHENG ELECTRONIC TECHNOLOGY Co Ltd
Priority date: 2006-12-29
Filing date: 2006-12-29
Publication date: 2008-07-02

Abstract

The invention provides a quick four-dimensional visualized method for novel text information ergodics. (1) establishing a database for texts under analysis; (2) accepting the user's input and combine input values with fixed characteristic values, so as to create a high-dimensional characteristic vector; the each high-dimensional characteristic vector indicates the subject property of an independent text set; (3) organizing the gained high-dimensional characteristics into a cluster; each cluster is preliminarily classified in accordance with the incidence of a subject property; (4) Calculate the mass center coordinate of each cluster; the mass center is projected onto a two-dimensional plane; (5) establishing a vector for each text; each vector comprises a distance from the text to the mass center; (6) creating a text layer; each layer relates to a corresponding cluster, and a coordinate (x, y) can be used for indicating the text related to each layer; (7) obtaining z coordinate as well as u coordinate of each text using a conversion function, so as to gain a four-dimensional visualized indication and overlay the coordinate onto the other layer.

Description

The rapid four-dimensional visualization method of text message ergodic

Technical field

The invention belongs to computer information retrieval and field of storage, provide a kind of new automatic four-dimensional visualization method for expressing (for the user makes up multi-dimensional indexing) at information traversal about text.This method is based upon on the basis of three-dimensional visualization and human-computer interaction.

Background technology

Present text method for visualizing mainly comprises: traditional chart method for visualizing, for example entity-Attribute Association figure in the organization chart of histogram, unit, goods catalogue, the database etc.Exist shortcoming to be: can not carry out visually to arbitrary text, can not adapt to the visual of high-volume database.Computing machine " visual inquiry " instrument carries out visually to text library by the method for graphical method or data abstraction, can be used for any environment by Any user, but still the huger text database of the scale that is not suitable for.The researchist has created analytic system for the large-scale information database of text based at present, and they rely on boolean queries, lists of documents and a large amount of manpower consumptions to classify, edit and data are carried out structuring.Such as market analysis, weather prognosis assessment, environmental monitoring even national security information gathering analysis field, analyst's task is to distinguish that carefully lot of data is to draw the cognitive pattern of appropriate information and to satisfy the scramble pattern of crossing between the different pieces of information source many.But along with open digital resource is deposited the growth of index rank, in the face of the document data of magnanimity, the user will face following problem: document is sorted out difficulty, and document is difficult to be identified, and storage space increases, the speed reduction of retrieval.Existing three-dimensional visualization method also exists processing procedure to lose the not strong shortcoming of text message and human-computer interaction too simply, easily.

Summary of the invention

In order to overcome the shortcoming that above-mentioned prior art exists, the present invention handles for the retrieval analysis of magnanimity text message provides the new spatialization based on the dimension of a vector space converting text to represent and vector processing method, can carry out the visual of any dimension according to the actual requirements, increase the user preferences parameter as the fourth dimension number.

Basic thought of the present invention is according to user input, extracts the number of proper vector, draws the best dimension that text retrieval is analyzed, with this decide with the videotex database in the content and the context of related text.All texts adopt the supplementary features value of sizes related value, peak value (sequence valve that the expression text subject is arranged according to importance in the space), content and user's input to represent.(1) makes up the database of text to be analyzed; (2) accept user's input, input value is combined with the fixed character value, create the proper vector of higher-dimension, the proper vector of each higher-dimension is represented the subject attribute of independent text collection; (3) high dimensional feature that obtains is formed cluster, each cluster is according to carrying out Preliminary division with the degree of association of certain subject attribute; (4) calculate the center-of-mass coordinate of each cluster, barycenter is projected to two dimensional surface; (5) set up a vector for each text, each vector comprises the distance of the text to barycenter; (6) layering that creates text, each layering all with corresponding cluster association, with coordinate (x, y) text that is associated with each layering of expression; (7) use z coordinate and the u coordinate that a transfer function is obtained each text, draw four-dimensional visualization and represent, and this coordinate is added in other layerings.

The present invention can effectively classify to text according to user's input feature vector and system's regulation feature, and the set of traditional text data is converted to three-dimensional form, and on the basis of three-dimensional visualization the user also as one dimension.For the query analysis of magnanimity text provides directly perceived more, image and easy method, human-computer interaction strengthens greatly, more can meet the needs of different users, and programming realizes easily.

Description of drawings

Accompanying drawing 1 is the presentation graphs of text database at two dimensional surface.

Accompanying drawing 2 is one-dimensional representation figure of Fig. 1.

Accompanying drawing 3 is level and smooth transition diagrams of Fig. 2.

Accompanying drawing 4 is four-dimensional presentation graphs of text database.

Embodiment

The specific implementation step is as follows:

(1) pre-service of text.Pending number of texts N is set, input text.Natural language text is converted to visual form, (text is numbered as the eigenwert of weighing individual text: X=with following statistical attribute, size text, text formatting, the appearance position and the number of times of the keyword in the text, the numbering of the position of each word, occurrence number and adjacent word, the number of times of the user capture text is with the semanteme of the linguistry definition of obtaining in advance and the eigenwert that the user may import).Eigenwert with text is represented text.

(2) obtain Euclidean distance Dij=(Xi-Xj) between any two texts according to eigenwert ²/ 2 (wherein Xi, Xj represent the proper vector of i and j text), this distance as the similarity between text, and is combined similarity and (1) the step eigenwert of trying to achieve and forms the set of high dimensional feature vector.

(3) the text feature vector is carried out cluster.(a) when text feature vector number M is less than or equal to N, adopt K average (or claiming ISODATA) clustering algorithm to carry out data clusters: (i) establishing c is clusters number, max is the maximum times that allows to carry out iteration, and Th is the minimum deflection threshold value that allows when carrying out subsequent iteration.The cluster error value E is the summation of the square deviation of each proper vector and barycenter.(ii) when k=1, according to c initial barycenter m of proper vector conduct of user capture number of times (such as at least greater than 100 times) selection from big to small _j ^(k), each Xi in the text feature vector set is assigned to and its barycenter m at a distance of nearest (being the similarity minimum) _j ^(k)In the cluster of representative.Calculate E ^(k)(iii) new barycenter m after the dispensed _j ^(k+1)And error value E ^(k+1)(iv) repeating step (ii) and (iii), up to k more than or equal to max or satisfy || E ^(k+1)-E ^(k)|| less than Th, cluster finishes.

(b) when text feature vector number during greater than N.Employing is mainly determined initial barycenter m according to features such as size text, similarities based on the heuristic of knowledge base set _j ^(k), guarantee that similarity maximum between the barycenter (promptly distance farthest) and clusters number are less, and these initial barycenter put into multidimensional text space that all the other steps are identical with the K mean algorithm.

(4) center-of-mass coordinate of ready-portioned higher dimensional space cluster in the step (3) is carried out rule treatments, obtain the Euclidean distance of each text feature vector to each cluster barycenter, and construct an Euclidean distance matrix in view of the above, and with each text feature multiplication of vectors of this matrix and higher dimensional space, the coordinate of higher-dimension text feature vector and cluster barycenter just is converted into the two dimensional surface coordinate coordinate of text and cluster barycenter (be to).

(5) step (4) two-dimensional visualization that produced text is represented, but is not enough for many application and user.Therefore, utilize affiliated theme term of text and user preferences parameters u serfrequency (being access frequency) to draw the third dimension z and the fourth dimension u of text respectively.Input text related subject term set, theme is numbered I, and establishing the frequency that certain theme occurs in certain cluster is f _n, if n the frequency maximum that theme occurs in k cluster, the third dimension coordinate z of all texts in k cluster then _k=I; If the number of times of certain theme of user capture in certain hour t is n, the fourth dimension coordinate of all texts is u in k then relevant with this theme cluster _k=n/t.

(6) (z u) represents for x, y, and the visualization result of a user-operable is provided with four-dimensional coordinate with all texts in the text library.

Claims

1. the rapid four-dimensional visualization method of a text message ergodic is characterized in that, (1) makes up the database of text to be analyzed; (2) accept user's input, input value is combined with the fixed character value, create the proper vector of higher-dimension, the proper vector of each higher-dimension is represented the subject attribute of independent text collection; (3) high dimensional feature that obtains is formed cluster, each cluster is according to carrying out Preliminary division with the degree of association of certain subject attribute; (4) calculate the center-of-mass coordinate of each cluster, barycenter is projected to two dimensional surface; (5) set up a vector for each text, each vector comprises the distance of the text to barycenter; (6) layering that creates text, each layering all with corresponding cluster association, with coordinate (x, y) text that is associated with each layering of expression; (7) use z coordinate and the u coordinate that a transfer function is obtained each text, draw four-dimensional visualization and represent, and this coordinate is added in other layerings.

2. according to the rapid four-dimensional visualization method of the described a kind of text message ergodic of claim 1, it is characterized in that, (1) pre-service of text, pending number of texts N is set, input text, natural language text is converted to visual form, with following statistical attribute as the eigenwert of weighing individual text: X=text numbering, size text, text formatting, the appearance position and the number of times of the keyword in the text, the position of each word, the numbering of occurrence number and adjacent word, the number of times of the user capture text, with the semanteme of the linguistry definition of obtaining in advance and the eigenwert that the user may import, represent text with the eigenwert of text, wherein Xi, Xj represents the proper vector of i and j text, this distance as the similarity between text, and is combined similarity and (1) the step eigenwert of trying to achieve and forms the set of high dimensional feature vector;

(3) the text feature vector is carried out cluster, (a) when text feature vector number M is less than or equal to N, adopt the K means clustering algorithm to carry out data clusters: (i) establishing c is clusters number, max is the maximum times that allows to carry out iteration, Th is the minimum deflection threshold value that allows when carrying out subsequent iteration, the cluster error value E is the summation of the square deviation of each proper vector and barycenter, (ii) when k=1, selects c proper vector as initial barycenter m from big to small according to the user capture number of times _j ^(k), being assigned to each Xi in the text feature vector set with it is the barycenter m of similarity minimum at a distance of recently _j ^(k) representative cluster in, calculate E ^(k), (iii) new barycenter m after the dispensed _j ^(k+1)And error value E ^(k+1), (iv) repeating step (ii) and (iii), up to k more than or equal to max or satisfy || E ^(k+1)-E ^(k)|| less than Th, cluster finishes;

(b) when text feature vector number during, adopt heuristic, determine initial barycenter m according to features such as size text, similarities based on the knowledge base set greater than N _j ^(k), guarantee that the maximum i.e. distance of similarity is less with clusters number farthest between the barycenter, and these initial barycenter put into multidimensional text space that all the other steps are identical with the K mean algorithm;

(4) center-of-mass coordinate of ready-portioned higher dimensional space cluster in the step (3) is carried out rule treatments, obtain the Euclidean distance of each text feature vector to each cluster barycenter, and construct an Euclidean distance matrix in view of the above, and with each text feature multiplication of vectors of this matrix and higher dimensional space, it is that the coordinate of text and cluster barycenter is right that the coordinate of higher-dimension text feature vector and cluster barycenter just is converted into the two dimensional surface coordinate;

(5) step (4) two-dimensional visualization that produced text is represented, utilizing affiliated theme term of text and user preferences parameter is third dimension z and the fourth dimension u that access frequency draws text respectively, input text related subject term set, theme is numbered I, and establishing the frequency that certain theme occurs in certain cluster is f _n, if n the frequency maximum that theme occurs in k cluster, the third dimension coordinate z of all texts in k cluster then _k=I; If the number of times of certain theme of user capture in certain hour t is n, the fourth dimension coordinate of all texts is u in k then relevant with this theme cluster _k=n/t;