CN103631856A

CN103631856A - Subject visualization method for Chinese document set

Info

Publication number: CN103631856A
Application number: CN201310488312.7A
Authority: CN
Inventors: 朱敏; 梁婷; 甘启宏; 李明召; 李�一
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2013-10-17
Filing date: 2013-10-17
Publication date: 2014-03-12
Anticipated expiration: 2033-10-17
Also published as: CN103631856B

Abstract

The invention discloses a subject visualization method for a Chinese document set. The subject visualization method comprises the steps of classifying the document set according to subjects, dividing the time periods of the document set, calculating subject frequency, ranking the subjects, generating a subject flow graph, extracting keywords expressing the content of the subjects, calculating and ranking the weights of the keywords and generating a character cloud. The subject visualization method further comprises an ordering method based on the subject frequency and geometrical complementarity, a character cloud arrangement method and a method for generating the detailed character cloud. The subject visualization method has the advantages that the subject visualization on the Chinese document set is achieved; the subject flow graph generated through the ordering method based on the subject frequency and the geometrical complementarity is more attractive, flatter, high in space use ratio and more beneficial to character cloud arrangement; the character cloud arrangement method can effectively utilize space, and arrangement efficiency is greatly improved; the detailed character cloud is generated, and all the keyword content of the subjects can be shown.

Description

A kind of theme method for visualizing of Chinese document collection

Technical field

The present invention relates to text visualization and subject analysis field, is specifically a kind of theme method for visualizing of Chinese document collection.

Background technology

Large-scale collection of document, as news, scientific and technical literature, webpage and electronic publication, bulletin etc., has contained bulk information.Along with the development of information digitalization and universal, the scale expanding day of collection of document, Fast Reading and understand vast as the open sea information, and therefrom extract useful knowledge, become people's problem demanding prompt solution.

" theme " generally includes a core event or activity, and all directly related event and activities with it.Topic detection method adopts the technology such as cluster, classification, retrieval, topic tracking, according to theme, document sets is carried out to hierarchy type classification and tissue, facilitates user that it is retrieved, selected and browses.Yet after document is sorted out, user still needs to expend the plenty of time and reads all documents under this theme, with understand theme main contents, excavate potential knowledge and obtain required information.

Multi-document auto-abstracting technology, on the basis of topic detection, gathers subject content, removes after redundant information, generates comprehensive, succinct text.Thereby greatly improved information acquisition efficiency.But the common more complicated of existing multi-document summary result, user's indigestion, and be difficult to summarization generation process to be controlled, lack friendly user interface and man-machine interactive operation.In addition, multi-document auto-abstracting technology has often been ignored other attributes outside content of text, as time, quantity etc., is difficult to represent theme and subject content Characteristics of Evolution in time in document sets, also cannot reflect the relation between each theme under same document sets.

Text visualization is as an important branch in information visualization field, utilize the mankind inherent to the identification of figure, memory and analysis ability, text message is converted into graph image, help people intuitively, understand efficiently, read and analyze content of text and structure, and by corresponding interactive operation, help people to excavate valuable knowledge and pattern.

Word Cloud (word cloud) visualization technique is abstracted into content of text the set of one group of vocabulary, utilizes font size to represent the word frequency information of vocabulary, then by vocabulary according to certain rule compact, attractive in appearance line up, to represent text feature.But word cloud can only carry out visual to single document.To a plurality of documents, Themerive(theme stream) theme in document sets is carried out visual, the variation tendency of showing each theme intensity time in document sets.Initial theme stream only comprises theme intensity and temporal information, and theme order random alignment.Afterwards, the people such as Liu Shixia propose improved theme stream TIARA, in theme stream, embed word cloud, further each subject content are carried out visually, contribute to user's express-analysis text subject content rule over time.

Several text visualization technology all lack versatility above, are not suitable for Chinese document, at home up to the present, also still lack the visualization technique that Chinese document subject matter is analyzed.In addition, only for the visual TIARA technology of English document theme, also there are the following problems: 1) shape, the layout of theme stream Chinese word cloud are unstable, easily make user cause misunderstanding, affect subject analysis effect; 2), owing to being subject to region limits, the word cloud of generation cannot show whole key contents of each theme.

Summary of the invention

The object of the present invention is to provide a kind of theme method for visualizing of Chinese document collection, by each subject information extracting in Chinese document sets is added up and processed, measure out the intensity of theme and the weight of content, then in patterned mode, show.

The technical scheme that realizes the object of the invention is as follows: a kind of theme method for visualizing of Chinese document collection, comprises the step to document sets classification by theme: establishing document sets has n theme l _j, j=0,1,2 ..., n-1, classifies to all documents in document sets according to theme, obtains n document subset D _j, j=0,1,2 ..., n-1; Wherein, theme l _jcorresponding document subset is D _j;

Divide the step of document sets time period: establishing the document sets start time is t _start, the end time is t _end, to document sets time period [t _start, t _end] carry out decile, obtain time period T _p=(t _start+ (p-1) Δ t, t _start+ p Δ t], wherein, p=1,2 ..., m-1,

calculate the step of the theme frequency: establish the theme frequency and comprise v _{j, 0}and v _j,p, v wherein _{j, 0}l is the theme _jcorresponding document subset D _jat start time t _starthe number of documents of t, v _j,ptheme l _jcorresponding document subset D _jat time period T _pthe quantity of interior document; Calculate respectively the theme frequency of each theme;

The step that theme is sorted: to all theme sequences, the subject nucleotide sequence table after being sorted;

Generate the step of theme flow graph: according to subject nucleotide sequence table and the theme frequency after sequence, adopt theme flow algorithm, generate theme flow graph;

Extract the step of the keyword that represents subject content: establish W _j,ptheme l _jcorresponding document subset D _jat time period T _pthe keyword subset that represents this subject content in interior document; Use the general Words partition system of Modern Chinese the document of each time period, to extract respectively from document subset corresponding to each theme the keyword subset that represents this subject content;

Calculate the weight of keyword the step of sequence: the weight of establishing keyword is the number of times that this keyword occurs in a keyword subset; Calculate the weight of each keyword in each keyword subset, and according to the weight of keyword, from big to small all keywords are sorted in each keyword subset;

The step of generating character cloud: according to keyword subset and keyword weight, generating character cloud on theme flow graph.

In technique scheme, the step that theme is sorted can adopt based on the theme frequency and how much complementary sort methods, comprises

Step 1, establishes theme l _jinitial time be OT _j; Work as v _{j, 0}while being not equal to zero, get the start time t of document sets _startfor OT _j; Work as v _{j, 0}while equalling zero, get v _j,pnon-vanishing those time periods T _pthe minimum value of left end point as OT _j; Calculate the initial time of each theme;

Step 2, establishes theme l _jthe frequency and

calculate each theme the frequency and;

Step 3: newly-built empty list B; If n is even number, the frequency and that maximum subject write are entered to list the first row, as upper extreme point theme l _up, the frequency and time that large subject write are entered to list the second row, as lower extreme point theme l _down; If n is odd number, the frequency and that maximum subject write are entered to list the first row, simultaneously as upper extreme point theme l _upwith lower extreme point theme l _down;

Step 4: select a not theme l in list B _i, calculate l _upand l _ithe frequency and mean value

\overset{&OverBar;}{V (l_{up} + l_{i})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{up, p} + v_{i, p}) - - - (1);

Calculate l _upand l _ihow much complementary, use variance

represent:

σ_{up, i} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{up, p} + v_{i, p}) - \overset{&OverBar;}{V (l_{up} + l_{i})})}^{2}} - - - (2);

Will

and OT _iafter normalization, calculate weighted value D _i:

D _i＝sOT _i+(1-s)σ _up,i (3)；

Wherein s is for controlling parameter, 0≤s≤1;

Step 5: repeating step 4, until obtain each not weighted value of the theme in list B;

Step 6: select weighted value D _iminimum theme, is inserted into the top of list B upper extreme point theme, as new upper extreme point theme l _up;

Step 7: select a not theme l in list B _k, calculate l _downand l _kthe frequency and mean value

\overset{&OverBar;}{V (l_{down} + l_{k})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{down, p} + v_{k, p}) - - - (4);

Calculate l _downand l _khow much complementary, use variances sigma _{down, k}represent:

σ_{down, k} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{down, p} + v_{k, p}) - \overset{&OverBar;}{V (l_{down} + l_{k})})}^{2}} - - - (5);

By σ _{down, k}and OT _kafter normalization, calculate weighted value D _k:

D _k＝sOT _k+(1-s)σ _down,k (6)；

Wherein s is for controlling parameter, 0≤s≤1;

Step 8: repeating step 7, until obtain each not weighted value of the theme in list B;

Step 9: select the theme of weighted value minimum, be inserted into the below of list B lower extreme point theme, as new lower extreme point theme l _down;

Step 10: repeated execution of steps 4 is to step 9, until all themes all add in list B.

In the present invention, the value of controlling parameter s is 0.3.

In aforementioned techniques scheme, on theme flow graph, the method for generating character cloud is:

Step 1: theme l on choosing a topic flow graph _jcorresponding region G _j, its start time and end time equal respectively the start time t of document sets _startwith end time t _end, by region G _jtime period [t _start, t _end] being divided into m-1 section, the length of each time period is

obtain decile time point t _start+ p Δ t, wherein, p=1,2 ..., m-2;

Step 2: successively with decile time point t _startcentered by+p Δ t, according to Δ t at region G _jupper intercepting subregion R _j,p; R _j,pby a set

and curve

and line segment

the closed space forming;

Step 3: at each subregion R _j,pupper placement keyword subset W _j,pin keyword, generate theme l _jword cloud; Comprise 3.1 use line segments

the each point of Pointcut N, obtains subregion R _j,papproximate polygon;

3.2 establish subregion R _j,peach summit of approximate polygon in the y coordinate figure on that summit of y coordinate maximum be y _max; If the y coordinate figure on that summit of y coordinate minimum is y in each summit of polygon _min;

3.3 one group of use horizontal line H={y=c|y _min≤ c≤y _max, c ∈ Z} and subregion R _j,pintersect, obtain some crossing line segments; Get the sub-line segment that each intersecting lens section is positioned at polygon inside, be expressed as

wherein, M is positioned at R for this intersecting lens section _j,pthe number of inner sub-line segment; By R _j,pbe expressed as one group of horizontal line section collection

L_{j, p} = {r_{c} (i) = \overset{&OverBar;}{s_{c} (i) e_{c} (i)}, y_{\min} \leq c \leq y_{\max}, 0 < i \leq M};

3.4 according to W _j,pin keyword weight choose successively from big to small a keyword, arranging high is h, widely for the rectangle of w replaces this keyword, carries out layout, then at placement position, places this keyword; Comprise

A, detection are at R _j,pcorresponding L _j,pin, at c=(y _max-y _mina wide w of being be placed in position)/2 can, and height is the rectangle of h, and detection method is: detect r corresponding from c to c-h _c(i) in, whether all there is same i, meet line segment

length be greater than w; If energy, (c, s in position _c(i)) place keyword, upgrade s _c(i) be s _c(i)=s _c(i)+w; If can not, proceed to step B;

B, with c=(y _max-y _mincentered by)/2, make successively c=c+1, c=c-1 alternately travels through L _j,p, detect and can place this rectangle in c position; If energy, (c, s in position _c(i)) place keyword, upgrade s _c(i) be s _c(i)=s _c(i)+w; If can not continue to make c=c+1, c=c-1 alternately travels through L _j,p, until find the r satisfying condition _cor traveled through all r (i) _c(i); If traveled through all r _c(i) after, do not find yet the position c satisfying condition, give up this keyword;

3.5 repeating steps 3.4, until by W _j,pin keyword all place;

Step 4: repeating step 1 is to step 3, until generate the word cloud of each theme on theme flow graph.

The invention allows for the method that generates detailed word cloud, comprise

Step 1: select to express theme l _jthe keyword set of content;

Step 2: a border circular areas C is set, turns to one group of conflict point set P by the border of C is discrete;

Step 3: choose from big to small a keyword according to the weight of keyword from keyword set, use random greedy algorithm is its generation position candidate coordinate (word.x, word.y) in the C of region;

Step 4: according to the weight setting font size of this keyword, then according to the number of words of font size and this keyword, with the approximate keyword that replaces of rectangle r, the lower left corner coordinate of establishing rectangle r equals coordinate;

Step 5: to each conflict point in P, detect each point and whether conflict with r; If there is conflict, proceed to step 6; If there is no conflict, proceeds to step 7;

Step 6: upgrade after the coordinate of position along spiral path, repeating step 4, step 5, until find the position pcoordinate or the radius of spin that satisfy condition to be greater than 100; When the radius of spin is greater than 100, keyword will be rejected;

Step 7: place this keyword in position coordinate (word.x, word.y), and the discrete region that this keyword is taken turns to conflict point, add in conflict point set P;

Step 8: repeating step 3 is to step 7, until all keywords are placed in keyword set.

When generating detailed word cloud, can choosing a topic l _jany one keyword subset W _j,pfor expressing the keyword set of the content of theme, also can select theme l _jall keyword subset W _j,pthe keyword set forming after merging is as the keyword set of expressing the content of theme.

The present invention is with respect to the technique effect of prior art: 1, realized the theme of Chinese document sets visual.2, after the sort methods of employing based on the theme frequency and how much complementarity sort to theme, the theme flow graph of generation is more attractive in appearance, more smooth, and space availability ratio is high, is more conducive to the placement of word cloud.3, adopt word cloud layout method of the present invention, can effectively utilize space, under the prerequisite of same area size, font size, can place more keyword; And the layout generating is stable, with interactive operation below, do not change; Efficiency of algorithm also obviously improves, this algorithm is expressed as discrete one by one entity by certain rule by irregular area, when placing word, only need travel through and find the entity that meets this word placement condition, not need to carry out collision detection and Boundary Detection, therefore greatly improve positioning efficiency.4, generate all key words contents that detailed word cloud can be shown theme.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention.

Fig. 2 is in first embodiment of the invention, to the randomly ordered rear generation theme flow graph of theme, then adopts the placement algorithm of TIARA technology Chinese word cloud to the visual result figure of subject content.

Fig. 3 is the process flow diagram based on the theme frequency and how much complementary sort methods in second embodiment of the invention.

Fig. 4 is the design sketch of the theme flow graph that generates in second embodiment of the invention.

Fig. 5 is the schematic diagram that intercepts subregion in third embodiment of the invention.

Fig. 6 is in third embodiment of the invention, by subregion approximate representation, to be the schematic diagram of one group of horizontal line section collection.

Fig. 7 places the design sketch after keyword on theme flow graph in third embodiment of the invention, theme flow graph is wherein to the randomly ordered rear generation of theme.

Fig. 8 is the design sketch after visual to the theme of Chinese document sets in fourth embodiment of the invention, and wherein theme flow graph is to have adopted based on generating after the theme frequency and how much complementary sort methods sequences, and word cloud is generated by the placement algorithm after improving.

Fig. 9 is visual and increased the design sketch after detailed word cloud to the theme of Chinese document sets in fifth embodiment of the invention.

Embodiment

Embodiment 1: the < < Journal of Software > > journal data of take is below example, in conjunction with Fig. 1, shows Chinese document theme visualization method.

Step 1, classifies to document sets by theme: establishing document sets has n theme l _j, j=0,1,2 ..., n-1, classifies to all documents in document sets according to theme, obtains n document subset D _j, j=0,1,2 ..., n-1; Wherein, theme l _jcorresponding document subset is D _j.Be specially: the paper data of input < < Journal of Software > > the 1st 9 phases of phase to the of periodical.Document sets is classified according to system software and soft project, database technology, computer network and information security, pattern-recognition and artificial intelligence and five themes of operating system, obtain five document subsets.

Step 2, divides the document sets time period: establishing the document sets start time is t _start, the end time is t _end, to document sets time period [t _start, t _end] carry out decile, obtain time period T _p=(t _start+ (p-1) Δ t, t _start+ p Δ t], wherein, p=1,2 ..., m-1,

wherein, m is the sum of start time, end time and decile time point.Here, the document sets start time was the 1st phase, and the end time was the 9th phase, and the document sets time is divided into 8 time periods, and the time interval was 1 phase.

Step 3, calculates the theme frequency: establish the theme frequency and comprise v _{j, 0}and v _j,p, v wherein _{j, 0}l is the theme _jcorresponding document subset D _jat start time t _startnumber of documents, v _j,ptheme l _jcorresponding document subset D _jat time period T _pthe quantity of interior document; Calculate respectively the theme frequency of each theme.Here, calculate the document of each theme in the frequency of the 1st phase and other each phase appearance, the number of documents that each theme comprises within document sets start time and each time period.

Step 4, sorts to theme: to all theme sequences, the subject nucleotide sequence table after being sorted.In the present embodiment, theme is sorted and uses traditional random alignment method.Adopt different sort methods, the effect of the theme flow graph generating is had a direct impact.

Step 5, generates theme flow graph: according to subject nucleotide sequence table and the theme frequency after sequence, adopt theme flow algorithm, generate theme flow graph, to theme intensity, carry out visual.In the present embodiment, the theme flow graph of generation as shown in colour band colored in Fig. 2, wherein blue color is pattern-recognition and artificial intelligence, purple colour band is computer network and information security, red ribbon is operating system, yellow colour band is database technology, and green color bars is system software and soft project.Theme flow algorithm (drawing mono-literary composition from < < ThemeRiver:Visualizing thematic changes in large document collections > >), according to each theme, the weights in discrete time carry out interpolation (it is zero constraint condition that interpolating function need meet at extreme point derivative), then carry out the drafting of stacked graph, generate theme flow graph.In theme flow graph, transverse axis represents the time, and longitudinal axis difference in height represents theme intensity, and different colours band represents different themes.Colour band is along with the time broadens or narrows down and represents the differentiation of theme intensity time.

Step 6, extracts the keyword that represents subject content: establish W _j,ptheme l _jcorresponding document subset D _jat time period T _pthe keyword subset that represents this subject content in interior document, is used the general Words partition system > of < < Modern Chinese > the document of each time period, to extract respectively from document subset corresponding to each theme the keyword subset that represents this subject content.In the present embodiment, adopt word bag model text analysis technique, extract the keyword subset that represents subject content.Be specially: the general Words partition system > of the < < Modern Chinese > that adopts Language Information Processing Institute of Beijing Language and Culture Univer's exploitation, to belonging to all documents of each time period in the document subset of each theme, carry out participle, remove stop words, as auxiliary words of mood, adverbial word, preposition, conjunction etc., finally obtain a plurality of keyword subsets.

Step 7, weight the sequence of calculating keyword: the weight of keyword is the number of times that this keyword occurs in a keyword subset; Calculate the weight of each keyword in each keyword subset, and from big to small all keywords are sorted according to the weight of keyword.

Step 8, generating character cloud: according to keyword subset and keyword weight, adopt TIARA algorithm generating character cloud on theme flow graph, carry out visual to subject content.

According to the Chinese document collection theme method for visualizing of the present embodiment, the result after visual to Chinese document sets theme as shown in Figure 2.

Embodiment 2: in above-mentioned Chinese document collection theme method for visualizing, and each theme order random alignment.When generating theme stream, if the Strength Changes of certain theme is excessive, adjacent theme shape can be twisted with it, make result not attractive in appearance, and the relative intensity between theme is also difficult to identification.In addition, the theme after distortion also can affect the placement of word cloud.Meanwhile, for all themes of a document sets, user is often more concerned about the particular content of the theme of theme intensity maximum.Therefore, the present invention, to the step in embodiment 1, theme being sorted, has carried out further improvement, has designed a kind of sort methods based on the theme frequency and how much complementarity theme is sorted.Below in conjunction with Fig. 3, describe this sort method in detail:

Step 2, establishes theme l _jthe frequency and

calculate each theme the frequency and;

Step 3: newly-built empty list B; If n is even number, the frequency and that maximum subject write are entered to list the first row, as upper extreme point theme l _up, the frequency and time that large subject write are entered to list the second row, as lower extreme point theme l _down; If n is odd number, the frequency and that maximum subject write are entered to list the first row, simultaneously as upper extreme point theme l _upwith lower extreme point theme l _down; Step 4: select a not theme l in list B _i, calculate l _upand l _ithe frequency and mean value

\overset{&OverBar;}{V (l_{up} + l_{i})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{up, p} + v_{i, p}) - - - (1);

Calculate l _upand l _ihow much complementary, use variance represent:

σ_{up, i} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{up, p} + v_{i, p}) - \overset{&OverBar;}{V (l_{up} + l_{i})})}^{2}} - - - (2);

Will

and OT _iafter normalization, calculate weighted value D _i:

D _i＝sOT _i+(1-s)σ _up,i (3)；

Wherein s is for controlling parameter, 0≤s≤1;

\overset{&OverBar;}{V (l_{down} + l_{k})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{down, p} + v_{k, p}) - - - (4);

σ_{down, k} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{down, p} + v_{k, p}) - \overset{&OverBar;}{V (l_{down} + l_{k})})}^{2}} - - - (5);

By σ _{down, k}and OT _kafter normalization, calculate weighted value D _k:

D _k＝sOT _k+(1-s)σ _down,k (6)；

Wherein s is for controlling parameter, 0≤s≤1;

In the present embodiment, the value of controlling parameter s is 0.3.

After theme being sorted according to the sort method in the present embodiment, the theme flow graph of generation as shown in Figure 4, can find out that theme flow graph is more attractive in appearance, more smooth, and space availability ratio is high, is more conducive to the placement of word cloud.

Embodiment 3: for TIARA technology Chinese word cloud shape, the unsettled problem of layout, the present invention also improves word cloud, first theme is divided into several subregions, then adopting scalable algorithm (drawing mono-literary composition from < < Tag Cloud++-Scalable Tag Clouds for Arbitrary Layouts > >) is one group of horizontal line section collection by this region representation, place successively again keyword, generating character cloud.Visual signature is as follows: 1) weight of keyword is larger, and font is larger; 2) keyword that weight is larger is the closer to this regional center.Below in conjunction with Fig. 5, Fig. 6, be elaborated:

obtain decile time point t _start+p Δ t, wherein, p=1,2 ..., m-2;

Step 2: as shown in Figure 5, successively with decile time point t _startcentered by+p Δ t, according to Δ t at region G _jupper intercepting subregion R _j,p; R _j,pby a set

and curve and line segment

the closed space forming;

Step 3: at each subregion R _j,pupper placement keyword subset W _j,pin keyword, generate theme l _jword cloud; Comprise

3.1 use line segment successively

the each point of Pointcut N, obtains subregion R _j,papproximate polygon;

3.3 as shown in Figure 6, with one group of horizontal line H={y=c|y _min≤ c≤y _max, c ∈ Z} and subregion R _j,pintersect, obtain some crossing line segments; Get the sub-line segment that each intersecting lens section is positioned at polygon inside, be expressed as

3.5 repeating steps 3.4, until by W _j,pin keyword all place;

Fig. 7 shows the effect adopting after this method generating character cloud, wherein generates theme flow graph, and to theme, sequence has adopted randomly ordered method.Comparison diagram 2 can be found out, adopts word cloud placement algorithm of the present invention to have the following advantages: 1) can effectively utilize space.Under the prerequisite of same area size, font size, can place more keyword.2) layout generating is stable, with interactive operation below, does not change.3) efficiency of algorithm obviously improves.This algorithm is expressed as discrete one by one entity by certain rule by irregular area, only need travel through to find to meet the entity that this word places condition when placing word, does not need to carry out collision detection and Boundary Detection, has therefore greatly improved positioning efficiency.

Embodiment 4: the word cloud laying method after the present embodiment combines and improves in the theme sort method of embodiment 2 and embodiment 3, other step is constant.Fig. 8 shows the present embodiment to the visual result of Chinese document sets theme.

Embodiment 5: in TIARA, be subject to the restriction of area size, be difficult to place all keywords in a region.Therefore, the present invention uses a detailed word cloud, the full content with the full content of further visual each theme or each theme in each time period.The color of theme in the corresponding theme flow graph of background color of word cloud, the size of keyword is corresponding to the weight of keyword.The present invention adopts random greedy algorithm (drawing the > from < < TIARA:A Visual Exploratory Text Analytic System >) to generate detailed word cloud, is specially:

Step 1: select to express theme l _jthe keyword set of content;

Step 6: upgrade after the coordinate of position along spiral path, repeating step 4, step 5, until find the position coordinate or the radius of spin that satisfy condition to be greater than 100; When the radius of spin is greater than 100, keyword will be rejected;

Fig. 9 shows in the theme visualization method of Chinese document collection, has increased the effect after detailed word cloud.Detailed word cloud is placed lower right in the drawings, clicks the colour band that theme is corresponding on theme flow graph, i.e. detailed word cloud corresponding to each theme of changeable demonstration.What in figure, show is the detailed word cloud of operating system theme.Can find out, owing to being subject to the restriction of area size, on theme flow graph, in the word cloud of operating system theme, all key words content is not placed completely, and word cloud has been shown all key words contents of this theme in detail.

Claims

1. a theme method for visualizing for Chinese document collection, is characterized in that, comprises

Step by theme to document sets classification: establishing document sets has n theme l _j, j=0,1,2 ..., n-1, classifies to all documents in document sets according to theme, obtains n document subset D _j, j=0,1,2 ..., n-1; Wherein, theme l _jcorresponding document subset is D _j;

Δt = \frac{t_{end} - t_{start}}{m - 1};

Calculate the step of the theme frequency: establish the theme frequency and comprise v _{j, 0}and v _j,p, v wherein _{j, 0}l is the theme _jcorresponding document subset D _jat start time t _startnumber of documents, v _j,ptheme l _jcorresponding document subset D _jat time period T _pthe quantity of interior document; Calculate respectively the theme frequency of each theme;

2. the theme method for visualizing of Chinese document collection as claimed in claim 1, is characterized in that, the described step that theme is sorted, according to carrying out based on the theme frequency and how much complementary sort methods, comprises

Step 1, establishes theme l _jinitial time be OT _j; Work as v _{j, 0}while being not equal to zero, get the start time t of document sets _startfor OT _j;

Work as v _{j, 0}while equalling zero, get v _j,pnon-vanishing those time periods T _pthe minimum value of left end point as OT _j; Calculate the initial time of each theme;

Step 2, establishes theme l _jthe frequency and

calculate each theme the frequency and;

\overset{&OverBar;}{V (l_{up} + l_{i})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{up, p} + v_{i, p}) - - - (1);

Calculate l _upand l _ihow much complementary, use variance

represent:

σ_{up, i} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{up, p} + v_{i, p}) - \overset{&OverBar;}{V (l_{up} + l_{i})})}^{2}} - - - (2);

Will

and OT _iafter normalization, calculate weighted value D _i:

D _i＝sOT _i+(1-s)σ _up,i (3)；

Wherein s is for controlling parameter, 0≤s≤1;

\overset{&OverBar;}{V (l_{down} + l_{k})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{down, p} + v_{k, p}) - - - (4);

σ_{down, k} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{down, p} + v_{k, p}) - \overset{&OverBar;}{V (l_{down} + l_{k})})}^{2}} - - - (5);

By σ _{down, k}and OT _kafter normalization, calculate weighted value D _k:

D _k＝sOT _k+(1-s)σ _down,k (6)；

Wherein s is for controlling parameter, 0≤s≤1;

3. the theme method for visualizing of Chinese document collection as claimed in claim 2, is characterized in that, described control parameter s=0.3.

4. the theme method for visualizing of Chinese document collection as claimed in claim 1, is characterized in that, the step of described generating character cloud, comprises

obtain decile time point t _start+ p Δ t, wherein, p=1,2 ..., m-2;

and curve

and line segment

the closed space forming;

3.1 use line segments

the each point of Pointcut N, obtains subregion R _j,papproximate polygon;

3.2 establish subregion R _j,peach summit of approximate polygon in the y coordinate figure on that summit of y coordinate maximum be y _max;

If the y coordinate figure on that summit of y coordinate minimum is y in each summit of polygon _min;

L_{j, p} = {r_{c} (i) = \overset{&OverBar;}{s_{c} (i) e_{c} (i)}, y_{\min} \leq c \leq y_{\max}, 0 < i \leq M};

3.5 repeating steps 3.4, until by W _j,pin keyword all place;

5. the theme method for visualizing of Chinese document collection as claimed in claim 1, is characterized in that, also comprises the step that generates detailed word cloud, comprises

Step 1: select to express theme l _jthe keyword set of content;

Step 6: upgrade after the coordinate of position along spiral path, repeating step 4, step 5, until find the position pcoordinate or the radius of spin that satisfy condition to be greater than 100; When the radius of spin is greater than 100, keyword will be rejected; Step 7: place this keyword in position coordinate (word.x, word.y), and the discrete region that this keyword is taken turns to conflict point, add in conflict point set P;

6. the theme method for visualizing of Chinese document collection as claimed in claim 5, is characterized in that, described expression theme l _jthe keyword set of content be theme l _jany one keyword subset W _j,p.

7. the theme method for visualizing of Chinese document collection as claimed in claim 5, is characterized in that, described expression theme l _jthe keyword set of content, by following steps, obtained:

Step 1, merges theme l _jall keyword subset W _j,p, p=1,2 ..., m-1;

Step 2: calculate the weight of all keywords in the set after merging, the weight of described keyword is the number of times that this keyword occurs in all keyword subsets.