CN103631856A - Subject visualization method for Chinese document set - Google Patents

Subject visualization method for Chinese document set Download PDF

Info

Publication number
CN103631856A
CN103631856A CN201310488312.7A CN201310488312A CN103631856A CN 103631856 A CN103631856 A CN 103631856A CN 201310488312 A CN201310488312 A CN 201310488312A CN 103631856 A CN103631856 A CN 103631856A
Authority
CN
China
Prior art keywords
theme
keyword
document
subset
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310488312.7A
Other languages
Chinese (zh)
Other versions
CN103631856B (en
Inventor
朱敏
梁婷
甘启宏
李明召
李�一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201310488312.7A priority Critical patent/CN103631856B/en
Publication of CN103631856A publication Critical patent/CN103631856A/en
Application granted granted Critical
Publication of CN103631856B publication Critical patent/CN103631856B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a subject visualization method for a Chinese document set. The subject visualization method comprises the steps of classifying the document set according to subjects, dividing the time periods of the document set, calculating subject frequency, ranking the subjects, generating a subject flow graph, extracting keywords expressing the content of the subjects, calculating and ranking the weights of the keywords and generating a character cloud. The subject visualization method further comprises an ordering method based on the subject frequency and geometrical complementarity, a character cloud arrangement method and a method for generating the detailed character cloud. The subject visualization method has the advantages that the subject visualization on the Chinese document set is achieved; the subject flow graph generated through the ordering method based on the subject frequency and the geometrical complementarity is more attractive, flatter, high in space use ratio and more beneficial to character cloud arrangement; the character cloud arrangement method can effectively utilize space, and arrangement efficiency is greatly improved; the detailed character cloud is generated, and all the keyword content of the subjects can be shown.

Description

A kind of theme method for visualizing of Chinese document collection
Technical field
The present invention relates to text visualization and subject analysis field, is specifically a kind of theme method for visualizing of Chinese document collection.
Background technology
Large-scale collection of document, as news, scientific and technical literature, webpage and electronic publication, bulletin etc., has contained bulk information.Along with the development of information digitalization and universal, the scale expanding day of collection of document, Fast Reading and understand vast as the open sea information, and therefrom extract useful knowledge, become people's problem demanding prompt solution.
" theme " generally includes a core event or activity, and all directly related event and activities with it.Topic detection method adopts the technology such as cluster, classification, retrieval, topic tracking, according to theme, document sets is carried out to hierarchy type classification and tissue, facilitates user that it is retrieved, selected and browses.Yet after document is sorted out, user still needs to expend the plenty of time and reads all documents under this theme, with understand theme main contents, excavate potential knowledge and obtain required information.
Multi-document auto-abstracting technology, on the basis of topic detection, gathers subject content, removes after redundant information, generates comprehensive, succinct text.Thereby greatly improved information acquisition efficiency.But the common more complicated of existing multi-document summary result, user's indigestion, and be difficult to summarization generation process to be controlled, lack friendly user interface and man-machine interactive operation.In addition, multi-document auto-abstracting technology has often been ignored other attributes outside content of text, as time, quantity etc., is difficult to represent theme and subject content Characteristics of Evolution in time in document sets, also cannot reflect the relation between each theme under same document sets.
Text visualization is as an important branch in information visualization field, utilize the mankind inherent to the identification of figure, memory and analysis ability, text message is converted into graph image, help people intuitively, understand efficiently, read and analyze content of text and structure, and by corresponding interactive operation, help people to excavate valuable knowledge and pattern.
Word Cloud (word cloud) visualization technique is abstracted into content of text the set of one group of vocabulary, utilizes font size to represent the word frequency information of vocabulary, then by vocabulary according to certain rule compact, attractive in appearance line up, to represent text feature.But word cloud can only carry out visual to single document.To a plurality of documents, Themerive(theme stream) theme in document sets is carried out visual, the variation tendency of showing each theme intensity time in document sets.Initial theme stream only comprises theme intensity and temporal information, and theme order random alignment.Afterwards, the people such as Liu Shixia propose improved theme stream TIARA, in theme stream, embed word cloud, further each subject content are carried out visually, contribute to user's express-analysis text subject content rule over time.
Several text visualization technology all lack versatility above, are not suitable for Chinese document, at home up to the present, also still lack the visualization technique that Chinese document subject matter is analyzed.In addition, only for the visual TIARA technology of English document theme, also there are the following problems: 1) shape, the layout of theme stream Chinese word cloud are unstable, easily make user cause misunderstanding, affect subject analysis effect; 2), owing to being subject to region limits, the word cloud of generation cannot show whole key contents of each theme.
Summary of the invention
The object of the present invention is to provide a kind of theme method for visualizing of Chinese document collection, by each subject information extracting in Chinese document sets is added up and processed, measure out the intensity of theme and the weight of content, then in patterned mode, show.
The technical scheme that realizes the object of the invention is as follows: a kind of theme method for visualizing of Chinese document collection, comprises the step to document sets classification by theme: establishing document sets has n theme l j, j=0,1,2 ..., n-1, classifies to all documents in document sets according to theme, obtains n document subset D j, j=0,1,2 ..., n-1; Wherein, theme l jcorresponding document subset is D j;
Divide the step of document sets time period: establishing the document sets start time is t start, the end time is t end, to document sets time period [t start, t end] carry out decile, obtain time period T p=(t start+ (p-1) Δ t, t start+ p Δ t], wherein, p=1,2 ..., m-1,
Figure BDA0000397461710000021
calculate the step of the theme frequency: establish the theme frequency and comprise v j, 0and v j,p, v wherein j, 0l is the theme jcorresponding document subset D jat start time t starthe number of documents of t, v j,ptheme l jcorresponding document subset D jat time period T pthe quantity of interior document; Calculate respectively the theme frequency of each theme;
The step that theme is sorted: to all theme sequences, the subject nucleotide sequence table after being sorted;
Generate the step of theme flow graph: according to subject nucleotide sequence table and the theme frequency after sequence, adopt theme flow algorithm, generate theme flow graph;
Extract the step of the keyword that represents subject content: establish W j,ptheme l jcorresponding document subset D jat time period T pthe keyword subset that represents this subject content in interior document; Use the general Words partition system of Modern Chinese the document of each time period, to extract respectively from document subset corresponding to each theme the keyword subset that represents this subject content;
Calculate the weight of keyword the step of sequence: the weight of establishing keyword is the number of times that this keyword occurs in a keyword subset; Calculate the weight of each keyword in each keyword subset, and according to the weight of keyword, from big to small all keywords are sorted in each keyword subset;
The step of generating character cloud: according to keyword subset and keyword weight, generating character cloud on theme flow graph.
In technique scheme, the step that theme is sorted can adopt based on the theme frequency and how much complementary sort methods, comprises
Step 1, establishes theme l jinitial time be OT j; Work as v j, 0while being not equal to zero, get the start time t of document sets startfor OT j; Work as v j, 0while equalling zero, get v j,pnon-vanishing those time periods T pthe minimum value of left end point as OT j; Calculate the initial time of each theme;
Step 2, establishes theme l jthe frequency and
Figure BDA0000397461710000031
calculate each theme the frequency and;
Step 3: newly-built empty list B; If n is even number, the frequency and that maximum subject write are entered to list the first row, as upper extreme point theme l up, the frequency and time that large subject write are entered to list the second row, as lower extreme point theme l down; If n is odd number, the frequency and that maximum subject write are entered to list the first row, simultaneously as upper extreme point theme l upwith lower extreme point theme l down;
Step 4: select a not theme l in list B i, calculate l upand l ithe frequency and mean value
V ( l up + l i ) ‾ = 1 m Σ p = 0 m - 1 ( v up , p + v i , p ) - - - ( 1 ) ;
Calculate l upand l ihow much complementary, use variance
Figure BDA0000397461710000034
represent:
σ up , i = 1 m Σ p = 0 m - 1 ( ( v up , p + v i , p ) - V ( l up + l i ) ‾ ) 2 - - - ( 2 ) ;
Will
Figure BDA0000397461710000036
and OT iafter normalization, calculate weighted value D i:
D i=sOT i+(1-s)σ up,i (3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeating step 4, until obtain each not weighted value of the theme in list B;
Step 6: select weighted value D iminimum theme, is inserted into the top of list B upper extreme point theme, as new upper extreme point theme l up;
Step 7: select a not theme l in list B k, calculate l downand l kthe frequency and mean value
Figure BDA0000397461710000037
V ( l down + l k ) ‾ = 1 m Σ p = 0 m - 1 ( v down , p + v k , p ) - - - ( 4 ) ;
Calculate l downand l khow much complementary, use variances sigma down, krepresent:
σ down , k = 1 m Σ p = 0 m - 1 ( ( v down , p + v k , p ) - V ( l down + l k ) ‾ ) 2 - - - ( 5 ) ;
By σ down, kand OT kafter normalization, calculate weighted value D k:
D k=sOT k+(1-s)σ down,k (6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeating step 7, until obtain each not weighted value of the theme in list B;
Step 9: select the theme of weighted value minimum, be inserted into the below of list B lower extreme point theme, as new lower extreme point theme l down;
Step 10: repeated execution of steps 4 is to step 9, until all themes all add in list B.
In the present invention, the value of controlling parameter s is 0.3.
In aforementioned techniques scheme, on theme flow graph, the method for generating character cloud is:
Step 1: theme l on choosing a topic flow graph jcorresponding region G j, its start time and end time equal respectively the start time t of document sets startwith end time t end, by region G jtime period [t start, t end] being divided into m-1 section, the length of each time period is
Figure BDA0000397461710000041
Figure BDA0000397461710000042
obtain decile time point t start+ p Δ t, wherein, p=1,2 ..., m-2;
Step 2: successively with decile time point t startcentered by+p Δ t, according to Δ t at region G jupper intercepting subregion R j,p; R j,pby a set
Figure BDA0000397461710000043
and curve
Figure BDA0000397461710000044
and line segment
Figure BDA0000397461710000045
the closed space forming;
Step 3: at each subregion R j,pupper placement keyword subset W j,pin keyword, generate theme l jword cloud; Comprise 3.1 use line segments
Figure BDA0000397461710000047
the each point of Pointcut N, obtains subregion R j,papproximate polygon;
3.2 establish subregion R j,peach summit of approximate polygon in the y coordinate figure on that summit of y coordinate maximum be y max; If the y coordinate figure on that summit of y coordinate minimum is y in each summit of polygon min;
3.3 one group of use horizontal line H={y=c|y min≤ c≤y max, c ∈ Z} and subregion R j,pintersect, obtain some crossing line segments; Get the sub-line segment that each intersecting lens section is positioned at polygon inside, be expressed as
Figure BDA0000397461710000048
wherein, M is positioned at R for this intersecting lens section j,pthe number of inner sub-line segment; By R j,pbe expressed as one group of horizontal line section collection
L j , p = { r c ( i ) = s c ( i ) e c ( i ) &OverBar; , y min &le; c &le; y max , 0 < i &le; M } ;
3.4 according to W j,pin keyword weight choose successively from big to small a keyword, arranging high is h, widely for the rectangle of w replaces this keyword, carries out layout, then at placement position, places this keyword; Comprise
A, detection are at R j,pcorresponding L j,pin, at c=(y max-y mina wide w of being be placed in position)/2 can, and height is the rectangle of h, and detection method is: detect r corresponding from c to c-h c(i) in, whether all there is same i, meet line segment
Figure BDA00003974617100000410
length be greater than w; If energy, (c, s in position c(i)) place keyword, upgrade s c(i) be s c(i)=s c(i)+w; If can not, proceed to step B;
B, with c=(y max-y mincentered by)/2, make successively c=c+1, c=c-1 alternately travels through L j,p, detect and can place this rectangle in c position; If energy, (c, s in position c(i)) place keyword, upgrade s c(i) be s c(i)=s c(i)+w; If can not continue to make c=c+1, c=c-1 alternately travels through L j,p, until find the r satisfying condition cor traveled through all r (i) c(i); If traveled through all r c(i) after, do not find yet the position c satisfying condition, give up this keyword;
3.5 repeating steps 3.4, until by W j,pin keyword all place;
Step 4: repeating step 1 is to step 3, until generate the word cloud of each theme on theme flow graph.
The invention allows for the method that generates detailed word cloud, comprise
Step 1: select to express theme l jthe keyword set of content;
Step 2: a border circular areas C is set, turns to one group of conflict point set P by the border of C is discrete;
Step 3: choose from big to small a keyword according to the weight of keyword from keyword set, use random greedy algorithm is its generation position candidate coordinate (word.x, word.y) in the C of region;
Step 4: according to the weight setting font size of this keyword, then according to the number of words of font size and this keyword, with the approximate keyword that replaces of rectangle r, the lower left corner coordinate of establishing rectangle r equals coordinate;
Step 5: to each conflict point in P, detect each point and whether conflict with r; If there is conflict, proceed to step 6; If there is no conflict, proceeds to step 7;
Step 6: upgrade after the coordinate of position along spiral path, repeating step 4, step 5, until find the position pcoordinate or the radius of spin that satisfy condition to be greater than 100; When the radius of spin is greater than 100, keyword will be rejected;
Step 7: place this keyword in position coordinate (word.x, word.y), and the discrete region that this keyword is taken turns to conflict point, add in conflict point set P;
Step 8: repeating step 3 is to step 7, until all keywords are placed in keyword set.
When generating detailed word cloud, can choosing a topic l jany one keyword subset W j,pfor expressing the keyword set of the content of theme, also can select theme l jall keyword subset W j,pthe keyword set forming after merging is as the keyword set of expressing the content of theme.
The present invention is with respect to the technique effect of prior art: 1, realized the theme of Chinese document sets visual.2, after the sort methods of employing based on the theme frequency and how much complementarity sort to theme, the theme flow graph of generation is more attractive in appearance, more smooth, and space availability ratio is high, is more conducive to the placement of word cloud.3, adopt word cloud layout method of the present invention, can effectively utilize space, under the prerequisite of same area size, font size, can place more keyword; And the layout generating is stable, with interactive operation below, do not change; Efficiency of algorithm also obviously improves, this algorithm is expressed as discrete one by one entity by certain rule by irregular area, when placing word, only need travel through and find the entity that meets this word placement condition, not need to carry out collision detection and Boundary Detection, therefore greatly improve positioning efficiency.4, generate all key words contents that detailed word cloud can be shown theme.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention.
Fig. 2 is in first embodiment of the invention, to the randomly ordered rear generation theme flow graph of theme, then adopts the placement algorithm of TIARA technology Chinese word cloud to the visual result figure of subject content.
Fig. 3 is the process flow diagram based on the theme frequency and how much complementary sort methods in second embodiment of the invention.
Fig. 4 is the design sketch of the theme flow graph that generates in second embodiment of the invention.
Fig. 5 is the schematic diagram that intercepts subregion in third embodiment of the invention.
Fig. 6 is in third embodiment of the invention, by subregion approximate representation, to be the schematic diagram of one group of horizontal line section collection.
Fig. 7 places the design sketch after keyword on theme flow graph in third embodiment of the invention, theme flow graph is wherein to the randomly ordered rear generation of theme.
Fig. 8 is the design sketch after visual to the theme of Chinese document sets in fourth embodiment of the invention, and wherein theme flow graph is to have adopted based on generating after the theme frequency and how much complementary sort methods sequences, and word cloud is generated by the placement algorithm after improving.
Fig. 9 is visual and increased the design sketch after detailed word cloud to the theme of Chinese document sets in fifth embodiment of the invention.
Embodiment
Embodiment 1: the < < Journal of Software > > journal data of take is below example, in conjunction with Fig. 1, shows Chinese document theme visualization method.
Step 1, classifies to document sets by theme: establishing document sets has n theme l j, j=0,1,2 ..., n-1, classifies to all documents in document sets according to theme, obtains n document subset D j, j=0,1,2 ..., n-1; Wherein, theme l jcorresponding document subset is D j.Be specially: the paper data of input < < Journal of Software > > the 1st 9 phases of phase to the of periodical.Document sets is classified according to system software and soft project, database technology, computer network and information security, pattern-recognition and artificial intelligence and five themes of operating system, obtain five document subsets.
Step 2, divides the document sets time period: establishing the document sets start time is t start, the end time is t end, to document sets time period [t start, t end] carry out decile, obtain time period T p=(t start+ (p-1) Δ t, t start+ p Δ t], wherein, p=1,2 ..., m-1,
Figure BDA0000397461710000061
Figure BDA0000397461710000062
wherein, m is the sum of start time, end time and decile time point.Here, the document sets start time was the 1st phase, and the end time was the 9th phase, and the document sets time is divided into 8 time periods, and the time interval was 1 phase.
Step 3, calculates the theme frequency: establish the theme frequency and comprise v j, 0and v j,p, v wherein j, 0l is the theme jcorresponding document subset D jat start time t startnumber of documents, v j,ptheme l jcorresponding document subset D jat time period T pthe quantity of interior document; Calculate respectively the theme frequency of each theme.Here, calculate the document of each theme in the frequency of the 1st phase and other each phase appearance, the number of documents that each theme comprises within document sets start time and each time period.
Step 4, sorts to theme: to all theme sequences, the subject nucleotide sequence table after being sorted.In the present embodiment, theme is sorted and uses traditional random alignment method.Adopt different sort methods, the effect of the theme flow graph generating is had a direct impact.
Step 5, generates theme flow graph: according to subject nucleotide sequence table and the theme frequency after sequence, adopt theme flow algorithm, generate theme flow graph, to theme intensity, carry out visual.In the present embodiment, the theme flow graph of generation as shown in colour band colored in Fig. 2, wherein blue color is pattern-recognition and artificial intelligence, purple colour band is computer network and information security, red ribbon is operating system, yellow colour band is database technology, and green color bars is system software and soft project.Theme flow algorithm (drawing mono-literary composition from < < ThemeRiver:Visualizing thematic changes in large document collections > >), according to each theme, the weights in discrete time carry out interpolation (it is zero constraint condition that interpolating function need meet at extreme point derivative), then carry out the drafting of stacked graph, generate theme flow graph.In theme flow graph, transverse axis represents the time, and longitudinal axis difference in height represents theme intensity, and different colours band represents different themes.Colour band is along with the time broadens or narrows down and represents the differentiation of theme intensity time.
Step 6, extracts the keyword that represents subject content: establish W j,ptheme l jcorresponding document subset D jat time period T pthe keyword subset that represents this subject content in interior document, is used the general Words partition system > of < < Modern Chinese > the document of each time period, to extract respectively from document subset corresponding to each theme the keyword subset that represents this subject content.In the present embodiment, adopt word bag model text analysis technique, extract the keyword subset that represents subject content.Be specially: the general Words partition system > of the < < Modern Chinese > that adopts Language Information Processing Institute of Beijing Language and Culture Univer's exploitation, to belonging to all documents of each time period in the document subset of each theme, carry out participle, remove stop words, as auxiliary words of mood, adverbial word, preposition, conjunction etc., finally obtain a plurality of keyword subsets.
Step 7, weight the sequence of calculating keyword: the weight of keyword is the number of times that this keyword occurs in a keyword subset; Calculate the weight of each keyword in each keyword subset, and from big to small all keywords are sorted according to the weight of keyword.
Step 8, generating character cloud: according to keyword subset and keyword weight, adopt TIARA algorithm generating character cloud on theme flow graph, carry out visual to subject content.
According to the Chinese document collection theme method for visualizing of the present embodiment, the result after visual to Chinese document sets theme as shown in Figure 2.
Embodiment 2: in above-mentioned Chinese document collection theme method for visualizing, and each theme order random alignment.When generating theme stream, if the Strength Changes of certain theme is excessive, adjacent theme shape can be twisted with it, make result not attractive in appearance, and the relative intensity between theme is also difficult to identification.In addition, the theme after distortion also can affect the placement of word cloud.Meanwhile, for all themes of a document sets, user is often more concerned about the particular content of the theme of theme intensity maximum.Therefore, the present invention, to the step in embodiment 1, theme being sorted, has carried out further improvement, has designed a kind of sort methods based on the theme frequency and how much complementarity theme is sorted.Below in conjunction with Fig. 3, describe this sort method in detail:
Step 1, establishes theme l jinitial time be OT j; Work as v j, 0while being not equal to zero, get the start time t of document sets startfor OT j; Work as v j, 0while equalling zero, get v j,pnon-vanishing those time periods T pthe minimum value of left end point as OT j; Calculate the initial time of each theme;
Step 2, establishes theme l jthe frequency and
Figure BDA0000397461710000081
calculate each theme the frequency and;
Step 3: newly-built empty list B; If n is even number, the frequency and that maximum subject write are entered to list the first row, as upper extreme point theme l up, the frequency and time that large subject write are entered to list the second row, as lower extreme point theme l down; If n is odd number, the frequency and that maximum subject write are entered to list the first row, simultaneously as upper extreme point theme l upwith lower extreme point theme l down; Step 4: select a not theme l in list B i, calculate l upand l ithe frequency and mean value
Figure BDA0000397461710000082
V ( l up + l i ) &OverBar; = 1 m &Sigma; p = 0 m - 1 ( v up , p + v i , p ) - - - ( 1 ) ;
Calculate l upand l ihow much complementary, use variance represent:
&sigma; up , i = 1 m &Sigma; p = 0 m - 1 ( ( v up , p + v i , p ) - V ( l up + l i ) &OverBar; ) 2 - - - ( 2 ) ;
Will
Figure BDA0000397461710000086
and OT iafter normalization, calculate weighted value D i:
D i=sOT i+(1-s)σ up,i (3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeating step 4, until obtain each not weighted value of the theme in list B;
Step 6: select weighted value D iminimum theme, is inserted into the top of list B upper extreme point theme, as new upper extreme point theme l up;
Step 7: select a not theme l in list B k, calculate l downand l kthe frequency and mean value
Figure BDA0000397461710000087
V ( l down + l k ) &OverBar; = 1 m &Sigma; p = 0 m - 1 ( v down , p + v k , p ) - - - ( 4 ) ;
Calculate l downand l khow much complementary, use variances sigma down, krepresent:
&sigma; down , k = 1 m &Sigma; p = 0 m - 1 ( ( v down , p + v k , p ) - V ( l down + l k ) &OverBar; ) 2 - - - ( 5 ) ;
By σ down, kand OT kafter normalization, calculate weighted value D k:
D k=sOT k+(1-s)σ down,k (6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeating step 7, until obtain each not weighted value of the theme in list B;
Step 9: select the theme of weighted value minimum, be inserted into the below of list B lower extreme point theme, as new lower extreme point theme l down;
Step 10: repeated execution of steps 4 is to step 9, until all themes all add in list B.
In the present embodiment, the value of controlling parameter s is 0.3.
After theme being sorted according to the sort method in the present embodiment, the theme flow graph of generation as shown in Figure 4, can find out that theme flow graph is more attractive in appearance, more smooth, and space availability ratio is high, is more conducive to the placement of word cloud.
Embodiment 3: for TIARA technology Chinese word cloud shape, the unsettled problem of layout, the present invention also improves word cloud, first theme is divided into several subregions, then adopting scalable algorithm (drawing mono-literary composition from < < Tag Cloud++-Scalable Tag Clouds for Arbitrary Layouts > >) is one group of horizontal line section collection by this region representation, place successively again keyword, generating character cloud.Visual signature is as follows: 1) weight of keyword is larger, and font is larger; 2) keyword that weight is larger is the closer to this regional center.Below in conjunction with Fig. 5, Fig. 6, be elaborated:
Step 1: theme l on choosing a topic flow graph jcorresponding region G j, its start time and end time equal respectively the start time t of document sets startwith end time t end, by region G jtime period [t start, t end] being divided into m-1 section, the length of each time period is
Figure BDA0000397461710000093
obtain decile time point t start+p Δ t, wherein, p=1,2 ..., m-2;
Step 2: as shown in Figure 5, successively with decile time point t startcentered by+p Δ t, according to Δ t at region G jupper intercepting subregion R j,p; R j,pby a set
Figure BDA0000397461710000094
and curve and line segment
Figure BDA0000397461710000096
the closed space forming;
Step 3: at each subregion R j,pupper placement keyword subset W j,pin keyword, generate theme l jword cloud; Comprise
3.1 use line segment successively
Figure BDA0000397461710000097
the each point of Pointcut N, obtains subregion R j,papproximate polygon;
3.2 establish subregion R j,peach summit of approximate polygon in the y coordinate figure on that summit of y coordinate maximum be y max; If the y coordinate figure on that summit of y coordinate minimum is y in each summit of polygon min;
3.3 as shown in Figure 6, with one group of horizontal line H={y=c|y min≤ c≤y max, c ∈ Z} and subregion R j,pintersect, obtain some crossing line segments; Get the sub-line segment that each intersecting lens section is positioned at polygon inside, be expressed as
Figure BDA0000397461710000102
wherein, M is positioned at R for this intersecting lens section j,pthe number of inner sub-line segment; By R j,pbe expressed as one group of horizontal line section collection
Figure BDA0000397461710000103
3.4 according to W j,pin keyword weight choose successively from big to small a keyword, arranging high is h, widely for the rectangle of w replaces this keyword, carries out layout, then at placement position, places this keyword; Comprise
A, detection are at R j,pcorresponding L j,pin, at c=(y max-y mina wide w of being be placed in position)/2 can, and height is the rectangle of h, and detection method is: detect r corresponding from c to c-h c(i) in, whether all there is same i, meet line segment
Figure BDA0000397461710000104
length be greater than w; If energy, (c, s in position c(i)) place keyword, upgrade s c(i) be s c(i)=s c(i)+w; If can not, proceed to step B;
B, with c=(y max-y mincentered by)/2, make successively c=c+1, c=c-1 alternately travels through L j,p, detect and can place this rectangle in c position; If energy, (c, s in position c(i)) place keyword, upgrade s c(i) be s c(i)=s c(i)+w; If can not continue to make c=c+1, c=c-1 alternately travels through L j,p, until find the r satisfying condition cor traveled through all r (i) c(i); If traveled through all r c(i) after, do not find yet the position c satisfying condition, give up this keyword;
3.5 repeating steps 3.4, until by W j,pin keyword all place;
Step 4: repeating step 1 is to step 3, until generate the word cloud of each theme on theme flow graph.
Fig. 7 shows the effect adopting after this method generating character cloud, wherein generates theme flow graph, and to theme, sequence has adopted randomly ordered method.Comparison diagram 2 can be found out, adopts word cloud placement algorithm of the present invention to have the following advantages: 1) can effectively utilize space.Under the prerequisite of same area size, font size, can place more keyword.2) layout generating is stable, with interactive operation below, does not change.3) efficiency of algorithm obviously improves.This algorithm is expressed as discrete one by one entity by certain rule by irregular area, only need travel through to find to meet the entity that this word places condition when placing word, does not need to carry out collision detection and Boundary Detection, has therefore greatly improved positioning efficiency.
Embodiment 4: the word cloud laying method after the present embodiment combines and improves in the theme sort method of embodiment 2 and embodiment 3, other step is constant.Fig. 8 shows the present embodiment to the visual result of Chinese document sets theme.
Embodiment 5: in TIARA, be subject to the restriction of area size, be difficult to place all keywords in a region.Therefore, the present invention uses a detailed word cloud, the full content with the full content of further visual each theme or each theme in each time period.The color of theme in the corresponding theme flow graph of background color of word cloud, the size of keyword is corresponding to the weight of keyword.The present invention adopts random greedy algorithm (drawing the > from < < TIARA:A Visual Exploratory Text Analytic System >) to generate detailed word cloud, is specially:
Step 1: select to express theme l jthe keyword set of content;
Step 2: a border circular areas C is set, turns to one group of conflict point set P by the border of C is discrete;
Step 3: choose from big to small a keyword according to the weight of keyword from keyword set, use random greedy algorithm is its generation position candidate coordinate (word.x, word.y) in the C of region;
Step 4: according to the weight setting font size of this keyword, then according to the number of words of font size and this keyword, with the approximate keyword that replaces of rectangle r, the lower left corner coordinate of establishing rectangle r equals coordinate;
Step 5: to each conflict point in P, detect each point and whether conflict with r; If there is conflict, proceed to step 6; If there is no conflict, proceeds to step 7;
Step 6: upgrade after the coordinate of position along spiral path, repeating step 4, step 5, until find the position coordinate or the radius of spin that satisfy condition to be greater than 100; When the radius of spin is greater than 100, keyword will be rejected;
Step 7: place this keyword in position coordinate (word.x, word.y), and the discrete region that this keyword is taken turns to conflict point, add in conflict point set P;
Step 8: repeating step 3 is to step 7, until all keywords are placed in keyword set.
When generating detailed word cloud, can choosing a topic l jany one keyword subset W j,pfor expressing the keyword set of the content of theme, also can select theme l jall keyword subset W j,pthe keyword set forming after merging is as the keyword set of expressing the content of theme.
Fig. 9 shows in the theme visualization method of Chinese document collection, has increased the effect after detailed word cloud.Detailed word cloud is placed lower right in the drawings, clicks the colour band that theme is corresponding on theme flow graph, i.e. detailed word cloud corresponding to each theme of changeable demonstration.What in figure, show is the detailed word cloud of operating system theme.Can find out, owing to being subject to the restriction of area size, on theme flow graph, in the word cloud of operating system theme, all key words content is not placed completely, and word cloud has been shown all key words contents of this theme in detail.

Claims (7)

1. a theme method for visualizing for Chinese document collection, is characterized in that, comprises
Step by theme to document sets classification: establishing document sets has n theme l j, j=0,1,2 ..., n-1, classifies to all documents in document sets according to theme, obtains n document subset D j, j=0,1,2 ..., n-1; Wherein, theme l jcorresponding document subset is D j;
Divide the step of document sets time period: establishing the document sets start time is t start, the end time is t end, to document sets time period [t start, t end] carry out decile, obtain time period T p=(t start+ (p-1) Δ t, t start+ p Δ t], wherein, p=1,2 ..., m-1, &Delta;t = t end - t start m - 1 ;
Calculate the step of the theme frequency: establish the theme frequency and comprise v j, 0and v j,p, v wherein j, 0l is the theme jcorresponding document subset D jat start time t startnumber of documents, v j,ptheme l jcorresponding document subset D jat time period T pthe quantity of interior document; Calculate respectively the theme frequency of each theme;
The step that theme is sorted: to all theme sequences, the subject nucleotide sequence table after being sorted;
Generate the step of theme flow graph: according to subject nucleotide sequence table and the theme frequency after sequence, adopt theme flow algorithm, generate theme flow graph;
Extract the step of the keyword that represents subject content: establish W j,ptheme l jcorresponding document subset D jat time period T pthe keyword subset that represents this subject content in interior document; Use the general Words partition system of Modern Chinese the document of each time period, to extract respectively from document subset corresponding to each theme the keyword subset that represents this subject content;
Calculate the weight of keyword the step of sequence: the weight of establishing keyword is the number of times that this keyword occurs in a keyword subset; Calculate the weight of each keyword in each keyword subset, and according to the weight of keyword, from big to small all keywords are sorted in each keyword subset;
The step of generating character cloud: according to keyword subset and keyword weight, generating character cloud on theme flow graph.
2. the theme method for visualizing of Chinese document collection as claimed in claim 1, is characterized in that, the described step that theme is sorted, according to carrying out based on the theme frequency and how much complementary sort methods, comprises
Step 1, establishes theme l jinitial time be OT j; Work as v j, 0while being not equal to zero, get the start time t of document sets startfor OT j;
Work as v j, 0while equalling zero, get v j,pnon-vanishing those time periods T pthe minimum value of left end point as OT j; Calculate the initial time of each theme;
Step 2, establishes theme l jthe frequency and
Figure FDA0000397461700000012
calculate each theme the frequency and;
Step 3: newly-built empty list B; If n is even number, the frequency and that maximum subject write are entered to list the first row, as upper extreme point theme l up, the frequency and time that large subject write are entered to list the second row, as lower extreme point theme l down; If n is odd number, the frequency and that maximum subject write are entered to list the first row, simultaneously as upper extreme point theme l upwith lower extreme point theme l down;
Step 4: select a not theme l in list B i, calculate l upand l ithe frequency and mean value
Figure FDA0000397461700000021
V ( l up + l i ) &OverBar; = 1 m &Sigma; p = 0 m - 1 ( v up , p + v i , p ) - - - ( 1 ) ;
Calculate l upand l ihow much complementary, use variance
Figure FDA0000397461700000023
represent:
&sigma; up , i = 1 m &Sigma; p = 0 m - 1 ( ( v up , p + v i , p ) - V ( l up + l i ) &OverBar; ) 2 - - - ( 2 ) ;
Will
Figure FDA0000397461700000025
and OT iafter normalization, calculate weighted value D i:
D i=sOT i+(1-s)σ up,i (3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeating step 4, until obtain each not weighted value of the theme in list B;
Step 6: select weighted value D iminimum theme, is inserted into the top of list B upper extreme point theme, as new upper extreme point theme l up;
Step 7: select a not theme l in list B k, calculate l downand l kthe frequency and mean value
Figure FDA0000397461700000026
V ( l down + l k ) &OverBar; = 1 m &Sigma; p = 0 m - 1 ( v down , p + v k , p ) - - - ( 4 ) ;
Calculate l downand l khow much complementary, use variances sigma down, krepresent:
&sigma; down , k = 1 m &Sigma; p = 0 m - 1 ( ( v down , p + v k , p ) - V ( l down + l k ) &OverBar; ) 2 - - - ( 5 ) ;
By σ down, kand OT kafter normalization, calculate weighted value D k:
D k=sOT k+(1-s)σ down,k (6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeating step 7, until obtain each not weighted value of the theme in list B;
Step 9: select the theme of weighted value minimum, be inserted into the below of list B lower extreme point theme, as new lower extreme point theme l down;
Step 10: repeated execution of steps 4 is to step 9, until all themes all add in list B.
3. the theme method for visualizing of Chinese document collection as claimed in claim 2, is characterized in that, described control parameter s=0.3.
4. the theme method for visualizing of Chinese document collection as claimed in claim 1, is characterized in that, the step of described generating character cloud, comprises
Step 1: theme l on choosing a topic flow graph jcorresponding region G j, its start time and end time equal respectively the start time t of document sets startwith end time t end, by region G jtime period [t start, t end] being divided into m-1 section, the length of each time period is
Figure FDA0000397461700000031
obtain decile time point t start+ p Δ t, wherein, p=1,2 ..., m-2;
Step 2: successively with decile time point t startcentered by+p Δ t, according to Δ t at region G jupper intercepting subregion R j,p; R j,pby a set
Figure FDA0000397461700000032
and curve
Figure FDA0000397461700000033
and line segment
Figure FDA0000397461700000034
the closed space forming;
Step 3: at each subregion R j,pupper placement keyword subset W j,pin keyword, generate theme l jword cloud; Comprise
3.1 use line segments
Figure FDA0000397461700000035
Figure FDA0000397461700000036
the each point of Pointcut N, obtains subregion R j,papproximate polygon;
3.2 establish subregion R j,peach summit of approximate polygon in the y coordinate figure on that summit of y coordinate maximum be y max;
If the y coordinate figure on that summit of y coordinate minimum is y in each summit of polygon min;
3.3 one group of use horizontal line H={y=c|y min≤ c≤y max, c ∈ Z} and subregion R j,pintersect, obtain some crossing line segments; Get the sub-line segment that each intersecting lens section is positioned at polygon inside, be expressed as
Figure FDA0000397461700000037
wherein, M is positioned at R for this intersecting lens section j,pthe number of inner sub-line segment; By R j,pbe expressed as one group of horizontal line section collection
L j , p = { r c ( i ) = s c ( i ) e c ( i ) &OverBar; , y min &le; c &le; y max , 0 < i &le; M } ;
3.4 according to W j,pin keyword weight choose successively from big to small a keyword, arranging high is h, widely for the rectangle of w replaces this keyword, carries out layout, then at placement position, places this keyword; Comprise
A, detection are at R j,pcorresponding L j,pin, at c=(y max-y mina wide w of being be placed in position)/2 can, and height is the rectangle of h, and detection method is: detect r corresponding from c to c-h c(i) in, whether all there is same i, meet line segment
Figure FDA0000397461700000039
length be greater than w; If energy, (c, s in position c(i)) place keyword, upgrade s c(i) be s c(i)=s c(i)+w; If can not, proceed to step B;
B, with c=(y max-y mincentered by)/2, make successively c=c+1, c=c-1 alternately travels through L j,p, detect and can place this rectangle in c position; If energy, (c, s in position c(i)) place keyword, upgrade s c(i) be s c(i)=s c(i)+w; If can not continue to make c=c+1, c=c-1 alternately travels through L j,p, until find the r satisfying condition cor traveled through all r (i) c(i); If traveled through all r c(i) after, do not find yet the position c satisfying condition, give up this keyword;
3.5 repeating steps 3.4, until by W j,pin keyword all place;
Step 4: repeating step 1 is to step 3, until generate the word cloud of each theme on theme flow graph.
5. the theme method for visualizing of Chinese document collection as claimed in claim 1, is characterized in that, also comprises the step that generates detailed word cloud, comprises
Step 1: select to express theme l jthe keyword set of content;
Step 2: a border circular areas C is set, turns to one group of conflict point set P by the border of C is discrete;
Step 3: choose from big to small a keyword according to the weight of keyword from keyword set, use random greedy algorithm is its generation position candidate coordinate (word.x, word.y) in the C of region;
Step 4: according to the weight setting font size of this keyword, then according to the number of words of font size and this keyword, with the approximate keyword that replaces of rectangle r, the lower left corner coordinate of establishing rectangle r equals coordinate;
Step 5: to each conflict point in P, detect each point and whether conflict with r; If there is conflict, proceed to step 6; If there is no conflict, proceeds to step 7;
Step 6: upgrade after the coordinate of position along spiral path, repeating step 4, step 5, until find the position pcoordinate or the radius of spin that satisfy condition to be greater than 100; When the radius of spin is greater than 100, keyword will be rejected; Step 7: place this keyword in position coordinate (word.x, word.y), and the discrete region that this keyword is taken turns to conflict point, add in conflict point set P;
Step 8: repeating step 3 is to step 7, until all keywords are placed in keyword set.
6. the theme method for visualizing of Chinese document collection as claimed in claim 5, is characterized in that, described expression theme l jthe keyword set of content be theme l jany one keyword subset W j,p.
7. the theme method for visualizing of Chinese document collection as claimed in claim 5, is characterized in that, described expression theme l jthe keyword set of content, by following steps, obtained:
Step 1, merges theme l jall keyword subset W j,p, p=1,2 ..., m-1;
Step 2: calculate the weight of all keywords in the set after merging, the weight of described keyword is the number of times that this keyword occurs in all keyword subsets.
CN201310488312.7A 2013-10-17 2013-10-17 Subject visualization method for Chinese document set Active CN103631856B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310488312.7A CN103631856B (en) 2013-10-17 2013-10-17 Subject visualization method for Chinese document set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310488312.7A CN103631856B (en) 2013-10-17 2013-10-17 Subject visualization method for Chinese document set

Publications (2)

Publication Number Publication Date
CN103631856A true CN103631856A (en) 2014-03-12
CN103631856B CN103631856B (en) 2017-01-11

Family

ID=50212898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310488312.7A Active CN103631856B (en) 2013-10-17 2013-10-17 Subject visualization method for Chinese document set

Country Status (1)

Country Link
CN (1) CN103631856B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320683A (en) * 2014-07-24 2016-02-10 贾新志 Graphical display method of literature theme content analysis
CN105373579A (en) * 2015-08-18 2016-03-02 天津大学 Regression analysis-based news competitiveness analysis method and visualization device
CN105989090A (en) * 2015-02-12 2016-10-05 中兴通讯股份有限公司 Critical data processing method and device as well as critical data display method and system
CN106250512A (en) * 2016-08-04 2016-12-21 国家基础地理信息中心 A kind of subject network information collecting method taking time intention into account
CN106681983A (en) * 2016-11-25 2017-05-17 北京掌行通信息技术有限公司 Station name participle display method and device
CN106909381A (en) * 2017-02-24 2017-06-30 西南交通大学 A kind of interactive theme river method for visualizing
CN107622132A (en) * 2017-10-09 2018-01-23 四川大学 A kind of association analysis method for visualizing towards online Ask-Answer Community
CN109144504A (en) * 2017-06-26 2019-01-04 华东师范大学 Data visualization image generation method and storage medium based on D3
CN109783616A (en) * 2018-12-03 2019-05-21 广东蔚海数问大数据科技有限公司 A kind of text subject extracting method, system and storage medium
CN109933702A (en) * 2019-03-11 2019-06-25 智慧芽信息科技(苏州)有限公司 A kind of retrieval methods of exhibiting, device, equipment and storage medium
CN111737523A (en) * 2020-04-22 2020-10-02 聚好看科技股份有限公司 Video tag, search content generation method and server
WO2020244214A1 (en) * 2019-06-05 2020-12-10 山东大学 Method and device for generating shape word cloud

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996234A (en) * 2009-08-17 2011-03-30 阿瓦雅公司 Word cloud audio navigation
US8402030B1 (en) * 2011-11-21 2013-03-19 Raytheon Company Textual document analysis using word cloud comparison
US20130267287A1 (en) * 2012-04-04 2013-10-10 David Goldenberg System and Method for Interactive Gameplay with Song Lyric Database

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996234A (en) * 2009-08-17 2011-03-30 阿瓦雅公司 Word cloud audio navigation
US8402030B1 (en) * 2011-11-21 2013-03-19 Raytheon Company Textual document analysis using word cloud comparison
US20130267287A1 (en) * 2012-04-04 2013-10-10 David Goldenberg System and Method for Interactive Gameplay with Song Lyric Database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TING LIANG ET.AL: "WordStream: Visualizing Theme Summarization and Comparison in Document Collections over Time", 《ADVANCES IN INFORMATION SCIENCES AND SERVICE SCIENCES(AISS)》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320683A (en) * 2014-07-24 2016-02-10 贾新志 Graphical display method of literature theme content analysis
CN105989090A (en) * 2015-02-12 2016-10-05 中兴通讯股份有限公司 Critical data processing method and device as well as critical data display method and system
CN105373579B (en) * 2015-08-18 2018-08-03 天津大学 A kind of news competitiveness analysis method and its visualization device based on regression analysis
CN105373579A (en) * 2015-08-18 2016-03-02 天津大学 Regression analysis-based news competitiveness analysis method and visualization device
CN106250512A (en) * 2016-08-04 2016-12-21 国家基础地理信息中心 A kind of subject network information collecting method taking time intention into account
CN106250512B (en) * 2016-08-04 2019-07-26 国家基础地理信息中心 A kind of subject network information collecting method for taking time intention into account
CN106681983A (en) * 2016-11-25 2017-05-17 北京掌行通信息技术有限公司 Station name participle display method and device
CN106909381B (en) * 2017-02-24 2020-01-03 西南交通大学 Interactive theme river visualization method
CN106909381A (en) * 2017-02-24 2017-06-30 西南交通大学 A kind of interactive theme river method for visualizing
CN109144504A (en) * 2017-06-26 2019-01-04 华东师范大学 Data visualization image generation method and storage medium based on D3
CN107622132A (en) * 2017-10-09 2018-01-23 四川大学 A kind of association analysis method for visualizing towards online Ask-Answer Community
CN107622132B (en) * 2017-10-09 2020-07-03 四川大学 Online question-answer community oriented association analysis visualization method
CN109783616A (en) * 2018-12-03 2019-05-21 广东蔚海数问大数据科技有限公司 A kind of text subject extracting method, system and storage medium
CN109933702A (en) * 2019-03-11 2019-06-25 智慧芽信息科技(苏州)有限公司 A kind of retrieval methods of exhibiting, device, equipment and storage medium
WO2020244214A1 (en) * 2019-06-05 2020-12-10 山东大学 Method and device for generating shape word cloud
CN111737523A (en) * 2020-04-22 2020-10-02 聚好看科技股份有限公司 Video tag, search content generation method and server
CN111737523B (en) * 2020-04-22 2023-11-14 聚好看科技股份有限公司 Video tag, generation method of search content and server

Also Published As

Publication number Publication date
CN103631856B (en) 2017-01-11

Similar Documents

Publication Publication Date Title
CN103631856B (en) Subject visualization method for Chinese document set
Cao et al. Introduction to text visualization
Lhuillier et al. State of the art in edge and trail bundling techniques
Van Ham et al. Mapping text with phrase nets
Andrienko et al. Designing visual analytics methods for massive collections of movement data
CN107578292B (en) User portrait construction system
Fried et al. Maps of computer science
CN103559199B (en) Method for abstracting web page information and device
Liang et al. Highlighting in information visualization: A survey
DE112009004951T5 (en) Method and system for document reconstruction
Tong et al. A density-peak-based clustering algorithm of automatically determining the number of clusters
CN111143547B (en) Big data display method based on knowledge graph
CN112667940B (en) Webpage text extraction method based on deep learning
CN104217038A (en) Knowledge network building method for financial news
CN106503211A (en) Information issues the method that the mobile edition of class website is automatically generated
US11650073B2 (en) Knowledge space analytics
Rayson et al. Towards interactive multidimensional visualisations for corpus linguistics
CN107908749A (en) A kind of personage&#39;s searching system and method based on search engine
CN112969035A (en) Visual video production method and production system
CN105550279A (en) Vision-based list page identification method
Lu et al. Exploration and application of graphic design language based on artificial intelligence visual communication
Liu et al. EXOD: A tool for building and exploring a large graph of open datasets
Liang et al. The generation and realization of Dunhuang cultural values from the perspective of crosscultural communication: Based on Spanish and Chinese culture
CN108647310A (en) Identification model method for building up and device, character recognition method and device
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant