CN103631856A - Subject visualization method for Chinese document set - Google Patents
Subject visualization method for Chinese document set Download PDFInfo
- Publication number
- CN103631856A CN103631856A CN201310488312.7A CN201310488312A CN103631856A CN 103631856 A CN103631856 A CN 103631856A CN 201310488312 A CN201310488312 A CN 201310488312A CN 103631856 A CN103631856 A CN 103631856A
- Authority
- CN
- China
- Prior art keywords
- theme
- keyword
- document
- subset
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a subject visualization method for a Chinese document set. The subject visualization method comprises the steps of classifying the document set according to subjects, dividing the time periods of the document set, calculating subject frequency, ranking the subjects, generating a subject flow graph, extracting keywords expressing the content of the subjects, calculating and ranking the weights of the keywords and generating a character cloud. The subject visualization method further comprises an ordering method based on the subject frequency and geometrical complementarity, a character cloud arrangement method and a method for generating the detailed character cloud. The subject visualization method has the advantages that the subject visualization on the Chinese document set is achieved; the subject flow graph generated through the ordering method based on the subject frequency and the geometrical complementarity is more attractive, flatter, high in space use ratio and more beneficial to character cloud arrangement; the character cloud arrangement method can effectively utilize space, and arrangement efficiency is greatly improved; the detailed character cloud is generated, and all the keyword content of the subjects can be shown.
Description
Technical field
The present invention relates to text visualization and subject analysis field, is specifically a kind of theme method for visualizing of Chinese document collection.
Background technology
Large-scale collection of document, as news, scientific and technical literature, webpage and electronic publication, bulletin etc., has contained bulk information.Along with the development of information digitalization and universal, the scale expanding day of collection of document, Fast Reading and understand vast as the open sea information, and therefrom extract useful knowledge, become people's problem demanding prompt solution.
" theme " generally includes a core event or activity, and all directly related event and activities with it.Topic detection method adopts the technology such as cluster, classification, retrieval, topic tracking, according to theme, document sets is carried out to hierarchy type classification and tissue, facilitates user that it is retrieved, selected and browses.Yet after document is sorted out, user still needs to expend the plenty of time and reads all documents under this theme, with understand theme main contents, excavate potential knowledge and obtain required information.
Multi-document auto-abstracting technology, on the basis of topic detection, gathers subject content, removes after redundant information, generates comprehensive, succinct text.Thereby greatly improved information acquisition efficiency.But the common more complicated of existing multi-document summary result, user's indigestion, and be difficult to summarization generation process to be controlled, lack friendly user interface and man-machine interactive operation.In addition, multi-document auto-abstracting technology has often been ignored other attributes outside content of text, as time, quantity etc., is difficult to represent theme and subject content Characteristics of Evolution in time in document sets, also cannot reflect the relation between each theme under same document sets.
Text visualization is as an important branch in information visualization field, utilize the mankind inherent to the identification of figure, memory and analysis ability, text message is converted into graph image, help people intuitively, understand efficiently, read and analyze content of text and structure, and by corresponding interactive operation, help people to excavate valuable knowledge and pattern.
Word Cloud (word cloud) visualization technique is abstracted into content of text the set of one group of vocabulary, utilizes font size to represent the word frequency information of vocabulary, then by vocabulary according to certain rule compact, attractive in appearance line up, to represent text feature.But word cloud can only carry out visual to single document.To a plurality of documents, Themerive(theme stream) theme in document sets is carried out visual, the variation tendency of showing each theme intensity time in document sets.Initial theme stream only comprises theme intensity and temporal information, and theme order random alignment.Afterwards, the people such as Liu Shixia propose improved theme stream TIARA, in theme stream, embed word cloud, further each subject content are carried out visually, contribute to user's express-analysis text subject content rule over time.
Several text visualization technology all lack versatility above, are not suitable for Chinese document, at home up to the present, also still lack the visualization technique that Chinese document subject matter is analyzed.In addition, only for the visual TIARA technology of English document theme, also there are the following problems: 1) shape, the layout of theme stream Chinese word cloud are unstable, easily make user cause misunderstanding, affect subject analysis effect; 2), owing to being subject to region limits, the word cloud of generation cannot show whole key contents of each theme.
Summary of the invention
The object of the present invention is to provide a kind of theme method for visualizing of Chinese document collection, by each subject information extracting in Chinese document sets is added up and processed, measure out the intensity of theme and the weight of content, then in patterned mode, show.
The technical scheme that realizes the object of the invention is as follows: a kind of theme method for visualizing of Chinese document collection, comprises the step to document sets classification by theme: establishing document sets has n theme l
j, j=0,1,2 ..., n-1, classifies to all documents in document sets according to theme, obtains n document subset D
j, j=0,1,2 ..., n-1; Wherein, theme l
jcorresponding document subset is D
j;
Divide the step of document sets time period: establishing the document sets start time is t
start, the end time is t
end, to document sets time period [t
start, t
end] carry out decile, obtain time period T
p=(t
start+ (p-1) Δ t, t
start+ p Δ t], wherein, p=1,2 ..., m-1,
calculate the step of the theme frequency: establish the theme frequency and comprise v
j, 0and v
j,p, v wherein
j, 0l is the theme
jcorresponding document subset D
jat start time t
starthe number of documents of t, v
j,ptheme l
jcorresponding document subset D
jat time period T
pthe quantity of interior document; Calculate respectively the theme frequency of each theme;
The step that theme is sorted: to all theme sequences, the subject nucleotide sequence table after being sorted;
Generate the step of theme flow graph: according to subject nucleotide sequence table and the theme frequency after sequence, adopt theme flow algorithm, generate theme flow graph;
Extract the step of the keyword that represents subject content: establish W
j,ptheme l
jcorresponding document subset D
jat time period T
pthe keyword subset that represents this subject content in interior document; Use the general Words partition system of Modern Chinese the document of each time period, to extract respectively from document subset corresponding to each theme the keyword subset that represents this subject content;
Calculate the weight of keyword the step of sequence: the weight of establishing keyword is the number of times that this keyword occurs in a keyword subset; Calculate the weight of each keyword in each keyword subset, and according to the weight of keyword, from big to small all keywords are sorted in each keyword subset;
The step of generating character cloud: according to keyword subset and keyword weight, generating character cloud on theme flow graph.
In technique scheme, the step that theme is sorted can adopt based on the theme frequency and how much complementary sort methods, comprises
Step 3: newly-built empty list B; If n is even number, the frequency and that maximum subject write are entered to list the first row, as upper extreme point theme l
up, the frequency and time that large subject write are entered to list the second row, as lower extreme point theme l
down; If n is odd number, the frequency and that maximum subject write are entered to list the first row, simultaneously as upper extreme point theme l
upwith lower extreme point theme l
down;
Step 4: select a not theme l in list B
i, calculate l
upand l
ithe frequency and mean value
D
i=sOT
i+(1-s)σ
up,i (3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeating step 4, until obtain each not weighted value of the theme in list B;
Step 6: select weighted value D
iminimum theme, is inserted into the top of list B upper extreme point theme, as new upper extreme point theme l
up;
Calculate l
downand l
khow much complementary, use variances sigma
down, krepresent:
By σ
down, kand OT
kafter normalization, calculate weighted value D
k:
D
k=sOT
k+(1-s)σ
down,k (6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeating step 7, until obtain each not weighted value of the theme in list B;
Step 9: select the theme of weighted value minimum, be inserted into the below of list B lower extreme point theme, as new lower extreme point theme l
down;
Step 10: repeated execution of steps 4 is to step 9, until all themes all add in list B.
In the present invention, the value of controlling parameter s is 0.3.
In aforementioned techniques scheme, on theme flow graph, the method for generating character cloud is:
Step 1: theme l on choosing a topic flow graph
jcorresponding region G
j, its start time and end time equal respectively the start time t of document sets
startwith end time t
end, by region G
jtime period [t
start, t
end] being divided into m-1 section, the length of each time period is
obtain decile time point t
start+ p Δ t, wherein, p=1,2 ..., m-2;
Step 2: successively with decile time point t
startcentered by+p Δ t, according to Δ t at region G
jupper intercepting subregion R
j,p; R
j,pby a set
and curve
and line segment
the closed space forming;
Step 3: at each subregion R
j,pupper placement keyword subset W
j,pin keyword, generate theme l
jword cloud; Comprise 3.1 use line segments
the each point of Pointcut N, obtains subregion R
j,papproximate polygon;
3.2 establish subregion R
j,peach summit of approximate polygon in the y coordinate figure on that summit of y coordinate maximum be y
max; If the y coordinate figure on that summit of y coordinate minimum is y in each summit of polygon
min;
3.3 one group of use horizontal line H={y=c|y
min≤ c≤y
max, c ∈ Z} and subregion R
j,pintersect, obtain some crossing line segments; Get the sub-line segment that each intersecting lens section is positioned at polygon inside, be expressed as
wherein, M is positioned at R for this intersecting lens section
j,pthe number of inner sub-line segment; By R
j,pbe expressed as one group of horizontal line section collection
3.4 according to W
j,pin keyword weight choose successively from big to small a keyword, arranging high is h, widely for the rectangle of w replaces this keyword, carries out layout, then at placement position, places this keyword; Comprise
A, detection are at R
j,pcorresponding L
j,pin, at c=(y
max-y
mina wide w of being be placed in position)/2 can, and height is the rectangle of h, and detection method is: detect r corresponding from c to c-h
c(i) in, whether all there is same i, meet line segment
length be greater than w; If energy, (c, s in position
c(i)) place keyword, upgrade s
c(i) be s
c(i)=s
c(i)+w; If can not, proceed to step B;
B, with c=(y
max-y
mincentered by)/2, make successively c=c+1, c=c-1 alternately travels through L
j,p, detect and can place this rectangle in c position; If energy, (c, s in position
c(i)) place keyword, upgrade s
c(i) be s
c(i)=s
c(i)+w; If can not continue to make c=c+1, c=c-1 alternately travels through L
j,p, until find the r satisfying condition
cor traveled through all r (i)
c(i); If traveled through all r
c(i) after, do not find yet the position c satisfying condition, give up this keyword;
3.5 repeating steps 3.4, until by W
j,pin keyword all place;
Step 4: repeating step 1 is to step 3, until generate the word cloud of each theme on theme flow graph.
The invention allows for the method that generates detailed word cloud, comprise
Step 1: select to express theme l
jthe keyword set of content;
Step 2: a border circular areas C is set, turns to one group of conflict point set P by the border of C is discrete;
Step 3: choose from big to small a keyword according to the weight of keyword from keyword set, use random greedy algorithm is its generation position candidate coordinate (word.x, word.y) in the C of region;
Step 4: according to the weight setting font size of this keyword, then according to the number of words of font size and this keyword, with the approximate keyword that replaces of rectangle r, the lower left corner coordinate of establishing rectangle r equals coordinate;
Step 5: to each conflict point in P, detect each point and whether conflict with r; If there is conflict, proceed to step 6; If there is no conflict, proceeds to step 7;
Step 6: upgrade after the coordinate of position along spiral path, repeating step 4, step 5, until find the position pcoordinate or the radius of spin that satisfy condition to be greater than 100; When the radius of spin is greater than 100, keyword will be rejected;
Step 7: place this keyword in position coordinate (word.x, word.y), and the discrete region that this keyword is taken turns to conflict point, add in conflict point set P;
Step 8: repeating step 3 is to step 7, until all keywords are placed in keyword set.
When generating detailed word cloud, can choosing a topic l
jany one keyword subset W
j,pfor expressing the keyword set of the content of theme, also can select theme l
jall keyword subset W
j,pthe keyword set forming after merging is as the keyword set of expressing the content of theme.
The present invention is with respect to the technique effect of prior art: 1, realized the theme of Chinese document sets visual.2, after the sort methods of employing based on the theme frequency and how much complementarity sort to theme, the theme flow graph of generation is more attractive in appearance, more smooth, and space availability ratio is high, is more conducive to the placement of word cloud.3, adopt word cloud layout method of the present invention, can effectively utilize space, under the prerequisite of same area size, font size, can place more keyword; And the layout generating is stable, with interactive operation below, do not change; Efficiency of algorithm also obviously improves, this algorithm is expressed as discrete one by one entity by certain rule by irregular area, when placing word, only need travel through and find the entity that meets this word placement condition, not need to carry out collision detection and Boundary Detection, therefore greatly improve positioning efficiency.4, generate all key words contents that detailed word cloud can be shown theme.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention.
Fig. 2 is in first embodiment of the invention, to the randomly ordered rear generation theme flow graph of theme, then adopts the placement algorithm of TIARA technology Chinese word cloud to the visual result figure of subject content.
Fig. 3 is the process flow diagram based on the theme frequency and how much complementary sort methods in second embodiment of the invention.
Fig. 4 is the design sketch of the theme flow graph that generates in second embodiment of the invention.
Fig. 5 is the schematic diagram that intercepts subregion in third embodiment of the invention.
Fig. 6 is in third embodiment of the invention, by subregion approximate representation, to be the schematic diagram of one group of horizontal line section collection.
Fig. 7 places the design sketch after keyword on theme flow graph in third embodiment of the invention, theme flow graph is wherein to the randomly ordered rear generation of theme.
Fig. 8 is the design sketch after visual to the theme of Chinese document sets in fourth embodiment of the invention, and wherein theme flow graph is to have adopted based on generating after the theme frequency and how much complementary sort methods sequences, and word cloud is generated by the placement algorithm after improving.
Fig. 9 is visual and increased the design sketch after detailed word cloud to the theme of Chinese document sets in fifth embodiment of the invention.
Embodiment
Embodiment 1: the < < Journal of Software > > journal data of take is below example, in conjunction with Fig. 1, shows Chinese document theme visualization method.
Step 3, calculates the theme frequency: establish the theme frequency and comprise v
j, 0and v
j,p, v wherein
j, 0l is the theme
jcorresponding document subset D
jat start time t
startnumber of documents, v
j,ptheme l
jcorresponding document subset D
jat time period T
pthe quantity of interior document; Calculate respectively the theme frequency of each theme.Here, calculate the document of each theme in the frequency of the 1st phase and other each phase appearance, the number of documents that each theme comprises within document sets start time and each time period.
Step 4, sorts to theme: to all theme sequences, the subject nucleotide sequence table after being sorted.In the present embodiment, theme is sorted and uses traditional random alignment method.Adopt different sort methods, the effect of the theme flow graph generating is had a direct impact.
Step 5, generates theme flow graph: according to subject nucleotide sequence table and the theme frequency after sequence, adopt theme flow algorithm, generate theme flow graph, to theme intensity, carry out visual.In the present embodiment, the theme flow graph of generation as shown in colour band colored in Fig. 2, wherein blue color is pattern-recognition and artificial intelligence, purple colour band is computer network and information security, red ribbon is operating system, yellow colour band is database technology, and green color bars is system software and soft project.Theme flow algorithm (drawing mono-literary composition from < < ThemeRiver:Visualizing thematic changes in large document collections > >), according to each theme, the weights in discrete time carry out interpolation (it is zero constraint condition that interpolating function need meet at extreme point derivative), then carry out the drafting of stacked graph, generate theme flow graph.In theme flow graph, transverse axis represents the time, and longitudinal axis difference in height represents theme intensity, and different colours band represents different themes.Colour band is along with the time broadens or narrows down and represents the differentiation of theme intensity time.
Step 6, extracts the keyword that represents subject content: establish W
j,ptheme l
jcorresponding document subset D
jat time period T
pthe keyword subset that represents this subject content in interior document, is used the general Words partition system > of < < Modern Chinese > the document of each time period, to extract respectively from document subset corresponding to each theme the keyword subset that represents this subject content.In the present embodiment, adopt word bag model text analysis technique, extract the keyword subset that represents subject content.Be specially: the general Words partition system > of the < < Modern Chinese > that adopts Language Information Processing Institute of Beijing Language and Culture Univer's exploitation, to belonging to all documents of each time period in the document subset of each theme, carry out participle, remove stop words, as auxiliary words of mood, adverbial word, preposition, conjunction etc., finally obtain a plurality of keyword subsets.
Step 7, weight the sequence of calculating keyword: the weight of keyword is the number of times that this keyword occurs in a keyword subset; Calculate the weight of each keyword in each keyword subset, and from big to small all keywords are sorted according to the weight of keyword.
Step 8, generating character cloud: according to keyword subset and keyword weight, adopt TIARA algorithm generating character cloud on theme flow graph, carry out visual to subject content.
According to the Chinese document collection theme method for visualizing of the present embodiment, the result after visual to Chinese document sets theme as shown in Figure 2.
Embodiment 2: in above-mentioned Chinese document collection theme method for visualizing, and each theme order random alignment.When generating theme stream, if the Strength Changes of certain theme is excessive, adjacent theme shape can be twisted with it, make result not attractive in appearance, and the relative intensity between theme is also difficult to identification.In addition, the theme after distortion also can affect the placement of word cloud.Meanwhile, for all themes of a document sets, user is often more concerned about the particular content of the theme of theme intensity maximum.Therefore, the present invention, to the step in embodiment 1, theme being sorted, has carried out further improvement, has designed a kind of sort methods based on the theme frequency and how much complementarity theme is sorted.Below in conjunction with Fig. 3, describe this sort method in detail:
Step 3: newly-built empty list B; If n is even number, the frequency and that maximum subject write are entered to list the first row, as upper extreme point theme l
up, the frequency and time that large subject write are entered to list the second row, as lower extreme point theme l
down; If n is odd number, the frequency and that maximum subject write are entered to list the first row, simultaneously as upper extreme point theme l
upwith lower extreme point theme l
down; Step 4: select a not theme l in list B
i, calculate l
upand l
ithe frequency and mean value
Calculate l
upand l
ihow much complementary, use variance
represent:
D
i=sOT
i+(1-s)σ
up,i (3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeating step 4, until obtain each not weighted value of the theme in list B;
Step 6: select weighted value D
iminimum theme, is inserted into the top of list B upper extreme point theme, as new upper extreme point theme l
up;
Calculate l
downand l
khow much complementary, use variances sigma
down, krepresent:
By σ
down, kand OT
kafter normalization, calculate weighted value D
k:
D
k=sOT
k+(1-s)σ
down,k (6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeating step 7, until obtain each not weighted value of the theme in list B;
Step 9: select the theme of weighted value minimum, be inserted into the below of list B lower extreme point theme, as new lower extreme point theme l
down;
Step 10: repeated execution of steps 4 is to step 9, until all themes all add in list B.
In the present embodiment, the value of controlling parameter s is 0.3.
After theme being sorted according to the sort method in the present embodiment, the theme flow graph of generation as shown in Figure 4, can find out that theme flow graph is more attractive in appearance, more smooth, and space availability ratio is high, is more conducive to the placement of word cloud.
Embodiment 3: for TIARA technology Chinese word cloud shape, the unsettled problem of layout, the present invention also improves word cloud, first theme is divided into several subregions, then adopting scalable algorithm (drawing mono-literary composition from < < Tag Cloud++-Scalable Tag Clouds for Arbitrary Layouts > >) is one group of horizontal line section collection by this region representation, place successively again keyword, generating character cloud.Visual signature is as follows: 1) weight of keyword is larger, and font is larger; 2) keyword that weight is larger is the closer to this regional center.Below in conjunction with Fig. 5, Fig. 6, be elaborated:
Step 1: theme l on choosing a topic flow graph
jcorresponding region G
j, its start time and end time equal respectively the start time t of document sets
startwith end time t
end, by region G
jtime period [t
start, t
end] being divided into m-1 section, the length of each time period is
obtain decile time point t
start+p Δ t, wherein, p=1,2 ..., m-2;
Step 2: as shown in Figure 5, successively with decile time point t
startcentered by+p Δ t, according to Δ t at region G
jupper intercepting subregion R
j,p; R
j,pby a set
and curve
and line segment
the closed space forming;
Step 3: at each subregion R
j,pupper placement keyword subset W
j,pin keyword, generate theme l
jword cloud; Comprise
3.1 use line segment successively
the each point of Pointcut N, obtains subregion R
j,papproximate polygon;
3.2 establish subregion R
j,peach summit of approximate polygon in the y coordinate figure on that summit of y coordinate maximum be y
max; If the y coordinate figure on that summit of y coordinate minimum is y in each summit of polygon
min;
3.3 as shown in Figure 6, with one group of horizontal line H={y=c|y
min≤ c≤y
max, c ∈ Z} and subregion R
j,pintersect, obtain some crossing line segments; Get the sub-line segment that each intersecting lens section is positioned at polygon inside, be expressed as
wherein, M is positioned at R for this intersecting lens section
j,pthe number of inner sub-line segment; By R
j,pbe expressed as one group of horizontal line section collection
3.4 according to W
j,pin keyword weight choose successively from big to small a keyword, arranging high is h, widely for the rectangle of w replaces this keyword, carries out layout, then at placement position, places this keyword; Comprise
A, detection are at R
j,pcorresponding L
j,pin, at c=(y
max-y
mina wide w of being be placed in position)/2 can, and height is the rectangle of h, and detection method is: detect r corresponding from c to c-h
c(i) in, whether all there is same i, meet line segment
length be greater than w; If energy, (c, s in position
c(i)) place keyword, upgrade s
c(i) be s
c(i)=s
c(i)+w; If can not, proceed to step B;
B, with c=(y
max-y
mincentered by)/2, make successively c=c+1, c=c-1 alternately travels through L
j,p, detect and can place this rectangle in c position; If energy, (c, s in position
c(i)) place keyword, upgrade s
c(i) be s
c(i)=s
c(i)+w; If can not continue to make c=c+1, c=c-1 alternately travels through L
j,p, until find the r satisfying condition
cor traveled through all r (i)
c(i); If traveled through all r
c(i) after, do not find yet the position c satisfying condition, give up this keyword;
3.5 repeating steps 3.4, until by W
j,pin keyword all place;
Step 4: repeating step 1 is to step 3, until generate the word cloud of each theme on theme flow graph.
Fig. 7 shows the effect adopting after this method generating character cloud, wherein generates theme flow graph, and to theme, sequence has adopted randomly ordered method.Comparison diagram 2 can be found out, adopts word cloud placement algorithm of the present invention to have the following advantages: 1) can effectively utilize space.Under the prerequisite of same area size, font size, can place more keyword.2) layout generating is stable, with interactive operation below, does not change.3) efficiency of algorithm obviously improves.This algorithm is expressed as discrete one by one entity by certain rule by irregular area, only need travel through to find to meet the entity that this word places condition when placing word, does not need to carry out collision detection and Boundary Detection, has therefore greatly improved positioning efficiency.
Embodiment 4: the word cloud laying method after the present embodiment combines and improves in the theme sort method of embodiment 2 and embodiment 3, other step is constant.Fig. 8 shows the present embodiment to the visual result of Chinese document sets theme.
Embodiment 5: in TIARA, be subject to the restriction of area size, be difficult to place all keywords in a region.Therefore, the present invention uses a detailed word cloud, the full content with the full content of further visual each theme or each theme in each time period.The color of theme in the corresponding theme flow graph of background color of word cloud, the size of keyword is corresponding to the weight of keyword.The present invention adopts random greedy algorithm (drawing the > from < < TIARA:A Visual Exploratory Text Analytic System >) to generate detailed word cloud, is specially:
Step 1: select to express theme l
jthe keyword set of content;
Step 2: a border circular areas C is set, turns to one group of conflict point set P by the border of C is discrete;
Step 3: choose from big to small a keyword according to the weight of keyword from keyword set, use random greedy algorithm is its generation position candidate coordinate (word.x, word.y) in the C of region;
Step 4: according to the weight setting font size of this keyword, then according to the number of words of font size and this keyword, with the approximate keyword that replaces of rectangle r, the lower left corner coordinate of establishing rectangle r equals coordinate;
Step 5: to each conflict point in P, detect each point and whether conflict with r; If there is conflict, proceed to step 6; If there is no conflict, proceeds to step 7;
Step 6: upgrade after the coordinate of position along spiral path, repeating step 4, step 5, until find the position coordinate or the radius of spin that satisfy condition to be greater than 100; When the radius of spin is greater than 100, keyword will be rejected;
Step 7: place this keyword in position coordinate (word.x, word.y), and the discrete region that this keyword is taken turns to conflict point, add in conflict point set P;
Step 8: repeating step 3 is to step 7, until all keywords are placed in keyword set.
When generating detailed word cloud, can choosing a topic l
jany one keyword subset W
j,pfor expressing the keyword set of the content of theme, also can select theme l
jall keyword subset W
j,pthe keyword set forming after merging is as the keyword set of expressing the content of theme.
Fig. 9 shows in the theme visualization method of Chinese document collection, has increased the effect after detailed word cloud.Detailed word cloud is placed lower right in the drawings, clicks the colour band that theme is corresponding on theme flow graph, i.e. detailed word cloud corresponding to each theme of changeable demonstration.What in figure, show is the detailed word cloud of operating system theme.Can find out, owing to being subject to the restriction of area size, on theme flow graph, in the word cloud of operating system theme, all key words content is not placed completely, and word cloud has been shown all key words contents of this theme in detail.
Claims (7)
1. a theme method for visualizing for Chinese document collection, is characterized in that, comprises
Step by theme to document sets classification: establishing document sets has n theme l
j, j=0,1,2 ..., n-1, classifies to all documents in document sets according to theme, obtains n document subset D
j, j=0,1,2 ..., n-1; Wherein, theme l
jcorresponding document subset is D
j;
Divide the step of document sets time period: establishing the document sets start time is t
start, the end time is t
end, to document sets time period [t
start, t
end] carry out decile, obtain time period T
p=(t
start+ (p-1) Δ t, t
start+ p Δ t], wherein, p=1,2 ..., m-1,
Calculate the step of the theme frequency: establish the theme frequency and comprise v
j, 0and v
j,p, v wherein
j, 0l is the theme
jcorresponding document subset D
jat start time t
startnumber of documents, v
j,ptheme l
jcorresponding document subset D
jat time period T
pthe quantity of interior document; Calculate respectively the theme frequency of each theme;
The step that theme is sorted: to all theme sequences, the subject nucleotide sequence table after being sorted;
Generate the step of theme flow graph: according to subject nucleotide sequence table and the theme frequency after sequence, adopt theme flow algorithm, generate theme flow graph;
Extract the step of the keyword that represents subject content: establish W
j,ptheme l
jcorresponding document subset D
jat time period T
pthe keyword subset that represents this subject content in interior document; Use the general Words partition system of Modern Chinese the document of each time period, to extract respectively from document subset corresponding to each theme the keyword subset that represents this subject content;
Calculate the weight of keyword the step of sequence: the weight of establishing keyword is the number of times that this keyword occurs in a keyword subset; Calculate the weight of each keyword in each keyword subset, and according to the weight of keyword, from big to small all keywords are sorted in each keyword subset;
The step of generating character cloud: according to keyword subset and keyword weight, generating character cloud on theme flow graph.
2. the theme method for visualizing of Chinese document collection as claimed in claim 1, is characterized in that, the described step that theme is sorted, according to carrying out based on the theme frequency and how much complementary sort methods, comprises
Step 1, establishes theme l
jinitial time be OT
j; Work as v
j, 0while being not equal to zero, get the start time t of document sets
startfor OT
j;
Work as v
j, 0while equalling zero, get v
j,pnon-vanishing those time periods T
pthe minimum value of left end point as OT
j; Calculate the initial time of each theme;
Step 3: newly-built empty list B; If n is even number, the frequency and that maximum subject write are entered to list the first row, as upper extreme point theme l
up, the frequency and time that large subject write are entered to list the second row, as lower extreme point theme l
down; If n is odd number, the frequency and that maximum subject write are entered to list the first row, simultaneously as upper extreme point theme l
upwith lower extreme point theme l
down;
D
i=sOT
i+(1-s)σ
up,i (3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeating step 4, until obtain each not weighted value of the theme in list B;
Step 6: select weighted value D
iminimum theme, is inserted into the top of list B upper extreme point theme, as new upper extreme point theme l
up;
Calculate l
downand l
khow much complementary, use variances sigma
down, krepresent:
By σ
down, kand OT
kafter normalization, calculate weighted value D
k:
D
k=sOT
k+(1-s)σ
down,k (6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeating step 7, until obtain each not weighted value of the theme in list B;
Step 9: select the theme of weighted value minimum, be inserted into the below of list B lower extreme point theme, as new lower extreme point theme l
down;
Step 10: repeated execution of steps 4 is to step 9, until all themes all add in list B.
3. the theme method for visualizing of Chinese document collection as claimed in claim 2, is characterized in that, described control parameter s=0.3.
4. the theme method for visualizing of Chinese document collection as claimed in claim 1, is characterized in that, the step of described generating character cloud, comprises
Step 1: theme l on choosing a topic flow graph
jcorresponding region G
j, its start time and end time equal respectively the start time t of document sets
startwith end time t
end, by region G
jtime period [t
start, t
end] being divided into m-1 section, the length of each time period is
obtain decile time point t
start+ p Δ t, wherein, p=1,2 ..., m-2;
Step 2: successively with decile time point t
startcentered by+p Δ t, according to Δ t at region G
jupper intercepting subregion R
j,p; R
j,pby a set
and curve
and line segment
the closed space forming;
Step 3: at each subregion R
j,pupper placement keyword subset W
j,pin keyword, generate theme l
jword cloud; Comprise
3.2 establish subregion R
j,peach summit of approximate polygon in the y coordinate figure on that summit of y coordinate maximum be y
max;
If the y coordinate figure on that summit of y coordinate minimum is y in each summit of polygon
min;
3.3 one group of use horizontal line H={y=c|y
min≤ c≤y
max, c ∈ Z} and subregion R
j,pintersect, obtain some crossing line segments; Get the sub-line segment that each intersecting lens section is positioned at polygon inside, be expressed as
wherein, M is positioned at R for this intersecting lens section
j,pthe number of inner sub-line segment; By R
j,pbe expressed as one group of horizontal line section collection
3.4 according to W
j,pin keyword weight choose successively from big to small a keyword, arranging high is h, widely for the rectangle of w replaces this keyword, carries out layout, then at placement position, places this keyword; Comprise
A, detection are at R
j,pcorresponding L
j,pin, at c=(y
max-y
mina wide w of being be placed in position)/2 can, and height is the rectangle of h, and detection method is: detect r corresponding from c to c-h
c(i) in, whether all there is same i, meet line segment
length be greater than w; If energy, (c, s in position
c(i)) place keyword, upgrade s
c(i) be s
c(i)=s
c(i)+w; If can not, proceed to step B;
B, with c=(y
max-y
mincentered by)/2, make successively c=c+1, c=c-1 alternately travels through L
j,p, detect and can place this rectangle in c position; If energy, (c, s in position
c(i)) place keyword, upgrade s
c(i) be s
c(i)=s
c(i)+w; If can not continue to make c=c+1, c=c-1 alternately travels through L
j,p, until find the r satisfying condition
cor traveled through all r (i)
c(i); If traveled through all r
c(i) after, do not find yet the position c satisfying condition, give up this keyword;
3.5 repeating steps 3.4, until by W
j,pin keyword all place;
Step 4: repeating step 1 is to step 3, until generate the word cloud of each theme on theme flow graph.
5. the theme method for visualizing of Chinese document collection as claimed in claim 1, is characterized in that, also comprises the step that generates detailed word cloud, comprises
Step 1: select to express theme l
jthe keyword set of content;
Step 2: a border circular areas C is set, turns to one group of conflict point set P by the border of C is discrete;
Step 3: choose from big to small a keyword according to the weight of keyword from keyword set, use random greedy algorithm is its generation position candidate coordinate (word.x, word.y) in the C of region;
Step 4: according to the weight setting font size of this keyword, then according to the number of words of font size and this keyword, with the approximate keyword that replaces of rectangle r, the lower left corner coordinate of establishing rectangle r equals coordinate;
Step 5: to each conflict point in P, detect each point and whether conflict with r; If there is conflict, proceed to step 6; If there is no conflict, proceeds to step 7;
Step 6: upgrade after the coordinate of position along spiral path, repeating step 4, step 5, until find the position pcoordinate or the radius of spin that satisfy condition to be greater than 100; When the radius of spin is greater than 100, keyword will be rejected; Step 7: place this keyword in position coordinate (word.x, word.y), and the discrete region that this keyword is taken turns to conflict point, add in conflict point set P;
Step 8: repeating step 3 is to step 7, until all keywords are placed in keyword set.
6. the theme method for visualizing of Chinese document collection as claimed in claim 5, is characterized in that, described expression theme l
jthe keyword set of content be theme l
jany one keyword subset W
j,p.
7. the theme method for visualizing of Chinese document collection as claimed in claim 5, is characterized in that, described expression theme l
jthe keyword set of content, by following steps, obtained:
Step 1, merges theme l
jall keyword subset W
j,p, p=1,2 ..., m-1;
Step 2: calculate the weight of all keywords in the set after merging, the weight of described keyword is the number of times that this keyword occurs in all keyword subsets.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310488312.7A CN103631856B (en) | 2013-10-17 | 2013-10-17 | Subject visualization method for Chinese document set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310488312.7A CN103631856B (en) | 2013-10-17 | 2013-10-17 | Subject visualization method for Chinese document set |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103631856A true CN103631856A (en) | 2014-03-12 |
CN103631856B CN103631856B (en) | 2017-01-11 |
Family
ID=50212898
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310488312.7A Active CN103631856B (en) | 2013-10-17 | 2013-10-17 | Subject visualization method for Chinese document set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103631856B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320683A (en) * | 2014-07-24 | 2016-02-10 | 贾新志 | Graphical display method of literature theme content analysis |
CN105373579A (en) * | 2015-08-18 | 2016-03-02 | 天津大学 | Regression analysis-based news competitiveness analysis method and visualization device |
CN105989090A (en) * | 2015-02-12 | 2016-10-05 | 中兴通讯股份有限公司 | Critical data processing method and device as well as critical data display method and system |
CN106250512A (en) * | 2016-08-04 | 2016-12-21 | 国家基础地理信息中心 | A kind of subject network information collecting method taking time intention into account |
CN106681983A (en) * | 2016-11-25 | 2017-05-17 | 北京掌行通信息技术有限公司 | Station name participle display method and device |
CN106909381A (en) * | 2017-02-24 | 2017-06-30 | 西南交通大学 | A kind of interactive theme river method for visualizing |
CN107622132A (en) * | 2017-10-09 | 2018-01-23 | 四川大学 | A kind of association analysis method for visualizing towards online Ask-Answer Community |
CN109144504A (en) * | 2017-06-26 | 2019-01-04 | 华东师范大学 | Data visualization image generation method and storage medium based on D3 |
CN109783616A (en) * | 2018-12-03 | 2019-05-21 | 广东蔚海数问大数据科技有限公司 | A kind of text subject extracting method, system and storage medium |
CN109933702A (en) * | 2019-03-11 | 2019-06-25 | 智慧芽信息科技(苏州)有限公司 | A kind of retrieval methods of exhibiting, device, equipment and storage medium |
CN111737523A (en) * | 2020-04-22 | 2020-10-02 | 聚好看科技股份有限公司 | Video tag, search content generation method and server |
WO2020244214A1 (en) * | 2019-06-05 | 2020-12-10 | 山东大学 | Method and device for generating shape word cloud |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996234A (en) * | 2009-08-17 | 2011-03-30 | 阿瓦雅公司 | Word cloud audio navigation |
US8402030B1 (en) * | 2011-11-21 | 2013-03-19 | Raytheon Company | Textual document analysis using word cloud comparison |
US20130267287A1 (en) * | 2012-04-04 | 2013-10-10 | David Goldenberg | System and Method for Interactive Gameplay with Song Lyric Database |
-
2013
- 2013-10-17 CN CN201310488312.7A patent/CN103631856B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996234A (en) * | 2009-08-17 | 2011-03-30 | 阿瓦雅公司 | Word cloud audio navigation |
US8402030B1 (en) * | 2011-11-21 | 2013-03-19 | Raytheon Company | Textual document analysis using word cloud comparison |
US20130267287A1 (en) * | 2012-04-04 | 2013-10-10 | David Goldenberg | System and Method for Interactive Gameplay with Song Lyric Database |
Non-Patent Citations (1)
Title |
---|
TING LIANG ET.AL: "WordStream: Visualizing Theme Summarization and Comparison in Document Collections over Time", 《ADVANCES IN INFORMATION SCIENCES AND SERVICE SCIENCES(AISS)》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320683A (en) * | 2014-07-24 | 2016-02-10 | 贾新志 | Graphical display method of literature theme content analysis |
CN105989090A (en) * | 2015-02-12 | 2016-10-05 | 中兴通讯股份有限公司 | Critical data processing method and device as well as critical data display method and system |
CN105373579B (en) * | 2015-08-18 | 2018-08-03 | 天津大学 | A kind of news competitiveness analysis method and its visualization device based on regression analysis |
CN105373579A (en) * | 2015-08-18 | 2016-03-02 | 天津大学 | Regression analysis-based news competitiveness analysis method and visualization device |
CN106250512A (en) * | 2016-08-04 | 2016-12-21 | 国家基础地理信息中心 | A kind of subject network information collecting method taking time intention into account |
CN106250512B (en) * | 2016-08-04 | 2019-07-26 | 国家基础地理信息中心 | A kind of subject network information collecting method for taking time intention into account |
CN106681983A (en) * | 2016-11-25 | 2017-05-17 | 北京掌行通信息技术有限公司 | Station name participle display method and device |
CN106909381B (en) * | 2017-02-24 | 2020-01-03 | 西南交通大学 | Interactive theme river visualization method |
CN106909381A (en) * | 2017-02-24 | 2017-06-30 | 西南交通大学 | A kind of interactive theme river method for visualizing |
CN109144504A (en) * | 2017-06-26 | 2019-01-04 | 华东师范大学 | Data visualization image generation method and storage medium based on D3 |
CN107622132A (en) * | 2017-10-09 | 2018-01-23 | 四川大学 | A kind of association analysis method for visualizing towards online Ask-Answer Community |
CN107622132B (en) * | 2017-10-09 | 2020-07-03 | 四川大学 | Online question-answer community oriented association analysis visualization method |
CN109783616A (en) * | 2018-12-03 | 2019-05-21 | 广东蔚海数问大数据科技有限公司 | A kind of text subject extracting method, system and storage medium |
CN109933702A (en) * | 2019-03-11 | 2019-06-25 | 智慧芽信息科技(苏州)有限公司 | A kind of retrieval methods of exhibiting, device, equipment and storage medium |
WO2020244214A1 (en) * | 2019-06-05 | 2020-12-10 | 山东大学 | Method and device for generating shape word cloud |
CN111737523A (en) * | 2020-04-22 | 2020-10-02 | 聚好看科技股份有限公司 | Video tag, search content generation method and server |
CN111737523B (en) * | 2020-04-22 | 2023-11-14 | 聚好看科技股份有限公司 | Video tag, generation method of search content and server |
Also Published As
Publication number | Publication date |
---|---|
CN103631856B (en) | 2017-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103631856B (en) | Subject visualization method for Chinese document set | |
Cao et al. | Introduction to text visualization | |
Lhuillier et al. | State of the art in edge and trail bundling techniques | |
Van Ham et al. | Mapping text with phrase nets | |
Andrienko et al. | Designing visual analytics methods for massive collections of movement data | |
CN107578292B (en) | User portrait construction system | |
Fried et al. | Maps of computer science | |
CN103559199B (en) | Method for abstracting web page information and device | |
Liang et al. | Highlighting in information visualization: A survey | |
DE112009004951T5 (en) | Method and system for document reconstruction | |
Tong et al. | A density-peak-based clustering algorithm of automatically determining the number of clusters | |
CN111143547B (en) | Big data display method based on knowledge graph | |
CN112667940B (en) | Webpage text extraction method based on deep learning | |
CN104217038A (en) | Knowledge network building method for financial news | |
CN106503211A (en) | Information issues the method that the mobile edition of class website is automatically generated | |
US11650073B2 (en) | Knowledge space analytics | |
Rayson et al. | Towards interactive multidimensional visualisations for corpus linguistics | |
CN107908749A (en) | A kind of personage's searching system and method based on search engine | |
CN112969035A (en) | Visual video production method and production system | |
CN105550279A (en) | Vision-based list page identification method | |
Lu et al. | Exploration and application of graphic design language based on artificial intelligence visual communication | |
Liu et al. | EXOD: A tool for building and exploring a large graph of open datasets | |
Liang et al. | The generation and realization of Dunhuang cultural values from the perspective of crosscultural communication: Based on Spanish and Chinese culture | |
CN108647310A (en) | Identification model method for building up and device, character recognition method and device | |
CN110083760B (en) | Multi-recording dynamic webpage information extraction method based on visual block |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |