CN105550365A - Visualization analysis system based on text topic model - Google Patents
Visualization analysis system based on text topic model Download PDFInfo
- Publication number
- CN105550365A CN105550365A CN201610028107.6A CN201610028107A CN105550365A CN 105550365 A CN105550365 A CN 105550365A CN 201610028107 A CN201610028107 A CN 201610028107A CN 105550365 A CN105550365 A CN 105550365A
- Authority
- CN
- China
- Prior art keywords
- data
- word
- theme
- text
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 59
- 238000012800 visualization Methods 0.000 title claims abstract description 22
- 230000011218 segmentation Effects 0.000 claims abstract description 26
- 238000013079 data visualisation Methods 0.000 claims abstract description 25
- 238000000034 method Methods 0.000 claims abstract description 15
- 238000004140 cleaning Methods 0.000 claims abstract description 4
- 238000007621 cluster analysis Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 14
- 238000013523 data management Methods 0.000 claims description 11
- 230000000007 visual effect Effects 0.000 claims description 11
- 238000010276 construction Methods 0.000 claims description 9
- 230000003993 interaction Effects 0.000 claims description 9
- 230000010354 integration Effects 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 230000001960 triggered effect Effects 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000013480 data collection Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 4
- 238000005457 optimization Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 abstract description 2
- 238000004422 calculation algorithm Methods 0.000 description 7
- 230000003750 conditioning effect Effects 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 239000013065 commercial product Substances 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000001747 exhibiting effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/358—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a visualization analysis system based on a text topic model. The visualization analysis system comprises an internet text data acquisition module, a corpus module, a topic analysis module, a topic clustering module and a data visualization module, wherein the internet text data acquisition module is used for acquiring and cleaning webpage text data from an internet; the corpus module is used for performing Chinese word segmentation and word frequency statistics of the webpage text data; the topic analysis module is used for generating a document-topic vector set and a topic-word vector set; the topic clustering module is used for performing clustering analysis of the document-topic vector set; and the data visualization module is used for performing data display and variable parameter adjustment. According to the invention, optimization of the analysis effect and dynamic adjustment of variable parameters in the analysis process are realized; and thus, the analysis efficiency is increased.
Description
Technical field
The present invention relates to text subject analysis field, internet, particularly relate to a kind of Visualized Analysis System based on text subject model.
Background
Internet also exists the text message of magnanimity, such as a large amount of news report, literary criticism, popularization of knowledge, form is also varied, such as news web page, blog, microblogging etc.The much-talked-about topic that subject analysis can find current network to discuss is carried out for text message.For much-talked-about topic, various useful application can be had, such as, carry out industry development trend prediction, focus commercial product recommending, network public opinion analysis etc.
Data visualization is a kind of cross discipline combining the subjects such as computer graphics, psychology and man-machine interaction.Data visualization, by visualized algorithm, realizes patterned Visualization Model, is used for showing multidimensional or high dimensional data.The Visualization Model combining man-machine interaction can carry out dynamic multi-angular analysis.The maximum purposes of data visualization, by patterned method for exhibiting data, promotes that user is for the understanding of complex data, improves data analysis efficiency.
Visual and understandable visual to result data, can promote that user is to the understanding of analysis result, improves analysis efficiency greatly.Because analysis result can be understood from different perspectives, the angle such as standing in certain particular topic is to the angle understood its distribution situation in a document or stand in certain particular document bunch to analyze described theme.A static method for visualizing is difficult to accomplish all situations all to show simultaneously.Therefore, static Visualization Model in conjunction with human-computer interaction technology, will dynamically represent the analytic angle that user wants.In addition, because each analysis phase all can relate to relatively independent Data Management Analysis, the optimum configurations of sub-analysis module directly can affect the result of holistic approach.Therefore, when subject analysis and cluster, user can adjust parameter, to reach the target of holistic approach effect optimum.Interactively Visualization Model can allow user to carry out dynamic conditioning to parameter on graphical interfaces, and real-time sees the analysis result after adjustment.
Summary of the invention
Based on the problems referred to above, the object of the invention is to propose a kind of Visualized Analysis System based on text subject model, achieve the dynamic conditioning of variable element in the optimization of analytical effect, analytic process, improve analysis efficiency.
To achieve these goals, the invention discloses a kind of Visualized Analysis System based on text subject model, this system comprises internet text notebook data acquisition module, corpus module, subject analysis module, Subject Clustering module, data visualization module;
Internet text notebook data acquisition module is used for gathering web page text data from internet, and cleans collected each section text data;
Corpus module is for storing the text data after the cleaning of internet text notebook data acquisition module, and Chinese word segmentation and word frequency statistics are carried out to the web page text data stored, generate the word frequency data of mapping relations and the word frequency statistics data comprised between word and the web page text data stored;
Subject analysis module is used for setting up topic model according to the word frequency data of corpus CMOS macro cell, utilizes the Gibbs methods of sampling to calculate set up topic model, stores and export the document-theme vector collection and theme-word vector set that calculate;
The document that Subject Clustering module exports subject analysis module-theme vector collection carries out cluster analysis, stores and exports cluster data;
Data the showing with figure that subject analysis module and Subject Clustering module export by data visualization module; Data visualization module is also for showing and adjusting variable element in corpus module, subject analysis module, Subject Clustering module.
Preferably, described internet text notebook data acquisition module comprises webpage capture unit and data cleansing unit;
Webpage capture unit is for gathering the text data in webpage from internet; This unit uses web crawlers technology, after providing seed website, jumps to other websites by the link of seed website, realizes automatic web and creep;
Data cleansing unit is used for the text data of webpage capture unit collection to clean, and removes the data irrelevant with web page contents, the title of the packet purse rope page of reservation, author, time, source and body matter.
Preferably, described corpus module comprises building of corpus unit, corpus, Chinese word segmentation unit, word frequency Data Management Unit, word frequency base;
Building of corpus unit is used for cleaned text data store based in the corpus of relevant database;
Chinese word segmentation unit is used for the data in corpus to carry out Chinese word segmentation, and removes the stop words irrelevant with body matter according to the inactive vocabulary defined in this unit;
The word frequency Data Management Unit word segmentation result obtained by Chinese word segmentation unit carries out word frequency statistics, by the statistics that obtains stored in word frequency base; The word frequency data stored in word frequency base comprise the statistics of mapping relations between the text data in word segmentation result in each word and corpus and word frequency data management list; Described statistics comprises the occurrence number of each word comprised in the number of times that in word segmentation result, each word occurs in correspondence each section text data, each section text data.
Preferably, described subject analysis module comprises LDA topic model construction unit, Gibbs sample calculation unit, result vector collection administrative unit, vector set database;
LDA topic model construction unit is used for according to word frequency data construct LDA topic model;
Gibbs sample calculation unit is used for utilizing the Gibbs methods of sampling to calculate LDA model, obtains for describing in every section of text data document-theme vector collection of comprising theme and for describing in each theme the theme-word vector set comprising keyword.
Result vector collection administrative unit is used for the vector set that Gibbs sample calculation unit obtains to be saved in the vector set database based on relevant database.
Preferably, described Subject Clustering module comprises cluster analysis unit, Subject Clustering data set administrative unit, clustering documents storehouse;
Cluster analysis unit is used for carrying out cluster analysis to document-theme vector collection and obtaining text cluster data, and text cluster data comprise the text, the document clusters belonging to every section of text that comprise in each document clusters;
Subject Clustering data set administrative unit is used for text cluster data being kept in the clustering documents storehouse based on relevant database.
Preferably, described data visualization module comprises Data Integration unit, visualization, man-machine interaction unit;
Data Integration unit is used for from vector set database, reading document-theme vector collection data and theme-word vector set data, from clustering documents storehouse, reading text cluster data, and the data pattern that the data read define according to visualization is carried out format conversion;
Visualization is mainly used in the data integrated to be presented to terminal user to graphically;
Man-machine interaction unit respectively has the variable element of the unit of computing and screening function in corpus module, subject analysis module, Subject Clustering module for adjusting.
Preferably, the described unit with computing and screening function comprises building of corpus unit, Chinese word segmentation unit, LDA topic model construction unit, Gibbs sample calculation unit, cluster analysis unit, Subject Clustering data set administrative unit.
Preferably, described result vector collection administrative unit also has theme temperature subelement, this subelement for calculating theme temperature, and by result stored in vector set database.
Preferably, the method for described data visualization CMOS macro cell visual image is:
Step 1, obtains the temperature data set H={h1 of theme, h2, h3 from vector set database ..., hk}, wherein hi is the hot value of i-th theme;
Step 2, draw motif area at display screen, be specially:
Step 21, draws two concentric circless;
Step 22, is normalized calculating to the theme hot value in hot value data set H, obtain normalization data collection H '=h1 ', h2 ' ..., hk ' }, wherein hi ' be normalization after the hot value of theme i;
Step 23, according to the ratio value of each theme hot value hi, be k sector by the Region dividing in the concentric circles drawn in step 21 between outer ring and inner ring, each sector represents a theme, the sector radian=2*PI*hi ' of theme i;
Step 3, draw word cloud in each sector, be specially:
Step 31, for theme i, access vector set database, obtain the word vector Wi={{wi1 that theme i comprises, v1}, { wi2, v2}, win, vn}}, wherein wip is the content of p the word that theme i comprises, vp represents the numerical value of wip, is also exactly the importance for theme i of this word.
Step 32, is normalized calculating to the v in Wi, obtains Wi '={ { wi1, v1 ' }, and wi2, v2 ' } ..., { win, vn ' } }, the wherein vp ' normalized value that is vp.
Step 33, generating character cloud in the sector that theme i is corresponding, font size=setting original size * vi ' the * hk ' of p word of word cloud.If font size is less than No. 2 words, then do not show this word;
Step 34, by each word horizontal positioned in word cloud;
Step 4, draw document clusters, be specially:
Step 41, obtains the dimension information of document clusters: SC={sc1, sc2 from clustering documents storehouse ..., scy}, the wherein quantity of document that comprises for i-th document clusters of sci;
Step 42, is normalized SC, obtain SC '=sc1 ', sc2 ' ..., scy ' }, the wherein normalized value of the quantity of document that comprises for i-th document clusters of sci ';
Step 43, in the concentrically ringed inner ring drawn in step 21, each document clusters draws a circle, and circular radius is directly proportional to normalized value sci ', circular in radius descending mode, ecto-entad helical arrangement.
Preferably, in described visual image, the sector of motif area has Trigger Function, is specially: after the sector corresponding to theme i is triggered, data visualization module obtains the ratio TC={tc1 of document in affiliated document clusters comprising theme i from clustering documents storehouse, tc2 ..., tcy}, wherein, tcs is the proportional numerical value of document in document clusters s comprising theme i, in corresponding document clusters, draw sector, the radian Ai=2*PI*tcs of sector.
Preferably, in described visual image, the border circular areas of document clusters has Trigger Function, is specially: after the border circular areas corresponding to document clusters is triggered, data visualization module read from clustering documents storehouse document clusters comprise the percent information CT={ct1 of theme, ct2,, ctk}, wherein cti is the theme proportional numerical value shared in whole themes that i comprises in selected document clusters, repartition sector at motif area according to CT, and in the sector that each theme is corresponding generating character cloud.
Preferably, in step 34, the method for each word horizontal positioned in word cloud is specially: by the anglec of rotation of each word of word cloud according to the corresponding center of circle of its central point, the word of word cloud is done corresponding rotation, how ensures the angle of no matter sector k, word wherein can the display of level.
A kind of Visualized Analysis System based on text subject model proposed by the invention, can realize the analysis of network text message subject and Subject Clustering graphical is intuitively represented by interactive theme Visualization Model, variable element dynamic conditioning, what optimize analytical effect improves analysis efficiency.
Accompanying drawing explanation
Fig. 1 is the Visualized Analysis System framework of the text subject model of the embodiment of the present invention;
Fig. 2 is the interactive theme Visualization Model structural representation of the embodiment of the present invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.
Be illustrated in figure 1 the system framework figure of the embodiment of the present invention, a kind of Visualized Analysis System based on text subject model of the present embodiment comprises internet text notebook data acquisition module S101, corpus module S102, subject analysis module S103, Subject Clustering module S104 and visualization model S105.
Internet text notebook data acquisition module S101 is used for gathering web page text data from internet and cleaning collected each section text data.Internet text notebook data acquisition module comprises webpage capture cell S 114 and data cleansing cell S 115; Webpage capture cell S 114 is for gathering the text data in webpage from internet.Webpage capture unit adopts web crawlers technology, by the seed website provided, can jump to other websites, realize automatic web and creep by the link of seed website.Data cleansing cell S 115, for being cleaned by the text data of webpage capture unit collection, removes the data irrelevant with web page contents, the title of the packet purse rope page of reservation, author, time, source and body matter etc.
Corpus module S102 comprises building of corpus cell S 116, corpus, Chinese word segmentation cell S 117, word frequency Data Management Unit S118, word frequency base; Building of corpus cell S 116 for by cleaned text data store based in the corpus of relevant database; Chinese word segmentation cell S 117 for the data in corpus are carried out Chinese word segmentation, and removes the stop words irrelevant with body matter according to the inactive vocabulary defined in this unit.Word frequency Data Management Unit S118 carries out word frequency statistics with by word segmentation result, by the statistics that obtains stored in word frequency base.Word frequency Data Management Unit provides the data access function between word frequency base.The word frequency data stored in word frequency base comprise the statistics of mapping relations between the text data in word segmentation result in each word and corpus and word frequency data management list.Described statistics comprises the occurrence number of each word comprised in the number of times that in word segmentation result, each word occurs in correspondence each section text data, each section text data.
Subject analysis module S103 comprises LDA topic model construction unit S119, Gibbs sample calculation cell S 120, result vector collector reason cell S 121, vector set database; LDA topic model construction unit S119 is used for according to word frequency data construct LDA (LatentDirichletAllocation) topic model; Gibbs sample calculation cell S 120 calculates LDA model for utilizing the Gibbs methods of sampling, result of calculation is document-theme vector collection and theme-word vector set, respectively describes the keyword comprised in the theme and each theme comprised in every section of text data.Result vector collector reason cell S 121 is saved in the vector set database based on relevant database for the vector set obtained by Gibbs sample calculation unit, and provides data access interface function.Result vector collector reason cell S 121 also has theme temperature subelement, this subelement for calculating theme temperature, and by result stored in vector set database.
Subject Clustering module S104 comprises cluster analysis cell S 122, Subject Clustering data set administrative unit S123, clustering documents storehouse.Multiple different cluster algorithm is contained in cluster analysis cell S 122, as K-means clustering algorithm, OPTICS clustering algorithm, DBSCAN clustering algorithm etc., the algorithm that can be used by data visualization model choice, carry out cluster analysis to document-theme vector collection and obtain text cluster data, text cluster data comprise the text, the document clusters belonging to every section of text that comprise in each document clusters.In addition, the relation of subelement to document clusters and theme is also had to carry out statistical computation, the theme comprised in such as each document clusters and the document clusters involved by each theme, Subject Clustering data set administrative unit S123 is used for the result of cluster analysis being kept in the clustering documents storehouse based on relevant database, and provides the interface function of data access.
Data visualization module S105 comprises Data Integration cell S 112, visualization S113, man-machine interaction unit 114.Data Integration cell S 112 for reading document-theme vector collection data and theme-word vector set data, reading text cluster data and the data pattern that the data read define according to visualization is carried out format conversion from clustering documents storehouse from vector set database.Visualization S113 be mainly used in by the data integrated to graphically be presented to terminal user.Man-machine interaction unit S114, for showing and adjusting the variable element in building of corpus cell S 116, Chinese word segmentation cell S 117, LDA topic model construction unit S119, Gibbs sample calculation cell S 120, cluster analysis cell S 122, Subject Clustering data set administrative unit S123, comprises the selection of the cluster algorithm in cluster analysis cell S 122.Then recalculate and by result of calculation by data visualization modules exhibit to screen, replace old visualized graphs.
Be illustrated in figure 2 the interactive theme Visualization Model structural representation of the embodiment of the present invention., the visual image of this structural representation is by this data visualization CMOS macro cell in the present embodiment, and is shown by display, and its generation method is:
Step 1, obtains the temperature data set H={h1 of theme, h2, h3 from vector set database ..., hk}, wherein hi is the hot value of i-th theme;
Step 2, draw motif area S201 at display screen, be specially:
Step 21, draws two concentric circless; Getting center of circle c in the present embodiment is screen center's point, outer radii ro=screen height * 2/5, inner radii ri=screen height * 1/5;
Step 22, is normalized calculating to the theme hot value in hot value data set H, obtain normalization data collection H '=h1 ', h2 ' ..., hk ' }, wherein hi ' be normalization after the hot value of theme i;
Step 23, according to the ratio value of each theme hot value hi, be k sector by the Region dividing in the concentric circles drawn in step 21 between outer ring and inner ring, each sector represents a theme, the sector radian=2*PI*hi ' of theme i;
Step 3, draw word cloud S202 in each sector, be specially:
Step 31, for theme i, access vector set database, obtain word vector Wi={{wi1, v1} that theme i comprises, { wi2, v2} ..., { win, vn}}, wherein wip is the content of p the word that theme i comprises, such as " football ", " mobile phone " etc., vp represents the numerical value of wip, is also exactly the importance for theme i of this word;
Step 32, is normalized calculating to the v in Wi, obtains Wi '={ { wi1, v1 ' }, and wi2, v2 ' } ..., { win, vn ' } }, the wherein vp ' normalized value that is vp;
Step 33, generating character cloud in the sector that theme i is corresponding, font size=setting original size * vi ' the * hk ' of p word of word cloud.If font size is less than setting minimum threshold, then do not show this word; In the present embodiment, the original size of word font is set as the Song typeface No. 18 words, and the minimum threshold of font size is set to No. 2 words;
Step 34, by the placement of each word level of word cloud; Word cloud is placed in the i of sector by the present embodiment, and according to the anglec of rotation in the corresponding center of circle of the central point of word each in word cloud, the word of word cloud is done corresponding rotation, how ensures the angle of no matter sector i, and word wherein can the display of level.
Step 4, draw document clusters S204, be specially:
Step 41, obtains the dimension information of document clusters: SC={sc1, sc2 from clustering documents storehouse ..., scy}, the wherein quantity of document that comprises for i-th document clusters of sci;
Step 42, is normalized SC, obtain SC '=sc1 ', sc2 ' ..., scy ' }, the wherein normalized value of the quantity of document that comprises for i-th document clusters of sci ';
Step 43, in the concentrically ringed inner ring drawn in step 21, each document clusters draws a circle, and circular radius is directly proportional to normalized value sci ', circular in radius descending mode, ecto-entad helical arrangement.
In described visual image, the sector of motif area has Trigger Function, be specially: after the sector corresponding to theme i is triggered, data visualization module obtains the ratio TC={tc1 of document in affiliated document clusters comprising theme i from clustering documents storehouse, tc2, tcy}, wherein, tcs is the proportional numerical value of document in document clusters s comprising theme i, sector is drawn, the radian Ai=2*PI*tcs of sector in corresponding document clusters.
In described visual image, the border circular areas of document clusters has Trigger Function, be specially: after the border circular areas corresponding to document clusters is triggered, data visualization module read from clustering documents storehouse document clusters comprise the percent information CT={ct1 of theme, ct2, ctk}, wherein cti is the theme proportional numerical value shared in whole themes that i comprises in selected document clusters, sector is repartitioned at motif area according to CT, and in the sector that each theme is corresponding generating character cloud, the generation method of word cloud and constraint condition consistent with step 33.
Visualization structure schematic diagram as shown in Figure 2, based on cake chart, is made up of two essential parts: motif area S201 and clustering documents region S203.
Motif area S201, with cake chart basis, illustrates for whole corpus, the situation of theme.The radian of cake chart sector have expressed the quantization scale information of theme temperature.Theme is more popular, and sector radian is larger.
Word cloud S202, can show the content of this theme in each sector, namely comprised word and the weight of word.Word cloud uses the weight of label-cloud technological expression word in the theme of place.The font of word is larger, and its weight in theme is larger, that is more can express the implication of this theme.The indication range of word cloud is only limitted to the sector of affiliated theme, and size is determined according to the area of sector.If sector area is too small, the word cloud of this sector just no longer shows.
Clustering documents region S203 illustrates the result of the Subject Clustering of document.Wherein comprise document clusters S204 and theme distribution document clusters S205.
Document clusters S204 represents the result of cluster with circle.Circular radius have expressed the quantity of document in a document clusters.Radius is larger, and the number of documents that the document bunch comprises is more.Document clusters with screw type descending sort, shows the comparability of document clusters in the scope of clustering documents region S203.
In the present embodiment, the display section in visualization structure schematic diagram as shown in motif area S201, clustering documents region S203, document clusters S204 is functional area, in use can carry out Data Update and image redraws by the mode clicked.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (12)
1. based on a Visualized Analysis System for text subject model, it is characterized in that, this system comprises internet text notebook data acquisition module, corpus module, subject analysis module, Subject Clustering module, data visualization module;
Internet text notebook data acquisition module is used for gathering web page text data from internet, and cleans collected each section text data;
Corpus module is for storing the text data after the cleaning of internet text notebook data acquisition module, and Chinese word segmentation and word frequency statistics are carried out to the web page text data stored, generate the word frequency data of mapping relations and the word frequency statistics data comprised between word and the web page text data stored;
Subject analysis module is used for setting up topic model according to the word frequency data of corpus CMOS macro cell, utilizes the Gibbs methods of sampling to calculate set up topic model, stores and export the document-theme vector collection and theme-word vector set that calculate;
The document that Subject Clustering module exports subject analysis module-theme vector collection carries out cluster analysis, stores and exports cluster data;
Data the showing with figure that subject analysis module and Subject Clustering module export by data visualization module; Data visualization module is also for showing and adjusting variable element in corpus module, subject analysis module, Subject Clustering module.
2. a kind of Visualized Analysis System based on text subject model as claimed in claim 1, it is characterized in that, described internet text notebook data acquisition module comprises webpage capture unit and data cleansing unit;
Webpage capture unit is for gathering the text data in webpage from internet; This unit uses web crawlers technology, after providing seed website, jumps to other websites by the link of seed website, realizes automatic web and creep;
Data cleansing unit is used for the text data of webpage capture unit collection to clean, and removes the data irrelevant with web page contents, the title of the packet purse rope page of reservation, author, time, source and body matter.
3. a kind of Visualized Analysis System based on text subject model as claimed in claim 2, it is characterized in that, described corpus module comprises building of corpus unit, corpus, Chinese word segmentation unit, word frequency Data Management Unit, word frequency base;
Building of corpus unit is used for cleaned text data store based in the corpus of relevant database;
Chinese word segmentation unit is used for the data in corpus to carry out Chinese word segmentation, and removes the stop words irrelevant with body matter according to the inactive vocabulary defined in this unit;
The word frequency Data Management Unit word segmentation result obtained by Chinese word segmentation unit carries out word frequency statistics, by the statistics that obtains stored in word frequency base; The word frequency data stored in word frequency base comprise the statistics of mapping relations between the text data in word segmentation result in each word and corpus and word frequency data management list; Described statistics comprises the occurrence number of each word comprised in the number of times that in word segmentation result, each word occurs in correspondence each section text data, each section text data.
4. a kind of Visualized Analysis System based on text subject model as claimed in claim 3, it is characterized in that, described subject analysis module comprises LDA topic model construction unit, Gibbs sample calculation unit, result vector collection administrative unit, vector set database;
LDA topic model construction unit is used for according to word frequency data construct LDA topic model;
Gibbs sample calculation unit is used for utilizing the Gibbs methods of sampling to calculate LDA model, obtains for describing in every section of text data document-theme vector collection of comprising theme and for describing in each theme the theme-word vector set comprising keyword.
Result vector collection administrative unit is used for the vector set that Gibbs sample calculation unit obtains to be saved in the vector set database based on relevant database.
5. a kind of Visualized Analysis System based on text subject model as described in claim 4, it is characterized in that, described Subject Clustering module comprises cluster analysis unit, Subject Clustering data set administrative unit, clustering documents storehouse;
Cluster analysis unit is used for carrying out cluster analysis to document-theme vector collection and obtaining text cluster data, and text cluster data comprise the text, the document clusters belonging to every section of text that comprise in each document clusters;
Subject Clustering data set administrative unit is used for text cluster data being kept in the clustering documents storehouse based on relevant database.
6. a kind of Visualized Analysis System based on text subject model as described in claim 5, it is characterized in that, described data visualization module comprises Data Integration unit, visualization, man-machine interaction unit;
Data Integration unit is used for from vector set database, reading document-theme vector collection data and theme-word vector set data, from clustering documents storehouse, reading text cluster data, and the data pattern that the data read define according to visualization is carried out format conversion;
Visualization is mainly used in the data integrated to be presented to terminal user to graphically;
Man-machine interaction unit respectively has the variable element of the unit of computing and screening function in corpus module, subject analysis module, Subject Clustering module for adjusting.
7. a kind of Visualized Analysis System based on text subject model as claimed in claim 6, it is characterized in that, the described unit with computing and screening function comprises building of corpus unit, Chinese word segmentation unit, LDA topic model construction unit, Gibbs sample calculation unit, cluster analysis unit, Subject Clustering data set administrative unit.
8. a kind of Visualized Analysis System based on text subject model according to any one of claim 1-7, it is characterized in that, described result vector collection administrative unit also has theme temperature subelement, this subelement for calculating theme temperature, and by result stored in vector set database.
9. a kind of Visualized Analysis System based on text subject model as claimed in claim 8, it is characterized in that, the method for described data visualization CMOS macro cell visual image is:
Step 1, obtains the temperature data set of theme from vector set database
H={h1, h2, h3 ..., hk}, wherein hi is the hot value of i-th theme;
Step 2, draw motif area at display screen, be specially:
Step 21, draws two concentric circless;
Step 22, is normalized calculating to the theme hot value in hot value data set H, obtain normalization data collection H '=h1 ', h2 ' ..., hk ' }, wherein hi ' be normalization after the hot value of theme i;
Step 23, according to the ratio value of each theme hot value hi, be k sector by the Region dividing in the concentric circles drawn in step 21 between outer ring and inner ring, each sector represents a theme, the sector radian=2*PI*hi ' of theme i;
Step 3, draw word cloud in each sector, be specially:
Step 31, for theme i, access vector set database, obtain the word vector Wi={{wi1 that theme i comprises, v1}, { wi2, v2}, win, vn}}, wherein wip is the content of p the word that theme i comprises, vp represents the numerical value of wip, is also exactly the importance for theme i of this word.
Step 32, is normalized calculating to the v in Wi, obtains Wi '={ { wi1, v1 ' }, and wi2, v2 ' } ..., { win, vn ' } }, the wherein vp ' normalized value that is vp.
Step 33, generating character cloud in the sector that theme i is corresponding, font size=setting original size * vi ' the * hk ' of p word of word cloud.If font size is less than No. 2 words, then do not show this word;
Step 34, by each word horizontal positioned in word cloud;
Step 4, draw document clusters, be specially:
Step 41, obtains the dimension information of document clusters from clustering documents storehouse:
SC={sc1, sc2 ..., scy}, the wherein quantity of document that comprises for i-th document clusters of sci;
Step 42, is normalized SC, obtain SC '=sc1 ', sc2 ' ..., scy ' }, the wherein normalized value of the quantity of document that comprises for i-th document clusters of sci ';
Step 43, in the concentrically ringed inner ring drawn in step 21, each document clusters draws a circle, and circular radius is directly proportional to normalized value sci ', circular in radius descending mode, ecto-entad helical arrangement.
10. a kind of Visualized Analysis System based on text subject model as claimed in claim 9, it is characterized in that, in described visual image, the sector of motif area has Trigger Function, be specially: after the sector corresponding to theme i is triggered, data visualization module obtains the ratio TC={tc1 of document in affiliated document clusters comprising theme i from clustering documents storehouse, tc2, tcy}, wherein, tcs is the proportional numerical value of document in document clusters s comprising theme i, sector is drawn, the radian Ai=2*PI*tcs of sector in corresponding document clusters.
11. a kind of Visualized Analysis Systems based on text subject model as claimed in claim 10, it is characterized in that, in described visual image, the border circular areas of document clusters has Trigger Function, be specially: after the border circular areas corresponding to document clusters is triggered, data visualization module read from clustering documents storehouse document clusters comprise the percent information CT={ct1 of theme, ct2, ctk}, wherein cti is the theme proportional numerical value shared in whole themes that i comprises in selected document clusters, sector is repartitioned at motif area according to CT, and in the sector that each theme is corresponding generating character cloud.
12. a kind of Visualized Analysis Systems based on text subject model as claimed in claim 11, it is characterized in that, in step 34, the method for each word horizontal positioned in word cloud is specially: by the anglec of rotation of each word of word cloud according to the corresponding center of circle of its central point, the word of word cloud is done corresponding rotation, how ensure the angle of no matter sector k, word wherein can the display of level.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610028107.6A CN105550365A (en) | 2016-01-15 | 2016-01-15 | Visualization analysis system based on text topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610028107.6A CN105550365A (en) | 2016-01-15 | 2016-01-15 | Visualization analysis system based on text topic model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105550365A true CN105550365A (en) | 2016-05-04 |
Family
ID=55829554
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610028107.6A Pending CN105550365A (en) | 2016-01-15 | 2016-01-15 | Visualization analysis system based on text topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105550365A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106055604A (en) * | 2016-05-25 | 2016-10-26 | 南京大学 | Short text topic model mining method based on word network to extend characteristics |
CN106156364A (en) * | 2016-08-02 | 2016-11-23 | 西南石油大学 | A kind of method and system of calculating media event dynamic effect power based on time stream |
CN106250513A (en) * | 2016-08-02 | 2016-12-21 | 西南石油大学 | A kind of event personalization sorting technique based on event modeling and system |
CN106469138A (en) * | 2016-09-29 | 2017-03-01 | 东软集团股份有限公司 | The generation method of word cloud and device |
CN106682231A (en) * | 2017-01-10 | 2017-05-17 | 深圳淞鑫金融服务科技发展有限公司 | Graphical visual display method and device for big data |
CN106777043A (en) * | 2016-12-09 | 2017-05-31 | 宁波大学 | A kind of academic resources acquisition methods based on LDA |
CN107066585A (en) * | 2017-04-17 | 2017-08-18 | 济南大学 | A kind of probability topic calculates the public sentiment monitoring method and system with matching |
CN107833271A (en) * | 2017-09-30 | 2018-03-23 | 中国科学院自动化研究所 | A kind of bone reorientation method and device based on Kinect |
CN108334591A (en) * | 2018-01-30 | 2018-07-27 | 天津中科智能识别产业技术研究院有限公司 | Industry analysis method and system based on focused crawler technology |
CN108573155A (en) * | 2018-04-18 | 2018-09-25 | 北京知道创宇信息技术有限公司 | Detect method, apparatus, electronic equipment and the storage medium of loophole coverage |
CN109189934A (en) * | 2018-11-13 | 2019-01-11 | 平安科技(深圳)有限公司 | Public sentiment recommended method, device, computer equipment and storage medium |
CN109478191A (en) * | 2016-07-25 | 2019-03-15 | 株式会社斯库林集团 | Text mining method, text mining program and text mining device |
CN110750646A (en) * | 2019-10-16 | 2020-02-04 | 乐山师范学院 | Attribute description extracting method for hotel comment text |
CN112269871A (en) * | 2020-10-12 | 2021-01-26 | 国网新疆电力有限公司信息通信公司 | Data visualization analysis method and device based on LDA topic generation model |
CN113378512A (en) * | 2021-07-05 | 2021-09-10 | 中国科学技术信息研究所 | Automatic indexing-based generation method of stepless dynamic evolution theme cloud picture |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101419623A (en) * | 2008-12-09 | 2009-04-29 | 中山大学 | Geographical simulation optimizing system |
CN101980199A (en) * | 2010-10-28 | 2011-02-23 | 北京交通大学 | Method and system for discovering network hot topic based on situation assessment |
US20130151531A1 (en) * | 2011-12-13 | 2013-06-13 | Xerox Corporation | Systems and methods for scalable topic detection in social media |
CN103853821A (en) * | 2014-02-21 | 2014-06-11 | 河海大学 | Method for constructing multiuser collaboration oriented data mining platform |
CN104199974A (en) * | 2013-09-22 | 2014-12-10 | 中科嘉速(北京)并行软件有限公司 | Microblog-oriented dynamic topic detection and evolution tracking method |
-
2016
- 2016-01-15 CN CN201610028107.6A patent/CN105550365A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101419623A (en) * | 2008-12-09 | 2009-04-29 | 中山大学 | Geographical simulation optimizing system |
CN101980199A (en) * | 2010-10-28 | 2011-02-23 | 北京交通大学 | Method and system for discovering network hot topic based on situation assessment |
US20130151531A1 (en) * | 2011-12-13 | 2013-06-13 | Xerox Corporation | Systems and methods for scalable topic detection in social media |
CN104199974A (en) * | 2013-09-22 | 2014-12-10 | 中科嘉速(北京)并行软件有限公司 | Microblog-oriented dynamic topic detection and evolution tracking method |
CN103853821A (en) * | 2014-02-21 | 2014-06-11 | 河海大学 | Method for constructing multiuser collaboration oriented data mining platform |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106055604A (en) * | 2016-05-25 | 2016-10-26 | 南京大学 | Short text topic model mining method based on word network to extend characteristics |
CN109478191A (en) * | 2016-07-25 | 2019-03-15 | 株式会社斯库林集团 | Text mining method, text mining program and text mining device |
CN109478191B (en) * | 2016-07-25 | 2022-04-08 | 株式会社斯库林集团 | Text mining method, recording medium, and text mining device |
CN106156364A (en) * | 2016-08-02 | 2016-11-23 | 西南石油大学 | A kind of method and system of calculating media event dynamic effect power based on time stream |
CN106250513A (en) * | 2016-08-02 | 2016-12-21 | 西南石油大学 | A kind of event personalization sorting technique based on event modeling and system |
CN106250513B (en) * | 2016-08-02 | 2021-04-23 | 西南石油大学 | Event modeling-based event personalized classification method and system |
CN106469138A (en) * | 2016-09-29 | 2017-03-01 | 东软集团股份有限公司 | The generation method of word cloud and device |
CN106469138B (en) * | 2016-09-29 | 2020-07-17 | 东软集团股份有限公司 | Word cloud generation method and device |
CN106777043A (en) * | 2016-12-09 | 2017-05-31 | 宁波大学 | A kind of academic resources acquisition methods based on LDA |
CN106682231A (en) * | 2017-01-10 | 2017-05-17 | 深圳淞鑫金融服务科技发展有限公司 | Graphical visual display method and device for big data |
CN107066585A (en) * | 2017-04-17 | 2017-08-18 | 济南大学 | A kind of probability topic calculates the public sentiment monitoring method and system with matching |
CN107066585B (en) * | 2017-04-17 | 2019-10-01 | 济南大学 | A kind of probability topic calculates and matched public sentiment monitoring method and system |
CN107833271B (en) * | 2017-09-30 | 2020-04-07 | 中国科学院自动化研究所 | Skeleton redirection method and device based on Kinect |
CN107833271A (en) * | 2017-09-30 | 2018-03-23 | 中国科学院自动化研究所 | A kind of bone reorientation method and device based on Kinect |
CN108334591A (en) * | 2018-01-30 | 2018-07-27 | 天津中科智能识别产业技术研究院有限公司 | Industry analysis method and system based on focused crawler technology |
CN108573155B (en) * | 2018-04-18 | 2020-10-16 | 北京知道创宇信息技术股份有限公司 | Method and device for detecting vulnerability influence range, electronic equipment and storage medium |
CN108573155A (en) * | 2018-04-18 | 2018-09-25 | 北京知道创宇信息技术有限公司 | Detect method, apparatus, electronic equipment and the storage medium of loophole coverage |
CN109189934A (en) * | 2018-11-13 | 2019-01-11 | 平安科技(深圳)有限公司 | Public sentiment recommended method, device, computer equipment and storage medium |
CN110750646A (en) * | 2019-10-16 | 2020-02-04 | 乐山师范学院 | Attribute description extracting method for hotel comment text |
CN110750646B (en) * | 2019-10-16 | 2022-12-06 | 乐山师范学院 | Attribute description extracting method for hotel comment text |
CN112269871A (en) * | 2020-10-12 | 2021-01-26 | 国网新疆电力有限公司信息通信公司 | Data visualization analysis method and device based on LDA topic generation model |
CN113378512A (en) * | 2021-07-05 | 2021-09-10 | 中国科学技术信息研究所 | Automatic indexing-based generation method of stepless dynamic evolution theme cloud picture |
CN113378512B (en) * | 2021-07-05 | 2023-05-26 | 中国科学技术信息研究所 | Automatic indexing-based stepless dynamic evolution subject cloud image generation method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105550365A (en) | Visualization analysis system based on text topic model | |
CN101593204A (en) | A kind of emotion tendency analysis system based on news comment webpage | |
Zhang et al. | Mesoscale structures in world city networks | |
CN102075851A (en) | Method and system for acquiring user preference in mobile network | |
CN102193994B (en) | Method for searching Web services according to non-functional requirements of user | |
JP5320307B2 (en) | Interest information recommendation device, interest information recommendation method, and interest information recommendation program | |
Dubey et al. | Item-based collaborative filtering using sentiment analysis of user reviews | |
Iezzi | Centrality measures for text clustering | |
WO2023108993A1 (en) | Product recommendation method, apparatus and device based on deep clustering algorithm, and medium | |
CN104217038A (en) | Knowledge network building method for financial news | |
Xhafa et al. | Semantics, intelligent processing and services for big data | |
Vijayarani et al. | Research in big data: an overview | |
CN103970891A (en) | Method for inquiring user interest information based on context | |
Jung et al. | P2P context awareness based sensibility design recommendation using color and bio-signal analysis | |
CN104598474B (en) | Information recommendation method based on data semantic under cloud environment | |
CN107330111A (en) | The search method and device of domain body based on common version body | |
CN107066585B (en) | A kind of probability topic calculates and matched public sentiment monitoring method and system | |
CN105677906A (en) | Automatic collecting and analyzing system and method for network events | |
CN110019763B (en) | Text filtering method, system, equipment and computer readable storage medium | |
CN116882414B (en) | Automatic comment generation method and related device based on large-scale language model | |
Gao et al. | Hierarchical clustering based web service discovery | |
TWI610257B (en) | Sorting method of data documents and display method for sorting landmark data | |
CN103559269B (en) | A kind of knowledge recommendation method towards mobile news subscription | |
Spitz et al. | Topexnet: entity-centric network topic exploration in news streams | |
Dritsas et al. | Aspect-based community detection of cultural heritage streaming data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20180601 Address after: 518057 A, 6 yuan, Zhuyuan garden, No. 5 KELONG Road, Yuhai street, Nanshan District, Shenzhen, Guangdong. Applicant after: Zhongke Jun Sheng (Shenzhen) intelligent data science and Technology Development Co., Ltd. Address before: 100080 No. 95 East Zhongguancun Road, Beijing, Haidian District Applicant before: Institute of Automation, Chinese Academy of Sciences |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160504 |