CN105550365A

CN105550365A - Visualization analysis system based on text topic model

Info

Publication number: CN105550365A
Application number: CN201610028107.6A
Authority: CN
Inventors: 王健; 张桂刚; 杨颐; 黄卫星
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Zhongke Jun Sheng (shenzhen) Intelligent Data Science And Technology Development Co Ltd
Priority date: 2016-01-15
Filing date: 2016-01-15
Publication date: 2016-05-04

Abstract

The invention discloses a visualization analysis system based on a text topic model. The visualization analysis system comprises an internet text data acquisition module, a corpus module, a topic analysis module, a topic clustering module and a data visualization module, wherein the internet text data acquisition module is used for acquiring and cleaning webpage text data from an internet; the corpus module is used for performing Chinese word segmentation and word frequency statistics of the webpage text data; the topic analysis module is used for generating a document-topic vector set and a topic-word vector set; the topic clustering module is used for performing clustering analysis of the document-topic vector set; and the data visualization module is used for performing data display and variable parameter adjustment. According to the invention, optimization of the analysis effect and dynamic adjustment of variable parameters in the analysis process are realized; and thus, the analysis efficiency is increased.

Description

A kind of Visualized Analysis System based on text subject model

Technical field

The present invention relates to text subject analysis field, internet, particularly relate to a kind of Visualized Analysis System based on text subject model.

Background

Internet also exists the text message of magnanimity, such as a large amount of news report, literary criticism, popularization of knowledge, form is also varied, such as news web page, blog, microblogging etc.The much-talked-about topic that subject analysis can find current network to discuss is carried out for text message.For much-talked-about topic, various useful application can be had, such as, carry out industry development trend prediction, focus commercial product recommending, network public opinion analysis etc.

Data visualization is a kind of cross discipline combining the subjects such as computer graphics, psychology and man-machine interaction.Data visualization, by visualized algorithm, realizes patterned Visualization Model, is used for showing multidimensional or high dimensional data.The Visualization Model combining man-machine interaction can carry out dynamic multi-angular analysis.The maximum purposes of data visualization, by patterned method for exhibiting data, promotes that user is for the understanding of complex data, improves data analysis efficiency.

Visual and understandable visual to result data, can promote that user is to the understanding of analysis result, improves analysis efficiency greatly.Because analysis result can be understood from different perspectives, the angle such as standing in certain particular topic is to the angle understood its distribution situation in a document or stand in certain particular document bunch to analyze described theme.A static method for visualizing is difficult to accomplish all situations all to show simultaneously.Therefore, static Visualization Model in conjunction with human-computer interaction technology, will dynamically represent the analytic angle that user wants.In addition, because each analysis phase all can relate to relatively independent Data Management Analysis, the optimum configurations of sub-analysis module directly can affect the result of holistic approach.Therefore, when subject analysis and cluster, user can adjust parameter, to reach the target of holistic approach effect optimum.Interactively Visualization Model can allow user to carry out dynamic conditioning to parameter on graphical interfaces, and real-time sees the analysis result after adjustment.

Summary of the invention

Based on the problems referred to above, the object of the invention is to propose a kind of Visualized Analysis System based on text subject model, achieve the dynamic conditioning of variable element in the optimization of analytical effect, analytic process, improve analysis efficiency.

To achieve these goals, the invention discloses a kind of Visualized Analysis System based on text subject model, this system comprises internet text notebook data acquisition module, corpus module, subject analysis module, Subject Clustering module, data visualization module;

Internet text notebook data acquisition module is used for gathering web page text data from internet, and cleans collected each section text data;

Corpus module is for storing the text data after the cleaning of internet text notebook data acquisition module, and Chinese word segmentation and word frequency statistics are carried out to the web page text data stored, generate the word frequency data of mapping relations and the word frequency statistics data comprised between word and the web page text data stored;

Subject analysis module is used for setting up topic model according to the word frequency data of corpus CMOS macro cell, utilizes the Gibbs methods of sampling to calculate set up topic model, stores and export the document-theme vector collection and theme-word vector set that calculate;

The document that Subject Clustering module exports subject analysis module-theme vector collection carries out cluster analysis, stores and exports cluster data;

Data the showing with figure that subject analysis module and Subject Clustering module export by data visualization module; Data visualization module is also for showing and adjusting variable element in corpus module, subject analysis module, Subject Clustering module.

Preferably, described internet text notebook data acquisition module comprises webpage capture unit and data cleansing unit;

Webpage capture unit is for gathering the text data in webpage from internet; This unit uses web crawlers technology, after providing seed website, jumps to other websites by the link of seed website, realizes automatic web and creep;

Data cleansing unit is used for the text data of webpage capture unit collection to clean, and removes the data irrelevant with web page contents, the title of the packet purse rope page of reservation, author, time, source and body matter.

Preferably, described corpus module comprises building of corpus unit, corpus, Chinese word segmentation unit, word frequency Data Management Unit, word frequency base;

Building of corpus unit is used for cleaned text data store based in the corpus of relevant database;

Chinese word segmentation unit is used for the data in corpus to carry out Chinese word segmentation, and removes the stop words irrelevant with body matter according to the inactive vocabulary defined in this unit;

The word frequency Data Management Unit word segmentation result obtained by Chinese word segmentation unit carries out word frequency statistics, by the statistics that obtains stored in word frequency base; The word frequency data stored in word frequency base comprise the statistics of mapping relations between the text data in word segmentation result in each word and corpus and word frequency data management list; Described statistics comprises the occurrence number of each word comprised in the number of times that in word segmentation result, each word occurs in correspondence each section text data, each section text data.

Preferably, described subject analysis module comprises LDA topic model construction unit, Gibbs sample calculation unit, result vector collection administrative unit, vector set database;

LDA topic model construction unit is used for according to word frequency data construct LDA topic model;

Gibbs sample calculation unit is used for utilizing the Gibbs methods of sampling to calculate LDA model, obtains for describing in every section of text data document-theme vector collection of comprising theme and for describing in each theme the theme-word vector set comprising keyword.

Result vector collection administrative unit is used for the vector set that Gibbs sample calculation unit obtains to be saved in the vector set database based on relevant database.

Preferably, described Subject Clustering module comprises cluster analysis unit, Subject Clustering data set administrative unit, clustering documents storehouse;

Cluster analysis unit is used for carrying out cluster analysis to document-theme vector collection and obtaining text cluster data, and text cluster data comprise the text, the document clusters belonging to every section of text that comprise in each document clusters;

Subject Clustering data set administrative unit is used for text cluster data being kept in the clustering documents storehouse based on relevant database.

Preferably, described data visualization module comprises Data Integration unit, visualization, man-machine interaction unit;

Data Integration unit is used for from vector set database, reading document-theme vector collection data and theme-word vector set data, from clustering documents storehouse, reading text cluster data, and the data pattern that the data read define according to visualization is carried out format conversion;

Visualization is mainly used in the data integrated to be presented to terminal user to graphically;

Man-machine interaction unit respectively has the variable element of the unit of computing and screening function in corpus module, subject analysis module, Subject Clustering module for adjusting.

Preferably, the described unit with computing and screening function comprises building of corpus unit, Chinese word segmentation unit, LDA topic model construction unit, Gibbs sample calculation unit, cluster analysis unit, Subject Clustering data set administrative unit.

Preferably, described result vector collection administrative unit also has theme temperature subelement, this subelement for calculating theme temperature, and by result stored in vector set database.

Preferably, the method for described data visualization CMOS macro cell visual image is:

Step 1, obtains the temperature data set H={h1 of theme, h2, h3 from vector set database ..., hk}, wherein hi is the hot value of i-th theme;

Step 2, draw motif area at display screen, be specially:

Step 21, draws two concentric circless;

Step 22, is normalized calculating to the theme hot value in hot value data set H, obtain normalization data collection H '=h1 ', h2 ' ..., hk ' }, wherein hi ' be normalization after the hot value of theme i;

Step 23, according to the ratio value of each theme hot value hi, be k sector by the Region dividing in the concentric circles drawn in step 21 between outer ring and inner ring, each sector represents a theme, the sector radian=2*PI*hi ' of theme i;

Step 3, draw word cloud in each sector, be specially:

Step 31, for theme i, access vector set database, obtain the word vector Wi={{wi1 that theme i comprises, v1}, { wi2, v2}, win, vn}}, wherein wip is the content of p the word that theme i comprises, vp represents the numerical value of wip, is also exactly the importance for theme i of this word.

Step 32, is normalized calculating to the v in Wi, obtains Wi '={ { wi1, v1 ' }, and wi2, v2 ' } ..., { win, vn ' } }, the wherein vp ' normalized value that is vp.

Step 33, generating character cloud in the sector that theme i is corresponding, font size=setting original size * vi ' the * hk ' of p word of word cloud.If font size is less than No. 2 words, then do not show this word;

Step 34, by each word horizontal positioned in word cloud;

Step 4, draw document clusters, be specially:

Step 41, obtains the dimension information of document clusters: SC={sc1, sc2 from clustering documents storehouse ..., scy}, the wherein quantity of document that comprises for i-th document clusters of sci;

Step 42, is normalized SC, obtain SC '=sc1 ', sc2 ' ..., scy ' }, the wherein normalized value of the quantity of document that comprises for i-th document clusters of sci ';

Step 43, in the concentrically ringed inner ring drawn in step 21, each document clusters draws a circle, and circular radius is directly proportional to normalized value sci ', circular in radius descending mode, ecto-entad helical arrangement.

Preferably, in described visual image, the sector of motif area has Trigger Function, is specially: after the sector corresponding to theme i is triggered, data visualization module obtains the ratio TC={tc1 of document in affiliated document clusters comprising theme i from clustering documents storehouse, tc2 ..., tcy}, wherein, tcs is the proportional numerical value of document in document clusters s comprising theme i, in corresponding document clusters, draw sector, the radian Ai=2*PI*tcs of sector.

Preferably, in described visual image, the border circular areas of document clusters has Trigger Function, is specially: after the border circular areas corresponding to document clusters is triggered, data visualization module read from clustering documents storehouse document clusters comprise the percent information CT={ct1 of theme, ct2,, ctk}, wherein cti is the theme proportional numerical value shared in whole themes that i comprises in selected document clusters, repartition sector at motif area according to CT, and in the sector that each theme is corresponding generating character cloud.

Preferably, in step 34, the method for each word horizontal positioned in word cloud is specially: by the anglec of rotation of each word of word cloud according to the corresponding center of circle of its central point, the word of word cloud is done corresponding rotation, how ensures the angle of no matter sector k, word wherein can the display of level.

A kind of Visualized Analysis System based on text subject model proposed by the invention, can realize the analysis of network text message subject and Subject Clustering graphical is intuitively represented by interactive theme Visualization Model, variable element dynamic conditioning, what optimize analytical effect improves analysis efficiency.

Accompanying drawing explanation

Fig. 1 is the Visualized Analysis System framework of the text subject model of the embodiment of the present invention;

Fig. 2 is the interactive theme Visualization Model structural representation of the embodiment of the present invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.

Be illustrated in figure 1 the system framework figure of the embodiment of the present invention, a kind of Visualized Analysis System based on text subject model of the present embodiment comprises internet text notebook data acquisition module S101, corpus module S102, subject analysis module S103, Subject Clustering module S104 and visualization model S105.

Internet text notebook data acquisition module S101 is used for gathering web page text data from internet and cleaning collected each section text data.Internet text notebook data acquisition module comprises webpage capture cell S 114 and data cleansing cell S 115; Webpage capture cell S 114 is for gathering the text data in webpage from internet.Webpage capture unit adopts web crawlers technology, by the seed website provided, can jump to other websites, realize automatic web and creep by the link of seed website.Data cleansing cell S 115, for being cleaned by the text data of webpage capture unit collection, removes the data irrelevant with web page contents, the title of the packet purse rope page of reservation, author, time, source and body matter etc.

Corpus module S102 comprises building of corpus cell S 116, corpus, Chinese word segmentation cell S 117, word frequency Data Management Unit S118, word frequency base; Building of corpus cell S 116 for by cleaned text data store based in the corpus of relevant database; Chinese word segmentation cell S 117 for the data in corpus are carried out Chinese word segmentation, and removes the stop words irrelevant with body matter according to the inactive vocabulary defined in this unit.Word frequency Data Management Unit S118 carries out word frequency statistics with by word segmentation result, by the statistics that obtains stored in word frequency base.Word frequency Data Management Unit provides the data access function between word frequency base.The word frequency data stored in word frequency base comprise the statistics of mapping relations between the text data in word segmentation result in each word and corpus and word frequency data management list.Described statistics comprises the occurrence number of each word comprised in the number of times that in word segmentation result, each word occurs in correspondence each section text data, each section text data.

Subject analysis module S103 comprises LDA topic model construction unit S119, Gibbs sample calculation cell S 120, result vector collector reason cell S 121, vector set database; LDA topic model construction unit S119 is used for according to word frequency data construct LDA (LatentDirichletAllocation) topic model; Gibbs sample calculation cell S 120 calculates LDA model for utilizing the Gibbs methods of sampling, result of calculation is document-theme vector collection and theme-word vector set, respectively describes the keyword comprised in the theme and each theme comprised in every section of text data.Result vector collector reason cell S 121 is saved in the vector set database based on relevant database for the vector set obtained by Gibbs sample calculation unit, and provides data access interface function.Result vector collector reason cell S 121 also has theme temperature subelement, this subelement for calculating theme temperature, and by result stored in vector set database.

Subject Clustering module S104 comprises cluster analysis cell S 122, Subject Clustering data set administrative unit S123, clustering documents storehouse.Multiple different cluster algorithm is contained in cluster analysis cell S 122, as K-means clustering algorithm, OPTICS clustering algorithm, DBSCAN clustering algorithm etc., the algorithm that can be used by data visualization model choice, carry out cluster analysis to document-theme vector collection and obtain text cluster data, text cluster data comprise the text, the document clusters belonging to every section of text that comprise in each document clusters.In addition, the relation of subelement to document clusters and theme is also had to carry out statistical computation, the theme comprised in such as each document clusters and the document clusters involved by each theme, Subject Clustering data set administrative unit S123 is used for the result of cluster analysis being kept in the clustering documents storehouse based on relevant database, and provides the interface function of data access.

Data visualization module S105 comprises Data Integration cell S 112, visualization S113, man-machine interaction unit 114.Data Integration cell S 112 for reading document-theme vector collection data and theme-word vector set data, reading text cluster data and the data pattern that the data read define according to visualization is carried out format conversion from clustering documents storehouse from vector set database.Visualization S113 be mainly used in by the data integrated to graphically be presented to terminal user.Man-machine interaction unit S114, for showing and adjusting the variable element in building of corpus cell S 116, Chinese word segmentation cell S 117, LDA topic model construction unit S119, Gibbs sample calculation cell S 120, cluster analysis cell S 122, Subject Clustering data set administrative unit S123, comprises the selection of the cluster algorithm in cluster analysis cell S 122.Then recalculate and by result of calculation by data visualization modules exhibit to screen, replace old visualized graphs.

Be illustrated in figure 2 the interactive theme Visualization Model structural representation of the embodiment of the present invention., the visual image of this structural representation is by this data visualization CMOS macro cell in the present embodiment, and is shown by display, and its generation method is:

Step 2, draw motif area S201 at display screen, be specially:

Step 21, draws two concentric circless; Getting center of circle c in the present embodiment is screen center's point, outer radii ro=screen height * 2/5, inner radii ri=screen height * 1/5;

Step 3, draw word cloud S202 in each sector, be specially:

Step 31, for theme i, access vector set database, obtain word vector Wi={{wi1, v1} that theme i comprises, { wi2, v2} ..., { win, vn}}, wherein wip is the content of p the word that theme i comprises, such as " football ", " mobile phone " etc., vp represents the numerical value of wip, is also exactly the importance for theme i of this word;

Step 32, is normalized calculating to the v in Wi, obtains Wi '={ { wi1, v1 ' }, and wi2, v2 ' } ..., { win, vn ' } }, the wherein vp ' normalized value that is vp;

Step 33, generating character cloud in the sector that theme i is corresponding, font size=setting original size * vi ' the * hk ' of p word of word cloud.If font size is less than setting minimum threshold, then do not show this word; In the present embodiment, the original size of word font is set as the Song typeface No. 18 words, and the minimum threshold of font size is set to No. 2 words;

Step 34, by the placement of each word level of word cloud; Word cloud is placed in the i of sector by the present embodiment, and according to the anglec of rotation in the corresponding center of circle of the central point of word each in word cloud, the word of word cloud is done corresponding rotation, how ensures the angle of no matter sector i, and word wherein can the display of level.

Step 4, draw document clusters S204, be specially:

In described visual image, the sector of motif area has Trigger Function, be specially: after the sector corresponding to theme i is triggered, data visualization module obtains the ratio TC={tc1 of document in affiliated document clusters comprising theme i from clustering documents storehouse, tc2, tcy}, wherein, tcs is the proportional numerical value of document in document clusters s comprising theme i, sector is drawn, the radian Ai=2*PI*tcs of sector in corresponding document clusters.

In described visual image, the border circular areas of document clusters has Trigger Function, be specially: after the border circular areas corresponding to document clusters is triggered, data visualization module read from clustering documents storehouse document clusters comprise the percent information CT={ct1 of theme, ct2, ctk}, wherein cti is the theme proportional numerical value shared in whole themes that i comprises in selected document clusters, sector is repartitioned at motif area according to CT, and in the sector that each theme is corresponding generating character cloud, the generation method of word cloud and constraint condition consistent with step 33.

Visualization structure schematic diagram as shown in Figure 2, based on cake chart, is made up of two essential parts: motif area S201 and clustering documents region S203.

Motif area S201, with cake chart basis, illustrates for whole corpus, the situation of theme.The radian of cake chart sector have expressed the quantization scale information of theme temperature.Theme is more popular, and sector radian is larger.

Word cloud S202, can show the content of this theme in each sector, namely comprised word and the weight of word.Word cloud uses the weight of label-cloud technological expression word in the theme of place.The font of word is larger, and its weight in theme is larger, that is more can express the implication of this theme.The indication range of word cloud is only limitted to the sector of affiliated theme, and size is determined according to the area of sector.If sector area is too small, the word cloud of this sector just no longer shows.

Clustering documents region S203 illustrates the result of the Subject Clustering of document.Wherein comprise document clusters S204 and theme distribution document clusters S205.

Document clusters S204 represents the result of cluster with circle.Circular radius have expressed the quantity of document in a document clusters.Radius is larger, and the number of documents that the document bunch comprises is more.Document clusters with screw type descending sort, shows the comparability of document clusters in the scope of clustering documents region S203.

In the present embodiment, the display section in visualization structure schematic diagram as shown in motif area S201, clustering documents region S203, document clusters S204 is functional area, in use can carry out Data Update and image redraws by the mode clicked.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. based on a Visualized Analysis System for text subject model, it is characterized in that, this system comprises internet text notebook data acquisition module, corpus module, subject analysis module, Subject Clustering module, data visualization module;

2. a kind of Visualized Analysis System based on text subject model as claimed in claim 1, it is characterized in that, described internet text notebook data acquisition module comprises webpage capture unit and data cleansing unit;

3. a kind of Visualized Analysis System based on text subject model as claimed in claim 2, it is characterized in that, described corpus module comprises building of corpus unit, corpus, Chinese word segmentation unit, word frequency Data Management Unit, word frequency base;

4. a kind of Visualized Analysis System based on text subject model as claimed in claim 3, it is characterized in that, described subject analysis module comprises LDA topic model construction unit, Gibbs sample calculation unit, result vector collection administrative unit, vector set database;

5. a kind of Visualized Analysis System based on text subject model as described in claim 4, it is characterized in that, described Subject Clustering module comprises cluster analysis unit, Subject Clustering data set administrative unit, clustering documents storehouse;

6. a kind of Visualized Analysis System based on text subject model as described in claim 5, it is characterized in that, described data visualization module comprises Data Integration unit, visualization, man-machine interaction unit;

7. a kind of Visualized Analysis System based on text subject model as claimed in claim 6, it is characterized in that, the described unit with computing and screening function comprises building of corpus unit, Chinese word segmentation unit, LDA topic model construction unit, Gibbs sample calculation unit, cluster analysis unit, Subject Clustering data set administrative unit.

8. a kind of Visualized Analysis System based on text subject model according to any one of claim 1-7, it is characterized in that, described result vector collection administrative unit also has theme temperature subelement, this subelement for calculating theme temperature, and by result stored in vector set database.

9. a kind of Visualized Analysis System based on text subject model as claimed in claim 8, it is characterized in that, the method for described data visualization CMOS macro cell visual image is:

Step 1, obtains the temperature data set of theme from vector set database

H={h1, h2, h3 ..., hk}, wherein hi is the hot value of i-th theme;

Step 2, draw motif area at display screen, be specially:

Step 21, draws two concentric circless;

Step 3, draw word cloud in each sector, be specially:

Step 34, by each word horizontal positioned in word cloud;

Step 4, draw document clusters, be specially:

Step 41, obtains the dimension information of document clusters from clustering documents storehouse:

SC={sc1, sc2 ..., scy}, the wherein quantity of document that comprises for i-th document clusters of sci;

10. a kind of Visualized Analysis System based on text subject model as claimed in claim 9, it is characterized in that, in described visual image, the sector of motif area has Trigger Function, be specially: after the sector corresponding to theme i is triggered, data visualization module obtains the ratio TC={tc1 of document in affiliated document clusters comprising theme i from clustering documents storehouse, tc2, tcy}, wherein, tcs is the proportional numerical value of document in document clusters s comprising theme i, sector is drawn, the radian Ai=2*PI*tcs of sector in corresponding document clusters.

11. a kind of Visualized Analysis Systems based on text subject model as claimed in claim 10, it is characterized in that, in described visual image, the border circular areas of document clusters has Trigger Function, be specially: after the border circular areas corresponding to document clusters is triggered, data visualization module read from clustering documents storehouse document clusters comprise the percent information CT={ct1 of theme, ct2, ctk}, wherein cti is the theme proportional numerical value shared in whole themes that i comprises in selected document clusters, sector is repartitioned at motif area according to CT, and in the sector that each theme is corresponding generating character cloud.

12. a kind of Visualized Analysis Systems based on text subject model as claimed in claim 11, it is characterized in that, in step 34, the method for each word horizontal positioned in word cloud is specially: by the anglec of rotation of each word of word cloud according to the corresponding center of circle of its central point, the word of word cloud is done corresponding rotation, how ensure the angle of no matter sector k, word wherein can the display of level.