CN105550365A - Visualization analysis system based on text topic model - Google Patents

Visualization analysis system based on text topic model Download PDF

Info

Publication number
CN105550365A
CN105550365A CN201610028107.6A CN201610028107A CN105550365A CN 105550365 A CN105550365 A CN 105550365A CN 201610028107 A CN201610028107 A CN 201610028107A CN 105550365 A CN105550365 A CN 105550365A
Authority
CN
China
Prior art keywords
data
word
theme
text
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610028107.6A
Other languages
Chinese (zh)
Inventor
王健
张桂刚
杨颐
黄卫星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Jun Sheng (shenzhen) Intelligent Data Science And Technology Development Co Ltd
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201610028107.6A priority Critical patent/CN105550365A/en
Publication of CN105550365A publication Critical patent/CN105550365A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a visualization analysis system based on a text topic model. The visualization analysis system comprises an internet text data acquisition module, a corpus module, a topic analysis module, a topic clustering module and a data visualization module, wherein the internet text data acquisition module is used for acquiring and cleaning webpage text data from an internet; the corpus module is used for performing Chinese word segmentation and word frequency statistics of the webpage text data; the topic analysis module is used for generating a document-topic vector set and a topic-word vector set; the topic clustering module is used for performing clustering analysis of the document-topic vector set; and the data visualization module is used for performing data display and variable parameter adjustment. According to the invention, optimization of the analysis effect and dynamic adjustment of variable parameters in the analysis process are realized; and thus, the analysis efficiency is increased.

Description

A kind of Visualized Analysis System based on text subject model
Technical field
The present invention relates to text subject analysis field, internet, particularly relate to a kind of Visualized Analysis System based on text subject model.
Background
Internet also exists the text message of magnanimity, such as a large amount of news report, literary criticism, popularization of knowledge, form is also varied, such as news web page, blog, microblogging etc.The much-talked-about topic that subject analysis can find current network to discuss is carried out for text message.For much-talked-about topic, various useful application can be had, such as, carry out industry development trend prediction, focus commercial product recommending, network public opinion analysis etc.
Data visualization is a kind of cross discipline combining the subjects such as computer graphics, psychology and man-machine interaction.Data visualization, by visualized algorithm, realizes patterned Visualization Model, is used for showing multidimensional or high dimensional data.The Visualization Model combining man-machine interaction can carry out dynamic multi-angular analysis.The maximum purposes of data visualization, by patterned method for exhibiting data, promotes that user is for the understanding of complex data, improves data analysis efficiency.
Visual and understandable visual to result data, can promote that user is to the understanding of analysis result, improves analysis efficiency greatly.Because analysis result can be understood from different perspectives, the angle such as standing in certain particular topic is to the angle understood its distribution situation in a document or stand in certain particular document bunch to analyze described theme.A static method for visualizing is difficult to accomplish all situations all to show simultaneously.Therefore, static Visualization Model in conjunction with human-computer interaction technology, will dynamically represent the analytic angle that user wants.In addition, because each analysis phase all can relate to relatively independent Data Management Analysis, the optimum configurations of sub-analysis module directly can affect the result of holistic approach.Therefore, when subject analysis and cluster, user can adjust parameter, to reach the target of holistic approach effect optimum.Interactively Visualization Model can allow user to carry out dynamic conditioning to parameter on graphical interfaces, and real-time sees the analysis result after adjustment.
Summary of the invention
Based on the problems referred to above, the object of the invention is to propose a kind of Visualized Analysis System based on text subject model, achieve the dynamic conditioning of variable element in the optimization of analytical effect, analytic process, improve analysis efficiency.
To achieve these goals, the invention discloses a kind of Visualized Analysis System based on text subject model, this system comprises internet text notebook data acquisition module, corpus module, subject analysis module, Subject Clustering module, data visualization module;
Internet text notebook data acquisition module is used for gathering web page text data from internet, and cleans collected each section text data;
Corpus module is for storing the text data after the cleaning of internet text notebook data acquisition module, and Chinese word segmentation and word frequency statistics are carried out to the web page text data stored, generate the word frequency data of mapping relations and the word frequency statistics data comprised between word and the web page text data stored;
Subject analysis module is used for setting up topic model according to the word frequency data of corpus CMOS macro cell, utilizes the Gibbs methods of sampling to calculate set up topic model, stores and export the document-theme vector collection and theme-word vector set that calculate;
The document that Subject Clustering module exports subject analysis module-theme vector collection carries out cluster analysis, stores and exports cluster data;
Data the showing with figure that subject analysis module and Subject Clustering module export by data visualization module; Data visualization module is also for showing and adjusting variable element in corpus module, subject analysis module, Subject Clustering module.
Preferably, described internet text notebook data acquisition module comprises webpage capture unit and data cleansing unit;
Webpage capture unit is for gathering the text data in webpage from internet; This unit uses web crawlers technology, after providing seed website, jumps to other websites by the link of seed website, realizes automatic web and creep;
Data cleansing unit is used for the text data of webpage capture unit collection to clean, and removes the data irrelevant with web page contents, the title of the packet purse rope page of reservation, author, time, source and body matter.
Preferably, described corpus module comprises building of corpus unit, corpus, Chinese word segmentation unit, word frequency Data Management Unit, word frequency base;
Building of corpus unit is used for cleaned text data store based in the corpus of relevant database;
Chinese word segmentation unit is used for the data in corpus to carry out Chinese word segmentation, and removes the stop words irrelevant with body matter according to the inactive vocabulary defined in this unit;
The word frequency Data Management Unit word segmentation result obtained by Chinese word segmentation unit carries out word frequency statistics, by the statistics that obtains stored in word frequency base; The word frequency data stored in word frequency base comprise the statistics of mapping relations between the text data in word segmentation result in each word and corpus and word frequency data management list; Described statistics comprises the occurrence number of each word comprised in the number of times that in word segmentation result, each word occurs in correspondence each section text data, each section text data.
Preferably, described subject analysis module comprises LDA topic model construction unit, Gibbs sample calculation unit, result vector collection administrative unit, vector set database;
LDA topic model construction unit is used for according to word frequency data construct LDA topic model;
Gibbs sample calculation unit is used for utilizing the Gibbs methods of sampling to calculate LDA model, obtains for describing in every section of text data document-theme vector collection of comprising theme and for describing in each theme the theme-word vector set comprising keyword.
Result vector collection administrative unit is used for the vector set that Gibbs sample calculation unit obtains to be saved in the vector set database based on relevant database.
Preferably, described Subject Clustering module comprises cluster analysis unit, Subject Clustering data set administrative unit, clustering documents storehouse;
Cluster analysis unit is used for carrying out cluster analysis to document-theme vector collection and obtaining text cluster data, and text cluster data comprise the text, the document clusters belonging to every section of text that comprise in each document clusters;
Subject Clustering data set administrative unit is used for text cluster data being kept in the clustering documents storehouse based on relevant database.
Preferably, described data visualization module comprises Data Integration unit, visualization, man-machine interaction unit;
Data Integration unit is used for from vector set database, reading document-theme vector collection data and theme-word vector set data, from clustering documents storehouse, reading text cluster data, and the data pattern that the data read define according to visualization is carried out format conversion;
Visualization is mainly used in the data integrated to be presented to terminal user to graphically;
Man-machine interaction unit respectively has the variable element of the unit of computing and screening function in corpus module, subject analysis module, Subject Clustering module for adjusting.
Preferably, the described unit with computing and screening function comprises building of corpus unit, Chinese word segmentation unit, LDA topic model construction unit, Gibbs sample calculation unit, cluster analysis unit, Subject Clustering data set administrative unit.
Preferably, described result vector collection administrative unit also has theme temperature subelement, this subelement for calculating theme temperature, and by result stored in vector set database.
Preferably, the method for described data visualization CMOS macro cell visual image is:
Step 1, obtains the temperature data set H={h1 of theme, h2, h3 from vector set database ..., hk}, wherein hi is the hot value of i-th theme;
Step 2, draw motif area at display screen, be specially:
Step 21, draws two concentric circless;
Step 22, is normalized calculating to the theme hot value in hot value data set H, obtain normalization data collection H '=h1 ', h2 ' ..., hk ' }, wherein hi ' be normalization after the hot value of theme i;
Step 23, according to the ratio value of each theme hot value hi, be k sector by the Region dividing in the concentric circles drawn in step 21 between outer ring and inner ring, each sector represents a theme, the sector radian=2*PI*hi ' of theme i;
Step 3, draw word cloud in each sector, be specially:
Step 31, for theme i, access vector set database, obtain the word vector Wi={{wi1 that theme i comprises, v1}, { wi2, v2}, win, vn}}, wherein wip is the content of p the word that theme i comprises, vp represents the numerical value of wip, is also exactly the importance for theme i of this word.
Step 32, is normalized calculating to the v in Wi, obtains Wi '={ { wi1, v1 ' }, and wi2, v2 ' } ..., { win, vn ' } }, the wherein vp ' normalized value that is vp.
Step 33, generating character cloud in the sector that theme i is corresponding, font size=setting original size * vi ' the * hk ' of p word of word cloud.If font size is less than No. 2 words, then do not show this word;
Step 34, by each word horizontal positioned in word cloud;
Step 4, draw document clusters, be specially:
Step 41, obtains the dimension information of document clusters: SC={sc1, sc2 from clustering documents storehouse ..., scy}, the wherein quantity of document that comprises for i-th document clusters of sci;
Step 42, is normalized SC, obtain SC '=sc1 ', sc2 ' ..., scy ' }, the wherein normalized value of the quantity of document that comprises for i-th document clusters of sci ';
Step 43, in the concentrically ringed inner ring drawn in step 21, each document clusters draws a circle, and circular radius is directly proportional to normalized value sci ', circular in radius descending mode, ecto-entad helical arrangement.
Preferably, in described visual image, the sector of motif area has Trigger Function, is specially: after the sector corresponding to theme i is triggered, data visualization module obtains the ratio TC={tc1 of document in affiliated document clusters comprising theme i from clustering documents storehouse, tc2 ..., tcy}, wherein, tcs is the proportional numerical value of document in document clusters s comprising theme i, in corresponding document clusters, draw sector, the radian Ai=2*PI*tcs of sector.
Preferably, in described visual image, the border circular areas of document clusters has Trigger Function, is specially: after the border circular areas corresponding to document clusters is triggered, data visualization module read from clustering documents storehouse document clusters comprise the percent information CT={ct1 of theme, ct2,, ctk}, wherein cti is the theme proportional numerical value shared in whole themes that i comprises in selected document clusters, repartition sector at motif area according to CT, and in the sector that each theme is corresponding generating character cloud.
Preferably, in step 34, the method for each word horizontal positioned in word cloud is specially: by the anglec of rotation of each word of word cloud according to the corresponding center of circle of its central point, the word of word cloud is done corresponding rotation, how ensures the angle of no matter sector k, word wherein can the display of level.
A kind of Visualized Analysis System based on text subject model proposed by the invention, can realize the analysis of network text message subject and Subject Clustering graphical is intuitively represented by interactive theme Visualization Model, variable element dynamic conditioning, what optimize analytical effect improves analysis efficiency.
Accompanying drawing explanation
Fig. 1 is the Visualized Analysis System framework of the text subject model of the embodiment of the present invention;
Fig. 2 is the interactive theme Visualization Model structural representation of the embodiment of the present invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.
Be illustrated in figure 1 the system framework figure of the embodiment of the present invention, a kind of Visualized Analysis System based on text subject model of the present embodiment comprises internet text notebook data acquisition module S101, corpus module S102, subject analysis module S103, Subject Clustering module S104 and visualization model S105.
Internet text notebook data acquisition module S101 is used for gathering web page text data from internet and cleaning collected each section text data.Internet text notebook data acquisition module comprises webpage capture cell S 114 and data cleansing cell S 115; Webpage capture cell S 114 is for gathering the text data in webpage from internet.Webpage capture unit adopts web crawlers technology, by the seed website provided, can jump to other websites, realize automatic web and creep by the link of seed website.Data cleansing cell S 115, for being cleaned by the text data of webpage capture unit collection, removes the data irrelevant with web page contents, the title of the packet purse rope page of reservation, author, time, source and body matter etc.
Corpus module S102 comprises building of corpus cell S 116, corpus, Chinese word segmentation cell S 117, word frequency Data Management Unit S118, word frequency base; Building of corpus cell S 116 for by cleaned text data store based in the corpus of relevant database; Chinese word segmentation cell S 117 for the data in corpus are carried out Chinese word segmentation, and removes the stop words irrelevant with body matter according to the inactive vocabulary defined in this unit.Word frequency Data Management Unit S118 carries out word frequency statistics with by word segmentation result, by the statistics that obtains stored in word frequency base.Word frequency Data Management Unit provides the data access function between word frequency base.The word frequency data stored in word frequency base comprise the statistics of mapping relations between the text data in word segmentation result in each word and corpus and word frequency data management list.Described statistics comprises the occurrence number of each word comprised in the number of times that in word segmentation result, each word occurs in correspondence each section text data, each section text data.
Subject analysis module S103 comprises LDA topic model construction unit S119, Gibbs sample calculation cell S 120, result vector collector reason cell S 121, vector set database; LDA topic model construction unit S119 is used for according to word frequency data construct LDA (LatentDirichletAllocation) topic model; Gibbs sample calculation cell S 120 calculates LDA model for utilizing the Gibbs methods of sampling, result of calculation is document-theme vector collection and theme-word vector set, respectively describes the keyword comprised in the theme and each theme comprised in every section of text data.Result vector collector reason cell S 121 is saved in the vector set database based on relevant database for the vector set obtained by Gibbs sample calculation unit, and provides data access interface function.Result vector collector reason cell S 121 also has theme temperature subelement, this subelement for calculating theme temperature, and by result stored in vector set database.
Subject Clustering module S104 comprises cluster analysis cell S 122, Subject Clustering data set administrative unit S123, clustering documents storehouse.Multiple different cluster algorithm is contained in cluster analysis cell S 122, as K-means clustering algorithm, OPTICS clustering algorithm, DBSCAN clustering algorithm etc., the algorithm that can be used by data visualization model choice, carry out cluster analysis to document-theme vector collection and obtain text cluster data, text cluster data comprise the text, the document clusters belonging to every section of text that comprise in each document clusters.In addition, the relation of subelement to document clusters and theme is also had to carry out statistical computation, the theme comprised in such as each document clusters and the document clusters involved by each theme, Subject Clustering data set administrative unit S123 is used for the result of cluster analysis being kept in the clustering documents storehouse based on relevant database, and provides the interface function of data access.
Data visualization module S105 comprises Data Integration cell S 112, visualization S113, man-machine interaction unit 114.Data Integration cell S 112 for reading document-theme vector collection data and theme-word vector set data, reading text cluster data and the data pattern that the data read define according to visualization is carried out format conversion from clustering documents storehouse from vector set database.Visualization S113 be mainly used in by the data integrated to graphically be presented to terminal user.Man-machine interaction unit S114, for showing and adjusting the variable element in building of corpus cell S 116, Chinese word segmentation cell S 117, LDA topic model construction unit S119, Gibbs sample calculation cell S 120, cluster analysis cell S 122, Subject Clustering data set administrative unit S123, comprises the selection of the cluster algorithm in cluster analysis cell S 122.Then recalculate and by result of calculation by data visualization modules exhibit to screen, replace old visualized graphs.
Be illustrated in figure 2 the interactive theme Visualization Model structural representation of the embodiment of the present invention., the visual image of this structural representation is by this data visualization CMOS macro cell in the present embodiment, and is shown by display, and its generation method is:
Step 1, obtains the temperature data set H={h1 of theme, h2, h3 from vector set database ..., hk}, wherein hi is the hot value of i-th theme;
Step 2, draw motif area S201 at display screen, be specially:
Step 21, draws two concentric circless; Getting center of circle c in the present embodiment is screen center's point, outer radii ro=screen height * 2/5, inner radii ri=screen height * 1/5;
Step 22, is normalized calculating to the theme hot value in hot value data set H, obtain normalization data collection H '=h1 ', h2 ' ..., hk ' }, wherein hi ' be normalization after the hot value of theme i;
Step 23, according to the ratio value of each theme hot value hi, be k sector by the Region dividing in the concentric circles drawn in step 21 between outer ring and inner ring, each sector represents a theme, the sector radian=2*PI*hi ' of theme i;
Step 3, draw word cloud S202 in each sector, be specially:
Step 31, for theme i, access vector set database, obtain word vector Wi={{wi1, v1} that theme i comprises, { wi2, v2} ..., { win, vn}}, wherein wip is the content of p the word that theme i comprises, such as " football ", " mobile phone " etc., vp represents the numerical value of wip, is also exactly the importance for theme i of this word;
Step 32, is normalized calculating to the v in Wi, obtains Wi '={ { wi1, v1 ' }, and wi2, v2 ' } ..., { win, vn ' } }, the wherein vp ' normalized value that is vp;
Step 33, generating character cloud in the sector that theme i is corresponding, font size=setting original size * vi ' the * hk ' of p word of word cloud.If font size is less than setting minimum threshold, then do not show this word; In the present embodiment, the original size of word font is set as the Song typeface No. 18 words, and the minimum threshold of font size is set to No. 2 words;
Step 34, by the placement of each word level of word cloud; Word cloud is placed in the i of sector by the present embodiment, and according to the anglec of rotation in the corresponding center of circle of the central point of word each in word cloud, the word of word cloud is done corresponding rotation, how ensures the angle of no matter sector i, and word wherein can the display of level.
Step 4, draw document clusters S204, be specially:
Step 41, obtains the dimension information of document clusters: SC={sc1, sc2 from clustering documents storehouse ..., scy}, the wherein quantity of document that comprises for i-th document clusters of sci;
Step 42, is normalized SC, obtain SC '=sc1 ', sc2 ' ..., scy ' }, the wherein normalized value of the quantity of document that comprises for i-th document clusters of sci ';
Step 43, in the concentrically ringed inner ring drawn in step 21, each document clusters draws a circle, and circular radius is directly proportional to normalized value sci ', circular in radius descending mode, ecto-entad helical arrangement.
In described visual image, the sector of motif area has Trigger Function, be specially: after the sector corresponding to theme i is triggered, data visualization module obtains the ratio TC={tc1 of document in affiliated document clusters comprising theme i from clustering documents storehouse, tc2, tcy}, wherein, tcs is the proportional numerical value of document in document clusters s comprising theme i, sector is drawn, the radian Ai=2*PI*tcs of sector in corresponding document clusters.
In described visual image, the border circular areas of document clusters has Trigger Function, be specially: after the border circular areas corresponding to document clusters is triggered, data visualization module read from clustering documents storehouse document clusters comprise the percent information CT={ct1 of theme, ct2, ctk}, wherein cti is the theme proportional numerical value shared in whole themes that i comprises in selected document clusters, sector is repartitioned at motif area according to CT, and in the sector that each theme is corresponding generating character cloud, the generation method of word cloud and constraint condition consistent with step 33.
Visualization structure schematic diagram as shown in Figure 2, based on cake chart, is made up of two essential parts: motif area S201 and clustering documents region S203.
Motif area S201, with cake chart basis, illustrates for whole corpus, the situation of theme.The radian of cake chart sector have expressed the quantization scale information of theme temperature.Theme is more popular, and sector radian is larger.
Word cloud S202, can show the content of this theme in each sector, namely comprised word and the weight of word.Word cloud uses the weight of label-cloud technological expression word in the theme of place.The font of word is larger, and its weight in theme is larger, that is more can express the implication of this theme.The indication range of word cloud is only limitted to the sector of affiliated theme, and size is determined according to the area of sector.If sector area is too small, the word cloud of this sector just no longer shows.
Clustering documents region S203 illustrates the result of the Subject Clustering of document.Wherein comprise document clusters S204 and theme distribution document clusters S205.
Document clusters S204 represents the result of cluster with circle.Circular radius have expressed the quantity of document in a document clusters.Radius is larger, and the number of documents that the document bunch comprises is more.Document clusters with screw type descending sort, shows the comparability of document clusters in the scope of clustering documents region S203.
In the present embodiment, the display section in visualization structure schematic diagram as shown in motif area S201, clustering documents region S203, document clusters S204 is functional area, in use can carry out Data Update and image redraws by the mode clicked.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (12)

1. based on a Visualized Analysis System for text subject model, it is characterized in that, this system comprises internet text notebook data acquisition module, corpus module, subject analysis module, Subject Clustering module, data visualization module;
Internet text notebook data acquisition module is used for gathering web page text data from internet, and cleans collected each section text data;
Corpus module is for storing the text data after the cleaning of internet text notebook data acquisition module, and Chinese word segmentation and word frequency statistics are carried out to the web page text data stored, generate the word frequency data of mapping relations and the word frequency statistics data comprised between word and the web page text data stored;
Subject analysis module is used for setting up topic model according to the word frequency data of corpus CMOS macro cell, utilizes the Gibbs methods of sampling to calculate set up topic model, stores and export the document-theme vector collection and theme-word vector set that calculate;
The document that Subject Clustering module exports subject analysis module-theme vector collection carries out cluster analysis, stores and exports cluster data;
Data the showing with figure that subject analysis module and Subject Clustering module export by data visualization module; Data visualization module is also for showing and adjusting variable element in corpus module, subject analysis module, Subject Clustering module.
2. a kind of Visualized Analysis System based on text subject model as claimed in claim 1, it is characterized in that, described internet text notebook data acquisition module comprises webpage capture unit and data cleansing unit;
Webpage capture unit is for gathering the text data in webpage from internet; This unit uses web crawlers technology, after providing seed website, jumps to other websites by the link of seed website, realizes automatic web and creep;
Data cleansing unit is used for the text data of webpage capture unit collection to clean, and removes the data irrelevant with web page contents, the title of the packet purse rope page of reservation, author, time, source and body matter.
3. a kind of Visualized Analysis System based on text subject model as claimed in claim 2, it is characterized in that, described corpus module comprises building of corpus unit, corpus, Chinese word segmentation unit, word frequency Data Management Unit, word frequency base;
Building of corpus unit is used for cleaned text data store based in the corpus of relevant database;
Chinese word segmentation unit is used for the data in corpus to carry out Chinese word segmentation, and removes the stop words irrelevant with body matter according to the inactive vocabulary defined in this unit;
The word frequency Data Management Unit word segmentation result obtained by Chinese word segmentation unit carries out word frequency statistics, by the statistics that obtains stored in word frequency base; The word frequency data stored in word frequency base comprise the statistics of mapping relations between the text data in word segmentation result in each word and corpus and word frequency data management list; Described statistics comprises the occurrence number of each word comprised in the number of times that in word segmentation result, each word occurs in correspondence each section text data, each section text data.
4. a kind of Visualized Analysis System based on text subject model as claimed in claim 3, it is characterized in that, described subject analysis module comprises LDA topic model construction unit, Gibbs sample calculation unit, result vector collection administrative unit, vector set database;
LDA topic model construction unit is used for according to word frequency data construct LDA topic model;
Gibbs sample calculation unit is used for utilizing the Gibbs methods of sampling to calculate LDA model, obtains for describing in every section of text data document-theme vector collection of comprising theme and for describing in each theme the theme-word vector set comprising keyword.
Result vector collection administrative unit is used for the vector set that Gibbs sample calculation unit obtains to be saved in the vector set database based on relevant database.
5. a kind of Visualized Analysis System based on text subject model as described in claim 4, it is characterized in that, described Subject Clustering module comprises cluster analysis unit, Subject Clustering data set administrative unit, clustering documents storehouse;
Cluster analysis unit is used for carrying out cluster analysis to document-theme vector collection and obtaining text cluster data, and text cluster data comprise the text, the document clusters belonging to every section of text that comprise in each document clusters;
Subject Clustering data set administrative unit is used for text cluster data being kept in the clustering documents storehouse based on relevant database.
6. a kind of Visualized Analysis System based on text subject model as described in claim 5, it is characterized in that, described data visualization module comprises Data Integration unit, visualization, man-machine interaction unit;
Data Integration unit is used for from vector set database, reading document-theme vector collection data and theme-word vector set data, from clustering documents storehouse, reading text cluster data, and the data pattern that the data read define according to visualization is carried out format conversion;
Visualization is mainly used in the data integrated to be presented to terminal user to graphically;
Man-machine interaction unit respectively has the variable element of the unit of computing and screening function in corpus module, subject analysis module, Subject Clustering module for adjusting.
7. a kind of Visualized Analysis System based on text subject model as claimed in claim 6, it is characterized in that, the described unit with computing and screening function comprises building of corpus unit, Chinese word segmentation unit, LDA topic model construction unit, Gibbs sample calculation unit, cluster analysis unit, Subject Clustering data set administrative unit.
8. a kind of Visualized Analysis System based on text subject model according to any one of claim 1-7, it is characterized in that, described result vector collection administrative unit also has theme temperature subelement, this subelement for calculating theme temperature, and by result stored in vector set database.
9. a kind of Visualized Analysis System based on text subject model as claimed in claim 8, it is characterized in that, the method for described data visualization CMOS macro cell visual image is:
Step 1, obtains the temperature data set of theme from vector set database
H={h1, h2, h3 ..., hk}, wherein hi is the hot value of i-th theme;
Step 2, draw motif area at display screen, be specially:
Step 21, draws two concentric circless;
Step 22, is normalized calculating to the theme hot value in hot value data set H, obtain normalization data collection H '=h1 ', h2 ' ..., hk ' }, wherein hi ' be normalization after the hot value of theme i;
Step 23, according to the ratio value of each theme hot value hi, be k sector by the Region dividing in the concentric circles drawn in step 21 between outer ring and inner ring, each sector represents a theme, the sector radian=2*PI*hi ' of theme i;
Step 3, draw word cloud in each sector, be specially:
Step 31, for theme i, access vector set database, obtain the word vector Wi={{wi1 that theme i comprises, v1}, { wi2, v2}, win, vn}}, wherein wip is the content of p the word that theme i comprises, vp represents the numerical value of wip, is also exactly the importance for theme i of this word.
Step 32, is normalized calculating to the v in Wi, obtains Wi '={ { wi1, v1 ' }, and wi2, v2 ' } ..., { win, vn ' } }, the wherein vp ' normalized value that is vp.
Step 33, generating character cloud in the sector that theme i is corresponding, font size=setting original size * vi ' the * hk ' of p word of word cloud.If font size is less than No. 2 words, then do not show this word;
Step 34, by each word horizontal positioned in word cloud;
Step 4, draw document clusters, be specially:
Step 41, obtains the dimension information of document clusters from clustering documents storehouse:
SC={sc1, sc2 ..., scy}, the wherein quantity of document that comprises for i-th document clusters of sci;
Step 42, is normalized SC, obtain SC '=sc1 ', sc2 ' ..., scy ' }, the wherein normalized value of the quantity of document that comprises for i-th document clusters of sci ';
Step 43, in the concentrically ringed inner ring drawn in step 21, each document clusters draws a circle, and circular radius is directly proportional to normalized value sci ', circular in radius descending mode, ecto-entad helical arrangement.
10. a kind of Visualized Analysis System based on text subject model as claimed in claim 9, it is characterized in that, in described visual image, the sector of motif area has Trigger Function, be specially: after the sector corresponding to theme i is triggered, data visualization module obtains the ratio TC={tc1 of document in affiliated document clusters comprising theme i from clustering documents storehouse, tc2, tcy}, wherein, tcs is the proportional numerical value of document in document clusters s comprising theme i, sector is drawn, the radian Ai=2*PI*tcs of sector in corresponding document clusters.
11. a kind of Visualized Analysis Systems based on text subject model as claimed in claim 10, it is characterized in that, in described visual image, the border circular areas of document clusters has Trigger Function, be specially: after the border circular areas corresponding to document clusters is triggered, data visualization module read from clustering documents storehouse document clusters comprise the percent information CT={ct1 of theme, ct2, ctk}, wherein cti is the theme proportional numerical value shared in whole themes that i comprises in selected document clusters, sector is repartitioned at motif area according to CT, and in the sector that each theme is corresponding generating character cloud.
12. a kind of Visualized Analysis Systems based on text subject model as claimed in claim 11, it is characterized in that, in step 34, the method for each word horizontal positioned in word cloud is specially: by the anglec of rotation of each word of word cloud according to the corresponding center of circle of its central point, the word of word cloud is done corresponding rotation, how ensure the angle of no matter sector k, word wherein can the display of level.
CN201610028107.6A 2016-01-15 2016-01-15 Visualization analysis system based on text topic model Pending CN105550365A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610028107.6A CN105550365A (en) 2016-01-15 2016-01-15 Visualization analysis system based on text topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610028107.6A CN105550365A (en) 2016-01-15 2016-01-15 Visualization analysis system based on text topic model

Publications (1)

Publication Number Publication Date
CN105550365A true CN105550365A (en) 2016-05-04

Family

ID=55829554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610028107.6A Pending CN105550365A (en) 2016-01-15 2016-01-15 Visualization analysis system based on text topic model

Country Status (1)

Country Link
CN (1) CN105550365A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055604A (en) * 2016-05-25 2016-10-26 南京大学 Short text topic model mining method based on word network to extend characteristics
CN106156364A (en) * 2016-08-02 2016-11-23 西南石油大学 A kind of method and system of calculating media event dynamic effect power based on time stream
CN106250513A (en) * 2016-08-02 2016-12-21 西南石油大学 A kind of event personalization sorting technique based on event modeling and system
CN106469138A (en) * 2016-09-29 2017-03-01 东软集团股份有限公司 The generation method of word cloud and device
CN106682231A (en) * 2017-01-10 2017-05-17 深圳淞鑫金融服务科技发展有限公司 Graphical visual display method and device for big data
CN106777043A (en) * 2016-12-09 2017-05-31 宁波大学 A kind of academic resources acquisition methods based on LDA
CN107066585A (en) * 2017-04-17 2017-08-18 济南大学 A kind of probability topic calculates the public sentiment monitoring method and system with matching
CN107833271A (en) * 2017-09-30 2018-03-23 中国科学院自动化研究所 A kind of bone reorientation method and device based on Kinect
CN108334591A (en) * 2018-01-30 2018-07-27 天津中科智能识别产业技术研究院有限公司 Industry analysis method and system based on focused crawler technology
CN108573155A (en) * 2018-04-18 2018-09-25 北京知道创宇信息技术有限公司 Detect method, apparatus, electronic equipment and the storage medium of loophole coverage
CN109189934A (en) * 2018-11-13 2019-01-11 平安科技(深圳)有限公司 Public sentiment recommended method, device, computer equipment and storage medium
CN109478191A (en) * 2016-07-25 2019-03-15 株式会社斯库林集团 Text mining method, text mining program and text mining device
CN110750646A (en) * 2019-10-16 2020-02-04 乐山师范学院 Attribute description extracting method for hotel comment text
CN112269871A (en) * 2020-10-12 2021-01-26 国网新疆电力有限公司信息通信公司 Data visualization analysis method and device based on LDA topic generation model
CN113378512A (en) * 2021-07-05 2021-09-10 中国科学技术信息研究所 Automatic indexing-based generation method of stepless dynamic evolution theme cloud picture

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101419623A (en) * 2008-12-09 2009-04-29 中山大学 Geographical simulation optimizing system
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment
US20130151531A1 (en) * 2011-12-13 2013-06-13 Xerox Corporation Systems and methods for scalable topic detection in social media
CN103853821A (en) * 2014-02-21 2014-06-11 河海大学 Method for constructing multiuser collaboration oriented data mining platform
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101419623A (en) * 2008-12-09 2009-04-29 中山大学 Geographical simulation optimizing system
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment
US20130151531A1 (en) * 2011-12-13 2013-06-13 Xerox Corporation Systems and methods for scalable topic detection in social media
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
CN103853821A (en) * 2014-02-21 2014-06-11 河海大学 Method for constructing multiuser collaboration oriented data mining platform

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055604A (en) * 2016-05-25 2016-10-26 南京大学 Short text topic model mining method based on word network to extend characteristics
CN109478191A (en) * 2016-07-25 2019-03-15 株式会社斯库林集团 Text mining method, text mining program and text mining device
CN109478191B (en) * 2016-07-25 2022-04-08 株式会社斯库林集团 Text mining method, recording medium, and text mining device
CN106156364A (en) * 2016-08-02 2016-11-23 西南石油大学 A kind of method and system of calculating media event dynamic effect power based on time stream
CN106250513A (en) * 2016-08-02 2016-12-21 西南石油大学 A kind of event personalization sorting technique based on event modeling and system
CN106250513B (en) * 2016-08-02 2021-04-23 西南石油大学 Event modeling-based event personalized classification method and system
CN106469138A (en) * 2016-09-29 2017-03-01 东软集团股份有限公司 The generation method of word cloud and device
CN106469138B (en) * 2016-09-29 2020-07-17 东软集团股份有限公司 Word cloud generation method and device
CN106777043A (en) * 2016-12-09 2017-05-31 宁波大学 A kind of academic resources acquisition methods based on LDA
CN106682231A (en) * 2017-01-10 2017-05-17 深圳淞鑫金融服务科技发展有限公司 Graphical visual display method and device for big data
CN107066585A (en) * 2017-04-17 2017-08-18 济南大学 A kind of probability topic calculates the public sentiment monitoring method and system with matching
CN107066585B (en) * 2017-04-17 2019-10-01 济南大学 A kind of probability topic calculates and matched public sentiment monitoring method and system
CN107833271B (en) * 2017-09-30 2020-04-07 中国科学院自动化研究所 Skeleton redirection method and device based on Kinect
CN107833271A (en) * 2017-09-30 2018-03-23 中国科学院自动化研究所 A kind of bone reorientation method and device based on Kinect
CN108334591A (en) * 2018-01-30 2018-07-27 天津中科智能识别产业技术研究院有限公司 Industry analysis method and system based on focused crawler technology
CN108573155B (en) * 2018-04-18 2020-10-16 北京知道创宇信息技术股份有限公司 Method and device for detecting vulnerability influence range, electronic equipment and storage medium
CN108573155A (en) * 2018-04-18 2018-09-25 北京知道创宇信息技术有限公司 Detect method, apparatus, electronic equipment and the storage medium of loophole coverage
CN109189934A (en) * 2018-11-13 2019-01-11 平安科技(深圳)有限公司 Public sentiment recommended method, device, computer equipment and storage medium
CN110750646A (en) * 2019-10-16 2020-02-04 乐山师范学院 Attribute description extracting method for hotel comment text
CN110750646B (en) * 2019-10-16 2022-12-06 乐山师范学院 Attribute description extracting method for hotel comment text
CN112269871A (en) * 2020-10-12 2021-01-26 国网新疆电力有限公司信息通信公司 Data visualization analysis method and device based on LDA topic generation model
CN113378512A (en) * 2021-07-05 2021-09-10 中国科学技术信息研究所 Automatic indexing-based generation method of stepless dynamic evolution theme cloud picture
CN113378512B (en) * 2021-07-05 2023-05-26 中国科学技术信息研究所 Automatic indexing-based stepless dynamic evolution subject cloud image generation method

Similar Documents

Publication Publication Date Title
CN105550365A (en) Visualization analysis system based on text topic model
CN101593204A (en) A kind of emotion tendency analysis system based on news comment webpage
Zhang et al. Mesoscale structures in world city networks
CN102075851A (en) Method and system for acquiring user preference in mobile network
CN102193994B (en) Method for searching Web services according to non-functional requirements of user
JP5320307B2 (en) Interest information recommendation device, interest information recommendation method, and interest information recommendation program
Dubey et al. Item-based collaborative filtering using sentiment analysis of user reviews
Iezzi Centrality measures for text clustering
WO2023108993A1 (en) Product recommendation method, apparatus and device based on deep clustering algorithm, and medium
CN104217038A (en) Knowledge network building method for financial news
Xhafa et al. Semantics, intelligent processing and services for big data
Vijayarani et al. Research in big data: an overview
CN103970891A (en) Method for inquiring user interest information based on context
Jung et al. P2P context awareness based sensibility design recommendation using color and bio-signal analysis
CN104598474B (en) Information recommendation method based on data semantic under cloud environment
CN107330111A (en) The search method and device of domain body based on common version body
CN107066585B (en) A kind of probability topic calculates and matched public sentiment monitoring method and system
CN105677906A (en) Automatic collecting and analyzing system and method for network events
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN116882414B (en) Automatic comment generation method and related device based on large-scale language model
Gao et al. Hierarchical clustering based web service discovery
TWI610257B (en) Sorting method of data documents and display method for sorting landmark data
CN103559269B (en) A kind of knowledge recommendation method towards mobile news subscription
Spitz et al. Topexnet: entity-centric network topic exploration in news streams
Dritsas et al. Aspect-based community detection of cultural heritage streaming data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20180601

Address after: 518057 A, 6 yuan, Zhuyuan garden, No. 5 KELONG Road, Yuhai street, Nanshan District, Shenzhen, Guangdong.

Applicant after: Zhongke Jun Sheng (Shenzhen) intelligent data science and Technology Development Co., Ltd.

Address before: 100080 No. 95 East Zhongguancun Road, Beijing, Haidian District

Applicant before: Institute of Automation, Chinese Academy of Sciences

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160504