CN102567405A - Hotspot discovery method based on improved text space vector representation - Google Patents
Hotspot discovery method based on improved text space vector representation Download PDFInfo
- Publication number
- CN102567405A CN102567405A CN2010106180993A CN201010618099A CN102567405A CN 102567405 A CN102567405 A CN 102567405A CN 2010106180993 A CN2010106180993 A CN 2010106180993A CN 201010618099 A CN201010618099 A CN 201010618099A CN 102567405 A CN102567405 A CN 102567405A
- Authority
- CN
- China
- Prior art keywords
- text
- speech
- space vector
- webpage
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a hotspot discovery method based on improved text space vector representation, which includes the steps of using an improved text space vector representation method to set up a vector model and enabling a network text to become the vector model capable of being recognized and processed by a computer. Therefore, the hotspot discovery method is capable of further having hotspot discovery. Meanwhile, the invention provides a public opinion monitoring system for achieving hotspot discovery.
Description
Technical field
The present invention relates to the text mining technology, natural language processing is particularly related to a kind of focus discover method and public sentiment control system of representing based on improved text space vector.
Background technology
The non-trivial process of effective, novel, potentially useful and final intelligible pattern is found in data mining from mass data.Data mining has mass data in order to solve exactly now, but lacks the predicament of effective analysis means and the research field that occurs.At present, comprising bioinformatics, enormous function has been brought into play in many aspects such as natural language processing.Internet public feelings is analyzed, and mainly is based on the content of text messages of issuing on the network and carries out, and therefore be unable to do without the text mining technology.
Main text feature extraction and the text classification technology paid close attention in the text mining technology.Feature extraction is the basis of text classification, and good feature extracting method can not only change the accuracy of text-processing, the more important thing is and can dwindle the vectorial dimension of handling text, increases efficient, improves the overall performance of system.But; In the Chinese language processing system, do not study with optimizing Feature Extraction as emphasis at present; Attempt just that algorithm sets about improving the correctness of classification from handling (classification or cluster), though some system has reached reasonable effect; They must be to be based upon under the condition of a large amount of training samples to realize, and very not suitable for random informations a large amount of on the network.In recent years, Feature Extraction System and method had obtained using widely in text-processing, had accelerated the development of text-processing., in the present document method for expressing that adopts, having a common ungracious place is that the file characteristics vector has surprising dimension, makes choosing of character subset become a requisite link in the text mining process.The work of dimension compression is promptly carried out in feature extraction, and the purpose of doing so mainly is to improve program efficiency and travelling speed, improves nicety of grading simultaneously, and rapid screening goes out the characteristic item set to such.
The main method of feature extraction has two kinds: the one, and independent evaluating method, based on the separate basic assumption (quadrature hypothesis) of relation between speech, characteristic is carried out the weights adjustment has multiple standards: mutual information, expectation cross entropy, information gain etc.Basic thought is that each characteristic in the feature set is independently assessed.Through constructing an algorithm, each characteristic is carried out the weights adjustment, press the ordering of weights size then, choose the result of optimal feature subset according to power threshold values or predetermined number of features as feature extraction.The 2nd, comprehensive estimation method, often there is certain correlativity in the speech that occurs in the text, the oblique situation promptly occurs, can influence result calculated to a certain extent.Therefore, can adopt a kind of comprehensive estimation method to these higher-dimensions, to each other independently primitive character concentrate and to carry out conversion, obtain the overall target of less these characteristics of description.Comprehensive estimation method from higher-dimension, to each other not independently primitive character concentrate the overall target find out less these characteristics of description.Separate between these overall targets, and the available overall target that obtains is selected feature set.Since the nineties, numerous statistical methods and machine learning method are applied to the autotext classification, and the text classification Study on Technology has caused researchist's very big interest.Also begun at home at present Chinese text classification is studied, and obtained preliminary application in a plurality of fields such as organization and management of information retrieval, the classification automatically of Web document, digital library, automatic abstract, classified news group, text filtering, semanteme of word discrimination and document.Text classification technology has in recent years obtained very big progress; Proposed various features abstracting method and sorting technique,, studied some quite successful categorizing systems like regression model, SVMs, maximum entropy model etc.; Set up OHSUMED, the classification corpus that Reuters etc. are open.Classification is the important data mining method, in text classification, almost exists the method with general classification as much.In numerous text classification algorithms, relatively commonly used have Rocchio algorithm, Naive Bayes Classification Algorithm, K-nearest neighbor algorithm, decision Tree algorithms, neural network algorithm and an algorithm of support vector machine.
Employing text mining technology can realize the similarity of internet text and disappear weight, focus discovery and tracking and association analysis and trend analysis.Wherein, focus is found to be meant and in various information sources, is followed the trail of the relevant information fragment that those discuss the target focuses, finds each the unknown focus in the pieces of information set, and the focus that can online detection makes new advances.Association analysis is from mass data, to excavate correlation rule, simultaneously, utilizes the trend analysis technology, and development trend situation in time such as phase-split network public opinion are so that realize the monitoring of the public opinion environment and the early warning of harmful trend.
Summary of the invention
A kind of focus discover method of representing based on improved text space vector is provided, and this method comprises has used improved text space vector method for expressing to text message construction feature vector model and a kind of.Wherein text message construction feature vector model method specifically comprises data library structure data is carried out word segmentation processing, is one dimension with the speech, and document is that one dimension is set up the two-dimensional space vector and calculated the word frequency of each speech in document and put into the two-dimensional space vector.
Improved text space vector method for expressing:
Wherein, represent the weight of i characteristic speech, the frequency of occurrences of expression speech t in document d, N representes total number of files, expression comprises the number of files of t.
The invention provides a public sentiment monitoring system of realizing that focus is found, this device comprises:
The public sentiment acquisition module, a large amount of public feelings informations that have been used to obtain on the network are collected database, so that post-processed.Comprise configuration module, be used to set the scope of crawler capturing webpage, through setting the web portal tabulation; Climb and get the degree of depth; Poll is climbed the time of getting and is confirmed that reptile climbs the scope of getting, and climbs the delivery piece, is used for connecting with appointed website; Get the degree of depth and poll according to climbing in the configuration module and climb the time of getting and grasp webpage, be saved in the server database;
Pre-processing module comprises webpage denoising module, is used for that webpage is carried out useful information and extracts, and uses regular expression that web page contents is mated, and extracts structured message and is saved to database, and remove the molality piece, the webpage that grabs is arranged heavily handled;
Word-dividing mode is used for the natural language processing to Chinese text, is divided into text one by one with the speech of part of speech, handles thereby the system that makes is atom with the speech;
The cluster module is used for after having made up the proper vector storehouse, the document with same characteristic features being sorted out, thereby realizes the focus discovery.
Description of drawings
Fig. 1 is a public sentiment acquisition module synoptic diagram;
Fig. 2 is the pre-processing module synoptic diagram;
Fig. 3 is the cluster module diagram.
Claims (7)
1. focus discover method of representing based on improved text space vector is characterized in that this method comprises:
To text message construction feature vector model;
Used improved text space vector method for expressing.
2. the method for claim 1 is characterized in that, said text message construction feature vector model method is specifically comprised:
Data library structure data are carried out word segmentation processing, are one dimension with the speech, and document is that one dimension is set up the two-dimensional space vector;
Calculate the word frequency of each speech in document and put into the two-dimensional space vector.
3. public sentiment monitoring system of realizing that focus is found is characterized in that this device comprises:
The public sentiment acquisition module, a large amount of public feelings informations that have been used to obtain on the network are collected database, so that post-processed;
Pre-processing module is used for a large amount of webpages of database are carried out the processing of denoising sound, goes heavily, and deposits structured database in;
Word-dividing mode is used for the natural language processing to Chinese text, is divided into text one by one with the speech of part of speech, handles thereby the system that makes is atom with the speech;
The cluster module is used for after having made up the proper vector storehouse, the document with same characteristic features being sorted out, thereby realizes the focus discovery.
4. device as claimed in claim 4 is characterized in that, said public sentiment acquisition module comprises:
Configuration module is used to set the scope of crawler capturing webpage, through setting the web portal tabulation, climbs and gets the degree of depth, and poll is climbed the time of getting and confirmed that reptile climbs the scope of getting;
Climb the delivery piece, be used for connecting, get the degree of depth and poll according to climbing in the configuration module and climb the time of getting and grasp webpage, be saved in the server database with appointed website.
5. device as claimed in claim 4 is characterized in that, said pre-processing module comprises:
Webpage denoising module is used for that webpage is carried out useful information and extracts, and uses regular expression that web page contents is mated, and extracts structured message and is saved to database;
Remove the molality piece, the webpage that grabs is arranged heavily handled.
6. device as claimed in claim 4 is characterized in that, said word-dividing mode comprises:
Using Words partition system that Chinese text is carried out text and split, is least unit with the speech, for follow-up natural language processing does homework.
7. device as claimed in claim 4 is characterized in that, said cluster module comprises:
Use clustering algorithm that the proper vector in the proper vector storehouse is handled, gathering the high text of similarity is one type, thereby realizes the focus discovery.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010106180993A CN102567405A (en) | 2010-12-31 | 2010-12-31 | Hotspot discovery method based on improved text space vector representation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010106180993A CN102567405A (en) | 2010-12-31 | 2010-12-31 | Hotspot discovery method based on improved text space vector representation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102567405A true CN102567405A (en) | 2012-07-11 |
Family
ID=46412838
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2010106180993A Pending CN102567405A (en) | 2010-12-31 | 2010-12-31 | Hotspot discovery method based on improved text space vector representation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102567405A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150335A (en) * | 2013-01-25 | 2013-06-12 | 河南理工大学 | Co-clustering-based coal mine public sentiment monitoring system |
CN104281615A (en) * | 2013-07-08 | 2015-01-14 | 中国移动通信集团甘肃有限公司 | Complaint handling method and system |
CN104794161A (en) * | 2015-03-24 | 2015-07-22 | 浪潮集团有限公司 | Method for monitoring network public opinions |
CN105447076A (en) * | 2015-11-04 | 2016-03-30 | 南京数律云信息科技有限公司 | Web page tag based security monitoring method and system |
CN106156041A (en) * | 2015-03-26 | 2016-11-23 | 科大讯飞股份有限公司 | Hot information finds method and system |
CN106708926A (en) * | 2016-11-14 | 2017-05-24 | 北京赛思信安技术股份有限公司 | Realization method for analysis model supporting massive long text data classification |
WO2019223153A1 (en) * | 2018-05-25 | 2019-11-28 | 平安科技(深圳)有限公司 | Big data structuring method, device, computer apparatus, and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1873642A (en) * | 2006-04-29 | 2006-12-06 | 上海世纪互联信息系统有限公司 | Searching engine with automating sorting function |
CN101059805A (en) * | 2007-03-29 | 2007-10-24 | 复旦大学 | Network flow and delaminated knowledge library based dynamic file clustering method |
CN101706807A (en) * | 2009-11-27 | 2010-05-12 | 清华大学 | Method for automatically acquiring new words from Chinese webpages |
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
CN101751458A (en) * | 2009-12-31 | 2010-06-23 | 暨南大学 | Network public sentiment monitoring system and method |
CN101751438A (en) * | 2008-12-17 | 2010-06-23 | 中国科学院自动化研究所 | Theme webpage filter system for driving self-adaption semantics |
CN101794311A (en) * | 2010-03-05 | 2010-08-04 | 南京邮电大学 | Fuzzy data mining based automatic classification method of Chinese web pages |
-
2010
- 2010-12-31 CN CN2010106180993A patent/CN102567405A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1873642A (en) * | 2006-04-29 | 2006-12-06 | 上海世纪互联信息系统有限公司 | Searching engine with automating sorting function |
CN101059805A (en) * | 2007-03-29 | 2007-10-24 | 复旦大学 | Network flow and delaminated knowledge library based dynamic file clustering method |
CN101751438A (en) * | 2008-12-17 | 2010-06-23 | 中国科学院自动化研究所 | Theme webpage filter system for driving self-adaption semantics |
CN101706807A (en) * | 2009-11-27 | 2010-05-12 | 清华大学 | Method for automatically acquiring new words from Chinese webpages |
CN101751458A (en) * | 2009-12-31 | 2010-06-23 | 暨南大学 | Network public sentiment monitoring system and method |
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
CN101794311A (en) * | 2010-03-05 | 2010-08-04 | 南京邮电大学 | Fuzzy data mining based automatic classification method of Chinese web pages |
Non-Patent Citations (2)
Title |
---|
刘莹等: "基于数组的关联规则挖掘算法", 《计算机与数字工程》, no. 01, 31 January 2006 (2006-01-31) * |
宋驰等: "一种文本数据挖掘与可视化的新方法", 《北京生物医学工程》, no. 02, 30 April 2008 (2008-04-30) * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150335A (en) * | 2013-01-25 | 2013-06-12 | 河南理工大学 | Co-clustering-based coal mine public sentiment monitoring system |
CN104281615A (en) * | 2013-07-08 | 2015-01-14 | 中国移动通信集团甘肃有限公司 | Complaint handling method and system |
CN104281615B (en) * | 2013-07-08 | 2018-05-15 | 中国移动通信集团甘肃有限公司 | A kind of method and system of complaint handling |
CN104794161A (en) * | 2015-03-24 | 2015-07-22 | 浪潮集团有限公司 | Method for monitoring network public opinions |
CN106156041A (en) * | 2015-03-26 | 2016-11-23 | 科大讯飞股份有限公司 | Hot information finds method and system |
CN106156041B (en) * | 2015-03-26 | 2019-05-28 | 科大讯飞股份有限公司 | Hot information finds method and system |
CN105447076A (en) * | 2015-11-04 | 2016-03-30 | 南京数律云信息科技有限公司 | Web page tag based security monitoring method and system |
CN106708926A (en) * | 2016-11-14 | 2017-05-24 | 北京赛思信安技术股份有限公司 | Realization method for analysis model supporting massive long text data classification |
CN106708926B (en) * | 2016-11-14 | 2020-10-30 | 北京赛思信安技术股份有限公司 | Implementation method of analysis model supporting massive long text data classification |
WO2019223153A1 (en) * | 2018-05-25 | 2019-11-28 | 平安科技(深圳)有限公司 | Big data structuring method, device, computer apparatus, and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104537097B (en) | Microblogging public sentiment monitoring system | |
CN103914478B (en) | Webpage training method and system, webpage Forecasting Methodology and system | |
CN102567405A (en) | Hotspot discovery method based on improved text space vector representation | |
CN101488150B (en) | Real-time multi-view network focus event analysis apparatus and analysis method | |
CN104077377A (en) | Method and device for finding network public opinion hotspots based on network article attributes | |
CN110543595B (en) | In-station searching system and method | |
CN110929145A (en) | Public opinion analysis method, public opinion analysis device, computer device and storage medium | |
CN105975478A (en) | Word vector analysis-based online article belonging event detection method and device | |
CN102542061B (en) | Intelligent product classification method | |
Shen et al. | On robust image spam filtering via comprehensive visual modeling | |
CN101814083A (en) | Automatic webpage classification method and system | |
CN108416034B (en) | Information acquisition system based on financial heterogeneous big data and control method thereof | |
CN105447081A (en) | Cloud platform-oriented government affair and public opinion monitoring method | |
CN105912524B (en) | The article topic keyword extracting method and device decomposed based on low-rank matrix | |
CN104199833A (en) | Network search term clustering method and device | |
Alzahrani et al. | Comparative study of machine learning algorithms for SMS spam detection | |
CN106980651B (en) | Crawling seed list updating method and device based on knowledge graph | |
CN109033281B (en) | Intelligent pushing system of knowledge resource library | |
CN102402589A (en) | Method and equipment for providing reference research information related to research request | |
CN103207864A (en) | Online novel content similarity comparison method | |
CN108959329A (en) | A kind of file classification method, device, medium and equipment | |
CN108595411B (en) | Method for acquiring multiple text abstracts in same subject text set | |
CN103455597A (en) | Distributed information hiding detection method facing mass web images | |
US11334592B2 (en) | Self-orchestrated system for extraction, analysis, and presentation of entity data | |
CN103488741A (en) | Online semantic excavation system of Chinese polysemic words and based on uniform resource locator (URL) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120711 |