CN102567405A - Hotspot discovery method based on improved text space vector representation - Google Patents

Hotspot discovery method based on improved text space vector representation Download PDF

Info

Publication number
CN102567405A
CN102567405A CN2010106180993A CN201010618099A CN102567405A CN 102567405 A CN102567405 A CN 102567405A CN 2010106180993 A CN2010106180993 A CN 2010106180993A CN 201010618099 A CN201010618099 A CN 201010618099A CN 102567405 A CN102567405 A CN 102567405A
Authority
CN
China
Prior art keywords
text
speech
space vector
webpage
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010106180993A
Other languages
Chinese (zh)
Inventor
贺智明
宫哲
蒋琴琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING SAFE-CODE TECHNOLOGY Co Ltd
Original Assignee
BEIJING SAFE-CODE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SAFE-CODE TECHNOLOGY Co Ltd filed Critical BEIJING SAFE-CODE TECHNOLOGY Co Ltd
Priority to CN2010106180993A priority Critical patent/CN102567405A/en
Publication of CN102567405A publication Critical patent/CN102567405A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a hotspot discovery method based on improved text space vector representation, which includes the steps of using an improved text space vector representation method to set up a vector model and enabling a network text to become the vector model capable of being recognized and processed by a computer. Therefore, the hotspot discovery method is capable of further having hotspot discovery. Meanwhile, the invention provides a public opinion monitoring system for achieving hotspot discovery.

Description

A kind of focus discover method of representing based on improved text space vector
Technical field
The present invention relates to the text mining technology, natural language processing is particularly related to a kind of focus discover method and public sentiment control system of representing based on improved text space vector.
Background technology
The non-trivial process of effective, novel, potentially useful and final intelligible pattern is found in data mining from mass data.Data mining has mass data in order to solve exactly now, but lacks the predicament of effective analysis means and the research field that occurs.At present, comprising bioinformatics, enormous function has been brought into play in many aspects such as natural language processing.Internet public feelings is analyzed, and mainly is based on the content of text messages of issuing on the network and carries out, and therefore be unable to do without the text mining technology.
Main text feature extraction and the text classification technology paid close attention in the text mining technology.Feature extraction is the basis of text classification, and good feature extracting method can not only change the accuracy of text-processing, the more important thing is and can dwindle the vectorial dimension of handling text, increases efficient, improves the overall performance of system.But; In the Chinese language processing system, do not study with optimizing Feature Extraction as emphasis at present; Attempt just that algorithm sets about improving the correctness of classification from handling (classification or cluster), though some system has reached reasonable effect; They must be to be based upon under the condition of a large amount of training samples to realize, and very not suitable for random informations a large amount of on the network.In recent years, Feature Extraction System and method had obtained using widely in text-processing, had accelerated the development of text-processing., in the present document method for expressing that adopts, having a common ungracious place is that the file characteristics vector has surprising dimension, makes choosing of character subset become a requisite link in the text mining process.The work of dimension compression is promptly carried out in feature extraction, and the purpose of doing so mainly is to improve program efficiency and travelling speed, improves nicety of grading simultaneously, and rapid screening goes out the characteristic item set to such.
The main method of feature extraction has two kinds: the one, and independent evaluating method, based on the separate basic assumption (quadrature hypothesis) of relation between speech, characteristic is carried out the weights adjustment has multiple standards: mutual information, expectation cross entropy, information gain etc.Basic thought is that each characteristic in the feature set is independently assessed.Through constructing an algorithm, each characteristic is carried out the weights adjustment, press the ordering of weights size then, choose the result of optimal feature subset according to power threshold values or predetermined number of features as feature extraction.The 2nd, comprehensive estimation method, often there is certain correlativity in the speech that occurs in the text, the oblique situation promptly occurs, can influence result calculated to a certain extent.Therefore, can adopt a kind of comprehensive estimation method to these higher-dimensions, to each other independently primitive character concentrate and to carry out conversion, obtain the overall target of less these characteristics of description.Comprehensive estimation method from higher-dimension, to each other not independently primitive character concentrate the overall target find out less these characteristics of description.Separate between these overall targets, and the available overall target that obtains is selected feature set.Since the nineties, numerous statistical methods and machine learning method are applied to the autotext classification, and the text classification Study on Technology has caused researchist's very big interest.Also begun at home at present Chinese text classification is studied, and obtained preliminary application in a plurality of fields such as organization and management of information retrieval, the classification automatically of Web document, digital library, automatic abstract, classified news group, text filtering, semanteme of word discrimination and document.Text classification technology has in recent years obtained very big progress; Proposed various features abstracting method and sorting technique,, studied some quite successful categorizing systems like regression model, SVMs, maximum entropy model etc.; Set up OHSUMED, the classification corpus that Reuters etc. are open.Classification is the important data mining method, in text classification, almost exists the method with general classification as much.In numerous text classification algorithms, relatively commonly used have Rocchio algorithm, Naive Bayes Classification Algorithm, K-nearest neighbor algorithm, decision Tree algorithms, neural network algorithm and an algorithm of support vector machine.
Employing text mining technology can realize the similarity of internet text and disappear weight, focus discovery and tracking and association analysis and trend analysis.Wherein, focus is found to be meant and in various information sources, is followed the trail of the relevant information fragment that those discuss the target focuses, finds each the unknown focus in the pieces of information set, and the focus that can online detection makes new advances.Association analysis is from mass data, to excavate correlation rule, simultaneously, utilizes the trend analysis technology, and development trend situation in time such as phase-split network public opinion are so that realize the monitoring of the public opinion environment and the early warning of harmful trend.
  
Summary of the invention
A kind of focus discover method of representing based on improved text space vector is provided, and this method comprises has used improved text space vector method for expressing to text message construction feature vector model and a kind of.Wherein text message construction feature vector model method specifically comprises data library structure data is carried out word segmentation processing, is one dimension with the speech, and document is that one dimension is set up the two-dimensional space vector and calculated the word frequency of each speech in document and put into the two-dimensional space vector.
Improved text space vector method for expressing:
Wherein, represent the weight of i characteristic speech, the frequency of occurrences of expression speech t in document d, N representes total number of files, expression comprises the number of files of t.
The invention provides a public sentiment monitoring system of realizing that focus is found, this device comprises:
The public sentiment acquisition module, a large amount of public feelings informations that have been used to obtain on the network are collected database, so that post-processed.Comprise configuration module, be used to set the scope of crawler capturing webpage, through setting the web portal tabulation; Climb and get the degree of depth; Poll is climbed the time of getting and is confirmed that reptile climbs the scope of getting, and climbs the delivery piece, is used for connecting with appointed website; Get the degree of depth and poll according to climbing in the configuration module and climb the time of getting and grasp webpage, be saved in the server database;
Pre-processing module comprises webpage denoising module, is used for that webpage is carried out useful information and extracts, and uses regular expression that web page contents is mated, and extracts structured message and is saved to database, and remove the molality piece, the webpage that grabs is arranged heavily handled;
Word-dividing mode is used for the natural language processing to Chinese text, is divided into text one by one with the speech of part of speech, handles thereby the system that makes is atom with the speech;
The cluster module is used for after having made up the proper vector storehouse, the document with same characteristic features being sorted out, thereby realizes the focus discovery.
Description of drawings
Fig. 1 is a public sentiment acquisition module synoptic diagram;
Fig. 2 is the pre-processing module synoptic diagram;
Fig. 3 is the cluster module diagram.

Claims (7)

1. focus discover method of representing based on improved text space vector is characterized in that this method comprises:
To text message construction feature vector model;
Used improved text space vector method for expressing.
2. the method for claim 1 is characterized in that, said text message construction feature vector model method is specifically comprised:
Data library structure data are carried out word segmentation processing, are one dimension with the speech, and document is that one dimension is set up the two-dimensional space vector;
Calculate the word frequency of each speech in document and put into the two-dimensional space vector.
3. public sentiment monitoring system of realizing that focus is found is characterized in that this device comprises:
The public sentiment acquisition module, a large amount of public feelings informations that have been used to obtain on the network are collected database, so that post-processed;
Pre-processing module is used for a large amount of webpages of database are carried out the processing of denoising sound, goes heavily, and deposits structured database in;
Word-dividing mode is used for the natural language processing to Chinese text, is divided into text one by one with the speech of part of speech, handles thereby the system that makes is atom with the speech;
The cluster module is used for after having made up the proper vector storehouse, the document with same characteristic features being sorted out, thereby realizes the focus discovery.
4. device as claimed in claim 4 is characterized in that, said public sentiment acquisition module comprises:
Configuration module is used to set the scope of crawler capturing webpage, through setting the web portal tabulation, climbs and gets the degree of depth, and poll is climbed the time of getting and confirmed that reptile climbs the scope of getting;
Climb the delivery piece, be used for connecting, get the degree of depth and poll according to climbing in the configuration module and climb the time of getting and grasp webpage, be saved in the server database with appointed website.
5. device as claimed in claim 4 is characterized in that, said pre-processing module comprises:
Webpage denoising module is used for that webpage is carried out useful information and extracts, and uses regular expression that web page contents is mated, and extracts structured message and is saved to database;
Remove the molality piece, the webpage that grabs is arranged heavily handled.
6. device as claimed in claim 4 is characterized in that, said word-dividing mode comprises:
Using Words partition system that Chinese text is carried out text and split, is least unit with the speech, for follow-up natural language processing does homework.
7. device as claimed in claim 4 is characterized in that, said cluster module comprises:
Use clustering algorithm that the proper vector in the proper vector storehouse is handled, gathering the high text of similarity is one type, thereby realizes the focus discovery.
CN2010106180993A 2010-12-31 2010-12-31 Hotspot discovery method based on improved text space vector representation Pending CN102567405A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010106180993A CN102567405A (en) 2010-12-31 2010-12-31 Hotspot discovery method based on improved text space vector representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010106180993A CN102567405A (en) 2010-12-31 2010-12-31 Hotspot discovery method based on improved text space vector representation

Publications (1)

Publication Number Publication Date
CN102567405A true CN102567405A (en) 2012-07-11

Family

ID=46412838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010106180993A Pending CN102567405A (en) 2010-12-31 2010-12-31 Hotspot discovery method based on improved text space vector representation

Country Status (1)

Country Link
CN (1) CN102567405A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150335A (en) * 2013-01-25 2013-06-12 河南理工大学 Co-clustering-based coal mine public sentiment monitoring system
CN104281615A (en) * 2013-07-08 2015-01-14 中国移动通信集团甘肃有限公司 Complaint handling method and system
CN104794161A (en) * 2015-03-24 2015-07-22 浪潮集团有限公司 Method for monitoring network public opinions
CN105447076A (en) * 2015-11-04 2016-03-30 南京数律云信息科技有限公司 Web page tag based security monitoring method and system
CN106156041A (en) * 2015-03-26 2016-11-23 科大讯飞股份有限公司 Hot information finds method and system
CN106708926A (en) * 2016-11-14 2017-05-24 北京赛思信安技术股份有限公司 Realization method for analysis model supporting massive long text data classification
WO2019223153A1 (en) * 2018-05-25 2019-11-28 平安科技(深圳)有限公司 Big data structuring method, device, computer apparatus, and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1873642A (en) * 2006-04-29 2006-12-06 上海世纪互联信息系统有限公司 Searching engine with automating sorting function
CN101059805A (en) * 2007-03-29 2007-10-24 复旦大学 Network flow and delaminated knowledge library based dynamic file clustering method
CN101706807A (en) * 2009-11-27 2010-05-12 清华大学 Method for automatically acquiring new words from Chinese webpages
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN101751438A (en) * 2008-12-17 2010-06-23 中国科学院自动化研究所 Theme webpage filter system for driving self-adaption semantics
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1873642A (en) * 2006-04-29 2006-12-06 上海世纪互联信息系统有限公司 Searching engine with automating sorting function
CN101059805A (en) * 2007-03-29 2007-10-24 复旦大学 Network flow and delaminated knowledge library based dynamic file clustering method
CN101751438A (en) * 2008-12-17 2010-06-23 中国科学院自动化研究所 Theme webpage filter system for driving self-adaption semantics
CN101706807A (en) * 2009-11-27 2010-05-12 清华大学 Method for automatically acquiring new words from Chinese webpages
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘莹等: "基于数组的关联规则挖掘算法", 《计算机与数字工程》, no. 01, 31 January 2006 (2006-01-31) *
宋驰等: "一种文本数据挖掘与可视化的新方法", 《北京生物医学工程》, no. 02, 30 April 2008 (2008-04-30) *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150335A (en) * 2013-01-25 2013-06-12 河南理工大学 Co-clustering-based coal mine public sentiment monitoring system
CN104281615A (en) * 2013-07-08 2015-01-14 中国移动通信集团甘肃有限公司 Complaint handling method and system
CN104281615B (en) * 2013-07-08 2018-05-15 中国移动通信集团甘肃有限公司 A kind of method and system of complaint handling
CN104794161A (en) * 2015-03-24 2015-07-22 浪潮集团有限公司 Method for monitoring network public opinions
CN106156041A (en) * 2015-03-26 2016-11-23 科大讯飞股份有限公司 Hot information finds method and system
CN106156041B (en) * 2015-03-26 2019-05-28 科大讯飞股份有限公司 Hot information finds method and system
CN105447076A (en) * 2015-11-04 2016-03-30 南京数律云信息科技有限公司 Web page tag based security monitoring method and system
CN106708926A (en) * 2016-11-14 2017-05-24 北京赛思信安技术股份有限公司 Realization method for analysis model supporting massive long text data classification
CN106708926B (en) * 2016-11-14 2020-10-30 北京赛思信安技术股份有限公司 Implementation method of analysis model supporting massive long text data classification
WO2019223153A1 (en) * 2018-05-25 2019-11-28 平安科技(深圳)有限公司 Big data structuring method, device, computer apparatus, and storage medium

Similar Documents

Publication Publication Date Title
CN104537097B (en) Microblogging public sentiment monitoring system
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN102567405A (en) Hotspot discovery method based on improved text space vector representation
CN101488150B (en) Real-time multi-view network focus event analysis apparatus and analysis method
CN104077377A (en) Method and device for finding network public opinion hotspots based on network article attributes
CN110543595B (en) In-station searching system and method
CN110929145A (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
CN105975478A (en) Word vector analysis-based online article belonging event detection method and device
CN102542061B (en) Intelligent product classification method
Shen et al. On robust image spam filtering via comprehensive visual modeling
CN101814083A (en) Automatic webpage classification method and system
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN105447081A (en) Cloud platform-oriented government affair and public opinion monitoring method
CN105912524B (en) The article topic keyword extracting method and device decomposed based on low-rank matrix
CN104199833A (en) Network search term clustering method and device
Alzahrani et al. Comparative study of machine learning algorithms for SMS spam detection
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN109033281B (en) Intelligent pushing system of knowledge resource library
CN102402589A (en) Method and equipment for providing reference research information related to research request
CN103207864A (en) Online novel content similarity comparison method
CN108959329A (en) A kind of file classification method, device, medium and equipment
CN108595411B (en) Method for acquiring multiple text abstracts in same subject text set
CN103455597A (en) Distributed information hiding detection method facing mass web images
US11334592B2 (en) Self-orchestrated system for extraction, analysis, and presentation of entity data
CN103488741A (en) Online semantic excavation system of Chinese polysemic words and based on uniform resource locator (URL)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120711