CN102567405A

CN102567405A - Hotspot discovery method based on improved text space vector representation

Info

Publication number: CN102567405A
Application number: CN2010106180993A
Authority: CN
Inventors: 贺智明; 宫哲; 蒋琴琴
Original assignee: BEIJING SAFE-CODE TECHNOLOGY Co Ltd
Current assignee: BEIJING SAFE-CODE TECHNOLOGY Co Ltd
Priority date: 2010-12-31
Filing date: 2010-12-31
Publication date: 2012-07-11

Abstract

The invention discloses a hotspot discovery method based on improved text space vector representation, which includes the steps of using an improved text space vector representation method to set up a vector model and enabling a network text to become the vector model capable of being recognized and processed by a computer. Therefore, the hotspot discovery method is capable of further having hotspot discovery. Meanwhile, the invention provides a public opinion monitoring system for achieving hotspot discovery.

Description

A kind of focus discover method of representing based on improved text space vector

Technical field

The present invention relates to the text mining technology, natural language processing is particularly related to a kind of focus discover method and public sentiment control system of representing based on improved text space vector.

Background technology

The non-trivial process of effective, novel, potentially useful and final intelligible pattern is found in data mining from mass data.Data mining has mass data in order to solve exactly now, but lacks the predicament of effective analysis means and the research field that occurs.At present, comprising bioinformatics, enormous function has been brought into play in many aspects such as natural language processing.Internet public feelings is analyzed, and mainly is based on the content of text messages of issuing on the network and carries out, and therefore be unable to do without the text mining technology.

Main text feature extraction and the text classification technology paid close attention in the text mining technology.Feature extraction is the basis of text classification, and good feature extracting method can not only change the accuracy of text-processing, the more important thing is and can dwindle the vectorial dimension of handling text, increases efficient, improves the overall performance of system.But; In the Chinese language processing system, do not study with optimizing Feature Extraction as emphasis at present; Attempt just that algorithm sets about improving the correctness of classification from handling (classification or cluster), though some system has reached reasonable effect; They must be to be based upon under the condition of a large amount of training samples to realize, and very not suitable for random informations a large amount of on the network.In recent years, Feature Extraction System and method had obtained using widely in text-processing, had accelerated the development of text-processing., in the present document method for expressing that adopts, having a common ungracious place is that the file characteristics vector has surprising dimension, makes choosing of character subset become a requisite link in the text mining process.The work of dimension compression is promptly carried out in feature extraction, and the purpose of doing so mainly is to improve program efficiency and travelling speed, improves nicety of grading simultaneously, and rapid screening goes out the characteristic item set to such.

The main method of feature extraction has two kinds: the one, and independent evaluating method, based on the separate basic assumption (quadrature hypothesis) of relation between speech, characteristic is carried out the weights adjustment has multiple standards: mutual information, expectation cross entropy, information gain etc.Basic thought is that each characteristic in the feature set is independently assessed.Through constructing an algorithm, each characteristic is carried out the weights adjustment, press the ordering of weights size then, choose the result of optimal feature subset according to power threshold values or predetermined number of features as feature extraction.The 2nd, comprehensive estimation method, often there is certain correlativity in the speech that occurs in the text, the oblique situation promptly occurs, can influence result calculated to a certain extent.Therefore, can adopt a kind of comprehensive estimation method to these higher-dimensions, to each other independently primitive character concentrate and to carry out conversion, obtain the overall target of less these characteristics of description.Comprehensive estimation method from higher-dimension, to each other not independently primitive character concentrate the overall target find out less these characteristics of description.Separate between these overall targets, and the available overall target that obtains is selected feature set.Since the nineties, numerous statistical methods and machine learning method are applied to the autotext classification, and the text classification Study on Technology has caused researchist's very big interest.Also begun at home at present Chinese text classification is studied, and obtained preliminary application in a plurality of fields such as organization and management of information retrieval, the classification automatically of Web document, digital library, automatic abstract, classified news group, text filtering, semanteme of word discrimination and document.Text classification technology has in recent years obtained very big progress; Proposed various features abstracting method and sorting technique,, studied some quite successful categorizing systems like regression model, SVMs, maximum entropy model etc.; Set up OHSUMED, the classification corpus that Reuters etc. are open.Classification is the important data mining method, in text classification, almost exists the method with general classification as much.In numerous text classification algorithms, relatively commonly used have Rocchio algorithm, Naive Bayes Classification Algorithm, K-nearest neighbor algorithm, decision Tree algorithms, neural network algorithm and an algorithm of support vector machine.

Employing text mining technology can realize the similarity of internet text and disappear weight, focus discovery and tracking and association analysis and trend analysis.Wherein, focus is found to be meant and in various information sources, is followed the trail of the relevant information fragment that those discuss the target focuses, finds each the unknown focus in the pieces of information set, and the focus that can online detection makes new advances.Association analysis is from mass data, to excavate correlation rule, simultaneously, utilizes the trend analysis technology, and development trend situation in time such as phase-split network public opinion are so that realize the monitoring of the public opinion environment and the early warning of harmful trend.

Summary of the invention

A kind of focus discover method of representing based on improved text space vector is provided, and this method comprises has used improved text space vector method for expressing to text message construction feature vector model and a kind of.Wherein text message construction feature vector model method specifically comprises data library structure data is carried out word segmentation processing, is one dimension with the speech, and document is that one dimension is set up the two-dimensional space vector and calculated the word frequency of each speech in document and put into the two-dimensional space vector.

Improved text space vector method for expressing:

Wherein, represent the weight of i characteristic speech, the frequency of occurrences of expression speech t in document d, N representes total number of files, expression comprises the number of files of t.

The invention provides a public sentiment monitoring system of realizing that focus is found, this device comprises:

The public sentiment acquisition module, a large amount of public feelings informations that have been used to obtain on the network are collected database, so that post-processed.Comprise configuration module, be used to set the scope of crawler capturing webpage, through setting the web portal tabulation; Climb and get the degree of depth; Poll is climbed the time of getting and is confirmed that reptile climbs the scope of getting, and climbs the delivery piece, is used for connecting with appointed website; Get the degree of depth and poll according to climbing in the configuration module and climb the time of getting and grasp webpage, be saved in the server database;

Pre-processing module comprises webpage denoising module, is used for that webpage is carried out useful information and extracts, and uses regular expression that web page contents is mated, and extracts structured message and is saved to database, and remove the molality piece, the webpage that grabs is arranged heavily handled;

Word-dividing mode is used for the natural language processing to Chinese text, is divided into text one by one with the speech of part of speech, handles thereby the system that makes is atom with the speech;

The cluster module is used for after having made up the proper vector storehouse, the document with same characteristic features being sorted out, thereby realizes the focus discovery.

Description of drawings

Fig. 1 is a public sentiment acquisition module synoptic diagram;

Fig. 2 is the pre-processing module synoptic diagram;

Fig. 3 is the cluster module diagram.

Claims

1. focus discover method of representing based on improved text space vector is characterized in that this method comprises:

To text message construction feature vector model;

Used improved text space vector method for expressing.

2. the method for claim 1 is characterized in that, said text message construction feature vector model method is specifically comprised:

Data library structure data are carried out word segmentation processing, are one dimension with the speech, and document is that one dimension is set up the two-dimensional space vector;

Calculate the word frequency of each speech in document and put into the two-dimensional space vector.

3. public sentiment monitoring system of realizing that focus is found is characterized in that this device comprises:

The public sentiment acquisition module, a large amount of public feelings informations that have been used to obtain on the network are collected database, so that post-processed;

Pre-processing module is used for a large amount of webpages of database are carried out the processing of denoising sound, goes heavily, and deposits structured database in;

4. device as claimed in claim 4 is characterized in that, said public sentiment acquisition module comprises:

Configuration module is used to set the scope of crawler capturing webpage, through setting the web portal tabulation, climbs and gets the degree of depth, and poll is climbed the time of getting and confirmed that reptile climbs the scope of getting;

Climb the delivery piece, be used for connecting, get the degree of depth and poll according to climbing in the configuration module and climb the time of getting and grasp webpage, be saved in the server database with appointed website.

5. device as claimed in claim 4 is characterized in that, said pre-processing module comprises:

Webpage denoising module is used for that webpage is carried out useful information and extracts, and uses regular expression that web page contents is mated, and extracts structured message and is saved to database;

Remove the molality piece, the webpage that grabs is arranged heavily handled.

6. device as claimed in claim 4 is characterized in that, said word-dividing mode comprises:

Using Words partition system that Chinese text is carried out text and split, is least unit with the speech, for follow-up natural language processing does homework.

7. device as claimed in claim 4 is characterized in that, said cluster module comprises:

Use clustering algorithm that the proper vector in the proper vector storehouse is handled, gathering the high text of similarity is one type, thereby realizes the focus discovery.