CN105447161A - Data feature based intelligent information classification method - Google Patents

Data feature based intelligent information classification method Download PDF

Info

Publication number
CN105447161A
CN105447161A CN201510866092.6A CN201510866092A CN105447161A CN 105447161 A CN105447161 A CN 105447161A CN 201510866092 A CN201510866092 A CN 201510866092A CN 105447161 A CN105447161 A CN 105447161A
Authority
CN
China
Prior art keywords
word
classification
frequency
document
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510866092.6A
Other languages
Chinese (zh)
Inventor
刘治
张胜
章云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201510866092.6A priority Critical patent/CN105447161A/en
Publication of CN105447161A publication Critical patent/CN105447161A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention belongs to the field of data mining and relates to a data feature based intelligent information classification method. The method mainly comprises a stage of training marked web pages and a stage of classifying to-be-classified web pages. The training stage comprises the main steps of: preprocessing the web pages; performing Chinese word segmentation and stop word removal on web page content; creating a knowledge base according to data features; performing feature selection on the web pages and generating eigenvectors; and generating an SVM classifier. The classification stage comprises the main steps of: pre-classifying the web pages; and performing accurate classification by using the SVM classifier. With the method, the deficiency that an existing information classification method cannot perform high-speed and efficient classification on Chinese web pages is overcome.

Description

A kind of intelligent information sorting technique based on data characteristics
Technical field
The invention belongs to Data Mining, relate to a kind of intelligent information sorting technique based on data characteristics.
Background technology
Along with the develop rapidly of internet, the network information is explosive growth.In the face of the Web information of magnanimity like this, how to obtain useful information quickly and accurately, be one of current Internet technology facing challenges.Automatic webpage classification is a kind of important technology efficiently processing magnanimity Web information.It refers to for webpage to be sorted, according to its content by computing machine according to certain Algorithms for Automatic Classification, webpage is divided into the classification pre-defined.
At present, there is the text automatic classification algorithm of multiple Corpus--based Method theory and machine learning method.But compared with common text documents, webpage has following characteristics: (1) webpage adopts hypertext design, comprises html tag in webpage, this makes it stronger than plain text expressive ability, utilizable structured message and edit file more; (2) interrelated by hyperlink between the webpage on Web, the commending contents that hyperlink contains and content correlationship bring a lot of heuristic information to Web page classifying; (3) webpage comprises a large amount of noise usually, as the information that advertisement, navigation bar, recommendation hurdle, author information etc. are irrelevant with subject content; (4) Chinese web page uses Chinese to express, and unlike English, use each word in blank character interval, Chinese web page needs word segmentation processing.These reasons make Web page classifying more more complex than plain text classification above just.
Summary of the invention
For the problems referred to above, the present invention is after the feature of further investigation Chinese web page, according to the part such as web page title, key word, classification results is had to the feature of higher weights, to propose according to data characteristics with preset antistop list and title content for knowledge base is classified in advance, then webpage is changed into proper vector in conjunction with SVM algorithm sorting technique as a supplement.The method drastically increases the combination property of sorter.
Concrete technical scheme is as follows: a kind of intelligent information sorting technique based on data characteristics, comprises training and two stages of classification:
Training stage specifically carries out in accordance with the following steps: step one, treat training webpage and carry out pre-service, removes the html irrelevant with Web page classifying and marks, therefrom extract body text.Step 2, Chinese word segmentation process is carried out to the text extracted, and Web page classifying be there is no to the stop-word of much meanings after removing participle.Such as ' ', ' ' etc. do not have word or the word of practical significance in Chinese, also have some rarely used words and special symbol in addition, all must remove as stop-word.Step 3, to participle with go the result after stop-word to carry out word frequency statistics.Step 4, feature selecting is carried out to the result after word frequency statistics.Specific practice arranges word frequency threshold value, filters out the word of word frequency lower than threshold value.Step 5, weighted value calculating is carried out to the high frequency words of remainder, generating feature vector.Step 6, establishment industry knowledge base are each antistop list of waiting to train classification this field preset.Step 7, establishment SVM classifier.
Sorting phase specifically carries out in accordance with the following steps: step one, carry out pre-service to experienced webpage to be sorted, removes the html irrelevant with Web page classifying and marks, therefrom extract body text.Step 2, to extract after text carry out Chinese word segmentation and go stop-word process, specific implementation method and train time the same.Step 3, to presort.The title class extracting webpage to be sorted is held, and contrasts, determine the generic of webpage with the antistop list in preset industry knowledge base.If presort successfully, then directly return classification results; If presort unsuccessfully, then continue following steps.Step 4, by web page text participle with go the text after stop-word to change into proper vector.Step 5, use SVM classifier are classified to this proper vector, and are returned classification results.
Based on disclosing of above technical scheme, the present invention possesses following beneficial effect:
1, according to the part such as web page title, key word, classification results is had to the feature of higher weights in the present invention, propose to classify in advance as knowledge base using preset antistop list and title content, substantially increase the classification speed to Chinese web page.
2, propose to set up knowledge base with data characteristics in the present invention to presort, then in conjunction with SVM algorithm sorting technique as a supplement, drastically increase the combination property of sorter.
Accompanying drawing explanation
Fig. 1 is the system flowchart of a kind of intelligent information sorting technique based on data characteristics that the present invention proposes.
Fig. 2 is the training stage process flow diagram of a kind of intelligent information sorting technique based on data characteristics that the present invention proposes.
Fig. 3 is the sorting phase process flow diagram of a kind of intelligent information sorting technique based on data characteristics that the present invention proposes.
Embodiment
As shown in Figure 1, be the system flowchart of a kind of intelligent information sorting technique based on data characteristics of the present invention's proposition.With reference to Fig. 1, a kind of intelligent information sorting technique based on data characteristics that the present invention proposes comprises: step S1, trains the webpage marked; Step S2, classifies to webpage to be sorted.
With reference to Fig. 2, described in step S1, training is carried out to the webpage marked and comprises:
Step S11, treats training webpage and carries out pre-service, remove the html irrelevant with Web page classifying and mark, therefrom extract body text.First, the html source codes embedded by mark such as <style>, <script>, <applet> are removed; Secondly, contents extraction in <title>, <meta> label out, is preserved separately; Finally, after filtering out above-mentioned html label, Web page text text is extracted.
Step S12, carries out Chinese word segmentation process to the Web page text text extracted, and Web page classifying is not had to the stop-word of much meanings after removing participle.Such as ' ', ' ' etc. do not have word or the word of practical significance in Chinese, also have some rarely used words and special symbol in addition, all must remove as stop-word.
Step S13, to participle and go the result after stop-word to carry out word frequency statistics.
Step S14, carries out feature selecting to the result after word frequency statistics.Specific practice arranges word frequency threshold value, filters out the word of word frequency lower than threshold value.
Step S15, usage space vector model (VectorSpaceModel) will treat that training webpage body text converts proper vector to.In the model, each text document is expressed as following proper vector:
V(d)=(t 1,ω 1(d);t 2,ω 2(d);…;t n,ω n(d);)
Wherein t ifor characteristic item, ω id () is t iweight in a document.
According to step S15, for the ease of subsequent calculations, need the dimension reducing proper vector.Step S12, S13, S14 decrease the number of characteristic item, namely decrease the dimension of proper vector.
According to step S15, characteristic item weights omega in a document id () can use the traditional Weight algorithm based on TF-IDF to calculate, computing formula is as follows:
&omega; i ( d ) = tf i ( d ) &times; log ( N / n k + 0.01 ) &Sigma; i = 1 n ( tf i ( d ) ) 2 &times; &lsqb; log ( N / n k + 0.01 ) &rsqb; 2
Wherein, tf id () is t ithe frequency occurred in document d, N is the total number of files in document sets, n kfor there is characteristic item t knumber of files.
According to step S15, in Web page representation, there is the weighted value of two factor effect characteristics items: one is the frequency that characteristic item occurs in a document, and one is the position that characteristic item occurs in a document.Adopt the method for diverse location Feature Words being given to different weight factor, weight factor computing formula is as follows:
&lambda; = d k &OverBar; d 0 = ( &Sigma; d k ) / N k ( &Sigma; d k ) / N 0
Wherein, represent the average word frequency of core word, represent the average word frequency of non-core word, d kand N kbe respectively core word word frequency and core word number, d oand N obe respectively non-core word word frequency and non-core word word number.Core word comprises the word that in word in <title> and <meta> mark, keywords, description position occurs, all the other are non-core word.
Optionally, in general, the word number of core word is few and occurrence number is many, and comparatively concentrated, so λ >=1, just gets λ=1 when running into the situation being less than 1.For core word, proper vector formula just becomes:
ω′ i(d)=λ×ω i(d)
Step S16, creation of knowledge storehouse, waits the antistop list of training classification this field preset for each.
According to step S16, the concrete creation method of knowledge base is: first mark by the <title> of all webpages in each classification extraction training set the content comprised, and word segmentation processing is carried out to it, then add up word frequency respectively, and by word frequency descending sort.Next step, using selected part from these words as the knowledge base of classifying in advance, the principle chosen is the word that word frequency is the highest from each classification, check whether it occurred in other classifications, if do not occurred in other classifications, then it is chosen for such other knowledge base.
Preferably, according to described above, only occur in certain classification title if be strict with a word, then obtained keyword is less, can not significantly improve classification effectiveness.So, suitably lower the requirement in real process, if certain word has higher word frequency in a classification, and the number of times occurred in other classifications is no more than a certain fixing threshold values, or occurrence number accounts for the ratio of total webpage number within the specific limits (as 1%) in other classifications, still select this entry as the keyword of presorting of this class.
Step S17, trains the proper vector generated, and creates SVM classifier.
According to S17, the principle of SVM classifier is:
(1) given training set is established
T∈{(x 1,y 1),(x 2,y 2),…,(x i,y i)}∈(X*Y) l
Wherein, x i∈ X=R n, y i∈ Y={-1,1}, i=1,2 ..., l
(2) select applicable kernel function K (x, x ') and punishment parameter C, construct and solve following optimization problem
S . t . &Sigma; i = 1 l y i &alpha; i = 0
0≤a i≤Ci=1,2,3…l
Obtain optimum solution &alpha; * = ( &alpha; 1 * , &alpha; 2 * , ... , &alpha; i * ) T
(3) α is selected *a positive component being less than C and calculate accordingly
b * = y i - &Sigma; i = 1 l y i &alpha; i * K ( x i , x j )
(4) decision function is constructed
f ( x ) = sgn ( &Sigma; i = 1 l y i &alpha; i * K ( x i , x ) + b * )
With reference to Fig. 3, in step S2, classification is carried out to webpage to be sorted and comprises:
Step S21, carry out pre-service to webpage to be sorted, specific implementation method is identical with S11.
Step S22, the text after extracting is carried out to Chinese word segmentation and goes stop-word process, and specific implementation method step S12 is the same.
Step S23, presorts to webpage.
According to step S23, the specific implementation method of presorting is: (1) extracts web page title content, compares, judge word generic with antistop list in knowledge base, adds up word in the title occurrence frequency in of all categories; (2) if the uni-gram frequency belonging to certain classification is maximum, then think that webpage belongs to this classification; (3) if the uni-gram frequency belonging to two classifications is equal, then compare category preferences, be divided into the classification that priority is larger.(4) if priority is identical, then presort unsuccessfully, need to continue following step.
Step S24, to participle and go the result after stop-word to carry out word frequency statistics, specific implementation method is the same with S13.
Step S25, carries out feature selecting to the result after word frequency statistics.Specific implementation method is the same with step S14.
Step S26, Web page text text-converted to be sorted is become proper vector by usage space vector model (VectorSpaceModel).Specific implementation method is the same with step S15.
Step S27, uses SVM classifier to classify to the proper vector generated, obtains classification results.
Based on disclosing of above technical scheme, the present invention possesses following beneficial effect:
1, according to the part such as web page title, key word, classification results is had to the feature of higher weights in the present invention, propose to set up knowledge base using preset antistop list and title content as data characteristics and come to classify in advance, substantially increase the classification speed of Chinese web page.
2, propose in the present invention to presort as knowledge base using data characteristics, then in conjunction with SVM algorithm sorting technique as a supplement, drastically increase the combination property of sorter.
The above, be only the present invention's preferably embodiment, but protection scope of the present invention is not limited thereto.Anyly be familiar with those skilled in the art in the technical scope that the present invention discloses, be equal to according to technical scheme of the present invention and inventive concept thereof and replace or change, all should be encompassed in protection scope of the present invention.

Claims (5)

1., based on an intelligent information sorting technique for data characteristics, it is characterized in that, the method comprises:
According to data characteristics creation of knowledge storehouse, Chinese web page to be sorted is presorted, greatly accelerate the classification speed of webpage.
2. method according to claim 1, is characterized in that, the described method of presorting comprises:
(1) extract web page title content, compare with antistop list in knowledge base, judge word generic, add up word in the title occurrence frequency in of all categories; (2) if the uni-gram frequency belonging to certain classification is maximum, then think that webpage belongs to this classification; (3) if the uni-gram frequency belonging to two classifications is equal, then compare category preferences, be divided into the classification that priority is larger; (4) if priority is identical, then presort unsuccessfully, need to use SVM classifier to continue classification.
3. method according to claim 1, is characterized in that, described knowledge base creates and comprises:
(1) mark by the <title> of all webpages in each classification extraction training set the content comprised, and word segmentation processing is carried out to it, then add up word frequency respectively, and by word frequency descending sort; (2) using selected part from these words as the knowledge base of classifying in advance, the principle chosen is the word that word frequency is the highest from each classification, check whether it occurred in other classifications, if do not occurred in other classifications, then it is chosen for such other knowledge base.
4. method according to claim 1, is characterized in that, the establishment principle of described SVM classifier comprises:
(1) given training set is established
T∈{(x 1,y 1),(x 2,y 2),…,(x i,y i)}∈(X*Y) l
Wherein, x i∈ X=R n, y i∈ Y={-1,1}, i=1,2 ..., l
(2) select applicable kernel function K (x, x ') and punishment parameter C, construct and solve following optimization problem:
0≤α i≤Ci=1,2,3…l
Obtain optimum solution
(3) α is selected *a positive component being less than C and calculate accordingly
(4) decision function is constructed
5. method according to claim 4, it is characterized in that, the building method of described proper vector comprises:
(1) usage space vector model (VectorSpaceModel) will treat that training webpage body text converts proper vector to; In the model, each text document is expressed as following proper vector:
V(d)=(t 1,ω 1(d);t 2,ω 2(d);…;t n,ω n(d);)
Wherein t ifor characteristic item, ω id () is t iweight in a document;
(2) characteristic item weights omega in a document id () can use the traditional Weight algorithm based on TF-IDF to calculate, computing formula is as follows:
Wherein, tf id () is t ithe frequency occurred in document d, N is the total number of files in document sets, n kfor there is characteristic item t knumber of files;
(3) in Web page representation, the weighted value of two factor effect characteristics items is had: one is the frequency that characteristic item occurs in a document, and one is the position that characteristic item occurs in a document; Adopt the method for diverse location Feature Words being given to different weight factor, weight factor computing formula is as follows:
Wherein, represent the average word frequency of core word, represent the average word frequency of non-core word, d kand N kbe respectively core word word frequency and core word number, d 0and N 0be respectively non-core word word frequency and non-core word word number; Core word comprises the word that in word in <title> and <meta> mark, keywords, description position occurs, all the other are non-core word.
CN201510866092.6A 2015-11-26 2015-11-26 Data feature based intelligent information classification method Pending CN105447161A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510866092.6A CN105447161A (en) 2015-11-26 2015-11-26 Data feature based intelligent information classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510866092.6A CN105447161A (en) 2015-11-26 2015-11-26 Data feature based intelligent information classification method

Publications (1)

Publication Number Publication Date
CN105447161A true CN105447161A (en) 2016-03-30

Family

ID=55557337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510866092.6A Pending CN105447161A (en) 2015-11-26 2015-11-26 Data feature based intelligent information classification method

Country Status (1)

Country Link
CN (1) CN105447161A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106934055A (en) * 2017-03-20 2017-07-07 南京大学 A kind of semi-supervised automatic webpage classification method based on insufficient modal information
CN107169523A (en) * 2017-05-27 2017-09-15 鹏元征信有限公司 Automatically determine method, storage device and the terminal of the affiliated category of employment of mechanism
CN107545179A (en) * 2017-07-11 2018-01-05 宁波大学 A kind of spam page recognition methods
CN107729334A (en) * 2016-08-11 2018-02-23 英业达科技有限公司 Data sorting system and data classification method
CN108228687A (en) * 2017-06-20 2018-06-29 上海吉贝克信息技术有限公司 Big data knowledge excavation and accurate tracking and system
CN108920492A (en) * 2018-05-16 2018-11-30 广州舜飞信息科技有限公司 A kind of Web page classification method, system, terminal and storage medium
CN109947947A (en) * 2019-03-29 2019-06-28 北京泰迪熊移动科技有限公司 A kind of file classification method, device and computer readable storage medium
CN109063217B (en) * 2018-10-29 2020-11-03 广东电网有限责任公司广州供电局 Work order classification method and device in electric power marketing system and related equipment thereof

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729334A (en) * 2016-08-11 2018-02-23 英业达科技有限公司 Data sorting system and data classification method
CN106934055A (en) * 2017-03-20 2017-07-07 南京大学 A kind of semi-supervised automatic webpage classification method based on insufficient modal information
CN106934055B (en) * 2017-03-20 2020-05-19 南京大学 Semi-supervised webpage automatic classification method based on insufficient modal information
CN107169523A (en) * 2017-05-27 2017-09-15 鹏元征信有限公司 Automatically determine method, storage device and the terminal of the affiliated category of employment of mechanism
CN108228687A (en) * 2017-06-20 2018-06-29 上海吉贝克信息技术有限公司 Big data knowledge excavation and accurate tracking and system
CN107545179A (en) * 2017-07-11 2018-01-05 宁波大学 A kind of spam page recognition methods
CN107545179B (en) * 2017-07-11 2020-06-19 宁波大学 Junk web page identification method
CN108920492A (en) * 2018-05-16 2018-11-30 广州舜飞信息科技有限公司 A kind of Web page classification method, system, terminal and storage medium
CN109063217B (en) * 2018-10-29 2020-11-03 广东电网有限责任公司广州供电局 Work order classification method and device in electric power marketing system and related equipment thereof
CN109947947A (en) * 2019-03-29 2019-06-28 北京泰迪熊移动科技有限公司 A kind of file classification method, device and computer readable storage medium
CN109947947B (en) * 2019-03-29 2021-11-23 北京泰迪熊移动科技有限公司 Text classification method and device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN105447161A (en) Data feature based intelligent information classification method
CN101408883B (en) Method for collecting network public feelings viewpoint
CN102622373B (en) Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN105022725B (en) A kind of text emotion trend analysis method applied to finance Web fields
CN102332028B (en) Webpage-oriented unhealthy Web content identifying method
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
Nagamma et al. An improved sentiment analysis of online movie reviews based on clustering for box-office prediction
CN107609121A (en) Newsletter archive sorting technique based on LDA and word2vec algorithms
CN108984518A (en) A kind of file classification method towards judgement document
CN106095996A (en) Method for text classification
CN107871144A (en) Invoice trade name sorting technique, system, equipment and computer-readable recording medium
CN107590219A (en) Webpage personage subject correlation message extracting method
CN101295381B (en) Junk mail detecting method
CN103309862A (en) Webpage type recognition method and system
CN103324628A (en) Industry classification method and system for text publishing
CN108199951A (en) A kind of rubbish mail filtering method based on more algorithm fusion models
CN101763431A (en) PL clustering method based on massive network public sentiment information
CN102789498A (en) Method and system for carrying out sentiment classification on Chinese comment text on basis of ensemble learning
CN103345528A (en) Text classification method based on correlation analysis and KNN
CN102289522A (en) Method of intelligently classifying texts
CN109446423B (en) System and method for judging sentiment of news and texts
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN111310476A (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN101540017A (en) Feature extraction method based on byte level n-gram and junk mail filter
CN104978354A (en) Text classification method and text classification device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160330