CN105447161A

CN105447161A - Data feature based intelligent information classification method

Info

Publication number: CN105447161A
Application number: CN201510866092.6A
Authority: CN
Inventors: 刘治; 张胜; 章云
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2015-11-26
Filing date: 2015-11-26
Publication date: 2016-03-30

Abstract

The invention belongs to the field of data mining and relates to a data feature based intelligent information classification method. The method mainly comprises a stage of training marked web pages and a stage of classifying to-be-classified web pages. The training stage comprises the main steps of: preprocessing the web pages; performing Chinese word segmentation and stop word removal on web page content; creating a knowledge base according to data features; performing feature selection on the web pages and generating eigenvectors; and generating an SVM classifier. The classification stage comprises the main steps of: pre-classifying the web pages; and performing accurate classification by using the SVM classifier. With the method, the deficiency that an existing information classification method cannot perform high-speed and efficient classification on Chinese web pages is overcome.

Description

A kind of intelligent information sorting technique based on data characteristics

Technical field

The invention belongs to Data Mining, relate to a kind of intelligent information sorting technique based on data characteristics.

Background technology

Along with the develop rapidly of internet, the network information is explosive growth.In the face of the Web information of magnanimity like this, how to obtain useful information quickly and accurately, be one of current Internet technology facing challenges.Automatic webpage classification is a kind of important technology efficiently processing magnanimity Web information.It refers to for webpage to be sorted, according to its content by computing machine according to certain Algorithms for Automatic Classification, webpage is divided into the classification pre-defined.

At present, there is the text automatic classification algorithm of multiple Corpus--based Method theory and machine learning method.But compared with common text documents, webpage has following characteristics: (1) webpage adopts hypertext design, comprises html tag in webpage, this makes it stronger than plain text expressive ability, utilizable structured message and edit file more; (2) interrelated by hyperlink between the webpage on Web, the commending contents that hyperlink contains and content correlationship bring a lot of heuristic information to Web page classifying; (3) webpage comprises a large amount of noise usually, as the information that advertisement, navigation bar, recommendation hurdle, author information etc. are irrelevant with subject content; (4) Chinese web page uses Chinese to express, and unlike English, use each word in blank character interval, Chinese web page needs word segmentation processing.These reasons make Web page classifying more more complex than plain text classification above just.

Summary of the invention

For the problems referred to above, the present invention is after the feature of further investigation Chinese web page, according to the part such as web page title, key word, classification results is had to the feature of higher weights, to propose according to data characteristics with preset antistop list and title content for knowledge base is classified in advance, then webpage is changed into proper vector in conjunction with SVM algorithm sorting technique as a supplement.The method drastically increases the combination property of sorter.

Concrete technical scheme is as follows: a kind of intelligent information sorting technique based on data characteristics, comprises training and two stages of classification:

Training stage specifically carries out in accordance with the following steps: step one, treat training webpage and carry out pre-service, removes the html irrelevant with Web page classifying and marks, therefrom extract body text.Step 2, Chinese word segmentation process is carried out to the text extracted, and Web page classifying be there is no to the stop-word of much meanings after removing participle.Such as ' ', ' ' etc. do not have word or the word of practical significance in Chinese, also have some rarely used words and special symbol in addition, all must remove as stop-word.Step 3, to participle with go the result after stop-word to carry out word frequency statistics.Step 4, feature selecting is carried out to the result after word frequency statistics.Specific practice arranges word frequency threshold value, filters out the word of word frequency lower than threshold value.Step 5, weighted value calculating is carried out to the high frequency words of remainder, generating feature vector.Step 6, establishment industry knowledge base are each antistop list of waiting to train classification this field preset.Step 7, establishment SVM classifier.

Sorting phase specifically carries out in accordance with the following steps: step one, carry out pre-service to experienced webpage to be sorted, removes the html irrelevant with Web page classifying and marks, therefrom extract body text.Step 2, to extract after text carry out Chinese word segmentation and go stop-word process, specific implementation method and train time the same.Step 3, to presort.The title class extracting webpage to be sorted is held, and contrasts, determine the generic of webpage with the antistop list in preset industry knowledge base.If presort successfully, then directly return classification results; If presort unsuccessfully, then continue following steps.Step 4, by web page text participle with go the text after stop-word to change into proper vector.Step 5, use SVM classifier are classified to this proper vector, and are returned classification results.

Based on disclosing of above technical scheme, the present invention possesses following beneficial effect:

1, according to the part such as web page title, key word, classification results is had to the feature of higher weights in the present invention, propose to classify in advance as knowledge base using preset antistop list and title content, substantially increase the classification speed to Chinese web page.

2, propose to set up knowledge base with data characteristics in the present invention to presort, then in conjunction with SVM algorithm sorting technique as a supplement, drastically increase the combination property of sorter.

Accompanying drawing explanation

Fig. 1 is the system flowchart of a kind of intelligent information sorting technique based on data characteristics that the present invention proposes.

Fig. 2 is the training stage process flow diagram of a kind of intelligent information sorting technique based on data characteristics that the present invention proposes.

Fig. 3 is the sorting phase process flow diagram of a kind of intelligent information sorting technique based on data characteristics that the present invention proposes.

Embodiment

As shown in Figure 1, be the system flowchart of a kind of intelligent information sorting technique based on data characteristics of the present invention's proposition.With reference to Fig. 1, a kind of intelligent information sorting technique based on data characteristics that the present invention proposes comprises: step S1, trains the webpage marked; Step S2, classifies to webpage to be sorted.

With reference to Fig. 2, described in step S1, training is carried out to the webpage marked and comprises:

Step S11, treats training webpage and carries out pre-service, remove the html irrelevant with Web page classifying and mark, therefrom extract body text.First, the html source codes embedded by mark such as <style>, <script>, <applet> are removed; Secondly, contents extraction in <title>, <meta> label out, is preserved separately; Finally, after filtering out above-mentioned html label, Web page text text is extracted.

Step S12, carries out Chinese word segmentation process to the Web page text text extracted, and Web page classifying is not had to the stop-word of much meanings after removing participle.Such as ' ', ' ' etc. do not have word or the word of practical significance in Chinese, also have some rarely used words and special symbol in addition, all must remove as stop-word.

Step S13, to participle and go the result after stop-word to carry out word frequency statistics.

Step S14, carries out feature selecting to the result after word frequency statistics.Specific practice arranges word frequency threshold value, filters out the word of word frequency lower than threshold value.

Step S15, usage space vector model (VectorSpaceModel) will treat that training webpage body text converts proper vector to.In the model, each text document is expressed as following proper vector:

V(d)＝(t ₁，ω ₁(d)；t ₂，ω ₂(d)；…；t _n，ω _n(d)；)

Wherein t _ifor characteristic item, ω _id () is t _iweight in a document.

According to step S15, for the ease of subsequent calculations, need the dimension reducing proper vector.Step S12, S13, S14 decrease the number of characteristic item, namely decrease the dimension of proper vector.

According to step S15, characteristic item weights omega in a document _id () can use the traditional Weight algorithm based on TF-IDF to calculate, computing formula is as follows:

ω_{i} (d) = \frac{{tf}_{i} (d) \times \log (N / n_{k} + 0.01)}{\sqrt{Σ_{i = 1}^{n} {({tf}_{i} (d))}^{2} \times {[\log (N / n_{k} + 0.01)]}^{2}}}

Wherein, tf _id () is t _ithe frequency occurred in document d, N is the total number of files in document sets, n _kfor there is characteristic item t _knumber of files.

According to step S15, in Web page representation, there is the weighted value of two factor effect characteristics items: one is the frequency that characteristic item occurs in a document, and one is the position that characteristic item occurs in a document.Adopt the method for diverse location Feature Words being given to different weight factor, weight factor computing formula is as follows:

λ = \frac{\overset{&OverBar;}{d_{k}}}{d_{0}} = \frac{(Σ d_{k}) / N_{k}}{(Σ d_{k}) / N_{0}}

Wherein, represent the average word frequency of core word, represent the average word frequency of non-core word, d _kand N _kbe respectively core word word frequency and core word number, d _oand N _obe respectively non-core word word frequency and non-core word word number.Core word comprises the word that in word in <title> and <meta> mark, keywords, description position occurs, all the other are non-core word.

Optionally, in general, the word number of core word is few and occurrence number is many, and comparatively concentrated, so λ >=1, just gets λ=1 when running into the situation being less than 1.For core word, proper vector formula just becomes:

ω′ _i(d)＝λ×ω _i(d)

Step S16, creation of knowledge storehouse, waits the antistop list of training classification this field preset for each.

According to step S16, the concrete creation method of knowledge base is: first mark by the <title> of all webpages in each classification extraction training set the content comprised, and word segmentation processing is carried out to it, then add up word frequency respectively, and by word frequency descending sort.Next step, using selected part from these words as the knowledge base of classifying in advance, the principle chosen is the word that word frequency is the highest from each classification, check whether it occurred in other classifications, if do not occurred in other classifications, then it is chosen for such other knowledge base.

Preferably, according to described above, only occur in certain classification title if be strict with a word, then obtained keyword is less, can not significantly improve classification effectiveness.So, suitably lower the requirement in real process, if certain word has higher word frequency in a classification, and the number of times occurred in other classifications is no more than a certain fixing threshold values, or occurrence number accounts for the ratio of total webpage number within the specific limits (as 1%) in other classifications, still select this entry as the keyword of presorting of this class.

Step S17, trains the proper vector generated, and creates SVM classifier.

According to S17, the principle of SVM classifier is:

(1) given training set is established

T∈{(x ₁，y ₁)，(x ₂，y ₂)，…，(x _i，y _i)}∈(X*Y) ^l

Wherein, x _i∈ X=R ⁿ, y _i∈ Y={-1,1}, i=1,2 ..., l

(2) select applicable kernel function K (x, x ') and punishment parameter C, construct and solve following optimization problem

\begin{matrix} S . t . & Σ_{i = 1}^{l} y_{i} α_{i} = 0 \end{matrix}

0≤a _i≤Ci＝1，2，3…l

Obtain optimum solution

α^{*} = {(α_{1}^{*}, α_{2}^{*}, ..., α_{i}^{*})}^{T}

(3) α is selected ^*a positive component being less than C and calculate accordingly

b^{*} = y_{i} - Σ_{i = 1}^{l} y_{i} α_{i}^{*} K (x_{i}, x_{j})

(4) decision function is constructed

f (x) = sgn (Σ_{i = 1}^{l} y_{i} α_{i}^{*} K (x_{i}, x) + b^{*})

With reference to Fig. 3, in step S2, classification is carried out to webpage to be sorted and comprises:

Step S21, carry out pre-service to webpage to be sorted, specific implementation method is identical with S11.

Step S22, the text after extracting is carried out to Chinese word segmentation and goes stop-word process, and specific implementation method step S12 is the same.

Step S23, presorts to webpage.

According to step S23, the specific implementation method of presorting is: (1) extracts web page title content, compares, judge word generic with antistop list in knowledge base, adds up word in the title occurrence frequency in of all categories; (2) if the uni-gram frequency belonging to certain classification is maximum, then think that webpage belongs to this classification; (3) if the uni-gram frequency belonging to two classifications is equal, then compare category preferences, be divided into the classification that priority is larger.(4) if priority is identical, then presort unsuccessfully, need to continue following step.

Step S24, to participle and go the result after stop-word to carry out word frequency statistics, specific implementation method is the same with S13.

Step S25, carries out feature selecting to the result after word frequency statistics.Specific implementation method is the same with step S14.

Step S26, Web page text text-converted to be sorted is become proper vector by usage space vector model (VectorSpaceModel).Specific implementation method is the same with step S15.

Step S27, uses SVM classifier to classify to the proper vector generated, obtains classification results.

1, according to the part such as web page title, key word, classification results is had to the feature of higher weights in the present invention, propose to set up knowledge base using preset antistop list and title content as data characteristics and come to classify in advance, substantially increase the classification speed of Chinese web page.

2, propose in the present invention to presort as knowledge base using data characteristics, then in conjunction with SVM algorithm sorting technique as a supplement, drastically increase the combination property of sorter.

The above, be only the present invention's preferably embodiment, but protection scope of the present invention is not limited thereto.Anyly be familiar with those skilled in the art in the technical scope that the present invention discloses, be equal to according to technical scheme of the present invention and inventive concept thereof and replace or change, all should be encompassed in protection scope of the present invention.

Claims

1., based on an intelligent information sorting technique for data characteristics, it is characterized in that, the method comprises:

According to data characteristics creation of knowledge storehouse, Chinese web page to be sorted is presorted, greatly accelerate the classification speed of webpage.

2. method according to claim 1, is characterized in that, the described method of presorting comprises:

(1) extract web page title content, compare with antistop list in knowledge base, judge word generic, add up word in the title occurrence frequency in of all categories; (2) if the uni-gram frequency belonging to certain classification is maximum, then think that webpage belongs to this classification; (3) if the uni-gram frequency belonging to two classifications is equal, then compare category preferences, be divided into the classification that priority is larger; (4) if priority is identical, then presort unsuccessfully, need to use SVM classifier to continue classification.

3. method according to claim 1, is characterized in that, described knowledge base creates and comprises:

(1) mark by the <title> of all webpages in each classification extraction training set the content comprised, and word segmentation processing is carried out to it, then add up word frequency respectively, and by word frequency descending sort; (2) using selected part from these words as the knowledge base of classifying in advance, the principle chosen is the word that word frequency is the highest from each classification, check whether it occurred in other classifications, if do not occurred in other classifications, then it is chosen for such other knowledge base.

4. method according to claim 1, is characterized in that, the establishment principle of described SVM classifier comprises:

(1) given training set is established

T∈{(x ₁，y ₁)，(x ₂，y ₂)，…，(x _i，y _i)}∈(X*Y) ^l

Wherein, x _i∈ X=R ⁿ, y _i∈ Y={-1,1}, i=1,2 ..., l

(2) select applicable kernel function K (x, x ') and punishment parameter C, construct and solve following optimization problem:

0≤α _i≤Ci＝1，2，3…l

Obtain optimum solution

(4) decision function is constructed

。

5. method according to claim 4, it is characterized in that, the building method of described proper vector comprises:

(1) usage space vector model (VectorSpaceModel) will treat that training webpage body text converts proper vector to; In the model, each text document is expressed as following proper vector:

V(d)＝(t ₁，ω ₁(d)；t ₂，ω ₂(d)；…；t _n，ω _n(d)；)

Wherein t _ifor characteristic item, ω _id () is t _iweight in a document;

(2) characteristic item weights omega in a document _id () can use the traditional Weight algorithm based on TF-IDF to calculate, computing formula is as follows:

Wherein, tf _id () is t _ithe frequency occurred in document d, N is the total number of files in document sets, n _kfor there is characteristic item t _knumber of files;

(3) in Web page representation, the weighted value of two factor effect characteristics items is had: one is the frequency that characteristic item occurs in a document, and one is the position that characteristic item occurs in a document; Adopt the method for diverse location Feature Words being given to different weight factor, weight factor computing formula is as follows:

Wherein, represent the average word frequency of core word, represent the average word frequency of non-core word, d _kand N _kbe respectively core word word frequency and core word number, d ₀and N ₀be respectively non-core word word frequency and non-core word word number; Core word comprises the word that in word in <title> and <meta> mark, keywords, description position occurs, all the other are non-core word.