CN106339459A

CN106339459A - Method for pre-classifying Chinese webpages based on keyword matching

Info

Publication number: CN106339459A
Application number: CN201610741134.8A
Authority: CN
Inventors: 张云; 冯多; 木伟民; 王伟平
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2016-08-26
Filing date: 2016-08-26
Publication date: 2017-01-18
Anticipated expiration: 2036-08-26
Also published as: CN106339459B

Abstract

The invention relates to a method for pre-classifying Chinese webpages based on keyword matching. The method comprises the following steps: in a process of making a training set needed by a classifying algorithm, annotating keywords representing webpages in the webpages while manually annotating training webpages to generate a keyword table; extracting keywords occurring in the webpages according to the keyword table for each test webpage, and transferring a tag of the training set to the test webpage by performing keyword matching calculation with the training set; if classifying results of the training webpages are not given by a pre-classifying method, performing further classification calculation on the test webpages. By adopting the method, running time of classifying technologies with complicated calculation such as SVM, KNN and naive Bayesian classification is shortened, and meanwhile the accuracy and recall rate of the classifying results are increased.

Description

The method that Chinese web page is presorted is carried out based on Keywords matching

Technical field

The present invention relates to the information processing aspect of computer realm, more particularly, to Chinese network is carried out based on Keywords matching The method that page is presorted.

Background technology

With the high speed development of the Internet, with the information of form web page storage still in explosive growth, therefore webpage is believed Breath is categorized into obtains one of indispensable method of useful information for people.The sorting algorithm of main flow includes at present Svm, knn, three kinds of algorithms of naive Bayesian, the training set needed for wherein svm seldom, to the classifying quality of English webpage also very Outstanding.The classification results accuracy rate of the Chinese Web page classification system with svm technology as core and recall rate are all unable to reach requirement. This is because English has natural separator, and Chinese can only first by Chinese word segmentation machine web page text is made vectorization it Before carry out participle.But Chinese word segmentation machine outstanding more also cannot make participle entirely accurate, this greatly have impact on Chinese web page The effect of classification.

Content of the invention

For the problems referred to above, the present invention proposes a kind of method Chinese web page presorted based on Keywords matching, The method can substantially reduce the run time of main flow sorting technique, improve accuracy rate and the recall rate of classification results simultaneously.

The technical scheme is that

The method that Chinese web page is presorted is carried out based on Keywords matching, comprises the steps:

1) each training class label tag of webpage tr and characterize the other pass of this web page class in mark training set trs Keyword collection kws, generates antistop list kwt；

2) key word comprising in each test webpage te in test set tes is extracted according to kwt, form keyword set tek；

3) calculate each two tuple (the i.e. key word to) tec of tek and travel through training set, kws is comprised the tr of this tec Tag be transferred to the corresponding te of this tec, and this tag is deposited in the tally set tags of this te；

4) label in tags is carried out frequency statistics, take the several label of frequency highest according to demand, as this test The label of presorting of webpage.

On the basis of technique scheme, the present invention can also do following improvement.

Further, step 1) in, also include entering keyword set kws belonging to the other all training webpages of same class Antistop list kwt is generated after row duplicate removal.

Further, the specifically comprising the following steps that of above-mentioned generation antistop list kwt

1 1) newly-built one mapping m, using first training webpage tr kws each key word k as m key, accordingly Initial value is all set to 1；

1 2) to each key word k in the kws of second tr, first determine whether whether contained k in m, if existing, The value of the key-value pair for k for the key is added 1；If not existing,<k, 1>this key-value pair is added in m；

1 3) to remaining tr, repeat step 1-2), until last tr；

1 4) set threshold value s, when the value of key-value pair is less than s, value is set to 0；Otherwise the value of key-value pair is set to 1.

Further, step 2) in calculate the specifically comprising the following steps that of two tuples tec of tek

31 1) for comprising n key word k1, the tek of k2 ... ..., kn, according to following sequential search key word pair, Each key word is to needing to enter into step 3-1-2) judge: the key word comprising k1 is to as<k1, k2>,<k1, k3 >... ...,<k1, kn>, comprise the key word of k2 to for<k2, k3>... ...,<k2, kn>, so until comprising the pass of k (n-1) Keyword is to<k (n-1), kn>；

31 2) if it is 1 that current key word centering at least meets its corresponding value in m, tec is traveled through Training set trs；If being unsatisfactory for, return to step 3-1-1), search next key word pair.

Further, step 2) in tek all two tuples (i.e. key word to) tec at least one key word attach most importance to Want key word, important key word is frequency of occurrence highest key word.

Further, step 2) in also include to test webpage in occur key word carry out frequency statistics, when secondary When the frequency that key word occurs exceedes given threshold, can be allowed to mark becomes important key word, in the pass calculating test webpage Obtain more two tuples during keyword two tuple, improve accuracy rate and the recall rate of classification results.

Wherein, " secondary key word " is that when counting for the first time, frequency of occurrence is not very high key word, but with statistics Increase in fact it could happen that the frequency has change, become the high key word of frequency of occurrence." important key word " be frequency of occurrence High key word, rank is 0, and its secondary and important division is by the way of artificial mark, but is also provided with threshold value, for area Point.Such as 0.2 is threshold value, when being labeled as secondary key word, with the statistics frequency increase above 0.2 after, then be changed into important Key word.Initially by the way of artificial mark.

Further, step 3) in each tec travel through training set trs process specifically include:

If 3-2-1) kws of tr comprises first key word of tec, enter step 3-2-2)；Otherwise, calculate under tek One tec simultaneously begins stepping through training set again；

If 3-2-2) kws of tr comprises at least one important key word of tec, the tag of tr is added to the tally set of te In tags.If having contained this tag in tags, using this tag as the key-value pair of key corresponding to value add 1；Otherwise incite somebody to action Tag, 1 > key-value pair is added in tags.

Further, step 4) also include to the test webpage te (Test Network of failure of presorting not comprising key word Page) carry out classified counting.

The invention has the beneficial effects as follows:

1., while the class label of artificial mark training set, provide the key word characterizing the category, then to all Key word carry out frequency statistics, given threshold, key word is divided into important and secondary, obtains antistop list.By key Root is divided into important and secondary two ranks according to frequency of occurrence, can make full use of the frequency information of key word, make test webpage Key word two tuple more can reflect the attribute of webpage itself, improve the accuracy rate of Chinese Web page classification.

2. every test webpage in pair test set, travels through antistop list first, obtains the keyword set that this webpage comprises； Then all two tuples of keyword set are obtained it is desirable at least one key word is important in two tuples；By with this pass Candidate's label of the test webpage that keyword obtains to coupling is more accurate, equally improves the accuracy of Web page classifying result.

3., to each two tuple after, travel through training set, if the keyword set of training webpage contains this two tuple, will The label of this training webpage is added in the tally set of this test webpage；Finally the label in the tally set of test webpage is carried out Frequency statistics, take the several label of frequency highest as needed, as the label of presorting of test webpage.Reasonably give training Webpage multi-tag improves accuracy rate and the recall rate of classification results.

4. due to training set quantity seldom, the time that whole process is consumed is with test set size linear increase.This is big Reduce greatly the run time of Chinese Web page classification, improve accuracy rate and the recall rate of classification results simultaneously.

Brief description

Fig. 1 is the composition structure chart of training set and test set.

Fig. 2 is the flow chart carrying out the method that Chinese web page is presorted based on Keywords matching.

Specific embodiment

Below in conjunction with accompanying drawing, the principle of the present invention and feature are described, example is served only for explaining the present invention, and Non- for limiting the scope of the present invention.

Now a Chinese Web page classification system is achieved based on svm technology.The training set being provided and test set are respectively Trs and tes, as shown in figure 1, wherein trs resolves into several tr, trid refers to the numbering of each tr；Tes resolves into some Individual te, teid then refer to the numbering of each te, and tecs refers to the collection of two tuples tec；Ket is antistop list, including key word Kw and its value of corresponding key-value pair.

The method that Chinese web page is presorted, namely step tes presorted by trs are carried out based on Keywords matching Rapid as shown in Fig. 2 specific as follows:

Step 1: to each training webpage tr in training set trs, mark its class label tag, and characterize this net Keyword set kws of page classification, repeat step 1 terminates to all training sets mark；

Step 2: the kws of all training webpages is carried out duplicate removal, is stored in antistop list kwt, enters step 3；

Step 3: to each test webpage te in test set tes, travel through kwt, search the key word comprising in te, group Become keyword set tek.If tek is not empty, enter step 4；If tek is sky, enter step 8；

Step 4: calculate first two tuple tec of tek, i.e. key word pair, enter step 5；

Step 5: to tec, travel through training set trs, if the kws of tr contains tec, the tag of this tr is transferred to this te, It is deposited in the tally set tags of this te, label occurrence number is counted simultaneously.Enter step 6；

Step 6: repeat step 4,5, until last two tuple of tek；Enter step 7；

Step 7: the label in tags is carried out descending by occurrence number, take tags top n (n be integer, permissible Take one or more as needed), as the label of presorting of this te, enter step 8；

Step 8: repeat step 3 to step 7, until the last item te presorts end；Chinese for failure of presorting Webpage, enters into the classified counting stage, calculates and completes the classification to training set after terminating, otherwise directly terminates to presort.

The kws by all training webpages described in step 2 carries out duplicate removal, is stored in specifically comprising the following steps that of antistop list kwt

Step 2.1: a newly-built mapping m, using first key training kws each key word k of webpage tr as m, phase The value answered all is set to 1；

Step 2.2: to each key word k in the kws of second tr, first determine whether whether contained k in m, if Exist, the value of the key-value pair for k for the key is added 1；If not existing,<k, 1>this key-value pair is added in m；

Step 2.3: to remaining tr, repeat step 2.2, up to last tr；

Step 2.4: set threshold value s, when the value of key-value pair is less than s, value is set to 0；Otherwise by the value of key-value pair It is set to 1；

Two tuples tec of the calculating tek described in step 4, that is, key word is to specifically comprising the following steps that

A n key word k1 is had in step 4.1:tek, k2 ... ..., kn, according to following sequential search key word pair, often One key word judges to all needing to enter into step 4.2: the key word comprising k1 to as<k1, k2>,<k1, k3>..., <k1, kn>, comprises the key word of k2 to for<k2, k3>... ...,<k2, kn>, so until comprising the key word of k (n-1) to<k (n-1),kn>；

Step 4.2: if it is 1 that current key word centering at least meets its corresponding value in m, enter into step Rapid 5；If being unsatisfactory for, returning to step 4.1, searching next key word pair.

If the kws of the tr described in step 5 contains tec, the tag of this tr is transferred to this te, is deposited into the mark of this te Sign in collection tags, simultaneously to specifically comprising the following steps that label occurrence number is counted

Step 5.1: if the kws of tr comprises first key word of tec, enter step 5.2；Otherwise, enter step 6；

Step 5.2: if the kws of tr comprises second key word of tec, the tag of tr is added to the tally set tags of te In.If having contained this tag in tags, using this tag as the key-value pair of key corresponding to value add 1；Otherwise by<tag, 1> Key-value pair is added in tags；

Wherein step 5.2 only needs te is to be carried out by the key word out to artificial mark the reason providing two key words Count, set suitable threshold value, obtained important key word, if there being at least one important key word to occur in tr in te In then it is assumed that the label of tr can be transferred to te.

Embodiment

Now with 7 classifications totally 200 training sets, illustrate as a example 1000 test sets.

To in training set, each trains one label of webpage label, and wherein can be characterized 3 key words of its classification (as keyword set) mark out, all stores in internal memory.

The threshold value dividing important key word is set to 0.2, obtain containing after calculating through the frequency 30 important Key word and the antistop list of 40 secondary key words.

Webpage is tested to each in test set, travels through antistop list first, find out the key word wherein comprising, average feelings Condition is 3 (not including the test webpage not comprising key word, this part webpage is presorted unsuccessfully), therefore be up to 3 satisfactions Key word two tuple of condition.

To each two tuple, travel through training set, the label containing the training webpage of this two tuple in keyword set is added Enter in the tally set of this test webpage.Finally tally set is sorted according to the frequency, take the first two label that occurrence number is most Label as this test webpage.

Of the present invention the method that Chinese web page presorts carried out based on Keywords matching real Chinese web page is carried out Test, for last classification results compare the Chinese Web page classification result do not presorted, accuracy rate and recall rate are divided At least do not improve 10%, 15%, make the classifying quality of whole Chinese Web page classification system reach desired value.Specific formula As follows:

The all related total number of files of associated documents/system that recall rate (recall)=system retrieval arrives

Accuracy rate (precision)=system retrieval to all total number of files retrieving of associated documents/system

Original method: recall=25%, precision=18%；

Context of methods: recall=40%, precision=28%.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all spirit in the present invention and Within principle, any modification, equivalent substitution and improvement made etc., should be included within the scope of the present invention.

Claims

1. the method that Chinese web page is presorted is carried out based on Keywords matching, comprise the steps:

1) each training class label tag of webpage tr and characterize the other key word of this web page class in mark training set trs Collection kws, generates antistop list kwt；

3) calculate each two tuple tec of tek and travel through training set, the tag of the tr that kws is comprised this tec is transferred to this tec Corresponding te, and this tag is deposited in the tally set tags of this te；

4) label in tags is carried out frequency statistics, take the several label of frequency highest according to demand, as this test webpage Label of presorting.

2. as claimed in claim 1 method that Chinese web page presorts is carried out based on Keywords matching it is characterised in that step 1), in, also include carrying out generating antistop list after duplicate removal by keyword set kws belonging to the other all training webpages of same class kwt.

3. as claimed in claim 1 method that Chinese web page presorts is carried out it is characterised in that generating based on Keywords matching Antistop list kwt specifically comprises the following steps that

1 1) newly-built one mapping m, using first training webpage tr kws each key word k as m key, accordingly initially Value is all set to 1；

1 2) to each key word k in the kws of second tr, first determine whether whether contained k in m, if existing, by key The value of the key-value pair for k adds 1；If not existing,<k, 1>this key-value pair is added in m；

1 3) to remaining tr, repeat step 1-2), until last tr；

4. as claimed in claim 3 method that Chinese web page presorts is carried out based on Keywords matching it is characterised in that step 2) in, two tuples tec of calculating tek specifically comprises the following steps that

31 1) for comprising n key word k1, the tek of k2 ... ..., kn, according to following sequential search key word pair, each Individual key word is to all needing to enter into step 3-1-2) judge: the key word comprising k1 is to as<k1, k2>,<k1, k3 >... ...,<k1, kn>, comprise the key word of k2 to for<k2, k3>... ...,<k2, kn>, so until comprising the pass of k (n-1) Keyword is to<k (n-1), kn>；

31 2) if it is 1 that current key word centering at least meets its corresponding value in m, training is traveled through to tec Collection trs；If being unsatisfactory for, return to step 3-1-1), search next key word pair.

5. as claimed in claim 1 method that Chinese web page presorts is carried out based on Keywords matching it is characterised in that step 2) in all two tuples tec of tek in, at least one key word is important key word, and described important key word is frequency of occurrence Highest key word.

6. as claimed in claim 5 method that Chinese web page presorts is carried out based on Keywords matching it is characterised in that step 2) also include in carrying out frequency statistics to the key word occurring in test webpage, when the frequency that secondary key word occurs exceedes setting During threshold value, then it is labeled as important key word.

7. as claimed in claim 1 method that Chinese web page presorts is carried out based on Keywords matching it is characterised in that step 3) in, the process of each tec traversal training set trs specifically includes:

If 3-2-1) kws of tr comprises first key word of tec, enter step 3-2-2)；Otherwise, calculate the next one of tek Tec simultaneously begins stepping through training set again；

If 3-2-2) kws of tr comprises at least one important key word of tec, described important key word is frequency of occurrence highest Key word, the tag of tr is added in the tally set tags of te, if having contained this tag in tags, using this tag as Value corresponding to the key-value pair of key adds 1；Otherwise by<tag, 1>key-value pair is added in tags.

8. as claimed in claim 1 method that Chinese web page presorts is carried out based on Keywords matching it is characterised in that step 4) also include carrying out classified counting to the test webpage te not comprising key word.