CN103064984B

CN103064984B - The recognition methods of spam page and system

Info

Publication number: CN103064984B
Application number: CN201310029963.XA
Authority: CN
Inventors: 刘奕群; 马少平; 张敏; 金奕江; 张阔
Original assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Current assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Priority date: 2013-01-25
Filing date: 2013-01-25
Publication date: 2016-08-10
Anticipated expiration: 2033-01-25
Also published as: CN103064984A

Abstract

The present invention proposes recognition methods and the system of a kind of spam page.Wherein, method includes: obtain the inquiry log of search engine and inquiry log carries out pretreatment acquisition pretreatment inquiry log；The Query Result set more than threshold value of user's clicking rate of inquiry and the occurrence number of results web page is filtered out from the multiple queries and results web page of pretreatment inquiry log；From Query Result set, artificial screening extracts multiple spam page and generates spam page sample set；Rubbish score and the cheating score of each inquiry calculating each results web page in Query Result set is added up to according to Query Result set and spam page sample set；It is spam page when the rubbish score of results web page is more than threshold value then results web page, and results web page is added in spam page set.Method according to embodiments of the present invention, is found by search engine inquiry daily record and identifies that spam page reduces algorithm complex, has preferable generalization and adaptability.

Description

The recognition methods of spam page and system

Technical field

The present invention relates to network information Intelligent treatment technical field, particularly to the recognition methods of a kind of spam page and be System.

Background technology

Being skyrocketed through of internet information amount makes search engine become indispensable information in people's routine work and life to obtain Take means.According to the statistics of CNNIC in December, 2011, search engine in the netizen colony of China The quantity of user has reached 3.96 hundred million, and application popularization rate nearly 80%, is that netizen uses one of most Internet service. Search engine plays important entrance effect during the online of user, therefore, takes in search engine retrieving result Favourable ranking has become as Internet resources and obtains the most effective approach that user pays close attention to as early as possible.

Under this information acquiring pattern with search engine as main entrance, the high flow capacity that high search rank brings and high receipts Benefit lures that many internet content providers use cheating mode to swindle search engine algorithms, to obtain advantageous knot into Really ranking, and the webpage that this use cheating mode is made a profit based on swindle is exactly spam page.The definition of spam page is: Utilize search engine to run the defect of algorithm, take the fraudulent means for search engine so that it is obtain and believe higher than its network Breath quality ranking effect is to seek the webpage of direct or indirect interests.

Fetterly et al. in 2003 by the sampling analysis to English Webpage, it is believed that at least a part of which has 8.1% The page is spam page；AndEt al. then estimated the rubbish contents of about 10% to 15% in Web in 2004； According to our sampling analysis to about 800,000,000 Chinese web pages under search dog search engine is assisted, Chinese Internet resources there are about The webpage of 15% belongs to spam page.

Spam page all can produce significant adverse effect for the network user, Internet resources environment and search engine.For For the network user, spam page comes position forward in retrieval the results list and clicks on user cheating, and this behavior increases Add user and searched the difficulty of the useful information wanted, reduce the information acquisition efficiency of user；Spam page also tends to sick Poison, Trojan software etc. combine, and the information security of user is caused serious impact.For Internet resources environment, by In the restriction of state's laws regulation, search engine generally will not provide bid advertisement for the illegal Web content such as pornographic, gambling Service, this makes to promote ranking by cheating mode becomes the main selection of these contents offer website, in spam page Thus also it is flooded with all kinds of illegal contents, and the illegal contents webpage of this addition cheating technology often causes widely Harmful effect, more serious destruction Internet resources environment.For search engine system, the existence of spam page causes It is full of the useless page in data directory, wastes a large amount of memory space and process time, thus strengthen search engine and processing often Consumption during individual inquiry, reduces search treatment effeciency, reduces user's degree of belief to search engine simultaneously.

The a kind of of conventional garbage web page identification method is for the Study of recognition work aspect practised fraud based on content, for rubbish URL feature and the common phrases feature of the page are analyzed, and enter 1.05 hundred million webpages captured based on MSN search Go content of pages feature extraction, employed and include length for heading, the average length of word, the ratio of content visible, interior Hold the features such as compression ratio and distinguish spam page and normal webpage.Also use more content characteristic on this basis to enter Row identifies work, and its feature includes the quantity etc. in the quantity of Anchor Text, the page containing popular vocabulary, and employs sequence Feature is merged the identification carrying out spam page by learning method.

Another kind is spam page identification based on link structure analysis.Et al. 2004 propose TrustRank algorithm opens a new way utilizing link structure information identification spam page, can apply to bag Include content cheating and the link cheating identification at interior various garbage webpage.Although the method lacks in link structure figure The coping style of noise data, but still have a considerable amount of researcher based on proposing many to the improvement of TrustRank algorithm Individual link analysis technology is applied to spam page identification, and these algorithms include Anti-TrustRank, Truncated PageRank etc..

Above spam page identification is operated in relatively-stationary webpage test set and closes and all obtain preferable recognition effect, The evaluation result that internationally recognizable spam page evaluation and test Web Spam Challenge is given much reaches the identification of more than 80% Accuracy rate, the experimental result accuracy rate that many correlative theses are given is then often beyond 90%.But, various reasons causes These recognizers still suffer from huge challenge when being applied to true internet environment, are difficult to give full play to it and identify Effect, this also result in the fact that search engine application is still caused tremendous influence by current spam page.

The shortcoming of prior art is as follows:

(1) these algorithms often can only be identified for certain certain types of spam page, lacks the robust identified Property, and the cheating form of spam page emerges in an endless stream, although recognizer is the highest for the recognition performance of certain class spam page, But cannot be identified other kinds of rubbish, spam page author once uses new cheating form, and these algorithms are just Often lose identification effectiveness.

(2) along with the development of cheating form, many algorithms need to expend a large amount of calculating, storage or the mode of bandwidth resources Carry out rubbish identification, such as, web page contents is carried out many gram language model structure, webpage is repeatedly captured, to net Page script carries out deep layer parsing etc., and this makes the efficiency of these algorithm identification spam pages need with the online service of search engine Ask inconsistent, thus cannot be applied in actual search engine service.

Summary of the invention

The purpose of the present invention is intended at least solve one of above-mentioned technological deficiency.

For reaching above-mentioned purpose, the embodiment of one aspect of the present invention proposes the recognition methods of a kind of spam page, including following Step: S1: obtain the inquiry log of search engine and described inquiry log is carried out pretreatment acquisition pretreatment inquiry log, Wherein, described pretreatment inquiry log includes multiple queries and results web page；S2: many from described pretreatment inquiry log Individual inquiry and results web page filter out user's clicking rate of described inquiry and the occurrence number of described results web page more than threshold Inquiry-the results set of value；S3: artificial screening extracts multiple spam page and generates rubbish from described inquiry-results set Rubbish webpage sample set；S4: add up to according to described inquiry-results set and spam page sample set and calculate described inquiry-result The rubbish score of each results web page and the cheating score of each inquiry in set；And S5: if described inquiry-result In set, the rubbish score of results web page is spam page more than the most described results web page of threshold value, and described results web page is added It is added in described spam page set.

Method according to embodiments of the present invention, by search engine inquiry daily record data, thus finds and identifies spam page Reducing algorithm complex, and structure and parameter is simple, recognition result is comprehensively reliable, has preferable generalization and fits Ying Xing.

In an example of the present invention, described step S1 specifically includes: S11: obtain the inquiry log of search engine, and Described inquiry log is converted to GBK form；S12: carry out the inquiry log after described conversion arranging acquisition pretreatment Inquiry log.

In an example of the present invention, described step S2 specifically includes: S21: each to described pretreatment inquiry log Inquiry participle is multiple key word, and the click results web page of described each key word with user is built the first inquiry-knot Fruit set；S22: calculate user's results web page click frequency of each inquiry in described first inquiry-results set, and from In filter out user's clicking rate and generate the second inquiry-results set more than the inquiry of threshold value and results web page；S23: calculate institute State the number of times that in the second inquiry-results set, each result occurs in described second inquiry-results set, and therefrom screen Occurrence number generates inquiry-results set more than inquiry and the results web page of threshold value.

In an example of the present invention, described step S4 specifically includes: S41: arrange in described inquiry-results set each The score of initially practising fraud of individual inquiry, and the initial waste score of results web page in described inquiry-results set is set；S42: The meansigma methods of the rubbish score calculating all results web page being associated with each inquiry in described inquiry-results set is made Cheating score for correspondence inquiry；And S43: calculate and be associated with each results web page in described inquiry-results set The meansigma methods of cheating score of all inquiries, if described results web page is not in spam page, by described cheating score Meansigma methods as the rubbish score of corresponding webpage, the most do not change described rubbish score.

For reaching above-mentioned purpose, on the other hand embodiments of the invention propose the identification system of a kind of spam page, including: Pretreatment module, is used for obtaining the inquiry log of search engine and described inquiry log carrying out pretreatment acquisition pretreatment looking into Asking daily record, wherein, described pretreatment inquiry log includes multiple queries and results web page；Screening module, for from described The multiple queries of pretreatment inquiry log and results web page filter out user's clicking rate of described inquiry and described result net The occurrence number of page is more than the inquiry-results set of threshold value；Extraction module, for artificial from described inquiry-results set Screening extracts multiple spam page and generates spam page sample set；Computing module, for according to described inquiry-result Set and spam page sample set add up to be calculated the rubbish score of each results web page in described inquiry-results set and each looks into The cheating score ask；Judge module, for judging that in described inquiry-results set, the rubbish score of results web page is the biggest In threshold value, it is then spam page if greater than threshold value；And processing module, described for described results web page is added to In spam page set.

System according to embodiments of the present invention, by search engine inquiry daily record data, thus finds and identifies spam page Reducing algorithm complex, and structure and parameter is simple, recognition result is comprehensively reliable, has preferable generalization and fits Ying Xing.

In an example of the present invention, described pretreatment module includes: obtain converting unit, for obtaining search engine Inquiry log, and described inquiry log is converted to GBK form；Pretreatment unit, for the inquiry after described conversion Daily record carries out arranging acquisition pretreatment inquiry log.

In an example of the present invention, described screening module includes: construction unit, for described pretreatment inquiry log Each inquiry participle be multiple key word, and the click results web page of described each key word and user built first look into Inquiry-results set；First computing unit, for calculating user's result of each inquiry in described first inquiry-results set Webpage click frequency, and therefrom filter out user's clicking rate and generate the second inquiry-knot more than the inquiry of threshold value and results web page Fruit set；Second computing unit, be used for calculating in described second inquiry-results set each result described second inquiring about- The number of times occurred in results set, and therefrom screening occurrence number generates inquiry-knot more than inquiry and the results web page of threshold value Fruit set.

In an example of the present invention, described computing module includes: arrange unit, is used for arranging described inquiry-result set Score of initially practising fraud of each inquiry in conjunction, and the initial waste of results web page in described inquiry-results set is set obtains Point；3rd computing unit, for calculating all result nets being associated with each inquiry in described inquiry-results set The cheating score that the meansigma methods of the rubbish score of page is inquired about as correspondence；And the 4th computing unit, for calculating with described The meansigma methods of the cheating score of all inquiries that each results web page in inquiry-results set is associated, if described knot Really webpage not in spam page then using the meansigma methods of described cheating score as the rubbish score of corresponding webpage, the most more Change described rubbish score.

Aspect and advantage that the present invention adds will part be given in the following description, and part will become from the following description Substantially, or by the practice of the present invention recognize.

Accompanying drawing explanation

The present invention above-mentioned and/or that add aspect and advantage will become bright from the following description of the accompanying drawings of embodiments Aobvious and easy to understand, wherein:

Fig. 1 is the flow chart of the recognition methods of the spam page according to one embodiment of the invention；

Fig. 2 is the pretreated log organization structure figure according to one embodiment of the invention；

Fig. 3 is the calculating schematic diagram of the rubbish score of the inquiry-results set according to one embodiment of the invention；

Fig. 4 is the frame diagram of the identification system of the spam page according to another embodiment of the present invention

Detailed description of the invention

Embodiments of the invention are described below in detail, and the example of embodiment is shown in the drawings, the most identical or Similar label represents same or similar element or has the element of same or like function.Retouch below with reference to accompanying drawing The embodiment stated is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.

In describing the invention, it is to be understood that term " first ", " second ", " the 3rd ", " the 4th " It is only used for describing purpose, and it is not intended that instruction or hint relative importance or implicit indicate indicated technical characteristic Quantity.Thus, define " first ", " second ", " the 3rd ", the feature of " the 4th " can express or Implicitly include one or more this feature.In describing the invention, " multiple " are meant that two or two Above, unless otherwise expressly limited specifically.

Fig. 1 is the flow chart of the recognition methods of the spam page according to one embodiment of the invention.As it is shown in figure 1, according to The recognition methods of the spam page of the embodiment of the present invention, comprises the following steps:

Step S101, obtains the inquiry log of search engine and inquiry log carries out pretreatment acquisition pretreatment inquiry day Will, wherein, pretreatment inquiry log includes multiple queries and results web page.

Specifically, first obtain the inquiry log of search engine, and inquiry log is converted to GBK form.Then, right Inquiry log after conversion carries out arranging acquisition pretreatment inquiry log, and the structure chart of its pretreatment inquiry log, such as Fig. 2 Shown in.Table 1 is the content that search engine inquiry daily record includes after pretreatment.

Table 1

In one embodiment of the invention, the daily record used includes search dog search engine on March 1st, 2011 All inquiries to the 9 day time of 9 days.Wherein, comprise 8,443,963 different inquiries, 12,470,865 Individual different webpage clicking, these webpages belong to 1,055,001 different website.The information that daily record includes such as table 2 Shown in.

Table 2

The log information of table 2 contains enough items of information for search engine automatic Evaluation, therefore can utilize this Individual daily record carries out the performance evaluation of each Chinese search engine.

Step S102, filter out from the multiple queries and results web page of pretreatment inquiry log inquiry user's clicking rate and The occurrence number of results web page is more than the inquiry-results set of threshold value.

Specifically, each inquiry participle to pretreatment inquiry log is multiple key word, and by each key word and user Click results web page build the first inquiry-results set.Then the use of each inquiry in the first inquiry-results set is calculated Family results web page click frequency, and therefrom filter out user's clicking rate and generate second more than the inquiry of threshold value and results web page and look into Inquiry-results set, then calculate that each result in the second inquiry-results set occurs in the second inquiry-results set time Number, and therefrom screening occurrence number generates inquiry-results set more than inquiry and the results web page of threshold value.

Step S103, from inquiry-results set, artificial screening extracts multiple spam page and generates spam page sample set Close.

Specifically, from inquiry-results set, randomly draw the Search Results of multiple quantity, such as, 1000 inquiries- As a result, and results web page therein is made whether the mark into spam page, until the spam page quantity marked out reaches To predetermined quantity, such as, mark is stopped when 200, if the quantity of spam page does not reaches predetermined quantity, then from looking into Inquiry-results set continues extraction 1000 be labeled, by that analogy, until spam page quantity reaches predetermined quantity Till.Using the spam page that marks out as spam page sample set.

Step S104, adds up to according to inquiry-results set and spam page sample set and calculates each result in inquiry-results set The rubbish score of webpage and the cheating score of each inquiry.

Specifically, arrange in inquiry-results set each inquiry initially practises fraud to be divided into 0, and arranges inquiry-result The initial waste score of results web page in set, if the results web page in inquiry-results set is at spam page sample set In conjunction, then corresponding initial waste score being set to 1, otherwise the initial waste score of its correspondence is set to 0.Then, Calculate the meansigma methods of rubbish score of all results web page being associated with each inquiry in inquiry-results set as right The cheating score that should inquire about.Finally, calculate with inquire about-results set in all inquiries that are associated of each results web page The meansigma methods of cheating score, if results web page is not in spam page, using the meansigma methods of cheating score as corresponding net The rubbish score of page, does not the most change rubbish score.In an embodiment of the present invention, by above-mentioned rubbish score and cheating The update method of score is sequentially repeated repeatedly generally 20-30 time, and the final rubbish obtained must be divided into results web page Rubbish score.

Fig. 3 is the calculating schematic diagram of the rubbish score of the inquiry-results set according to one embodiment of the invention.Such as Fig. 3 Shown in, inquiry-results set contains the corresponding relation inquired about between result, and the size of strength of association between the two Then by the frequency of occurrences of inquiry-results set (in figure 3 by w_iiRepresent) record.Small-scale rubbish from manual mark Webpage sample set is set out, and can calculate the spam page score of each webpage with progressive alternate.Assume URL₁For spam page Webpage (its rubbish must be divided into 1) in sample set, and URL₂It not that the webpage in spam page sample set is (at the beginning of it Beginning rubbish must be divided into 0), then Query₁And Query₃Key word cheating score during iteration is URL for the first time₁And URL₂ Spam page score averages (can be average by equal weight, it is also possible to by strength of association size weighted average)； Further, URL₂Spam page score value be Query₁And Query₃Key word cheating score averages (can by etc. Weight is average, it is also possible to by strength of association size weighted average), it is achieved thereby that spam page score is from sample set Close the diffusion of other webpages.By that analogy, the spam page score of all webpages can i.e. be calculated.

Step S105, is rubbish net by the results web page that the rubbish score of results web page in inquiry-results set is more than threshold value Page, and results web page is added in spam page set.

In one embodiment of the invention, the rubbish score threshold of spam page criterion can according to circumstances depending on, example As, it is set to 0.8.The spam page identified is added in spam page set as the data identifying spam page Use.

Fig. 4 is the frame diagram of the identification system of the spam page according to another embodiment of the present invention.As shown in Figure 4, root Pretreatment module 100, screening module 200, extraction module is included according to the identification system of the spam page of the embodiment of the present invention 300, computing module 400, judge module 500 and processing module 600.

Pretreatment module 100 is for obtaining the inquiry log of search engine and inquiry log carrying out pretreatment acquisition pretreatment Inquiry log, wherein, pretreatment inquiry log includes multiple queries and results web page.

In one embodiment of the invention, pretreatment module 100 includes obtaining converting unit 110 and pretreatment unit 120。

Obtain converting unit 110 and be used for obtaining the inquiry log of search engine, and inquiry log is converted to GBK form.

Pretreatment unit 120 obtains pretreatment inquiry log for carrying out arranging to the inquiry log after conversion.

In one embodiment of the invention, obtain the inquiry log of search engine, and inquiry log Unified coding is changed For GBK form.Carry out arranging and filtering useless information acquisition pretreatment inquiry log to the inquiry log after conversion, Fig. 2 Structure chart for pretreatment inquiry log.

Screening module 200 is for filtering out user's point of inquiry from the multiple queries and results web page of pretreatment inquiry log Hit the occurrence number inquiry-results set more than threshold value of rate and results web page.

In one embodiment of the invention, screening module 200 includes construction unit the 210, first computing unit 220 and Second computing unit 230.

It is multiple key word that construction unit 210 is used for each inquiry participle to pretreatment inquiry log, and by each key Word builds the first inquiry-results set with the click results web page of user.

First computing unit 220 clicks on frequency for calculating user's results web page of each inquiry in the first inquiry-results set Rate, and therefrom filter out user's clicking rate and generate the second inquiry-results set more than the inquiry of threshold value and results web page.

Second computing unit 230 is for calculating in the second inquiry-results set each result in the second inquiry-results set The number of times occurred, and therefrom screening occurrence number generates inquiry-results set more than inquiry and the results web page of threshold value.

In one embodiment of the invention, from inquiry-results set, the Search Results of multiple quantity, example are randomly drawed As, 1000 inquiry-results, and results web page therein is made whether the mark into spam page, until marking out Spam page quantity reach predetermined quantity, such as, when 200 stop mark, if the quantity of spam page does not reaches Predetermined quantity, then continue extraction 1000 from inquiry-results set and be labeled, by that analogy, until spam page Till quantity reaches predetermined quantity.Using the spam page that marks out as spam page sample set.

Extraction module 300 extracts multiple spam page for artificial screening from inquiry-results set and generates spam page Sample set.

Computing module 400 is for adding up in calculation inquiry-results set every according to inquiry-results set and spam page sample set The rubbish score of individual results web page and the cheating score of each inquiry.

In one embodiment of the invention, computing module 400 includes arranging unit the 410, the 3rd computing unit 420 and 4th computing unit 430.

Unit 410 is set for arranging in inquiry-results set score of initially practising fraud of each inquiry, and arrange inquiry- The initial waste score of results web page in results set.

3rd computing unit 420 for calculate with inquire about-results set in all results web page of being associated of each inquiry The cheating score inquired about as correspondence of the meansigma methods of rubbish score.

4th computing unit 430 for calculate with inquire about-results set in all inquiries that are associated of each results web page The meansigma methods of cheating score, if results web page is not in spam page, using the meansigma methods of cheating score as corresponding net The rubbish score of page, does not the most change rubbish score.

In an embodiment of the present invention, it is sequentially repeated repeatedly updates rubbish by the 3rd computing unit and the 4th computing unit Score and cheating score are generally 20-30 time, and the final rubbish obtained must be divided into the rubbish score of results web page.

Whether judge module 500 is more than threshold value for the rubbish score judging results web page in inquiry-results set, if It it is then spam page more than threshold value.In one embodiment of the invention, the rubbish score threshold of spam page criterion Can according to circumstances depending on, such as, be set to 0.8 etc..

Processing module 600 is for adding to results web page in spam page set.The spam page identified is added Use as the data identifying spam page in spam page set.

Although above it has been shown and described that embodiments of the invention, it is to be understood that above-described embodiment is exemplary , it is impossible to being interpreted as limitation of the present invention, those of ordinary skill in the art is without departing from the principle of the present invention and objective In the case of above-described embodiment can be changed within the scope of the invention, revise, replace and modification.

Claims

1. the recognition methods of a spam page, it is characterised in that comprise the following steps:

S1: obtain the inquiry log of search engine and described inquiry log is carried out pretreatment acquisition pretreatment Inquiry log, wherein, described pretreatment inquiry log includes multiple queries and results web page；

S2: filter out described inquiry from the multiple queries and results web page of described pretreatment inquiry log The occurrence number of user's clicking rate and described results web page is more than the inquiry-results set of threshold value；

S3: artificial screening extracts multiple spam page and generates rubbish net from described inquiry-results set Page sample set；

S4: add up to according to described inquiry-results set and spam page sample set and calculate described inquiry-result set The rubbish score of each results web page and the cheating score of each inquiry in conjunction；And

S5: if the rubbish score of results web page is more than threshold value, described knot in described inquiry-results set Really webpage is spam page, and described results web page is added in described spam page set,

Described step S2 specifically includes:

S21: each inquiry participle to described pretreatment inquiry log is multiple key word, and by described Each key word of multiple key words and the click results web page of user build the first inquiry-results set；

S22: the user's results web page calculating each inquiry in described first inquiry-results set clicks on frequency Rate, and therefrom filter out user's clicking rate and generate the second inquiry-knot more than the inquiry of threshold value and results web page Fruit set；

S23: calculate in described second inquiry-results set each result at described second inquiry-result set The number of times occurred in conjunction, and therefrom screening occurrence number generates inquiry more than inquiry and the results web page of threshold value -results set,

Described step S4 specifically includes:

S41: the score of initially practising fraud of each inquiry in described inquiry-results set is set, and institute is set State the initial waste score of results web page in inquiry-results set；

S42: all results web page that calculating is associated with each inquiry in described inquiry-results set The cheating score that the meansigma methods of rubbish score is inquired about as correspondence；And

S43: all inquiries that calculating is associated with each results web page in described inquiry-results set The meansigma methods of cheating score, if described results web page is not in spam page, by described cheating score Meansigma methods, as the rubbish score of corresponding webpage, does not the most change described rubbish score.

The recognition methods of spam page the most according to claim 1, it is characterised in that described step S1 specifically includes:

S11: obtain the inquiry log of search engine, and described inquiry log is converted to GBK form；

S12: carry out the inquiry log after described conversion arranging acquisition pretreatment inquiry log.

3. the identification system of a spam page, it is characterised in that including:

Pretreatment module, for obtaining the inquiry log of search engine and described inquiry log being carried out pre-place Reason obtains pretreatment inquiry log, and wherein, described pretreatment inquiry log includes multiple queries and result net Page；

Screening module, for filtering out from the multiple queries and results web page of described pretreatment inquiry log User's clicking rate of described inquiry and the occurrence number of described results web page are more than the inquiry-result set of threshold value Close；

Extraction module, extracts multiple spam page for artificial screening from described inquiry-results set Generate spam page sample set；

Computing module, for adding up to calculation described according to described inquiry-results set and spam page sample set The rubbish score of each results web page and the cheating score of each inquiry in inquiry-results set；

Judge module, for judging that in described inquiry-results set, the rubbish score of results web page is the biggest In threshold value, it is then spam page if greater than threshold value；And

Processing module, for described results web page is added in described spam page set,

Described screening module includes:

Construction unit, being used for each inquiry participle to described pretreatment inquiry log is multiple key word, And the click results web page of each key word of the plurality of key word with user is built the first inquiry-knot Fruit set；

First computing unit, for calculating user's knot of each inquiry in described first inquiry-results set Really webpage click frequency, and therefrom filter out user's clicking rate and generate more than the inquiry of threshold value and results web page Second inquiry-results set；

Second computing unit, is used for calculating in described second inquiry-results set each result described The number of times occurred in two inquiries-results set, and therefrom screening occurrence number is more than inquiry and the knot of threshold value Really auto-building html files inquiry-results set,

Described computing module includes:

Unit is set, for arranging the score of initially practising fraud of each inquiry in described inquiry-results set, And the initial waste score of results web page in described inquiry-results set is set；

3rd computing unit, is associated with each inquiry in described inquiry-results set for calculating The cheating score that the meansigma methods of the rubbish score of all results web page is inquired about as correspondence；And

4th computing unit is relevant to each results web page in described inquiry-results set for calculating The meansigma methods of the cheating score of all inquiries of connection, if described results web page is not in spam page, will The meansigma methods of described cheating score, as the rubbish score of corresponding webpage, is not the most changed described rubbish and is obtained Point.

The identification system of spam page the most according to claim 3, it is characterised in that described pre- Processing module includes:

Obtain converting unit, for obtaining the inquiry log of search engine, and described inquiry log is changed For GBK form；

Pretreatment unit, obtains pretreatment inquiry day for carrying out the inquiry log after described conversion arranging Will.