CN101493819B

CN101493819B - Method for optimizing detection of search engine cheat

Info

Publication number: CN101493819B
Application number: CN2008100567261A
Authority: CN
Inventors: 耿光刚; 李秋丹; 王春恒; 戴汝为
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2008-01-24
Filing date: 2008-01-24
Publication date: 2011-09-14
Anticipated expiration: 2028-01-24
Also published as: CN101493819A

Abstract

The invention discloses an optimization method for detecting cheating of a search engine, which comprises the following steps: Step S1, all web pages and hyperlinks are pretreated for feature extraction, and initial cheating detection is performed aiming at the extracted features; Step S2, a secondary feature extraction is performed based on the result of the initial cheating detection so as to extract clustering features, transmission features and neighbor features; and Step S3, the cheating is detected again by a machine learning algorithm and the detection result is generated based on the results of the initial cheating detection and the secondary feature extraction. The optimization method solves the unstable problems of the heuristic detection method in the existing technology and optimizes the performance of search engine cheating detection to the utmost extent.

Description

The optimization method that a kind of search engine cheat detects

Technical field

The present invention relates to information retrieval and search engine technique field, relate in particular to the optimization method that a kind of search engine cheat detects.

Background technology

The internet is as maximum since the dawn of human civilization information bank, and its content is still increasing with exponential.Internet hunt has become the part of people's daily life, and the report that CNNIC2007 issues July claims to use every day among the netizen ratio of search engine up to 61.91%.

Scholars such as N.Eiron use famous PageRank algorithm that 100,000,000 webpages are sorted, and found that to come in top 20 websites has 11 to be the porn site, and these websites are by distorting hyperlink to obtain forward rank.According to the investigation of american commerce investigation bureau, 2006, the ecommerce sales volume of the U.S. reached 1,141 hundred million dollars, had increased by 22.7% than 93,000,000,000 dollars in 2005.And 2007, first season of this statistics has just reached 31,500,000,000 dollars, than increasing by 18.4% 2006 year same period.The result of study of Bernard J.Jansen and Amanda Spink shows that about 80% user only can browse preceding 3 pages return results when using search engine.

The huge profit and the door effect of search engine are ordered about a lot of portal management persons and webpage making person and are made its website and the page become famous by every means on the internet, when carrying out the related content inquiry, come result's prostatitis with expects users.Search engine cheat (Web Spam), be search engine cheat again, be meant the means that adopt some fascinations, deception search engine, make the rank of the Web page in result for retrieval be higher than the behavior of actual deserved rank, it causes the quality of search engine retrieving result seriously to descend.

The internet cheating broadly can be divided into content cheating, link cheating two classes.Content cheating refers to website use content information deception search engine, improves the importance of some page, comprises the keyword cheating, at the title cheating etc.The link cheating website that refers to practise fraud goes out some network linking structures at the PageRank algorithm construction, fascination search engine sort algorithm, thus improve the importance of some page.

At above cheating form, relevant in a large number countermeasure has been carried.Cheating webpages context of detection in content-based analysis, but people such as A.Ntoulas investigate ratio, content compression ratio, anchor number of texts and the ratio of popular vocabulary in text etc. of average word length display part in cheating webpages and the general webpage, sum up a series of heuristic features, the content cheating webpages is detected as two classification problems, training decision tree classification device, most content swindle webpages can be detected.In the cheating context of detection based on link, influential work the earliest is the TrustRank algorithm that people such as Gyongyi proposes, and its starting point is " the good page seldom points to the cheating page ".By selecting seed set with a high reputation by hand, carry out degree of belief along the hyperlink in the network chart and propagate.Thereby obtain the degree of belief of each page, and then all pages are divided into two kinds of Spam and Normal.

War between search engine and the network cheating fabricator such as same arms race after search engine has been found an effective method and used, are just found out countermeasure through cribber after a while, invent the cheating form that makes new advances.Based on the method for machine learning at new cheating form, by increase, the deletion individual features, the validity that the maintenance system detects cheating, and needn't revise system architecture.Become the focus of recent research based on the cheat detection method of machine learning.A large amount of heuristics that people such as Carlos Castillo will be in the past mention in the document form the proper vector that 236 dimensions have comprised content and link association attributes as detected characteristics, adopt the method for machine learning that cheating is detected.Be accuracy of detection and or the stable method that all is much better than in the past.Yet Carlos has walked forefathers' old road at last again---based on detecting degree of confidence, utilize heuristics such as figure cluster, link study and stack figure study to optimize the precision of first round detection.

For when optimize detecting performance, the instability of avoiding these heuristics to greatest extent and being brought, we have proposed to detect optimization method based on the cheating of secondary characteristics.

Summary of the invention

(1) technical matters that will solve

In view of this, fundamental purpose of the present invention is to provide a kind of optimization method of search engine cheat detection, with the instability problem of heuristic cheat detection method in the solution prior art, and the performance of the detection of optimization searching engine cheating to the full extent.

(2) technical scheme

For achieving the above object, the invention provides the optimization method that a kind of search engine cheat detects, this method comprises:

Step S1: all webpages of pre-service and hyperlink, carry out feature extraction, at the feature of the extracting detection of tentatively practising fraud; Wherein, preliminary cheating detects and comprises: webpage extracting, web page contents extraction, network hyperlink figure structure, feature extraction, training set generation, test set generation, learning classification device, detection training set, and Preliminary detection result's generation and storage;

Step S2: on the basis of preliminary cheating testing result, carry out secondary characteristics and extract, extract cluster feature, transfer characteristic and neighbour's feature; Wherein, the degree of confidence of result for detecting that preliminary cheating detects is to carry out secondary characteristics to extract prerequisite, the input that the result that this preliminary cheating detects extracts as secondary characteristics together with website level hyperlink figure; Formula is adopted in the calculating of cluster feature

Wherein cf (H) is the cluster feature of website H, the cluster set at C (H) expression H place, the cheating degree of the website h that spamicity (h) provided for the first detection stage, 0＜=spamicity (h)＜=1; Formula is adopted in the calculating of transfer characteristic

pf {(H)}^{(t)} = (1 - α) spamicity (H) + α \underset{h : h - > H}{Σ} \frac{pf {(h)}^{(t - 1)}}{outdegree (h)},

Pf (H) wherein ^(t)Be the transfer characteristic of website H, t represents iterations, and the link that goes out of outdegree (h) expression h is gathered, and α is a damping factor, and value is between 0 to 1; Formula is adopted in the calculating of neighbour's feature

nf (H) = \frac{Σ_{h &Element; N (H)} spamicity (h) * (weight)}{| N (H) |},

Wherein, neighbour's feature of nf (H) expression website H, neighbour's set of N (H) expression H, weight represents weight, the value of weight determines that according to the number that links between the neighbour weight gets and do not consider any weight information at 1 o'clock;

Step S3: extract on result's the basis in preliminary cheating testing result and secondary characteristics, adopt machine learning algorithm that cheating is detected again, and generate testing result; Wherein, adopt machine learning algorithm that cheating is detected employed feature again, be to form by the characteristics combination that the feature and the secondary characteristics of preliminary cheating detection are extracted, described cheating is detected specifically again comprises: on the feature space after the expansion training set and test set are represented again, used preliminary feature and second extraction character representation sample simultaneously; Training classifier on training set after sorter is trained end, uses the sorter train to the detection of practising fraud of the website sample in the test set, finishes the optimization to Preliminary detection, generates final cheating testing result.

Preferably, described feature extraction, the feature of being extracted comprises the content of pages feature feature relevant with hyperlink, and this hyperlink correlated characteristic further comprises webpage level link correlated characteristic and website level link correlated characteristic.

Preferably, described when carrying out Preliminary detection result's generation, the cheating detection algorithm adopts pattern classification algorithm SVM, AdaBoost or C4.5.

Preferably, the calculating of described website level link correlated characteristic is based on website level linked, diagram.

Preferably, the extraction of cluster feature described in the step S2 is based on the figure divided characteristic, should be based on the clustering method of figure division, comprise that the figure based on boolean's link divides and divides based on the figure that weight links, and be divided into the subgraphs of different sizes respectively, to the confidence calculations arithmetic mean of the node in the specific subgraph, to generate again the used cluster feature of subseries.

Preferably, the extraction of transfer characteristic described in the step S2 is based on the degree of confidence transmission of digraph, based on the degree of confidence transmission of converse digraph with based on the degree of confidence transmission of non-directed graph, so that each website is generated three transfer characteristics.

Preferably, the Feature Extraction of neighbour described in the step S2 is based on the feature extraction of one-level neighbor relationships and based on the feature extraction of secondary neighbor relationships; At this one-level neighbor relationships and secondary neighbor relationships, just the various combination of the direction of link and opposite direction node generates a plurality of neighbour's features respectively.

Preferably, described on training set during training classifier, sorter is selected C4.5, Bagging or Adaboost.

Preferably, the extraction of secondary characteristics described in the step S2 can further be extended for the multi-stage characteristics extraction.

(3) beneficial effect

From technique scheme as can be seen, the optimization method that detects based on the search engine cheat of secondary characteristics provided by the invention has overcome the problem that detects the feature extraction difficulty that is faced based on the search engine cheat of machine learning.Compare with didactic method, not only can better improve the detection performance, optimized the performance that search engine cheat detects to the full extent, and improved the robustness of detection system greatly.

Description of drawings

Fig. 1 is the method flow diagram that detects based on the search engine cheat of secondary characteristics provided by the invention;

Fig. 2 is the data flowchart from the pre-service to the Preliminary detection provided by the invention;

Fig. 3 is the synoptic diagram that secondary characteristics provided by the invention is extracted;

Fig. 4 is the synoptic diagram that figure provided by the invention (cluster) divides;

Fig. 5 is the synoptic diagram of neighbor relationships feature extraction provided by the invention;

Fig. 6 is the method flow diagram of practising fraud and detecting based on the feature space after the expansion provided by the invention.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

Below in conjunction with accompanying drawing the present invention is described in detail, be to be noted that described embodiment only is intended to be convenient to the understanding of the present invention, and it is not played any qualification effect.

In order to realize method of the present invention, consider that algorithm will handle the hyperlink figure of magnanimity webpage, if realize that at unit guarantee that preferably processor host frequency is not less than 2GHz, internal memory is not less than 2G, can adopt any programming language commonly used to write.

The search engine cheat based on secondary characteristics that the present invention proposes detects optimization method, overall procedure as shown in Figure 1, step S1 is that preliminary cheating detects, be all webpages of pre-service and hyperlink, carry out feature extraction, at the feature of the extracting detection of tentatively practising fraud, this step provides the detection degree of confidence of Preliminary detection for the secondary characteristics of step S2; Step S2 is on the basis of preliminary cheating testing result, carries out secondary characteristics and extracts, and extracts cluster feature, transfer characteristic and neighbour's feature; In order to narrate conveniently, the present invention is called " one-level feature extraction " with " feature extraction " among the step S1, " feature extraction " among the step S2 is called " secondary characteristics extraction ", in this course, will extract the different feature of three classes with the Preliminary detection degree of confidence based on hyperlink figure; Step S3 extracts on result's the basis in preliminary cheating testing result and secondary characteristics, adopts machine learning algorithm that cheating is detected again, and generates testing result.

A large amount of statistics show that the website at cheating webpages place in the internet often is exactly the website of practising fraud, the formulation of the standard data set Webspam-UK2006 in this field just is based on this point, if no special instructions, cheating sample among the present invention and Spam all represent the website of practising fraud.Next be described in detail each key step.

1, all webpages of pre-service and hyperlink are carried out feature extraction, at the feature of the extracting detection (step S1) of tentatively practising fraud.

The work that Preliminary detection is finished as shown in Figure 2, comprise that webpage grasps (step S11), web page contents extracts (step S12), network hyperlink figure structure (step S13), feature extraction (step S14), training set generation (step S15), test set generation (step S16), learning classification device and detects training set (step S17), and Preliminary detection result's generation and storage.

Webpage grasps, web page contents extracts and network hyperlink figure makes up the method that maturation has been arranged, and just repeats no more here.

The feature that the feature extraction part is extracted comprises the content of pages correlated characteristic feature relevant with hyperlink, and wherein the hyperlink correlated characteristic can be subdivided into webpage level link correlated characteristic and website level feature correlated characteristic again.

Webpage level link correlated characteristic and content of pages feature are referring to [C.Castillo, D.Donato, A.Gionis:Know your Neighbors:Web Spam Detection using the Web Topology.Sigir 2007], website level link correlated characteristic comprises:

F ₁(H)＝Score(H)

F_{2} (H) = \frac{1}{| Inlink (H) |} \times \underset{h &Element; Inlink (H)}{Σ} Score (h)

F_{3} (H) = \frac{1}{| Outlink (H) |} \times \underset{h &Element; Outlink (H)}{Σ} Score (h)

F_{4} (H) = \frac{1}{| Outlink (Outlink (H)) |} \times \underset{h &Element; Outlink (Outlink (H))}{Σ} Score (h)

F_{5} (H) = \frac{1}{| Intlink (Intlink (H)) |} \times \underset{h &Element; Intlink (Intlink (H))}{Σ} Score (h)

F_{6} (H) = \frac{1}{| Intlink (Outlink (H)) |} \times \underset{h &Element; Intlink (Outlink (H))}{Σ} Score (h)

F_{7} (H) = \frac{1}{| Outlink (Intlink (H)) |} \times \underset{h &Element; Outlink (Intlink (H))}{Σ} Score (h)

F ₈(H)＝SiteSupporters _Di(H) Di∈{1，2，3，4}

Wherein, Score (h) ∈ { HostRank (h), TrucatedPageRank (h), TrustRank (h) }, promptly be respectively the HostRank of website, TrucatedPageRank and TrustRank value, Inlink (H) and Outlink (H) represent going into set of links and going into set of links of website H respectively.SiteSupporters _Di(H) expression website H is at different distance D _iOn the backer, i.e. neighbours' number on different distance.

When carrying out Preliminary detection result's generation among the step S1, the cheating detection algorithm can adopt the pattern classification algorithm of any maturation, such as SVM, AdaBoost, C4.5 etc.The degree of confidence of result for detecting that described preliminary cheating detects is to carry out secondary characteristics to extract prerequisite, the input that the result that this preliminary cheating detects extracts as secondary characteristics together with website level feature correlated characteristic (being website level linked, diagram).

2, on the basis of preliminary cheating testing result, carry out secondary characteristics and extract, extract cluster feature, transfer characteristic and neighbour's feature (step S2).

Step S2 is on the basis of Preliminary detection result that step S1 generates (comprise and detect degree of confidence) and website level Internet superman linked, diagram, extracts a series of new features, for machine learning algorithm used, to improve accuracy of detection and to detect stable.

Step S21, step S22, step S23 extract three classes feature of different nature respectively among Fig. 3, i.e. cluster feature, transfer characteristic and neighbour's feature.Discuss respectively with regard to the extracting method of this three category feature below.

The extraction of the described cluster feature of step S21 is based on the figure divided characteristic, should be based on the clustering method of figure division, comprise that the figure based on boolean's link divides and divides based on the figure that weight links, and be divided into the subgraphs of different sizes respectively, to the confidence calculations arithmetic mean of the node in the specific subgraph, to generate again the used cluster feature of subseries.

The figure partitioning algorithm of considering existing maturation, is regarded the linked, diagram of whole website level as non-directed graph here and is handled simultaneously in order to simplify computing mostly at non-directed graph.Fig. 4 is a synoptic diagram that figure divides.Linked, diagram can formally be expressed as G=(w), wherein V represents the set of all websites for V, E, and w is the mapping function from V * V to integer, get respectively mapping function w (u, v) be,

w (u, v) = \{\begin{matrix} \log (N + 1), & if & N > 0 \\ 0, & if & N = 0 \end{matrix},

w (u, v) = \{\begin{matrix} 1, & if & N > 0 \\ 0, & if & N = 0 \end{matrix}

Or w (u, v)=N, wherein N is the hyperlink number between website u and the v, E is the set on limit in the non-directed graph.Dendrogram G uses METIS figure clustering algorithm, and at the three kinds of different weighting functions in front, respectively the website that comprises in the linked, diagram being gathered is K class, calculates the cluster feature of website H by following formula 1.

cf (H) = \frac{Σ_{h &Element; C (H)} spamicity (h)}{| C (H) |} - - - (1)

Wherein, cf (H) is the cluster feature of website H, the cluster set at C (H) expression H place, the cheating degree of the website h that spamicity (h) provided for the first detection stage, 0＜=spamicity (h)＜=1, if spamicity (h) equals 0, expression h is the cheating website, same spamicity (h) equals 1, and expression h is non-cheating website.By adjusting the value of K, can obtain a plurality of cluster feature through formula (1).

The extraction of the described transfer characteristic of step S22 is based on the degree of confidence transmission of digraph, based on the degree of confidence transmission of converse digraph with based on the degree of confidence transmission of non-directed graph, so that each website is generated three features, calculates as shown in Equation (2):

pf {(H)}^{(t)} = (1 - α) spamicity (H) + α \underset{h : h - > H}{Σ} \frac{pf {(h)}^{(t - 1)}}{outdegree (h)} - - - (2)

Pf (H) wherein ^(t)Be the transfer characteristic of website H, t represents iterations, during actual the use, can think the setting iterations, gets pf (h) ⁽⁰⁾=spamicity (h), the link that goes out of outdegree (h) expression h is gathered, can calculate indegree (h) accordingly or consider the link of coming in and going out simultaneously, can obtain 3 transfer characteristics at least like this, promptly based on the degree of confidence transmission of digraph, based on the degree of confidence transmission of converse digraph with based on the degree of confidence transmission of non-directed graph.α is a damping factor, and value is between 0 to 1.

The described neighbour's Feature Extraction of step S23 is based on the feature extraction of one-level neighbor relationships and based on the feature extraction of secondary neighbor relationships; At this one-level neighbor relationships and secondary neighbor relationships, just the various combination of the direction of link and opposite direction node generates a plurality of features respectively.

Experiment shows that neighbour's Feature Extraction only need consider that the two-stage neighbor relationships just can reach reasonable effect, promptly is respectively the arest neighbors (go out the ingress that outes of ingress, be called the secondary neighbour) of arest neighbors (go out ingress, be called the one-level neighbour) and arest neighbors.As shown in Figure 5, the white point of innermost layer indicates to carry out the website H of neighbour's feature extraction, represents the one-level neighbour of H indicating grey node on the internal layer great circle of D1, and dark node is represented the secondary neighbour of H on the outermost layer great circle.The value of neighbour's feature is calculated by formula (3):

nf (H) = \frac{Σ_{h &Element; N (H)} spamicity (h) * (weight)}{| N (H) |} - - - (3)

Wherein, neighbour's feature of nf (H) expression website H, neighbour's set of N (H) expression H, weight represents weight, the value of weight determines that according to the number that links between the neighbour weight gets and do not consider any weight information at 1 o'clock.If node does not have neighboring node, then nf (H) value with this node is changed to 0.5, promptly uncertain value.The selection of neighbour's set can be selected arest neighbors, and secondary neighbour, and multistage neighbour, Fig. 5 have provided four kinds of different secondary neighbours' synoptic diagram, and arrow is represented the direction of hyperlink.Experimental results show that neighbour's feature is effectively detected characteristics.

The above neighbour's feature, cluster feature and transfer characteristic are the secondary characteristics of extraction, together with the input of the one-level feature of extracting in the step 1 as step S3, detect optimization to carry out final cheating.

3, extract on result's the basis in preliminary cheating testing result and secondary characteristics, adopt machine learning algorithm that cheating is detected again, and generate testing result (step S3).

Adopting machine learning algorithm that cheating is detected employed feature again described in the step S3, is to be formed by the step S1 characteristics combination that the feature that detects and step S2 secondary characteristics extract of tentatively practising fraud.

Described cheating is detected specifically again comprises: on the feature space after the expansion training set and test set are represented (step S31 and step S32) again, use preliminary feature and second extraction character representation sample simultaneously; Step S33 is a training classifier on training set, the selection of sorter can be any existing pattern classifier, as C4.5, Bagging, Adaboost etc., after sorter is trained end, the sorter that use trains is to the detection of practising fraud of the website sample in the test set, finish optimization, generate final cheating testing result Preliminary detection.

The extraction of above-described secondary characteristics can similarly be extended for multi-stage characteristics and extract, and other steps are similar, but experiment shows that the extraction multi-stage characteristics is than extracting the detection performance that secondary characteristics can significantly not improve system once more.

Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the above only is specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the optimization method that detects of a search engine cheat is characterized in that this method comprises:

Wherein cf (H) is the cluster feature of website H,

The cluster set at C (H) expression H place, the cheating degree of the website h that spamicity (h) provided for the first detection stage, 0＜=spamicity (h)＜=1; Formula is adopted in the calculating of transfer characteristic

pf {(H)}^{(t)} = (1 - α) spamicity (H) + α \underset{h : h - > H}{Σ} \frac{pf {(h)}^{(t - 1)}}{outdegree (h)},

nf (H) = \frac{Σ_{h &Element; N (H)} spamicity (h) * (weight)}{| N (H) |},

2. the optimization method that search engine cheat according to claim 1 detects, it is characterized in that, described feature extraction, the feature of being extracted comprises the content of pages feature feature relevant with hyperlink, and this hyperlink correlated characteristic further comprises webpage level link correlated characteristic and website level link correlated characteristic.

3. the optimization method that search engine cheat according to claim 1 detects is characterized in that, and is described when carrying out Preliminary detection result's generation, and the cheating detection algorithm adopts pattern classification algorithm SVM, AdaBoost or C4.5.

4. the optimization method that search engine cheat according to claim 1 and 2 detects is characterized in that, the calculating of described website level link correlated characteristic is based on website level linked, diagram.

5. the optimization method that search engine cheat according to claim 1 detects, it is characterized in that, the extraction of cluster feature described in the step S2 is based on the figure divided characteristic, should be based on the clustering method of figure division, comprise that the figure based on boolean's link divides and divides based on the figure that weight links, and be divided into different big or small subgraphs respectively, to the confidence calculations arithmetic mean of the node in the specific subgraph, to generate again the used cluster feature of subseries.

6. the optimization method that search engine cheat according to claim 1 detects, it is characterized in that, the extraction of transfer characteristic described in the step S2, be based on the degree of confidence transmission of digraph, based on the degree of confidence transmission of converse digraph with based on the degree of confidence transmission of non-directed graph, so that each website is generated three transfer characteristics.

7. the optimization method that search engine cheat according to claim 1 detects is characterized in that, the Feature Extraction of neighbour described in the step S2 is based on the feature extraction of one-level neighbor relationships and based on the feature extraction of secondary neighbor relationships; At this one-level neighbor relationships and secondary neighbor relationships, just the various combination of the direction of link and opposite direction node generates a plurality of neighbour's features respectively.

8. the optimization method that search engine cheat according to claim 1 detects is characterized in that, described on training set during training classifier, sorter is selected C4.5, Bagging or Adaboost.

9. the optimization method that search engine cheat according to claim 1 detects is characterized in that, the extraction of secondary characteristics described in the step S2 can further be extended for multi-stage characteristics and extract.