CN101493819B - Method for optimizing detection of search engine cheat - Google Patents

Method for optimizing detection of search engine cheat Download PDF

Info

Publication number
CN101493819B
CN101493819B CN2008100567261A CN200810056726A CN101493819B CN 101493819 B CN101493819 B CN 101493819B CN 2008100567261 A CN2008100567261 A CN 2008100567261A CN 200810056726 A CN200810056726 A CN 200810056726A CN 101493819 B CN101493819 B CN 101493819B
Authority
CN
China
Prior art keywords
feature
cheating
detection
website
detects
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2008100567261A
Other languages
Chinese (zh)
Other versions
CN101493819A (en
Inventor
耿光刚
李秋丹
王春恒
戴汝为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN2008100567261A priority Critical patent/CN101493819B/en
Publication of CN101493819A publication Critical patent/CN101493819A/en
Application granted granted Critical
Publication of CN101493819B publication Critical patent/CN101493819B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an optimization method for detecting cheating of a search engine, which comprises the following steps: Step S1, all web pages and hyperlinks are pretreated for feature extraction, and initial cheating detection is performed aiming at the extracted features; Step S2, a secondary feature extraction is performed based on the result of the initial cheating detection so as to extract clustering features, transmission features and neighbor features; and Step S3, the cheating is detected again by a machine learning algorithm and the detection result is generated based on the results of the initial cheating detection and the secondary feature extraction. The optimization method solves the unstable problems of the heuristic detection method in the existing technology and optimizes the performance of search engine cheating detection to the utmost extent.

Description

The optimization method that a kind of search engine cheat detects
Technical field
The present invention relates to information retrieval and search engine technique field, relate in particular to the optimization method that a kind of search engine cheat detects.
Background technology
The internet is as maximum since the dawn of human civilization information bank, and its content is still increasing with exponential.Internet hunt has become the part of people's daily life, and the report that CNNIC2007 issues July claims to use every day among the netizen ratio of search engine up to 61.91%.
Scholars such as N.Eiron use famous PageRank algorithm that 100,000,000 webpages are sorted, and found that to come in top 20 websites has 11 to be the porn site, and these websites are by distorting hyperlink to obtain forward rank.According to the investigation of american commerce investigation bureau, 2006, the ecommerce sales volume of the U.S. reached 1,141 hundred million dollars, had increased by 22.7% than 93,000,000,000 dollars in 2005.And 2007, first season of this statistics has just reached 31,500,000,000 dollars, than increasing by 18.4% 2006 year same period.The result of study of Bernard J.Jansen and Amanda Spink shows that about 80% user only can browse preceding 3 pages return results when using search engine.
The huge profit and the door effect of search engine are ordered about a lot of portal management persons and webpage making person and are made its website and the page become famous by every means on the internet, when carrying out the related content inquiry, come result's prostatitis with expects users.Search engine cheat (Web Spam), be search engine cheat again, be meant the means that adopt some fascinations, deception search engine, make the rank of the Web page in result for retrieval be higher than the behavior of actual deserved rank, it causes the quality of search engine retrieving result seriously to descend.
The internet cheating broadly can be divided into content cheating, link cheating two classes.Content cheating refers to website use content information deception search engine, improves the importance of some page, comprises the keyword cheating, at the title cheating etc.The link cheating website that refers to practise fraud goes out some network linking structures at the PageRank algorithm construction, fascination search engine sort algorithm, thus improve the importance of some page.
At above cheating form, relevant in a large number countermeasure has been carried.Cheating webpages context of detection in content-based analysis, but people such as A.Ntoulas investigate ratio, content compression ratio, anchor number of texts and the ratio of popular vocabulary in text etc. of average word length display part in cheating webpages and the general webpage, sum up a series of heuristic features, the content cheating webpages is detected as two classification problems, training decision tree classification device, most content swindle webpages can be detected.In the cheating context of detection based on link, influential work the earliest is the TrustRank algorithm that people such as Gyongyi proposes, and its starting point is " the good page seldom points to the cheating page ".By selecting seed set with a high reputation by hand, carry out degree of belief along the hyperlink in the network chart and propagate.Thereby obtain the degree of belief of each page, and then all pages are divided into two kinds of Spam and Normal.
War between search engine and the network cheating fabricator such as same arms race after search engine has been found an effective method and used, are just found out countermeasure through cribber after a while, invent the cheating form that makes new advances.Based on the method for machine learning at new cheating form, by increase, the deletion individual features, the validity that the maintenance system detects cheating, and needn't revise system architecture.Become the focus of recent research based on the cheat detection method of machine learning.A large amount of heuristics that people such as Carlos Castillo will be in the past mention in the document form the proper vector that 236 dimensions have comprised content and link association attributes as detected characteristics, adopt the method for machine learning that cheating is detected.Be accuracy of detection and or the stable method that all is much better than in the past.Yet Carlos has walked forefathers' old road at last again---based on detecting degree of confidence, utilize heuristics such as figure cluster, link study and stack figure study to optimize the precision of first round detection.
For when optimize detecting performance, the instability of avoiding these heuristics to greatest extent and being brought, we have proposed to detect optimization method based on the cheating of secondary characteristics.
Summary of the invention
(1) technical matters that will solve
In view of this, fundamental purpose of the present invention is to provide a kind of optimization method of search engine cheat detection, with the instability problem of heuristic cheat detection method in the solution prior art, and the performance of the detection of optimization searching engine cheating to the full extent.
(2) technical scheme
For achieving the above object, the invention provides the optimization method that a kind of search engine cheat detects, this method comprises:
Step S1: all webpages of pre-service and hyperlink, carry out feature extraction, at the feature of the extracting detection of tentatively practising fraud; Wherein, preliminary cheating detects and comprises: webpage extracting, web page contents extraction, network hyperlink figure structure, feature extraction, training set generation, test set generation, learning classification device, detection training set, and Preliminary detection result's generation and storage;
Step S2: on the basis of preliminary cheating testing result, carry out secondary characteristics and extract, extract cluster feature, transfer characteristic and neighbour's feature; Wherein, the degree of confidence of result for detecting that preliminary cheating detects is to carry out secondary characteristics to extract prerequisite, the input that the result that this preliminary cheating detects extracts as secondary characteristics together with website level hyperlink figure; Formula is adopted in the calculating of cluster feature
Figure GSB00000559429100031
Wherein cf (H) is the cluster feature of website H, the cluster set at C (H) expression H place, the cheating degree of the website h that spamicity (h) provided for the first detection stage, 0<=spamicity (h)<=1; Formula is adopted in the calculating of transfer characteristic pf ( H ) ( t ) = ( 1 - α ) spamicity ( H ) + α Σ h : h - > H pf ( h ) ( t - 1 ) outdegree ( h ) , Pf (H) wherein (t)Be the transfer characteristic of website H, t represents iterations, and the link that goes out of outdegree (h) expression h is gathered, and α is a damping factor, and value is between 0 to 1; Formula is adopted in the calculating of neighbour's feature nf ( H ) = Σ h ∈ N ( H ) spamicity ( h ) * ( weight ) | N ( H ) | , Wherein, neighbour's feature of nf (H) expression website H, neighbour's set of N (H) expression H, weight represents weight, the value of weight determines that according to the number that links between the neighbour weight gets and do not consider any weight information at 1 o'clock;
Step S3: extract on result's the basis in preliminary cheating testing result and secondary characteristics, adopt machine learning algorithm that cheating is detected again, and generate testing result; Wherein, adopt machine learning algorithm that cheating is detected employed feature again, be to form by the characteristics combination that the feature and the secondary characteristics of preliminary cheating detection are extracted, described cheating is detected specifically again comprises: on the feature space after the expansion training set and test set are represented again, used preliminary feature and second extraction character representation sample simultaneously; Training classifier on training set after sorter is trained end, uses the sorter train to the detection of practising fraud of the website sample in the test set, finishes the optimization to Preliminary detection, generates final cheating testing result.
Preferably, described feature extraction, the feature of being extracted comprises the content of pages feature feature relevant with hyperlink, and this hyperlink correlated characteristic further comprises webpage level link correlated characteristic and website level link correlated characteristic.
Preferably, described when carrying out Preliminary detection result's generation, the cheating detection algorithm adopts pattern classification algorithm SVM, AdaBoost or C4.5.
Preferably, the calculating of described website level link correlated characteristic is based on website level linked, diagram.
Preferably, the extraction of cluster feature described in the step S2 is based on the figure divided characteristic, should be based on the clustering method of figure division, comprise that the figure based on boolean's link divides and divides based on the figure that weight links, and be divided into the subgraphs of different sizes respectively, to the confidence calculations arithmetic mean of the node in the specific subgraph, to generate again the used cluster feature of subseries.
Preferably, the extraction of transfer characteristic described in the step S2 is based on the degree of confidence transmission of digraph, based on the degree of confidence transmission of converse digraph with based on the degree of confidence transmission of non-directed graph, so that each website is generated three transfer characteristics.
Preferably, the Feature Extraction of neighbour described in the step S2 is based on the feature extraction of one-level neighbor relationships and based on the feature extraction of secondary neighbor relationships; At this one-level neighbor relationships and secondary neighbor relationships, just the various combination of the direction of link and opposite direction node generates a plurality of neighbour's features respectively.
Preferably, described on training set during training classifier, sorter is selected C4.5, Bagging or Adaboost.
Preferably, the extraction of secondary characteristics described in the step S2 can further be extended for the multi-stage characteristics extraction.
(3) beneficial effect
From technique scheme as can be seen, the optimization method that detects based on the search engine cheat of secondary characteristics provided by the invention has overcome the problem that detects the feature extraction difficulty that is faced based on the search engine cheat of machine learning.Compare with didactic method, not only can better improve the detection performance, optimized the performance that search engine cheat detects to the full extent, and improved the robustness of detection system greatly.
Description of drawings
Fig. 1 is the method flow diagram that detects based on the search engine cheat of secondary characteristics provided by the invention;
Fig. 2 is the data flowchart from the pre-service to the Preliminary detection provided by the invention;
Fig. 3 is the synoptic diagram that secondary characteristics provided by the invention is extracted;
Fig. 4 is the synoptic diagram that figure provided by the invention (cluster) divides;
Fig. 5 is the synoptic diagram of neighbor relationships feature extraction provided by the invention;
Fig. 6 is the method flow diagram of practising fraud and detecting based on the feature space after the expansion provided by the invention.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
Below in conjunction with accompanying drawing the present invention is described in detail, be to be noted that described embodiment only is intended to be convenient to the understanding of the present invention, and it is not played any qualification effect.
In order to realize method of the present invention, consider that algorithm will handle the hyperlink figure of magnanimity webpage, if realize that at unit guarantee that preferably processor host frequency is not less than 2GHz, internal memory is not less than 2G, can adopt any programming language commonly used to write.
The search engine cheat based on secondary characteristics that the present invention proposes detects optimization method, overall procedure as shown in Figure 1, step S1 is that preliminary cheating detects, be all webpages of pre-service and hyperlink, carry out feature extraction, at the feature of the extracting detection of tentatively practising fraud, this step provides the detection degree of confidence of Preliminary detection for the secondary characteristics of step S2; Step S2 is on the basis of preliminary cheating testing result, carries out secondary characteristics and extracts, and extracts cluster feature, transfer characteristic and neighbour's feature; In order to narrate conveniently, the present invention is called " one-level feature extraction " with " feature extraction " among the step S1, " feature extraction " among the step S2 is called " secondary characteristics extraction ", in this course, will extract the different feature of three classes with the Preliminary detection degree of confidence based on hyperlink figure; Step S3 extracts on result's the basis in preliminary cheating testing result and secondary characteristics, adopts machine learning algorithm that cheating is detected again, and generates testing result.
A large amount of statistics show that the website at cheating webpages place in the internet often is exactly the website of practising fraud, the formulation of the standard data set Webspam-UK2006 in this field just is based on this point, if no special instructions, cheating sample among the present invention and Spam all represent the website of practising fraud.Next be described in detail each key step.
1, all webpages of pre-service and hyperlink are carried out feature extraction, at the feature of the extracting detection (step S1) of tentatively practising fraud.
The work that Preliminary detection is finished as shown in Figure 2, comprise that webpage grasps (step S11), web page contents extracts (step S12), network hyperlink figure structure (step S13), feature extraction (step S14), training set generation (step S15), test set generation (step S16), learning classification device and detects training set (step S17), and Preliminary detection result's generation and storage.
Webpage grasps, web page contents extracts and network hyperlink figure makes up the method that maturation has been arranged, and just repeats no more here.
The feature that the feature extraction part is extracted comprises the content of pages correlated characteristic feature relevant with hyperlink, and wherein the hyperlink correlated characteristic can be subdivided into webpage level link correlated characteristic and website level feature correlated characteristic again.
Webpage level link correlated characteristic and content of pages feature are referring to [C.Castillo, D.Donato, A.Gionis:Know your Neighbors:Web Spam Detection using the Web Topology.Sigir 2007], website level link correlated characteristic comprises:
F 1(H)=Score(H)
F 2 ( H ) = 1 | Inlink ( H ) | × Σ h ∈ Inlink ( H ) Score ( h )
F 3 ( H ) = 1 | Outlink ( H ) | × Σ h ∈ Outlink ( H ) Score ( h )
F 4 ( H ) = 1 | Outlink ( Outlink ( H ) ) | × Σ h ∈ Outlink ( Outlink ( H ) ) Score ( h )
F 5 ( H ) = 1 | Intlink ( Intlink ( H ) ) | × Σ h ∈ Intlink ( Intlink ( H ) ) Score ( h )
F 6 ( H ) = 1 | Intlink ( Outlink ( H ) ) | × Σ h ∈ Intlink ( Outlink ( H ) ) Score ( h )
F 7 ( H ) = 1 | Outlink ( Intlink ( H ) ) | × Σ h ∈ Outlink ( Intlink ( H ) ) Score ( h )
F 8(H)=SiteSupporters Di(H) Di∈{1,2,3,4}
Wherein, Score (h) ∈ { HostRank (h), TrucatedPageRank (h), TrustRank (h) }, promptly be respectively the HostRank of website, TrucatedPageRank and TrustRank value, Inlink (H) and Outlink (H) represent going into set of links and going into set of links of website H respectively.SiteSupporters Di(H) expression website H is at different distance D iOn the backer, i.e. neighbours' number on different distance.
When carrying out Preliminary detection result's generation among the step S1, the cheating detection algorithm can adopt the pattern classification algorithm of any maturation, such as SVM, AdaBoost, C4.5 etc.The degree of confidence of result for detecting that described preliminary cheating detects is to carry out secondary characteristics to extract prerequisite, the input that the result that this preliminary cheating detects extracts as secondary characteristics together with website level feature correlated characteristic (being website level linked, diagram).
2, on the basis of preliminary cheating testing result, carry out secondary characteristics and extract, extract cluster feature, transfer characteristic and neighbour's feature (step S2).
Step S2 is on the basis of Preliminary detection result that step S1 generates (comprise and detect degree of confidence) and website level Internet superman linked, diagram, extracts a series of new features, for machine learning algorithm used, to improve accuracy of detection and to detect stable.
Step S21, step S22, step S23 extract three classes feature of different nature respectively among Fig. 3, i.e. cluster feature, transfer characteristic and neighbour's feature.Discuss respectively with regard to the extracting method of this three category feature below.
The extraction of the described cluster feature of step S21 is based on the figure divided characteristic, should be based on the clustering method of figure division, comprise that the figure based on boolean's link divides and divides based on the figure that weight links, and be divided into the subgraphs of different sizes respectively, to the confidence calculations arithmetic mean of the node in the specific subgraph, to generate again the used cluster feature of subseries.
The figure partitioning algorithm of considering existing maturation, is regarded the linked, diagram of whole website level as non-directed graph here and is handled simultaneously in order to simplify computing mostly at non-directed graph.Fig. 4 is a synoptic diagram that figure divides.Linked, diagram can formally be expressed as G=(w), wherein V represents the set of all websites for V, E, and w is the mapping function from V * V to integer, get respectively mapping function w (u, v) be, w ( u , v ) = log ( N + 1 ) , if N > 0 0 , if N = 0 , w ( u , v ) = 1 , if N > 0 0 , if N = 0 Or w (u, v)=N, wherein N is the hyperlink number between website u and the v, E is the set on limit in the non-directed graph.Dendrogram G uses METIS figure clustering algorithm, and at the three kinds of different weighting functions in front, respectively the website that comprises in the linked, diagram being gathered is K class, calculates the cluster feature of website H by following formula 1.
cf ( H ) = Σ h ∈ C ( H ) spamicity ( h ) | C ( H ) | - - - ( 1 )
Wherein, cf (H) is the cluster feature of website H, the cluster set at C (H) expression H place, the cheating degree of the website h that spamicity (h) provided for the first detection stage, 0<=spamicity (h)<=1, if spamicity (h) equals 0, expression h is the cheating website, same spamicity (h) equals 1, and expression h is non-cheating website.By adjusting the value of K, can obtain a plurality of cluster feature through formula (1).
The extraction of the described transfer characteristic of step S22 is based on the degree of confidence transmission of digraph, based on the degree of confidence transmission of converse digraph with based on the degree of confidence transmission of non-directed graph, so that each website is generated three features, calculates as shown in Equation (2):
pf ( H ) ( t ) = ( 1 - α ) spamicity ( H ) + α Σ h : h - > H pf ( h ) ( t - 1 ) outdegree ( h ) - - - ( 2 )
Pf (H) wherein (t)Be the transfer characteristic of website H, t represents iterations, during actual the use, can think the setting iterations, gets pf (h) (0)=spamicity (h), the link that goes out of outdegree (h) expression h is gathered, can calculate indegree (h) accordingly or consider the link of coming in and going out simultaneously, can obtain 3 transfer characteristics at least like this, promptly based on the degree of confidence transmission of digraph, based on the degree of confidence transmission of converse digraph with based on the degree of confidence transmission of non-directed graph.α is a damping factor, and value is between 0 to 1.
The described neighbour's Feature Extraction of step S23 is based on the feature extraction of one-level neighbor relationships and based on the feature extraction of secondary neighbor relationships; At this one-level neighbor relationships and secondary neighbor relationships, just the various combination of the direction of link and opposite direction node generates a plurality of features respectively.
Experiment shows that neighbour's Feature Extraction only need consider that the two-stage neighbor relationships just can reach reasonable effect, promptly is respectively the arest neighbors (go out the ingress that outes of ingress, be called the secondary neighbour) of arest neighbors (go out ingress, be called the one-level neighbour) and arest neighbors.As shown in Figure 5, the white point of innermost layer indicates to carry out the website H of neighbour's feature extraction, represents the one-level neighbour of H indicating grey node on the internal layer great circle of D1, and dark node is represented the secondary neighbour of H on the outermost layer great circle.The value of neighbour's feature is calculated by formula (3):
nf ( H ) = Σ h ∈ N ( H ) spamicity ( h ) * ( weight ) | N ( H ) | - - - ( 3 )
Wherein, neighbour's feature of nf (H) expression website H, neighbour's set of N (H) expression H, weight represents weight, the value of weight determines that according to the number that links between the neighbour weight gets and do not consider any weight information at 1 o'clock.If node does not have neighboring node, then nf (H) value with this node is changed to 0.5, promptly uncertain value.The selection of neighbour's set can be selected arest neighbors, and secondary neighbour, and multistage neighbour, Fig. 5 have provided four kinds of different secondary neighbours' synoptic diagram, and arrow is represented the direction of hyperlink.Experimental results show that neighbour's feature is effectively detected characteristics.
The above neighbour's feature, cluster feature and transfer characteristic are the secondary characteristics of extraction, together with the input of the one-level feature of extracting in the step 1 as step S3, detect optimization to carry out final cheating.
3, extract on result's the basis in preliminary cheating testing result and secondary characteristics, adopt machine learning algorithm that cheating is detected again, and generate testing result (step S3).
Adopting machine learning algorithm that cheating is detected employed feature again described in the step S3, is to be formed by the step S1 characteristics combination that the feature that detects and step S2 secondary characteristics extract of tentatively practising fraud.
Described cheating is detected specifically again comprises: on the feature space after the expansion training set and test set are represented (step S31 and step S32) again, use preliminary feature and second extraction character representation sample simultaneously; Step S33 is a training classifier on training set, the selection of sorter can be any existing pattern classifier, as C4.5, Bagging, Adaboost etc., after sorter is trained end, the sorter that use trains is to the detection of practising fraud of the website sample in the test set, finish optimization, generate final cheating testing result Preliminary detection.
The extraction of above-described secondary characteristics can similarly be extended for multi-stage characteristics and extract, and other steps are similar, but experiment shows that the extraction multi-stage characteristics is than extracting the detection performance that secondary characteristics can significantly not improve system once more.
Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the above only is specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (9)

1. the optimization method that detects of a search engine cheat is characterized in that this method comprises:
Step S1: all webpages of pre-service and hyperlink, carry out feature extraction, at the feature of the extracting detection of tentatively practising fraud; Wherein, preliminary cheating detects and comprises: webpage extracting, web page contents extraction, network hyperlink figure structure, feature extraction, training set generation, test set generation, learning classification device, detection training set, and Preliminary detection result's generation and storage;
Step S2: on the basis of preliminary cheating testing result, carry out secondary characteristics and extract, extract cluster feature, transfer characteristic and neighbour's feature; Wherein, the degree of confidence of result for detecting that preliminary cheating detects is to carry out secondary characteristics to extract prerequisite, the input that the result that this preliminary cheating detects extracts as secondary characteristics together with website level hyperlink figure; Formula is adopted in the calculating of cluster feature
Figure FSB00000559429000011
Wherein cf (H) is the cluster feature of website H,
The cluster set at C (H) expression H place, the cheating degree of the website h that spamicity (h) provided for the first detection stage, 0<=spamicity (h)<=1; Formula is adopted in the calculating of transfer characteristic pf ( H ) ( t ) = ( 1 - α ) spamicity ( H ) + α Σ h : h - > H pf ( h ) ( t - 1 ) outdegree ( h ) , Pf (H) wherein (t)Be the transfer characteristic of website H, t represents iterations, and the link that goes out of outdegree (h) expression h is gathered, and α is a damping factor, and value is between 0 to 1; Formula is adopted in the calculating of neighbour's feature nf ( H ) = Σ h ∈ N ( H ) spamicity ( h ) * ( weight ) | N ( H ) | , Wherein, neighbour's feature of nf (H) expression website H, neighbour's set of N (H) expression H, weight represents weight, the value of weight determines that according to the number that links between the neighbour weight gets and do not consider any weight information at 1 o'clock;
Step S3: extract on result's the basis in preliminary cheating testing result and secondary characteristics, adopt machine learning algorithm that cheating is detected again, and generate testing result; Wherein, adopt machine learning algorithm that cheating is detected employed feature again, be to form by the characteristics combination that the feature and the secondary characteristics of preliminary cheating detection are extracted, described cheating is detected specifically again comprises: on the feature space after the expansion training set and test set are represented again, used preliminary feature and second extraction character representation sample simultaneously; Training classifier on training set after sorter is trained end, uses the sorter train to the detection of practising fraud of the website sample in the test set, finishes the optimization to Preliminary detection, generates final cheating testing result.
2. the optimization method that search engine cheat according to claim 1 detects, it is characterized in that, described feature extraction, the feature of being extracted comprises the content of pages feature feature relevant with hyperlink, and this hyperlink correlated characteristic further comprises webpage level link correlated characteristic and website level link correlated characteristic.
3. the optimization method that search engine cheat according to claim 1 detects is characterized in that, and is described when carrying out Preliminary detection result's generation, and the cheating detection algorithm adopts pattern classification algorithm SVM, AdaBoost or C4.5.
4. the optimization method that search engine cheat according to claim 1 and 2 detects is characterized in that, the calculating of described website level link correlated characteristic is based on website level linked, diagram.
5. the optimization method that search engine cheat according to claim 1 detects, it is characterized in that, the extraction of cluster feature described in the step S2 is based on the figure divided characteristic, should be based on the clustering method of figure division, comprise that the figure based on boolean's link divides and divides based on the figure that weight links, and be divided into different big or small subgraphs respectively, to the confidence calculations arithmetic mean of the node in the specific subgraph, to generate again the used cluster feature of subseries.
6. the optimization method that search engine cheat according to claim 1 detects, it is characterized in that, the extraction of transfer characteristic described in the step S2, be based on the degree of confidence transmission of digraph, based on the degree of confidence transmission of converse digraph with based on the degree of confidence transmission of non-directed graph, so that each website is generated three transfer characteristics.
7. the optimization method that search engine cheat according to claim 1 detects is characterized in that, the Feature Extraction of neighbour described in the step S2 is based on the feature extraction of one-level neighbor relationships and based on the feature extraction of secondary neighbor relationships; At this one-level neighbor relationships and secondary neighbor relationships, just the various combination of the direction of link and opposite direction node generates a plurality of neighbour's features respectively.
8. the optimization method that search engine cheat according to claim 1 detects is characterized in that, described on training set during training classifier, sorter is selected C4.5, Bagging or Adaboost.
9. the optimization method that search engine cheat according to claim 1 detects is characterized in that, the extraction of secondary characteristics described in the step S2 can further be extended for multi-stage characteristics and extract.
CN2008100567261A 2008-01-24 2008-01-24 Method for optimizing detection of search engine cheat Expired - Fee Related CN101493819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008100567261A CN101493819B (en) 2008-01-24 2008-01-24 Method for optimizing detection of search engine cheat

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008100567261A CN101493819B (en) 2008-01-24 2008-01-24 Method for optimizing detection of search engine cheat

Publications (2)

Publication Number Publication Date
CN101493819A CN101493819A (en) 2009-07-29
CN101493819B true CN101493819B (en) 2011-09-14

Family

ID=40924423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008100567261A Expired - Fee Related CN101493819B (en) 2008-01-24 2008-01-24 Method for optimizing detection of search engine cheat

Country Status (1)

Country Link
CN (1) CN101493819B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004764A (en) * 2010-11-04 2011-04-06 中国科学院计算机网络信息中心 Internet bad information detection method and system
CN102184208B (en) * 2011-04-29 2013-06-05 武汉慧人信息科技有限公司 Junk web page detection method based on multi-dimensional data abnormal cluster mining
CN102243659B (en) * 2011-07-18 2014-07-16 南京邮电大学 Webpage junk detection method based on dynamic Bayesian model
CN102521331A (en) * 2011-12-06 2012-06-27 中国科学院计算机网络信息中心 Webpage redirection cheating detection method and device
CN102591965B (en) * 2011-12-30 2014-07-09 奇智软件(北京)有限公司 Method and device for detecting black chain
CN102622435B (en) * 2012-02-29 2017-12-12 百度在线网络技术(北京)有限公司 A kind of method and apparatus for detecting black chain
CN103577487A (en) * 2012-08-07 2014-02-12 亿赞普(北京)科技有限公司 Method and device of testing index function of search engine
CN103684896B (en) * 2012-09-07 2017-02-01 中国科学院计算机网络信息中心 Method of detecting website cheating based on domain name resolution characteristics
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages
CN104239485B (en) * 2014-09-05 2018-05-01 中国科学院计算机网络信息中心 A kind of dark chain detection method in internet based on statistical machine learning
CN105373598B (en) * 2015-10-27 2017-03-15 广州神马移动信息科技有限公司 Cheating station recognition method and device
CN108304395B (en) * 2016-02-05 2022-09-06 北京迅奥科技有限公司 Webpage cheating detection
CN107909396A (en) * 2017-11-11 2018-04-13 霍尔果斯普力网络科技有限公司 The anti-cheat monitoring method that a kind of Internet advertising is launched
CN113723980A (en) * 2020-05-26 2021-11-30 北京达佳互联信息技术有限公司 Method and device for detecting advertisement landing page, electronic equipment and storage medium
CN113779559B (en) * 2021-09-13 2023-10-03 北京百度网讯科技有限公司 Method, device, electronic equipment and medium for identifying cheating website
CN113553288B (en) * 2021-09-18 2022-01-11 北京大学 Two-layer blocking multicolor parallel optimization method for HPCG benchmark test

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1770158A (en) * 2004-09-30 2006-05-10 微软公司 Content evaluation
CN1996299A (en) * 2006-12-12 2007-07-11 孙斌 Ranking method for web page and web site

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1770158A (en) * 2004-09-30 2006-05-10 微软公司 Content evaluation
CN1996299A (en) * 2006-12-12 2007-07-11 孙斌 Ranking method for web page and web site

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张泽明.基于免疫原理的个性化Spam过滤算法.《模式识别与人工智能》.2007,第20卷(第3期),406-414. *
李智超,余慧佳,马少平.使用支持向量机进行作弊页面识别.《第三届全国信息检索与内容安全学术会议》.2007,248-254. *
蒋涛,张彬.一种反Web Spam页面的方法.《微型电脑应用》.2007,第23卷(第4期),23-26. *

Also Published As

Publication number Publication date
CN101493819A (en) 2009-07-29

Similar Documents

Publication Publication Date Title
CN101493819B (en) Method for optimizing detection of search engine cheat
US7809723B2 (en) Distributed hierarchical text classification framework
US8768960B2 (en) Enhancing keyword advertising using online encyclopedia semantics
CN101364239B (en) Method for auto constructing classified catalogue and relevant system
US7617176B2 (en) Query-based snippet clustering for search result grouping
CN101350011B (en) Method for detecting search engine cheat based on small sample set
CN102982153B (en) A kind of information retrieval method and device thereof
CN105045875B (en) Personalized search and device
Alguliev et al. Effective summarization method of text documents
CN101944099A (en) Method for automatically classifying text documents by utilizing body
CN102637192A (en) Method for answering with natural language
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
CN101763395A (en) Method for automatically generating webpage by adopting artificial intelligence technology
Santos et al. Integrating proximity to subjective sentences for blog opinion retrieval
CN103761286B (en) A kind of Service Source search method based on user interest
CN105975547A (en) Approximate web document detection method based on content and position features
JP2013168177A (en) Information provision program, information provision apparatus, and provision method of retrieval service
Zhang et al. Co-ranking multiple entities in a heterogeneous network: Integrating temporal factor and users’ bookmarks
JP5315726B2 (en) Information providing method, information providing apparatus, and information providing program
Pang et al. Query expansion and query fuzzy with large-scale click-through data for microblog retrieval
Rajkumar et al. Users’ click and bookmark based personalization using modified agglomerative clustering for web search engine
Batra et al. Content based hidden web ranking algorithm (CHWRA)
CN112214511A (en) API recommendation method based on WTP-WCD algorithm
CN106649537A (en) Search engine keyword optimization technology based on improved swarm intelligence algorithm
Wang et al. Knowledge graph-based semantic ranking for efficient semantic query

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110914

CF01 Termination of patent right due to non-payment of annual fee