CN101350011B - Method for detecting search engine cheat based on small sample set - Google Patents

Method for detecting search engine cheat based on small sample set Download PDF

Info

Publication number
CN101350011B
CN101350011B CN2007101191966A CN200710119196A CN101350011B CN 101350011 B CN101350011 B CN 101350011B CN 2007101191966 A CN2007101191966 A CN 2007101191966A CN 200710119196 A CN200710119196 A CN 200710119196A CN 101350011 B CN101350011 B CN 101350011B
Authority
CN
China
Prior art keywords
sample
cheating
training set
search engine
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2007101191966A
Other languages
Chinese (zh)
Other versions
CN101350011A (en
Inventor
耿光刚
王春恒
戴汝为
李秋丹
朱远平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN2007101191966A priority Critical patent/CN101350011B/en
Publication of CN101350011A publication Critical patent/CN101350011A/en
Application granted granted Critical
Publication of CN101350011B publication Critical patent/CN101350011B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to Internet information retrieval and discloses a method for detecting Internet cheats on the basis of a small sample set to strike search engine cheating behaviors which become severe increasingly. The invention uses the iterative execution of self-learning on the basis of a classifier and the linking learning process on the basis of an Internet topological structure to continuously expand a training set aiming at the problem that a detection sample has high collecting cost to realize that search engine cheats are detected under the small sample set, an integrated down sampling strategy is adopted in the identification process, and information which is contained in websites which exist widely on Internet and have high reputation is fully used. Finally, label transmission along the Internet topological structure and on the basis of prediction cheating degree is carried out to realize the optimization of detection results. Experiments are used to show that the method can effectively detect the cheating behaviors.

Description

A kind of method for detecting search engine cheat based on small sample set
Technical field
The present invention relates to information retrieval and search engine technique field, is the method that search engine cheat detects under a kind of small sample.
Background technology
The internet is as maximum since the dawn of human civilization information bank, and its content is still increasing with exponential.Internet hunt has become the part of people's daily life, and the report that CNNIC2006 issues July claims to be in the netizen with 66.3% ratio by search engine, and the network service of normal use is the first.
Scholars such as N.Eiron use famous PageRank algorithm that 100,000,000 webpages are sorted, and found that to come in top 20 websites has 11 to be the porn site, and these websites are by distorting hyperlink to obtain forward rank.According to the investigation of american commerce investigation bureau, 2006, the ecommerce sales volume of the U.S. reached 1,141 hundred million dollars, had increased by 22.7% than 93,000,000,000 dollars in 2005.And 2007, first season of this statistics has just reached 31,500,000,000 dollars, than increasing by 18.4% 2006 year same period.The result of study of Bernard J.Jansen and Amanda Spink shows that about 80% user only can browse preceding 3 pages return results when using search engine.
The huge profit and the door effect of search engine are ordered about a lot of portal management persons and webpage making person and are made its website and the page become famous by every means on the internet, when carrying out the related content inquiry, come result's prostatitis with expects users.Internet cheating (Web Spam), be search engine cheat again, be meant the means that adopt some fascinations, deception search engine, make the rank of the Web page in result for retrieval be higher than the behavior of actual deserved rank, it causes the quality of search engine retrieving result seriously to descend.
The internet cheating can be divided into content cheating, link cheating and hide cheating three classes.Content cheating refers to website use content information deception search engine, improves the importance of some page, comprises the keyword cheating, at the title cheating etc.The link cheating website that refers to practise fraud goes out some network linking structures at the PageRank algorithm construction, fascination search engine sort algorithm, thus improve the importance of some page.Hiding cheating refers to that the cribber utilizes various concealing technologies, and the use of foregoing and link cheating technology is not found by the user.
At above cheating form, relevant in a large number countermeasure has been carried.Cheating webpages context of detection in content-based analysis, but people such as A.Ntoulas investigate ratio, content compression ratio, anchor number of texts and the ratio of popular vocabulary in text etc. of average word length display part in cheating webpages and the general webpage, sum up a series of heuristic features, the content cheating webpages is detected as two classification problems, training decision tree classification device, most content swindle webpages can be detected.In the cheating context of detection based on link, influential work the earliest is the TrustRank algorithm that people such as Gyongyi proposes, and its starting point is " the good page seldom points to the cheating page ".By selecting seed set with a high reputation by hand, carry out degree of belief along the hyperlink in the network chart and propagate.Thereby obtain the degree of belief of each page, and then all pages are divided into two kinds of Spam and Normal.People such as B.Wu and Davison has proposed a kind of method that the camouflage cheating is detected, at each URL, successively grasp twice by reptile, copy browser model to grasp once in addition, calculate difference in terms of content between them then, be redirected cheating to judge whether to exist.The shortcoming of this method is repeatedly to grasp to have increased the burden that search engine is handled, also take simultaneously massive band width, the more important thing is that this method needs the reptile of search engine to produce conventional browse request, this has violated Robots Exclusion Standard agreement.
Become the focus of recent research based on the cheat detection method of machine learning.War between search engine and the network cheating fabricator such as same arms race after search engine has been found an effective method and used, are just found out countermeasure through cribber after a while, invent the cheating form that makes new advances.Based on the method for machine learning at new cheating form, by increase, the deletion individual features, the validity that the maintenance system detects cheating, and needn't revise system architecture.Yet the detection method based on machine learning faces following two difficult problems: 1, obtaining of the required sample of machine learning need be expended a large amount of manpowers, the cost height; 2, website with a high reputation is than the cheating easier acquisition in website in the internet data, and cheating is seriously unbalanced with the ratio of non-cheating website, and traditional learning algorithm is difficult in the effect that obtains in the unbalanced sample learning.
Summary of the invention
In order to solve the procurement cost height of the existing required sample of machine learning method, and traditional learning algorithm is difficult in the problem of the effect that obtains in the unbalanced sample learning, the objective of the invention is to reduce and obtains required sample human resources, reduces cost; In unbalanced sample learning, obtained effect, the invention provides a kind of search engine Web cheat detection method for this reason based on small sample set.
In order to realize described purpose, the technical scheme of search engine Web cheat detection method that the present invention is based on small sample set is as described below:
Step S1: all webpage samples are carried out pre-service, sample set is divided into training set, test set and unlabelled collection;
Step S2: use ready-portioned training set and unlabelled collection to carry out learning, to expand training set based on the self study of sorter with based on the link of internet topological structure;
Step S3: the training set at after the expansion, adopt integrated down-sampled tactful training classifier, utilize the sorter that trains that the sample in the test set is detected;
Step S4: the post-processing stages of testing result---based on the label transmission of prediction cheating degree, finish search and draw the cheating detection.
According to embodiments of the invention, the described training set of step S2 expands, and comprises based on the self study of sorter with based on the link of internet topological structure learning.These two processes that learning process all is continuous iteration are to finish the continuous expansion of training set.
According to embodiments of the invention, the ratio of cheating of selecting in the described iterative process and non-cheating website is identical with ratio in the original training set.
According to embodiments of the invention, the described self study based on sorter of step S2 is to utilize training set sample training sorter, and the unlabelled sample set is learnt, and utilizes semi-supervised self study process to select preceding J 1The sample of individual maximum predicted degree of confidence is with the prediction label collection that goes into training.
According to embodiments of the invention, the described link study based on internet topological structure of step S2 is to utilize training set sample mark internet link figure, links study according to the cheating and the regularity of distribution of non-cheating website in linked, diagram, selects J 2Individual website with maximum delivered degree of confidence is with them and the prediction label collection that goes into training.
According to embodiments of the invention, the described down-sampled classification policy of step S3 adopts adjustable down-sampled scale-up factor, based on the integrated strategy of subclassification of prediction cheating degree, and algorithm itself is applicable to Distributed Calculation.
According to embodiments of the invention, the described testing result post-processing stages of step S4 is the label transmission of carrying out along internet topological structure based on prediction cheating degree.
Training set of the present invention expands self study and the internet link topology information that algorithm has effectively utilized learning algorithm, has solved the rare problem of sample that machine learning method faces in the Web cheating detects to a certain extent.The classification policy of inheriting of will sampling has at random effectively utilized high prestige website (webpage) information that extensively exists on the internet, and has overcome the unbalanced problem of sample.In traditional classification problem, be separate between the sample, and exist relation of interdependence between the website as detected object, the testing result optimization of being undertaken by the Internet superman linked, diagram, make full use of this point just, further improved the performance that cheating detects.
Description of drawings
Fig. 1 is that the Web cheating that the present invention is based on small sample set detects overall module frame chart;
Fig. 2 is the pre-treatment step data flow diagram;
Fig. 3 is that training set of the present invention expands the step data flow diagram;
Fig. 4 is cheating detection and the result optimizing step that the present invention is based on expansion back training set;
Fig. 5 is simple site link topological relation exemplary plot.
Embodiment
Below in conjunction with accompanying drawing the present invention is described in detail, be to be noted that described embodiment only is intended to be convenient to the understanding of the present invention, and it is not played any qualification effect.
In order to realize method of the present invention, consider that algorithm relates to repeatedly resampling and iterative process, if realize that at unit guarantee that preferably processor host frequency is not less than 2GHz, internal memory is not less than 1G, can adopt any programming language commonly used to write.
The Web cheat detection method that the present invention proposes, overall procedure based on small sample set as shown in Figure 1, specifically each step data stream is provided by Fig. 2,3,4.Pre-service (step S1) part is that data are prepared in whole cheating testing; Step S2 is that the training set iteration expands process, promptly based on the self study of sorter and the link learning process of topological structure Network Based; Step S3 uses the training set training classifier after expanding, and uses the sorter of learning that sample to be detected is detected, and uses integrated down-sampled at random learning strategy (ERUS) in this course; Step S4 is the optimization step of testing result, and the label that is based on prediction cheating degree is propagated.
A large amount of statistics show that the website at cheating webpages place in the internet often is exactly the website of practising fraud, and the formulation of Webspam-UK2006 standard data set just is based on this point, and if no special instructions, cheating sample and Spam among the present invention all represent the website of practising fraud.Next be described in detail each key step.
1, pre-service (step S1)
Pre-service is the first step of total system, it is as the preparatory stage, the work of finishing as shown in Figure 2, comprise that webpage grasps (step S11), network hyperlink figure makes up (step S12), web page contents extracts (step S13), feature extraction (step S14), manually mark (manual sort part website is Spam or Normal) (step S15) the sample set division (training set, test set and unlabelled collection) that sample set artificial mark of division (manual sort part website is Spam or Normal) and step S16 carry out etc.Webpage grasps, hyperlink figure makes up and web page contents extracts ripe method has been arranged, and does not belong to the content that the present invention emphasizes, the Web page classifying mode with reference to Webspam-UK2006 ( Http:// www.yr-bcn.es/webspam/datasets/uk2006-info/) the formulation standard, the feature that feature extraction part is extracted comprises the web page contents correlated characteristic and based on the feature of linking relationship.The present invention focuses on the effective down cheating of research special characteristic and detects strategy, feature extraction is with reference to [C.Castillo, D.Donato, A.Gionis:Know yourNeighbors:Web Spam Detection using the Web Topology.Sigir 2007], no longer narration here.
According to the feature and the manual sort situation that extract all samples are divided into three parts: training set, test set and unlabelled collection.
2, the iteration of training set expands (step S2)
Step 2 is semi-supervised learning processes, and purpose is progressively to expand to improve training set, and the data of this section processes are divided the training sample set and the unlabelled sample set that from step S1.The expansion process is made up of two parts: based on the self study of sorter and the link study of topological structure Network Based.The loop that step S21, step S22, step S23, step S24, step S25 form among Fig. 3 is the self study process, and this process is carried out repeatedly according to the iterations of setting.Step S21 learns training set, and this step adopts integrated down-sampled learning strategy; Step S22 receives the model parameter of succeeding in school, and preserves, in order to unknown label sample is classified; Step S23 uses the sorter train, and the sample in the unlabelled sample set is predicted, step S24 receives predicting the outcome of label sample not, and learning outcome is carried out from big to small ordering according to the value of forecast confidence.Step S25 will come top J according to preset threshold I 1Individual sample is deleted it simultaneously together with its prediction label collection that goes into training from the unlabelled sample set, the condition 1 among Fig. 3 is sample is come the conditional expression that preceding I judges.Integrated down-sampled strategy is adopted in the study of sorter in this step, and its specific implementation will be set forth in step 3.The cheating that requirement is selected in the self study iterative process is identical with ratio in the original training set with the ratio of non-cheating website.
Fig. 5 is the link topological structure of website level, the hyperlink between wherein directed edge is represented to stand, dark circles is represented the website of practising fraud, white circular is represented non-cheating website, grey be the unlabelled website; The nearest grey circle (website) of numerical reference among the figure is so that convenient the description; The letter of figure below (6 different simple network topologys of expression of A-E).In order to remove the influence of noise hyperlink, the hyperlink number of and if only if a certain website points to another website is not less than threshold value T Num, there is directed edge between two websites, as shown in Figure 5, we get in test T Num=5.
Step S26, step S27 and step S28 are the subordinate phase that training set expands, promptly based on the link learning process of internet topological structure.At first in step S26, utilize training set that the Web linked, diagram is marked.The target of step S27 link study is according to the label that go out ingress (be neighbor node) of certain node in linked, diagram, predicts the label of this node.[C.Castillo such as C.Castillo, D.Donato, A.Gionis:Know your Neighbors:Web SpamDetection using the Web Topology.Sigir2007] point out: " link of going into of non-cheating website seldom is the cheating website; their common chains are to other non-cheating website " and " cheating website go into to link the website of practising fraud often ", we have also proved this point at the statistics on actual internet data collection.Link the destination of study is to expand training set, rather than only produces prediction, and sample selects quality will directly influence the performance that training set expands the back classification learning, and this just requires the mark of sample must satisfy stricter restriction.For convenience of description, use inlinkn Ormal(i) the expression node i goes into non-cheating degree website number in the link, inlink Spam(i) the expression node i goes into cheating website number in the link.Similarly, use outlink Normal(i) and outlink Spam(i) respectively expression expression node i go out normal website and cheating website number in the link.Step S27 link learning process is as follows to the computing formula of prediction label degree of confidence:
Conf normal ( i ) = K 1 × outlink normal ( i ) + inlink normal ( i ) + ϵ outlink spam ( i ) + inlink spam ( i ) + ϵ - - - ( 1 )
Conf spam ( i ) = K 2 × outlink spam ( i ) + inlink spam ( i ) + ϵ outlink normal ( i ) + inlink normal ( i ) + ϵ - - - ( 2 )
Wherein 0<ε<1 is a smoothing factor, K 1And K 2Value according to following rule:
IF(outlink normal(i)>0&inlink normal(i)>0)THEN
IF(inlink spam(i)=0&outlink spam(i)=0)THEN
K 1=M 1
ELSE
K 1=M 2
END?IF
ELSE
K 1=M 3
END?IF
IF(outlink spam(i)>0&inlink spam(i)>0)THEN
IF(inlink normal(i)=0&outlink normal(i)=0)THEN
K 2=N 1
ELSE
K 2=N 2
END?IF
ELSE
K 2=N 3
END?IF
M wherein 1M 2M 3〉=1, N 1N 2N 3〉=1.Select J respectively according to formula 1,2 NormaL and J SpamIndividual Confn Ormal(i) and Conf Spam(i) the maximum node of value is put into training set with its feature and prediction label, and for next iteration is prepared, selection course is judged by the S28 condition 2 of Fig. 3.We get M in actual applications 1=N 1=3〉M 2=N 2=2〉M 3=N 3=1, ε=0.5, J Spam=1, J Normal=L*J Spam, J 2=J Normal+ J Spam, wherein L is the ratio of Normal sample and Spam sample in the original training set, J 2Expression is through the Number of websites with maximum delivered degree of confidence of an iteration selection.According to above value, in the network linking graph of a relation of Fig. 5, Conf Spam(1)〉Conf Spam(2)〉Conf Spam(3), Conf Normal(4)〉Conf Normal(5)〉Confn Ormal(6), the website of the corresponding numerical reference of digitized representation Fig. 5 wherein.In the link study iterative process, J NormalAnd J SpamValue guaranteed that the ratio of the cheating selected and non-cheating website is with ratio in the original training set identical.
3, cheating detects (step S3)
As shown in Figure 4, the employed training set of step S3 is the training set after expanding through step S2.Step S31 is training classifier on the training set after the expansion, step S32 receives the model parameter of learning, and it is preserved in order to use that label sample is not classified, step S33 uses the sorter that trains that test sample book is tested, step S34 receives preliminary cheating testing result, prepares for step S4 carries out data.Wherein, adopt integrated down-sampled immediately learning strategy (ERUS), on training set, sorting algorithm is learnt the training of sorter.Shown in being implemented as follows of ERUS algorithm:
-------------------------------------------------
Input: M: small sample class (cheating class) sample
S: large sample class (non-cheating class) sample
N: sampling number
K: sampling ratio
X: test sample book
Output: to the testing result of sample x
--------------------------
1.i=0
2.while?i<ndo
3. sampling subset S immediately from S i(| S i|<| S|, | S i|=K*|M|)
4. at S iGo up training classifier C with M i
5. preserve the model of succeeding in school
Figure S071B9196620070815D000081
6. i=i+1
7.end?while
8.spamicity=0
9.for?i=0?to?n?do
10. use a model
Figure S071B9196620070815D000082
Test sample book x
11. calculate PS (x, C with formula 3 i)
12. spamicity=spamicity+PS(x,C i)
13.end?for
14. spamicity = spamicity n
15.if(spamicity>=0.5)then
16. sample x belongs to the cheating website
17.else
18. sample x belongs to non-cheating website
19.end?if
---------------------------------------------
Learning algorithms commonly used such as bayes, C4.5, bagging and adaboost can be applied to this learning strategy.The integrated prediction cheating degree (PS) that is based on each subclassification of down-sampled classification at first provides the notion of PS here, is similar to the definition of forecast confidence in the learning algorithm, and PS is described as:
PS ( x , C ) = P spam ( x , C ) P spam ( x , C ) + P normal ( x , C ) - - - ( 3 )
Wherein, x is a test sample book, and C is the specific classification device, P Spam(x, C) and P Normal(x, C) the sample x of presentation class device C prediction belongs to the probability or the distance value of spam collection and normal collection respectively.
Down-sampled is the available strategy that solves the unbalanced study of class, for given other sample set M of a group and big classification sample set S, down-sampled sampling subset S at random i, wherein | M|<| S| (| * | the number of samples among the expression set *).Down-sampled strategy has only used large sample sector of breakdown sample to come training classifier, the relative equilibrium that becomes in the training set of inhomogeneous like this number of samples after resampling, and experiment shows that this method is very effective.Yet, only a large amount of useful informations that caused being included in the abandoned main classes sample with part large sample class sample can not be fully utilized, here we have adopted integrated down-sampled strategy one of to remedy defective, to excavate the useful information in the big class sample to greatest extent.Down-sampled sampling ratio K is adjustable, regulates sampling ratio according to different data sets.In the ERUS algorithm, each subclassification result's fusion is based on the PS value, rather than based on the prediction label.
In the ERUS algorithm, 1--7 is a learning phase, and 8--19 is a test phase.It is separate process that each subclassification study and test are, and this algorithm is applicable to Distributed Calculation.
4, testing result optimization (step S4)
The tentative prediction result who provides at step S3, a given threshold value T (0<T<0.5), sample that all prediction cheating degree are positioned at [T, 1-T] interval carries out the label transmission based on the Web topological structure, and wherein the label among the Web figure marks (step S41) by the training set sample.Step S42 transmits cheating degree value and is calculated by formula (2), is different from sample set and expands the stage to preceding J SpamThe selection of individual maximum delivered cheating degree sample, the transmission cheating degree here is used for carrying out the website is carried out final cheating whether judgement, so K in the formula 2 2Value do not need to satisfy rule in the step S27 link study, get K here 2=1.If Conf Spam(i)〉1, think that then this node is the cheating website, otherwise be non-cheating website.The starting point of doing like this is, the high forecast confidence sample that learning algorithm produces is given a vote of confidence, and simultaneously the sample of low forecast confidence held the suspicious attitude to predicting the outcome, and these samples are finally predicted label by the network topology transmission.Between [T, 1-T] interval network islands (website of no neighbors), its label is still by the sorter decision that trains previous stage for forecast confidence.Step S43 receives by the result after the label propagation optimization, finishes final cheating and detects.

Claims (7)

1. method for detecting search engine cheat based on small sample set is characterized in that step is as follows:
Step S1: all webpage samples are carried out pre-service, sample set is divided into training set, test set and unlabelled collection;
Step S2: use ready-portioned training set and unlabelled collection to carry out learning based on the self study of sorter with based on the link of internet topological structure, to expand training set, concrete steps are as follows:
Step S21 learns training set, and this step adopts integrated down-sampled learning strategy;
Step S22 receives the model parameter of succeeding in school, and preserves, in order to unknown label sample is classified;
Step S23 uses the sorter that trains, and the sample in the unlabelled sample set is predicted;
Step S24 receives predicting the outcome of label sample not, and learning outcome is carried out from big to small ordering according to the value of forecast confidence;
Step S25 will come top J according to preset threshold I 1Individual sample is deleted it simultaneously together with its prediction label collection that goes into training from the unlabelled sample set, condition 1 is for coming the conditional expression that preceding I judges to sample;
Utilize training set that the Web linked, diagram is marked among the step S26;
The target of step S27 link study is according to the label that go out ingress of certain node in linked, diagram, predicts the label of this node;
Step S28 judges J NormalIndividual Conf Normal(i) Zui Da node and J SpamIndividual Conf Spam(i) the maximum node of value is to expand training set;
Step S3: the training set at after the expansion, adopt integrated down-sampled at random tactful training classifier, utilize the sorter that trains that the sample in the test set is detected; Described integrated down-sampled at random plan is for the given other sample set M of a group and big classification sample set S, repeatedly the subclass S of sampling S at random i, the relative equilibrium that becomes in the training set of inhomogeneous like this sample size after resampling, wherein down-sampled sampling ratio K is adjustable, regulates sampling ratio according to different data sets;
Step S4: the post-processing stages of testing result---all prediction cheating degree are positioned at [T, 1-T] interval sample carries out label transmission based on the Web topological structure, finishes the search engine cheat detection.
2. method for detecting search engine cheat according to claim 1 is characterized in that, the described training set of step S2 expands, and comprises based on the self study of sorter with based on the link study of internet topological structure; These two processes that learning process all is continuous iteration are to finish the continuous expansion of training set.
3. as method for detecting search engine cheat as described in the claim 2, it is characterized in that the ratio of cheating of selecting in the described iterative process and non-cheating website is identical with ratio in the original training set.
4. method for detecting search engine cheat according to claim 1 is characterized in that the described self study based on sorter of step S2 is to utilize training set sample training sorter, and the unlabelled sample set is learnt, utilize semi-supervised self study process to select before J 1The sample of individual maximum predicted degree of confidence is with the prediction label collection that goes into training.
5. method for detecting search engine cheat according to claim 1, it is characterized in that, the described link study of step S2 based on internet topological structure, be to utilize training set sample mark internet link figure, link study according to the cheating and the regularity of distribution of non-cheating website in linked, diagram, select J 2Individual website with maximum delivered degree of confidence is with them and the prediction label collection that goes into training.
6. method for detecting search engine cheat according to claim 1 is characterized in that, the described down-sampled classification policy of step S3 adopts adjustable down-sampled scale-up factor, based on the integrated strategy of subclassification of prediction cheating degree.
7. method for detecting search engine cheat according to claim 1 is characterized in that, the described testing result post-processing stages of step S4 is the label transmission of carrying out along internet topological structure based on prediction cheating degree.
CN2007101191966A 2007-07-18 2007-07-18 Method for detecting search engine cheat based on small sample set Expired - Fee Related CN101350011B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007101191966A CN101350011B (en) 2007-07-18 2007-07-18 Method for detecting search engine cheat based on small sample set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101191966A CN101350011B (en) 2007-07-18 2007-07-18 Method for detecting search engine cheat based on small sample set

Publications (2)

Publication Number Publication Date
CN101350011A CN101350011A (en) 2009-01-21
CN101350011B true CN101350011B (en) 2011-09-07

Family

ID=40268806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101191966A Expired - Fee Related CN101350011B (en) 2007-07-18 2007-07-18 Method for detecting search engine cheat based on small sample set

Country Status (1)

Country Link
CN (1) CN101350011B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102103598B (en) * 2009-12-21 2012-12-05 同济大学 Reliable search method base on content trust
CN102521331A (en) * 2011-12-06 2012-06-27 中国科学院计算机网络信息中心 Webpage redirection cheating detection method and device
CN103684896B (en) * 2012-09-07 2017-02-01 中国科学院计算机网络信息中心 Method of detecting website cheating based on domain name resolution characteristics
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages
CN104216920B (en) * 2013-06-05 2017-11-21 北京齐尔布莱特科技有限公司 Data classification method based on cluster and Hungary Algorithm
CN104239485B (en) * 2014-09-05 2018-05-01 中国科学院计算机网络信息中心 A kind of dark chain detection method in internet based on statistical machine learning
CN110147472B (en) * 2017-07-14 2021-10-15 北京搜狗科技发展有限公司 Detection method and device for cheating sites and detection device for cheating sites
CN107943856A (en) * 2017-11-07 2018-04-20 南京邮电大学 A kind of file classification method and system based on expansion marker samples
CN107909396A (en) * 2017-11-11 2018-04-13 霍尔果斯普力网络科技有限公司 The anti-cheat monitoring method that a kind of Internet advertising is launched
CN108510007A (en) * 2018-04-08 2018-09-07 北京知道创宇信息技术有限公司 A kind of webpage tamper detection method, device, electronic equipment and storage medium
CN110132390B (en) * 2019-05-22 2021-08-06 简刚 Electronic scale capable of reducing cheating force
CN110188262B (en) * 2019-07-23 2019-10-29 武汉斗鱼网络科技有限公司 A kind of abnormal object determines method, apparatus, equipment and medium
CN113536087B (en) * 2021-06-30 2022-05-17 北京百度网讯科技有限公司 Method, device, equipment, storage medium and program product for identifying cheating sites
CN113407804B (en) * 2021-07-14 2023-06-16 杭州雾联科技有限公司 Crawler-based externally hung accurate marking and identifying method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1996299A (en) * 2006-12-12 2007-07-11 孙斌 Ranking method for web page and web site

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1996299A (en) * 2006-12-12 2007-07-11 孙斌 Ranking method for web page and web site

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri.Know your Neighbors: Web Spam DetectionUsing the WebTopology.Proceedings of the 30th annual international ACM SIGIR conference on Rearch and development in information retrieval.2007,423-430.
C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri.Know your Neighbors: Web Spam DetectionUsing the WebTopology.Proceedings of the 30th annual international ACM SIGIR conference on Rearch and development in information retrieval.2007,423-430. *
Dengyong Zhou, Christopher J.C. Burges, Tao Tao.Transductive Link Spam Detection.Proceedings of the 3rd international workshop on Adversarial information retrieval on the web215.2007,21521-28. *
DengyongZhou Christopher J.C. Burges
Luca Becchetti, Carlos Castillo, Debora Donato.Link-Based Characterization and Detection of Web Spam.proc. of AIRWEB 2006.2006,1-8. *

Also Published As

Publication number Publication date
CN101350011A (en) 2009-01-21

Similar Documents

Publication Publication Date Title
CN101350011B (en) Method for detecting search engine cheat based on small sample set
CN101820366B (en) Pre-fetching-based fishing web page detection method
CN101493819B (en) Method for optimizing detection of search engine cheat
Karakatsanis et al. Data mining approach to monitoring the requirements of the job market: A case study
CN106815297B (en) Academic resource recommendation service system and method
CN103902597B (en) The method and apparatus for determining relevance of searches classification corresponding to target keyword
CN103425799A (en) Personalized research direction recommending system and method based on themes
CN105095187A (en) Search intention identification method and device
JP5543020B2 (en) Research mission identification
CN101216825A (en) Indexing key words extraction/ prediction method, on-line advertisement recommendation method and device
CN101609450A (en) Web page classification method based on training set
CN110555154B (en) Theme-oriented information retrieval method
CN111259219B (en) Malicious webpage identification model establishment method, malicious webpage identification method and malicious webpage identification system
CN103150369A (en) Method and device for identifying cheat web-pages
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN101398832A (en) Image searching method and system by utilizing human face detection
CN103544307B (en) A kind of multiple search engine automation contrast evaluating method independent of document library
CN110728136A (en) Multi-factor fused textrank keyword extraction algorithm
CN112989215B (en) Sparse user behavior data-based knowledge graph enhanced recommendation system
CN105512224A (en) Search engine user satisfaction automatic assessment method based on cursor position sequence
CN103823847A (en) Keyword extension method and device
CN112328469B (en) Function level defect positioning method based on embedding technology
Wu et al. SOUA: Towards Intelligent Recommendation for Applying for Overseas Universities
CN109034908A (en) A kind of film ranking prediction technique of combination sequence study
JP5315726B2 (en) Information providing method, information providing apparatus, and information providing program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110907

Termination date: 20170718