CN101350011A - Method for detecting search engine cheat based on small sample set - Google Patents
Method for detecting search engine cheat based on small sample set Download PDFInfo
- Publication number
- CN101350011A CN101350011A CNA2007101191966A CN200710119196A CN101350011A CN 101350011 A CN101350011 A CN 101350011A CN A2007101191966 A CNA2007101191966 A CN A2007101191966A CN 200710119196 A CN200710119196 A CN 200710119196A CN 101350011 A CN101350011 A CN 101350011A
- Authority
- CN
- China
- Prior art keywords
- cheating
- sample
- search engine
- training set
- internet
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 59
- 238000001514 detection method Methods 0.000 claims abstract description 13
- 230000005540 biological transmission Effects 0.000 claims abstract description 8
- 238000012360 testing method Methods 0.000 claims description 23
- 238000010586 diagram Methods 0.000 claims description 7
- 238000012804 iterative process Methods 0.000 claims description 5
- 238000012805 post-processing Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000013341 scale-up Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 abstract description 8
- 238000005457 optimization Methods 0.000 abstract description 5
- 230000006399 behavior Effects 0.000 abstract description 3
- 238000002474 experimental method Methods 0.000 abstract description 2
- 241000209202 Bromus secalinus Species 0.000 abstract 2
- 238000010801 machine learning Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 208000027534 Emotional disease Diseases 0.000 description 2
- 241000270322 Lepidosauria Species 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 2
- 238000009472 formulation Methods 0.000 description 2
- 238000011835 investigation Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 201000007094 prostatitis Diseases 0.000 description 1
- 238000012797 qualification Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to Internet information retrieval and discloses a method for detecting Internet cheats on the basis of a small sample set to strike search engine cheating behaviors which become severe increasingly. The invention uses the iterative execution of self-learning on the basis of a classifier and the linking learning process on the basis of an Internet topological structure to continuously expand a training set aiming at the problem that a detection sample has high collecting cost to realize that search engine cheats are detected under the small sample set, an integrated down sampling strategy is adopted in the identification process, and information which is contained in websites which exist widely on Internet and have high reputation is fully used. Finally, label transmission along the Internet topological structure and on the basis of prediction cheating degree is carried out to realize the optimization of detection results. Experiments are used to show that the method can effectively detect the cheating behaviors.
Description
Technical field
The present invention relates to information retrieval and search engine technique field, is the method that search engine cheat detects under a kind of small sample.
Background technology
The internet is as maximum since the dawn of human civilization information bank, and its content is still increasing with exponential.Internet hunt has become the part of people's daily life, and the report that CNNIC2006 issues July claims to be in the netizen with 66.3% ratio by search engine, and the network service of normal use is the first.
Scholars such as N.Eiron use famous PageRank algorithm that 100,000,000 webpages are sorted, and found that to come in top 20 websites has 11 to be the porn site, and these websites are by distorting hyperlink to obtain forward rank.According to the investigation of american commerce investigation bureau, 2006, the ecommerce sales volume of the U.S. reached 1,141 hundred million dollars, had increased by 22.7% than 93,000,000,000 dollars in 2005.And 2007, first season of this statistics has just reached 31,500,000,000 dollars, than increasing by 18.4% 2006 year same period.The result of study of Bernard J.Jansen and Amanda Spink shows that about 80% user only can browse preceding 3 pages return results when using search engine.
The huge profit and the door effect of search engine are ordered about a lot of portal management persons and webpage making person and are made its website and the page become famous by every means on the internet, when carrying out the related content inquiry, come result's prostatitis with expects users.Internet cheating (Web Spam), be search engine cheat again, be meant the means that adopt some fascinations, deception search engine, make the rank of the Web page in result for retrieval be higher than the behavior of actual deserved rank, it causes the quality of search engine retrieving result seriously to descend.
The internet cheating can be divided into content cheating, link cheating and hide cheating three classes.Content cheating refers to website use content information deception search engine, improves the importance of some page, comprises the keyword cheating, at the title cheating etc.The link cheating website that refers to practise fraud goes out some network linking structures at the PageRank algorithm construction, fascination search engine sort algorithm, thus improve the importance of some page.Hiding cheating refers to that the cribber utilizes various concealing technologies, and the use of foregoing and link cheating technology is not found by the user.
At above cheating form, relevant in a large number countermeasure has been carried.Cheating webpages context of detection in content-based analysis, but people such as A.Ntoulas investigate ratio, content compression ratio, anchor number of texts and the ratio of popular vocabulary in text etc. of average word length display part in cheating webpages and the general webpage, sum up a series of heuristic features, the content cheating webpages is detected as two classification problems, training decision tree classification device, most content swindle webpages can be detected.In the cheating context of detection based on link, influential work the earliest is the TrustRank algorithm that people such as Gyongyi proposes, and its starting point is " the good page seldom points to the cheating page ".By selecting seed set with a high reputation by hand, carry out degree of belief along the hyperlink in the network chart and propagate.Thereby obtain the degree of belief of each page, and then all pages are divided into two kinds of Spam and Normal.People such as B.Wu and Davison has proposed a kind of method that the camouflage cheating is detected, at each URL, successively grasp twice by reptile, copy browser model to grasp once in addition, calculate difference in terms of content between them then, be redirected cheating to judge whether to exist.The shortcoming of this method is repeatedly to grasp to have increased the burden that search engine is handled, also take simultaneously massive band width, the more important thing is that this method needs the reptile of search engine to produce conventional browse request, this has violated Robots Exclusion Standard agreement.
Become the focus of recent research based on the cheat detection method of machine learning.War between search engine and the network cheating fabricator such as same arms race after search engine has been found an effective method and used, are just found out countermeasure through cribber after a while, invent the cheating form that makes new advances.Based on the method for machine learning at new cheating form, by increase, the deletion individual features, the validity that the maintenance system detects cheating, and needn't revise system architecture.Yet the detection method based on machine learning faces following two difficult problems: 1, obtaining of the required sample of machine learning need be expended a large amount of manpowers, the cost height; 2, website with a high reputation is than the cheating easier acquisition in website in the internet data, and cheating is seriously unbalanced with the ratio of non-cheating website, and traditional learning algorithm is difficult in the effect that obtains in the unbalanced sample learning.
Summary of the invention
In order to solve the procurement cost height of the existing required sample of machine learning method, and traditional learning algorithm is difficult in the problem of the effect that obtains in the unbalanced sample learning, the objective of the invention is to reduce and obtains required sample human resources, reduces cost; In unbalanced sample learning, obtained effect, the invention provides a kind of search engine Web cheat detection method for this reason based on small sample set.
In order to realize described purpose, the technical scheme of search engine Web cheat detection method that the present invention is based on small sample set is as described below:
Step S1: all webpage samples are carried out pre-service, sample set is divided into training set, test set and unlabelled collection;
Step S2: use ready-portioned training set and unlabelled collection to carry out learning, to expand training set based on the self study of sorter with based on the link of internet topological structure;
Step S3: the training set at after the expansion, adopt integrated down-sampled tactful training classifier, utilize the sorter that trains that the sample in the test set is detected;
Step S4: the post-processing stages of testing result---based on the label transmission of prediction cheating degree, finish search and draw the cheating detection.
According to embodiments of the invention, the described training set of step S2 expands, and comprises based on the self study of sorter with based on the link of internet topological structure learning.These two processes that learning process all is continuous iteration are to finish the continuous expansion of training set.
According to embodiments of the invention, the ratio of cheating of selecting in the described iterative process and non-cheating website is identical with ratio in the original training set.
According to embodiments of the invention, the described self study based on sorter of step S2 is to utilize training set sample training sorter, and the unlabelled sample set is learnt, and utilizes semi-supervised self study process to select preceding J
1The sample of individual maximum predicted degree of confidence is with the prediction label collection that goes into training.
According to embodiments of the invention, the described link study based on internet topological structure of step S2 is to utilize training set sample mark internet link figure, links study according to the cheating and the regularity of distribution of non-cheating website in linked, diagram, selects J
2Individual website with maximum delivered degree of confidence is with them and the prediction label collection that goes into training.
According to embodiments of the invention, the described down-sampled classification policy of step S3 adopts adjustable down-sampled scale-up factor, based on the integrated strategy of subclassification of prediction cheating degree, and algorithm itself is applicable to Distributed Calculation.
According to embodiments of the invention, the described testing result post-processing stages of step S4 is the label transmission of carrying out along internet topological structure based on prediction cheating degree.
Training set of the present invention expands self study and the internet link topology information that algorithm has effectively utilized learning algorithm, has solved the rare problem of sample that machine learning method faces in the Web cheating detects to a certain extent.The classification policy of inheriting of will sampling has at random effectively utilized high prestige website (webpage) information that extensively exists on the internet, and has overcome the unbalanced problem of sample.In traditional classification problem, be separate between the sample, and exist relation of interdependence between the website as detected object, the testing result optimization of being undertaken by the Internet superman linked, diagram, make full use of this point just, further improved the performance that cheating detects.
Description of drawings
Fig. 1 is that the Web cheating that the present invention is based on small sample set detects overall module frame chart;
Fig. 2 is the pre-treatment step data flow diagram;
Fig. 3 is that training set of the present invention expands the step data flow diagram;
Fig. 4 is cheating detection and the result optimizing step that the present invention is based on expansion back training set;
Fig. 5 is simple site link topological relation exemplary plot.
Embodiment
Below in conjunction with accompanying drawing the present invention is described in detail, be to be noted that described embodiment only is intended to be convenient to the understanding of the present invention, and it is not played any qualification effect.
In order to realize method of the present invention, consider that algorithm relates to repeatedly resampling and iterative process, if realize that at unit guarantee that preferably processor host frequency is not less than 2GHz, internal memory is not less than 1G, can adopt any programming language commonly used to write.
The Web cheat detection method that the present invention proposes, overall procedure based on small sample set as shown in Figure 1, specifically each step data stream is provided by Fig. 2,3,4.Pre-service (step S1) part is that data are prepared in whole cheating testing; Step S2 is that the training set iteration expands process, promptly based on the self study of sorter and the link learning process of topological structure Network Based; Step S3 uses the training set training classifier after expanding, and uses the sorter of learning that sample to be detected is detected, and uses integrated down-sampled at random learning strategy (ERUS) in this course; Step S4 is the optimization step of testing result, and the label that is based on prediction cheating degree is propagated.
A large amount of statistics show that the website at cheating webpages place in the internet often is exactly the website of practising fraud, and the formulation of Webspam-UK2006 standard data set just is based on this point, and if no special instructions, cheating sample and Spam among the present invention all represent the website of practising fraud.Next be described in detail each key step.
1, pre-service (step S1)
Pre-service is the first step of total system, it is as the preparatory stage, the work of finishing as shown in Figure 2, comprise that webpage grasps (step S11), network hyperlink figure makes up (step S12), web page contents extracts (step S13), feature extraction (step S14), manually mark (manual sort part website is Spam or Normal) (step S15) the sample set division (training set, test set and unlabelled collection) that sample set artificial mark of division (manual sort part website is Spam or Normal) and step S16 carry out etc.Webpage grasps, hyperlink figure makes up and web page contents extracts ripe method has been arranged, and does not belong to the content that the present invention emphasizes, the Web page classifying mode with reference to Webspam-UK2006 (
Http:// www.yr-bcn.es/webspam/datasets/uk2006-info/) the formulation standard, the feature that feature extraction part is extracted comprises the web page contents correlated characteristic and based on the feature of linking relationship.The present invention focuses on the effective down cheating of research special characteristic and detects strategy, feature extraction is with reference to [C.Castillo, D.Donato, A.Gionis:Know yourNeighbors:Web Spam Detection using the Web Topology.Sigir 2007], no longer narration here.
According to the feature and the manual sort situation that extract all samples are divided into three parts: training set, test set and unlabelled collection.
2, the iteration of training set expands (step S2)
Step 2 is semi-supervised learning processes, and purpose is progressively to expand to improve training set, and the data of this section processes are divided the training sample set and the unlabelled sample set that from step S1.The expansion process is made up of two parts: based on the self study of sorter and the link study of topological structure Network Based.The loop that step S21, step S22, step S23, step S24, step S25 form among Fig. 3 is the self study process, and this process is carried out repeatedly according to the iterations of setting.Step S21 learns training set, and this step adopts integrated down-sampled learning strategy; Step S22 receives the model parameter of succeeding in school, and preserves, in order to unknown label sample is classified; Step S23 uses the sorter train, and the sample in the unlabelled sample set is predicted, step S24 receives predicting the outcome of label sample not, and learning outcome is carried out from big to small ordering according to the value of forecast confidence.Step S25 is according to preset threshold I, will come a top J1 sample together with its prediction label collection that goes into training, and simultaneously it deleted from the unlabelled sample set, and the condition 1 among Fig. 3 is sample is come the conditional expression that preceding I judges.Integrated down-sampled strategy is adopted in the study of sorter in this step, and its specific implementation will be set forth in step 3.The cheating that requirement is selected in the self study iterative process is identical with ratio in the original training set with the ratio of non-cheating website.
Fig. 5 is the link topological structure of website level, the hyperlink between wherein directed edge is represented to stand, dark circles is represented the website of practising fraud, white circular is represented non-cheating website, grey be the unlabelled website; The nearest grey circle (website) of numerical reference among the figure is so that convenient the description; 6 different simple network topologys of letter (A-E) expression of figure below.In order to remove the influence of noise hyperlink, the hyperlink number of and if only if a certain website points to another website is not less than threshold value T
Num, there is directed edge between two websites, as shown in Figure 5, we get in test T
Num=5.
Step S26, step S27 and step S28 are the subordinate phase that training set expands, promptly based on the link learning process of internet topological structure.At first in step S26, utilize training set that the Web linked, diagram is marked.The target of step S27 link study is according to the label that go out ingress (be neighbor node) of certain node in linked, diagram, predicts the label of this node.[C.Castillo such as C.Castillo, D.Donato, A.Gionis:Know your Neighbors:Web SpamDetection using the Web Topology.Sigir 2007] point out: " link of going into of non-cheating website seldom is the cheating website; their common chains are to other non-cheating website " and " cheating website go into to link the website of practising fraud often ", we have also proved this point at the statistics on actual internet data collection.Link the destination of study is to expand training set, rather than only produces prediction, and sample selects quality will directly influence the performance that training set expands the back classification learning, and this just requires the mark of sample must satisfy stricter restriction.For convenience of description, use inlink
Normal(i) the expression node i goes into non-cheating degree website number in the link, inlink
Spam(i) the expression node i goes into cheating website number in the link.Similarly, use outlink
Normal(i) and outlink
Spam(i) respectively expression expression node i go out normal website and cheating website number in the link.Step S27 link learning process is as follows to the computing formula of prediction label degree of confidence:
Wherein 0<ε<1 is a smoothing factor, K
1And K
2Value according to following rule:
IF(outlink
normal(i)>0?&?inlink
normal(i)>0)THEN
IF(inlink
spam(i)=0?&?outlink
spam(i)=0)THEN
K
1=M
1
ELSE
K
1=M
2
END?IF
ELSE
K
1=M
3
END?IF
IF(outlink
spam(i)>0?&?inlink
spam(i)>0)THEN
IF(inlink
normal(i)=0?&?outlink
normal(i)=0)THEN
K
2=N
1
ELSE
K
2=N
2
END?IF
ELSE
K
2=N
3
END?IF
M wherein
1>M
2>M
3>=1, N
1>N
2>N
3>=1.Select J respectively according to formula 1,2
NormalAnd J
SpamIndividual Conf
Normal(i) and Conf
Spam(i) the maximum node of value is put into training set with its feature and prediction label, and for next iteration is prepared, selection course is judged by the S28 condition 2 of Fig. 3.We get M in actual applications
1=N
1=3>M
2=N
2=2>M
3=N
3=1, ε=0.5, J
Spam=1, J
Normal=L*J
Spam, J
2=J
Normal+ J
Spam, wherein L is the ratio of Normal sample and Spam sample in the original training set, J
2Expression is through the Number of websites with maximum delivered degree of confidence of an iteration selection.According to above value, in the network linking graph of a relation of Fig. 5, Conf
Spam(1)>Conf
Spam(2)>Conf
Spam(3), Conf
Normal(4)>Conf
Normal(5)>Conf
Normal(6), the website of the corresponding numerical reference of digitized representation Fig. 5 wherein.In the link study iterative process, J
NormalAnd J
SpamValue guaranteed that the ratio of the cheating selected and non-cheating website is with ratio in the original training set identical.
3, cheating detects (step S3)
As shown in Figure 4, the employed training set of step S3 is the training set after expanding through step S2.Step S31 is training classifier on the training set after the expansion, step S32 receives the model parameter of learning, and it is preserved in order to use that label sample is not classified, step S33 uses the sorter that trains that test sample book is tested, step S34 receives preliminary cheating testing result, prepares for step S4 carries out data.Wherein, adopt integrated down-sampled immediately learning strategy (ERUS), on training set, sorting algorithm is learnt the training of sorter.Shown in being implemented as follows of ERUS algorithm:
--------------------------
Input: M: small sample class (cheating class) sample
S: large sample class (non-cheating class) sample
N: sampling number
K: sampling ratio
X: test sample book
Output: to the testing result of sample x
-------------
1.i=0
2.while?i<n?do
3. sampling subset S immediately from S
i(| S
i|<| S|, | S
i|=K*|M|)
4. at S
iGo up training classifier C with M
i
6.i=i+1
7.end?while
8.spamicity=0
9.for?i=0?to?n?do
11. calculate PS (x, C with formula 3
i)
12.spamicity=spamicity+PS(x,C
i)
13.end?for
14.
15.if(spamicity>=0.5)then
16. sample x belongs to the cheating website
17.else
18. sample x belongs to non-cheating website
19.end?if
--------------------------
Learning algorithms commonly used such as bayes, C4.5, bagging and adaboost can be applied to this learning strategy.The integrated prediction cheating degree (PS) that is based on each subclassification of down-sampled classification at first provides the notion of PS here, is similar to the definition of forecast confidence in the learning algorithm, and PS is described as:
Wherein, x is a test sample book, and C is the specific classification device, P
Spam(x, C) and P
Normal(x, C) the sample x of presentation class device C prediction belongs to the probability or the distance value of spam collection and normal collection respectively.
Down-sampled is the available strategy that solves the unbalanced study of class, for given other sample set M of a group and big classification sample set S, down-sampled sampling subset S at random
i, wherein | M|<| S| (| * | the number of samples among the expression set *).Down-sampled strategy has only used large sample sector of breakdown sample to come training classifier, the relative equilibrium that becomes in the training set of inhomogeneous like this number of samples after resampling, and experiment shows that this method is very effective.Yet, only a large amount of useful informations that caused being included in the abandoned main classes sample with part large sample class sample can not be fully utilized, here we have adopted integrated down-sampled strategy one of to remedy defective, to excavate the useful information in the big class sample to greatest extent.Down-sampled sampling ratio K is adjustable, regulates sampling ratio according to different data sets.In the ERUS algorithm, each subclassification result's fusion is based on the PS value, rather than based on the prediction label.
In the ERUS algorithm, 1--7 is a learning phase, and 8--19 is a test phase.It is separate process that each subclassification study and test are, and this algorithm is applicable to Distributed Calculation.
4, testing result optimization (step S4)
The tentative prediction result who provides at step S3, a given threshold value T (0<T<0.5), sample that all prediction cheating degree are positioned at [T, 1-T] interval carries out the label transmission based on the Web topological structure, and wherein the label among the Web figure marks (step S41) by the training set sample.Step S42 transmits cheating degree value and is calculated by formula (2), is different from sample set and expands the stage to preceding J
SpamThe selection of individual maximum delivered cheating degree sample, the transmission cheating degree here is used for carrying out the website is carried out final cheating whether judgement, so K in the formula 2
2Value do not need to satisfy rule in the step S27 link study, get K here
2=1.If Conf
Spam(i)>1, think that then this node is the cheating website, otherwise be non-cheating website.The starting point of doing like this is, the high forecast confidence sample that learning algorithm produces is given a vote of confidence, and simultaneously the sample of low forecast confidence held the suspicious attitude to predicting the outcome, and these samples are finally predicted label by the network topology transmission.Between [T, 1-T] interval network islands (website of no neighbors), its label is still by the sorter decision that trains previous stage for forecast confidence.Step S43 receives by the result after the label propagation optimization, finishes final cheating and detects.
Claims (7)
1, a kind of method for detecting search engine cheat based on small sample set is characterized in that step is as follows:
Step S1: all webpage samples are carried out pre-service, sample set is divided into training set, test set and unlabelled collection;
Step S2: use ready-portioned training set and unlabelled collection to carry out learning, to expand training set based on the self study of sorter with based on the link of internet topological structure;
Step S3: the training set at after the expansion, adopt integrated down-sampled tactful training classifier, utilize the sorter that trains that the sample in the test set is detected;
Step S4: the post-processing stages of testing result---based on the label transmission of prediction cheating degree, finish search and draw the cheating detection.
2, method for detecting search engine cheat according to claim 1 is characterized in that, the described training set of step S2 expands, and comprises based on the self study of sorter with based on the link study of internet topological structure.These two processes that learning process all is continuous iteration are to finish the continuous expansion of training set.
3, as method for detecting search engine cheat as described in the claim 2, it is characterized in that the ratio of cheating of selecting in the described iterative process and non-cheating website is identical with ratio in the original training set.
4, method for detecting search engine cheat according to claim 1, it is characterized in that the described self study based on sorter of step S2 is to utilize training set sample training sorter, the unlabelled sample set is learnt, utilized semi-supervised self study process to select preceding J
1The sample of individual maximum predicted degree of confidence is with the prediction label collection that goes into training.
5, method for detecting search engine cheat according to claim 1, it is characterized in that, the described link study of step S2 based on internet topological structure, be to utilize training set sample mark internet link figure, link study according to the cheating and the regularity of distribution of non-cheating website in linked, diagram, select J
2Individual website with maximum delivered degree of confidence is with them and the prediction label collection that goes into training.
6, method for detecting search engine cheat according to claim 1, it is characterized in that, the described down-sampled classification policy of step S3 adopts adjustable down-sampled scale-up factor, based on the integrated strategy of subclassification of prediction cheating degree, and algorithm itself is applicable to Distributed Calculation.
7, method for detecting search engine cheat according to claim 1 is characterized in that, the described testing result post-processing stages of step S4 is the label transmission of carrying out along internet topological structure based on prediction cheating degree.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2007101191966A CN101350011B (en) | 2007-07-18 | 2007-07-18 | Method for detecting search engine cheat based on small sample set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2007101191966A CN101350011B (en) | 2007-07-18 | 2007-07-18 | Method for detecting search engine cheat based on small sample set |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101350011A true CN101350011A (en) | 2009-01-21 |
CN101350011B CN101350011B (en) | 2011-09-07 |
Family
ID=40268806
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2007101191966A Expired - Fee Related CN101350011B (en) | 2007-07-18 | 2007-07-18 | Method for detecting search engine cheat based on small sample set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101350011B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102521331A (en) * | 2011-12-06 | 2012-06-27 | 中国科学院计算机网络信息中心 | Webpage redirection cheating detection method and device |
CN102103598B (en) * | 2009-12-21 | 2012-12-05 | 同济大学 | Reliable search method base on content trust |
CN103150369A (en) * | 2013-03-07 | 2013-06-12 | 人民搜索网络股份公司 | Method and device for identifying cheat web-pages |
CN103684896A (en) * | 2012-09-07 | 2014-03-26 | 中国科学院计算机网络信息中心 | Method of detecting website cheating based on domain name resolution characteristics |
CN104216920A (en) * | 2013-06-05 | 2014-12-17 | 北京齐尔布莱特科技有限公司 | Data classification method based on clustering and Hungary algorithm |
CN104239485A (en) * | 2014-09-05 | 2014-12-24 | 中国科学院计算机网络信息中心 | Statistical machine learning-based internet hidden link detection method |
CN107909396A (en) * | 2017-11-11 | 2018-04-13 | 霍尔果斯普力网络科技有限公司 | The anti-cheat monitoring method that a kind of Internet advertising is launched |
CN107943856A (en) * | 2017-11-07 | 2018-04-20 | 南京邮电大学 | A kind of file classification method and system based on expansion marker samples |
CN108510007A (en) * | 2018-04-08 | 2018-09-07 | 北京知道创宇信息技术有限公司 | A kind of webpage tamper detection method, device, electronic equipment and storage medium |
CN110132390A (en) * | 2019-05-22 | 2019-08-16 | 查常财 | The electronic scale of cheating dynamics can be reduced |
CN110147472A (en) * | 2017-07-14 | 2019-08-20 | 北京搜狗科技发展有限公司 | Detection method, device and the detection device for website of practising fraud of cheating website |
CN110188262A (en) * | 2019-07-23 | 2019-08-30 | 武汉斗鱼网络科技有限公司 | A kind of abnormal object determines method, apparatus, equipment and medium |
CN113407804A (en) * | 2021-07-14 | 2021-09-17 | 杭州雾联科技有限公司 | External hanging accurate marking and identifying method and device based on crawler |
CN113536087A (en) * | 2021-06-30 | 2021-10-22 | 北京百度网讯科技有限公司 | Method, device, equipment, storage medium and program product for identifying cheating sites |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100543744C (en) * | 2006-12-12 | 2009-09-23 | 孙斌 | Method to webpage and website grading |
-
2007
- 2007-07-18 CN CN2007101191966A patent/CN101350011B/en not_active Expired - Fee Related
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102103598B (en) * | 2009-12-21 | 2012-12-05 | 同济大学 | Reliable search method base on content trust |
CN102521331A (en) * | 2011-12-06 | 2012-06-27 | 中国科学院计算机网络信息中心 | Webpage redirection cheating detection method and device |
CN103684896A (en) * | 2012-09-07 | 2014-03-26 | 中国科学院计算机网络信息中心 | Method of detecting website cheating based on domain name resolution characteristics |
CN103684896B (en) * | 2012-09-07 | 2017-02-01 | 中国科学院计算机网络信息中心 | Method of detecting website cheating based on domain name resolution characteristics |
CN103150369A (en) * | 2013-03-07 | 2013-06-12 | 人民搜索网络股份公司 | Method and device for identifying cheat web-pages |
CN104216920B (en) * | 2013-06-05 | 2017-11-21 | 北京齐尔布莱特科技有限公司 | Data classification method based on cluster and Hungary Algorithm |
CN104216920A (en) * | 2013-06-05 | 2014-12-17 | 北京齐尔布莱特科技有限公司 | Data classification method based on clustering and Hungary algorithm |
CN104239485B (en) * | 2014-09-05 | 2018-05-01 | 中国科学院计算机网络信息中心 | A kind of dark chain detection method in internet based on statistical machine learning |
CN104239485A (en) * | 2014-09-05 | 2014-12-24 | 中国科学院计算机网络信息中心 | Statistical machine learning-based internet hidden link detection method |
CN110147472A (en) * | 2017-07-14 | 2019-08-20 | 北京搜狗科技发展有限公司 | Detection method, device and the detection device for website of practising fraud of cheating website |
CN110147472B (en) * | 2017-07-14 | 2021-10-15 | 北京搜狗科技发展有限公司 | Detection method and device for cheating sites and detection device for cheating sites |
CN107943856A (en) * | 2017-11-07 | 2018-04-20 | 南京邮电大学 | A kind of file classification method and system based on expansion marker samples |
CN107909396A (en) * | 2017-11-11 | 2018-04-13 | 霍尔果斯普力网络科技有限公司 | The anti-cheat monitoring method that a kind of Internet advertising is launched |
CN108510007A (en) * | 2018-04-08 | 2018-09-07 | 北京知道创宇信息技术有限公司 | A kind of webpage tamper detection method, device, electronic equipment and storage medium |
CN110132390A (en) * | 2019-05-22 | 2019-08-16 | 查常财 | The electronic scale of cheating dynamics can be reduced |
CN110132390B (en) * | 2019-05-22 | 2021-08-06 | 简刚 | Electronic scale capable of reducing cheating force |
CN110188262A (en) * | 2019-07-23 | 2019-08-30 | 武汉斗鱼网络科技有限公司 | A kind of abnormal object determines method, apparatus, equipment and medium |
CN113536087A (en) * | 2021-06-30 | 2021-10-22 | 北京百度网讯科技有限公司 | Method, device, equipment, storage medium and program product for identifying cheating sites |
CN113407804A (en) * | 2021-07-14 | 2021-09-17 | 杭州雾联科技有限公司 | External hanging accurate marking and identifying method and device based on crawler |
CN113407804B (en) * | 2021-07-14 | 2023-06-16 | 杭州雾联科技有限公司 | Crawler-based externally hung accurate marking and identifying method and device |
Also Published As
Publication number | Publication date |
---|---|
CN101350011B (en) | 2011-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101350011B (en) | Method for detecting search engine cheat based on small sample set | |
CN101493819B (en) | Method for optimizing detection of search engine cheat | |
CN106815297B (en) | Academic resource recommendation service system and method | |
CN101820366B (en) | Pre-fetching-based fishing web page detection method | |
Karakatsanis et al. | Data mining approach to monitoring the requirements of the job market: A case study | |
CN103902597B (en) | The method and apparatus for determining relevance of searches classification corresponding to target keyword | |
CN106339502A (en) | Modeling recommendation method based on user behavior data fragmentation cluster | |
Joho et al. | Overview of NTCIR-11 Temporal Information Access (Temporalia) Task. | |
CN103425799A (en) | Personalized research direction recommending system and method based on themes | |
CN105095187A (en) | Search intention identification method and device | |
CN101609450A (en) | Web page classification method based on training set | |
CN110555154B (en) | Theme-oriented information retrieval method | |
CN111259219B (en) | Malicious webpage identification model establishment method, malicious webpage identification method and malicious webpage identification system | |
CN107844533A (en) | A kind of intelligent Answer System and analysis method | |
CN103150369A (en) | Method and device for identifying cheat web-pages | |
Amami et al. | A graph based approach to scientific paper recommendation | |
CN103544307B (en) | A kind of multiple search engine automation contrast evaluating method independent of document library | |
CN110728136A (en) | Multi-factor fused textrank keyword extraction algorithm | |
Archchitha et al. | Opinion spam detection in online reviews using neural networks | |
CN105512224A (en) | Search engine user satisfaction automatic assessment method based on cursor position sequence | |
CN105701167B (en) | Based on safety of coal mines event topic correlation method of discrimination | |
CN112989215A (en) | Knowledge graph enhanced recommendation system based on sparse user behavior data | |
CN103823847A (en) | Keyword extension method and device | |
CN102929977A (en) | Event tracing method aiming at news website | |
Wu et al. | SOUA: Towards Intelligent Recommendation for Applying for Overseas Universities |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20110907 Termination date: 20170718 |