Background technology
The internet is as maximum since the dawn of human civilization information bank, and its content is still increasing with exponential.Internet hunt has become the part of people's daily life, and the report that CNNIC2006 issues July claims to be in the netizen with 66.3% ratio by search engine, and the network service of normal use is the first.
Scholars such as N.Eiron use famous PageRank algorithm that 100,000,000 webpages are sorted, and found that to come in top 20 websites has 11 to be the porn site, and these websites are by distorting hyperlink to obtain forward rank.According to the investigation of american commerce investigation bureau, 2006, the ecommerce sales volume of the U.S. reached 1,141 hundred million dollars, had increased by 22.7% than 93,000,000,000 dollars in 2005.And 2007, first season of this statistics has just reached 31,500,000,000 dollars, than increasing by 18.4% 2006 year same period.The result of study of Bernard J.Jansen and Amanda Spink shows that about 80% user only can browse preceding 3 pages return results when using search engine.
The huge profit and the door effect of search engine are ordered about a lot of portal management persons and webpage making person and are made its website and the page become famous by every means on the internet, when carrying out the related content inquiry, come result's prostatitis with expects users.Internet cheating (Web Spam), be search engine cheat again, be meant the means that adopt some fascinations, deception search engine, make the rank of the Web page in result for retrieval be higher than the behavior of actual deserved rank, it causes the quality of search engine retrieving result seriously to descend.
The internet cheating can be divided into content cheating, link cheating and hide cheating three classes.Content cheating refers to website use content information deception search engine, improves the importance of some page, comprises the keyword cheating, at the title cheating etc.The link cheating website that refers to practise fraud goes out some network linking structures at the PageRank algorithm construction, fascination search engine sort algorithm, thus improve the importance of some page.Hiding cheating refers to that the cribber utilizes various concealing technologies, and the use of foregoing and link cheating technology is not found by the user.
At above cheating form, relevant in a large number countermeasure has been carried.Cheating webpages context of detection in content-based analysis, but people such as A.Ntoulas investigate ratio, content compression ratio, anchor number of texts and the ratio of popular vocabulary in text etc. of average word length display part in cheating webpages and the general webpage, sum up a series of heuristic features, the content cheating webpages is detected as two classification problems, training decision tree classification device, most content swindle webpages can be detected.In the cheating context of detection based on link, influential work the earliest is the TrustRank algorithm that people such as Gyongyi proposes, and its starting point is " the good page seldom points to the cheating page ".By selecting seed set with a high reputation by hand, carry out degree of belief along the hyperlink in the network chart and propagate.Thereby obtain the degree of belief of each page, and then all pages are divided into two kinds of Spam and Normal.People such as B.Wu and Davison has proposed a kind of method that the camouflage cheating is detected, at each URL, successively grasp twice by reptile, copy browser model to grasp once in addition, calculate difference in terms of content between them then, be redirected cheating to judge whether to exist.The shortcoming of this method is repeatedly to grasp to have increased the burden that search engine is handled, also take simultaneously massive band width, the more important thing is that this method needs the reptile of search engine to produce conventional browse request, this has violated Robots Exclusion Standard agreement.
Become the focus of recent research based on the cheat detection method of machine learning.War between search engine and the network cheating fabricator such as same arms race after search engine has been found an effective method and used, are just found out countermeasure through cribber after a while, invent the cheating form that makes new advances.Based on the method for machine learning at new cheating form, by increase, the deletion individual features, the validity that the maintenance system detects cheating, and needn't revise system architecture.Yet the detection method based on machine learning faces following two difficult problems: 1, obtaining of the required sample of machine learning need be expended a large amount of manpowers, the cost height; 2, website with a high reputation is than the cheating easier acquisition in website in the internet data, and cheating is seriously unbalanced with the ratio of non-cheating website, and traditional learning algorithm is difficult in the effect that obtains in the unbalanced sample learning.
Summary of the invention
In order to solve the procurement cost height of the existing required sample of machine learning method, and traditional learning algorithm is difficult in the problem of the effect that obtains in the unbalanced sample learning, the objective of the invention is to reduce and obtains required sample human resources, reduces cost; In unbalanced sample learning, obtained effect, the invention provides a kind of search engine Web cheat detection method for this reason based on small sample set.
In order to realize described purpose, the technical scheme of search engine Web cheat detection method that the present invention is based on small sample set is as described below:
Step S1: all webpage samples are carried out pre-service, sample set is divided into training set, test set and unlabelled collection;
Step S2: use ready-portioned training set and unlabelled collection to carry out learning, to expand training set based on the self study of sorter with based on the link of internet topological structure;
Step S3: the training set at after the expansion, adopt integrated down-sampled tactful training classifier, utilize the sorter that trains that the sample in the test set is detected;
Step S4: the post-processing stages of testing result---based on the label transmission of prediction cheating degree, finish search and draw the cheating detection.
According to embodiments of the invention, the described training set of step S2 expands, and comprises based on the self study of sorter with based on the link of internet topological structure learning.These two processes that learning process all is continuous iteration are to finish the continuous expansion of training set.
According to embodiments of the invention, the ratio of cheating of selecting in the described iterative process and non-cheating website is identical with ratio in the original training set.
According to embodiments of the invention, the described self study based on sorter of step S2 is to utilize training set sample training sorter, and the unlabelled sample set is learnt, and utilizes semi-supervised self study process to select preceding J
1The sample of individual maximum predicted degree of confidence is with the prediction label collection that goes into training.
According to embodiments of the invention, the described link study based on internet topological structure of step S2 is to utilize training set sample mark internet link figure, links study according to the cheating and the regularity of distribution of non-cheating website in linked, diagram, selects J
2Individual website with maximum delivered degree of confidence is with them and the prediction label collection that goes into training.
According to embodiments of the invention, the described down-sampled classification policy of step S3 adopts adjustable down-sampled scale-up factor, based on the integrated strategy of subclassification of prediction cheating degree, and algorithm itself is applicable to Distributed Calculation.
According to embodiments of the invention, the described testing result post-processing stages of step S4 is the label transmission of carrying out along internet topological structure based on prediction cheating degree.
Training set of the present invention expands self study and the internet link topology information that algorithm has effectively utilized learning algorithm, has solved the rare problem of sample that machine learning method faces in the Web cheating detects to a certain extent.The classification policy of inheriting of will sampling has at random effectively utilized high prestige website (webpage) information that extensively exists on the internet, and has overcome the unbalanced problem of sample.In traditional classification problem, be separate between the sample, and exist relation of interdependence between the website as detected object, the testing result optimization of being undertaken by the Internet superman linked, diagram, make full use of this point just, further improved the performance that cheating detects.
Embodiment
Below in conjunction with accompanying drawing the present invention is described in detail, be to be noted that described embodiment only is intended to be convenient to the understanding of the present invention, and it is not played any qualification effect.
In order to realize method of the present invention, consider that algorithm relates to repeatedly resampling and iterative process, if realize that at unit guarantee that preferably processor host frequency is not less than 2GHz, internal memory is not less than 1G, can adopt any programming language commonly used to write.
The Web cheat detection method that the present invention proposes, overall procedure based on small sample set as shown in Figure 1, specifically each step data stream is provided by Fig. 2,3,4.Pre-service (step S1) part is that data are prepared in whole cheating testing; Step S2 is that the training set iteration expands process, promptly based on the self study of sorter and the link learning process of topological structure Network Based; Step S3 uses the training set training classifier after expanding, and uses the sorter of learning that sample to be detected is detected, and uses integrated down-sampled at random learning strategy (ERUS) in this course; Step S4 is the optimization step of testing result, and the label that is based on prediction cheating degree is propagated.
A large amount of statistics show that the website at cheating webpages place in the internet often is exactly the website of practising fraud, and the formulation of Webspam-UK2006 standard data set just is based on this point, and if no special instructions, cheating sample and Spam among the present invention all represent the website of practising fraud.Next be described in detail each key step.
1, pre-service (step S1)
Pre-service is the first step of total system, it is as the preparatory stage, the work of finishing as shown in Figure 2, comprise that webpage grasps (step S11), network hyperlink figure makes up (step S12), web page contents extracts (step S13), feature extraction (step S14), manually mark (manual sort part website is Spam or Normal) (step S15) the sample set division (training set, test set and unlabelled collection) that sample set artificial mark of division (manual sort part website is Spam or Normal) and step S16 carry out etc.Webpage grasps, hyperlink figure makes up and web page contents extracts ripe method has been arranged, and does not belong to the content that the present invention emphasizes, the Web page classifying mode with reference to Webspam-UK2006 (
Http:// www.yr-bcn.es/webspam/datasets/uk2006-info/) the formulation standard, the feature that feature extraction part is extracted comprises the web page contents correlated characteristic and based on the feature of linking relationship.The present invention focuses on the effective down cheating of research special characteristic and detects strategy, feature extraction is with reference to [C.Castillo, D.Donato, A.Gionis:Know yourNeighbors:Web Spam Detection using the Web Topology.Sigir 2007], no longer narration here.
According to the feature and the manual sort situation that extract all samples are divided into three parts: training set, test set and unlabelled collection.
2, the iteration of training set expands (step S2)
Step 2 is semi-supervised learning processes, and purpose is progressively to expand to improve training set, and the data of this section processes are divided the training sample set and the unlabelled sample set that from step S1.The expansion process is made up of two parts: based on the self study of sorter and the link study of topological structure Network Based.The loop that step S21, step S22, step S23, step S24, step S25 form among Fig. 3 is the self study process, and this process is carried out repeatedly according to the iterations of setting.Step S21 learns training set, and this step adopts integrated down-sampled learning strategy; Step S22 receives the model parameter of succeeding in school, and preserves, in order to unknown label sample is classified; Step S23 uses the sorter train, and the sample in the unlabelled sample set is predicted, step S24 receives predicting the outcome of label sample not, and learning outcome is carried out from big to small ordering according to the value of forecast confidence.Step S25 is according to preset threshold I, will come a top J1 sample together with its prediction label collection that goes into training, and simultaneously it deleted from the unlabelled sample set, and the condition 1 among Fig. 3 is sample is come the conditional expression that preceding I judges.Integrated down-sampled strategy is adopted in the study of sorter in this step, and its specific implementation will be set forth in step 3.The cheating that requirement is selected in the self study iterative process is identical with ratio in the original training set with the ratio of non-cheating website.
Fig. 5 is the link topological structure of website level, the hyperlink between wherein directed edge is represented to stand, dark circles is represented the website of practising fraud, white circular is represented non-cheating website, grey be the unlabelled website; The nearest grey circle (website) of numerical reference among the figure is so that convenient the description; 6 different simple network topologys of letter (A-E) expression of figure below.In order to remove the influence of noise hyperlink, the hyperlink number of and if only if a certain website points to another website is not less than threshold value T
Num, there is directed edge between two websites, as shown in Figure 5, we get in test T
Num=5.
Step S26, step S27 and step S28 are the subordinate phase that training set expands, promptly based on the link learning process of internet topological structure.At first in step S26, utilize training set that the Web linked, diagram is marked.The target of step S27 link study is according to the label that go out ingress (be neighbor node) of certain node in linked, diagram, predicts the label of this node.[C.Castillo such as C.Castillo, D.Donato, A.Gionis:Know your Neighbors:Web SpamDetection using the Web Topology.Sigir 2007] point out: " link of going into of non-cheating website seldom is the cheating website; their common chains are to other non-cheating website " and " cheating website go into to link the website of practising fraud often ", we have also proved this point at the statistics on actual internet data collection.Link the destination of study is to expand training set, rather than only produces prediction, and sample selects quality will directly influence the performance that training set expands the back classification learning, and this just requires the mark of sample must satisfy stricter restriction.For convenience of description, use inlink
Normal(i) the expression node i goes into non-cheating degree website number in the link, inlink
Spam(i) the expression node i goes into cheating website number in the link.Similarly, use outlink
Normal(i) and outlink
Spam(i) respectively expression expression node i go out normal website and cheating website number in the link.Step S27 link learning process is as follows to the computing formula of prediction label degree of confidence:
Wherein 0<ε<1 is a smoothing factor, K
1And K
2Value according to following rule:
IF(outlink
normal(i)>0?&?inlink
normal(i)>0)THEN
IF(inlink
spam(i)=0?&?outlink
spam(i)=0)THEN
K
1=M
1
ELSE
K
1=M
2
END?IF
ELSE
K
1=M
3
END?IF
IF(outlink
spam(i)>0?&?inlink
spam(i)>0)THEN
IF(inlink
normal(i)=0?&?outlink
normal(i)=0)THEN
K
2=N
1
ELSE
K
2=N
2
END?IF
ELSE
K
2=N
3
END?IF
M wherein
1>M
2>M
3>=1, N
1>N
2>N
3>=1.Select J respectively according to formula 1,2
NormalAnd J
SpamIndividual Conf
Normal(i) and Conf
Spam(i) the maximum node of value is put into training set with its feature and prediction label, and for next iteration is prepared, selection course is judged by the S28 condition 2 of Fig. 3.We get M in actual applications
1=N
1=3>M
2=N
2=2>M
3=N
3=1, ε=0.5, J
Spam=1, J
Normal=L*J
Spam, J
2=J
Normal+ J
Spam, wherein L is the ratio of Normal sample and Spam sample in the original training set, J
2Expression is through the Number of websites with maximum delivered degree of confidence of an iteration selection.According to above value, in the network linking graph of a relation of Fig. 5, Conf
Spam(1)>Conf
Spam(2)>Conf
Spam(3), Conf
Normal(4)>Conf
Normal(5)>Conf
Normal(6), the website of the corresponding numerical reference of digitized representation Fig. 5 wherein.In the link study iterative process, J
NormalAnd J
SpamValue guaranteed that the ratio of the cheating selected and non-cheating website is with ratio in the original training set identical.
3, cheating detects (step S3)
As shown in Figure 4, the employed training set of step S3 is the training set after expanding through step S2.Step S31 is training classifier on the training set after the expansion, step S32 receives the model parameter of learning, and it is preserved in order to use that label sample is not classified, step S33 uses the sorter that trains that test sample book is tested, step S34 receives preliminary cheating testing result, prepares for step S4 carries out data.Wherein, adopt integrated down-sampled immediately learning strategy (ERUS), on training set, sorting algorithm is learnt the training of sorter.Shown in being implemented as follows of ERUS algorithm:
--------------------------
Input: M: small sample class (cheating class) sample
S: large sample class (non-cheating class) sample
N: sampling number
K: sampling ratio
X: test sample book
Output: to the testing result of sample x
-------------
1.i=0
2.while?i<n?do
3. sampling subset S immediately from S
i(| S
i|<| S|, | S
i|=K*|M|)
4. at S
iGo up training classifier C with M
i
5. preserve the model of succeeding in school
6.i=i+1
7.end?while
8.spamicity=0
9.for?i=0?to?n?do
10. use a model
Test sample book x
11. calculate PS (x, C with formula 3
i)
12.spamicity=spamicity+PS(x,C
i)
13.end?for
14.
15.if(spamicity>=0.5)then
16. sample x belongs to the cheating website
17.else
18. sample x belongs to non-cheating website
19.end?if
--------------------------
Learning algorithms commonly used such as bayes, C4.5, bagging and adaboost can be applied to this learning strategy.The integrated prediction cheating degree (PS) that is based on each subclassification of down-sampled classification at first provides the notion of PS here, is similar to the definition of forecast confidence in the learning algorithm, and PS is described as:
Wherein, x is a test sample book, and C is the specific classification device, P
Spam(x, C) and P
Normal(x, C) the sample x of presentation class device C prediction belongs to the probability or the distance value of spam collection and normal collection respectively.
Down-sampled is the available strategy that solves the unbalanced study of class, for given other sample set M of a group and big classification sample set S, down-sampled sampling subset S at random
i, wherein | M|<| S| (| * | the number of samples among the expression set *).Down-sampled strategy has only used large sample sector of breakdown sample to come training classifier, the relative equilibrium that becomes in the training set of inhomogeneous like this number of samples after resampling, and experiment shows that this method is very effective.Yet, only a large amount of useful informations that caused being included in the abandoned main classes sample with part large sample class sample can not be fully utilized, here we have adopted integrated down-sampled strategy one of to remedy defective, to excavate the useful information in the big class sample to greatest extent.Down-sampled sampling ratio K is adjustable, regulates sampling ratio according to different data sets.In the ERUS algorithm, each subclassification result's fusion is based on the PS value, rather than based on the prediction label.
In the ERUS algorithm, 1--7 is a learning phase, and 8--19 is a test phase.It is separate process that each subclassification study and test are, and this algorithm is applicable to Distributed Calculation.
4, testing result optimization (step S4)
The tentative prediction result who provides at step S3, a given threshold value T (0<T<0.5), sample that all prediction cheating degree are positioned at [T, 1-T] interval carries out the label transmission based on the Web topological structure, and wherein the label among the Web figure marks (step S41) by the training set sample.Step S42 transmits cheating degree value and is calculated by formula (2), is different from sample set and expands the stage to preceding J
SpamThe selection of individual maximum delivered cheating degree sample, the transmission cheating degree here is used for carrying out the website is carried out final cheating whether judgement, so K in the formula 2
2Value do not need to satisfy rule in the step S27 link study, get K here
2=1.If Conf
Spam(i)>1, think that then this node is the cheating website, otherwise be non-cheating website.The starting point of doing like this is, the high forecast confidence sample that learning algorithm produces is given a vote of confidence, and simultaneously the sample of low forecast confidence held the suspicious attitude to predicting the outcome, and these samples are finally predicted label by the network topology transmission.Between [T, 1-T] interval network islands (website of no neighbors), its label is still by the sorter decision that trains previous stage for forecast confidence.Step S43 receives by the result after the label propagation optimization, finishes final cheating and detects.