CN102521369A

CN102521369A - Multi-view web spam detection method

Info

Publication number: CN102521369A
Application number: CN2011104247014A
Authority: CN
Inventors: 张化祥
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2011-12-16
Filing date: 2011-12-16
Publication date: 2012-06-27
Anticipated expiration: 2031-12-16
Also published as: CN102521369B

Abstract

The invention provides a multi-view web spam detection method. The method comprises the following steps of: firstly, obtaining two views of all normal pages and spam pages in the training data; then, two views of a page to be detected; respectively establishing a matrix for the obtained two views; resolving a normal norm and a spam norm; comparing the normal norm with the spam norm; if the normal norm is less than the spam norm, identifying the page to be detected as a normal page; if the normal norm is more than the spam norm, identifying the page to be detected as a spam page; if the normal norm is equal to the spam norm, randomly identifying the page to be detected as the normal page or the spam page. The method has the advantages of being insensitive to the unbalancedness of the training data, being capable of simultaneously detecting various spam pages, being simple in the detection process, and the like.

Description

A kind of many views network spam page detection method

Technical field

The present invention relates to a kind of many views network spam page detection method, belong to the internet information retrieval field.

Background technology

The part website owner adopts improper means deception search engine web site sort algorithm for obtaining commercial interest, makes the ordering of the inessential website or the page forward, destroys the engine search result.Its correlation technique has search engine optimization (SEO) and search engine market (SEM) etc., is referred to as search engine spam, i.e. Web spam (the network spam page).Web spam has become the significant challenge that various Web search face at present, has a strong impact on the information retrieval effect, and Web spam development simultaneously is rapid, and new spamming technology constantly occurs.The main three kinds of forms of expression of Web spam: content-based, link (link) and the page are hidden.The method that detects the spam page at present adopts heuristic function more, detects the spam page of particular form, can not detect the multiple spam page simultaneously, and detection time, complexity was high, and is responsive to the unbalancedness of training data simultaneously.

So-called training data imbalance is meant that the quantity of the normal page in training data will be much larger than the quantity of the spam page.Web page quantity is mass data; The artificial mark page is that the normal page still is that the spam page is wasted time and energy; Can only artificially mark partial page; With the good page training classifier of mark, a large amount of not mark pages is carried out the machine mark, i.e. the page detection that the sorter that use is succeeded in school will not mark is the normal page or the spam page.

The major technique means that detect the spam page have method, the method based on link, statistical method and Graph-theoretical Approach etc. according to content of pages.Content-based spam detects according to spam content of pages characteristic, adopts heuristic function that it is detected, and is difficult to form unified model.Some method is passed through the applied statistics technology, analyzes the page key words distribution and detects the spam page, can be used for solving because of the change to the result of page searching ordering such as duplicate key word, modification content of pages; Based on search engine page sort algorithm such as the PageRank and the HITS of link,, also can be used for detecting the content-based spam page owing to ignored the influence of content of pages to page ordering.The applied for machines learning method is at first extracted the content of pages characteristic when detecting the content spam page, re-use sorting technique and realize page detection.

Page sort algorithm based on link is used widely in search engine, can be used for detecting the spam page based on link.Whether heuristic has the bipartite graph method, according to existing relevant subgraph to judge whether link is link spam in the link adjacency matrix.In addition, can detect the link spam page through uncommon link structure in the analytic statistics and new page sort algorithm.

In recent years, machine learning techniques is applied in Web spam context of detection, and through extracting page characteristic, training classifier obtains the quite good detecting performance.But when the data dimension is big, all have following problem: the unbalancedness to training data is responsive, can not detect the multiple different spam page and the high problem of complexity detection time simultaneously.

It is in a basic balance that traditional classifier such as decision tree, neural network and SVMs etc. all suppose to belong in the training data inhomogeneous data, and promptly the quantity variance of Various types of data is little.Research shows; In case when all kinds of quantity variances are big, especially for two types of problems, when the quantity of one type of data during far more than the quantity of another kind of data; Through the sorter that study obtains, the classify accuracy of that type few to quantity (minority class) will reduce greatly.Often meaning is bigger for the classify accuracy of raising minority class.The Spam page data is less, that is to say, the overwhelming majority all is the normal page in the page that we collect, and minority is the spam page, reduce the training data imbalance to the influence of sorter to improving the discrimination particular importance of the spam page.

Summary of the invention

The object of the invention is exactly in order to address the above problem, and a kind of many views web spam detection method is provided, and this method only needs not need training classifier, so have the insensitive characteristics of training data unbalancedness according to training data study weight matrix; This method can detect the multiclass spam page simultaneously, is superior to existing only to specific spam page detection effective method; Testing process is simple, only need learn weight matrix according to training data, calculate the poor of norm, and being identified as the normal page according to norm extent decision new page still is the spam page.

To achieve these goals, the present invention adopts following technical scheme:

A kind of many views web spam detection method, this method comprises the steps:

Step 1: the contents view and the link view that at first obtain all normal pages and the spam page in the training data;

Step 2: the contents view and the link view that obtain the page to be detected then;

Step 3:, obtain normal content matrix and normal chain matrice with the contents view of all normal pages in the step 1 and link view structural matrix separately;

Step 4:, obtain spam content matrix and spam chain matrice with the contents view of all spam pages in the step 1 and link view structural matrix separately;

Step 5:, obtain content matrix to be detected and chain matrice to be detected with the contents view of the page to be detected and link view structural matrix separately;

Step 6: utilize normal content matrix and content matrix to be detected to find the solution weight matrix W _-, utilize spam content matrix and content matrix to be detected to find the solution weight matrix W ₊

Step 7: utilize normal chain matrice and weight matrix W _-Find the solution approximate matrix B ₁, utilize spam chain matrice and weight matrix W ₊Find the solution approximate matrix B ₂

Step 8: utilize the approximate matrix B in the step 7 ₁Find the solution normal norm E with chain matrice to be detected ₁, utilize approximate matrix B ₂Find the solution spam norm E with chain matrice to be detected ₊

Step 9: more normal norm E ₁With spam norm E ₊Size; If normal norm is less than the spam norm, the page then to be detected is the normal page; If normal norm is greater than the spam norm, the page then to be detected is the spam page; If both equate that the page then to be detected is identified as the normal page or the spam page at random;

Step 10: if the page to be detected is identified as the normal page, just be retained, otherwise the page to be detected is deleted from pool of page, detect and finish.

In the described step 6, find the solution weight matrix W _-Formula following:

min||[A _x]-[A _-]W _-|| ²

s.t.||W _-|| ²＝1

Formulate is satisfying constraint condition || W _-|| ²Under=1 the situation, find the solution feasible || [A _x]-[A _-] W _-|| ²Obtain the weight matrix W of minimum value _-This minimizes expression and passes through weights W _-With [A _-] content matrix [A that constructs _-] W _-Content matrix [A with the page to be measured _x] between otherness as far as possible little.

In the described step 6, find the solution weight matrix W ₊Formula following:

min||[A _x]-[A ₊]W ₊|| ²

s.t.||W ₊|| ²＝1

Formulate is satisfying constraint condition || W ₊|| ²Under=1 the situation, find the solution feasible || [A _x]-[A ₊] W ₊|| ²Obtain the weight matrix W of minimum value ₊This minimizes expression and passes through weights W ₊With [A ₊] content matrix [A that constructs ₊] W ₊Content matrix [A with page P to be measured _x] between otherness as far as possible little.

In the described step 7, calculate approximate matrix B ₁And B ₂Formula following:

B ₁＝[B _-]W _- (1)

B ₂＝[B ₊]W ₊ (2)

Formula (1) is through transformation matrix W _-[B _-] calculate the approximate matrix B of the page to be detected ₁Formula (2) is through transformation matrix W ₊[B ₊] calculate the approximate matrix B of the page to be detected ₂

In the described step 8, find the solution normal norm E _-With spam norm E ₊Formula is following:

E _-＝||[B _x]-B ₁|| ²；

E ₊＝||[B _x]-B ₂|| ²；

The size of norm is represented the size of two approximate matrixs and chain matrice otherness to be detected, and norm is big more, and expression approximate matrix and chain matrice otherness to be detected are big more; Otherwise norm is more little, and expression approximate matrix and chain matrice otherness to be detected are more little.

Beneficial effect of the present invention: the present invention proposes a kind of many views spam page detection method,, thereby improved the efficient that detects because of it can detect the polytype spam page simultaneously; The present invention does not simultaneously need training classifier, thereby has avoided the influence of the imbalance of training data to classifying quality, and detection algorithm is insensitive to the data imbalance.

Description of drawings

Fig. 1 converts matrix to for the view of training data;

Fig. 2 converts matrix to for the view of the page to be detected;

Fig. 3 a is for finding the solution weight matrix W _-Process;

Fig. 3 b is for finding the solution weight matrix W ₊Process

Fig. 4 a is approximate matrix B ₁Calculate;

Fig. 4 b is approximate matrix B ₂Calculate;

Fig. 5 is the page detection process.

Embodiment

Below in conjunction with accompanying drawing and embodiment the present invention is described further.

The object of the present invention is to provide a kind of universal test method towards the multiple spam page.

For realizing above-mentioned purpose, technical solution of the present invention is: the method for the many view shows of page characteristic has been proposed, and different with traditional page character representation method.This method adopts page of two view shows; Described two view shows; Be meant the same web page, both adopted content-based proper vector to represent (being called contents view), adopt again based on the proper vector of hyperlink and represent (being called the link view); Promptly corresponding two views of the page are called contents view and link view respectively.Described training data is meant the page data that clearly is labeled as normal page data and spam.The contents view that is labeled as normal all pages in the training data constitutes the normal content matrix, is designated as [A _-], the link view that is labeled as normal all pages in the training data constitutes normal chain matrice, is designated as [B _-]; The contents view that is labeled as all pages of spam in the training data constitutes the spam content matrix, is designated as [A ₊], the link view that is labeled as all pages of spam in the training data constitutes the spam chain matrice, is designated as [B ₊], as shown in Figure 1; The contents view of the page P that each is to be detected constitutes content matrix to be detected, is designated as [Ax], and the link view of each page P to be detected constitutes chain matrice to be detected, is designated as [Bx], and is as shown in Figure 2.Through [A _-] and [A ₊] mode of view through matrixing, construct [Ax] respectively, study obtains corresponding transformation matrix W _-And W ₊, shown in Fig. 3 a and 3b; Through transformation matrix W _-And [B _-], construct page approximate matrix B to be detected ₁, through transformation matrix W ₊And [B ₊] construct page approximate matrix B to be detected ₂, as shown in Figs. 4a and 4b.

Concrete building method specifies as follows.Compute matrix B then ₁And B ₂With the mould of [Bx] difference, and by the size of two moulds, it still is the spam page that decision is identified as the normal page with page P to be detected.

Below to transformation matrix W among the present invention _-And W ₊Study and the structure of approximate matrix be described further.

Specifically comprise:

1: study transformation matrix W _-And W ₊

Find the solution transformation matrix W through following method _-:

min||[A _x]-[A _-]W _-|| ² (1)

s.t.||W _-|| ²＝1

Formula (1) is illustrated in and satisfies constraint condition || W _-|| ²Under=1 the situation, find the solution feasible || [A _x]-[A _-] W _-|| ²Obtain the weight matrix W of minimum value _-This minimizes expression and passes through weights W _-With [A _-] content matrix [A that constructs _-] W _-Content matrix [A with page P to be measured _x] between otherness as far as possible little.

Find the solution transformation matrix W through following method ₊:

min||[A _x]-[A ₊]W ₊|| ² (2)

s.t.||W ₊|| ²＝1

Formula (2) is illustrated in and satisfies constraint condition || W ₊|| ²Under=1 the situation, find the solution feasible || [A _x]-[A ₊] W ₊|| ²Obtain the weight matrix W of minimum value ₊This minimizes expression and passes through weights W ₊With [A ₊] content matrix [A that constructs ₊] W ₊Content matrix [A with page P to be measured _x] between otherness as far as possible little.

2: calculate approximate matrix B ₁And B ₂

Calculate B through following method ₁And B ₂

B ₁＝[B _-]W _- (3)

B ₂＝[B ₊]W ₊ (4)

Formula (3) is through transformation matrix W _-[B _-] calculate the page approximate matrix B to be detected of page P to be detected ₁Formula (4) is through transformation matrix W ₊[B ₊] calculate the page approximate matrix B to be detected of page P to be detected ₂

3: the chain matrice to be detected [Bx] and the B that calculate page P to be detected ₁And B ₂The difference norm of matrix

Calculate norm E _-=|| [B _x]-B ₁|| ²And norm E ₊=|| [B _x]-B ₂|| ²The size of norm is represented the size of two approximate matrixs and chain matrice otherness to be detected.Norm is big more, and expression approximate matrix and chain matrice otherness to be detected are big more; Otherwise norm is more little, and expression approximate matrix and chain matrice otherness to be detected are more little.

4: the classification of decision page P

If E＞E ₊, then the P page is identified as the spam page; If E _-＜E ₊, then the P page is identified as the normal page; If E _-=E ₊Then with the P page be identified as at random the two one of, as shown in Figure 5.If page P is identified as the normal page, just is retained, otherwise P is deleted from pool of page.

Though the above-mentioned accompanying drawing specific embodiments of the invention that combines is described; But be not restriction to protection domain of the present invention; One of ordinary skill in the art should be understood that; On the basis of technical scheme of the present invention, those skilled in the art need not pay various modifications that creative work can make or distortion still in protection scope of the present invention.

Claims

1. the spam of view web more than kind detection method is characterized in that this method comprises the steps:

Step 8: utilize the approximate matrix B in the step 7 ₁Find the solution normal norm E with chain matrice to be detected _-, utilize approximate matrix B ₂Find the solution spam norm E with chain matrice to be detected ₊

Step 9: more normal norm E _-With spam norm E ₊Size; If normal norm is less than the spam norm, the page then to be detected is the normal page; If normal norm is greater than the spam norm, the page then to be detected is the spam page; If both equate that the page then to be detected is identified as the normal page or the spam page at random;

2. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 6, find the solution weight matrix W _-Formula following:

min||[A _x]-[A _-]W _-|| ²

s.t.||W _-|| ²＝1

Formulate is satisfying constraint condition || W _-|| ²Under=1 the situation, find the solution feasible || [A _x]-[A _-] W _-|| ²Obtain the weight matrix W of minimum value _-This minimizes expression and passes through weights W _-With [A ₊] content matrix [A that constructs _-] W _-Content matrix [A with the page to be measured _x] between otherness as far as possible little.

3. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 6, find the solution weight matrix W ₊Formula following:

min||[A _x]-[A ₊]W ₊|| ²

s.t.||W ₊|| ²＝1

Formulate is satisfying constraint condition || W ₊|| ²Under=1 the situation, find the solution feasible || [A _x]-[A ₊] W ₊|| ²Obtain the weight matrix W of minimum value ₊This minimizes expression and passes through weights W ₊With [A ₊] content matrix [A that constructs ₊] W ₊Content matrix [A with the page to be measured _x] between otherness as far as possible little.

4. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 7, calculate approximate matrix B ₁And B ₂Formula following:

B ₁＝[B _-]W _- (1)

B ₂＝[B ₊]W ₊ (2)

5. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 8, find the solution normal norm E _-With spam norm E ₊Formula is following:

E _-＝||[B _x]-B ₁|| ²；

E ₊＝||[B _x]-B ₂|| ²；

The invention discloses two views that provide a kind of many views web spam detection method, this method to comprise the steps: at first to obtain all normal pages and the spam page in the training data; Obtain two views of the page to be detected then; To two views that obtain structural matrix separately; Obtain normal norm and spam norm; The size of more normal norm and spam norm; If normal norm is less than the spam norm, the page then to be detected is the normal page; If normal norm is greater than the spam norm, the page then to be detected is the spam page; If both equate that the page then to be detected is identified as the normal page or the spam page at random.It has insensitive to the training data unbalancedness, can detect advantages such as the multiple spam page and testing process be simple simultaneously.