CN102521369A - Multi-view web spam detection method - Google Patents
Multi-view web spam detection method Download PDFInfo
- Publication number
- CN102521369A CN102521369A CN2011104247014A CN201110424701A CN102521369A CN 102521369 A CN102521369 A CN 102521369A CN 2011104247014 A CN2011104247014 A CN 2011104247014A CN 201110424701 A CN201110424701 A CN 201110424701A CN 102521369 A CN102521369 A CN 102521369A
- Authority
- CN
- China
- Prior art keywords
- page
- spam
- matrix
- norm
- detected
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a multi-view web spam detection method. The method comprises the following steps of: firstly, obtaining two views of all normal pages and spam pages in the training data; then, two views of a page to be detected; respectively establishing a matrix for the obtained two views; resolving a normal norm and a spam norm; comparing the normal norm with the spam norm; if the normal norm is less than the spam norm, identifying the page to be detected as a normal page; if the normal norm is more than the spam norm, identifying the page to be detected as a spam page; if the normal norm is equal to the spam norm, randomly identifying the page to be detected as the normal page or the spam page. The method has the advantages of being insensitive to the unbalancedness of the training data, being capable of simultaneously detecting various spam pages, being simple in the detection process, and the like.
Description
Technical field
The present invention relates to a kind of many views network spam page detection method, belong to the internet information retrieval field.
Background technology
The part website owner adopts improper means deception search engine web site sort algorithm for obtaining commercial interest, makes the ordering of the inessential website or the page forward, destroys the engine search result.Its correlation technique has search engine optimization (SEO) and search engine market (SEM) etc., is referred to as search engine spam, i.e. Web spam (the network spam page).Web spam has become the significant challenge that various Web search face at present, has a strong impact on the information retrieval effect, and Web spam development simultaneously is rapid, and new spamming technology constantly occurs.The main three kinds of forms of expression of Web spam: content-based, link (link) and the page are hidden.The method that detects the spam page at present adopts heuristic function more, detects the spam page of particular form, can not detect the multiple spam page simultaneously, and detection time, complexity was high, and is responsive to the unbalancedness of training data simultaneously.
So-called training data imbalance is meant that the quantity of the normal page in training data will be much larger than the quantity of the spam page.Web page quantity is mass data; The artificial mark page is that the normal page still is that the spam page is wasted time and energy; Can only artificially mark partial page; With the good page training classifier of mark, a large amount of not mark pages is carried out the machine mark, i.e. the page detection that the sorter that use is succeeded in school will not mark is the normal page or the spam page.
The major technique means that detect the spam page have method, the method based on link, statistical method and Graph-theoretical Approach etc. according to content of pages.Content-based spam detects according to spam content of pages characteristic, adopts heuristic function that it is detected, and is difficult to form unified model.Some method is passed through the applied statistics technology, analyzes the page key words distribution and detects the spam page, can be used for solving because of the change to the result of page searching ordering such as duplicate key word, modification content of pages; Based on search engine page sort algorithm such as the PageRank and the HITS of link,, also can be used for detecting the content-based spam page owing to ignored the influence of content of pages to page ordering.The applied for machines learning method is at first extracted the content of pages characteristic when detecting the content spam page, re-use sorting technique and realize page detection.
Page sort algorithm based on link is used widely in search engine, can be used for detecting the spam page based on link.Whether heuristic has the bipartite graph method, according to existing relevant subgraph to judge whether link is link spam in the link adjacency matrix.In addition, can detect the link spam page through uncommon link structure in the analytic statistics and new page sort algorithm.
In recent years, machine learning techniques is applied in Web spam context of detection, and through extracting page characteristic, training classifier obtains the quite good detecting performance.But when the data dimension is big, all have following problem: the unbalancedness to training data is responsive, can not detect the multiple different spam page and the high problem of complexity detection time simultaneously.
It is in a basic balance that traditional classifier such as decision tree, neural network and SVMs etc. all suppose to belong in the training data inhomogeneous data, and promptly the quantity variance of Various types of data is little.Research shows; In case when all kinds of quantity variances are big, especially for two types of problems, when the quantity of one type of data during far more than the quantity of another kind of data; Through the sorter that study obtains, the classify accuracy of that type few to quantity (minority class) will reduce greatly.Often meaning is bigger for the classify accuracy of raising minority class.The Spam page data is less, that is to say, the overwhelming majority all is the normal page in the page that we collect, and minority is the spam page, reduce the training data imbalance to the influence of sorter to improving the discrimination particular importance of the spam page.
Summary of the invention
The object of the invention is exactly in order to address the above problem, and a kind of many views web spam detection method is provided, and this method only needs not need training classifier, so have the insensitive characteristics of training data unbalancedness according to training data study weight matrix; This method can detect the multiclass spam page simultaneously, is superior to existing only to specific spam page detection effective method; Testing process is simple, only need learn weight matrix according to training data, calculate the poor of norm, and being identified as the normal page according to norm extent decision new page still is the spam page.
To achieve these goals, the present invention adopts following technical scheme:
A kind of many views web spam detection method, this method comprises the steps:
Step 1: the contents view and the link view that at first obtain all normal pages and the spam page in the training data;
Step 2: the contents view and the link view that obtain the page to be detected then;
Step 3:, obtain normal content matrix and normal chain matrice with the contents view of all normal pages in the step 1 and link view structural matrix separately;
Step 4:, obtain spam content matrix and spam chain matrice with the contents view of all spam pages in the step 1 and link view structural matrix separately;
Step 5:, obtain content matrix to be detected and chain matrice to be detected with the contents view of the page to be detected and link view structural matrix separately;
Step 6: utilize normal content matrix and content matrix to be detected to find the solution weight matrix W
-, utilize spam content matrix and content matrix to be detected to find the solution weight matrix W
+
Step 7: utilize normal chain matrice and weight matrix W
-Find the solution approximate matrix B
1, utilize spam chain matrice and weight matrix W
+Find the solution approximate matrix B
2
Step 8: utilize the approximate matrix B in the step 7
1Find the solution normal norm E with chain matrice to be detected
1, utilize approximate matrix B
2Find the solution spam norm E with chain matrice to be detected
+
Step 9: more normal norm E
1With spam norm E
+Size; If normal norm is less than the spam norm, the page then to be detected is the normal page; If normal norm is greater than the spam norm, the page then to be detected is the spam page; If both equate that the page then to be detected is identified as the normal page or the spam page at random;
Step 10: if the page to be detected is identified as the normal page, just be retained, otherwise the page to be detected is deleted from pool of page, detect and finish.
In the described step 6, find the solution weight matrix W
-Formula following:
min||[A
x]-[A
-]W
-||
2
s.t.||W
-||
2=1
Formulate is satisfying constraint condition || W
-||
2Under=1 the situation, find the solution feasible || [A
x]-[A
-] W
-||
2Obtain the weight matrix W of minimum value
-This minimizes expression and passes through weights W
-With [A
-] content matrix [A that constructs
-] W
-Content matrix [A with the page to be measured
x] between otherness as far as possible little.
In the described step 6, find the solution weight matrix W
+Formula following:
min||[A
x]-[A
+]W
+||
2
s.t.||W
+||
2=1
Formulate is satisfying constraint condition || W
+||
2Under=1 the situation, find the solution feasible || [A
x]-[A
+] W
+||
2Obtain the weight matrix W of minimum value
+This minimizes expression and passes through weights W
+With [A
+] content matrix [A that constructs
+] W
+Content matrix [A with page P to be measured
x] between otherness as far as possible little.
In the described step 7, calculate approximate matrix B
1And B
2Formula following:
B
1=[B
-]W
- (1)
B
2=[B
+]W
+ (2)
Formula (1) is through transformation matrix W
-[B
-] calculate the approximate matrix B of the page to be detected
1Formula (2) is through transformation matrix W
+[B
+] calculate the approximate matrix B of the page to be detected
2
In the described step 8, find the solution normal norm E
-With spam norm E
+Formula is following:
E
-=||[B
x]-B
1||
2;
E
+=||[B
x]-B
2||
2;
The size of norm is represented the size of two approximate matrixs and chain matrice otherness to be detected, and norm is big more, and expression approximate matrix and chain matrice otherness to be detected are big more; Otherwise norm is more little, and expression approximate matrix and chain matrice otherness to be detected are more little.
Beneficial effect of the present invention: the present invention proposes a kind of many views spam page detection method,, thereby improved the efficient that detects because of it can detect the polytype spam page simultaneously; The present invention does not simultaneously need training classifier, thereby has avoided the influence of the imbalance of training data to classifying quality, and detection algorithm is insensitive to the data imbalance.
Description of drawings
Fig. 1 converts matrix to for the view of training data;
Fig. 2 converts matrix to for the view of the page to be detected;
Fig. 3 a is for finding the solution weight matrix W
-Process;
Fig. 3 b is for finding the solution weight matrix W
+Process
Fig. 4 a is approximate matrix B
1Calculate;
Fig. 4 b is approximate matrix B
2Calculate;
Fig. 5 is the page detection process.
Embodiment
Below in conjunction with accompanying drawing and embodiment the present invention is described further.
The object of the present invention is to provide a kind of universal test method towards the multiple spam page.
For realizing above-mentioned purpose, technical solution of the present invention is: the method for the many view shows of page characteristic has been proposed, and different with traditional page character representation method.This method adopts page of two view shows; Described two view shows; Be meant the same web page, both adopted content-based proper vector to represent (being called contents view), adopt again based on the proper vector of hyperlink and represent (being called the link view); Promptly corresponding two views of the page are called contents view and link view respectively.Described training data is meant the page data that clearly is labeled as normal page data and spam.The contents view that is labeled as normal all pages in the training data constitutes the normal content matrix, is designated as [A
-], the link view that is labeled as normal all pages in the training data constitutes normal chain matrice, is designated as [B
-]; The contents view that is labeled as all pages of spam in the training data constitutes the spam content matrix, is designated as [A
+], the link view that is labeled as all pages of spam in the training data constitutes the spam chain matrice, is designated as [B
+], as shown in Figure 1; The contents view of the page P that each is to be detected constitutes content matrix to be detected, is designated as [Ax], and the link view of each page P to be detected constitutes chain matrice to be detected, is designated as [Bx], and is as shown in Figure 2.Through [A
-] and [A
+] mode of view through matrixing, construct [Ax] respectively, study obtains corresponding transformation matrix W
-And W
+, shown in Fig. 3 a and 3b; Through transformation matrix W
-And [B
-], construct page approximate matrix B to be detected
1, through transformation matrix W
+And [B
+] construct page approximate matrix B to be detected
2, as shown in Figs. 4a and 4b.
Concrete building method specifies as follows.Compute matrix B then
1And B
2With the mould of [Bx] difference, and by the size of two moulds, it still is the spam page that decision is identified as the normal page with page P to be detected.
Below to transformation matrix W among the present invention
-And W
+Study and the structure of approximate matrix be described further.
Specifically comprise:
1: study transformation matrix W
-And W
+
Find the solution transformation matrix W through following method
-:
min||[A
x]-[A
-]W
-||
2 (1)
s.t.||W
-||
2=1
Formula (1) is illustrated in and satisfies constraint condition || W
-||
2Under=1 the situation, find the solution feasible || [A
x]-[A
-] W
-||
2Obtain the weight matrix W of minimum value
-This minimizes expression and passes through weights W
-With [A
-] content matrix [A that constructs
-] W
-Content matrix [A with page P to be measured
x] between otherness as far as possible little.
Find the solution transformation matrix W through following method
+:
min||[A
x]-[A
+]W
+||
2 (2)
s.t.||W
+||
2=1
Formula (2) is illustrated in and satisfies constraint condition || W
+||
2Under=1 the situation, find the solution feasible || [A
x]-[A
+] W
+||
2Obtain the weight matrix W of minimum value
+This minimizes expression and passes through weights W
+With [A
+] content matrix [A that constructs
+] W
+Content matrix [A with page P to be measured
x] between otherness as far as possible little.
2: calculate approximate matrix B
1And B
2
Calculate B through following method
1And B
2
B
1=[B
-]W
- (3)
B
2=[B
+]W
+ (4)
Formula (3) is through transformation matrix W
-[B
-] calculate the page approximate matrix B to be detected of page P to be detected
1Formula (4) is through transformation matrix W
+[B
+] calculate the page approximate matrix B to be detected of page P to be detected
2
3: the chain matrice to be detected [Bx] and the B that calculate page P to be detected
1And B
2The difference norm of matrix
Calculate norm E
-=|| [B
x]-B
1||
2And norm E
+=|| [B
x]-B
2||
2The size of norm is represented the size of two approximate matrixs and chain matrice otherness to be detected.Norm is big more, and expression approximate matrix and chain matrice otherness to be detected are big more; Otherwise norm is more little, and expression approximate matrix and chain matrice otherness to be detected are more little.
4: the classification of decision page P
If E>E
+, then the P page is identified as the spam page; If E
-<E
+, then the P page is identified as the normal page; If E
-=E
+Then with the P page be identified as at random the two one of, as shown in Figure 5.If page P is identified as the normal page, just is retained, otherwise P is deleted from pool of page.
Though the above-mentioned accompanying drawing specific embodiments of the invention that combines is described; But be not restriction to protection domain of the present invention; One of ordinary skill in the art should be understood that; On the basis of technical scheme of the present invention, those skilled in the art need not pay various modifications that creative work can make or distortion still in protection scope of the present invention.
Claims (5)
1. the spam of view web more than kind detection method is characterized in that this method comprises the steps:
Step 1: the contents view and the link view that at first obtain all normal pages and the spam page in the training data;
Step 2: the contents view and the link view that obtain the page to be detected then;
Step 3:, obtain normal content matrix and normal chain matrice with the contents view of all normal pages in the step 1 and link view structural matrix separately;
Step 4:, obtain spam content matrix and spam chain matrice with the contents view of all spam pages in the step 1 and link view structural matrix separately;
Step 5:, obtain content matrix to be detected and chain matrice to be detected with the contents view of the page to be detected and link view structural matrix separately;
Step 6: utilize normal content matrix and content matrix to be detected to find the solution weight matrix W
-, utilize spam content matrix and content matrix to be detected to find the solution weight matrix W
+
Step 7: utilize normal chain matrice and weight matrix W
-Find the solution approximate matrix B
1, utilize spam chain matrice and weight matrix W
+Find the solution approximate matrix B
2
Step 8: utilize the approximate matrix B in the step 7
1Find the solution normal norm E with chain matrice to be detected
-, utilize approximate matrix B
2Find the solution spam norm E with chain matrice to be detected
+
Step 9: more normal norm E
-With spam norm E
+Size; If normal norm is less than the spam norm, the page then to be detected is the normal page; If normal norm is greater than the spam norm, the page then to be detected is the spam page; If both equate that the page then to be detected is identified as the normal page or the spam page at random;
Step 10: if the page to be detected is identified as the normal page, just be retained, otherwise the page to be detected is deleted from pool of page, detect and finish.
2. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 6, find the solution weight matrix W
-Formula following:
min||[A
x]-[A
-]W
-||
2
s.t.||W
-||
2=1
Formulate is satisfying constraint condition || W
-||
2Under=1 the situation, find the solution feasible || [A
x]-[A
-] W
-||
2Obtain the weight matrix W of minimum value
-This minimizes expression and passes through weights W
-With [A
+] content matrix [A that constructs
-] W
-Content matrix [A with the page to be measured
x] between otherness as far as possible little.
3. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 6, find the solution weight matrix W
+Formula following:
min||[A
x]-[A
+]W
+||
2
s.t.||W
+||
2=1
Formulate is satisfying constraint condition || W
+||
2Under=1 the situation, find the solution feasible || [A
x]-[A
+] W
+||
2Obtain the weight matrix W of minimum value
+This minimizes expression and passes through weights W
+With [A
+] content matrix [A that constructs
+] W
+Content matrix [A with the page to be measured
x] between otherness as far as possible little.
4. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 7, calculate approximate matrix B
1And B
2Formula following:
B
1=[B
-]W
- (1)
B
2=[B
+]W
+ (2)
Formula (1) is through transformation matrix W
-[B
-] calculate the approximate matrix B of the page to be detected
1Formula (2) is through transformation matrix W
+[B
+] calculate the approximate matrix B of the page to be detected
2
5. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 8, find the solution normal norm E
-With spam norm E
+Formula is following:
E
-=||[B
x]-B
1||
2;
E
+=||[B
x]-B
2||
2;
The size of norm is represented the size of two approximate matrixs and chain matrice otherness to be detected, and norm is big more, and expression approximate matrix and chain matrice otherness to be detected are big more; Otherwise norm is more little, and expression approximate matrix and chain matrice otherness to be detected are more little.
The invention discloses two views that provide a kind of many views web spam detection method, this method to comprise the steps: at first to obtain all normal pages and the spam page in the training data; Obtain two views of the page to be detected then; To two views that obtain structural matrix separately; Obtain normal norm and spam norm; The size of more normal norm and spam norm; If normal norm is less than the spam norm, the page then to be detected is the normal page; If normal norm is greater than the spam norm, the page then to be detected is the spam page; If both equate that the page then to be detected is identified as the normal page or the spam page at random.It has insensitive to the training data unbalancedness, can detect advantages such as the multiple spam page and testing process be simple simultaneously.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110424701.4A CN102521369B (en) | 2011-12-16 | 2011-12-16 | Multi-view web spam detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110424701.4A CN102521369B (en) | 2011-12-16 | 2011-12-16 | Multi-view web spam detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102521369A true CN102521369A (en) | 2012-06-27 |
CN102521369B CN102521369B (en) | 2014-01-22 |
Family
ID=46292282
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110424701.4A Expired - Fee Related CN102521369B (en) | 2011-12-16 | 2011-12-16 | Multi-view web spam detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102521369B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102750345A (en) * | 2012-06-07 | 2012-10-24 | 山东师范大学 | Method for identifying web spam through web page multi-view data association combination |
CN105930365A (en) * | 2016-04-11 | 2016-09-07 | 天津大学 | Network link topology reconstruction method based on content |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1728655A (en) * | 2004-11-25 | 2006-02-01 | 刘文印 | Method and system for detecting and discriminating counterfeit web page |
CN101324888A (en) * | 2007-06-13 | 2008-12-17 | 北京恒金恒泰信息技术有限公司 | Plug-in card for filtering eroticism software based on IE |
CN101393555A (en) * | 2008-09-09 | 2009-03-25 | 浙江大学 | Rubbish blog detecting method |
-
2011
- 2011-12-16 CN CN201110424701.4A patent/CN102521369B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1728655A (en) * | 2004-11-25 | 2006-02-01 | 刘文印 | Method and system for detecting and discriminating counterfeit web page |
CN101324888A (en) * | 2007-06-13 | 2008-12-17 | 北京恒金恒泰信息技术有限公司 | Plug-in card for filtering eroticism software based on IE |
CN101393555A (en) * | 2008-09-09 | 2009-03-25 | 浙江大学 | Rubbish blog detecting method |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102750345A (en) * | 2012-06-07 | 2012-10-24 | 山东师范大学 | Method for identifying web spam through web page multi-view data association combination |
CN105930365A (en) * | 2016-04-11 | 2016-09-07 | 天津大学 | Network link topology reconstruction method based on content |
Also Published As
Publication number | Publication date |
---|---|
CN102521369B (en) | 2014-01-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105045875B (en) | Personalized search and device | |
CN103365910B (en) | Method and system for information retrieval | |
CN103412888B (en) | A kind of point of interest recognition methods and device | |
JP2012521598A5 (en) | ||
CN105975596A (en) | Query expansion method and system of search engine | |
CN101630327A (en) | Design method of theme network crawler system | |
CN105654144B (en) | A kind of social network ontologies construction method based on machine learning | |
CN108769079A (en) | A kind of Web Intrusion Detection Techniques based on machine learning | |
CN106407484A (en) | Video tag extraction method based on semantic association of barrages | |
CN102799814A (en) | Phishing website search system and method | |
CN106156372A (en) | The sorting technique of a kind of internet site and device | |
CN103020067A (en) | Method and device for determining webpage type | |
CN103544307B (en) | A kind of multiple search engine automation contrast evaluating method independent of document library | |
CN101251896B (en) | Object detecting system and method based on multiple classifiers | |
CN101980210A (en) | Marked word classifying and grading method and system | |
CN102591948A (en) | Method and system for improving search results based on user behavior analysis | |
CN102117339A (en) | Filter supervision method specific to unsecure web page texts | |
CN103778262A (en) | Information retrieval method and device based on thesaurus | |
CN103618744A (en) | Intrusion detection method based on fast k-nearest neighbor (KNN) algorithm | |
CN108319518A (en) | File fragmentation sorting technique based on Recognition with Recurrent Neural Network and device | |
CN102521369B (en) | Multi-view web spam detection method | |
CN111222031A (en) | Website distinguishing method and system | |
CN103684896A (en) | Method of detecting website cheating based on domain name resolution characteristics | |
CN103853771B (en) | A kind of method for pushing and system of search result | |
CN102063497A (en) | Open type knowledge sharing platform and entry processing method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20140122 Termination date: 20201216 |