CN102521369A - Multi-view web spam detection method - Google Patents

Multi-view web spam detection method Download PDF

Info

Publication number
CN102521369A
CN102521369A CN2011104247014A CN201110424701A CN102521369A CN 102521369 A CN102521369 A CN 102521369A CN 2011104247014 A CN2011104247014 A CN 2011104247014A CN 201110424701 A CN201110424701 A CN 201110424701A CN 102521369 A CN102521369 A CN 102521369A
Authority
CN
China
Prior art keywords
page
spam
matrix
norm
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011104247014A
Other languages
Chinese (zh)
Other versions
CN102521369B (en
Inventor
张化祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN201110424701.4A priority Critical patent/CN102521369B/en
Publication of CN102521369A publication Critical patent/CN102521369A/en
Application granted granted Critical
Publication of CN102521369B publication Critical patent/CN102521369B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a multi-view web spam detection method. The method comprises the following steps of: firstly, obtaining two views of all normal pages and spam pages in the training data; then, two views of a page to be detected; respectively establishing a matrix for the obtained two views; resolving a normal norm and a spam norm; comparing the normal norm with the spam norm; if the normal norm is less than the spam norm, identifying the page to be detected as a normal page; if the normal norm is more than the spam norm, identifying the page to be detected as a spam page; if the normal norm is equal to the spam norm, randomly identifying the page to be detected as the normal page or the spam page. The method has the advantages of being insensitive to the unbalancedness of the training data, being capable of simultaneously detecting various spam pages, being simple in the detection process, and the like.

Description

A kind of many views network spam page detection method
Technical field
The present invention relates to a kind of many views network spam page detection method, belong to the internet information retrieval field.
Background technology
The part website owner adopts improper means deception search engine web site sort algorithm for obtaining commercial interest, makes the ordering of the inessential website or the page forward, destroys the engine search result.Its correlation technique has search engine optimization (SEO) and search engine market (SEM) etc., is referred to as search engine spam, i.e. Web spam (the network spam page).Web spam has become the significant challenge that various Web search face at present, has a strong impact on the information retrieval effect, and Web spam development simultaneously is rapid, and new spamming technology constantly occurs.The main three kinds of forms of expression of Web spam: content-based, link (link) and the page are hidden.The method that detects the spam page at present adopts heuristic function more, detects the spam page of particular form, can not detect the multiple spam page simultaneously, and detection time, complexity was high, and is responsive to the unbalancedness of training data simultaneously.
So-called training data imbalance is meant that the quantity of the normal page in training data will be much larger than the quantity of the spam page.Web page quantity is mass data; The artificial mark page is that the normal page still is that the spam page is wasted time and energy; Can only artificially mark partial page; With the good page training classifier of mark, a large amount of not mark pages is carried out the machine mark, i.e. the page detection that the sorter that use is succeeded in school will not mark is the normal page or the spam page.
The major technique means that detect the spam page have method, the method based on link, statistical method and Graph-theoretical Approach etc. according to content of pages.Content-based spam detects according to spam content of pages characteristic, adopts heuristic function that it is detected, and is difficult to form unified model.Some method is passed through the applied statistics technology, analyzes the page key words distribution and detects the spam page, can be used for solving because of the change to the result of page searching ordering such as duplicate key word, modification content of pages; Based on search engine page sort algorithm such as the PageRank and the HITS of link,, also can be used for detecting the content-based spam page owing to ignored the influence of content of pages to page ordering.The applied for machines learning method is at first extracted the content of pages characteristic when detecting the content spam page, re-use sorting technique and realize page detection.
Page sort algorithm based on link is used widely in search engine, can be used for detecting the spam page based on link.Whether heuristic has the bipartite graph method, according to existing relevant subgraph to judge whether link is link spam in the link adjacency matrix.In addition, can detect the link spam page through uncommon link structure in the analytic statistics and new page sort algorithm.
In recent years, machine learning techniques is applied in Web spam context of detection, and through extracting page characteristic, training classifier obtains the quite good detecting performance.But when the data dimension is big, all have following problem: the unbalancedness to training data is responsive, can not detect the multiple different spam page and the high problem of complexity detection time simultaneously.
It is in a basic balance that traditional classifier such as decision tree, neural network and SVMs etc. all suppose to belong in the training data inhomogeneous data, and promptly the quantity variance of Various types of data is little.Research shows; In case when all kinds of quantity variances are big, especially for two types of problems, when the quantity of one type of data during far more than the quantity of another kind of data; Through the sorter that study obtains, the classify accuracy of that type few to quantity (minority class) will reduce greatly.Often meaning is bigger for the classify accuracy of raising minority class.The Spam page data is less, that is to say, the overwhelming majority all is the normal page in the page that we collect, and minority is the spam page, reduce the training data imbalance to the influence of sorter to improving the discrimination particular importance of the spam page.
Summary of the invention
The object of the invention is exactly in order to address the above problem, and a kind of many views web spam detection method is provided, and this method only needs not need training classifier, so have the insensitive characteristics of training data unbalancedness according to training data study weight matrix; This method can detect the multiclass spam page simultaneously, is superior to existing only to specific spam page detection effective method; Testing process is simple, only need learn weight matrix according to training data, calculate the poor of norm, and being identified as the normal page according to norm extent decision new page still is the spam page.
To achieve these goals, the present invention adopts following technical scheme:
A kind of many views web spam detection method, this method comprises the steps:
Step 1: the contents view and the link view that at first obtain all normal pages and the spam page in the training data;
Step 2: the contents view and the link view that obtain the page to be detected then;
Step 3:, obtain normal content matrix and normal chain matrice with the contents view of all normal pages in the step 1 and link view structural matrix separately;
Step 4:, obtain spam content matrix and spam chain matrice with the contents view of all spam pages in the step 1 and link view structural matrix separately;
Step 5:, obtain content matrix to be detected and chain matrice to be detected with the contents view of the page to be detected and link view structural matrix separately;
Step 6: utilize normal content matrix and content matrix to be detected to find the solution weight matrix W -, utilize spam content matrix and content matrix to be detected to find the solution weight matrix W +
Step 7: utilize normal chain matrice and weight matrix W -Find the solution approximate matrix B 1, utilize spam chain matrice and weight matrix W +Find the solution approximate matrix B 2
Step 8: utilize the approximate matrix B in the step 7 1Find the solution normal norm E with chain matrice to be detected 1, utilize approximate matrix B 2Find the solution spam norm E with chain matrice to be detected +
Step 9: more normal norm E 1With spam norm E +Size; If normal norm is less than the spam norm, the page then to be detected is the normal page; If normal norm is greater than the spam norm, the page then to be detected is the spam page; If both equate that the page then to be detected is identified as the normal page or the spam page at random;
Step 10: if the page to be detected is identified as the normal page, just be retained, otherwise the page to be detected is deleted from pool of page, detect and finish.
In the described step 6, find the solution weight matrix W -Formula following:
min||[A x]-[A -]W -|| 2
s.t.||W -|| 2=1
Formulate is satisfying constraint condition || W -|| 2Under=1 the situation, find the solution feasible || [A x]-[A -] W -|| 2Obtain the weight matrix W of minimum value -This minimizes expression and passes through weights W -With [A -] content matrix [A that constructs -] W -Content matrix [A with the page to be measured x] between otherness as far as possible little.
In the described step 6, find the solution weight matrix W +Formula following:
min||[A x]-[A +]W +|| 2
s.t.||W +|| 2=1
Formulate is satisfying constraint condition || W +|| 2Under=1 the situation, find the solution feasible || [A x]-[A +] W +|| 2Obtain the weight matrix W of minimum value +This minimizes expression and passes through weights W +With [A +] content matrix [A that constructs +] W +Content matrix [A with page P to be measured x] between otherness as far as possible little.
In the described step 7, calculate approximate matrix B 1And B 2Formula following:
B 1=[B -]W - (1)
B 2=[B +]W + (2)
Formula (1) is through transformation matrix W -[B -] calculate the approximate matrix B of the page to be detected 1Formula (2) is through transformation matrix W +[B +] calculate the approximate matrix B of the page to be detected 2
In the described step 8, find the solution normal norm E -With spam norm E +Formula is following:
E -=||[B x]-B 1|| 2
E +=||[B x]-B 2|| 2
The size of norm is represented the size of two approximate matrixs and chain matrice otherness to be detected, and norm is big more, and expression approximate matrix and chain matrice otherness to be detected are big more; Otherwise norm is more little, and expression approximate matrix and chain matrice otherness to be detected are more little.
Beneficial effect of the present invention: the present invention proposes a kind of many views spam page detection method,, thereby improved the efficient that detects because of it can detect the polytype spam page simultaneously; The present invention does not simultaneously need training classifier, thereby has avoided the influence of the imbalance of training data to classifying quality, and detection algorithm is insensitive to the data imbalance.
Description of drawings
Fig. 1 converts matrix to for the view of training data;
Fig. 2 converts matrix to for the view of the page to be detected;
Fig. 3 a is for finding the solution weight matrix W -Process;
Fig. 3 b is for finding the solution weight matrix W +Process
Fig. 4 a is approximate matrix B 1Calculate;
Fig. 4 b is approximate matrix B 2Calculate;
Fig. 5 is the page detection process.
Embodiment
Below in conjunction with accompanying drawing and embodiment the present invention is described further.
The object of the present invention is to provide a kind of universal test method towards the multiple spam page.
For realizing above-mentioned purpose, technical solution of the present invention is: the method for the many view shows of page characteristic has been proposed, and different with traditional page character representation method.This method adopts page of two view shows; Described two view shows; Be meant the same web page, both adopted content-based proper vector to represent (being called contents view), adopt again based on the proper vector of hyperlink and represent (being called the link view); Promptly corresponding two views of the page are called contents view and link view respectively.Described training data is meant the page data that clearly is labeled as normal page data and spam.The contents view that is labeled as normal all pages in the training data constitutes the normal content matrix, is designated as [A -], the link view that is labeled as normal all pages in the training data constitutes normal chain matrice, is designated as [B -]; The contents view that is labeled as all pages of spam in the training data constitutes the spam content matrix, is designated as [A +], the link view that is labeled as all pages of spam in the training data constitutes the spam chain matrice, is designated as [B +], as shown in Figure 1; The contents view of the page P that each is to be detected constitutes content matrix to be detected, is designated as [Ax], and the link view of each page P to be detected constitutes chain matrice to be detected, is designated as [Bx], and is as shown in Figure 2.Through [A -] and [A +] mode of view through matrixing, construct [Ax] respectively, study obtains corresponding transformation matrix W -And W +, shown in Fig. 3 a and 3b; Through transformation matrix W -And [B -], construct page approximate matrix B to be detected 1, through transformation matrix W +And [B +] construct page approximate matrix B to be detected 2, as shown in Figs. 4a and 4b.
Concrete building method specifies as follows.Compute matrix B then 1And B 2With the mould of [Bx] difference, and by the size of two moulds, it still is the spam page that decision is identified as the normal page with page P to be detected.
Below to transformation matrix W among the present invention -And W +Study and the structure of approximate matrix be described further.
Specifically comprise:
1: study transformation matrix W -And W +
Find the solution transformation matrix W through following method -:
min||[A x]-[A -]W -|| 2 (1)
s.t.||W -|| 2=1
Formula (1) is illustrated in and satisfies constraint condition || W -|| 2Under=1 the situation, find the solution feasible || [A x]-[A -] W -|| 2Obtain the weight matrix W of minimum value -This minimizes expression and passes through weights W -With [A -] content matrix [A that constructs -] W -Content matrix [A with page P to be measured x] between otherness as far as possible little.
Find the solution transformation matrix W through following method +:
min||[A x]-[A +]W +|| 2 (2)
s.t.||W +|| 2=1
Formula (2) is illustrated in and satisfies constraint condition || W +|| 2Under=1 the situation, find the solution feasible || [A x]-[A +] W +|| 2Obtain the weight matrix W of minimum value +This minimizes expression and passes through weights W +With [A +] content matrix [A that constructs +] W +Content matrix [A with page P to be measured x] between otherness as far as possible little.
2: calculate approximate matrix B 1And B 2
Calculate B through following method 1And B 2
B 1=[B -]W - (3)
B 2=[B +]W + (4)
Formula (3) is through transformation matrix W -[B -] calculate the page approximate matrix B to be detected of page P to be detected 1Formula (4) is through transformation matrix W +[B +] calculate the page approximate matrix B to be detected of page P to be detected 2
3: the chain matrice to be detected [Bx] and the B that calculate page P to be detected 1And B 2The difference norm of matrix
Calculate norm E -=|| [B x]-B 1|| 2And norm E +=|| [B x]-B 2|| 2The size of norm is represented the size of two approximate matrixs and chain matrice otherness to be detected.Norm is big more, and expression approximate matrix and chain matrice otherness to be detected are big more; Otherwise norm is more little, and expression approximate matrix and chain matrice otherness to be detected are more little.
4: the classification of decision page P
If E>E +, then the P page is identified as the spam page; If E -<E +, then the P page is identified as the normal page; If E -=E +Then with the P page be identified as at random the two one of, as shown in Figure 5.If page P is identified as the normal page, just is retained, otherwise P is deleted from pool of page.
Though the above-mentioned accompanying drawing specific embodiments of the invention that combines is described; But be not restriction to protection domain of the present invention; One of ordinary skill in the art should be understood that; On the basis of technical scheme of the present invention, those skilled in the art need not pay various modifications that creative work can make or distortion still in protection scope of the present invention.

Claims (5)

1. the spam of view web more than kind detection method is characterized in that this method comprises the steps:
Step 1: the contents view and the link view that at first obtain all normal pages and the spam page in the training data;
Step 2: the contents view and the link view that obtain the page to be detected then;
Step 3:, obtain normal content matrix and normal chain matrice with the contents view of all normal pages in the step 1 and link view structural matrix separately;
Step 4:, obtain spam content matrix and spam chain matrice with the contents view of all spam pages in the step 1 and link view structural matrix separately;
Step 5:, obtain content matrix to be detected and chain matrice to be detected with the contents view of the page to be detected and link view structural matrix separately;
Step 6: utilize normal content matrix and content matrix to be detected to find the solution weight matrix W -, utilize spam content matrix and content matrix to be detected to find the solution weight matrix W +
Step 7: utilize normal chain matrice and weight matrix W -Find the solution approximate matrix B 1, utilize spam chain matrice and weight matrix W +Find the solution approximate matrix B 2
Step 8: utilize the approximate matrix B in the step 7 1Find the solution normal norm E with chain matrice to be detected -, utilize approximate matrix B 2Find the solution spam norm E with chain matrice to be detected +
Step 9: more normal norm E -With spam norm E +Size; If normal norm is less than the spam norm, the page then to be detected is the normal page; If normal norm is greater than the spam norm, the page then to be detected is the spam page; If both equate that the page then to be detected is identified as the normal page or the spam page at random;
Step 10: if the page to be detected is identified as the normal page, just be retained, otherwise the page to be detected is deleted from pool of page, detect and finish.
2. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 6, find the solution weight matrix W -Formula following:
min||[A x]-[A -]W -|| 2
s.t.||W -|| 2=1
Formulate is satisfying constraint condition || W -|| 2Under=1 the situation, find the solution feasible || [A x]-[A -] W -|| 2Obtain the weight matrix W of minimum value -This minimizes expression and passes through weights W -With [A +] content matrix [A that constructs -] W -Content matrix [A with the page to be measured x] between otherness as far as possible little.
3. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 6, find the solution weight matrix W +Formula following:
min||[A x]-[A +]W +|| 2
s.t.||W +|| 2=1
Formulate is satisfying constraint condition || W +|| 2Under=1 the situation, find the solution feasible || [A x]-[A +] W +|| 2Obtain the weight matrix W of minimum value +This minimizes expression and passes through weights W +With [A +] content matrix [A that constructs +] W +Content matrix [A with the page to be measured x] between otherness as far as possible little.
4. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 7, calculate approximate matrix B 1And B 2Formula following:
B 1=[B -]W - (1)
B 2=[B +]W + (2)
Formula (1) is through transformation matrix W -[B -] calculate the approximate matrix B of the page to be detected 1Formula (2) is through transformation matrix W +[B +] calculate the approximate matrix B of the page to be detected 2
5. like claims 1 described a kind of many views web spam detection method, it is characterized in that, in the described step 8, find the solution normal norm E -With spam norm E +Formula is following:
E -=||[B x]-B 1|| 2
E +=||[B x]-B 2|| 2
The size of norm is represented the size of two approximate matrixs and chain matrice otherness to be detected, and norm is big more, and expression approximate matrix and chain matrice otherness to be detected are big more; Otherwise norm is more little, and expression approximate matrix and chain matrice otherness to be detected are more little.
The invention discloses two views that provide a kind of many views web spam detection method, this method to comprise the steps: at first to obtain all normal pages and the spam page in the training data; Obtain two views of the page to be detected then; To two views that obtain structural matrix separately; Obtain normal norm and spam norm; The size of more normal norm and spam norm; If normal norm is less than the spam norm, the page then to be detected is the normal page; If normal norm is greater than the spam norm, the page then to be detected is the spam page; If both equate that the page then to be detected is identified as the normal page or the spam page at random.It has insensitive to the training data unbalancedness, can detect advantages such as the multiple spam page and testing process be simple simultaneously.
CN201110424701.4A 2011-12-16 2011-12-16 Multi-view web spam detection method Expired - Fee Related CN102521369B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110424701.4A CN102521369B (en) 2011-12-16 2011-12-16 Multi-view web spam detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110424701.4A CN102521369B (en) 2011-12-16 2011-12-16 Multi-view web spam detection method

Publications (2)

Publication Number Publication Date
CN102521369A true CN102521369A (en) 2012-06-27
CN102521369B CN102521369B (en) 2014-01-22

Family

ID=46292282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110424701.4A Expired - Fee Related CN102521369B (en) 2011-12-16 2011-12-16 Multi-view web spam detection method

Country Status (1)

Country Link
CN (1) CN102521369B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750345A (en) * 2012-06-07 2012-10-24 山东师范大学 Method for identifying web spam through web page multi-view data association combination
CN105930365A (en) * 2016-04-11 2016-09-07 天津大学 Network link topology reconstruction method based on content

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1728655A (en) * 2004-11-25 2006-02-01 刘文印 Method and system for detecting and discriminating counterfeit web page
CN101324888A (en) * 2007-06-13 2008-12-17 北京恒金恒泰信息技术有限公司 Plug-in card for filtering eroticism software based on IE
CN101393555A (en) * 2008-09-09 2009-03-25 浙江大学 Rubbish blog detecting method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1728655A (en) * 2004-11-25 2006-02-01 刘文印 Method and system for detecting and discriminating counterfeit web page
CN101324888A (en) * 2007-06-13 2008-12-17 北京恒金恒泰信息技术有限公司 Plug-in card for filtering eroticism software based on IE
CN101393555A (en) * 2008-09-09 2009-03-25 浙江大学 Rubbish blog detecting method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750345A (en) * 2012-06-07 2012-10-24 山东师范大学 Method for identifying web spam through web page multi-view data association combination
CN105930365A (en) * 2016-04-11 2016-09-07 天津大学 Network link topology reconstruction method based on content

Also Published As

Publication number Publication date
CN102521369B (en) 2014-01-22

Similar Documents

Publication Publication Date Title
CN105045875B (en) Personalized search and device
CN103365910B (en) Method and system for information retrieval
CN103412888B (en) A kind of point of interest recognition methods and device
JP2012521598A5 (en)
CN105975596A (en) Query expansion method and system of search engine
CN101630327A (en) Design method of theme network crawler system
CN105654144B (en) A kind of social network ontologies construction method based on machine learning
CN108769079A (en) A kind of Web Intrusion Detection Techniques based on machine learning
CN106407484A (en) Video tag extraction method based on semantic association of barrages
CN102799814A (en) Phishing website search system and method
CN106156372A (en) The sorting technique of a kind of internet site and device
CN103020067A (en) Method and device for determining webpage type
CN103544307B (en) A kind of multiple search engine automation contrast evaluating method independent of document library
CN101251896B (en) Object detecting system and method based on multiple classifiers
CN101980210A (en) Marked word classifying and grading method and system
CN102591948A (en) Method and system for improving search results based on user behavior analysis
CN102117339A (en) Filter supervision method specific to unsecure web page texts
CN103778262A (en) Information retrieval method and device based on thesaurus
CN103618744A (en) Intrusion detection method based on fast k-nearest neighbor (KNN) algorithm
CN108319518A (en) File fragmentation sorting technique based on Recognition with Recurrent Neural Network and device
CN102521369B (en) Multi-view web spam detection method
CN111222031A (en) Website distinguishing method and system
CN103684896A (en) Method of detecting website cheating based on domain name resolution characteristics
CN103853771B (en) A kind of method for pushing and system of search result
CN102063497A (en) Open type knowledge sharing platform and entry processing method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140122

Termination date: 20201216