CN109902236B

CN109902236B - Junk web page degradation method based on non-probability model

Info

Publication number: CN109902236B
Application number: CN201910172890.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Chengdu Shuzhilian Technology Co Ltd
Current assignee: Chengdu Shuzhilian Technology Co Ltd
Priority date: 2019-03-07
Filing date: 2019-03-07
Publication date: 2021-06-11
Anticipated expiration: 2039-03-07
Also published as: CN109902236A

Abstract

The invention discloses a garbage webpage degradation method based on a non-probability model, which comprises the steps of crawling a webpage by a web crawler and analyzing the content of the webpage to obtain a webpage URL list; calculating according to the obtained URL list to obtain a node adjacency list; constructing a node network graph according to the node adjacency list; sorting nodes in the node network graph by adopting a PageRank algorithm, and sequentially classifying and marking the webpages which are ranked at the front as normal webpages or junk webpages; assigning a grade initial value and a jump probability initial value to the marked webpage; carrying out value propagation by adopting an iterative algorithm until the algorithm is converged to obtain a node scoring value; and performing descending sorting on all nodes in the node network graph according to the node score values to obtain a final sorting result of the page. The method and the device realize the degradation processing of the junk web pages, improve the ranking of normal web pages in a search engine as much as possible, reduce the ranking of the junk web pages, and effectively improve the processing precision and speed of the degradation of the junk web pages.

Description

Junk web page degradation method based on non-probability model

Technical Field

The invention belongs to the technical field of webpage information, and particularly relates to a junk webpage degradation method based on a non-probability model.

Background

Spam web pages refer to web pages that employ a cheating means to increase their own rank in search results of a search engine. The presence of spam web pages presents a significant challenge to both general search engine users and search engine companies. For a common user, the junk web pages enable a large amount of useless information to exist in search results, and the time for the user to search for effective information is increased; for a search engine company, the spam web page needs additional resources to store, analyze and index the spam web page, which greatly wastes storage resources and computing resources.

The existing junk web page degradation processing method needs to establish a huge and complex probability calculation model so as to carry out degradation processing on junk web pages by means of the probability calculation model; the method can greatly waste network resources, has extremely low efficiency of identifying the spam web pages, and cannot rapidly and accurately realize the degradation treatment of the spam web pages. The existing junk web page degradation algorithm generally adopts an isolated approximate principle, namely, normal web pages are rarely linked to junk web pages, the propagation among web page links is ignored, and the processing precision of junk web page degradation is greatly reduced.

Disclosure of Invention

In order to solve the problems, the invention provides a junk web page degradation method based on a non-probability model, which realizes junk web page degradation processing, improves the ranking of normal web pages in a search engine as much as possible, reduces the ranking of junk web pages, and effectively improves the processing precision and speed of junk web page degradation.

In order to achieve the purpose, the invention adopts the technical scheme that: a junk web page degradation method based on a non-probability model comprises the following steps:

s100, crawling a webpage and analyzing the content of the webpage through a web crawler to obtain a webpage URL list;

s200, calculating according to the obtained URL list to obtain a node adjacency list;

s300, constructing a node network graph according to the node adjacency list;

s400, sorting nodes in the node network graph by adopting a PageRank algorithm, and sequentially classifying and labeling the webpages ranked at the front, wherein the classification labels comprise normal webpages and junk webpages;

s500, assigning a score initial value and a jump probability initial value to the marked webpage; carrying out value propagation by adopting an iterative algorithm until the algorithm is converged to obtain a node scoring value;

s600, sorting all the nodes in the node network graph in a descending order according to the node score values to obtain a final sorting result of the page.

Further, the URL links of the acquired web page and the URL links linked out thereof are stored in the database in the form of an adjacency list.

Further, the node network graph structure constructed according to the adjacency list is G (V, E), wherein G is a directed weightless graph;

wherein V is the set of all nodes, and E is the set of all edges;

if node v_iExists, and exists by node v_iPointing to node v_jIs linked, then has < v_i,v_j>, [ E ]; for any node v_iLinks pointing to itself are not contained in E, i.e.

Further, in the step S400, the classifying and labeling the top ranked web pages includes the steps of:

marking the nodes with the highest rank in sequence until the number of marked normal webpages and junk webpages is not less than 100; the marked normal webpage set is S_nAnd the garbage webpage collection is Ss.

Further, in the step S500, an iterative algorithm is adopted to perform value propagation until the algorithm converges to obtain a node scoring value, including the steps of:

for each node v_iRecord G (v)_i) Represents its positive rank value, B (v)_i) Denotes its reverse rank value, In (v)_i) Denotes v_iSet of parent nodes of Out (v)_i) Denotes v_iA set of child nodes of;

calculating G (v) of each node by adopting an iterative algorithm_i) And B (v)_i) The calculation formula is as follows:

wherein,

G(v_i) And B (v)_i) Is given by IG (v)_i) And IB (v)_i) Calculating; lambda takes the value 0.85; the number of iterations of the algorithm is 100.

Further, when all nodes in the node network graph are sorted in a descending order according to the node score, the node v_iG (v) of_i) And B (v)_i) The non-normalized probability approximation value of the normal webpage and the junk webpage is used as a node; if G (v)_i) The moreLarge, then node v_iThe greater the likelihood of being a normal web page; if B (v)_i) The larger the node v is, the larger the node v is_iThe greater the likelihood of being a spam web page.

Further, in the propagation process, the value propagated by the source node undergoes two attenuations, one attenuation uses the information of the source node, and the other attenuation uses the information of the target node, and the calculation process is as follows:

node v_iThe unattenuated propagation values of (a):

node v_iThe probability of being a normal page is:

node v_jThe probability of being a normal page is:

if it is<v_i,v_j>E, then p (v)_i) And p (v)_j) Deriving node v as an attenuation factor_iPropagating to node v_jThe values of (A) are:

wherein, | out (v)_i) L is node v_iThe number of child nodes.

The beneficial effects of the technical scheme are as follows:

the method uses the link structure among the network nodes to calculate the node score values, sorts the web pages according to the node score values, and realizes the degradation processing of the junk web pages through the propagation pointing characteristic of the score values, thereby promoting the ranking of normal web pages in a search engine as much as possible, reducing the ranking of the junk web pages, and effectively improving the processing precision and speed of the junk web page degradation;

the method of the invention does not need to establish a complex and huge probability model, saves network storage resources and computing resources, and greatly improves the processing speed and precision of the degradation of the junk web pages.

Drawings

FIG. 1 is a schematic flow chart of a method for degrading spam web pages based on a non-probabilistic model according to the present invention;

fig. 2 is a node network diagram according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings.

In this embodiment, referring to fig. 1, the present invention provides a method for degrading spam web pages based on a non-probabilistic model, including the steps of:

s300, constructing a node network graph according to the node adjacency list;

As an optimization scheme of the above embodiment, the URL links of the acquired web page and the linked URL links thereof are stored in the database in the form of an adjacency list.

As an optimization scheme of the above embodiment, a node network graph structure constructed according to the adjacency list is G ═ V, E, and G is a directed weightless graph;

wherein V is the set of all nodes, and E is the set of all edges;

As an optimization scheme of the above embodiment, in the step S400, classifying and labeling the top ranked web pages includes the steps of:

As an optimization scheme of the above embodiment, in step S500, an iterative algorithm is used to perform value propagation until the algorithm converges, and a node scoring value is obtained, including the steps of:

wherein,

When all nodes in the node network graph are sorted in a descending order according to the node score values, the node v_iG (v) of_i) And B (v)_i) The non-normalized probability approximation value of the normal webpage and the junk webpage is used as a node; if G (v)_i) The larger the node v is, the larger the node v is_iThe greater the likelihood of being a normal web page; if B (v)_i) The larger the node v is, the larger the node v is_iThe greater the likelihood of being a spam web page.

In the propagation process, the value propagated by the source node is attenuated twice, the information of the source node is used in the first attenuation, the information of the target node is used in the first attenuation, and the calculation process is as follows:

node v_iThe unattenuated propagation values of (a):

node v_iThe probability of being a normal page is:

node v_jThe probability of being a normal page is:

wherein, | out (v)_i) L is node v_iThe number of child nodes.

As in the node network diagram example of fig. 2:

Out(v1)＝{v2,v3,v4}，|Out(v1)|＝3

v₁the unattenuated propagation values of (a):

v₁the probability of being a normal page is:

v₂the probability of being a normal page is:

p (v)₁) And p (v)₂) Deriving node v as an attenuation factor₁Propagating to node v₂The values of (A) are:

the node importance algorithm based on the non-probability model can effectively reduce the ranking of junk web pages in a search engine based on the PageRank sorting algorithm. The algorithms provided by the invention on the public data sets WEBSPAM-UK2006 and WEBSPAM-UK2007 have better effects than classical algorithms TrustRank, PageRank, Anti-TrustRank and the like.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A junk web page degradation method based on a non-probability model is characterized by comprising the following steps:

s300, constructing a node network graph according to the node adjacency list; the node network graph structure is G (V, E), G is a directed weightless graph; wherein V is the set of all nodes, and E is the set of all edges; if node v_iExists, and exists by node v_iPointing to node v_jIs linked, then has < v_i,v_j>, [ E ]; for any node v_iLinks pointing to itself are not contained in E, i.e.

s500, assigning a score initial value and a jump probability initial value to the marked webpage; carrying out value propagation by adopting an iterative algorithm until the algorithm is converged to obtain a node scoring value; the method comprises the following steps:

wherein,

G(v_i) And B (v)_i) Is given by IG (v)_i) And IB (v)_i) Calculating; lambda takes the value 0.85; the iteration times of the algorithm are 100 times; s_nThe web page is a normal web page set, and the Ss is a junk web page set;

2. The method for degrading the spam web pages based on the non-probability model of claim 1, wherein the URL links of the obtained web pages and the URL links linked out of the obtained web pages are stored in a database in the form of an adjacency list.

3. The method for degrading spam webpages according to claim 1, wherein in the step S400, the webpages with the top rank are labeled by classification in sequence, comprising the steps of:

4. The method according to claim 1, wherein when all nodes in the node network graph are sorted in a descending order according to the node score, the node v is sorted_iG (v) of_i) And B (v)_i) The non-normalized probability approximation value of the normal webpage and the junk webpage is used as a node; if G (v)_i) The larger the node v is, the larger the node v is_iThe greater the likelihood of being a normal web page; if B (v)_i) The larger the node v is, the larger the node v is_iIs a spam web pageThe greater the likelihood.

5. The method for degrading the spam web pages based on the non-probability model according to claim 4, wherein in the process of value propagation by adopting an iterative algorithm, the value propagated by the source node is attenuated twice, one attenuation uses the information of the source node, and the other attenuation uses the information of the target node, and the calculation process is as follows:

node v_iThe unattenuated propagation values of (a):

node v_iThe probability of being a normal page is:

node v_jThe probability of being a normal page is:

wherein, | out (v)_i) L is node v_iThe number of child nodes.