CN109902236B - Junk web page degradation method based on non-probability model - Google Patents

Junk web page degradation method based on non-probability model Download PDF

Info

Publication number
CN109902236B
CN109902236B CN201910172890.7A CN201910172890A CN109902236B CN 109902236 B CN109902236 B CN 109902236B CN 201910172890 A CN201910172890 A CN 201910172890A CN 109902236 B CN109902236 B CN 109902236B
Authority
CN
China
Prior art keywords
node
webpage
junk
webpages
normal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910172890.7A
Other languages
Chinese (zh)
Other versions
CN109902236A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Shuzhilian Technology Co Ltd
Original Assignee
Chengdu Shuzhilian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Shuzhilian Technology Co Ltd filed Critical Chengdu Shuzhilian Technology Co Ltd
Priority to CN201910172890.7A priority Critical patent/CN109902236B/en
Publication of CN109902236A publication Critical patent/CN109902236A/en
Application granted granted Critical
Publication of CN109902236B publication Critical patent/CN109902236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a garbage webpage degradation method based on a non-probability model, which comprises the steps of crawling a webpage by a web crawler and analyzing the content of the webpage to obtain a webpage URL list; calculating according to the obtained URL list to obtain a node adjacency list; constructing a node network graph according to the node adjacency list; sorting nodes in the node network graph by adopting a PageRank algorithm, and sequentially classifying and marking the webpages which are ranked at the front as normal webpages or junk webpages; assigning a grade initial value and a jump probability initial value to the marked webpage; carrying out value propagation by adopting an iterative algorithm until the algorithm is converged to obtain a node scoring value; and performing descending sorting on all nodes in the node network graph according to the node score values to obtain a final sorting result of the page. The method and the device realize the degradation processing of the junk web pages, improve the ranking of normal web pages in a search engine as much as possible, reduce the ranking of the junk web pages, and effectively improve the processing precision and speed of the degradation of the junk web pages.

Description

Junk web page degradation method based on non-probability model
Technical Field
The invention belongs to the technical field of webpage information, and particularly relates to a junk webpage degradation method based on a non-probability model.
Background
Spam web pages refer to web pages that employ a cheating means to increase their own rank in search results of a search engine. The presence of spam web pages presents a significant challenge to both general search engine users and search engine companies. For a common user, the junk web pages enable a large amount of useless information to exist in search results, and the time for the user to search for effective information is increased; for a search engine company, the spam web page needs additional resources to store, analyze and index the spam web page, which greatly wastes storage resources and computing resources.
The existing junk web page degradation processing method needs to establish a huge and complex probability calculation model so as to carry out degradation processing on junk web pages by means of the probability calculation model; the method can greatly waste network resources, has extremely low efficiency of identifying the spam web pages, and cannot rapidly and accurately realize the degradation treatment of the spam web pages. The existing junk web page degradation algorithm generally adopts an isolated approximate principle, namely, normal web pages are rarely linked to junk web pages, the propagation among web page links is ignored, and the processing precision of junk web page degradation is greatly reduced.
Disclosure of Invention
In order to solve the problems, the invention provides a junk web page degradation method based on a non-probability model, which realizes junk web page degradation processing, improves the ranking of normal web pages in a search engine as much as possible, reduces the ranking of junk web pages, and effectively improves the processing precision and speed of junk web page degradation.
In order to achieve the purpose, the invention adopts the technical scheme that: a junk web page degradation method based on a non-probability model comprises the following steps:
s100, crawling a webpage and analyzing the content of the webpage through a web crawler to obtain a webpage URL list;
s200, calculating according to the obtained URL list to obtain a node adjacency list;
s300, constructing a node network graph according to the node adjacency list;
s400, sorting nodes in the node network graph by adopting a PageRank algorithm, and sequentially classifying and labeling the webpages ranked at the front, wherein the classification labels comprise normal webpages and junk webpages;
s500, assigning a score initial value and a jump probability initial value to the marked webpage; carrying out value propagation by adopting an iterative algorithm until the algorithm is converged to obtain a node scoring value;
s600, sorting all the nodes in the node network graph in a descending order according to the node score values to obtain a final sorting result of the page.
Further, the URL links of the acquired web page and the URL links linked out thereof are stored in the database in the form of an adjacency list.
Further, the node network graph structure constructed according to the adjacency list is G (V, E), wherein G is a directed weightless graph;
wherein V is the set of all nodes, and E is the set of all edges;
if node viExists, and exists by node viPointing to node vjIs linked, then has < vi,vj>, [ E ]; for any node viLinks pointing to itself are not contained in E, i.e.
Figure BDA0001988655060000021
Further, in the step S400, the classifying and labeling the top ranked web pages includes the steps of:
marking the nodes with the highest rank in sequence until the number of marked normal webpages and junk webpages is not less than 100; the marked normal webpage set is SnAnd the garbage webpage collection is Ss.
Further, in the step S500, an iterative algorithm is adopted to perform value propagation until the algorithm converges to obtain a node scoring value, including the steps of:
for each node viRecord G (v)i) Represents its positive rank value, B (v)i) Denotes its reverse rank value, In (v)i) Denotes viSet of parent nodes of Out (v)i) Denotes viA set of child nodes of;
calculating G (v) of each node by adopting an iterative algorithmi) And B (v)i) The calculation formula is as follows:
Figure BDA0001988655060000022
Figure BDA0001988655060000023
wherein,
Figure BDA0001988655060000024
G(vi) And B (v)i) Is given by IG (v)i) And IB (v)i) Calculating; lambda takes the value 0.85; the number of iterations of the algorithm is 100.
Further, when all nodes in the node network graph are sorted in a descending order according to the node score, the node viG (v) ofi) And B (v)i) The non-normalized probability approximation value of the normal webpage and the junk webpage is used as a node; if G (v)i) The moreLarge, then node viThe greater the likelihood of being a normal web page; if B (v)i) The larger the node v is, the larger the node v isiThe greater the likelihood of being a spam web page.
Further, in the propagation process, the value propagated by the source node undergoes two attenuations, one attenuation uses the information of the source node, and the other attenuation uses the information of the target node, and the calculation process is as follows:
node viThe unattenuated propagation values of (a):
Figure BDA0001988655060000031
node viThe probability of being a normal page is:
Figure BDA0001988655060000032
node vjThe probability of being a normal page is:
Figure BDA0001988655060000033
if it is<vi,vj>E, then p (v)i) And p (v)j) Deriving node v as an attenuation factoriPropagating to node vjThe values of (A) are:
Figure BDA0001988655060000034
wherein, | out (v)i) L is node viThe number of child nodes.
The beneficial effects of the technical scheme are as follows:
the method uses the link structure among the network nodes to calculate the node score values, sorts the web pages according to the node score values, and realizes the degradation processing of the junk web pages through the propagation pointing characteristic of the score values, thereby promoting the ranking of normal web pages in a search engine as much as possible, reducing the ranking of the junk web pages, and effectively improving the processing precision and speed of the junk web page degradation;
the method of the invention does not need to establish a complex and huge probability model, saves network storage resources and computing resources, and greatly improves the processing speed and precision of the degradation of the junk web pages.
Drawings
FIG. 1 is a schematic flow chart of a method for degrading spam web pages based on a non-probabilistic model according to the present invention;
fig. 2 is a node network diagram according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings.
In this embodiment, referring to fig. 1, the present invention provides a method for degrading spam web pages based on a non-probabilistic model, including the steps of:
s100, crawling a webpage and analyzing the content of the webpage through a web crawler to obtain a webpage URL list;
s200, calculating according to the obtained URL list to obtain a node adjacency list;
s300, constructing a node network graph according to the node adjacency list;
s400, sorting nodes in the node network graph by adopting a PageRank algorithm, and sequentially classifying and labeling the webpages ranked at the front, wherein the classification labels comprise normal webpages and junk webpages;
s500, assigning a score initial value and a jump probability initial value to the marked webpage; carrying out value propagation by adopting an iterative algorithm until the algorithm is converged to obtain a node scoring value;
s600, sorting all the nodes in the node network graph in a descending order according to the node score values to obtain a final sorting result of the page.
As an optimization scheme of the above embodiment, the URL links of the acquired web page and the linked URL links thereof are stored in the database in the form of an adjacency list.
As an optimization scheme of the above embodiment, a node network graph structure constructed according to the adjacency list is G ═ V, E, and G is a directed weightless graph;
wherein V is the set of all nodes, and E is the set of all edges;
if node viExists, and exists by node viPointing to node vjIs linked, then has < vi,vj>, [ E ]; for any node viLinks pointing to itself are not contained in E, i.e.
Figure BDA0001988655060000041
As an optimization scheme of the above embodiment, in the step S400, classifying and labeling the top ranked web pages includes the steps of:
marking the nodes with the highest rank in sequence until the number of marked normal webpages and junk webpages is not less than 100; the marked normal webpage set is SnAnd the garbage webpage collection is Ss.
As an optimization scheme of the above embodiment, in step S500, an iterative algorithm is used to perform value propagation until the algorithm converges, and a node scoring value is obtained, including the steps of:
for each node viRecord G (v)i) Represents its positive rank value, B (v)i) Denotes its reverse rank value, In (v)i) Denotes viSet of parent nodes of Out (v)i) Denotes viA set of child nodes of;
calculating G (v) of each node by adopting an iterative algorithmi) And B (v)i) The calculation formula is as follows:
Figure BDA0001988655060000051
Figure BDA0001988655060000052
wherein,
Figure BDA0001988655060000053
G(vi) And B (v)i) Is given by IG (v)i) And IB (v)i) Calculating; lambda takes the value 0.85; the number of iterations of the algorithm is 100.
When all nodes in the node network graph are sorted in a descending order according to the node score values, the node viG (v) ofi) And B (v)i) The non-normalized probability approximation value of the normal webpage and the junk webpage is used as a node; if G (v)i) The larger the node v is, the larger the node v isiThe greater the likelihood of being a normal web page; if B (v)i) The larger the node v is, the larger the node v isiThe greater the likelihood of being a spam web page.
In the propagation process, the value propagated by the source node is attenuated twice, the information of the source node is used in the first attenuation, the information of the target node is used in the first attenuation, and the calculation process is as follows:
node viThe unattenuated propagation values of (a):
Figure BDA0001988655060000054
node viThe probability of being a normal page is:
Figure BDA0001988655060000055
node vjThe probability of being a normal page is:
Figure BDA0001988655060000056
if it is<vi,vj>E, then p (v)i) And p (v)j) Deriving node v as an attenuation factoriPropagating to node vjThe values of (A) are:
Figure BDA0001988655060000057
wherein, | out (v)i) L is node viThe number of child nodes.
As in the node network diagram example of fig. 2:
Out(v1)={v2,v3,v4},|Out(v1)|=3
v1the unattenuated propagation values of (a):
Figure BDA0001988655060000058
v1the probability of being a normal page is:
Figure BDA0001988655060000061
v2the probability of being a normal page is:
Figure BDA0001988655060000062
p (v)1) And p (v)2) Deriving node v as an attenuation factor1Propagating to node v2The values of (A) are:
Figure BDA0001988655060000063
the node importance algorithm based on the non-probability model can effectively reduce the ranking of junk web pages in a search engine based on the PageRank sorting algorithm. The algorithms provided by the invention on the public data sets WEBSPAM-UK2006 and WEBSPAM-UK2007 have better effects than classical algorithms TrustRank, PageRank, Anti-TrustRank and the like.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (5)

1. A junk web page degradation method based on a non-probability model is characterized by comprising the following steps:
s100, crawling a webpage and analyzing the content of the webpage through a web crawler to obtain a webpage URL list;
s200, calculating according to the obtained URL list to obtain a node adjacency list;
s300, constructing a node network graph according to the node adjacency list; the node network graph structure is G (V, E), G is a directed weightless graph; wherein V is the set of all nodes, and E is the set of all edges; if node viExists, and exists by node viPointing to node vjIs linked, then has < vi,vj>, [ E ]; for any node viLinks pointing to itself are not contained in E, i.e.
Figure FDA0002946169090000011
S400, sorting nodes in the node network graph by adopting a PageRank algorithm, and sequentially classifying and labeling the webpages ranked at the front, wherein the classification labels comprise normal webpages and junk webpages;
s500, assigning a score initial value and a jump probability initial value to the marked webpage; carrying out value propagation by adopting an iterative algorithm until the algorithm is converged to obtain a node scoring value; the method comprises the following steps:
for each node viRecord G (v)i) Represents its positive rank value, B (v)i) Denotes its reverse rank value, In (v)i) Denotes viSet of parent nodes of Out (v)i) Denotes viA set of child nodes of;
calculating G (v) of each node by adopting an iterative algorithmi) And B (v)i) The calculation formula is as follows:
Figure FDA0002946169090000012
Figure FDA0002946169090000013
wherein,
Figure FDA0002946169090000014
G(vi) And B (v)i) Is given by IG (v)i) And IB (v)i) Calculating; lambda takes the value 0.85; the iteration times of the algorithm are 100 times; snThe web page is a normal web page set, and the Ss is a junk web page set;
s600, sorting all the nodes in the node network graph in a descending order according to the node score values to obtain a final sorting result of the page.
2. The method for degrading the spam web pages based on the non-probability model of claim 1, wherein the URL links of the obtained web pages and the URL links linked out of the obtained web pages are stored in a database in the form of an adjacency list.
3. The method for degrading spam webpages according to claim 1, wherein in the step S400, the webpages with the top rank are labeled by classification in sequence, comprising the steps of:
marking the nodes with the highest rank in sequence until the number of marked normal webpages and junk webpages is not less than 100; the marked normal webpage set is SnAnd the garbage webpage collection is Ss.
4. The method according to claim 1, wherein when all nodes in the node network graph are sorted in a descending order according to the node score, the node v is sortediG (v) ofi) And B (v)i) The non-normalized probability approximation value of the normal webpage and the junk webpage is used as a node; if G (v)i) The larger the node v is, the larger the node v isiThe greater the likelihood of being a normal web page; if B (v)i) The larger the node v is, the larger the node v isiIs a spam web pageThe greater the likelihood.
5. The method for degrading the spam web pages based on the non-probability model according to claim 4, wherein in the process of value propagation by adopting an iterative algorithm, the value propagated by the source node is attenuated twice, one attenuation uses the information of the source node, and the other attenuation uses the information of the target node, and the calculation process is as follows:
node viThe unattenuated propagation values of (a):
Figure FDA0002946169090000021
node viThe probability of being a normal page is:
Figure FDA0002946169090000022
node vjThe probability of being a normal page is:
Figure FDA0002946169090000023
if it is<vi,vj>E, then p (v)i) And p (v)j) Deriving node v as an attenuation factoriPropagating to node vjThe values of (A) are:
Figure FDA0002946169090000024
wherein, | out (v)i) L is node viThe number of child nodes.
CN201910172890.7A 2019-03-07 2019-03-07 Junk web page degradation method based on non-probability model Active CN109902236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910172890.7A CN109902236B (en) 2019-03-07 2019-03-07 Junk web page degradation method based on non-probability model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910172890.7A CN109902236B (en) 2019-03-07 2019-03-07 Junk web page degradation method based on non-probability model

Publications (2)

Publication Number Publication Date
CN109902236A CN109902236A (en) 2019-06-18
CN109902236B true CN109902236B (en) 2021-06-11

Family

ID=66946583

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910172890.7A Active CN109902236B (en) 2019-03-07 2019-03-07 Junk web page degradation method based on non-probability model

Country Status (1)

Country Link
CN (1) CN109902236B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184208A (en) * 2011-04-29 2011-09-14 武汉慧人信息科技有限公司 Junk web page detection method based on multi-dimensional data abnormal cluster mining
CN102750380A (en) * 2012-06-27 2012-10-24 山东师范大学 Page sorting method in combination with difference feature distribution and link feature
CN102750345A (en) * 2012-06-07 2012-10-24 山东师范大学 Method for identifying web spam through web page multi-view data association combination
CN103246677A (en) * 2012-02-13 2013-08-14 广州淘信互联网科技有限公司 Search method and search system on basis of social intercourse
CN105447505A (en) * 2015-11-09 2016-03-30 成都数之联科技有限公司 Multilevel important email detection method
CN107423319A (en) * 2017-03-29 2017-12-01 天津大学 A kind of spam page detection method
CN108460158A (en) * 2018-03-28 2018-08-28 天津大学 Differentiation Web page sequencing method based on PageRank
CN108984630A (en) * 2018-06-20 2018-12-11 天津大学 Application method of the Node Contraction in Complex Networks importance in spam page detection

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8352321B2 (en) * 2008-12-12 2013-01-08 Microsoft Corporation In-text embedded advertising

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184208A (en) * 2011-04-29 2011-09-14 武汉慧人信息科技有限公司 Junk web page detection method based on multi-dimensional data abnormal cluster mining
CN103246677A (en) * 2012-02-13 2013-08-14 广州淘信互联网科技有限公司 Search method and search system on basis of social intercourse
CN102750345A (en) * 2012-06-07 2012-10-24 山东师范大学 Method for identifying web spam through web page multi-view data association combination
CN102750380A (en) * 2012-06-27 2012-10-24 山东师范大学 Page sorting method in combination with difference feature distribution and link feature
CN105447505A (en) * 2015-11-09 2016-03-30 成都数之联科技有限公司 Multilevel important email detection method
CN107423319A (en) * 2017-03-29 2017-12-01 天津大学 A kind of spam page detection method
CN108460158A (en) * 2018-03-28 2018-08-28 天津大学 Differentiation Web page sequencing method based on PageRank
CN108984630A (en) * 2018-06-20 2018-12-11 天津大学 Application method of the Node Contraction in Complex Networks importance in spam page detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
主题相似度与链接权重相结合的垃圾网页排序检测;韦莎 等;《计算机应用》;20160310;第36卷(第3期);第2.2节 *

Also Published As

Publication number Publication date
CN109902236A (en) 2019-06-18

Similar Documents

Publication Publication Date Title
CN107341270B (en) Social platform-oriented user emotion influence analysis method
Chakrabarti et al. Page-level template detection via isotonic smoothing
CN110543595B (en) In-station searching system and method
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
US20080256065A1 (en) Information Extraction System
CN1904886A (en) Method and apparatus for establishing link structure between multiple documents
Chatterjee et al. Single document extractive text summarization using genetic algorithms
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN113239111B (en) Knowledge graph-based network public opinion visual analysis method and system
Gong et al. Chinese web text classification system model based on Naive Bayes
CN112905800A (en) Public character public opinion knowledge graph and XGboost multi-feature fusion emotion early warning method
CN104794209B (en) Chinese microblogging mood sorting technique based on Markov logical network and system
Song Sentiment analysis of Japanese text and vocabulary learning based on natural language processing and SVM
US8949254B1 (en) Enhancing the content and structure of a corpus of content
CN115438274A (en) False news identification method based on heterogeneous graph convolutional network
CN109902236B (en) Junk web page degradation method based on non-probability model
CN1766871A (en) The processing method of the semi-structured data extraction of semantics of based on the context
Marghny et al. Web mining based on genetic algorithm
CN110580280A (en) Method, device and storage medium for discovering new words
Dahiwale et al. Design of improved focused web crawler by analyzing semantic nature of URL and anchor text
Zheng Genetic and ant algorithms based focused crawler design
CN115599915A (en) Long text classification method based on TextRank and attention mechanism
Kannan et al. Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm
Sajeev A community based web summarization in near linear time
CN111797235A (en) Text real-time clustering method based on time attenuation factor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 610000 No. 270, floor 2, No. 8, Jinxiu street, Wuhou District, Chengdu, Sichuan

Patentee after: Chengdu shuzhilian Technology Co.,Ltd.

Address before: 610000 No.2, 4th floor, building 1, Jule Road intersection, West 1st section of 1st ring road, Wuhou District, Chengdu City, Sichuan Province

Patentee before: CHENGDU SHUZHILIAN TECHNOLOGY Co.,Ltd.