CN109902236B - Junk web page degradation method based on non-probability model - Google Patents
Junk web page degradation method based on non-probability model Download PDFInfo
- Publication number
- CN109902236B CN109902236B CN201910172890.7A CN201910172890A CN109902236B CN 109902236 B CN109902236 B CN 109902236B CN 201910172890 A CN201910172890 A CN 201910172890A CN 109902236 B CN109902236 B CN 109902236B
- Authority
- CN
- China
- Prior art keywords
- node
- webpage
- junk
- webpages
- normal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000015556 catabolic process Effects 0.000 title claims abstract description 19
- 238000006731 degradation reaction Methods 0.000 title claims abstract description 19
- 230000009193 crawling Effects 0.000 claims abstract description 4
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000000593 degrading effect Effects 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 5
- XHEFDIBZLJXQHF-UHFFFAOYSA-N fisetin Chemical compound C=1C(O)=CC=C(C(C=2O)=O)C=1OC=2C1=CC=C(O)C(O)=C1 XHEFDIBZLJXQHF-UHFFFAOYSA-N 0.000 claims description 3
- 230000000644 propagated effect Effects 0.000 claims description 3
- 230000002238 attenuated effect Effects 0.000 claims description 2
- 238000005457 optimization Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a garbage webpage degradation method based on a non-probability model, which comprises the steps of crawling a webpage by a web crawler and analyzing the content of the webpage to obtain a webpage URL list; calculating according to the obtained URL list to obtain a node adjacency list; constructing a node network graph according to the node adjacency list; sorting nodes in the node network graph by adopting a PageRank algorithm, and sequentially classifying and marking the webpages which are ranked at the front as normal webpages or junk webpages; assigning a grade initial value and a jump probability initial value to the marked webpage; carrying out value propagation by adopting an iterative algorithm until the algorithm is converged to obtain a node scoring value; and performing descending sorting on all nodes in the node network graph according to the node score values to obtain a final sorting result of the page. The method and the device realize the degradation processing of the junk web pages, improve the ranking of normal web pages in a search engine as much as possible, reduce the ranking of the junk web pages, and effectively improve the processing precision and speed of the degradation of the junk web pages.
Description
Technical Field
The invention belongs to the technical field of webpage information, and particularly relates to a junk webpage degradation method based on a non-probability model.
Background
Spam web pages refer to web pages that employ a cheating means to increase their own rank in search results of a search engine. The presence of spam web pages presents a significant challenge to both general search engine users and search engine companies. For a common user, the junk web pages enable a large amount of useless information to exist in search results, and the time for the user to search for effective information is increased; for a search engine company, the spam web page needs additional resources to store, analyze and index the spam web page, which greatly wastes storage resources and computing resources.
The existing junk web page degradation processing method needs to establish a huge and complex probability calculation model so as to carry out degradation processing on junk web pages by means of the probability calculation model; the method can greatly waste network resources, has extremely low efficiency of identifying the spam web pages, and cannot rapidly and accurately realize the degradation treatment of the spam web pages. The existing junk web page degradation algorithm generally adopts an isolated approximate principle, namely, normal web pages are rarely linked to junk web pages, the propagation among web page links is ignored, and the processing precision of junk web page degradation is greatly reduced.
Disclosure of Invention
In order to solve the problems, the invention provides a junk web page degradation method based on a non-probability model, which realizes junk web page degradation processing, improves the ranking of normal web pages in a search engine as much as possible, reduces the ranking of junk web pages, and effectively improves the processing precision and speed of junk web page degradation.
In order to achieve the purpose, the invention adopts the technical scheme that: a junk web page degradation method based on a non-probability model comprises the following steps:
s100, crawling a webpage and analyzing the content of the webpage through a web crawler to obtain a webpage URL list;
s200, calculating according to the obtained URL list to obtain a node adjacency list;
s300, constructing a node network graph according to the node adjacency list;
s400, sorting nodes in the node network graph by adopting a PageRank algorithm, and sequentially classifying and labeling the webpages ranked at the front, wherein the classification labels comprise normal webpages and junk webpages;
s500, assigning a score initial value and a jump probability initial value to the marked webpage; carrying out value propagation by adopting an iterative algorithm until the algorithm is converged to obtain a node scoring value;
s600, sorting all the nodes in the node network graph in a descending order according to the node score values to obtain a final sorting result of the page.
Further, the URL links of the acquired web page and the URL links linked out thereof are stored in the database in the form of an adjacency list.
Further, the node network graph structure constructed according to the adjacency list is G (V, E), wherein G is a directed weightless graph;
wherein V is the set of all nodes, and E is the set of all edges;
if node viExists, and exists by node viPointing to node vjIs linked, then has < vi,vj>, [ E ]; for any node viLinks pointing to itself are not contained in E, i.e.
Further, in the step S400, the classifying and labeling the top ranked web pages includes the steps of:
marking the nodes with the highest rank in sequence until the number of marked normal webpages and junk webpages is not less than 100; the marked normal webpage set is SnAnd the garbage webpage collection is Ss.
Further, in the step S500, an iterative algorithm is adopted to perform value propagation until the algorithm converges to obtain a node scoring value, including the steps of:
for each node viRecord G (v)i) Represents its positive rank value, B (v)i) Denotes its reverse rank value, In (v)i) Denotes viSet of parent nodes of Out (v)i) Denotes viA set of child nodes of;
calculating G (v) of each node by adopting an iterative algorithmi) And B (v)i) The calculation formula is as follows:
wherein,
G(vi) And B (v)i) Is given by IG (v)i) And IB (v)i) Calculating; lambda takes the value 0.85; the number of iterations of the algorithm is 100.
Further, when all nodes in the node network graph are sorted in a descending order according to the node score, the node viG (v) ofi) And B (v)i) The non-normalized probability approximation value of the normal webpage and the junk webpage is used as a node; if G (v)i) The moreLarge, then node viThe greater the likelihood of being a normal web page; if B (v)i) The larger the node v is, the larger the node v isiThe greater the likelihood of being a spam web page.
Further, in the propagation process, the value propagated by the source node undergoes two attenuations, one attenuation uses the information of the source node, and the other attenuation uses the information of the target node, and the calculation process is as follows:
node vjThe probability of being a normal page is:
if it is<vi,vj>E, then p (v)i) And p (v)j) Deriving node v as an attenuation factoriPropagating to node vjThe values of (A) are:
wherein, | out (v)i) L is node viThe number of child nodes.
The beneficial effects of the technical scheme are as follows:
the method uses the link structure among the network nodes to calculate the node score values, sorts the web pages according to the node score values, and realizes the degradation processing of the junk web pages through the propagation pointing characteristic of the score values, thereby promoting the ranking of normal web pages in a search engine as much as possible, reducing the ranking of the junk web pages, and effectively improving the processing precision and speed of the junk web page degradation;
the method of the invention does not need to establish a complex and huge probability model, saves network storage resources and computing resources, and greatly improves the processing speed and precision of the degradation of the junk web pages.
Drawings
FIG. 1 is a schematic flow chart of a method for degrading spam web pages based on a non-probabilistic model according to the present invention;
fig. 2 is a node network diagram according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings.
In this embodiment, referring to fig. 1, the present invention provides a method for degrading spam web pages based on a non-probabilistic model, including the steps of:
s100, crawling a webpage and analyzing the content of the webpage through a web crawler to obtain a webpage URL list;
s200, calculating according to the obtained URL list to obtain a node adjacency list;
s300, constructing a node network graph according to the node adjacency list;
s400, sorting nodes in the node network graph by adopting a PageRank algorithm, and sequentially classifying and labeling the webpages ranked at the front, wherein the classification labels comprise normal webpages and junk webpages;
s500, assigning a score initial value and a jump probability initial value to the marked webpage; carrying out value propagation by adopting an iterative algorithm until the algorithm is converged to obtain a node scoring value;
s600, sorting all the nodes in the node network graph in a descending order according to the node score values to obtain a final sorting result of the page.
As an optimization scheme of the above embodiment, the URL links of the acquired web page and the linked URL links thereof are stored in the database in the form of an adjacency list.
As an optimization scheme of the above embodiment, a node network graph structure constructed according to the adjacency list is G ═ V, E, and G is a directed weightless graph;
wherein V is the set of all nodes, and E is the set of all edges;
if node viExists, and exists by node viPointing to node vjIs linked, then has < vi,vj>, [ E ]; for any node viLinks pointing to itself are not contained in E, i.e.
As an optimization scheme of the above embodiment, in the step S400, classifying and labeling the top ranked web pages includes the steps of:
marking the nodes with the highest rank in sequence until the number of marked normal webpages and junk webpages is not less than 100; the marked normal webpage set is SnAnd the garbage webpage collection is Ss.
As an optimization scheme of the above embodiment, in step S500, an iterative algorithm is used to perform value propagation until the algorithm converges, and a node scoring value is obtained, including the steps of:
for each node viRecord G (v)i) Represents its positive rank value, B (v)i) Denotes its reverse rank value, In (v)i) Denotes viSet of parent nodes of Out (v)i) Denotes viA set of child nodes of;
calculating G (v) of each node by adopting an iterative algorithmi) And B (v)i) The calculation formula is as follows:
wherein,
G(vi) And B (v)i) Is given by IG (v)i) And IB (v)i) Calculating; lambda takes the value 0.85; the number of iterations of the algorithm is 100.
When all nodes in the node network graph are sorted in a descending order according to the node score values, the node viG (v) ofi) And B (v)i) The non-normalized probability approximation value of the normal webpage and the junk webpage is used as a node; if G (v)i) The larger the node v is, the larger the node v isiThe greater the likelihood of being a normal web page; if B (v)i) The larger the node v is, the larger the node v isiThe greater the likelihood of being a spam web page.
In the propagation process, the value propagated by the source node is attenuated twice, the information of the source node is used in the first attenuation, the information of the target node is used in the first attenuation, and the calculation process is as follows:
if it is<vi,vj>E, then p (v)i) And p (v)j) Deriving node v as an attenuation factoriPropagating to node vjThe values of (A) are:
wherein, | out (v)i) L is node viThe number of child nodes.
As in the node network diagram example of fig. 2:
Out(v1)={v2,v3,v4},|Out(v1)|=3
p (v)1) And p (v)2) Deriving node v as an attenuation factor1Propagating to node v2The values of (A) are:
the node importance algorithm based on the non-probability model can effectively reduce the ranking of junk web pages in a search engine based on the PageRank sorting algorithm. The algorithms provided by the invention on the public data sets WEBSPAM-UK2006 and WEBSPAM-UK2007 have better effects than classical algorithms TrustRank, PageRank, Anti-TrustRank and the like.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (5)
1. A junk web page degradation method based on a non-probability model is characterized by comprising the following steps:
s100, crawling a webpage and analyzing the content of the webpage through a web crawler to obtain a webpage URL list;
s200, calculating according to the obtained URL list to obtain a node adjacency list;
s300, constructing a node network graph according to the node adjacency list; the node network graph structure is G (V, E), G is a directed weightless graph; wherein V is the set of all nodes, and E is the set of all edges; if node viExists, and exists by node viPointing to node vjIs linked, then has < vi,vj>, [ E ]; for any node viLinks pointing to itself are not contained in E, i.e.
S400, sorting nodes in the node network graph by adopting a PageRank algorithm, and sequentially classifying and labeling the webpages ranked at the front, wherein the classification labels comprise normal webpages and junk webpages;
s500, assigning a score initial value and a jump probability initial value to the marked webpage; carrying out value propagation by adopting an iterative algorithm until the algorithm is converged to obtain a node scoring value; the method comprises the following steps:
for each node viRecord G (v)i) Represents its positive rank value, B (v)i) Denotes its reverse rank value, In (v)i) Denotes viSet of parent nodes of Out (v)i) Denotes viA set of child nodes of;
calculating G (v) of each node by adopting an iterative algorithmi) And B (v)i) The calculation formula is as follows:
wherein,
G(vi) And B (v)i) Is given by IG (v)i) And IB (v)i) Calculating; lambda takes the value 0.85; the iteration times of the algorithm are 100 times; snThe web page is a normal web page set, and the Ss is a junk web page set;
s600, sorting all the nodes in the node network graph in a descending order according to the node score values to obtain a final sorting result of the page.
2. The method for degrading the spam web pages based on the non-probability model of claim 1, wherein the URL links of the obtained web pages and the URL links linked out of the obtained web pages are stored in a database in the form of an adjacency list.
3. The method for degrading spam webpages according to claim 1, wherein in the step S400, the webpages with the top rank are labeled by classification in sequence, comprising the steps of:
marking the nodes with the highest rank in sequence until the number of marked normal webpages and junk webpages is not less than 100; the marked normal webpage set is SnAnd the garbage webpage collection is Ss.
4. The method according to claim 1, wherein when all nodes in the node network graph are sorted in a descending order according to the node score, the node v is sortediG (v) ofi) And B (v)i) The non-normalized probability approximation value of the normal webpage and the junk webpage is used as a node; if G (v)i) The larger the node v is, the larger the node v isiThe greater the likelihood of being a normal web page; if B (v)i) The larger the node v is, the larger the node v isiIs a spam web pageThe greater the likelihood.
5. The method for degrading the spam web pages based on the non-probability model according to claim 4, wherein in the process of value propagation by adopting an iterative algorithm, the value propagated by the source node is attenuated twice, one attenuation uses the information of the source node, and the other attenuation uses the information of the target node, and the calculation process is as follows:
node vjThe probability of being a normal page is:
if it is<vi,vj>E, then p (v)i) And p (v)j) Deriving node v as an attenuation factoriPropagating to node vjThe values of (A) are:
wherein, | out (v)i) L is node viThe number of child nodes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910172890.7A CN109902236B (en) | 2019-03-07 | 2019-03-07 | Junk web page degradation method based on non-probability model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910172890.7A CN109902236B (en) | 2019-03-07 | 2019-03-07 | Junk web page degradation method based on non-probability model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109902236A CN109902236A (en) | 2019-06-18 |
CN109902236B true CN109902236B (en) | 2021-06-11 |
Family
ID=66946583
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910172890.7A Active CN109902236B (en) | 2019-03-07 | 2019-03-07 | Junk web page degradation method based on non-probability model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109902236B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184208A (en) * | 2011-04-29 | 2011-09-14 | 武汉慧人信息科技有限公司 | Junk web page detection method based on multi-dimensional data abnormal cluster mining |
CN102750380A (en) * | 2012-06-27 | 2012-10-24 | 山东师范大学 | Page sorting method in combination with difference feature distribution and link feature |
CN102750345A (en) * | 2012-06-07 | 2012-10-24 | 山东师范大学 | Method for identifying web spam through web page multi-view data association combination |
CN103246677A (en) * | 2012-02-13 | 2013-08-14 | 广州淘信互联网科技有限公司 | Search method and search system on basis of social intercourse |
CN105447505A (en) * | 2015-11-09 | 2016-03-30 | 成都数之联科技有限公司 | Multilevel important email detection method |
CN107423319A (en) * | 2017-03-29 | 2017-12-01 | 天津大学 | A kind of spam page detection method |
CN108460158A (en) * | 2018-03-28 | 2018-08-28 | 天津大学 | Differentiation Web page sequencing method based on PageRank |
CN108984630A (en) * | 2018-06-20 | 2018-12-11 | 天津大学 | Application method of the Node Contraction in Complex Networks importance in spam page detection |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8352321B2 (en) * | 2008-12-12 | 2013-01-08 | Microsoft Corporation | In-text embedded advertising |
-
2019
- 2019-03-07 CN CN201910172890.7A patent/CN109902236B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184208A (en) * | 2011-04-29 | 2011-09-14 | 武汉慧人信息科技有限公司 | Junk web page detection method based on multi-dimensional data abnormal cluster mining |
CN103246677A (en) * | 2012-02-13 | 2013-08-14 | 广州淘信互联网科技有限公司 | Search method and search system on basis of social intercourse |
CN102750345A (en) * | 2012-06-07 | 2012-10-24 | 山东师范大学 | Method for identifying web spam through web page multi-view data association combination |
CN102750380A (en) * | 2012-06-27 | 2012-10-24 | 山东师范大学 | Page sorting method in combination with difference feature distribution and link feature |
CN105447505A (en) * | 2015-11-09 | 2016-03-30 | 成都数之联科技有限公司 | Multilevel important email detection method |
CN107423319A (en) * | 2017-03-29 | 2017-12-01 | 天津大学 | A kind of spam page detection method |
CN108460158A (en) * | 2018-03-28 | 2018-08-28 | 天津大学 | Differentiation Web page sequencing method based on PageRank |
CN108984630A (en) * | 2018-06-20 | 2018-12-11 | 天津大学 | Application method of the Node Contraction in Complex Networks importance in spam page detection |
Non-Patent Citations (1)
Title |
---|
主题相似度与链接权重相结合的垃圾网页排序检测;韦莎 等;《计算机应用》;20160310;第36卷(第3期);第2.2节 * |
Also Published As
Publication number | Publication date |
---|---|
CN109902236A (en) | 2019-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107341270B (en) | Social platform-oriented user emotion influence analysis method | |
Chakrabarti et al. | Page-level template detection via isotonic smoothing | |
CN110543595B (en) | In-station searching system and method | |
CN107885793A (en) | A kind of hot microblog topic analyzing and predicting method and system | |
US20080256065A1 (en) | Information Extraction System | |
CN1904886A (en) | Method and apparatus for establishing link structure between multiple documents | |
Chatterjee et al. | Single document extractive text summarization using genetic algorithms | |
CN102169496A (en) | Anchor text analysis-based automatic domain term generating method | |
CN113239111B (en) | Knowledge graph-based network public opinion visual analysis method and system | |
Gong et al. | Chinese web text classification system model based on Naive Bayes | |
CN112905800A (en) | Public character public opinion knowledge graph and XGboost multi-feature fusion emotion early warning method | |
CN104794209B (en) | Chinese microblogging mood sorting technique based on Markov logical network and system | |
Song | Sentiment analysis of Japanese text and vocabulary learning based on natural language processing and SVM | |
US8949254B1 (en) | Enhancing the content and structure of a corpus of content | |
CN115438274A (en) | False news identification method based on heterogeneous graph convolutional network | |
CN109902236B (en) | Junk web page degradation method based on non-probability model | |
CN1766871A (en) | The processing method of the semi-structured data extraction of semantics of based on the context | |
Marghny et al. | Web mining based on genetic algorithm | |
CN110580280A (en) | Method, device and storage medium for discovering new words | |
Dahiwale et al. | Design of improved focused web crawler by analyzing semantic nature of URL and anchor text | |
Zheng | Genetic and ant algorithms based focused crawler design | |
CN115599915A (en) | Long text classification method based on TextRank and attention mechanism | |
Kannan et al. | Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm | |
Sajeev | A community based web summarization in near linear time | |
CN111797235A (en) | Text real-time clustering method based on time attenuation factor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: 610000 No. 270, floor 2, No. 8, Jinxiu street, Wuhou District, Chengdu, Sichuan Patentee after: Chengdu shuzhilian Technology Co.,Ltd. Address before: 610000 No.2, 4th floor, building 1, Jule Road intersection, West 1st section of 1st ring road, Wuhou District, Chengdu City, Sichuan Province Patentee before: CHENGDU SHUZHILIAN TECHNOLOGY Co.,Ltd. |