CN108984630B - Application method of node importance in complex network in spam webpage detection - Google Patents

Application method of node importance in complex network in spam webpage detection Download PDF

Info

Publication number
CN108984630B
CN108984630B CN201810637788.5A CN201810637788A CN108984630B CN 108984630 B CN108984630 B CN 108984630B CN 201810637788 A CN201810637788 A CN 201810637788A CN 108984630 B CN108984630 B CN 108984630B
Authority
CN
China
Prior art keywords
web page
value
webpage
importance
scores
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810637788.5A
Other languages
Chinese (zh)
Other versions
CN108984630A (en
Inventor
罗韬
刘伟
喻梅
徐天一
赵满坤
郭佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201810637788.5A priority Critical patent/CN108984630B/en
Publication of CN108984630A publication Critical patent/CN108984630A/en
Application granted granted Critical
Publication of CN108984630B publication Critical patent/CN108984630B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

An application method of node importance in spam webpage detection in a complex network comprises the following steps: the data preprocessing is to normalize the known characteristic data, and after normalization, use the PCA algorithm to extract the characteristics, reduce the characteristic dimension, make the new space dimension lower than the original characteristic space; calculating the weight of links among the webpages and the betweenness index of the webpages, and fusing the betweenness index and the weight to calculate the importance score of each webpage; sorting according to the importance scores, selecting the webpages with the highest scores and the lowest scores to jointly form a seed set, and assigning an initial value of the trust degree to each webpage; calculating an aggregation coefficient, and taking the webpage with the aggregation coefficient higher than a threshold value as a trusted webpage; calculating a confidence transfer matrix; and (4) iteratively calculating the CTrank scores by using the transfer matrix TC, and sequencing the converged CTrank scores. The method can effectively detect the spam web pages, obtain obvious effect and reduce the ranking of the spam web pages.

Description

Application method of node importance in complex network in spam webpage detection
Technical Field
The invention relates to web page detection. In particular to an application method of node importance in spam webpage detection in a complex network.
Background
Currently, there are two main categories in the related art: one is a PageRank link analysis webpage computing method, which is a very classical computing method based on a link structure. Many existing link structure sorting algorithms are improved based on this. The PageRank technique ranks the web pages according to the number of incoming links and the high quality of the web pages, namely, the PageRank value of each web page is calculated according to the web pages, and then the importance of each web page is ranked according to the value. The idea is that a common internet user is simulated, the internet user can randomly select to open a webpage to browse the webpage and then jump to other webpages in link relation with the webpage, and thus browsing is finished, and PageRank mainly calculates the possibility that a regular person looks up each webpage when surfing the internet. The PageRank algorithm is often performed by an iterative method, and after the iteration is completed, the value of the PageRank algorithm converges at a certain point. The PageRank algorithm, although popular in search engines, still suffers from certain drawbacks. One of the defects is that the PageRank algorithm only takes sequencing as a division standard, and the judgment process is simple and rough; second, PageRank is clearly more beneficial for web pages that are built for longer periods of time. Since it has a great advantage in recommendation over time if the web page is built up at an early time. However, limitations of the PageRank technique may result in the final ranking of web pages not necessarily being accurate. Because the evaluated web page quality is not necessarily the true quality, the evaluation cannot be carried out, and only calculation can be carried out according to a certain standard.
Another class of TrustRank algorithm is also a link relationship based ranking algorithm. And the ranking sequence is calculated by adopting a TrustRank algorithm, so that cheating means of manipulating the ranking and improving the quality of the search result can be effectively prevented. By using the technology, the ranking order can be difficult to change by spammers in a short time, so the ranking quality is improved. The method mainly utilizes the trust values of partial web pages to judge other web pages, and the quality of the web pages is better when the TrustRank value of the web pages is larger. However, with the rapid development of science and technology, many manufacturers of spam web pages have been changing their cheating methods in a synchronous manner. For example, the websites of the spam webpages of the users are randomly pasted in the comment areas of some high-quality webpages, so that the ranks of the users can be improved by assuming existing vulnerabilities through the Trustrank algorithm.
Disclosure of Invention
The invention aims to solve the technical problem of providing an application method of node importance in spam webpage detection in a complex network of a spam webpage detection algorithm based on an index and an aggregation coefficient.
The technical scheme adopted by the invention is as follows: an application method of node importance in spam webpage detection in a complex network comprises the following steps:
1) the data preprocessing is to normalize the known characteristic data, compress the data into a range, give equal weight to all attributes, normalize the data and eliminate the influence of dimension on the subsequent calculation of the data; after normalization, feature extraction is carried out by using a PCA algorithm, so that the feature dimension is reduced, and the new space dimension is lower than the original feature space;
2) calculating the weight of links among the webpages and the betweenness index of the webpages, and fusing the betweenness index and the weight to calculate the importance score of each webpage;
3) sorting according to the importance scores, selecting the webpages with the highest scores and the lowest scores to jointly form a seed set, and assigning an initial value of the trust degree to each webpage;
4) calculating an aggregation coefficient, and taking the webpage with the aggregation coefficient higher than a threshold value as a trusted webpage;
5) calculating a confidence transfer matrix;
6) and (4) iteratively calculating the CTrank scores by using the transfer matrix TC, and sequencing the converged CTrank scores.
The normalization in the step 1) is a calculation formula normalized by adopting z-score:
Figure BDA0001701987040000021
in the formula (I), the compound is shown in the specification,
Figure BDA0001701987040000022
is the mean value of the attribute A, σADenotes the standard deviation, v, of the attribute AiDenotes the value of the ith data on attribute A, v'iIs the value of the ith data on the attribute A after normalization;
after the feature vectors are subjected to normalization processing by using z-score, the value ranges of all feature data are between 0 and 1, and the influence of dimension on the subsequent calculation of the data is eliminated.
Calculating the weight of the links between the web pages in the step 2), and adopting the following formula:
Figure BDA0001701987040000023
in the formula, disti,jRepresenting the Euclidean distance, w, from web page i to web page ji,jThe calculated weight values from the web page i to the web page j are obtained.
The betweenness index in the step 2) is as follows:
Figure BDA0001701987040000024
in the formula, σst(v) Is the shortest path number from s to t and through point v, and σstIs the total number of shortest paths from s to t, the end point of each path is also counted as part of the path; bcc(v) Is the exponent index value for point v; and substituting all the web pages into the formula to obtain the betweenness index value of each web page.
The importance score in step 2) is the sum of the median index values of all the out-link web pages of the web page multiplied by the weight values of the corresponding links, and then the median index value of the web page itself is added, as shown in the following formula:
Figure BDA0001701987040000025
wherein BCW (i) is the final importance score of web page i, wi,jWeight value for a link from Web page i to Web page j, bciThe index value of the web page i is, and n is the number of outgoing links of the web page i.
Step 3) specifically, the web pages are sorted from large to small according to the importance scores, the web pages with the highest importance scores are used as trusted seeds, the web pages with the lowest importance scores are used as spam seeds, a seed set is formed by the web pages together, then initial trust values are given to all the web pages, the trusted seeds are given with positive values, the spam seeds are given with negative values, other web pages are given with 0, and the specific calculation formula is as follows:
Figure BDA0001701987040000031
wherein W is the garbage seed, v is the number of garbage seeds, T is the trust seed, u is the number of trust seed, diThe initial value of the trust level of the webpage i.
The calculation formula of the aggregation coefficient in the step 4) is as follows:
Figure BDA0001701987040000032
where k is the number of edges connecting all the incoming chains of node i, in (i) is the number of nodes of all the incoming chains of node i, i.e. the degree of entry of node i, and clustercoeffient (i) is the aggregation coefficient of node i.
Step 5) after the evaluation of the clustering coefficient is finished, the splitting of the confidence value in a differentiation manner is started, namely, the confidence transfer matrix is calculated:
Figure BDA0001701987040000033
wherein θ (u) isThe set of outgoing chains for web page u,
Figure BDA0001701987040000034
the out-link number of the web page with high aggregation coefficient in the out-link web pages of the web page u,
Figure BDA0001701987040000035
and the sum of other out-chain numbers of the web page u, O (i) represents the out-chain number of the web page i, CC is a set of web pages which can be trusted and is evaluated according to the aggregation coefficient, and TC (i, u) is the ith row and the u column of the confidence transfer matrix.
The step 6) of utilizing the transfer matrix TC to iteratively calculate the CTrank score adopts the following formula:
cti=α·TC(i)·cti-1+(1-α)·d
where α represents the attenuation factor, TC (i) is the transition matrix for the ith iteration, d is the initially assigned confidence value, ctiThe CTRank score for the ith iteration of the web page.
The application method of the node importance in the complex network in the spam webpage detection can effectively detect the spam webpage, obtain obvious effect and reduce the ranking of the spam webpage. The method has the following advantages:
1. a new method for selecting the seed set is provided, and the problem that a junk web page manufacturer can add an external link in a high-quality web page to improve the ranking of the junk web page manufacturer in a network is solved.
2. A CTRank sorting algorithm is provided, the number of neighbors in the clustering coefficient is changed into the number of in-links to replace calculation, different methods are provided for calculating the confidence transfer matrix in a differentiated mode according to the score of the changed clustering coefficient and the out-link condition of each node, and the problems that link weights are distributed averagely and webpage importance is ignored when the TrustRank algorithm calculates the confidence transfer matrix are solved.
Detailed Description
The following describes in detail an application method of node importance in spam web page detection in a complex network according to an embodiment of the present invention.
The application method of the node importance in the complex network in the spam webpage detection comprises the following steps:
1) the data preprocessing is to normalize the known characteristic data, compress the data into a range, give equal weight to all attributes, normalize the data and eliminate the influence of dimension on the subsequent calculation of the data; feature extraction is carried out by using a PCA algorithm, so that the feature dimension is reduced, the new space dimension is lower than the original feature space, and the subsequent calculation is convenient;
the normalization is a calculation formula normalized by adopting z-score:
Figure BDA0001701987040000041
in the formula (I), the compound is shown in the specification,
Figure BDA0001701987040000042
is the mean value of the attribute A, σADenotes the standard deviation, v, of the attribute AiDenotes the value of the ith data on attribute A, v'iIs the value of the ith data on the attribute A after normalization;
after the feature vectors are subjected to normalization processing by using z-score, the value ranges of all feature data are between 0 and 1, and the influence of dimension on the subsequent calculation of the data is eliminated.
2) Calculating the weight of links among the webpages and the betweenness index of the webpages, and fusing the betweenness index and the weight to calculate the importance score of each webpage; wherein the content of the first and second substances,
when a link relation exists between two webpages, if the distance value between the two webpages is small, the similarity is very high, and a larger weight should be given, and based on the principle, the weight of the link between the webpages is calculated by adopting the following formula:
Figure BDA0001701987040000043
in the formula, disti,jRepresentative slave netEuclidean distance, w, between page i and page ji,jThe calculated weight values from the web page i to the web page j are obtained.
The betweenness index is
Figure BDA0001701987040000044
In the formula, σst(v) Is the shortest path number from s to t and through point v, and σstIs the total number of shortest paths from s to t, the end point of each path is also counted as part of the path; bcc(v) Is the exponent index value for point v; and substituting all the web pages into the formula to obtain the betweenness index value of each web page.
The fused betweenness and weight may define an importance score for each web page, where the importance score is the sum of the betweenness index values of all the out-linked web pages of the web page multiplied by the weight value of the corresponding link, and then added to the betweenness index value of the web page itself, as shown in the following formula:
Figure BDA0001701987040000045
wherein BCW (i) is the final importance score of web page i, wi,jWeight value for a link from Web page i to Web page j, bciThe index value of the web page i is, and n is the number of outgoing links of the web page i.
3) Sorting according to the importance scores, selecting the webpages with the highest scores and the lowest scores to jointly form a seed set, and assigning an initial value of the trust degree to each webpage;
specifically, the web pages are sorted from large to small according to the importance scores, the web page with the highest importance score is used as a trusted seed, the web pages with the lowest importance score are used as a spam seed, a seed set is formed by the web pages, then, an initial trust degree value is given to all the web pages, the trusted seed is given with a positive value, the spam seed is given with a negative value, and other web pages are given with 0, and the specific calculation formula is as follows:
Figure BDA0001701987040000051
wherein W is the garbage seed, v is the number of garbage seeds, T is the trust seed, u is the number of trust seed, diThe initial value of the trust level of the webpage i.
4) Calculating an aggregation coefficient, and taking the webpage with the aggregation coefficient higher than a threshold value as a trusted webpage;
and performing trust splitting after the initial value of the trust of each webpage is obtained, wherein the CTrank algorithm is a trust difference sorting algorithm, and the good webpage and the junk webpage are considered not to be equal to the first splitting trust value when the trust transfer matrix is calculated. The method judges whether the web pages can be based on trust by using the clustering coefficient, and performs differentiated trust splitting on different web pages when calculating the trust transfer matrix.
The said aggregation coefficient is calculated by the following formula:
Figure BDA0001701987040000052
where k is the number of edges connecting all the incoming chains of node i, in (i) is the number of nodes of all the incoming chains of node i, i.e. the degree of entry of node i, and clustercoeffient (i) is the aggregation coefficient of node i.
5) Calculating a confidence transfer matrix;
after the evaluation of the clustering coefficient is finished, the splitting of the trust value in a differentiation way is started, namely, the transfer matrix of the trust degree is calculated:
Figure BDA0001701987040000053
wherein, theta (u) is the out-link set of the web page u,
Figure BDA0001701987040000054
the out-link number of the web page with high aggregation coefficient in the out-link web pages of the web page u,
Figure BDA0001701987040000055
and the sum of other out-chain numbers of the web page u, O (i) represents the out-chain number of the web page i, CC is a set of web pages which can be trusted and is evaluated according to the aggregation coefficient, and TC (i, u) is the ith row and the u column of the confidence transfer matrix.
6) And (4) iteratively calculating the CTrank scores by using the transfer matrix TC, and sequencing the converged CTrank scores. Wherein, the CTrank score is iteratively calculated by using the transition matrix TC, and the following formula is adopted:
cti=α·TC(i)·cti-1+(1-α)·d (8)
where α represents the attenuation factor, TC (i) is the transition matrix for the ith iteration, d is the initially assigned confidence value, ctiThe CTRank score for the ith iteration of the web page.
The following is the verification of the application method of the importance of the node in the complex network in the spam webpage detection.
The effectiveness of a BCW seed set selection algorithm and a CTrank ranking algorithm is evaluated by selecting the garbage number, the accuracy, the recall rate and the F value.
Table 1 shows evaluation criteria
Figure BDA0001701987040000061
(1) Accuracy (Precision)
The accuracy is mainly measured by the precision of the experiment, namely the ratio of the number of detected garbage net pages to the sum of the detected garbage net pages. The calculation formula is shown as formula (1).
Figure BDA0001701987040000062
(2) Recall ratio (Recall)
Recall is a trade-off between recall, i.e., the ratio of the number of spam pages detected to the number of total spam pages in the web page. The calculation formula is shown in formula (2).
Figure BDA0001701987040000063
(3) F value (F-measure)
The accuracy and the recall rate sometimes have contradictory conditions, so that the accuracy and the recall rate need to be considered together, and the most common method at this time is an F value which is calculated by a weighted harmonic mean of the accuracy and the recall rate. The calculation formula is shown in formula (3).
Figure BDA0001701987040000064
In the formula (3), F-Measure represents an F value, the meaning of P indicates the accuracy of the algorithm, and the meaning of R indicates the proportion of correct detection of the algorithm.
The experiment firstly analyzes the effectiveness of the BCW seed set selection algorithm, compares the BCW seed set selection algorithm with the Inverse PageRank algorithm and the highPageRank algorithm by using the accuracy, the recall rate and the F value as evaluation criteria, and proves the effectiveness of the algorithms as shown in tables 2 to 4. And then verifying the effectiveness of the CTRank algorithm sorting algorithm, and comparing the CTRank algorithm sorting algorithm with the TrustRank algorithm by using the accuracy, the recall rate and the F value as evaluation criteria, wherein the two algorithms select a seed set by using a BCW algorithm as shown in tables 5 to 7. The result proves that the BCW algorithm and the CTrank algorithm can effectively detect the spam web pages, obtain obvious effect and reduce the ranking of the spam web pages.
TABLE 2 junk web page accuracy comparison for setting different thresholds
Figure BDA0001701987040000065
Figure BDA0001701987040000071
TABLE 3 comparison of spam web page recall rates for setting different thresholds
Figure BDA0001701987040000072
TABLE 4 comparison of spam web page F values when setting different thresholds
Figure BDA0001701987040000073
TABLE 5 comparison of spam web page recall rates for setting different thresholds
Figure BDA0001701987040000074
TABLE 6 comparison of spam web page recall rates for setting different thresholds
Figure BDA0001701987040000075
TABLE 7 comparison of spam web page F values for different threshold settings
Figure BDA0001701987040000076

Claims (4)

1. An application method of node importance in spam webpage detection in a complex network is characterized by comprising the following steps:
1) the data preprocessing is to normalize the known characteristic data, compress the data into a range, give equal weight to all attributes, normalize the data and eliminate the influence of dimension on the subsequent calculation of the data; after normalization, feature extraction is carried out by using a PCA algorithm, so that the feature dimension is reduced, and the new space dimension is lower than the original feature space;
2) calculating the weight of links among the webpages and the betweenness index of the webpages, and fusing the betweenness index and the weight to calculate the importance score of each webpage;
3) sorting according to the importance scores, selecting the webpages with the highest scores and the lowest scores to jointly form a seed set, and assigning an initial value of the trust degree to each webpage;
4) calculating an aggregation coefficient, and taking the webpage with the aggregation coefficient higher than a threshold value as a trusted webpage;
the said aggregation coefficient is calculated by the following formula:
Figure FDA0003085293410000011
wherein k is the number of edges connected among all incoming chains of the node i, IN (i) is the number of nodes of all incoming chains of the node i, namely the degree of entry of the node i, and Cluster coefficient (i) is the aggregation coefficient of the node i;
5) calculating a confidence transfer matrix;
after the evaluation of the clustering coefficient is finished, the splitting of the trust value in a differentiation way is started, namely, the transfer matrix of the trust degree is calculated:
Figure FDA0003085293410000012
wherein, theta (u) is the out-link set of the web page u,
Figure FDA0003085293410000013
the out-link number of the web page with high aggregation coefficient in the out-link web pages of the web page u,
Figure FDA0003085293410000014
the sum of other out-link numbers of the web page u, O (i) represents the out-link number of the web page i, CC is a web page set which can be trusted and is evaluated according to the aggregation coefficient, and TC (i, u) is the ith row and the u column of the confidence transfer matrix;
6) calculating CTrank scores by using a transfer matrix TC (transfer matrix) in an iterative manner, sorting the converged CTrank scores, and determining the spam web pages according to the sorting results of the CTrank scores;
the CTRank score is iteratively calculated by using the transfer matrix TC, and the following formula is adopted:
cti=α·TC(i)·cti-1+(1-α)·d
where α represents the attenuation factor, TC (i) is the transition matrix for the ith iteration, d is the initially assigned confidence value, ctiThe CTRank score for the ith iteration of the web page.
2. The method for applying node importance in spam detection in a complex network according to claim 1, wherein the normalization in step 1) is a formula normalized by z-score:
Figure FDA0003085293410000021
in the formula (I), the compound is shown in the specification,
Figure FDA0003085293410000022
is the mean value of the attribute A, σADenotes the standard deviation, v, of the attribute AiDenotes the value of the ith data on the attribute A, vi' is the value of the normalized ith data on attribute A;
after the feature vectors are subjected to normalization processing by using z-score, the value ranges of all feature data are between 0 and 1, and the influence of dimension on the subsequent calculation of the data is eliminated.
3. The method for applying node importance in spam web page detection in a complex network according to claim 1, wherein the weight of the links between the web pages in step 2) is calculated by the following formula:
Figure FDA0003085293410000023
in the formula, disti,jRepresenting the Euclidean distance, w, from web page i to web page ji,jIs calculated to obtainThe weight value from web page i to web page j;
the betweenness index in the step 2) is as follows:
Figure FDA0003085293410000024
in the formula, σst(v) Is the shortest path number from s to t and through point v, and σstIs the total number of shortest paths from s to t, the end point of each path is also counted as part of the path; bcc(v) Is the exponent index value for point v; substituting all the web pages into the above formula to obtain the betweenness index value of each web page;
the importance score in step 2) is the sum of the median index values of all the out-link web pages of the web page multiplied by the weight values of the corresponding links, and then the median index value of the web page itself is added, as shown in the following formula:
Figure FDA0003085293410000025
wherein BCW (i) is the final importance score of web page i, wi,jWeight value for a link from Web page i to Web page j, bciThe index value of the web page i is, and n is the number of outgoing links of the web page i.
4. The method for applying the importance of the nodes in the complex network to spam web page detection according to claim 1, wherein the step 3) specifically comprises the steps of sorting the web pages from large to small according to the importance scores, taking the web page with the highest importance score as a trusted seed, taking the web page with the lowest importance score as a spam seed, and jointly forming a seed set, then assigning an initial trust value to all the web pages, assigning a positive value to the trusted seed, assigning a negative value to the spam seed, and assigning 0 to other web pages, wherein the specific calculation formula is as follows:
Figure FDA0003085293410000026
wherein W is the garbage seed, v is the number of garbage seeds, T is the trust seed, u is the number of trust seed, diThe initial value of the trust level of the webpage i.
CN201810637788.5A 2018-06-20 2018-06-20 Application method of node importance in complex network in spam webpage detection Active CN108984630B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810637788.5A CN108984630B (en) 2018-06-20 2018-06-20 Application method of node importance in complex network in spam webpage detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810637788.5A CN108984630B (en) 2018-06-20 2018-06-20 Application method of node importance in complex network in spam webpage detection

Publications (2)

Publication Number Publication Date
CN108984630A CN108984630A (en) 2018-12-11
CN108984630B true CN108984630B (en) 2021-08-24

Family

ID=64541528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810637788.5A Active CN108984630B (en) 2018-06-20 2018-06-20 Application method of node importance in complex network in spam webpage detection

Country Status (1)

Country Link
CN (1) CN108984630B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902236B (en) * 2019-03-07 2021-06-11 成都数之联科技有限公司 Junk web page degradation method based on non-probability model
CN111478854B (en) * 2020-04-01 2021-10-12 中国人民解放军国防科技大学 Real-time network node importance ordering method based on flow data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101180624A (en) * 2004-10-28 2008-05-14 雅虎公司 Link-based spam detection
CN102184208A (en) * 2011-04-29 2011-09-14 武汉慧人信息科技有限公司 Junk web page detection method based on multi-dimensional data abnormal cluster mining
CN102750380A (en) * 2012-06-27 2012-10-24 山东师范大学 Page sorting method in combination with difference feature distribution and link feature
CN102915369A (en) * 2012-11-01 2013-02-06 吉林大学 Method for ranking web pages on basis of hyperlink source analysis
CN105183784A (en) * 2015-08-14 2015-12-23 天津大学 Content based junk webpage detecting method and detecting apparatus thereof
CN106096026A (en) * 2016-06-24 2016-11-09 武汉合创源科技有限公司 A kind of product search method and system
CN107423319A (en) * 2017-03-29 2017-12-01 天津大学 A kind of spam page detection method
CN107943994A (en) * 2017-12-04 2018-04-20 重庆第二师范学院 A kind of Web page sequencing method and system based on transition probability

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7975301B2 (en) * 2007-03-05 2011-07-05 Microsoft Corporation Neighborhood clustering for web spam detection

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101180624A (en) * 2004-10-28 2008-05-14 雅虎公司 Link-based spam detection
CN102184208A (en) * 2011-04-29 2011-09-14 武汉慧人信息科技有限公司 Junk web page detection method based on multi-dimensional data abnormal cluster mining
CN102750380A (en) * 2012-06-27 2012-10-24 山东师范大学 Page sorting method in combination with difference feature distribution and link feature
CN102915369A (en) * 2012-11-01 2013-02-06 吉林大学 Method for ranking web pages on basis of hyperlink source analysis
CN105183784A (en) * 2015-08-14 2015-12-23 天津大学 Content based junk webpage detecting method and detecting apparatus thereof
CN106096026A (en) * 2016-06-24 2016-11-09 武汉合创源科技有限公司 A kind of product search method and system
CN107423319A (en) * 2017-03-29 2017-12-01 天津大学 A kind of spam page detection method
CN107943994A (en) * 2017-12-04 2018-04-20 重庆第二师范学院 A kind of Web page sequencing method and system based on transition probability

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
"CT-Rank: A Time-aware Ranking Algorithm for Web Search";Peiquan Jin et al.;《Journal of Convergence Information Technology》;20100831;第5卷(第6期);第99-111页 *
"Collaborative Deep Ranking A Hybrid Pair-Wise Recommendation Algorithm with Implicit Feedback";Haochao Ying et al.;《PAKDD 2016》;20161231;第555-567页 *
"Web结构挖掘中的PageRank算法改进";钱杰等;《计算机系统应用》;20081231;第42-45页 *
"基于TrustRank的垃圾网页检测算法研究";周静;《中国优秀硕士学位论文全文数据库信息科技辑》;20171215(第12期);第1-38页 *
"基于集聚系数的链路预测算法";黄子轩等;《应用物理》;20140630;第101-106页 *
"复杂网络中节点重要性排序的研究进展";刘建国等;《物理学报》;20131231;第62卷(第17期);第1-10页 *

Also Published As

Publication number Publication date
CN108984630A (en) 2018-12-11

Similar Documents

Publication Publication Date Title
US11216503B1 (en) Clustering search results
TW201909112A (en) Image feature acquisition
CN107341268B (en) Hot searching ranking method and system
CN104750798B (en) Recommendation method and device for application program
WO2019223552A1 (en) Article recommendation method and apparatus, and computer device and storage medium
US8122015B2 (en) Multi-ranker for search
CN109471982B (en) Web service recommendation method based on QoS (quality of service) perception of user and service clustering
CN104077560B (en) Fingerprint comparison method
WO2018006631A1 (en) User level automatic segmentation method and system
WO2021189830A1 (en) Sample data optimization method, apparatus and device, and storage medium
CN106611193A (en) Image content information analysis method based on characteristic variable algorithm
CN108984630B (en) Application method of node importance in complex network in spam webpage detection
CN107943910B (en) Personalized book recommendation method based on combined algorithm
CN106651427B (en) Data association method based on user behaviors
CN106330861B (en) Website detection method and device
CN107423319B (en) Junk web page detection method
CN110990713A (en) Collaborative filtering recommendation method based on optimal trust path
CN110765364A (en) Collaborative filtering method based on local optimization dimension reduction and clustering
CN112214684B (en) Seed-expanded overlapping community discovery method and device
Qian et al. Three-way decision collaborative recommendation algorithm based on user reputation
CN117155701A (en) Network flow intrusion detection method
Lin et al. A new density-based scheme for clustering based on genetic algorithm
CN111914930A (en) Density peak value clustering method based on self-adaptive micro-cluster fusion
CN108647263B (en) Network address confidence evaluation method based on webpage segmentation crawling
CN110825965A (en) Improved collaborative filtering recommendation method based on trust mechanism and time weighting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant