CN108984630B

CN108984630B - Application method of node importance in complex network in spam webpage detection

Info

Publication number: CN108984630B
Application number: CN201810637788.5A
Authority: CN
Inventors: 罗韬; 刘伟; 喻梅; 徐天一; 赵满坤; 郭佳
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-06-20
Filing date: 2018-06-20
Publication date: 2021-08-24
Anticipated expiration: 2038-06-20
Also published as: CN108984630A

Abstract

An application method of node importance in spam webpage detection in a complex network comprises the following steps: the data preprocessing is to normalize the known characteristic data, and after normalization, use the PCA algorithm to extract the characteristics, reduce the characteristic dimension, make the new space dimension lower than the original characteristic space; calculating the weight of links among the webpages and the betweenness index of the webpages, and fusing the betweenness index and the weight to calculate the importance score of each webpage; sorting according to the importance scores, selecting the webpages with the highest scores and the lowest scores to jointly form a seed set, and assigning an initial value of the trust degree to each webpage; calculating an aggregation coefficient, and taking the webpage with the aggregation coefficient higher than a threshold value as a trusted webpage; calculating a confidence transfer matrix; and (4) iteratively calculating the CTrank scores by using the transfer matrix TC, and sequencing the converged CTrank scores. The method can effectively detect the spam web pages, obtain obvious effect and reduce the ranking of the spam web pages.

Description

Application method of node importance in complex network in spam webpage detection

Technical Field

The invention relates to web page detection. In particular to an application method of node importance in spam webpage detection in a complex network.

Background

Currently, there are two main categories in the related art: one is a PageRank link analysis webpage computing method, which is a very classical computing method based on a link structure. Many existing link structure sorting algorithms are improved based on this. The PageRank technique ranks the web pages according to the number of incoming links and the high quality of the web pages, namely, the PageRank value of each web page is calculated according to the web pages, and then the importance of each web page is ranked according to the value. The idea is that a common internet user is simulated, the internet user can randomly select to open a webpage to browse the webpage and then jump to other webpages in link relation with the webpage, and thus browsing is finished, and PageRank mainly calculates the possibility that a regular person looks up each webpage when surfing the internet. The PageRank algorithm is often performed by an iterative method, and after the iteration is completed, the value of the PageRank algorithm converges at a certain point. The PageRank algorithm, although popular in search engines, still suffers from certain drawbacks. One of the defects is that the PageRank algorithm only takes sequencing as a division standard, and the judgment process is simple and rough; second, PageRank is clearly more beneficial for web pages that are built for longer periods of time. Since it has a great advantage in recommendation over time if the web page is built up at an early time. However, limitations of the PageRank technique may result in the final ranking of web pages not necessarily being accurate. Because the evaluated web page quality is not necessarily the true quality, the evaluation cannot be carried out, and only calculation can be carried out according to a certain standard.

Another class of TrustRank algorithm is also a link relationship based ranking algorithm. And the ranking sequence is calculated by adopting a TrustRank algorithm, so that cheating means of manipulating the ranking and improving the quality of the search result can be effectively prevented. By using the technology, the ranking order can be difficult to change by spammers in a short time, so the ranking quality is improved. The method mainly utilizes the trust values of partial web pages to judge other web pages, and the quality of the web pages is better when the TrustRank value of the web pages is larger. However, with the rapid development of science and technology, many manufacturers of spam web pages have been changing their cheating methods in a synchronous manner. For example, the websites of the spam webpages of the users are randomly pasted in the comment areas of some high-quality webpages, so that the ranks of the users can be improved by assuming existing vulnerabilities through the Trustrank algorithm.

Disclosure of Invention

The invention aims to solve the technical problem of providing an application method of node importance in spam webpage detection in a complex network of a spam webpage detection algorithm based on an index and an aggregation coefficient.

The technical scheme adopted by the invention is as follows: an application method of node importance in spam webpage detection in a complex network comprises the following steps:

1) the data preprocessing is to normalize the known characteristic data, compress the data into a range, give equal weight to all attributes, normalize the data and eliminate the influence of dimension on the subsequent calculation of the data; after normalization, feature extraction is carried out by using a PCA algorithm, so that the feature dimension is reduced, and the new space dimension is lower than the original feature space;

2) calculating the weight of links among the webpages and the betweenness index of the webpages, and fusing the betweenness index and the weight to calculate the importance score of each webpage;

3) sorting according to the importance scores, selecting the webpages with the highest scores and the lowest scores to jointly form a seed set, and assigning an initial value of the trust degree to each webpage;

4) calculating an aggregation coefficient, and taking the webpage with the aggregation coefficient higher than a threshold value as a trusted webpage;

5) calculating a confidence transfer matrix;

6) and (4) iteratively calculating the CTrank scores by using the transfer matrix TC, and sequencing the converged CTrank scores.

The normalization in the step 1) is a calculation formula normalized by adopting z-score:

in the formula (I), the compound is shown in the specification,

is the mean value of the attribute A, σ_ADenotes the standard deviation, v, of the attribute A_iDenotes the value of the ith data on attribute A, v'_iIs the value of the ith data on the attribute A after normalization;

after the feature vectors are subjected to normalization processing by using z-score, the value ranges of all feature data are between 0 and 1, and the influence of dimension on the subsequent calculation of the data is eliminated.

Calculating the weight of the links between the web pages in the step 2), and adopting the following formula:

in the formula, dist_i,jRepresenting the Euclidean distance, w, from web page i to web page j_i,jThe calculated weight values from the web page i to the web page j are obtained.

The betweenness index in the step 2) is as follows:

in the formula, σ_st(v) Is the shortest path number from s to t and through point v, and σ_stIs the total number of shortest paths from s to t, the end point of each path is also counted as part of the path; bc_c(v) Is the exponent index value for point v; and substituting all the web pages into the formula to obtain the betweenness index value of each web page.

The importance score in step 2) is the sum of the median index values of all the out-link web pages of the web page multiplied by the weight values of the corresponding links, and then the median index value of the web page itself is added, as shown in the following formula:

wherein BCW (i) is the final importance score of web page i, w_i,jWeight value for a link from Web page i to Web page j, bc_iThe index value of the web page i is, and n is the number of outgoing links of the web page i.

Step 3) specifically, the web pages are sorted from large to small according to the importance scores, the web pages with the highest importance scores are used as trusted seeds, the web pages with the lowest importance scores are used as spam seeds, a seed set is formed by the web pages together, then initial trust values are given to all the web pages, the trusted seeds are given with positive values, the spam seeds are given with negative values, other web pages are given with 0, and the specific calculation formula is as follows:

wherein W is the garbage seed, v is the number of garbage seeds, T is the trust seed, u is the number of trust seed, d_iThe initial value of the trust level of the webpage i.

The calculation formula of the aggregation coefficient in the step 4) is as follows:

where k is the number of edges connecting all the incoming chains of node i, in (i) is the number of nodes of all the incoming chains of node i, i.e. the degree of entry of node i, and clustercoeffient (i) is the aggregation coefficient of node i.

Step 5) after the evaluation of the clustering coefficient is finished, the splitting of the confidence value in a differentiation manner is started, namely, the confidence transfer matrix is calculated:

wherein θ (u) isThe set of outgoing chains for web page u,

the out-link number of the web page with high aggregation coefficient in the out-link web pages of the web page u,

and the sum of other out-chain numbers of the web page u, O (i) represents the out-chain number of the web page i, CC is a set of web pages which can be trusted and is evaluated according to the aggregation coefficient, and TC (i, u) is the ith row and the u column of the confidence transfer matrix.

The step 6) of utilizing the transfer matrix TC to iteratively calculate the CTrank score adopts the following formula:

ct_i＝α·TC(i)·ct_i-1+(1-α)·d

where α represents the attenuation factor, TC (i) is the transition matrix for the ith iteration, d is the initially assigned confidence value, ct_iThe CTRank score for the ith iteration of the web page.

The application method of the node importance in the complex network in the spam webpage detection can effectively detect the spam webpage, obtain obvious effect and reduce the ranking of the spam webpage. The method has the following advantages:

1. a new method for selecting the seed set is provided, and the problem that a junk web page manufacturer can add an external link in a high-quality web page to improve the ranking of the junk web page manufacturer in a network is solved.

2. A CTRank sorting algorithm is provided, the number of neighbors in the clustering coefficient is changed into the number of in-links to replace calculation, different methods are provided for calculating the confidence transfer matrix in a differentiated mode according to the score of the changed clustering coefficient and the out-link condition of each node, and the problems that link weights are distributed averagely and webpage importance is ignored when the TrustRank algorithm calculates the confidence transfer matrix are solved.

Detailed Description

The following describes in detail an application method of node importance in spam web page detection in a complex network according to an embodiment of the present invention.

The application method of the node importance in the complex network in the spam webpage detection comprises the following steps:

1) the data preprocessing is to normalize the known characteristic data, compress the data into a range, give equal weight to all attributes, normalize the data and eliminate the influence of dimension on the subsequent calculation of the data; feature extraction is carried out by using a PCA algorithm, so that the feature dimension is reduced, the new space dimension is lower than the original feature space, and the subsequent calculation is convenient;

the normalization is a calculation formula normalized by adopting z-score:

in the formula (I), the compound is shown in the specification,

2) Calculating the weight of links among the webpages and the betweenness index of the webpages, and fusing the betweenness index and the weight to calculate the importance score of each webpage; wherein the content of the first and second substances,

when a link relation exists between two webpages, if the distance value between the two webpages is small, the similarity is very high, and a larger weight should be given, and based on the principle, the weight of the link between the webpages is calculated by adopting the following formula:

in the formula, dist_i,jRepresentative slave netEuclidean distance, w, between page i and page j_i,jThe calculated weight values from the web page i to the web page j are obtained.

The betweenness index is

The fused betweenness and weight may define an importance score for each web page, where the importance score is the sum of the betweenness index values of all the out-linked web pages of the web page multiplied by the weight value of the corresponding link, and then added to the betweenness index value of the web page itself, as shown in the following formula:

specifically, the web pages are sorted from large to small according to the importance scores, the web page with the highest importance score is used as a trusted seed, the web pages with the lowest importance score are used as a spam seed, a seed set is formed by the web pages, then, an initial trust degree value is given to all the web pages, the trusted seed is given with a positive value, the spam seed is given with a negative value, and other web pages are given with 0, and the specific calculation formula is as follows:

and performing trust splitting after the initial value of the trust of each webpage is obtained, wherein the CTrank algorithm is a trust difference sorting algorithm, and the good webpage and the junk webpage are considered not to be equal to the first splitting trust value when the trust transfer matrix is calculated. The method judges whether the web pages can be based on trust by using the clustering coefficient, and performs differentiated trust splitting on different web pages when calculating the trust transfer matrix.

The said aggregation coefficient is calculated by the following formula:

5) Calculating a confidence transfer matrix;

after the evaluation of the clustering coefficient is finished, the splitting of the trust value in a differentiation way is started, namely, the transfer matrix of the trust degree is calculated:

wherein, theta (u) is the out-link set of the web page u,

6) And (4) iteratively calculating the CTrank scores by using the transfer matrix TC, and sequencing the converged CTrank scores. Wherein, the CTrank score is iteratively calculated by using the transition matrix TC, and the following formula is adopted:

ct_i＝α·TC(i)·ct_i-1+(1-α)·d (8)

The following is the verification of the application method of the importance of the node in the complex network in the spam webpage detection.

The effectiveness of a BCW seed set selection algorithm and a CTrank ranking algorithm is evaluated by selecting the garbage number, the accuracy, the recall rate and the F value.

Table 1 shows evaluation criteria

(1) Accuracy (Precision)

The accuracy is mainly measured by the precision of the experiment, namely the ratio of the number of detected garbage net pages to the sum of the detected garbage net pages. The calculation formula is shown as formula (1).

(2) Recall ratio (Recall)

Recall is a trade-off between recall, i.e., the ratio of the number of spam pages detected to the number of total spam pages in the web page. The calculation formula is shown in formula (2).

(3) F value (F-measure)

The accuracy and the recall rate sometimes have contradictory conditions, so that the accuracy and the recall rate need to be considered together, and the most common method at this time is an F value which is calculated by a weighted harmonic mean of the accuracy and the recall rate. The calculation formula is shown in formula (3).

In the formula (3), F-Measure represents an F value, the meaning of P indicates the accuracy of the algorithm, and the meaning of R indicates the proportion of correct detection of the algorithm.

The experiment firstly analyzes the effectiveness of the BCW seed set selection algorithm, compares the BCW seed set selection algorithm with the Inverse PageRank algorithm and the highPageRank algorithm by using the accuracy, the recall rate and the F value as evaluation criteria, and proves the effectiveness of the algorithms as shown in tables 2 to 4. And then verifying the effectiveness of the CTRank algorithm sorting algorithm, and comparing the CTRank algorithm sorting algorithm with the TrustRank algorithm by using the accuracy, the recall rate and the F value as evaluation criteria, wherein the two algorithms select a seed set by using a BCW algorithm as shown in tables 5 to 7. The result proves that the BCW algorithm and the CTrank algorithm can effectively detect the spam web pages, obtain obvious effect and reduce the ranking of the spam web pages.

TABLE 2 junk web page accuracy comparison for setting different thresholds

TABLE 3 comparison of spam web page recall rates for setting different thresholds

TABLE 4 comparison of spam web page F values when setting different thresholds

TABLE 5 comparison of spam web page recall rates for setting different thresholds

TABLE 6 comparison of spam web page recall rates for setting different thresholds

TABLE 7 comparison of spam web page F values for different threshold settings

Claims

1. An application method of node importance in spam webpage detection in a complex network is characterized by comprising the following steps:

the said aggregation coefficient is calculated by the following formula:

wherein k is the number of edges connected among all incoming chains of the node i, IN (i) is the number of nodes of all incoming chains of the node i, namely the degree of entry of the node i, and Cluster coefficient (i) is the aggregation coefficient of the node i;

5) calculating a confidence transfer matrix;

wherein, theta (u) is the out-link set of the web page u,

the sum of other out-link numbers of the web page u, O (i) represents the out-link number of the web page i, CC is a web page set which can be trusted and is evaluated according to the aggregation coefficient, and TC (i, u) is the ith row and the u column of the confidence transfer matrix;

6) calculating CTrank scores by using a transfer matrix TC (transfer matrix) in an iterative manner, sorting the converged CTrank scores, and determining the spam web pages according to the sorting results of the CTrank scores;

the CTRank score is iteratively calculated by using the transfer matrix TC, and the following formula is adopted:

ct_i＝α·TC(i)·ct_i-1+(1-α)·d

2. The method for applying node importance in spam detection in a complex network according to claim 1, wherein the normalization in step 1) is a formula normalized by z-score:

in the formula (I), the compound is shown in the specification,

is the mean value of the attribute A, σ_ADenotes the standard deviation, v, of the attribute A_iDenotes the value of the ith data on the attribute A, v_i' is the value of the normalized ith data on attribute A;

3. The method for applying node importance in spam web page detection in a complex network according to claim 1, wherein the weight of the links between the web pages in step 2) is calculated by the following formula:

in the formula, dist_i,jRepresenting the Euclidean distance, w, from web page i to web page j_i,jIs calculated to obtainThe weight value from web page i to web page j;

the betweenness index in the step 2) is as follows:

in the formula, σ_st(v) Is the shortest path number from s to t and through point v, and σ_stIs the total number of shortest paths from s to t, the end point of each path is also counted as part of the path; bc_c(v) Is the exponent index value for point v; substituting all the web pages into the above formula to obtain the betweenness index value of each web page;

4. The method for applying the importance of the nodes in the complex network to spam web page detection according to claim 1, wherein the step 3) specifically comprises the steps of sorting the web pages from large to small according to the importance scores, taking the web page with the highest importance score as a trusted seed, taking the web page with the lowest importance score as a spam seed, and jointly forming a seed set, then assigning an initial trust value to all the web pages, assigning a positive value to the trusted seed, assigning a negative value to the spam seed, and assigning 0 to other web pages, wherein the specific calculation formula is as follows: