CN112257073A

CN112257073A - Webpage duplicate removal method based on improved DBSCAN algorithm

Info

Publication number: CN112257073A
Application number: CN202011176217.XA
Authority: CN
Inventors: 徐光侠; 王利; 马创; 刘俊; 张家俊
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-01-22

Abstract

The invention belongs to the field of computers, and particularly relates to a webpage duplication removing method based on an improved DBSCAN algorithm, which comprises the following steps: acquiring website data in real time, inputting the acquired website data into a trained webpage deduplication algorithm model, and removing duplicated data from data in a website data set according to a training result to obtain a subsequent data set to be subjected to vulnerability scanning; the webpage deduplication algorithm model searches for optimal parameters through an improved artificial bee colony algorithm, replaces two parameters of the DBSCAN algorithm with the optimal parameters, and improves the core point selection process of the DBSCAN algorithm by adopting a proximity search strategy to obtain a webpage deduplication algorithm model; the invention utilizes the improved artificial bee colony algorithm to select the optimal parameters of the constructed artificial data set, and then uses the found optimal parameters for parameter setting of the DBSCAN algorithm, thereby improving the clustering effect of the DBSCAN algorithm.

Description

Webpage duplicate removal method based on improved DBSCAN algorithm

Technical Field

The invention belongs to the field of computers, and particularly relates to a webpage duplication removing method based on an improved DBSCAN algorithm.

Background

With the development of the information era, countless opportunities are brought to the internet industry while providing convenience for life for people through the network; due to the rapid development of the internet, there is not enough protection measure in the aspect of the structural security of the network system, so that the network vulnerability becomes an important security breach in the internet industry in recent years. How to deal with the network attack and how to prevent the network attack to protect the information assets of people from loss, and provide users with a safe internet surfing environment, which becomes a problem to be solved urgently in the society at present. At present, a vulnerability scanner is mainly used for safely scanning a webpage, vulnerability information of the webpage is exposed in advance, and vulnerabilities exposed in advance are repaired to prevent hackers from invading the webpage.

And the vulnerability scanner distinguishes each page through the request information of each page and scans vulnerabilities in each page unit. Because a website has a large number of pages, repeated pages need to be filtered, and the scanning rate is increased. The page deduplication processing method comprises a page deduplication technology based on regular matching and a DBSCAN algorithm. The page deduplication technology based on the regular matching distinguishes different pages through the similarity degree of character strings, if the character string contents of the two pages are the same, the two pages are shown to be one type of page, and only one page needs to be selected; otherwise, the classification is divided into different categories; however, the algorithm cannot consider the integration of multiple features, so that the classification result is inaccurate. The DBSCAN algorithm is a classical density-based clustering algorithm, which does not require the designation of the number of classes and can identify outliers and any number and shape of clusters when used for clustering. When the DBSCAN algorithm is used for carrying out cluster analysis on the page data set, the data in each cluster are similar, so that the data in each cluster are repeated page request data, and only one piece of data needs to be selected when vulnerability scanning is carried out. However, the DBSCAN algorithm has the problem of sensitivity to the selection of the radial parameter epsilon and the density threshold parameter Minpts, and when the DBSCAN algorithm is used for searching the nearest data point, the searching speed is slow, and the web page deduplication efficiency is reduced.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a webpage deduplication method based on an improved DBSCAN algorithm, which comprises the following steps: acquiring website data in real time, inputting the acquired website data into a trained webpage deduplication algorithm model to remove repeated data in a website data set, and obtaining a data set to be subjected to subsequent vulnerability scanning;

the process of constructing the webpage deduplication algorithm model comprises the following steps:

s1: acquiring website data, and performing feature extraction and feature quantization processing on the acquired website data;

s2: carrying out selective characteristic weighting processing on the quantized website data to obtain a page data set D;

s3: inputting the page data set D into an improved artificial bee colony algorithm for training to obtain an optimal parameter value; the improved artificial bee colony algorithm comprises the steps of optimizing a honey source selection process in the artificial bee colony algorithm by adopting a truncation selection mechanism;

s4: improving the DBSCAN algorithm according to the optimal parameter value and the adjacent search strategy to obtain a webpage duplication elimination algorithm model;

s5: inputting the data in the page data set D into a webpage deduplication algorithm model for training to obtain cluster labels of all data points;

s6: and according to the difference of the cluster labels, selecting one piece of data in each cluster to construct a data set to be subjected to vulnerability scanning.

Preferably, the features extracted from the acquired website data include: the method comprises the steps of requesting a method, a request address, a request parameter name, the number of request parameters, a request host name and the transmission length of a message entity; and performing quantization processing on the extracted features.

Further, the quantization process includes: directly endowing different request methods with number labels in 0-9, taking the transmission length value of each message entity as a characteristic value, establishing a dictionary according to a request address and a request parameter name, counting the number of request parameters, and taking the number of the parameters as the characteristic value.

Preferably, the process of optimizing the artificial bee colony algorithm by the truncation selection mechanism comprises:

step 1: calculating the individual fitness fval of each piece of data in the page data set D;

step 2: arranging each piece of data in the page data set D in a descending order according to the individual fitness fval; randomly selecting k individuals from a population to form a group, and selecting an individual i with the maximum fitness fval in the group;

and step 3: selecting the first t% of better individuals in the population to generate a next generation population, judging the probability of generating a filial generation population by the individuals with the bit sequence i according to a truncation selection mechanism, and searching in the neighborhood of the current honey source along with the bee when the selected individuals i meet the condition that i is not more than M multiplied by t%, and generating the filial generation population;

and 4, step 4: repeating step 3M times to generate a new generation of population.

Further, the formula for calculating the individual fitness fval is as follows:

further, the probability formula of generating the offspring population by the individual with the bit sequence i is as follows:

the calculation formula of the truncation threshold is as follows:

preferably, the process of solving the optimal parameters by using the improved artificial bee colony algorithm comprises the following steps:

s31: setting the maximum iteration number Mirater, the number M of bees and a parameter lit for judging whether to remove the solution or not; initializing a set of honey sources v in a solution space₁,v₂,…,v_kOne honey source corresponds to one solution, and the initial honey source position is used as an initial solution; calculating the fitness and arranging according to a descending order, taking the bees on the honey source position corresponding to the first M/2 fitness values as honey collection peaks, and taking the bees corresponding to the last M/2 fitness values as follower bees;

s32: setting the current iteration number to be iratom as 1;

s33: the bee is adopted to carry out neighborhood search according to a search formula to generate a new solution new1_ x_ijCalculating the fitness value of the new solution if fval (new1_ x)_ij)＞fval(x_ij) If not, updating the honey source position, otherwise, keeping the honey source position unchanged;

s34: the follower bee adopts a truncation selection mechanism to select the honey source and search the neighborhood according to the honey source position obtained by S33 to obtain a new solution new2_ x_ijCalculating the fitness value of the new solution if fval (new2_ x)_ij)＞fval(x_ij) If not, updating the honey source position, otherwise, keeping the honey source position unchanged;

s35: recording the position of the current optimal honey source, wherein the position is the current optimal solution;

s36: if the fitness value of the ith honey source is unchanged, the mining frequency s (i) is s (i) + 1;

s37: if s (i) is not less than lit, discarding the solution, and the scout bee performing global search according to a new honey source search formula to generate a new solution to replace the solution, and setting s (i) to be 0;

s38: and if the maximum iteration number is reached, namely, irator +1, outputting the optimal solution, otherwise, returning to the step S33.

Preferably, the process of training the web page deduplication algorithm model includes:

s51: initializing a category label C as 0, setting category labels of all data points as 0, and inputting a page data set D, a radius epsilon and a density threshold Minpts;

s52: calculating the distance between two points to obtain the 2 epsilon neighborhood N of the data point p which is not accessed in the page data set D_2ε(p); storing the result in a dists array, sequencing the dists array from small to large, and storing the sequenced result in a distAlr array;

s53: judging the density value in the 2 epsilon neighborhood of p and the size of a density threshold Minpts; if the density value in the 2 epsilon neighborhood of p is more than or equal to Minpts, i.e. | N_2εWhen (p) | is greater than or equal to Minpts, judging the distAlr [ Minpts [ ]]And the radius epsilon, and executing a corresponding function according to the relation; if the density value in the 2 epsilon neighborhood of p is less than Minpts, i.e. | N_2ε(p) | < Minpts, labeling the category labels of all points in the epsilon neighborhood of p as-1; wherein, distAlr [ Minpts ] is]Represents a distance value between a point which is close to the point Minpts and the point p;

s54: and outputting the category label of each data point in the page data set D until all the data points are visited, and finishing the algorithm.

Further, the process of determining the relationship between distAlr [ Minpts ] and ε includes:

step 1: if the distAlr [ Minpts ] ≦ ε, executing a cluster expansion function Expandcluster (p, distAlr, Minpts), saving the result of executing the function into a resultPts array, adding 1 to the category label C, marking all points in the resultPts array with C to mark the category, wherein the points in the resultPts array form a cluster with p as the core point;

step 2: if disArr [ Minpts ] is present]If the number of the non-core points is larger than epsilon, selecting data in the distAlr array by adopting a binary search method to obtain a non-core point set O ═ O | dis_p,o＜distArr[Minpts]- ε }, where dis_p,oRepresenting the distance between point p and point O, the class labels for all points in the set O are labeled-1.

Further, the process of executing the cluster expansion function Expandcluster (p, distArr, Minpts) includes: initializing a queue drPts, adding points in an epsilon neighborhood of a data point p into the queue drPts, and judging the distance state of each unaccessed data point q in the drPts; outputting the data in drPts until all the points in drPts are visited; the judgment process is as follows:

if the distance between p and q is less than epsilon, searching the data point q by adopting a neighbor search NeighborQuery algorithm to obtain a neighborhood N_ε(q) if | N_ε(q) | is not less than Minpts, and adding N_ε(q) adding to the queue drPts;

if q is the last point of drPts, a 2 epsilon neighborhood N of p is obtained using range query_2ε(p) calculating p to N_2ε(p) distance of all points in the table, save the result to dists, sort the dists from small to large, and save the result as drPts.

The invention has the following beneficial effects:

1) the established artificial data set is subjected to optimal parameter selection by utilizing the improved artificial bee colony algorithm, and the found optimal parameters are used for parameter setting of the DBSCAN algorithm, so that the clustering effect of the DBSCAN algorithm is improved;

2) and the adjacent search strategy is used for improving the core point selection process of the DBSCAN algorithm, so that the clustering speed of the algorithm is increased, and the scanning efficiency of the vulnerability scanning system can be increased after the duplication of the website page data is removed according to the clustering result.

Drawings

FIG. 1 is a flow diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A webpage deduplication method based on an improved DBSCAN algorithm, as shown in fig. 1, the method includes: acquiring website data in real time, inputting the acquired website data into a trained webpage deduplication algorithm model to remove repeated data in a website data set, and obtaining a data set to be subjected to subsequent vulnerability scanning;

Preferably, the acquiring of the website data includes crawling the website data in a web crawler manner.

Preferably, the features extracted from the acquired website data include: the method comprises the steps of requesting a method, a request path, a request parameter name, the number of request parameters, a request host name and the transmission length of a message entity; the characteristic quantization processing comprises the following steps: directly assigning values for different parameters, establishing a dictionary and counting the number as a weight. The specific process of the quantization processing comprises the following steps: directly endowing different request methods with number labels in 0-9, taking the transmission length value of each message entity as a characteristic value, establishing a dictionary according to a request address and a request parameter name, counting the number of request parameters, and taking the number of the parameters as the characteristic value.

Preferably, the process of obtaining the page data set D includes: website data acquisition, data feature extraction, data feature quantization and data feature weighting, wherein the website data after feature weighting form a page data set D, and the data feature weighting is carried out by comprehensively considering the influence degree of corresponding features in the webpage data on page distinguishing and the magnitude of the feature value.

The embodiment provides a specific implementation method for optimizing an artificial bee colony algorithm according to a truncation selection mechanism, and the process comprises the following steps:

the specific process for calculating the individual fitness fval comprises the following steps:

step 11: calculating the existence probability of the cluster i and the weighted density of the cluster i according to the data in the page data set D; the probability formula for the existence of cluster i is:

wherein n is_iIs the number of objects in cluster i, and n is the total number of objects.

The weighted consistency formula for cluster i is:

where ω (x, y) represents the distance between sample point x and sample point y.

Step 12: calculating probability weighted density according to the probability of the cluster i and the weighted density of the cluster i; the probability weighted consistency is formulated as:

where c represents the number of clusters.

Step 13: and (3) calculating the similarity c (x, y) between the sample points in two different clusters according to the data in the page data set D, wherein the expression is as follows:

wherein, E represents the union of two clusters, d (x, y) represents the similarity between two different sample points in E, and the calculation formula is as follows:

where ω (x, y) represents the distance between the sample point x and the sample point y, ω_cen(I_i,I_j) Represents a cluster I_iAnd cluster I_jThe distance between the center points of (a) and (b).

Step 14: calculating the similarity between clusters according to c (x, y) of S313; the formula is as follows:

wherein, Sim (I)_i,I_j) Represents a cluster I_iAnd cluster I_jThe similarity between them.

Step 15: calculating individual fitness fval according to the similarity and probability weighted consistency between clusters; the formula is as follows:

wherein c represents the number of clusters, Sim (I)_i,I_j) Represents a cluster I_iAnd cluster I_jPWD represents the probability weighted consistency.

the probability formula for generating a population of offspring is:

wherein, P_iRepresenting the probability that an individual with the bit sequence i can generate the next generation, wherein M represents the population size, and t% represents a truncation threshold;

the calculation formula of the truncation threshold is as follows:

wherein, t_max,t_minThe maximum and minimum cut-off threshold value is cyc, and the Mirater is the current remaining search times and the maximum search times of the bee collecting and following bees.

The embodiment provides a specific implementation mode of finding the optimal parameters by the improved artificial bee colony algorithm.

S31: and setting the maximum iteration number Mirater, the number M of bees and whether the parameter lit of the solution is to be lost. Initializing a set of honey sources v in a range in solution space (epsilon neighborhood range 0-1)₁,v₂,…,v_kOne honey source corresponds to one solution, and the initial honey source position is used as an initial solution; calculating individual fitness and arranging according to descending order, taking the bees on the honey source position corresponding to the first M/2 fitness values as honey collection peaks, and taking the bees corresponding to the later M/2 fitness values as follower bees;

s32: setting the iteration number irator as 1;

the neighborhood search formula is:

new_x_ij＝x_ij+r_ij(x_ij-x_kj)

wherein k belongs to (1,2, …, N), j belongs to (1,2, …, d), k and j are generated randomly, and k is not equal to i, r_ijIs a random number on (0, 1).

s35, recording the position of the current optimal honey source, wherein the position is the current optimal solution; if the fitness value of the ith honey source is unchanged, the mining frequency s (i) is s (i) + 1;

s36: if s (i) is not less than lit, discarding the solution, and the scout bee performing global search according to a new honey source search formula to generate a new solution to replace the solution, and setting s (i) to be 0;

the new honey source search formula is as follows:

new_pop(i)＝(upbond-lbond)·rand+lbond

wherein, upbond is the upper bound of the neighborhood, lbond is the lower bound of the neighborhood, and rand is a random number on (0, 1).

S38: and if the maximum iteration number is reached, namely irator ═ Mirator, outputting the optimal solution, otherwise, adding 1 to the iteration number, and returning to the step S33.

The embodiment provides an improved DBSCAN algorithm model, which is used for training a page data set on the basis of the optimal parameters obtained in the embodiment 1, so as to realize duplication removal of website page data of vulnerability scanning.

In this embodiment, the process of training the web page deduplication algorithm model includes:

s52: for each unvisited data point p in D: firstly, the 2 epsilon neighborhood N of the data point p which is not accessed in the page data set D is obtained through the distance calculation between two points_2ε(p), storing the result into a dists array, sorting the dists array from small to large, and storing the sorted result into a distArr array;

s53: judging the density value in the 2 epsilon neighborhood of p and the size of a density threshold Minpts; if p is within 2 epsilon neighborhoodHas a density value of greater than or equal to Minpts, i.e. | N_2εWhen (p) | is greater than or equal to Minpts, judging the distAlr [ Minpts [ ]]And the radius epsilon; wherein, distAlr [ Minpts ] is]Represents a distance value between a point which is close to the Minpts-th point of the point P and the point P;

if distAlr [ Minpts ] ≦ ε, execute the cluster expansion function Expandcluster (p, distAlr, Minpts), save the result to the resultPts array, label the category label C plus 1 for all points in resultPts, label all points in resultPts with C the category, the points in the array resultPts constitute a cluster with p as the core point;

if disArr [ Minpts ] is present]If the number of the non-core points is larger than epsilon, selecting data in the distAlr array by adopting a binary search method to obtain a non-core point set O ═ O | dis_d,o< distAlr (minpts) - ε }, where dis_p,oRepresenting the distance between the point p and the point O, and marking the category labels of all the points in the set O as-1;

s54: and outputting the category label of each data point in the page data set D until all the data points are visited, and finishing the construction of the algorithm model.

The process of executing the cluster expansion function Expandcluster (p, distArr, Minpts) includes:

initializing the queue drPts, i.e., drPts ═ N_ε(p) |, adding points in the epsilon neighborhood of the data point p into the queue drPts, and judging the distance state of each unaccessed data point q in the drPts; outputting the data in drPts until all the points in drPts are visited; the judgment process is as follows:

if q is the last point of drPts, a 2 epsilon neighborhood N of p is obtained using range query_2ε(p) calculating p to N_2ε(p) storing the results to dists according to the distance of all the points in the image, and sequencing the dists from small to large, wherein the results are stored as drPts;

until all points in drPts have been accessed, the data in drPts is output.

The process of searching the data point q by adopting the neighbor search NeighborQuery algorithm comprises the following steps:

inputting data in a page data set D, a radius epsilon, a density threshold Minpts, a current query point p and an array distAlr;

performing a binary search results in a satisfying distAlr (L) > d_p,q-an index value L corresponding to ε, wherein d_p,qRepresents the distance between point p and point q; performing a binary search results in a satisfying distAlr (U) < d_p,qThe index value U corresponding to + epsilon; saving points of index values from L to U in the ordered array disterr into an array posssibleNeighbor;

finding a point with the distance of the point q being less than epsilon in the array of posibleNeighbor, and storing the result in the array of Neighbor;

n is formed by the points before the array index L-1 and the points in the array Neighbor_ε(q), output of epsilon neighborhood N of q_ε(q)。

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A webpage deduplication method based on an improved DBSCAN algorithm is characterized by comprising the following steps: acquiring website data in real time, inputting the acquired website data into a trained webpage deduplication algorithm model to remove repeated data in a website data set, and obtaining a data set to be subjected to subsequent vulnerability scanning;

2. The web page deduplication method based on the improved DBSCAN algorithm as claimed in claim 1, wherein the features extracted from the acquired website data include: the method comprises the steps of requesting a method, a request address, a request parameter name, the number of request parameters, a request host name and the transmission length of a message entity; and performing quantization processing on the extracted features.

3. The method for webpage deduplication based on the improved DBSCAN algorithm as claimed in claim 2, wherein the quantization process comprises: directly endowing different request methods with number labels in 0-9, taking the transmission length value of each message entity as a characteristic value, establishing a dictionary according to a request address and a request parameter name, counting the number of request parameters, and taking the number of the parameters as the characteristic value.

4. The method for webpage deduplication based on the improved DBSCAN algorithm of claim 1, wherein the process of optimizing the artificial bee colony algorithm by the truncation selection mechanism comprises:

5. The method for webpage deduplication based on the improved DBSCAN algorithm of claim 4, wherein the formula for calculating the individual fitness fval is as follows:

wherein PWD represents probability weighted density, c represents number of clusters, Sim (I)_i,I_j) Represents a cluster I_iAnd cluster I_jThe similarity between them.

6. The web page deduplication method based on the improved DBSCAN algorithm as claimed in claim 4, wherein the probability formula that the individual with the bit sequence i generates the offspring population is as follows:

the calculation formula of the truncation threshold is as follows:

wherein, t_maxDenotes the maximum truncation threshold, t_minIndicating a minimum cut-off threshold, cyc indicating honey withdrawalCurrent remaining search times of bees and follower bees, and Mirator represents the current maximum search times of the bee-collecting bees and follower bees.

7. The method for webpage deduplication based on the improved DBSCAN algorithm as claimed in claim 1, wherein the process of solving the optimal parameters by using the improved artificial bee colony algorithm comprises:

s32: setting the current iteration number to be iratom as 1;

8. The method for webpage deduplication based on the improved DBSCAN algorithm of claim 1, wherein the process of training the webpage deduplication algorithm model comprises:

s53: judging the density value in the 2 epsilon neighborhood of p and the size of a density threshold Minpts; if the density value in the 2 epsilon neighborhood of p is more than or equal to Minpts, i.e. | N_2εWhen (p) | is greater than or equal to Minpts, judging the distAlr [ Minpts [ ]]And the radius epsilon, and executing a corresponding function according to the relation; if the density value in the 2 epsilon neighborhood of p is less than Minpts, i.e. | N_2ε(p) | < Minpts, labeling the category labels of all points in the epsilon neighborhood of p as-1; wherein, distAlr [ Minpts ] is]Represents a distance value between a point which is close to the Minpts-th point of the point P and the point P;

9. The method of claim 8, wherein determining the relationship between distArr [ Minpts ] and radius ∈ comprises:

10. The method for web page deduplication based on the improved DBSCAN algorithm as claimed in claim 9, wherein the process of executing the cluster expansion function Expandcluster (p, distArr, Minpts) comprises: initializing a queue drPts, adding points in an epsilon neighborhood of a data point p into the queue drPts, and judging the distance state of each unaccessed data point q in the drPts; outputting the data in drPts until all the points in drPts are visited; the judgment process is as follows:

if the distance between p and q is smaller than epsilon, searching the data point q by adopting a neighbor search NeighborQuery algorithm to obtain a neighborhood N of the data point q_ε(q) if | N_ε(q) | is not less than Minpts, and adding N_ε(q) adding to the queue drPts;