CN112257073A - Webpage duplicate removal method based on improved DBSCAN algorithm - Google Patents

Webpage duplicate removal method based on improved DBSCAN algorithm Download PDF

Info

Publication number
CN112257073A
CN112257073A CN202011176217.XA CN202011176217A CN112257073A CN 112257073 A CN112257073 A CN 112257073A CN 202011176217 A CN202011176217 A CN 202011176217A CN 112257073 A CN112257073 A CN 112257073A
Authority
CN
China
Prior art keywords
data
algorithm
minpts
webpage
deduplication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011176217.XA
Other languages
Chinese (zh)
Inventor
徐光侠
王利
马创
刘俊
张家俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202011176217.XA priority Critical patent/CN112257073A/en
Publication of CN112257073A publication Critical patent/CN112257073A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of computers, and particularly relates to a webpage duplication removing method based on an improved DBSCAN algorithm, which comprises the following steps: acquiring website data in real time, inputting the acquired website data into a trained webpage deduplication algorithm model, and removing duplicated data from data in a website data set according to a training result to obtain a subsequent data set to be subjected to vulnerability scanning; the webpage deduplication algorithm model searches for optimal parameters through an improved artificial bee colony algorithm, replaces two parameters of the DBSCAN algorithm with the optimal parameters, and improves the core point selection process of the DBSCAN algorithm by adopting a proximity search strategy to obtain a webpage deduplication algorithm model; the invention utilizes the improved artificial bee colony algorithm to select the optimal parameters of the constructed artificial data set, and then uses the found optimal parameters for parameter setting of the DBSCAN algorithm, thereby improving the clustering effect of the DBSCAN algorithm.

Description

Webpage duplicate removal method based on improved DBSCAN algorithm
Technical Field
The invention belongs to the field of computers, and particularly relates to a webpage duplication removing method based on an improved DBSCAN algorithm.
Background
With the development of the information era, countless opportunities are brought to the internet industry while providing convenience for life for people through the network; due to the rapid development of the internet, there is not enough protection measure in the aspect of the structural security of the network system, so that the network vulnerability becomes an important security breach in the internet industry in recent years. How to deal with the network attack and how to prevent the network attack to protect the information assets of people from loss, and provide users with a safe internet surfing environment, which becomes a problem to be solved urgently in the society at present. At present, a vulnerability scanner is mainly used for safely scanning a webpage, vulnerability information of the webpage is exposed in advance, and vulnerabilities exposed in advance are repaired to prevent hackers from invading the webpage.
And the vulnerability scanner distinguishes each page through the request information of each page and scans vulnerabilities in each page unit. Because a website has a large number of pages, repeated pages need to be filtered, and the scanning rate is increased. The page deduplication processing method comprises a page deduplication technology based on regular matching and a DBSCAN algorithm. The page deduplication technology based on the regular matching distinguishes different pages through the similarity degree of character strings, if the character string contents of the two pages are the same, the two pages are shown to be one type of page, and only one page needs to be selected; otherwise, the classification is divided into different categories; however, the algorithm cannot consider the integration of multiple features, so that the classification result is inaccurate. The DBSCAN algorithm is a classical density-based clustering algorithm, which does not require the designation of the number of classes and can identify outliers and any number and shape of clusters when used for clustering. When the DBSCAN algorithm is used for carrying out cluster analysis on the page data set, the data in each cluster are similar, so that the data in each cluster are repeated page request data, and only one piece of data needs to be selected when vulnerability scanning is carried out. However, the DBSCAN algorithm has the problem of sensitivity to the selection of the radial parameter epsilon and the density threshold parameter Minpts, and when the DBSCAN algorithm is used for searching the nearest data point, the searching speed is slow, and the web page deduplication efficiency is reduced.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a webpage deduplication method based on an improved DBSCAN algorithm, which comprises the following steps: acquiring website data in real time, inputting the acquired website data into a trained webpage deduplication algorithm model to remove repeated data in a website data set, and obtaining a data set to be subjected to subsequent vulnerability scanning;
the process of constructing the webpage deduplication algorithm model comprises the following steps:
s1: acquiring website data, and performing feature extraction and feature quantization processing on the acquired website data;
s2: carrying out selective characteristic weighting processing on the quantized website data to obtain a page data set D;
s3: inputting the page data set D into an improved artificial bee colony algorithm for training to obtain an optimal parameter value; the improved artificial bee colony algorithm comprises the steps of optimizing a honey source selection process in the artificial bee colony algorithm by adopting a truncation selection mechanism;
s4: improving the DBSCAN algorithm according to the optimal parameter value and the adjacent search strategy to obtain a webpage duplication elimination algorithm model;
s5: inputting the data in the page data set D into a webpage deduplication algorithm model for training to obtain cluster labels of all data points;
s6: and according to the difference of the cluster labels, selecting one piece of data in each cluster to construct a data set to be subjected to vulnerability scanning.
Preferably, the features extracted from the acquired website data include: the method comprises the steps of requesting a method, a request address, a request parameter name, the number of request parameters, a request host name and the transmission length of a message entity; and performing quantization processing on the extracted features.
Further, the quantization process includes: directly endowing different request methods with number labels in 0-9, taking the transmission length value of each message entity as a characteristic value, establishing a dictionary according to a request address and a request parameter name, counting the number of request parameters, and taking the number of the parameters as the characteristic value.
Preferably, the process of optimizing the artificial bee colony algorithm by the truncation selection mechanism comprises:
step 1: calculating the individual fitness fval of each piece of data in the page data set D;
step 2: arranging each piece of data in the page data set D in a descending order according to the individual fitness fval; randomly selecting k individuals from a population to form a group, and selecting an individual i with the maximum fitness fval in the group;
and step 3: selecting the first t% of better individuals in the population to generate a next generation population, judging the probability of generating a filial generation population by the individuals with the bit sequence i according to a truncation selection mechanism, and searching in the neighborhood of the current honey source along with the bee when the selected individuals i meet the condition that i is not more than M multiplied by t%, and generating the filial generation population;
and 4, step 4: repeating step 3M times to generate a new generation of population.
Further, the formula for calculating the individual fitness fval is as follows:
Figure BDA0002748742400000031
further, the probability formula of generating the offspring population by the individual with the bit sequence i is as follows:
Figure BDA0002748742400000032
the calculation formula of the truncation threshold is as follows:
Figure BDA0002748742400000033
preferably, the process of solving the optimal parameters by using the improved artificial bee colony algorithm comprises the following steps:
s31: setting the maximum iteration number Mirater, the number M of bees and a parameter lit for judging whether to remove the solution or not; initializing a set of honey sources v in a solution space1,v2,…,vkOne honey source corresponds to one solution, and the initial honey source position is used as an initial solution; calculating the fitness and arranging according to a descending order, taking the bees on the honey source position corresponding to the first M/2 fitness values as honey collection peaks, and taking the bees corresponding to the last M/2 fitness values as follower bees;
s32: setting the current iteration number to be iratom as 1;
s33: the bee is adopted to carry out neighborhood search according to a search formula to generate a new solution new1_ xijCalculating the fitness value of the new solution if fval (new1_ x)ij)>fval(xij) If not, updating the honey source position, otherwise, keeping the honey source position unchanged;
s34: the follower bee adopts a truncation selection mechanism to select the honey source and search the neighborhood according to the honey source position obtained by S33 to obtain a new solution new2_ xijCalculating the fitness value of the new solution if fval (new2_ x)ij)>fval(xij) If not, updating the honey source position, otherwise, keeping the honey source position unchanged;
s35: recording the position of the current optimal honey source, wherein the position is the current optimal solution;
s36: if the fitness value of the ith honey source is unchanged, the mining frequency s (i) is s (i) + 1;
s37: if s (i) is not less than lit, discarding the solution, and the scout bee performing global search according to a new honey source search formula to generate a new solution to replace the solution, and setting s (i) to be 0;
s38: and if the maximum iteration number is reached, namely, irator +1, outputting the optimal solution, otherwise, returning to the step S33.
Preferably, the process of training the web page deduplication algorithm model includes:
s51: initializing a category label C as 0, setting category labels of all data points as 0, and inputting a page data set D, a radius epsilon and a density threshold Minpts;
s52: calculating the distance between two points to obtain the 2 epsilon neighborhood N of the data point p which is not accessed in the page data set D(p); storing the result in a dists array, sequencing the dists array from small to large, and storing the sequenced result in a distAlr array;
s53: judging the density value in the 2 epsilon neighborhood of p and the size of a density threshold Minpts; if the density value in the 2 epsilon neighborhood of p is more than or equal to Minpts, i.e. | NWhen (p) | is greater than or equal to Minpts, judging the distAlr [ Minpts [ ]]And the radius epsilon, and executing a corresponding function according to the relation; if the density value in the 2 epsilon neighborhood of p is less than Minpts, i.e. | N(p) | < Minpts, labeling the category labels of all points in the epsilon neighborhood of p as-1; wherein, distAlr [ Minpts ] is]Represents a distance value between a point which is close to the point Minpts and the point p;
s54: and outputting the category label of each data point in the page data set D until all the data points are visited, and finishing the algorithm.
Further, the process of determining the relationship between distAlr [ Minpts ] and ε includes:
step 1: if the distAlr [ Minpts ] ≦ ε, executing a cluster expansion function Expandcluster (p, distAlr, Minpts), saving the result of executing the function into a resultPts array, adding 1 to the category label C, marking all points in the resultPts array with C to mark the category, wherein the points in the resultPts array form a cluster with p as the core point;
step 2: if disArr [ Minpts ] is present]If the number of the non-core points is larger than epsilon, selecting data in the distAlr array by adopting a binary search method to obtain a non-core point set O ═ O | disp,o<distArr[Minpts]- ε }, where disp,oRepresenting the distance between point p and point O, the class labels for all points in the set O are labeled-1.
Further, the process of executing the cluster expansion function Expandcluster (p, distArr, Minpts) includes: initializing a queue drPts, adding points in an epsilon neighborhood of a data point p into the queue drPts, and judging the distance state of each unaccessed data point q in the drPts; outputting the data in drPts until all the points in drPts are visited; the judgment process is as follows:
if the distance between p and q is less than epsilon, searching the data point q by adopting a neighbor search NeighborQuery algorithm to obtain a neighborhood Nε(q) if | Nε(q) | is not less than Minpts, and adding Nε(q) adding to the queue drPts;
if q is the last point of drPts, a 2 epsilon neighborhood N of p is obtained using range query(p) calculating p to N(p) distance of all points in the table, save the result to dists, sort the dists from small to large, and save the result as drPts.
The invention has the following beneficial effects:
1) the established artificial data set is subjected to optimal parameter selection by utilizing the improved artificial bee colony algorithm, and the found optimal parameters are used for parameter setting of the DBSCAN algorithm, so that the clustering effect of the DBSCAN algorithm is improved;
2) and the adjacent search strategy is used for improving the core point selection process of the DBSCAN algorithm, so that the clustering speed of the algorithm is increased, and the scanning efficiency of the vulnerability scanning system can be increased after the duplication of the website page data is removed according to the clustering result.
Drawings
FIG. 1 is a flow diagram of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A webpage deduplication method based on an improved DBSCAN algorithm, as shown in fig. 1, the method includes: acquiring website data in real time, inputting the acquired website data into a trained webpage deduplication algorithm model to remove repeated data in a website data set, and obtaining a data set to be subjected to subsequent vulnerability scanning;
the process of constructing the webpage deduplication algorithm model comprises the following steps:
s1: acquiring website data, and performing feature extraction and feature quantization processing on the acquired website data;
s2: carrying out selective characteristic weighting processing on the quantized website data to obtain a page data set D;
s3: inputting the page data set D into an improved artificial bee colony algorithm for training to obtain an optimal parameter value; the improved artificial bee colony algorithm comprises the steps of optimizing a honey source selection process in the artificial bee colony algorithm by adopting a truncation selection mechanism;
s4: improving the DBSCAN algorithm according to the optimal parameter value and the adjacent search strategy to obtain a webpage duplication elimination algorithm model;
s5: inputting the data in the page data set D into a webpage deduplication algorithm model for training to obtain cluster labels of all data points;
s6: and according to the difference of the cluster labels, selecting one piece of data in each cluster to construct a data set to be subjected to vulnerability scanning.
Preferably, the acquiring of the website data includes crawling the website data in a web crawler manner.
Preferably, the features extracted from the acquired website data include: the method comprises the steps of requesting a method, a request path, a request parameter name, the number of request parameters, a request host name and the transmission length of a message entity; the characteristic quantization processing comprises the following steps: directly assigning values for different parameters, establishing a dictionary and counting the number as a weight. The specific process of the quantization processing comprises the following steps: directly endowing different request methods with number labels in 0-9, taking the transmission length value of each message entity as a characteristic value, establishing a dictionary according to a request address and a request parameter name, counting the number of request parameters, and taking the number of the parameters as the characteristic value.
Preferably, the process of obtaining the page data set D includes: website data acquisition, data feature extraction, data feature quantization and data feature weighting, wherein the website data after feature weighting form a page data set D, and the data feature weighting is carried out by comprehensively considering the influence degree of corresponding features in the webpage data on page distinguishing and the magnitude of the feature value.
The embodiment provides a specific implementation method for optimizing an artificial bee colony algorithm according to a truncation selection mechanism, and the process comprises the following steps:
step 1: calculating the individual fitness fval of each piece of data in the page data set D;
the specific process for calculating the individual fitness fval comprises the following steps:
step 11: calculating the existence probability of the cluster i and the weighted density of the cluster i according to the data in the page data set D; the probability formula for the existence of cluster i is:
Figure BDA0002748742400000071
wherein n isiIs the number of objects in cluster i, and n is the total number of objects.
The weighted consistency formula for cluster i is:
Figure BDA0002748742400000072
where ω (x, y) represents the distance between sample point x and sample point y.
Step 12: calculating probability weighted density according to the probability of the cluster i and the weighted density of the cluster i; the probability weighted consistency is formulated as:
Figure BDA0002748742400000073
where c represents the number of clusters.
Step 13: and (3) calculating the similarity c (x, y) between the sample points in two different clusters according to the data in the page data set D, wherein the expression is as follows:
Figure BDA0002748742400000074
wherein, E represents the union of two clusters, d (x, y) represents the similarity between two different sample points in E, and the calculation formula is as follows:
Figure BDA0002748742400000075
where ω (x, y) represents the distance between the sample point x and the sample point y, ωcen(Ii,Ij) Represents a cluster IiAnd cluster IjThe distance between the center points of (a) and (b).
Step 14: calculating the similarity between clusters according to c (x, y) of S313; the formula is as follows:
Figure BDA0002748742400000081
wherein, Sim (I)i,Ij) Represents a cluster IiAnd cluster IjThe similarity between them.
Step 15: calculating individual fitness fval according to the similarity and probability weighted consistency between clusters; the formula is as follows:
Figure BDA0002748742400000082
wherein c represents the number of clusters, Sim (I)i,Ij) Represents a cluster IiAnd cluster IjPWD represents the probability weighted consistency.
Step 2: arranging each piece of data in the page data set D in a descending order according to the individual fitness fval; randomly selecting k individuals from a population to form a group, and selecting an individual i with the maximum fitness fval in the group;
and step 3: selecting the first t% of better individuals in the population to generate a next generation population, judging the probability of generating a filial generation population by the individuals with the bit sequence i according to a truncation selection mechanism, and searching in the neighborhood of the current honey source along with the bee when the selected individuals i meet the condition that i is not more than M multiplied by t%, and generating the filial generation population;
the probability formula for generating a population of offspring is:
Figure BDA0002748742400000083
wherein, PiRepresenting the probability that an individual with the bit sequence i can generate the next generation, wherein M represents the population size, and t% represents a truncation threshold;
the calculation formula of the truncation threshold is as follows:
Figure BDA0002748742400000084
wherein, tmax,tminThe maximum and minimum cut-off threshold value is cyc, and the Mirater is the current remaining search times and the maximum search times of the bee collecting and following bees.
And 4, step 4: repeating step 3M times to generate a new generation of population.
The embodiment provides a specific implementation mode of finding the optimal parameters by the improved artificial bee colony algorithm.
S31: and setting the maximum iteration number Mirater, the number M of bees and whether the parameter lit of the solution is to be lost. Initializing a set of honey sources v in a range in solution space (epsilon neighborhood range 0-1)1,v2,…,vkOne honey source corresponds to one solution, and the initial honey source position is used as an initial solution; calculating individual fitness and arranging according to descending order, taking the bees on the honey source position corresponding to the first M/2 fitness values as honey collection peaks, and taking the bees corresponding to the later M/2 fitness values as follower bees;
s32: setting the iteration number irator as 1;
s33: the bee is adopted to carry out neighborhood search according to a search formula to generate a new solution new1_ xijCalculating the fitness value of the new solution if fval (new1_ x)ij)>fval(xij) If not, updating the honey source position, otherwise, keeping the honey source position unchanged;
the neighborhood search formula is:
new_xij=xij+rij(xij-xkj)
wherein k belongs to (1,2, …, N), j belongs to (1,2, …, d), k and j are generated randomly, and k is not equal to i, rijIs a random number on (0, 1).
S34: the follower bee adopts a truncation selection mechanism to select the honey source and search the neighborhood according to the honey source position obtained by S33 to obtain a new solution new2_ xijCalculating the fitness value of the new solution if fval (new2_ x)ij)>fval(xij) If not, updating the honey source position, otherwise, keeping the honey source position unchanged;
s35, recording the position of the current optimal honey source, wherein the position is the current optimal solution; if the fitness value of the ith honey source is unchanged, the mining frequency s (i) is s (i) + 1;
s36: if s (i) is not less than lit, discarding the solution, and the scout bee performing global search according to a new honey source search formula to generate a new solution to replace the solution, and setting s (i) to be 0;
the new honey source search formula is as follows:
new_pop(i)=(upbond-lbond)·rand+lbond
wherein, upbond is the upper bound of the neighborhood, lbond is the lower bound of the neighborhood, and rand is a random number on (0, 1).
S38: and if the maximum iteration number is reached, namely irator ═ Mirator, outputting the optimal solution, otherwise, adding 1 to the iteration number, and returning to the step S33.
The embodiment provides an improved DBSCAN algorithm model, which is used for training a page data set on the basis of the optimal parameters obtained in the embodiment 1, so as to realize duplication removal of website page data of vulnerability scanning.
In this embodiment, the process of training the web page deduplication algorithm model includes:
s51: initializing a category label C as 0, setting category labels of all data points as 0, and inputting a page data set D, a radius epsilon and a density threshold Minpts;
s52: for each unvisited data point p in D: firstly, the 2 epsilon neighborhood N of the data point p which is not accessed in the page data set D is obtained through the distance calculation between two points(p), storing the result into a dists array, sorting the dists array from small to large, and storing the sorted result into a distArr array;
s53: judging the density value in the 2 epsilon neighborhood of p and the size of a density threshold Minpts; if p is within 2 epsilon neighborhoodHas a density value of greater than or equal to Minpts, i.e. | NWhen (p) | is greater than or equal to Minpts, judging the distAlr [ Minpts [ ]]And the radius epsilon; wherein, distAlr [ Minpts ] is]Represents a distance value between a point which is close to the Minpts-th point of the point P and the point P;
if distAlr [ Minpts ] ≦ ε, execute the cluster expansion function Expandcluster (p, distAlr, Minpts), save the result to the resultPts array, label the category label C plus 1 for all points in resultPts, label all points in resultPts with C the category, the points in the array resultPts constitute a cluster with p as the core point;
if disArr [ Minpts ] is present]If the number of the non-core points is larger than epsilon, selecting data in the distAlr array by adopting a binary search method to obtain a non-core point set O ═ O | disd,o< distAlr (minpts) - ε }, where disp,oRepresenting the distance between the point p and the point O, and marking the category labels of all the points in the set O as-1;
s54: and outputting the category label of each data point in the page data set D until all the data points are visited, and finishing the construction of the algorithm model.
The process of executing the cluster expansion function Expandcluster (p, distArr, Minpts) includes:
initializing the queue drPts, i.e., drPts ═ Nε(p) |, adding points in the epsilon neighborhood of the data point p into the queue drPts, and judging the distance state of each unaccessed data point q in the drPts; outputting the data in drPts until all the points in drPts are visited; the judgment process is as follows:
if the distance between p and q is less than epsilon, searching the data point q by adopting a neighbor search NeighborQuery algorithm to obtain a neighborhood Nε(q) if | Nε(q) | is not less than Minpts, and adding Nε(q) adding to the queue drPts;
if q is the last point of drPts, a 2 epsilon neighborhood N of p is obtained using range query(p) calculating p to N(p) storing the results to dists according to the distance of all the points in the image, and sequencing the dists from small to large, wherein the results are stored as drPts;
until all points in drPts have been accessed, the data in drPts is output.
The process of searching the data point q by adopting the neighbor search NeighborQuery algorithm comprises the following steps:
inputting data in a page data set D, a radius epsilon, a density threshold Minpts, a current query point p and an array distAlr;
performing a binary search results in a satisfying distAlr (L) > dp,q-an index value L corresponding to ε, wherein dp,qRepresents the distance between point p and point q; performing a binary search results in a satisfying distAlr (U) < dp,qThe index value U corresponding to + epsilon; saving points of index values from L to U in the ordered array disterr into an array posssibleNeighbor;
finding a point with the distance of the point q being less than epsilon in the array of posibleNeighbor, and storing the result in the array of Neighbor;
n is formed by the points before the array index L-1 and the points in the array Neighborε(q), output of epsilon neighborhood N of qε(q)。
The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A webpage deduplication method based on an improved DBSCAN algorithm is characterized by comprising the following steps: acquiring website data in real time, inputting the acquired website data into a trained webpage deduplication algorithm model to remove repeated data in a website data set, and obtaining a data set to be subjected to subsequent vulnerability scanning;
the process of constructing the webpage deduplication algorithm model comprises the following steps:
s1: acquiring website data, and performing feature extraction and feature quantization processing on the acquired website data;
s2: carrying out selective characteristic weighting processing on the quantized website data to obtain a page data set D;
s3: inputting the page data set D into an improved artificial bee colony algorithm for training to obtain an optimal parameter value; the improved artificial bee colony algorithm comprises the steps of optimizing a honey source selection process in the artificial bee colony algorithm by adopting a truncation selection mechanism;
s4: improving the DBSCAN algorithm according to the optimal parameter value and the adjacent search strategy to obtain a webpage duplication elimination algorithm model;
s5: inputting the data in the page data set D into a webpage deduplication algorithm model for training to obtain cluster labels of all data points;
s6: and according to the difference of the cluster labels, selecting one piece of data in each cluster to construct a data set to be subjected to vulnerability scanning.
2. The web page deduplication method based on the improved DBSCAN algorithm as claimed in claim 1, wherein the features extracted from the acquired website data include: the method comprises the steps of requesting a method, a request address, a request parameter name, the number of request parameters, a request host name and the transmission length of a message entity; and performing quantization processing on the extracted features.
3. The method for webpage deduplication based on the improved DBSCAN algorithm as claimed in claim 2, wherein the quantization process comprises: directly endowing different request methods with number labels in 0-9, taking the transmission length value of each message entity as a characteristic value, establishing a dictionary according to a request address and a request parameter name, counting the number of request parameters, and taking the number of the parameters as the characteristic value.
4. The method for webpage deduplication based on the improved DBSCAN algorithm of claim 1, wherein the process of optimizing the artificial bee colony algorithm by the truncation selection mechanism comprises:
step 1: calculating the individual fitness fval of each piece of data in the page data set D;
step 2: arranging each piece of data in the page data set D in a descending order according to the individual fitness fval; randomly selecting k individuals from a population to form a group, and selecting an individual i with the maximum fitness fval in the group;
and step 3: selecting the first t% of better individuals in the population to generate a next generation population, judging the probability of generating a filial generation population by the individuals with the bit sequence i according to a truncation selection mechanism, and searching in the neighborhood of the current honey source along with the bee when the selected individuals i meet the condition that i is not more than M multiplied by t%, and generating the filial generation population;
and 4, step 4: repeating step 3M times to generate a new generation of population.
5. The method for webpage deduplication based on the improved DBSCAN algorithm of claim 4, wherein the formula for calculating the individual fitness fval is as follows:
Figure FDA0002748742390000021
wherein PWD represents probability weighted density, c represents number of clusters, Sim (I)i,Ij) Represents a cluster IiAnd cluster IjThe similarity between them.
6. The web page deduplication method based on the improved DBSCAN algorithm as claimed in claim 4, wherein the probability formula that the individual with the bit sequence i generates the offspring population is as follows:
Figure FDA0002748742390000022
wherein, PiRepresenting the probability that an individual with the bit sequence i can generate the next generation, wherein M represents the population size, and t% represents a truncation threshold;
the calculation formula of the truncation threshold is as follows:
Figure FDA0002748742390000023
wherein, tmaxDenotes the maximum truncation threshold, tminIndicating a minimum cut-off threshold, cyc indicating honey withdrawalCurrent remaining search times of bees and follower bees, and Mirator represents the current maximum search times of the bee-collecting bees and follower bees.
7. The method for webpage deduplication based on the improved DBSCAN algorithm as claimed in claim 1, wherein the process of solving the optimal parameters by using the improved artificial bee colony algorithm comprises:
s31: setting the maximum iteration number Mirater, the number M of bees and a parameter lit for judging whether to remove the solution or not; initializing a set of honey sources v in a solution space1,v2,…,vkOne honey source corresponds to one solution, and the initial honey source position is used as an initial solution; calculating the fitness and arranging according to a descending order, taking the bees on the honey source position corresponding to the first M/2 fitness values as honey collection peaks, and taking the bees corresponding to the last M/2 fitness values as follower bees;
s32: setting the current iteration number to be iratom as 1;
s33: the bee is adopted to carry out neighborhood search according to a search formula to generate a new solution new1_ xijCalculating the fitness value of the new solution if fval (new1_ x)ij)>fval(xij) If not, updating the honey source position, otherwise, keeping the honey source position unchanged;
s34: the follower bee adopts a truncation selection mechanism to select the honey source and search the neighborhood according to the honey source position obtained by S33 to obtain a new solution new2_ xijCalculating the fitness value of the new solution if fval (new2_ x)ij)>fval(xij) If not, updating the honey source position, otherwise, keeping the honey source position unchanged;
s35: recording the position of the current optimal honey source, wherein the position is the current optimal solution;
s36: if the fitness value of the ith honey source is unchanged, the mining frequency s (i) is s (i) + 1;
s37: if s (i) is not less than lit, discarding the solution, and the scout bee performing global search according to a new honey source search formula to generate a new solution to replace the solution, and setting s (i) to be 0;
s38: and if the maximum iteration number is reached, namely, irator +1, outputting the optimal solution, otherwise, returning to the step S33.
8. The method for webpage deduplication based on the improved DBSCAN algorithm of claim 1, wherein the process of training the webpage deduplication algorithm model comprises:
s51: initializing a category label C as 0, setting category labels of all data points as 0, and inputting a page data set D, a radius epsilon and a density threshold Minpts;
s52: calculating the distance between two points to obtain the 2 epsilon neighborhood N of the data point p which is not accessed in the page data set D(p); storing the result in a dists array, sequencing the dists array from small to large, and storing the sequenced result in a distAlr array;
s53: judging the density value in the 2 epsilon neighborhood of p and the size of a density threshold Minpts; if the density value in the 2 epsilon neighborhood of p is more than or equal to Minpts, i.e. | NWhen (p) | is greater than or equal to Minpts, judging the distAlr [ Minpts [ ]]And the radius epsilon, and executing a corresponding function according to the relation; if the density value in the 2 epsilon neighborhood of p is less than Minpts, i.e. | N(p) | < Minpts, labeling the category labels of all points in the epsilon neighborhood of p as-1; wherein, distAlr [ Minpts ] is]Represents a distance value between a point which is close to the Minpts-th point of the point P and the point P;
s54: and outputting the category label of each data point in the page data set D until all the data points are visited, and finishing the algorithm.
9. The method of claim 8, wherein determining the relationship between distArr [ Minpts ] and radius ∈ comprises:
step 1: if the distAlr [ Minpts ] ≦ ε, executing a cluster expansion function Expandcluster (p, distAlr, Minpts), saving the result of executing the function into a resultPts array, adding 1 to the category label C, marking all points in the resultPts array with C to mark the category, wherein the points in the resultPts array form a cluster with p as the core point;
step 2: if disArr [ Minpts ] is present]If the number of the non-core points is larger than epsilon, selecting data in the distAlr array by adopting a binary search method to obtain a non-core point set O ═ O | disp,o<distArr[Minpts]- ε }, where disp,oRepresenting the distance between point p and point O, the class labels for all points in the set O are labeled-1.
10. The method for web page deduplication based on the improved DBSCAN algorithm as claimed in claim 9, wherein the process of executing the cluster expansion function Expandcluster (p, distArr, Minpts) comprises: initializing a queue drPts, adding points in an epsilon neighborhood of a data point p into the queue drPts, and judging the distance state of each unaccessed data point q in the drPts; outputting the data in drPts until all the points in drPts are visited; the judgment process is as follows:
if the distance between p and q is smaller than epsilon, searching the data point q by adopting a neighbor search NeighborQuery algorithm to obtain a neighborhood N of the data point qε(q) if | Nε(q) | is not less than Minpts, and adding Nε(q) adding to the queue drPts;
if q is the last point of drPts, a 2 epsilon neighborhood N of p is obtained using range query(p) calculating p to N(p) distance of all points in the table, save the result to dists, sort the dists from small to large, and save the result as drPts.
CN202011176217.XA 2020-10-29 2020-10-29 Webpage duplicate removal method based on improved DBSCAN algorithm Pending CN112257073A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011176217.XA CN112257073A (en) 2020-10-29 2020-10-29 Webpage duplicate removal method based on improved DBSCAN algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011176217.XA CN112257073A (en) 2020-10-29 2020-10-29 Webpage duplicate removal method based on improved DBSCAN algorithm

Publications (1)

Publication Number Publication Date
CN112257073A true CN112257073A (en) 2021-01-22

Family

ID=74262750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011176217.XA Pending CN112257073A (en) 2020-10-29 2020-10-29 Webpage duplicate removal method based on improved DBSCAN algorithm

Country Status (1)

Country Link
CN (1) CN112257073A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779377A (en) * 2021-07-27 2021-12-10 浙江大学 Crawler searching method based on barrier-free detection result duplication removal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484600A (en) * 2014-11-18 2015-04-01 中国科学院深圳先进技术研究院 Intrusion detection method and device based on improved density clustering
US20200175158A1 (en) * 2018-11-29 2020-06-04 Atos Information Technology GmbH Method For Detecting Intrusions In An Audit Log
CN111291376A (en) * 2018-12-08 2020-06-16 南京慕测信息科技有限公司 Web vulnerability verification method based on crowdsourcing and machine learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484600A (en) * 2014-11-18 2015-04-01 中国科学院深圳先进技术研究院 Intrusion detection method and device based on improved density clustering
US20200175158A1 (en) * 2018-11-29 2020-06-04 Atos Information Technology GmbH Method For Detecting Intrusions In An Audit Log
CN111291376A (en) * 2018-12-08 2020-06-16 南京慕测信息科技有限公司 Web vulnerability verification method based on crowdsourcing and machine learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
K.SANTHISREE等: ""Web Usage Data Clustering Using Dbscan Algorithm and Set Similarities,"", 《2010 INTERNATIONAL CONFERENCE ON DATA STORAGE AND DATA ENGINEERING》 *
汤盛宇: ""基于邻近搜索技术的快速密度聚类算法研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
胡健 等: ""基于自适应蜂群优化的DBSCAN聚类算法"", 《计算机工程与应用》 *
贾彦丰: ""基于DBSCAN算法的WEB漏洞检测去重方法研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779377A (en) * 2021-07-27 2021-12-10 浙江大学 Crawler searching method based on barrier-free detection result duplication removal
CN113779377B (en) * 2021-07-27 2024-03-22 浙江大学 Crawler searching method based on barrier-free detection result deduplication

Similar Documents

Publication Publication Date Title
Yang et al. Detecting malicious URLs via a keyword-based convolutional gated-recurrent-unit neural network
CN106528599B (en) A kind of character string Fast Fuzzy matching algorithm in magnanimity audio data
US7433869B2 (en) Method and apparatus for document clustering and document sketching
US8595204B2 (en) Spam score propagation for web spam detection
US8010614B1 (en) Systems and methods for generating signatures for electronic communication classification
CN112256939B (en) Text entity relation extraction method for chemical field
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN110543595B (en) In-station searching system and method
CN108717408A (en) A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
CN110633365A (en) Word vector-based hierarchical multi-label text classification method and system
KR20190135129A (en) Apparatus and Method for Documents Classification Using Documents Organization and Deep Learning
CN113962293B (en) LightGBM classification and representation learning-based name disambiguation method and system
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN107291895B (en) Quick hierarchical document query method
CN112100372B (en) Head news prediction classification method
Zhang et al. Unsupervised entity resolution with blocking and graph algorithms
CN104572720B (en) A kind of method, apparatus and computer readable storage medium of webpage information re-scheduling
Wilkins et al. Comparison of five clustering algorithms to classify phytoplankton from flow cytometry data
CN110851733A (en) Community discovery and emotion interpretation method based on network topology and document content
CN112257073A (en) Webpage duplicate removal method based on improved DBSCAN algorithm
CN117155701A (en) Network flow intrusion detection method
Zhu et al. PDHF: Effective phishing detection model combining optimal artificial and automatic deep features
CN114943285B (en) Intelligent auditing system for internet news content data
Chen et al. Community Detection Based on DeepWalk Model in Large‐Scale Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210122

RJ01 Rejection of invention patent application after publication