CN115766255A

CN115766255A - Crawler behavior detection method, device, equipment and storage medium

Info

Publication number: CN115766255A
Application number: CN202211463745.2A
Authority: CN
Inventors: 王梦婷
Original assignee: Zhejiang Nuonuo Network Technology Co ltd
Current assignee: Zhejiang Nuonuo Network Technology Co ltd
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2023-03-07

Abstract

The application discloses a crawler behavior detection method, a crawler behavior detection device, crawler behavior detection equipment and a storage medium, and relates to the technical field of data leakage prevention. The method comprises the following steps: acquiring original flow data, and preprocessing the original flow data to obtain a key information tuple; based on the key information tuple, determining the dense feature vector of each IP address in the original flow data by using a Count Sketch algorithm and a principal component analysis method; constructing a first mixed integer linear programming model based on the space dimension, and determining a plurality of suspected crawler IPs corresponding to the dense feature vectors by using the first mixed integer linear programming model to obtain a suspected crawler IP set; and constructing a hidden Markov model aiming at the suspected crawler IP set based on the time dimension so as to determine crawler IPs in the suspected crawler IP set cluster by utilizing the hidden Markov model. Through the technical scheme of this application, can improve the rate of accuracy and the real-time of crawler detection.

Description

Crawler behavior detection method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of data leakage prevention, in particular to a crawler behavior detection method, a crawler behavior detection device, crawler behavior detection equipment and a storage medium.

Background

Data leakage prevention is a difficult problem that each enterprise needs to face in operation management, once important information is leaked, huge loss is caused, the image of the enterprise is damaged, stable operation of the enterprise is influenced, meanwhile, customers are damaged, and finally, safety events such as fraud, lassitude, privacy leakage and the like can be caused. From the way of data leakage and the leakage source, data leakage prevention can be divided into: network side leakage prevention, terminal side leakage prevention, and application side leakage prevention. At present, enterprises gradually build data leakage prevention capability. But on the application side, a strong data leakage prevention capability is still lacking at present, and the application side data leakage manner includes: database collision, data crawling, account blasting and the like. According to statistics, the data leakage rate of the web crawler reaches over 90%, so that the web crawler occupies a dominant position in application-side data leakage.

The conventional anti-crawler technology includes a User-Agent (UA) control request, session control access restriction, spider trap, verification of a verification code, dynamic data loading, IP (Internet Protocol ) restriction, parameter encryption, and the like. However, in the conventional technology, a crawler program can perform a model sentence by forging head information of a legal program, or poll a target website by setting a plurality of sessions and other methods to break through barriers to obtain resources, the resource overhead of the website can be increased, and the use experience of a normal user can be greatly reduced. The traditional method also has the defects of low real-time performance, low land falling performance, incapability of processing large-scale data and the like; meanwhile, most of the conventional methods only involve judgment on the space dimension or the time dimension of the access object, so that the detection accuracy is low and the misjudgment rate is high.

In summary, how to improve the accuracy and real-time performance of web crawler detection is a problem to be solved at present.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, a device and a storage medium for detecting a crawler behavior, which can improve the accuracy and the real-time performance of the detection when performing web crawler detection. The specific scheme is as follows:

in a first aspect, the application discloses a crawler behavior detection method, including:

acquiring original flow data, and preprocessing the original flow data to obtain a key information tuple;

based on the key information tuple, determining the dense feature vector of each IP address in the original flow data by using a Count Sketch algorithm and a principal component analysis method;

constructing a first mixed integer linear programming model based on the space dimension, and determining a plurality of suspected crawler IPs corresponding to the dense feature vector by using the first mixed integer linear programming model to obtain a suspected crawler IP set;

and constructing a hidden Markov model for the suspected crawler IP set based on a time dimension so as to determine crawler IP in the suspected crawler IP set group by using the hidden Markov model.

Optionally, the obtaining the original traffic data and preprocessing the original traffic data to obtain a key information tuple includes:

acquiring original flow data, and performing data cleaning and de-duplication on the original flow data to obtain intermediate processing data;

and extracting key information in the intermediate processing data, and storing the key information in a tuple manner to obtain a key information tuple.

Optionally, the determining, based on the key information tuple, the dense feature vector of each IP address in the original traffic data by using a Count Sketch algorithm and a principal component analysis method includes:

constructing a feature extraction model based on a Count Sketch algorithm, and inputting the key information tuple into the feature extraction model to obtain a daily statistical result;

acquiring a corresponding continuous first preset number of daily statistical results for each IP address in the original flow data, and determining a mean value result, a variance result, a minimum value result and a maximum value result of each IP address by using the continuous first preset number of daily statistical results;

performing feature fusion on the mean result, the variance result, the minimum result and the maximum result in a splicing manner to obtain a sparse feature vector of each IP address;

and constructing a feature dimension reduction model based on a principal component analysis method, inputting the sparse feature vectors into the feature dimension reduction model, and acquiring the dense feature vectors of each IP address in an information aggregation mode.

Optionally, the constructing a first mixed integer linear programming model based on the spatial dimension, and determining a plurality of suspected crawler IPs corresponding to the dense feature vector by using the first mixed integer linear programming model to obtain a suspected crawler IP set includes:

constructing a decision variable group based on the dense feature vectors; the decision variable group comprises a first decision variable used for judging whether the IP address in the original flow data belongs to the suspected crawler IP, a second decision variable used for judging whether the IP address at least belongs to one suspected crawler IP in the suspected crawler IP set, a third decision variable used for judging whether the IP address belongs to a plurality of suspected crawler IPs in the suspected crawler IP set, and a fourth decision variable used for representing a second preset number of IP addresses in the suspected crawler IP set;

constructing an objective function based on the dense feature vectors; the objective function is

N _p The preset number of the suspected crawler IP sets, t _i Is the fourth decision variable, N _i For said second preset number, var (p) _i ) As a set p _i Front N of _i Variance of individual IP addresses;

constructing a constraint condition set based on the dense feature vectors; the constraint condition set comprises constraint conditions used for constraining similarity and adjacent relation between the suspected crawler IPs;

and solving the decision variable group, the objective function and the constraint condition set through preset solver software, and determining a plurality of suspected crawler IPs corresponding to the dense feature vectors to obtain a suspected crawler IP set.

Optionally, the constructing a first mixed integer linear programming model based on the spatial dimension, and determining a plurality of suspected crawler IPs in the dense feature vector by using the first mixed integer linear programming model to obtain a suspected crawler IP set further includes:

building the mixed integer linear programming model based on the spatial dimension quadratic to obtain a second mixed integer linear programming model;

and reallocating the IP addresses corresponding to the dense feature vectors by using the second mixed integer linear programming model to obtain an optimized target suspected crawler IP set.

Optionally, the constructing a hidden markov model for the suspected crawler IP set based on a time dimension to determine crawler IPs in the suspected crawler IP set group by using the hidden markov model includes:

determining three elements of a model for constructing the hidden Markov model through a first preset algorithm;

and determining the crawler IP in the suspected crawler IP cluster group according to the three elements of the model through a second preset algorithm.

Optionally, the determining, by a first preset algorithm, three elements of a model used for building the hidden markov model includes:

determining three elements of a model for constructing the hidden Markov model through a Baum-welch algorithm;

and, the determining, by a second preset algorithm, the crawler IP in the suspected crawler IP cluster according to the three elements of the model includes:

and determining the crawler IP in the suspected crawler IP cluster group according to the three factors of the model through a Viterbi algorithm.

In a second aspect, the application discloses a crawler behavior detection device, including:

the system comprises an original flow data acquisition module, a key information tuple acquisition module and a flow data processing module, wherein the original flow data acquisition module is used for acquiring original flow data and preprocessing the original flow data to obtain the key information tuple;

the feature extraction module is used for determining dense feature vectors of each IP address in the original flow data by using a Count Sketch algorithm and a principal component analysis method based on the key information tuples;

the space dimension modeling module is used for constructing a first mixed integer linear programming model based on space dimensions, and determining a plurality of suspected crawler IPs corresponding to the dense feature vectors by using the first mixed integer linear programming model to obtain a suspected crawler IP set;

and the time dimension building module is used for building a hidden Markov model aiming at the suspected crawler IP set based on the time dimension so as to determine crawler IP in the suspected crawler IP set group by using the hidden Markov model.

In a third aspect, the present application discloses an electronic device comprising a processor and a memory; wherein the memory is used for storing a computer program which is loaded and executed by the processor to implement the crawler behavior detection method as described above.

In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the crawler behavior detection method as described above.

The application discloses a crawler behavior detection method, which comprises the steps of firstly obtaining original flow data, and preprocessing the original flow data to obtain a key information tuple; then based on the key information tuple, determining the dense feature vector of each IP address in the original flow data by using a Count Sketch algorithm and a principal component analysis method; secondly, a first mixed integer linear programming model is built based on the space dimension, and a plurality of suspected crawler IPs corresponding to the dense feature vector are determined by using the first mixed integer linear programming model to obtain a suspected crawler IP set; and finally, establishing a hidden Markov model for the suspected crawler IP set based on the time dimension so as to determine crawler IP in the suspected crawler IP set cluster by using the hidden Markov model. Therefore, the Count Sketch algorithm provides powerful support for big data processing, and a feature extraction method based on the Count Sketch algorithm and a principal component analysis method can provide dense feature vectors of each IP access flow, so that access information with deeper layers and more comprehensive layers is extracted. In addition, the stacking method is based on the multi-dimensional information modeling of the integrated IP access, and two times of analysis and judgment are carried out: firstly, performing spatial dimension modeling through a mixed integer linear programming model, and filtering out suspected crawler IP sets in dense feature vectors, so that the problems of large access flow and no data label are solved; and then, time dimension modeling is carried out through a hidden Markov model, and crawler IP in a suspected crawler IP set is determined, so that the problem that the crawler detection in the traditional method is low in accuracy and easy to misjudge is solved. Compared with the detection of the crawler IP by using a single basic model, the method has higher detection accuracy and real-time performance.

In addition, the crawler behavior detection device, the crawler behavior detection equipment and the crawler behavior detection storage medium correspond to the crawler behavior detection method, and the effects are the same as those of the crawler behavior detection method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a crawler behavior detection method disclosed in the present application;

fig. 2 is a schematic diagram of a feature extraction structure based on Count Sketch and PCA disclosed in the present application;

FIG. 3 is a schematic diagram of crawler behavior detection based on multi-dimensional access information disclosed in the present application;

FIG. 4 is a flow chart of a particular crawler behavior detection method disclosed herein;

FIG. 5 is a schematic diagram illustrating modeling of spatiotemporal access patterns based on MILP and HMM disclosed herein;

FIG. 6 is a schematic structural diagram of a crawler behavior detection apparatus disclosed in the present application;

fig. 7 is a block diagram of an electronic device disclosed in the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

At present, the traditional anti-crawler technology has a lot of problems, when crawler detection is carried out, a crawler program can carry out model essay through forging legal program head information, or a plurality of methods such as polling a target website by setting a plurality of sessions and the like break through barriers to obtain resources, the resource overhead of the website can be increased, and the use experience of normal users is greatly reduced. The traditional method also has the defects of low real-time performance, low land falling performance, incapability of processing large-scale data and the like; meanwhile, most of the conventional methods only involve judgment on the space dimension or the time dimension of the access object, so that the detection accuracy is low and the misjudgment rate is high.

Therefore, the crawler behavior detection scheme is provided, and the accuracy and the real-time performance of detection can be improved when the web crawler is detected.

The embodiment of the invention discloses a crawler behavior detection method, which is shown in figure 1 and comprises the following steps:

step S11: acquiring original flow data, and preprocessing the original flow data to obtain a key information tuple.

In the embodiment of the application, original IP access flow data is prepared and preprocessed to construct a standard data structure. Specifically, original flow data are obtained, and data cleaning and duplicate removal are carried out on the original flow data to obtain intermediate processing data; and extracting key information in the intermediate processing data, and storing the key information in a tuple manner to obtain a key information tuple.

It can be understood that after the original traffic data is acquired, the original traffic data is subjected to processing operations such as cleaning and deduplication, the original traffic data after being processed by the cleaning and deduplication operations is used as intermediate processing data, and key information is extracted from the intermediate processing data, where the extracted key information includes, but is not limited to, information such as a source address, a destination address, a source port, a destination port, a protocol, traffic, and time, and the extracted key information is represented by an N-tuple as a standard data structure, that is, a key information tuple.

Step S12: and based on the key information tuple, determining the dense feature vector of each IP address in the original flow data by using a Count Sketch algorithm and a principal component analysis method.

Because the traditional method is low in real-time performance and land-falling performance during crawler detection and cannot process large-scale data, aiming at the gap of the prior art, in the embodiment of the application, data cleaning and feature extraction are firstly performed by using the Count Sketch algorithm, so that powerful support is provided for large data processing through the Count Sketch algorithm. Specifically, a feature extraction model is constructed based on a Count Sketch algorithm, and the key information tuple is input into the feature extraction model to obtain a daily statistical result.

In the embodiment of the present application, the key information tuple obtained in step S11 is input to a feature extraction model constructed based on a Count Sketch algorithm, so as to obtain a daily statistical result Sketch. It should be noted that the daily statistic Sketch is a two-dimensional array of w columns and d rows, where the parameters w and d are hyper-parameters that are empirically determined in advance and are related to the final query error rate. In addition, a global statistical result may also be obtained through the feature extraction model, and a daily statistical result is used in this embodiment.

Further, acquiring a corresponding continuous first preset number of daily statistical results for each IP address in the original traffic data, and determining a mean result, a variance result, a minimum result and a maximum result of each IP address by using the continuous first preset number of daily statistical results; performing feature fusion on the mean result, the variance result, the minimum result and the maximum result in a splicing manner to obtain a sparse feature vector of each IP address; and constructing a feature dimension reduction model based on a principal component analysis method, inputting the sparse feature vector into the feature dimension reduction model, and acquiring the dense feature vector of each IP address in an information aggregation mode.

In this embodiment of the present application, for each IP address, in the obtained daily statistical result Sketch information, a corresponding first preset number of consecutive daily statistical results are obtained, for example, five consecutive days of Sketch are obtained, that is, five two-dimensional arrays are obtained. Then carrying out point-to-point calculation on the mean value, the variance, the minimum value and the maximum value so as to generate 4 new Sketchs; and performing feature fusion on the 4 new Sketchs of the mean result, the variance result, the minimum result and the maximum result in a splicing mode to obtain the sparse feature vector of each IP address. And finally, constructing a feature dimension reduction model based on Principal Component Analysis (PCA), inputting the obtained sparse feature vector of each IP address, and obtaining a denser dense feature vector corresponding to each IP address through information aggregation.

As shown in fig. 2, the schematic diagram of the structure of the method when performing feature extraction based on Count Sketch and PCA, pre-data is performed on the obtained original data to obtain a fixed structure, then sparse feature vectors are obtained through the Count Sketch feature extraction, feature dimensionality reduction is performed through the PCA to obtain dense feature vectors of each IP access flow, and experiments prove that the method can extract deeper and more comprehensive access information.

Step S13: and constructing a first mixed integer linear programming model based on the space dimension, and determining a plurality of suspected crawler IPs corresponding to the dense feature vector by using the first mixed integer linear programming model to obtain a suspected crawler IP set.

In the embodiment of the application, the conditions of low accuracy and high misjudgment rate exist in the judgment of the access object by aiming at the single model using the space dimension or the time dimension, so that the two stages of detection models are fused based on the stacked integration strategy, and the obtained final crawler behavior detection model has higher detection accuracy and real-time performance than the detection accuracy and real-time performance obtained by using the single model.

In the embodiment of the present application, a Mixed Integer Linear Programming (MILP) is first used to perform spatial dimension modeling on all visited IPs, so as to obtain a batch of suspected crawler IP sets. Namely, a primary crawler IP filter based on MILP is constructed, and a suspected crawler IP set is obtained by modeling the spatial dimension of IP access. The number of the suspected crawler sets can be multiple, and is generally set in advance, and the default number is 3, that is, a high suspected crawler IP set, a medium suspected crawler IP set, and a light suspected crawler IP set.

It can be understood that in the process of determining the suspected crawler IP set through the MILP modeling, because a distance setting situation may exist a situation that part of virtual IPs of the suspected crawler are not divided into sets, or part of IP addresses are allocated to a plurality of suspected crawler IP sets, in the embodiment of the present application, the MILP idea is adopted to perform secondary modeling and optimization solution, and the IP addresses are reallocated to obtain a final filtering result, so as to obtain an optimized target suspected crawler IP set.

Step S14: and constructing a hidden Markov model for the suspected crawler IP set based on a time dimension so as to determine crawler IP in the suspected crawler IP set group by using the hidden Markov model.

In the embodiment of the application, after spatial dimension modeling is performed on all accessed IPs through MILP to obtain a batch of suspected crawler IP sets, time dimension modeling is performed on crawler IPs in the sets through a Hidden Markov Model (HMM) to obtain detection results. In addition, the ablation learning and comparison experiment proves that the model has higher detection accuracy and real-time performance than a single model.

Fig. 3 is a schematic diagram illustrating crawler behavior detection based on multidimensional access information in the present invention. Firstly, preparing original IP access flow data for preprocessing, and constructing a standard data structure; secondly, a feature extraction module based on Count Sketch is constructed, a flow characterization vector is obtained through PCA feature dimension reduction, and a dense feature vector of IP access is obtained; then constructing a primary crawler IP filter based on MILP, and modeling the spatial dimension of IP access to obtain a suspected crawler IP set; and then, a secondary crawler IP (Internet protocol) determiner based on an HMM (hidden Markov model) is constructed, and the crawler IP in the set is determined by modeling the time dimension of IP access. Therefore, the two-level models are fused based on the strategy of stacking integration, and the final crawler behavior detection model is obtained to determine the crawler IP.

The application discloses a crawler behavior detection method, which comprises the steps of firstly obtaining original flow data, and preprocessing the original flow data to obtain a key information tuple; then based on the key information tuple, determining dense feature vectors of each IP address in the original traffic data by using a Count Scut algorithm and a principal component analysis method; secondly, a first mixed integer linear programming model is built based on the space dimensionality, and a plurality of suspected crawler IPs corresponding to the dense feature vectors are determined by the first mixed integer linear programming model to obtain a suspected crawler IP set; and finally, establishing a hidden Markov model for the suspected crawler IP set based on the time dimension so as to determine crawler IP in the suspected crawler IP set cluster by using the hidden Markov model. Therefore, the Count Sketch algorithm provides powerful support for big data processing, and a feature extraction method based on the Count Sketch algorithm and a principal component analysis method can provide dense feature vectors of each IP access flow, so that access information with deeper layers and more comprehensive layers is extracted. In addition, the stacking method is based on the multi-dimensional information modeling of the integrated IP access, and two times of analysis and judgment are carried out: firstly, performing spatial dimension modeling through a mixed integer linear programming model, and filtering out suspected crawler IP sets in dense feature vectors, so that the problems of large access flow and no data label are solved; and then, time dimension modeling is carried out through a hidden Markov model, and crawler IP in a suspected crawler IP set is determined, so that the problem that the crawler detection in the traditional method is low in accuracy and easy to misjudge is solved. Compared with the detection of the crawler IP by using a single basic model, the method has higher detection accuracy and real-time performance.

The embodiment of the application discloses a specific crawler behavior detection method, and as shown in fig. 4, the method comprises the following steps:

step S21: acquiring original flow data, and preprocessing the original flow data to obtain a key information tuple.

Step S22: and based on the key information tuple, determining the dense feature vector of each IP address in the original traffic data by using a Count Scut algorithm and a principal component analysis method.

For more specific processing procedures of the step S21 and the step S22, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

Step S23: constructing a decision variable group, an objective function and a constraint condition set based on the dense feature vectors; and solving the decision variable group, the objective function and the constraint condition set through preset solver software, and determining a plurality of suspected crawler IPs corresponding to the dense feature vectors to obtain a suspected crawler IP set.

In the embodiment of the application, secondary modeling is performed based on an MILP thought, and the MILP modeling comprises decision variables, objective functions and constraint conditions. And during primary MILP modeling, constructing a decision variable group, an objective function and a constraint condition set based on the dense feature vectors.

In the embodiment of the application, the decision variable group has four decision variables in total, including (1) a first decision variable used for judging whether the IP address in the original traffic data belongs to the suspected crawler IP; such as x _ij And indicating whether the jth IP address belongs to the ith suspected crawler IP in the suspected crawler IP set. (2) Is used for judging whether the IP address at least belongs to one of the suspected crawler IP setsA second decision variable for the suspected crawler IP; x' _i Indicating whether the jth IP address at least belongs to one suspected crawler IP in the suspected crawler IP set. (3) A third decision variable for determining whether the IP address belongs to a plurality of suspected crawler IPs in the suspected crawler IP set; such as x ″) _i Indicating whether the jth IP address belongs to a plurality of suspected crawler IPs in the suspected crawler IP set. (4) A fourth decision variable for characterizing a second preset number of previous IP addresses in the suspected crawler IP set; such as t _i Representing the top N in the ith suspected crawler IP set _i Weighted variance of individual IP addresses.

In the embodiment of the present application, the number of the suspected crawler sets may be multiple, and is generally set in advance, and the number of the suspected crawler sets is 3 by default, that is, a high suspected crawler IP set, a medium suspected crawler IP set, and a light suspected crawler IP set. The objective function is the sum of weighted variances of the feature vectors of the minimized suspected crawler IP set, and the specific formula is as follows:

wherein N is _p The number of the suspected crawler IP sets is preset; t is t _i Is the fourth decision variable, i.e. representing the top N in the ith suspected crawler IP set _i Weighted variance of individual IP addresses, t _i ＝N _i ×var(P _i )；N _i The second preset number; var (p) _i ) As a set p _i Front N of _i Variance of individual IP addresses.

In the embodiment of the application, a constraint condition set is constructed based on the dense feature vectors; the constraint condition set comprises constraint conditions used for constraining similarity and adjacent relation between the suspected crawler IPs; therefore, the constraint of the initial MILP modeling needs to ensure that the selected suspected crawler IPs have high similarity and have a neighboring relationship, and the specific formula is as follows:

∑ _j∈ζ x′ _j ≥a′×N；

∑ _j∈ζ x″ _j ≤a″×N；

x _ij ∈{0，1}，x′ _i ∈{0，1}，x″ _j ∈{0，1}，t _i ∈R；

wherein x is _ij Indicating whether the jth IP address belongs to the ith suspected crawler IP in the suspected crawler IP set or not, R _i (j) J-th characteristic quantity in a characteristic vector x representing the ith IP address, wherein N represents the number of the IP addresses; x' _j Showing whether the jth IP address at least belongs to one suspected crawler IP in the suspected crawler IP set, wherein zeta shows all accessed IP address sets in the network; a' represents the proportion of the number of the IP addresses at least allocated to one suspected crawler IP set to all the IP addresses, and the value is generally 0.4; x ″ _j Whether the jth IP address belongs to a plurality of suspected crawler IPs in the suspected crawler IP set or not is represented; a' represents the proportion of the number of IP addresses allocated to a plurality of suspected crawler IP sets in all the number of the IP addresses, and the value is generally 0.1; n is a radical of hydrogen _min Representing the number of IP addresses at least contained in each suspected crawler IP set group;

(j) The position number, var (P), of the jth characteristic quantity in the ith IP address _{i，j} ) A combined characteristic parameter variance value representing the first j characteristic quantities of the ith IP address; n is a radical of _p Is the number of suspected crawler IP sets.

And (3) performing optimization solution by using the existing solver software through the decision variables, the objective function and the constraint conditions of the MILP at the end of the primary modeling to obtain a primary suspected crawler IP set filtering result. Further, based on the obtained preliminary filtering result, secondary modeling and optimization solution are carried out by adopting an MILP thought, and part of IP addresses distributed to a plurality of suspected crawler IP cluster groups are redistributed, so that an optimized target suspected crawler IP cluster is obtained, and a final filtering result is obtained.

Step S24: determining three elements of a model for constructing the hidden Markov model through a first preset algorithm; and determining the crawler IP in the suspected crawler IP cluster group according to the three factors of the model through a second preset algorithm.

In the embodiment of the application, a two-level crawler IP determiner based on a hidden Markov model is constructed, and time dimension modeling is performed on each IP in the suspected crawler IP set obtained in step S23, so that the crawler IP in the set is finally determined. Specifically, the hidden markov model-based modeling process firstly determines three elements of a model, and the three elements of the model are determined through a Baum-welch algorithm: λ = (a, B, π). Wherein A represents a transition probability matrix judged as a suspected crawler IP; b represents a probability distribution matrix generated by directly observed access modes, wherein the access modes represent probability distribution values of elements in the matrix; pi represents an initial distribution matrix judged as a suspected crawler IP; and finally, obtaining a final prediction result according to the three factors of the model by a Viterbi algorithm, namely whether the IP address is a crawler IP.

FIG. 5 is a diagram illustrating modeling of spatiotemporal access patterns based on MILP and HMM. And inputting the dense feature vector obtained based on the Count Sketch and the principal component analysis method into a model constructed based on MILP to obtain a primary crawler IP filter, and in the process, performing secondary modeling to obtain a suspected crawler IP set. And then constructing an HMM model for the suspected crawler IP set, constructing a secondary crawler IP determiner, and determining the suspected crawler IP set to obtain a final crawler IP.

The application discloses a crawler behavior detection method, which comprises the steps of firstly obtaining original flow data, and preprocessing the original flow data to obtain a key information tuple; then based on the key information tuple, determining the dense feature vector of each IP address in the original flow data by using a Count Sketch algorithm and a principal component analysis method; secondly, constructing a decision variable group, an objective function and a constraint condition set based on the dense feature vectors; solving the decision variable group, the objective function and the constraint condition set through preset solver software, and determining a plurality of suspected crawler IPs corresponding to the dense feature vectors to obtain a suspected crawler IP set; finally, determining three elements of a model for constructing the hidden Markov model through a first preset algorithm; and determining the crawler IP in the suspected crawler IP cluster group according to the three factors of the model through a second preset algorithm. Therefore, the Count Sketch algorithm provides powerful support for big data processing, and a feature extraction method based on the Count Sketch algorithm and a principal component analysis method can provide dense feature vectors of each IP access flow, so that access information with deeper layers and more comprehensive layers is extracted. In addition, the stacking method is based on the multi-dimensional information modeling of the integrated IP access, and two times of analysis and judgment are carried out: firstly, performing spatial dimension modeling through a mixed integer linear programming model, and filtering out a suspected crawler IP set in a dense feature vector, thereby solving the problems of large access flow and no data label; and then, time dimension modeling is carried out through a hidden Markov model, the crawler IP in the suspected crawler IP set is determined, and the problem that the crawler detection is low in accuracy and easy to misjudge in the traditional method is solved. Compared with the detection of the crawler IP by using a single basic model, the method has higher detection accuracy and real-time performance.

Correspondingly, the embodiment of the present application further discloses a crawling behavior detection device, as shown in fig. 6, the device includes:

the system comprises an original flow data acquisition module 11, a key information tuple acquisition module and a flow data preprocessing module, wherein the original flow data acquisition module is used for acquiring original flow data and preprocessing the original flow data to obtain the key information tuple;

a feature extraction module 12, configured to determine, based on the key information tuple, a dense feature vector of each IP address in the original traffic data by using a Count Sketch algorithm and a principal component analysis method;

the space dimension modeling module 13 is configured to construct a first mixed integer linear programming model based on a space dimension, and determine a plurality of suspected crawler IPs corresponding to the dense feature vector by using the first mixed integer linear programming model, so as to obtain a suspected crawler IP set;

and a time dimension establishing module 14, configured to establish a hidden markov model based on a time dimension for the suspected crawler IP set, so as to determine a crawler IP in the suspected crawler IP set group by using the hidden markov model.

For more specific working processes of the modules, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

Therefore, according to the scheme of the embodiment, the original traffic data is obtained first, and the original traffic data is preprocessed to obtain the key information tuple; then based on the key information tuple, determining dense feature vectors of each IP address in the original traffic data by using a Count Scut algorithm and a principal component analysis method; secondly, a first mixed integer linear programming model is built based on the space dimension, and a plurality of suspected crawler IPs corresponding to the dense feature vector are determined by using the first mixed integer linear programming model to obtain a suspected crawler IP set; and finally, establishing a hidden Markov model for the suspected crawler IP set based on the time dimension so as to determine crawler IP in the suspected crawler IP set cluster by using the hidden Markov model. Therefore, the Count Sketch algorithm provides powerful support for big data processing, and a feature extraction method based on the Count Sketch algorithm and a principal component analysis method can provide dense feature vectors of each IP access flow, so that deeper and more comprehensive access information can be extracted. In addition, the stacking method is based on the multi-dimensional information modeling of the integrated IP access, and two times of analysis and judgment are carried out: firstly, performing spatial dimension modeling through a mixed integer linear programming model, and filtering out a suspected crawler IP set in a dense feature vector, thereby solving the problems of large access flow and no data label; and then, time dimension modeling is carried out through a hidden Markov model, and crawler IP in a suspected crawler IP set is determined, so that the problem that the crawler detection in the traditional method is low in accuracy and easy to misjudge is solved. Compared with the detection of the crawler IP by using a single basic model, the method has higher detection accuracy and real-time performance.

Further, an electronic device is disclosed in the embodiments of the present application, and fig. 7 is a block diagram of an electronic device 20 according to an exemplary embodiment, which should not be construed as limiting the scope of the application.

Fig. 7 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein, the memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the crawler behavior detection method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be a computer.

In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to acquire external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.

In addition, the memory 22 is a carrier for storing resources, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., the resources stored thereon may include an operating system 221, a computer program 222, data 223, etc., and the data 223 may include various data. The storage means may be a transient storage or a permanent storage.

The operating system 221 is used for managing and controlling each hardware device on the electronic device 20 and the computer program 222, and may be Windows Server, netware, unix, linux, or the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the crawler behavior detection method disclosed in any of the foregoing embodiments and executed by the electronic device 20.

Further, embodiments of the present application disclose a computer-readable storage medium, where the computer-readable storage medium includes a Random Access Memory (RAM), a Memory, a Read-Only Memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a magnetic disk, or an optical disk or any other form of storage medium known in the art. Wherein the computer program, when executed by a processor, implements the aforementioned crawler behavior detection method. For the specific steps of the method, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The steps of a crawler behavior detection or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The above provides a method, an apparatus, a device and a storage medium for detecting crawler behavior, and a specific example is applied in the present to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A crawler behavior detection method, comprising:

constructing a first mixed integer linear programming model based on the space dimension, and determining a plurality of suspected crawler IPs corresponding to the dense feature vectors by using the first mixed integer linear programming model to obtain a suspected crawler IP set;

and constructing a hidden Markov model aiming at the suspected crawler IP set based on the time dimension so as to determine crawler IPs in the suspected crawler IP set cluster by utilizing the hidden Markov model.

2. The crawler behavior detection method according to claim 1, wherein the obtaining raw traffic data and preprocessing the raw traffic data to obtain a key information tuple comprises:

3. The crawler behavior detection method according to claim 1, wherein the determining, based on the key information tuple, a dense feature vector of each IP address in the original traffic data by using a Count Sketch algorithm and a principal component analysis method includes:

and constructing a feature dimension reduction model based on a principal component analysis method, inputting the sparse feature vector into the feature dimension reduction model, and acquiring the dense feature vector of each IP address in an information aggregation mode.

4. The crawler behavior detection method according to claim 1, wherein the constructing a first mixed integer linear programming model based on spatial dimensions, and determining a plurality of suspected crawler IPs corresponding to the dense feature vector using the first mixed integer linear programming model to obtain a suspected crawler IP set includes:

constructing a decision variable group based on the dense feature vectors; the decision variable group comprises a first decision variable used for judging whether an IP address in the original flow data belongs to the suspected crawler IP, a second decision variable used for judging whether the IP address at least belongs to one suspected crawler IP in the suspected crawler IP set, a third decision variable used for judging whether the IP address belongs to a plurality of suspected crawler IPs in the suspected crawler IP set, and a fourth decision variable used for representing a second preset number of IP addresses in the suspected crawler IP set;

N _p For the preset number of the suspected crawler IP sets, t _i Is the fourth decision variable, N _i For said second preset number, var (p) _i ) As a set p _i Front N of _i Variance of individual IP addresses;

5. The crawler behavior detection method according to claim 1, wherein the constructing a first mixed integer linear programming model based on spatial dimensions, and determining a plurality of suspected crawler IPs in the dense feature vector by using the first mixed integer linear programming model to obtain a suspected crawler IP set further comprises:

building the mixed integer linear programming model based on the spatial dimension twice to obtain a second mixed integer linear programming model;

6. The crawler behavior detection method according to any one of claims 1 to 5, wherein the constructing a hidden Markov model for the suspected crawler IP set based on a time dimension to determine crawler IPs in the suspected crawler IP set cluster by using the hidden Markov model comprises:

7. The crawler behavior detection method according to claim 6, wherein the determining of the three elements of the model for constructing the hidden Markov model by the first preset algorithm comprises:

and determining the crawler IP in the suspected crawler IP cluster according to the three factors of the model by a Viterbi algorithm.

8. A crawler behavior detection device, comprising:

9. An electronic device, comprising a processor and a memory; wherein the memory is for storing a computer program that is loaded and executed by the processor to implement the crawler behavior detection method according to any one of claims 1 to 7.

10. A computer-readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the crawler behavior detection method according to any one of claims 1 to 7.