CN111652284A

CN111652284A - Scanner identification method and device, electronic equipment and storage medium

Info

Publication number: CN111652284A
Application number: CN202010386219.5A
Authority: CN
Inventors: 张永
Original assignee: Hangzhou Dt Dream Technology Co Ltd
Current assignee: Hangzhou Dt Dream Technology Co Ltd
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2020-09-11

Abstract

The application provides a scanner identification method and device, electronic equipment and a storage medium; the method can comprise the following steps: constructing a feature vector to be detected associated with a source IP address of the flow to be detected in a feature vector space; determining a relative position relation between the feature vector to be detected and a first clustering center corresponding to a scanner and a second clustering center corresponding to a non-scanner in a pre-established scanner detection model, wherein the first clustering center and the second clustering center are obtained by clustering and analyzing sample feature vectors, and the sample feature vectors are obtained based on sample flow sent by the scanner and the non-scanner; and determining a target characteristic vector matched with the first clustering center in the characteristic vectors to be detected according to the relative position relationship, and judging that the source IP address in the flow to be detected is the flow of the IP address associated with the target characteristic vector and sending the flow by a scanner.

Description

Scanner identification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of network security, and in particular, to a scanner identification method and apparatus, an electronic device, and a storage medium.

Background

The scanner is a program capable of automatically detecting the security weakness of the host, and can quickly and accurately find the loophole existing in the scanning target. Therefore, the scanner is often used by hackers to obtain vulnerabilities of the network devices, and then attack the corresponding network devices through the obtained vulnerabilities.

Therefore, a hacker intruding through the scanner brings great potential safety hazards to the network equipment, and effectively identifies the traffic sent by the scanner, so that it is very important to prevent the safety information of the network equipment from being stolen.

Disclosure of Invention

In view of this, the present application provides a scanner identification method and apparatus, an electronic device, and a storage medium, which can detect received traffic, and further identify traffic sent by a scanner from the detected traffic, so as to prevent a hacker from stealing security information of a network device through the scanner.

In order to achieve the above purpose, the present application provides the following technical solutions:

according to a first aspect of the present application, a scanner identification method is provided, including:

constructing a feature vector to be detected associated with a source IP address of the flow to be detected in a feature vector space;

determining a relative position relation between the feature vector to be detected and a first clustering center corresponding to a scanner and a second clustering center corresponding to a non-scanner in a pre-established scanner detection model, wherein the first clustering center and the second clustering center are obtained by clustering and analyzing sample feature vectors, and the sample feature vectors are obtained based on sample flow sent by the scanner and the non-scanner;

and determining a target characteristic vector matched with the first clustering center in the characteristic vectors to be detected according to the relative position relationship, and judging that the source IP address in the flow to be detected is the flow of the IP address associated with the target characteristic vector and sending the flow by a scanner.

According to a second aspect of the present application, a method for training a scanner detection model is provided, including:

acquiring sample flow sent by a scanner and a non-scanner;

constructing a sample feature vector associated with a source IP address of the sample traffic in a feature vector space;

performing cluster analysis on the sample feature vectors in the feature vector space to obtain a first cluster center corresponding to a scanner and a second cluster center corresponding to a non-scanner; and using the first clustering center and the second clustering center as the training result of the scanner detection model.

According to a third aspect of the present application, there is provided a scanner identification apparatus comprising:

the construction unit is used for constructing a to-be-detected feature vector associated with a source IP address of the to-be-detected flow in the feature vector space;

the first determining unit is used for determining the relative position relation between the feature vector to be detected and a first clustering center corresponding to the scanner and a second clustering center corresponding to the non-scanner in a pre-established scanner detection model, wherein the first clustering center and the second clustering center are obtained by clustering and analyzing sample feature vectors, and the sample feature vectors are obtained based on sample flow sent by the scanner and the non-scanner;

and the second determining unit is used for determining a target characteristic vector matched with the first clustering center in the characteristic vector to be detected according to the relative position relationship, and judging that the source IP address in the flow to be detected is the flow of the IP address associated with the target characteristic vector and sent by the scanner.

According to a fourth aspect of the present application, there is provided a training apparatus for a scanner inspection model, comprising:

the acquisition unit is used for acquiring sample flow sent by a scanner and a non-scanner;

the construction unit is used for constructing a sample feature vector associated with the source IP address of the sample flow in a feature vector space;

the analysis unit is used for carrying out clustering analysis on the sample feature vectors in the feature vector space to obtain a first clustering center corresponding to a scanner and a second clustering center corresponding to a non-scanner; and using the first clustering center and the second clustering center as the training result of the scanner detection model.

According to a fifth aspect of the present application, there is provided an electronic device comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor implements the method according to any one of the first and second aspects by executing the executable instructions.

According to a sixth aspect of the present application, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method according to any one of the first and second aspects as described above.

In the technical scheme of the application, the characteristic vector of the source IP address can be constructed in the characteristic vector space based on the obtained sample flow, and two clustering centers respectively corresponding to the scanner and the non-scanner are obtained by clustering the characteristic vector, so that when the vector to be detected is obtained, whether the flow to be detected is sent by the scanner or not can be judged according to the two predetermined clustering centers. In other words, according to the technical scheme, the acquired flow to be detected can be automatically identified according to the two predetermined clustering centers, so that a hacker is prevented from stealing the security information of the network equipment through a scanner.

Furthermore, after the sample feature vector is constructed according to the sample feature set, a cluster analysis method which does not limit the number of the cluster centers in the related technology is not adopted, but the number of the cluster centers is determined to be two, and then the positions of the two cluster centers are obtained through cluster analysis, so that after the flow to be detected is obtained, whether the flow to be detected is sent through a scanner or not is determined according to the positions of the two cluster centers. It can be understood that the number of the clustering centers is preset, so that the clustering analysis process is more targeted, the accuracy of the obtained positions of the clustering centers is improved, and the accuracy of scanner identification is improved.

Drawings

Fig. 1 is a schematic diagram of a network architecture shown in an exemplary embodiment of the present application.

Fig. 2 is a flowchart illustrating a method for training a scanner inspection model according to an exemplary embodiment of the present application.

Fig. 3 is a flowchart illustrating a scanner identification method according to an exemplary embodiment of the present application.

FIG. 4 is a flow chart illustrating a method of training a scanner inspection model in accordance with an exemplary embodiment of the present application.

Fig. 5 is a schematic diagram of a feature vector space including a cluster center according to an exemplary embodiment of the present application.

FIG. 6 is a flow chart illustrating a method for identifying a scanner based on a scanner inspection model in accordance with an exemplary embodiment of the present application.

Fig. 7 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application.

Fig. 8 is a block diagram of a scanner identification device according to an exemplary embodiment of the present application.

Fig. 9 is a schematic structural diagram of another electronic device according to an exemplary embodiment of the present application.

Fig. 10 is a block diagram of a training apparatus for a scanner inspection model according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Fig. 1 is a schematic diagram of a network architecture shown in an exemplary embodiment of the present application. As shown in fig. 1, the network architecture may include: device 11, network 12, and device 13. Wherein a hacker may send traffic to device 13 via a scanner on device 11 to obtain network security information associated with device 13. On this basis, the hacker may determine the security hole of the device 13 according to the acquired network security information, and further launch an attack to the device 13 based on the security hole.

In the present application, the device 13 may be an execution subject of the present application, and the device may be a common electronic device, such as a PC, a tablet device, a notebook computer, a handheld computer (PDAs), a mobile phone, a wearable device (such as smart glasses, smart watches, etc.), and may even be an enterprise-level or industrial-level device, and one or more embodiments of the present specification do not limit this.

Fig. 2 is a flowchart illustrating a method for training a scanner inspection model according to an exemplary embodiment of the present application. As shown in fig. 2, the method may include the steps of:

at step 202, sample traffic based on scanner and non-scanner transmissions is obtained.

The scanner is a detection type application program, and can automatically acquire security information of local equipment or remote equipment through the scanner, and further determine security holes of related equipment based on the acquired security information. Based on the characteristics, the scanner is used for identifying local security vulnerabilities on one hand and then protecting the vulnerabilities in time; on the other hand, the network hacker is used for stealing the security information of other network equipment to determine the security loophole of other network equipment, and then attacks the remote equipment according to the determined security loophole. In view of this, it is very important to identify the traffic sent by the scanner from the received traffic, so as to effectively avoid the security information in the network device from being stolen, thereby improving the security of the network device.

The scanner may include a Web scanner and a port scanner, and in the related art, a plurality of identification methods for the Web scanner are proposed, which identify the received traffic by using the characteristic that the Web scanner steals the security information of the network device by sending HTTP traffic. However, for the port scanner, instead of obtaining the security information of the network device by sending HTTP traffic, it can be seen that the above technical solution cannot identify the port scanner except relying on the professional knowledge of the security personnel.

And step 204, constructing a sample feature vector associated with the source IP address of the sample flow in a feature vector space.

In the present application, received traffic including traffic sent by a real scanner is taken as sample traffic to train a scanner detection model. After sample traffic sent based on a scanner and a non-scanner is acquired, a sample feature vector associated with a source IP address of the sample traffic needs to be constructed in a feature vector space first to be used for training a scanner detection model.

For the traffic sent based on the scanner and the non-scanner, the carried source IP address is usually different, and the feature data corresponding to the carried source IP address is also different. Thus, a sample feature vector can be constructed from the feature data of the source IP address. Specifically, after the sample traffic is obtained, feature data corresponding to the source IP address of the sample traffic may be counted, where the feature data includes: the number of different destination IPs, the number of different destination ports, the number of combinations between the destination IPs and the destination ports in the same flow and the number of times of initiating requests; after the feature data corresponding to the source IP address is obtained, a sample feature vector associated with the source IP address can be established according to the feature data, for example, the four feature data can be used as one dimension of a vector to construct a sample feature vector in a feature space vector.

For ease of understanding, the process of constructing the sample feature vector is illustrated, assuming that the source IP address, destination IP address, and destination port of the currently received sample traffic are as shown in table 1 below.

TABLE 1

Then, the 4 pieces of feature data corresponding to the source IP address 1.1.1.1 obtained by statistics are as follows: the number of different destination IP addresses is 3, which is shown in table 1: 2.2.2, 3.3.3.3 and 4.4.4.4; the number of ports for different purposes is 4, which is shown in table 1: 80. 22, 8080 and 4040; the number of combinations between the destination IP and the destination port in the same traffic is 6, and as can be seen from fig. 1, these are: 2.2.2.2 and 80, 3.3.3.3 and 8080, 4.4.4.4 and 80, 4.4.4 and 22, 4.4.4.4 and 4040; the number of requests initiated is 7, and each row in table 1 corresponds to one request initiated by a source IP address of 1.1.1.1.

Therefore, for the source IP address 1.1.1.1, the values of the obtained feature data are: 3. 4, 6, 7, a sample feature vector associated with the source IP address of 1.1.1.1 can be created.

It should be understood that, since the obtained multiple pieces of feature data corresponding to the same source IP address are used to characterize different features of the corresponding source IP address, weights of sample feature vectors associated with the source IP address, which are established based on the feature data, in each dimension are also different, and it is apparent that, if the sample feature vectors are not processed, values of the sample feature vectors in different dimensions may be in different intervals, thereby affecting results obtained by cluster analysis. Therefore, in order to ensure that the values of the sample feature vectors in all dimensions are in the same interval, the obtained sample feature vectors can be standardized, and then the standardized sample feature vectors are subjected to cluster analysis. The sample feature vector may be processed by a conventional normalization method, for example, the sample feature vector may be processed by RobustScaler normalization, this example is merely illustrative, and a person skilled in the art may determine the normalization method according to actual situations, which is not limited in this application.

It should be noted that, although the sample feature vector is preferentially established based on the feature data and then normalized in the above process, the values of the sample feature vector in each dimension are essentially normalized, and the values of the sample feature vector in each dimension are determined by the obtained feature data. Therefore, the obtained feature data may be preferentially normalized, and then a sample feature vector associated with the source IP address may be created based on the normalized feature data.

Step 206, performing cluster analysis on the sample feature vectors in the feature vector space to obtain a first cluster center corresponding to a scanner and a second cluster center corresponding to a non-scanner; and using the first clustering center and the second clustering center as the training result of the scanner detection model.

The obtained sample feature vectors can be subjected to clustering analysis through a common clustering analysis algorithm, for example, a K-Means clustering algorithm (K-Means algorithm), a K-center clustering algorithm (K-Medoids algorithm), and the like can be adopted. Of course, this example is only illustrative, and a person skilled in the art may determine, according to actual needs, which clustering algorithm to perform clustering analysis on the sample feature vectors, and the application is not limited herein.

In the application, the number of the clustering centers is preset to be 2, and then the sample feature vectors are subjected to clustering analysis, so that two clustering centers respectively corresponding to a scanner and a non-scanner are obtained. In other words, compared with the conventional cluster analysis, the number of the cluster centers is not used as the result of the cluster analysis, but the number of the cluster centers is limited in advance, and the finally obtained position of the cluster center is used as the result of the cluster analysis. By means of limiting the number of the clustering centers, the clustering analysis process has better pertinence, more accurate clustering centers corresponding to the scanner and the non-scanner are obtained, and accuracy of scanner identification through the clustering centers is improved.

And evaluating the result of the clustering analysis in the process of clustering analysis on the sample characteristic vectors so as to judge whether the obtained clustering center can reach the expected standard.

In one case, the acquired sample feature vectors may be randomly divided into a training set and a test set according to a preset ratio, where the training set is used for training the scanner detection model, and the test set is used for evaluating the obtained scanner detection model. After the training set is obtained, carrying out clustering analysis on the sample characteristic vectors in the training set through a clustering algorithm to obtain a first backup clustering center corresponding to the scanner and a second candidate clustering center corresponding to the non-scanner; and then, evaluating the first candidate clustering center and the second candidate clustering center based on the recognition result of the sample feature vector in the predetermined test set, and determining the obtained first candidate clustering center and the second candidate clustering center as the training result of the scanner detection model when the evaluation result reaches a preset standard. In the process, the identification result of the sample feature vectors in the predetermined test set can be determined by a maintainer with professional knowledge; the manner of evaluation may be: and identifying the sample characteristic vectors in the test set through the obtained alternative clustering centers to obtain corresponding identification results, and judging whether the alternative clustering centers are determined to be the clustering centers contained in the scanner identification model or not based on the identification results and the identification results obtained manually in advance. In the application, the recognition result obtained manually in advance can be used as a standard result, the recall rate and/or the accuracy rate of the recognition result obtained by the alternative clustering center are calculated based on the standard result, and when the recall rate and/or the accuracy rate of the recognition result reach the preset standard, the alternative clustering center can be used as the training result of the scanner detection model. In practical application, in order to obtain an optimal scanner detection model, a recall rate threshold and an accuracy rate threshold can be set at the same time, and only when the recall rate of the obtained identification result is higher than the recall rate threshold and the accuracy rate is higher than the accuracy rate threshold, the obtained alternative clustering center is used as a training result of the scanner detection model.

In another case, the obtained sample feature vectors can be randomly divided into N times according to a preset proportion to obtain N groups of training sets and test sets; on this basis, the following steps may be performed for each set of training sets and test sets: performing cluster analysis on the sample feature vectors in the training set to obtain two alternative cluster centers respectively corresponding to the scanner and the non-scanner; performing scanner identification on the test set through the obtained alternative clustering center to obtain an identification result; in addition, the identification result of the test set is obtained in a manual mode to serve as a standard identification result for the test set; based on the standard recognition result, the recall rate or the accuracy rate of the recognition result obtained by the alternative clustering center is calculated. After the steps are executed on each group of training set and test set, N groups of recall rates or accuracy rates can be obtained through comparison, and the candidate cluster center with the highest recall rate or accuracy rate is determined as the cluster center contained in the scanner detection model.

It should be noted that, because the sample traffic obtained in step 202 includes both traffic sent by the scanner and traffic sent by the non-scanner, the present application can obtain the first cluster center corresponding to the scanner and the second cluster center corresponding to the non-scanner through cluster analysis. Besides, the sample traffic obtained in the present application may be TCP traffic. In an actual situation, both the WEB scanner and the port scanner need to send TCP traffic to acquire the security information of the network device, so that the application is not limited to HTTP traffic as in the related art, and the application can identify the port scanner and the WEB scanner at the same time. Still further, to avoid interference of irrelevant traffic, the sample traffic is traffic IN the IN direction, which refers to traffic actively initiated by the non-local device, IN other words, not response traffic returned for traffic sent by the local device.

According to the technical scheme, the scanner detection model is trained in a cluster analysis mode, the current situation that a characteristic rule set of the scanner needs to be manually summarized in the related technology is broken through, the training efficiency of the scanner detection model is improved, and the labor cost is reduced. In addition, the flow sent by the scanner and the non-scanner is used as the sample flow, and then the sample characteristic vector is constructed, so that the scanner detection model obtained by the method based on the sample characteristic vector training better accords with the actual condition in the network environment, and the method is high in accuracy and high in applicability.

Further, the received TCP traffic is used as sample traffic in the present application, so that the scanner detection model obtained in the present application can identify traffic sent through the WEB scanner as well as traffic sent through the port scanner in the related art.

Furthermore, in the process of training the scanner detection model, the number of the clustering centers is preset, and the obtained positions of the clustering centers are used as the training result of the scanner detection model, so that the process of training the clustering centers is more targeted, the accuracy of the obtained scanner detection model is improved, and the accuracy of scanner identification is further improved.

The application also provides a method for identifying the scanner based on the scanner detection model. In this method, how to train the scanner detection model is described above, and details are not described below.

Fig. 3 illustrates a scanner identification method according to an exemplary embodiment of the present application. As shown in fig. 3, the method may include the steps of:

step 302, constructing a feature vector to be detected associated with the source IP address of the flow to be detected in the feature vector space.

After receiving the traffic to be detected, first, a feature vector to be detected associated with the source IP address needs to be constructed in a feature vector space, so as to identify the traffic sent by the scanner based on a scanner detection model constructed in advance.

The way of constructing the feature vector to be detected is similar to the way of constructing the sample feature vector. Specifically, after the traffic to be detected is acquired, the feature data corresponding to the source IP address in the traffic to be detected can be counted, where the feature data includes: the number of different destination IPs, the number of different destination ports, the number of combinations between the destination IPs and the destination ports in the same flow and the number of times of initiating requests; after the feature data corresponding to the source IP address is obtained, a feature vector to be detected associated with the source IP address can be established according to the feature data, for example, the four feature data can be used as a dimension of a vector to construct the feature vector to be detected in a feature space vector. The specific construction method can refer to the description in table 1 above, and is not described herein.

Similar to the sample feature vector, the normalization processing may be performed on the feature vector to be detected in this embodiment, and the manner of the normalization processing may refer to the description of the normalization processing performed on the sample feature vector in the previous embodiment, which is not described herein again.

And 304, determining the relative position relationship between the feature vector to be detected and a first clustering center corresponding to the scanner and a second clustering center corresponding to the non-scanner in a pre-established scanner detection model, wherein the first clustering center and the second clustering center are obtained by clustering and analyzing sample feature vectors, and the sample feature vectors are obtained based on sample flow sent by the scanner and the non-scanner.

In the present application, the scanner inspection model may be created in advance based on the method shown in fig. 2.

As described above, in the process of creating the scanner detection model, the feature data corresponding to the source IP address in the sample traffic may be counted, and a sample feature vector associated with the source IP address may be established based on the feature data. Wherein the characteristic data includes: the number of different destination IPs, the number of different destination ports, the number of combinations between the destination IPs and the destination ports in the same flow, and the number of times of initiating requests.

As described above, the result of the cluster analysis may be evaluated during the process of performing the cluster analysis on the sample feature data, so as to determine whether the obtained cluster center can meet the expected standard.

In one case, the acquired sample feature vectors may be randomly divided into a training set and a test set according to a preset ratio, where the training set is used for training the scanner detection model, and the test set is used for evaluating the obtained scanner detection model. After the training set is obtained, performing cluster analysis on the sample feature vectors in the training set to obtain a first candidate cluster center corresponding to the scanner and a second candidate cluster center corresponding to the non-scanner; and then, based on the recognition result of the sample feature vectors in the predetermined test set, evaluating the two candidate clustering centers, and when the evaluation result reaches a preset standard, determining the two obtained candidate clustering centers as the clustering centers of the scanner detection model.

In another case, the obtained sample feature vectors can be randomly divided into N times according to a preset proportion to obtain N groups of training sets and test sets; on this basis, the following steps may be performed for each set of training sets and test sets: performing cluster analysis on the sample feature vectors in the training set to obtain two alternative cluster centers respectively corresponding to the scanner and the non-scanner; performing scanner identification on the test set through the obtained alternative clustering center to obtain an identification result; in addition, the identification result of the test set can be obtained in a manual mode to serve as a standard result for the test set; and calculating the recall rate or the accuracy rate based on the standard result and the identification result obtained by the alternative clustering center. After the steps are executed on each group of training set and test set, N groups of recall rates or accuracy rates can be obtained through comparison, and the candidate cluster center with the highest recall rate or accuracy rate is determined as the cluster center contained in the scanner detection model.

As described above, the sample traffic set used to train the scanner test model contains both traffic sent through the scanner and traffic sent through the non-scanner. Besides, the sample traffic obtained in the present application may be TCP traffic. The sample traffic is IN-direction traffic, which refers to traffic actively initiated by the non-local device, IN other words, not response traffic returned for traffic sent by the local device.

Step 306, determining a target feature vector matched with the first clustering center in the feature vector to be detected according to the relative position relationship, and judging that the source IP address in the traffic to be detected is the traffic of the IP address associated with the target feature vector and sent by the scanner.

In one case, the target feature vector may be determined based on the distance between the feature vector to be detected and the two cluster centers. For example, a first distance between an end point position of the feature vector to be detected and a first cluster center, and a second distance between the end point position and a second cluster center may be determined; and determining the vector to be detected with the first distance smaller than the second distance as the target characteristic vector by comparing the first distance with the second distance.

In another case, a vector range corresponding to the scanner and a vector range corresponding to the non-scanner may be determined based on the two cluster centers, and the target feature vector may be determined by determining a positional relationship between the feature vector to be detected and the two vector ranges. For example, a scanner radius may be preset; then, after determining two cluster centers, determining a spherical range corresponding to the scanner based on the cluster center corresponding to the scanner and the scanner radius; on the basis, the feature vector to be detected with the vector end point falling in the spherical range can be determined as the target feature vector. Of course, the above two ways of determining the target feature vector are only illustrative, and a person skilled in the art can set how to determine the target feature vector according to actual situations, and the method is not limited herein.

Since the feature vector to be detected in the present application is established based on the source IP address, it should be understood that the target feature vector determined from the feature vector to be detected is also established based on the source IP address. In fact, after the target feature vector is determined, all the traffic sent by the source IP address associated with the target feature vector can be determined as the traffic sent by the scanner.

According to the technical scheme, after the flow to be detected is obtained, the feature vector to be detected associated with the source IP address can be established in the feature vector space, the target feature vector is determined from the feature vector to be detected based on the scanner detection model established in advance, and further, the flow of which the source IP address is the IP address associated with the target feature vector is determined to be sent by the scanner. Therefore, the flow sent by the scanner in the received flow can be automatically identified based on the scanner detection model established in advance, and the labor cost for identifying the scanner is reduced.

Furthermore, the received TCP traffic is used as the traffic to be detected, so that the application can identify the WEB scanner and the port scanner at the same time instead of only identifying the WEB scanner which acquires the equipment safety information by sending the HTTP traffic as in the related art.

Furthermore, in the process of training the scanner detection model, the number of the clustering centers is preset, and the obtained positions of the clustering centers are used as the training result of the scanner detection model, so that the process of clustering centers is more targeted, the accuracy of the obtained scanner detection model is improved, and the accuracy of scanner identification is further improved.

FIG. 4 is a flowchart illustrating a method of training a scanner recognition model according to an exemplary embodiment of the present application. As shown in fig. 4, the method may include the steps of:

step 401, receiving TCP traffic within a preset time period as sample traffic.

In this embodiment, a time period may be preset, and the traffic received in one period is used as a sample traffic for training the scanner identification model. In addition, in order to eliminate the interference of accidental factors, all the flows received within N preset time periods can be acquired, and 1/N of all the flows is taken as a sample flow.

Step 402, counting characteristic data corresponding to each source IP address in the sample flow.

In this embodiment, the feature data corresponding to any source IP address is: the number of different destination IPs, the number of different destination ports, the number of combinations between the destination IPs and the destination ports in the same flow, and the number of times of initiating requests.

Step 403, establishing sample feature vectors associated with the source IP addresses in the feature vector space based on the feature data corresponding to the source IP addresses.

In this embodiment, the process of establishing the feature vector based on the feature data is similar to that in the above embodiment, and specifically, the description of table 1 in the embodiment shown in fig. 2 may be referred to, and is not repeated herein.

And step 404, dividing the established sample feature vectors associated with the source IP addresses into a training set and a test set according to a preset proportion.

In this embodiment, a division ratio may be preset to divide the established sample feature vectors into a training set and a test set. It should be noted that, in order to ensure the accuracy of the established scanner identification model, the sample feature vectors included in the training set are larger than those included in the testing set.

Step 405A, performing cluster analysis on the sample feature vectors in the training set through a clustering algorithm.

In this embodiment, the algorithm for performing cluster analysis on the sample feature vectors in the training set may be a K-means clustering algorithm. Different from the conventional usage of the K-means clustering algorithm, the value of K is preset to be 2 in this embodiment, so that the result obtained by clustering analysis is necessarily two clustering centers. In other words, the cluster analysis result obtained in this embodiment is not the number of the conventional cluster centers, but the position of the cluster center is fixed.

At step 406A, cluster centers corresponding to the scanner and the non-scanner are determined, respectively.

After two cluster centers are obtained based on step 405A, it is not yet possible to determine which cluster center corresponds to the scanner and which cluster center corresponds to the non-scanner.

In practical situations, since a hacker needs to obtain security information of a network device through a scanner, a large amount of traffic is usually sent through the scanner, and therefore, the number of times of sending the traffic by a source IP address carried by the traffic sent through the scanner is much higher than that of sending the traffic by a source IP address carried by normal traffic. Similarly, no matter the number of the related different destination IPs, the number of the different destination ports, and the number of the combinations between the destination IPs and the destination ports in the same traffic, the source IP address carried by the traffic sent by the scanner is much higher than the source IP address carried by the normal traffic. In other words, the feature data corresponding to the feature vector of the scanner all have a higher value than the feature data corresponding to the feature vector of the non-scanner.

Based on the above feature, the cluster center corresponding to the scanner and the cluster center corresponding to the non-scanner can be determined by analyzing the feature data of the sample feature vectors respectively corresponding to the two cluster centers. For convenience of understanding, it is assumed for example that feature data of sample feature vectors corresponding to two scanners obtained currently are as shown in table 2 below. Wherein, the characteristic data respectively represent from left to right in one column: the number of different destination IPs, the number of different destination ports, the number of combinations between the destination IP and the destination port in the same traffic, and the number of times of initiating requests related to the source IP address corresponding to the sample feature vector.

Vector corresponding to cluster center 1	Characteristic data	Vector corresponding to cluster center 2	Characteristic data
				Sample feature vector 1	5,6,7,9	Sample feature vector 1'	1,2,3,4
Sample feature vector 2	4,5,7,8	Sample feature vector 2'	2,4,3,6
				Sample feature vector 3	6,8,9,9	Sample feature vector of 3'	1,4,2,4
Sample feature vector 4	4,5,6,9	Sample feature vector of 4'	2,4,4,6

TABLE 2

As can be seen from table 2, the whole of the numerical value of the feature data of the sample feature vector corresponding to the cluster center 1 is higher than the numerical value of the feature data of the sample feature vector corresponding to the cluster center 2, and at this time, the cluster center 1 is determined as the cluster center corresponding to the scanner, and the cluster center 2 is determined as the cluster center corresponding to the non-scanner.

Step 407A, identifying sample feature vectors in the test set based on the determined clustering centers respectively corresponding to the scanner and the non-scanner to obtain a first identification result.

In this embodiment, the distance between the end point of each sample feature vector in the test set and two cluster centers may be preferentially determined, and the sample feature vector corresponding to the scanner may be determined based on the distance.

For convenience of understanding, the above distances are adopted and will be described by taking the feature vector space shown in fig. 5 as an example. It is assumed that the cluster center 51 shown in fig. 5 is the cluster center 1 in the example shown in table 2 and the cluster center 52 is the cluster center 2 in the example shown in table 2, in other words, the cluster center 51 is the cluster center corresponding to the scanner and the cluster center 52 is the cluster center corresponding to the non-scanner.

Wherein, the vector 1 and the vector 2 are sample feature vectors in the test set, and as can be seen from the figure, the end point position of the vector 1 is obviously closer to the cluster center 51, so that it can be determined that the vector 1 corresponds to the scanner; similarly, vector 2 corresponds to a non-scanner.

It should be understood that the four-dimensional feature vector space is actually established based on the above feature data, and although fig. 5 illustrates a two-dimensional plane, this illustration is only schematic, and is intended to explain the process of obtaining the recognition result more clearly and intuitively, and the recognition result is obtained in the actual four-dimensional feature vector space by the method consistent with fig. 5.

Step 405B, feature data corresponding to each sample feature vector in the test set is displayed.

In step 406B, a selection operation for the sample feature vectors in the test set is received.

Step 407B, determining a second recognition result for the test set based on the selection operation.

Besides identifying the sample feature vectors in the test set by the obtained two cluster centers to obtain the first identification result, the second identification result also needs to be obtained manually. In this embodiment, the feature data corresponding to each sample feature vector in the test set may be presented to the user through step 405B, so that the user can identify the sample feature data corresponding to the scanner therefrom. In actual operation, the user only needs to click different touch areas to select the corresponding sample feature vector as the sample feature vector corresponding to the scanner.

It should be understood that the user performing the manual recognition operation is a professional having expertise in the field to ensure the accuracy of the obtained second recognition result.

And step 408, taking the second recognition result as a standard result, and determining the recall rate of the first recognition result.

In this embodiment, the accuracy of the two obtained clustering centers is evaluated using the second recognition result obtained by the manual method as a standard result. In this embodiment, the accuracy of the two clustering centers is evaluated by determining the recall rate of the first recognition result, but in an actual situation, the accuracy may also be evaluated by other methods, which is not limited herein.

For convenience of understanding, assuming that the test set includes 10 sample feature vectors, the first recognition result is: sample feature vectors 1-6 are sample feature vectors corresponding to a scanner, and sample feature vectors 7-10 are sample feature vectors corresponding to a non-scanner; and the second recognition result is: sample feature vectors 1-7 are sample feature vectors corresponding to scanners and sample feature vectors 8-10 are sample feature vectors corresponding to non-scanners.

Then the recall rate obtained is 6/7-85.71%; when the accuracy is adopted, the obtained accuracy is (6+ 3)/10-90%. Recall may be understood as identifying the correct number of sample feature vectors corresponding to the scanner, e.g., in the above example, the number of sample feature vectors actually present corresponding to the scanner is 7, i.e., sample feature vectors 1-7, and the number of sample feature vectors identified as correct corresponding to the scanner is 6, i.e., sample feature vectors 1-6, so the recall is 6/7; accuracy may be understood as the number of sample feature vectors that identify the correct, as in the example above, sample feature vectors 1-6 and 8-10 all identify the correct, and only sample feature vector 7 identifies the wrong, and therefore has an accuracy of 9/10.

Step 409, judging whether the obtained recall rate is higher than a preset value; if yes, go to step 410; otherwise, go to step 411.

In this embodiment, a preset value of the recall rate may be preset, for example, in a case that the preset value is 80%, and then the recall rate 85.71% determined in the above example is higher than the preset value, then two cluster centers obtained based on the training set may be determined as the training results of the scanner recognition model, and then used for scanner recognition on the received traffic. Otherwise, if the preset value is 90%, the recall rate determined in the above example is lower than the preset value, and the training is determined to be failed.

And step 410, determining the two determined clustering centers as the training result of the scanner recognition model.

In step 411, training fails.

According to the technical scheme, the received TCP traffic is used as the sample traffic to establish the sample feature vector, so that the scanner identification model obtained based on the sample feature vector can not only identify the scanner based on the HTTP traffic as in the related technology, but identify the scanner based on the TCP traffic, and further, the WEB scanner and the port scanner can be identified simultaneously.

Further, in this embodiment, the sample feature vectors are divided into a training set and a test set, and on one hand, the candidate cluster centers respectively corresponding to the scanner and the non-scanner are obtained by performing cluster analysis on the sample feature vectors in the training set; and on the other hand, the obtained alternative clustering center is evaluated through the test set, and the alternative clustering center is determined as the clustering center contained in the scanner detection model only when the evaluation result is higher than the preset standard, so that the accuracy of the obtained scanner detection model is improved.

Fig. 6 is a flowchart illustrating a method for identifying a scanner based on a scanner identification model according to an exemplary embodiment of the present application. As shown in fig. 6, the method may include the steps of:

step 601, taking the TCP flow received within the preset time as the flow to be detected.

Step 602, counting characteristic data corresponding to each source IP address in the flow to be detected.

Step 603, establishing a feature vector to be detected associated with each source IP address in a feature vector space based on the feature data corresponding to each source IP address.

In this embodiment, steps 601 to 603 are similar to steps 401 to 403, and reference may be made to the description of the previous embodiment, which is not repeated herein.

Step 604, obtaining the clustering centers respectively corresponding to the scanner and the non-scanner in the scanner identification model established in advance.

By taking the example of the previous embodiment as a reference, the cluster center 1 corresponding to the scanner and the cluster center 2 corresponding to the non-scanner obtained in the previous embodiment can be obtained in this embodiment.

Step 605A, a first distance between an end point of each feature vector to be detected and a cluster center corresponding to the scanner is determined.

In this embodiment, the process of identifying the feature vector to be detected is determined to be similar to the process of identifying the sample feature vector in the test set by the cluster center 1 and the cluster center 2 in the previous embodiment. Continuing to take the clustering center shown in fig. 5 as an example, assuming that the vector 1 is the feature vector to be detected, the determined first distance is the distance shown by the dashed line 3; the second distance is the distance shown by the dashed line 4.

Step 605B, determining a second distance between the end point of each feature vector to be detected and the cluster center corresponding to the scanner.

Step 606, judging whether the first distance of each feature vector to be detected is smaller than the second distance; if yes, go to step 607; if not, go to step 608.

For vector 1, it is obvious that the first distance is smaller than the second distance, the vector 1 may be determined as the target feature vector, that is, it is determined that the traffic of the source IP address and the IP address associated with the vector 1 is determined to be sent by the scanner, and then corresponding protection measures are performed, for example, the source IP address corresponding to the vector 1 is added into a blacklist, and then all the traffic from the source IP address in the blacklist is intercepted.

Step 607, determining the part of the feature vectors to be detected as the target feature vectors.

Step 608, determining the part of feature vectors to be detected as non-target feature vectors.

As can be seen from the above technical solution, in this embodiment, based on the two clustering centers in the scanner detection model established in the previous embodiment, scanner identification is performed on the received traffic, so that the embodiment can automatically identify the traffic sent by the scanner.

Further, the embodiment uses the TCP traffic as the traffic to be detected, so that the application can identify the WEB scanner and the port scanner at the same time.

Fig. 7 is a schematic block diagram illustrating an electronic device according to an exemplary embodiment of the present application. Referring to fig. 7, at the hardware level, the electronic device includes a processor 702, an internal bus 704, a network interface 706, a memory 708, and a non-volatile storage 710, but may also include hardware required for other services. The processor 702 reads the corresponding computer program from the non-volatile memory 710 into the memory 708 and then runs it, forming the scanner recognition means on a logical level. Of course, besides the software implementation, the present application does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.

Referring to fig. 8, in a software implementation, the scanner identification apparatus may include:

a constructing unit 801 that constructs a to-be-detected feature vector associated with a source IP address of a to-be-detected flow in a feature vector space;

a first determining unit 802, configured to determine a relative position relationship between the feature vector to be detected and a first clustering center corresponding to a scanner and a second clustering center corresponding to a non-scanner in a pre-created scanner detection model, where the first clustering center and the second clustering center are obtained by performing clustering analysis on sample feature vectors, and the sample feature vectors are obtained based on sample flows sent by the scanner and the non-scanner;

the second determining unit 803 determines, according to the relative position relationship, a target feature vector matched with the first clustering center in the feature vector to be detected, and determines that the source IP address in the traffic to be detected is the traffic of the IP address associated with the target feature vector, which is sent by the scanner.

Optionally, the constructing unit 801 is further configured to:

counting characteristic data corresponding to the source IP address of the flow to be detected, wherein the characteristic data comprises: the number of different destination IPs, the number of different destination ports, the number of combinations between the destination IPs and the destination ports in the same flow and the number of times of initiating requests;

and establishing a feature vector to be detected associated with the source IP address based on the feature data obtained by statistics.

Optionally, the second determining unit 803 is further configured to:

determining a first distance between an end point position of a feature vector to be detected and a first clustering center, and a second distance between the end point position and a second clustering center;

and determining the feature vector to be detected with the first distance smaller than the second distance as a target feature vector.

Fig. 9 is a schematic block diagram of another electronic device shown in an exemplary embodiment of the present application. Referring to fig. 9, at the hardware level, the electronic device includes a processor 902, an internal bus 904, a network interface 906, a memory 908, and a non-volatile memory 910, but may also include hardware required for other services. The processor 902 reads a corresponding computer program from the non-volatile memory 910 into the memory 908 and runs the computer program to form a training apparatus of the scanner inspection model on a logical level. Of course, besides the software implementation, the present application does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.

Referring to fig. 10, in a software implementation, the training device for the scanner inspection model may include:

an acquisition unit 1001 that acquires sample traffic sent based on a scanner and a non-scanner;

a constructing unit 1002, configured to construct a sample feature vector associated with a source IP address of the sample traffic in a feature vector space;

an analysis unit 1003, performing cluster analysis on the sample feature vectors in the feature vector space to obtain a first cluster center corresponding to the scanner and a second cluster center corresponding to the non-scanner; and using the first clustering center and the second clustering center as the training result of the scanner detection model.

Optionally, the constructing unit 1002 is further configured to:

counting feature data corresponding to the source IP address in the sample flow, wherein the feature data comprises: the number of different destination IPs, the number of different destination ports, the number of combinations between the destination IPs and the destination ports in the same flow and the number of times of initiating requests;

and establishing a sample feature vector associated with the source IP address based on the statistical feature data.

Optionally, the method further includes:

a normalization unit 1004 for normalizing the sample feature vector;

the analyzing unit 1003 is further configured to: and performing cluster analysis on the normalized sample feature vectors.

Optionally, the analysis unit 1003 is further configured to:

randomly dividing the sample feature vectors into a training set and a testing set according to a preset proportion;

performing clustering analysis on the sample characteristic vectors in the training set through a clustering algorithm to obtain a first alternative clustering center corresponding to the scanner and a second alternative clustering center corresponding to the non-scanner;

evaluating the first candidate clustering center and the second candidate clustering center based on a predetermined recognition result for the sample feature vectors in the test set;

and when the evaluation result reaches a preset standard, determining the first candidate clustering center and the second candidate clustering center as the training result of the scanner detection model.

Optionally, the sample traffic comprises TCP traffic.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

In an exemplary embodiment, there is also provided a non-transitory computer readable storage medium, e.g. a memory, comprising instructions executable by a processor of a scanner recognition device/training device of a scanner detection model to implement the method as described in any of the above embodiments, such as a scanner recognition method may comprise: constructing a feature vector to be detected associated with a source IP address of the flow to be detected in a feature vector space; determining a relative position relation between the feature vector to be detected and a first clustering center corresponding to a scanner and a second clustering center corresponding to a non-scanner in a pre-established scanner detection model, wherein the first clustering center and the second clustering center are obtained by clustering and analyzing sample feature vectors, and the sample feature vectors are obtained based on sample flow sent by the scanner and the non-scanner; and determining a target characteristic vector matched with the first clustering center in the characteristic vectors to be detected according to the relative position relationship, and judging that the source IP address in the flow to be detected is the flow of the IP address associated with the target characteristic vector and sending the flow by a scanner.

The non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, etc., which is not limited in this application.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A scanner identification method, comprising:

2. The method according to claim 1, wherein constructing the feature vector to be detected associated with the source IP address of the traffic to be detected in the feature vector space comprises:

3. The method according to claim 1, wherein the determining, according to the relative position relationship, a target feature vector matching the first cluster center in the feature vectors to be detected comprises:

4. A method for training a scanner inspection model, comprising:

acquiring sample flow sent by a scanner and a non-scanner;

5. The method of claim 4, wherein constructing the sample feature vector associated with the source IP address of the sample traffic in a feature vector space comprises:

counting feature data corresponding to a source IP address of the sample flow, wherein the feature data comprises: the number of different destination IPs, the number of different destination ports, the number of combinations between the destination IPs and the destination ports in the same flow and the number of times of initiating requests;

6. The method of claim 4,

further comprising: carrying out standardization processing on the sample feature vector;

the performing cluster analysis on the sample feature vectors in the feature vector space includes: and performing cluster analysis on the normalized sample feature vectors.

7. The method of claim 4, wherein performing cluster analysis on the sample feature vectors in the feature vector space to obtain a first cluster center corresponding to a scanner and a second cluster center corresponding to a non-scanner comprises:

8. The method of claim 4, wherein the sample traffic comprises TCP traffic.

9. A scanner identification apparatus, comprising:

10. A training device for scanner detection models is characterized by comprising:

11. An electronic device, comprising:

a processor;

a memory for storing processor executable tasks;

wherein the processor implements the method of any one of claims 1-8 by executing the executable instructions.

12. A computer-readable storage medium having stored thereon computer instructions, which, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 8.