CN107330010B

CN107330010B - Background path blasting method based on machine learning

Info

Publication number: CN107330010B
Application number: CN201710447292.7A
Authority: CN
Inventors: 刘儒学
Original assignee: Beijing Know Future Information Technology Co ltd
Current assignee: Beijing Know Future Information Technology Co ltd
Priority date: 2017-06-14
Filing date: 2017-06-14
Publication date: 2020-10-16
Anticipated expiration: 2037-06-14
Also published as: CN107330010A

Abstract

The invention relates to a background path blasting method based on machine learning. The method comprises the following steps: 1) crawling a URL path of a common website with background features to generate a common background dictionary; 2) vectorizing URL paths in a common background dictionary and a common non-background dictionary; 3) training the quantified URL path through a classification algorithm; 4) crawling a page of a target website to obtain a set of all URL paths of the target website, and combining the URL paths with the characteristics of the target website with the URL paths with the characteristics of a common background to generate a background dictionary containing the characteristics of the target website; 5) inputting the generated background dictionary containing the target website characteristics into a trained classification algorithm to perform recognition and classification, so as to obtain an optimal dictionary; 6) blasting the background path of the target website by using the optimal dictionary and adopting a multi-thread blasting technology. The invention can improve the efficiency and the success rate of blasting the website background.

Description

Background path blasting method based on machine learning

Technical Field

The invention belongs to the technical field of information, and particularly relates to a background path blasting method based on machine learning.

Background

In the prior art, when the penetration test is performed, after the penetration test is performed on a website for a long time, the obtained user or administrator information cannot be utilized because a background login interface cannot be found.

The existing background scanning technology is mainly based on a scanner with multiple threads and a large dictionary. For example, the existing background scanning tools such as sword, coconut tree and the like are based on a fixed large dictionary and multithreading for scanning. Because the existing tools are written based on a large dictionary, the existing tools are only useful for most websites with poor security consciousness, cannot threaten websites with higher security level, and cannot detect the real sensitive catalog and background interface of the websites.

Disclosure of Invention

Aiming at the problems, the invention provides a background path blasting method based on machine learning, which can improve the efficiency and the success rate of blasting websites background.

The technical scheme adopted by the invention is as follows:

a background path blasting method based on machine learning comprises the following steps:

1) crawling a URL path of a common website with background features to generate a common background dictionary;

2) vectorizing URL paths in a common background dictionary and an existing common non-background dictionary;

3) training the obtained vectorized URL path through a classification algorithm;

4) crawling a page of a target website to obtain a set of all URL paths of the target website, and combining the URL paths with the target website features with the URL paths with the common background features in the common background dictionary obtained in the step 1) to generate a background dictionary containing the target website features;

5) inputting the background dictionary containing the target website characteristics generated in the step 4) into the classification algorithm trained in the step 3) for recognition and classification, and taking a background path set obtained according to a classification result as an optimal dictionary;

6) blasting the background path of the target website by using the optimal dictionary and adopting a multi-thread blasting technology.

Further, when vectorizing is performed in step 2), weighting processing is performed on URLs with background features, and weight reduction processing is performed on URLs without background features, so that the URLs are converted into weight matrixes finally.

Further, step 3) utilizes the weight matrix as a training set to train the classification algorithm.

Further, in step 4), the words with the target website characteristics and the words with the common background characteristics are randomly combined to generate a background dictionary containing the target website characteristics.

Further, when the step 5) is used for identification and classification, the obtained URL path which not only accords with the website naming rule but also has the background path characteristics is added into the background path set to form the optimal dictionary.

Further, step 6) adopts the traditional multi-thread blasting technology, and judges according to the state value of the return packet, and finally realizes the background path blasting.

The key points of the invention are as follows: 1. and weighting the directory keywords by adopting a TF-IDF vectorization algorithm and other vectorization algorithms to realize text vectorization, thereby training and identifying the classification algorithm. 2. The method is applied to blasting background paths and sensitive catalogs.

Compared with the prior art, the invention has the following beneficial effects:

the method and the system can generate the blasting dictionary in a targeted manner by utilizing the characteristics of machine learning, and improve the efficiency and the success rate of blasting the background of the website. The method can be used in the field of information security, and the penetration success rate can be increased by applying the method when penetration testing is carried out on authorized projects. The penetration test is that under the condition that one party authorizes, the other party tests the appointed range of the other party, tries to acquire sensitive information of the other party or acquires certain authority of the other party so as to detect the safety of the website or the product; the infiltration personnel test a specific network at different positions (such as positions of an internal network, an external network and the like) by various means so as to discover and mine the existing vulnerabilities in the system, and then output an infiltration test report and submit the report to a network owner. The network owner can clearly know the potential safety hazard and the problem existing in the system according to the penetration test report provided by the penetration personnel.

Drawings

FIG. 1 is a flow chart of a background path blasting method based on machine learning.

FIG. 2 is a schematic diagram of a weight matrix.

FIG. 3 is a schematic diagram of training by SVM algorithm.

Detailed Description

The invention is further illustrated by the following specific examples and the accompanying drawings.

The core concept of the invention is as follows: and analyzing the path naming rule of a single website by utilizing machine learning, and seeking an optimal blasting dictionary.

Fig. 1 is a flowchart of a background path blasting method based on machine learning according to the present invention, and the steps are described as follows:

1. and crawling all URL paths of the target website.

Crawlers can be written by python, pages are crawled according to target domain names, and the Beautiful Soup library is matched with links of the crawlers, so that the aim of obtaining the whole station directory is fulfilled. Crawlers are then used to crawl links that have been crawled again, and the mining depth can be adjusted to perfect the integrity of the entire site directory. And then, the crawled paths are deduplicated to obtain a set of all URLs.

This step is performed by a python crawler in this example. Matching href (href is a link attribute) through a Beautiful Soup library of python or a regular expression, and matching the whole station is realized by realizing a breadth-first algorithm and setting a crawling depth. In doing deduplication, it is selectively added to the list by checking whether the current path exists in the list of paths that have been crawled.

2. And crawling a common website path with background features to generate a common background dictionary.

A crawler obtains a URL (Uniform resource locator) with a background management typeface in a title of a common website, and generates a common background dictionary. If the background is searched by a hundred-degree search intet, the first search result is http:// www.demlution.com/account/store _ login/, and the webpage with the open-linked point is marked as a background management system.

The common background dictionary generated in this step is used for vectorization with a common non-background path dictionary (i.e., "common non-background dictionary" in fig. 1), and is used as a training set of the SVM algorithm in step 5 below. The "general non-background dictionary" in fig. 1 may employ an existing dictionary. The background directory contained in the large dictionary of the sensitive directory in the online stream can be removed, and the rest is the common non-background dictionary. For example, a sword scanner comprises a dir dictionary in a configuration file, background directories in the dir dictionary are removed and added into a common background dictionary, and then a common non-background dictionary can be generated.

3. And vectorizing the common background dictionary and the common non-background dictionary by a TF-IDF algorithm.

TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. In a given document, Term Frequency (TF) refers to the number of times a given word appears in the document. This number is typically normalized to prevent it from being biased towards long documents (the same word may have a higher frequency of words in long documents than in short documents, regardless of the importance of the word). Inverse Document Frequency (IDF) is a measure of the general importance of a word. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient.

A high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF. Therefore, the text can be vectorized using TF-IDF, URLs with background features are weighted, URLs without background features are de-weighted, and the URLs are finally converted into a weight matrix, as shown in fig. 2. Wherein, the nonzero value indicates that the word frequency corresponding to the value is higher, and the weight is not 0; a zero value indicates that the word corresponding to the value occurs a very small number of times, with a weight of 0.

4. And generating a background dictionary (background dictionary to be classified) containing the target website features by combining the target website feature path with the common background feature path.

Randomly combining (splicing) the words with the target website characteristics obtained in the step 1 and the commonly used words with the background characteristics in the common background dictionary in the step 2, wherein the generated background path set is the background dictionary containing the target website characteristics. The background dictionary contains the target website characteristics, and a proper background path can be found only by performing classification processing in the following step 5. The "common background dictionary" obtained in the previous step 2 is a dictionary with common background features, and is a training set used for the machine to learn the background features in the subsequent step 5.

Specifically, the words with the characteristics of the target website refer to parts in the URL with the characteristics of the target website, for example, in http:// www.gd-info. gov. cn/shtml/guangdong/sqsjk/njk/gdnj/URL, shtml, guangdong, sqsjk, njk, gdnj and the like are all words with the characteristics of the website, the URLs are obtained by a method of crawling the whole website in step 2, and then the words are obtained by a segmentation method.

Specifically, the words with the background features refer to words commonly found in URLs of various website background management interfaces, such as admin, login, manager, account, and the like, and are obtained by downloading a background dictionary and crawling the URLs with the background features.

5. Training is carried out through an SVM algorithm, then the background dictionary is classified, and an optimal dictionary is sought.

The steps are divided into a training stage and a classification stage, which are respectively explained as follows:

1) training phase

And (3) training the vectorized paths by using a learning algorithm before classification, namely, taking the weight matrix generated by the common background dictionary and the common non-background dictionary in the step 3 as a training set.

An SVM (Support Vector Machine) is a supervised learning model, which is commonly used for pattern recognition, classification, and regression analysis. FIG. 3 is a schematic diagram of training by SVM algorithm, wherein untrained data sets are represented by scatter. In the three coordinate axes, 0-12 are 12 groups of data, 0-300000 is the data quantity of each group, and 0-1 is the weight occupied by the data, namely the unit of the three coordinate axes is respectively defined as group, number and weight. The vertical is the weight given to the data by the algorithm. Most points are not shown with very low weights because if they are to be displayed a very dense plane of points is formed, those points shown being data with higher weights. Through the learning algorithm of the SVM, an optimal interface can be found out, as shown in a figure, the optimal interface is called as an optimal hyperplane, so that a data set can be obviously scored into two parts, one part is an upper layer point with background characteristics, and the other part is a lower layer point with general website characteristics. The reason for seeking the optimal hyperplane is to make classification more definite and more characteristic-consistent.

2) Classification phase

And (4) inputting the background dictionary containing the target website features generated in the step (4) into an SVM algorithm for recognition and classification (vectorization is also needed before input), and obtaining paths meeting the conditions, namely paths meeting the website naming rules and having background path features. And adding the classified result into a background path set with high probability to form an optimal dictionary.

6. And judging according to the state value of the returned packet by using the optimal dictionary and adopting the traditional multithreading blasting technology, and finally realizing the background path blasting.

The step uses the traditional multi-thread blasting technology to judge according to the state value of the return packet, if the state value of the return packet is 200, 500 and the like, the path is present, if the state value of the return packet is 404, 302 and the like, the path is absent, and the next path is tested.

The method can be used in the field of information security, and the penetration success rate can be increased by applying the method when penetration testing is carried out on authorized projects. When receiving a project authorized to perform advanced penetration testing, information collection is performed firstly, then pre-penetration is performed, and functions and possible vulnerabilities of the website are preliminarily judged. If the background of one website can be found, two paths can be taken, namely, blasting weak passwords to enter the background, searching injection points, obtaining account numbers and passwords of background administrators through injection and logging in the background, and then performing the next penetration test. If the background cannot be found, penetration testing can be performed only through leak attempts of the server side, even if an injection point is found, the injection point cannot enter the background successfully for deep penetration, and only partial data can be acquired. Finding the website background is an important step in the penetration test, and in order to make the penetration test proceed smoothly, the background can be found by the method of the invention.

The method is adopted to test a plurality of websites, and the test result shows that when the real background address of the website cannot be found through the traditional scanner, the background address can be successfully found by adopting the method of the invention and blasting the background through generating the background dictionary.

The TF-IDF algorithm adopted in the embodiment of the invention can be replaced by methods such as N-Gram and VSM to carry out text vectorization; the SVM classification algorithm used in the above embodiments may also be replaced by other classification algorithms of the same type, such as algorithms of decision tree, bayes, artificial neural networks, and the like.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A background path blasting method based on machine learning is characterized by comprising the following steps:

3) training a classification algorithm through the vectorized URL path;

5) inputting the generated background dictionary containing the target website characteristics into the trained classification algorithm in the step 3) to perform recognition and classification, and taking a background path set obtained according to a classification result as an optimal dictionary;

2. The method as claimed in claim 1, wherein, in the vectorization in step 2), the URLs with background features are weighted and the URLs without background features are weighted down, and finally the URLs are converted into the weight matrix.

3. The method of claim 2, wherein step 2) performs vectorization of text using one of the following algorithms: TF-IDF algorithm, N-Gram algorithm, VSM algorithm.

4. The method of claim 2, wherein step 3) utilizes the weight matrix as a training set for training of a classification algorithm.

5. The method of claim 3, wherein the classification algorithm employed in step 3) is one of the following algorithms: SVM classification algorithm, decision tree algorithm, Bayesian algorithm and artificial neural network algorithm.

6. The method of claim 1, wherein step 4) randomly combines words with target website features with words with common background features to generate a background dictionary containing target website features.

7. The method as claimed in claim 1, wherein, when the step 5) performs the identification and classification, the obtained URL path which both meets the website naming rule and has the background path characteristics is added into the background path set to form the optimal dictionary.

8. The method according to claim 1, wherein step 6) adopts a traditional multithread blasting technology, and the judgment is carried out according to the state value of the return packet, and finally the background path blasting is realized.