CN107330010B - Background path blasting method based on machine learning - Google Patents

Background path blasting method based on machine learning Download PDF

Info

Publication number
CN107330010B
CN107330010B CN201710447292.7A CN201710447292A CN107330010B CN 107330010 B CN107330010 B CN 107330010B CN 201710447292 A CN201710447292 A CN 201710447292A CN 107330010 B CN107330010 B CN 107330010B
Authority
CN
China
Prior art keywords
background
dictionary
path
target website
common
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710447292.7A
Other languages
Chinese (zh)
Other versions
CN107330010A (en
Inventor
刘儒学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Know Future Information Technology Co ltd
Original Assignee
Beijing Know Future Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Know Future Information Technology Co ltd filed Critical Beijing Know Future Information Technology Co ltd
Priority to CN201710447292.7A priority Critical patent/CN107330010B/en
Publication of CN107330010A publication Critical patent/CN107330010A/en
Application granted granted Critical
Publication of CN107330010B publication Critical patent/CN107330010B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention relates to a background path blasting method based on machine learning. The method comprises the following steps: 1) crawling a URL path of a common website with background features to generate a common background dictionary; 2) vectorizing URL paths in a common background dictionary and a common non-background dictionary; 3) training the quantified URL path through a classification algorithm; 4) crawling a page of a target website to obtain a set of all URL paths of the target website, and combining the URL paths with the characteristics of the target website with the URL paths with the characteristics of a common background to generate a background dictionary containing the characteristics of the target website; 5) inputting the generated background dictionary containing the target website characteristics into a trained classification algorithm to perform recognition and classification, so as to obtain an optimal dictionary; 6) blasting the background path of the target website by using the optimal dictionary and adopting a multi-thread blasting technology. The invention can improve the efficiency and the success rate of blasting the website background.

Description

Background path blasting method based on machine learning
Technical Field
The invention belongs to the technical field of information, and particularly relates to a background path blasting method based on machine learning.
Background
In the prior art, when the penetration test is performed, after the penetration test is performed on a website for a long time, the obtained user or administrator information cannot be utilized because a background login interface cannot be found.
The existing background scanning technology is mainly based on a scanner with multiple threads and a large dictionary. For example, the existing background scanning tools such as sword, coconut tree and the like are based on a fixed large dictionary and multithreading for scanning. Because the existing tools are written based on a large dictionary, the existing tools are only useful for most websites with poor security consciousness, cannot threaten websites with higher security level, and cannot detect the real sensitive catalog and background interface of the websites.
Disclosure of Invention
Aiming at the problems, the invention provides a background path blasting method based on machine learning, which can improve the efficiency and the success rate of blasting websites background.
The technical scheme adopted by the invention is as follows:
a background path blasting method based on machine learning comprises the following steps:
1) crawling a URL path of a common website with background features to generate a common background dictionary;
2) vectorizing URL paths in a common background dictionary and an existing common non-background dictionary;
3) training the obtained vectorized URL path through a classification algorithm;
4) crawling a page of a target website to obtain a set of all URL paths of the target website, and combining the URL paths with the target website features with the URL paths with the common background features in the common background dictionary obtained in the step 1) to generate a background dictionary containing the target website features;
5) inputting the background dictionary containing the target website characteristics generated in the step 4) into the classification algorithm trained in the step 3) for recognition and classification, and taking a background path set obtained according to a classification result as an optimal dictionary;
6) blasting the background path of the target website by using the optimal dictionary and adopting a multi-thread blasting technology.
Further, when vectorizing is performed in step 2), weighting processing is performed on URLs with background features, and weight reduction processing is performed on URLs without background features, so that the URLs are converted into weight matrixes finally.
Further, step 3) utilizes the weight matrix as a training set to train the classification algorithm.
Further, in step 4), the words with the target website characteristics and the words with the common background characteristics are randomly combined to generate a background dictionary containing the target website characteristics.
Further, when the step 5) is used for identification and classification, the obtained URL path which not only accords with the website naming rule but also has the background path characteristics is added into the background path set to form the optimal dictionary.
Further, step 6) adopts the traditional multi-thread blasting technology, and judges according to the state value of the return packet, and finally realizes the background path blasting.
The key points of the invention are as follows: 1. and weighting the directory keywords by adopting a TF-IDF vectorization algorithm and other vectorization algorithms to realize text vectorization, thereby training and identifying the classification algorithm. 2. The method is applied to blasting background paths and sensitive catalogs.
Compared with the prior art, the invention has the following beneficial effects:
the method and the system can generate the blasting dictionary in a targeted manner by utilizing the characteristics of machine learning, and improve the efficiency and the success rate of blasting the background of the website. The method can be used in the field of information security, and the penetration success rate can be increased by applying the method when penetration testing is carried out on authorized projects. The penetration test is that under the condition that one party authorizes, the other party tests the appointed range of the other party, tries to acquire sensitive information of the other party or acquires certain authority of the other party so as to detect the safety of the website or the product; the infiltration personnel test a specific network at different positions (such as positions of an internal network, an external network and the like) by various means so as to discover and mine the existing vulnerabilities in the system, and then output an infiltration test report and submit the report to a network owner. The network owner can clearly know the potential safety hazard and the problem existing in the system according to the penetration test report provided by the penetration personnel.
Drawings
FIG. 1 is a flow chart of a background path blasting method based on machine learning.
FIG. 2 is a schematic diagram of a weight matrix.
FIG. 3 is a schematic diagram of training by SVM algorithm.
Detailed Description
The invention is further illustrated by the following specific examples and the accompanying drawings.
The core concept of the invention is as follows: and analyzing the path naming rule of a single website by utilizing machine learning, and seeking an optimal blasting dictionary.
Fig. 1 is a flowchart of a background path blasting method based on machine learning according to the present invention, and the steps are described as follows:
1. and crawling all URL paths of the target website.
Crawlers can be written by python, pages are crawled according to target domain names, and the Beautiful Soup library is matched with links of the crawlers, so that the aim of obtaining the whole station directory is fulfilled. Crawlers are then used to crawl links that have been crawled again, and the mining depth can be adjusted to perfect the integrity of the entire site directory. And then, the crawled paths are deduplicated to obtain a set of all URLs.
This step is performed by a python crawler in this example. Matching href (href is a link attribute) through a Beautiful Soup library of python or a regular expression, and matching the whole station is realized by realizing a breadth-first algorithm and setting a crawling depth. In doing deduplication, it is selectively added to the list by checking whether the current path exists in the list of paths that have been crawled.
2. And crawling a common website path with background features to generate a common background dictionary.
A crawler obtains a URL (Uniform resource locator) with a background management typeface in a title of a common website, and generates a common background dictionary. If the background is searched by a hundred-degree search intet, the first search result is http:// www.demlution.com/account/store _ login/, and the webpage with the open-linked point is marked as a background management system.
The common background dictionary generated in this step is used for vectorization with a common non-background path dictionary (i.e., "common non-background dictionary" in fig. 1), and is used as a training set of the SVM algorithm in step 5 below. The "general non-background dictionary" in fig. 1 may employ an existing dictionary. The background directory contained in the large dictionary of the sensitive directory in the online stream can be removed, and the rest is the common non-background dictionary. For example, a sword scanner comprises a dir dictionary in a configuration file, background directories in the dir dictionary are removed and added into a common background dictionary, and then a common non-background dictionary can be generated.
3. And vectorizing the common background dictionary and the common non-background dictionary by a TF-IDF algorithm.
TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. In a given document, Term Frequency (TF) refers to the number of times a given word appears in the document. This number is typically normalized to prevent it from being biased towards long documents (the same word may have a higher frequency of words in long documents than in short documents, regardless of the importance of the word). Inverse Document Frequency (IDF) is a measure of the general importance of a word. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient.
A high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF. Therefore, the text can be vectorized using TF-IDF, URLs with background features are weighted, URLs without background features are de-weighted, and the URLs are finally converted into a weight matrix, as shown in fig. 2. Wherein, the nonzero value indicates that the word frequency corresponding to the value is higher, and the weight is not 0; a zero value indicates that the word corresponding to the value occurs a very small number of times, with a weight of 0.
4. And generating a background dictionary (background dictionary to be classified) containing the target website features by combining the target website feature path with the common background feature path.
Randomly combining (splicing) the words with the target website characteristics obtained in the step 1 and the commonly used words with the background characteristics in the common background dictionary in the step 2, wherein the generated background path set is the background dictionary containing the target website characteristics. The background dictionary contains the target website characteristics, and a proper background path can be found only by performing classification processing in the following step 5. The "common background dictionary" obtained in the previous step 2 is a dictionary with common background features, and is a training set used for the machine to learn the background features in the subsequent step 5.
Specifically, the words with the characteristics of the target website refer to parts in the URL with the characteristics of the target website, for example, in http:// www.gd-info. gov. cn/shtml/guangdong/sqsjk/njk/gdnj/URL, shtml, guangdong, sqsjk, njk, gdnj and the like are all words with the characteristics of the website, the URLs are obtained by a method of crawling the whole website in step 2, and then the words are obtained by a segmentation method.
Specifically, the words with the background features refer to words commonly found in URLs of various website background management interfaces, such as admin, login, manager, account, and the like, and are obtained by downloading a background dictionary and crawling the URLs with the background features.
5. Training is carried out through an SVM algorithm, then the background dictionary is classified, and an optimal dictionary is sought.
The steps are divided into a training stage and a classification stage, which are respectively explained as follows:
1) training phase
And (3) training the vectorized paths by using a learning algorithm before classification, namely, taking the weight matrix generated by the common background dictionary and the common non-background dictionary in the step 3 as a training set.
An SVM (Support Vector Machine) is a supervised learning model, which is commonly used for pattern recognition, classification, and regression analysis. FIG. 3 is a schematic diagram of training by SVM algorithm, wherein untrained data sets are represented by scatter. In the three coordinate axes, 0-12 are 12 groups of data, 0-300000 is the data quantity of each group, and 0-1 is the weight occupied by the data, namely the unit of the three coordinate axes is respectively defined as group, number and weight. The vertical is the weight given to the data by the algorithm. Most points are not shown with very low weights because if they are to be displayed a very dense plane of points is formed, those points shown being data with higher weights. Through the learning algorithm of the SVM, an optimal interface can be found out, as shown in a figure, the optimal interface is called as an optimal hyperplane, so that a data set can be obviously scored into two parts, one part is an upper layer point with background characteristics, and the other part is a lower layer point with general website characteristics. The reason for seeking the optimal hyperplane is to make classification more definite and more characteristic-consistent.
2) Classification phase
And (4) inputting the background dictionary containing the target website features generated in the step (4) into an SVM algorithm for recognition and classification (vectorization is also needed before input), and obtaining paths meeting the conditions, namely paths meeting the website naming rules and having background path features. And adding the classified result into a background path set with high probability to form an optimal dictionary.
6. And judging according to the state value of the returned packet by using the optimal dictionary and adopting the traditional multithreading blasting technology, and finally realizing the background path blasting.
The step uses the traditional multi-thread blasting technology to judge according to the state value of the return packet, if the state value of the return packet is 200, 500 and the like, the path is present, if the state value of the return packet is 404, 302 and the like, the path is absent, and the next path is tested.
The method can be used in the field of information security, and the penetration success rate can be increased by applying the method when penetration testing is carried out on authorized projects. When receiving a project authorized to perform advanced penetration testing, information collection is performed firstly, then pre-penetration is performed, and functions and possible vulnerabilities of the website are preliminarily judged. If the background of one website can be found, two paths can be taken, namely, blasting weak passwords to enter the background, searching injection points, obtaining account numbers and passwords of background administrators through injection and logging in the background, and then performing the next penetration test. If the background cannot be found, penetration testing can be performed only through leak attempts of the server side, even if an injection point is found, the injection point cannot enter the background successfully for deep penetration, and only partial data can be acquired. Finding the website background is an important step in the penetration test, and in order to make the penetration test proceed smoothly, the background can be found by the method of the invention.
The method is adopted to test a plurality of websites, and the test result shows that when the real background address of the website cannot be found through the traditional scanner, the background address can be successfully found by adopting the method of the invention and blasting the background through generating the background dictionary.
The TF-IDF algorithm adopted in the embodiment of the invention can be replaced by methods such as N-Gram and VSM to carry out text vectorization; the SVM classification algorithm used in the above embodiments may also be replaced by other classification algorithms of the same type, such as algorithms of decision tree, bayes, artificial neural networks, and the like.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (8)

1. A background path blasting method based on machine learning is characterized by comprising the following steps:
1) crawling a URL path of a common website with background features to generate a common background dictionary;
2) vectorizing URL paths in a common background dictionary and an existing common non-background dictionary;
3) training a classification algorithm through the vectorized URL path;
4) crawling a page of a target website to obtain a set of all URL paths of the target website, and combining the URL paths with the target website features with the URL paths with the common background features in the common background dictionary obtained in the step 1) to generate a background dictionary containing the target website features;
5) inputting the generated background dictionary containing the target website characteristics into the trained classification algorithm in the step 3) to perform recognition and classification, and taking a background path set obtained according to a classification result as an optimal dictionary;
6) blasting the background path of the target website by using the optimal dictionary and adopting a multi-thread blasting technology.
2. The method as claimed in claim 1, wherein, in the vectorization in step 2), the URLs with background features are weighted and the URLs without background features are weighted down, and finally the URLs are converted into the weight matrix.
3. The method of claim 2, wherein step 2) performs vectorization of text using one of the following algorithms: TF-IDF algorithm, N-Gram algorithm, VSM algorithm.
4. The method of claim 2, wherein step 3) utilizes the weight matrix as a training set for training of a classification algorithm.
5. The method of claim 3, wherein the classification algorithm employed in step 3) is one of the following algorithms: SVM classification algorithm, decision tree algorithm, Bayesian algorithm and artificial neural network algorithm.
6. The method of claim 1, wherein step 4) randomly combines words with target website features with words with common background features to generate a background dictionary containing target website features.
7. The method as claimed in claim 1, wherein, when the step 5) performs the identification and classification, the obtained URL path which both meets the website naming rule and has the background path characteristics is added into the background path set to form the optimal dictionary.
8. The method according to claim 1, wherein step 6) adopts a traditional multithread blasting technology, and the judgment is carried out according to the state value of the return packet, and finally the background path blasting is realized.
CN201710447292.7A 2017-06-14 2017-06-14 Background path blasting method based on machine learning Active CN107330010B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710447292.7A CN107330010B (en) 2017-06-14 2017-06-14 Background path blasting method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710447292.7A CN107330010B (en) 2017-06-14 2017-06-14 Background path blasting method based on machine learning

Publications (2)

Publication Number Publication Date
CN107330010A CN107330010A (en) 2017-11-07
CN107330010B true CN107330010B (en) 2020-10-16

Family

ID=60194707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710447292.7A Active CN107330010B (en) 2017-06-14 2017-06-14 Background path blasting method based on machine learning

Country Status (1)

Country Link
CN (1) CN107330010B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409090B (en) * 2018-11-12 2020-09-29 北京知道创宇信息技术股份有限公司 Website background detection method and device and server
CN111723378B (en) * 2020-06-17 2023-03-10 浙江网新恒天软件有限公司 Website directory blasting method based on website map
CN114024729A (en) * 2021-10-29 2022-02-08 恒安嘉新(北京)科技股份公司 Website background detection method, device, equipment and storage medium
CN117112873B (en) * 2023-10-25 2024-01-26 北京华云安信息技术有限公司 API blasting method, device, equipment and storage medium based on code injection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101692639A (en) * 2009-09-15 2010-04-07 西安交通大学 Bad webpage recognition method based on URL
CN102739679A (en) * 2012-06-29 2012-10-17 东南大学 URL(Uniform Resource Locator) classification-based phishing website detection method
CN104217160A (en) * 2014-09-19 2014-12-17 中国科学院深圳先进技术研究院 Method and system for detecting Chinese phishing website
CN104573033A (en) * 2015-01-15 2015-04-29 国家计算机网络与信息安全管理中心 Dynamic URL filtering method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11461785B2 (en) * 2008-07-10 2022-10-04 Ron M. Redlich System and method to identify, classify and monetize information as an intangible asset and a production model based thereon
US20140165194A1 (en) * 2012-12-06 2014-06-12 International Business Machines Corporation Attack Protection Against XML Encryption Vulnerability
CN104200167B (en) * 2014-08-05 2017-08-18 杭州安恒信息技术有限公司 Automate penetration testing method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101692639A (en) * 2009-09-15 2010-04-07 西安交通大学 Bad webpage recognition method based on URL
CN102739679A (en) * 2012-06-29 2012-10-17 东南大学 URL(Uniform Resource Locator) classification-based phishing website detection method
CN104217160A (en) * 2014-09-19 2014-12-17 中国科学院深圳先进技术研究院 Method and system for detecting Chinese phishing website
CN104573033A (en) * 2015-01-15 2015-04-29 国家计算机网络与信息安全管理中心 Dynamic URL filtering method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Pker多线程后台扫描工具使用教程;kingstone1258;《https://jingyan.baidu.com/article/c910274b9e1263cd361d2d09.html》;20150718;全文 *
基于机器学习算法的钓鱼网站检测系统;王田峰;《中国优秀硕士学位论文全文数据库》;20140918;全文 *
御剑web后台敏感目录扫描;Unitue_逆流;《https://blog.csdn.net/lijia111111/article/details/54694863》;20170123;全文 *

Also Published As

Publication number Publication date
CN107330010A (en) 2017-11-07

Similar Documents

Publication Publication Date Title
Yang et al. Detecting malicious URLs via a keyword-based convolutional gated-recurrent-unit neural network
Wang et al. PDRCNN: Precise phishing detection with recurrent convolutional neural networks
Harinahalli Lokesh et al. Phishing website detection based on effective machine learning approach
CN107330010B (en) Background path blasting method based on machine learning
US20190019058A1 (en) System and method for detecting homoglyph attacks with a siamese convolutional neural network
Wang et al. Bidirectional LSTM Malicious webpages detection algorithm based on convolutional neural network and independent recurrent neural network
Desai et al. Malicious web content detection using machine leaning
Sheykhkanloo Employing neural networks for the detection of SQL injection attack
Barlow et al. A novel approach to detect phishing attacks using binary visualisation and machine learning
Rokon et al. Repo2vec: A comprehensive embedding approach for determining repository similarity
AU2021255654A1 (en) Systems and methods for determining entity attribute representations
Jalil et al. Highly accurate phishing URL detection based on machine learning
Khan Detection of phishing websites using deep learning techniques
Zhou et al. Cdtier: a Chinese dataset of threat intelligence entity relationships
Yang et al. Hadoop-based dark web threat intelligence analysis framework
Wen et al. Detecting malicious websites in depth through analyzing topics and web-pages
VanDam et al. You have been caute! early detection of compromised accounts on social media
Wu et al. Website defacements detection based on support vector machine classification method
Oudah et al. SQL injection detection using machine learning with different TF-IDF feature extraction approaches
Xu et al. Generating risk maps for evolution analysis of societal risk events
Lei et al. Design and implementation of an automatic scanning tool of SQL injection vulnerability based on Web crawler
Elnagar et al. A cognitive framework for detecting phishing websites
Kumar et al. Novel features for web spam detection
Preetha et al. Personalized search engines on mining user preferences using clickthrough data
CN107463845A (en) A kind of detection method, system and the computer-processing equipment of SQL injection attack

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: Room 301, Unit 1, 3rd Floor, Building 15, No.1 Courtyard, Gaolizhang Road, Haidian District, Beijing, 100080

Patentee after: BEIJING KNOW FUTURE INFORMATION TECHNOLOGY CO.,LTD.

Address before: 100102 room 112102, unit 1, building 3, yard 1, Futong East Street, Chaoyang District, Beijing

Patentee before: BEIJING KNOW FUTURE INFORMATION TECHNOLOGY CO.,LTD.

CP02 Change in the address of a patent holder