CN114169432A - Cross-site scripting attack identification method based on deep learning - Google Patents
Cross-site scripting attack identification method based on deep learning Download PDFInfo
- Publication number
- CN114169432A CN114169432A CN202111482584.7A CN202111482584A CN114169432A CN 114169432 A CN114169432 A CN 114169432A CN 202111482584 A CN202111482584 A CN 202111482584A CN 114169432 A CN114169432 A CN 114169432A
- Authority
- CN
- China
- Prior art keywords
- cross
- site scripting
- scripting attack
- data
- related data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000013135 deep learning Methods 0.000 title claims abstract description 22
- 230000011218 segmentation Effects 0.000 claims abstract description 24
- 238000001514 detection method Methods 0.000 claims abstract description 18
- 238000013528 artificial neural network Methods 0.000 claims abstract description 13
- 238000000605 extraction Methods 0.000 claims abstract description 5
- 238000004140 cleaning Methods 0.000 claims abstract description 4
- 238000013480 data collection Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims abstract description 4
- 230000008595 infiltration Effects 0.000 claims abstract description 4
- 238000001764 infiltration Methods 0.000 claims abstract description 4
- 238000002372 labelling Methods 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims abstract description 4
- 230000004044 response Effects 0.000 claims description 9
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 208000028257 Joubert syndrome with oculorenal defect Diseases 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000035515 penetration Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
- G06F18/2414—Smoothing the distance, e.g. radial basis function networks [RBFN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a cross-site scripting attack identification method based on deep learning, which comprises the following steps: s1, webpage data collection: constructing a target range containing cross-site scripting attack vulnerability, collecting related data containing cross-site scripting attack by using a scanner and a manual infiltration mode, and classifying and labeling the related data; s2, data feature extraction: carrying out data cleaning on related data, carrying out word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation; s3, constructing a data set: constructing a data set by adopting a Wold2Vec model; s4, training a neural network: using an LSTM neural network to perform cross-site scripting attack detection and outputting a detection result; s5, Softmax layer classification: and (4) after the detection result is subjected to Softmax function operation, obtaining probability distribution of cross-site scripting attack on all target information, and judging whether the related data is cross-site scripting attack or not. The invention can effectively improve the cross-site scripting attack identification efficiency and improve the safety.
Description
Technical Field
The invention relates to the technical field of network data security, in particular to a cross-site scripting attack identification method for deep machine learning.
Background
Nowadays, computer network technology is developed rapidly, and network crime behaviors are increasing day by day. The network crime behavior mainly has two forms, namely illegal acquisition of system data and incapability of providing service for the system. In the aspect of illegally obtaining system data, cross-site scripting attacks are very typical attack means for maliciously stealing information by utilizing website vulnerabilities. Unlike most attacks, cross-site scripting vulnerabilities involve an attacker, a client, and a website, rather than just an attacker and a victim as in most attacks. This undoubtedly increases the difficulty of defense and attack of cross-site scripting vulnerabilities.
The traditional method is carried out by two modes of a manual dynamic detection method and a static detection method. The first dynamic detection method starts from black box testing and is combined with a method related to penetration attack, so that the XSS vulnerability can be detected. The current dynamic detection method uses a real XSS attack code or utilizes a web crawler to perform crawling analysis on a target webpage, but the time overhead of the web crawler is huge, the crawled page data cannot be guaranteed to cover all pages of a website, the attack code stored in a database cannot cover all attack scenes, and the requirement of the access overhead on a server is very high. The second static detection method is that HTML5 and CORS attribute rules design a filter in the browser to detect XSS attacks and provide a system to determine if the intercepted request has a malicious attempt. Through the above, it is obvious that the traditional cross-site script detection method usually needs to spend a lot of time and energy to extract the features of the attack data, and certain experience is needed to be combined to obtain a good effect. The degree of dependence on personnel is high, the final effect is influenced by the uneven personnel ability level, and the expense on server resources is also high.
Disclosure of Invention
The invention aims to solve the technical problem that people cannot effectively prevent cross-site scripting attack in time when using internet data in the prior art, and provides a cross-site scripting attack identification method based on deep learning, which can effectively improve the cross-site scripting attack identification efficiency and improve the safety.
The invention provides a cross-site scripting attack identification method based on deep learning, which comprises the following steps:
s1, webpage data collection: constructing a target range containing cross-site scripting attack vulnerability, collecting related data containing cross-site scripting attack by using a scanner and a manual infiltration mode, and classifying and labeling the related data;
s2, data feature extraction: carrying out data cleaning on the related data, carrying out word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation;
s3, constructing a data set: constructing a data set by adopting a Wold2Vec model;
s4, training a neural network: using an LSTM neural network to perform cross-site scripting attack detection and outputting a detection result;
s5, Softmax layer classification: and (4) after the detection result is subjected to Softmax function operation, obtaining probability distribution of cross-site scripting attack on all target information, and judging whether the related data is cross-site scripting attack or not.
LSTM is an improved model of the Recurrent Neural Network (RNN) whose main feature is the ability to handle sequence features, i.e. the preceding and following inputs are related. The LSTM neural network overcomes the defect that the recurrent neural network cannot process long-distance dependence, becomes the most popular recurrent neural network at present, and is good achievement in the fields of image processing, text classification and the like. The design idea of the LSTM neural network is to add a state C for storing long-term information, called a unit state, in the hidden layer of the RNN, and to spread the modified RNN according to the time sequence.
The cross-site scripting attack identification method based on deep learning is used as an optimal mode, and related data comprise request parameters, a request method, response contents and a response state.
The deep learning-based cross-site scripting attack identification method is used as an optimal mode, and the classification labels comprise a cross-site scripting attack class and a non-cross-site scripting attack class.
According to the cross-site scripting attack identification method based on deep learning, as a preferred mode, a scanner is wvs or appscan.
The invention discloses a cross-site scripting attack identification method based on deep learning, which is a preferable mode, and the step S2 further comprises the following steps:
s21, removing data containing missing values in the related data by adopting a downsampling algorithm;
s22, removing the label in the response content by using xpath and only keeping the page content;
s23, distinguishing the request parameters, the IP address and the port number by adopting a method in a urlparse packet;
s24, evaluating the relation between the characteristic variables and the characteristics of the related data by using a Pearson correlation coefficient, and removing data irrelevant to the final category;
and S25, performing word segmentation on the related data through the word segmentation device, and outputting the feature data after word segmentation.
XPath is XML Path Language (XML Path Language), which is a Language used to determine the location of a part in an XML document; XPath is based on XML tree structures, providing the ability to find nodes in a data structure tree.
Pearson Correlation Coefficient (Pearson Correlation Coefficient) is used to measure whether two data sets are on a line, and is used to measure the linear relation between distance variables.
The invention discloses a cross-site scripting attack recognition method based on deep learning.
As an optimal mode, the Wold2Vec model comprises a continuous bag-of-words model and a Skip-gram model. The Word vector converted by the Word2Vec model not only can represent words into distributed Word vectors, but also can capture the similarity between words, and the words with the similarity can be calculated by using algebraic operation on the words.
The invention relates to a deep learning-based cross-site scripting attack recognition method, which is used as an optimal mode, wherein a Skip-gram model consists of an input layer, a projection layer and an output layer.
The invention has the following advantages:
(1) the identification mode of the cross-site script is more flexible and diversified;
(2) the dependence degree on personnel is reduced;
(3) the recognition result does not depend on the experience of the relevant person. The identification accuracy is greatly improved;
(4) the identification accuracy is greatly improved;
(5) and the identification efficiency of the cross-site script is further improved.
Drawings
FIG. 1 is a flow chart of a cross-site scripting attack identification method based on deep learning;
FIG. 2 is a data feature extraction flow chart of a cross-site scripting attack identification method based on deep learning.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
Example 1
As shown in fig. 1, a method for identifying cross-site scripting attack based on deep learning includes the following steps:
s1, webpage data collection: constructing a target range containing cross-site scripting attack holes, collecting related data containing cross-site scripting attack by using scanners such as wvs and appscan and a manual infiltration mode, and carrying out classification and labeling on the related data; the related data comprises request parameters, a request method, response contents and a response state; the classification label comprises a cross-site scripting attack class and a non-cross-site scripting attack class;
s2, data feature extraction: carrying out data cleaning on the related data, carrying out word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation; as shown in fig. 2, the method comprises the following steps:
s21, removing data containing missing values in the related data by adopting a downsampling algorithm;
s22, removing the label in the response content by using xpath and only keeping the page content;
s23, distinguishing the request parameters, the IP address and the port number by adopting a method in a urlparse packet;
s24, evaluating the relation between the characteristic variables and the characteristics of the related data by using a Pearson correlation coefficient, and removing data irrelevant to the final category;
s25, performing word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation;
s3, constructing a data set: constructing a data set by adopting a Wold2Vec model; word2Vec provides two models, namely a continuous bag-of-words model and a Skip-gram model, to train a Word vector, and a Skip-gram model is adopted to construct a data set, wherein the Skip-gram model consists of an input layer, a projection layer and an output layer;
s4, training a neural network: using an LSTM neural network to perform cross-site scripting attack detection and outputting a detection result;
s5, Softmax layer classification: and (4) after the detection result is subjected to Softmax function operation, obtaining probability distribution of cross-site scripting attack on all target information, and judging whether the related data is cross-site scripting attack or not.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.
Claims (8)
1. A cross-site scripting attack identification method based on deep learning is characterized by comprising the following steps: the method comprises the following steps:
s1, webpage data collection: constructing a target range containing a cross-site scripting attack vulnerability, collecting related data containing the cross-site scripting attack by using a scanner and a manual infiltration mode, and carrying out classification and labeling on the related data;
s2, data feature extraction: carrying out data cleaning on the related data, carrying out word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation;
s3, constructing a data set: constructing a data set by adopting a Wold2Vec model;
s4, training a neural network: using an LSTM neural network to perform cross-site scripting attack detection and outputting a detection result;
s5, Softmax layer classification: and (3) after the detection result is subjected to Softmax function operation, obtaining probability distribution of cross-site scripting attack on all target information, and judging whether the related data is the cross-site scripting attack.
2. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the related data comprises request parameters, request methods, response contents and response states.
3. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the classification labels comprise a cross-site scripting attack class and a non-cross-site scripting attack class.
4. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the scanner is wvs or appscan.
5. The deep learning-based cross-site scripting attack identification method according to claim 2, characterized in that: step S2 further includes the steps of:
s21, removing data containing missing values in the related data by adopting a downsampling algorithm;
s22, removing the label in the response content by using xpath and only keeping the page content;
s23, distinguishing the request parameters, the IP address and the port number by adopting a method in a urlparse packet;
s24, evaluating the relation between the characteristic variables and the characteristics of the related data by using a Pearson correlation coefficient, and removing data irrelevant to the final category;
and S25, performing word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation.
6. The deep learning-based cross-site scripting attack identification method according to claim 5, characterized in that: the word segmentation device is a jieba word segmentation device or an nltk word segmentation device.
7. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the Wold2Vec model includes a continuous bag-of-words model and a Skip-gram model.
8. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the Skip-gram model is composed of an input layer, a projection layer and an output layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111482584.7A CN114169432B (en) | 2021-12-06 | 2021-12-06 | Cross-site scripting attack recognition method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111482584.7A CN114169432B (en) | 2021-12-06 | 2021-12-06 | Cross-site scripting attack recognition method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114169432A true CN114169432A (en) | 2022-03-11 |
CN114169432B CN114169432B (en) | 2024-06-18 |
Family
ID=80483682
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111482584.7A Active CN114169432B (en) | 2021-12-06 | 2021-12-06 | Cross-site scripting attack recognition method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114169432B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115618283A (en) * | 2022-12-02 | 2023-01-17 | 中国汽车技术研究中心有限公司 | Cross-site script attack detection method, device, equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109308494A (en) * | 2018-09-27 | 2019-02-05 | 厦门服云信息科技有限公司 | LSTM Recognition with Recurrent Neural Network model and network attack identification method based on this model |
CN109561084A (en) * | 2018-11-20 | 2019-04-02 | 四川长虹电器股份有限公司 | URL parameter rejecting outliers method based on LSTM autoencoder network |
CN109766693A (en) * | 2018-12-11 | 2019-05-17 | 四川大学 | A kind of cross-site scripting attack detection method based on deep learning |
CN110266675A (en) * | 2019-06-12 | 2019-09-20 | 成都积微物联集团股份有限公司 | A kind of xss attack automated detection method based on deep learning |
CN111625838A (en) * | 2020-05-26 | 2020-09-04 | 北京墨云科技有限公司 | Vulnerability scene identification method based on deep learning |
US20200336507A1 (en) * | 2019-04-17 | 2020-10-22 | Sew, Inc. | Generative attack instrumentation for penetration testing |
CN112580050A (en) * | 2020-12-25 | 2021-03-30 | 嘉应学院 | XSS intrusion identification method based on semantic analysis and vectorization big data |
CN113596007A (en) * | 2021-07-22 | 2021-11-02 | 广东电网有限责任公司 | Vulnerability attack detection method and device based on deep learning |
-
2021
- 2021-12-06 CN CN202111482584.7A patent/CN114169432B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109308494A (en) * | 2018-09-27 | 2019-02-05 | 厦门服云信息科技有限公司 | LSTM Recognition with Recurrent Neural Network model and network attack identification method based on this model |
CN109561084A (en) * | 2018-11-20 | 2019-04-02 | 四川长虹电器股份有限公司 | URL parameter rejecting outliers method based on LSTM autoencoder network |
CN109766693A (en) * | 2018-12-11 | 2019-05-17 | 四川大学 | A kind of cross-site scripting attack detection method based on deep learning |
US20200336507A1 (en) * | 2019-04-17 | 2020-10-22 | Sew, Inc. | Generative attack instrumentation for penetration testing |
CN110266675A (en) * | 2019-06-12 | 2019-09-20 | 成都积微物联集团股份有限公司 | A kind of xss attack automated detection method based on deep learning |
CN111625838A (en) * | 2020-05-26 | 2020-09-04 | 北京墨云科技有限公司 | Vulnerability scene identification method based on deep learning |
CN112580050A (en) * | 2020-12-25 | 2021-03-30 | 嘉应学院 | XSS intrusion identification method based on semantic analysis and vectorization big data |
CN113596007A (en) * | 2021-07-22 | 2021-11-02 | 广东电网有限责任公司 | Vulnerability attack detection method and device based on deep learning |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115618283A (en) * | 2022-12-02 | 2023-01-17 | 中国汽车技术研究中心有限公司 | Cross-site script attack detection method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114169432B (en) | 2024-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11716347B2 (en) | Malicious site detection for a cyber threat response system | |
CN110233849B (en) | Method and system for analyzing network security situation | |
Rao et al. | A computer vision technique to detect phishing attacks | |
CN104954372B (en) | A kind of evidence obtaining of fishing website and verification method and system | |
CN112541476B (en) | Malicious webpage identification method based on semantic feature extraction | |
CN114021040B (en) | Method and system for alarming and protecting malicious event based on service access | |
Bai | Phishing website detection based on machine learning algorithm | |
Zhang et al. | Cross-site scripting (XSS) detection integrating evidences in multiple stages | |
CN113918936A (en) | SQL injection attack detection method and device | |
Lee et al. | Attacking logo-based phishing website detectors with adversarial perturbations | |
CN113904834B (en) | XSS attack detection method based on machine learning | |
Gong et al. | Model uncertainty based annotation error fixing for web attack detection | |
CN114169432B (en) | Cross-site scripting attack recognition method based on deep learning | |
CN113688346A (en) | Illegal website identification method, device, equipment and storage medium | |
CN114124448B (en) | Cross-site script attack recognition method based on machine learning | |
KR102313414B1 (en) | Hybrid system and method for detecting defaced homepage using artificial intelligence and pattern | |
CN115001763A (en) | Phishing website attack detection method and device, electronic equipment and storage medium | |
KR20230046182A (en) | Apparatus, method and computer program for detecting attack on network | |
CN113225343A (en) | Risk website identification method and system based on identity characteristic information | |
Le-Nguyen et al. | Hunting phishing websites using a hybrid fuzzy-semantic-visual approach | |
Ma et al. | Phishsifter: An Enhanced Phishing Pages Detection Method Based on the Relevance of Content and Domain | |
Bozkır et al. | Local image descriptor based phishing web page recognition as an open-set problem | |
WR et al. | Web Extension for Phishing URL Identification | |
Sirisha et al. | Phishing URL detection using machine learning techniques | |
CN116527373B (en) | Back door attack method and device for malicious URL detection system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Room 601-626, No. 70, Qingjiang South Road, Gulou District, Nanjing City, Jiangsu Province, 210036 Applicant after: Nanjing Moyun Technology Co.,Ltd. Address before: 210036 room 402, 301 Hanzhongmen street, Gulou District, Nanjing City, Jiangsu Province Applicant before: Nanjing mowang Yunrui Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |