CN114169432A - Cross-site scripting attack identification method based on deep learning - Google Patents

Cross-site scripting attack identification method based on deep learning Download PDF

Info

Publication number
CN114169432A
CN114169432A CN202111482584.7A CN202111482584A CN114169432A CN 114169432 A CN114169432 A CN 114169432A CN 202111482584 A CN202111482584 A CN 202111482584A CN 114169432 A CN114169432 A CN 114169432A
Authority
CN
China
Prior art keywords
cross
site scripting
scripting attack
data
related data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111482584.7A
Other languages
Chinese (zh)
Other versions
CN114169432B (en
Inventor
任玉坤
何召阳
何晓刚
刘兵
董昊辰
李克萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Mowang Yunrui Technology Co ltd
Original Assignee
Nanjing Mowang Yunrui Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Mowang Yunrui Technology Co ltd filed Critical Nanjing Mowang Yunrui Technology Co ltd
Priority to CN202111482584.7A priority Critical patent/CN114169432B/en
Publication of CN114169432A publication Critical patent/CN114169432A/en
Application granted granted Critical
Publication of CN114169432B publication Critical patent/CN114169432B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-site scripting attack identification method based on deep learning, which comprises the following steps: s1, webpage data collection: constructing a target range containing cross-site scripting attack vulnerability, collecting related data containing cross-site scripting attack by using a scanner and a manual infiltration mode, and classifying and labeling the related data; s2, data feature extraction: carrying out data cleaning on related data, carrying out word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation; s3, constructing a data set: constructing a data set by adopting a Wold2Vec model; s4, training a neural network: using an LSTM neural network to perform cross-site scripting attack detection and outputting a detection result; s5, Softmax layer classification: and (4) after the detection result is subjected to Softmax function operation, obtaining probability distribution of cross-site scripting attack on all target information, and judging whether the related data is cross-site scripting attack or not. The invention can effectively improve the cross-site scripting attack identification efficiency and improve the safety.

Description

Cross-site scripting attack identification method based on deep learning
Technical Field
The invention relates to the technical field of network data security, in particular to a cross-site scripting attack identification method for deep machine learning.
Background
Nowadays, computer network technology is developed rapidly, and network crime behaviors are increasing day by day. The network crime behavior mainly has two forms, namely illegal acquisition of system data and incapability of providing service for the system. In the aspect of illegally obtaining system data, cross-site scripting attacks are very typical attack means for maliciously stealing information by utilizing website vulnerabilities. Unlike most attacks, cross-site scripting vulnerabilities involve an attacker, a client, and a website, rather than just an attacker and a victim as in most attacks. This undoubtedly increases the difficulty of defense and attack of cross-site scripting vulnerabilities.
The traditional method is carried out by two modes of a manual dynamic detection method and a static detection method. The first dynamic detection method starts from black box testing and is combined with a method related to penetration attack, so that the XSS vulnerability can be detected. The current dynamic detection method uses a real XSS attack code or utilizes a web crawler to perform crawling analysis on a target webpage, but the time overhead of the web crawler is huge, the crawled page data cannot be guaranteed to cover all pages of a website, the attack code stored in a database cannot cover all attack scenes, and the requirement of the access overhead on a server is very high. The second static detection method is that HTML5 and CORS attribute rules design a filter in the browser to detect XSS attacks and provide a system to determine if the intercepted request has a malicious attempt. Through the above, it is obvious that the traditional cross-site script detection method usually needs to spend a lot of time and energy to extract the features of the attack data, and certain experience is needed to be combined to obtain a good effect. The degree of dependence on personnel is high, the final effect is influenced by the uneven personnel ability level, and the expense on server resources is also high.
Disclosure of Invention
The invention aims to solve the technical problem that people cannot effectively prevent cross-site scripting attack in time when using internet data in the prior art, and provides a cross-site scripting attack identification method based on deep learning, which can effectively improve the cross-site scripting attack identification efficiency and improve the safety.
The invention provides a cross-site scripting attack identification method based on deep learning, which comprises the following steps:
s1, webpage data collection: constructing a target range containing cross-site scripting attack vulnerability, collecting related data containing cross-site scripting attack by using a scanner and a manual infiltration mode, and classifying and labeling the related data;
s2, data feature extraction: carrying out data cleaning on the related data, carrying out word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation;
s3, constructing a data set: constructing a data set by adopting a Wold2Vec model;
s4, training a neural network: using an LSTM neural network to perform cross-site scripting attack detection and outputting a detection result;
s5, Softmax layer classification: and (4) after the detection result is subjected to Softmax function operation, obtaining probability distribution of cross-site scripting attack on all target information, and judging whether the related data is cross-site scripting attack or not.
LSTM is an improved model of the Recurrent Neural Network (RNN) whose main feature is the ability to handle sequence features, i.e. the preceding and following inputs are related. The LSTM neural network overcomes the defect that the recurrent neural network cannot process long-distance dependence, becomes the most popular recurrent neural network at present, and is good achievement in the fields of image processing, text classification and the like. The design idea of the LSTM neural network is to add a state C for storing long-term information, called a unit state, in the hidden layer of the RNN, and to spread the modified RNN according to the time sequence.
The cross-site scripting attack identification method based on deep learning is used as an optimal mode, and related data comprise request parameters, a request method, response contents and a response state.
The deep learning-based cross-site scripting attack identification method is used as an optimal mode, and the classification labels comprise a cross-site scripting attack class and a non-cross-site scripting attack class.
According to the cross-site scripting attack identification method based on deep learning, as a preferred mode, a scanner is wvs or appscan.
The invention discloses a cross-site scripting attack identification method based on deep learning, which is a preferable mode, and the step S2 further comprises the following steps:
s21, removing data containing missing values in the related data by adopting a downsampling algorithm;
s22, removing the label in the response content by using xpath and only keeping the page content;
s23, distinguishing the request parameters, the IP address and the port number by adopting a method in a urlparse packet;
s24, evaluating the relation between the characteristic variables and the characteristics of the related data by using a Pearson correlation coefficient, and removing data irrelevant to the final category;
and S25, performing word segmentation on the related data through the word segmentation device, and outputting the feature data after word segmentation.
XPath is XML Path Language (XML Path Language), which is a Language used to determine the location of a part in an XML document; XPath is based on XML tree structures, providing the ability to find nodes in a data structure tree.
Pearson Correlation Coefficient (Pearson Correlation Coefficient) is used to measure whether two data sets are on a line, and is used to measure the linear relation between distance variables.
The invention discloses a cross-site scripting attack recognition method based on deep learning.
As an optimal mode, the Wold2Vec model comprises a continuous bag-of-words model and a Skip-gram model. The Word vector converted by the Word2Vec model not only can represent words into distributed Word vectors, but also can capture the similarity between words, and the words with the similarity can be calculated by using algebraic operation on the words.
The invention relates to a deep learning-based cross-site scripting attack recognition method, which is used as an optimal mode, wherein a Skip-gram model consists of an input layer, a projection layer and an output layer.
The invention has the following advantages:
(1) the identification mode of the cross-site script is more flexible and diversified;
(2) the dependence degree on personnel is reduced;
(3) the recognition result does not depend on the experience of the relevant person. The identification accuracy is greatly improved;
(4) the identification accuracy is greatly improved;
(5) and the identification efficiency of the cross-site script is further improved.
Drawings
FIG. 1 is a flow chart of a cross-site scripting attack identification method based on deep learning;
FIG. 2 is a data feature extraction flow chart of a cross-site scripting attack identification method based on deep learning.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
Example 1
As shown in fig. 1, a method for identifying cross-site scripting attack based on deep learning includes the following steps:
s1, webpage data collection: constructing a target range containing cross-site scripting attack holes, collecting related data containing cross-site scripting attack by using scanners such as wvs and appscan and a manual infiltration mode, and carrying out classification and labeling on the related data; the related data comprises request parameters, a request method, response contents and a response state; the classification label comprises a cross-site scripting attack class and a non-cross-site scripting attack class;
s2, data feature extraction: carrying out data cleaning on the related data, carrying out word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation; as shown in fig. 2, the method comprises the following steps:
s21, removing data containing missing values in the related data by adopting a downsampling algorithm;
s22, removing the label in the response content by using xpath and only keeping the page content;
s23, distinguishing the request parameters, the IP address and the port number by adopting a method in a urlparse packet;
s24, evaluating the relation between the characteristic variables and the characteristics of the related data by using a Pearson correlation coefficient, and removing data irrelevant to the final category;
s25, performing word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation;
s3, constructing a data set: constructing a data set by adopting a Wold2Vec model; word2Vec provides two models, namely a continuous bag-of-words model and a Skip-gram model, to train a Word vector, and a Skip-gram model is adopted to construct a data set, wherein the Skip-gram model consists of an input layer, a projection layer and an output layer;
s4, training a neural network: using an LSTM neural network to perform cross-site scripting attack detection and outputting a detection result;
s5, Softmax layer classification: and (4) after the detection result is subjected to Softmax function operation, obtaining probability distribution of cross-site scripting attack on all target information, and judging whether the related data is cross-site scripting attack or not.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (8)

1. A cross-site scripting attack identification method based on deep learning is characterized by comprising the following steps: the method comprises the following steps:
s1, webpage data collection: constructing a target range containing a cross-site scripting attack vulnerability, collecting related data containing the cross-site scripting attack by using a scanner and a manual infiltration mode, and carrying out classification and labeling on the related data;
s2, data feature extraction: carrying out data cleaning on the related data, carrying out word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation;
s3, constructing a data set: constructing a data set by adopting a Wold2Vec model;
s4, training a neural network: using an LSTM neural network to perform cross-site scripting attack detection and outputting a detection result;
s5, Softmax layer classification: and (3) after the detection result is subjected to Softmax function operation, obtaining probability distribution of cross-site scripting attack on all target information, and judging whether the related data is the cross-site scripting attack.
2. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the related data comprises request parameters, request methods, response contents and response states.
3. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the classification labels comprise a cross-site scripting attack class and a non-cross-site scripting attack class.
4. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the scanner is wvs or appscan.
5. The deep learning-based cross-site scripting attack identification method according to claim 2, characterized in that: step S2 further includes the steps of:
s21, removing data containing missing values in the related data by adopting a downsampling algorithm;
s22, removing the label in the response content by using xpath and only keeping the page content;
s23, distinguishing the request parameters, the IP address and the port number by adopting a method in a urlparse packet;
s24, evaluating the relation between the characteristic variables and the characteristics of the related data by using a Pearson correlation coefficient, and removing data irrelevant to the final category;
and S25, performing word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation.
6. The deep learning-based cross-site scripting attack identification method according to claim 5, characterized in that: the word segmentation device is a jieba word segmentation device or an nltk word segmentation device.
7. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the Wold2Vec model includes a continuous bag-of-words model and a Skip-gram model.
8. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the Skip-gram model is composed of an input layer, a projection layer and an output layer.
CN202111482584.7A 2021-12-06 2021-12-06 Cross-site scripting attack recognition method based on deep learning Active CN114169432B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111482584.7A CN114169432B (en) 2021-12-06 2021-12-06 Cross-site scripting attack recognition method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111482584.7A CN114169432B (en) 2021-12-06 2021-12-06 Cross-site scripting attack recognition method based on deep learning

Publications (2)

Publication Number Publication Date
CN114169432A true CN114169432A (en) 2022-03-11
CN114169432B CN114169432B (en) 2024-06-18

Family

ID=80483682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111482584.7A Active CN114169432B (en) 2021-12-06 2021-12-06 Cross-site scripting attack recognition method based on deep learning

Country Status (1)

Country Link
CN (1) CN114169432B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618283A (en) * 2022-12-02 2023-01-17 中国汽车技术研究中心有限公司 Cross-site script attack detection method, device, equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308494A (en) * 2018-09-27 2019-02-05 厦门服云信息科技有限公司 LSTM Recognition with Recurrent Neural Network model and network attack identification method based on this model
CN109561084A (en) * 2018-11-20 2019-04-02 四川长虹电器股份有限公司 URL parameter rejecting outliers method based on LSTM autoencoder network
CN109766693A (en) * 2018-12-11 2019-05-17 四川大学 A kind of cross-site scripting attack detection method based on deep learning
CN110266675A (en) * 2019-06-12 2019-09-20 成都积微物联集团股份有限公司 A kind of xss attack automated detection method based on deep learning
CN111625838A (en) * 2020-05-26 2020-09-04 北京墨云科技有限公司 Vulnerability scene identification method based on deep learning
US20200336507A1 (en) * 2019-04-17 2020-10-22 Sew, Inc. Generative attack instrumentation for penetration testing
CN112580050A (en) * 2020-12-25 2021-03-30 嘉应学院 XSS intrusion identification method based on semantic analysis and vectorization big data
CN113596007A (en) * 2021-07-22 2021-11-02 广东电网有限责任公司 Vulnerability attack detection method and device based on deep learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308494A (en) * 2018-09-27 2019-02-05 厦门服云信息科技有限公司 LSTM Recognition with Recurrent Neural Network model and network attack identification method based on this model
CN109561084A (en) * 2018-11-20 2019-04-02 四川长虹电器股份有限公司 URL parameter rejecting outliers method based on LSTM autoencoder network
CN109766693A (en) * 2018-12-11 2019-05-17 四川大学 A kind of cross-site scripting attack detection method based on deep learning
US20200336507A1 (en) * 2019-04-17 2020-10-22 Sew, Inc. Generative attack instrumentation for penetration testing
CN110266675A (en) * 2019-06-12 2019-09-20 成都积微物联集团股份有限公司 A kind of xss attack automated detection method based on deep learning
CN111625838A (en) * 2020-05-26 2020-09-04 北京墨云科技有限公司 Vulnerability scene identification method based on deep learning
CN112580050A (en) * 2020-12-25 2021-03-30 嘉应学院 XSS intrusion identification method based on semantic analysis and vectorization big data
CN113596007A (en) * 2021-07-22 2021-11-02 广东电网有限责任公司 Vulnerability attack detection method and device based on deep learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618283A (en) * 2022-12-02 2023-01-17 中国汽车技术研究中心有限公司 Cross-site script attack detection method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114169432B (en) 2024-06-18

Similar Documents

Publication Publication Date Title
US11716347B2 (en) Malicious site detection for a cyber threat response system
CN110233849B (en) Method and system for analyzing network security situation
Rao et al. A computer vision technique to detect phishing attacks
CN104954372B (en) A kind of evidence obtaining of fishing website and verification method and system
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
CN114021040B (en) Method and system for alarming and protecting malicious event based on service access
Bai Phishing website detection based on machine learning algorithm
Zhang et al. Cross-site scripting (XSS) detection integrating evidences in multiple stages
CN113918936A (en) SQL injection attack detection method and device
Lee et al. Attacking logo-based phishing website detectors with adversarial perturbations
CN113904834B (en) XSS attack detection method based on machine learning
Gong et al. Model uncertainty based annotation error fixing for web attack detection
CN114169432B (en) Cross-site scripting attack recognition method based on deep learning
CN113688346A (en) Illegal website identification method, device, equipment and storage medium
CN114124448B (en) Cross-site script attack recognition method based on machine learning
KR102313414B1 (en) Hybrid system and method for detecting defaced homepage using artificial intelligence and pattern
CN115001763A (en) Phishing website attack detection method and device, electronic equipment and storage medium
KR20230046182A (en) Apparatus, method and computer program for detecting attack on network
CN113225343A (en) Risk website identification method and system based on identity characteristic information
Le-Nguyen et al. Hunting phishing websites using a hybrid fuzzy-semantic-visual approach
Ma et al. Phishsifter: An Enhanced Phishing Pages Detection Method Based on the Relevance of Content and Domain
Bozkır et al. Local image descriptor based phishing web page recognition as an open-set problem
WR et al. Web Extension for Phishing URL Identification
Sirisha et al. Phishing URL detection using machine learning techniques
CN116527373B (en) Back door attack method and device for malicious URL detection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 601-626, No. 70, Qingjiang South Road, Gulou District, Nanjing City, Jiangsu Province, 210036

Applicant after: Nanjing Moyun Technology Co.,Ltd.

Address before: 210036 room 402, 301 Hanzhongmen street, Gulou District, Nanjing City, Jiangsu Province

Applicant before: Nanjing mowang Yunrui Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant