CN117857090A

CN117857090A - Multi-language-oriented remote code execution attack detection method and device

Info

Publication number: CN117857090A
Application number: CN202311632010.2A
Authority: CN
Inventors: 王炎; 张星; 乔晨星; 王浩辰; 宋涧山
Original assignee: Tianyi Cloud Technology Co Ltd
Current assignee: Tianyi Cloud Technology Co Ltd
Priority date: 2023-12-01
Filing date: 2023-12-01
Publication date: 2024-04-09

Abstract

The invention discloses a multilingual remote code execution attack detection method, which comprises the following steps: s1: acquiring flow to be detected, and converting the flow to be detected into text data by adopting a preprocessing method; s2: performing numeric operation on the text data by using the trained byte-level byte pair coding model; s3: constructing a pre-training model, extracting features from the numeric text data, and generating feature vectors; s4: constructing a classification network based on the pre-training model, performing fine-tuning training, and classifying the feature vectors to confirm whether the flow to be detected contains remote codes to execute attack; s5: model optimization and high performance deployment. The invention converts the multi-language remote code execution into the multi-programming language identification problem, greatly reduces the problem of lack of marked data sets, has wider detection coverage, and can detect multi-class remote code execution attacks simultaneously by training one model.

Description

Multi-language-oriented remote code execution attack detection method and device

Technical Field

The invention relates to the technical field of network security, in particular to a multi-language-oriented remote code execution attack detection method and device.

Background

Remote code execution attacks refer to an attacker having a remote machine execute any code over a network, which can range from the execution of malware to the attacker gaining complete control over a compromised machine.

Such an attack typically uses network communication packets as a carrier, and induces a target machine to execute malicious code by uploading a carefully constructed attack load to the target machine. While services on the target machine are typically provided by various programming languages, the attack load provided by remote code execution attacks may also involve multiple programming languages, such as Java, PHP, javaScript, and the like.

Currently, security products contain detection of attacks against remote code, such as Web application firewalls (Web Application Firewall, WAF). While the detection technique is typically based on signature or machine learning methods. Signature-based detection methods respond to threats by implementing rules for specific vulnerabilities or attack behaviors to block malicious traffic. However, these rules must be continually adjusted to cope with evolving threats. In the face of different programming language attack loads, different rules are also typically formulated to detect, and the resulting rules can become complex and difficult to maintain, requiring an administrator to have a rich set of attack and defense experience. In addition, in the face of zero-day attacks, these detection methods may produce high false positives and false negatives, thereby adversely affecting product performance. The machine learning based approach is advantageous over the signature/rule based approach because the former can address vulnerabilities of zero day attacks and is easier to configure and update. In addition, remote code execution attacks against multiple programming languages often also require training multiple models, which further exacerbates the difficulty of training data set acquisition and thus also limits the performance of the models.

CN115473734a proposes a remote code execution attack detection method based on single classification and federal learning, by preprocessing input text form data first, normalizing the preprocessed data packet according to a special keyword mapping table, and converting the normalized data packet into a vector form based on a pre-training word vector; then constructing a single machine feature extraction model based on the textCNN; then building a federal learning frame, and accessing a single machine feature extraction model; and then training an anomaly detection model by using an One-Class SVM algorithm on the basis of the single-machine feature extraction model, and judging whether the data packet is a remote code execution attack data packet. CN114285641a proposes a network attack detection method and device, an electronic device, and a storage medium, where first, a network request text to be detected is obtained, then a generalized word segmentation process is performed on the network request text to obtain a request text word sequence corresponding to the network request text, then the request text word sequence is input into a multi-fragment question-answering model, and a sensitive fragment in the network request text and an attack type corresponding to the sensitive fragment are output. CN115766153a proposes an attack detection method, device, equipment and storage medium, by obtaining a flow to be detected, converting the flow to be detected into text data based on a preset data conversion operation, then performing a feature extraction operation on the text data by using a post-training BERT model to obtain a feature vector matrix, inputting the feature vector matrix into a post-training TextCNN model to perform flow classification, so as to obtain a flow classification result of the flow to be detected, and judging whether an attack flow exists according to the flow classification result. CN114297640a proposes an attack detection method, device, medium and apparatus, by obtaining a request sequence to be detected, performing word segmentation processing on the request sequence to be detected to obtain a word segmentation of the request sequence to be detected, performing encoding processing on the word segmentation to obtain a sequence code corresponding to the word segmentation, performing feature extraction on the sequence code to obtain a first hidden layer feature, then obtaining a pre-trained segment locator, performing feature extraction on the first hidden layer feature and the segment locator to obtain a second hidden layer feature, and finally performing regression processing on the second hidden layer feature to obtain a segment start position, a segment end position and a segment type of an attack segment of the request sequence to be detected.

The existing detection method still has some defects:

1. the attack types are numerous and complex in form, the regular matching is difficult to cover all attacks, and the attack types are easy to bypass; the rule base is huge, the maintenance cost is high, and the service response rate can be influenced if the rules are opened too much; the unknown attack is difficult to prevent, and certain hysteresis exists;

2. by adopting an anomaly detection method, normal behavior modeling is difficult, and a high false alarm rate exists;

3. the performance of the machine learning-based method is often limited to an attack data set, and attack detection is performed on multilingual remote codes, so that the lack of the attack data set seriously affects the performance of the model;

4. the attack load has more special characters, the word segmentation operation of the existing machine learning method can cause serious problem of unregistered words, and the HTTP sentence context and word sequence relation cannot be effectively processed.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.

The present invention has been made in view of the above-mentioned problems occurring in the prior art.

Accordingly, the present invention is directed to a method and apparatus for detecting multi-language remote code execution attacks, which solve the existing problems.

The invention provides the following technical scheme:

a multi-language remote code execution attack detection method comprises the following steps:

s1: acquiring flow to be detected, and converting the flow to be detected into text data by adopting a preprocessing method;

s2: performing numeric operation on the text data by using the trained byte-level byte pair coding model;

s3: constructing a pre-training model, extracting features from the numeric text data, and generating feature vectors;

s4: constructing a classification network based on the pre-training model, performing fine-tuning training, and classifying the feature vectors to confirm whether the flow to be detected contains remote codes to execute attack;

s5: model optimization and high performance deployment.

As a preferred scheme of the method for executing attack detection for multilingual remote codes according to the present invention, the step S1 specifically includes the following steps:

s1.1: dividing the acquired flow to be detected, and extracting the content of the part to be detected, such as a request parameter value;

s1.2: and performing cyclic decoding operation on the content to be detected, including URL decoding, base64 decoding, unicode decoding and HTML decoding, until the data before and after the decoding operation is unchanged, and stopping the decoding operation.

As a preferred scheme of the method for executing attack detection for multilingual remote codes according to the present invention, the step S2 specifically includes the following steps:

s2.1: collecting a large amount of text and code data from the public channel as a training data set;

s2.2: the byte-level byte pair coding model is constructed, compared with the byte pair coding model, the byte-level byte pair coding model converts words from character sequences to byte sequences, and the problem of unknown words can be effectively solved;

s2.3: converting text data to be detected into byte sequences, and performing aggregation operation on the byte sequences by using a trained byte-level byte pair coding model byte aggregation method to generate corresponding token sequences;

s2.4: and converting the token sequence into a numerical sequence according to the vocabulary mapping generated by the byte-level byte pair coding model.

As a preferred scheme of the method for executing attack detection for multilingual remote codes according to the present invention, the step S3 specifically includes the following steps:

s3.1: collecting code data and natural language text data containing a plurality of programming languages as a training data set;

s3.2: constructing a multi-layer transducer model, wherein the number of layers is 6, each layer is provided with 12 self-focusing heads, the size of each self-focusing head is 64, the hidden dimension is 768, the size of an internal hidden layer of a feedforward layer is 3072, and the total parameter number is 84M;

s3.3: a pre-training model is trained by adopting a replacement token detection mechanism, the model mainly trains two neural networks, namely a generator and a discriminator, and each network mainly consists of an encoder and can map an input sequence into a corresponding vector representation.

As a preferred scheme of the method for executing attack detection for multilingual remote codes according to the present invention, the step S4 specifically includes the following steps:

step 4.1: collecting a data set for attack detection, converting the attack detection into code recognition, collecting code data and text data as training data sets, and constructing the training data sets by collecting five types of samples of Java code, PHP code, ASP code, javaScript code and text data in order to identify multi-language remote code execution attacks, such as Java code execution, PHP code execution, ASP code execution and JavaScript code execution;

step 4.2: loading the trained pre-training model in the step 3, and connecting a full connection layer and a Softmax layer after the output of the pre-training model to realize a classification network;

step 4.3: based on the 5-class sample data set, training the classification network to realize fine adjustment of parameters, and finally outputting a detection model.

As a preferred scheme of the method for executing attack detection for multilingual remote codes according to the present invention, the step S5 specifically includes the following steps:

step 5.1: the method adopts int16 quantization aiming at the trained detection model, so that the size of the model is reduced;

step 5.2: converting the parameters derived by the model into ONNX format for storage;

step 5.3: and C/C++ is adopted to realize the processing of data and the calling of an ONNX model.

As a preferred scheme of the method for executing attack detection for multilingual remote codes according to the present invention, in the step S2.2, the steps of the byte-level byte pair coding model are as follows:

s2.2-1: extracting words and corresponding frequencies in a given dataset, and determining that the vocabulary size is 51200;

s2.2-2: splitting the word into a sequence of bytes, the non-repeated bytes being used as an initial vocabulary;

s2.2-3: adding the bytes in all byte sequences into the vocabulary without repetition, and selecting and combining adjacent byte pairs with highest frequency;

s2.2-4: repeating the step S2.2-3 until the vocabulary size is satisfied.

In the step S3.2, the input format of the transducer model is "< S >, t1, t2, t3, tn >", < S > represents the input start flag, and < S > represents the input end flag.

As a preferred scheme of the multi-language remote code execution attack detection method, in the step S4.3, the detection model mainly identifies 5 types of samples, and the label mapping relationship is as follows: ("Text" 0, "Java" 1, "PHP" 2, "JavaScript" 3, "ASP" 4), the model will first predict the probability that the detected sample belongs to each class, respectively, the probabilities that the input is divided into each class are [9.3603414e-04,6.2562805e-04,9.9600178e-01,1.0281029e-03,1.4084997e-03], respectively, wherein the probability is 9.9600178e-01 at maximum, so the label of the sample is predicted to be 2, i.e. the remote code representing the PHP language is present to perform an attack.

An attack detection device for multi-language remote code execution, comprising:

the flow analysis module is used for analyzing the input flow data to extract data to be detected, and circularly decoding the data to be detected to output text data finally used for detection;

the byte-level byte pair coding module is used for word segmentation of text data to be detected and converting the text data into numerical representation;

the pre-training module is used for converting input data into vector representations;

the detection module is used for carrying out classification detection on the input vector so as to determine whether the input contains remote codes to execute attack sentences and the programming language type of attack load;

and the deployment module is used for optimizing the model to reduce the size of the model and improve the detection efficiency, so that quick response can be realized under the limited condition.

The invention has the beneficial effects that:

1. the multi-language remote code execution is converted into the multi-programming language identification problem, the problem of lack of marked data sets is greatly reduced, the detection coverage is wider, and a model is trained to detect multi-class remote code execution attacks including Java code execution, PHP code execution, ASP code execution, javaScript code execution and the like.

2. Rules and features do not need to be defined manually, so that the labor cost is greatly reduced, and the method can be used for detecting unknown vulnerability attacks;

3. the combination scheme of the byte level byte pair coding model and the pre-training model is adopted to effectively treat the condition of word ambiguity, and can effectively avoid the problem of unknown words, and has higher performance when the context information and word sequence information are treated, so that false alarm and missing report are greatly reduced;

4. the model optimization scheme and the general deployment scheme can adapt to rapid response under various limited conditions, and single detection can be within 10 milliseconds.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a flow chart of a method for detecting multi-language remote code execution attacks according to the present invention;

FIG. 2 is a schematic diagram of a byte-level byte-to-code model aggregation operation in accordance with the present disclosure;

FIG. 3 is a diagram of an exemplary process for detecting a remote code execution attack in accordance with the present disclosure;

fig. 4 is a schematic diagram of a multi-language remote code execution attack detection device according to the present disclosure.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Further, in describing the embodiments of the present invention in detail, the cross-sectional view of the device structure is not partially enlarged to a general scale for convenience of description, and the schematic is only an example, which should not limit the scope of protection of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.

Referring to fig. 1-4, the present invention provides a multi-language remote code execution attack detection method, comprising the steps of:

s1: the method comprises the steps of obtaining flow to be detected, and converting the flow to be detected into text data by adopting a preprocessing method, wherein the specific method comprises the following steps of:

s1.1: dividing the acquired flow to be detected, and extracting the content of the part to be detected, such as a request parameter value.

S2: the trained byte-level bytes are utilized to carry out numeric operation on the text data by the coding model, and the specific method is as follows:

s2.1: a large amount of text and code data is collected from the public channel as a training data set.

S2.2: the byte-level byte pair coding model is constructed, compared with the byte pair coding model, the byte-level byte pair coding model is used for converting words from character sequences to byte sequences, and the problem of unknown words can be effectively solved.

The steps of the byte-level byte pair coding model are as follows:

s2.2-4: repeating the step S2.2-3 until the vocabulary size is satisfied.

S2.3: converting the text data to be detected into byte sequences, and performing aggregation operation on the byte sequences by using a trained byte-level byte pair coding model byte aggregation method to generate corresponding token sequences.

A schematic diagram of a byte-level byte-to-code model aggregation operation is shown in fig. 2, where all operations are performed at the byte level, and the aggregation is performed according to the aggregation policy trained in step 2.2, and includes 51739 rules. The term 'eval' is first divided into a sequence of four bytes ('e', 'v', 'a', 'l'), then the bytes 'a' and 'l' can be first matched to aggregate into 'al' according to a rule, then the bytes 'v' and 'al' can be aggregated into 'val', and finally the bytes 'e' can be matched to aggregate into 'eval'. Thus, the final input 'eval' is either divided into sequences containing only one token ('eval').

FIG. 3 shows an exemplary flowchart for detecting an attack performed by remote code, for input to be detected "<？php@eval($_POST['param'])；？>", it is first converted into a token sequence" [ "".<s>','<？','php','@','eval','($_','POST',"['",'param',"'])；",'？>','</s>']The word list map is then converted into a numerical value, and the word list map relationship contains 51200 pairs. The token sequence also judges the length of the sequence before being converted into a numerical value sequence, the length of the fixed sequence is 512, and when the length of the sequence does not meet the length, special marks are filled'<pad>' until the length of the token sequence meets the threshold. If the length of the token sequence exceeds the threshold, a truncation is performed. Sign'<pad>The 'numerical value for' is 1, so that in converting the token sequence to a numerical sequence, the deficient parts are all filled with the number 1.

S3: constructing a pre-training model, extracting features from the numeric text data, and generating feature vectors, wherein the specific method comprises the following steps of:

s3.1: code data and natural language text data containing a plurality of programming languages are collected as a training data set.

S3.2: in the invention, for balancing detection precision and efficiency, the number of layers is 6, each layer has 12 self-attention heads, each self-attention head has a size of 64, a hidden dimension of 768, an internal hidden layer size of a feedforward layer of 3072 and a total parameter number of 84M.

The input format of the transducer model is "< s >, t1, t2, t3, < s > represents the start flag of the input, and < s > represents the end flag of the input.

S3.3: the pre-training model is trained using an alternate token detection mechanism. The model mainly trains two neural networks, namely a generator and a discriminator. Each network consists essentially of one encoder that can map an input sequence into a corresponding vector representation.

The generator inputs a token sequence x= [ x ] ₁ ，…，x _n ]Is encoded to obtainThe probability of the generator outputting the token x at position t is then:

where e represents the token vector and for the arbiter, the token sequence x= [ x ] is input ₁ ，…，x _n ]Is encoded to obtainThe probability that the arbiter predicts that the token at position t is replaced is then:

D(x,t)＝sigmoid(w ^T h _D (x) _t )

where w is a model parameter.

Let x be ^maskd Representing sentences with random token positions masked in the original input, x ^corrupt Representing the replaced sentence generated by the generator, the loss function of the model is:

wherein,representing the loss function of the generator ∈>Representing the loss function of the arbiter. The optimization objective of the final model is +.>The sum of the generator loss function and the arbiter loss function is minimized.

S4: and constructing a classification network based on the pre-training model, performing fine-tuning training, classifying the feature vectors to confirm whether the flow to be detected contains remote code to execute attack, wherein the specific method is as follows:

s4.1: the method comprises the steps of collecting a data set for attack detection, wherein the labeled attack data is often deficient, and considering that the attack behavior is caused by the fact that attack codes are contained in traffic data, the method is used for converting the attack detection into code identification, collecting code data and text data, and constructing a training data set by means of five samples of Java code, PHP code, ASP code, javaScript code and text data in order to be able to identify multi-language remote code execution attacks, such as Java code execution, PHP code execution, ASP code execution and JavaScript code execution.

S4.2: and (3) loading the trained pre-training model in the step (3), and connecting a full connection layer and a Softmax layer after the output of the pre-training model to realize a classification network.

S4.3: based on the 5-class sample data set, training the classification network to realize fine adjustment of parameters, and finally outputting a detection model.

The detection model mainly identifies 5 types of samples, and the label mapping relation is as follows: ("Text" 0, "Java" 1, "PHP" 2, "JavaScript" 3 "ASP" 4), the model will first predict the probability that the detection samples belong to each class respectively, as shown in FIG. 3, the probabilities that the inputs are divided into each class are [9.3603414e-04,6.2562805e-04,9.9600178e-01,1.0281029e-03,1.4084997e-03] respectively. Where the probability is at most 9.9600178e-01, the label of the sample is predicted to be 2, i.e., the remote code representing the presence of PHP language performs the attack.

S5: model optimization and high-performance deployment, the specific method is as follows:

s5.1: the method has the advantages that the method adopts the int16 quantization aiming at the trained detection model, reduces the size of the model, and can reduce the size by one time;

s5.2: converting the parameters derived by the model into ONNX format for storage;

s5.3: the C/C++ is adopted to realize the processing of data and the calling of the ONNX model, and compared with the Python calling, the higher efficiency can be realized, and the response time is in millisecond level.

FIG. 4 shows a schematic diagram of the apparatus of the present invention, including a traffic analysis module, a byte-level byte pair encoding module, a pre-training module, a detection module, and a deployment module.

The flow analysis module is used for analyzing the input flow data to extract data to be detected, and circularly decoding the data to be detected to output text data finally used for detection.

The byte-level byte pair encoding module is used for word segmentation of text data to be detected and converting the text data into numerical representation.

The pre-training module is used for converting input data into vector representations.

The detection module is used for carrying out classification detection on the input vector so as to determine whether the input contains remote codes to execute attack sentences and the programming language type of attack load.

The deployment module is used for optimizing the model to reduce the size of the model and improve the detection efficiency, so that quick response can be realized under limited conditions.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. The multi-language remote code execution attack detection method is characterized by comprising the following steps of:

s5: model optimization and high performance deployment.

2. The method for detecting the attack of the multi-language-oriented remote code execution according to claim 1, wherein the step S1 specifically comprises the following steps:

3. The method for detecting the attack by the multi-language remote code execution according to claim 1, wherein the step S2 specifically comprises the following steps:

4. The method for detecting the attack by the multi-language remote code execution according to claim 1, wherein the step S3 specifically comprises the following steps:

5. The method for detecting the attack by the multi-language remote code execution according to claim 1, wherein the step S4 specifically comprises the following steps:

6. The method for detecting the attack by the multilingual remote code according to claim 1, wherein the step S5 comprises the steps of:

7. A method for attack detection by multi-lingual remote code execution according to claim 3, wherein in said step S2.2, the step of byte-level byte pair coding model is as follows:

s2.2-4: repeating the step S2.2-3 until the vocabulary size is satisfied.

8. The method according to claim 4, wherein in the step S3.2, the input format of the transducer model is "< S >, t1, t2, t3, tn >, < S > represents the start flag of the input, and < S > represents the end flag of the input.

9. The method for detecting the attack by the multilingual remote code according to claim 5, wherein in the step S4.3, the detection model mainly identifies 5 types of samples, and the tag mapping relationship is: ("Text" 0, "Java" 1, "PHP" 2, "JavaScript" 3, "ASP" 4), the model will first predict the probability that the detected sample belongs to each class, respectively, the probabilities that the input is divided into each class are [9.3603414e-04,6.2562805e-04,9.9600178e-01,1.0281029e-03,1.4084997e-03], respectively, wherein the probability is 9.9600178e-01 at maximum, so the label of the sample is predicted to be 2, i.e. the remote code representing the PHP language is present to perform an attack.

10. A multi-language oriented remote code execution attack detection device, comprising: