CN112968872A

CN112968872A - Malicious flow detection method, system and terminal based on natural language processing

Info

Publication number: CN112968872A
Application number: CN202110127620.1A
Authority: CN
Inventors: 杨昊; 何琴; 文武; 谢安琪
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2021-06-15
Anticipated expiration: 2041-01-29
Also published as: CN112968872B

Abstract

The invention belongs to the technical field of malicious flow detection, and discloses a malicious flow detection method, a malicious flow detection system and a malicious flow detection terminal based on natural language processing.A tsap packet is extracted by utilizing a tshark tool to obtain an encrypted flow data set; respectively marking black and white labels on the black and white sample data; removing repeated data and disturbing the index of the sample data; establishing a TF-IDF model, and performing characteristic reconstruction on the encrypted flow data set; establishing a machine learning algorithm model, and training positive and negative samples of a data set; establishing a deep learning model; adjusting parameters of each model, and training each model; and evaluating each model of machine learning by using an ROC curve and an AUC value, and detecting the encrypted malicious traffic by adopting a method combining TF-IDF and integrated learning. The method represents the encrypted flow field by a text classification method, has stronger generalization, and does not need to be limited by information extraction of the encrypted flow data when model improvement is carried out at the later stage.

Description

Malicious flow detection method, system and terminal based on natural language processing

Technical Field

The invention belongs to the technical field of malicious traffic detection, and particularly relates to a malicious traffic detection method, system and terminal based on natural language processing.

Background

In recent years, the proliferation of encrypted communications has changed the threat model, and many conventional methods based on conventional rules have no longer been as effective as before. With the increasing digitization of businesses, a large number of services and applications employ encryption as the primary means of information protection. According to the data of netmarkertschare, the proportion of encrypted Web traffic in month 10 of 2019 has exceeded ninety percent. Not just businesses that benefit from cryptographic techniques, however, adversaries may also utilize such techniques to evade detecting and protecting their malicious activities.

At present, for the detection of encrypted malicious traffic data, various machine learning or deep learning methods can be used for modeling and detection. But before modeling, the handling of traffic packets is a crucial issue. A traffic packet typically contains a number of fields such as IP, port number, MAC address, various protocols, etc. Some of the fields directly affect the post-model training effect, and some of the fields are redundant information for training the model. Therefore, in an actual project, a complete network professional is often required to be engaged to analyze and process malicious flow data. In actual operation, the processing of the flow data packet is often one of the cores of projects, and on one hand, the cost needs to be consumed to hire a large amount of manpower; another aspect is that erroneous data processing directly affects the detection effect of the model.

Aiming at the problem that detection of encrypted malicious traffic is always the focus of attention in the field of network security, the current mainstream attack detection means comprises the following methods: statistical methods, pattern matching methods, and machine learning methods. The statistical method utilizes metadata of the data stream for detection, including packet length and inter-arrival time, which detects malware of TLS connections without decrypting encrypted malicious traffic. However, the method based on statistical learning has low detection accuracy, cannot ensure correct detection of most malicious traffic, and has a lower speed compared with a machine learning method. The pattern matching method and the machine learning method do not need to decrypt the flow, and algorithm modeling training and detection are selected after characteristic engineering is carried out on the flow data extraction characteristics. Pattern matching methods are another group of methods that have been applied for a long time in network traffic classification. However, since it is necessary to read the contents of the data packet, it is difficult to read the encrypted data, so that some obstacles are faced, and problems such as processing multi-GB connection and scalability supporting a large number of signatures need to be overcome. Although the machine learning method does not need to decrypt the encrypted traffic and is fast, it needs to analyze and process the traffic data, which consumes labor and time.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) in an actual project, detection of encrypted malicious flow data often requires the engagement of a network complete professional to analyze and process the malicious flow data, cost is required to be consumed to engage a large amount of manpower, and erroneous data processing directly affects the detection effect of a model.

(2) The method based on statistical learning has low detection accuracy, cannot ensure the correct detection of most malicious traffic, and has lower speed compared with a machine learning method.

(3) The pattern matching method needs to read the content of the data packet, and it is difficult to read the encrypted data, so it faces some obstacles, and needs to overcome the problems of processing multi-GB connection and scalability supporting a large number of signatures.

(4) Although the machine learning method does not need to decrypt the encrypted traffic and is fast, it needs to analyze and process the traffic data, which consumes labor and time.

The significance of solving the problems and the defects is as follows:

in the detection process, excessive attention to the data information is not needed, so that manpower resources are greatly saved; the detection result has higher accuracy and is suitable for practical engineering.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a malicious flow detection method, a malicious flow detection system and a malicious flow detection terminal based on natural language processing.

The invention is realized in such a way that a malicious flow detection method based on natural language processing comprises the following steps:

extracting a pcap packet by utilizing a tshark tool to obtain an encrypted flow data set;

secondly, respectively marking black and white labels on the black and white sample data;

removing repeated data and disturbing the index of the sample data;

establishing a TF-IDF model, and performing feature reconstruction on the encrypted flow data set;

establishing a machine learning algorithm model, and training positive and negative samples of the data set;

step six, establishing a deep learning model, and adopting a one-dimensional convolution CNN to construct the model;

regulating parameters of each model, and training each model;

and step eight, evaluating each model of machine learning by using an ROC curve and an AUC value, and detecting the encrypted malicious traffic by adopting a method combining TF-IDF and integrated learning.

Further, in step four, the establishing a TF-IDF model and performing feature reconstruction on the encrypted traffic data set includes:

(1) in the aspect of data set feature processing, a text classification method, namely a TF-IDF model is used for reconstructing a data set, and new features are established;

(2) and after the processing is finished, a new data set is obtained and is used as the input of a later machine learning algorithm model.

Further, in the fifth step, the machine learning algorithm comprises a Random Forest, AdaBoost, GradientBoost and an integrated learning model of the three models and the CGB model.

Further, the Gradientboosting belongs to a Boosting series algorithm of ensemble learning; GradientBoosting serially generates weak learners, each of which is targeted to fit the negative gradient of the loss function of the previous cumulative model such that the cumulative model loss after the weak learner is added decreases in the direction of the negative gradient.

Further, the random forest is a higher-level algorithm based on decision trees, and consists of a plurality of decision trees, and the trees are not influenced by each other; and (4) randomly constructing a forest by the decision tree, wherein each tree has a result after each round of completion, and finally, the category with the highest vote number is used as an output result in a voting mode.

Furthermore, the AdaBoost belongs to a Boosting series algorithm of ensemble learning, and consists of a plurality of weak learners, and the network is adjusted by giving different weights to the learners each time.

Further, in step six, the CNN is a type of feedforward neural network that includes convolution calculation and has a deep structure.

Further, in step eight, the deep learning model is evaluated using the loss value and the accuracy value.

Another object of the present invention is to provide a malicious traffic detection system based on natural language processing, which includes:

the encryption flow data set acquisition module is used for extracting the pcap packet by utilizing a tshark tool to obtain an encryption flow data set;

the black and white label acquisition module is used for marking black and white labels on the black and white sample data respectively;

the repeated data removing module is used for removing repeated data and disturbing the sample data index;

the characteristic reconstruction module is used for establishing a TF-IDF model and performing characteristic reconstruction on the encrypted flow data set;

the positive and negative sample training module is used for establishing a machine learning algorithm model and training positive and negative samples of the data set;

the deep learning model building module is used for building a deep learning model and building the model by adopting one-dimensional convolution CNN;

the encrypted malicious flow detection module is used for adjusting parameters of each model and training each model; and the method is also used for evaluating each model of machine learning by using the ROC curve and the AUC value and detecting the encrypted malicious traffic by adopting a method of combining TF-IDF and integrated learning.

Another object of the present invention is to provide an information data processing terminal including a memory and a processor, the memory storing a computer program, the computer program, when executed by the processor, causing the processor to execute the malicious traffic detection method based on natural language processing.

Another object of the present invention is to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to execute the malicious traffic detection method based on natural language processing.

By combining all the technical schemes, the invention has the advantages and positive effects that:

aiming at the problems that the traditional detection method cannot detect the encrypted flow and the machine learning method needs to expend energy to extract features and the like, the invention designs the method for detecting the encrypted malicious flow by combining machine learning, deep learning and natural language processing, realizes the purpose of representing the encrypted flow field by a text classification method without decrypting the encrypted flow, does not need to care about the field meaning of the flow data and does not lose the information of the encrypted flow data. The method is not only suitable for detecting the encrypted malicious flow, but also can be used for other related detections such as malicious code detection, has stronger generalization and higher accuracy, and does not need to be limited by information extraction of encrypted flow data when model improvement is carried out at the later stage.

The invention adopts a general machine learning data processing method, namely recoding the data set, and then selecting the classifier to train to achieve the detection accuracy of about 0.5, and the accuracy is lower. The invention reserves the information contained in the data, and the accuracy reaches about 0.9 compared with the general machine learning method in the detection effect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a malicious traffic detection method based on natural language processing according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a malicious traffic detection method based on natural language processing according to an embodiment of the present invention.

Fig. 3 is a flow chart of encrypted traffic data processing according to an embodiment of the present invention.

FIG. 4 is a ROC curve and AUC values for ensemble learning provided by embodiments of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides a malicious flow detection method based on natural language processing, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, a malicious traffic detection method based on natural language processing according to an embodiment of the present invention includes the following steps:

s101, extracting a pcap packet by utilizing a tshark tool to obtain an encrypted flow data set;

s102, respectively marking black and white labels on black and white sample data;

s103, removing repeated data and disturbing the index of the sample data;

s104, establishing a TF-IDF model, and performing feature reconstruction on the encrypted flow data set;

s105, establishing a machine learning algorithm model, and training positive and negative samples of the data set;

s106, establishing a deep learning model, and adopting a one-dimensional convolution CNN to construct the model;

s107, adjusting parameters of each model, and training each model;

and S108, evaluating each model of machine learning by using an ROC curve and an AUC value, and detecting the encrypted malicious traffic by adopting a method combining TF-IDF and integrated learning.

The invention also provides a malicious flow detection system based on natural language processing, which comprises:

The technical solution of the present invention is further described below with reference to examples.

Example 1

As shown in fig. 2, the technical solution of the present invention includes:

step 1: extracting the pcap packet by utilizing a tshark tool to obtain encrypted flow data.

Step 2: and respectively marking black and white labels on the black and white sample data.

And step 3: and removing repeated data and disturbing the index of the sample data.

And 4, step 4: and establishing a TF-IDF model, and performing characteristic reconstruction on the encrypted flow data, so that the meaning and the details of specific fields of the data packet do not need to be concerned and analyzed. In the aspect of data set feature processing, the data set is reconstructed by using a text classification method, namely a TF-IDF model, and new features are established. And after the processing is finished, a new data set is obtained and is used as the input of a later machine learning algorithm model. TF-IDF (Term Frequency-Inverse text Frequency), is a commonly used weighting technique for information retrieval and text mining, and the larger the value of TF-IDF, the higher the importance of a word to an article.

And 5: and establishing a machine learning algorithm model, and training the positive and negative samples of the data set. The machine learning algorithm herein includes Random Forest (Random Forest), AdaBoost, GradientBoost, and an integrated learning model of these three models and the CGB model. Gradientboosting belongs to the Boosting series of algorithms of ensemble learning. GradientBoosting serially generates weak learners, each of which is targeted to fit the negative gradient of the loss function of the previous cumulative model such that the cumulative model loss after the weak learner is added decreases in the direction of the negative gradient. Random forests are more advanced algorithms based on decision trees (default CART trees) that are composed of multiple decision trees without any interaction between the trees. And (4) randomly constructing a forest by the decision tree, wherein each tree has a result after each round of completion, and finally, the category with the highest vote number is used as an output result in a voting mode. AdaBoost belongs to Boosting series algorithms of integrated learning, and comprises a plurality of weak learners, and the network is adjusted by giving different weights to the learners each time.

Step 6: and (3) establishing a deep learning model, and constructing the model by adopting a one-dimensional convolution CNN (convolutional neural network), wherein the CNN is a feedforward neural network which contains convolution calculation and has a deep structure.

And 7: and adjusting parameters of each model, and training each model.

And 8: and evaluating each model of machine learning by using an ROC curve and an AUC value, and finally detecting the encrypted malicious traffic by adopting a method combining TF-IDF and integrated learning. The deep learning model was evaluated using the loss value and the accuracy value.

Example 2

For the processing of encrypted flow data, a TF-IDF model in the field of natural language processing is used for reconstructing the data, and work is done by the TF-IDF model to calculate the TF value and the IDF value of a word. TF-IDF belongs to statistical methods for evaluating the importance of a word to one of a set of documents or a corpus, the importance of a word increasing in direct proportion to the number of occurrences of the word in the document, but decreasing in inverse proportion to the frequency of occurrences of the word in the corpus. Fig. 3 is a flow of encrypted traffic data processing.

And for the ensemble learning model, the ensemble learning model consists of a random forest classifier, an AdaBoost classifier, a GradientBoost classifier and an XGB classifier, and finally, the optimal classifier is selected in a voting mode.

For the deep learning model, a one-dimensional convolution CNN is used to construct the network structure. The invention autonomously constructs a CNN network structure, and the network is composed of 13 layers in total, including a convolution layer, an activation layer, a pooling layer, a dropout layer, a flatten layer and a dense layer.

Table 1 shows the CNN network structure layer autonomously constructed according to the present invention.

Table 1 autonomously constructed CNN network fabric layer

Table 2 is an evaluation index table of the present invention.

TABLE 2 evaluation index Table

Aiming at the problems that the traditional detection method cannot detect the encrypted flow and the machine learning method needs to expend energy to extract features and the like, the invention designs the method for detecting the encrypted malicious flow by combining machine learning, deep learning and natural language processing, realizes the purpose of representing the encrypted flow field by a text classification method without decrypting the encrypted flow, does not need to care about the field meaning of the flow data and does not lose the information of the encrypted flow data. The method is not only suitable for detecting the encrypted malicious flow, but also can be used for other related detections such as malicious code detection, and has stronger generalization and higher accuracy. And the information extraction of the encrypted flow data is not required to be limited when the model improvement is carried out at the later stage.

The experiment of the embodiment of the invention is carried out, and the following experiment parts are included: the data of the invention is provided by the Qian letter and is real packet capturing data. 3000 data packets are provided, wherein the black and white data respectively account for 1500 data packets, the data is divided into two and eight data packets, 2400 black and white data packets are respectively taken as training data, and 600 black and white data packets are respectively taken as test data. After the data are processed by the TF-IDF, the classifier is used for training and detection, and the following fig. 4 is an ROC curve and an AUC value of ensemble learning. As can be seen from fig. 4, the AUC based on the ensemble learning detector was 0.929. The effect is better.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A malicious traffic detection method based on natural language processing is characterized by comprising the following steps:

extracting the pcap packet by utilizing a tshark tool to obtain an encrypted flow data set;

respectively marking black and white labels on the black and white sample data;

removing repeated data and disturbing the index of the sample data;

establishing a TF-IDF model, and performing characteristic reconstruction on the encrypted flow data set;

establishing a machine learning algorithm model, and training positive and negative samples of a data set;

adopting one-dimensional convolution CNN to construct a deep learning model;

adjusting parameters of each model, and training each model;

and evaluating each model of machine learning by using an ROC curve and an AUC value, and detecting the encrypted malicious traffic by adopting a method combining TF-IDF and integrated learning.

2. The malicious traffic detection method based on natural language processing as claimed in claim 1, wherein the establishing of the TF-IDF model and the feature reconstruction of the encrypted traffic data set comprise:

3. The malicious traffic detection method based on natural language processing according to claim 1, wherein the machine learning algorithm comprises Random Forest, AdaBoost, GradientBoost, and an ensemble learning model of the three models and the CGB model.

4. The malicious traffic detection method based on natural language processing according to claim 3, wherein the GradientBoosting belongs to Boosting series algorithms of ensemble learning; GradientBoosting serially generates weak learners, each of which is targeted to fit the negative gradient of the loss function of the previous cumulative model such that the cumulative model loss after the weak learner is added decreases in the direction of the negative gradient.

5. The malicious traffic detection method based on natural language processing as claimed in claim 3, wherein the random forest is a higher-level algorithm based on decision trees, and is composed of a plurality of decision trees, and the trees are not affected by each other; and (4) randomly constructing a forest by the decision tree, wherein each tree has a result after each round of completion, and finally, the category with the highest vote number is used as an output result in a voting mode.

6. The malicious traffic detection method based on natural language processing according to claim 3, wherein the AdaBoost is a Boosting series algorithm belonging to ensemble learning, and comprises a plurality of weak learners, and the network is adjusted by giving different weights to the learners each time.

7. The malicious traffic detection method based on natural language processing according to claim 1, wherein the CNN is a feed-forward neural network including convolution calculation and having a deep structure;

the deep learning model was evaluated using the loss value and the accuracy value.

8. A malicious traffic detection system based on natural language processing, comprising:

9. An information data processing terminal, characterized in that the information data processing terminal comprises a memory and a processor, the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the malicious traffic detection method based on natural language processing according to any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to execute the malicious traffic detection method based on natural language processing according to any one of claims 1 to 7.