CN112968872A - Malicious flow detection method, system and terminal based on natural language processing - Google Patents

Malicious flow detection method, system and terminal based on natural language processing Download PDF

Info

Publication number
CN112968872A
CN112968872A CN202110127620.1A CN202110127620A CN112968872A CN 112968872 A CN112968872 A CN 112968872A CN 202110127620 A CN202110127620 A CN 202110127620A CN 112968872 A CN112968872 A CN 112968872A
Authority
CN
China
Prior art keywords
model
encrypted
data set
natural language
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110127620.1A
Other languages
Chinese (zh)
Other versions
CN112968872B (en
Inventor
杨昊
何琴
文武
谢安琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN202110127620.1A priority Critical patent/CN112968872B/en
Publication of CN112968872A publication Critical patent/CN112968872A/en
Application granted granted Critical
Publication of CN112968872B publication Critical patent/CN112968872B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Biomedical Technology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Molecular Biology (AREA)
  • Computer Hardware Design (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention belongs to the technical field of malicious flow detection, and discloses a malicious flow detection method, a malicious flow detection system and a malicious flow detection terminal based on natural language processing.A tsap packet is extracted by utilizing a tshark tool to obtain an encrypted flow data set; respectively marking black and white labels on the black and white sample data; removing repeated data and disturbing the index of the sample data; establishing a TF-IDF model, and performing characteristic reconstruction on the encrypted flow data set; establishing a machine learning algorithm model, and training positive and negative samples of a data set; establishing a deep learning model; adjusting parameters of each model, and training each model; and evaluating each model of machine learning by using an ROC curve and an AUC value, and detecting the encrypted malicious traffic by adopting a method combining TF-IDF and integrated learning. The method represents the encrypted flow field by a text classification method, has stronger generalization, and does not need to be limited by information extraction of the encrypted flow data when model improvement is carried out at the later stage.

Description

Malicious flow detection method, system and terminal based on natural language processing
Technical Field
The invention belongs to the technical field of malicious traffic detection, and particularly relates to a malicious traffic detection method, system and terminal based on natural language processing.
Background
In recent years, the proliferation of encrypted communications has changed the threat model, and many conventional methods based on conventional rules have no longer been as effective as before. With the increasing digitization of businesses, a large number of services and applications employ encryption as the primary means of information protection. According to the data of netmarkertschare, the proportion of encrypted Web traffic in month 10 of 2019 has exceeded ninety percent. Not just businesses that benefit from cryptographic techniques, however, adversaries may also utilize such techniques to evade detecting and protecting their malicious activities.
At present, for the detection of encrypted malicious traffic data, various machine learning or deep learning methods can be used for modeling and detection. But before modeling, the handling of traffic packets is a crucial issue. A traffic packet typically contains a number of fields such as IP, port number, MAC address, various protocols, etc. Some of the fields directly affect the post-model training effect, and some of the fields are redundant information for training the model. Therefore, in an actual project, a complete network professional is often required to be engaged to analyze and process malicious flow data. In actual operation, the processing of the flow data packet is often one of the cores of projects, and on one hand, the cost needs to be consumed to hire a large amount of manpower; another aspect is that erroneous data processing directly affects the detection effect of the model.
Aiming at the problem that detection of encrypted malicious traffic is always the focus of attention in the field of network security, the current mainstream attack detection means comprises the following methods: statistical methods, pattern matching methods, and machine learning methods. The statistical method utilizes metadata of the data stream for detection, including packet length and inter-arrival time, which detects malware of TLS connections without decrypting encrypted malicious traffic. However, the method based on statistical learning has low detection accuracy, cannot ensure correct detection of most malicious traffic, and has a lower speed compared with a machine learning method. The pattern matching method and the machine learning method do not need to decrypt the flow, and algorithm modeling training and detection are selected after characteristic engineering is carried out on the flow data extraction characteristics. Pattern matching methods are another group of methods that have been applied for a long time in network traffic classification. However, since it is necessary to read the contents of the data packet, it is difficult to read the encrypted data, so that some obstacles are faced, and problems such as processing multi-GB connection and scalability supporting a large number of signatures need to be overcome. Although the machine learning method does not need to decrypt the encrypted traffic and is fast, it needs to analyze and process the traffic data, which consumes labor and time.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) in an actual project, detection of encrypted malicious flow data often requires the engagement of a network complete professional to analyze and process the malicious flow data, cost is required to be consumed to engage a large amount of manpower, and erroneous data processing directly affects the detection effect of a model.
(2) The method based on statistical learning has low detection accuracy, cannot ensure the correct detection of most malicious traffic, and has lower speed compared with a machine learning method.
(3) The pattern matching method needs to read the content of the data packet, and it is difficult to read the encrypted data, so it faces some obstacles, and needs to overcome the problems of processing multi-GB connection and scalability supporting a large number of signatures.
(4) Although the machine learning method does not need to decrypt the encrypted traffic and is fast, it needs to analyze and process the traffic data, which consumes labor and time.
The significance of solving the problems and the defects is as follows:
in the detection process, excessive attention to the data information is not needed, so that manpower resources are greatly saved; the detection result has higher accuracy and is suitable for practical engineering.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a malicious flow detection method, a malicious flow detection system and a malicious flow detection terminal based on natural language processing.
The invention is realized in such a way that a malicious flow detection method based on natural language processing comprises the following steps:
extracting a pcap packet by utilizing a tshark tool to obtain an encrypted flow data set;
secondly, respectively marking black and white labels on the black and white sample data;
removing repeated data and disturbing the index of the sample data;
establishing a TF-IDF model, and performing feature reconstruction on the encrypted flow data set;
establishing a machine learning algorithm model, and training positive and negative samples of the data set;
step six, establishing a deep learning model, and adopting a one-dimensional convolution CNN to construct the model;
regulating parameters of each model, and training each model;
and step eight, evaluating each model of machine learning by using an ROC curve and an AUC value, and detecting the encrypted malicious traffic by adopting a method combining TF-IDF and integrated learning.
Further, in step four, the establishing a TF-IDF model and performing feature reconstruction on the encrypted traffic data set includes:
(1) in the aspect of data set feature processing, a text classification method, namely a TF-IDF model is used for reconstructing a data set, and new features are established;
(2) and after the processing is finished, a new data set is obtained and is used as the input of a later machine learning algorithm model.
Further, in the fifth step, the machine learning algorithm comprises a Random Forest, AdaBoost, GradientBoost and an integrated learning model of the three models and the CGB model.
Further, the Gradientboosting belongs to a Boosting series algorithm of ensemble learning; GradientBoosting serially generates weak learners, each of which is targeted to fit the negative gradient of the loss function of the previous cumulative model such that the cumulative model loss after the weak learner is added decreases in the direction of the negative gradient.
Further, the random forest is a higher-level algorithm based on decision trees, and consists of a plurality of decision trees, and the trees are not influenced by each other; and (4) randomly constructing a forest by the decision tree, wherein each tree has a result after each round of completion, and finally, the category with the highest vote number is used as an output result in a voting mode.
Furthermore, the AdaBoost belongs to a Boosting series algorithm of ensemble learning, and consists of a plurality of weak learners, and the network is adjusted by giving different weights to the learners each time.
Further, in step six, the CNN is a type of feedforward neural network that includes convolution calculation and has a deep structure.
Further, in step eight, the deep learning model is evaluated using the loss value and the accuracy value.
Another object of the present invention is to provide a malicious traffic detection system based on natural language processing, which includes:
the encryption flow data set acquisition module is used for extracting the pcap packet by utilizing a tshark tool to obtain an encryption flow data set;
the black and white label acquisition module is used for marking black and white labels on the black and white sample data respectively;
the repeated data removing module is used for removing repeated data and disturbing the sample data index;
the characteristic reconstruction module is used for establishing a TF-IDF model and performing characteristic reconstruction on the encrypted flow data set;
the positive and negative sample training module is used for establishing a machine learning algorithm model and training positive and negative samples of the data set;
the deep learning model building module is used for building a deep learning model and building the model by adopting one-dimensional convolution CNN;
the encrypted malicious flow detection module is used for adjusting parameters of each model and training each model; and the method is also used for evaluating each model of machine learning by using the ROC curve and the AUC value and detecting the encrypted malicious traffic by adopting a method of combining TF-IDF and integrated learning.
Another object of the present invention is to provide an information data processing terminal including a memory and a processor, the memory storing a computer program, the computer program, when executed by the processor, causing the processor to execute the malicious traffic detection method based on natural language processing.
Another object of the present invention is to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to execute the malicious traffic detection method based on natural language processing.
By combining all the technical schemes, the invention has the advantages and positive effects that:
aiming at the problems that the traditional detection method cannot detect the encrypted flow and the machine learning method needs to expend energy to extract features and the like, the invention designs the method for detecting the encrypted malicious flow by combining machine learning, deep learning and natural language processing, realizes the purpose of representing the encrypted flow field by a text classification method without decrypting the encrypted flow, does not need to care about the field meaning of the flow data and does not lose the information of the encrypted flow data. The method is not only suitable for detecting the encrypted malicious flow, but also can be used for other related detections such as malicious code detection, has stronger generalization and higher accuracy, and does not need to be limited by information extraction of encrypted flow data when model improvement is carried out at the later stage.
The invention adopts a general machine learning data processing method, namely recoding the data set, and then selecting the classifier to train to achieve the detection accuracy of about 0.5, and the accuracy is lower. The invention reserves the information contained in the data, and the accuracy reaches about 0.9 compared with the general machine learning method in the detection effect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a malicious traffic detection method based on natural language processing according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a malicious traffic detection method based on natural language processing according to an embodiment of the present invention.
Fig. 3 is a flow chart of encrypted traffic data processing according to an embodiment of the present invention.
FIG. 4 is a ROC curve and AUC values for ensemble learning provided by embodiments of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Aiming at the problems in the prior art, the invention provides a malicious flow detection method based on natural language processing, and the invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, a malicious traffic detection method based on natural language processing according to an embodiment of the present invention includes the following steps:
s101, extracting a pcap packet by utilizing a tshark tool to obtain an encrypted flow data set;
s102, respectively marking black and white labels on black and white sample data;
s103, removing repeated data and disturbing the index of the sample data;
s104, establishing a TF-IDF model, and performing feature reconstruction on the encrypted flow data set;
s105, establishing a machine learning algorithm model, and training positive and negative samples of the data set;
s106, establishing a deep learning model, and adopting a one-dimensional convolution CNN to construct the model;
s107, adjusting parameters of each model, and training each model;
and S108, evaluating each model of machine learning by using an ROC curve and an AUC value, and detecting the encrypted malicious traffic by adopting a method combining TF-IDF and integrated learning.
The invention also provides a malicious flow detection system based on natural language processing, which comprises:
the encryption flow data set acquisition module is used for extracting the pcap packet by utilizing a tshark tool to obtain an encryption flow data set;
the black and white label acquisition module is used for marking black and white labels on the black and white sample data respectively;
the repeated data removing module is used for removing repeated data and disturbing the sample data index;
the characteristic reconstruction module is used for establishing a TF-IDF model and performing characteristic reconstruction on the encrypted flow data set;
the positive and negative sample training module is used for establishing a machine learning algorithm model and training positive and negative samples of the data set;
the deep learning model building module is used for building a deep learning model and building the model by adopting one-dimensional convolution CNN;
the encrypted malicious flow detection module is used for adjusting parameters of each model and training each model; and the method is also used for evaluating each model of machine learning by using the ROC curve and the AUC value and detecting the encrypted malicious traffic by adopting a method of combining TF-IDF and integrated learning.
The technical solution of the present invention is further described below with reference to examples.
Example 1
As shown in fig. 2, the technical solution of the present invention includes:
step 1: extracting the pcap packet by utilizing a tshark tool to obtain encrypted flow data.
Step 2: and respectively marking black and white labels on the black and white sample data.
And step 3: and removing repeated data and disturbing the index of the sample data.
And 4, step 4: and establishing a TF-IDF model, and performing characteristic reconstruction on the encrypted flow data, so that the meaning and the details of specific fields of the data packet do not need to be concerned and analyzed. In the aspect of data set feature processing, the data set is reconstructed by using a text classification method, namely a TF-IDF model, and new features are established. And after the processing is finished, a new data set is obtained and is used as the input of a later machine learning algorithm model. TF-IDF (Term Frequency-Inverse text Frequency), is a commonly used weighting technique for information retrieval and text mining, and the larger the value of TF-IDF, the higher the importance of a word to an article.
And 5: and establishing a machine learning algorithm model, and training the positive and negative samples of the data set. The machine learning algorithm herein includes Random Forest (Random Forest), AdaBoost, GradientBoost, and an integrated learning model of these three models and the CGB model. Gradientboosting belongs to the Boosting series of algorithms of ensemble learning. GradientBoosting serially generates weak learners, each of which is targeted to fit the negative gradient of the loss function of the previous cumulative model such that the cumulative model loss after the weak learner is added decreases in the direction of the negative gradient. Random forests are more advanced algorithms based on decision trees (default CART trees) that are composed of multiple decision trees without any interaction between the trees. And (4) randomly constructing a forest by the decision tree, wherein each tree has a result after each round of completion, and finally, the category with the highest vote number is used as an output result in a voting mode. AdaBoost belongs to Boosting series algorithms of integrated learning, and comprises a plurality of weak learners, and the network is adjusted by giving different weights to the learners each time.
Step 6: and (3) establishing a deep learning model, and constructing the model by adopting a one-dimensional convolution CNN (convolutional neural network), wherein the CNN is a feedforward neural network which contains convolution calculation and has a deep structure.
And 7: and adjusting parameters of each model, and training each model.
And 8: and evaluating each model of machine learning by using an ROC curve and an AUC value, and finally detecting the encrypted malicious traffic by adopting a method combining TF-IDF and integrated learning. The deep learning model was evaluated using the loss value and the accuracy value.
Example 2
For the processing of encrypted flow data, a TF-IDF model in the field of natural language processing is used for reconstructing the data, and work is done by the TF-IDF model to calculate the TF value and the IDF value of a word. TF-IDF belongs to statistical methods for evaluating the importance of a word to one of a set of documents or a corpus, the importance of a word increasing in direct proportion to the number of occurrences of the word in the document, but decreasing in inverse proportion to the frequency of occurrences of the word in the corpus. Fig. 3 is a flow of encrypted traffic data processing.
And for the ensemble learning model, the ensemble learning model consists of a random forest classifier, an AdaBoost classifier, a GradientBoost classifier and an XGB classifier, and finally, the optimal classifier is selected in a voting mode.
For the deep learning model, a one-dimensional convolution CNN is used to construct the network structure. The invention autonomously constructs a CNN network structure, and the network is composed of 13 layers in total, including a convolution layer, an activation layer, a pooling layer, a dropout layer, a flatten layer and a dense layer.
Table 1 shows the CNN network structure layer autonomously constructed according to the present invention.
Table 1 autonomously constructed CNN network fabric layer
Figure BDA0002923992710000091
Table 2 is an evaluation index table of the present invention.
TABLE 2 evaluation index Table
Figure BDA0002923992710000092
Aiming at the problems that the traditional detection method cannot detect the encrypted flow and the machine learning method needs to expend energy to extract features and the like, the invention designs the method for detecting the encrypted malicious flow by combining machine learning, deep learning and natural language processing, realizes the purpose of representing the encrypted flow field by a text classification method without decrypting the encrypted flow, does not need to care about the field meaning of the flow data and does not lose the information of the encrypted flow data. The method is not only suitable for detecting the encrypted malicious flow, but also can be used for other related detections such as malicious code detection, and has stronger generalization and higher accuracy. And the information extraction of the encrypted flow data is not required to be limited when the model improvement is carried out at the later stage.
The experiment of the embodiment of the invention is carried out, and the following experiment parts are included: the data of the invention is provided by the Qian letter and is real packet capturing data. 3000 data packets are provided, wherein the black and white data respectively account for 1500 data packets, the data is divided into two and eight data packets, 2400 black and white data packets are respectively taken as training data, and 600 black and white data packets are respectively taken as test data. After the data are processed by the TF-IDF, the classifier is used for training and detection, and the following fig. 4 is an ROC curve and an AUC value of ensemble learning. As can be seen from fig. 4, the AUC based on the ensemble learning detector was 0.929. The effect is better.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A malicious traffic detection method based on natural language processing is characterized by comprising the following steps:
extracting the pcap packet by utilizing a tshark tool to obtain an encrypted flow data set;
respectively marking black and white labels on the black and white sample data;
removing repeated data and disturbing the index of the sample data;
establishing a TF-IDF model, and performing characteristic reconstruction on the encrypted flow data set;
establishing a machine learning algorithm model, and training positive and negative samples of a data set;
adopting one-dimensional convolution CNN to construct a deep learning model;
adjusting parameters of each model, and training each model;
and evaluating each model of machine learning by using an ROC curve and an AUC value, and detecting the encrypted malicious traffic by adopting a method combining TF-IDF and integrated learning.
2. The malicious traffic detection method based on natural language processing as claimed in claim 1, wherein the establishing of the TF-IDF model and the feature reconstruction of the encrypted traffic data set comprise:
(1) in the aspect of data set feature processing, a text classification method, namely a TF-IDF model is used for reconstructing a data set, and new features are established;
(2) and after the processing is finished, a new data set is obtained and is used as the input of a later machine learning algorithm model.
3. The malicious traffic detection method based on natural language processing according to claim 1, wherein the machine learning algorithm comprises Random Forest, AdaBoost, GradientBoost, and an ensemble learning model of the three models and the CGB model.
4. The malicious traffic detection method based on natural language processing according to claim 3, wherein the GradientBoosting belongs to Boosting series algorithms of ensemble learning; GradientBoosting serially generates weak learners, each of which is targeted to fit the negative gradient of the loss function of the previous cumulative model such that the cumulative model loss after the weak learner is added decreases in the direction of the negative gradient.
5. The malicious traffic detection method based on natural language processing as claimed in claim 3, wherein the random forest is a higher-level algorithm based on decision trees, and is composed of a plurality of decision trees, and the trees are not affected by each other; and (4) randomly constructing a forest by the decision tree, wherein each tree has a result after each round of completion, and finally, the category with the highest vote number is used as an output result in a voting mode.
6. The malicious traffic detection method based on natural language processing according to claim 3, wherein the AdaBoost is a Boosting series algorithm belonging to ensemble learning, and comprises a plurality of weak learners, and the network is adjusted by giving different weights to the learners each time.
7. The malicious traffic detection method based on natural language processing according to claim 1, wherein the CNN is a feed-forward neural network including convolution calculation and having a deep structure;
the deep learning model was evaluated using the loss value and the accuracy value.
8. A malicious traffic detection system based on natural language processing, comprising:
the encryption flow data set acquisition module is used for extracting the pcap packet by utilizing a tshark tool to obtain an encryption flow data set;
the black and white label acquisition module is used for marking black and white labels on the black and white sample data respectively;
the repeated data removing module is used for removing repeated data and disturbing the sample data index;
the characteristic reconstruction module is used for establishing a TF-IDF model and performing characteristic reconstruction on the encrypted flow data set;
the positive and negative sample training module is used for establishing a machine learning algorithm model and training positive and negative samples of the data set;
the deep learning model building module is used for building a deep learning model and building the model by adopting one-dimensional convolution CNN;
the encrypted malicious flow detection module is used for adjusting parameters of each model and training each model; and the method is also used for evaluating each model of machine learning by using the ROC curve and the AUC value and detecting the encrypted malicious traffic by adopting a method of combining TF-IDF and integrated learning.
9. An information data processing terminal, characterized in that the information data processing terminal comprises a memory and a processor, the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the malicious traffic detection method based on natural language processing according to any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to execute the malicious traffic detection method based on natural language processing according to any one of claims 1 to 7.
CN202110127620.1A 2021-01-29 2021-01-29 Malicious flow detection method, system and terminal based on natural language processing Active CN112968872B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110127620.1A CN112968872B (en) 2021-01-29 2021-01-29 Malicious flow detection method, system and terminal based on natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110127620.1A CN112968872B (en) 2021-01-29 2021-01-29 Malicious flow detection method, system and terminal based on natural language processing

Publications (2)

Publication Number Publication Date
CN112968872A true CN112968872A (en) 2021-06-15
CN112968872B CN112968872B (en) 2023-04-18

Family

ID=76273475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110127620.1A Active CN112968872B (en) 2021-01-29 2021-01-29 Malicious flow detection method, system and terminal based on natural language processing

Country Status (1)

Country Link
CN (1) CN112968872B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705619A (en) * 2021-08-03 2021-11-26 广州大学 Malicious traffic detection method, system, computer and medium
CN113824729A (en) * 2021-09-27 2021-12-21 杭州安恒信息技术股份有限公司 Encrypted flow detection method, system and related device
CN114124551A (en) * 2021-11-29 2022-03-01 中国电子科技集团公司第三十研究所 Malicious encrypted flow identification method based on multi-granularity feature extraction under WireGuard protocol

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704495A (en) * 2017-08-25 2018-02-16 平安科技(深圳)有限公司 Training method, device and the computer-readable recording medium of subject classification device
CN109379377A (en) * 2018-11-30 2019-02-22 极客信安(北京)科技有限公司 Encrypt malicious traffic stream detection method, device, electronic equipment and storage medium
CN109450860A (en) * 2018-10-16 2019-03-08 南京航空航天大学 A kind of detection method threatened based on entropy and the advanced duration of support vector machines
CN109714341A (en) * 2018-12-28 2019-05-03 厦门服云信息科技有限公司 A kind of Web hostile attack identification method, terminal device and storage medium
CN110149418A (en) * 2018-12-12 2019-08-20 国网信息通信产业集团有限公司 A kind of hidden tunnel detection method of DNS based on deep learning
US20200219005A1 (en) * 2019-01-09 2020-07-09 International Business Machines Corporation Device discovery and classification from encrypted network traffic
CN112256874A (en) * 2020-10-21 2021-01-22 平安科技(深圳)有限公司 Model training method, text classification method, device, computer equipment and medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704495A (en) * 2017-08-25 2018-02-16 平安科技(深圳)有限公司 Training method, device and the computer-readable recording medium of subject classification device
CN109450860A (en) * 2018-10-16 2019-03-08 南京航空航天大学 A kind of detection method threatened based on entropy and the advanced duration of support vector machines
CN109379377A (en) * 2018-11-30 2019-02-22 极客信安(北京)科技有限公司 Encrypt malicious traffic stream detection method, device, electronic equipment and storage medium
CN110149418A (en) * 2018-12-12 2019-08-20 国网信息通信产业集团有限公司 A kind of hidden tunnel detection method of DNS based on deep learning
CN109714341A (en) * 2018-12-28 2019-05-03 厦门服云信息科技有限公司 A kind of Web hostile attack identification method, terminal device and storage medium
US20200219005A1 (en) * 2019-01-09 2020-07-09 International Business Machines Corporation Device discovery and classification from encrypted network traffic
CN112256874A (en) * 2020-10-21 2021-01-22 平安科技(深圳)有限公司 Model training method, text classification method, device, computer equipment and medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
KUSHAL SINGLA ETC.: "A Privacy preserving Approach for Home Ownership Prediction" *
何琴: "WCDMA_UE链路接入及解码去复用的实现研究" *
文武: "基于混合神经网络和集成学习的非侵入式负荷识别算法" *
李坤明;顾益军;张培晶;: "对抗环境下基于集成决策树的恶意PDF文件检测" *
杨昊: "成都联通GSM核心网络演进建设咨询报告" *
陈铁明等: "基于样本增强的网络恶意流量智能检测方法", 《通信学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705619A (en) * 2021-08-03 2021-11-26 广州大学 Malicious traffic detection method, system, computer and medium
CN113705619B (en) * 2021-08-03 2023-09-12 广州大学 Malicious traffic detection method, system, computer and medium
CN113824729A (en) * 2021-09-27 2021-12-21 杭州安恒信息技术股份有限公司 Encrypted flow detection method, system and related device
CN113824729B (en) * 2021-09-27 2023-01-06 杭州安恒信息技术股份有限公司 Encrypted flow detection method, system and related device
CN114124551A (en) * 2021-11-29 2022-03-01 中国电子科技集团公司第三十研究所 Malicious encrypted flow identification method based on multi-granularity feature extraction under WireGuard protocol
CN114124551B (en) * 2021-11-29 2023-05-23 中国电子科技集团公司第三十研究所 Malicious encryption traffic identification method based on multi-granularity feature extraction under WireGuard protocol

Also Published As

Publication number Publication date
CN112968872B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN112968872B (en) Malicious flow detection method, system and terminal based on natural language processing
Liu et al. Machine learning and deep learning methods for intrusion detection systems: A survey
Wang et al. Deep and broad URL feature mining for android malware detection
Li et al. The weighted word2vec paragraph vectors for anomaly detection over HTTP traffic
CN111382434B (en) System and method for detecting malicious files
Mehtab et al. AdDroid: rule-based machine learning framework for android malware analysis
Chen et al. Research on intrusion detection method based on Pearson correlation coefficient feature selection algorithm
Lin et al. Using federated learning on malware classification
Hu et al. Deepstyle: User style embedding for authorship attribution of short texts
Zhang et al. An ensemble method for detecting shilling attacks based on ordered item sequences
Chebbi Mastering machine learning for penetration testing: develop an extensive skill set to break self-learning systems using Python
Mimura et al. Towards efficient detection of malicious VBA macros with LSI
JP2024513569A (en) Anomaly detection system and method
Yang et al. Hadoop-based dark web threat intelligence analysis framework
Yan et al. Cross-site scripting attack detection based on a modified convolution neural network
Kuang Research on network traffic anomaly detection method based on deep learning
Yu et al. Malicious documents detection for business process management based on multi-layer abstract model
Prilepok et al. Spam detection using data compression and signatures
Patil et al. Learning to Detect Phishing Web Pages Using Lexical and String Complexity Analysis
Zhang et al. A high performance intrusion detection system using lightgbm based on oversampling and undersampling
CN113259369A (en) Data set authentication method and system based on machine learning member inference attack
Xia et al. Malware classification based on graph neural network using control flow graph
Ding et al. A network intrusion detection algorithm based on outlier mining
Guo et al. Intelligent mining vulnerabilities in python code snippets
Guichang et al. CNNPayl: An Intrusion Detection System of Cross-site Script Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant