CN111651642A - Improved TEXT-GAN-based flow data set generation method - Google Patents
Improved TEXT-GAN-based flow data set generation method Download PDFInfo
- Publication number
- CN111651642A CN111651642A CN202010298213.2A CN202010298213A CN111651642A CN 111651642 A CN111651642 A CN 111651642A CN 202010298213 A CN202010298213 A CN 202010298213A CN 111651642 A CN111651642 A CN 111651642A
- Authority
- CN
- China
- Prior art keywords
- data
- gan
- data set
- text
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Communication Control (AREA)
Abstract
The invention discloses a flow data set generation method based on improved TEXT-GAN, which comprises the following steps: 1. a traffic data set is read. 2. And filtering useless local area network data packets in the data set. 3. The data set format is converted from 16-ary to 10-ary. 4. Unifying data dimensions in the dataset. 5. The data were normalized to be limited to the [0,1] range. 6. And putting the processed data into a TEXT-GAN network improved based on a Self-Attention mechanism for training. 7. Generating traffic data using the trained generator. The invention provides an improved TEXT-GAN generation method for generating flow data by using a Self-extension mechanism to replace an LSTM network in an original Text-GAN network generator. Compared with the original TEXT-GAN network, the improved TEXT-GAN greatly improves the speed and the stability of the generation of the flow data.
Description
Technical Field
The invention relates to a flow data set generation method based on improved TEXT-GAN, belonging to the technical field of Internet mining.
Background
The mass flow data has huge mining value, and in the process of flow data collection, because the heat of various applications is different, a large amount of flow is difficult to acquire for some cold door applications, so that the problems that the data volume of an established data set is unbalanced easily, the data volume of some applications is too small and the like are caused, and the follow-up flow data analysis work is influenced.
The current data balance technologies such as random oversampling technology, down-sampling technology and original generation countermeasure network technology have the problems of low speed, poor balance effect and low quality of generated data when applied to flow data.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides a flow data set generation method based on an improved TEXT-GAN, which can efficiently generate high-quality flow data by generating flow data through an improved TEXT-GAN network, achieves the effects of data amplification and data balance, and solves the problems of low speed, poor balance effect and low generated data quality of the existing flow data balance method.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a flow data set generation method based on improved TEXT-GAN comprises the following steps:
step one, reading a flow data set.
And step two, filtering useless local area network data packets in the data set.
And step three, converting the format of the data set from 16 systems to 10 systems.
And step four, unifying data dimensions in the data set.
And step five, normalizing the data and limiting the data in the range of [0,1 ].
And step six, putting the processed data into a TEXT-GAN network improved based on a Self-Attention mechanism for training.
The improved TEXT-GAN network based on the Self-authorization mechanism replaces the long-short term memory network in the original TEXT-GAN network generator by using the Self-authorization mechanism.
And step seven, generating flow data by using the generator after training.
Preferably: the data set in the step one is flow data in a PCAP format or a PCAPNG format.
Preferably: the useless local area network data packets to be filtered in the second step are data packets under an APR protocol and a DHCP protocol.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a method for generating a TEXT-GAN flow data set based on a Self-orientation mechanism, which utilizes the Self-orientation mechanism: 1. the method has the characteristic of capturing long-distance dependency relationship as same as a long-term and short-term memory network. 2. The advantage of parallel computation by the GPU is achieved. The original Text-GAN network is optimized, and the network can better express the generation of encrypted traffic.
The invention provides an improved TEXT-GAN generation method for generating flow data by using a Self-extension mechanism to replace an LSTM network in an original Text-GAN network generator. Compared with the original TEXT-GAN network and other data balancing technologies, the improved TEXT-GAN greatly improves the speed, stability and quality of flow data generation.
Drawings
Fig. 1 is a schematic flow chart of a flow data set generating method based on improved TEXT-GAN according to the present invention.
Figure 2 is a diagram comparing the network structure of the improved TEXT-GAN network of the present invention with the original TEXT-GAN.
Figure 3 is a graph of training LOSS variation on Facebook traffic data for a network of original TEXT-GAN.
FIG. 4 is a graph of variation of training LOSS on Facebook traffic data for the improved TEXT-GAN network of the present invention
FIG. 5 is a graph comparing the training times for 500 epochs on two networks for each application of the present invention.
Detailed Description
The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.
As shown in fig. 1-5, the present invention designs a method for generating a flow data set based on improved TEXT-GAN, comprising the following steps:
step one, reading a flow data set. The data set is traffic data in a PCAP format or a PCAPNG format.
And step two, filtering useless local area network data packets in the data set, wherein the useless local area network data packets needing to be filtered are data packets under an APR protocol and a DHCP protocol.
And step three, converting the format of the data set from 16 systems to 10 systems. The format of the converted data set is convenient for subsequent data to be put into a neural network for training, and the neural network cannot train 16-system data.
And step four, unifying data dimensions in the data set.
And step five, normalizing the data and limiting the data in the range of [0,1 ].
And step six, putting the processed data into a TEXT-GAN network improved based on a Self-Attention mechanism for training.
The improved implementation details of improving the TE9XT-GAN network based on the Self-authorization mechanism is to replace the long-short term memory network in the original TEXT-GAN network generator with the Self-authorization mechanism.
Improved TEXT-GAN networks original networks were from the paper "adaptive Feature Matching for TEXT Generation".
And step seven, generating flow data by using the generator after training.
The method of the invention utilizes a Self-Attention mechanism: 1. the method has the characteristic of capturing long-distance dependency relationship as same as a long-term and short-term memory network. 2. The method has the advantage that parallel computation can be carried out through a GPU, an LSTM network in the original Text-GAN network generator is replaced, and an improved Text-GAN generation method is provided for generating flow data. Compared with the original TEXT-GAN network and other data balancing technologies, the improved TEXT-GAN greatly improves the speed, stability and quality of flow data generation.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (3)
1. A method for generating a flow data set based on improved TEXT-GAN is characterized by comprising the following steps:
reading a flow data set;
filtering useless local area network data packets in the data set;
step three, converting the format of the data set from 16 systems to 10 systems;
step four, unifying data dimensions in the data set;
normalizing the data and limiting the data in the range of [0,1 ];
putting the processed data into a TEXT-GAN network improved based on a Self-Attention mechanism for training;
the improved TEXT-GAN network based on the Self-authorization mechanism replaces a long-short term memory network in an original TEXT-GAN network generator by using the Self-authorization mechanism;
and step seven, generating flow data by using the generator after training.
2. The method of claim 1, wherein the method comprises: the data set in the step one is flow data in a PCAP format or a PCAPNG format.
3. The method of claim 1, wherein the method comprises: the useless local area network data packets to be filtered in the second step are data packets under an APR protocol and a DHCP protocol.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010298213.2A CN111651642B (en) | 2020-04-16 | 2020-04-16 | Improved TEXT-GAN-based flow data set generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010298213.2A CN111651642B (en) | 2020-04-16 | 2020-04-16 | Improved TEXT-GAN-based flow data set generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111651642A true CN111651642A (en) | 2020-09-11 |
CN111651642B CN111651642B (en) | 2022-10-04 |
Family
ID=72350431
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010298213.2A Active CN111651642B (en) | 2020-04-16 | 2020-04-16 | Improved TEXT-GAN-based flow data set generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111651642B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112884075A (en) * | 2021-03-23 | 2021-06-01 | 北京天融信网络安全技术有限公司 | Traffic data enhancement method, traffic data classification method and related device |
CN113542271A (en) * | 2021-07-14 | 2021-10-22 | 西安电子科技大学 | Network background flow generation method based on generation of confrontation network GAN |
CN114048494A (en) * | 2021-11-09 | 2022-02-15 | 四川大学 | Encryption flow data set balancing method based on transform domain |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109902824A (en) * | 2019-03-02 | 2019-06-18 | 天津工业大学 | It is a kind of to generate confrontation network method with self adaptive control learning improvement |
CN110602078A (en) * | 2019-09-04 | 2019-12-20 | 南京邮电大学 | Application encryption traffic generation method and system based on generation countermeasure network |
US20200111194A1 (en) * | 2018-10-08 | 2020-04-09 | Rensselaer Polytechnic Institute | Ct super-resolution gan constrained by the identical, residual and cycle learning ensemble (gan-circle) |
-
2020
- 2020-04-16 CN CN202010298213.2A patent/CN111651642B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200111194A1 (en) * | 2018-10-08 | 2020-04-09 | Rensselaer Polytechnic Institute | Ct super-resolution gan constrained by the identical, residual and cycle learning ensemble (gan-circle) |
CN109902824A (en) * | 2019-03-02 | 2019-06-18 | 天津工业大学 | It is a kind of to generate confrontation network method with self adaptive control learning improvement |
CN110602078A (en) * | 2019-09-04 | 2019-12-20 | 南京邮电大学 | Application encryption traffic generation method and system based on generation countermeasure network |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112884075A (en) * | 2021-03-23 | 2021-06-01 | 北京天融信网络安全技术有限公司 | Traffic data enhancement method, traffic data classification method and related device |
CN113542271A (en) * | 2021-07-14 | 2021-10-22 | 西安电子科技大学 | Network background flow generation method based on generation of confrontation network GAN |
CN113542271B (en) * | 2021-07-14 | 2022-07-26 | 西安电子科技大学 | Network background flow generation method based on generation of confrontation network GAN |
CN114048494A (en) * | 2021-11-09 | 2022-02-15 | 四川大学 | Encryption flow data set balancing method based on transform domain |
CN114048494B (en) * | 2021-11-09 | 2023-04-07 | 四川大学 | Encryption flow data set balancing method based on transform domain |
Also Published As
Publication number | Publication date |
---|---|
CN111651642B (en) | 2022-10-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111651642B (en) | Improved TEXT-GAN-based flow data set generation method | |
CN110247930B (en) | Encrypted network flow identification method based on deep neural network | |
CN108199863B (en) | Network traffic classification method and system based on two-stage sequence feature learning | |
CN108416495B (en) | Scoring card model establishing method and device based on machine learning | |
CN112115999B (en) | Wind turbine generator fault diagnosis method of space-time multi-scale neural network | |
CN110868404B (en) | Industrial control equipment automatic identification method based on TCP/IP fingerprint | |
CN107104988B (en) | IPv6 intrusion detection method based on probabilistic neural network | |
Wu et al. | Tdae: Autoencoder-based automatic feature learning method for the detection of dns tunnel | |
CN110222795A (en) | The recognition methods of P2P flow based on convolutional neural networks and relevant apparatus | |
CN111431607B (en) | Block matrix interference elimination method in WO-FTN transmission system | |
CN116030409A (en) | Photovoltaic panel dust accumulation state identification method based on self-adaptive image segmentation | |
CN115759365A (en) | Photovoltaic power generation power prediction method and related equipment | |
CN105554181B (en) | A kind of DNS log compression method and apparatus | |
CN113726561A (en) | Business type recognition method for training convolutional neural network by using federal learning | |
CN115473734B (en) | Remote code execution attack detection method based on single classification and federal learning | |
CN110995396A (en) | Compression method of communication messages of electricity consumption information acquisition system based on hierarchical structure | |
CN116506273A (en) | Novel MPSK modulation signal identification and classification method | |
CN114866246A (en) | Computer network security intrusion detection method based on big data | |
Hu et al. | Multi-Component Feature Extraction for Few-Sample Automatic Modulation Classification | |
CN109379401A (en) | Original flow storage device based on Kafka | |
CN111835720B (en) | VPN flow WEB fingerprint identification method based on feature enhancement | |
CN115589349A (en) | QAM signal modulation identification method based on deep learning channel self-attention mechanism | |
CN114461662A (en) | Method and system for screening high-potential users in response to demands of residents | |
CN113114664A (en) | Abnormal flow detection system and method based on hybrid convolutional neural network | |
CN115913792B (en) | DGA domain name identification method, system and readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 210000, 66 new model street, Gulou District, Jiangsu, Nanjing Applicant after: NANJING University OF POSTS AND TELECOMMUNICATIONS Address before: Yuen Road Ya Dong Qixia District of Nanjing City, Jiangsu province 210000 New District No. 9 Applicant before: NANJING University OF POSTS AND TELECOMMUNICATIONS |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |