CN111651642A - Improved TEXT-GAN-based flow data set generation method - Google Patents

Improved TEXT-GAN-based flow data set generation method Download PDF

Info

Publication number
CN111651642A
CN111651642A CN202010298213.2A CN202010298213A CN111651642A CN 111651642 A CN111651642 A CN 111651642A CN 202010298213 A CN202010298213 A CN 202010298213A CN 111651642 A CN111651642 A CN 111651642A
Authority
CN
China
Prior art keywords
data
gan
data set
text
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010298213.2A
Other languages
Chinese (zh)
Other versions
CN111651642B (en
Inventor
王攀
刘芃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202010298213.2A priority Critical patent/CN111651642B/en
Publication of CN111651642A publication Critical patent/CN111651642A/en
Application granted granted Critical
Publication of CN111651642B publication Critical patent/CN111651642B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Communication Control (AREA)

Abstract

The invention discloses a flow data set generation method based on improved TEXT-GAN, which comprises the following steps: 1. a traffic data set is read. 2. And filtering useless local area network data packets in the data set. 3. The data set format is converted from 16-ary to 10-ary. 4. Unifying data dimensions in the dataset. 5. The data were normalized to be limited to the [0,1] range. 6. And putting the processed data into a TEXT-GAN network improved based on a Self-Attention mechanism for training. 7. Generating traffic data using the trained generator. The invention provides an improved TEXT-GAN generation method for generating flow data by using a Self-extension mechanism to replace an LSTM network in an original Text-GAN network generator. Compared with the original TEXT-GAN network, the improved TEXT-GAN greatly improves the speed and the stability of the generation of the flow data.

Description

Improved TEXT-GAN-based flow data set generation method
Technical Field
The invention relates to a flow data set generation method based on improved TEXT-GAN, belonging to the technical field of Internet mining.
Background
The mass flow data has huge mining value, and in the process of flow data collection, because the heat of various applications is different, a large amount of flow is difficult to acquire for some cold door applications, so that the problems that the data volume of an established data set is unbalanced easily, the data volume of some applications is too small and the like are caused, and the follow-up flow data analysis work is influenced.
The current data balance technologies such as random oversampling technology, down-sampling technology and original generation countermeasure network technology have the problems of low speed, poor balance effect and low quality of generated data when applied to flow data.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides a flow data set generation method based on an improved TEXT-GAN, which can efficiently generate high-quality flow data by generating flow data through an improved TEXT-GAN network, achieves the effects of data amplification and data balance, and solves the problems of low speed, poor balance effect and low generated data quality of the existing flow data balance method.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a flow data set generation method based on improved TEXT-GAN comprises the following steps:
step one, reading a flow data set.
And step two, filtering useless local area network data packets in the data set.
And step three, converting the format of the data set from 16 systems to 10 systems.
And step four, unifying data dimensions in the data set.
And step five, normalizing the data and limiting the data in the range of [0,1 ].
And step six, putting the processed data into a TEXT-GAN network improved based on a Self-Attention mechanism for training.
The improved TEXT-GAN network based on the Self-authorization mechanism replaces the long-short term memory network in the original TEXT-GAN network generator by using the Self-authorization mechanism.
And step seven, generating flow data by using the generator after training.
Preferably: the data set in the step one is flow data in a PCAP format or a PCAPNG format.
Preferably: the useless local area network data packets to be filtered in the second step are data packets under an APR protocol and a DHCP protocol.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a method for generating a TEXT-GAN flow data set based on a Self-orientation mechanism, which utilizes the Self-orientation mechanism: 1. the method has the characteristic of capturing long-distance dependency relationship as same as a long-term and short-term memory network. 2. The advantage of parallel computation by the GPU is achieved. The original Text-GAN network is optimized, and the network can better express the generation of encrypted traffic.
The invention provides an improved TEXT-GAN generation method for generating flow data by using a Self-extension mechanism to replace an LSTM network in an original Text-GAN network generator. Compared with the original TEXT-GAN network and other data balancing technologies, the improved TEXT-GAN greatly improves the speed, stability and quality of flow data generation.
Drawings
Fig. 1 is a schematic flow chart of a flow data set generating method based on improved TEXT-GAN according to the present invention.
Figure 2 is a diagram comparing the network structure of the improved TEXT-GAN network of the present invention with the original TEXT-GAN.
Figure 3 is a graph of training LOSS variation on Facebook traffic data for a network of original TEXT-GAN.
FIG. 4 is a graph of variation of training LOSS on Facebook traffic data for the improved TEXT-GAN network of the present invention
FIG. 5 is a graph comparing the training times for 500 epochs on two networks for each application of the present invention.
Detailed Description
The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.
As shown in fig. 1-5, the present invention designs a method for generating a flow data set based on improved TEXT-GAN, comprising the following steps:
step one, reading a flow data set. The data set is traffic data in a PCAP format or a PCAPNG format.
And step two, filtering useless local area network data packets in the data set, wherein the useless local area network data packets needing to be filtered are data packets under an APR protocol and a DHCP protocol.
And step three, converting the format of the data set from 16 systems to 10 systems. The format of the converted data set is convenient for subsequent data to be put into a neural network for training, and the neural network cannot train 16-system data.
And step four, unifying data dimensions in the data set.
And step five, normalizing the data and limiting the data in the range of [0,1 ].
And step six, putting the processed data into a TEXT-GAN network improved based on a Self-Attention mechanism for training.
The improved implementation details of improving the TE9XT-GAN network based on the Self-authorization mechanism is to replace the long-short term memory network in the original TEXT-GAN network generator with the Self-authorization mechanism.
Improved TEXT-GAN networks original networks were from the paper "adaptive Feature Matching for TEXT Generation".
Figure BDA0002453012250000031
Figure BDA0002453012250000041
And step seven, generating flow data by using the generator after training.
The method of the invention utilizes a Self-Attention mechanism: 1. the method has the characteristic of capturing long-distance dependency relationship as same as a long-term and short-term memory network. 2. The method has the advantage that parallel computation can be carried out through a GPU, an LSTM network in the original Text-GAN network generator is replaced, and an improved Text-GAN generation method is provided for generating flow data. Compared with the original TEXT-GAN network and other data balancing technologies, the improved TEXT-GAN greatly improves the speed, stability and quality of flow data generation.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (3)

1. A method for generating a flow data set based on improved TEXT-GAN is characterized by comprising the following steps:
reading a flow data set;
filtering useless local area network data packets in the data set;
step three, converting the format of the data set from 16 systems to 10 systems;
step four, unifying data dimensions in the data set;
normalizing the data and limiting the data in the range of [0,1 ];
putting the processed data into a TEXT-GAN network improved based on a Self-Attention mechanism for training;
the improved TEXT-GAN network based on the Self-authorization mechanism replaces a long-short term memory network in an original TEXT-GAN network generator by using the Self-authorization mechanism;
and step seven, generating flow data by using the generator after training.
2. The method of claim 1, wherein the method comprises: the data set in the step one is flow data in a PCAP format or a PCAPNG format.
3. The method of claim 1, wherein the method comprises: the useless local area network data packets to be filtered in the second step are data packets under an APR protocol and a DHCP protocol.
CN202010298213.2A 2020-04-16 2020-04-16 Improved TEXT-GAN-based flow data set generation method Active CN111651642B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010298213.2A CN111651642B (en) 2020-04-16 2020-04-16 Improved TEXT-GAN-based flow data set generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010298213.2A CN111651642B (en) 2020-04-16 2020-04-16 Improved TEXT-GAN-based flow data set generation method

Publications (2)

Publication Number Publication Date
CN111651642A true CN111651642A (en) 2020-09-11
CN111651642B CN111651642B (en) 2022-10-04

Family

ID=72350431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010298213.2A Active CN111651642B (en) 2020-04-16 2020-04-16 Improved TEXT-GAN-based flow data set generation method

Country Status (1)

Country Link
CN (1) CN111651642B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112884075A (en) * 2021-03-23 2021-06-01 北京天融信网络安全技术有限公司 Traffic data enhancement method, traffic data classification method and related device
CN113542271A (en) * 2021-07-14 2021-10-22 西安电子科技大学 Network background flow generation method based on generation of confrontation network GAN
CN114048494A (en) * 2021-11-09 2022-02-15 四川大学 Encryption flow data set balancing method based on transform domain

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902824A (en) * 2019-03-02 2019-06-18 天津工业大学 It is a kind of to generate confrontation network method with self adaptive control learning improvement
CN110602078A (en) * 2019-09-04 2019-12-20 南京邮电大学 Application encryption traffic generation method and system based on generation countermeasure network
US20200111194A1 (en) * 2018-10-08 2020-04-09 Rensselaer Polytechnic Institute Ct super-resolution gan constrained by the identical, residual and cycle learning ensemble (gan-circle)

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200111194A1 (en) * 2018-10-08 2020-04-09 Rensselaer Polytechnic Institute Ct super-resolution gan constrained by the identical, residual and cycle learning ensemble (gan-circle)
CN109902824A (en) * 2019-03-02 2019-06-18 天津工业大学 It is a kind of to generate confrontation network method with self adaptive control learning improvement
CN110602078A (en) * 2019-09-04 2019-12-20 南京邮电大学 Application encryption traffic generation method and system based on generation countermeasure network

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112884075A (en) * 2021-03-23 2021-06-01 北京天融信网络安全技术有限公司 Traffic data enhancement method, traffic data classification method and related device
CN113542271A (en) * 2021-07-14 2021-10-22 西安电子科技大学 Network background flow generation method based on generation of confrontation network GAN
CN113542271B (en) * 2021-07-14 2022-07-26 西安电子科技大学 Network background flow generation method based on generation of confrontation network GAN
CN114048494A (en) * 2021-11-09 2022-02-15 四川大学 Encryption flow data set balancing method based on transform domain
CN114048494B (en) * 2021-11-09 2023-04-07 四川大学 Encryption flow data set balancing method based on transform domain

Also Published As

Publication number Publication date
CN111651642B (en) 2022-10-04

Similar Documents

Publication Publication Date Title
CN111651642B (en) Improved TEXT-GAN-based flow data set generation method
CN110247930B (en) Encrypted network flow identification method based on deep neural network
CN108199863B (en) Network traffic classification method and system based on two-stage sequence feature learning
CN108416495B (en) Scoring card model establishing method and device based on machine learning
CN112115999B (en) Wind turbine generator fault diagnosis method of space-time multi-scale neural network
CN110868404B (en) Industrial control equipment automatic identification method based on TCP/IP fingerprint
CN107104988B (en) IPv6 intrusion detection method based on probabilistic neural network
Wu et al. Tdae: Autoencoder-based automatic feature learning method for the detection of dns tunnel
CN110222795A (en) The recognition methods of P2P flow based on convolutional neural networks and relevant apparatus
CN111431607B (en) Block matrix interference elimination method in WO-FTN transmission system
CN116030409A (en) Photovoltaic panel dust accumulation state identification method based on self-adaptive image segmentation
CN115759365A (en) Photovoltaic power generation power prediction method and related equipment
CN105554181B (en) A kind of DNS log compression method and apparatus
CN113726561A (en) Business type recognition method for training convolutional neural network by using federal learning
CN115473734B (en) Remote code execution attack detection method based on single classification and federal learning
CN110995396A (en) Compression method of communication messages of electricity consumption information acquisition system based on hierarchical structure
CN116506273A (en) Novel MPSK modulation signal identification and classification method
CN114866246A (en) Computer network security intrusion detection method based on big data
Hu et al. Multi-Component Feature Extraction for Few-Sample Automatic Modulation Classification
CN109379401A (en) Original flow storage device based on Kafka
CN111835720B (en) VPN flow WEB fingerprint identification method based on feature enhancement
CN115589349A (en) QAM signal modulation identification method based on deep learning channel self-attention mechanism
CN114461662A (en) Method and system for screening high-potential users in response to demands of residents
CN113114664A (en) Abnormal flow detection system and method based on hybrid convolutional neural network
CN115913792B (en) DGA domain name identification method, system and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 210000, 66 new model street, Gulou District, Jiangsu, Nanjing

Applicant after: NANJING University OF POSTS AND TELECOMMUNICATIONS

Address before: Yuen Road Ya Dong Qixia District of Nanjing City, Jiangsu province 210000 New District No. 9

Applicant before: NANJING University OF POSTS AND TELECOMMUNICATIONS

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant