CN118101308A - Method, system and electronic equipment for accelerating machine learning feature engineering - Google Patents

Method, system and electronic equipment for accelerating machine learning feature engineering Download PDF

Info

Publication number
CN118101308A
CN118101308A CN202410339486.5A CN202410339486A CN118101308A CN 118101308 A CN118101308 A CN 118101308A CN 202410339486 A CN202410339486 A CN 202410339486A CN 118101308 A CN118101308 A CN 118101308A
Authority
CN
China
Prior art keywords
data
data packet
module
characteristic
flowid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410339486.5A
Other languages
Chinese (zh)
Inventor
刘亚萍
罗浪
张硕
陈世越
吴柏年
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202410339486.5A priority Critical patent/CN118101308A/en
Publication of CN118101308A publication Critical patent/CN118101308A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a method, a system and electronic equipment for accelerating machine learning feature engineering, wherein the method comprises the following steps: s1: initializing P4 network equipment and acquiring a data packet; s2: calculating the flowID and the state characteristic value of the data packet, and updating a first data characteristic flow table in a P4 network equipment register by taking the flowID as an index; s3: extracting flow statistical characteristics of the data packet after encapsulation, and then storing the flow statistical characteristics in a second data characteristic flow table of a register; s4: reading a first data characteristic flow table and a second data characteristic flow table in a register, processing data, and storing the data into a database for establishing a flow statistical data model; the technical scheme of the invention remarkably accelerates the characteristic engineering of encryption flow classification through the innovative integration of a control plane and a data plane.

Description

Method, system and electronic equipment for accelerating machine learning feature engineering
Technical Field
The invention relates to the technical field of network security, in particular to a machine learning characteristic engineering acceleration method aiming at encrypted traffic.
Background
Machine learning is a central data analysis and decision tool, whose performance and efficiency are the focus of attention for businesses and research institutions. However, the success of machine learning models is largely dependent on the quality and depth of the feature engineering, which involves complex processes such as preprocessing, selection, and conversion of data. Conventional feature engineering methods are often focused on a central server or cloud computing platform, and such centralized processing modes may face multiple challenges of data transmission delay, bandwidth limitations, privacy security, and the like. In this context, programmable switches represent a great potential for machine learning feature engineering as an emerging network device. The system not only has strong data processing and flow management capability, but also can realize real-time and dynamic scheduling and processing of data flow through a flexible programming interface, and provides a brand new solution for real-time extraction, optimization and transmission of machine learning characteristics.
Under the condition that the encrypted traffic content cannot be directly analyzed, traffic classification is currently mainly performed by extracting statistical features, behavioral features and other observable features of traffic.
Flow statistics characteristics: such characteristics include packet size distribution, time interval characteristics, and traffic direction changes, among others. Such as average packet size, average interval of packet arrival times, and packet and byte proportions in bi-directional traffic, these statistics can reflect the basic transmission characteristics of the traffic.
Flow behavior characteristics: the flow duration, total number of bytes, total number of packets, and usage of a particular protocol port reveal the behavior pattern of the flow. These features help to distinguish between different types of applications and services, especially when they have significantly different communication behaviors.
Specific features of encrypted traffic: even under encrypted conditions, certain protocol-specific metadata, such as the encryption suite, TLS version, etc. during TLS handshake, remain visible. Such information may provide clues as to the type of encrypted traffic.
However, the acquisition of the above features is done at the CPU end, and when a packet file is very large, the time required to be consumed is very large. In addition, with continued advances in encryption technology and increasing complexity in network traffic patterns, a single feature extraction approach may not be sufficient to address all classification challenges.
In summary, in order to solve the problems of low speed and low efficiency of extracting the encrypted traffic feature at the server, the invention offloads the machine learning feature engineering process to the P4 switch.
Disclosure of Invention
The invention aims to solve the technical problems of improving the efficiency and accuracy of encryption traffic classification and providing a new solution for the field of network security.
The invention provides a machine learning feature engineering acceleration method and system based on a programmable switch. The system utilizes the programmability of the P4 exchanger to directly perform primary analysis and feature extraction on the flow in the data plane, thereby remarkably reducing the dependence on a Central Processing Unit (CPU) and reducing the time required by data transmission.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
S1: initializing P4 network equipment and acquiring a data packet;
s2: calculating the flowID and the state characteristic value of the data packet, and updating a first data characteristic flow table in a P4 network equipment register by taking the flowID as an index;
S3: extracting flow statistical characteristics of the data packet after encapsulation, and then storing the flow statistical characteristics in a second data characteristic flow table of a register;
S4: and reading the first data characteristic and the second data characteristic in the register, processing the data, and storing the data into a database for establishing a stream statistical data model.
Compared with the prior art, the invention has the following beneficial effects:
The technical scheme of the invention remarkably accelerates the characteristic engineering of encryption flow classification through the innovative integration of a control plane and a data plane.
The P4 register is used to store flow statistics for the packets including, but not limited to, packet size, frequency of transmission, duration of flow, etc. Allowing the system to process and analyze the data stream in real time at the network edge improves the efficiency of feature extraction.
The characteristic values of each stream are correctly distinguished and stored by taking the hash value based on the five-tuple field of the data packet as the P4 register index, so that necessary information is provided for encryption traffic analysis. The real-time extraction of flow characteristics is realized, and the system is allowed to rapidly identify and process network traffic.
The feature extraction mechanism of the data plane depends on a register and a hash function, and the use of the register is optimized by utilizing the Count-MIN SKETCH, so that the real-time extraction of the flow features is realized, and the timeliness and the accuracy of feature data are ensured; optimizing the use of registers, reducing the memory space requirement, improving the encryption traffic classification processing speed and accuracy, and allowing the system to respond quickly when network attack occurs, and the system processes a large amount of network traffic under limited resources.
The control plane periodically collects key network indexes through a set of accurate mechanism, ensures the safety and the integrity of data, and the collected data is preprocessed and stored in a high-efficiency database to provide high-quality input data for training and reasoning of a machine learning model. The machine learning model is ensured to be capable of operating efficiently, and accurate and consistent data is allowed to be accessed by the model in the training and reasoning process.
Feature extraction covers multidimensional network traffic characteristics such as data packet size and transmission frequency. The strategy supports training and reasoning of machine learning through efficient data processing and safety guarantee, and improves the efficiency and accuracy of encryption traffic classification.
Drawings
The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention. In the drawings:
Fig. 1 is a flow chart of the present invention.
FIG. 2 is a system block diagram illustrating the interrelationship and data flow of a control plane, a data plane, and a computation plane in accordance with the present invention.
Fig. 3 is a flow chart of the processing of a packet by a P4 network device, detailing the complete flow of the packet from entering the switch to being processed.
Fig. 4 is a P4 network device register schematic diagram illustrating how registers store and manage flow characteristic information.
FIG. 5 is a csv feature table diagram illustrating the storage format and content of feature data.
FIG. 6 is a feature diagram of a calculation showing how the calculation plane further processes and analyzes flow statistics.
Detailed Description
The following description of the preferred embodiments of the present invention is provided in connection with the accompanying drawings, and it is to be understood that the preferred embodiments described herein are merely for the purpose of illustrating and explaining the technical aspects of the present invention, and are not to be construed as limiting the present invention.
Examples
As shown in fig. 1 and 3, an embodiment of the present disclosure provides a method for accelerating machine learning feature engineering, including the steps of:
s1: initializing P4 network equipment and acquiring an acquisition data packet:
Firstly, initializing the P4 network equipment, and transmitting control information to the P4 network equipment by a control plane transmitting module. When a data packet enters the P4 network device through the inlet port, the P4 network device forwards the data packet to the outlet port normally, and S2 and S3 are executed in the process to complete the characteristic calculation process.
S2: calculating the flowID and the state characteristic value of the data packet, and updating a first data characteristic flow table in a P4 network equipment register by taking the flowID as an index;
and using a register of the network equipment as a temporary storage module to perform data real-time processing and data flow analysis at the network edge.
S201 packet parsing, when a packet arrives, it is first parsed by a parser block. Analyzing the head before the data packet transmission layer, traversing and analyzing the data packet effective load through a pipeline of a plurality of matching operation blocks formed by a plurality of tables, and forwarding the data packet to a specific outlet port according to the head information before the data packet transmission layer and a dynamic forwarding table after the data packet effective load is completed.
S202 judges whether the data packet is the data packet of the previous data stream, if yes, S221 is executed, and if not, S222 is executed.
S221, calculating a hash value by using CRC (cyclic redundancy check) for source address, destination address, protocol field, source port and destination port field of the IP header of the data packet, and taking the hash value as an index, namely the flowID of the data packet, reading information in a P4 register, and updating the characteristic value of the current flow and a first data characteristic flow table if the current flowID exists; if the FlowID does not exist, S203 is executed.
The P4 register is shown in fig. 5, and the first data feature flow table contains statistical feature information of the current flow.
S222 copies the packet, and then sends it out, and then S203 is performed.
S203, judging whether the entry in the current register is available, if not, discarding the data packet, and if available, executing S231
Comparing whether the basic entry information of the current data packet and the previous data packet is consistent, wherein the comparison content includes, but is not limited to, average data packet size, average value and standard deviation of the arrival interval of the data packet, if the entry information is consistent, the current data packet is judged to be unavailable, the data packet is discarded, if the information is inconsistent, the data packet is available, and executing S231
231 Initializes a new feature entry and then stores it in a register with the FlowID as an index value.
S204, packaging the data packet: sequentially encapsulating the packet parsed header and the undetected payload
S205 sends the data packet to the feature extraction module of the computation plane.
S3: extracting flow statistical characteristics of the data packet after encapsulation, and then storing the flow statistical characteristics in a second data characteristic flow table of a register;
After the data packet enters the feature extraction module of the calculation plane, the feature extraction module extracts the flow statistical feature of the data packet, and stores the flow statistical feature in the temporary storage module, wherein the temporary storage module exists on the network device, as shown in fig. 4, and the time, five-tuple, maximum data packet length and other information of the data packet are recorded in the feature table, and specific features are shown in fig. 6.
S4: reading the read data in real time, processing the data, and storing the data into a database for establishing a stream statistical data model:
The S400 control plane reads the register value once every 5S, and then stores the register value in a certain location on the server, where the file type is csv file.
The S401 control plane pre-processes the received data, including but not limited to decompression, formatting, and basic cleaning.
Here, a scheme of data cleansing is provided: firstly, each data item is subjected to integrity judgment, 50% of items with field information exceeding 50% are discarded, 50% is a preset value and can be changed according to the need, then a plurality of similar items are selected from the complete items according to other information in the missing information items, and the average value of the required fields is calculated to fill the missing fields. And then, checking the filled data, discarding repeated data items, and finally, clustering item information and discarding the orphan points in the item information.
S402, the data preprocessed by the control plane are stored in an efficient and extensible time sequence database, so that high-quality input data is provided for training and reasoning of the machine learning model, and the machine learning model can be ensured to run efficiently.
Second, a system architecture for accelerating machine learning feature engineering provided by the disclosed embodiment shown in fig. 2 is as follows, including:
control plane: the system comprises a issuing module, a reading module, a storage module and a data processing module.
The issuing module is used for issuing control instructions to control the data plane, and the data plane is controlled to execute calculation and extraction of the data characteristics.
The reading module is used for reading the flow table information items stored in the register according to the instruction issued by the issuing module and a fixed period, and sending the flow table information items to the data processing module.
The data processing module is used for preprocessing the data read from the register by the reading module.
The storage module is used for storing the flow table information items read from the register and the preprocessing data obtained after the processing of the data processing module.
Data plane: including a calculation module and registers.
The computing module is used for normally forwarding the data packet entering through the inlet port to the outlet port, mirroring one part of data packet, performing data processing on the data packet, analyzing the content of the data packet, computing the specific characteristics of the encrypted flow and updating the data characteristic flow table;
the register is used for storing the flow table information items generated by the calculation module and the extraction module;
calculating a plane: comprises an extraction module.
The extraction module is used for extracting flow statistical characteristics and flow behavior characteristics of the data packet;
Third, this embodiment provides a terminal, including: a processor, an input device, an output device, a controller, and a memory.
The processor, the input device, the output device, the controller and the memory are connected with each other, wherein the processor is connected with the memory through the input device, the input device is connected with the network, and the controller is connected with the processor.
The controller configures the processor through a stored P4 network equipment program, wherein the P4 network equipment program comprises P4 network equipment program instructions, and the controller is also used for issuing instructions, reading characteristic data, storing the characteristic data, preprocessing the data and establishing a data model.
The processor is used for executing the P4 network equipment program instruction and the instruction issued by the controller, performing feature extraction calculation on the data, and generating a flow table for temporarily storing the data features.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but the technical solutions of the present invention are described in detail with reference to the foregoing embodiment, and it will be apparent to those skilled in the art that modifications may be made to the technical solutions described in the foregoing embodiment, or equivalents may be substituted for some of the technical features thereof. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of accelerating machine learning feature engineering, comprising the steps of:
S1: initializing P4 network equipment and acquiring a data packet;
s2: calculating the flowID and the state characteristic value of the data packet, and updating a first data characteristic flow table in a P4 network equipment register by taking the flowID as an index;
S3: extracting flow statistical characteristics of the data packet after encapsulation, and then storing the flow statistical characteristics in a second data characteristic flow table of a register;
S4: and reading the first data characteristic flow table and the second data characteristic flow table in the register, processing the data, and storing the data into a database for establishing a flow statistical data model.
2. A method for accelerated learning machine characterization engineering according to claim 1, wherein in S1: and the issuing module of the control plane issues control information to the P4 network equipment.
3. A method for accelerated learning machine characterization engineering according to claim 1, wherein in S1: the data packet enters the P4 network device through the inlet port, the P4 network device forwards the data packet to the outlet port normally, and the characteristic calculation process is completed in the process.
4. A method for accelerated learning machine characterization engineering according to claim 1, wherein in S2: the FlowID is a hash value calculated by a CRC algorithm of a source address, a destination address, a protocol field, a source port and a destination port field of an IP header of the data packet.
5. A method for accelerated learning machine characterization engineering according to claim 1, wherein in S2: and judging whether the data packet belongs to the previous data stream, if the data packet does not belong to the data packet of the previous data stream, copying the data packet and transmitting, if the data packet does not belong to the data packet of the previous data stream, calculating the flowID of the data packet, reading information in a register by using the flowID, if the flowID exists, updating the characteristic value of the current stream and updating the characteristic entry, judging whether the entries in the current temporary storage module of the data packet with the non-existing flowID and the data packet with the non-existing flowID are available, if the data packet with the non-existing flowID and the data packet with the non-existing flowID are not available, discarding the data packet, if the data packet with the non-existing flowID is available, initializing a new characteristic entry, and then storing the new characteristic entry in the temporary storage module by taking the flowID as an index value.
6. A method for accelerated learning machine characterization engineering according to claim 5, wherein in S2: and judging whether the entry of the current data packet in the temporary storage module is available or not by comparing the entry information of the basic characteristics of the current data packet and the previous data packet, and if the entry information of the current data packet is consistent with the entry information of the previous data packet, the current entry is unavailable.
7. A method for accelerated learning machine characterization engineering according to claim 1, wherein in S4: the control plane reads the first data characteristic flow table and the second data characteristic flow table stored in the register once every preset time, and then generates the flow characteristic table of the first data characteristic flow table and the second data characteristic flow table and sends the flow characteristic table to the server for storage.
8. A method for accelerated learning machine characterization engineering according to claim 1, wherein in S4: the data read by the control plane is decompressed, formatted and basically cleaned and then stored in an expandable time series database.
9. A system for accelerating machine learning feature engineering, comprising:
Data plane: the system comprises an input module, an output module, a calculation module and a register, wherein the calculation module is used for processing data packets and calculating the characteristics of the data packets, and the register is used for storing flow table information items generated by the calculation module and the extraction module;
calculating a plane: the system comprises an extraction module, a data packet extraction module and a data packet extraction module, wherein the extraction module is used for extracting flow statistical characteristics of the data packet;
Control plane: the system comprises an issuing module, a reading module, a storage module and a data processing module, wherein the issuing module is used for issuing a control instruction, the reading module is used for reading flow table information items in a real-time register and sending the flow table information items to the processing module, the data processing module is used for preprocessing data read by the reading module from a temporary storage module, and the storage module is used for storing the data read from the temporary storage module and the data processed by the processing module.
10. A terminal, comprising a processor, an input device, an output device, a controller, and a memory, where the processor, the input device, the output device, the controller, and the memory are connected to each other, where the controller is configured to store a P4 network device program, the P4 network device program includes P4 network device program instructions for configuring the processor, and the controller is further configured to issue, extract, store related data, and preprocess the data, build a data model, and the processor is configured to invoke the program instructions to perform the method according to any of claims 1-8.
CN202410339486.5A 2024-03-22 2024-03-22 Method, system and electronic equipment for accelerating machine learning feature engineering Pending CN118101308A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410339486.5A CN118101308A (en) 2024-03-22 2024-03-22 Method, system and electronic equipment for accelerating machine learning feature engineering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410339486.5A CN118101308A (en) 2024-03-22 2024-03-22 Method, system and electronic equipment for accelerating machine learning feature engineering

Publications (1)

Publication Number Publication Date
CN118101308A true CN118101308A (en) 2024-05-28

Family

ID=91154772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410339486.5A Pending CN118101308A (en) 2024-03-22 2024-03-22 Method, system and electronic equipment for accelerating machine learning feature engineering

Country Status (1)

Country Link
CN (1) CN118101308A (en)

Similar Documents

Publication Publication Date Title
CN112468370B (en) High-speed network message monitoring and analyzing method and system supporting custom rules
CN106815112B (en) Massive data monitoring system and method based on deep packet inspection
US6651099B1 (en) Method and apparatus for monitoring traffic in a network
EP1480379B1 (en) Automated characterization of network traffic
CN112822276B (en) Substation control layer communication method and system, electronic equipment and storage medium
CN112804253B (en) Network flow classification detection method, system and storage medium
CN112702235B (en) Method for automatically and reversely analyzing unknown protocol
CN114157502B (en) Terminal identification method and device, electronic equipment and storage medium
CN112104570A (en) Traffic classification method and device, computer equipment and storage medium
CN111163043B (en) Deep analysis method and system for real-time interactive protocol of source-network-load system
CN114327833A (en) Efficient flow processing method based on software-defined complex rule
CN114281676A (en) Black box fuzzy test method and system for industrial control private protocol
CN114389792A (en) WEB log NAT (network Address translation) front-back association method and system
KR100608541B1 (en) An apparatus for capturing Internet ProtocolIP packet with sampling and signature searching function, and a method thereof
CN118101308A (en) Method, system and electronic equipment for accelerating machine learning feature engineering
CN111817915A (en) Protocol analysis framework based on DPDK
CN116074056A (en) Accurate identification method and system for intelligent Internet of things terminal operating system and application software
CN116405292A (en) Method and system for automatically identifying and extracting network flow characteristics
CN115550470B (en) Industrial control network data packet analysis method and device, electronic equipment and storage medium
CN112910842B (en) Network attack event evidence obtaining method and device based on flow reduction
CN112532593B (en) Method, device, equipment and medium for processing attack message
CN114143385A (en) Network traffic data identification method, device, equipment and medium
CN112769520A (en) Complete data packet retention method and system based on IP fragmentation
Teymourzadeh et al. Simulating Network Link Compression in Loss-less Wireless Sensor Networks (WSNs) Environment
WO2024065185A1 (en) Device classification method and apparatus, electronic device, and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination