CN111585832A - Industrial control protocol reverse analysis method based on semantic pre-mining - Google Patents

Industrial control protocol reverse analysis method based on semantic pre-mining Download PDF

Info

Publication number
CN111585832A
CN111585832A CN202010251465.XA CN202010251465A CN111585832A CN 111585832 A CN111585832 A CN 111585832A CN 202010251465 A CN202010251465 A CN 202010251465A CN 111585832 A CN111585832 A CN 111585832A
Authority
CN
China
Prior art keywords
semantic
length
message
field
industrial control
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010251465.XA
Other languages
Chinese (zh)
Inventor
王群
苏子漪
叶时平
王章权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Shuren University
Original Assignee
Zhejiang Shuren University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Shuren University filed Critical Zhejiang Shuren University
Priority to CN202010251465.XA priority Critical patent/CN111585832A/en
Publication of CN111585832A publication Critical patent/CN111585832A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/18Protocol analysers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an industrial control protocol reverse analysis method based on semantic pre-mining, which realizes the optimization of an industrial control data sample protocol reverse analysis result by pre-mining the semantics such as a timestamp, a length, a sequence number and the like before carrying out protocol format reverse analysis and then carrying out a field division method. The basic idea of the method is as follows: when the protocol format analysis is carried out on a target industrial control data sample, firstly, clustering is carried out on a sample set to be analyzed according to the length of a message, then whether fields with time stamps, lengths, serial numbers and the like exist in the message are respectively analyzed according to different types of messages, and the found semantic fields are replaced by wildcards; after the semantic pre-analysis is completed, analyzing the data sample by adopting a Needleman-Wunsch sequence comparison algorithm; and finally, replacing the semantic result obtained by pre-analysis in the analysis result, thereby improving the accuracy of the analysis result. The invention has the advantages of accurate analysis result, high semantic recognition rate and the like.

Description

Industrial control protocol reverse analysis method based on semantic pre-mining
Technical Field
The invention relates to a method for carrying out reverse analysis on an industrial control protocol, in particular to a semantic pre-mining-based industrial control protocol reverse analysis method, and belongs to the technical field of information security.
Background
The industrial control protocol reverse analysis technology based on network flow is used for carrying out reverse analysis on communication flow between an upper computer and industrial control equipment so as to deduce a communication protocol followed by the flow and take the communication protocol as the basis of technologies such as fuzzy test, vulnerability mining and the like. The method based on the network flow has the characteristics of strong universality and independence on complex technology, and is widely applied to the field of protocol reversal.
In the existing industrial control protocol reverse method based on network flow, the classical analysis method is to calculate the relative distance between message byte sequences by using a sequence comparison algorithm such as Needleman-Wunsch, etc., then to cluster the message sequences with higher similarity together, and then to segment fields according to the change characteristics of the message contents in the same category. However, this method is often failed because many fields are not only composed of pure variable data or fixed data, but also composed of two kinds of data, in which case, the sequence alignment algorithm will split this type of field into two parts, thereby destroying the format of the actual protocol, affecting the recognition of the semantics of the subsequent field, and even causing the failure of the analysis result, typical examples include a timestamp field and a field with a length of multiple bytes, a sequence number, etc. Meanwhile, the changing parts in the fields may also cause the sequence comparison algorithm to classify certain messages with the same message structure into different categories when clustering the message set, thereby reducing the accuracy of the analysis result.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problem that a field formed by fixed data and variable data is difficult to be accurately identified when a sequence comparison algorithm is adopted in the process of reverse analysis of the industrial control protocol, the reverse analysis method of the industrial control protocol based on semantic pre-mining is provided, so that the accuracy of a protocol reverse analysis result is improved. Research shows that some fields with specific semantics have obvious characteristics, and the characteristics can be used for preliminarily analyzing and processing sample data, excavating corresponding semantics from the sample data, and substituting the information into a subsequent sequence comparison analysis process, so that the analysis effect can be improved. Therefore, the method plays a very important role in improving the protocol reverse effect through the semantic pre-mining technology.
The technical scheme is as follows: a method for reverse analysis of industrial control protocol based on semantic pre-mining emphasizes the method adopted when carrying out protocol reverse analysis on unknown industrial control protocol, when carrying out protocol format analysis on target industrial control data samples, firstly clustering the sample set to be analyzed according to message length, then respectively finding and identifying semantic fields aiming at different types of messages, completing semantic pre-analysis, and then adopting Needleman-Wunsch sequence comparison algorithm to analyze the data samples; and finally, replacing the semantic result obtained by pre-analysis in the analysis result.
The method specifically comprises the following steps:
step 1, reading a message in a sample data set to be subjected to protocol reverse analysis, storing the message in a message set Dataset, and then switching to step 2;
step 2, analyzing the messages in the Dataset by adopting a clustering algorithm, respectively aggregating all the messages with the same message length together, and then switching to the step 3;
step 3, performing timestamp semantic recognition on the messages of each category, identifying the position of a timestamp field if the timestamp field is recognized, and then turning to the step 4;
step 4, identifying the length and the sequence number of each type of message, if the length or the sequence number field is identified, identifying the field position as the length or the sequence number field, and then turning to the step 5;
step 5, replacing the message content marked as the time stamp, the length and the sequence number field in the Dataset by a wildcard character, then performing protocol reverse analysis on the message in the Dataset by adopting a Needleman-Wunsch sequence comparison algorithm, and then turning to step 6;
and 6, reversely replacing the wildcard part in the obtained analysis result according to the previous replacement mode to obtain the original data content and the corresponding semantic identifier, then taking the converted result as the final analysis result, and finishing the operation of the program.
In step 3, semantic timestamp recognition is performed on each type of packet from the head in a unit of 4 bytes, and if a timestamp field is recognized, the position of the field is identified.
In the step 3, the semantic identification of the timestamp is performed on the message of each category, if the timestamp field is identified, the position of the field is identified, the content of the timestamp field is replaced by using a wildcard, and meanwhile, a record table is generated, and the content and the position of the replaced timestamp field in the message are recorded.
In step 4, length and sequence number identification is performed on each type of message from the head by using the length of 2 bytes and the length of 1 byte as a unit, and if a length or sequence number field is identified, the field position is identified as the length or sequence number field.
In the step 4, the length and the sequence number of each type of message are identified, if the length or the sequence number field is identified, the field position is identified as the length or the sequence number field, the content of the length and the sequence number field is respectively replaced by using wildcards, and simultaneously, a record table is generated, and the content and the position of the replaced length field and the replaced sequence number field in the message are respectively recorded.
In step 5, the message content marked as the timestamp, length and sequence number fields in the Dataset is replaced by a wildcard character.
Compared with the prior art, the invention has the following advantages:
(1) the recognition rate of protocol semantics can be improved: the method provided by the invention can carry out semantic analysis in advance through semantic pre-mining, thereby avoiding the failure of semantic analysis caused by mistakenly dividing the structures of fields such as timestamps and the like, and further improving the recognition rate of protocol semantics.
(2) The accuracy of the analysis result can be improved: the method can improve the similarity between the messages through semantic pre-mining and content replacement before the internationally recognized and popular Needleman-Wunsch sequence comparison algorithm is adopted to reversely analyze the protocol, thereby classifying the messages belonging to the same structure into one class and ensuring the accuracy of the analysis result.
Drawings
Fig. 1 is a flowchart of inverse analysis and processing of an industrial control protocol for a sample data set according to an embodiment of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
Firstly, providing an operating environment required by the industrial control protocol reverse analysis method based on semantic pre-mining. The required operating environment of the invention is a PC with Intel-Windows architecture and a sample data set with a pcap type format, the sample data set can be obtained by a packet capturing mode by adopting tools such as wireshark and the like, and messages in the sample data set are all related to the industrial control protocol to be analyzed.
The PC system operating the industrial control protocol reverse analysis method based on semantic pre-mining provided by the invention is configured as follows: the software realized based on the method is installed and operated on the PC with Intel-Windows architecture, the PC with Core eight-Core CPU with the main frequency of 2.5GHz and above of the PC hardware has the memory of more than or equal to 4GB and the hard disk of 500GB, and operates the Windows 7 operating system.
FIG. 1 shows a processing flow of the industrial control protocol reverse analysis method based on semantic pre-mining, which begins with step S101, where a program reads a pcap file and adds all messages to a data set Dataset, and then goes to step S102.
In step S102, a clustering algorithm is used to cluster the messages in the Dataset, the messages with the same length are classified into the same category, and then S103 is performed.
In step S103, semantic timestamp recognition is performed on each category of packets in units of 4 bytes from the header, and then the process goes to S104.
In step S104, it is determined whether there is a timestamp field in the message, and if there is a timestamp field, S105 is performed, otherwise S106 is performed.
In step S105, the timestamp field is identified, the content of the timestamp field is completely replaced with a wildcard "", a record table is generated, the content and the position of the replaced timestamp field in the message are recorded, and then the process goes to step S106.
In step S106, semantic recognition of length and sequence number is performed on each type of packet in units of 2 bytes length and 1 byte length from the header, and then S107 is performed.
In step S107, it is determined whether there is a length or sequence number field in the message, and if so, S108 is performed, otherwise, S109 is performed.
In step S108, the length and sequence number fields are identified, and the length or sequence number field content is replaced with a wildcard "×", and simultaneously a record table is generated, and the content and position of the replaced length and sequence number fields in the message are recorded respectively, and then the process goes to S109.
In step S109, a Needleman-Wunsch sequence alignment algorithm is used to perform a protocol reverse analysis on the packet in Dataset, and then S110 is performed.
In step S110, the wildcard "-" part in the obtained analysis result is reversely replaced according to the previous replacement mode according to the record table, so as to obtain the original data content and the corresponding semantic identifier, and the operation of the program is terminated.
Results of the experiment
In this embodiment, a reverse analysis is performed on the HTTP protocol by running software based on the semantic pre-mining based inverse analysis method for the industrial control protocol on a certain PC, so as to grasp the running status of the method and provide a scientific basis for designing a rapid inverse analysis method for the industrial control protocol.
HTTP protocol pcap files with the number of messages exceeding 1000 are selected, software based on the semantic pre-mining based industrial control protocol reverse analysis method is operated on a PC, and the recognition accuracy rate of semantic fields in the HTTP protocol can be improved by finding a system through program operation.
The parts not involved in the present invention are the same as or can be implemented using the prior art.

Claims (9)

1. A semantic pre-mining based industrial control protocol reverse analysis method is characterized by comprising the following steps: when the protocol format analysis is carried out on a target industrial control data sample, firstly, clustering is carried out on a sample set to be analyzed according to the length of a message, then semantic fields are respectively found and identified aiming at different types of messages, the semantic pre-analysis is completed, and then, a Needleman-Wunsch sequence comparison algorithm is adopted to analyze the data sample; and finally, replacing the semantic result obtained by pre-analysis in the analysis result.
2. The industrial control protocol reverse analysis method based on semantic pre-mining according to claim 1, characterized in that a sample set to be analyzed is clustered according to message lengths, semantic fields are respectively found and identified for different classes of messages, and semantic pre-analysis is completed, specifically comprising the following steps:
step 1, reading a message in a sample data set to be subjected to protocol reverse analysis, storing the message in a message set Dataset, and then switching to step 2;
step 2, analyzing the messages in the Dataset by adopting a clustering algorithm, respectively aggregating all the messages with the same message length together, and then switching to the step 3;
step 3, performing timestamp semantic recognition on the messages of each category, identifying the position of a timestamp field if the timestamp field is recognized, and then turning to the step 4;
step 4, identifying the length and the sequence number of each type of message, if the length or the sequence number field is identified, identifying the field position as the length or the sequence number field, and then turning to the step 5;
and 5, replacing the message content marked as the fields of the time stamp, the length and the sequence number in the Dataset by using a wildcard.
3. The industrial control protocol reverse analysis method based on semantic pre-mining as claimed in claim 2, wherein: performing protocol reverse analysis on the message in the Dataset in the step 5 by adopting a Needleman-Wunsch sequence comparison algorithm; and reversely replacing the wildcard part in the obtained analysis result according to the previous replacement mode to obtain the original data content and the corresponding semantic identifier, and then taking the converted result as the final analysis result.
4. The industrial control protocol reverse analysis method based on semantic pre-mining as claimed in claim 1, wherein: in step 3, semantic timestamp recognition is performed on each type of packet from the head in a unit of 4 bytes, and if a timestamp field is recognized, the position of the field is identified.
5. The industrial control protocol reverse analysis method based on semantic pre-mining as claimed in claim 1, wherein: in step 4, length and sequence number identification is performed on each type of message from the head by using the length of 2 bytes and the length of 1 byte as a unit, and if a length or sequence number field is identified, the field position is identified as the length or sequence number field.
6. The industrial control protocol reverse analysis method based on semantic pre-mining as claimed in claim 1, wherein: in step 5, the message content marked as the timestamp, length and sequence number fields in the Dataset is replaced by a wildcard character.
7. The industrial control protocol reverse analysis method based on semantic pre-mining as claimed in claim 1, wherein: in the step 3, the semantic identification of the timestamp is performed on the message of each category, if the timestamp field is identified, the position of the field is identified, the content of the timestamp field is replaced by using a wildcard, and meanwhile, a record table is generated, and the content and the position of the replaced timestamp field in the message are recorded.
8. The industrial control protocol reverse analysis method based on semantic pre-mining as claimed in claim 1, wherein: in the step 4, the length and the sequence number of each type of message are identified, if the length or the sequence number field is identified, the field position is identified as the length or the sequence number field, the content of the length and the sequence number field is respectively replaced by using wildcards, and simultaneously, a record table is generated, and the content and the position of the replaced length field and the replaced sequence number field in the message are respectively recorded.
9. The industrial control protocol reverse analysis method based on semantic pre-mining according to claim 7 or 8, characterized in that: performing protocol reverse analysis on the message in the Dataset in the step 5 by adopting a Needleman-Wunsch sequence comparison algorithm; and reversely replacing the wildcard part in the obtained analysis result according to the record table in the previous replacement mode to obtain the original data content and the corresponding semantic identifier, and then taking the converted result as the final analysis result.
CN202010251465.XA 2020-04-01 2020-04-01 Industrial control protocol reverse analysis method based on semantic pre-mining Pending CN111585832A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010251465.XA CN111585832A (en) 2020-04-01 2020-04-01 Industrial control protocol reverse analysis method based on semantic pre-mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010251465.XA CN111585832A (en) 2020-04-01 2020-04-01 Industrial control protocol reverse analysis method based on semantic pre-mining

Publications (1)

Publication Number Publication Date
CN111585832A true CN111585832A (en) 2020-08-25

Family

ID=72126082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010251465.XA Pending CN111585832A (en) 2020-04-01 2020-04-01 Industrial control protocol reverse analysis method based on semantic pre-mining

Country Status (1)

Country Link
CN (1) CN111585832A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113194010A (en) * 2021-04-28 2021-07-30 浙江大学 Field semantic analysis method of non-public industrial communication protocol
CN114745417A (en) * 2022-04-12 2022-07-12 广东技术师范大学 Industrial control protocol semantic analysis method based on industrial side channel information
CN115134433A (en) * 2022-06-24 2022-09-30 国网数字科技控股有限公司 Semantic analysis method, system, equipment and storage medium of industrial control protocol
WO2023121878A1 (en) * 2021-12-21 2023-06-29 Forescout Technologies, Inc. Iterative development of protocol parsers

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109922087A (en) * 2019-04-23 2019-06-21 广东技术师范大学 Analytic method, device, system and the computer storage medium of industry control agreement
CN110061931A (en) * 2019-04-23 2019-07-26 广东技术师范大学 Clustering method, device, system and the computer storage medium of industry control agreement
CN110113332A (en) * 2019-04-30 2019-08-09 北京奇安信科技有限公司 A kind of detection industry control agreement whether there is the method and device of exception
CN110213130A (en) * 2019-06-03 2019-09-06 南京莱克贝尔信息技术有限公司 A kind of industry control protocol format analysis method based on iteration optimization

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109922087A (en) * 2019-04-23 2019-06-21 广东技术师范大学 Analytic method, device, system and the computer storage medium of industry control agreement
CN110061931A (en) * 2019-04-23 2019-07-26 广东技术师范大学 Clustering method, device, system and the computer storage medium of industry control agreement
CN110113332A (en) * 2019-04-30 2019-08-09 北京奇安信科技有限公司 A kind of detection industry control agreement whether there is the method and device of exception
CN110213130A (en) * 2019-06-03 2019-09-06 南京莱克贝尔信息技术有限公司 A kind of industry control protocol format analysis method based on iteration optimization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程必成: "非标工业控制协议格式逆向方法研究", 《计算机技术与应用》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113194010A (en) * 2021-04-28 2021-07-30 浙江大学 Field semantic analysis method of non-public industrial communication protocol
WO2023121878A1 (en) * 2021-12-21 2023-06-29 Forescout Technologies, Inc. Iterative development of protocol parsers
US11777832B2 (en) 2021-12-21 2023-10-03 Forescout Technologies, Inc. Iterative development of protocol parsers
CN114745417A (en) * 2022-04-12 2022-07-12 广东技术师范大学 Industrial control protocol semantic analysis method based on industrial side channel information
CN114745417B (en) * 2022-04-12 2023-07-04 广东技术师范大学 Industrial control protocol semantic analysis method based on industrial side channel information
CN115134433A (en) * 2022-06-24 2022-09-30 国网数字科技控股有限公司 Semantic analysis method, system, equipment and storage medium of industrial control protocol
CN115134433B (en) * 2022-06-24 2024-03-29 国网数字科技控股有限公司 Semantic analysis method, system and equipment of industrial control protocol and storage medium

Similar Documents

Publication Publication Date Title
CN111585832A (en) Industrial control protocol reverse analysis method based on semantic pre-mining
CN107665191B (en) Private protocol message format inference method based on extended prefix tree
CN109040081B (en) Protocol field reverse analysis system and method based on BWT
US7392311B2 (en) System and method for throttling events in an information technology system
WO2020038353A1 (en) Abnormal behavior detection method and system
CN111277578A (en) Encrypted flow analysis feature extraction method, system, storage medium and security device
CN109525508B (en) Encrypted stream identification method and device based on flow similarity comparison and storage medium
CN111385297B (en) Wireless device fingerprint identification method, system, device and readable storage medium
US8682864B1 (en) Analyzing frequently occurring data items
CN111314279B (en) Unknown protocol reverse method based on network flow
CN113328985B (en) Passive Internet of things equipment identification method, system, medium and equipment
CN114553983B (en) Deep learning-based high-efficiency industrial control protocol analysis method
CN112732655B (en) Online analysis method and system for format-free log
CN109275045B (en) DFI-based mobile terminal encrypted video advertisement traffic identification method
CN113452672A (en) Method for analyzing abnormal flow of terminal of Internet of things of electric power based on reverse protocol analysis
CN109660656A (en) A kind of intelligent terminal method for identifying application program
Kleber et al. Message type identification of binary network protocols using continuous segment similarity
CN115622926A (en) Industrial control protocol reverse analysis method based on network traffic
Yujie et al. End-to-end android malware classification based on pure traffic images
Li et al. A lightweight intrusion detection model based on feature selection and maximum entropy model
CN111581057B (en) General log analysis method, terminal device and storage medium
CN111966339B (en) Buried point parameter input method and device, computer equipment and storage medium
CN114168610B (en) Distributed storage and query method and system based on line sequence division
Yang et al. Deep learning-based reverse method of binary protocol
CN108924002A (en) A kind of analytic method of performance data files, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200825