CN114189350B

CN114189350B - LightGBM-based train communication network intrusion detection method

Info

Publication number: CN114189350B
Application number: CN202111219056.2A
Authority: CN
Inventors: 聂晓波; 王登锐; 岳川; 闫海鹏; 王立德
Original assignee: Beijing Jiaotong University; China State Railway Group Co Ltd
Current assignee: Beijing Jiaotong University; China State Railway Group Co Ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2023-03-07
Anticipated expiration: 2041-10-20
Also published as: CN114189350A

Abstract

The invention relates to a lightGBM (LightGBM) -based train communication network intrusion detection method, which is used for capturing a data packet in a train communication network by utilizing a Scapy network packet sniffing tool and flow convergence equipment aiming at the train communication network based on Ethernet, processing the obtained data according to a special communication protocol of the train communication network, analyzing the special characteristics of TRDP (true tree discovery protocol), selecting the characteristics to obtain a basic characteristic data set by combining the particularity of the network, further processing the basic characteristic data set to obtain a special data set suitable for the intrusion detection research of the train communication network, training by adopting a lightGBM integrated learning algorithm, forming a train intrusion detection model which gives consideration to both real-time property and accuracy, constructing a mathematical quantitative evaluation model to quantize the classification result of the model, and obtaining the level quantization of the network attack on the train communication network at the evaluation moment in a visual form, thereby better evaluating the safety state of the train communication network.

Description

LightGBM-based train communication network intrusion detection method

Technical Field

The invention belongs to the technical field of train communication network safety, and relates to a lightGBM-based train communication network intrusion detection method.

Background

The conventional Train Communication Network (TCN) of the rail transit Vehicle is composed of two-level buses, including a Wire Train Bus (WTB) and a Multifunction Vehicle Bus (MVB), and the Bus technology is reliable and real-time, and is a Communication Network standard commonly used by most of trains at present. However, as the integration and intelligence of trains are improved, the communication data flow of trains is increased rapidly, and the defects of the traditional TCN in processing high-speed and large-flow data are highlighted. Under the background, the vehicle-mounted ethernet gradually becomes a new generation of mainstream vehicle-mounted network communication system due to the advantages of high communication rate, simple networking, flexible networking mode and the like.

However, due to the openness of the ethernet itself and the deep application of new technologies such as cloud computing and internet of things in the rail transit industry, the train communication network is no longer a closed "information island", the "exposed surface" of data information is rapidly enlarged, and the rail transit industry still mostly adopts the traditional network security protection technology and equipment to construct a security protection system at present, and the intrusion detection technology can solve the problems to a great extent. Therefore, constructing a special intrusion detection method for a train communication network is an important issue for promoting large-scale popularization and application of a vehicle-mounted ethernet.

The intrusion detection method is a network security protection method which monitors the state of a protected network by collecting and analyzing information of core equipment in the network, judges whether a security violation strategy or an unauthorized malicious behavior exists in the network, and ensures the confidentiality, integrity and availability of system operation by taking certain measures. The arrival of the big data era brings about the explosive increase of network communication data, and novel network attack means are more and more varied, so that domestic and foreign scholars introduce new technologies such as artificial intelligence, machine learning and the like into the intrusion detection field to actively deal with the problems, and great achievements are brought out.

But instead. The existing intrusion detection research results are designed aiming at other various network systems, and no intrusion detection method specially used for a train communication network exists at present. Aiming at the specialized research subject of the train communication network intrusion detection based on the Ethernet, the defects of the prior art are as follows:

(1) The intrusion detection research is carried out on a target network by utilizing a machine learning method, firstly, the original data of network intrusion is required to be obtained, and a corresponding data preprocessing method is designed. The train Ethernet is different from the traditional Ethernet network, has a special network structure, uses a proprietary network protocol and a proprietary data format, and the existing data acquisition and data preprocessing method cannot be directly applied to a train communication network.

(2) Most intrusion detection algorithms, such as optimization algorithms for support vector machines or neural networks and variations thereof, focus more on the prediction accuracy of the algorithm, while accounting for the time consumption of the model relatively less. However, for the train communication network, because the network has the transmission of strong real-time information such as braking and traction, the real-time performance and accuracy of the intrusion detection of the train passing network have higher requirements, and the existing intrusion detection algorithm cannot well meet the requirements.

(3) The output result of the existing intrusion detection method is generally the set attack type and the like, and the quantitative evaluation of the intrusion detection result is lacked. Since the train communication network is a network scene with frequent man-machine interaction, and intrusion detection results are necessary to be fed back to network maintenance personnel in an intuitive manner, a quantitative evaluation method capable of more intuitively representing the network security state is required.

Disclosure of Invention

The invention aims to provide a lightGBM-based train communication network intrusion detection method, which solves the problems in the prior art.

1) Lack of methods for the acquisition and processing of intrusion data for train communication networks;

2) The existing intrusion detection algorithm cannot meet the requirements of a train communication network on both detection accuracy and detection real-time performance;

3) And a quantitative evaluation method for the intrusion detection result of the train communication network is lacked.

The technical scheme adopted by the invention is that,

a lightGBM-based train communication network intrusion detection method specifically comprises the following steps:

step 1: sniffing the whole network flow by adopting network aggregator equipment to perform data aggregation, and then sniffing the network card flow to the information processing terminal in a centralized manner by adopting a PYTHON language and an interactive data packet processing module (SCAPY) so as to obtain a binary data packet file;

and 2, step: performing characteristic analysis on the binary data packet file obtained in the step (1), wherein the characteristic analysis comprises layered analysis of an Ethernet protocol stack and analysis of the contents of a special communication protocol TRDP data frame of a train communication network to obtain data characteristics of a data packet level;

and 3, step 3: selecting and extracting the characteristics of the data packet level data characteristics obtained in the step 2, and combining the characteristics of the train communication network to obtain a basic characteristic data set;

and 4, step 4: preprocessing the features in the basic feature data set obtained in the step (3), wherein the preprocessing comprises null value filling and class feature designation to obtain a special data set for intrusion detection of the train communication network;

and 5: and randomly dividing a special data set for the intrusion detection of the train communication network into a training set and a verification set according to the ratio of 8:2, wherein the training set is used for training an intrusion detection model of the train communication network, and the verification set is used for verifying the training effect of the intrusion detection model of the train communication network.

Step 6: building a train communication network intrusion detection model, and continuously training the train communication network intrusion detection model by using a training set, wherein the training result of the train communication network intrusion detection model is as follows: and classifying the types of network intrusion attack behaviors suffered by the train communication network.

And 7: and (3) constructing a mathematical quantitative evaluation model to quantize the type classification result of the network intrusion attack behavior, and outputting an evaluation value in a percentage form.

On the basis of the scheme, the step 2 specifically comprises the following steps:

step 2.1: firstly, analyzing an Ether protocol, a dot.1 protocol and an IP protocol in sequence; next, the UDP protocol is parsed, and at the same time, the destination port number of the protocol is checked, and step 2.2 is entered when the port number is 17225, step 2.3 is entered when the port number is 17224,

step 2.2: when the port number is 17225, the message is possibly a TRDP-MD message, TRDP-MD protocol analysis is carried out, whether the TRDP-MD protocol specification is met is judged, if the TRDP-MD protocol specification is met, the message is a TRDP-MD message, and then the step 2.4 is carried out; if not, the message is a UDP message;

step 2.3: when the port number is 17224, the message is indicated to be possibly a TRDP-PD message, TRDP-PD protocol analysis is performed, whether the specification of the TRDP-PD protocol is met or not is judged, if the specification of the TRDP-PD protocol is met, the message is a TRDP-PD message, and then the step 2.5 is performed; if not, the message is a UDP message;

step 2.4: carrying out TRDP-MD feature extraction on TRDP-MD messages conforming to TRDP-MD protocol specifications to obtain TRDP-MD features;

step 2.5: extracting TRDP-PD characteristics of a TRDP-PD message which accords with TRDP-PD protocol specifications to obtain the TRDP-PD characteristics;

step 2.6: performing protocol analysis of a TCP layer on the TRDP-MD characteristics and the TRDP-PD characteristics obtained in the step 2.4 and the step 2.5, and judging whether the port number is 17225; when the port number is 17225, return to step 2.2; when the port number is not 17225, the parsing is completed to obtain the data characteristics of the packet level.

On the basis of the above scheme, the basic feature data set in step 3 includes: 36-dimensional data features, wherein the 36-dimensional data features are divided into eight categories, which specifically comprise: global features, ether, 802.1q, IP (IPv 4), ICMP, UDP, and TRDP;

wherein the general features include: protocol and Len _ total; wherein, protocol represents Protocol type, len _ total represents message total length;

ether includes: type _ ETH, src _ ETH, and Dst _ ETH; wherein, type _ ETH represents Ethernet Type, src _ ETH represents source MAC address, and Dst _ ETH represents destination MAC address;

802.1q includes: prio _ Dot1Q, ID _ Dot1Q, vlan _ Dot1Q, and Type _ Dot1Q; wherein, prio _ Dot1Q represents priority, ID _ Dot1Q represents standard format indicating bit, vlan _ Dot1Q represents VLAN number, and Type _ Dot1Q represents frame Type;

the IP (IPv 4) includes: ver _ IP, src _ IP, dst _ IP, len _ IP, IHL _ IP, DSF _ IP, ID _ IP, flag _ IP, frag _ IP and TTL _ IP; wherein Ver _ IP represents IP protocol version number, src _ IP represents source IP address, dst _ IP represents destination IP address, len _ IP represents IP packet length, IHL _ IP represents header length, DSF _ IP represents differentiated service, ID _ IP represents IP identifier, flag _ IP represents IP Flag bit, frag _ IP represents fragment offset, and TTL _ IP represents survival time;

the ICMP includes: type _ ICMP, code _ ICMP, id _ ICMP, and seq _ ICMP; wherein, type _ ICMP represents ICMP message type, code _ ICMP represents ICMP message code, id _ ICMP represents ICMP process identification, seq _ ICMP represents ICMP sequence number;

the UDP includes: src _ Port _ UDP, dst _ Port _ UDP, and Len _ UDP; wherein Src _ Port _ UDP represents a UDP source Port, dst _ Port _ UDP represents a UDP destination Port, and Len _ UDP represents a UDP length;

the TCP includes: src _ Port _ TCP, dst _ Port _ TCP, len _ TCP, seq _ TCP, ack _ TCP, flag _ TCP, and Win _ val _ TCP; wherein Src _ Port _ TCP represents a TCP source Port, dst _ Port _ TCP represents a TCP destination Port, len _ TCP represents a TCP length, seq _ TCP represents a TCP sequence number, ack _ TCP represents a TCP acknowledgement number, flag _ TCP represents a TCP Flag, and Win _ val _ TCP represents a TCP window value;

the TRDP comprises: seq _ TRDP, ver _ TRDP and Type _ TRDP; wherein Seq _ TRDP indicates a TRDP serial number, ver _ TRDP indicates a TRDP version, and Type _ TRDP indicates a TRDP Type.

On the basis of the above scheme, the null value filling in step 4 refers to: filling the null value characteristics in the communication protocol analysis process, wherein in order to avoid the influence of null values on other characteristics as much as possible, a numerical value "-1" is adopted to fill null value characteristic items, and the numerical value "-1" filled in the null value characteristic items does not have any specific physical meaning;

the category feature designation means: and (3) specifying the features with the category type in the basic feature data set obtained in the step (3), wherein the specific numerical value in a certain feature does not represent the numerical value, but represents a specific physical meaning as a certain symbol, and specifying four features including Src _ ETH, dst _ ETH, src _ IP and Dst _ IP as category features for specific analysis of the features in the basic feature data set obtained in the step (3).

On the basis of the scheme, the construction of the train communication network intrusion detection model in the step 6 specifically comprises the following steps:

the intrusion detection training set is denoted as D = { (x) ₁ ，y ₁ )，(x ₂ ，y ₂ )，…，(x _N y _N ) Therein of

R ⁿ Representing an n-dimensional real number set, x being the input space, x _i Represents the ith input data;

r is a real number set, Y is an output space, Y _i Is the tag value of the ith input data, (x) _i ，y _i ) The ith sample data;

step 6.1: first, initializing a regression tree

Will y _i Substituting the constant value c which minimizes the general loss function L into the calculation to obtain the heel regression tree f ₀ (x)：

Wherein, L (y) _i And c) is y _i Loss function value with constant c, N is number of sample data;

step 6.2: and setting the iteration number of the intrusion detection model of the train communication network as M, wherein for M =1,2, \8230;, M, there are:

step 6.2.1: for a defined general loss function L, the mth block is calculatedApproximate residual r of tree _mi ：

Wherein, f _m-1 (x _i ) Refers to the predicted value of the m-1 th tree.

Step 6.2.2: fitting a regression tree by taking the residual error obtained in the step 6.2.1 as a new sample label value, and obtaining a new training set D in the mth iteration _m ＝{(x ₁ ，r _m1 )，(x ₂ ，r _m2 )，…，(x _N ，r _mN ) In which x is _i ∈R ⁿ Represents sample data, R ⁿ Representing an n-dimensional real number set, r _mi Is the new tag value; the leaf node area of the regression tree is marked as R _mj J =1,2, \ 8230, J, where J denotes the number of leaf nodes;

step 6.2.3: for leaf node region R _mj Calculating by linear search to find out minimum value of general loss function, and calculating best fitting value c _mj ：

Step 6.2.4: updating the mth tree f _m (x)：

Wherein f is _m-1 (x) Represents the m-1 th tree; i (x is belonged to R) _mj ) To indicate a function, if the condition x ∈ R _mj If true, the value is 1, otherwise it is 0.

Step 6.3: obtaining a final train communication network intrusion detection model

The training result of the train communication network intrusion detection model is a classification result of network intrusion attack behavior types suffered by the train communication network.

On the basis of the scheme, the classification result of the network intrusion attack behavior type in the step 6 is classified into three levels of high, medium and low according to scales of 3, 2 and 1, wherein the level 1 represents scanning detection type attack, the level 2 represents denial of service attack, and the level 3 represents man-in-the-middle attack.

On the basis of the above scheme, step 7 specifically includes:

step 7.1: and quantitatively calculating a node level intrusion detection result, defining a mathematical model of the security situation of the network node, and assuming that a node i in the network suffers from p network attacks within a time period of T, wherein the security situation value of the node is T _i (T) the above T _i The expression of (t) is as follows:

wherein, c _k Number of packets, l, representing the kth network attack _k Represents the threat level of the attack, Q represents the total number of data packets, k _a Is an attack threat level balance factor for regulating the threat severity of different attacks, specifically 1 attack with threat level 3

Attack of the second 2 order or

The threat degree of the attack of the level 1 on the node is equivalent, and the default value is 1.

The attack threat situation is normalized through a formula (5), and the security situation value is mapped to a [0,1] interval.

And 7.2: evaluation of network level intrusion detection results, for a given node data set Z = [ Z = ₁ ，z ₂ ，…，z _n ]，z _i E R, where R is a set of real numbers and the data weight vector is represented as W = [ W ] ₁ ，w ₂ ，…，w _n ] ^T Wherein w is _i ∈[0，1]And is provided with

The weighted geometric mean operator is defined as follows:

in order to avoid zero values in the data to zero the final result, a zero-removed weighted geometric mean operator is constructed:

wherein z is _min Represents the minimum value of the node data group Z which is not zero;

the characteristics of a train communication network are integrated, a zeroing weighted geometric mean operator is adopted for data fusion, and a final evaluation value of the whole network in a time period t is obtained, wherein the formula is as follows:

where n is the number of nodes in the network, w _i Is the weight of the ith node, T _i And (t) is an ith node intrusion detection result quantized value in a time period t, and S (t) is an evaluation value of the final network whole in the time period t.

The invention has the beneficial effects that:

(1) Aiming at a network structure of a train communication network, a targeted data acquisition scheme is designed to solve the problem of flow sniffing in the train communication network, characteristics of communication data in the train communication network are analyzed in detail, a communication flow characteristic selection scheme based on a data packet is provided, and a corresponding data characteristic analysis scheme is designed according to the characteristic selection scheme.

(2) The method is characterized in that a LightGBM algorithm-based intrusion detection algorithm of the train communication network is designed, the problem of intrusion behavior recognition of the train communication network is solved by utilizing an integrated learning idea, and on the basis of ensuring the detection accuracy, the training and prediction time of a model is greatly improved, so that the requirement of the train communication network on the real-time performance can be met.

(3) And designing an intrusion detection result quantitative evaluation scheme suitable for a train communication network. By analyzing the particularity of the train communication network and the threat degree of different network attack behaviors to the train communication network in detail, based on the intrusion detection result, the threat degree quantized value of the network node in the train communication network under attack is calculated, and the quantized values of all nodes in the network are fused for evaluation, so that the evaluation value of the whole network is finally obtained, the safety condition of the whole train network can be better reflected, and the control of relevant personnel on the safety state of the whole train communication network is facilitated.

Drawings

FIG. 1 is a flow chart of the operation of a method for detecting intrusion into a train communication network;

FIG. 2 is a flow diagram of a train communication network traffic characteristic resolution scheme;

Detailed Description

The present invention will be described in further detail with reference to FIGS. 1 to 2.

The invention relates to a lightGBM-based train communication network intrusion detection method. The method mainly aims at the Train communication network based on the Ethernet, a Scapy network packet sniffing tool and a flow convergence device are used for grabbing Data packets in the Train communication network, the obtained Data are processed according to a special communication Protocol TRDP (Train Real Time Data Protocol) of the Train communication network, the special characteristics of the TRDP are analyzed, the characteristics are selected to obtain a basic characteristic Data set in combination with the particularity of the network, the basic characteristic Data set is further processed to obtain a special Data set suitable for the intrusion detection research of the Train communication network, a LightGBM integrated learning algorithm is adopted for training, a Train intrusion detection model which gives consideration to Real-Time performance and accuracy is formed, a mathematical quantitative evaluation model is constructed to quantize the classification result of the model, and a quantized value of the threat degree of the Train communication network subjected to network attack at the evaluation moment is obtained in a visual mode, so that the safety state of the Train communication network is better evaluated.

The specific implementation steps of the method are shown in the attached figure 1 of the specification, and the core part comprises the following steps: processing of data, construction of a model and quantitative evaluation of results.

The data processing process is completed by the following steps:

(1) The data acquisition and the train communication data acquisition comprise two steps of physical connection and flow capture. The system comprises a train communication network, an information processing terminal and a data processing module, wherein the train communication network is physically connected with the information processing terminal in two modes, one mode is that the whole network flow in the train communication network is intensively converged to the information processing terminal by utilizing the function of a mirror image port of a switch, and the information processing terminal carries out subsequent processing on the data; the second way is to adopt the network aggregator to sniff the whole network traffic for data aggregation. The second method is used because the train communication network devices are complex in kind and large in number.

Determining a physical connection mode for network traffic collection, wherein the physical connection mode is as follows: sniffing the whole network flow by adopting network aggregator equipment to aggregate data; then, by adopting a PYTHON language and utilizing an interactive data packet processing module (SCAPY), the sniffing of the network card flow from the centralized sniffing to the information processing terminal is realized, so that a binary data packet file is obtained.

(2) And (3) performing characteristic analysis on the binary data packet file obtained in the step (1), wherein the characteristic analysis comprises the layered analysis of an Ethernet protocol stack and the analysis of the content of a special communication protocol TRDP data frame of a train communication network.

The layered analysis of the Ethernet protocol stack is directly carried out by calling a library function in the interactive data packet processing module, and the analysis of the special communication protocol TRDP data frame content is carried out by judging byte by byte on the basis of analyzing the protocol frame format in detail to obtain the data characteristics of the data packet level.

The specific analysis steps are shown in the attached figure 2 in the specification:

step 2.1: firstly, analyzing an Ether protocol, a dot.1 protocol and an IP protocol in sequence; next, the UDP protocol is parsed, and the destination port number of the protocol is checked, and step 2.2 is entered when the port number is 17225, step 2.3 is entered when the port number is 17224,

step 2.3: when the port number is 17224, the message is possibly a TRDP-PD message, TRDP-PD protocol analysis is carried out, whether the TRDP-PD protocol specification is met or not is judged, if the TRDP-PD protocol specification is met, the message is the TRDP-PD message, and then the step 2.5 is carried out; if not, the message is a UDP message;

step 2.4: performing TRDP-MD feature extraction on a TRDP-MD message which accords with TRDP-MD protocol specifications to obtain TRDP-MD features;

step 2.5: extracting TRDP-PD characteristics of a TRDP-PD message which conforms to TRDP-PD protocol specification to obtain the TRDP-PD characteristics;

step 2.6: performing protocol analysis of a TCP layer on the TRDP-MD characteristic and the TRDP-PD characteristic obtained in the step 2.4 and the step 2.5, and judging whether the port number is 17225; when the port number is 17225, returning to step 2.2; when the port number is not 17225, the parsing is completed to obtain the data characteristics of the packet level.

(3) And (3) selecting and extracting the characteristics of the data packet level obtained in the step (2), and combining the characteristics of a train communication network to obtain a basic characteristic data set, wherein 36-dimensional data characteristics are selected and extracted from the basic characteristic data set, and the basic characteristic data set is divided into eight categories, including: general characteristics, ether, 802.1q, IP (IPv 4), ICMP, UDP, and TRDP. The above features include not only the conventional quintuple information but also different protocol header information, and TRDP protocol features, as shown in table 1:

TABLE 1 characteristics selection scheme for intrusion detection method of train communication network

(4) Preprocessing the features in the basic feature data set, and performing null value filling and class feature designation on the features obtained in the step (3) to obtain a special data set for intrusion detection of the train communication network;

the null value filling refers to filling null value features appearing in the communication protocol analysis process, in order to avoid influence of null values on other features as much as possible, a numerical value "-1" is adopted to fill null value feature items, and the numerical value "-1" filled in the null value feature items does not have any specific physical meaning at this time.

The category feature designation means: the basic features in step 3 are specified to have a class type feature, wherein the class type feature means that a specific numerical value in a certain feature does not represent the numerical value of the feature, but represents a specific physical meaning as a certain symbol, for example, "Protocol" feature "17" does not represent a specific numerical value "10" in decimal, but represents UDP Protocol. Four features, src _ ETH, dst _ ETH, src _ IP, and Dst _ IP, were designated as class features through a specific analysis of each feature item in table 1. And (3) null value filling and class characteristic specifying operations belong to the category of data preprocessing, and the purpose is to standardize the characteristic data obtained in the step (3) and facilitate the input and training of the intrusion detection model of the train communication network.

(5) And (3) constructing a data set, wherein the special data set for the intrusion detection of the train communication network is used as a basis for constructing an intrusion detection model of the train communication network.

Meanwhile, a special data set for train communication network intrusion detection is randomly divided into a training set and a verification set according to the ratio of 8:2, wherein the training set is used for training a train communication network intrusion detection model, and the verification set is used for verifying the training effect of the train communication network intrusion detection model.

(II) building a train communication network intrusion detection model:

in order to improve the accuracy of intrusion detection and the generalization capability of the model, a LightGBM-based train communication network intrusion detection algorithm is designed and a train communication network intrusion detection model is constructed.

The intrusion detection training set is denoted as D = { (x) ₁ ，y ₁ )，(x ₂ ，y ₂ )，…，(x _N ，y _N ) Therein of

(R ⁿ Representing an n-dimensional real number set), x being the input space, x _i Represents the ith input data;

(R is a real number set), Y is an output space, Y _i Is the tag value of the ith input data, (x) _i ，y _i ) The ith sample data;

1) First, initializing a regression tree

Will y is _i Substituting to calculate constant value c for minimizing general loss function L to obtain regression tree f ₀ (x)：

Wherein, L (y) _i And c) is y _i And the loss function value of the constant c, wherein N is the number of sample data.

2) And setting the iteration number of the intrusion detection model of the train communication network as M, wherein for M =1,2, \8230;, M, there are:

a. for a defined general loss function L, the approximate residual r of the mth regression tree is calculated _mi ：

Wherein, f _m-1 (x _i ) Refers to the predicted value of the m-1 th tree.

b. Fitting a regression tree by taking the residual error obtained in the step a as a new sample label value, wherein a new training set in the mth iteration is D _m ＝{(x ₁ ，r _m1 )，(x ₂ ，r _m2 )，…，(x _N ，r _mN ) In which x _i ∈R ⁿ ，R ⁿ Representing an n-dimensional real number set, representing sample data, r _mi Is the new tag value. The leaf node region of the regression tree is marked as R _mj J =1,2, \ 8230, J, where J denotes the number of leaf nodes.

c. For leaf node region R _mj Calculating by adopting a linear search mode, finding out the minimum value of a general loss function, and calculating the best fitting value c _mj ：

d. Updating the mth tree f _m (x)：

Wherein f is _m-1 (x) Refers to the m-1 st tree; i (x is belonged to R) _mj ) To indicate a function, if the condition x ∈ R _mj If true, the value is 1, otherwise it is 0.

3) Obtaining a final intrusion detection model of the train communication network

The training result of the train communication network intrusion detection model is a classification result of network intrusion attack behavior types suffered by the train communication network, and the classification result is classified into three grades of high, medium and low according to scales of 3, 2 and 1, wherein the grade 1 represents scanning detection type attack, the grade 2 represents denial of service attack, and the grade 3 represents man-in-the-middle attack.

For the TCN intrusion detection problem, the characteristics are physical quantities of each protocol layer analyzed by the collected data packets, and the characteristics are more and have various forms. After sorting the features, a histogram strategy optimization method is applied to disperse continuous floating point feature values into k integers, a histogram with the width of k is constructed at the same time, bin values are adopted to store the discrete values, and the construction of decision numbers is only needed according to the discretized histogram features in the subsequent training process.

The strategy avoids multiple traversals of sample data, remarkably improves the model training speed, and simultaneously enhances the generalization capability of the model.

Secondly, a depth-first splitting strategy (leaf-wise) is adopted, a global sample is referred to for splitting of each leaf, and interference of a local optimal solution in a traditional level-width-first strategy (level-wise) and subsequent pruning cost are avoided to the maximum extent. A maximum tree depth limiting strategy is introduced, so that the depth of the tree can be set by a user, and the over-fitting problem caused by the over-deep depth of the tree is avoided;

meanwhile, the single-side gradient sampling strategy is used because the sample data in the data set is more and the gain of calculating the target function is more complicated. The gradient value of the sample data is analogized to the weight of the original sample, the sample with small gradient value is abandoned in a certain mode, the effect of simplifying the data volume and simultaneously ensuring the calculation precision is achieved, and the simplifying mode is as follows: the method comprises the steps of firstly, arranging all values of the features to be split in a descending order according to absolute values, taking a x 100% of samples (a is the sampling rate of large-gradient data), then randomly selecting b x 100% of samples from the rest of samples (b is the sampling rate of small-gradient data), multiplying the selected data by a weight 1-a/b, and finally using (a + b) x 100% of data to form data of the next round of training, wherein the data quantity can be reduced by the acquisition mode under the condition that the distribution of original data is guaranteed as much as possible.

(III) quantitative evaluation of the results

In order to visually display the intrusion detection result to train communication network maintainers, a set of mathematical quantitative evaluation model is constructed, quantization is carried out, and an evaluation value is output in a percentage system mode.

The quantification of the mathematical model refers to: on the basis of a training result of a train communication network intrusion detection model, equipment parts in a train communication network where network attacks occur are considered, comprehensive quantitative evaluation is carried out, and finally the severity of the network attacks on the train communication network in an intrusion detection evaluation time period is displayed in a numerical visualization mode.

In the part, firstly, the threat degree of a specific device is quantified, and then the quantitative evaluation results of all devices in the train communication network are fused to obtain the overall evaluation value of the train communication network, so that the method comprises the following two steps:

1) Node level intrusion detection result quantitative calculation

Defining a mathematical model of the security posture of the network node:

assuming that a node i in a network suffers from p kinds of network attacks in a time period T, the security situation value T of the node is:

wherein, c _k Number of packets, l, representing the kth network attack _k Represents the threat level of the attack, Q represents the total number of data packets, k _a Is an attack threat level balancing factor for adjusting the threat severity of different attacks, specifically means 1 attack with threat level 3 and

attack of the second 2 order or

2) Network level intrusion detection result evaluation

In the process of evaluating the intrusion detection result of the train communication network, different network nodes have different influence degrees on the whole network, different priority weights can be distributed to the nodes according to the priority standards of related industries, and then the whole evaluation value of the network can be obtained by combining the quantized values of the different nodes.

The invention adopts the weighted geometric averaging algorithm (WGA) idea in the multi-attribute decision theory to replace the traditional information aggregation algorithm to calculate the integral quantitative value of the train communication network.

Data set Z = [ Z ] for given node ₁ ，z ₂ ，…，z _n ]，z _i E R, where R is a real number set and its data weight vector is represented as W = [ W ] ₁ ，w ₂ ，…，w _n ] ^T Wherein w is _i ∈[0，1]And is

Its weighted geometric mean operator is defined as follows:

in order to avoid Zero-clearing the final result by Zero values in the data, the Zero values in the data need to be improved as follows, and a Zero-removed weighted geometric averaging operator (DZ-WGA) is constructed:

wherein z is _min Represents the minimum value of the node data group Z that is not zero, and may also be the minimum value that may occur in the data group Z.

The intrusion detection result of the train communication network is finally that the safety state of the whole network is obtained by fusing the results of all nodes in the network and is displayed in a quantitative numerical value form, so that the visual reference of the network intrusion detection result of the train communication network at the current moment is provided for relevant operating personnel.

The invention integrates the characteristics of a train communication network, adopts the DZ-WGA operator to perform data fusion, and obtains the final evaluation value of the whole network in the time period t, wherein the formula is as follows:

Those not described in detail in this specification are within the skill of the art.

Claims

1. A lightGBM-based train communication network intrusion detection method is characterized by specifically comprising the following steps:

step 1: adopting network aggregator equipment to sniff the whole network flow to carry out data aggregation, then adopting PYTHON language and utilizing an interactive data packet processing module to realize the sniffing of the network card flow from the centralized sniffing to the information processing terminal, thereby obtaining a binary data packet file;

step 2: performing characteristic analysis on the binary data packet file obtained in the step (1), wherein the characteristic analysis comprises layered analysis of an Ethernet protocol stack and analysis of the TRDP data frame content of a special communication protocol of a train communication network to obtain data packet-level data characteristics;

and step 3: selecting and extracting the characteristics of the data packet level data characteristics obtained in the step 2, and combining the characteristics of the train communication network to obtain a basic characteristic data set;

and 5: according to a data set special for train communication network intrusion detection, according to the following steps: 2, randomly dividing the training set into a training set and a verification set in proportion, wherein the training set is used for training a train communication network intrusion detection model, and the verification set is used for verifying the training effect of the train communication network intrusion detection model;

step 6: combining the particularity of the network, selecting the characteristics to obtain a basic characteristic data set, further processing the basic characteristic data set to obtain a special data set suitable for the intrusion detection research of the train communication network, constructing a train communication network intrusion detection model, adopting the intrusion detection method of LightGBM, continuing training the train communication network intrusion detection model by using the training set, wherein the training result of the train communication network intrusion detection model is as follows: classifying the types of network intrusion attack behaviors suffered by the train communication network;

the construction of the intrusion detection model of the train communication network comprises the following steps:

the training set of the intrusion detection model of the train communication network is represented as D = { (x) ₁ ，y ₁ )，(x ₂ ，y ₂ )，…，(x _N ，y _N ) -means for, among other things,

step 6.1: first, a regression tree is initialized

Will y _i Substitution calculation yields a constant value c that minimizes the general loss function L,get the node regression tree f ₀ (x)：

Wherein, L (y) _i And c) is y _i A loss function value with a constant c, wherein N is the number of sample data;

step 6.2: if the iteration number of the intrusion detection model of the train communication network is set to be M, for M =1,2, \8230, M, the following parameters are provided:

step 6.2.1: for a defined general loss function L, the approximate residual r of the mth regression tree is calculated _mi ：

Wherein f is _m-1 (x _i ) The predicted value of the (m-1) th tree is referred to;

step 6.2.2: fitting a regression tree by taking the residual error obtained in the step 6.2.1 as a new sample label value, and obtaining a new training set D in the mth iteration _m ＝{(x _i ，r _m1 )，(x ₂ ，r _m2 )，…，(x _N ，r _mN ) In which x _i ∈R ⁿ Represents sample data, R ⁿ Representing an n-dimensional real number set, r _mi Is the new tag value; the leaf node area of the regression tree is marked as R _mj J =1,2, \8230, J, where J represents the number of leaf nodes;

step 6.2.3: for leaf node region R _mj Calculating by linear search to find the minimum value of general loss function, and calculating the best fitting value C _mj ：

Step 6.2.4: updating the mth tree f _m (x)：

Wherein f is _m-1 (x) Represents the m-1 tree; i (x is epsilon of R) _mj ) To indicate a function, if the condition x ∈ R _mj If true, the value is 1, otherwise it is 0;

The training result of the intrusion detection model of the train communication network is a classification result of network intrusion attack behavior types suffered by the train communication network;

and 7: and establishing a mathematical quantitative evaluation model to quantify the classification result of the network intrusion attack behavior, and outputting an evaluation value in a percentage system mode.

2. The LightGBM-based train communication network intrusion detection method of claim 1, wherein the step 2 specifically comprises:

step 2.6: performing protocol analysis of a TCP layer on the TRDP-MD characteristics and the TRDP-PD characteristics obtained in the step 2.4 and the step 2.5, and judging whether the port number is 17225; when the port number is 17225, returning to step 2.2; when the port number is not 17225, the parsing is completed to obtain the data characteristics of the packet level.

3. The LightGBM-based train communication network intrusion detection method of claim 1, wherein the basic feature data set of step 3 comprises: 36-dimensional data features, wherein the 36-dimensional data features are divided into eight categories, and the categories specifically comprise: global features, ether, 802.1q, IPv4, ICMP, UDP, and TRDP;

the IPv4 includes: ver _ IP, src _ IP, dst _ IP, len _ IP, IHL _ IP, DSF _ IP, ID _ IP, flag _ IP, frag _ IP and TTL _ IP; wherein Ver _ IP represents IP protocol version number, src _ IP represents source IP address, dst _ IP represents destination IP address, len _ IP represents IP packet length, IHL _ IP represents header length, DSF _ IP represents differentiated service, ID _ IP represents IP identifier, flag _ IP represents IP Flag bit, frag _ IP represents fragment offset, and TTL _ IP represents survival time;

4. The LightGBM-based train communication network intrusion detection method of claim 1, wherein the null stuffing of step 4 is: filling the null value characteristics in the communication protocol analysis process, and filling null value characteristic items by adopting a numerical value "-1" in order to avoid the influence of null values on other characteristics as much as possible, wherein the numerical value "-1" filled in the null value characteristic items does not have any specific physical meaning at the moment;

the category feature designation means: and (4) specifying the type of the features in the basic feature data set obtained in the step (3), wherein the specific numerical value in a certain feature does not represent the numerical value but represents a specific physical meaning as a symbol, and specifying four features including Src _ ETH, dst _ ETH, src _ IP and Dst _ IP as the class features for specific analysis of the features in the basic feature data set obtained in the step (3).

5. The LightGBM-based train communication network intrusion detection method of claim 1, wherein the classification result of the network intrusion attack behavior type in step 6 is classified into three levels, i.e. high, medium and low, according to scales of 3, 2 and 1, wherein level 1 represents a scanning probe type attack, level 2 represents a denial of service attack, and level 3 represents a man-in-the-middle attack.

6. The LightGBM-based train communication network intrusion detection method of claim 5, wherein step 7 comprises:

wherein, c _k Number of packets, l, representing the kth network attack _k Represents the threat registration of such an attack, Q represents the total number of packets, k _a Is an attack threat level balancing factor for adjusting the threat severity of different attacks, specifically means 1 attack with threat level 3 and

attack of the second 2 degree or

The threat degree of the second level 1 attack on the node is equivalent, and the default value is 1;

the attack threat situation is normalized through the formula, and the security situation value is mapped to a (0, 1) interval;

step 7.2: evaluating the network level intrusion detection result, and for a given node data group Z = [ Z ] ₁ ，z ₂ ，…，z _n 】，z _i e.R, where R is a set of real numbers and the data weight vector is represented as W＝【w ₁ ，w ₂ ，…，w _n 】 ^T Wherein w is _i Is epsilon [0,1] and

the weighted geometric mean operator is defined as follows:

in order to avoid zero value of the data center, zero clearing is carried out on the final result, and a zero-clearing weighted geometric mean operator is constructed:

where n is the number of nodes in the network, w _i Is the weight of the ith node, T _i (t) is the quantized value of the intrusion detection result of the ith node in the t time period, and S (t) is the final evaluation value of the whole network in the t time period.