CN117176442A

CN117176442A - Illegal network access detection method and system based on DNA spatial information weight

Info

Publication number: CN117176442A
Application number: CN202311194014.7A
Authority: CN
Inventors: 行鸿彦; 侯天浩; 梁欣怡; 王心怡
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2023-12-05

Abstract

The invention discloses an illegal network access detection method and system based on DNA space information weight, which relates to the technical field of network security and comprises the following steps: receiving network flow data, sorting the characteristic types of the network flow data, and integrating the network flow data to generate a network flow data set; generating a DNA translation rule dictionary according to the characteristic type of the network traffic data; performing translation coding on the network flow data set by using a DNA translation rule dictionary to obtain a DNA sequence set; extracting deep features of the DNA sequence set to obtain an information weight matrix, wherein the types and positions of DNA fragments in the DNA sequence set are considered when the deep features of the DNA sequence set are extracted; and identifying and classifying the information weight matrix by using a random forest algorithm to obtain a network intrusion detection result.

Description

Illegal network access detection method and system based on DNA spatial information weight

Technical Field

The invention relates to the technical field of network security, in particular to an illegal network access detection method and system based on DNA spatial information weight.

Background

Nowadays, various network terminals and server nodes are increasingly popularized, so that the network is easily attacked by various malicious attacks, the normal operation of the network is affected, and the transmission and storage safety of data are threatened. In particular, in the application scenarios of the internet of vehicles and the internet of things, data leakage, network hijacking and communication delay caused by any network intrusion may cause disastrous results. Therefore, an Intrusion Detection System (IDS) plays an important role in the current network environment, and through real-time monitoring of network traffic data, the IDS can discover intrusion behavior and send out alarms in time, and the intrusion detection performance of the IDS has an important meaning on the network operation safety.

Network intrusion detection is generally regarded as a classification problem in which network behavior is discriminated from network traffic data, and some classification algorithms based on machine learning or deep learning are first applied in this field. However, since the data features are too complex and generally include symbols, continuous real numbers, discrete real numbers, etc., only a simple data normalization method may cause a certain degree of feature loss. And the deep learning can integrate a coding network in a training algorithm, and optimally code flow data through multiple iterations so as to further mine deep data characteristics. Meanwhile, the deep learning algorithm can strengthen the feature extraction of few attack types through a data enhancement network, and solve the problem of unbalanced sample types. Therefore, in the network intrusion detection algorithm, the detection effect based on deep learning is better.

In recent years, more students began to try machine learning algorithms that incorporate the idea of encoding. But there are several general problems:

1. the traditional feature coding method cuts apart discrete features and continuous features, which leads to inconsistent evaluation standards of different features, and feature dimensions still exist after standardization. This makes intrusion detection less accurate.

2. The calculated amount is large, and the real-time requirement cannot be met.

Disclosure of Invention

In order to solve the above-mentioned shortcomings in the background art, the present invention aims to provide a method and a system for detecting illegal network access based on DNA spatial information weight.

The aim of the invention can be achieved by the following technical scheme: an illegal network access detection method based on DNA space information weight comprises the following steps:

receiving network flow data, sorting the characteristic types of the network flow data, and integrating the network flow data to generate a network flow data set;

generating a DNA translation rule dictionary according to the characteristic type of the network traffic data;

performing translation coding on the network flow data set by using a DNA translation rule dictionary to obtain a DNA sequence set;

extracting deep features of the DNA sequence set to obtain an information weight matrix, wherein the types and positions of DNA fragments in the DNA sequence set are considered when the deep features of the DNA sequence set are extracted;

and identifying and classifying the information weight matrix by using a random forest algorithm to obtain a network intrusion detection result.

Preferably, the type of classifying the characteristics of the network traffic data set includes: digital type features and character type features.

Preferably, the DNA translation rule dictionary, the character dictionary includes: identifying a feature dictionary, a protocol feature dictionary and a service feature dictionary; the digital dictionary includes: a digital feature dictionary and a long digital feature dictionary.

Preferably, the DNA translation rule dictionary translation process:

translation is performed using 3 1-position, non-repetitive DNA base pair protocol features; translation of the service features with 71 4-position, non-repetitive DNA fragments; translation of the marker feature with 11 2-position, non-repetitive DNA fragments; the digital signature was translated using 11 2-position, non-repetitive DNA fragments.

For long digital features, according to the long digital feature interval, 8 DNA fragments, which are not repeated with each other at 2 bits, are used for translation.

Preferably, the encoding rule for the translation encoding of the network traffic data set using the DNA translation rule dictionary is as follows:

for protocol features, service features, identification features and long-number features, DNA translation can be completed by directly comparing with a DNA translation rule dictionary;

for other digital features, firstly splitting the digital features according to the bits, and finally, sequentially completing DNA translation by comparing the split single digits with a DNA translation rule dictionary.

Preferably, the process of extracting the DNA sequence set to obtain the information weight matrix includes: constructing a base position frequency matrix, calculating information weight and reconstructing the information weight matrix.

Preferably, the calculation model of the base position frequency matrix PFM is:

wherein: k represents a base type, p _k,J Representing the frequency of occurrence of base k in the j-th column in the context of DNA sequence set M;

p _k,j the calculation model of (2) is as follows:

wherein b _i,J Bases on row i and column j in the DNA sequence set; i is a base presence determination function defined as follows:

preferably, the information weight calculation model is as follows:

f in _k For the probability distribution of k in the whole sequence set, w _k,J Information weight for base k in column J;

the information weight matrix is obtained by combining the formula as follows:

in order to achieve the above object, the present invention discloses an illegal network access detection system based on DNA spatial information weight, comprising:

and a feature classification module: the method comprises the steps of receiving network traffic data, sorting the characteristic types of the network traffic data, and integrating the network traffic data to generate a network traffic data set;

translation module: the method comprises the steps of generating a DNA translation rule dictionary according to the characteristic type of network traffic data;

and a coding module: the method comprises the steps of performing translation coding on a network traffic data set by using a DNA translation rule dictionary to obtain a DNA sequence set;

and the information weight extraction module is used for: the method comprises the steps of extracting deep features of a DNA sequence set to obtain an information weight matrix, wherein the types and positions of DNA fragments in the DNA sequence set are considered when the deep features of the DNA sequence set are extracted;

and (5) an identification and classification module: the method is used for identifying and classifying the information weight matrix by utilizing a random forest algorithm to obtain a network intrusion detection result.

In another aspect of the present invention, in order to achieve the above object, there is disclosed an apparatus, comprising:

one or more processors;

a memory for storing one or more programs;

when one or more of the programs are executed by one or more of the processors, the one or more of the processors implement an illegal network access detection method based on DNA spatial information weights as described above.

The invention has the beneficial effects that:

firstly, designing a DNA coding strategy, reconstructing network flow data by using a DNA sequence, mapping original features by using bases on specific spatial positions, and finishing standardized representation of the data. And then, by constructing an information weight matrix, deep features of network flow data in the DNA sequence set are extracted, and the precision of intrusion detection is ensured. And finally, classifying the information weight matrix by using a random forest algorithm, and judging the network intrusion behavior. Experiments prove that the method has higher detection efficiency, and improves the recognition accuracy of few attack samples on the premise of ensuring higher overall detection rate and lower false alarm rate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort;

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a schematic of the overall workflow of the present invention;

FIG. 3 is a graph comparing the detection effect of the present invention example with that of the most advanced class-based intrusion detection methods;

FIG. 4 is a graph comparing the detection effect of the present invention with that of the most advanced data-based intrusion detection methods;

fig. 5 is a schematic diagram of the system structure of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, an illegal network access detection method based on DNA spatial information weight includes the following steps:

the illegal network access detection method based on the DNA spatial information weight is characterized by comprising the following steps:

in this embodiment, the types of feature classification of the network traffic data set include: the method comprises the steps of dividing the 1 st, 5 th and 6 th features in an original sample into long digital features;

the network traffic data set TR consists of n samples TR _i The mathematical definition of the composition is shown in formula 1:

TR＝{Tr ₁ ,Tr ₂ ,...,Tr _i ,...,Tr _n },i∈[1,n] (1)

wherein each sample Tr _i Are each composed of 41 features X, including 3 character-type features (protocol type P, service type S, and flag type F) and 38 digit-type features N. The definition is shown in formula 2:

Tr _i ＝{X ₁ ,X ₂ ,...,X _j ,...,X ₄₁ }

where:j∈[1,41],X ₂ ∈P,X ₃ ∈S,X ₄ ∈F,X _else ∈N (2)

the DNA translation rule dictionary includes character-labeled feature dictionaries such as: a recognition feature dictionary, a protocol feature dictionary, and a service feature dictionary; digital feature dictionary: a digital feature dictionary and a long digital feature dictionary;

the DNA translation rules dictionary translation process:

to control the length and dimension of the translated network traffic data, a translation strategy should be adopted that is as compact as possible, and 3 1-bit and non-repetitive DNA base pair protocol features should be used for translation; translation of the service features with 71 4-position, non-repetitive DNA fragments; translation of the marker feature with 11 2-position, non-repetitive DNA fragments; the digital signature was translated using 11 2-position, non-repetitive DNA fragments.

in this embodiment, the encoding rule for performing translation encoding on the network traffic data set by using the DNA translation rule dictionary is as follows: for protocol features, service features, identification features and long-number features, DNA translation can be completed by directly comparing with a DNA translation rule dictionary;

To realize coding of the DNA of Tr, the DNA-SE establishes coding rule dictionaries for four characteristics of numbers, protocols, services and marks.

A. Digital feature dictionary

In the digital feature dictionary, the set N to be encoded is '0' - '9' and '.' total eleven characters, which can be described losslessly by random non-repeated combination of two bases. Random (x) ₁ ，x ₂ ，x ₃ ，x ₄ N) logic defines the above procedure: at x ₁ ，x ₂ ，x ₃ ，x ₄ N elements are randomly selected for non-repetitive arrangement.

Equation (3) defines the generation logic of the digital feature dictionary.

EncodeDigit(X _j )＝Random(A，G，C，T，2)

where:X _j ∈N；EncodeDigit(X _j )≠EncodeDigit(X _else(j) ) (3)

B. Protocol feature dictionary

The content to be coded of the protocol dictionary is three protocol types of TCP, UDP and ICMP in the set P, and translation can be completed by using a single base, and the rule is shown in a formula (4).

EncodeProtocol(X _j )＝Random(A,G,C,T,1)

where:X _j ∈P；EncodeProtocol(X _j )≠EncodeProtocol(X _else(j) ) (4)

C. Service feature dictionary

Since 71 elements exist in the service type S, four base pairs are required for complete translation, and the rule is shown in formula (5).

D. Mark feature dictionary

The signature feature set F contains 11 elements, which are translated using a non-repeating combination of two bases, as shown in equation (6).

EncodeFlag(X _j )＝Random(A，G，C，T，2)

where:X _j ∈F；EncodeFlag(X _j )≠EncodeFlag(X _else(j) ) (6)

E. Long digital feature dictionary

In NSL-KDD data set, the 1 st, 5 th and 6 th features are of digital type, but the value range is larger [0,1.38 multiplied by 10 ] ⁹ ]If the EncodeDigit (N) _i ) The problem of different DNA sequences of different Trs is caused, which is unfavorable for subsequent processing. For these three features, therefore, a new translation rule is formulated according to the data length, as shown in equation (7).

The complete coding rule dictionary is obtained from equation (3) -equation (7) as shown in table 1.

Table 1 translation rules dictionary

the process for extracting the DNA sequence set to obtain the information weight matrix comprises the following steps: constructing a base position frequency matrix, information weight calculation and information weight matrix reconstruction;

Tr _i after completion of the DNA encoding, a DNA sequence m consisting of 169 bases b was obtained _i As shown in formula (8):

m _i ＝Encode(Tr _i )＝b _i,1 ,b _i,2 ,...,b _i,J ,...,b _i,169

where:J＝[1,169] (8)

for the whole data set TR, a DNA sequence set M consisting of n M can be obtained through DNA-SE, and the expression is shown in the formula (9):

then, a base Position Frequency Matrix (PFM) was constructed to represent the frequency of occurrence of bases at the same positions in each sequence m:

in the formula (10), k represents a base type, p _k,J The frequency of occurrence of the base k in the j-th column in the background of the DNA sequence set M can be defined by the formula (11):

in the formula (11), I is a base presence judging function, and is defined as shown in the formula (12):

the amount of information represented by the presence of base k on the j-th column can be calculated from the formula (11). But taking into account the probability f of the overall distribution of bases in the context of the entire sequence set _k Influence on the information quantity at a specific position [36 ] on the basis of the information quantity model]Redefining the information weight w _k,j As shown in equation (13).

F in _k For the probability distribution of k in the whole sequence set, w _k,J Information weight for base k in column J; can be used to characterize the information that base k appears at position J.

Information feature matrices can be constructed in combination with equations (8), (9) and (13), as shown in equation (14).

From equation (14), it can be seen that there is a unique mapping of all elements in M to information weights in IFM, which is a unique representation of the statistical characteristics of M. Tr can be queried through IFM _i Translates each feature of (a) into an information weight corresponding to the DNA code, so that we compose a specific position base information weight s by the sum of the information weights at the corresponding positions in the IFM _i To represent Tr _i Is a feature of (a).

Tr is known from the coding rules _i Feature X of (3) _i,j At m _i Mapping relationships in, e.g.Tr at this time _i Corresponding s _i The expression is:

and similarly, constructing a specific position base information weight matrix.

The actions and effects of the example implementation are as follows:

table 2 example test effects

As can be seen from table 2: the accuracy of Normal detection of the inventive example is 86.92%; the overall accuracy of the attach is 97.70%, wherein the detection accuracy of Dos and Probe Attack types is 98.73% and 100.00%, respectively, and in two Attack types R2L and U2R with less number of recognized training samples and higher detection difficulty, the detection accuracy reaches 93.14%, and the detection accuracy of the attach is 94.95%. This indicates that DNA-SIF has a better generalization ability and is affected very little by small samples.

As can be seen from fig. 3 and 4: compared with the most advanced intrusion detection methods at present, the method has obvious improvement on four indexes of Accumacy, precision, recovery and F1-score, and overcomes the phenomenon of low detection Precision and the like caused by uneven sample distribution.

TABLE 3 training time and detection time for intrusion detection methods

As can be seen from table 3: the inventive examples exhibit shorter training times and detection times than the current most advanced intrusion detection methods.

In a second aspect, as shown in fig. 5. In order to achieve the above objective, an embodiment of the present invention discloses an illegal network access detection system based on DNA spatial information weight, including:

Based on the same inventive concept, the present invention also provides a computer apparatus comprising: one or more processors, and memory for storing one or more computer programs; the program includes program instructions and the processor is configured to execute the program instructions stored in the memory. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application SpecificIntegrated Circuit, ASIC), field-Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computational core and control core of the terminal for implementing one or more instructions, in particular for loading and executing one or more instructions within a computer storage medium to implement the methods described above.

It should be further noted that, based on the same inventive concept, the present invention also provides a computer storage medium having a computer program stored thereon, which when executed by a processor performs the above method. The storage media may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electrical, magnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing has shown and described the basic principles, principal features, and advantages of the present disclosure. It will be understood by those skilled in the art that the present disclosure is not limited to the embodiments described above, which have been described in the foregoing and description merely illustrates the principles of the disclosure, and that various changes and modifications may be made therein without departing from the spirit and scope of the disclosure, which is defined in the appended claims.

Claims

1. The illegal network access detection method based on the DNA spatial information weight is characterized by comprising the following steps:

2. The method for detecting illegal network access based on DNA spatial information weight according to claim 1, wherein the type of feature classification of the network traffic data set comprises: digital type features and character type features.

3. The illegal network access detection method based on DNA spatial information weight according to claim 1, wherein the DNA translation rule dictionary includes an identification character type feature dictionary and a digital feature dictionary, and the identification character type feature dictionary includes: a feature dictionary, a protocol feature dictionary, and a service feature dictionary; the digital feature dictionary includes: a digital feature dictionary, a long digital feature dictionary.

4. The illegal network access detection method based on DNA spatial information weight according to claim 3, wherein the DNA translation rule dictionary translation process:

5. The illegal network access detection method based on DNA spatial information weight according to claim 1, wherein the encoding rule for the translation encoding of the network traffic data set using the DNA translation rule dictionary is as follows:

6. The method for detecting illegal network access based on DNA space information weight according to claim 1, wherein the process of extracting the DNA sequence set to obtain the information weight matrix comprises the following steps: constructing a base position frequency matrix, calculating information weight and reconstructing the information weight matrix.

7. The illegal network access detection method based on the DNA spatial information weight according to claim 6, wherein the calculation model of the base position frequency matrix PFM is:

wherein: k is E (A, G)C, T) represents a base type, p _k,J Representing the frequency of occurrence of base k in the j-th column in the context of DNA sequence set M;

p _k,j the calculation model of (2) is as follows:

8. the illegal network access detection method based on DNA spatial information weight according to claim 6, wherein the information weight calculation model is as follows:

the information weight matrix is obtained by combining the formula as follows:

9. an illegal network access detection system based on DNA spatial information weight, comprising:

10. An apparatus, comprising:

one or more processors;

a memory for storing one or more programs;

when one or more of the programs are executed by one or more of the processors, the one or more of the processors implement a method for illegal network access detection based on DNA spatial information weights as claimed in any one of claims 1 to 8.