CN113743542B - Network asset identification method and system based on encrypted flow - Google Patents

Network asset identification method and system based on encrypted flow Download PDF

Info

Publication number
CN113743542B
CN113743542B CN202111302660.1A CN202111302660A CN113743542B CN 113743542 B CN113743542 B CN 113743542B CN 202111302660 A CN202111302660 A CN 202111302660A CN 113743542 B CN113743542 B CN 113743542B
Authority
CN
China
Prior art keywords
network
flow
encrypted
tls
network asset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111302660.1A
Other languages
Chinese (zh)
Other versions
CN113743542A (en
Inventor
刘东海
徐育毅
庞辉富
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Youyun Software Co ltd
Beijing Guangtong Youyun Technology Co ltd
Original Assignee
Hangzhou Youyun Software Co ltd
Beijing Guangtong Youyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Youyun Software Co ltd, Beijing Guangtong Youyun Technology Co ltd filed Critical Hangzhou Youyun Software Co ltd
Priority to CN202111302660.1A priority Critical patent/CN113743542B/en
Publication of CN113743542A publication Critical patent/CN113743542A/en
Application granted granted Critical
Publication of CN113743542B publication Critical patent/CN113743542B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows

Abstract

The invention provides a network asset identification method and a network asset identification system based on encrypted flow, which are characterized by firstly obtaining information of historical network assets in an organization, then manually marking attributes and necessary information of the network assets, then using a network asset feature extraction algorithm based on the encrypted flow to extract encrypted flow fingerprint features of the network assets, then calculating feature accuracy under different sensitive values, finally determining the sensitive values of a model, feeding the sensitive values back to a machine learning model to complete the training of the model, when the network assets need to be updated and iterated, mapping new architecture flow data in the organization by using an existing model, and forming a network asset classification identification result according to a model result. The invention has the beneficial effects that: the invention extracts and generates the network asset fingerprint vector based on the encrypted flow, and then realizes the automatic identification of the network asset by machine learning algorithm classification, so that network operation and maintenance personnel can deeply know the network asset architecture and the dynamic state in the organization in real time, and the operation and the maintenance are more convenient and rapid.

Description

Network asset identification method and system based on encrypted flow
Technical Field
The invention relates to the technical field of network asset operation and maintenance in an IT operation and maintenance system, in particular to a network asset identification method and system based on encrypted flow.
Background
Network asset identification in IT operation and maintenance is to comb all hardware assets in an enterprise organization, when unit scale is complicated, a large number of business system devices, database devices, network devices and safety protection devices can generate a large number of management problems, the network assets are idle for a long time and are unsupervised and easily attacked by a network or cause potential safety hazards, when events such as internal architecture adjustment of the organization, update period of the network assets and the like occur, the IT operation and maintenance work is huge, technologies such as network detection, fingerprint identification and the like are used for identification in a traditional network, but with the wide application of encryption technology, the effect of network asset identification is poor, and improvement is urgently needed.
Therefore, in order to perform information network asset identification on faults in an IT system automatically and intelligently, some patents try to introduce an artificial intelligence algorithm to perform information network asset identification at present, for example, patent CN109033471 discloses an information network asset identification method and device, and the method mainly adopts a passive detection method to analyze the fingerprint characteristics of special fields banner in protocol data packets such as application layer HTTP, FTP, SMTP and the like in flow or protocol data packets such as IP, TCP three-way handshake, DHCP and the like, so as to realize the passive detection of network asset information. Firstly, network asset feature data of each logic entity in an information system is obtained, a training sample is determined according to the network asset feature data, feature vectors are directly constructed on the basis of features such as network asset original flow quintuple, network identifiers and the like, and then a machine learning model is used for training the samples to complete identification and classification of network assets. Compared with the conventional manual statistics, the method greatly improves the efficiency, and can identify the network assets in a logic level, but the method has a too simple feature data combination mode, does not consider the problem of network asset fingerprint feature vector normalization in different network environments, and does not consider the problem that the features cannot be extracted in an encrypted flow environment, so that the method is difficult to effectively and comprehensively identify the network assets under the condition that the encrypted flow application scene gradually rises.
In the aspect of encrypted traffic detection, CN111885083A provides a method for extracting encrypted traffic features, which converts statistical features such as encrypted traffic protocol version, acceptable password, extended list, elliptic curve password, etc. into a first feature vector, and directly inputs these features into a model algorithm for subsequent detection, and this method has problems that malicious related traffic detection modeling is mainly concerned among many normal traffic and abnormal traffic, part of feature selection is not general enough, and it needs to perform complex analysis on traffic protocol, and has extremely high performance requirements, most important, the scheme does not consider application scenarios when used for network asset fingerprint construction, and cannot identify and operate and maintain a large number of network assets in complex environment.
With the rapid expansion of IT infrastructure, the asset scale managed by IT operation and maintenance is continuously enlarged, the requirement on IT operation and maintenance response timeliness is higher and higher, for example, various update and migration problems caused by version change, service change, code logic or network fluctuation and the like often exist in operation and maintenance, and especially when the network assets are large and complex, IT is very difficult to grope and check the network assets again. Once errors occur in network asset inventory and identification, great business influence is brought to enterprises, and huge business loss is caused. With the steady promotion of encrypted flow in the cloud era, more and more network assets are difficult to identify through a traditional non-encrypted network mode, the effective well-done network asset identification is a foundation for network security construction, and for enterprises with high operation and maintenance maturity, an effective network asset security life cycle management method is provided. However, this is almost an impossible task for most operation and maintenance teams of enterprises. Also for part of national government departments or regulatory units, the supervision range of the local network assets is too large, and the network assets are difficult to be managed quickly, comprehensively and accurately.
When the fault occurs, the network assets are difficult to rapidly troubleshoot manually, the updating period is temporary, and the network assets are migrated and changed, so that the network assets cannot be seen, found and cannot be found, and the like, and various problems and pain points are caused. First, many enterprises do not have a dedicated network asset management department responsible for grooming network asset conditions (indeed, even with a network asset management department, it is often independent of the security team and does not or minimally pay attention to the security status of the network assets during management). Secondly, the operation and maintenance team often needs to comprehensively identify the network asset information through comprehensive application of multiple network asset identification modes such as active detection, passive flow monitoring, Configuration Management Database (CMDB), financial approval information and the like. However, in the actual situation, the operation and maintenance team often suffers from the elbow stopping caused by network assets, personnel, time and other factors, so that all the operation and maintenance team cannot be considered, and an effective scheme for assisting the operation and maintenance team to quickly identify, position and manage the network assets is urgently needed.
Disclosure of Invention
Aiming at the defects that the network assets with encrypted flow are difficult to detect and sniff, the traditional methods such as fingerprint identification and network asset mapping are low in accuracy and efficiency and the like in the IT operation and maintenance process, the invention provides a network asset identification method and system based on encrypted flow.
The object of the present invention is achieved by the following technical means. A network asset identification method based on encrypted flow comprises the steps of firstly obtaining information of historical network assets in an organization, then manually marking attributes and necessary information of the network assets, then extracting encrypted flow fingerprint features of the network assets by using a network asset feature extraction algorithm based on the encrypted flow, then calculating feature accuracy under different sensitive values, finally determining the sensitive values of a model, feeding the sensitive values back to a machine learning model to finish model training, when the network assets need to be updated and iterated, mapping new architecture flow data in the organization by using an existing model, and forming a network asset classification identification result according to a model result.
Preferably, the network asset feature extraction algorithm based on encrypted traffic collects encrypted session data of each network asset in an organization network, constructs network asset fingerprints by using traffic data of TLS handshake original bytes and TLS handshake sequence data in encrypted sessions, and performs one-dimensional convolution and pooling operations, and then uses machine learning algorithm for classification, thereby realizing automatic identification of the network assets.
Furthermore, the method comprises the following specific steps:
(1) before generating the network asset fingerprint, firstly carrying out preprocessing operations of data cleaning, detection unit division and normalization on flow data in an organization; the network data cleaning needs to be connected with initial network flow equipment in a butt joint mode, and after the initial flow is obtained, the flow data is processed on the granularity of the bidirectional flow;
(2) after acquiring the recombined encrypted flow data stream, extracting and identifying a characteristic vector of the encrypted flow; converging a flow representation of an original byte through TLS handshake and a flow representation based on a TLS record length sequence into a single fingerprint vector representation of the network asset;
(3) comparing parameter sensitivity, including original byte size of TLS handshake and length selection of TLS record;
(4) integrating all training classification processes, generating features by using a fingerprint feature vector generation module according to the marking of the organization network assets and the corresponding conditions of the traffic, and finishing the training of the encrypted traffic in the organization; and carrying out classification prediction on the encrypted flow when the network asset is changed, and determining the network asset class corresponding to each encrypted flow.
The network data cleaning comprises the following three steps of filtering, splitting and recombining:
(1) filtering all unencrypted sessions, and simultaneously filtering encrypted sessions which are not successfully connected, wherein the part of traffic comprises part of noise and abnormal traffic;
(2) dividing the captured continuous flow into independent detection units, and finally analyzing each detection unit into network quintuple information, wherein the network quintuple comprises five categories including a source IP, a source port, a destination IP, a destination port and a protocol, and finally analyzing each basic detection unit into bidirectional stream data packets with the same network quintuple;
(3) carrying out recombination operation on the encrypted flow on the basis of the detection unit, wherein a single TCP segment can contain a plurality of TLS records, and one TLS record is distributed in a plurality of TCP segments respectively; in the process of recombination, the TCP session and the TLS record are reconstructed by discrete TCP segments, and when a TCP message is received, the recombination is carried out according to the corresponding sequence number and direction in the TCP message.
The invention also provides an operation and maintenance asset identification system based on the equipment network behavior, which mainly comprises four modules, a flow data cleaning module, a fingerprint vector generation module, a sensitive parameter tuning module and a system classification display module; wherein the content of the first and second substances,
the flow data cleaning module is used for carrying out pretreatment operations of data cleaning, detection unit division and normalization on the flow data in the organization mechanism;
the fingerprint vector generation module is used for extracting and identifying the characteristic vector of the encrypted flow after acquiring the recombined encrypted flow data stream;
the sensitive parameter tuning module is used for comparing parameter sensitivity, including the original byte size of TLS handshake and the length selection of TLS record;
the system classification display module is used for integrating all training classification processes, generating features by using the fingerprint feature vector generation module according to the marking and flow corresponding conditions of the organization network assets, and finishing the training of the encrypted flow in the organization; and carrying out classification prediction on the encrypted flow when the network asset is changed, and determining the network asset class corresponding to each encrypted flow.
The invention has the beneficial effects that: the invention extracts and generates the network asset fingerprint vector based on the encrypted flow, and then realizes the automatic identification of the network asset by machine learning algorithm classification, so that network operation and maintenance personnel can deeply know the network asset architecture and the dynamic state in the organization in real time, and the operation and the maintenance are more convenient and rapid.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention.
Fig. 2 is a flow chart of flow data cleaning according to the present invention.
Fig. 3 is a schematic flow chart of extracting and identifying a feature vector of encrypted traffic according to the present invention.
FIG. 4 is a flow chart of the sensitivity parameter tuning module according to the present invention.
FIG. 5 is a flow chart of the system classification display module according to the present invention.
Detailed Description
The invention will be described in detail below with reference to the following drawings:
the invention discloses a network asset identification method and a network asset identification system based on encrypted flow analysis, which are characterized in that a network asset fingerprint vector is extracted and generated based on encrypted flow, and then the network asset is classified through a machine learning algorithm to realize automatic identification of the network asset, so that network operation and maintenance personnel can deeply know the network asset architecture and dynamics in an organization in real time, and the operation and maintenance are more convenient and rapid. As shown in fig. 1, the system mainly comprises four modules, a flow data cleaning module, a fingerprint vector generating module, a sensitive parameter tuning module and a system classification display module.
As shown in fig. 2, before generating the network asset fingerprint, preprocessing operations such as data cleaning, detection unit division, normalization and the like should be performed on the traffic data in the organization, and the traffic data cleaning module is used in the present invention to accomplish this. In the invention, the source form of the initial flow is not limited, the flow data can be processed on the granularity of the bidirectional flow by a mode of bypass mirror image of a core switch, or by a mode of flow traction of an operator or directly intercepting the flow of a host, and the like after the initial flow is obtained, namely, each detection unit is an encryption session sharing the same quintuple (namely, a source IP, a source port, a destination IP, a destination port and a protocol), and the source IP/the destination IP/the port can be interchanged in the processing process. The method fully utilizes the unique advantages of the bidirectional flow representation method in flow depiction:
(1) the method can describe fine-grained interaction behaviors between the client and the server;
(2) the method can fuse the flow information and does not cause information loss in the fusion process;
(3) the method can provide convenience for the correlation analysis work among the data streams;
(4) the method does not require aggregation or division of different time windows during the analysis.
Specifically, the flow data cleaning module used by the invention comprises three steps of filtering, splitting and recombining. Firstly, in order to improve the quality of the traffic processed by the invention, the invention focuses on generating interactive encrypted traffic, so that all unencrypted sessions can be filtered, the partial traffic can be directly identified by adopting a traditional network asset fingerprint feature generation method, including but not limited to analyzing the fingerprint features of protocol data packets such as a Banner or IP, TCP three-way handshake, DHCP and the like in protocol data packets such as an application layer HTTP, FTP, SMTP and the like in the traffic, and simultaneously, encrypted sessions which are not successfully connected can be filtered, the partial traffic can include partial noise and abnormal traffic, thereby reducing the memory and calculation overhead of a system, and improving the efficiency and the space utilization rate under a large-scale network environment.
Then, the captured continuous flow is divided into independent detection units, the technical scheme used for dividing the flow into the independent units is not limited in the invention, the division of the flow units can be completed through published schemes such as tcpdump and tcdisplay and business schemes such as network backtracking and deep thinking network backtracking, each detection unit is finally analyzed into network quintuple information, the network quintuple comprises and only comprises five categories of source IP, source port, destination IP, destination port and protocol, and finally each basic detection unit is analyzed into bidirectional flow data packets with the same network quintuple.
Finally, we recombine the encrypted traffic based on the detection unit, and considering the limitation of the network traffic MTU (maximum transmission unit) and the diversity of TLS (secure transport layer protocol) records, a single TCP segment may contain multiple TLS records, and a TLS record may also be distributed in multiple TCP segments. In the process of recombination, the TCP session and the TLS record are reconstructed by discrete TCP segments, and when a TCP message is received, the recombination is carried out according to the corresponding sequence number and direction in the TCP message. The invention does not limit the bottom layer implementation mode used by the recombination scheme, the Snort, the subcata and the linux kernels all have the specific implementation scheme of TCP recombination, and meanwhile, the problems of retransmission, disorder, packet loss and the like based on a TCP protocol can be combed and solved through the flow recombination process.
As shown in fig. 3, after acquiring the recombined encrypted traffic data stream, feature vector extraction and identification are required to be performed on the encrypted traffic, which is realized by a fingerprint vector generation module in the present invention, and a single representation of the fingerprint vector of the network asset is formed by converging two methods, namely TLS handshake original byte flow representation and TLS record length sequence-based flow representation. Because the payload of the encrypted traffic is invisible, the encrypted traffic can only be subjected to characteristic fingerprint generation based on the handshake messages transmitted in the clear text and the inherent statistical characteristics (such as the length sequence of the data packet) of the network traffic. Because the feature extraction work of the encrypted traffic needs a great deal of expert knowledge, the feature extraction work cannot be carried out in a complex and chaotic large organization due to the strong dependency on the expert. The invention innovatively combines two TLS identification methods, automatically learns the representation of the encrypted session, and simultaneously considers the correlation relationship between the encrypted sessions. Based on the above considerations, in the invention we characterize and define an encryption session from two aspects.
First the traffic representation of the original byte is handshake based on TLS. The TLS (secure transport layer protocol) is used as a successor of the SSL (secure socket layer protocol) to provide data confidentiality and integrity guarantee for network application communication, and because network asset application data is invisible, clear text information negotiated before encryption communication establishment can be utilized, namely TLS record data in a handshake phase, and original bytes in the phase contain information such as versions, extensions, encryption suites and certificates of various original information data used in encryption communication. Because each network asset has security and regularity, and has certain certificate features and communication modes, various fields negotiated in the TLS handshake can be used to generate a fingerprint feature vector for the network asset. In addition, the specificity of the encrypted network stream cannot be effectively reflected due to the data itself below the session layer, such as the IP address of the network layer and various TCP control fields of the transport layer. For this reason, in the present invention, data (network layer and transport layer data) below the session layer are not processed, and only the first N bytes of the TLS record in the TLS handshake phase are reserved. The selection of N is crucial to the detection result, on one hand, N must be long enough to ensure that the first N bytes contain TLS ClientHello, TLS ServerHello and part of Certificate information; on the other hand, the detection efficiency is reduced because too much invalid data is not blended into the selected data as much as possible, in the invention, a sensitive parameter tuning module can be used for dynamically generating an N value according to the network topology, and the finally determined N =1800 can also be directly used through a large amount of analysis and experiments in the history of the invention. Accordingly, the original byte data of a single encryption session can be represented as in equation (1).
Figure 915814DEST_PATH_IMAGE001
(1)
Where RawByte (i) represents the ith encrypted network flow, bn iThe nth 16-bit double byte of the ith encrypted record is represented, and each byte is in the range of 0, 255]。
In subsequent processing, each original byte is mapped to a feature vector with a fixed length by using word embedding (embedding) operation, then the vector is processed by using a one-dimensional convolutional network architecture, the direct context association of each byte and its successive bytes and the mapping relation of each byte in the whole byte vector are obtained, and through the operation, richer semantic representation information in the TLS handshake process can be obtained.
Next, using the traffic representation based on the TLS record length sequence, the packet length sequence of the encrypted session may not only characterize the communication mode of the encrypted session, but also reflect the type of the application program carried by the session. The TLS record length sequences for different network assets vary greatly. In the preprocessing process, the TCP recombination technology is used for solving the problems of data packet retransmission and disorder caused by network problems, meanwhile, the limitation of an MTU (1500) is eliminated, and TLS records are restored, so that the original appearance of TLS encryption sessions is recovered. Therefore, we replace the packet length sequence with a TLS record length sequence, which is more suitable for the task of software network asset traffic detection.
Based on the above analysis, the present invention selects the first M TLS record lengths of the encrypted session. The selected value of M must contain the Client Hello, ServerHello, Certificate, and part of the Application Data in the TLS record, effectively reflecting the communication mode of the encrypted session. In the invention, a sensitive parameter tuning module can be used for dynamically generating the M value according to the network topology, and finally determined M =10 after a great deal of analysis and experiments can also be used. The traffic representation based on the length sequence of TLS records can be expressed as formula (2):
Figure 144539DEST_PATH_IMAGE002
(2)
wherein
Figure 986593DEST_PATH_IMAGE003
Indicating the nth TLS record length for the ith encrypted network stream. For TLS recording data stream information
Figure 194851DEST_PATH_IMAGE003
The symbols of (a) represent: upstream traffic (client-)>Server) is positive, and the downlink traffic (server) -is positive>Client) is negative.
Furthermore, when modeling the relationships between encryption sessions, the length sequence of TLS records can be applied to the construction of encryption traffic fingerprints, since it can help us identify more related encryption sessions with similar communication patterns. In the detection process, the smoothness of the communication mode of the related encryption session can be reflected by considering the difference of the related encryption session. The length sequence of TLS records is therefore z-score standardized to eliminate the effect of different types of encrypted session record lengths.
Figure 71540DEST_PATH_IMAGE005
(3)
Wherein lnFor the length of TLS after normalizationSnAnd UnThe standard deviation and mean of the length is recorded for the nth TLS for all encrypted sessions.
Finally, in the invention, we aggregate TLS handshake original byte characteristics and TLS record length sequence characteristics, where sig (i) is the last traffic characteristic, rawbytes (i) is the original byte characteristic, and sequence (i) is the TLS length sequence characteristic.
Sig(i)=RawBytes(i)+Sequence(i) (4)
As shown in fig. 4, in the sensitive parameter tuning module, the present invention designs an analysis method for the sensitivity of the contrast parameter, which includes selecting the original byte size of the TLS handshake and the length of the TLS record.
In the existing feature-based work, TLS handshake records of Client Hello, Server Hello and Certificate are the most commonly used encrypted traffic information, and we do not extract features here, but use a one-dimensional convolutional neural network to automatically learn the best feature representation from the original bytes. In particular, the original bytes of the TLS handshake contain the security parameters negotiated by the TLS handshake phase for subsequent encrypted communications, which is the most valuable information in the network asset fingerprinting algorithm. And the original byte size determines the amount of handshake information to utilize. Different raw byte sizes in the TLS handshake achieve different performance. The TLS record length sequence better reflects the application type of the encrypted traffic bearer and the communication mode of the TLS session, with less impact than the original byte size on performance. In the invention, the two parameters are respectively subjected to self-adaptive adjustment, the first N bytes of the TLS are traversed and searched by adopting 300-3000 bytes, the step length is 100, and standard calculation can be carried out by adopting a basic algorithm such as SVM; traversing search is also adopted for TLS records, the range is 5-20, the step length is 1, and after two representation methods are respectively traversed and searched, an optimal numerical value is selected and transmitted into a classification detection module for generating an optimal training model parameter.
As shown in fig. 5, finally, the system classification display module integrates all training and classification processes, and first generates features by using the fingerprint feature vector generation module according to the labeling and traffic correspondence of the organization network assets, so as to complete the training of the organization internal encryption traffic, and then performs classification prediction on the encryption traffic when the network assets are changed, so as to determine the network asset class corresponding to each encryption traffic. The network assets of the same type can be reused, a certain fixed network asset fingerprint training model in the A unit can be used, the network assets of the same type in the B unit can also be used, and richer encrypted flow network asset fingerprint models can be obtained in a larger organization network.
The method adopts 60% of all marked data as training data, 20% as verification data and 20% as a data set division method of test data, and simultaneously adopts an encapsulation classifier provided by a Sciket Learn library based on Python to classify the data. And (3) integrating a plurality of algorithms, selecting the three algorithms with the highest scores to repeat for 10 times, and taking the average values of the precision rate, the recall rate and the Micro-F1 of the fixed parameters in machine learning as final results. It should be noted that the machine learning algorithm of the present invention is not limited to support the above algorithm, and the LightGBM, XGboost algorithm, neural network, self-encoder algorithm, or timing-type recurrent neural network algorithm may be suitably adapted to use the solution of the present invention.
The system classification display module included in the network asset identification system has the main functions of: original unprocessed multi-dimensional network asset flow information is displayed in an interface, and operation and maintenance personnel can conveniently check the change trend of the original data. Meanwhile, network asset identification is carried out on the encrypted traffic in real time. On the other hand, in order to facilitate the operation and maintenance personnel to debug the algorithm, the system classification display module can also provide an interactive interface, so that the operation and maintenance personnel can conveniently input parameter configurations of different algorithms in the network asset identification algorithm, and the input parameter configurations are stored in a back-end configuration file for parameter selection of next algorithm improvement.
It should be understood that equivalent substitutions and changes to the technical solution and the inventive concept of the present invention should be made by those skilled in the art to the protection scope of the appended claims.

Claims (4)

1. A network asset identification method based on encrypted flow is characterized in that: firstly, acquiring information of historical network assets in an organization, then manually marking attributes and necessary information of the network assets, next extracting encrypted flow fingerprint characteristics of the network assets by using a network asset characteristic extraction algorithm based on encrypted flow, next calculating characteristic accuracy under different sensitive values, finally determining the sensitive values of a model, feeding the sensitive values back to a machine learning model to finish the training of the model, when the network assets need to be updated and iterated, mapping new architecture flow data in the organization by using an existing model, and forming a network asset classification recognition result according to a model result;
the network asset feature extraction algorithm based on the encrypted flow firstly collects encrypted session data of each network asset in an organization network, and assembles the encrypted session data into a single fingerprint vector representation of the network asset through two methods of TLS handshake original byte flow representation and TLS record length sequence-based flow representation, and then uses a machine learning algorithm for classification after one-dimensional convolution and pooling operation to realize automatic identification of the network asset; the method comprises the following specific steps:
(1) flow representation based on TLS handshake original bytes: the first N bytes of a TLS record in a TLS handshake stage are reserved, each original byte is mapped to a feature vector with a fixed length by word embedding operation, then the vector is processed by using a one-dimensional convolutional network architecture, and the direct context association of each byte and the bytes in sequence and the mapping relation of each byte in the whole byte vector are obtained;
(2) traffic representation based on TLS record length sequence: the lengths of the first M TLS records of the encrypted session are selected, and the flow based on the length sequence of the TLS records is expressed as formula (2):
Figure FDA0003425321490000011
wherein
Figure FDA0003425321490000012
Indicating the nth TLS record length of the ith encrypted network stream, the TLS record data stream is used for information
Figure FDA0003425321490000013
The symbols of (a) represent: the uplink flow is positive, and the downlink flow is negative;
z-score normalization of the length sequences of the TLS records;
Figure FDA0003425321490000014
wherein lnFor the normalized TLS length, snAnd unRecording the standard deviation and the mean value of the length for the nth TLS of all the encrypted sessions;
(3) aggregating TLS handshake original byte characteristics and TLS record length sequence characteristics, wherein Sig (i) is the final traffic characteristics, RawBytes (i) is the TLS handshake original byte characteristics, and sequence (i) is the TLS record length sequence characteristics;
Sig(i)=RawBytes(i)+Sequence(i) (4)。
2. the encrypted traffic-based network asset identification method according to claim 1, wherein: the method comprises the following specific steps:
(1) before generating the network asset fingerprint, firstly carrying out preprocessing operations of data cleaning, detection unit division and normalization on flow data in an organization; the network data cleaning needs to be connected with initial network flow equipment in a butt joint mode, and after the initial flow is obtained, the flow data is processed on the granularity of the bidirectional flow;
(2) after acquiring the recombined encrypted flow data stream, extracting and identifying a characteristic vector of the encrypted flow; converging a flow representation of an original byte through TLS handshake and a flow representation based on a TLS record length sequence into a single fingerprint vector representation of the network asset;
(3) comparing parameter sensitivity, including original byte size of TLS handshake and length selection of TLS record;
(4) integrating all training classification processes, generating features by using a fingerprint feature vector generation module according to the marking of the organization network assets and the corresponding conditions of the traffic, and finishing the training of the encrypted traffic in the organization; and carrying out classification prediction on the encrypted flow when the network asset is changed, and determining the network asset class corresponding to each encrypted flow.
3. The encrypted traffic-based network asset identification method according to claim 2, wherein: the network data cleaning comprises the following three steps of filtering, splitting and recombining:
(1) filtering all unencrypted sessions, and simultaneously filtering encrypted sessions which are not successfully connected;
(2) dividing the captured continuous flow into independent detection units, and finally analyzing each detection unit into network quintuple information, wherein the network quintuple comprises five categories including a source IP, a source port, a destination IP, a destination port and a protocol, and finally analyzing each basic detection unit into bidirectional stream data packets with the same network quintuple;
(3) carrying out recombination operation on the encrypted flow on the basis of the detection unit, wherein a single TCP segment can contain a plurality of TLS records, and one TLS record is distributed in a plurality of TCP segments respectively; in the process of recombination, the TCP session and the TLS record are reconstructed by discrete TCP segments, and when a TCP message is received, the recombination is carried out according to the corresponding sequence number and direction in the TCP message.
4. A network asset identification system using the encrypted traffic-based network asset identification method according to any one of claims 1 to 3, characterized in that: the system mainly comprises four modules, a flow data cleaning module, a fingerprint vector generating module, a sensitive parameter tuning module and a system classification display module; wherein the content of the first and second substances,
the flow data cleaning module is used for carrying out pretreatment operations of data cleaning, detection unit division and normalization on the flow data in the organization mechanism;
the fingerprint vector generation module is used for extracting and identifying the characteristic vector of the encrypted flow after acquiring the recombined encrypted flow data stream;
the sensitive parameter tuning module is used for comparing parameter sensitivity, including the original byte size of TLS handshake and the length selection of TLS record;
the system classification display module is used for integrating all training classification processes, generating features by using the fingerprint feature vector generation module according to the marking and flow corresponding conditions of the organization network assets, and finishing the training of the encrypted flow in the organization; and carrying out classification prediction on the encrypted flow when the network asset is changed, and determining the network asset class corresponding to each encrypted flow.
CN202111302660.1A 2021-11-05 2021-11-05 Network asset identification method and system based on encrypted flow Active CN113743542B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111302660.1A CN113743542B (en) 2021-11-05 2021-11-05 Network asset identification method and system based on encrypted flow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111302660.1A CN113743542B (en) 2021-11-05 2021-11-05 Network asset identification method and system based on encrypted flow

Publications (2)

Publication Number Publication Date
CN113743542A CN113743542A (en) 2021-12-03
CN113743542B true CN113743542B (en) 2022-03-01

Family

ID=78727534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111302660.1A Active CN113743542B (en) 2021-11-05 2021-11-05 Network asset identification method and system based on encrypted flow

Country Status (1)

Country Link
CN (1) CN113743542B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114422174B (en) * 2021-12-09 2023-07-25 绿盟科技集团股份有限公司 Network traffic filtering method, device, medium and equipment
CN114553939B (en) * 2022-04-25 2022-07-19 北京广通优云科技股份有限公司 Encryption flow-based resource stable switching method in IT intelligent operation and maintenance system
CN115174147A (en) * 2022-06-01 2022-10-11 中国科学院信息工程研究所 Real-time network connection privacy protection method and system based on anti-disturbance
CN115242463B (en) * 2022-06-30 2023-06-09 北京华顺信安科技有限公司 Method, system and computer equipment for monitoring dynamic change of network asset
CN115589362B (en) * 2022-12-08 2023-03-14 中国电子科技网络信息安全有限公司 Method for generating and identifying device type fingerprint, device and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016010872A1 (en) * 2014-07-16 2016-01-21 Microsoft Technology Licensing, Llc Recognition of behavioural changes of online services
CN105871832A (en) * 2016-03-29 2016-08-17 北京理工大学 Network application encrypted traffic recognition method and device based on protocol attributes
CN109726763A (en) * 2018-12-29 2019-05-07 北京神州绿盟信息安全科技股份有限公司 A kind of information assets recognition methods, device, equipment and medium
CN110909224A (en) * 2019-11-22 2020-03-24 浙江大学 Sensitive data automatic classification and identification method and system based on artificial intelligence
CN110991509A (en) * 2019-11-25 2020-04-10 杭州安恒信息技术股份有限公司 Asset identification and information classification method based on artificial intelligence technology
CN112671757A (en) * 2020-12-22 2021-04-16 无锡江南计算技术研究所 Encrypted flow protocol identification method and device based on automatic machine learning
CN113162908A (en) * 2021-03-04 2021-07-23 中国科学院信息工程研究所 Encrypted flow detection method and system based on deep learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016010872A1 (en) * 2014-07-16 2016-01-21 Microsoft Technology Licensing, Llc Recognition of behavioural changes of online services
CN105871832A (en) * 2016-03-29 2016-08-17 北京理工大学 Network application encrypted traffic recognition method and device based on protocol attributes
CN109726763A (en) * 2018-12-29 2019-05-07 北京神州绿盟信息安全科技股份有限公司 A kind of information assets recognition methods, device, equipment and medium
CN110909224A (en) * 2019-11-22 2020-03-24 浙江大学 Sensitive data automatic classification and identification method and system based on artificial intelligence
CN110991509A (en) * 2019-11-25 2020-04-10 杭州安恒信息技术股份有限公司 Asset identification and information classification method based on artificial intelligence technology
CN112671757A (en) * 2020-12-22 2021-04-16 无锡江南计算技术研究所 Encrypted flow protocol identification method and device based on automatic machine learning
CN113162908A (en) * 2021-03-04 2021-07-23 中国科学院信息工程研究所 Encrypted flow detection method and system based on deep learning

Also Published As

Publication number Publication date
CN113743542A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
CN113743542B (en) Network asset identification method and system based on encrypted flow
Hu et al. Ganfuzz: a gan-based industrial network protocol fuzzing framework
CN111277578A (en) Encrypted flow analysis feature extraction method, system, storage medium and security device
CN115606162A (en) Abnormal flow detection method and system, and computer storage medium
CN111147394B (en) Multi-stage classification detection method for remote desktop protocol traffic behavior
JP2023530828A (en) Rapid identification of violations and attack executions in network traffic patterns
Yu et al. An encrypted malicious traffic detection system based on neural network
CN113923026A (en) Encrypted malicious flow detection model based on TextCNN and construction method thereof
US11093367B2 (en) Method and system for testing a system under development using real transaction data
CN110858837B (en) Network management and control method and device and electronic equipment
Muhati et al. Asynchronous advantage actor-critic (a3c) learning for cognitive network security
CN116828087B (en) Information security system based on block chain connection
CN115378619A (en) Sensitive data access method, electronic equipment and computer readable storage medium
CN112448919B (en) Network anomaly detection method, device and system and computer readable storage medium
Zhang et al. An uncertainty-based traffic training approach to efficiently identifying encrypted proxies
Abdalla et al. Log File Analysis Based on Machine Learning: A Survey: Survey
Whalen et al. Hidden markov models for automated protocol learning
CN114866310A (en) Malicious encrypted flow detection method, terminal equipment and storage medium
Guo et al. MGEL: a robust malware encrypted traffic detection method based on ensemble learning with multi-grained features
CN115051874A (en) Multi-feature CS malicious encrypted traffic detection method and system
KR102559398B1 (en) Security monitoring intrusion detection alarm processing device and method using artificial intelligence
Zhou et al. Classification of botnet families based on features self-learning under network traffic censorship
CN113992419A (en) User abnormal behavior detection and processing system and method thereof
CN113177203A (en) Method and device for identifying encrypted malicious message flow
Pandeeswari et al. Analysis of Intrusion Detection Using Machine Learning Techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant