CN113300977B - Application flow identification and classification method based on multi-feature fusion analysis - Google Patents

Application flow identification and classification method based on multi-feature fusion analysis Download PDF

Info

Publication number
CN113300977B
CN113300977B CN202110584098.XA CN202110584098A CN113300977B CN 113300977 B CN113300977 B CN 113300977B CN 202110584098 A CN202110584098 A CN 202110584098A CN 113300977 B CN113300977 B CN 113300977B
Authority
CN
China
Prior art keywords
application
flow
sequence
ciphertext
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110584098.XA
Other languages
Chinese (zh)
Other versions
CN113300977A (en
Inventor
司成祥
李应博
李胜男
毛蔚轩
张建松
刘云昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN202110584098.XA priority Critical patent/CN113300977B/en
Publication of CN113300977A publication Critical patent/CN113300977A/en
Application granted granted Critical
Publication of CN113300977B publication Critical patent/CN113300977B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A multi-feature fusion analysis-based application flow identification and classification method is characterized in that description features and flow samples of application are collected and extracted through methods such as web crawlers and automatic flow triggering, then plaintext features and ciphertext features of the application are extracted and stored in a graph data mode, and the application flow can be accurately identified based on graph structure fusion multi-features. The method comprises four parts of application and description information acquisition, automatic flow triggering and acquisition, plaintext and ciphertext flow characteristic extraction, and characteristic storage and retrieval based on a graph structure; providing corresponding identification methods for encrypted flow and non-encrypted flow; analyzing the fine granularity of the network activity performed by the application; the method solves the problems of high false alarm, too coarse application behavior identification granularity and the like caused by the dependence on single characteristics of the traditional flow identification method, and provides method basis and technical support for further work such as network resource scheduling, malicious application identification and protection, user portrayal and the like.

Description

Application flow identification and classification method based on multi-feature fusion analysis
Technical Field
The invention belongs to the technical field of network traffic management, and particularly relates to an application traffic identification and classification method based on multi-feature fusion analysis.
Background
With the development of internet technology, network traffic classification plays a great role in network security, user portrayal, operator-level traffic optimization, and the like.
Because of the rapid development of internet applications and the heavy use of http/https protocols, traditional port-based traffic classification has not made sense. With the importance of the large manufacturers on traffic encryption in recent years, the traffic identification method based on plaintext load feature extraction is no longer effective. The advent of cloud computing platforms has made traditional IP-based identification methods unsuitable.
The effectiveness of various newly proposed traffic identification methods based on neural network machine learning greatly depends on training data, and the training data greatly depends on preprocessing work of professionals on traffic data, including collection, feature engineering, data cleaning and the like. In the case of fast application version iteration and new applications going out of date, this manpower-intensive model has been difficult to adapt to practical situations.
Meanwhile, all kinds of identification methods based on flow characteristics cannot well handle the problem of calling of the third-party API. With the improvement of open platforms of various large vendors, more and more applications choose to use the API of the third-party vendor to implement some more general functions (such as third-party social account login, map display, payment, etc.). The flow classification method based on the flow characteristic identification does not consider the additional information, and a high false identification rate can occur in the actual scenes called by the interfaces of different manufacturers.
In summary, the traditional simple DPI (deep packet inspection) and DFI (deep flow inspection) methods using a single-feature traffic classification method are no longer applicable, and cannot achieve good efficiency and high accuracy in a real scene.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide an application traffic identification and classification method based on multi-feature fusion analysis, so as to solve the problem that the application traffic is difficult to accurately identify due to the fact that the applications call the API mutually.
In order to achieve the purpose, the invention adopts the technical scheme that:
an application flow identification and classification method based on multi-feature fusion analysis comprises the following steps:
step 1, collecting an executable file of an application and description information thereof.
Specifically, a web crawler technology can be adopted, after a user specifies collected applications, application software is retrieved and downloaded in an application official website and an internet software library, and application official websites, internet software libraries and application-related webpages are collected, wherein the application-related webpages at least comprise application description pages, download pages and application use description pages of the application official website and the internet software library.
Step 2, extracting application description characteristics and collecting application flow samples, wherein the method specifically comprises the following steps:
step 2.1, before collecting application flow samples, analyzing applications and extracting application description features by adopting a semantic analysis method, and the specific steps are as follows:
and 2.1.1, obtaining the application description information by using the binary package obtained by static analysis downloading.
The method comprises the following steps: calculating a file hash, judging the type of an executable file of the file hash through a file header, carrying out recursive decompression operation on a filing type file, scanning an android manifest (xml) file of an APK type file to obtain marked version information of the APK type file, analyzing an operation platform and a dependency library of the ELF type file through an ELF import table, analyzing a DOS header, an NT header, a section table and specific sections of the PE file in sequence, and discarding and marking the files of other types as unavailable; the acquired application description information is a set formed by software version information, and comprises the following steps: software operating platform information, software version information and software binary files.
And 2.1.2, obtaining application description characteristics by using a Word2Vector algorithm.
The method comprises the following steps: and (3) learning the Vector representation of the descriptive sentences in the application description information and the application description information collected in the step (1) through a Word2Vector algorithm, calculating cosine similarity between the Vector representations, and reserving 20 information vectors with the highest cosine similarity to obtain the software description characteristics.
Step 2.2, when collecting the application flow sample, automatically triggering the application by adopting a virtualization technology and an automatic testing technology, and automatically obtaining the application flow sample, wherein the step is realized by utilizing a controlled module running in a virtual machine and a control module running outside, and the implementation steps are as follows:
step 2.2.1, the control module searches the same platform in a pre-prepared virtual machine snapshot library by combining the relevant information of the application running environment, and starts a virtual machine corresponding to the platform;
step 2.2.2, the controlled module calls a corresponding operating system interface to complete the installation and the starting of the application program in the virtual machine, and simultaneously informs the control module to start the flow capturing of an IP layer;
step 2.2.3, the control module selects a predefined trigger strategy to omit the operation of a simulation trigger user through the operation of a virtual user outside;
step 2.2.4, after the flow data collected once exceeds a threshold value, the control module closes the virtual machine and restores the virtual machine to the initial snapshot;
step 2.2.5, repeat the above steps several times to provide a sufficiently large data set for subsequent analysis.
The controlled module is designed in a platform-related manner, and different back ends are arranged on different operating system platforms; snapshots included in the virtual machine snapshot library comprise Windows, linux, IOS, android operating systems and simulators; the trigger strategies include depth-first, breadth-first and component triggering.
And 3, extracting the plaintext characteristics of the application by adopting a sequence pattern analysis method based on the flow sample and the application description characteristics, and specifically comprising the following steps:
step 3.1, after obtaining a flow sample, dividing an input IP layer data packet sequence into TCP flow and non-TCP flow according to an OSI four-layer protocol header and discarding an invalid data packet, classifying the IP packets belonging to a TCP session on a fourth layer according to a quintuple plan for the TCP flow, wherein the invalid data packet comprises a handshake packet, a retransmission packet, an acknowledgement packet and an incomplete session;
and 3.2, in each session, taking the session as a sequence, extracting the position of the plaintext information in the session and the regular characteristic of the plaintext information in the session by adopting a sequence mode analysis method, and integrating a Snort rule as the plaintext flow characteristic of the session, wherein the sequence mode analysis method comprises a longest common subsequence mining algorithm and a frequent subsequence mining algorithm.
And 4, extracting the ciphertext features of the application by using an automatic coding and decoding neural network based on the flow sample, and specifically comprising the following steps:
step 4.1, after the flow sample is obtained, combining the IP layer data packets belonging to the same session with the ciphertext flow according to the sequence of the packet sending time;
step 4.2, converting the combined data packet sequence into a multi-dimensional characteristic sequence, extracting any one-dimensional characteristic sequence, and sending the multi-dimensional characteristic sequence into an automatic coding and decoding neural network to obtain the ciphertext characteristic of the dimension, wherein the multi-dimensional characteristic sequence comprises a data packet type characteristic sequence, a data packet size characteristic sequence and a data packet interval characteristic sequence;
and 4.3, executing the operation on each dimension of the multi-dimensional feature sequence, and integrating the ciphertext features of each dimension into the applied ciphertext features.
And 5, storing the application description characteristics, the plaintext characteristics and the ciphertext characteristics in a form of graph data. Specifically, the graph structure can be stored by adopting a neo4j graph database, and the extracted application description features, plaintext features and ciphertext features are used as nodes in the graph to establish a connection with the application nodes to which the application description features, the plaintext features and the ciphertext features belong.
And 6, accurately identifying the application flow based on the graph structure fusion multi-features when identifying a section of network flow. Specifically, for the flow to be identified, the ciphertext feature of the flow to be identified is calculated, then the flow to be identified and the plaintext feature are matched, the ciphertext feature of the flow to be identified and the ciphertext feature of the flow to be identified are compared, if the probability of a certain or some possible nodes is higher than a set threshold value, the application name of the flow to be identified can be determined, and information related to an application program is output, wherein the information related to the application program at least comprises an application version, an application type and application behaviors corresponding to the flow.
Compared with the prior art, the invention has the beneficial effects that:
1. the method has the advantages that the relation between the applications is established, the flow is analyzed transversely, and the problem that the identification accuracy is greatly influenced by API calling of third-party manufacturers is avoided.
The method and the device can remove the calling flow of the third-party API by extracting the flow characteristics of various application programs and screening and removing the same type of flow.
2. Automatic flow triggering and flow acquisition are realized, so that a great deal of manpower resource waste in the process of processing the data by using the traditional method is avoided.
The invention automatically downloads the binary package of the application program from the application store and carries out package capturing and monitoring in a proper virtual environment through Web crawler and automatic testing technology. The invention operates the application program in the virtual machine in a virtual click mode through an automatic testing technology, and monitors the flow of the application program to enter and exit so as to realize the automatic collection of the flow.
Drawings
FIG. 1 is an overall flow chart of the present invention.
Fig. 2 is a flow chart of feature word extraction and application download.
Fig. 3 is a flow chart of automatic flow triggering and collection.
Fig. 4 is a flow chart of flow characteristic extraction.
FIG. 5 is a flow chart of information storage and retrieval.
FIG. 6 is a diagram illustrating a data structure.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples.
As shown in fig. 1, the method for identifying and classifying application traffic based on multi-feature fusion analysis of the present invention includes the following steps:
step 1, collecting an executable file of an application and description information thereof.
Specifically, referring to fig. 2, a web crawler technology may be used to retrieve and download application software in an application official website and an internet software library after a user designates a collected application, and collect web pages related to the application official website, the internet software library and the application. The internet software library comprises but is not limited to internet software libraries such as an intelligent market, an apple software store, a warrior software park and the like, and the application-related web pages comprise but is not limited to an application official website, an application description page, a download page, an application use description page and the like of the internet software library.
Step 2, extracting application description features, and collecting application flow samples, specifically referring to fig. 3, including:
step 2.1, before collecting application flow samples, analyzing applications and extracting application description features by adopting a semantic analysis method, and the specific steps are as follows:
and 2.1.1, obtaining the application description information by using the binary package obtained by static analysis downloading.
The method for statically analyzing the binary package comprises the following steps: calculating the hash of the file, judging the type of the executable file through the file header, carrying out recursive decompression operation on the filing type file, scanning the android manifest of the APK type file to obtain the marked version information of the APK type file, analyzing the running platform and the dependency library of the ELF type file through an ELF import table, analyzing the DOS header, the NT header, the section table and the specific sections of the PE file in sequence, and discarding and marking the files of other types as unavailable. The set of the acquired application description information and the software version information specifically comprises: software operating platform information, software version information and software binary files.
And 2.1.2, acquiring application description characteristics by using a Word2Vector algorithm in combination with the collected application description information.
The method comprises the following steps: and (3) learning the Vector representation of the descriptive sentences in the application description information and the application description information collected in the step (1) through a Word2Vector algorithm, calculating cosine similarity between the Vector representations, and reserving 20 information vectors with the highest cosine similarity to obtain the software description characteristics.
Step 2.2, when collecting the application flow sample, automatically triggering the application by adopting a virtualization technology and an automatic testing technology, and automatically obtaining the application flow sample, wherein the step is realized by utilizing a controlled module running in a virtual machine and a control module running outside, the controlled module is designed in a platform-dependent manner (different back ends exist on different operating system platforms), and the specific implementation steps are as follows:
step 2.2.1, the control module searches the same platform in a pre-prepared virtual machine snapshot library according to the acquired relevant information of the application running environment, and starts a virtual machine corresponding to the platform, wherein the virtual machine snapshot library comprises snapshots including but not limited to Windows, linux, IOS, android operating system and simulator;
step 2.2.2, the controlled module calls a corresponding operating system interface to complete the installation and the starting of the application program in the virtual machine, and simultaneously informs the control module to start the flow capturing of an IP layer;
step 2.2.3, the control module selects a predefined trigger strategy to omit the operation of a simulation trigger user through the operation of a virtual user outside, wherein the trigger strategy comprises but is not limited to depth priority, breadth priority and component trigger;
step 2.2.4, after the flow data collected once exceeds a threshold value, the control module closes the virtual machine and restores the virtual machine to the initial snapshot;
step 2.2.5, repeat the above steps several times to provide a sufficiently large data set for subsequent analysis.
Step 3, based on the flow sample and the application description feature, extracting the plaintext feature of the application by using a sequence pattern analysis method, specifically referring to fig. 4, specifically comprising the following steps:
step 3.1, after obtaining a flow sample, dividing an input IP layer data packet sequence into TCP flow and non-TCP flow according to an OSI four-layer protocol header and discarding an 'invalid' data packet, classifying the IP packets belonging to the same TCP session in the fourth layer according to a quintuple plan for the TCP flow, and classifying the non-TCP flow according to the quintuple plan, wherein the 'invalid' data packet comprises but is not limited to a handshake packet, a retransmission packet, an acknowledgement packet and an incomplete session;
and 3.2, in each session, taking the session as a sequence, extracting the position and regular appearance characteristics of plaintext information in the session by adopting a sequence pattern analysis method, and integrating a Snort rule as the plaintext flow characteristics of the session, wherein the sequence pattern analysis method comprises but is not limited to a longest public subsequence mining algorithm, a frequent subsequence mining algorithm and other sequence pattern analysis algorithms.
And 4, extracting the ciphertext characteristics of the application by using an automatic coding and decoding neural network based on the flow sample, and specifically comprising the following steps:
step 4.1, after the flow sample is obtained, combining the IP layer data packets belonging to the same session with the ciphertext flow, and combining the IP layer data packets according to the sequence of packet sending time;
step 4.2, converting the combined data packet sequence into a multidimensional characteristic sequence, extracting any one-dimensional characteristic sequence, and sending the multidimensional characteristic sequence into an automatic coding and decoding neural network to obtain the ciphertext characteristics of the dimension, wherein the multidimensional characteristic sequence comprises but is not limited to a data packet type characteristic sequence, a data packet size characteristic sequence and a data packet interval characteristic sequence;
and 4.3, executing the operation on each dimensionality of the multidimensional feature sequence, and integrating ciphertext features of each dimensionality into applied ciphertext features.
And 5, storing the application description characteristics, the plaintext characteristics and the ciphertext characteristics in a form of graph data. Specifically, referring to fig. 5, the neo4j graph database may be used to implement the storage of the graph structure, and the extracted application description features, plaintext features, and ciphertext features are used as nodes in the graph to establish a connection with the application nodes to which the extracted application description features, plaintext features, and ciphertext features belong.
And 6, accurately identifying the application flow based on the graph structure fusion multi-feature when identifying a section of network flow. Specifically, referring to fig. 6, for the traffic to be identified, the ciphertext feature is calculated, then the traffic to be identified and the plaintext feature are matched, and the ciphertext feature is compared with the ciphertext feature in the graph database, if the probability of a certain or some possible nodes is higher than a set threshold, the application name of the traffic can be determined, and information related to the application program is output, where the information related to the application program includes, but is not limited to, an application version, an application type, an application behavior corresponding to the traffic, and the like.
Correspondingly, the invention also provides a flow classification system based on the multi-feature fusion recognition technology, which comprises four subsystems: the system comprises a characteristic character extraction and application downloading subsystem based on a Web crawler technology and a Word2Vector algorithm, a flow automatic triggering and acquisition subsystem based on a virtualization technology and an automatic testing technology, a flow characteristic extraction subsystem based on plaintext characteristic extraction and ciphertext characteristic learning, and an information storage and retrieval subsystem based on a graph structure, wherein the technical details are as follows:
the characteristic character extraction and application download subsystem based on the Web crawler technology and the Word2Vector algorithm comprises:
for a given application name, the Web crawler retrieves application-related information and obtains download links at predefined application marketplace websites. Specifically, the subsystem is divided into two modules: the device comprises a characteristic character extraction module and an application downloading module.
The characteristic character extraction module performs the following processing:
1. and simulating and loading pages such as an application description page, a download page, an application use instruction and the like through the browser, extracting the visual text information in the pages, and storing the visual text information according to the sections of the visual text information on the pages.
2. In each paragraph, the Vector representation of the descriptive statement in each paragraph is learned through the Word2Vector algorithm, and a software description semantic Vector group is obtained.
The application downloading module performs the following processing:
1. and matching a download link on the download page according to a predefined application market site template.
2. And initiating a downloading request for downloading through the obtained downloading link.
3. The downloaded binary package is statically analyzed according to the following procedure:
a. its file hash is computed.
b. The file type of the executable file is judged through the file header. For archive files, recursive decompression operations are performed. And for the APK type file, scanning the android manifest file to acquire the annotated version information of the APK type file. The ELF type file is analyzed by an ELF import table to run a platform (Linux or OSX) and a dependency library. And sequentially analyzing the DOS head, the NT head, the section table and the specific section of the PE file. For other types of files, discard and mark as unavailable.
After the two modules are executed, the subsystem obtains the following information:
1. software description semantic vector set
2. A collection of software version information, comprising:
a. software operating platform information
b. Software version information
c. Software binary files
The flow automatic triggering and collecting subsystem based on the virtualization technology and the automatic testing technology comprises:
the subsystem is a C/S architecture and comprises two modules: a controlled module (hereinafter referred to as Client) running in the virtual machine and a control module running externally.
The Client is designed for platform correlation (different back ends are provided on different operating system platforms), and is specifically designed to monitor a fixed TCP port to receive control signaling after being started. And after receiving the control signaling, calling the API relevant to the platform to read the relevant information or perform relevant operation. The Client terminal abstractly encapsulates the relative operation of the complex platform into general operation signaling similar to 'acquiring interface layout information, clicking a button, inputting characters' and the like, and exposes the control module through the socket.
The specific working principle of the subsystem is as follows:
1. and searching the same platform in a pre-prepared virtual machine snapshot library according to the relevant information of the operating environment acquired in the characteristic character extraction and application download subsystem, and exiting and recording a log if no error is reported. If the virtual machine which is consistent with the platform is started on the basis of the existing snapshot.
2. The control module continuously attempts to connect the virtual machine IP and the predefined port through the socket until the connection is successful (i.e., the virtual machine successfully starts and the Client starts running).
3. And the control terminal transmits the acquired binary file to the virtual machine through the established socket channel.
4. The Client checks whether the dynamic link dependencies necessary for its operation are satisfied (for ELF-like files) and downloads the relevant dependency packages from the release software source to satisfy the software dependencies.
5. And the Client calls a corresponding operating system interface to complete the installation and the starting of the application program in the virtual machine, and simultaneously informs the control system to start the flow capture of the IP layer through the established socket channel.
6. And the Client monitors the user interface and reports the user interface to a monitoring process outside the virtual machine in real time.
7. The control module selects a predefined strategy suitable for triggering according to information such as application categories and the like acquired from the characteristic character extraction and application download subsystem through virtual user operation outside, for example, all user interfaces with depth priority for a shopping App are omitted, and communication applications are triggered according to function points of the shopping App, so that the user operation is simulated.
8. And after the flow data collected once exceeds a threshold value, the control end closes the virtual machine and restores the virtual machine to the initial snapshot.
The above steps are repeated several times to provide a sufficiently large data set for subsequent analysis.
The flow characteristic extraction subsystem based on plaintext characteristic extraction and ciphertext characteristic learning comprises:
for an incoming sequence of IP layer packets, it is first split into TCP traffic and non-TCP traffic (including UDP traffic and other traffic, such as ICMP) according to the OSI four layer protocol header.
For TCP traffic, the present subsystem classifies it according to TCP handshake information. The IP packets belonging to the fourth layer and one TCP session are classified into one class (hereinafter, the set of each class of packets is simply referred to as "session"). And discarding meaningless data packets such as the handshake packet, the retransmission packet and the confirmation packet and incomplete conversation generated in the acquisition end stage. The flow data collected by running the software for multiple times is analyzed as follows:
in each session, the positions of the plaintext information appearing in the session and the regular characteristics of the appearing plaintext information are extracted, and Snort rules are integrated to be used as the plaintext flow characteristics of the session. The method can better extract the plaintext characteristics in the flow sample.
For the cipher text flow, the IP layer data packets belonging to the same session in the first step are considered and combined according to the sequence of the packet sending time. And extracting feature vectors of a plurality of sequences obtained by multiple acquisition according to two characteristics of packet length and packet frequency, and integrating the two characteristics into ciphertext flow features of the session. The method can better extract the ciphertext characteristics in the flow sample.
For all sessions in each application lifecycle, the subsystem integrates the feature information of all sessions and records the start time of the session (according to the connection initiation time), forming a session feature vector group.
For non-TCP traffic, the present subsystem classifies it according to the same IP and source/destination port number (for UDP traffic), considering homogeneous traffic (same target IP and same source/destination port number (for UDP traffic)) multiple "virtual sessions". The processing for each session is similar to the processing for each session in the TCP traffic described above, and the set of session feature vectors is also formed according to the time of session initiation.
Finally, the subsystem integrates the flow characteristic information of a plurality of TCP sessions and a virtual session consisting of non-TCP flows into a vector sequence according to the initiation time of the virtual session, and the vector sequence is used as a flow characteristic vector group of the application for retrieval and use in subsequent steps.
The information storage and retrieval subsystem based on the graph structure comprises:
the subsystem comprises two modules, a storage module of a graph structure and a retrieval module of a graph.
For the storage module of the graph structure, the invention adopts the neo4j graph database to realize the storage of the graph structure. In order to establish the relationship between nodes, the invention adopts the following method to realize the fusion of the structural features of the graph, and the specific realization is as follows:
1. and taking the collected applications as nodes in the graph, and establishing contact with the application nodes to which the applications belong.
2. Traversing the nodes, for each node, carrying out similarity detection on the traffic characteristics of each session of the node and the traffic characteristics of each session of other application programs, and if the similarity is higher than a threshold value, establishing a calling connection between the two application programs.
3. And matching the description information (such as video class and chat class) which is used for describing the semantic vector group and is related to the predefined application type to obtain a normalized application type vector. The present invention uses vectors to represent application tags, as follows:
type=(x 1 ,x 2 ,……,x n )
wherein 0<x i <1(i∈[1,n]) Representing the correlation between the application and the corresponding type, e.g. video in the first dimension and chat in the second dimension, for a pure video application:
type=(1,0,……,0)
for short video applications with social functions, the type vector is:
type=(0.8,0.2,……,0)
and after matching, establishing contact between the nodes of the same application type.
Based on the four steps, the graph structure can be built according to data returned by other related subsystems.
The specific implementation of the retrieval module of the graph is as follows:
1. preprocessing input information, wherein the specific process comprises the following steps:
a. the set of feature vectors is calculated according to the method described above.
b. The set of descriptive semantic vectors is calculated according to the method described above.
2. And matching the two obtained vector groups with corresponding nodes in the graph (matching the flow characteristics with the flow characteristics, matching the description semantics with the description semantics), and marking according to the similarity entity.
3. All nodes in the graph are traversed. If the probability of a certain or some possible nodes is higher than a threshold value, the characteristic sample is determined to belong to a known application, and the application name can be determined and information related to the application program can be output. If the likelihood is below a threshold, it is interpreted as a new unknown flow. And meanwhile, the application type entity is traversed again, so that the probability that the sample characteristics belong to each known class can be determined.
In conclusion, the method collects and extracts the description characteristics and the flow samples of the application through methods such as web crawlers and automatic flow triggering, further extracts the plaintext characteristics and the ciphertext characteristics of the application and stores the characteristics in the form of graph data, and can accurately identify the application flow based on the graph structure and the fusion of multiple characteristics. The method comprises four parts of application and description information acquisition, automatic flow triggering and acquisition, plaintext and ciphertext flow characteristic extraction, and characteristic storage and retrieval based on a graph structure; providing corresponding identification methods for encrypted flow and non-encrypted flow; analyzing the network activity of the application in a fine granularity; the method solves the problems of high false alarm, too coarse application behavior identification granularity and the like caused by the dependence on single characteristics of the traditional flow identification method, and provides method basis and technical support for further work such as network resource scheduling, malicious application identification and protection, user portrayal and the like.

Claims (8)

1. A multi-feature fusion analysis-based application flow identification and classification method is characterized by comprising the following steps:
step 1, acquiring an executable file of an application and description information thereof;
step 2, extracting application description characteristics and collecting application flow samples;
step 3, extracting plaintext characteristics of the application by adopting a sequence pattern analysis method based on the flow sample and the application description characteristics;
step 4, extracting the ciphertext characteristics of the application by using an automatic coding and decoding neural network based on the flow sample;
step 5, storing the application description characteristics, the plaintext characteristics and the ciphertext characteristics in a graph data form, wherein a neo4j graph database is adopted to realize graph structure storage, and the extracted application description characteristics, the plaintext characteristics and the ciphertext characteristics are used as nodes in a graph to establish a connection with the application nodes to which the application description characteristics, the plaintext characteristics and the ciphertext characteristics belong;
step 6, accurately identifying application flow based on the graph structure fusion multi-feature when identifying a section of network flow; the method comprises the steps of calculating the ciphertext characteristics of flow to be identified, matching the flow to be identified with the plaintext characteristics, comparing the ciphertext characteristics with the ciphertext characteristics in a graph database, and determining the application name and outputting information related to an application program if the probability of a certain or some possible nodes is higher than a set threshold value.
2. The method for identifying and classifying the application traffic based on the multi-feature fusion analysis according to claim 1, wherein in the step 1, a web crawler technology is adopted, after a user specifies a collected application, application software is retrieved and downloaded in an application official website and an internet software library, and application official websites, the internet software library and application-related webpages are collected, wherein the application-related webpages at least comprise an application description page, a download page and an application use description page of the application official website and the internet software library.
3. The method for identifying and classifying application traffic based on multi-feature fusion analysis according to claim 1, wherein the step 2 comprises:
step 2.1, before collecting application flow samples, analyzing applications and extracting application description features by adopting a semantic analysis method;
and 2.2, when the application flow sample is collected, automatically triggering the application by adopting a virtualization technology and an automatic testing technology, and automatically acquiring the application flow sample.
4. The method for identifying and classifying application traffic based on multi-feature fusion analysis according to claim 3, wherein the step 2.1 comprises:
step 2.1.1, obtaining application description information by using a binary package obtained by static analysis downloading;
step 2.1.2, obtaining application description characteristics by using a Word2Vector algorithm;
the step 2.2 is implemented by using a controlled module running in the virtual machine and a control module running outside, and the implementation steps are as follows:
step 2.2.1, the control module searches the same platform in a pre-prepared virtual machine snapshot library by combining the relevant information of the application running environment, and starts a virtual machine corresponding to the platform;
step 2.2.2, the controlled module calls a corresponding operating system interface to complete the installation and the starting of the application program in the virtual machine, and simultaneously informs the control module to start the flow capturing of an IP layer;
step 2.2.3, the control module selects a predefined trigger policy to omit the operation of a simulation trigger user through the operation of a virtual user on the outside;
step 2.2.4, after the flow data collected once exceeds a threshold value, the control module closes the virtual machine and restores the virtual machine to the initial snapshot;
step 2.2.5, repeat the above steps several times to provide a sufficiently large data set for subsequent analysis.
5. The method for identifying and classifying application traffic based on multi-feature fusion analysis according to claim 4, wherein the method of step 2.1.1 is: calculating a file hash, judging the type of an executable file of the file hash through a file header, carrying out recursive decompression operation on a filing type file, scanning an android manifest (xml) file of an APK type file to obtain marked version information of the APK type file, analyzing an operation platform and a dependency library of the ELF type file through an ELF import table, analyzing a DOS header, an NT header, a section table and specific sections of the PE file in sequence, and discarding and marking the files of other types as unavailable; the acquired application description information is a set formed by software version information, and comprises the following steps: software running platform information, software version information and a software binary file;
the method of step 2.1.2 is: learning the Vector representation of the descriptive sentences in the application description information and the application description information collected in the step 1 through a Word2Vector algorithm, calculating cosine similarity between the Vector representations, and reserving 20 information vectors with the highest cosine similarity to obtain software description characteristics;
the controlled module is designed in a platform-related manner, and different back ends are arranged on different operating system platforms; snapshots included in the virtual machine snapshot library comprise Windows, linux, IOS, android operating systems and simulators; the trigger strategies include depth-first, breadth-first and component triggering.
6. The method for identifying and classifying application traffic based on multi-feature fusion analysis according to claim 1, wherein the step 3 comprises:
step 3.1, after obtaining a flow sample, dividing an input IP layer data packet sequence into TCP flow and non-TCP flow according to an OSI four-layer protocol header and discarding an invalid data packet, classifying the IP packets belonging to a TCP session on a fourth layer according to a quintuple plan for the TCP flow, wherein the invalid data packet comprises a handshake packet, a retransmission packet, an acknowledgement packet and an incomplete session;
and 3.2, in each session, taking the session as a sequence, extracting the position of the plaintext information in the session and the regular characteristic of the plaintext information in the session by adopting a sequence mode analysis method, and integrating the Snort rule as the plaintext flow characteristic of the session.
7. The method for identifying and classifying application traffic based on multi-feature fusion analysis according to claim 6, wherein the step 4 comprises:
step 4.1, after the flow sample is obtained, combining the IP layer data packets belonging to the same session with the ciphertext flow, and combining the IP layer data packets according to the sequence of packet sending time;
step 4.2, converting the combined data packet sequence into a multidimensional characteristic sequence, extracting any one-dimensional characteristic sequence, and sending the characteristic sequence into an automatic coding and decoding neural network to obtain the ciphertext characteristic of the dimension;
and 4.3, executing the operation on each dimension of the multi-dimensional feature sequence, and integrating the ciphertext features of each dimension into the applied ciphertext features.
8. The method of claim 7, wherein the sequence pattern analysis method comprises a longest common subsequence mining algorithm and a frequent subsequence mining algorithm, and the multidimensional feature sequence comprises a packet type feature sequence, a packet size feature sequence and a packet interval feature sequence.
CN202110584098.XA 2021-05-27 2021-05-27 Application flow identification and classification method based on multi-feature fusion analysis Active CN113300977B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110584098.XA CN113300977B (en) 2021-05-27 2021-05-27 Application flow identification and classification method based on multi-feature fusion analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110584098.XA CN113300977B (en) 2021-05-27 2021-05-27 Application flow identification and classification method based on multi-feature fusion analysis

Publications (2)

Publication Number Publication Date
CN113300977A CN113300977A (en) 2021-08-24
CN113300977B true CN113300977B (en) 2022-10-21

Family

ID=77325605

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110584098.XA Active CN113300977B (en) 2021-05-27 2021-05-27 Application flow identification and classification method based on multi-feature fusion analysis

Country Status (1)

Country Link
CN (1) CN113300977B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114520774B (en) * 2021-12-28 2024-02-23 武汉虹旭信息技术有限责任公司 Deep message detection method and device based on intelligent contract
CN115086043B (en) * 2022-06-17 2023-03-21 电子科技大学 Encryption network flow classification and identification method based on minimum public subsequence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103281213A (en) * 2013-04-18 2013-09-04 西安交通大学 Method for extracting, analyzing and searching network flow and content
CN108289093A (en) * 2017-12-29 2018-07-17 北京拓明科技有限公司 The construction method and structure system in App application condition codes library
CN108897739A (en) * 2018-07-20 2018-11-27 西安交通大学 A kind of intelligentized application traffic identification feature automatic mining method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103281213A (en) * 2013-04-18 2013-09-04 西安交通大学 Method for extracting, analyzing and searching network flow and content
CN108289093A (en) * 2017-12-29 2018-07-17 北京拓明科技有限公司 The construction method and structure system in App application condition codes library
CN108897739A (en) * 2018-07-20 2018-11-27 西安交通大学 A kind of intelligentized application traffic identification feature automatic mining method and system

Also Published As

Publication number Publication date
CN113300977A (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN109922052B (en) Malicious URL detection method combining multiple features
CN111866016B (en) Log analysis method and system
CN113300977B (en) Application flow identification and classification method based on multi-feature fusion analysis
CN111488577B (en) Model building method and risk assessment method and device based on artificial intelligence
CN110221977A (en) Website penetration test method based on ai
CN112333706B (en) Internet of things equipment anomaly detection method and device, computing equipment and storage medium
CN110768875A (en) Application identification method and system based on DNS learning
US20230252145A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
CN111447224A (en) Web vulnerability scanning method and vulnerability scanner
CN114528457A (en) Web fingerprint detection method and related equipment
Huang et al. Protocol reverse-engineering methods and tools: A survey
CN114090406A (en) Electric power Internet of things equipment behavior safety detection method, system, equipment and storage medium
CN115314268B (en) Malicious encryption traffic detection method and system based on traffic fingerprint and behavior
CN115314291A (en) Model training method and assembly, safety detection method and assembly
CN117130870B (en) Transparent request tracking and sampling method and device for Java architecture micro-service system
US20240054215A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
CN112436980A (en) Method, device and equipment for reading test data packet and storage medium
CN115051874B (en) Multi-feature CS malicious encrypted traffic detection method and system
CN113849810B (en) Identification method, device, equipment and storage medium for risk operation behavior
CN114329466A (en) Cross-site script vulnerability attack detection method and system
CN115392238A (en) Equipment identification method, device, equipment and readable storage medium
Said et al. Attention-based CNN-BiLSTM deep learning approach for network intrusion detection system in software defined networks
CN116502226B (en) Firmware simulation-based high-interaction Internet of things honeypot deployment method and system
CN117763547B (en) Malicious application program detection method and system and electronic equipment
CN117708813B (en) Security detection method and system for software development environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant