CN111835542B - Method for automatically extracting and checking application program characteristics - Google Patents

Method for automatically extracting and checking application program characteristics Download PDF

Info

Publication number
CN111835542B
CN111835542B CN201910317742.XA CN201910317742A CN111835542B CN 111835542 B CN111835542 B CN 111835542B CN 201910317742 A CN201910317742 A CN 201910317742A CN 111835542 B CN111835542 B CN 111835542B
Authority
CN
China
Prior art keywords
flow
uplink
application program
data
target application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910317742.XA
Other languages
Chinese (zh)
Other versions
CN111835542A (en
Inventor
刘亮
梁登高
郑荣锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201910317742.XA priority Critical patent/CN111835542B/en
Publication of CN111835542A publication Critical patent/CN111835542A/en
Application granted granted Critical
Publication of CN111835542B publication Critical patent/CN111835542B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a method for automatically extracting and checking application program characteristics, and relates to the technical field of computer network flow analysis. The method comprises the following steps: 1) collecting the flow generated by the same operation of a target application program under the registration of multiple devices and different account numbers; 2) performing stream grouping on the data packets, and only storing a first stream generated by each operation; 3) separating the upstream and downstream data packets of the stream, and sequencing the data packets in each direction; 4) respectively merging the packets with the same sequence number in the same-direction data together for character string comparison analysis to obtain bidirectional character string characteristics; 5) and setting a matching rule by combining the characteristics of the bidirectional character string, and checking the effectiveness and the false alarm rate of the characteristics. The present invention can be used to identify applications that have significant communication characteristics in a particular activity.

Description

Method for automatically extracting and checking application program characteristics
Technical Field
The invention relates to the technical field of computer network flow analysis.
Background
The rapid development of computer technology makes the information degree higher and higher, and the number of application programs increases rapidly. Some applications generate data packets with fixed strings under certain specific operations. The main basis of this argument is the following: 1) the implementation logics of different application programs are different, so that some obvious characteristics may be generated in the communication process; 2) different application programs can add some extra interactive behaviors according to business requirements; 3) most application programs adopt independently developed private protocols at an application layer and have unique protocol characteristics; 4) even if the application program adopts the encryption communication, special information can be interactively transmitted in the handshaking process of the client and the server, and fixed character characteristics are generated at the flow level.
From the above points, it can be concluded that general applications (especially applications with unencrypted communication) have traffic characteristics different from other applications. Thus, features of the target application specific behavior may be extracted, which may be used to identify the communication behavior of the target application. At present, the methods for extracting the flow characteristics of the application program are generally divided into the following methods: training a large number of data packet characteristics by adopting machine learning so as to establish a plurality of prejudgment models; counting the load length of the flow generated by an application program, the flow packet generated in a fixed time unit and other characteristics; extracting server information requesting access from a client message, namely identifying an application program through keywords in an uplink message; by extracting the same character string from all the data packets of the same stream. The methods basically analyze all the flow data generated by a target application program, do not perform targeted comparative analysis on flow packets with specific behaviors, and do not perform bidirectional feature matching by combining the characteristics of uplink and downlink data packets. In view of the above situation, the present invention provides a method for automatically extracting and checking application program features, which can automatically extract the character strings inherent in the uplink and downlink data packets from the target application program specific behavior data packets, thereby identifying the target application program in combination with the bidirectional features.
Disclosure of Invention
The invention provides a method for automatically extracting and checking application program characteristics. The processing object of the method is a flow set generated by the target application program under the same operation, and the method can automatically extract the character string characteristics of the target application program in the data packets in the uplink direction and the downlink direction from the flow set. The method uses an automated program to verify the validity and false positive rate of features by collecting a sufficient number of labeled data sets. The features passing the inspection can be used for identifying the target application program in the actual environment, and the specific technical scheme is as follows.
A method for automatically extracting and verifying application program features is provided, and the method comprises the following steps:
A. collecting the flow generated by the same operation of a target application program under the registration of different account numbers in multiple devices to obtain a flow set of the same operation of the target application;
B. extracting the first complete interactive flow of each flow in the flow set through packet capturing software to obtain an interactive flow set with the same operation of a target application;
C. and separating uplink and downlink data packets of each flow in the interactive flow set by using a protocol analyzer, and sequencing and numbering the data packets of each flow in the two directions by using the same sequencing rule. Finally, an uplink data packet set and a downlink data packet set with sequence numbers are obtained;
D. and acquiring application message data of all the packets in the uplink data packet set by using a protocol analyzer, and extracting the uplink characteristics of the target application program from the message data. The same method is used to obtain the target application downlink characteristics.
As a further optimization, the method further comprises the steps of:
E. making a feature library according to the extracted uplink and downlink features of the target application program;
F. and deploying the feature library into inspection equipment for inspection personnel to inspect the effectiveness and the false alarm rate of the features in a large number of scenes.
Drawings
To further clarify the objects, methods and features of the present invention, a more particular description of the invention will be rendered by reference to the appended drawings, in which:
FIG. 1 is a general flow chart showing a specific implementation of the method proposed by the present invention
FIG. 2 is a flow chart illustrating data acquisition and preprocessing of the proposed method of the present invention
FIG. 3 is a flow chart illustrating the processing of the collected traffic according to the method of the present invention
FIG. 4 is a flow chart illustrating the process of convection for the method of the present invention
FIG. 5 is a flow chart illustrating the feature extraction method of the present invention
FIG. 6 is an exemplary diagram illustrating a storage form of extracted byte features of the method proposed by the present invention
Fig. 7 is a block diagram of a procedure illustrating the method of checking the bidirectional characteristic of the method proposed by the invention.
Detailed Description
The invention analyzes and extracts the flow characteristics of an application program under specific operation, and mainly aims at the Ethernet data packet of a computer network. Before each analysis, a certain amount of flow generated by the same operation of the target application program in different devices and under different account numbers needs to be manually collected. To further illustrate the embodiments of the present invention, reference will be made to the following detailed description taken in conjunction with the accompanying drawings. The invention provides a method for automatically extracting and checking application program identification features, which can automatically extract the behavior uplink and downlink data packet features from a target application program specific behavior data packet, so that the target application program can be identified through the features.
As shown in FIG. 1, the overall process of the present invention is divided into 5 steps. Each step is a processing unit, and each step is executed in turn. The first step is to obtain the flow generated by the same operation of the target application program in different devices and under different account numbers, so as to obtain a flow set. And the second step is to perform flow grouping on the data packets respectively, and only the first flow is reserved for each flow. The main purpose is to obtain a data packet containing the complete interactive process. The third step is to divide the data packet of the stream into uplink and downlink. And carrying out sequencing numbering on the data packets in each direction by adopting the same sequencing rule. The main purpose of the step is to divide the data packets of the client and the server, and the purpose of numbering is to perform character string comparison analysis for subsequently integrating the packets with the same serial number in different streams; and fourthly, comparing and analyzing the same serial number packets in the same direction in all the streams to finally obtain the same character string set in the same serial number packets of the uplink and downlink data packets of the target application program, and selecting the most representative characteristics from the same character string set to obtain the uplink and downlink characteristics of the target application program. The fifth step is an indispensable step, and the most important functions of the fifth step are to capture a large amount of data in a large environment, perform replay experiments on a testing machine, detect the effectiveness of characteristics and detect the false alarm rate.
As shown in fig. 2, the first step of resolution is shown. Firstly, a plurality of devices (the device types refer to mobile terminals, PCs and the like) of the same type running the same operating system are needed, and the purpose is to eliminate the influence of the devices on the experiment. Target applications with the same version are installed on the devices, and the same version can eliminate the influence caused by the versions of the applications. And logging in the target application program by using different accounts in different devices to eliminate the influence of the accounts in the transmission traffic data (if the accounts are extracted from the data packet, the accounts can be logged in by using the same account). The last step is to do the same operations on the target application, such as login, logout, etc. And the data packet generated by the target application program is extracted from the data packet generated by each device by the IP filtering method and is stored into an independent file.
The second step is to stream the collected traffic as shown in fig. 3. Using the quadruplet { srcIP, srcPort, dstIP, dstPort } information as the group stream condition and separating the streams in time sequence, only the first stream in each file is finally retained.
As in fig. 4, the third step requires the stream data to be divided into upstream and downstream data. The uplink data refers to a data packet sent by the client, and the downlink data refers to a packet received by the client. And respectively carrying out sequencing numbering on the data packets of each flow in the two directions by adopting the same sequencing rule.
As shown in fig. 5, the fourth step is to extract features from the direction packet set (taking the above data packet set as an example), and finally obtain the features of the target application in the corresponding direction. The specific operation is divided into three small steps.
The first substep is to integrate all the packets with the same sequence number in the incoming data packet set.
The second substep is to extract the characteristic string from the same sequence number packet. In order to eliminate the interference of replay packets or other special situations in the data packets, a threshold value (min _ sup, and 0< min _ sup < = n) needs to be set when extracting the features, and the feature character string can be used only when the occurrence frequency of the character string exceeds the threshold value. As shown in fig. 5, the specific implementation is as follows:
(a) taking the application message of the packet No. 1 in the Dev _1_ flow _ up as a long character string, and sequentially extracting the sub character string sub _ str (at least comprising two characters) of the application message;
(b) calculating the occurrence frequency of sub _ str, namely the number of sub _ str-containing packets in the rest n-1 (Dev _2_ flow _ up, …, Dev _ n _ flow _ up) number 1 packets, and recording as m;
(c) when m is larger than min _ sup and the occurrence frequency of longer substrings containing sub _ str is smaller than min _ sup, marking the sub _ str as a characteristic string, otherwise, the sub _ str does not belong to the characteristic string;
(d) combining all the characteristic character strings of the No. 1 packet in the Dev _1_ flow _ up, the offset and the occurrence frequency thereof to form a characteristic report;
(e) repeating the three steps of (a), (b), (c) and (d) on the remaining n-1 number 1 packets to generate a feature report of each packet, and selecting a representative feature report as the feature of the number 1 packet;
(f) repeating the operations (a), (b), (c), (d) and (e) for the rest of the serial number (2, 3, …) packets to obtain the feature reports of all the serial number packets.
And the third step is to compare the characteristics of each serial number packet, select the serial number packet with the most characteristic character strings and the most complete serial number packet, and take the characteristics of the packet as the uplink (up) characteristics.
As shown in fig. 6, which is an example of the features derived in the fourth step. pcake _1 represents that the packet is the first packet in the corresponding direction in the stream, and "offset _ feature" refers to the character string characteristic at a fixed offset in the packet, such as the first "0-3 (3): 16030102 "represents the packet application layer data offset from 0 to 3 (i.e., the first four bytes) as a value of" 16030102 "(hexadecimal form)," (3) "represents that a total of three of all identically numbered packets participating in the comparison analysis contain the characteristic, i.e., frequency.
As shown in fig. 7, the fifth step requires verification of the analyzed features. The first step is to generate new data packets for the target application in the method of fig. 1, replay these packets directly, and test the validity of the features. In a second step, the false positive rate of the signature is tested by collecting a large number of packets generated by non-target applications, but encompassing as many other kinds of applications as possible, and replaying these packets in the tester.

Claims (5)

1. A method for automatically extracting and verifying features for identifying an application, comprising the steps of:
A. collecting the flow generated by the same operation of a target application program under the registration of different account numbers in multiple devices to obtain a flow set of the same operation of the target application;
B. extracting the first complete interactive flow of each flow in the flow set by using packet capturing software to obtain an interactive flow set with the same operation of a target application;
C. separating uplink and downlink data packets of each stream in the interactive stream set by using a corresponding protocol analyzer, and respectively sequencing and numbering the data packets of each stream in two directions by using the same sequencing rule to finally obtain an uplink data packet set and a downlink data packet set with sequence numbers;
D. acquiring application message data of all packets in an uplink data packet set by using a corresponding protocol analyzer, extracting uplink characteristics of a target application program from the message data, and acquiring downlink characteristics of the target application program by using the same method;
E. making a feature library according to the extracted uplink and downlink features of the target application program;
F. and deploying the feature library into inspection equipment for inspection personnel to inspect the validity and the false alarm rate of the features in the test.
2. The method according to claim 1, wherein in step B, the first complete interaction flow of each flow in the flow set is extracted by the packet capturing software to obtain an interaction flow set of the same operation of the target application, specifically:
and carrying out stream grouping on each flow in the flow set by utilizing quadruplet { srcIP, srcPort, dstIP, dstPort } information, wherein when a plurality of flows exist, each flow only keeps the first flow, when an application layer protocol is TCP, an incomplete flow must be discarded, and the complete TCP flow comprises a TCP triple handshake packet, a communication data packet and an interactive packet during disconnection, so that the set of interactive flows is finally obtained.
3. The method according to claim 1, wherein in step C, the protocol parser is used to separate uplink and downlink data packets of each stream in the interactive stream set, and the data packets of each stream in the two directions are respectively sorted and numbered by using the same sorting rule, so as to obtain an uplink data packet set and a downlink data packet set with sequence numbers, specifically:
and respectively numbering the data packets in the two directions according to time sequence to finally obtain an uplink data packet set and a downlink data packet set with sequence numbers.
4. The method according to claim 1, wherein in step D, the obtaining, by using the protocol parser, the application packet data of all packets in the uplink data packet set, extracting the uplink feature of the target application program from the packet data, and obtaining the downlink feature of the target application program by using the same method specifically includes:
the method comprises the steps of putting data packets of each flow in an uplink data packet set in the same arrangement sequence together to obtain an uplink data same-sequence packet set, extracting the same content in application messages of the data packets in the same arrangement sequence to obtain the characteristics of corresponding arrangement packets, taking the characteristics of serial number packets with the most same content as the final uplink characteristics, and performing the same operation on a downlink data packet set to obtain the downlink characteristics.
5. The method according to claim 1, wherein in step F, the feature library is deployed into an inspection apparatus for inspection personnel to inspect the validity and false alarm rate of features in a test environment, specifically:
the detection method is that the application program is identified by combining uplink and downlink characteristics, and the successful matching is calculated only when the uplink data packet and the downlink data packet of the stream are matched with the uplink and downlink characteristics of the application in the characteristic library, wherein the data packet without the data flow of the target application program is replayed, the false alarm rate of the characteristics can be checked, the new data flow of the target application program is collected according to the step A, and the effectiveness of the characteristics can be tested by replaying the data flow.
CN201910317742.XA 2019-04-19 2019-04-19 Method for automatically extracting and checking application program characteristics Active CN111835542B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910317742.XA CN111835542B (en) 2019-04-19 2019-04-19 Method for automatically extracting and checking application program characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910317742.XA CN111835542B (en) 2019-04-19 2019-04-19 Method for automatically extracting and checking application program characteristics

Publications (2)

Publication Number Publication Date
CN111835542A CN111835542A (en) 2020-10-27
CN111835542B true CN111835542B (en) 2022-02-11

Family

ID=72911395

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910317742.XA Active CN111835542B (en) 2019-04-19 2019-04-19 Method for automatically extracting and checking application program characteristics

Country Status (1)

Country Link
CN (1) CN111835542B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101741908A (en) * 2009-12-25 2010-06-16 青岛朗讯科技通讯设备有限公司 Identification method for application layer protocol characteristic
CN106559281A (en) * 2015-09-29 2017-04-05 中国电信股份有限公司 Generate method and apparatus, virtual machine and the terminal for applying feature database
CN107707549A (en) * 2017-09-30 2018-02-16 迈普通信技术股份有限公司 A kind of device and method automatically extracted using feature
CN108234345A (en) * 2016-12-21 2018-06-29 中国移动通信集团湖北有限公司 A kind of traffic characteristic recognition methods of terminal network application, device and system
CN109190371A (en) * 2018-07-09 2019-01-11 四川大学 A kind of the Android malware detection method and technology of Behavior-based control figure
CN109194756A (en) * 2018-09-12 2019-01-11 网宿科技股份有限公司 Application features information extracting method and device
CN109327357A (en) * 2018-11-29 2019-02-12 杭州迪普科技股份有限公司 Feature extracting method, device and the electronic equipment of application software

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101741908A (en) * 2009-12-25 2010-06-16 青岛朗讯科技通讯设备有限公司 Identification method for application layer protocol characteristic
CN106559281A (en) * 2015-09-29 2017-04-05 中国电信股份有限公司 Generate method and apparatus, virtual machine and the terminal for applying feature database
CN108234345A (en) * 2016-12-21 2018-06-29 中国移动通信集团湖北有限公司 A kind of traffic characteristic recognition methods of terminal network application, device and system
CN107707549A (en) * 2017-09-30 2018-02-16 迈普通信技术股份有限公司 A kind of device and method automatically extracted using feature
CN109190371A (en) * 2018-07-09 2019-01-11 四川大学 A kind of the Android malware detection method and technology of Behavior-based control figure
CN109194756A (en) * 2018-09-12 2019-01-11 网宿科技股份有限公司 Application features information extracting method and device
CN109327357A (en) * 2018-11-29 2019-02-12 杭州迪普科技股份有限公司 Feature extracting method, device and the electronic equipment of application software

Also Published As

Publication number Publication date
CN111835542A (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN110597734B (en) Fuzzy test case generation method suitable for industrial control private protocol
CN109117634B (en) Malicious software detection method and system based on network traffic multi-view fusion
CN111277578A (en) Encrypted flow analysis feature extraction method, system, storage medium and security device
CN107733851A (en) DNS tunnels Trojan detecting method based on communication behavior analysis
CN112953971B (en) Network security flow intrusion detection method and system
CN109861957A (en) A kind of the user behavior fining classification method and system of the privately owned cryptographic protocol of mobile application
CN104506484A (en) Proprietary protocol analysis and identification method
CN113206860B (en) DRDoS attack detection method based on machine learning and feature selection
CN108241580B (en) Client program testing method and terminal
CN110868409A (en) Passive operating system identification method and system based on TCP/IP protocol stack fingerprint
CN110245273B (en) Method for acquiring APP service feature library and corresponding device
CN110764980A (en) Log processing method and device
CN111222547B (en) Traffic feature extraction method and system for mobile application
CN108491717A (en) A kind of xss systems of defense and its implementation based on machine learning
CN108234345A (en) A kind of traffic characteristic recognition methods of terminal network application, device and system
CN110460611A (en) Full flow attack detecting technology based on machine learning
CN114244564A (en) Attack defense method, device, equipment and readable storage medium
CN107707549B (en) Device and method for automatically extracting application characteristics
CN110011860A (en) Android application and identification method based on network traffic analysis
CN110858837B (en) Network management and control method and device and electronic equipment
CN111835542B (en) Method for automatically extracting and checking application program characteristics
US20150150132A1 (en) Intrusion detection system false positive detection apparatus and method
CN108073803A (en) For detecting the method and device of malicious application
CN115941555B (en) APP personal information collection behavior detection method and system based on flow fingerprint
CN106101061A (en) The automatic classification method of rogue program and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant