CN107682225B - Method for automatically generating fine-grained network program function flow fingerprint - Google Patents

Method for automatically generating fine-grained network program function flow fingerprint Download PDF

Info

Publication number
CN107682225B
CN107682225B CN201710948442.2A CN201710948442A CN107682225B CN 107682225 B CN107682225 B CN 107682225B CN 201710948442 A CN201710948442 A CN 201710948442A CN 107682225 B CN107682225 B CN 107682225B
Authority
CN
China
Prior art keywords
segments
frequent
segment
fragments
occurrence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710948442.2A
Other languages
Chinese (zh)
Other versions
CN107682225A (en
Inventor
唐亚哲
李勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201710948442.2A priority Critical patent/CN107682225B/en
Publication of CN107682225A publication Critical patent/CN107682225A/en
Application granted granted Critical
Publication of CN107682225B publication Critical patent/CN107682225B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a method for automatically generating fine-grained network program function flow fingerprints, relates to the technical field of network communication, in particular to network service identification, and mainly aims to identify different function modules in a certain specific application. The invention processes the bottom flow according to the long character string, and calculates and records the frequently-occurring character segments and the corresponding occurrence times. And finally acquiring the corresponding fingerprint through subsequent intersection operation, merging operation and final fingerprint purification operation. By the mode, the invention can automatically and effectively formulate the fingerprint of the network program function, and different functions of a network program can be identified by the fingerprint, so that the network state can be accurately controlled, the network quality can be optimized, and valuable information can be mined.

Description

Method for automatically generating fine-grained network program function flow fingerprint
Technical Field
The invention belongs to the field of identification of computer network flow service types, and particularly relates to a method for automatically generating a fine-grained network program function flow fingerprint.
Background
With the rapid development of networks, various new service applications are continuously appeared, and the communication protocols used are more and more complex. In addition, malicious attack behaviors on the network are increasing, so that network traffic identification is very important, and a network manager can monitor and manage various service flows in real time; the method is beneficial to the network service provider to know the conditions of various service flows of the network when planning and building the network. Particularly, the fine-grained flow identification can know the user's favorite degree and corresponding user behavior of a certain function in a piece of software, so as to provide a basis for the later network program design and development, and can measure and monitor the network in a deep level, thereby guiding developers to provide better user experience for the users.
Currently, most research in the field of network traffic identification focuses on the classification of network protocols and applications. Many researchers have used variations of existing classification methods, or techniques that combine multiple classification methods, to improve the efficiency and accuracy of previous studies. However, as the network is rapidly developed, the network traffic is gradually scaled up, and the traffic characteristics are more and more complex, so it is difficult to find a flow classification method with an accuracy of 100%. Compared with the existing classification algorithm, the accuracy rate of the method is improved by 1% -2%, and the method is more meaningful for researching the flow classification problem at a new classification level. The fine-grained network program functional flow identification is a new classification idea, which refers to the following steps: different functional modules in a particular application are identified. Therefore, the key problem of fine-grained network program function flow identification is to generate fingerprints for different function modules in specific applications so as to facilitate later identification and differentiation. Fine-grained traffic identification is currently less researched, and most of the fine-grained traffic identification is carried out by adopting traffic statistical information, such as collecting packet length, packet interval, bit rate and the like as characteristic values. The method is greatly influenced by network conditions, sensitive to noise and difficult to ensure precision.
Disclosure of Invention
The invention aims to provide a method for automatically generating a fine-grained network program function flow fingerprint so as to solve the problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for automatically generating fine-grained network program function flow fingerprints comprises the following steps:
the method comprises the following steps: collecting corresponding flow data packets according to different program functions;
step two: the flow data packets with different functions are treated according to the long character string, and the long character string is divided into segments with fixed length; if a segment frequently appears in the data packet, and the frequency of the occurrence is greater than or equal to a preset threshold value, the segment is selected as a frequent segment;
step three: after obtaining the frequent segments, deleting the segment intersection between the corresponding frequent segments of different functional programs, wherein the step comprises the following two stages: step one, finding out an intersection set among frequent segments; stage two, deleting the same segment from the frequent segments corresponding to different functions;
step four: merging the fragments in the frequent fragment set corresponding to each program function, keeping the merged fragments and the occurrence times thereof, then deleting the two short character fragments before merging and the occurrence times thereof, circulating the step until the merging cannot be carried out, and finally collecting the frequent fragments as a primary fingerprint;
step five: and (4) circulating the step one to the step four times, solving intersection of multiple preliminary fingerprints formulated by the same function, and finally generating final fingerprints of all functions of the program.
Further, the same segment in stage two in step three means that the characters of the two segments are the same.
Further, two conditions are combined in step four: first, two segments should have characters that intersect either adjacent or in front of and behind; second, the combined score must meet a preset threshold.
Furthermore, in the fifth step, the final numerical value of the occurrence times is the minimum value for the same character segment but different corresponding occurrence times.
Further, in the second step, searching for the comprehensive information of the fragments and the corresponding occurrence times of the fragments in the frequent fragments; the fragments are used as keys, the occurrence times are used as values, and finally the fragments are combined into a set in the form of key value pairs.
Further, the fragment merging method in the fourth step comprises the following steps:
1) firstly, judging whether the segments to be merged are adjacent or the boundaries are overlapped;
2) and judging whether the combination is carried out according to whether the occurrence frequency of the combined fragments is divided by the sum of the occurrence frequency of the two fragments before the combination is less than or equal to 0.5, and only carrying out the combination when the ratio is less than or equal to 0.5.
Compared with the prior art, the invention has the following technical effects:
the invention can automatically make fingerprints corresponding to different functions of the application program; the method is ready for next fine-grained flow identification, namely, the flow corresponding to the applied function operation is identified; the traffic identification based on fine-grained fingerprints also provides information for upper layer data analysis finally. By utilizing the fine-grained fingerprints, whether different functions of the program are triggered or not and the function duration can be acquired, so that a necessary data source is provided for the next user analysis and user experience.
The automatic fingerprint acquisition method greatly reduces the manual participation, lightens the labor burden, and improves the accuracy and the working efficiency; meanwhile, the method does not need any prior information such as a flow protocol format and the like, and is insensitive to flow noise.
Drawings
FIG. 1 is a schematic illustration of a specific embodiment;
fig. 2 is a fine-grained functional flow diagram.
FIG. 3 is a schematic block diagram of a protocol of the process of the present invention;
Detailed Description
The invention is further described below with reference to the accompanying drawings:
referring to fig. 1 to 3, a method for automatically generating a fine-grained network program function traffic fingerprint includes the following steps:
the method comprises the following steps: collecting corresponding flow data packets according to different program functions;
step two: the flow data packets with different functions are treated according to the long character string, and the long character string is divided into segments with fixed length; if a segment frequently appears in the data packet, and the frequency of the occurrence is greater than or equal to a preset threshold value, the segment is selected as a frequent segment; specifically, the pseudo code of the processing algorithm in the step is expressed as follows:
Figure BDA0001432307640000031
Figure BDA0001432307640000041
step three: after obtaining the frequent segments, deleting the segment intersection between the corresponding frequent segments of different functional programs, wherein the step comprises the following two stages: step one, finding out an intersection set among frequent segments; stage two, deleting the same segment from the frequent segments corresponding to different functions;
step four: merging the fragments in the frequent fragment set corresponding to each program function, keeping the merged fragments and the occurrence times thereof, then deleting the two short character fragments before merging and the occurrence times thereof, circulating the step until the merging cannot be carried out, and finally collecting the frequent fragments as a primary fingerprint; specifically, the pseudo code of the processing algorithm in the step is expressed as follows:
Figure BDA0001432307640000042
Figure BDA0001432307640000051
step five: and (4) circulating the step one to the step four times, solving intersection of multiple preliminary fingerprints formulated by the same function, and finally generating final fingerprints of all functions of the program. This step iteration process can be expressed as follows:
Figure BDA0001432307640000052
this process ends until the new fingerprint set (i) ═ new fingerprint set (i-1).
The same segment in stage two in step three means that the two segment characters are the same.
In step four, two conditions are combined: first, two segments should have characters that intersect either adjacent or in front of and behind; second, the combined score must meet a preset threshold.
And fifthly, for the same character segments but different corresponding occurrence times, the final occurrence time value is the minimum value.
Searching for comprehensive information of the fragments and corresponding occurrence times of the fragments in the frequent fragments; the fragments are used as keys, the occurrence times are used as values, and finally the fragments are combined into a set in the form of key value pairs.
The method for combining the fragments in the fourth step comprises the following steps:
1) firstly, judging whether the segments to be merged are adjacent or the boundaries are overlapped;
2) and judging whether the combination is carried out according to whether the occurrence frequency of the combined fragment is divided by the sum of the occurrence frequency of the two fragments before combination is less than or equal to 0.5.
The present invention will be described in more detail with reference to the accompanying drawings 1 and 2.
The general implementation of the invention is shown in fig. 1, and the whole process is divided into off-line fingerprint generation and on-line identification.
The collaboration process between the whole processes is illustrated as follows:
1. and collecting the online flow to the offline, operating a certain function of a program on the online, and simultaneously capturing a packet to collect the flow data to be stored offline.
2. The flows corresponding to different functions of the same program are collected respectively with reference to fig. 2, allowing flow data packet noise to exist.
3. The above process can be repeated for N times.
4. The data collected at each time are processed as follows:
referring to fig. 3, the data packets in the traffic data corresponding to each function are fragmented and frequent fragments are searched. The packet data is treated as long strings, which are first divided into fixed-length segments. For example, if there is a character string 'xjtuabc', and the length of the segment is set to 2, the long character can be divided into { xj, jt, tu, ua, ab, bc }, and if the occurrence frequency of a character string segment is greater than or equal to a preset threshold, the segment and the corresponding occurrence frequency are recorded, so as to obtain a "key value" pair set corresponding to a fine-grained function. According to experimental analysis, the following results are obtained: the size of the fragment is 4, the occurrence frequency threshold value is 6, and the method has the best effect.
5. Referring to fig. 3, the "key value" pairs corresponding to all functions of a program are subjected to deletion intersection processing, and the same segments in the frequent segment "key value" pairs corresponding to different program functions are removed, where the same segment means that the "keys" of two segments are the same, and the corresponding occurrence times "values" are not necessarily required to be the same. For example, there are frequent fragment sets that program four different functions:
function 1 { HTTP1.1, 8; server, 9; 0x00, 24; t% 3B%, 17}
Function 2 { HTTP1.1, 13; 0x02000, 11; b7A3F8a2, 16; t% 3B%, 17}
Function 3 { HTTP1.1, 8; server, 13; XX1pZ, 17; jS5gH, 16; D-DtH, 6; the flow rate of the liquid in the AC4DF,
5}
function 4 { HTTP1.1, 8; server, 9; 0x02, 14; JSSES, 4; XX1pZ,17}
After this step, the final aggregate output should be:
function 1 {0x00,24}
Function 2 {0x02000, 11; B7A3F8A2,16}
Function 3 { jS5gH, 16; D-DtH, 6; AC4DF,5}
Function 4 {0x02, 14; JSSES, 4}
6. Referring to fig. 3, the frequent segment sets obtained in the previous step would take too much time to distinguish different program functions. This is because there may be a large number of redundant fragments in a frequent fragment set corresponding to a function, for example, a data packet has a character fragment 'xjtu _ homepage', and when the fragment size is 4, 10 fragments are obtained, which are: 'xjtu', 'jtu _', 'tu _ h', 'u _ ho', 'hom', 'home', 'mepa', 'epag' and 'page', which are fragments that can be combined into a longer fragment, such as 'xjtu' occurred 7 times, 'jtu _' also occurred 7 times, and in fact 'xjtu _' occurred 7 times. Thus, one long segment can replace the original two segments, and the matching search time can be saved in the matching stage.
7. Referring to fig. 3, when frequent segment merging, first, it is determined whether a segment completely contains another self-negative segment, and if so, the longest segment is directly reserved, and the contained segment is deleted; if not, checking whether the two segments are adjacent or the boundaries are intersected and overlapped; if the conditions are met, whether the merging score meets a preset threshold value is judged, if yes, the merged segments and the key value pairs of the occurrence times of the segments are reserved, and the two segments before merging are deleted. The method for calculating the merging score comprises the following steps: the ratio of the number of times of the character fragments appearing in the flow after combination to the sum of the number of times of the two fragments appearing in the flow before combination is mainly seen in the method whether the ratio is less than or equal to 0.5.
8. Referring to fig. 3, through the above steps, the final frequent segment set corresponding to each function is generated, and in order to solve the influence of the noise data packet in the flow on the final result, the intersection of the frequent segment sets formulated for the same function N times is calculated next, so that the noise is completely eradicated, and the fine-grained fingerprint is finally generated. In the invention, intersection is solved according to the following definition. For frequent segments where the "key" is the same and the "value" is different, the final "value" takes the smallest one. For example, intersect (HTTP,8) and (HTTP,6), the result is (HTTP, 6). An example of intersection of three sets of frequent fragments of a function of a program is shown below. Three frequent fragment sets are given first:
function 1-1 { HTTP1.1, 8; server, 9; 0x02, 14; b7A3F8a2, 16; t% 3B%, 17}
Function 1-2 { HTTP1.1, 7; server, 9; 0x02, 14; c7A8K3C7, 14; t% 3B%, 17}
Function 1-3 { HTTP1.1, 8; server, 7; 0x02, 14; 37B2B7C6, 7; t% 3B%, 17}
The final fingerprint for this function, via this module, is:
function 1 fingerprints are { HTTP1.1, 7; server, 7; 0x02, 14; t% 3B%, 17 }.
As described above, the invention can automatically generate fine-grained program function fingerprints, and is convenient for later-stage identification and use.
The method is not limited to be used only in the aspect of network function identification, and fingerprints of texts of different traffic data packets can be generated according to specific situations.
Although specific embodiments of, and examples for, the invention are disclosed in the accompanying drawings for illustrative purposes and to aid in the understanding of the contents of the invention and the manner in which the same may be practiced, those skilled in the art will understand that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. The present invention should not be limited to the disclosure of the embodiments and drawings described in the specification, and the scope of the present invention is defined by the scope of the claims.

Claims (4)

1. A method for automatically generating fine-grained network program function flow fingerprints is characterized by comprising the following steps:
the method comprises the following steps: collecting corresponding flow data packets according to different program functions;
step two: the flow data packets with different functions are treated according to the long character string, and the long character string is divided into segments with fixed length; if a segment frequently appears in the data packet, and the frequency of the occurrence is greater than or equal to a preset threshold value, the segment is selected as a frequent segment;
step three: after obtaining the frequent segments, deleting the segment intersection between the corresponding frequent segments of different functional programs, wherein the step comprises the following two stages: step one, finding out an intersection set among frequent segments; stage two, deleting the same segment from the frequent segments corresponding to different functions;
step four: merging the fragments in the frequent fragment set corresponding to each program function, keeping the merged fragments and the occurrence times thereof, then deleting the two short character fragments before merging and the occurrence times thereof, circulating the step until the merging cannot be carried out, and finally collecting the frequent fragments as a primary fingerprint; two conditions are combined: first, two segments should have characters that intersect either adjacent or in front of and behind; second, the combined score must meet a preset threshold;
when the frequent fragments are combined, firstly, whether one fragment completely contains another self-negative fragment is judged, if so, the longest fragment is directly reserved, and the contained fragment is deleted; if not, checking whether the two segments are adjacent or the boundaries are intersected and overlapped; if the conditions are met, whether the merging score meets a preset threshold value is judged, if so, the merged segments and the key value pairs of the occurrence times of the segments are reserved, and the two segments before merging are deleted; the method for calculating the merging score comprises the following steps: the ratio of the occurrence frequency of the character segments after combination in the flow to the sum of the occurrence frequency of the two segments before combination in the flow is judged whether the ratio is less than or equal to 0.5;
step five: and (4) circulating the step one to the step four times, solving intersection of multiple preliminary fingerprints formulated by the same function, and finally generating final fingerprints of all functions of the program.
2. The method of claim 1, wherein the same segment in stage two in step three means that two segments have the same character.
3. The method as claimed in claim 1, wherein in step five, the final occurrence number value is the minimum value for the same character segment but different occurrence numbers.
4. The method for automatically generating fine grained network program function flow fingerprint according to claim 1, characterized in that, in the step two, the comprehensive information of the segment and the corresponding occurrence number thereof is recorded in the frequent segments; the fragments are used as keys, the occurrence times are used as values, and finally the fragments are combined into a set in the form of key value pairs.
CN201710948442.2A 2017-10-12 2017-10-12 Method for automatically generating fine-grained network program function flow fingerprint Expired - Fee Related CN107682225B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710948442.2A CN107682225B (en) 2017-10-12 2017-10-12 Method for automatically generating fine-grained network program function flow fingerprint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710948442.2A CN107682225B (en) 2017-10-12 2017-10-12 Method for automatically generating fine-grained network program function flow fingerprint

Publications (2)

Publication Number Publication Date
CN107682225A CN107682225A (en) 2018-02-09
CN107682225B true CN107682225B (en) 2020-05-22

Family

ID=61139990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710948442.2A Expired - Fee Related CN107682225B (en) 2017-10-12 2017-10-12 Method for automatically generating fine-grained network program function flow fingerprint

Country Status (1)

Country Link
CN (1) CN107682225B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104468273A (en) * 2014-12-12 2015-03-25 北京百度网讯科技有限公司 Method and system for recognizing application type of flow data
CN106452868A (en) * 2016-10-12 2017-02-22 中国电子科技集团公司第三十研究所 Network traffic statistics implement method supporting multi-dimensional aggregation classification

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11087006B2 (en) * 2014-06-30 2021-08-10 Nicira, Inc. Method and apparatus for encrypting messages based on encryption group association

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104468273A (en) * 2014-12-12 2015-03-25 北京百度网讯科技有限公司 Method and system for recognizing application type of flow data
CN106452868A (en) * 2016-10-12 2017-02-22 中国电子科技集团公司第三十研究所 Network traffic statistics implement method supporting multi-dimensional aggregation classification

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SktTracer: Towards Fine一Grained Identification for Skype Traffic via Sequence Signatures;Zhenlong Yuan等;《2014 International Conference on Computing, Networking and Communications, Communications and Information Security Symposium》;20140203;全文 *
基于Tilera平台的网络细粒度应用行为识别;吴舜等;《电信科学》;20131120;全文 *

Also Published As

Publication number Publication date
CN107682225A (en) 2018-02-09

Similar Documents

Publication Publication Date Title
CN111506599B (en) Industrial control equipment identification method and system based on rule matching and deep learning
CN103838754B (en) Information retrieval device and method
CN103076892A (en) Method and equipment for providing input candidate items corresponding to input character string
CN103336766A (en) Short text garbage identification and modeling method and device
CN111324801B (en) Hot event discovery method in judicial field based on hot words
CN111460153A (en) Hot topic extraction method and device, terminal device and storage medium
CN113505826B (en) Network flow anomaly detection method based on joint feature selection
CN103077163B (en) Data preprocessing method, device and system
CN113032557A (en) Microblog hot topic discovery method based on frequent word set and BERT semantics
CN102184201B (en) Equipment and method used for selecting recommended sequence of query sequence
CN108173876B (en) Dynamic rule base construction method based on maximum frequent pattern
CN112231700B (en) Behavior recognition method and apparatus, storage medium, and electronic device
CN107832611B (en) Zombie program detection and classification method combining dynamic and static characteristics
CN112235254B (en) Rapid identification method for Tor network bridge in high-speed backbone network
CN114510615A (en) Fine-grained encrypted website fingerprint classification method and device based on graph attention pooling network
CN111400617B (en) Social robot detection data set extension method and system based on active learning
WO2021248707A1 (en) Operation verification method and apparatus
CN103455754A (en) Regular expression-based malicious search keyword recognition method
CN107682225B (en) Method for automatically generating fine-grained network program function flow fingerprint
CN110555170B (en) System and method for optimizing user experience
CN111538839A (en) Real-time text clustering method based on Jacobsard distance
CN110929506A (en) Junk information detection method, device and equipment and readable storage medium
KR101621959B1 (en) Apparatus for extracting and analyzing log pattern and method thereof
CN116662557A (en) Entity relation extraction method and device in network security field
CN111159996B (en) Short text set similarity comparison method and system based on text fingerprint algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200522