CN107682225B - Method for automatically generating fine-grained network program function flow fingerprint - Google Patents
Method for automatically generating fine-grained network program function flow fingerprint Download PDFInfo
- Publication number
- CN107682225B CN107682225B CN201710948442.2A CN201710948442A CN107682225B CN 107682225 B CN107682225 B CN 107682225B CN 201710948442 A CN201710948442 A CN 201710948442A CN 107682225 B CN107682225 B CN 107682225B
- Authority
- CN
- China
- Prior art keywords
- segments
- frequent
- segment
- fragments
- occurrence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 239000012634 fragment Substances 0.000 claims description 52
- 238000004891 communication Methods 0.000 abstract description 2
- 238000011403 purification operation Methods 0.000 abstract 1
- 238000012545 processing Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/04—Processing captured monitoring data, e.g. for logfile generation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses a method for automatically generating fine-grained network program function flow fingerprints, relates to the technical field of network communication, in particular to network service identification, and mainly aims to identify different function modules in a certain specific application. The invention processes the bottom flow according to the long character string, and calculates and records the frequently-occurring character segments and the corresponding occurrence times. And finally acquiring the corresponding fingerprint through subsequent intersection operation, merging operation and final fingerprint purification operation. By the mode, the invention can automatically and effectively formulate the fingerprint of the network program function, and different functions of a network program can be identified by the fingerprint, so that the network state can be accurately controlled, the network quality can be optimized, and valuable information can be mined.
Description
Technical Field
The invention belongs to the field of identification of computer network flow service types, and particularly relates to a method for automatically generating a fine-grained network program function flow fingerprint.
Background
With the rapid development of networks, various new service applications are continuously appeared, and the communication protocols used are more and more complex. In addition, malicious attack behaviors on the network are increasing, so that network traffic identification is very important, and a network manager can monitor and manage various service flows in real time; the method is beneficial to the network service provider to know the conditions of various service flows of the network when planning and building the network. Particularly, the fine-grained flow identification can know the user's favorite degree and corresponding user behavior of a certain function in a piece of software, so as to provide a basis for the later network program design and development, and can measure and monitor the network in a deep level, thereby guiding developers to provide better user experience for the users.
Currently, most research in the field of network traffic identification focuses on the classification of network protocols and applications. Many researchers have used variations of existing classification methods, or techniques that combine multiple classification methods, to improve the efficiency and accuracy of previous studies. However, as the network is rapidly developed, the network traffic is gradually scaled up, and the traffic characteristics are more and more complex, so it is difficult to find a flow classification method with an accuracy of 100%. Compared with the existing classification algorithm, the accuracy rate of the method is improved by 1% -2%, and the method is more meaningful for researching the flow classification problem at a new classification level. The fine-grained network program functional flow identification is a new classification idea, which refers to the following steps: different functional modules in a particular application are identified. Therefore, the key problem of fine-grained network program function flow identification is to generate fingerprints for different function modules in specific applications so as to facilitate later identification and differentiation. Fine-grained traffic identification is currently less researched, and most of the fine-grained traffic identification is carried out by adopting traffic statistical information, such as collecting packet length, packet interval, bit rate and the like as characteristic values. The method is greatly influenced by network conditions, sensitive to noise and difficult to ensure precision.
Disclosure of Invention
The invention aims to provide a method for automatically generating a fine-grained network program function flow fingerprint so as to solve the problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for automatically generating fine-grained network program function flow fingerprints comprises the following steps:
the method comprises the following steps: collecting corresponding flow data packets according to different program functions;
step two: the flow data packets with different functions are treated according to the long character string, and the long character string is divided into segments with fixed length; if a segment frequently appears in the data packet, and the frequency of the occurrence is greater than or equal to a preset threshold value, the segment is selected as a frequent segment;
step three: after obtaining the frequent segments, deleting the segment intersection between the corresponding frequent segments of different functional programs, wherein the step comprises the following two stages: step one, finding out an intersection set among frequent segments; stage two, deleting the same segment from the frequent segments corresponding to different functions;
step four: merging the fragments in the frequent fragment set corresponding to each program function, keeping the merged fragments and the occurrence times thereof, then deleting the two short character fragments before merging and the occurrence times thereof, circulating the step until the merging cannot be carried out, and finally collecting the frequent fragments as a primary fingerprint;
step five: and (4) circulating the step one to the step four times, solving intersection of multiple preliminary fingerprints formulated by the same function, and finally generating final fingerprints of all functions of the program.
Further, the same segment in stage two in step three means that the characters of the two segments are the same.
Further, two conditions are combined in step four: first, two segments should have characters that intersect either adjacent or in front of and behind; second, the combined score must meet a preset threshold.
Furthermore, in the fifth step, the final numerical value of the occurrence times is the minimum value for the same character segment but different corresponding occurrence times.
Further, in the second step, searching for the comprehensive information of the fragments and the corresponding occurrence times of the fragments in the frequent fragments; the fragments are used as keys, the occurrence times are used as values, and finally the fragments are combined into a set in the form of key value pairs.
Further, the fragment merging method in the fourth step comprises the following steps:
1) firstly, judging whether the segments to be merged are adjacent or the boundaries are overlapped;
2) and judging whether the combination is carried out according to whether the occurrence frequency of the combined fragments is divided by the sum of the occurrence frequency of the two fragments before the combination is less than or equal to 0.5, and only carrying out the combination when the ratio is less than or equal to 0.5.
Compared with the prior art, the invention has the following technical effects:
the invention can automatically make fingerprints corresponding to different functions of the application program; the method is ready for next fine-grained flow identification, namely, the flow corresponding to the applied function operation is identified; the traffic identification based on fine-grained fingerprints also provides information for upper layer data analysis finally. By utilizing the fine-grained fingerprints, whether different functions of the program are triggered or not and the function duration can be acquired, so that a necessary data source is provided for the next user analysis and user experience.
The automatic fingerprint acquisition method greatly reduces the manual participation, lightens the labor burden, and improves the accuracy and the working efficiency; meanwhile, the method does not need any prior information such as a flow protocol format and the like, and is insensitive to flow noise.
Drawings
FIG. 1 is a schematic illustration of a specific embodiment;
fig. 2 is a fine-grained functional flow diagram.
FIG. 3 is a schematic block diagram of a protocol of the process of the present invention;
Detailed Description
The invention is further described below with reference to the accompanying drawings:
referring to fig. 1 to 3, a method for automatically generating a fine-grained network program function traffic fingerprint includes the following steps:
the method comprises the following steps: collecting corresponding flow data packets according to different program functions;
step two: the flow data packets with different functions are treated according to the long character string, and the long character string is divided into segments with fixed length; if a segment frequently appears in the data packet, and the frequency of the occurrence is greater than or equal to a preset threshold value, the segment is selected as a frequent segment; specifically, the pseudo code of the processing algorithm in the step is expressed as follows:
step three: after obtaining the frequent segments, deleting the segment intersection between the corresponding frequent segments of different functional programs, wherein the step comprises the following two stages: step one, finding out an intersection set among frequent segments; stage two, deleting the same segment from the frequent segments corresponding to different functions;
step four: merging the fragments in the frequent fragment set corresponding to each program function, keeping the merged fragments and the occurrence times thereof, then deleting the two short character fragments before merging and the occurrence times thereof, circulating the step until the merging cannot be carried out, and finally collecting the frequent fragments as a primary fingerprint; specifically, the pseudo code of the processing algorithm in the step is expressed as follows:
step five: and (4) circulating the step one to the step four times, solving intersection of multiple preliminary fingerprints formulated by the same function, and finally generating final fingerprints of all functions of the program. This step iteration process can be expressed as follows:
this process ends until the new fingerprint set (i) ═ new fingerprint set (i-1).
The same segment in stage two in step three means that the two segment characters are the same.
In step four, two conditions are combined: first, two segments should have characters that intersect either adjacent or in front of and behind; second, the combined score must meet a preset threshold.
And fifthly, for the same character segments but different corresponding occurrence times, the final occurrence time value is the minimum value.
Searching for comprehensive information of the fragments and corresponding occurrence times of the fragments in the frequent fragments; the fragments are used as keys, the occurrence times are used as values, and finally the fragments are combined into a set in the form of key value pairs.
The method for combining the fragments in the fourth step comprises the following steps:
1) firstly, judging whether the segments to be merged are adjacent or the boundaries are overlapped;
2) and judging whether the combination is carried out according to whether the occurrence frequency of the combined fragment is divided by the sum of the occurrence frequency of the two fragments before combination is less than or equal to 0.5.
The present invention will be described in more detail with reference to the accompanying drawings 1 and 2.
The general implementation of the invention is shown in fig. 1, and the whole process is divided into off-line fingerprint generation and on-line identification.
The collaboration process between the whole processes is illustrated as follows:
1. and collecting the online flow to the offline, operating a certain function of a program on the online, and simultaneously capturing a packet to collect the flow data to be stored offline.
2. The flows corresponding to different functions of the same program are collected respectively with reference to fig. 2, allowing flow data packet noise to exist.
3. The above process can be repeated for N times.
4. The data collected at each time are processed as follows:
referring to fig. 3, the data packets in the traffic data corresponding to each function are fragmented and frequent fragments are searched. The packet data is treated as long strings, which are first divided into fixed-length segments. For example, if there is a character string 'xjtuabc', and the length of the segment is set to 2, the long character can be divided into { xj, jt, tu, ua, ab, bc }, and if the occurrence frequency of a character string segment is greater than or equal to a preset threshold, the segment and the corresponding occurrence frequency are recorded, so as to obtain a "key value" pair set corresponding to a fine-grained function. According to experimental analysis, the following results are obtained: the size of the fragment is 4, the occurrence frequency threshold value is 6, and the method has the best effect.
5. Referring to fig. 3, the "key value" pairs corresponding to all functions of a program are subjected to deletion intersection processing, and the same segments in the frequent segment "key value" pairs corresponding to different program functions are removed, where the same segment means that the "keys" of two segments are the same, and the corresponding occurrence times "values" are not necessarily required to be the same. For example, there are frequent fragment sets that program four different functions:
function 1 { HTTP1.1, 8; server, 9; 0x00, 24; t% 3B%, 17}
Function 2 { HTTP1.1, 13; 0x02000, 11; b7A3F8a2, 16; t% 3B%, 17}
Function 3 { HTTP1.1, 8; server, 13; XX1pZ, 17; jS5gH, 16; D-DtH, 6; the flow rate of the liquid in the AC4DF,
5}
function 4 { HTTP1.1, 8; server, 9; 0x02, 14; JSSES, 4; XX1pZ,17}
After this step, the final aggregate output should be:
function 1 {0x00,24}
Function 2 {0x02000, 11; B7A3F8A2,16}
Function 3 { jS5gH, 16; D-DtH, 6; AC4DF,5}
Function 4 {0x02, 14; JSSES, 4}
6. Referring to fig. 3, the frequent segment sets obtained in the previous step would take too much time to distinguish different program functions. This is because there may be a large number of redundant fragments in a frequent fragment set corresponding to a function, for example, a data packet has a character fragment 'xjtu _ homepage', and when the fragment size is 4, 10 fragments are obtained, which are: 'xjtu', 'jtu _', 'tu _ h', 'u _ ho', 'hom', 'home', 'mepa', 'epag' and 'page', which are fragments that can be combined into a longer fragment, such as 'xjtu' occurred 7 times, 'jtu _' also occurred 7 times, and in fact 'xjtu _' occurred 7 times. Thus, one long segment can replace the original two segments, and the matching search time can be saved in the matching stage.
7. Referring to fig. 3, when frequent segment merging, first, it is determined whether a segment completely contains another self-negative segment, and if so, the longest segment is directly reserved, and the contained segment is deleted; if not, checking whether the two segments are adjacent or the boundaries are intersected and overlapped; if the conditions are met, whether the merging score meets a preset threshold value is judged, if yes, the merged segments and the key value pairs of the occurrence times of the segments are reserved, and the two segments before merging are deleted. The method for calculating the merging score comprises the following steps: the ratio of the number of times of the character fragments appearing in the flow after combination to the sum of the number of times of the two fragments appearing in the flow before combination is mainly seen in the method whether the ratio is less than or equal to 0.5.
8. Referring to fig. 3, through the above steps, the final frequent segment set corresponding to each function is generated, and in order to solve the influence of the noise data packet in the flow on the final result, the intersection of the frequent segment sets formulated for the same function N times is calculated next, so that the noise is completely eradicated, and the fine-grained fingerprint is finally generated. In the invention, intersection is solved according to the following definition. For frequent segments where the "key" is the same and the "value" is different, the final "value" takes the smallest one. For example, intersect (HTTP,8) and (HTTP,6), the result is (HTTP, 6). An example of intersection of three sets of frequent fragments of a function of a program is shown below. Three frequent fragment sets are given first:
function 1-1 { HTTP1.1, 8; server, 9; 0x02, 14; b7A3F8a2, 16; t% 3B%, 17}
Function 1-2 { HTTP1.1, 7; server, 9; 0x02, 14; c7A8K3C7, 14; t% 3B%, 17}
Function 1-3 { HTTP1.1, 8; server, 7; 0x02, 14; 37B2B7C6, 7; t% 3B%, 17}
The final fingerprint for this function, via this module, is:
function 1 fingerprints are { HTTP1.1, 7; server, 7; 0x02, 14; t% 3B%, 17 }.
As described above, the invention can automatically generate fine-grained program function fingerprints, and is convenient for later-stage identification and use.
The method is not limited to be used only in the aspect of network function identification, and fingerprints of texts of different traffic data packets can be generated according to specific situations.
Although specific embodiments of, and examples for, the invention are disclosed in the accompanying drawings for illustrative purposes and to aid in the understanding of the contents of the invention and the manner in which the same may be practiced, those skilled in the art will understand that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. The present invention should not be limited to the disclosure of the embodiments and drawings described in the specification, and the scope of the present invention is defined by the scope of the claims.
Claims (4)
1. A method for automatically generating fine-grained network program function flow fingerprints is characterized by comprising the following steps:
the method comprises the following steps: collecting corresponding flow data packets according to different program functions;
step two: the flow data packets with different functions are treated according to the long character string, and the long character string is divided into segments with fixed length; if a segment frequently appears in the data packet, and the frequency of the occurrence is greater than or equal to a preset threshold value, the segment is selected as a frequent segment;
step three: after obtaining the frequent segments, deleting the segment intersection between the corresponding frequent segments of different functional programs, wherein the step comprises the following two stages: step one, finding out an intersection set among frequent segments; stage two, deleting the same segment from the frequent segments corresponding to different functions;
step four: merging the fragments in the frequent fragment set corresponding to each program function, keeping the merged fragments and the occurrence times thereof, then deleting the two short character fragments before merging and the occurrence times thereof, circulating the step until the merging cannot be carried out, and finally collecting the frequent fragments as a primary fingerprint; two conditions are combined: first, two segments should have characters that intersect either adjacent or in front of and behind; second, the combined score must meet a preset threshold;
when the frequent fragments are combined, firstly, whether one fragment completely contains another self-negative fragment is judged, if so, the longest fragment is directly reserved, and the contained fragment is deleted; if not, checking whether the two segments are adjacent or the boundaries are intersected and overlapped; if the conditions are met, whether the merging score meets a preset threshold value is judged, if so, the merged segments and the key value pairs of the occurrence times of the segments are reserved, and the two segments before merging are deleted; the method for calculating the merging score comprises the following steps: the ratio of the occurrence frequency of the character segments after combination in the flow to the sum of the occurrence frequency of the two segments before combination in the flow is judged whether the ratio is less than or equal to 0.5;
step five: and (4) circulating the step one to the step four times, solving intersection of multiple preliminary fingerprints formulated by the same function, and finally generating final fingerprints of all functions of the program.
2. The method of claim 1, wherein the same segment in stage two in step three means that two segments have the same character.
3. The method as claimed in claim 1, wherein in step five, the final occurrence number value is the minimum value for the same character segment but different occurrence numbers.
4. The method for automatically generating fine grained network program function flow fingerprint according to claim 1, characterized in that, in the step two, the comprehensive information of the segment and the corresponding occurrence number thereof is recorded in the frequent segments; the fragments are used as keys, the occurrence times are used as values, and finally the fragments are combined into a set in the form of key value pairs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710948442.2A CN107682225B (en) | 2017-10-12 | 2017-10-12 | Method for automatically generating fine-grained network program function flow fingerprint |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710948442.2A CN107682225B (en) | 2017-10-12 | 2017-10-12 | Method for automatically generating fine-grained network program function flow fingerprint |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107682225A CN107682225A (en) | 2018-02-09 |
CN107682225B true CN107682225B (en) | 2020-05-22 |
Family
ID=61139990
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710948442.2A Expired - Fee Related CN107682225B (en) | 2017-10-12 | 2017-10-12 | Method for automatically generating fine-grained network program function flow fingerprint |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107682225B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104468273A (en) * | 2014-12-12 | 2015-03-25 | 北京百度网讯科技有限公司 | Method and system for recognizing application type of flow data |
CN106452868A (en) * | 2016-10-12 | 2017-02-22 | 中国电子科技集团公司第三十研究所 | Network traffic statistics implement method supporting multi-dimensional aggregation classification |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11087006B2 (en) * | 2014-06-30 | 2021-08-10 | Nicira, Inc. | Method and apparatus for encrypting messages based on encryption group association |
-
2017
- 2017-10-12 CN CN201710948442.2A patent/CN107682225B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104468273A (en) * | 2014-12-12 | 2015-03-25 | 北京百度网讯科技有限公司 | Method and system for recognizing application type of flow data |
CN106452868A (en) * | 2016-10-12 | 2017-02-22 | 中国电子科技集团公司第三十研究所 | Network traffic statistics implement method supporting multi-dimensional aggregation classification |
Non-Patent Citations (2)
Title |
---|
SktTracer: Towards Fine一Grained Identification for Skype Traffic via Sequence Signatures;Zhenlong Yuan等;《2014 International Conference on Computing, Networking and Communications, Communications and Information Security Symposium》;20140203;全文 * |
基于Tilera平台的网络细粒度应用行为识别;吴舜等;《电信科学》;20131120;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN107682225A (en) | 2018-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111506599B (en) | Industrial control equipment identification method and system based on rule matching and deep learning | |
CN103838754B (en) | Information retrieval device and method | |
CN103076892A (en) | Method and equipment for providing input candidate items corresponding to input character string | |
CN103336766A (en) | Short text garbage identification and modeling method and device | |
CN111324801B (en) | Hot event discovery method in judicial field based on hot words | |
CN111460153A (en) | Hot topic extraction method and device, terminal device and storage medium | |
CN113505826B (en) | Network flow anomaly detection method based on joint feature selection | |
CN103077163B (en) | Data preprocessing method, device and system | |
CN113032557A (en) | Microblog hot topic discovery method based on frequent word set and BERT semantics | |
CN102184201B (en) | Equipment and method used for selecting recommended sequence of query sequence | |
CN108173876B (en) | Dynamic rule base construction method based on maximum frequent pattern | |
CN112231700B (en) | Behavior recognition method and apparatus, storage medium, and electronic device | |
CN107832611B (en) | Zombie program detection and classification method combining dynamic and static characteristics | |
CN112235254B (en) | Rapid identification method for Tor network bridge in high-speed backbone network | |
CN114510615A (en) | Fine-grained encrypted website fingerprint classification method and device based on graph attention pooling network | |
CN111400617B (en) | Social robot detection data set extension method and system based on active learning | |
WO2021248707A1 (en) | Operation verification method and apparatus | |
CN103455754A (en) | Regular expression-based malicious search keyword recognition method | |
CN107682225B (en) | Method for automatically generating fine-grained network program function flow fingerprint | |
CN110555170B (en) | System and method for optimizing user experience | |
CN111538839A (en) | Real-time text clustering method based on Jacobsard distance | |
CN110929506A (en) | Junk information detection method, device and equipment and readable storage medium | |
KR101621959B1 (en) | Apparatus for extracting and analyzing log pattern and method thereof | |
CN116662557A (en) | Entity relation extraction method and device in network security field | |
CN111159996B (en) | Short text set similarity comparison method and system based on text fingerprint algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200522 |