CN107682225B

CN107682225B - Method for automatically generating fine-grained network program function flow fingerprint

Info

Publication number: CN107682225B
Application number: CN201710948442.2A
Authority: CN
Inventors: 唐亚哲; 李勋
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2017-10-12
Filing date: 2017-10-12
Publication date: 2020-05-22
Anticipated expiration: 2037-10-12
Also published as: CN107682225A

Abstract

The invention discloses a method for automatically generating fine-grained network program function flow fingerprints, relates to the technical field of network communication, in particular to network service identification, and mainly aims to identify different function modules in a certain specific application. The invention processes the bottom flow according to the long character string, and calculates and records the frequently-occurring character segments and the corresponding occurrence times. And finally acquiring the corresponding fingerprint through subsequent intersection operation, merging operation and final fingerprint purification operation. By the mode, the invention can automatically and effectively formulate the fingerprint of the network program function, and different functions of a network program can be identified by the fingerprint, so that the network state can be accurately controlled, the network quality can be optimized, and valuable information can be mined.

Description

Method for automatically generating fine-grained network program function flow fingerprint

Technical Field

The invention belongs to the field of identification of computer network flow service types, and particularly relates to a method for automatically generating a fine-grained network program function flow fingerprint.

Background

With the rapid development of networks, various new service applications are continuously appeared, and the communication protocols used are more and more complex. In addition, malicious attack behaviors on the network are increasing, so that network traffic identification is very important, and a network manager can monitor and manage various service flows in real time; the method is beneficial to the network service provider to know the conditions of various service flows of the network when planning and building the network. Particularly, the fine-grained flow identification can know the user's favorite degree and corresponding user behavior of a certain function in a piece of software, so as to provide a basis for the later network program design and development, and can measure and monitor the network in a deep level, thereby guiding developers to provide better user experience for the users.

Currently, most research in the field of network traffic identification focuses on the classification of network protocols and applications. Many researchers have used variations of existing classification methods, or techniques that combine multiple classification methods, to improve the efficiency and accuracy of previous studies. However, as the network is rapidly developed, the network traffic is gradually scaled up, and the traffic characteristics are more and more complex, so it is difficult to find a flow classification method with an accuracy of 100%. Compared with the existing classification algorithm, the accuracy rate of the method is improved by 1% -2%, and the method is more meaningful for researching the flow classification problem at a new classification level. The fine-grained network program functional flow identification is a new classification idea, which refers to the following steps: different functional modules in a particular application are identified. Therefore, the key problem of fine-grained network program function flow identification is to generate fingerprints for different function modules in specific applications so as to facilitate later identification and differentiation. Fine-grained traffic identification is currently less researched, and most of the fine-grained traffic identification is carried out by adopting traffic statistical information, such as collecting packet length, packet interval, bit rate and the like as characteristic values. The method is greatly influenced by network conditions, sensitive to noise and difficult to ensure precision.

Disclosure of Invention

The invention aims to provide a method for automatically generating a fine-grained network program function flow fingerprint so as to solve the problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for automatically generating fine-grained network program function flow fingerprints comprises the following steps:

the method comprises the following steps: collecting corresponding flow data packets according to different program functions;

step two: the flow data packets with different functions are treated according to the long character string, and the long character string is divided into segments with fixed length; if a segment frequently appears in the data packet, and the frequency of the occurrence is greater than or equal to a preset threshold value, the segment is selected as a frequent segment;

step three: after obtaining the frequent segments, deleting the segment intersection between the corresponding frequent segments of different functional programs, wherein the step comprises the following two stages: step one, finding out an intersection set among frequent segments; stage two, deleting the same segment from the frequent segments corresponding to different functions;

step four: merging the fragments in the frequent fragment set corresponding to each program function, keeping the merged fragments and the occurrence times thereof, then deleting the two short character fragments before merging and the occurrence times thereof, circulating the step until the merging cannot be carried out, and finally collecting the frequent fragments as a primary fingerprint;

step five: and (4) circulating the step one to the step four times, solving intersection of multiple preliminary fingerprints formulated by the same function, and finally generating final fingerprints of all functions of the program.

Further, the same segment in stage two in step three means that the characters of the two segments are the same.

Further, two conditions are combined in step four: first, two segments should have characters that intersect either adjacent or in front of and behind; second, the combined score must meet a preset threshold.

Furthermore, in the fifth step, the final numerical value of the occurrence times is the minimum value for the same character segment but different corresponding occurrence times.

Further, in the second step, searching for the comprehensive information of the fragments and the corresponding occurrence times of the fragments in the frequent fragments; the fragments are used as keys, the occurrence times are used as values, and finally the fragments are combined into a set in the form of key value pairs.

Further, the fragment merging method in the fourth step comprises the following steps:

1) firstly, judging whether the segments to be merged are adjacent or the boundaries are overlapped;

2) and judging whether the combination is carried out according to whether the occurrence frequency of the combined fragments is divided by the sum of the occurrence frequency of the two fragments before the combination is less than or equal to 0.5, and only carrying out the combination when the ratio is less than or equal to 0.5.

Compared with the prior art, the invention has the following technical effects:

the invention can automatically make fingerprints corresponding to different functions of the application program; the method is ready for next fine-grained flow identification, namely, the flow corresponding to the applied function operation is identified; the traffic identification based on fine-grained fingerprints also provides information for upper layer data analysis finally. By utilizing the fine-grained fingerprints, whether different functions of the program are triggered or not and the function duration can be acquired, so that a necessary data source is provided for the next user analysis and user experience.

The automatic fingerprint acquisition method greatly reduces the manual participation, lightens the labor burden, and improves the accuracy and the working efficiency; meanwhile, the method does not need any prior information such as a flow protocol format and the like, and is insensitive to flow noise.

Drawings

FIG. 1 is a schematic illustration of a specific embodiment;

fig. 2 is a fine-grained functional flow diagram.

FIG. 3 is a schematic block diagram of a protocol of the process of the present invention;

Detailed Description

The invention is further described below with reference to the accompanying drawings:

referring to fig. 1 to 3, a method for automatically generating a fine-grained network program function traffic fingerprint includes the following steps:

step two: the flow data packets with different functions are treated according to the long character string, and the long character string is divided into segments with fixed length; if a segment frequently appears in the data packet, and the frequency of the occurrence is greater than or equal to a preset threshold value, the segment is selected as a frequent segment; specifically, the pseudo code of the processing algorithm in the step is expressed as follows:

step four: merging the fragments in the frequent fragment set corresponding to each program function, keeping the merged fragments and the occurrence times thereof, then deleting the two short character fragments before merging and the occurrence times thereof, circulating the step until the merging cannot be carried out, and finally collecting the frequent fragments as a primary fingerprint; specifically, the pseudo code of the processing algorithm in the step is expressed as follows:

step five: and (4) circulating the step one to the step four times, solving intersection of multiple preliminary fingerprints formulated by the same function, and finally generating final fingerprints of all functions of the program. This step iteration process can be expressed as follows:

this process ends until the new fingerprint set (i) ═ new fingerprint set (i-1).

The same segment in stage two in step three means that the two segment characters are the same.

In step four, two conditions are combined: first, two segments should have characters that intersect either adjacent or in front of and behind; second, the combined score must meet a preset threshold.

And fifthly, for the same character segments but different corresponding occurrence times, the final occurrence time value is the minimum value.

Searching for comprehensive information of the fragments and corresponding occurrence times of the fragments in the frequent fragments; the fragments are used as keys, the occurrence times are used as values, and finally the fragments are combined into a set in the form of key value pairs.

The method for combining the fragments in the fourth step comprises the following steps:

2) and judging whether the combination is carried out according to whether the occurrence frequency of the combined fragment is divided by the sum of the occurrence frequency of the two fragments before combination is less than or equal to 0.5.

The present invention will be described in more detail with reference to the accompanying drawings 1 and 2.

The general implementation of the invention is shown in fig. 1, and the whole process is divided into off-line fingerprint generation and on-line identification.

The collaboration process between the whole processes is illustrated as follows:

1. and collecting the online flow to the offline, operating a certain function of a program on the online, and simultaneously capturing a packet to collect the flow data to be stored offline.

2. The flows corresponding to different functions of the same program are collected respectively with reference to fig. 2, allowing flow data packet noise to exist.

3. The above process can be repeated for N times.

4. The data collected at each time are processed as follows:

referring to fig. 3, the data packets in the traffic data corresponding to each function are fragmented and frequent fragments are searched. The packet data is treated as long strings, which are first divided into fixed-length segments. For example, if there is a character string 'xjtuabc', and the length of the segment is set to 2, the long character can be divided into { xj, jt, tu, ua, ab, bc }, and if the occurrence frequency of a character string segment is greater than or equal to a preset threshold, the segment and the corresponding occurrence frequency are recorded, so as to obtain a "key value" pair set corresponding to a fine-grained function. According to experimental analysis, the following results are obtained: the size of the fragment is 4, the occurrence frequency threshold value is 6, and the method has the best effect.

5. Referring to fig. 3, the "key value" pairs corresponding to all functions of a program are subjected to deletion intersection processing, and the same segments in the frequent segment "key value" pairs corresponding to different program functions are removed, where the same segment means that the "keys" of two segments are the same, and the corresponding occurrence times "values" are not necessarily required to be the same. For example, there are frequent fragment sets that program four different functions:

function 1 { HTTP1.1, 8; server, 9; 0x00, 24; t% 3B%, 17}

Function 2 { HTTP1.1, 13; 0x02000, 11; b7A3F8a2, 16; t% 3B%, 17}

Function 3 { HTTP1.1, 8; server, 13; XX1pZ, 17; jS5gH, 16; D-DtH, 6; the flow rate of the liquid in the AC4DF,

5}

function 4 { HTTP1.1, 8; server, 9; 0x02, 14; JSSES, 4; XX1pZ,17}

After this step, the final aggregate output should be:

function 1 {0x00,24}

Function 2 {0x02000, 11; B7A3F8A2,16}

Function 3 { jS5gH, 16; D-DtH, 6; AC4DF,5}

Function 4 {0x02, 14; JSSES, 4}

6. Referring to fig. 3, the frequent segment sets obtained in the previous step would take too much time to distinguish different program functions. This is because there may be a large number of redundant fragments in a frequent fragment set corresponding to a function, for example, a data packet has a character fragment 'xjtu _ homepage', and when the fragment size is 4, 10 fragments are obtained, which are: 'xjtu', 'jtu _', 'tu _ h', 'u _ ho', 'hom', 'home', 'mepa', 'epag' and 'page', which are fragments that can be combined into a longer fragment, such as 'xjtu' occurred 7 times, 'jtu _' also occurred 7 times, and in fact 'xjtu _' occurred 7 times. Thus, one long segment can replace the original two segments, and the matching search time can be saved in the matching stage.

7. Referring to fig. 3, when frequent segment merging, first, it is determined whether a segment completely contains another self-negative segment, and if so, the longest segment is directly reserved, and the contained segment is deleted; if not, checking whether the two segments are adjacent or the boundaries are intersected and overlapped; if the conditions are met, whether the merging score meets a preset threshold value is judged, if yes, the merged segments and the key value pairs of the occurrence times of the segments are reserved, and the two segments before merging are deleted. The method for calculating the merging score comprises the following steps: the ratio of the number of times of the character fragments appearing in the flow after combination to the sum of the number of times of the two fragments appearing in the flow before combination is mainly seen in the method whether the ratio is less than or equal to 0.5.

8. Referring to fig. 3, through the above steps, the final frequent segment set corresponding to each function is generated, and in order to solve the influence of the noise data packet in the flow on the final result, the intersection of the frequent segment sets formulated for the same function N times is calculated next, so that the noise is completely eradicated, and the fine-grained fingerprint is finally generated. In the invention, intersection is solved according to the following definition. For frequent segments where the "key" is the same and the "value" is different, the final "value" takes the smallest one. For example, intersect (HTTP,8) and (HTTP,6), the result is (HTTP, 6). An example of intersection of three sets of frequent fragments of a function of a program is shown below. Three frequent fragment sets are given first:

function 1-1 { HTTP1.1, 8; server, 9; 0x02, 14; b7A3F8a2, 16; t% 3B%, 17}

Function 1-2 { HTTP1.1, 7; server, 9; 0x02, 14; c7A8K3C7, 14; t% 3B%, 17}

Function 1-3 { HTTP1.1, 8; server, 7; 0x02, 14; 37B2B7C6, 7; t% 3B%, 17}

The final fingerprint for this function, via this module, is:

function 1 fingerprints are { HTTP1.1, 7; server, 7; 0x02, 14; t% 3B%, 17 }.

As described above, the invention can automatically generate fine-grained program function fingerprints, and is convenient for later-stage identification and use.

The method is not limited to be used only in the aspect of network function identification, and fingerprints of texts of different traffic data packets can be generated according to specific situations.

Although specific embodiments of, and examples for, the invention are disclosed in the accompanying drawings for illustrative purposes and to aid in the understanding of the contents of the invention and the manner in which the same may be practiced, those skilled in the art will understand that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. The present invention should not be limited to the disclosure of the embodiments and drawings described in the specification, and the scope of the present invention is defined by the scope of the claims.

Claims

1. A method for automatically generating fine-grained network program function flow fingerprints is characterized by comprising the following steps:

step four: merging the fragments in the frequent fragment set corresponding to each program function, keeping the merged fragments and the occurrence times thereof, then deleting the two short character fragments before merging and the occurrence times thereof, circulating the step until the merging cannot be carried out, and finally collecting the frequent fragments as a primary fingerprint; two conditions are combined: first, two segments should have characters that intersect either adjacent or in front of and behind; second, the combined score must meet a preset threshold;

when the frequent fragments are combined, firstly, whether one fragment completely contains another self-negative fragment is judged, if so, the longest fragment is directly reserved, and the contained fragment is deleted; if not, checking whether the two segments are adjacent or the boundaries are intersected and overlapped; if the conditions are met, whether the merging score meets a preset threshold value is judged, if so, the merged segments and the key value pairs of the occurrence times of the segments are reserved, and the two segments before merging are deleted; the method for calculating the merging score comprises the following steps: the ratio of the occurrence frequency of the character segments after combination in the flow to the sum of the occurrence frequency of the two segments before combination in the flow is judged whether the ratio is less than or equal to 0.5;

2. The method of claim 1, wherein the same segment in stage two in step three means that two segments have the same character.

3. The method as claimed in claim 1, wherein in step five, the final occurrence number value is the minimum value for the same character segment but different occurrence numbers.

4. The method for automatically generating fine grained network program function flow fingerprint according to claim 1, characterized in that, in the step two, the comprehensive information of the segment and the corresponding occurrence number thereof is recorded in the frequent segments; the fragments are used as keys, the occurrence times are used as values, and finally the fragments are combined into a set in the form of key value pairs.