CN108667685B

CN108667685B - Mobile application network flow clustering device

Info

Publication number: CN108667685B
Application number: CN201810309715.3A
Authority: CN
Inventors: 何高峰; 朱海婷; 孙雁飞; 王堃
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2018-04-08
Filing date: 2018-04-08
Publication date: 2020-10-02
Anticipated expiration: 2038-04-08
Also published as: CN108667685A

Abstract

Mobile application network traffic clustering device includes: the mobile terminal comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is suitable for acquiring network flow generated by the mobile terminal; the preprocessing unit is suitable for preprocessing the acquired network traffic to generate a DNS traffic record set corresponding to the DNS traffic and other network traffic record sets corresponding to other network traffic except the DNS traffic; the first clustering unit is suitable for clustering the DNS traffic record set to obtain a plurality of DNS network traffic classes corresponding to the DNS traffic; the second clustering unit is suitable for clustering other network traffic record sets by adopting the obtained multiple DNS network traffic classes to obtain multiple other network traffic classes corresponding to other network traffic; and the merging unit is suitable for merging the plurality of DNS network traffic classes corresponding to the DNS traffic with a plurality of other network traffic classes corresponding to other network traffic to obtain a final mobile application network traffic clustering result. By the scheme, the cluster analysis of the mobile application network flow can be simply realized.

Description

Mobile application network flow clustering device

Technical Field

The invention relates to the technical field of data processing, in particular to a mobile application network flow clustering device.

Background

With the rapid development of mobile internet and internet of things technologies, mobile terminals, such as smart phones, tablet pads, smart glasses, smart watches, and the like, have become important auxiliary tools for daily social activities.

However, the widespread use of mobile terminals also presents an unprecedented enormous challenge to network management and network security protection. For example, in an intranet of a certain enterprise, a mobile terminal carried by an employee person is installed with a malicious application, the application calls a camera to shoot secret information, and then transmits the shot secret information to an attacker server by using a mobile communication network or a wireless network, so that the secret information is leaked. Therefore, detecting malicious mobile applications or analyzing whether the operation behaviors of the mobile terminal are abnormal through network traffic becomes a hot problem of current theoretical research and practical application.

When studying how to detect malicious mobile applications through network traffic or analyzing whether the operation behavior of the mobile terminal is abnormal, an important premise is that the corresponding relationship between the network traffic and the mobile applications can be known. For example, when analyzing network behavior to detect malicious mobile applications, for different network flows F1, F2, F3, F4, existing work assumes that it is known that F1 and F2 are generated by the same mobile application, and that F3 and F4 are generated by another mobile application, and then the network behavior characteristics of { F1, F2} and { F3, F4} are analyzed to detect malicious mobile applications. In order to satisfy the above-mentioned assumption, an agent program may be installed on the mobile terminal, and the agent program determines the correspondence between the network traffic and the mobile application. However, in practice, there are many difficulties in directly installing the agent on the mobile terminal, such as forcing the user to install the agent, and the user can uninstall the agent by himself/herself.

However, no public research and report on automatic clustering of mobile application network traffic has appeared in the prior art.

Disclosure of Invention

The invention solves the technical problem of simply realizing the cluster analysis of the mobile application network flow.

In order to solve the above technical problem, an embodiment of the present invention provides a mobile application network traffic clustering device, where the device includes:

the mobile terminal comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is suitable for acquiring network flow generated by the mobile terminal;

the preprocessing unit is suitable for preprocessing the acquired network traffic to generate a DNS traffic record set corresponding to the DNS traffic and other network traffic record sets corresponding to other network traffic except the DNS traffic;

the first clustering unit is suitable for clustering the DNS traffic record set to obtain a plurality of DNS network traffic classes corresponding to the DNS traffic;

the second clustering unit is suitable for clustering the other network traffic record sets by adopting the obtained multiple DNS network traffic classes to obtain multiple other network traffic classes corresponding to the other network traffic;

and the merging unit is suitable for merging the plurality of DNS network traffic classes corresponding to the DNS traffic and the plurality of other network traffic classes corresponding to the other network traffic to obtain a final mobile application network traffic clustering result.

Optionally, the DNS network traffic includes DNS reply traffic and DNS query traffic;

the preprocessing unit is suitable for traversing the DNS traffic in the acquired network traffic one by one; when the traversed current DNS flow is determined to be DNS response flow, an IP address and a domain name in the current DNS flow are extracted to form a corresponding query record; when the traversed current DNS flow is determined to be DNS query flow, extracting a domain name and capture time in the current DNS flow to form a corresponding answer record; and acquiring the next DNS flow in the DNS flows in the network flows until all the DNS flows in the network flows are traversed, and generating the DNS flow record set.

Optionally, the preprocessing unit is further adapted to traverse the other network flows one by one, and set a corresponding six-tuple flow identifier including a source address, a destination address, a source port, a destination port, a domain name/null domain name, and an upper-layer protocol for each of the other network flows; when the destination address in the six-tuple flow identifier has a corresponding domain name, the corresponding domain name/airspace name field is corresponding domain name information; otherwise, the corresponding domain name/space domain name field is a space domain name; and respectively adding the message capture time and the message length of other network flows with the same six-tuple flow identification to the corresponding six-tuple flow identification in sequence to obtain other network flow record sets.

Optionally, the first clustering unit is adapted to create a variable j, set an initial value of j to 1, and use a first item in the DNS query record set as a first element in a first DNS query record class T1; traversing the DNS query record set according to the sequence to obtain a traversed current DNS query record Di; calculating a message capturing time difference value between the traversed current DNS query record Di and the previous DNS query record Di-1; when the message capturing time difference is determined to be smaller than a preset time threshold, merging the traversed current DNS query record Di into the DNS query record class where the previous DNS query record Di-1 is located; when the message capturing time difference is determined to be larger than or equal to the time threshold, creating a new DNS query record class, and taking the traversed current DNS query record Di as a first element in the new DNS query record class; obtaining a next DNS query record Di +1 in the DNS query record set until all DNS query records in the DNS query record set are traversed, and obtaining a plurality of DNS query record classes T1-Tm; creating variables a and b, wherein a is 1-m, b is 1-m, and the initial values of the variables a and b are respectively set to 1 and 2; traversing the DNS query record classes T1-Tm to obtain traversed current DNS query record classes Ta and Tb; when determining that a DNS query record class Tb exists, calculating the domain name similarity between the traversed current DNS query record classes Ta and Tb; when the domain name similarity is determined to be larger than a preset similarity threshold value, combining the current DNS query record classes Ta and Tb to obtain a combined DNS query record class Ta; when the domain name similarity is determined to be smaller than or equal to the similarity threshold, keeping the current DNS query record classes Ta and Tb unchanged; otherwise, setting a variable b as b +1, and repeating the judging and calculating steps until b is m; setting a variable a to a +1, setting b to a +1 until a DNS query record class Ta exists, and repeating the judging and calculating steps until b to m; repeating the steps until a is m, and obtaining a plurality of final DNS query record classes; and respectively classifying each DNS response record in the DNS response record set into the DNS query record class where the corresponding DNS query record is located to obtain the plurality of DNS network traffic classes.

Optionally, the first clustering unit is adapted to calculate a domain name similarity between traversed current DNS query record classes Ta and Tb by using the following formula:

wherein S is_abRepresenting the domain name similarity between the DNS query record classes Ta and Tb, K-T_aSet of keywords, K-T, representing the DNS query record class Ta_bA set of keywords representing a DNS query record class Tb; the keyword sets of the DNS query record class Ta and the DNS query record class Tb are respectivelyThe method is formed by dividing the domain names of the DNS query record class Ta and the DNS query record class Tb by using point numbers and not dividing the second-level domain name.

Optionally, the second clustering unit is adapted to extract a network flow feature corresponding to each other network flow record in the other network flow record set, respectively; classifying other network traffic records with domain name information in the six-element group traffic identifier and the obtained DNS network traffic classes with the same domain name information into the same other network traffic class to form more than one other network traffic classes; respectively extracting the network flow characteristics of the more than one other network flow classes; respectively calculating the distances between other network flow records with airspace name information in the six-element group flow identification and the more than one other network flow classes based on the extracted network flow characteristics; and merging other network flow records with airspace name information in the six-element group flow identification into other network flow classes corresponding to the minimum distance in the calculated distances to obtain a plurality of final other network flow classes.

Optionally, the extracting network flow features of other network flow records with spatial domain name information in the six-element group flow identifier includes: f1: capturing time of a first message in a stream; f2: the total number of messages in the stream; f3: the sum of the length of the messages in the stream; f4: maximum message length in the stream; f5: minimum message length in the stream; f6: average message length value in the flow; f7: variance of length values of messages in the stream; f8: maximum message time interval in the stream; f9: minimum message time interval in the stream; f10: average message time interval in the stream; f11: the variance of the message time intervals in the stream.

Optionally, the second clustering unit is adapted to calculate distances between other network traffic records having spatial domain name information in the hexahydric group traffic identifier and the one or more other network traffic classes respectively by using the following formulas:

and (b) and (c).

Wherein d is_pRepresenting other network flow records with airspace name information in the six-element group flow identifier and the p-th other network flow class C in the more than one other network flow classes_pOf the Euclidean distance between f_nullRepresenting other network traffic records with spatial domain name information in the six-element group traffic identifier, F_qThe qth network traffic characteristic represented, s represents the other network traffic class C_pNumber of other network traffic records in, f_rRepresenting other network traffic classes C_pThe r-th other network traffic record in (1).

The embodiment of the present invention further provides a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed, the method for clustering mobile application network traffic according to any one of the above-mentioned steps is performed.

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory is stored with a computer instruction capable of being operated on the processor, and the processor executes the steps of any one of the above mobile application network flow clustering methods when the processor operates the computer instruction.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

according to the scheme, the obtained network traffic is preprocessed to generate a DNS traffic record set corresponding to the DNS traffic and other network traffic record sets corresponding to other network traffic except the DNS traffic, the DNS traffic record sets are clustered, the other network traffic record sets are clustered by adopting a plurality of obtained DNS network traffic classes, finally the DNS network traffic classes corresponding to the DNS traffic and the other network traffic classes corresponding to the other network traffic are combined to obtain a final mobile application network traffic clustering result, and the corresponding relation between the network traffic and the mobile application can be determined without installing an agent program on the mobile terminal, so that the operation complexity of clustering the mobile application network traffic can be reduced, and the use experience of a user is improved.

Drawings

Fig. 1 is a flowchart of a mobile application network traffic clustering method according to an embodiment of the present invention;

fig. 2 is a flow chart of clustering DNS traffic in an embodiment of the present invention;

FIG. 3 is a flow chart illustrating further merging of DNS query record classes T1-Tm by calculating domain name similarity;

fig. 4 is a flowchart of a method for clustering the other network traffic record sets by using the obtained multiple DNS network traffic classes in the embodiment of the present invention;

FIG. 5 is a schematic diagram of system deployment when a mobile application network traffic clustering method is applied to cluster network traffic of a mobile terminal;

fig. 6 is a schematic structural diagram of a mobile application network traffic clustering device according to an embodiment of the present invention.

Detailed Description

The technical proposal of the embodiment of the invention generates a DNS flow record set corresponding to the DNS flow and other network flow record sets corresponding to other network flows except the DNS flow by preprocessing the acquired network flow, and clustering the DNS traffic record set, clustering the other network traffic record sets by adopting the obtained multiple DNS network traffic classes, finally merging the multiple DNS network traffic classes corresponding to the DNS traffic with the multiple other network traffic classes corresponding to the other network traffic to obtain a final mobile application network traffic clustering result, and determining the corresponding relation between the network traffic and the mobile application without installing a proxy program on the mobile terminal, so that the operation complexity of clustering the mobile application network traffic can be reduced, and the use experience of a user can be improved.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Fig. 1 is a flowchart of a mobile application network traffic clustering method according to an embodiment of the present invention. Referring to fig. 1, a mobile application network traffic clustering method may include the following steps:

step S101: and acquiring the network traffic generated by the mobile terminal.

In a specific implementation, network traffic generated by the mobile terminal can be captured by a router mirror or the like during the operation of the mobile terminal.

Step S102: and preprocessing the acquired network traffic to generate a DNS traffic record set corresponding to the DNS traffic and other network traffic record sets corresponding to other network traffic except the DNS traffic.

In particular implementations, the captured network traffic generated by the mobile terminal may be divided into Domain Name System (DNS) traffic and other network traffic than non-DNS traffic.

When performing DNS traffic preprocessing, the DNS traffic may be further divided into DNS query traffic and DNS response traffic, and then the DNS query traffic and the DNS response traffic are preprocessed respectively.

Specifically, for each piece of DNS query traffic, extracting DNS query content therein, that is, a specific domain name, and adding capture time of the DNS query traffic to form a DNS query record, which is expressed as < capture time, domain name >, arranging all DNS query records according to the sequence of capture time, and forming a DNS query record set as follows:

< Capture time 1, Domain name 1>

< Capture time 2, Domain name 2>

< Capture time 3, Domain name 3>

……

When the DNS response traffic is preprocessed, the domain name and the corresponding IP address in each DNS response message may be extracted to form a DNS response record, which is expressed as < domain name, IP address >.

When other network traffic is preprocessed, the five-tuple traffic identifier of < source address, destination address, source port, destination port, upper layer protocol > may be used as the network traffic identifier corresponding to each other network traffic. Then, whether the destination address in the quintuple flow identifier has a corresponding domain name or not can be judged according to the corresponding DNS response record; when determining that the destination address in the quintuple flow identifier has corresponding domain name information, adding the corresponding domain name information to the quintuple flow identifier to form a corresponding hexahtuple flow identifier which is < source address, destination address, source port, destination port, upper layer protocol, domain name >; otherwise, adding null domain name (null) information in the quintuple flow identifier to form a corresponding hexahtuple flow identifier which is < source address, destination address, source port, destination port, upper layer protocol, null >. And finally, extracting the capturing time and the message length of a plurality of other network flows with the same six-element group network identification, and adding the capturing time and the message length into the corresponding six-element group network identification according to the sequence of the capturing time to form a corresponding other network flow record. Wherein, according to whether the domain name information is empty, the finally formed other network flow records are respectively as follows:

a source address, a destination address, a source port, a destination port, an upper layer protocol, a domain name, a capturing time of a message 1, a length value of the message 1, a capturing time of a message 2, and a length value of the message 2, which are referred to herein as other network traffic records with domain name information in the hexa-element traffic identifier;

the network traffic record comprises a source address, a destination address, a source port, a destination port, an upper layer protocol, null, the capturing time of the message 1, the length value of the message 1, the capturing time of the message 2, and the length value of the message 2.

As can be seen from the above description, when capturing network traffic generated by the mobile terminal, it is not necessary to store specific communication contents of other network traffic except DNS traffic, so that it is possible to prevent privacy information of the mobile user from being leaked, and ensure data security.

Step S103: and clustering the DNS traffic record set to obtain a plurality of DNS network traffic classes corresponding to the DNS traffic.

In a specific implementation, after the DNS traffic record set and other network traffic record sets are formed, the DNS traffic record sets may be clustered respectively to obtain a plurality of DNS network traffic classes corresponding to DNS traffic, specifically, refer to fig. 2.

Step S104: and clustering the other network flow record sets by adopting the obtained plurality of DNS network flow classes to obtain a plurality of other network flow classes corresponding to the other network flows.

In a specific implementation, when a DNS traffic record set and other network traffic record sets are formed and clustering of DNS traffic is completed, the obtained multiple DNS network traffic classes may be used to cluster the other network traffic record sets, specifically referring to fig. 3.

Step S105: and merging the plurality of DNS network traffic classes corresponding to the DNS traffic with the plurality of other network traffic classes corresponding to the other network traffic to obtain a final mobile application network traffic clustering result.

In specific implementation, when a plurality of DNS network traffic classes corresponding to DNS traffic obtained by clustering and a plurality of other network traffic classes corresponding to the other network traffic are obtained, the application network traffic cluster of the mobile terminal is finally formed by merging the plurality of DNS network traffic classes obtained by clustering and the other network traffic classes.

Fig. 2 shows a flowchart of clustering DNS traffic in an embodiment of the present invention. Referring to fig. 2, the method for clustering DNS traffic in the embodiment of the present invention may specifically include the following operations:

step S201: the first one of the set of DNS query records is taken as the first element in the first DNS query record class T1, a variable j is created, and the initial value of j is set to 1.

Step S202: and traversing the DNS query record set according to the sequence from a second DNS query record in the DNS query record set to obtain a traversed current DNS query record Di.

In a specific implementation, traversing is performed on a plurality of DNS query records in the DNS query record set according to the sequence of the capturing time of the DNS query records.

It should be noted that, initially, the first item in the DNS query record set is taken as the first element in the first DNS query record class T1, and therefore, when the DNS query record set is traversed, the first item starts from the second DNS query record in the DNS query record set, that is, i is an integer greater than or equal to 2.

Step S203: and calculating the message capture time difference between the traversed current DNS query record Di and the previous DNS query record Di-1.

In specific implementation, the traversed current DNS query record Di has information of corresponding capture time, and the obtained capture time is subtracted by the message capture time of the previous DNS query record Di-1 to obtain a message capture time difference ti between the current DNS query record Di and the previous DNS query record Di-1.

Step S204: judging whether the message capturing time difference ti is smaller than a preset time threshold t; when the determination result is yes, step S2ti05 may be executed; otherwise, step S206 may be performed.

In a specific implementation, the time threshold t may be set according to an actual requirement, such as 1s, and is not limited herein.

Step S205: and merging the traversed current DNS query record Di into the DNS query record class in which the previous DNS query record Di-1 is positioned.

In specific implementation, when it is determined that the message capturing time difference ti is smaller than a preset time threshold T, the current DNS query record Di is merged into the DNS query record class where the previous DNS query record Di-1 is located, that is, the current DNS query record Di is merged into the DNS query record class T_jIn (1).

Step S206: setting j to j +1, creating a new DNS query record class T_j+1And taking the traversed current DNS query record Di as the new DNS query record class T_j+1The first element in (1).

In a specific implementation, when it is determined that the packet capturing time difference ti is greater than or equal to the time threshold T, a new DNS query record class T is created_j+1And taking the traversed current DNS query record Di as a new DNS query record class T_j+1The first element in (1).

Step S207: judging whether all DNS query records in the DNS query records are traversed completely; when the judgment result is yes, the operation can be ended; otherwise, step S208 is executed.

Step S208: and acquiring the next DNS query record Di +1 in the DNS query record set.

In a specific implementation, when all DNS query records in the DNS query records are not traversed, the next DNS query record Di +1 may be obtained as the traversed current query record Di, and the execution is started from step S203 until all DNS query records in the DNS query record set are traversed, so as to obtain a plurality of DNS query record classes T1-Tm.

In a specific implementation, when the corresponding DNS query record classes T1-Tm are obtained, the DNS query record classes T1-Tm may be further merged by calculating domain name similarity, which may specifically include the following operations:

step S301: variables a and b are created, with a being 1 to m and b being 1 to m, and the initial values of the variables a and b are set to 1 and 2, respectively.

Step S302: traversing the DNS query record classes T1-Tm to obtain traversed current DNS query record classes Ta and Tb.

In a specific implementation, the initial DNS query record classes Ta and Tb are T1 and T2, respectively.

Step S303: judging whether a DNS query record class Tb exists or not; when the judgment result is yes, step S304 may be performed; otherwise, step S308 may be performed.

Step S304: and calculating the domain name similarity between the traversed current DNS query record classes Ta and Tb.

In an embodiment of the present invention, when calculating the domain name similarity between the traversed current DNS query record classes Ta and Tb, first, all the domain names in the DNS query record classes Ta and Tb are divided according to the point number ". multidot.", and the second-level domain names are not divided, so as to form a keyword set, which is denoted as K _ T_aAnd K _ T_b. Then, the domain name similarity between the current DNS query record classes Ta and Tb is calculated by the following formula:

wherein S is_abRepresenting the domain name similarity between the DNS query record classes Ta and Tb, K _ T_aSet of keywords, K _ T, representing the DNS query record class Ta_bA set of keywords representing the DNS query record class Tb.

Step S305: judging the similarity S of the calculated domain name_abWhether the similarity is greater than a preset similarity threshold value S or not; when the judgment result is yes, step S306 may be performed; otherwise, step S307 may be executed.

In a specific implementation, the similarity threshold may be set according to an actual requirement, such as 0.3, and is not limited herein.

Step S306: and merging the current DNS query record class Ta and Tb to obtain a merged DNS query record class Ta.

In specific implementation, when the domain name similarity is determined to be greater than the preset similarity threshold, the current DNS query record class Ta and Tb are merged, that is, the content in the DNS query record class Tb is copied to the DNS query record class Ta, and the DNS query record class Tb is deleted at the same time, so that the merged DNS query record class Ta is obtained.

Step S307: keeping the current DNS query record classes Ta and Tb unchanged.

In specific implementation, when the domain name similarity is determined to be less than or equal to the similarity threshold, the DNS query record classes Ta and Tb do not need to be merged, and the DNS query record classes Ta and Tb are kept unchanged.

Step S308: judging whether b is less than or equal to m; when the judgment result is yes, step S309 may be performed; otherwise, step S210 may be performed.

Step S309: the variable b +1 is set, and execution starts with step S303.

Step S310: the variable a +1 is set until a DNS query record class Ta exists.

In a specific implementation, after the foregoing steps are adopted to complete the merge and delete operations between DNS query record classes, when a is set to a +1, the corresponding DNS query record class Ta may or may not exist, and at this time, it may be determined whether the DNS query record class Ta exists first. When Ta exists, step S311 may be executed, otherwise, the operation of adding 1 to the variable a is continued until the DNS query record class Ta exists.

Step S311: when a DNS query record class Ta exists, judging whether a is equal to m or not; when the judgment result is yes, ending the operation; otherwise, step S312 may be performed.

In a specific implementation, when a is equal to m, the last DNS query record Tm in the DNS query record classes T1-Tm has been traversed, and no other DNS query record class exists, which may be compared with the DNS query record class Tm for domain name similarity, that is, the clustering of the DNS query record classes T1-Tm is completed, so that the operation may be ended.

Step S312: set b to a +1, and is executed from step S303.

In a specific implementation, when the DNS query record class Ta exists and a is smaller than m, the calculation operation of the domain name similarity between the DNS query record class and the DNS query record classes arranged thereafter may be continuously performed to determine whether to perform merging, that is, to perform from step S303 again.

In a specific implementation, when a DNS query traffic class corresponding to a DNS query traffic is obtained, each DNS response record in the DNS response record set is respectively classified into the DNS query record class in which the corresponding DNS query record is located, so that a plurality of final DNS network traffic classes can be obtained.

Fig. 4 is a flowchart illustrating a method for clustering the other network traffic record sets by using the obtained multiple DNS network traffic classes in the embodiment of the present invention. Referring to fig. 4, in the embodiment of the present invention, a method for clustering the other network traffic record sets by using the obtained multiple DNS network traffic classes may specifically include the following steps:

step S401: and respectively extracting the network flow characteristics corresponding to each other network flow record in the other network flow record sets.

In specific implementation, for other network traffic records with airspace name information in the six-tuple wave identifier, corresponding network traffic features are respectively extracted. In an embodiment of the present invention, the network flow characteristics of other network traffic records with spatial domain name information include the following eleven items:

f1: capturing time of a first message in a stream;

f2: the total number of messages in the stream;

f3: the sum of the length of the messages in the stream;

f4: maximum message length in the stream;

f5: minimum message length in the stream;

f6: average message length value in the flow;

f7: variance of length values of messages in the stream;

f8: maximum message time interval in the stream;

f9: minimum message time interval in the stream;

f10: average message time interval in the stream;

f11: the variance of the message time intervals in the stream.

Those skilled in the art will appreciate that other network traffic records with spatial domain name information may include more or less network traffic characteristics than the eleven items of information described above, and those skilled in the art can set the characteristics according to actual needs without limitation.

Step S402: and classifying other network traffic records with domain name information in the six-element group traffic identifier into one class, wherein the other network traffic records with the domain name information in the six-element group traffic identifier and the obtained multiple DNS network traffic classes have the same domain name information, so as to form one other network traffic class, and further obtain more than one other network traffic classes.

Before clustering other network traffic records, other network traffic record sets can be divided into two types according to whether the hexahydric group traffic identifier has domain name information, namely, other network traffic records of the hexahydric group traffic identifier having the domain name information are less than a source address, a destination address, a source port, a destination port, an upper layer protocol, a domain name, capture time of a message 1, a length value of the message 1, capture time of the message 2, a length value of the message 2,. > are taken as a subset, and other network traffic records of the hexahydric group traffic identifier having airspace name information are less than the source address, the destination address, the source port, the destination port, the upper layer protocol, null, capture time of the message 1, the length value of the message 1, capture time of the message 2, and the length value of the message 2.

When clustering is performed on other network traffic record sets, each other network traffic record in other network traffic record subsets having domain name information in the six-element group traffic identifier may be re-partitioned according to the above clustering result of DNS traffic. Specifically, the domain name information in each of the other network traffic records with the domain name information is respectively compared with the obtained domain name information in the plurality of DNS network traffic classes, so that the other network traffic records having the same domain name information as the DNS traffic records in the DNS network traffic class are classified into the same other network traffic class, and thus, more than one corresponding other network traffic classes are obtained.

Step S303: and respectively extracting the network flow characteristics of the more than one other network flow classes.

In a specific implementation, the network flow characteristics of the more than one other network traffic classes are average values of the characteristics recorded in the other network traffic classes in each other network traffic class, so that the network flow characteristics of each other network traffic class in the other network traffic classes extracted in step S301 may be adopted, and the network flow characteristics corresponding to each of the more than one other network traffic classes are calculated by adopting the following formula:

wherein the content of the first and second substances,

represents the p-th other network traffic class C of the more than one other network traffic class_pIs the qth network traffic characteristic of (1), s represents other network traffic class C_pNumber of other network traffic records in, f_rRepresenting other network traffic classes C_pThe r-th other network traffic record in (1).

Step S304: and respectively calculating the distances between other network flow records with space domain name information in the six-element group flow identification and the more than one other network flow classes based on the extracted network flow characteristics.

In an embodiment of the present invention, the following formula is adopted to calculate the distances between the other network traffic records having the spatial domain name information in the hexahydric group traffic identifier and the one or more other network traffic classes, respectively:

wherein d is_pRepresenting other network flow records with space domain name information in six-element group flow identificationThe pth other network traffic class C of more than one other network traffic class_pOf the Euclidean distance between f_nullAnd representing other network flow records with space domain name information in the six-element group flow identification.

Of course, other distance calculation methods in the prior art may also be adopted to respectively obtain the distances between the other network traffic records having the spatial domain name information in the calculated hexahydric group traffic identifier and the one or more other network traffic classes.

Step S305: and merging other network flow records with airspace name information in the six-element group flow identification into other network flow classes corresponding to the minimum distance in the calculated distances to obtain the final other network flow classes.

In a specific implementation, when the distance between each other network traffic record with the spatial domain name information and each other network traffic class in the one or more other network traffic classes is obtained through calculation, the other network traffic class corresponding to the minimum distance is used as the class to which the corresponding other network traffic record with the spatial domain name information belongs, so as to obtain the final other network traffic class.

The mobile application network traffic clustering method in the embodiment of the present invention will be described with reference to specific examples.

Referring to fig. 5, a certain enterprise employee runs a mobile office application AppN on its own mobile intelligent terminal, and the AppN is connected to a company server company.com; meanwhile, the device is infected with malicious viruses AppM, the AppM establishes network connection with a server malware. The AppM is a background program, and a user cannot observe the running of the AppM from the interface of the mobile terminal.

When the method in the embodiment of the present invention is used for clustering mobile applications, in the system deployment diagram shown in fig. 5, network traffic generated by a mobile terminal is captured through a router mirroring function.

Next, traffic preprocessing is performed on the captured network traffic. The DNS traffic record set and other network traffic record sets obtained by preprocessing the captured 4 DNS requests, 4 DNS responses, and other 6 TCP flows are as follows:

then, DNS traffic clustering is performed. Setting a threshold t as 1 second, and according to the difference value of the capturing time, obtaining a clustering result as follows: t1 ═ 2018.1.28, 25 minutes, 30 seconds, work, company, T2 ═ 2018.1.28 minutes, 32 seconds, control, company, T3 ═ 2018.1.28, 25 minutes, 35 seconds, person, company, T4 ═ 2018.1.28, 25 minutes, 45 seconds, data, company, com, on day 9.

The domain names in T1, T2, T3 and T4 are divided by the dot symbol ". the" second-level domain names are not divided, and a keyword set is formed, that is, K _ T1 ═ { work, company.com }, K _ T2 ═ control, company.com }, K _ T3 ═ person, company.com }, and K _ T4 ═ data, company.com }.

After that, the threshold value was set to 0.3.

Calculating the domain name similarity between the T1 and the T2, wherein the result of the calculation is | { work, company.com } anddc { control, malware.com } |/| { control, malware.com } | -0/2 ═ 0, which is less than the threshold, and the T1 and the T2 are not merged;

and calculating the domain name similarity between the T1 and the T3, wherein the calculation result is 1/2-0.5 which is larger than the threshold, combining the T3 into the T1, deleting the T3 to obtain a combined class T1, and changing a corresponding keyword set into K _ T1 ═ work, person, company.

And calculating the domain name similarity between the T1 and the T4, wherein the calculation result is 0 which is smaller than the threshold value, and the combined T1 and T4 are not combined.

And calculating the domain name similarity between the T2 and the T4 due to the deletion of the T3, wherein the calculation result is that 1/2 is 0.5 which is larger than the threshold value, combining the T2 and the T4, and deleting the T4 to obtain a combined class T2, and the keyword set of the combined T2 is K _ T2 ═ control, data, malware.

The calculation is finished, and only T1 and T2 classes are left at this time. T1 ═ 2018.1.28, 25 minutes at day 9, 30 seconds work. 2018.1.28 at day 9, 25 minutes, 35 seconds, person.company }, T2 ═ 2018.1.28 at day 9, 25 minutes, 32 seconds, control.malware.com; 2018.1.28 data.malware.com } at 25 minutes 45 seconds at 9 days.

Each DNS reply traffic is directly incorporated into the T1 or T2 class according to the domain name.

Then, clustering of other network traffic is performed. According to the DNS flow clustering result, the following steps are known:

f1 ═ 192.168.1.100, 3124, work.company.com, 80, tcp, 2018.1.28, 25 minutes 31 seconds at day 9, 100, 2018.1.28, 25 minutes 32 seconds at day 9, 200,.. and f5 ═ 192.168.1.100, 3128, person.company.com, 80, tcp, 2018.1.28, 25 minutes 37 seconds at day 9, 150, 2018.1.28, 25 minutes 39 seconds at day 9, 200, … } are classified as C1;

f2 ═ 192.168.1.100, 3125, control. major. com, 8080, tcp, 2018.1.28, 25 minutes 32 seconds at day 9, 500, 25 minutes 33 seconds at day 9, 2018.1.28, 1500,.. times } and f4 ═ 192.168.1.100, 3127, data. major. com, 80, tcp, 2018.1.28, 25 minutes 35 seconds at day 9, 500, 2018.1.28, 25 minutes 36 seconds at day 9, 1500, … } should be classified as C2.

The average value of the network flow characteristics of C1 and C2 was calculated, and the calculation results and the network traffic flow characteristic values of f3 and f6 were as follows:

the distances between f3 and f6 to C1 and C2, respectively, were calculated. Wherein the distance values of f3 to C1 and C2 are 173 and 1277, respectively, and the distance values of f6 to C1 and C2 are 824 and 419, respectively, thus clustering f3 into class C1 and f6 into class C2.

After renting, merging the clustering results of the DNS traffic and other network traffic to finally form two categories, namely:

class 1 ═ {2018.1.28, 25 min 30 sec word. 2018.1.28 day 9, 25 minutes, 35 seconds, person. work, company, com 10.3.125.6; company.com 10.3.125.87; 192.168.1.100, 3124, work company.com, 80, tcp, 2018.1.28, 25 minutes 31 seconds at day 9, 100, 2018.1.28, 25 minutes 32 seconds at day 9, 200.; 192.168.1.100, 3128, person company.com, 80, tcp, 2018.1.28, 25 minutes 37 seconds at day 9, 150, 2018.1.28, 25 minutes 39 seconds at day 9, 200, …; 192.168.1.100, 3126, 10.3.245.8, 443, tcp, 2018.1.28, 25 minutes 33 seconds at day 9, 200, 2018.1.28, 25 minutes 34 seconds at day 9, 500.

Class 2 ═ {2018.1.28, 25 minutes at day 9, 32 seconds control. 2018.1.28 data.malware.com, 25 minutes 45 seconds at day 9; com 183.45.6.8; data.malware.com 183.45.6.9; 192.168.1.100, 3125, control 1. major. com, 8080, tcp, 2018.1.28, 25 minutes 32 seconds at 9, 500, 2018.1.28, 25 minutes 33 seconds at 9, 1500.; 192.168.1.100, 3127, data.malware.com, 80, tcp, 2018.1.28, 25 minutes 35 seconds at day 9, 500, 2018.1.28, 25 minutes 36 seconds at day 9, 1500, …; 192.168.1.100, 3129, 143.5.8.10, 443, tcp, 2018.1.28, 25 minutes 55 seconds at day 9, 400, 2018.1.28, 25 minutes 56 seconds at day 9, 1500.

According to the clustering result, in addition to the mobile office application AppN, other application programs run in the background, and the domain names thereof are data. Wherein the information contained in class 2 may provide a basis for further malicious mobile application detection.

The method in the embodiment of the present invention is described in detail above, and the apparatus corresponding to the method will be described below.

Fig. 6 shows a structure of a mobile application network traffic clustering apparatus in an embodiment of the present invention. Referring to fig. 6, the apparatus 60 may include an obtaining unit 601, a preprocessing unit 602, a first clustering unit 603, a second clustering unit 604, and a merging unit 605, wherein:

the obtaining unit 601 is adapted to obtain network traffic generated by the mobile terminal.

The preprocessing unit 602 is adapted to preprocess the acquired network traffic, and generate a DNS traffic record set corresponding to the DNS traffic and other network traffic record sets corresponding to other network traffic except the DNS traffic.

The first clustering unit 603 is adapted to cluster the DNS traffic record set to obtain a plurality of DNS network traffic classes corresponding to the DNS traffic.

The second clustering unit 604 is adapted to cluster the other network traffic record sets by using the obtained multiple DNS network traffic classes to obtain multiple other network traffic classes corresponding to the other network traffic.

The merging unit 605 is adapted to merge the plurality of DNS network traffic classes corresponding to the DNS traffic with the plurality of other network traffic classes corresponding to the other network traffic to obtain a final mobile application network traffic clustering result.

In a specific implementation, the DNS network traffic includes DNS reply traffic and DNS query traffic; the preprocessing unit 602 is adapted to perform traversal item by item on DNS traffic in the acquired network traffic; when the traversed current DNS flow is determined to be DNS response flow, an IP address and a domain name in the current DNS flow are extracted to form a corresponding query record; when the traversed current DNS flow is determined to be DNS query flow, extracting a domain name and capture time in the current DNS flow to form a corresponding answer record; and acquiring the next DNS flow in the DNS flows in the network flows until all the DNS flows in the network flows are traversed, and generating the DNS flow record set.

In a specific implementation, the preprocessing unit 602 is further adapted to traverse the other network flows one by one, and set a corresponding six-tuple flow identifier including a source address, a destination address, a source port, a destination port, a domain name/null domain name, and an upper-layer protocol for each of the other network flows; when the destination address in the six-tuple flow identifier has a corresponding domain name, the corresponding domain name/airspace name field is corresponding domain name information; otherwise, the corresponding domain name/space domain name field is a space domain name; and respectively adding the message capture time and the message length of other network flows with the same six-tuple flow identification to the corresponding six-tuple flow identification in sequence to obtain other network flow record sets.

In an embodiment of the present invention, the first clustering unit 603 is adapted to create a variable j, set an initial value of j to 1, and use a first item in the DNS query record set as a first element in a first DNS query record class T1; traversing the DNS query record set according to the sequence to obtain a traversed current DNS query record Di; calculating a message capturing time difference value between the traversed current DNS query record Di and the previous DNS query record Di-1; when the message capturing time difference is determined to be smaller than a preset time threshold, merging the traversed current DNS query record Di into the DNS query record class where the previous DNS query record Di-1 is located; when the message capturing time difference is determined to be larger than or equal to the time threshold, creating a new DNS query record class, and taking the traversed current DNS query record Di as a first element in the new DNS query record class; obtaining a next DNS query record Di +1 in the DNS query record set until all DNS query records in the DNS query record set are traversed, and obtaining a plurality of DNS query record classes T1-Tm; creating variables a and b, wherein a is 1-m, b is 1-m, and the initial values of the variables a and b are respectively set to 1 and 2; traversing the DNS query record classes T1-Tm to obtain traversed current DNS query record classes Ta and Tb; when determining that a DNS query record class Tb exists, calculating the domain name similarity between the traversed current DNS query record classes Ta and Tb; when the domain name similarity is determined to be larger than a preset similarity threshold value, combining the current DNS query record classes Ta and Tb to obtain a combined DNS query record class Ta; when the domain name similarity is determined to be smaller than or equal to the similarity threshold, keeping the current DNS query record classes Ta and Tb unchanged; otherwise, setting a variable b as b +1, and repeating the judging and calculating steps until b is m; setting a variable a to a +1, setting b to a +1 until a DNS query record class Ta exists, and repeating the judging and calculating steps until b to m; repeating the steps until a is m, and obtaining a plurality of final DNS query record classes; and respectively classifying each DNS response record in the DNS response record set into the DNS query record class where the corresponding DNS query record is located to obtain the plurality of DNS network traffic classes.

In an embodiment of the present invention, the first clustering unit 603 is adapted to calculate the domain name similarity between the traversed current DNS query record classes Ta and Tb by using the following formula:

wherein S is_abRepresenting the domain name similarity between the DNS query record classes Ta and Tb, K _ T_aSet of keywords, K _ T, representing the DNS query record class Ta_bA set of keywords representing a DNS query record class Tb; the keyword sets of the DNS query record class Ta and the DNS query record class Tb are formed by dividing the domain names of the DNS query record class Ta and the DNS query record class Tb by using point numbers and not dividing the second-level domain names.

In an embodiment of the present invention, the second clustering unit 604 is adapted to respectively extract a network flow feature corresponding to each other network flow record in the other network flow record set; classifying other network traffic records with domain name information in the six-element group traffic identifier and the obtained DNS network traffic classes with the same domain name information into the same other network traffic class to form more than one other network traffic classes; respectively extracting the network flow characteristics of the more than one other network flow classes; respectively calculating the distances between other network flow records with airspace name information in the six-element group flow identification and the more than one other network flow classes based on the extracted network flow characteristics; and merging other network flow records with airspace name information in the six-element group flow identification into other network flow classes corresponding to the minimum distance in the calculated distances to obtain a plurality of final other network flow classes.

In an embodiment of the present invention, the extracting network flow features of other network flow records having airspace name information in the six-tuple flow identifier includes: f1: capturing time of a first message in a stream; f2: the total number of messages in the stream; f3: the sum of the length of the messages in the stream; f4: maximum message length in the stream; f5: minimum message length in the stream; f6: average message length value in the flow; f7: variance of length values of messages in the stream; f8: maximum message time interval in the stream; f9: minimum message time interval in the stream; f10: average message time interval in the stream; f11: the variance of the message time intervals in the stream.

In an embodiment of the present invention, the second clustering unit 604 is adapted to calculate distances between other network traffic records having spatial domain name information in the hexahydric group traffic identifier and the one or more other network traffic classes respectively by using the following formulas:

and the number of the first and second electrodes,

wherein d is_pRepresenting other network flow records with airspace name information in the six-element group flow mark, and the p-th other network flow class C in the more than one other network flow classes_COf the Euclidean distance between f_nullRepresenting other network traffic records with spatial domain name information in the six-element group traffic identifier, F_qThe qth network traffic characteristic represented, s represents the other network traffic class C_pNumber of other network traffic records in, f_rRepresenting other network traffic classes C_pThe r-th other network traffic record in (1).

The embodiment of the invention also provides a computer readable storage medium, which stores computer instructions, and the computer instructions execute the steps of the mobile application network flow clustering method when running. Please refer to the description of the relevant parts herein before for the steps of the mobile application network traffic clustering method, which is not described again.

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory is stored with a computer instruction capable of running on the processor, and the processor executes the steps of the mobile application network flow clustering method when running the computer instruction. Please refer to the description of the relevant parts herein before for the steps of the mobile application network traffic clustering method, which is not described again.

In the above-mentioned scheme of the embodiment of the present invention, the obtained network traffic is preprocessed to generate a DNS traffic record set corresponding to the DNS traffic and other network traffic record sets corresponding to other network traffic except the DNS traffic, and clustering the DNS traffic record set, clustering the other network traffic record sets by adopting the obtained multiple DNS network traffic classes, finally merging the multiple DNS network traffic classes corresponding to the DNS traffic with the multiple other network traffic classes corresponding to the other network traffic to obtain a final mobile application network traffic clustering result, and determining the corresponding relation between the network traffic and the mobile application without installing a proxy program on the mobile terminal, so that the operation complexity of clustering the mobile application network traffic can be reduced, and the use experience of a user can be improved.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructions associated with hardware via a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A mobile application network traffic clustering apparatus, comprising:

the preprocessing unit is suitable for preprocessing the acquired network traffic to generate a DNS traffic record set corresponding to the DNS traffic and other network traffic record sets corresponding to other network traffic except the DNS traffic; the preprocessing unit is suitable for traversing DNS traffic in the acquired network traffic one by one; when the traversed current DNS flow is determined to be DNS response flow, an IP address and a domain name in the current DNS flow are extracted to form a corresponding response record; when the traversed current DNS flow is determined to be DNS query flow, extracting a domain name and capture time in the current DNS flow to form a corresponding query record; acquiring next DNS flow in the DNS flows in the network flow until all the DNS flows in the network flow are traversed, and generating a DNS flow record set; the method is also suitable for traversing the other network flows one by one, and setting a corresponding six-element group flow identifier comprising a source address, a destination address, a source port, a destination port, a domain name/null domain name and an upper-layer protocol for each of the other network flows; when the destination address in the six-tuple flow identifier has a corresponding domain name, the corresponding domain name/airspace name field is corresponding domain name information; otherwise, the corresponding domain name/space domain name field is a space domain name; respectively adding the message capture time and the message length of other network flows with the same six-tuple flow identification to the corresponding six-tuple flow identification in sequence to obtain other network flow record sets;

the second clustering unit is suitable for clustering the other network traffic record sets by adopting the obtained multiple DNS network traffic classes to obtain multiple other network traffic classes corresponding to the other network traffic; and the merging unit is suitable for merging the plurality of DNS network traffic classes corresponding to the DNS traffic and the plurality of other network traffic classes corresponding to the other network traffic to obtain a final mobile application network traffic clustering result.

2. The mobile application network traffic clustering device according to claim 1, wherein the first clustering unit is adapted to create a variable j, set an initial value of j to 1, and use a first DNS query record in the DNS traffic record set as a first DNS query record in a first DNS query record class T1A first element; traversing DNS query records in the DNS traffic record set according to the sequence to obtain a traversed current DNS query record Di; calculating the traversed current DNS query record Di and the previous DNS query record Di_-1Capturing the time difference value of the messages; when the message capturing time difference is determined to be smaller than a preset time threshold, the traversed current DNS query record Di is merged into the previous DNS query record Di_-1The DNS query record class; when the message capturing time difference is determined to be larger than or equal to the time threshold, creating a new DNS query record class, and taking the traversed current DNS query record Di as a first element in the new DNS query record class; obtaining the next DNS query record Di in the DNS traffic record set₊₁Until all DNS query records in the DNS traffic record set are traversed, obtaining a plurality of DNS query record classes T1-Tm; creating variables a and b, wherein a is 1-m, b is 1-m, and the initial values of the variables a and b are respectively set to 1 and 2; traversing the DNS query record classes T1-Tm to obtain traversed current DNS query record classes Ta and Tb; judging whether a DNS query record class Tb exists or not; when determining that a DNS query record class Tb exists, calculating the domain name similarity between the traversed current DNS query record classes Ta and Tb; judging whether the domain name similarity between the traversed current DNS query record classes Ta and Tb is larger than a preset similarity threshold value or not; when the domain name similarity is determined to be larger than a preset similarity threshold value, combining the current DNS query record classes Ta and Tb to obtain a combined DNS query record class Ta; when the domain name similarity is determined to be smaller than or equal to the similarity threshold, keeping the current DNS query record classes Ta and Tb unchanged; judging whether b is less than or equal to m; when b is determined to be smaller than m, setting a variable b as b +1, and starting execution from the judgment of whether a DNS query record class Tb exists until b as m; when b is determined to be equal to m, setting a variable a to a +1 until a DNS query record class Ta exists, and judging whether a is smaller than m; when determining that a is smaller than m, setting b to be a +1, and starting execution from the judgment of whether the DNS query record class Tb exists until a is m, so as to obtain a plurality of final DNS query record classes; recording the DNS flowAnd each DNS response record in the set is respectively classified into the DNS query record class where the corresponding DNS query record is located, so that the plurality of DNS network traffic classes are obtained.

3. The mobile application network traffic clustering device according to claim 2, wherein the first clustering unit is adapted to calculate the domain name similarity between the traversed current DNS query record classes Ta and Tb by using the following formula:

4. The mobile application network traffic clustering device according to claim 3, wherein the second clustering unit is adapted to extract the network flow characteristics corresponding to each other network traffic record in the other network traffic record set respectively; classifying other network traffic records with domain name information in the six-element group traffic identifier and the obtained DNS network traffic classes with the same domain name information into the same other network traffic class to form more than one other network traffic classes; respectively extracting the network flow characteristics of the more than one other network flow classes; respectively calculating the distances between other network flow records with airspace name information in the six-element group flow identification and the more than one other network flow classes based on the extracted network flow characteristics; and merging other network flow records with airspace name information in the six-element group flow identification into other network flow classes corresponding to the minimum distance in the calculated distances to obtain a plurality of final other network flow classes.

5. The device for clustering mobile application network traffic according to claim 4, wherein the network flow characteristics of other network traffic records with spatial domain name information in the six-tuple traffic identifier include: