CN111478861A

CN111478861A - Traffic identification method and device, electronic equipment and storage medium

Info

Publication number: CN111478861A
Application number: CN202010254366.7A
Authority: CN
Inventors: 苑晓鹏; 崔渊博; 周忠义; 傅强; 阿曼太; 梁彧; 田野; 王杰; 杨满智; 蔡琳; 金红; 陈晓光
Original assignee: Eversec Beijing Technology Co Ltd
Current assignee: Eversec Beijing Technology Co Ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-07-31
Anticipated expiration: 2040-04-02
Also published as: CN111478861B

Abstract

The embodiment of the disclosure discloses a flow identification method, a flow identification device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring data streams of undetermined affiliated application programs in a network as unknown data streams; backtracking a plurality of data streams which do not contain domain name information and are determined to belong to the application program as a plurality of reference data streams; respectively calculating the similarity between the plurality of reference data streams and the unknown data stream to determine the maximum similarity and the reference data stream corresponding to the maximum similarity; and if the maximum similarity is larger than or equal to a preset similarity threshold, determining the application program of the unknown data stream, which is the same as the application program of the reference data stream corresponding to the maximum similarity. According to the technical scheme of the embodiment of the disclosure, the accuracy and precision of identifying the malicious traffic can be improved.

Description

Traffic identification method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of computer networks, in particular to a flow identification method, a flow identification device, electronic equipment and a storage medium.

Background

The flow identification aims at identifying network flow in real time according to three levels of protocols, applications and WEB services, classifying fine granularity as far as possible and providing decision reference for network monitoring. On the basis of traffic identification, network monitoring can take a number of measures. The flow identification can be used for flow charging, user experience improvement and network security guarantee, and can also be used for daily operation and maintenance, and network flow abnormal changes can be discovered as early as possible through the flow identification, so that guarantee measures are taken, and the service is not influenced.

Current traffic identification techniques include port identification techniques, deep packet identification techniques, deep flow identification techniques, and machine learning or artificial intelligence techniques. Because the machine learning technology can directly extract the characteristics from the original data, the labor cost is saved, the rule which is difficult to be found by human eyes can be found, and the encrypted flow can be processed. Therefore, the traffic identification method based on machine learning and artificial intelligence is the mainstream direction of current research, but the existing traffic identification method based on machine learning has relatively coarse traffic classification and is not high in accuracy when identifying malicious traffic.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a traffic identification method, apparatus, electronic device, and storage medium, so as to improve accuracy and precision of identifying malicious traffic.

Additional features and advantages of the disclosed embodiments will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosed embodiments.

In a first aspect of the present disclosure, an embodiment of the present disclosure provides a traffic identification method, including:

acquiring data streams of undetermined affiliated application programs in a network as unknown data streams;

backtracking a plurality of data streams which do not contain domain name information and are determined to belong to the application program as a plurality of reference data streams;

respectively calculating the similarity between the plurality of reference data streams and the unknown data stream to determine the maximum similarity and the reference data stream corresponding to the maximum similarity;

and if the maximum similarity is larger than or equal to a preset similarity threshold, determining the application program of the unknown data stream, which is the same as the application program of the reference data stream corresponding to the maximum similarity.

In one embodiment, acquiring a data stream of an application program not determined to belong to in a network as an unknown data stream includes:

acquiring data stream in a network, determining an application program of the data stream according to a preset hard coding rule, marking the application program for the data stream if the determination is successful, and acquiring the data stream as the unknown data stream if the determination is failed.

In an embodiment, the method further includes, if the maximum similarity is smaller than the predetermined similarity threshold:

backtracking a plurality of DNS response records, and if the plurality of DNS response records contain at least one DNS record which takes the IP address of the unknown flow as a destination address, acquiring at least one domain name corresponding to the at least one DNS record;

and backtracking a plurality of data streams of the determined affiliated application programs containing the at least one domain name, respectively calculating domain name text similarity between the backtracked data streams and the unknown data streams, and determining the affiliated application programs of the unknown data streams according to the affiliated application programs of the data streams and the affiliated application programs of the plurality of reference data streams if the domain name text similarity is greater than a second preset similarity threshold value.

In one embodiment, determining the application to which the unknown data stream belongs according to the application to which the data stream belongs and the applications to which the reference data streams belong comprises:

if the application program to which at least one reference data stream belongs exists in the plurality of reference data streams and is the application program to which the data stream belongs, and the similarity between the at least one reference data stream, which includes at least one reference data stream, and the unknown data stream is greater than a second predetermined similarity threshold, determining that the application program to which the unknown data stream belongs is the application program to which the data stream belongs, wherein the second predetermined similarity threshold is less than the predetermined similarity threshold.

In one embodiment, calculating the similarity between the reference data stream and the unknown data stream includes:

calculating a stream feature distance vector between the reference data stream and the unknown data stream;

and inputting the stream characteristic distance vector into a pre-trained stream similarity calculation model, and obtaining the similarity output by the stream similarity calculation model, wherein the similarity is used for representing the probability that two data streams corresponding to the input stream characteristic distance vector belong to the same application program.

In one embodiment, calculating the stream feature distance vector between the reference data stream and the unknown data stream comprises:

calculating a stream feature distance vector between the reference data stream and the unknown data stream according to predetermined stream features of the data stream, wherein the predetermined stream features of the data stream include at least one of:

the median of the uplink packet length sequence of the data stream, the standard deviation of the time interval sequence of the data stream, the median of the downlink packet length sequence of the data stream, the packet lengths of the first N packets of the data stream, and the domain name characteristics of the predetermined field of the data stream.

In an embodiment, the flow similarity calculation model is obtained by training through the following steps:

acquiring a training sample set, wherein the training sample comprises a stream characteristic distance vector between two data streams and marking information used for indicating whether the two data streams belong to the same application program, the marking information is 1 to indicate that the two data streams belong to the same application program, and the marking information is 0 to indicate that the two data streams do not belong to the same application program;

determining an initialized flow similarity calculation model, wherein the initialized flow similarity calculation model comprises a target layer for outputting a probability that two data flows belong to the same application;

and training to obtain the flow similarity calculation model by using a machine learning method and taking the flow characteristic distance vector in the training samples in the training sample set as the input of the initialized flow similarity calculation model, and taking the marking information corresponding to the input flow characteristic distance vector as the expected output of the initialized flow similarity calculation model.

In a second aspect of the present disclosure, an embodiment of the present disclosure further provides a flow identification device, including:

an unknown flow obtaining unit, configured to obtain, as an unknown data flow, a data flow of an application program to which the unknown flow belongs in a network;

the device comprises a backtracking unit, a judging unit and a judging unit, wherein the backtracking unit is used for backtracking a plurality of data streams which do not contain domain name information and are determined to belong to application programs as a plurality of reference data streams;

a similar data stream determining unit, configured to calculate similarities between the multiple reference data streams and the unknown data stream, respectively, so as to determine a maximum similarity and a reference data stream corresponding to the maximum similarity;

and the first determining unit is used for determining the affiliated application program of the unknown data stream if the maximum similarity is larger than or equal to a preset similarity threshold, and the affiliated application program is the same as the affiliated application program of the reference data stream corresponding to the maximum similarity.

In an embodiment, the unknown stream acquiring unit is configured to:

In an embodiment, the apparatus further includes a second determining unit, where the second determining unit is configured to, if the maximum similarity is smaller than the predetermined similarity threshold:

In an embodiment, the second determining unit is configured to determine the application program of the unknown data stream with the application programs of the plurality of reference data streams according to the application program of the data stream, and includes:

In an embodiment, the calculating the similarity between the reference data stream and the unknown data stream by the similar data stream determining unit includes:

In an embodiment, the similar data stream determining unit is configured to calculate a stream feature distance vector between the reference data stream and the unknown data stream, and includes:

In an embodiment, the flow similarity calculation model is obtained by training through the following modules:

the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring a training sample set, the training sample comprises a stream characteristic distance vector between two data streams and marking information used for indicating whether the two data streams belong to the same application program, the marking information is 1 for indicating that the two data streams belong to the same application program, and the marking information is 0 for indicating that the two data streams do not belong to the same application program;

a model determination module for determining an initialized flow similarity calculation model, wherein the initialized flow similarity calculation model comprises a target layer for outputting a probability that two data flows belong to the same application;

and the model training module is used for training to obtain the flow similarity calculation model by using the flow characteristic distance vectors in the training samples in the training sample set as the input of the initialized flow similarity calculation model and using the marking information corresponding to the input flow characteristic distance vectors as the expected output of the initialized flow similarity calculation model by using a machine learning method.

In a third aspect of the disclosure, an electronic device is provided. The electronic device includes: a processor; and a memory for storing executable instructions that, when executed by the processor, cause the electronic device to perform the method of the first aspect.

In a fourth aspect of the disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the method in the first aspect.

The technical scheme provided by the embodiment of the disclosure has the beneficial technical effects that:

according to the embodiment of the disclosure, after the unknown data stream is obtained, a plurality of data streams which do not contain domain name information and have determined the affiliated application program are traced back as a plurality of reference data streams, the similarity between each reference data stream and the unknown data stream is respectively calculated to determine the maximum similarity and the reference data stream corresponding to the maximum similarity, if the maximum similarity is greater than or equal to a predetermined similarity threshold, it is determined that the affiliated application program of the unknown data stream is the same as the affiliated application program of the reference data stream corresponding to the maximum similarity, and the accuracy and precision of malicious flow identification can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments of the present disclosure will be briefly described below, and it is obvious that the drawings in the following description are only a part of the embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the contents of the embodiments of the present disclosure and the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a traffic identification method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram illustrating a method for training a flow similarity calculation model according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart diagram of another traffic identification method provided in accordance with an embodiment of the present disclosure;

fig. 4 is a schematic flow chart of another traffic identification method provided according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a flow rate identification device according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a training apparatus for a flow similarity calculation model according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of another flow identification device provided in accordance with an embodiment of the present disclosure;

FIG. 8 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

In order to make the technical problems solved, technical solutions adopted and technical effects achieved by the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be described in further detail below with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments, but not all embodiments, of the embodiments of the present disclosure. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present disclosure, belong to the protection scope of the embodiments of the present disclosure.

It should be noted that the terms "system" and "network" are often used interchangeably in the embodiments of the present disclosure. Reference to "and/or" in embodiments of the present disclosure is meant to include any and all combinations of one or more of the associated listed items. The terms "first", "second", and the like in the description and claims of the present disclosure and in the drawings are used for distinguishing between different objects and not for limiting a particular order.

It should also be noted that, in the embodiments of the present disclosure, each of the following embodiments may be executed alone, or may be executed in combination with each other, and the embodiments of the present disclosure are not limited specifically.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The technical solutions of the embodiments of the present disclosure are further described by the following detailed description in conjunction with the accompanying drawings.

Fig. 1 shows a flow diagram of a traffic identification method provided in an embodiment of the present disclosure, where the embodiment is applicable to a case of an application program that identifies unknown traffic in a network, and the method may be executed by a traffic identification device configured in an electronic device, as shown in fig. 1, the traffic identification method according to the embodiment includes:

in step S110, a data stream of an application program not determined to belong to in the network is acquired as an unknown data stream.

For example, a data stream in a network may be acquired, an application to which the data stream belongs may be determined according to a predetermined hard-coding rule, the application to which the data stream belongs may be marked if the determination is successful, and the data stream may be acquired as the unknown data stream if the determination is failed.

In step S120, a plurality of data streams that do not include domain name information and to which the application has been determined are traced back as a plurality of reference data streams.

In step S130, similarities between the plurality of reference data streams and the unknown data stream are respectively calculated to determine a maximum similarity and a reference data stream corresponding to the maximum similarity.

The similarity between the reference data stream and the unknown data stream can be calculated by various methods, for example, a flow similarity calculation model trained in advance can be used for calculating, for example, a flow characteristic distance vector between the reference data stream and the unknown data stream; and inputting the stream characteristic distance vector into a pre-trained stream similarity calculation model, and obtaining the similarity output by the stream similarity calculation model, wherein the similarity is used for representing the probability that two data streams corresponding to the input stream characteristic distance vector belong to the same application program.

Wherein a stream characteristic distance vector between the reference data stream and the unknown data stream is calculated, the stream characteristic distance vector between the reference data stream and the unknown data stream being calculable from predetermined stream characteristics of the data streams. The predetermined stream characteristics of the data stream include a median of an uplink packet length sequence of the data stream, a standard deviation of a time interval sequence of the data stream, a median of a downlink packet length sequence of the data stream, packet lengths of first N packets of the data stream, domain name characteristics of a predetermined field of the data stream, and the like.

The flow similarity calculation model may be obtained by training through a plurality of methods, fig. 2 shows a flow diagram of a training method of the flow similarity calculation model, and as shown in fig. 2, the flow similarity calculation model may be obtained by training through the following steps:

in step S210, a training sample set is obtained, where the training sample includes a stream characteristic distance vector between two data streams and label information used for indicating whether the two data streams belong to the same application program, where a label information of 1 indicates that the two data streams belong to the same application program, and a label information of 0 indicates that the two data streams do not belong to the same application program.

In step S220, an initialized flow similarity calculation model is determined, wherein the initialized flow similarity calculation model includes a target layer for outputting a probability that two data flows belong to the same application.

In step S230, the flow similarity calculation model is obtained by training using a machine learning method, with the flow characteristic distance vectors in the training samples in the training sample set as an input of the initialized flow similarity calculation model, and the label information corresponding to the input flow characteristic distance vectors as an expected output of the initialized flow similarity calculation model.

In step S140, if the maximum similarity is greater than or equal to a predetermined similarity threshold, it is determined that the application program of the unknown data stream belongs to the same application program as the application program of the reference data stream corresponding to the maximum similarity.

In this embodiment, after the unknown data stream is obtained, a plurality of data streams which do not include domain name information and to which the application program belongs are traced back as a plurality of reference data streams, and the similarity between each reference data stream and the unknown data stream is respectively calculated to determine the maximum similarity and the reference data stream corresponding thereto, and if the maximum similarity is greater than or equal to a predetermined similarity threshold, it is determined that the application program to which the unknown data stream belongs is the same as the application program to which the reference data stream corresponding to the maximum similarity belongs, so that accuracy and precision of identifying malicious traffic can be improved.

Fig. 3 is a schematic flow chart of another traffic identification method provided in the embodiment of the present disclosure, and the embodiment is based on the foregoing embodiment and performs improved optimization. As shown in fig. 3, the traffic identification method according to this embodiment includes:

in step S310, a data stream of an application program not determined to belong to in the network is acquired as an unknown data stream. For example, a data stream in a network may be acquired, an application to which the data stream belongs may be determined according to a predetermined hard coding rule, the application to which the data stream belongs may be marked for the data stream if the determination is successful, and the data stream may be acquired as the unknown data stream if the determination is failed.

In step S320, a plurality of data streams that do not include domain name information and to which the application has been determined are traced back as a plurality of reference data streams.

In step S330, similarities between the reference data streams and the unknown data stream are respectively calculated to determine a maximum similarity and a reference data stream corresponding to the maximum similarity.

In step S340, it is determined whether the maximum similarity is smaller than a predetermined similarity threshold, if yes, step S360 is performed, otherwise, step S350 is performed.

In step S350, the application program of the unknown data stream is determined to be the same as the application program of the reference data stream corresponding to the maximum similarity.

In step S360, a plurality of DNS response records are traced back, and if the DNS response records include at least one DNS record using the IP address of the unknown flow as the destination address, at least one domain name corresponding to the at least one DNS record is obtained.

In step S370, a plurality of data streams of the determined affiliated application programs including the at least one domain name are traced back, domain name text similarities between the traced data streams and the unknown data streams are respectively calculated, and if the domain name text similarities are greater than a second predetermined similarity threshold, the affiliated application program of the unknown data stream is determined according to the affiliated application program of the data streams and the affiliated application programs of the plurality of reference data streams.

Wherein, according to the application program of the data stream and the application programs of the reference data streams, determining the application program of the unknown data stream can be further realized by: if the application program to which at least one reference data stream belongs exists in the plurality of reference data streams and is the application program to which the data stream belongs, and the similarity between the at least one reference data stream, which includes at least one reference data stream, and the unknown data stream is greater than a second predetermined similarity threshold, determining that the application program to which the unknown data stream belongs is the application program to which the data stream belongs, wherein the second predetermined similarity threshold is less than the predetermined similarity threshold.

On the basis of the above embodiment, the present embodiment further discloses that, if the maximum similarity is smaller than the predetermined similarity threshold, the unknown stream is further identified by backtracking the DNS response record, so that the identification rate of the unknown data stream can be further improved.

Fig. 4 is a schematic flow chart of another traffic identification method provided in the embodiment of the present disclosure, and the embodiment is based on the foregoing embodiment and performs improved optimization. As shown in fig. 4, the traffic identification method according to this embodiment includes:

in step S401, the occurrence of an unknown stream is identified.

In step S402, the identified flows that do not contain domain name information are traced back K times.

I.e. when an unknown stream is present, the information of the partially pure TCP or pure UDP stream is traced back upwards. Before this, it is necessary to collect identified data stream samples of the test APP.

For example, the method can be generated by analyzing the bill information and the script of DPI playback of a pcap data packet generated by a mobile device collecting test APP. And running the APP to be tested on the mobile equipment, and monitoring the generated flow data to form a pcap data packet. And playing back the formed pcap data packet by using a DPI engine to form a corresponding ticket. And compiling scripts to analyze flow characteristics such as packet length sequences, time sequences and the like from the pcap data packets, and combining the flow characteristics with the flow characteristics in the call ticket for use.

In step S403, the substitution model calculates K similarities.

For example, a flow characteristic distance vector between the reference data flow and the unknown data flow may be calculated, the flow characteristic distance vector is input into a flow similarity calculation model obtained by training according to the machine learning model training method described in fig. 2, and the similarities between the unknown flow and the K identified flows are output through the flow similarity calculation model, so as to determine the probabilities that the unknown flow and the K identified flows belong to the same application APP.

The flow characteristic distance vector between two data flows is calculated, flow characteristics analyzed by a call ticket and a script can be screened, the characteristics mainly come from a packet length sequence and a time interval sequence, new statistical characteristics are obtained through simple calculation, and specific characteristics are as follows: the median of the uplink packet length sequence, the standard deviation of the time interval sequence, the average of the time interval sequence, the median of the downlink packet length sequence, the packet lengths of the n packets before the stream, and the domain name characteristics of the DNS query and host fields. The uplink direction refers to a direction in which the local IP sends the number of bytes to the server IP. The downlink direction refers to the direction of sending byte number to the local IP by the server IP. The calculation range of the aggregation characteristic can select the first n packets or time intervals, and after the aggregation characteristic is calculated, zero-padding processing is carried out on the stream which is less than n packets.

When the flow similarity calculation model is trained, a plurality of methods can be adopted for sample collection and labeling, for example, the sample can be generated by the analysis of the bill information and the script of the pcap data packet generated by the mobile equipment collection test APP through DPI playback. For example, an APP to be tested may be run on a mobile device, and traffic data generated by the APP is monitored to form a pcap data packet. And playing back the formed pcap data packet by using a DPI engine to form a corresponding ticket. And compiling scripts to analyze flow characteristics such as packet length sequences, time sequences and the like from the pcap data packets, and combining the flow characteristics with the flow characteristics in the call ticket for use.

In step S404, it is determined whether the maximum similarity is greater than the threshold Y, if so, step S405 is performed, otherwise, step S406 is performed.

In step S405, a classification result is obtained, and the process ends.

In step S406, the N DNS reply records are traced back.

In step S407, it is determined whether a DNS record for the IP address exists, if yes, step S408 is performed, otherwise, step S412 is performed.

In step S408, M identified flows containing domain name information are traced back.

In step S409, it is determined whether the domain name text similarity is greater than the threshold value X, if so, step S410 is executed, otherwise, step S412 is executed.

In step S410, it is determined whether K domain-name-free streams include the APP, if so, step S411 is executed, otherwise, step S412 is executed.

In step S411, it is determined whether the domain name free stream similarity is greater than the threshold Z, if so, step S405 is executed, otherwise, step S412 is executed.

In step S412, the unknown stream is classified as unidentified.

As can be seen from observing the stream features of different apps, the same app may have a stream feature with extremely low similarity of the stream features, and the two streams with extremely low similarity may be generated by different actions or behaviors of the apps. While the stream features from the same behavior of the same app have the highest similarity.

Since traffic that is difficult for the DPI engine to recognize is mostly pure TCP and UDP flows without domain name information, the training process may first order the TCP and UDP flows separately by start time. And then backtracking a plurality of pure TCP or pure UDP flows upwards for each pure TCP or pure UDP flow, and respectively calculating the distance of the characteristics of each flow to obtain a distance vector. If the two streams come from the same app, a label 1 is marked, otherwise, a label 0 is marked, and the two streams are substituted into a machine learning model with low complexity for training. The probability value output by the trained machine learning model is the probability that the two streams come from the same app, and can also be regarded as the stream feature association degree of the streams.

In a modular criminal application process, once an unknown flow occurs, several identified flows are traced back forward. And respectively calculating the flow characteristics and the unknown flow to obtain flow characteristic distance vectors, and substituting the vectors into the trained machine learning model to obtain the similarity. And comparing the maximum similarity with a stream feature correlation threshold, if the maximum similarity is greater than the threshold, marking the unknown stream with the highest similarity, and otherwise, continuously marking the unknown stream as unknown.

It is clear that some pure TCP and pure UDP streams, which do not contain domain name information themselves, except for DNS, HTTP and HTTPs streams, which themselves contain domain name features, will have their IP addresses present in the previous DNS reply field. Through this association, a portion of the flow may be populated with domain name information.

For each unknown flow, a number of DNS replies are traced back forward, and if the IP address in the DNS reply is the same as the destination IP of the flow, the domain name is populated to the unknown flow.

And after the backfilling is successful, for the TCP stream, backtracking a known stream with domain name characteristics upwards for each backfilled domain name, calculating the text similarity of the domain names of the two streams, and marking the known stream as a domain name similar stream if the text similarity of the domain names is greater than a threshold value. And updates the text similarity threshold. The above operation continues for the next backfilled domain name until all the backfilled domain names are traversed. After traversing, if the domain name similar stream exists, traversing the backtracked known pure TCP stream again, if the domain name similar stream and the stream of the app exist, calculating the stream feature similarity of the stream and the unknown stream by using a machine learning model of the stream feature association degree module, and if the similarity is greater than a stream feature similarity limit threshold, marking the unknown stream with a note of the similar stream.

For the UDP flow, if the domain name backfill of the unknown flow is successful, only the condition that whether all the domain names come from the same app needs to be checked, and if the domain names come from the same app, the known flow is marked as a domain name similar flow. The process of checking the similarity of flow characteristics is the same as for TCP flows.

In the flow calculation process, a fixed number of recognized domain name flows and DNS response information in a fixed time are cached and are respectively used for a flow characteristic association module and a domain name association module. Adjusting the buffer size and the size of the 3 thresholds allows the algorithm to trade off performance, coverage and recognition accuracy times. Respectively, a stream feature similarity threshold, an initial text similarity threshold, and a stream feature similarity limit threshold.

When the unknown stream is identified, the stream feature similarity is calculated by tracing back the information of part of pure TCP or pure UDP streams upwards. And the upward backtracking part comprises domain name information of the stream of the domain name and calculates the text similarity of the stream domain name. The two are combined to determine the associated flow, so that the identification of the unknown flow is realized, and the accuracy and precision of identifying the malicious flow can be improved.

As an implementation of the methods shown in the above figures, the present application provides an embodiment of a traffic identification device, and fig. 5 shows a schematic structural diagram of a traffic identification device provided in this embodiment, where the embodiment of the device corresponds to the method embodiments shown in fig. 1 to 4, and the device may be specifically applied to various electronic devices. As shown in fig. 5, the traffic identification apparatus according to this embodiment includes an unknown flow acquiring unit 510, a backtracking unit 520, a similar data flow determining unit 530, and a first determining unit 540.

The unknown flow obtaining unit 510 is configured to obtain a data flow of an application program that is not determined to belong in the network as an unknown data flow;

the backtracking unit 520 is configured to backtrack a plurality of data streams which do not contain domain name information and to which the application program is determined as a plurality of reference data streams;

the similar data stream determining unit 530 is configured to calculate similarities between the multiple reference data streams and the unknown data stream respectively to determine a maximum similarity and a reference data stream corresponding to the maximum similarity;

the first determining unit 540 is configured to determine, if the maximum similarity is greater than or equal to a predetermined similarity threshold, an application to which the unknown data stream belongs, which is the same as an application to which a reference data stream corresponding to the maximum similarity belongs.

According to one or more embodiments of the present disclosure, the unknown stream acquiring unit 510 is configured to: acquiring data stream in a network, determining an application program of the data stream according to a preset hard coding rule, marking the application program for the data stream if the determination is successful, and acquiring the data stream as the unknown data stream if the determination is failed.

According to one or more embodiments of the present disclosure, the similar data stream determining unit 530 is configured to calculate a stream feature distance vector between the reference data stream and the unknown data stream; and inputting the stream characteristic distance vector into a pre-trained stream similarity calculation model, and obtaining the similarity output by the stream similarity calculation model, wherein the similarity is used for representing the probability that two data streams corresponding to the input stream characteristic distance vector belong to the same application program.

Further, the similar data stream determining unit 530 configured for calculating a stream feature distance vector between the reference data stream and the unknown data stream comprises: calculating a stream feature distance vector between the reference data stream and the unknown data stream according to predetermined stream features of the data stream, wherein the predetermined stream features of the data stream include at least one of: the median of the uplink packet length sequence of the data stream, the standard deviation of the time interval sequence of the data stream, the median of the downlink packet length sequence of the data stream, the packet lengths of the first N packets of the data stream, and the domain name characteristics of the predetermined field of the data stream.

Fig. 6 is a schematic structural diagram of a training apparatus for a flow similarity calculation model provided according to an embodiment of the present disclosure, and according to one or more embodiments of the present disclosure, the flow similarity calculation model used in the similar data flow determination unit 530 is obtained by training through the sample acquisition module 610, the model determination module 620, and the model training module 630 shown in fig. 6.

The sample obtaining module 610 is configured to obtain a training sample set, where a training sample includes a stream characteristic distance vector between two data streams, and label information used for indicating whether the two data streams belong to the same application program, where a label information of 1 indicates that the two data streams belong to the same application program, and a label information of 0 indicates that the two data streams do not belong to the same application program.

The model determination module 620 is configured for determining an initialized flow similarity calculation model, wherein the initialized flow similarity calculation model comprises a target layer for outputting probabilities that two data flows belong to the same application.

The model training module 630 is configured to train the flow similarity calculation model by using a machine learning method, using the flow characteristic distance vectors in the training samples in the training sample set as an input of the initialized flow similarity calculation model, and using the label information corresponding to the input flow characteristic distance vectors as an expected output of the initialized flow similarity calculation model.

The traffic identification device provided by the embodiment can execute the traffic identification method provided by the embodiment of the method disclosed by the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Fig. 7 shows a schematic structural diagram of another traffic identification device provided in the embodiment of the present disclosure, and as shown in fig. 7, the traffic identification device described in this embodiment includes an unknown flow obtaining unit 710, a backtracking unit 720, a similar data flow determining unit 730, a first determining unit 740, and a second determining unit 750.

The unknown flow obtaining unit 710 is configured to obtain a data flow of an application program that is not determined to belong in the network as an unknown data flow.

The trace-back unit 720 is configured to trace back a plurality of data streams that do not include domain name information and that have been determined to belong to the application as a plurality of reference data streams.

The similar data stream determining unit 730 is configured to calculate similarities between the plurality of reference data streams and the unknown data stream respectively to determine a maximum similarity and a reference data stream corresponding to the maximum similarity.

The first determining unit 740 is configured to determine, if the maximum similarity is greater than or equal to a predetermined similarity threshold, an application to which the unknown data stream belongs, which is the same as an application to which a reference data stream corresponding to the maximum similarity belongs.

The second determining unit 750 is configured to, if the maximum similarity is smaller than the predetermined similarity threshold value: backtracking a plurality of DNS response records, and if the plurality of DNS response records contain at least one DNS record which takes the IP address of the unknown flow as a destination address, acquiring at least one domain name corresponding to the at least one DNS record; and backtracking a plurality of data streams of the determined affiliated application programs containing the at least one domain name, respectively calculating domain name text similarity between the backtracked data streams and the unknown data streams, and determining the affiliated application programs of the unknown data streams according to the affiliated application programs of the data streams and the affiliated application programs of the plurality of reference data streams if the domain name text similarity is greater than a second preset similarity threshold value.

In an embodiment, the unknown stream acquiring unit 710 is configured to further acquire a data stream in a network, determine an application program to which the data stream belongs according to a predetermined hard coding rule, mark the application program to which the data stream belongs if the determination is successful, and acquire the data stream as the unknown data stream if the determination is failed.

In an embodiment, the second determining unit 750 is configured to determine that the application program belonging to the unknown data stream is the application program belonging to the data stream if there is at least one application program belonging to a reference data stream in the reference data streams, the at least one reference data stream includes at least one reference data stream, and a similarity between the at least one reference data stream and the unknown data stream is greater than a second predetermined similarity threshold, where the second predetermined similarity threshold is smaller than the predetermined similarity threshold.

In an embodiment, the similar data stream determining unit 730 is configured to calculate a stream feature distance vector between the reference data stream and the unknown data stream; and inputting the stream characteristic distance vector into a pre-trained stream similarity calculation model, and obtaining the similarity output by the stream similarity calculation model, wherein the similarity is used for representing the probability that two data streams corresponding to the input stream characteristic distance vector belong to the same application program.

Further, the similar data stream determining unit 730 is configured for calculating a stream feature distance vector between the reference data stream and the unknown data stream according to a predetermined stream feature of a data stream, wherein the predetermined stream feature of the data stream comprises at least one of: the median of the uplink packet length sequence of the data stream, the standard deviation of the time interval sequence of the data stream, the median of the downlink packet length sequence of the data stream, the packet lengths of the first N packets of the data stream, and the domain name characteristics of the predetermined field of the data stream.

Referring now to FIG. 8, shown is a schematic diagram of an electronic device 800 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

In general, input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 807 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 808 including, for example, magnetic tape, hard disk, etc., and communication devices 809 may allow electronic device 800 to communicate wirelessly or wiredly with other devices to exchange data although FIG. 8 illustrates electronic device 800 with various devices, it is to be understood that not all of the illustrated devices are required to be implemented or provided, more or fewer devices may be alternatively implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium described above in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the disclosed embodiments, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the disclosed embodiments, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring data streams of undetermined affiliated application programs in a network as unknown data streams; backtracking a plurality of data streams which do not contain domain name information and are determined to belong to the application program as a plurality of reference data streams; respectively calculating the similarity between the plurality of reference data streams and the unknown data stream to determine the maximum similarity and the reference data stream corresponding to the maximum similarity; and if the maximum similarity is larger than or equal to a preset similarity threshold, determining the application program of the unknown data stream, which is the same as the application program of the reference data stream corresponding to the maximum similarity.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

The foregoing description is only a preferred embodiment of the disclosed embodiments and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure in the embodiments of the present disclosure is not limited to the particular combination of the above-described features, but also encompasses other embodiments in which any combination of the above-described features or their equivalents is possible without departing from the scope of the present disclosure. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A traffic identification method, comprising:

2. The method of claim 1, wherein obtaining data streams of applications in the network to which the applications are not determined to belong as unknown data streams comprises:

3. The method of claim 1, further comprising, if the maximum similarity is less than the predetermined similarity threshold:

4. The method of claim 3, wherein determining the application to which the unknown data stream belongs based on the application to which the data stream belongs and the applications to which the plurality of reference data streams belong comprises:

5. The method of claim 1, wherein computing the similarity between a reference data stream and the unknown data stream comprises:

6. The method of claim 5, wherein computing the stream feature distance vector between the reference data stream and the unknown data stream comprises:

7. The method according to any one of claims 1 to 6, wherein the flow similarity calculation model is trained by:

8. A flow rate identification device, comprising:

9. An electronic device, comprising:

a processor; and

a memory to store executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.