WO2018047027A1 - A method for exploring traffic passive traces and grouping similar urls - Google Patents

A method for exploring traffic passive traces and grouping similar urls Download PDF

Info

Publication number
WO2018047027A1
WO2018047027A1 PCT/IB2017/054786 IB2017054786W WO2018047027A1 WO 2018047027 A1 WO2018047027 A1 WO 2018047027A1 IB 2017054786 W IB2017054786 W IB 2017054786W WO 2018047027 A1 WO2018047027 A1 WO 2018047027A1
Authority
WO
WIPO (PCT)
Prior art keywords
urls
strings
distance
similar
pairs
Prior art date
Application number
PCT/IB2017/054786
Other languages
French (fr)
Inventor
Marco Mellia
Hassan METWALLEY
Enrico BOCCHI
Andrea MORICHETTA
Original Assignee
Politecnico Di Torino
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Politecnico Di Torino filed Critical Politecnico Di Torino
Publication of WO2018047027A1 publication Critical patent/WO2018047027A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms

Definitions

  • the present invention relates to a method of computer security for the analysis of traces of HTTP traffic on the Internet (Hyper Text Transfer Protocol - a standard application protocol used as the main system for the transmission of information on the Web), finalized to the extraction and grouping of similar Web transactions generated in an automatic way by malware, malicious services, unsolicited advertising or other.
  • Web transactions are intended HTTP and HTTPS requests and responses containing within them the URL (Uniform Resource Locator - a unique address of a resource present on the Internet, by which transactions are identified).
  • US7680858 it performs a normalization of URLs (unique address of a resource present on the Internet) by dividing them into “levels” of information; the measure of the variation between two URLs is calculated on the basis of the "differences" of keywords (search keys); it also uses information about the "content” of the page.
  • URLs unique address of a resource present on the Internet
  • US7376752 it divides the URL into two parts; the distance among URLs is calibrated so as to recognize typing mistakes.
  • EP2291812 it is based on the page "content”; it creates a set of features from every page, on which it calculates the "distance” among URLs.
  • WO2013009713 it aims to the recognition of phishing pages; it searches "relationships" among phishing page files to determine their similarity.
  • l In the scientific literature, there are thus two types of works relating to the subject matter of the present invention, in the first of which are included all those jobs that aim to classify a page processing only the "content” present in it, or the Web address of a page (URL). In this case, thus, only algorithms of "text recognition” are used, that represent only a part of the present invention.
  • the methodologies present in this type of works require a high computational cost for processing the text of billions of Web pages, and aim to recognize the "subject" of each page, consequently their objectives are totally different from those of the present invention.
  • the second type comprises all those works that apply data- mining techniques (data extraction and processing) in URLs to detect only "some types" of cyber-attacks, such as phishing or spam.
  • the present invention is much more complete and universal if compared to the current state of the art.
  • utilizing various and suitably adapted/edited "text recognition” algorithms and “clustering” algorithms not supervised techniques developed in the field of data-mining to extract information from large amounts of data, a quantity decidedly greater of "artificial” and/ or “malicious” traffic may be detected.
  • the present invention comes to help network administrators and/ or computer security analysts to extract information from the Web traffic generated by networks having thousands of computers. Without tools that could help analysts, actually, detecting problems or faults becomes very difficult when considering data blocks including billions of Web transactions.
  • the present method inspects traces of Web traffic generated by real users or automatic bots. For each pair of network transactions present in a trace, the "degree of lexical similarity" is then calculated, and “similar” transactions are subsequently “grouped” together to form homogeneous groups that are presented to the network analyst or to the security expert, sorted by "importance".
  • the present method particularly, allows to detect automatically and make easily visible all that traffic that is not generated by human users, but by "automatic systems", also called bots (robots) in the technical jargon.
  • This type of traffic actually, is often generated by malware or other malicious services, thus a methodology of this kind can be crucial for reducing the time that passes between a cyber-attack and its discovery (on average, about 150-180 days) or for recognizing faults that cause malfunctions in networks.
  • D(xl,x2) is the distance between the points xl and x2.
  • the sphere having a radius E centred in xl is considered. If at least a minimum number of points (minPoints) is within the distance E from xl, the point xl is classified as "central point".
  • a given point xl is a "central point” if at least a minimum number of points (minPoints) is within the distance E from it.
  • a generic point xk is "reachable" by xl if there is a path xl,x2,...,xk so that xi+1 is directly reachable by xi.
  • the points reachable by xl form a "cluster", i.e. a "dense” region.
  • the points that are not reachable by xl are called “anomalous values", and may form a separate cluster, if they belong to another dense region, or be included in the so-called “noise” region.
  • the parameters minPoints and E are adjustable and can be set by an expert in domains.
  • the parameter minPoints defines the minimum size of a cluster and has little impact on final results.
  • the parameter E is a fundamental parameter.
  • the groupings thus generated are subsequently sorted to help the visualization for the network administrator or the security expert.
  • the sorting is done by considering the cohesion degree of the elements inside each grouping.
  • the present invention therefore solves the problem of processing the input data, of aggregating them syntactically and semantically, and showing them to the analyst coherently and consistently, and sorted by importance.
  • the subject method of the present invention is also capable of offering an aggregate analysis tool of Web traffic, allowing to detect in a simple and direct way Web transactions linked to malicious services, or supplied by automatic systems such as those generating advertising, tracking systems, or in general, interesting for the network administrator or the security expert.
  • the invention relates to a method of computer security for the analysis of traces of HTTP and HTTPS traffic on the Internet, finalized to the extraction and grouping of "similar" Web transactions generated in an "automatic” way by malware, malicious services, unsolicited advertising or other.
  • the main objectives of the present method are essentially:
  • the subject method of the invention comprises at least the following steps of processing and control:
  • the extraction of transactions takes place via the network/ passive probe for extraction and filtering of traffic, located in a specific link, which processes the data packets in real time, extracts the transactions and then groups them in specific batches for subsequent processing.
  • the "distance" among all transactions pairs is then calculated, i.e. the level of likelihood/ similarity, such distance being calculated considering the entire URL as a single string of characters, composed of both "hostname” (identifier name of a device within a network of computers), and "path” (path).
  • a distance among pairs of strings is used, belonging to the "edit- distance” class, suitable to calculate the dissimilarity of pairs of strings of characters composing the URLs, being considered the "distance" among pairs of strings of characters as the minimum number of steps required to convert one of the two strings into the other.
  • the most popular technique is the so-called distance of Levenshtein, that assigns a unit value to all editing operations, i.e. insertion, deletion and substitution of one character. It calculates an absolute distance among pairs of strings, that is equal to the length of the longest string at max. This, however, makes the technique of the distance of Levenshtein scarcely convenient when comparing a short URL and a long one (in this case, the URL length may extend from a few to hundreds of characters).
  • substitution of a character has a value of 2, the substitution being equivalent to a deletion plus an insertion
  • the obtained value is normalized in the range between 0 and 1 by adding all the previous operations necessary to match the two strings (i.e. insertions, deletions and/ or substitutions) and dividing this value by the sum of the lengths of the two strings;
  • Said one or more "clustering" algorithms used for grouping the URLs on the basis of similarity metrics, group the URLs in a same set when these have a high value of similarity (i.e. low distance).
  • DBSCAN the known clustering algorithm called DBSCAN is preferably used, based on the calculation of the "density" of the elements present within a certain area.
  • the network administrator or the security expert are provided with a visualization of these groupings of transactions, sorted according to the degree of cohesion, starting from the most cohesive grouping.
  • coefficient of silhouette This coefficient, which is based on the concepts of cohesion and separation, provides that a cluster is identified as cohesive if the elements therein are mutually very close. In addition, a cluster is well separated if its points are distant from those of other clusters. Thus, with the coefficient of silhouette how well each point is included in a cluster is evaluated.
  • a(i) is the average distance among that point and all the other points of the cluster they belong to. In this way, how well the point i is included in its grouping is calculated.
  • b(i) the average of the lowest distances among i and all the other points of the remaining clusters is defined.
  • the silhouette is thus defined as the ratio between the difference between b(i) and a(i) and the maximum value between a(i) and b(i), obtaining consequently values included the range between 0 and 1. The higher is s(i), the more i is similar to its own cluster.
  • the method of the present invention is therefore based solely and advantageously on the URLs "syntax", ignoring the "content" of pages or other information.

Abstract

Computer security method for the analysis of passive traces of HTTP and HTTPS traffic on the Internet, with extraction and grouping of similar Web transactions automatically generated by malware, malicious services, unsolicited advertising or other, comprises at least the following processing and control steps: a) URLs extraction from an operational network, using passive exploration of the HTTP e HTTPS traffic data and subsequent collection into batches of the extracted URLs; b) detection of similar URLs, by metrics calculation based on the distance among URLs, namely based on a measure of the degree of diversity among pairs of character strings composing the URLs; c) activation of one or more clustering algorithms used to group the URLs based on the similarity metrics and to obtain, within each group of URLs, elements with similar/homogeneous features, adapted to be analyzed as a single entity; d) visualization of elements according to a sorting based on the degree of cohesion of the URLs contained in each grouping.

Description

A METHOD FOR EXPLORING TRAFFIC PASSIVE TRACES AND GROUPING
SIMILAR URLS
DESCRIPTION
The present invention relates to a method of computer security for the analysis of traces of HTTP traffic on the Internet (Hyper Text Transfer Protocol - a standard application protocol used as the main system for the transmission of information on the Web), finalized to the extraction and grouping of similar Web transactions generated in an automatic way by malware, malicious services, unsolicited advertising or other. With Web transactions are intended HTTP and HTTPS requests and responses containing within them the URL (Uniform Resource Locator - a unique address of a resource present on the Internet, by which transactions are identified).
In the current state of the art there are some prior documents, US7680858, US7962487, US7376752, EP2291812, WO2013009713, but none of these documents uses the innovative features of the present invention described below, which allow to obtain better performances and greater benefits.
Specifically, US7680858: it performs a normalization of URLs (unique address of a resource present on the Internet) by dividing them into "levels" of information; the measure of the variation between two URLs is calculated on the basis of the "differences" of keywords (search keys); it also uses information about the "content" of the page.
US7962487: it is oriented only towards the improvement of search engines; it relies on clustering (grouping) of tokens (categorized text blocks) associated with search queries (questions).
US7376752: it divides the URL into two parts; the distance among URLs is calibrated so as to recognize typing mistakes.
EP2291812: it is based on the page "content"; it creates a set of features from every page, on which it calculates the "distance" among URLs.
WO2013009713: it aims to the recognition of phishing pages; it searches "relationships" among phishing page files to determine their similarity. l In the scientific literature, there are thus two types of works relating to the subject matter of the present invention, in the first of which are included all those jobs that aim to classify a page processing only the "content" present in it, or the Web address of a page (URL). In this case, thus, only algorithms of "text recognition" are used, that represent only a part of the present invention. The methodologies present in this type of works, however, require a high computational cost for processing the text of billions of Web pages, and aim to recognize the "subject" of each page, consequently their objectives are totally different from those of the present invention.
The second type, on the other hand, comprises all those works that apply data- mining techniques (data extraction and processing) in URLs to detect only "some types" of cyber-attacks, such as phishing or spam.
Therefore, the present invention is much more complete and universal if compared to the current state of the art. Actually, utilizing various and suitably adapted/edited "text recognition" algorithms and "clustering" algorithms (not supervised techniques developed in the field of data-mining to extract information from large amounts of data), a quantity decidedly greater of "artificial" and/ or "malicious" traffic may be detected.
Therefore, the present invention comes to help network administrators and/ or computer security analysts to extract information from the Web traffic generated by networks having thousands of computers. Without tools that could help analysts, actually, detecting problems or faults becomes very difficult when considering data blocks including billions of Web transactions.
The present method inspects traces of Web traffic generated by real users or automatic bots. For each pair of network transactions present in a trace, the "degree of lexical similarity" is then calculated, and "similar" transactions are subsequently "grouped" together to form homogeneous groups that are presented to the network analyst or to the security expert, sorted by "importance".
The present method, particularly, allows to detect automatically and make easily visible all that traffic that is not generated by human users, but by "automatic systems", also called bots (robots) in the technical jargon. This type of traffic, actually, is often generated by malware or other malicious services, thus a methodology of this kind can be crucial for reducing the time that passes between a cyber-attack and its discovery (on average, about 150-180 days) or for recognizing faults that cause malfunctions in networks.
The present invention differs from the prior art for the following reasons:
- it is based solely on the analysis of URLs and their syntax (address of an Internet resource), ignoring the "content" of the page or other information;
- it does neither analyze nor use particular structural features of URLs, but maintains a neutral point of view, checking only the "similarity" among pairs of URLs;
- it uses techniques based on "non-supervised algorithms", and therefore, a priori, it does not require the use of any kind of knowledge or information;
- it is based solely on the calculation of the "syntactic similarity" among the various URLs, avoiding the need to have a set of pre-labelled elements, and preventing, in this way, also problems of excessive adaptation of the used algorithm.
Inspired by text-mining algorithms (text extraction and processing), the concept of "distance" among URLs is introduced, used to compose "groups" of URLs by means of the well-known clustering algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise), based on the "density" since it connects regions of points with a sufficiently high density.
In order to better illustrate how the clustering algorithms based on "density" work, a set of points in a sample space to cluster should be considered. D(xl,x2) is the distance between the points xl and x2. Now the sphere having a radius E centred in xl is considered. If at least a minimum number of points (minPoints) is within the distance E from xl, the point xl is classified as "central point". Formally, a given point xl is a "central point" if at least a minimum number of points (minPoints) is within the distance E from it. These points are defined as "directly reachable" by xl. A generic point xk is "reachable" by xl if there is a path xl,x2,...,xk so that xi+1 is directly reachable by xi. The points reachable by xl form a "cluster", i.e. a "dense" region. The points that are not reachable by xl are called "anomalous values", and may form a separate cluster, if they belong to another dense region, or be included in the so-called "noise" region. The parameters minPoints and E are adjustable and can be set by an expert in domains. The parameter minPoints defines the minimum size of a cluster and has little impact on final results. The parameter E, on the other hand, is a fundamental parameter. If it is set to a too small value, it leads to a high number of small groups and to many points which cannot be clusterized/ grouped. On the other hand, if it is set to a too high value, it leads to a few groups with a multitude of heterogeneous points. A sensitivity analysis is therefore essential to correctly choose the value of the radius E.
The groupings thus generated are subsequently sorted to help the visualization for the network administrator or the security expert. The sorting is done by considering the cohesion degree of the elements inside each grouping.
The present invention therefore solves the problem of processing the input data, of aggregating them syntactically and semantically, and showing them to the analyst coherently and consistently, and sorted by importance.
The subject method of the present invention is also capable of offering an aggregate analysis tool of Web traffic, allowing to detect in a simple and direct way Web transactions linked to malicious services, or supplied by automatic systems such as those generating advertising, tracking systems, or in general, interesting for the network administrator or the security expert.
The above and other objects and advantages of the invention, as will appear from the following description, are achieved with the method described in claim 1.
Preferred embodiments and non-trivial variations of the present invention form the subject matter of the dependent claims.
It is understood that all the appended claims form an integral part of the present description.
It will immediately appear obvious that numerous variations and modifications to what described could be made, without departing from the scope of protection of the invention, as it results from the appended claims.
The invention relates to a method of computer security for the analysis of traces of HTTP and HTTPS traffic on the Internet, finalized to the extraction and grouping of "similar" Web transactions generated in an "automatic" way by malware, malicious services, unsolicited advertising or other.
The main objectives of the present method are essentially:
- reducing the number of elements that the analyst should visualize and process, from hundreds of millions of single transactions to a few hundreds of clusters (groups with similar/ internally homogeneous elements);
- identifying the transactions generated "automatically", for example transactions generated by advertising platforms, polymorphic malware and/ or systems of the wiki-like type.
Specifically, the subject method of the invention comprises at least the following steps of processing and control:
a) extraction of transactions from an operational network, by means of exploration of the HTTP and HTTPS traffic data, and subsequent collection into batch (groups of elements) of the extracted transactions;
b) detection of similar transactions, by metrics calculation based on the "similarity" among pairs of transactions, namely based on a measure of the degree of "diversity" among pairs of character strings composing the URLs;
c) activation of one or more "clustering" algorithms, used to group the transactions on the basis of a similarity metrics, obtaining in this way, within each group of transactions, elements with similar/ homogeneous features, which can thus be analyzed as a "single" entities, considerably reducing the number of elements to be analyzed, facilitating and accelerating the work of analysis and research of the malicious and/or unwanted Internet traffic generated artificially/ automatically; d) sorting of the transaction groups on the basis of their importance, i.e. of the degree of cohesion of the transactions contained in groupings.
The extraction of transactions takes place via the network/ passive probe for extraction and filtering of traffic, located in a specific link, which processes the data packets in real time, extracts the transactions and then groups them in specific batches for subsequent processing. Once a batch of transactions is formed, the "distance" among all transactions pairs is then calculated, i.e. the level of likelihood/ similarity, such distance being calculated considering the entire URL as a single string of characters, composed of both "hostname" (identifier name of a device within a network of computers), and "path" (path).
To detect similar URL, a distance among pairs of strings is used, belonging to the "edit- distance" class, suitable to calculate the dissimilarity of pairs of strings of characters composing the URLs, being considered the "distance" among pairs of strings of characters as the minimum number of steps required to convert one of the two strings into the other.
In the state of the art, the most popular technique is the so-called distance of Levenshtein, that assigns a unit value to all editing operations, i.e. insertion, deletion and substitution of one character. It calculates an absolute distance among pairs of strings, that is equal to the length of the longest string at max. This, however, makes the technique of the distance of Levenshtein scarcely convenient when comparing a short URL and a long one (in this case, the URL length may extend from a few to hundreds of characters).
Unlike various known techniques, in the present method, for calculating the "distance" among strings of characters composing URLs, the following conditions apply:
- the "insertion" of a character has a value of 1;
- the "deletion" of a character has a value of 1;
- the "substitution" of a character has a value of 2, the substitution being equivalent to a deletion plus an insertion;
- the obtained value is normalized in the range between 0 and 1 by adding all the previous operations necessary to match the two strings (i.e. insertions, deletions and/ or substitutions) and dividing this value by the sum of the lengths of the two strings;
- the similarity of two strings of URLs characters thus varies in a normalized range of values comprised between 0 and 1, so as to obtain that a pair of identical strings has a distance equal to 0, and a pair of completely different strings has a distance equal to 1. A pair of similar URLs has a small distance, while a pair of different URLs has a great distance.
Said one or more "clustering" algorithms, used for grouping the URLs on the basis of similarity metrics, group the URLs in a same set when these have a high value of similarity (i.e. low distance).
For the purposes of the present invention, the known clustering algorithm called DBSCAN is preferably used, based on the calculation of the "density" of the elements present within a certain area.
Then, the network administrator or the security expert are provided with a visualization of these groupings of transactions, sorted according to the degree of cohesion, starting from the most cohesive grouping.
In detail, for this task an analysis tool called "coefficient of silhouette" is used. This coefficient, which is based on the concepts of cohesion and separation, provides that a cluster is identified as cohesive if the elements therein are mutually very close. In addition, a cluster is well separated if its points are distant from those of other clusters. Thus, with the coefficient of silhouette how well each point is included in a cluster is evaluated.
Given a point i, a(i) is the average distance among that point and all the other points of the cluster they belong to. In this way, how well the point i is included in its grouping is calculated. On the other hand, with b(i) the average of the lowest distances among i and all the other points of the remaining clusters is defined. The silhouette is thus defined as the ratio between the difference between b(i) and a(i) and the maximum value between a(i) and b(i), obtaining consequently values included the range between 0 and 1. The higher is s(i), the more i is similar to its own cluster. In particular, if the value of silhouette is > 0, it means that the mean distance among i and the other objects in its grouping is lower than the minimum average distance with respect to the elements of all other clusters. For s(i) < 0 the opposite of what has just been specified above applies.
The method of the present invention is therefore based solely and advantageously on the URLs "syntax", ignoring the "content" of pages or other information.

Claims

1) Computer security method for the analysis of passive traces of HTTP and HTTPS traffic on the Internet, with extraction and grouping of similar Web transactions automatically generated by malware, malicious services, unsolicited advertising or other, characterized in that it comprises at least the following processing and control steps:
a) URLs extraction from an operational network, using passive exploration of the traffic data and subsequent collection into batches of the extracted URLs;
b) detection of similar URLs, by means of metrics calculation based on the similarity among URLs, namely based on a measure of the degree of diversity among pairs of character strings composing said URLs;
c) activation of one or more clustering algorithms used to group the URLs based on a similarity metrics, and to obtain, within each group of URLs, elements with similar/ homogeneous features adapted to be analyzed as a single entity;
d) sorting said URLs into groups according to their importance, namely the degree of cohesion of the URLs contained in said groupings.
2) Method according to claim 1, characterized in that said extraction of URLs is performed by network/ passive probe for exploration and filtering, located in a specific link, adapted to process the data packets in real time, for extracting and downloading the URLs in specific batches for subsequent processing.
3) Method according to claim 2, characterized in that when an HTTP/ HTTPS transaction is detected, the contained URL is recorded in a specific file.
4) Method according to claims 2 and 3, characterized in that, once a lot of URLs is formed, the distance between all pairs of the various URLs is calculated, namely the level of likeness/ similarity, said distance being calculated considering the entire URL as a single string of characters, composed by both hostname and path.
5) Method according to one or more of the preceding claims 1 to 4, characterized in that to detect similar URLs a similarity metrics among pairs of strings is used adapted to calculate the dissimilarity of pairs of strings of characters composing the URLs, the distance among pairs of strings of characters as the minimum number of steps needed to convert one of the two strings into the other being considered.
6) Method according to one of the preceding claims 1 to 5, characterized in that, for the calculation of the distance among the pairs of strings of characters composing the URLs, the following conditions apply:
- the insertion of a character has a value of 1;
- the deletion of a character has a value of 1;
- the substitution of a character has a value of 2, the substitution being equivalent to a deletion plus an insertion;
- the normalization between 0 and 1 of the previous value obtained by the sum of the operations to match the two strings divided by the sum of the lengths of the two strings;
- the similarity of a pair of strings of URL characters varying in a normalized range of values between 0 and 1, obtaining consequently that a pair of identical strings has a distance equal to 0, and a pair of completely different strings has a distance equal to 1.
7) Method according to one of the preceding claims 1 to 6, characterized in that a pair of similar URLs has a small distance, while a pair of different URLs has a great distance.
8) Method according to claim 1, characterized in that said one or more clustering algorithms are adapted to be used for grouping the URLs based on a similarity metrics.
9) Method according to claim 8, characterized in that preferably a clustering algorithm DBSCAN is used, based on the calculation of density of elements present within a certain area.
10) Method according to claim 9, characterized in that said groupings generated using the clustering algorithm DBSCAN are sorted according to the degree of cohesion among the URLs contained therein.
11) Method according to claim 10, characterized in that a coefficient of silhouettes is used, based on the calculation of both cohesion and degree of separation for all the elements of each grouping. 12) Method according to one or more of the preceding claims 1 to 11, characterized that it is based solely on the URLs syntax.
PCT/IB2017/054786 2016-09-12 2017-08-04 A method for exploring traffic passive traces and grouping similar urls WO2018047027A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IT102016000091521 2016-09-12
IT102016000091521A IT201600091521A1 (en) 2016-09-12 2016-09-12 METHOD FOR THE EXPLORATION OF PASSIVE TRAFFIC TRACKS AND GROUPING OF SIMILAR URLS.

Publications (1)

Publication Number Publication Date
WO2018047027A1 true WO2018047027A1 (en) 2018-03-15

Family

ID=58606411

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2017/054786 WO2018047027A1 (en) 2016-09-12 2017-08-04 A method for exploring traffic passive traces and grouping similar urls

Country Status (2)

Country Link
IT (1) IT201600091521A1 (en)
WO (1) WO2018047027A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399485A (en) * 2019-07-01 2019-11-01 上海交通大学 The data source tracing method and system of word-based vector sum machine learning
US10834214B2 (en) 2018-09-04 2020-11-10 At&T Intellectual Property I, L.P. Separating intended and non-intended browsing traffic in browsing history
CN112291089A (en) * 2020-10-23 2021-01-29 全知科技(杭州)有限责任公司 Application system identification and definition method based on flow
CN113556308A (en) * 2020-04-23 2021-10-26 深信服科技股份有限公司 Method, system, equipment and computer storage medium for detecting flow security

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110283361A1 (en) * 2010-01-19 2011-11-17 Damballa, Inc. Method and system for network-based detecting of malware from behavioral clustering
US20140297640A1 (en) * 2013-03-27 2014-10-02 International Business Machines Corporation Clustering based process deviation detection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110283361A1 (en) * 2010-01-19 2011-11-17 Damballa, Inc. Method and system for network-based detecting of malware from behavioral clustering
US20140297640A1 (en) * 2013-03-27 2014-10-02 International Business Machines Corporation Clustering based process deviation detection

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ANDREA MORICHETTA ET AL: "CLUE: Clustering for Mining Web URLs", 2016 28TH INTERNATIONAL TELETRAFFIC CONGRESS (ITC 28), 12 September 2016 (2016-09-12), pages 286 - 294, XP055386135, ISBN: 978-0-9883045-1-2, DOI: 10.1109/ITC-28.2016.146 *
ANTHONY VEREZ: "On the Use of Data Mining Techniques for the Clustering of URLs Extracted from Network-based Malware Traces", 18 February 2014 (2014-02-18), XP055386323, Retrieved from the Internet <URL:http://verez.net/docs/malwurl_paper.pdf> [retrieved on 20170629] *
PIOTR KIJEWSKI: "Automated Extraction of Threat Signatures from Network Flows", 18TH ANNUAL FIRST CONFERENCE, 25 June 2006 (2006-06-25), Baltimore, Maryland, XP055386193, Retrieved from the Internet <URL:https://www.first.org/resources/papers/conference2006/kijewski-piotr-papers.pdf> [retrieved on 20170628] *
ROBERTO PERDISCI ET AL: "Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces", USENIX,, 18 March 2010 (2010-03-18), pages 1 - 14, XP061010768 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10834214B2 (en) 2018-09-04 2020-11-10 At&T Intellectual Property I, L.P. Separating intended and non-intended browsing traffic in browsing history
US11228655B2 (en) 2018-09-04 2022-01-18 At&T Intellectual Property I, L.P. Separating intended and non-intended browsing traffic in browsing history
US11652900B2 (en) 2018-09-04 2023-05-16 At&T Intellectual Property I, L.P. Separating intended and non-intended browsing traffic in browsing history
CN110399485A (en) * 2019-07-01 2019-11-01 上海交通大学 The data source tracing method and system of word-based vector sum machine learning
CN110399485B (en) * 2019-07-01 2022-04-08 上海交通大学 Data tracing method and system based on word vector and machine learning
CN113556308A (en) * 2020-04-23 2021-10-26 深信服科技股份有限公司 Method, system, equipment and computer storage medium for detecting flow security
CN112291089A (en) * 2020-10-23 2021-01-29 全知科技(杭州)有限责任公司 Application system identification and definition method based on flow

Also Published As

Publication number Publication date
IT201600091521A1 (en) 2018-03-12

Similar Documents

Publication Publication Date Title
US10050986B2 (en) Systems and methods for traffic classification
Mahajan et al. Phishing website detection using machine learning algorithms
US9003529B2 (en) Apparatus and method for identifying related code variants in binaries
CN108156131B (en) Webshell detection method, electronic device and computer storage medium
Pouget et al. Honeypot-based forensics
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
Chiew et al. Leverage website favicon to detect phishing websites
JP6503141B2 (en) Access classification device, access classification method and access classification program
Joshi et al. Using lexical features for malicious URL detection--a machine learning approach
CN109905288B (en) Application service classification method and device
WO2018047027A1 (en) A method for exploring traffic passive traces and grouping similar urls
Chiew et al. Building standard offline anti-phishing dataset for benchmarking
Van Dooremaal et al. Combining text and visual features to improve the identification of cloned webpages for early phishing detection
He et al. Malicious domain detection via domain relationship and graph models
Khan Detection of phishing websites using deep learning techniques
Pradeepa et al. Lightweight approach for malicious domain detection using machine learning
Yazhmozhi et al. Natural language processing and Machine learning based phishing website detection system
Kozik et al. Solution to data imbalance problem in application layer anomaly detection systems
Wang et al. TSMWD: a high-speed malicious web page detection system based on two-step classifiers
Han Detection of web application attacks with request length module and regex pattern analysis
Morichetta et al. Clustering and evolutionary approach for longitudinal web traffic analysis
CN104063491B (en) A kind of method and device that the detection page is distorted
Shibahara et al. POSTER: Detecting Malicious Web Pages based on Structural Similarity of Redirection Chains
Wedyan et al. An Associative Classification Data Mining Approach for Detecting Phishing Websites
Ponmaniraj et al. Intrusion Detection: Spider Content Analysis to Identify Image-Based Bogus URL Navigation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17762219

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17762219

Country of ref document: EP

Kind code of ref document: A1