IT201600091521A1

IT201600091521A1 - METHOD FOR THE EXPLORATION OF PASSIVE TRAFFIC TRACKS AND GROUPING OF SIMILAR URLS.

Info

Publication number: IT201600091521A1
Application number: IT102016000091521A
Authority: IT
Inventors: Marco Mellia; Hassan Metwalley; Enrico Bocchi; Andrea Morichetta
Original assignee: Torino Politecnico
Priority date: 2016-09-12
Filing date: 2016-09-12
Publication date: 2018-03-12
Also published as: WO2018047027A1

Description

METODO PER L’ESPLORAZIONE DI TRACCE PASSIVE DI TRAFFICO E RAGGRUPPAMENTO DI URL SIMILI. METHOD FOR EXPLORING PASSIVE TRAFFIC TRACKS AND GROUPING OF SIMILAR URLs.

DESCRIZIONE DESCRIPTION

La presente invenzione si riferisce ad un metodo di sicurezza informatica per l’analisi di tracce di traffico HTTP in Internet (HyperText Transfer Protocol - protocollo applicativo standard utilizzato come principale sistema per la trasmissione d'informazioni sul Web), finalizzato all’estrazione e al raggruppamento di transazioni Web tra loro simili generate in maniera automatica da malware, servizi malevoli, pubblicità indesiderata o altro. Per transazioni Web vengono intese le richieste e le risposte HTTP e HTTPS contenenti al loro interno URL (Uniform Resource Locator - indirizzo univoco di una risorsa presente su Internet, mediante i quali le transazioni vengono identificate). The present invention refers to a computer security method for analyzing traces of HTTP traffic on the Internet (HyperText Transfer Protocol - standard application protocol used as the main system for transmitting information on the Web), aimed at extracting and grouping of similar web transactions generated automatically by malware, malicious services, unwanted advertising or other. Web transactions are understood as HTTP and HTTPS requests and responses containing URLs (Uniform Resource Locator - unique address of a resource on the Internet, through which transactions are identified).

Nello stato dell’arte attuale esistono alcuni documenti anteriori, US7680858, US7962487, US7376752, EP2291812, WO2013009713, ma nessuno di tali documenti utilizza le innovative caratteristiche della presente invenzione di seguito descritte, che consentono di ottenere migliori prestazioni e maggiori vantaggi. In the current state of the art there are some prior documents, US7680858, US7962487, US7376752, EP2291812, WO2013009713, but none of these documents uses the innovative features of the present invention described below, which allow for better performance and greater advantages.

Nello specifico, US7680858: effettua una normalizzazione degli URL (indirizzo univoco di una risorsa presente su Internet) dividendoli in “livelli” di informazioni; la misura della variazione tra due URL viene calcolata sulla base delle “differenze” di keyword (chiavi di ricerca); utilizza anche informazioni sul “contenuto” della pagina. Specifically, US7680858: performs a normalization of the URLs (unique address of a resource on the Internet) dividing them into "levels" of information; the measure of the variation between two URLs is calculated on the basis of the "differences" of keywords (search keywords); it also uses information about the "content" of the page.

US7962487: è orientato solamente al miglioramento dei motori di ricerca; si basa sul clustering (raggruppamento) dei token (blocchi di testo categorizzati) associati alle query (interrogazioni) di ricerca. US7962487: is oriented only to the improvement of search engines; is based on the clustering of tokens (categorized text blocks) associated with search queries (queries).

US7376752: divide l’URL in due parti; la distanza tra URL è tarata in modo da riconoscere errori di digitazione. US7376752: divides the URL into two parts; the distance between URLs is calibrated to recognize typing errors.

EP2291812: si basa sul “contenuto” della pagina; crea un set di caratteristiche da ogni pagina, sul quale calcola la “distanza” tra URL. EP2291812: is based on the "content" of the page; creates a set of features from each page, on which it calculates the "distance" between URLs.

WO2013009713: mira al riconoscimento di pagine di phishing; ricerca “relazioni” tra i file di pagine di phishing per determinarne la similarità. WO2013009713: aims at recognizing phishing pages; searches for “relationships” between phishing page files to determine their similarity.

In letteratura scientifica vi sono quindi due tipologie di lavori inerenti l’argomento oggetto della presente invenzione, nella prima delle quali ricadono tutti quei lavori che puntano a classificare una pagina elaborando solamente il “contenuto” in esso presente oppure l’indirizzo Web di una pagina (URL). In questo caso, quindi, vengono utilizzati unicamente algoritmi di “riconoscimento di testo”, che rappresentano solo una parte della presente invenzione. Le metodologie presenti in questa tipologia di lavori, però, richiedono un alto costo computazionale per elaborare il testo di miliardi di pagine Web ed inoltre puntano a riconoscere la “tematica” di ciascuna pagina, quindi hanno obbiettivi totalmente diversi dalla presente invenzione. In scientific literature there are therefore two types of works relating to the subject matter of the present invention, the first of which includes all those works that aim to classify a page by processing only the "content" present in it or the web address of a page (URL). In this case, therefore, only "text recognition" algorithms are used, which represent only a part of the present invention. The methodologies present in this type of work, however, require a high computational cost to process the text of billions of Web pages and also aim to recognize the "theme" of each page, therefore they have totally different objectives from the present invention.

Nella seconda tipologia, invece, ricadono tutti quei lavori che applicano tecniche di data-mining (estrazione ed elaborazione di dati) agli URL per rilevare solo “alcuni tipi” di attacchi informatici, come phishing o spam. The second type, on the other hand, includes all those works that apply data-mining techniques (data extraction and processing) to URLs to detect only "certain types" of cyber attacks, such as phishing or spam.

La presente invenzione è quindi decisamente più completa ed universale rispetto all’attuale stato dell’arte. Infatti, utilizzando diversi algoritmi, opportunamente adattati/modificati, di “riconoscimento di testo” ed algoritmi di “clustering” (tecniche non supervisionate sviluppate nel campo del data-mining per estrarre conoscenza da grandi quantità di dati), si potrà rilevare una quantità decisamente maggiore di traffico “artificiale” e/o “malevolo”. The present invention is therefore much more complete and universal than the current state of the art. In fact, using different algorithms, suitably adapted / modified, of "text recognition" and "clustering" algorithms (unsupervised techniques developed in the field of data-mining to extract knowledge from large amounts of data), it will be possible to detect a decidedly higher than "artificial" and / or "malicious" traffic.

La presente invenzione nasce quindi per aiutare gli amministratori di rete e/o gli analisti di sicurezza informatica ad estrarre informazioni dal traffico Web generato da reti con migliaia di computer. Senza strumenti che possano aiutare gli analisti, infatti, diventa molto difficile rilevare problemi o anomalie guardando blocchi di dati composti da miliardi di transazioni Web. The present invention was therefore created to help network administrators and / or computer security analysts to extract information from Web traffic generated by networks with thousands of computers. Without tools that can help analysts, it becomes very difficult to detect problems or anomalies by looking at blocks of data made up of billions of web transactions.

Il presente metodo ispeziona tracce di traffico Web generato da utenti reali o bot automatici. Per ogni coppia di transazioni di rete presente in una traccia viene poi calcolato il “grado di similarità lessicale” e le transazioni “simili” vengono poi “raggruppate” tra loro per formare gruppi omogenei che sono presentati, all’analista di rete o esperto di sicurezza, ordinati per “importanza”. This method inspects traces of web traffic generated by real users or automated bots. For each pair of network transactions present in a trace, the "degree of lexical similarity" is then calculated and the "similar" transactions are then "grouped" together to form homogeneous groups that are presented to the network analyst or expert in safety, sorted by "importance".

Il presente metodo, in particolare, permette di rilevare automaticamente e rendere facilmente visibile tutto quel traffico che non viene generato da utenti umani ma da “sistemi automatici”, anche detti in gergo bot (robot). Questo tipo di traffico, infatti, spesso viene generato da malware o da altri servizi malevoli, quindi una metodologia del genere può essere fondamentale per abbattere il tempo che passa tra un attacco informatico e la sua scoperta (in media circa 150-180 giorni) o per riconoscere anomalie che provocano malfunzionamenti nelle reti. This method, in particular, makes it possible to automatically detect and make easily visible all that traffic that is not generated by human users but by "automatic systems", also known in jargon as bots (robots). This type of traffic, in fact, is often generated by malware or other malicious services, so such a methodology can be essential to reduce the time that passes between a cyber attack and its discovery (on average about 150-180 days) or to recognize anomalies that cause malfunctions in networks.

La presente invenzione si differenzia dalla tecnica anteriore, per i seguenti motivi: The present invention differs from the prior art for the following reasons:

- si basa unicamente sulla analisi degli URL e della loro sintassi (indirizzo di una risorsa Internet), ignorando il “contenuto” della pagina o altre informazioni; - it is based solely on the analysis of URLs and their syntax (address of an Internet resource), ignoring the "content" of the page or other information;

- non analizza e non utilizza particolari caratteristiche strutturali degli URL, ma mantiene un punto di osservazione neutro, verificando solamente la “similitudine” tra coppie di URL; - does not analyze and does not use particular structural characteristics of URLs, but maintains a neutral point of observation, verifying only the "similarity" between pairs of URLs;

- utilizza tecniche basate su “algoritmi non supervisionati” e quindi non necessita di utilizzare, a priori, alcun tipo di conoscenza o informazione; - uses techniques based on "unsupervised algorithms" and therefore does not need to use, a priori, any type of knowledge or information;

- si basa esclusivamente sul calcolo della “similitudine sintattica” tra i diversi URL, evitando di dover possedere un set di elementi preetichettati, e prevenendo, in questo modo, anche problemi di eccessivo adattamento dell’algoritmo utilizzato. - is based exclusively on the calculation of the "syntactic similarity" between the different URLs, avoiding having to have a set of pre-labeled elements, and thus also preventing problems of excessive adaptation of the algorithm used.

Ispirato da algoritmi di text-mining (estrazione ed elaborazione di testi), si introduce il concetto di “distanza” tra URL, utilizzata per comporre “gruppi” di URL tramite il noto algoritmo di clustering DBSCAN (Density-Based Spatial Clustering of Applications with Noise) basato sulla “densità” perché connette regioni di punti con densità sufficientemente alta. Inspired by text-mining algorithms (text extraction and processing), the concept of "distance" between URLs is introduced, used to compose "groups" of URLs using the well-known clustering algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise) based on “density” because it connects regions of points with sufficiently high density.

Per illustrare meglio come funzionano gli algoritmi di clustering basati sulla “densità”, si consideri un insieme di punti in uno spazio campione da clusterizzare. Sia D(x1,x2) la distanza tra due punti x1 e x2. Consideriamo ora la sfera di raggio E centrato in x1. Se almeno un numero minimo di punti (minPoints) sono all’interno della distanza E da x1, il punto x1 è classificato come "punto centrale". Formalmente, un dato punto x1 è un “punto centrale” se almeno un numero minimo di punti (minPoints) sono all’interno della distanza E da esso. Questi punti sono definiti come "direttamente raggiungibili" da x1. Un punto xk generico è "raggiungibile" da x1 se esiste un percorso x1,x2,...,xk in modo che xi+1 è direttamente raggiungibile da xi. I punti raggiungibili da x1 formano un “cluster”, cioè una regione “densa”. Punti che non sono raggiungibili da x1 sono chiamati "valori anomali" e possono sia formare un cluster separato, se appartengono ad un'altra regione densa, oppure cadere nella c.d. regione di "rumore". I parametri minPoints ed E sono regolabili e possono essere impostati da un esperto di dominio. Il parametro minPoints definisce la dimensione minima di un cluster ed ha poco impatto sui risultati finali. Il parametro E, invece, è un parametro fondamentale. Se è impostato ad un valore troppo piccolo, ci conduce ad un elevato numero di piccoli gruppi ed a molti punti che non sono clusterizzabili/raggruppabili. Se invece è impostato ad un valore troppo grande, ci conduce a pochi gruppi con una moltitudine di punti eterogenei. Un’analisi di sensibilità è quindi essenziale per scegliere correttamente il valore del raggio E. To better illustrate how “density” -based clustering algorithms work, consider a set of points in a sample space to be clustered. Let D (x1, x2) be the distance between two points x1 and x2. Let us now consider the sphere of radius E centered at x1. If at least a minimum number of points (minPoints) are within the distance E from x1, the point x1 is classified as a "central point". Formally, a given point x1 is a "central point" if at least a minimum number of points (minPoints) are within the distance E from it. These points are defined as "directly reachable" by x1. A generic point xk is "reachable" from x1 if there is a path x1, x2, ..., xk so that xi + 1 is directly reachable from xi. The points reachable from x1 form a “cluster”, ie a “dense” region. Points that are not reachable from x1 are called "outliers" and can either form a separate cluster, if they belong to another dense region, or fall into the c.d. region of "noise". The minPoints and E parameters are adjustable and can be set by a domain expert. The minPoints parameter defines the minimum size of a cluster and has little impact on the final results. The parameter E, on the other hand, is a fundamental parameter. If it is set too small, it leads to a large number of small groups and to many points that are not clusterable / groupable. If, on the other hand, it is set too large, it leads us to a few groups with a multitude of heterogeneous points. A sensitivity analysis is therefore essential to correctly choose the value of the radius E.

I raggruppamenti così generati vengono successivamente ordinati per aiutare la visualizzazione all’amministratore di rete o all’esperto di sicurezza. L’ordinamento viene effettuato considerando il grado di coesione degli elementi all’interno di ciascun raggruppamento. The groupings thus generated are subsequently sorted to help the network administrator or security expert view them. The ordering is carried out considering the degree of cohesion of the elements within each grouping.

La presente invenzione pertanto risolve il problema di elaborare i dati in ingresso, di aggregarli sintatticamente e semanticamente, e di mostrarli all’analista in modo coeso e coerente e ordinato per importanza. The present invention therefore solves the problem of processing the incoming data, aggregating them syntactically and semantically, and showing them to the analyst in a cohesive and coherent way and ordered by importance.

Il metodo in oggetto alla presente invenzione è inoltre in grado di offrire uno strumento di analisi aggregata del traffico Web, permettendo di individuare in maniera semplice e diretta transazioni Web legate a servizi malevoli, o forniti da sistemi automatici quali quelli per generazioni di pubblicità, sistemi di tracciamento, o, in generale, di interesse per l’amministratore di rete o l’esperto di sicurezza. The method in question of the present invention is also able to offer a tool for the aggregate analysis of Web traffic, allowing to identify in a simple and direct way Web transactions related to malicious services, or provided by automatic systems such as those for generations of advertisements, systems tracking, or, in general, of interest to the network administrator or security expert.

I suddetti ed altri scopi e vantaggi dell’invenzione, quali risulteranno dal seguito della descrizione, vengono raggiunti con il metodo descritto nella rivendicazione 1. The aforementioned and other purposes and advantages of the invention, which will emerge from the following description, are achieved with the method described in claim 1.

Forme di realizzazione preferite e varianti non banali della presente invenzione formano l’oggetto delle rivendicazioni dipendenti. Preferred embodiments and non-trivial variants of the present invention form the subject of the dependent claims.

Resta inteso che tutte le rivendicazioni allegate formano parte integrante della presente descrizione. It is understood that all the attached claims form an integral part of the present description.

Risulterà immediatamente ovvio che si potranno apportare, a quanto descritto, innumerevoli varianti e modifiche senza discostarsi dal campo di protezione dell'invenzione come appare dalle rivendicazioni allegate. It will be immediately obvious that innumerable variations and modifications can be made to what has been described without departing from the scope of the invention, as appears from the attached claims.

L’invenzione riguarda un metodo di sicurezza informatica per l’analisi di tracce di traffico HTTP e HTTPS in Internet, finalizzato all’estrazione e al raggruppamento di transazioni Web tra loro “simili” generate in maniera “automatica” da malware, servizi malevoli, pubblicità indesiderata o altro. The invention relates to a computer security method for analyzing traces of HTTP and HTTPS traffic on the Internet, aimed at extracting and grouping "similar" Web transactions generated in an "automatic" manner by malware, malicious services, unwanted advertising or otherwise.

I principali obiettivi del presente metodo sono in sostanza: The main objectives of this method are essentially:

- ridurre il numero di elementi che l'analista deve visualizzare e processare, da centinaia di milioni di singole transazioni a poche centinaia di cluster (gruppi con elementi simili/omogenei al loro interno); - reduce the number of elements that the analyst has to view and process, from hundreds of millions of single transactions to a few hundred clusters (groups with similar / homogeneous elements within them);

- identificare le transazioni generate “automaticamente”, ad esempio transazioni generate da piattaforme pubblicitarie, malware polimorfici e/o sistemi di tipo wiki-like. - identify transactions generated "automatically", for example transactions generated by advertising platforms, polymorphic malware and / or wiki-like systems.

Specificatamente, il metodo in oggetto comprende almeno le seguenti fasi di elaborazione e controllo: Specifically, the method in question includes at least the following processing and control phases:

a) estrazione di transazioni da una rete operativa, mediante esplorazione dei dati di traffico HTTP e HTTPS e successiva raccolta in batch (gruppi di elementi) delle transazioni estratte; a) extraction of transactions from an operating network, by exploration of HTTP and HTTPS traffic data and subsequent collection in batches (groups of elements) of the extracted transactions;

b) individuazione di transazioni tra loro “simili”, mediante metrica di calcolo basata sulla “similarità” tra coppie di transazioni, ovvero basata su una misura del grado di “diversità” tra le coppie di stringhe di caratteri di cui sono composti gli URL; b) identification of "similar" transactions, by means of calculation metrics based on the "similarity" between pairs of transactions, or based on a measure of the degree of "diversity" between the pairs of character strings of which the URLs are composed;

c) attivazione di uno o più algoritmi di “clustering”, utilizzati per raggruppare le transazioni in base a metrica di similarità, ottenendo, in tal modo, all’interno di ogni gruppo di transazioni, elementi con caratteristiche simili/omogenee, che possono quindi essere analizzati come una “singola” entità, riducendo notevolmente il numero di elementi da analizzare, facilitando e velocizzando il lavoro di analisi e ricerca del traffico internet malevolo e/o indesiderato generato in maniera artificiale/automatica; c) activation of one or more "clustering" algorithms, used to group transactions based on similarity metrics, thus obtaining, within each group of transactions, elements with similar / homogeneous characteristics, which can therefore be analyzed as a "single" entity, significantly reducing the number of elements to be analyzed, facilitating and speeding up the work of analyzing and searching for malicious and / or unwanted internet traffic generated artificially / automatically;

d) ordinamento dei gruppi di transazione in base alla loro importanza, ovvero al grado di coesione delle transazioni contenute nei raggruppamenti. d) ordering of the transaction groups based on their importance, or the degree of cohesion of the transactions contained in the groupings.

L’estrazione di transazioni avviene tramite rete/sonda passiva di estrazione e filtraggio del traffico, situata in uno specifico link, la quale elabora i pacchetti di dati in tempo reale, estrae le transazioni e poi le raggruppa in specifici batch per la successiva elaborazione. The extraction of transactions takes place via a passive traffic extraction and filtering network / probe, located in a specific link, which processes the data packets in real time, extracts the transactions and then groups them into specific batches for subsequent processing.

Una volta formato un lotto di transazioni, viene poi calcolata la “distanza” tra tutte le coppie di transazioni, ovvero il livello di somiglianza/similarità, venendo calcolata tale distanza considerando l’intero URL come una singola stringa di caratteri, composta sia da “hostname” (nome identificativo di un dispositivo all’interno di una rete di calcolatori), sia da “path” (percorso). Once a batch of transactions has been formed, the "distance" between all pairs of transactions is then calculated, i.e. the level of similarity / similarity, and this distance is calculated considering the entire URL as a single character string, consisting of both " hostname ”(identification name of a device within a computer network), and from“ path ”.

Per rilevare URL tra loro simili si utilizza una distanza tra coppie di stringhe, appartenente alla classe delle “edit-distance”, idonea per calcolare la dissomiglianza di coppie di stringhe di caratteri componenti gli URL, venendo considerata la “distanza” tra coppie di stringhe di caratteri come il numero minimo di passi necessari per convertire una delle due stringhe nell’altra. To detect URLs similar to each other, a distance between pairs of strings is used, belonging to the "edit-distance" class, suitable for calculating the dissimilarity of pairs of character strings making up the URLs, considering the "distance" between pairs of strings of characters as the minimum number of steps required to convert one of the two strings into the other.

Nello stato dell’arte, la tecnica più popolare è la c.d. distanza di Levenshtein che assegna un valore unitario a tutte le operazioni di editing, cioè inserimento, cancellazione e sostituzione di un carattere. Essa calcola una distanza assoluta tra coppie di stringhe che al massimo è pari alla lunghezza della stringa più lunga. Questo, però, rende la tecnica della distanza di Levenshtein poco conveniente quando si confrontano un URL corto ed uno lungo (in questo caso la lunghezza dell'URL si può estendere da pochi a centinaia di caratteri). Diversamente dalle varie tecniche note, nel presente metodo, per calcolare la “distanza” tra stringhe di caratteri componenti gli URL, valgono le seguenti condizioni: In the state of the art, the most popular technique is the so-called Levenshtein distance which assigns a unitary value to all editing operations, i.e. insertion, deletion and replacement of a character. It calculates an absolute distance between pairs of strings which at most is equal to the length of the longest string. This, however, makes the Levenshtein distance technique inconvenient when comparing a short and a long URL (in this case the length of the URL can extend from a few to hundreds of characters). Unlike the various known techniques, in this method, to calculate the "distance" between character strings making up the URLs, the following conditions apply:

- “inserimento” di un carattere ha un valore pari ad 1; - “insertion” of a character has a value equal to 1;

- “cancellazione” di un carattere ha un valore pari ad 1; - “cancellation” of a character has a value equal to 1;

- “sostituzione” di un carattere ha un valore pari a 2, equivalendo, la sostituzione, ad una cancellazione più un inserimento; - "substitution" of a character has a value equal to 2, equivalent, the substitution, to a cancellation plus an insertion;

- il valore ottenuto viene normalizzato nell’intervallo tra 0 e 1 sommando tutte le operazioni precedenti necessarie per far coincidere le due stringhe (ovvero inserimenti, cancellazioni e/o sostituzioni) e dividendo questo valore per la somma delle lunghezze delle due stringhe. - the value obtained is normalized in the interval between 0 and 1 by adding all the previous operations necessary to make the two strings coincide (i.e. insertions, deletions and / or replacements) and dividing this value by the sum of the lengths of the two strings.

- la similitudine tra due stringhe di caratteri di URL varia quindi in un intervallo normalizzato di valori compreso tra 0 e 1, ottenendosi in tal modo che una coppia di stringhe uguali ha una distanza pari a 0 ed una coppia di stringhe completamente differenti ha una distanza pari ad 1. - the similarity between two strings of URL characters therefore varies in a normalized range of values between 0 and 1, thus obtaining that a pair of identical strings has a distance equal to 0 and a pair of completely different strings has a distance equal to 1.

Una coppia di URL simili presenta una piccola distanza, mentre una coppia di URL diversi presenta una grande distanza. A pair of similar URLs has a small distance, while a pair of different URLs has a large distance.

Detti uno o più algoritmi di “clustering”, utilizzati per raggruppare gli URL in base a metrica di similarità, raggruppano gli URL in uno stesso insieme quando questi presentano un alto valore di similarità (ovvero bassa distanza). Said one or more "clustering" algorithms, used to group URLs based on similarity metrics, they group URLs into the same set when they have a high similarity value (ie low distance).

Ai fini della presente invenzione, viene preferibilmente utilizzato il noto algoritmo di clustering denominato DBSCAN, basato sul calcolo della “densità” di elementi presenti all’interno di una certa area. For the purposes of the present invention, the known clustering algorithm called DBSCAN is preferably used, based on the calculation of the "density" of elements present within a certain area.

Successivamente si fornisce all’amministratore di rete o all’esperto di sicurezza una visualizzazione di questi raggruppamenti di transazioni, ordinati secondo il grado di coesione, partendo dal raggruppamento più coeso. Subsequently, the network administrator or security expert is provided with a view of these groupings of transactions, sorted according to the degree of cohesion, starting with the most cohesive grouping.

Nel dettaglio, per questo compito si è utilizzato uno strumento di analisi chiamato “coefficiente di silhouette”. Questo coefficiente, che si basa sui concetti di coesione e separazione, prevede che un cluster venga identificato come coeso se gli elementi al suo interno sono fra loro molto vicini. Inoltre, un cluster risulta ben separato se i suoi punti sono distanti da quelli di altri cluster. Con il coefficiente di silhouette quindi si valuta quanto bene ogni punto è incluso in un cluster. In detail, an analysis tool called the “silhouette coefficient” was used for this task. This coefficient, which is based on the concepts of cohesion and separation, provides that a cluster is identified as cohesive if the elements within it are very close to each other. Furthermore, a cluster is well separated if its points are distant from those of other clusters. The silhouette coefficient then evaluates how well each point is included in a cluster.

Dato un punto i, sia a(i) la distanza media tra quel punto e tutti gli altri punti del cluster di appartenenza. In questo modo si calcola quanto il punto i sia bene incluso nel suo raggruppamento. Con b(i) invece definiamo la media delle distanze più basse fra i e tutti gli altri punti dei restanti cluster. La silhouette è quindi definita come il rapporto fra la differenza tra b(i) e a(i) e il massimo valore tra a(i) e b(i), ottenendo così valori compresi nell’intervallo tra 0 e 1. Quanto più s(i) è alta, tanto più i è simile al proprio cluster. In particolare se il valore di silhouette è > 0, significa che la distanza media fra i e gli altri oggetti nel suo raggruppamento è più bassa della distanza minima media rispetto agli elementi di tutti gli altri cluster. Per s(i) < 0 vale il contrario di quanto appena sopra specificato. Given a point i, let a (i) be the average distance between that point and all the other points of the cluster to which it belongs. In this way it is calculated how well the point i is included in its grouping. With b (i) instead we define the average of the lowest distances between i and all the other points of the remaining clusters. The silhouette is therefore defined as the ratio between the difference between b (i) and a (i) and the maximum value between a (i) and b (i), thus obtaining values between 0 and 1. The more s ( i) is high, the more similar i is to its cluster. In particular, if the silhouette value is> 0, it means that the average distance between i and the other objects in its grouping is lower than the minimum average distance compared to the elements of all the other clusters. For s (i) <0 the opposite of the above is true.

Il metodo relativo alla presente invenzione si basa pertanto unicamente e vantaggiosamente sulla “sintassi” degli URL, ignorando il “contenuto” delle pagine o altre informazioni. The method relating to the present invention is therefore based solely and advantageously on the "syntax" of the URLs, ignoring the "content" of the pages or other information.

Claims

CLAIMS 1) IT security method for the analysis of passive traces of HTTP and HTTPS traffic on the Internet, with extraction and grouping of similar web transactions generated automatically by malware, malicious services, unwanted advertising or other, characterized by the fact of include at least the following processing and control phases: a) extraction of URLs from an operational network, by passive exploration of traffic data and subsequent collection in batches of the extracted URLs; b) identification of similar URLs, by means of calculation metrics based on the similarity between URLs, or based on a measure of the degree of diversity between pairs of character strings of which the URLs are composed; c) activation of one or more clustering algorithms used to group URLs based on similarity metrics, and to obtain, within each group of URLs, elements with similar / homogeneous characteristics capable of being analyzed as a single entity; d) ordering of said groups of URLs based on their importance, or the degree of cohesion of the URLs contained in said groupings.

2) Method according to claim 1, characterized by the fact that said URL extraction takes place via a passive exploration and filtering network / probe, located in a specific link, suitable for processing data packets in real time, extracting the URLs and download them in specific batches for subsequent processing.

3) Method according to claim 2, characterized by the fact that when an HTTP and HTTPS transaction is found, the URL contained is recorded in a specific file.

4) Method according to claims 2 and 3, characterized by the fact that, once a batch of URLs has been formed, the distance between all the pairs of the various URLs is then calculated, i.e. the level of similarity / similarity, and said distance is calculated considering the 'integer URL as a single character string, consisting of both hostname and path.

5) Method according to one or more of the preceding claims from 1 to 4, characterized in that to detect URLs similar to each other, a similarity metric between pairs of strings is used, suitable for calculating the dissimilarity of pairs of character strings making up the URLs, considering the distance between pairs of character strings as the minimum number of steps necessary to convert one of the two strings into the other.

6) Method according to one of the preceding claims 1 to 5, characterized in that, for the calculation of the distance between pairs of character strings making up the URLs, the following conditions apply: - insertion of a character has a value equal to 1; - deletion of a character has a value equal to 1; - substitution of a character has a value equal to 2, equivalent, the substitution, to a cancellation plus an insertion; - normalization between 0 and 1 of the previous value obtained of the sum of the operations to make the two strings coincide divided by the sum of the lengths of the two strings; - the similarity of a pair of URL character strings varying in a normalized range of values between 0 and 1, thus obtaining that a pair of identical strings has a distance equal to 0 and a completely different pair of strings has a distance equal to 1.

7) Method according to one of the preceding claims 1 to 6, characterized in that a pair of similar URLs has a small distance, while a pair of different URLs has a large distance.

8) Method according to claim 1, characterized in that said one or more clustering algorithms are adapted to be used to group the URLs on the basis of similarity metrics.

9) Method according to claim 8, characterized in that a DBSCAN clustering algorithm is preferably used, based on the calculation of the density of elements present within a certain area.

10) Method according to claim 9, characterized by the fact that said groupings generated using the DBSCAN clustering algorithm are ordered according to the degree of cohesion between the URLs contained therein.

11) Method according to claim 10, characterized in that the silhouette coefficient is preferably used, based on the calculation of the cohesion and the degree of separation for all the elements of each grouping.

12) Method according to one or more of the preceding claims from 1 to 11, characterized in that it is based solely on the syntax of the URLs.