WO2024105843A1 - Dispositif d'agrégation, procédé d'agrégation et programme - Google Patents

Dispositif d'agrégation, procédé d'agrégation et programme Download PDF

Info

Publication number
WO2024105843A1
WO2024105843A1 PCT/JP2022/042683 JP2022042683W WO2024105843A1 WO 2024105843 A1 WO2024105843 A1 WO 2024105843A1 JP 2022042683 W JP2022042683 W JP 2022042683W WO 2024105843 A1 WO2024105843 A1 WO 2024105843A1
Authority
WO
WIPO (PCT)
Prior art keywords
fqdns
community
similarity
aggregation
graph
Prior art date
Application number
PCT/JP2022/042683
Other languages
English (en)
Japanese (ja)
Inventor
雅季 小林
真尚 岩本
正裕 小林
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/042683 priority Critical patent/WO2024105843A1/fr
Publication of WO2024105843A1 publication Critical patent/WO2024105843A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/41Flow control; Congestion control by acting on aggregated flows or links

Definitions

  • This disclosure relates to an aggregation device, an aggregation method, and a program.
  • FQDNs Fully qualified domain names
  • DNS Domain Name System
  • FQDNs are often associated with the service names provided by the destination server.
  • time series changes in the number of DNS queries can be used as an indicator of traffic demand for services.
  • a technique is required to classify DNS queries based on the time series change patterns in order to accurately analyze the time series changes in the number of DNS queries.
  • DNS queries occur before traffic is generated and they can be associated with each other, the classification of DNS queries and the classification of traffic can be considered the same.
  • Known conventional technologies for traffic classification include, for example, the technologies described in Patent Document 1 and Non-Patent Document 1.
  • a DNS query generally contains a huge number of FQDNs, it is necessary to aggregate the FQDNs in order to satisfy the constraints on calculation time and memory capacity.
  • known conventional technologies for FQDN aggregation include, for example, the technology described in Non-Patent Document 2.
  • Non-Patent Document 2 aggregates FQDNs by focusing on the similarity of their character strings, so there is a possibility that FQDNs related to DNS queries with different patterns of change over time may be aggregated.
  • This disclosure has been made in consideration of the above points and provides a technology for aggregating FQDNs.
  • An aggregation device includes a first calculation unit that calculates at least the number of co-occurrences between different FQDNs using the number of DNS queries for each FQDN by each user in each time period, a second calculation unit that calculates the similarity between the different FQDNs using the number of co-occurrences, a graph construction unit that constructs a graph consisting of nodes representing FQDNs and edges having the similarity as a weight, a community analysis unit that detects the community to which each node included in the graph belongs, and an aggregation unit that aggregates FQDNs represented by nodes that belong to the same community.
  • FIG. 1 is a diagram illustrating an example of an overall configuration of a system including an aggregation device according to an embodiment of the present invention.
  • FIG. 2 is a diagram illustrating an example of a hardware configuration of an aggregation device according to the present embodiment.
  • FIG. 2 is a diagram illustrating an example of a functional configuration of an aggregation device according to the present embodiment.
  • 11 is a flowchart illustrating an example of an aggregation process according to the embodiment.
  • FIG. 1 illustrates an example of a graph and its community detection.
  • FQDN name resolution is performed by a DNS query, and the FQDN is often associated with the service name provided by the communication destination server.
  • name resolution means obtaining a corresponding IP (Internet Protocol) address from the FQDN.
  • IP Internet Protocol
  • the fact that the FQDN is often associated with the service name means that, for example, if the communication destination server provides service A, the FQDN of the server is often an FQDN associated with the service name, such as "www.service-a.co.jp". Note that name resolution may also be called "DNS name resolution”.
  • time-series changes in the number of DNS queries As an indicator of traffic demand for a service.
  • a characteristic of time-series changes in the number of DNS queries is that there are various patterns depending on the user sending the DNS query and the FQDN being resolved. For example, the number of DNS queries generated when accessing a social networking service (SNS) from a smartphone tends to increase during the day, whereas the number of DNS queries generated mechanically by a server tends to remain constant.
  • SNS social networking service
  • DNS queries can be classified (i.e., traffic can be classified) based on the pattern of time-series changes in the number of DNS queries, it becomes possible to distinguish between DNS queries with different patterns even in an environment in which DNS queries with various patterns coexist. As a result, it becomes possible to accurately analyze the time-series changes in the number of DNS queries (i.e., to accurately analyze traffic demand based on the pattern of time-series changes in the number of DNS queries), and it becomes possible to increase resources in a way that accurately reflects traffic demand.
  • Methods based on supervised learning there is a method that uses a classification model based on a convolutional neural network, which is one of the deep learning models (for example, Non-Patent Document 1, etc.). This method attempts to improve the accuracy of traffic classification by adding information to the classification model by convolving traffic data of nearby base stations.
  • NTF non-negative tensor factorization
  • the system inputs data that records the number of DNS queries for each (time, user, service) pair, and, following steps 1-1 to 1-4, classifies the (user, service) pairs into predefined K (where K is an integer equal to or greater than 1) patterns, and predicts the total number of future DNS queries (i.e., future traffic demand).
  • K is an integer equal to or greater than 1
  • a service corresponds to an FQDN
  • a user corresponds to a terminal that has made a DNS query related to that FQDN.
  • Step 1-1 Create a tensor X with (time, user, service) as axes and the number of DNS queries as components.
  • Step 1-2 Apply NTF to tensor X to calculate matrices A, B, and C indicating the similarity to K types of patterns for each element of time, user, and service.
  • B (b ik )
  • b ik represents the similarity of user i to pattern k.
  • Step 1-3 Using matrices B and C, find the pattern for (user i, service j)
  • Step 1-4 For each group, a time series prediction model is constructed from the (user, service) pairs classified into that group, and future traffic demand is predicted using this time series prediction model.
  • any model used for time series prediction can be used as the time series prediction model, but examples that can be used include the ARIMA (autoregressive integrated moving average) model and the LSTM (Long Short-Term Memory) model.
  • the above method uses NTF to calculate the pattern of time-series changes in the number of DNS queries for each pair (user, service). This makes it possible to capture patterns such as "the number of DNS queries generated when accessing an SNS (service) from a smartphone (user) increases during the day.”
  • issue 1 issues with methods based on supervised learning
  • issue 2 issues with methods based on supervised learning
  • issue 3 packets are analyzed using DPI (Deep Packet Inspection) and the association between traffic and services is collected as learning data.
  • DPI Deep Packet Inspection
  • traffic encryption has progressed, and the fact that communication carriers cannot know the contents of the traffic is also one of the reasons why it is difficult to prepare the learning data.
  • issue 2 issues with the method based on unsupervised learning
  • issue 2 issues with the method based on unsupervised learning
  • issue 2 issues with the method based on unsupervised learning
  • the method described in Patent Literature 1 is applied to DNS queries (logs) using FQDNs as services
  • the number of FQDNs is enormous, so the tensor X created in step 1-1 above becomes extremely large.
  • issue 2 there is an issue (hereinafter referred to as issue 2) in that the method cannot be applied to large-scale data sets due to constraints on calculation time and memory capacity.
  • Non-Patent Document 2 a method of aggregating FQDNs by assuming similarity of services between FQDNs with a common upper domain. This method focuses on the character string similarity of FQDNs, and for example, "www.example.co.jp" and “www.example-abc.co.jp” are aggregated to "co.jp”.
  • the above problem 1 can be solved by using a method based on unsupervised learning. Furthermore, the above problem 2 can be solved by aggregating FQDNs as preprocessing. Problem 2-1 when aggregating FQDNs can be solved by extracting a set of co-occurring FQDNs from (the log of) a DNS query, rather than focusing on the character string similarity of the FQDNs, and defining the similarity between FQDNs based on the FQDN set.
  • a set of co-occurring FQDNs refers to "a set consisting of FQDNs for which a specific user performed DNS name resolution during a specific time period.”
  • the DNS query log will be referred to as a "DNS query log.”
  • the method for solving problem 2-1 above is called the "proposed method”, and in the following embodiment, an aggregation device 10 that aggregates FQDNs using this proposed method is described. Note that since FQDN aggregation can be considered a preprocessing step for traffic classification, the aggregation device 10 may also be called a "preprocessing device” that performs preprocessing for traffic classification.
  • the proposed method described above can easily extract a set of FQDNs for services that tend to be accessed at the same time. Therefore, when classifying traffic using the method described in Patent Document 1, by using the proposed method as preprocessing, it is possible to reduce the impact on the subsequent NTF compared to using the method described in Non-Patent Document 2 as preprocessing.
  • FIG. 1 An example of the overall configuration of a system including an aggregation device 10 according to this embodiment is shown in Fig. 1.
  • the system includes the aggregation device 10, a classification device 20, a DNS cache server 30, and one or more terminals 40.
  • the aggregation device 10 and the classification device 20, the aggregation device 10 and the DNS cache server 30, and the DNS cache server 30 and the terminals 40 are each connected to be able to communicate with each other.
  • the aggregation device 10 aggregates FQDNs using the number of DNS queries c tij obtained from the DNS query log held by the DNS cache server 30 as input data.
  • c tij represents the number of DNS queries for FQDN j by user i in time period t.
  • T, I, and J are the number of time periods, the number of users, and the number of FQDNs, respectively, where 1 ⁇ t ⁇ T, 1 ⁇ i ⁇ I, and 1 ⁇ j ⁇ J.
  • the classification device 20 classifies traffic by the method described in Patent Document 1, using the FQDNs aggregated by the aggregation device 10. That is, the classification device 20 inputs data recording the number of DNS queries for each pair (time, user, service) using the FQDNs aggregated by the aggregation device 10 as services, and classifies the pairs (user, service) into K types of patterns by steps 1-1 to 1-3 above. As a result, traffic corresponding to (user, service) is classified into K types of patterns.
  • the classification device 20 may also construct a time series prediction model for each group by steps 1-4 above, and predict future traffic demand using this time series prediction model.
  • the DNS cache server 30 receives a DNS query from the terminal 40 and performs name resolution of the FQDN related to the DNS query.
  • the DNS cache server 30 is connected to the Internet 50, and if it cannot perform name resolution using the information it caches, it sends the DNS query to another DNS cache server.
  • the terminal 40 is a variety of terminals used by users (e.g., a PC (personal computer), a smartphone, a tablet terminal, a wearable device, a game machine, etc.).
  • the terminal 40 is connected to the Internet 50, and when using a service via the Internet 50, the terminal 40 sends a DNS query for the FQDN of the server that provides the service to the DNS cache server 30. This allows name resolution of the FQDN, and the terminal 40 can access the server that provides the service.
  • the overall configuration of the system shown in FIG. 1 is just an example, and the overall configuration of a system including the aggregation device 10 is not limited to this.
  • the aggregation device 10 and the classification device 20 may be configured integrally, or the aggregation device 10, the classification device 20, and the DNS cache server 30 may be configured integrally.
  • FIG. 2 An example of a hardware configuration of the aggregation device 10 according to this embodiment is shown in Fig. 2.
  • the aggregation device 10 according to this embodiment has an input device 101, a display device 102, an external I/F 103, a communication I/F 104, a RAM (Random Access Memory) 105, a ROM (Read Only Memory) 106, an auxiliary storage device 107, and a processor 108.
  • Each of these pieces of hardware is connected to each other so as to be able to communicate with each other via a bus 109.
  • the input device 101 is, for example, a keyboard, a mouse, a touch panel, a physical button, etc.
  • the display device 102 is, for example, a display, a display panel, etc. Note that the aggregation device 10 does not have to have at least one of the input device 101 and the display device 102, for example.
  • the external I/F 103 is an interface with external devices such as a recording medium 103a.
  • recording media 103a include a CD (Compact Disc), a DVD (Digital Versatile Disk), an SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.
  • the communication I/F 104 is an interface through which the aggregation device 10 communicates with the classification device 20, the DNS cache server 30, etc.
  • the RAM 105 is a volatile semiconductor memory (storage device) that temporarily stores programs and data.
  • the ROM 106 is a non-volatile semiconductor memory (storage device) that can store programs and data even when the power is turned off.
  • the auxiliary storage device 107 is a non-volatile storage device such as a HDD (Hard Disk Drive), SSD (Solid State Drive), flash memory, etc.
  • the processor 108 is, for example, a variety of arithmetic devices such as a CPU (Central Processing Unit).
  • the hardware configuration shown in FIG. 2 is an example, and the hardware configuration of the aggregation device 10 is not limited to this.
  • the aggregation device 10 may have multiple auxiliary storage devices 107 and multiple processors 108, may not have some of the hardware shown in the figure, or may have various hardware other than the hardware shown in the figure.
  • the aggregation device 10 includes a co-occurrence count calculation unit 201, a similarity calculation unit 202, a graph construction unit 203, a community analysis unit 204, and an FQDN aggregation unit 205.
  • a co-occurrence count calculation unit 201 the aggregation device 10 according to this embodiment includes a co-occurrence count calculation unit 201, a similarity calculation unit 202, a graph construction unit 203, a community analysis unit 204, and an FQDN aggregation unit 205.
  • Each of these units is realized by, for example, processing in which one or more programs installed in the aggregation device 10 are executed by the processor 108 or the like.
  • the co-occurrence count calculation unit 201 uses the number of DNS queries c tij (1 ⁇ t ⁇ T, 1 ⁇ i ⁇ I, 1 ⁇ j ⁇ J) to calculate the number of co-occurrences between FQDNs and the number of appearances of each FQDN.
  • the similarity calculation unit 202 calculates the similarity between FQDNs using the number of co-occurrences between FQDNs and the number of occurrences of each FQDN.
  • the graph construction unit 203 uses the similarity between FQDNs to construct a graph (weighted graph) in which the FQDNs are nodes and the similarity between the FQDNs is the weight of the edges.
  • the community analysis unit 204 applies a technique called community analysis to the graph to calculate the community to which each FQDN belongs.
  • the FQDN aggregation unit 205 aggregates FQDNs that belong to the same community.
  • ⁇ Aggregation process The aggregation process according to this embodiment will be described below with reference to Fig. 4.
  • the number of DNS queries ctij (1 ⁇ t ⁇ T, 1 ⁇ i ⁇ I, 1 ⁇ j ⁇ J) is used as input data, and the set thereof (input data set) ⁇ ctij
  • the co-occurrence count calculation unit 201 uses the number of DNS queries c tij (1 ⁇ t ⁇ T, 1 ⁇ i ⁇ I, 1 ⁇ j ⁇ J) to calculate the number of co-occurrences between FQDNs and the number of appearances of each FQDN by the following steps 2-1 to 2-3.
  • 1 ⁇ t ⁇ T, 1 ⁇ i ⁇ I, j 1 , j 2 ⁇ u ti ⁇ . Note that s(j 1 , j 2 ) s(j 2 , j 1 ).
  • the similarity calculation unit 202 calculates the similarity between FQDNs using the number of co-occurrences between FQDNs calculated in step S101 and the number of occurrences of each FQDN (step S102).
  • the similarity between FQDNs j1 , j2 (where 1 ⁇ j1 , j2 ⁇ J, j1 ⁇ j2 ) is sim( j1 , j2 ). Since the similarity between FQDNs in the proposed method is intended to reflect the tendency of co-occurrence, for example, Jaccard index or PMI (Pairwise Mutual Information) described in Reference 1 can be used.
  • step S101 When a similarity sim(j 1 , j 2 ) that does not use the number of occurrences s(j), such as a co-occurrence frequency, is used, the number of occurrences of the FQDN does not need to be calculated in step S101.
  • the graph construction unit 203 uses the similarity between FQDNs calculated in step S102 to construct a graph in which the FQDNs are nodes and the similarity between FQDNs is the weight of the edge (step S103).
  • This results in the construction of a graph G (V, E).
  • V: ⁇ j
  • E: ⁇ sim( j1 , j2 )
  • V ⁇ 1,2,3,4,5 ⁇
  • sim(1,4), sim(2,4), sim(2,5), and sim(3,5) have been deleted.
  • the community analysis unit 204 applies a community detection algorithm to the graph constructed in step S103 above, and detects the community to which each node (FQDN) belongs (step S104).
  • the community detection algorithm is an algorithm that detects a subset of nodes in a graph in which the nodes are closely connected to each other.
  • the community analysis unit 204 may apply, for example, a community detection algorithm described in Reference 2 to the graph G constructed in step S103 above.
  • the community to which each FQDN j belongs is calculated (detected), and a set (subset of V) consisting of FQDNs having high similarity to each other is obtained.
  • K communities are calculated in this step, and the community to which FQDN j belongs is ⁇ j (1 ⁇ j ⁇ K).
  • the FQDN aggregation unit 205 aggregates FQDNs belonging to the same community (step S105). That is, the FQDN aggregation unit 205 replaces j in the input data set with ⁇ j to create a new data set. This aggregates J FQDNs into K FQDNs, thereby solving the above problem 2-1. Therefore, the classification device 20 can classify traffic with high accuracy while satisfying constraints on calculation time and memory capacity.
  • the aggregation device 10 defines a similarity between FQDNs based on a set of co-occurring FQDNs, and aggregates the FQDNs based on the similarity. This makes it possible to aggregate FQDNs without destroying the pattern of time-series changes in the number of DNS queries. Therefore, by performing FQDN aggregation by the aggregation device 10 according to the present embodiment as a preprocessing step, even if a large set of DNS query numbers is given as an input data set, it is possible to classify traffic with high accuracy while satisfying constraints on calculation time and memory capacity, for example, by the method described in Patent Document 1.
  • the main purpose of the proposed method is to aggregate FQDNs as a pre-processing for the NTF method (for example, the method described in Patent Document 1) that classifies traffic by (user, service), but the present invention is not limited to this, and the proposed method may be used alone.
  • the classification is rough by service, and the classification accuracy is lower than that of the existing method, but the amount of calculation can be reduced.
  • Reference 1 Akiko Aizawa, “Similarity Measures Based on Co-occurrence,” Operations Research: Science of Management, vol. 52, no. 11, pp. 706-712, 2007.
  • Reference 2 A. Clauset, M. Newman and C. Moore, “Finding community structure in very large networks,” Physical review.E, vol. 70, p. 06111, 2005.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

Un dispositif d'agrégation selon un aspect de la présente invention comprend : une première unité de calcul qui calcule au moins le nombre de cooccurrences entre différents FQDN, en utilisant le nombre de requêtes DNS en ce qui concerne des FQDN pour des utilisateurs dans chaque créneau temporel ; une deuxième unité de calcul qui calcule le degré de similarité entre les différents FQDN, en utilisant ledit nombre de cooccurrences ; une unité de construction de graphe qui construit un graphe composé de nœuds représentant les FQDN et d'arêtes ayant chacune le degré de similarité comme poids ; une unité d'analyse de communauté qui détecte des communautés auxquelles appartiennent les nœuds respectifs inclus dans le graphe ; et une unité d'agrégation qui agrège les FQDN représentés par les nœuds appartenant à la même communauté.
PCT/JP2022/042683 2022-11-17 2022-11-17 Dispositif d'agrégation, procédé d'agrégation et programme WO2024105843A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/042683 WO2024105843A1 (fr) 2022-11-17 2022-11-17 Dispositif d'agrégation, procédé d'agrégation et programme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/042683 WO2024105843A1 (fr) 2022-11-17 2022-11-17 Dispositif d'agrégation, procédé d'agrégation et programme

Publications (1)

Publication Number Publication Date
WO2024105843A1 true WO2024105843A1 (fr) 2024-05-23

Family

ID=91084088

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/042683 WO2024105843A1 (fr) 2022-11-17 2022-11-17 Dispositif d'agrégation, procédé d'agrégation et programme

Country Status (1)

Country Link
WO (1) WO2024105843A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011015047A (ja) * 2009-06-30 2011-01-20 Nippon Telegr & Teleph Corp <Ntt> トラヒック特性計測方法および装置
JP2017050827A (ja) * 2015-09-04 2017-03-09 日本電信電話株式会社 アクセス数推定装置、アクセス数推定方法、及びプログラム
WO2020170852A1 (fr) * 2019-02-19 2020-08-27 日本電信電話株式会社 Dispositif de prédiction, procédé de prédiction, et programme

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011015047A (ja) * 2009-06-30 2011-01-20 Nippon Telegr & Teleph Corp <Ntt> トラヒック特性計測方法および装置
JP2017050827A (ja) * 2015-09-04 2017-03-09 日本電信電話株式会社 アクセス数推定装置、アクセス数推定方法、及びプログラム
WO2020170852A1 (fr) * 2019-02-19 2020-08-27 日本電信電話株式会社 Dispositif de prédiction, procédé de prédiction, et programme

Similar Documents

Publication Publication Date Title
US20210385236A1 (en) System and method for the automated detection and prediction of online threats
US20200349430A1 (en) System and method for predicting domain reputation
Kim et al. Design of network threat detection and classification based on machine learning on cloud computing
Rosa et al. Intrusion and anomaly detection for the next-generation of industrial automation and control systems
Kaur A comparison of two hybrid ensemble techniques for network anomaly detection in spark distributed environment
Sarabi et al. Characterizing the internet host population using deep learning: A universal and lightweight numerical embedding
Sriramoju Review on Big Data and Mining Algorithm
Luo et al. Acceleration of decision tree searching for IP traffic classification
Kepner et al. Hypersparse neural network analysis of large-scale internet traffic
Tang et al. HSLF: HTTP header sequence based lsh fingerprints for application traffic classification
Nuojua et al. DNS tunneling detection techniques–classification, and theoretical comparison in case of a real APT campaign
Li et al. Network intrusion detection via tri-broad learning system based on spatial-temporal granularity
CN112348041B (zh) 日志分类、日志分类训练方法及装置、设备、存储介质
WO2024105843A1 (fr) Dispositif d&#39;agrégation, procédé d&#39;agrégation et programme
Feng et al. An efficient caching mechanism for network-based url filtering by multi-level counting bloom filters
Wei et al. Age: authentication graph embedding for detecting anomalous login activities
Long et al. Deep encrypted traffic detection: An anomaly detection framework for encryption traffic based on parallel automatic feature extraction
Fahd et al. A framework for real-time sentiment analysis of big data generated by social media platforms
Magalhães et al. Adopting machine learning to support the detection of malicious domain names
Darwish et al. Bio-inspired machine learning mechanism for detecting malicious URL through passive DNS in big data platform
Qin et al. Network traffic classification based on SD sampling and hierarchical ensemble learning
Kotenko et al. Combining spark and snort technologies for detection of network attacks and anomalies: assessment of performance for the big data framework
Tian et al. Towards revealing parallel adversarial attack on politician socialnet of graph structure
Al Musawi et al. Examining indicators of complex network vulnerability across diverse attack scenarios
Li et al. Measuring and classifying IP usage scenarios: a continuous neural trees approach

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22965819

Country of ref document: EP

Kind code of ref document: A1