WO2022239235A1

WO2022239235A1 - Feature quantity calculation device, feature quantity calculation method, and feature quantity calculation program

Info

Publication number: WO2022239235A1
Application number: PCT/JP2021/018420
Authority: WO
Inventors: 博胡; 和憲神谷
Original assignee: 日本電信電話株式会社
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2022-11-17
Also published as: JPWO2022239235A1

Abstract

In the present invention, a generation unit (15b) generates, using communication information between network nodes, a graph that represents communication between the nodes. A selection unit (15c) selects associated nodes, among the nodes of the generated graph, that are connected by a path having a prescribed length. A classification unit (15d) classifies the selected nodes that are within a prescribed distance on the path into a group that corresponds to the node-to-node distance. A learning unit (15e) learns, for each of the classified groups, a model (14a) that represents a feature quantity in the graph of the nodes within the group. A calculation unit (15f) synthesizes, for each of the selected nodes, feature quantities estimated using the model (14a) that has been learned for each group, thereby calculating a feature quantity.

Description

Feature quantity calculation device, feature quantity calculation method, and feature quantity calculation program

The present invention relates to a feature amount calculation device, a feature amount calculation method, and a feature amount calculation program.

Networks called botnets, which are composed of malicious servers hijacked by malicious programs, have various structures, and the shortest distance between malicious servers also varies depending on the structure. In recent years, a technology for detecting the structure of such a botnet has been expected. A feature amount of a graph in which IP hosts are nodes and end-to-end communication between IP hosts is edges is useful information for detecting the structure of a botnet.

Therefore, a technique called graphembedding is known, which learns node features from a communication graph consisting of network flow information. For example, it is possible to generate a node path from a graph and learn the similarity between nodes within a predetermined number of hops on the path (see Non-Patent Document 1).

However, with conventional technology, it is difficult to learn high-quality feature values. For example, when focusing on a certain node, the contexts of adjacent nodes within a certain distance differ depending on the distance, so learning nodes with different contexts at the same time can lead to a problem of degraded feature values. .

The present invention has been made in view of the above, and aims to learn high-quality feature quantities from graphs representing communication networks.

In order to solve the above-described problems and achieve the object, a feature learning apparatus according to the present invention includes a generation unit that generates a graph representing communication between nodes using communication information between nodes in a network; a selection unit that selects related nodes connected by a path of a predetermined length from among the nodes of the graph that have been selected; a classifying unit that classifies into groups according to, a learning unit that learns a model that represents the feature amount in the graph of each node in the group for each of the classified groups, and each of the selected nodes, and a calculation unit that calculates a feature amount by synthesizing the feature amount estimated using the model learned by each group.

According to the present invention, it is possible to learn high-quality feature quantities from graphs representing communication networks.

FIG. 1 is a schematic diagram illustrating a schematic configuration of a feature amount calculation device. FIG. 2 is a diagram for explaining processing of the feature amount calculation device. FIG. 3 is a diagram for explaining processing of the feature amount calculation device. FIG. 4 is a diagram for explaining processing of the feature amount calculation device. FIG. 5 is a flow chart showing a feature amount calculation processing procedure. FIG. 6 is a diagram illustrating a computer that executes a feature amount calculation program.

An embodiment of the present invention will be described in detail below with reference to the drawings. It should be noted that the present invention is not limited by this embodiment. Moreover, in the description of the drawings, the same parts are denoted by the same reference numerals.

[Configuration of Feature Amount Calculation Device]
FIG. 1 is a schematic diagram illustrating a schematic configuration of a feature amount calculation device. 2 to 4 are diagrams for explaining the processing of the feature amount calculation device. First, as exemplified in FIG. 1 , the feature amount calculation device 10 is implemented by a general-purpose computer such as a personal computer, and includes an input unit 11 , an output unit 12 , a communication control unit 13 , a storage unit 14 and a control unit 15 .

The input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the operator. The output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, or the like.

The communication control unit 13 is realized by a NIC (Network Interface Card) or the like, and controls communication between an external device such as a server and the control unit 15 via a network. For example, the communication control unit 13 controls communication between a management device or the like that collects and manages network communication information and the control unit 15 .

The storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. The storage unit 14 pre-stores a processing program for operating the feature amount calculation device 10, data used during execution of the processing program, or the like, or temporarily stores the processing each time. For example, the storage unit 14 stores a model 14a or the like that is the processing result of the learning unit, which will be described later. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 .

The control unit 15 is implemented using a CPU (Central Processing Unit) or the like, and executes a processing program stored in memory. Thereby, the control unit 15 functions as an acquisition unit 15a, a generation unit 15b, a selection unit 15c, a classification unit 15d, a learning unit 15e, a calculation unit 15f, and an extraction unit 15g, as illustrated in FIG. It should be noted that these functional units may be implemented in different hardware, respectively or partially. For example, the learning unit 15e and the calculation unit 15f may be implemented in different hardware. Also, the control unit 15 may include other functional units.

The acquisition unit 15a acquires the collected communication information of the nodes of the network. For example, the acquisition unit 15a obtains flow information and the like of an IP host to be processed in the feature amount calculation process described later from a management device or the like that collects and manages network communication information via the input unit 11 or the communication control unit 13. get. The acquisition unit 15a may cause the storage unit 14 to store the acquired data. Alternatively, the acquisition unit 15a may transfer these pieces of information to the generation unit 15b described below without storing them in the storage unit 14. FIG.

The generation unit 15b uses the communication information between the nodes of the network to generate a graph representing the communication between the nodes. For example, as shown in FIG. 2, the generating unit 15b uses the obtained flow information of IP hosts to create a graph in which IP hosts are nodes and communication between IP hosts is edges. FIG. 2 illustrates a communication graph between a malicious (Bot) server and a C&C (Command and Control) server.

The selection unit 15c selects related nodes connected by a path of a predetermined length from among the nodes of the generated graph. For example, the selection unit 15c executes Random Walk a predetermined number of times with each node as the starting point, and generates a path of a predetermined length including the node for each node as the starting point.

The classification unit 15d classifies the selected nodes within a predetermined distance on the path into groups according to the distance between the nodes. For example, the classification unit 15d classifies the graph shown in FIG. Classify into 3 groups.

As shown in FIG. 4(a), the learning unit 15e learns a model 14a that represents the feature amount in the graph of each node in each classified group. In this embodiment, the learning unit 15e learns different models 14a for each classified group.

The learning unit 15e may further learn the common model 14a for a plurality of groups within a predetermined distance range among the classified groups. For example, among the groups illustrated in FIG. 4A, a plurality of groups with a distance of 2 or less, that is, a group with a distance of 1 and a group with a distance of 2, may learn the common model 14a. In this case, the learning unit 15e may, in principle, allow each group to learn a different model 14a, and select a plurality of groups to learn a common model 14a.

The calculation unit 15f calculates the feature amount by synthesizing the feature amount estimated using the model 14a learned in each group for each of the selected nodes. For example, as shown in FIG. 4B, the calculation unit 15f combines all the feature amounts output by the models 14a learned in each group for each node to obtain the feature amount of the node.

The extracting unit 15g extracts dimension values with importance levels equal to or greater than a predetermined threshold value from the calculated feature amounts. Specifically, the extraction unit 15g uses the teacher data and the learned model 14a to calculate the degree of importance for each dimension of the feature vector representing the feature amount of each node. For example, the extraction unit 15g calculates the importance of each dimension by Random Forest. Then, as shown in FIG. 4(c), the extracting unit 15g selects only important dimensions whose importance is greater than or equal to a predetermined threshold value, and uses them as the feature amount of the node.

In addition, the extraction unit 15g outputs the calculated feature amount of each node via the output unit 12. In place of the extraction unit 15g, or in addition to the extraction unit 15g, the feature amount of each node calculated by the calculation unit 15f may be output.

[Feature amount calculation process]
Next, with reference to FIG. 5, feature amount calculation processing by the feature amount calculation device 10 according to the present embodiment will be described. FIG. 5 is a flow chart showing a feature amount calculation processing procedure. The flowchart of FIG. 5 is started, for example, when an operation input instructing the start of the feature amount calculation process is performed.

First, using the communication information of the nodes of the network acquired by the acquisition unit 15a, the generation unit 15b generates a graph representing communication between nodes (step S1).

Also, the selection unit 15c selects related nodes connected by a path of a predetermined length from among the nodes of the generated graph (step S2). In addition, the classification unit 15d classifies nodes within a predetermined distance on the path from the selected node into groups according to the distance between the nodes (step S3).

Next, the learning unit 15e learns the model 14a representing the feature amount in the graph of each node in the group for each classified group (step S4).

At that time, the learning unit 15e learns a different model 14a for each classified group. Alternatively, the learning unit 15e may learn the common model 14a for a plurality of groups within a predetermined distance range among the classified groups.

Then, the calculation unit 15f calculates a feature amount for each of the selected nodes by synthesizing the feature amounts estimated using the model 14a learned in each group (step S5).

Also, the extraction unit 15g uses the teacher data and the learned model 14a to calculate the degree of importance for each dimension of the feature vector representing the feature amount of each node. Then, the extracting unit 15g extracts only important dimensions whose degrees of importance are equal to or greater than a predetermined threshold value, and uses them as feature amounts of the node (step S6).

Also, the extraction unit 15g outputs the feature amount of each node via the output unit 12 (step S7). This completes a series of feature amount calculation processing.

As described above, in the feature amount calculation device 10, the generation unit 15b uses communication information between nodes of the network to generate a graph representing communication between nodes. Further, the selection unit 15c selects related nodes connected by a path of a predetermined length from among the nodes of the generated graph. In addition, the classification unit 15d classifies nodes within a predetermined distance on the path from the selected node into groups according to the distance between the nodes. Also, the learning unit 15e learns the model 14a representing the feature amount in the graph of each node in the group for each classified group. Then, the calculation unit 15f calculates a feature amount by synthesizing the feature amount estimated using the model 14a learned in each group for each of the selected nodes.

In this way, the feature amount calculation device 10 divides the teacher data according to the distance between nodes, learns the similarity between nodes at each distance, and synthesizes the feature amounts of each node learned at each different distance. By doing so, the feature amount of each node is calculated. As a result, it is possible to calculate the feature amount of a node by taking into consideration the difference due to the distance of the contexts of adjacent nodes. Therefore, the feature amount calculation device 10 can learn high-quality feature amounts from the graph representing the communication network.

Also, the learning unit 15e learns a different model 14a for each classified group. As a result, the feature amount calculation device 10 can learn the model 14a with higher accuracy.

In addition, the learning unit 15e learns the common model 14a for a plurality of groups within a predetermined distance range among the classified groups. As a result, the feature amount calculation device 10 can efficiently learn the model 14a.

Also, the extraction unit 15g uses the teacher data and the learned model 14a to calculate the degree of importance for each dimension of the feature vector representing the feature amount of each node. Then, the extracting unit 15g extracts the dimension value of the degree of importance equal to or higher than a predetermined threshold value from the calculated feature amount. As a result, the feature amount calculation device 10 can efficiently calculate high-quality feature amounts of each node.

[program]
It is also possible to create a program in which the processing executed by the feature amount calculation device 10 according to the above embodiment is described in a computer-executable language. As one embodiment, the feature quantity calculation device 10 can be implemented by installing a feature quantity calculation program for executing the feature quantity calculation process as package software or online software in a desired computer. For example, the information processing apparatus can function as the feature amount calculation apparatus 10 by causing the information processing apparatus to execute the above feature amount calculation program. In addition, information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants). Also, the functions of the feature amount calculation device 10 may be implemented in a cloud server.

FIG. 6 is a diagram showing an example of a computer that executes a feature amount calculation program. Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1031 . Disk drive interface 1040 is connected to disk drive 1041 . A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example. A mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example. For example, a display 1061 is connected to the video adapter 1060 .

Here, the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.

Also, the feature amount calculation program is stored in the hard disk drive 1031 as a program module 1093 in which commands to be executed by the computer 1000 are described, for example. Specifically, the hard disk drive 1031 stores a program module 1093 that describes each process executed by the feature amount calculation apparatus 10 described in the above embodiment.

Data used for information processing by the feature amount calculation program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.

Note that the program modules 1093 and program data 1094 related to the feature amount calculation program are not limited to being stored in the hard disk drive 1031, but are stored in a removable storage medium, for example, and are stored by the CPU 1020 via the disk drive 1041 or the like. may be read out. Alternatively, the program module 1093 and program data 1094 related to the feature amount calculation program are stored in another computer connected via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and the network interface 1070 is may be read by CPU 1020 via

Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the descriptions and drawings forming part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on this embodiment are all included in the scope of the present invention.

REFERENCE SIGNS LIST 10 feature amount calculation device 11 input unit 12 output unit 13 communication control unit 14 storage unit 14a model 15 control unit 15a acquisition unit 15b generation unit 15c selection unit 15d classification unit 15e learning unit 15f calculation unit 15g extraction unit

Claims

a generation unit that generates a graph representing communication between nodes using communication information between nodes in the network;
a selection unit that selects related nodes connected by a path of a predetermined length from among the nodes of the generated graph;
a classification unit that classifies the selected nodes within a predetermined distance on the path into groups according to the distance between the nodes;
a learning unit that learns, for each of the classified groups, a model that represents the feature amount of each node in the group in the graph;
a calculation unit that calculates a feature amount for each of the selected nodes by synthesizing the feature amount estimated using the model learned in each group;
A feature amount calculation device characterized by having:
The feature quantity calculation device according to claim 1, wherein the learning unit learns a different model for each of the classified groups.
The feature quantity calculation device according to claim 2, wherein the learning unit learns a common model for a plurality of groups within a predetermined distance range among the classified groups.
The feature amount calculation device according to claim 1, further comprising an extraction unit that extracts, from among the calculated feature amounts, values of dimensions whose degree of importance is equal to or greater than a predetermined threshold.
5. The feature quantity calculation according to claim 4, wherein the extracting unit calculates the degree of importance for each dimension of a feature vector representing the feature quantity of each node using teacher data and the learned model. Device.
A feature quantity calculation method executed by a feature quantity calculation device,
a generation step of generating a graph representing communication between nodes using communication information between nodes of the network;
a selection step of selecting related nodes connected by a path of a predetermined length from among the nodes of the generated graph;
a classification step of classifying the selected nodes within a predetermined distance on the path into groups according to the distance between the nodes;
a learning step of learning a model representing a feature amount in the graph of each node in the group for each of the classified groups;
For each of the selected nodes, a step of calculating a feature amount by synthesizing the feature amount estimated using the model learned in each group;
A feature amount calculation method characterized by including
a generation step of generating a graph representing the communication between the nodes using the communication information between the nodes of the network in a computer;
a selection step of selecting related nodes connected by a path of a predetermined length from among the nodes of the generated graph;
a classification step of classifying the selected nodes within a predetermined distance on the path into groups according to the distance between the nodes;
a learning step of learning a model representing a feature amount in the graph of each node in the group for each of the classified groups;
For each of the selected nodes, calculating a feature amount by synthesizing the feature amount estimated using the model learned in each group;
A feature amount calculation program characterized by executing