WO2016194909A1 - アクセス分類装置、アクセス分類方法、及びアクセス分類プログラム - Google Patents
アクセス分類装置、アクセス分類方法、及びアクセス分類プログラム Download PDFInfo
- Publication number
- WO2016194909A1 WO2016194909A1 PCT/JP2016/066054 JP2016066054W WO2016194909A1 WO 2016194909 A1 WO2016194909 A1 WO 2016194909A1 JP 2016066054 W JP2016066054 W JP 2016066054W WO 2016194909 A1 WO2016194909 A1 WO 2016194909A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- trees
- access
- similarity
- server
- classification
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/51—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems at application loading time, e.g. accepting, rejecting, starting or inhibiting executable software based on integrity or source reliability
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
Definitions
- the present invention relates to an access classification device, an access classification method, and an access classification program.
- a malignant web site identification method that is a method for identifying a malignant web site that is infected with malware from the web sites accessed by the user terminal.
- Such a malignant web site identification method is classified into a dynamic analysis and a static analysis.
- Dynamic analysis is executed by executing content such as HTML (HyperText Markup Language) or JavaScript (registered trademark) acquired from a server such as a web server, and detecting attacks against vulnerabilities from the behavior on the host at that time.
- This is a technique for identifying malignant web sites.
- static analysis only obtains content from the server, performs code analysis of the content, and uses benign webs using information related to program features, domains, and URLs (Uniform Resource Locator). This is a technique for identifying a malignant web site from the difference from the characteristics observed on the site.
- Non-Patent Document 1 Non-Patent Document 1 in which a feature vector is described using the number of lines and characters of a script, and the number of appearances of a keyword in the script, and is identified using a machine learning method. is there.
- Non-Patent Document 2 Non-Patent Document 2
- Non-patent Document 3 there is a technique (Non-patent Document 3) that creates a sequence composed of node types of abstract syntax trees created from JavaScript (registered trademark) and identifies the sequence by the similarity of the sequence.
- Non-Patent Document 4 creates and identifies a decision tree for determining malignancy of content from the number of appearances of iframe and script tags, the size of the tag, and the like.
- the URL and host information can be determined using the keywords contained in the URL, DNS (Domain Name System) query results, and geographic information associated with the IP (Internet Protocol) address.
- DNS Domain Name System
- IP Internet Protocol
- Non-patent Document 6 a technique (Non-Patent Document 7) is proposed in which a difference is extracted from contents acquired when accessing the same page at different times, and a malignancy determination is performed based on the difference.
- the malicious web site is identified by various methods.
- Matsunaka et al.'S method uses the HTTP (Hyper Text Transfer Protocol) header when downloading an executable file and the fact that there is no download information in the content acquired before downloading. Forwarding caused by an attack on the vulnerability is detected.
- the method of Stringhini et al. (Non-Patent Document 9), after creating an access group having the same last page from the time series of a series of pages accessed by each user by automatic transfer, an IP address is obtained from those access groups.
- a feature vector such as the number or the number of redirects is created to identify a malicious web site. Furthermore, in the method of Rafique et al. (Non-Patent Document 10), a part necessary for downloading malware is extracted from a series of pages accessed by redirection by individually accessing a plurality of pages in the series, and a signature is created. By doing so, the malignant web site is identified.
- the above-described malicious web site identification method based on content uses content information acquired from a server, URL, and host information, and is therefore easily avoided by an attacker. For example, if an attacker changes the tendency of HTML tags and JavaScript (registered trademark) functions used on a malignant web site to be close to a benign site, the site that is originally a malignant web site is changed to a benign web site. It may be misidentified. As a result, the content provider cannot write the undetected malicious web site in the blacklist, and may cause a problem that the user is allowed to access the malicious website.
- HTML tags and JavaScript registered trademark
- the above-described method focusing on redirection also requires a plurality of accesses, and the malignant web site cannot be identified from a single access, and therefore the scope of application of the method is limited. Therefore, it is desired to construct a malicious web site identification method that is not easily affected by changes in content or the like by an attacker and can be identified by one access.
- the embodiment of the disclosure has been made in view of the above, and is difficult to avoid by an attacker, and can easily detect a malicious web site, an access classification method, and an access classification
- the purpose is to provide a program.
- an access classification device disclosed in the present application in one aspect, includes a first server and a second server as nodes, and the second server to the second server.
- a degree of similarity between the plurality of trees based on a degree of coincidence between a creation unit that creates a plurality of trees having an instruction to transfer access to a server as an edge, and each of the created trees
- a classification unit that classifies the access based on the calculated similarity.
- the access classification method disclosed in the present application uses the first server and the second server as nodes, and an instruction for transferring access from the first server to the second server as an edge.
- a creation step of creating a plurality of trees, a calculation step of calculating a similarity between the plurality of trees based on a degree of coincidence between the created subtrees, and the calculated similarity And classifying the access based on the degree.
- the access classification program disclosed in the present application uses the first server and the second server as nodes, and an instruction for transferring access from the first server to the second server as an edge.
- the access classification device, the access classification method, and the access classification program disclosed in the present application have an effect that it is difficult for an attacker to avoid and that a malignant web site can be easily detected.
- FIG. 1 is a diagram illustrating a configuration of an access classification apparatus.
- FIG. 2 is a diagram illustrating an example of analysis target access input to the access classification device.
- FIG. 3 is a diagram illustrating an example of known access input to the access classification device.
- FIG. 4 is a diagram illustrating a structure of a tree constructed by the access classification device.
- FIG. 5 is a diagram illustrating a process of extracting a partial tree from a tree.
- FIG. 6 is a diagram illustrating a method for calculating the similarity of trees based on the ratio of common subtrees.
- FIG. 7 is a diagram illustrating a method for calculating the similarity of trees based on the number of common subtrees.
- FIG. 1 is a diagram illustrating a configuration of an access classification apparatus.
- FIG. 2 is a diagram illustrating an example of analysis target access input to the access classification device.
- FIG. 3 is a diagram illustrating an example of known access input to the access classification device.
- FIG. 8 is a diagram illustrating a method for calculating the similarity of trees based on the size of the common tree.
- FIG. 9 is a diagram illustrating a method of classifying a plurality of trees into a plurality of sets.
- FIG. 10 is a diagram illustrating a method for creating a representative tree from a set of trees.
- FIG. 11 is a flowchart for explaining the identification model creation process when the similarity is used as the inner product value.
- FIG. 12 is a flowchart for explaining the access identification processing when the similarity is used as the inner product value.
- FIG. 13 is a flowchart for explaining an identification model creation process in the case of using a similarity with a representative tree.
- FIG. 14 is a flowchart for explaining an access identification process when using a similarity with a representative tree.
- FIG. 15 is a diagram illustrating that information processing by the access classification program is specifically realized using a computer.
- FIG. 1 is a diagram showing the configuration of the access classification device 10.
- the access classification device 10 includes a target access input unit 11, a known access input unit 12, a tree construction unit 13, a similarity calculation unit 14, a representative tree creation unit 15, and a classification unit 16. Each of these components is connected so that various signals and data can be input and output in one or both directions.
- the target access input unit 11 allows access to the analysis target server as an input.
- the known access input unit 12 is known to be an access to a server that provides a benign web site as opposed to a known malignant access that is known to be an access to a server that provides a malignant web site. Allow some known benign access as input.
- the tree construction unit 13 determines the access source (automatic transfer source) and access destination (automatic transfer destination) server to the analysis target server from each access input by the target access input unit 11 and the known access input unit 12. A tree having “nodes” and automatic transfer instructions “edges” is constructed.
- the similarity calculation unit 14 calculates the similarity of a plurality of trees. calculate. In addition, the similarity calculation unit 14 calculates the similarity between trees created by the tree construction unit 13.
- the representative tree creation unit 15 divides the access input by the known access input unit 12 into a plurality of sets based on the similarity calculated by the similarity calculation unit 14, and is a part common to the trees in each set Create a tree as a representative tree.
- the classification unit 16 determines whether the access input by the target access input unit 11 is an access to a malignant web site, using the similarity calculated by the similarity calculation unit 14.
- FIG. 2 is a diagram illustrating an example of the analysis target access 11 a input to the access classification device 10.
- the transfer destination and transfer source of the analysis target access 11a are URLs, but are not limited thereto, and may be, for example, an FQDN (Fully Qualified Domain Name), a domain, a host name, or the like.
- FQDN Full Qualified Domain Name
- FIG. 2 there are “SRC-IFRAME” representing a link by an HTML tag iframe tag and “SRC-SCRIPT-SRC” representing a link by a script tag of an HTML tag. Not limited to this.
- “SRC-OBJECT-CODEBASE” representing a link by an Object tag of an HTML tag may be used.
- FIG. 3 is a diagram illustrating an example of the known access 12a input to the access classification device 10.
- a URL as a transfer destination and transfer source of the known access 12a, but is not limited to this, and may be, for example, an FQDN, a domain, a host name, or the like.
- the transfer command as shown in FIG. 3, there are “SRC-IFRAME” and “SRC-SCRIPT-SRC” as described above, but not limited thereto.
- the known access 12a is provided with a label for identifying the property of the transfer destination web site or the like, but the label is not limited to “benign” or “malignant” shown in FIG. ”,“ Drive-by-Download ”,“ Phishing ”, and the like.
- FIG. 4 is a diagram showing a tree structure constructed by the access classification device 10.
- the tree construction unit 13 of the access classification device 10 sets the transfer source and the transfer destination as “node” based on the access transfer information shown in FIGS. To build a tree.
- the tree building unit 13 sets the URLs of the web sites as nodes N1 to N4, and creates edges E1 to E3 corresponding to transfer instructions between the URLs between the transfer source URL and the transfer destination URL.
- the tree construction unit 13 removes the URL information attached to the nodes N1 to N4. As a result, it is possible to identify a web site focusing on a URL-independent redirect structure.
- FIG. 4 shows an example in which the URLs once assigned to the nodes N1 to N4 are removed, the URLs may not be removed.
- FIG. 5 is a diagram showing a process of extracting a partial tree from a tree.
- the tree construction unit 13 of the access classification device 10 extracts a partial tree constituting the tree from the constructed tree (see FIG. 4). For example, as shown in FIG. 5A, the tree building unit 13 extracts paths from the node N1 corresponding to the first accessed server in the series of accesses to the other end nodes N3 and N4. Next, the tree construction unit 13 extracts all partial paths included in the path as shown in FIG. Then, as shown in FIG. 5C, the tree construction unit 13 decomposes the extracted partial path into partial trees. At this time, if there are overlapping subtrees, the tree building unit 13 may delete one of the overlapping subtrees.
- FIG. 6 is a diagram showing a method for calculating the similarity of trees based on the ratio of the common subtree T3.
- the similarity calculation unit 14 calculates the similarity of a plurality of trees based on the extracted subtree (see FIG. 5C).
- the similarity calculation unit 14 sets a set of subtrees included in a tree having a ratio equal to or greater than a threshold (for example, about 40 to 60%) among the comparison target trees as a common subtree.
- a threshold for example, about 40 to 60%
- the similarity calculation unit 14 sets a set of subtrees obtained by removing overlapping subtrees from the subtrees of all the comparison target trees as all subtrees.
- the similarity calculation unit 14 sets the value obtained by dividing the number of common subtrees by the number of all subtrees as the similarity.
- the similarity calculation unit 14 selects the subtrees (N1-E3-N4) included in the trees T1 and T2 (both equal to or greater than the threshold) among the comparison target trees T1 and T2. Let it be a common subtree T3. Next, the similarity calculation unit 14 sets a set of subtrees obtained by removing overlapping subtrees (N5-E6-N8) from the subtrees of all the trees T1 and T2 to be compared as an all subtree T4. . Then, the similarity calculation unit 14 sets the value obtained by dividing “1”, which is the number of common subtrees T3, by “7”, which is the number of all subtrees T4, as the similarity. Therefore, in the example shown in FIG. 6, the similarity is “1/7”.
- the similarity calculation unit 14 extracts the common subtree T3 or includes the partial subtree including the URL information as well as the transfer command. It is good also as what determines the coincidence / non-coincidence for removal. In addition, when creating all subtrees T4, it is not necessary to remove overlapping subtrees. Furthermore, the number of comparison target trees is not limited to two, and may be two or more. In addition, regarding the parameters used for calculating the similarity, the ratio of the number of common subtrees to the total number of subtrees has been exemplified. Anything that compares the number of trees can be used.
- FIG. 7 is a diagram showing a method for calculating the similarity of trees based on the number of common subtrees T3.
- the similarity calculation unit 14 calculates the similarity of a plurality of trees based on the extracted subtree (see FIG. 5C). First, the similarity calculation unit 14 sets a set of subtrees included in all trees of the comparison target trees as a common subtree. Then, the similarity calculation unit 14 sets the value obtained by counting the number of common subtrees as the similarity.
- the similarity calculation unit 14 uses the subtrees (N1-E3-N4) included in both (all) trees T1 and T2 among the comparison target trees T1 and T2. Let it be a subtree T3. Then, the similarity calculation unit 14 sets “1”, which is the number of common subtrees T3, as the similarity. Therefore, in the example shown in FIG. 7, the similarity is “1”.
- the similarity calculation unit 14 includes not only the transfer command but also the URL information, and the matching / unmatching for extracting the common subtree T3. It is good also as what determines a mismatch.
- the number of comparison target trees is not limited to two and may be two or more.
- the number of common subtrees is exemplified for the parameters used for calculating the similarity, but the number of common subtrees is not necessarily required. For example, the number of nodes or edges included in the common subtree, etc. As long as it is based on the number of common subtrees.
- FIG. 8 is a diagram showing a method for calculating the similarity of trees based on the size of the common tree T8.
- the similarity calculation unit 14 calculates the similarity of a plurality of trees based on the extracted tree (see FIG. 4). First, the similarity calculation unit 14 extracts a partial tree common to a plurality of trees. Next, the similarity calculation unit 14 extracts, from the extracted common partial trees, a common partial tree having the maximum number of nodes as a “common tree”. Then, the similarity calculation unit 14 sets the value obtained by counting the number of extracted common tree nodes as the similarity.
- the similarity calculation unit 14 includes subtrees (N1-E1-N2-E2-N3, N1-N3) common to both trees T5 and T6 of the comparison target trees T5 and T6.
- E1-N2) is extracted and set as a common subtree T7.
- the similarity calculation unit 14 extracts a common subtree (N1-E1-N2-E2-N3) having the maximum number of nodes “3” from the common subtree T7 and sets it as a common tree T8. .
- the similarity calculation unit 14 sets “3”, which is the number of nodes of the common tree T8, as the similarity. Therefore, in the example shown in FIG. 8, the similarity is “3”.
- the similarity calculation unit 14 includes not only the transfer command but also the URL information, and the matching / unmatching for extracting the common subtree T7. It is good also as what determines a mismatch.
- the number of comparison target trees is not limited to two and may be two or more.
- the number of nodes of the common tree is exemplified for the parameters used for calculating the degree of similarity, it may be anything related to the size of the common tree, such as the number of edges of the common tree.
- FIG. 9 is a diagram showing a method of classifying a plurality of trees into a plurality of sets.
- the access classification device 10 classifies the plurality of trees (accesses) illustrated in FIG. 4 into a plurality of sets composed of trees with high similarity by the classification unit 16.
- the classification unit 16 joins sets when the maximum value of the similarity between the trees belonging to each set is equal to or greater than a threshold value from a state in which each set includes only one tree.
- the classification unit 16 repeatedly executes this combining process until there is no set to be combined.
- each of the sets C1 to C5 is composed of only one tree (trees T11 to T15).
- the classification unit 16 classifies the plurality of trees T11 to T15 into a plurality of sets C1 'to C3' composed of trees with high similarity.
- the set C1 and the set C2 to which the trees T11 and T12 having a maximum similarity equal to or greater than the threshold value belong are combined and classified into the same set C1 '.
- the set C3 and the set C5 to which the trees T13 and T15 having a maximum similarity equal to or greater than the threshold value belong are combined and classified into the same set C2 '.
- the classification unit 16 uses the maximum value of similarity as a criterion for combining sets. However, the classification unit 16 is not limited to this, and a minimum value or an average value of similarity may be used. When the maximum value of similarity is used, a set of trees in which some subtrees that are commonly included in multiple trees are shared is created, but the minimum value of similarity is used instead of the maximum value The classification unit 16 can create a set of trees in which many subtrees are common. When the average value is used, the classification unit 16 can create a set of intermediate trees.
- the classification unit 16 preferentially combines the sets having the maximum similarity without setting the threshold, and performs the combining process as a whole. It is good also as what decides which step is adopted among the processes which repeated until it becomes a set, and combined each set after that. Furthermore, the number of sets to be combined is not limited to two but may be two or more.
- FIG. 10 is a diagram showing a method for creating a representative tree from a set of trees.
- the access classification device 10 uses the representative tree creation unit 15 to collect a set of trees classified by the classification unit 16 (see FIG. 9) based on the subtree extracted by the tree construction unit 13 (see FIG. 5). Create a representative tree.
- the representative tree creating unit 15 sets a partial tree common to all trees in the set as a representative tree.
- the representative tree creating unit 15 sets a subtree (N1-E3-N4) common to the trees T1 and T2 in the same set as the representative tree T9.
- the representative tree creation unit 15 uses a partial tree common to all trees in the set as a representative tree.
- a representative tree includes a set of subtrees included in trees of a predetermined ratio or more in the set. It is good.
- the representative tree creation unit 15 includes not only the transfer command but also the URL information, and matches / mismatches for creating the representative tree T9. It is good also as what performs this determination.
- the number of comparison target trees is not limited to two, and may be two or more.
- FIG. 11 is a flowchart for explaining the identification model creation process when the similarity is used as the inner product value.
- the known access input unit 12 inputs a known benign access and a known malignant access (see FIG. 3).
- the tree construction unit 13 constructs a tree from the input access and extracts a partial tree from the constructed tree (see FIGS. 4 and 5).
- the similarity calculation unit 14 calculates the tree similarity based on the degree of matching of the extracted subtrees (see FIGS. 6 to 8).
- the classification unit 16 applies the access input in S1 and the similarity calculated in S3 to supervised machine learning using the inner product value after conversion into the high-dimensional space of the input. That is, the classification unit 16 converts the known benign access and the known malignant access input in S1 as “teacher data”, and converts the similarity calculated in S3 into a vector in the feature amount space.
- a discriminant model is created by supervised machine learning with the subsequent “inner product value”.
- the supervised machine learning method is, for example, a support vector machine, but is not limited thereto.
- the classification unit 16 outputs the created identification model to the hard disk drive 108 described later.
- the output identification model is stored as data in the hard disk drive 108.
- FIG. 12 is a flowchart for explaining the access identification process when the similarity is used as the inner product value.
- the target access input unit 11 inputs an analysis target access (see FIG. 2).
- the tree construction unit 13 constructs a tree from the input access, and extracts a partial tree from the constructed tree (see FIGS. 4 and 5).
- the similarity calculation unit 14 calculates the tree similarity based on the degree of matching of the extracted subtrees (see FIGS. 6 to 8).
- the classification unit 16 applies the access input in S11 and the similarity calculated in S13 to supervised machine learning using the inner product value after conversion into the high-dimensional space of the input. That is, the classification unit 16 sets the analysis target access input in S11 as “test data”, and converts the similarity calculated in S13 into “inner product value after converting the test data into a vector on the feature amount space”.
- An identification result is created by supervised machine learning.
- the supervised machine learning technique is, for example, a support vector machine, but is not limited to this as long as it is the same technique as that used in the above-described identification model creation process.
- the classification unit 16 outputs the created identification result to a display device such as the display 112 described later.
- FIG. 13 is a flowchart for explaining the identification model creation process in the case where the similarity with the representative tree is used. Since FIG. 13 includes a plurality of steps similar to those in FIG. 11, common steps are given the same reference numerals at the end, and detailed descriptions thereof are omitted. Specifically, steps S21 to S23 and S25 in FIG. 13 correspond to steps S1 to S3 and S5 shown in FIG. 11, respectively.
- the classification unit 16 classifies the plurality of trees constructed in S22 into a plurality of sets composed of trees with high similarity based on the similarity calculated in S23 (see FIG. 9). .
- the representative tree creation unit 15 creates, as a representative tree, a subtree representing the characteristics of each set (for example, a common subtree in the same set) for each set obtained by the classification in S26 (see FIG. 10).
- the similarity calculation unit 14 is created from the representative tree created in S27 and the known benign access or known malignant access inputted in S21 by the method shown in any of FIGS. Similarity with a tree (including a partial tree) is calculated (see FIGS. 6 to 8).
- the classification unit 16 applies the access input in S21 and the similarity calculated in S23 to supervised machine learning. That is, the classification unit 16 creates an identification model by supervised machine learning using a vector in which similarities with the representative tree are arranged as the access feature vector.
- the supervised machine learning methods are, for example, linear discriminant analysis, support vector machine, random forest, etc., but are not limited to these methods.
- FIG. 14 is a flowchart for explaining an access identification process when using a similarity with a representative tree. Since FIG. 14 includes a plurality of steps similar to those in FIG. 12, common steps are denoted by the same reference numerals at the end and detailed description thereof is omitted. Specifically, steps S31, S32, and S35 in FIG. 14 correspond to steps S11, S12, and S15 shown in FIG. 12, respectively.
- the similarity calculation unit 14 uses the method shown in any of FIGS. 6 to 8 to create a tree (partial) created from the representative tree created in S27 and the analysis target access inputted in S31. Similarity with tree is calculated (see FIGS. 6 to 8).
- the classification unit 16 applies the access input in S31 and the similarity calculated in S36 to supervised machine learning. That is, the classification unit 16 creates an identification result by supervised machine learning using a vector in which similarities to the representative tree are arranged as the access feature vector.
- the supervised machine learning methods are, for example, linear discriminant analysis, support vector machine, random forest, etc., but are not limited to these methods as long as they are the same methods as those used in the above-described identification model creation process. Absent.
- the access classification device 10 includes the tree construction unit 13, the similarity calculation unit 14, and the classification unit 16.
- the tree building unit 13 uses a first server (for example, a web server) and a plurality of second servers (for example, servers at a malicious web site) as nodes, from the first server to the plurality of second servers. Create multiple trees that represent instructions that automatically transfer a series of accesses as edges.
- the similarity calculation unit 14 calculates the similarity between the plurality of trees based on the degree of coincidence of the partial trees that constitute each of the created trees.
- the classification unit 16 classifies the access based on the calculated similarity.
- the similarity calculation unit 14 calculates the number of subtrees (common subtrees) common to the plurality of trees with respect to the number of all subtrees (all subtrees) constituting the plurality of trees. The ratio of the number is calculated as the similarity.
- the similarity calculation unit 14 may calculate the number of subtrees (common subtrees) common to the plurality of trees as the similarity.
- the similarity calculation unit 14 calculates, as the similarity, the number of nodes of the subtree (common tree) having the maximum number of nodes among the subtrees (common subtrees) common to the plurality of trees. It may be a thing.
- the classification unit 16 may classify the access by calculating the inner product value in the space of the feature amount of the plurality of trees using the similarity.
- the access classification device 10 includes a tree construction unit 13, a similarity calculation unit 14, a classification unit 16, and a representative tree creation unit 15.
- the tree construction unit 13 creates a plurality of trees.
- the similarity calculation unit 14 calculates the similarity between the plurality of trees based on the degree of coincidence of the partial trees that constitute each of the created trees.
- the classification unit 16 classifies the plurality of trees into a plurality of sets configured by the plurality of trees having the high similarity based on the calculated similarity.
- the representative tree creation unit 15 creates, as a representative tree, one or a plurality of subtrees (for example, common subtrees in the same set) representing the characteristics of each set for each set obtained by the above classification.
- the classification unit 16 may classify the access based on the similarity between the representative tree and the access.
- the access classification device 10 classifies a series of accesses to the server including automatic transfer.
- the access classification device 10 can identify a malicious web site from the characteristics of the redirect pattern. Therefore, the access classification device 10 can prevent the user from being infected with malware by blocking the user's access to the web site determined to be malignant. As a result, it is possible to construct a malignant web site identification method that is not easily affected by changes in content or the like by an attacker and can be identified by one access.
- the access classification device 10 can identify a malicious web site without depending on information such as content, URL, and host obtained from the server. For this reason, the access classification device 10 can detect an attack on the user via the malicious web site even when the content is modified or the URL is intentionally changed. Therefore, it is difficult to be avoided by an attacker, and the detection of the malicious web site and the detection of the attack is realized.
- FIG. 15 is a diagram illustrating that the information processing by the access classification program is specifically realized by using the computer 100.
- the computer 100 includes, for example, a memory 101, a CPU (Central Processing Unit) 102, a hard disk drive interface 103, a disk drive interface 104, a serial port interface 105, a video adapter 106, a network, and the like. These units are connected by a bus B.
- a bus B bus B.
- the memory 101 includes a ROM (Read Only Memory) 101a and a RAM (Random Access Memory) 101b as shown in FIG.
- the ROM 101a stores a boot program such as BIOS (Basic Input Output System).
- BIOS Basic Input Output System
- the hard disk drive interface 103 is connected to the hard disk drive 108 as shown in FIG.
- the disk drive interface 104 is connected to the disk drive 109 as shown in FIG.
- a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 109.
- the serial port interface 105 is connected to, for example, a mouse 110 and a keyboard 111.
- the video adapter 106 is connected to a display 112, for example, as shown in FIG.
- the hard disk drive 108 stores, for example, an OS (Operating System) 108a, an application program 108b, a program module 108c, program data 108d, a tree including a partial tree and a representative tree, access related information, and the like.
- OS Operating System
- the access classification program according to the disclosed technique is stored in, for example, the hard disk drive 108 as the program module 108c in which an instruction to be executed by the computer 100 is described.
- the same information processing as each of the target access input unit 11, the known access input unit 12, the tree construction unit 13, the similarity calculation unit 14, the representative tree creation unit 15, and the classification unit 16 described in the above embodiment. Is stored in the hard disk drive 108.
- Data used for information processing by the access classification program is stored as program data 108d, for example, in the hard disk drive 108. Then, the CPU 102 reads the program module 108c and the program data 108d stored in the hard disk drive 108 to the RAM 101b as necessary, and executes the above various procedures.
- the program module 108c and the program data 108d related to the access classification program are not limited to being stored in the hard disk drive 108, but are stored in, for example, a removable storage medium and read out by the CPU 102 via the disk drive 109 or the like. May be.
- the program module 108c and the program data 108d relating to the access classification program are stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.), and the network interface 107 is stored. Via the CPU 102.
- LAN Local Area Network
- WAN Wide Area Network
- each component of the access classification device 10 described above does not necessarily need to be physically configured as illustrated. That is, the specific mode of distribution / integration of each device is not limited to the illustrated one, and all or a part thereof is functionally or physically distributed in an arbitrary unit according to various loads or usage conditions. -It can also be integrated and configured.
- the target access input unit 11 and the known access input unit 12, or the similarity calculation unit 14 and the representative tree creation unit 15 may be integrated as one component.
- the classification unit 16 the access may be divided into a part for classifying the access and a part for classifying a plurality of trees into a set.
- a hard disk drive 108 that stores a tree including a partial tree and a representative tree, access related information, and the like may be connected as an external device of the access classification device 10 via a network or a cable.
- Access Classification Device 11 Target Access Input Unit 11a Analysis Target Access 12 Known Access Input Unit 12a Known Access 13 Tree Construction Unit 14 Similarity Calculation Unit 15 Representative Tree Creation Unit 16 Classification Unit 100 Computer 101 Memory 101a ROM 101b RAM 102 CPU 103 Hard Disk Drive Interface 104 Disk Drive Interface 105 Serial Port Interface 106 Video Adapter 107 Network Interface 108 Hard Disk Drive 108a OS 108b Application program 108c Program module 108d Program data 109 Disk drive 110 Mouse 111 Keyboard 112 Display B Buses C1 to C5, C1 ′ to C3 ′, C1 ′′, C2 ′′ Set E1 to E6 Edges N1 to N8 Nodes T1, T2, T5, T6, T11 to T15 Trees T3, T7 Common subtree T4 All subtrees T8 Common tree T9 Representative tree
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
図15は、アクセス分類プログラムによる情報処理がコンピュータ100を用いて具体的に実現されることを示す図である。図15に示す様に、コンピュータ100は、例えば、メモリ101と、CPU(Central Processing Unit)102と、ハードディスクドライブインタフェース103と、ディスクドライブインタフェース104と、シリアルポートインタフェース105と、ビデオアダプタ106と、ネットワークインタフェース107とを有し、これらの各部はバスBによって接続される。
11 対象アクセス入力部
11a 解析対象アクセス
12 既知アクセス入力部
12a 既知アクセス
13 木構築部
14 類似度算出部
15 代表木作成部
16 分類部
100 コンピュータ
101 メモリ
101a ROM
101b RAM
102 CPU
103 ハードディスクドライブインタフェース
104 ディスクドライブインタフェース
105 シリアルポートインタフェース
106 ビデオアダプタ
107 ネットワークインタフェース
108 ハードディスクドライブ
108a OS
108b アプリケーションプログラム
108c プログラムモジュール
108d プログラムデータ
109 ディスクドライブ
110 マウス
111 キーボード
112 ディスプレイ
B バス
C1~C5、C1’~C3’、C1”、C2” 集合
E1~E6 エッジ
N1~N8 ノード
T1、T2、T5、T6、T11~T15 木
T3、T7 共通部分木
T4 全部分木
T8 共通木
T9 代表木
Claims (11)
- 第1のサーバ及び第2のサーバをノードとし、前記第1のサーバから前記第2のサーバへアクセスを転送する命令をエッジとする複数の木を作成する作成部と、
作成された前記複数の木の各々を構成する部分木の一致度合いに基づき、前記複数の木間の類似度を算出する算出部と、
算出された前記類似度に基づき、前記アクセスを分類する分類部と
を有することを特徴とするアクセス分類装置。 - 前記算出部は、前記複数の木を構成する全ての部分木の数に対する、前記複数の木に共通する部分木の数の割合を、前記類似度として算出することを特徴とする請求項1に記載のアクセス分類装置。
- 前記算出部は、前記複数の木に共通する部分木の数を、前記類似度として算出することを特徴とする請求項1に記載のアクセス分類装置。
- 前記算出部は、前記複数の木に共通する部分木の内、前記ノードの数が最大の部分木のノード数を、前記類似度として算出することを特徴とする請求項1に記載のアクセス分類装置。
- 前記分類部は、前記類似度を用いて、前記複数の木の特徴量の空間での内積値を算出し、前記アクセスを分類することを特徴とする請求項1に記載のアクセス分類装置。
- 第1のサーバ及び第2のサーバをノードとし、前記第1のサーバから前記第2のサーバへアクセスを転送する命令をエッジとする複数の木を作成する木作成部と、
作成された前記複数の木の各々を構成する部分木の一致度合いに基づき、前記複数の木間の類似度を算出する算出部と、
算出された前記類似度に基づき、前記複数の木を、前記類似度の高い複数の木により構成される複数の集合に分類する分類部と、
前記分類により得られた集合毎に、各集合の特徴を表す部分木を、代表木として作成する代表木作成部と
を有することを特徴とするアクセス分類装置。 - 前記分類部は、前記代表木とサーバへのアクセスとの類似度に基づき、前記アクセスを分類することを特徴とする請求項6に記載のアクセス分類装置。
- 第1のサーバ及び第2のサーバをノードとし、前記第1のサーバから前記第2のサーバへアクセスを転送する命令をエッジとする複数の木を作成する作成工程と、
作成された前記複数の木の各々を構成する部分木の一致度合いに基づき、前記複数の木間の類似度を算出する算出工程と、
算出された前記類似度に基づき、前記アクセスを分類する分類工程と
を含むことを特徴とするアクセス分類方法。 - 第1のサーバ及び第2のサーバをノードとし、前記第1のサーバから前記第2のサーバへアクセスを転送する命令をエッジとする複数の木を作成する木作成工程と、
作成された前記複数の木の各々を構成する部分木の一致度合いに基づき、前記複数の木間の類似度を算出する算出工程と、
算出された前記類似度に基づき、前記複数の木を、前記類似度の高い複数の木により構成される複数の集合に分類する分類工程と、
前記分類により得られた集合毎に、各集合の特徴を表す部分木を、代表木として作成する代表木作成工程と
を含むことを特徴とするアクセス分類方法。 - 第1のサーバ及び第2のサーバをノードとし、前記第1のサーバから前記第2のサーバへアクセスを転送する命令をエッジとする複数の木を作成する作成ステップと、
作成された前記複数の木の各々を構成する部分木の一致度合いに基づき、前記複数の木間の類似度を算出する算出ステップと、
算出された前記類似度に基づき、前記アクセスを分類する分類ステップと
をコンピュータに実行させるためのアクセス分類プログラム。 - 第1のサーバ及び第2のサーバをノードとし、前記第1のサーバから前記第2のサーバへアクセスを転送する命令をエッジとする複数の木を作成する木作成ステップと、
作成された前記複数の木の各々を構成する部分木の一致度合いに基づき、前記複数の木間の類似度を算出する算出ステップと、
算出された前記類似度に基づき、前記複数の木を、前記類似度の高い複数の木により構成される複数の集合に分類する分類ステップと、
前記分類により得られた集合毎に、各集合の特徴を表す部分木を、代表木として作成する代表木作成ステップと
をコンピュータに実行させるためのアクセス分類プログラム。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2017521950A JP6557334B2 (ja) | 2015-06-02 | 2016-05-31 | アクセス分類装置、アクセス分類方法、及びアクセス分類プログラム |
US15/577,938 US10462168B2 (en) | 2015-06-02 | 2016-05-31 | Access classifying device, access classifying method, and access classifying program |
EP16803343.9A EP3287909B1 (en) | 2015-06-02 | 2016-05-31 | Access classification device, access classification method, and access classification program |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2015-112227 | 2015-06-02 | ||
JP2015112227 | 2015-06-02 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016194909A1 true WO2016194909A1 (ja) | 2016-12-08 |
Family
ID=57441256
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2016/066054 WO2016194909A1 (ja) | 2015-06-02 | 2016-05-31 | アクセス分類装置、アクセス分類方法、及びアクセス分類プログラム |
Country Status (4)
Country | Link |
---|---|
US (1) | US10462168B2 (ja) |
EP (1) | EP3287909B1 (ja) |
JP (1) | JP6557334B2 (ja) |
WO (1) | WO2016194909A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3647982A1 (en) | 2018-10-31 | 2020-05-06 | Fujitsu Limited | Cyber attack evaluation method and cyber attack evaluation device |
CN116127079A (zh) * | 2023-04-20 | 2023-05-16 | 中电科大数据研究院有限公司 | 一种文本分类方法 |
US20230394021A1 (en) * | 2022-06-07 | 2023-12-07 | Oracle International Corporation | Computing similarity of tree data structures using metric functions defined on sets |
JP7574424B2 (ja) | 2022-01-29 | 2024-10-28 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | ウェブページ識別方法、装置、電子機器、媒体およびコンピュータプログラム |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6866322B2 (ja) * | 2018-02-13 | 2021-04-28 | 日本電信電話株式会社 | アクセス元分類装置、アクセス元分類方法及びプログラム |
US20230034914A1 (en) * | 2021-07-13 | 2023-02-02 | Fortinet, Inc. | Machine Learning Systems and Methods for API Discovery and Protection by URL Clustering With Schema Awareness |
US20240022585A1 (en) * | 2022-07-15 | 2024-01-18 | HiddenLayer Inc. | Detecting and responding to malicious acts directed towards machine learning model |
CN116186628B (zh) * | 2023-04-23 | 2023-07-07 | 广州钛动科技股份有限公司 | App应用自动打标方法和系统 |
US12105844B1 (en) | 2024-03-29 | 2024-10-01 | HiddenLayer, Inc. | Selective redaction of personally identifiable information in generative artificial intelligence model outputs |
US12107885B1 (en) | 2024-04-26 | 2024-10-01 | HiddenLayer, Inc. | Prompt injection classifier using intermediate results |
US12111926B1 (en) | 2024-05-20 | 2024-10-08 | HiddenLayer, Inc. | Generative artificial intelligence model output obfuscation |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000172699A (ja) * | 1998-12-04 | 2000-06-23 | Fuji Xerox Co Ltd | ハイパーテキスト構造変更支援装置および方法、ハイパーテキスト構造変更支援プログラムを記録した記憶媒体 |
JP2010072727A (ja) * | 2008-09-16 | 2010-04-02 | Nippon Telegr & Teleph Corp <Ntt> | 履歴処理装置、履歴処理方法および履歴処理プログラム |
JP2010122737A (ja) * | 2008-11-17 | 2010-06-03 | Ntt Docomo Inc | コンテンツ配信サーバ、コンテンツ配信方法および通信システム |
WO2015141665A1 (ja) * | 2014-03-19 | 2015-09-24 | 日本電信電話株式会社 | ウェブサイト情報抽出装置、システム、ウェブサイト情報抽出方法、および、ウェブサイト情報抽出プログラム |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7933946B2 (en) | 2007-06-22 | 2011-04-26 | Microsoft Corporation | Detecting data propagation in a distributed system |
CN104303152B (zh) | 2012-03-22 | 2017-06-13 | 洛斯阿拉莫斯国家安全股份有限公司 | 在内网检测异常以识别协同群组攻击的方法、装置和系统 |
US20130304677A1 (en) * | 2012-05-14 | 2013-11-14 | Qualcomm Incorporated | Architecture for Client-Cloud Behavior Analyzer |
WO2014110281A1 (en) * | 2013-01-11 | 2014-07-17 | Db Networks, Inc. | Systems and methods for detecting and mitigating threats to a structured data storage system |
-
2016
- 2016-05-31 JP JP2017521950A patent/JP6557334B2/ja active Active
- 2016-05-31 EP EP16803343.9A patent/EP3287909B1/en active Active
- 2016-05-31 US US15/577,938 patent/US10462168B2/en active Active
- 2016-05-31 WO PCT/JP2016/066054 patent/WO2016194909A1/ja active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000172699A (ja) * | 1998-12-04 | 2000-06-23 | Fuji Xerox Co Ltd | ハイパーテキスト構造変更支援装置および方法、ハイパーテキスト構造変更支援プログラムを記録した記憶媒体 |
JP2010072727A (ja) * | 2008-09-16 | 2010-04-02 | Nippon Telegr & Teleph Corp <Ntt> | 履歴処理装置、履歴処理方法および履歴処理プログラム |
JP2010122737A (ja) * | 2008-11-17 | 2010-06-03 | Ntt Docomo Inc | コンテンツ配信サーバ、コンテンツ配信方法および通信システム |
WO2015141665A1 (ja) * | 2014-03-19 | 2015-09-24 | 日本電信電話株式会社 | ウェブサイト情報抽出装置、システム、ウェブサイト情報抽出方法、および、ウェブサイト情報抽出プログラム |
Non-Patent Citations (1)
Title |
---|
See also references of EP3287909A4 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3647982A1 (en) | 2018-10-31 | 2020-05-06 | Fujitsu Limited | Cyber attack evaluation method and cyber attack evaluation device |
US11290474B2 (en) | 2018-10-31 | 2022-03-29 | Fujitsu Limited | Cyber attack evaluation method and cyber attack evaluation device |
JP7574424B2 (ja) | 2022-01-29 | 2024-10-28 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | ウェブページ識別方法、装置、電子機器、媒体およびコンピュータプログラム |
US20230394021A1 (en) * | 2022-06-07 | 2023-12-07 | Oracle International Corporation | Computing similarity of tree data structures using metric functions defined on sets |
CN116127079A (zh) * | 2023-04-20 | 2023-05-16 | 中电科大数据研究院有限公司 | 一种文本分类方法 |
CN116127079B (zh) * | 2023-04-20 | 2023-06-20 | 中电科大数据研究院有限公司 | 一种文本分类方法 |
Also Published As
Publication number | Publication date |
---|---|
JP6557334B2 (ja) | 2019-08-07 |
JPWO2016194909A1 (ja) | 2018-04-05 |
EP3287909B1 (en) | 2019-07-03 |
US20180176242A1 (en) | 2018-06-21 |
EP3287909A1 (en) | 2018-02-28 |
US10462168B2 (en) | 2019-10-29 |
EP3287909A4 (en) | 2018-12-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6557334B2 (ja) | アクセス分類装置、アクセス分類方法、及びアクセス分類プログラム | |
US11089040B2 (en) | Cognitive analysis of security data with signal flow-based graph exploration | |
US11212297B2 (en) | Access classification device, access classification method, and recording medium | |
Xiao et al. | Malware detection based on deep learning of behavior graphs | |
Shibahara et al. | Efficient dynamic malware analysis based on network behavior using deep learning | |
US11314862B2 (en) | Method for detecting malicious scripts through modeling of script structure | |
CN109074454B (zh) | 基于赝象对恶意软件自动分组 | |
JP6674036B2 (ja) | 分類装置、分類方法及び分類プログラム | |
JP6687761B2 (ja) | 結合装置、結合方法および結合プログラム | |
Li et al. | Towards fine-grained fingerprinting of firmware in online embedded devices | |
KR101806118B1 (ko) | 오픈 포트 배너 키워드 분석을 통한 취약점 정보를 식별하는 방법 및 장치 | |
Mimura et al. | Filtering malicious javascript code with doc2vec on an imbalanced dataset | |
WO2017077847A1 (ja) | 解析装置、解析方法、および、解析プログラム | |
WO2018143097A1 (ja) | 判定装置、判定方法、および、判定プログラム | |
Gordeychik et al. | SD-WAN internet census | |
US20210273963A1 (en) | Generation device, generation method, and generation program | |
Kumar et al. | Detection of malware using deep learning techniques | |
JP6666475B2 (ja) | 解析装置、解析方法及び解析プログラム | |
JP6787861B2 (ja) | 分類装置 | |
Aung et al. | ATLAS: A Practical Attack Detection and Live Malware Analysis System for IoT Threat Intelligence | |
Acosta et al. | Automatic data generation and rule creation for network scanning tools | |
JP2016170524A (ja) | 悪性url候補取得装置、悪性url候補取得方法、及びプログラム | |
KR20240019738A (ko) | 사이버 위협 정보 처리 장치, 사이버 위협 정보 처리방법 및 사이버 위협 정보 처리하는 프로그램을 저장하는 저장매체 | |
KR20180050205A (ko) | 오픈 포트 배너 키워드 분석을 통한 취약점 정보를 식별하는 방법 및 장치 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16803343 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2017521950 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2016803343 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15577938 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |