WO2024089860A1 - Dispositif de classement, procédé de classement et programme de classement - Google Patents

Dispositif de classement, procédé de classement et programme de classement Download PDF

Info

Publication number
WO2024089860A1
WO2024089860A1 PCT/JP2022/040260 JP2022040260W WO2024089860A1 WO 2024089860 A1 WO2024089860 A1 WO 2024089860A1 JP 2022040260 W JP2022040260 W JP 2022040260W WO 2024089860 A1 WO2024089860 A1 WO 2024089860A1
Authority
WO
WIPO (PCT)
Prior art keywords
post
feature
tweets
classification
unit
Prior art date
Application number
PCT/JP2022/040260
Other languages
English (en)
Japanese (ja)
Inventor
弘樹 中野
大紀 千葉
駿 小出
直翼 福士
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/040260 priority Critical patent/WO2024089860A1/fr
Publication of WO2024089860A1 publication Critical patent/WO2024089860A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present invention relates to a classification device, a classification method, and a classification program for classifying posts related to security threat information.
  • Security blogs, security reports, social platforms, etc. are sources from which information on security threats such as phishing attacks can be extracted.
  • non-patent documents 3 and 4 natural language processing technology can be applied to blogs and reports that summarize threat information analyzed by security experts, and the data can be extracted as formatted data, making it possible to use it mechanically.
  • Non-Patent Document 5 compares and evaluates Twitter (registered trademark), Facebook (registered trademark), news sites, security blogs, security forums, etc. as sources of threat information, and reports that Twitter is superior in terms of both the quantity and quality of information that can be collected.
  • Non-Patent Documents 6, 7, and 8 propose technology that focuses on specific users and keywords on Twitter and extracts threat-related URLs, domain names, hash values, IP addresses, vulnerability information, and other information from each user's tweets. It has been reported that this technology can obtain a large amount of useful threat information.
  • Phishing attacks continue at a rapid pace - unique URLs, an average of about 270 per day, Security NEXT, [online], [searched October 13, 2022], Internet ⁇ URL: https://www.security-next.com/134607> 2022/02 Phishing report status, [online], Council of Anti-Phishing Japan, [Retrieved October 13, 2022], Internet ⁇ URL: https://www.antiphishing.jp/report/monthly/202202.html> Zhu, Ziyun and Dumitras, Jerusalem, “ChainSmith: Automatically Learning the Semantics of Malicious Campaigns by Mining Threat Intelligence Reports”, 2018 IEEE European Symposium on Security and Privacy Satvat, Kiavash, Gjomemo, Rigel and Venkatakrishnan, V.N., “EXTRACTOR: Extracting Attack Behavior from Threat Reports”, IEEE EuroS&P 2021.
  • the objective of the present invention is to solve the above-mentioned problem and extract useful security threat information.
  • the present invention is characterized by comprising a feature extraction unit that extracts features of each of the text and images contained in posts related to security threats on SNS (Social Networking Service) from the posts; a learning unit that uses the features to learn from training data in which each post is labeled with a correct answer as to whether it is a security threat or not, thereby learning a machine learning model for classifying an input post as to whether the post is a security threat or not, a classification unit that uses the trained machine learning model to classify an input post as to whether it is a security threat or not, and an output processing unit that outputs the results of the classification.
  • SNS Social Networking Service
  • the present invention makes it possible to extract useful security threat information.
  • FIG. 1 is a diagram illustrating an example of a system configuration.
  • FIG. 2A is a diagram illustrating an example of the configuration of a collection device.
  • FIG. 2B is a flowchart illustrating an example of a processing procedure executed by the collection device.
  • FIG. 3 is a diagram for explaining a specific example of a processing procedure executed by the collection device.
  • FIG. 4 is a diagram showing an example of security keywords.
  • Figure 5 is a diagram illustrating an example of generating co-occurrence keywords.
  • FIG. 6 is a diagram showing an example of a Tweet that is the subject of data collection.
  • FIG. 7 is a diagram for explaining the process of extracting a URL and a domain name from the text and image of a Tweet.
  • FIG. 1 is a diagram illustrating an example of a system configuration.
  • FIG. 2A is a diagram illustrating an example of the configuration of a collection device.
  • FIG. 2B is a flowchart illustrating an example of a processing procedure executed by the collection
  • FIG. 8A is a diagram illustrating an example of the configuration of a classification device.
  • FIG. 8B is a flowchart illustrating an example of a processing procedure executed by the classification device.
  • FIG. 9 is a diagram for explaining a specific example of a processing procedure executed by the classification device.
  • FIG. 10 is a diagram showing an example of feature quantities generated from a Tweet.
  • Figure 11 is a diagram showing an example of an Account Feature of a Tweet.
  • Figure 12 shows an example of a Content Feature of a Tweet.
  • FIG. 13 is a diagram showing an example of a URL Feature of a Tweet.
  • Figure 14 shows an example of an OCR Feature of a Tweet.
  • Figure 15 shows an example of a Visual Feature of a Tweet.
  • Figure 16 shows an example of a Context Feature of a Tweet.
  • FIG. 17 is a diagram showing an example of feature amounts selected by the selection unit in FIG. 8A.
  • FIG. 18 shows the evaluation results of the classification accuracy of the system.
  • FIG. 19 is a diagram showing the number of phishing attack reports and URLs related to phishing attacks extracted by the system during a given period.
  • FIG. 20 is a diagram showing the results of comparing the system with OpenPhish.
  • FIG. 21 is a diagram showing the comparison results between the system and PhishTank.
  • FIG. 22 is a diagram showing the survey results of the number of reports by users and the number of phishing URLs.
  • FIG. 23 is a diagram showing the effect of dynamically selecting keywords.
  • FIG. 24 is a diagram illustrating a computer that executes a program.
  • SNS Social Networking Service
  • Twitter posts Twitter posts
  • SNS posts may be in either Japanese or English.
  • the system for example, quickly and accurately extracts tweets reporting phishing attacks from each user's tweets.
  • the system includes a collection device 10 and a classification device 20.
  • the collection device 10 and the classification device 20 may be connected to each other so as to be able to communicate with each other via a network such as the Internet, or may be installed in the same device.
  • Collection device 10 Collects a wide range of tweets that may be reports of phishing attacks. For example, the collection device 10 extracts keywords that co-occur in reports of phishing attacks (Co-occurrence Keywords). The collection device 10 then uses keywords related to security threats (Security Keywords) and the above-mentioned Co-occurrence Keywords to collect a wide range of tweets that may be reports of phishing attacks (Screened Tweets in Figure 1).
  • Classification device 20 Classifies tweets reporting phishing attacks from among the tweets collected by collection device 10. For example, classification device 20 extracts text and image features of tweets reporting phishing attacks through machine learning, and uses the extracted features to classify each tweet as either a tweet reporting a phishing attack or another tweet.
  • the collection device 10 may extract Co-occurrence Keywords from the group of Tweets classified as Tweets reporting phishing attacks. The collection device 10 may then use the extracted Co-occurrence Keywords to collect Tweets that may be reports of phishing attacks. In this way, the system can dynamically expand/reduce the keywords for collecting Tweets that may be reports of phishing attacks, and collect Tweets that should be collected at the appropriate time.
  • the system can also accurately extract reports of phishing attacks from the large amount of collected Tweets. Furthermore, the system extracts information about phishing attacks from both the text and images contained in Tweets, making it possible to extract useful information that could not be obtained by simply analyzing the text of Tweets.
  • This system provides the following benefits in countering phishing attacks: (1) It becomes possible to collect threat information from a wider range than the limited monitoring targets of conventional technology, making it possible to provide threat information from a new perspective.
  • the collection device 10 includes, for example, an input/output unit 11, a storage unit 12, and a control unit 13.
  • the input/output unit 11 is an interface that handles the input and output of various data. For example, the input/output unit 11 accepts input of Tweets collected from Twitter. In addition, the input/output unit 11 outputs Tweets that may be reports of phishing attacks extracted by the control unit 13 (Screened Tweets in FIG. 1 ).
  • the memory unit 12 stores data, programs, etc. that are referenced when the control unit 13 executes various processes.
  • the memory unit 12 is realized, for example, by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk.
  • the memory unit 12 stores, for example, security keywords, co-occurrence keywords, etc. extracted by the control unit 13.
  • the control unit 13 is responsible for controlling the entire collection device 10.
  • the functions of the control unit 13 are realized, for example, by a CPU (Central Processing Unit) executing a program stored in the memory unit 12.
  • a CPU Central Processing Unit
  • the control unit 13 includes, for example, a first collection unit 131, a keyword extraction unit 132, a second collection unit 133, and a data collection unit 134.
  • a URL/domain name extraction unit 135 and a selection unit 136 shown by dashed lines, may or may not be provided, and cases in which they are provided will be described later.
  • the first collection unit 131 uses Security Keywords, which are keywords related to security threats, to collect Tweets reporting phishing attacks from each user's Tweets.
  • the keyword extraction unit 132 extracts co-occurrence keywords, which are keywords that co-occur with more than a predetermined frequency, from tweets reporting phishing attacks collected by the first collection unit 131. Note that these co-occurrence keywords may be extracted from tweets classified by the classification device 20 as tweets reporting phishing attacks.
  • the second collection unit 133 uses the Co-occurrence Keywords to collect Tweets that may be reports of phishing attacks from the Tweets of each user. For example, the second collection unit 133 collects Tweets that contain Security Keywords and Co-occurrence Keywords in the text of the Tweet or in images linked to the Tweet from the Tweets of each user. The collected Tweets are stored, for example, in the memory unit 12.
  • the data collection unit 134 collects data necessary for input to the classification device 20.
  • the data collection unit 134 collects the following data from Tweets collected by the second collection unit 133: (1) Tweet character strings (e.g., hashtags, number of characters, etc.), (2) meta information linked to the Tweet (e.g., application information, presence or absence of defang, etc.), (3) information related to the Tweet's account (e.g., number of followers of the account, period of account registration, etc.), and (4) images included in the Tweet (e.g., up to four images linked to the Tweet, etc.).
  • the collected data is stored, for example, in the memory unit 12.
  • the first collection unit 131 of the collection device 10 collects tweets reporting phishing attacks using, for example, security keywords (S1: collection of tweets using security keywords). Then, the keyword extraction unit 132 extracts co-occurrence keywords, which are keywords that co-occur with a predetermined frequency or more, from the tweets reporting phishing attacks collected in S1 (S2: extraction of co-occurrence keywords).
  • S1 collection of tweets using security keywords
  • S2 extraction of co-occurrence keywords
  • the second collection unit 133 uses the Security Keywords and Co-occurrence Keywords to collect Tweets that may be reports of phishing attacks from each user's Tweets (S3).
  • the data collection unit 134 collects data necessary for input to the classification device 20 from the Tweets collected in S3 (S4).
  • the collection device 10 can collect tweets that may be reports of phishing attacks.
  • the collection device 10 may also include a URL/domain name extraction unit 135 and a selection unit 136 as shown in FIG. 2A.
  • the URL/domain name extraction unit 135 extracts URLs and domain names from the text and images of the Tweets collected by the second collection unit 133.
  • the selection unit 136 selects Tweets that are likely to be reports of phishing attacks from the Tweets collected by the second collection unit 133, based on the URLs or domain names extracted by the URL/domain name extraction unit 135.
  • the selection unit 136 selects the Tweet as likely to be a report of a phishing attack.
  • the selection unit 136 selects the Tweet as likely to be a report of a phishing attack. For example, the selection unit 136 selects a domain name that has been registered in WHOIS for less than a predetermined number of days as a Tweet that is likely to be a report of a phishing attack.
  • the data collection unit 134 collects data (e.g., Tweet character strings, etc.) necessary for input to the classification device 20 from the Tweets selected by the selection unit 136.
  • data e.g., Tweet character strings, etc.
  • the collection device 10 can collect tweets and their data that are more likely to be reports of phishing attacks from the collected tweets.
  • the collection device 10 generates two types of keywords (Security Keywords and Co-occurrence Keywords) for searching for Tweets containing reports of phishing attacks.
  • the collection device 10 generates, as security keywords, keywords related to security threats and the media through which they are spread, such as "SMS” and “fake site,” and keywords for sharing security threat information, such as "#phishing” and "#fraud” (see FIG. 4). Note that existing keywords related to security threats may be used as the security keywords.
  • the collection device 10 extracts co-occurring keywords (co-occurrence keywords) with a frequency exceeding a predetermined value only from reports of phishing attacks collected using security keywords as keys.
  • the first collection unit 131 of the collection device 10 uses Security Keywords to collect Tweets reporting phishing attacks from each user's Tweets.
  • the keyword extraction unit 132 then extracts Co-occurrence Keywords from the collected Tweets.
  • the keyword extraction unit 132 newly extracts Co-occurrence Keywords from the Tweets collected during each specified period.
  • the keyword extraction unit 132 extracts proper nouns from the character strings of tweets for a given period of time, and calculates PMI (Pointwise Mutual Information) using the following formula (1). Note that X and Y in formula (1) are proper nouns contained in the tweets.
  • the keyword extraction unit 132 calculates the SoA using formula (2).
  • W is a proper noun contained in the Tweet
  • L is a label (security threat information or other).
  • the keyword extraction unit 132 extracts proper nouns whose SoA exceeds a predetermined threshold.
  • tweets containing the security keyword "fraud” include tweets related to phishing reports shown in FIG. 5 (1) and tweets unrelated to phishing reports shown in FIG. 5 (2).
  • the keyword extraction unit 132 extracts "Company d” and "SMS,” proper nouns that appear frequently (whose SoA exceeds a predetermined threshold) only in tweets ((1)) related to phishing reports that contain "fraud,” as co-occurrence keywords.
  • the collection device 10 collects data necessary for input to the classification device 20 from Twitter.
  • the second collection unit 133 collects Tweets that may be reports of phishing attacks from Tweets of each user by using the co-occurrence keywords extracted by the keyword extraction unit 132. In this way, the second collection unit 133 can collect Tweets that include URLs and domains of Potentially Phishing Sites, for example, as shown in FIG. 3.
  • the second collection unit 133 can collect Tweets (Screened Tweets) from among the Tweets of each user, excluding Tweets (Unrelated Tweets) related to Legitimate Sites.
  • the data collection unit 134 collects the following data related to the Tweets collected by the second collection unit 133 (see FIG. 6).
  • Tweet string e.g. hashtag, number of characters, etc.
  • meta information associated with the Tweet e.g. application information, whether or not defanged, etc.
  • information about the Tweet's account e.g. number of followers, period of account registration, etc.
  • images included in the Tweet e.g. up to four images associated with the Tweet, etc.
  • the URL/domain name extraction unit 135 of the collection device 10 extracts URLs and domain names from the text and images of the Tweets (Screened Tweets) collected by the second collection unit 133 .
  • the URL/domain name extraction unit 135 applies optical character recognition to the image of the Tweet to extract a character string.
  • a defang e.g., https -> ttps
  • the URL/domain name extraction unit 135 restores it to its original state.
  • the URL/domain name extraction unit 135 then extracts URLs and domain names from the character strings in the text and image of the Tweet using regular expressions.
  • the URL/domain name extraction unit 135 checks whether the extracted domain name exists in the Public Suffix List (see Reference 1) or the like.
  • the URL/domain name extraction unit 135 confirms that the extracted domain name exists, it extracts the domain name and a URL that includes the domain name. For example, the URL/domain name extraction unit 135 extracts the following URL and domain name from the Tweet shown in FIG. 7.
  • the selection unit 136 screens the URLs and domain names extracted by the URL/domain name extraction unit 135 for URLs and domain names related to phishing.
  • the selection unit 136 determines that the extracted URL and domain name are Potentially Phishing Sites. The selection unit 136 then selects Tweets that include URLs or domain names determined to be Potentially Phishing Sites as Tweets that are likely to be reports of phishing attacks.
  • the Allowlist e.g., a list of URLs or domain names of legitimate websites
  • a Long-lived Domain Name e.g., a domain name that has been registered in WHOIS for a predetermined number of days or more
  • the selection unit 136 determines that the URL and domain name are Legitimate Sites.
  • the selection unit 136 passes the domain name. In addition, if the extracted domain name matches the Tranco List (see Reference 2), the selection unit 136 excludes the domain name as a domain name that is not related to phishing attacks.
  • the selection unit 136 also queries WHOIS for the extracted domain name, and if no information can be obtained, passes the domain name. Furthermore, based on the WHOIS information, the selection unit 136 excludes a domain name if it has been more than 365 days since it was registered, and passes the domain name if it has not been 365 days since it was registered. The selection unit 136 then selects, for example, a Tweet that contains at least one URL or domain name that has been passed in the above process as a Tweet that is likely to be a report of a phishing attack.
  • the collection device 10 can extract tweets from each user that are likely to be reports of phishing attacks.
  • the classification device 20 includes, for example, an input/output unit 21, a storage unit 22, and a control unit 23.
  • the input/output unit 21 is an interface that handles the input and output of various data.
  • the input/output unit 21 accepts input of tweets that may be reports of phishing attacks collected by the collection device 10 and the associated data.
  • the input/output unit 21 also outputs the classification results obtained by the control unit 23.
  • the storage unit 22 stores data, programs, etc. referenced when the control unit 23 executes various processes.
  • the storage unit 22 is realized by a semiconductor memory element such as a RAM or a flash memory, or a storage device such as a hard disk or an optical disk.
  • the storage unit 22 stores tweets that are likely to be reports of phishing attacks received by the input/output unit 21 and the data (collected data), etc.
  • the storage unit 22 stores parameters of the classification model after the control unit 23 has learned the classification model.
  • the control unit 23 is responsible for controlling the entire classification device 20.
  • the functions of the control unit 23 are realized, for example, by the CPU executing a program stored in the storage unit 22.
  • the control unit 23 includes, for example, a data acquisition unit 231, a feature extraction unit 232, a feature selection unit 233, a learning unit 234, a classification unit 235, and an output processing unit 236.
  • the data acquisition unit 231 acquires tweets and their data that are likely to be reports of phishing attacks from the collection device 10.
  • the feature extraction unit 232 extracts features from the Tweet and its data acquired by the data acquisition unit 231. For example, the feature extraction unit 232 extracts features from the text and image of the Tweet acquired by the data acquisition unit 231.
  • the feature extraction unit 232 extracts, from a Tweet acquired by the data acquisition unit 231, features of the account of the Tweet, features of the content of the Tweet, features of the URL or domain name included in the Tweet, features of a character string obtained by optical character recognition of an image included in the post, features of an image included in the Tweet, features of the context of the text included in the Tweet, etc. Details of the extraction of Tweet features by the feature extraction unit 232 will be described later using specific examples.
  • the feature selection unit 233 selects, from among the features extracted by the feature extraction unit 232, features that are effective in classifying whether or not a tweet is related to a report of a phishing attack.
  • the feature selection method uses Boruta-SHAP (see References 3 and 4).
  • the feature selection unit 233 selects, from among the features extracted by the feature extraction unit 232, features that are effective for classifying whether or not a tweet is related to a report of a phishing attack, using the following procedure.
  • the feature selecting unit 233 generates false features that include random values in addition to the features to be selected.
  • the feature selection unit 233 classifies the features to be selected and the false features using a decision tree-based algorithm, and calculates the variable importance of each feature.
  • the feature selecting unit 233 counts the variable importance of the feature to be selected calculated in (2) if it is greater than the variable importance of the false feature.
  • the feature value selection unit 233 repeats the processes (1) to (3) multiple times and selects feature values that are determined to be statistically significant as feature values that are effective for classification.
  • the learning unit 234 learns a machine learning model (classification model) for classifying whether an input Tweet is a Tweet reporting a phishing attack or not through supervised learning using the features selected by the feature selection unit 233.
  • a machine learning model for classifying whether an input Tweet is a Tweet reporting a phishing attack or not through supervised learning using the features selected by the feature selection unit 233.
  • the learning unit 234 learns a classification model through supervised learning using the features selected by the feature selection unit 233 for teacher data related to phishing attacks (data to which each Tweet is assigned a correct answer label indicating whether it is a phishing attack or not).
  • the classification unit 235 uses the classification model learned by the learning unit 234 to classify whether the input Tweet is a Tweet reporting a phishing attack.
  • the output processing unit 236 outputs the result of the classification of the Tweet by the classification unit 235.
  • the data acquisition unit 231 of the classification device 20 acquires Tweets and their data that are likely to be reports of phishing attacks collected by the collection device 10 (S11: Acquisition of collected data).
  • the feature extraction unit 232 extracts features from the Tweets and their data acquired by the data acquisition unit 231 (S12: Extraction of Tweet features).
  • the feature selection unit 233 selects, from the features extracted in S12, features that are effective for classifying whether or not a Tweet is a report of a phishing attack (S13). Then, the learning unit 234 uses the features selected in S13 for the teacher data related to phishing attacks to learn a classification model for classifying whether or not an input Tweet is a report of a phishing attack (S14).
  • the classification unit 235 uses the classification model learned in S14 to classify whether the input Tweet is a Tweet reporting a phishing attack (S15). Then, the output processing unit 236 outputs the result of the classification in S16 (S16).
  • the data acquisition unit 231 of the classification device 20 acquires Tweets (Screened Tweets) and their data collected by the collection device 10. Then, the feature extraction unit 232 extracts features from the Tweets and their data acquired by the data acquisition unit 231.
  • the feature extraction unit 232 generates a total of 27 features of six types: Account Feature (1) from the account of the Tweet, Content Feature (2) from information linked to the Tweet, URL Feature (3) from the extracted URL, OCR Feature (5) from character strings extracted by OCR, Visual Feature (6) from the appearance of the image, and Context Feature (4) from the context of the Tweet.
  • Account Feature (1) from the account of the Tweet
  • Content Feature (2) from information linked to the Tweet
  • URL Feature (3) from the extracted URL
  • OCR Feature (5) from character strings extracted by OCR
  • Visual Feature (6) from the appearance of the image
  • Context Feature (4) from the context of the Tweet.
  • the feature extraction unit 232 In order to capture the characteristics of a Twitter user, the feature extraction unit 232 generates an Account Feature for each Tweet from information about the user's account (e.g., number of followings, number of followers, number of Tweets, number of media, number of lists, account registration date, etc.), as shown in FIG. 11 .
  • information about the user's account e.g., number of followings, number of followers, number of Tweets, number of media, number of lists, account registration date, etc.
  • (5-2) Content Feature In order to capture the characteristics of content that frequently appears in Tweets reporting phishing attacks, the feature extraction unit 232 generates a Content Feature for each Tweet from information linked to the Tweet itself (e.g., a character string, a mentioned user, a hashtag, an image, a URL or domain name, an application used in the Tweet, a defang type, etc.), as shown in FIG. 12 .
  • information linked to the Tweet itself e.g., a character string, a mentioned user, a hashtag, an image, a URL or domain name, an application used in the Tweet, a defang type, etc.
  • the feature extraction unit 232 In order to capture features related to the abuse of subdomains specific to phishing URLs and the abuse of specific top-level domains, the feature extraction unit 232 generates a URL Feature for each Tweet from the URL (or domain name) extracted from both the character string and image of the Tweet, as shown in Fig. 13.
  • the URL Feature is, for example, the character string of the URL, the domain name, the path, the numbers included in the URL, the top-level domain, etc.
  • OCR Feature In order to capture characteristics of similar character strings in Tweets related to phishing attacks, the feature extraction unit 232 generates an OCR feature for each Tweet from character strings extracted by optical character recognition (OCR), as shown in Fig. 14.
  • OCR optical character recognition
  • the OCR feature is, for example, a character string, a word, a symbol, a number, a URL, a domain name, etc.
  • the feature extraction unit 232 In order to capture the commonality in the appearance of images contained in Tweets related to reports of phishing attacks, the feature extraction unit 232 generates a Visual Feature for each Tweet from the images associated with the Tweet.
  • the feature extraction unit 232 uses the Efficient Net model (see Reference 5), which has produced excellent results in image classification, to generate a fixed-dimensional vector of the image linked to the Tweet.
  • the feature extraction unit 232 then compresses the dimension of the vector using Truncated SV (see Reference 6), which converts a sparse vector into a dense vector.
  • the feature extraction unit 232 then treats the compressed vector as the Visual Feature of the image included in the Tweet.
  • the feature extraction unit 232 converts images associated with Tweets into vectors with inherent dimensions using an Efficient Net model that has been pre-trained on a large number of images from the Image Net, as shown in FIG. 15, for example.
  • the feature extraction unit 232 then compresses the converted vectors to a cumulative contribution rate of 99% in the training data using Truncated SV.
  • the feature extraction unit 232 In order to grasp the commonality of context in Tweets related to reports of phishing attacks, the feature extraction unit 232 generates a Context Feature for each Tweet from character strings in the Tweet.
  • the feature extraction unit 232 generates a fixed-dimensional vector from the character strings in the Tweet, for example, using the BERT model, which has shown excellent results in sentence classification.
  • the feature extraction unit 232 then compresses the dimension of the vector using Truncated SV.
  • the feature extraction unit 232 then sets the compressed vector as the Context Feature of the Tweet.
  • the feature extraction unit 232 converts the strings in the Tweet into vectors with inherent dimensions using a BERT model that has been pre-trained on a large number of strings from Wikipedia in English and Japanese, as shown in FIG. 16. The feature extraction unit 232 then compresses the converted vectors to a cumulative contribution rate of 99% in the training data using Truncated SV.
  • the feature selection unit 233 selects, from the group of features generated by the feature extraction unit 232 in (5), features that are effective (important) for classifying tweets reporting phishing attacks from other tweets.
  • Figure 17 shows examples of features that were determined to be important for classification as a result of feature selection.
  • the learning unit 234 learns a classification model (machine learning model) using the features (feature vectors) selected by the feature selection unit 233 in (6) and training data (Ground-Truth Dataset) to which correct labels indicating whether or not the attack is a phishing attack have been assigned.
  • Algorithms that can be used to train classification models include, for example, Random Forest, Neural Network, Decision Tree, Support Vector Machine, Logistic Regression, Naive Bayes, Gradient Boosting, and Stochastic Gradient Descent. After evaluating these algorithms against training data, it was confirmed that it is preferable to use Random Forest for the following three reasons.
  • Random Forest had better classification accuracy than any other algorithm. - Random Forest performed at a stable speed in both the learning and estimation (classification) phases. ⁇ Random Forest had a distributed feature importance for all six types of features.
  • the classification unit 235 classifies the Tweets collected by the collection device 10 into Tweets related to reports of phishing attacks (positive) or not (negative) using the machine learning model (classification model) learned in (7). Then, the output processing unit 236 outputs the result of the classification.
  • the classification device 20 may extract proper nouns that appear in tweets classified as reports of phishing attacks, and the collection device 10 may use the proper nouns when extracting co-occurrence keywords.
  • each component of each part shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure.
  • the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or a part of it can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc.
  • each processing function performed by each device can be realized in whole or in any part by a CPU and a program executed by the CPU, or can be realized as hardware using wired logic.
  • the above-mentioned system can be implemented by installing a program as package software or online software on a desired computer.
  • the above-mentioned program can be executed by an information processing device to function as the above-mentioned system.
  • the information processing device referred to here includes mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone System), as well as terminals such as PDAs (Personal Digital Assistants).
  • FIG. 24 is a diagram showing an example of a computer that executes a program.
  • the computer 1000 has, for example, a memory 1010 and a CPU 1020.
  • the computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these components is connected by a bus 1080.
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012.
  • the ROM 1011 stores a boot program such as a BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • the hard disk drive interface 1030 is connected to a hard disk drive 1090.
  • the disk drive interface 1040 is connected to a disk drive 1100.
  • a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100.
  • the serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example.
  • the video adapter 1060 is connected to a display 1130, for example.
  • the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the programs that define each process executed by the above-mentioned system are implemented as program modules 1093 in which computer-executable code is written.
  • the program modules 1093 are stored, for example, in the hard disk drive 1090.
  • a program module 1093 for executing processes similar to the functional configuration of the system is stored in the hard disk drive 1090.
  • the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
  • the data used in the processing of the above-described embodiment is stored as program data 1094, for example, in memory 1010 or hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 or program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as necessary and executes it.
  • the program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and program data 1094 may be stored in another computer connected via a network (such as a LAN (Local Area Network), WAN (Wide Area Network)). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via the network interface 1070.
  • a network such as a LAN (Local Area Network), WAN (Wide Area Network)

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Un dispositif de classement extrait, à partir de tweets se rapportant à des rapports d'attaques de hameçonnage qui sont collectés par un dispositif de collecte, des quantités de caractéristiques pour chaque texte et une image incluse dans les Tweets. Le dispositif de classement effectue ensuite un apprentissage, à l'aide des quantités de caractéristiques, relatif à des données d'enseignement étiquetées avec une étiquette de réponse correcte indiquant si les tweets se rapportent à des rapports d'attaques de hameçonnage, ce qui permet d'entraîner un modèle de classement pour classer des tweets entrés en fonction du fait que les tweets se rapportent ou non à des rapports d'attaques de hameçonnage. Le dispositif de classement classe ensuite les tweets en fonction du fait que les tweets se rapportent ou non à des rapports d'attaques de hameçonnage à l'aide du modèle de classement entraîné. Le dispositif de classement délivre ensuite le résultat de la classement des tweets en fonction du fait que les tweets se rapportent ou non à des rapports d'attaques de hameçonnage.
PCT/JP2022/040260 2022-10-27 2022-10-27 Dispositif de classement, procédé de classement et programme de classement WO2024089860A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/040260 WO2024089860A1 (fr) 2022-10-27 2022-10-27 Dispositif de classement, procédé de classement et programme de classement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/040260 WO2024089860A1 (fr) 2022-10-27 2022-10-27 Dispositif de classement, procédé de classement et programme de classement

Publications (1)

Publication Number Publication Date
WO2024089860A1 true WO2024089860A1 (fr) 2024-05-02

Family

ID=90830373

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/040260 WO2024089860A1 (fr) 2022-10-27 2022-10-27 Dispositif de classement, procédé de classement et programme de classement

Country Status (1)

Country Link
WO (1) WO2024089860A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015072614A (ja) * 2013-10-03 2015-04-16 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation 特定のテーマに依存して危険表現となりうる表現を検出する方法、並びに、当該表現を検出するための電子装置及びその電子装置用プログラム
WO2020240834A1 (fr) * 2019-05-31 2020-12-03 楽天株式会社 Système d'inférence d'activité illicite, procédé d'inférence d'activité illicite et programme
JP2021193545A (ja) * 2020-06-08 2021-12-23 旭化成ホームズ株式会社 情報紐づけサーバ、情報紐づけシステム、情報紐づけ方法およびプログラム

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015072614A (ja) * 2013-10-03 2015-04-16 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation 特定のテーマに依存して危険表現となりうる表現を検出する方法、並びに、当該表現を検出するための電子装置及びその電子装置用プログラム
WO2020240834A1 (fr) * 2019-05-31 2020-12-03 楽天株式会社 Système d'inférence d'activité illicite, procédé d'inférence d'activité illicite et programme
JP2021193545A (ja) * 2020-06-08 2021-12-23 旭化成ホームズ株式会社 情報紐づけサーバ、情報紐づけシステム、情報紐づけ方法およびプログラム

Similar Documents

Publication Publication Date Title
Nouh et al. Understanding the radical mind: Identifying signals to detect extremist content on twitter
Shafi’I et al. A review on mobile SMS spam filtering techniques
Vinayakumar et al. Evaluating deep learning approaches to characterize and classify malicious URL’s
Lin et al. Malicious URL filtering—A big data application
US11146586B2 (en) Detecting a root cause for a vulnerability using subjective logic in social media
Riadi Detection of cyberbullying on social media using data mining techniques
Kouvela et al. Bot-Detective: An explainable Twitter bot detection service with crowdsourcing functionalities
Gada et al. Cyberbullying Detection using LSTM-CNN architecture and its applications
Sun et al. Efficient event detection in social media data streams
Boahen et al. Detection of compromised online social network account with an enhanced knn
Thakur et al. An intelligent algorithmically generated domain detection system
De La Torre-Abaitua et al. On the application of compression-based metrics to identifying anomalous behaviour in web traffic
Singh et al. Spam detection using ANN and ABC Algorithm
Opara et al. Look before You leap: Detecting phishing web pages by exploiting raw URL And HTML characteristics
Bani-Hani et al. A semantic model for context-based fake news detection on social media
Kayhan et al. Cyber threat detection: Unsupervised hunting of anomalous commands (UHAC)
Prusty et al. SMS Fraud detection using machine learning
Pradeepa et al. Lightweight approach for malicious domain detection using machine learning
CN117614644A (zh) 恶意网址识别方法、电子设备及存储介质
WO2024089860A1 (fr) Dispositif de classement, procédé de classement et programme de classement
WO2024089859A1 (fr) Dispositif de collecte, procédé de collecte, et programme de collecte
Dangwal et al. Feature selection for machine learning-based phishing websites detection
EP4020886B1 (fr) Système et procédé permettant de détecter des sites web suspects dans les flux de données de proxy
Zareapoor et al. Text mining for phishing e-mail detection
Al-Nabki et al. Short text classification approach to identify child sexual exploitation material