WO2017086992A1 - Malicious web content discovery through graphical model inference - Google Patents

Malicious web content discovery through graphical model inference Download PDF

Info

Publication number
WO2017086992A1
WO2017086992A1 PCT/US2015/061899 US2015061899W WO2017086992A1 WO 2017086992 A1 WO2017086992 A1 WO 2017086992A1 US 2015061899 W US2015061899 W US 2015061899W WO 2017086992 A1 WO2017086992 A1 WO 2017086992A1
Authority
WO
WIPO (PCT)
Prior art keywords
graphical model
malicious
random variable
content
probability
Prior art date
Application number
PCT/US2015/061899
Other languages
French (fr)
Inventor
Manish Marwah
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Priority to PCT/US2015/061899 priority Critical patent/WO2017086992A1/en
Publication of WO2017086992A1 publication Critical patent/WO2017086992A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements

Definitions

  • Figure 1 shows an example of a system that supports discovery of malicious web content through graphical model inference.
  • Figure 2 shows an example of a graphical model that a graphical model construction engine may construct.
  • Figure 3 shows an example of graphical model seeding based on a blacklist of malicious web content that the graphical model construction engine may perform.
  • Figure 4 shows an example of graphical model seeding based on a content-based classifier that the graphical model construction engine may perform.
  • Figure 5 shows an example of graphical model inference that an inference engine may perform to discover malicious content from the graphical model.
  • Figure 6 shows an example of logic that a system or device may implement to provide malicious content discovery through graphical model inference.
  • Figure 7 shows another example of logic that a system or device may implement to provide malicious content discovery through graphical model inference.
  • Figure 8 shows an example of a device that supports discovery of malicious web content through graphical model inference.
  • a graphical model may be constructed using a web graph, which may capture the hyperl inked structure of interconnected web resources (e.g., interlinked websites, the Internet, or the World Wide Web).
  • the constructed graphical model may be seeded with probability factors determined through a blacklist of malicious websites, generated by a content-based classifier trained through contract extraction of the malicious websites specified in the blacklist, or through combinations of both.
  • the discovery features described herein may account for global factors through mapping of the hyperl inked structure of interconnected web resources and location-specific factors through content-based classification, and may methodically do so in combination to infer malicious web content.
  • the malicious web content discovery features described herein may support identification of malicious web with increased accuracy and efficiency.
  • Figure 1 shows an example of a system 100 that supports discovery of malicious web content through graphical model inference.
  • the system 100 may take the form of a computing system, including a single or multiple computing devices such as application servers, compute nodes, desktop or laptop computers, smart phones or other mobile devices, tablet devices, embedded controllers, and more.
  • the system 100 may discover malicious web content.
  • Malicious web content may refer to any website, web domain, web page, or web host that provides, propagates, or includes malicious content.
  • Malicious content may include any program or file that is intended to damage or disable computer operations, gather sensitive information, or gain unauthorized access to a computer or computer system.
  • a malicious website may refer to a website through which malicious software, viruses, worms, tnojan horses, spyware, spam content, phishing mechanisms, or any other malicious content is propagated or linked through.
  • a malicious web page may refer to any particular web page through which malicious content is propagated.
  • a malicious web host may refer to any web host that hosts a malicious web domain, malicious website, or malicious web page.
  • the system 100 may discover malicious web content through graphical model inference performed on a graphical model constructed to reflect a hyper) inked structure of a web system and seeded with probability factors determined through a blacklist and content- based classifier.
  • the system 100 shown in Figure 1 includes the content-based classification engine 108, the graphical model construction engine 110, and the inference engine 112, through which the system 100 may discover malicious web content through graphical model inference.
  • the system 100 may implement the engines 108, 110, and 112 (and components thereof) in various ways, for example as hardware and programming.
  • the programming for the engines 108, 110, and 112 may take the form of processor-executable instructions stored on a non-transitory machine- readable storage medium and the hardware for the engines 108, 110, and 112 may include a processing resource to execute those instructions.
  • a processing resource may include a number of processors and may be implemented through a single processor or multi-processor architecture.
  • the system 100 implements multiple engines using the same system features or hardware components (e.g., a common processing resource).
  • the content-based classification engine 108 includes an engine component to generate a probability factor for a particular website, and the content-based classification engine 108 may be trained through content extraction from malicious websites specified in a blacklist.
  • the graphical model construction engine 110 includes engine components to construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent websites and links between the random variable nodes in the graphical model represent hyperlinks between the websites and seed a random variable node representing a particular website not specified in the blacklist with the probability factor generated from the content-based classification engine.
  • the inference engine 112 includes engine components to perform a graphical model inference on the graphical model constructed from the web graph and seeded with the probability factor generated by the content-based classification engine to adjust probability factors for the random variable nodes of the graphical model and generate a list of discovered malicious websites from the adjusted probability factors.
  • Figure 2 shows an example of a graphical model that the graphical model construction engine 110 may construct.
  • a graphical model may also be referred to as a probabilistic graphical model, and may be used to express conditional dependence structure between random variables.
  • the graphical model construction engine 110 may construct a graphical model to model a hyperl inked structure of interconnected web resources, such as the World Wide Web, an enterprise intranet, or various other types of interconnect resources (or portions thereof). To do so, the graphical model construction engine 110 may obtain a web graph 202, which may be any graph or data structure that indicates hyperlinks between web pages, websites, or other web resources of a structure of interconnected web resources, such as the World Wide Web.
  • the graphical model construction engine 110 itself may perform crawling operations to map out links of selected portions of the web (e.g., including particular websites or pages). As another example, the graphical model construction engine 110 may otherwise obtain the web graph 202 from a web crawler or other information source. Then, the graphical model construction engine 110 may construct the graphical model from the web graph by associating random variables with the websites specified in the web graph (or web pages, depending on the web abstraction level for the malicious content discovery).
  • the graphical model construction engine 110 constructs the graphical model 210.
  • a graphical model may include nodes for the random variables mapped in the graphical model, which may also be referred to as random variable nodes.
  • the graphical model construction engine 110 may plot the webpages and websites as the random variable nodes of the graphical model and plot the hyperlinks between the webpages and websites as edges between the random variable nodes.
  • the random variable of each random variable node may indicate the probability that a particular website is malicious (and may thus be referred to as a probability factor). That is, a probability factor of a random variable node may refer to a probability function or value that particular web content represented by the random variable node is malicious.
  • the probability factor for a node of the graphical model 210 may be represented in various ways and include any number of data types, such as a probability function, a probability value (e.g., between a range of 0-1 ), a probability distribution, and the like.
  • the random variable nodes of the graphical model 210 each represent a particular website.
  • the graphical model 210 includes the random variable nodes labeled as 211 and 212, which correspond to and represent the website 221 and the website 222 respectively (and the websites 221 and 222 may include multiple web pages). Edges between random variable nodes in the graphical model 210 may represent hyperlinks between the websites represented by the random variable nodes. As such, the random variable nodes 211 are 212 are joined by an edge in the graphical model 210, indicating that the websites 221 and 222 are hyperi inked to one another, e.g., at least one webpage of the website 221 links to at least one webpage of the website 222 or vice versa.
  • the graphical model construction engine 110 uses an undirected or a random Markov field model in modeling the random variable nodes and edges in the graphical model 210.
  • the graphical model construction engine 110 may account for an impact that other websites linked to a particular website may have in terms of hosting or propagating malicious content.
  • the graphical model 210 may exploit the concept of homophily that a web entity is likely (in probabilistic terms) to be associated with similar entities.
  • the graphical model 210 may be used to imply, probabilistically express, or infer that malicious websites are likely to have hyperlinks to other malicious websites and non-malicious websites are likely to have hyperlinks to other non-malicious websites.
  • a system 100 may exploit global information to determine (e.g., infer) a maliciousness probability for websites represented in the graphical model 210.
  • the graphical model construction engine 110 may seed the random variable nodes of the graphical model 210 with probability factors, including according to a blacklist and according to a content- based classifier. Examples by which the graphical model 210 is seeded with probability factors are described in greater detail next through Figures 3 and 4.
  • Figure 3 shows an example of graphical model seeding based on a blacklist of malicious web content that the graphical model construction engine 110 may perform. To do so, the graphical model construction engine 110 may access a blacklist 310, which may indicate specific web content (e.g., websites, web pages, domains, web hosts, etc.) determined or known to include malicious content.
  • the blacklist 310 may, for example, specify the Uniform Resource Locator (URL) or otherwise identify web content that has been determined to include malware, is a phishing website, or according to any other maliciousness categorization.
  • the graphical model construction engine 110 may receive the blacklist 310 from any listing source, such as various security organizations that distribute such blacklists.
  • the blacklist 310 specifies two malicious websites show as the malicious website 311 and the malicious website 312.
  • the graphical model construction engine 110 may seed particular random variable nodes in the graphical model 210 according to the specification of the malicious websites 311 and 312 in the blacklist 310. That is, the graphical model construction engine 110 may seed particular random variable nodes in the graphical model that represent the malicious websites 311 and 312 specified in the blacklist 310, which in Figure 3 are the random variable nodes 331 and 332 respectively.
  • the graphical model construction engine 110 may seed a random variable node in the graphical model 210 with a random variable referred to as a probability factor.
  • the probability factor may specify a probability that the website represented by the random variable node is malicious (also referred to as a maliciousness probability).
  • the graphical model construction engine 110 may seed a probability factor as a vector of two values, a first probability that the website represented by a random variable node is malicious and a second probability that the website represented by the random variable node is not malicious.
  • the sum of the first and second values may be a value of 1.
  • the probability factor may specify a single maliciousness probability value, a maliciousness probability distribution, or any other probabilistic expression.
  • the probability factor is a probability distribution over several values, categories, or classifications of maliciousness, such as a distribution of probabilities that a website is non- malicious, includes malware, is a phishing site, etc.
  • the graphical model construction engine 110 may seed the corresponding random variable nodes 331 and 332 with a probability factor indicative of a high probability of maliciousness.
  • the graphical model construction engine 110 may seed random variable nodes representing malicious websites identified in a blacklist within a predetermined high-maliciousness probability range indicative of a high probability of including malicious content, for example a high-maliciousness probability range of .95-. ⁇ 9.
  • the graphical model construction engine 110 may determine the particular value of the probability factor according to a confidence level of the source of the blacklist (which may be based on a reputation of the blacklist source, other information about the malicious website, or any number of other factors). For each malicious website specified in a blacklist, the graphical model construction engine 110 may seed the random variable node that represents the malicious website accordingly.
  • the graphical model construction engine 110 seeds a random variable node according to multiple blacklists.
  • the graphical model construction engine 110 may access multiple blacklists, for example through retrieval from multiple, different sources (e.g., different security organizations).
  • the graphical model construction engine 110 may weight the impact upon the probability factor value determination according to a confidence level for a particular source.
  • the graphical model construction engine 110 may account for, as example factors, the number of blacklists a particular malicious website appears in, the confidence level of sources of the blacklists in which the malicious website appears, and more.
  • the graphical model construction engine 110 may seed random variable nodes according to a whitelist.
  • a whitelist may refer to any listing or identification of non-malicious web content, which may be verified or authenticated by a security organization or other entity as to not include or propagate malicious content.
  • the graphical model construction engine 110 may seed random variable nodes representing web content identified in the whitelist with a probability factor indicative of a low probability of maliciousness. For instance, the graphical model construction engine 110 may assign a probability factor to such random variable nodes within a low-maliciousness probability range (for example, between .01- .05).
  • the graphical model construction engine 110 may determine a probability factor value within the low-maliciousness probability range accounting for any number of factors based on the whitelist(s), source(s) of the whitelist, or various other factors.
  • the graphical model construction engine 110 may seed random variable nodes in the graphical model that represent malicious websites identified in a blacklist, non-malicious websites specified in a whitelist, or combinations of both.
  • the graphical model construction engine 110 may seed the graphical model 210 with probability factors for web content previously known or identified as being malicious or non-malicious, e.g., with a known or determined maliciousness characterization.
  • some of the random variable nodes in a graphical model may represent websites not previously identified as malicious or non-malicious (e.g., not specified in a blacklist or a whitelist accessed by the graphical model construction engine 110).
  • the graphical model construction engine 110 may seed such random variable nodes with a probability factor generated by content-based classifier, as described next.
  • Figure 4 shows an example of graphical model seeding based on a content-based classifier that the graphical model construction engine 110 may perform.
  • the content-based classification engine 108 implements a content-based classifier which generates a probability that a particular input website is malicious (e.g., a probability factor).
  • the content-based classification engine 108 may train a content-based classifier through extracting content characteristics of malicious websites, such as malicious websites known, determined, or identified through a blacklist.
  • the content-based classification engine 108 accesses the blacklist 310 and extracts content of the malicious websites 311 and 312 specified in the blacklist 310. By extracting the local content features of identified or known malicious websites, the content-based classification engine 108 may track specific attributes, characteristics, and content of malicious websites to predict the maliciousness of other websites not specified in the blacklist 310.
  • the content-based classification engine 108 may extract various types of content from the malicious websites 311 and 312.
  • the content-based classification engine 108 may extract lexical features of the malicious websites 311 and 312, such as specific web page content, URL characteristics, images or visual characteristics, etc.
  • the content-based classification engine 108 may do so based on a bag of words model, for example.
  • the content-based classification engine 108 may extract host features of the malicious websites 311 and 312, which may include host information obtained through Domain Name Service (DNS) requests such as a host name, domain registration time, owner information, and the like.
  • DNS Domain Name Service
  • the malicious websites 311 and 312 specified in the blacklist 310 and the extracted content features may provide a training set by which the content-based classification engine 108 trains a classifier.
  • the content-based classification engine 108 accesses a whitelist of web content known to not contain any malicious content or verified as authentic and non-malicious.
  • the content-based classification engine 108 may extract content from whitelisted websites identified as non-malicious to extract lexical and/or host features of non- malicious websites, for example to include in the training set for the content-based classifier.
  • the content-based classification engine 108 may train a content-based classifier to generate a probability that an input website is malicious.
  • the content-based classification engine 108 may employ any number of machine learning models, including classifiers trained using na ' ive Bayes methods, support vector machine techniques, logistic regression, neural networks, and more.
  • the content-based classification engine 108 obtains labels for training from the blacklist 310 and a whitelist, and subsampling is used to ensure that no imbalance occurs in the training set (e.g., the number of samples per label are within a numerical threshold from one another).
  • the content-based classification engine 108 may generate probability factors for random variable nodes in the graphical model 210.
  • the content-based classifier may examine content attributes of the input website and predict the maliciousness of the input website in the form of a maliciousness probability.
  • the content-based classification engine 108 may thus generate a maliciousness probability for a website corresponding to a random variable node in the graphical model, which the graphical model construction engine 110 may use as a probability factor to seed the random variable node in the graphical model 210.
  • the content- based classification engine 108 may generate probability factors for websites not specified in the blacklist 310, not specified in a whitelist, or both. That is, the content-based classification engine 108 may generate probability factors for websites with an unknown or undetermined malicious characterization, and the graphical model construction engine 110 may seed the probability factors generated by the content-based classification engine 108 for these websites in the graphical model 210.
  • the graphical model 210 includes random variable nodes (representing malicious websites specified in the blacklist 310) seeded according to the blacklist 310, random variable nodes (representing non-malicious websites specified in a whitelist) seeded according to the whitelist, as well as random variable nodes (representing websites not specified in the blacklist 310 or the whitelist) seeded according to the content-based classification engine 108.
  • the probability factors seeded in the graphical model 210 may also be referred to as priors, e.g., probability factors determined prior to any graphical model inference or prior to accounting for the dependence structure of the graphical model 210.
  • the graphical model construction engine 110 may incorporate content-based classification of websites into the graphical model 210. Doing so may allow a system to discover malicious content that incorporates both global and local web content considerations, which may increase the accuracy of malicious web content identification. Moreover, the local content extraction and classification may be specifically combined with the global hyperlink modeling through the graphical model, providing an efficient and accurate mechanism to infer malicious web content.
  • Figure 5 shows an example of graphical model inference that the inference engine 112 may perform to discover malicious web content from a graphical model.
  • the inference engine 112 may perform a graphical model inference on the graphical model 210 to discover malicious web content.
  • the inference engine 112 may determine a marginal probability distribution of random variable nodes in the graphical model 210 and may apply any inference method to do.
  • Example inference methods the inference engine 112 may apply include belief propagation methods, exact inference, MCMC, Gibbs sampling, junction tree methods, variational methods, and the like.
  • the inference engine 112 may adjust the probability factors for the random variable nodes, for example through the determination or adjusting of marginal probabilities, probability distribution adjustments, or via maximum a posteriori (MAP) probabilities.
  • the inference engine 112 may, from the adjusted probability factors, discover malicious content. For instance, the inference engine 112 may identify any website represented by a random variable node with an adjusted probability factor (e.g., marginal probability, probability distribution, or MAP probability) that meets a malicious criterion.
  • an adjusted probability factor e.g., marginal probability, probability distribution, or MAP probability
  • Example malicious criterion include exceeding a particular threshold probability value (e.g., greater than a 50% probability or a probability factor value that is greater than a value of 0.5), within a particular maliciousness probability range, a probability distribution with a threshold lower range or upper range, or any other configurable criterion to categorize a website as malicious based on an adjust probability factor.
  • the inference engine 112 generates a ranked list of discovered malicious websites 510, for example ranked according to the adjusted probability factors.
  • the graphical model inference may support discovery of malicious web content (e.g., web content that was not previously known, identified, or categorized as malicious).
  • the inference engine 112 may filter websites already known as malicious from any listing of discovered malicious websites. For instance, the inference engine 112 may filter, from the ranked list of discovered malicious websites 510, the malicious websites 311 and 312 specified in the blacklist 310 or any other known malicious websites.
  • a system 100 may support discovery of malicious web content through a content-based classification engine 108, graphical model construction engine 110, and inference engine 112.
  • the system 100 may map the hyperl inked structure of the web, by which the system 100 account for how the linked nature of the web and dependencies, homophily, and malicious websites linking to other malicious websites impact the probability that a particular website is malicious.
  • the system 100 may also consider content-based features of the particular website, specifically through seeding the random variable node representing the particular website through a content-based classifier. As such, through the particular combination of a graphical model seeded with a content-based classifier, the system 100 may support determination of malicious web content with increased accuracy and efficiency.
  • Figure 6 shows an example of logic 600 that a system or device may implement to provide malicious content discovery through graphical model inference.
  • a system may implement the logic 600 as hardware, executable instructions stored on a machine-readable medium, or as combinations of both.
  • the system implements the logic 600 through the content- based classification engine 108, the graphical model construction engine 110, and the inference engine 112, by which the system may perform or execute the logic 600 as a method to discover malicious web content through graphical model inference.
  • the system may access a blacklist and a whitelist (602), for example by accessing or retrieving the blacklist or whitelist from a security organization or other source.
  • the system may extract content features from the malicious websites specified in the blacklist and the non-malicious websites specified in the whitelist (604), through which the system may train a content-based classifier to predict the maliciousness of an input website through a probability factor (606). That is, the system may train the content-based classifier according to the content extracted from the malicious websites specified in the blacklist, according to content extracted from white listed or non-malicious websites, or a combination of both.
  • the content-based classifier may generate a probability that a particular input website is malicious, which may be provided as a probability factor for seeding a graphical model (as discussed below regarding 614).
  • the system may obtain a web graph (608) and construct a graphical model from the web graph (610).
  • the graphical model may reflect the hyperi inked structure of the web, and also convey a dependence structure for random variable nodes of the graphical model.
  • the system may seed the random variable nodes in the graphical model with probability factors (which, prior to inference, may also be referred to as priors).
  • the system may seed random variable nodes of the graphical model representing any of the malicious websites specified in the blacklist (612).
  • the system may also seed the random variable nodes of the graphical model representing websites not specified in the blacklist, and do specifically using the probability factors generated by the content-based classifier (614).
  • the system may perform graphical model inference on the graphical model to adjust the probability factors (616) and generate a ranked list of discovered malicious websites (618) from the graphical model inference.
  • Figure 7 shows another example of logic 700 that a system or device may implement to provide malicious content discovery through graphical model inference.
  • a system may implement the logic 700 as hardware, executable instructions stored on a machine-readable medium, or as combinations of both.
  • the system implements the logic 700 through the content- based classification engine 108, the graphical model construction engine 110, and the inference engine 112, by which the system may perform or execute the logic 700 as a method to discover malicious web content through graphical model inference.
  • the system may construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent websites and links between the random variable nodes in the graphical model represent hyperlinks between the websites (702).
  • the system may seed a first random variable node in the graphical model that represents a malicious website specified in a blacklist with a probability factor based on the blacklist (704), e.g., based on the fact that the malicious website is specified in the blacklist.
  • the system may also seed a second random variable node in the graphical model that represents a different website not specified in the blacklist with a probability factor generated by a content-based classifier trained through content extraction of malicious websites specified in a blacklist (706).
  • the system may perform a graphical model inference on the graphical model constructed from the web graph to adjust probability factors for the random variable nodes of the graphical model and identify the different website represented by the second random variable node as a discovered malicious website when the adjusted probability factor of the second random variable node exceeds a maliciousness probability threshold (708).
  • the malicious probability threshold may be configurable or user-specified, for example.
  • FIG 8 shows an example of a device 800 that supports discovery of malicious web content through graphical model inference.
  • the device 800 may include a processing resource 810, which may take the form of a single or multiple processors.
  • the processors may include a central processing unit (CPU), microprocessor, or any hardware device suitable for executing instructions stored on a machine-readable medium, such as the machine-readable medium 820 shown in Figure 8.
  • the machine-readable medium 820 may be any non- transitory electronic, magnetic, optical, or other physical storage device that stores executable instructions, such as the instructions 822, 824, 826, 828, 830, and 832 shown in Figure 8.
  • the machine-readable medium 820 may be, for example, Random Access Memory (RAM) such as dynamic RAM (DRAM), flash memory, memristor memory, spin-transfer torque memory, an Electrically- Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disk, and the like.
  • RAM Random Access Memory
  • DRAM dynamic RAM
  • flash memory memristor memory
  • spin-transfer torque memory an Electrically- Erasable Programmable Read-Only Memory (EEPROM)
  • EEPROM Electrically- Erasable Programmable Read-Only Memory
  • the device 800 may execute instructions stored on the machine- readable medium 820 through the processing resource 810. Executing the instructions may cause the device 800 to perform any of the malicious web content discovery features described herein, including according to any features of the content-based classification engine 108, graphical model construction engine 110, inference engine 112, logic 600 and 700, or any combination thereof.
  • execution of the instructions 822, 824, 826, 828, 830, and 832 by the processing resource 810 may cause the device 800 to train a content- based classifier to generate a probability factor that a web page is malicious through content extraction from malicious web pages specified in a blacklist, the content extraction including hyperlinks of the malicious web pages, domain information of the malicious web pages, page content of the malicious web pages, or any combination thereof; construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent web pages and links between the random variable nodes in the graphical model represent hyperlinks between the web pages; seed random variable nodes in the web graph that represent web pages not specified in the blacklist through probability factors generated by the content-based classifier; seed random variable nodes in the web graph that represent the malicious web pages specified in the blacklist with probability factors determined without use of the content-based classifier; perform a graphical model inference on the graphical model constructed from the web graph and seeded with the probability factors to adjust the probability factors for the
  • the machine-readable medium 820 may further include instructions executable by the processing resource 810 to access multiple blacklists from different sources and seed the random variable nodes in the web graph that represent malicious web pages specified in the blacklist with probability factors determined based on the multiple blacklists.
  • the machine-readable medium 820 may further include instructions executable by the processing resource 810 to generate a ranked list of discovered malicious web pages ranked according to the adjusted probability factors, and further to filter the malicious web pages specified in the blacklist from the ranked list of discovered malicious web pages.
  • the systems, methods, devices, and logic described above, including the content-based classification engine 108, graphical model construction engine 110, and inference engine 112 may be implemented in many different ways in many different combinations of hardware, logic, circuitry, and executable instructions stored on a machine-readable medium.
  • the content- based classification engine 108, graphical model construction engine 110, and inference engine 112, or combinations thereof may include circuitry in a controller, a microprocessor, or an application specific integrated circuit (ASIC), or may be implemented with discrete logic or components, or a combination of other types of analog or digital circuitry, combined on a single integrated circuit or distributed among multiple integrated circuits.
  • ASIC application specific integrated circuit
  • a product such as a computer program product, may include a storage medium and machine readable instructions stored on the medium, which when executed in an endpoint, computer system, or other device, cause the device to perform operations according to any of the description above, including according to any features of the content-based classification engine 108, graphical model construction engine 110, and inference engine 112.
  • the processing capability of the systems, devices, and engines described herein, including the content-based classification engine 108, graphical model construction engine 110, and inference engine 112, may be distributed among multiple system components, such as among multiple processors and memories, optionally including multiple distributed processing systems.
  • Parameters, databases, and other data structures may be separately stored and managed, may be incorporated into a single memory or database, may be logically and physically organized in many different ways, and may implemented in many ways, including data structures such as linked lists, hash tables, or implicit storage mechanisms.
  • Programs may be parts (e.g., subroutines) of a single program, separate programs, distributed across several memories and processors, or implemented in many different ways, such as in a library (e.g., a shared library).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In some examples, a method includes constructing a graphical model from a web graph. Random variable nodes in the graphical model may represent websites and links between the random variable nodes in the graphical model may represent hyperlinks between the websites. The method may also include seeding a first random variable node in the graphical model that represents a malicious website specified in a blacklist with a probability factor and seeding a second random variable node in the graphical model that represents a different website not specified in the blacklist, particularly with a probability factor generated by a content-based classifier trained through content extraction of malicious websites specified in a blacklist. The method may further include performing graphical model inference on the graphical model constructed and identifying the website represented by the second random variable node as a discovered malicious website.

Description

MALICIOUS WEB CONTENT DISCOVERY
THROUGH GRAPHICAL MODEL INFERENCE
BACKGROUND
[0001] With rapid advances in technology, electronic devices have become increasingly prevalent in society today. Laptop computers, desktop computers, mobile phones, and tablet devices are but a few examples of electronic devices allowing a user to access digital data, communicate across vast interconnected networks (such as the Internet), and execute web-based applications. Increasing the security of computing devices will further improve user experience.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] Certain examples are described in the following detailed description and in reference to the drawings.
[0003] Figure 1 shows an example of a system that supports discovery of malicious web content through graphical model inference.
[0004] Figure 2 shows an example of a graphical model that a graphical model construction engine may construct.
[0005] Figure 3 shows an example of graphical model seeding based on a blacklist of malicious web content that the graphical model construction engine may perform.
[0006] Figure 4 shows an example of graphical model seeding based on a content-based classifier that the graphical model construction engine may perform.
[0007] Figure 5 shows an example of graphical model inference that an inference engine may perform to discover malicious content from the graphical model. [0008] Figure 6 shows an example of logic that a system or device may implement to provide malicious content discovery through graphical model inference.
[0009] Figure 7 shows another example of logic that a system or device may implement to provide malicious content discovery through graphical model inference.
[0010] Figure 8 shows an example of a device that supports discovery of malicious web content through graphical model inference.
DETAILED DESCRIPTION
[0011] The disclosure herein may provide systems, methods, devices, and logic that support discovery of malicious web content, such as malicious websites, malicious web domains, malicious web pages, malicious web hosts, or combinations thereof. As described in greater detail below, a graphical model may be constructed using a web graph, which may capture the hyperl inked structure of interconnected web resources (e.g., interlinked websites, the Internet, or the World Wide Web). The constructed graphical model may be seeded with probability factors determined through a blacklist of malicious websites, generated by a content-based classifier trained through contract extraction of the malicious websites specified in the blacklist, or through combinations of both. Thus, the discovery features described herein may account for global factors through mapping of the hyperl inked structure of interconnected web resources and location-specific factors through content-based classification, and may methodically do so in combination to infer malicious web content. As such, the malicious web content discovery features described herein may support identification of malicious web with increased accuracy and efficiency.
[0012] Figure 1 shows an example of a system 100 that supports discovery of malicious web content through graphical model inference. The system 100 may take the form of a computing system, including a single or multiple computing devices such as application servers, compute nodes, desktop or laptop computers, smart phones or other mobile devices, tablet devices, embedded controllers, and more.
[0013] The system 100 may discover malicious web content. Malicious web content may refer to any website, web domain, web page, or web host that provides, propagates, or includes malicious content. Malicious content may include any program or file that is intended to damage or disable computer operations, gather sensitive information, or gain unauthorized access to a computer or computer system. Thus, a malicious website may refer to a website through which malicious software, viruses, worms, tnojan horses, spyware, spam content, phishing mechanisms, or any other malicious content is propagated or linked through. Likewise, a malicious web page may refer to any particular web page through which malicious content is propagated. A malicious web host may refer to any web host that hosts a malicious web domain, malicious website, or malicious web page.
[0014] As described in greater detail below, the system 100 may discover malicious web content through graphical model inference performed on a graphical model constructed to reflect a hyper) inked structure of a web system and seeded with probability factors determined through a blacklist and content- based classifier. As one example, the system 100 shown in Figure 1 includes the content-based classification engine 108, the graphical model construction engine 110, and the inference engine 112, through which the system 100 may discover malicious web content through graphical model inference.
[0015] The system 100 may implement the engines 108, 110, and 112 (and components thereof) in various ways, for example as hardware and programming. The programming for the engines 108, 110, and 112 may take the form of processor-executable instructions stored on a non-transitory machine- readable storage medium and the hardware for the engines 108, 110, and 112 may include a processing resource to execute those instructions. A processing resource may include a number of processors and may be implemented through a single processor or multi-processor architecture. In some examples, the system 100 implements multiple engines using the same system features or hardware components (e.g., a common processing resource).
[0016] In the example shown in Figure 1 , the content-based classification engine 108 includes an engine component to generate a probability factor for a particular website, and the content-based classification engine 108 may be trained through content extraction from malicious websites specified in a blacklist. The graphical model construction engine 110 includes engine components to construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent websites and links between the random variable nodes in the graphical model represent hyperlinks between the websites and seed a random variable node representing a particular website not specified in the blacklist with the probability factor generated from the content-based classification engine. The inference engine 112 includes engine components to perform a graphical model inference on the graphical model constructed from the web graph and seeded with the probability factor generated by the content-based classification engine to adjust probability factors for the random variable nodes of the graphical model and generate a list of discovered malicious websites from the adjusted probability factors.
[0017] Some example features relating to malicious content discovery through graphical model inference are described in greater detail next. Many of the following examples are described specifically with regards to discovery of malicious websites. However, any of malicious web content discovery features described herein may be consistently applied to any other abstraction levels of malicious web content, such as for malicious web pages or malicious web hosts as examples.
[0018] Figure 2 shows an example of a graphical model that the graphical model construction engine 110 may construct. A graphical model may also be referred to as a probabilistic graphical model, and may be used to express conditional dependence structure between random variables. The graphical model construction engine 110 may construct a graphical model to model a hyperl inked structure of interconnected web resources, such as the World Wide Web, an enterprise intranet, or various other types of interconnect resources (or portions thereof). To do so, the graphical model construction engine 110 may obtain a web graph 202, which may be any graph or data structure that indicates hyperlinks between web pages, websites, or other web resources of a structure of interconnected web resources, such as the World Wide Web.
[0019] To obtain the web graph 202, the graphical model construction engine 110 itself may perform crawling operations to map out links of selected portions of the web (e.g., including particular websites or pages). As another example, the graphical model construction engine 110 may otherwise obtain the web graph 202 from a web crawler or other information source. Then, the graphical model construction engine 110 may construct the graphical model from the web graph by associating random variables with the websites specified in the web graph (or web pages, depending on the web abstraction level for the malicious content discovery).
[0020] In the example shown in Figure 2, the graphical model construction engine 110 constructs the graphical model 210. A graphical model may include nodes for the random variables mapped in the graphical model, which may also be referred to as random variable nodes. In particular, the graphical model construction engine 110 may plot the webpages and websites as the random variable nodes of the graphical model and plot the hyperlinks between the webpages and websites as edges between the random variable nodes.
[0021] For the graphical model 210 constructed by the graphical model construction engine 110, the random variable of each random variable node may indicate the probability that a particular website is malicious (and may thus be referred to as a probability factor). That is, a probability factor of a random variable node may refer to a probability function or value that particular web content represented by the random variable node is malicious. The probability factor for a node of the graphical model 210 may be represented in various ways and include any number of data types, such as a probability function, a probability value (e.g., between a range of 0-1 ), a probability distribution, and the like. [0022] In some examples, the random variable nodes of the graphical model 210 each represent a particular website. To illustrate through the example shown in Figure 2, the graphical model 210 includes the random variable nodes labeled as 211 and 212, which correspond to and represent the website 221 and the website 222 respectively (and the websites 221 and 222 may include multiple web pages). Edges between random variable nodes in the graphical model 210 may represent hyperlinks between the websites represented by the random variable nodes. As such, the random variable nodes 211 are 212 are joined by an edge in the graphical model 210, indicating that the websites 221 and 222 are hyperi inked to one another, e.g., at least one webpage of the website 221 links to at least one webpage of the website 222 or vice versa. In some examples, the graphical model construction engine 110 uses an undirected or a random Markov field model in modeling the random variable nodes and edges in the graphical model 210.
[0023] By modeling the hyperlinked structure of the web through a graphical model, the graphical model construction engine 110 may account for an impact that other websites linked to a particular website may have in terms of hosting or propagating malicious content. The graphical model 210 may exploit the concept of homophily that a web entity is likely (in probabilistic terms) to be associated with similar entities. In the context of a hyperlinked structure of web resources, the graphical model 210 may be used to imply, probabilistically express, or infer that malicious websites are likely to have hyperlinks to other malicious websites and non-malicious websites are likely to have hyperlinks to other non-malicious websites. Thus, through the graphical model 210, a system 100 may exploit global information to determine (e.g., infer) a maliciousness probability for websites represented in the graphical model 210.
[0024] In discovering malicious web content, the graphical model construction engine 110 may seed the random variable nodes of the graphical model 210 with probability factors, including according to a blacklist and according to a content- based classifier. Examples by which the graphical model 210 is seeded with probability factors are described in greater detail next through Figures 3 and 4. [0025] Figure 3 shows an example of graphical model seeding based on a blacklist of malicious web content that the graphical model construction engine 110 may perform. To do so, the graphical model construction engine 110 may access a blacklist 310, which may indicate specific web content (e.g., websites, web pages, domains, web hosts, etc.) determined or known to include malicious content. The blacklist 310 may, for example, specify the Uniform Resource Locator (URL) or otherwise identify web content that has been determined to include malware, is a phishing website, or according to any other maliciousness categorization. The graphical model construction engine 110 may receive the blacklist 310 from any listing source, such as various security organizations that distribute such blacklists.
[0026] In Figure 3, the blacklist 310 specifies two malicious websites show as the malicious website 311 and the malicious website 312. The graphical model construction engine 110 may seed particular random variable nodes in the graphical model 210 according to the specification of the malicious websites 311 and 312 in the blacklist 310. That is, the graphical model construction engine 110 may seed particular random variable nodes in the graphical model that represent the malicious websites 311 and 312 specified in the blacklist 310, which in Figure 3 are the random variable nodes 331 and 332 respectively.
[0027] The graphical model construction engine 110 may seed a random variable node in the graphical model 210 with a random variable referred to as a probability factor. As noted above, the probability factor may specify a probability that the website represented by the random variable node is malicious (also referred to as a maliciousness probability). As an example, the graphical model construction engine 110 may seed a probability factor as a vector of two values, a first probability that the website represented by a random variable node is malicious and a second probability that the website represented by the random variable node is not malicious. In this example, the sum of the first and second values may be a value of 1. As other examples, the probability factor may specify a single maliciousness probability value, a maliciousness probability distribution, or any other probabilistic expression. In some examples, the probability factor is a probability distribution over several values, categories, or classifications of maliciousness, such as a distribution of probabilities that a website is non- malicious, includes malware, is a phishing site, etc.
[0028] For the malicious websites 311 and 312 specified in the blacklist 310, the graphical model construction engine 110 may seed the corresponding random variable nodes 331 and 332 with a probability factor indicative of a high probability of maliciousness. The graphical model construction engine 110 may seed random variable nodes representing malicious websites identified in a blacklist within a predetermined high-maliciousness probability range indicative of a high probability of including malicious content, for example a high-maliciousness probability range of .95-.Θ9. Within the high-maliciousness probability range, the graphical model construction engine 110 may determine the particular value of the probability factor according to a confidence level of the source of the blacklist (which may be based on a reputation of the blacklist source, other information about the malicious website, or any number of other factors). For each malicious website specified in a blacklist, the graphical model construction engine 110 may seed the random variable node that represents the malicious website accordingly.
[0029] In some examples, the graphical model construction engine 110 seeds a random variable node according to multiple blacklists. The graphical model construction engine 110 may access multiple blacklists, for example through retrieval from multiple, different sources (e.g., different security organizations). The graphical model construction engine 110 may weight the impact upon the probability factor value determination according to a confidence level for a particular source. In determining a probability factor for a random variable node that represents a particular malicious website, the graphical model construction engine 110 may account for, as example factors, the number of blacklists a particular malicious website appears in, the confidence level of sources of the blacklists in which the malicious website appears, and more.
[0030] Additionally or alternatively to the blacklist 310, the graphical model construction engine 110 may seed random variable nodes according to a whitelist. A whitelist may refer to any listing or identification of non-malicious web content, which may be verified or authenticated by a security organization or other entity as to not include or propagate malicious content. In such examples, the graphical model construction engine 110 may seed random variable nodes representing web content identified in the whitelist with a probability factor indicative of a low probability of maliciousness. For instance, the graphical model construction engine 110 may assign a probability factor to such random variable nodes within a low-maliciousness probability range (for example, between .01- .05). Along similar lines as malicious web-content identified in the blacklist 310, the graphical model construction engine 110 may determine a probability factor value within the low-maliciousness probability range accounting for any number of factors based on the whitelist(s), source(s) of the whitelist, or various other factors.
[0031] As described above, the graphical model construction engine 110 may seed random variable nodes in the graphical model that represent malicious websites identified in a blacklist, non-malicious websites specified in a whitelist, or combinations of both. Thus, the graphical model construction engine 110 may seed the graphical model 210 with probability factors for web content previously known or identified as being malicious or non-malicious, e.g., with a known or determined maliciousness characterization. However, some of the random variable nodes in a graphical model may represent websites not previously identified as malicious or non-malicious (e.g., not specified in a blacklist or a whitelist accessed by the graphical model construction engine 110). For these random variable nodes that represent websites with an undetermined maliciousness characterization, , the graphical model construction engine 110 may seed such random variable nodes with a probability factor generated by content-based classifier, as described next.
[0032] Figure 4 shows an example of graphical model seeding based on a content-based classifier that the graphical model construction engine 110 may perform. In some examples, the content-based classification engine 108 implements a content-based classifier which generates a probability that a particular input website is malicious (e.g., a probability factor). The content-based classification engine 108 may train a content-based classifier through extracting content characteristics of malicious websites, such as malicious websites known, determined, or identified through a blacklist.
[0033] In the example shown in Figure 4, the content-based classification engine 108 accesses the blacklist 310 and extracts content of the malicious websites 311 and 312 specified in the blacklist 310. By extracting the local content features of identified or known malicious websites, the content-based classification engine 108 may track specific attributes, characteristics, and content of malicious websites to predict the maliciousness of other websites not specified in the blacklist 310.
[0034] The content-based classification engine 108 may extract various types of content from the malicious websites 311 and 312. As one example, the content-based classification engine 108 may extract lexical features of the malicious websites 311 and 312, such as specific web page content, URL characteristics, images or visual characteristics, etc. The content-based classification engine 108 may do so based on a bag of words model, for example. As another example, the content-based classification engine 108 may extract host features of the malicious websites 311 and 312, which may include host information obtained through Domain Name Service (DNS) requests such as a host name, domain registration time, owner information, and the like. The malicious websites 311 and 312 specified in the blacklist 310 and the extracted content features may provide a training set by which the content-based classification engine 108 trains a classifier.
[0035] In some examples, the content-based classification engine 108 accesses a whitelist of web content known to not contain any malicious content or verified as authentic and non-malicious. In a similar manner as described above, the content-based classification engine 108 may extract content from whitelisted websites identified as non-malicious to extract lexical and/or host features of non- malicious websites, for example to include in the training set for the content-based classifier. [0036] From the content extraction from malicious websites, non-malicious websites, or both, the content-based classification engine 108 may train a content-based classifier to generate a probability that an input website is malicious. To do so, the content-based classification engine 108 may employ any number of machine learning models, including classifiers trained using na'ive Bayes methods, support vector machine techniques, logistic regression, neural networks, and more. In some examples, the content-based classification engine 108 obtains labels for training from the blacklist 310 and a whitelist, and subsampling is used to ensure that no imbalance occurs in the training set (e.g., the number of samples per label are within a numerical threshold from one another).
[0037] Upon training a content-based classifier, the content-based classification engine 108 may generate probability factors for random variable nodes in the graphical model 210. The content-based classifier may examine content attributes of the input website and predict the maliciousness of the input website in the form of a maliciousness probability. The content-based classification engine 108 may thus generate a maliciousness probability for a website corresponding to a random variable node in the graphical model, which the graphical model construction engine 110 may use as a probability factor to seed the random variable node in the graphical model 210. Accordingly, the content- based classification engine 108 may generate probability factors for websites not specified in the blacklist 310, not specified in a whitelist, or both. That is, the content-based classification engine 108 may generate probability factors for websites with an unknown or undetermined malicious characterization, and the graphical model construction engine 110 may seed the probability factors generated by the content-based classification engine 108 for these websites in the graphical model 210.
[0038] In the example shown in Figure 4, the graphical model 210 includes random variable nodes (representing malicious websites specified in the blacklist 310) seeded according to the blacklist 310, random variable nodes (representing non-malicious websites specified in a whitelist) seeded according to the whitelist, as well as random variable nodes (representing websites not specified in the blacklist 310 or the whitelist) seeded according to the content-based classification engine 108. The probability factors seeded in the graphical model 210 may also be referred to as priors, e.g., probability factors determined prior to any graphical model inference or prior to accounting for the dependence structure of the graphical model 210.
[0039] Through the probability factor seeding features described above, the graphical model construction engine 110 may incorporate content-based classification of websites into the graphical model 210. Doing so may allow a system to discover malicious content that incorporates both global and local web content considerations, which may increase the accuracy of malicious web content identification. Moreover, the local content extraction and classification may be specifically combined with the global hyperlink modeling through the graphical model, providing an efficient and accurate mechanism to infer malicious web content. Some graphical model inference features are described next in Figure 5.
[0040] Figure 5 shows an example of graphical model inference that the inference engine 112 may perform to discover malicious web content from a graphical model. Upon seeding of the graphical model 210 with probability factors by the graphical model construction engine 110, the inference engine 112 may perform a graphical model inference on the graphical model 210 to discover malicious web content. For example, the inference engine 112 may determine a marginal probability distribution of random variable nodes in the graphical model 210 and may apply any inference method to do. Example inference methods the inference engine 112 may apply include belief propagation methods, exact inference, MCMC, Gibbs sampling, junction tree methods, variational methods, and the like.
[0041] Through the graphical model inference, the inference engine 112 may adjust the probability factors for the random variable nodes, for example through the determination or adjusting of marginal probabilities, probability distribution adjustments, or via maximum a posteriori (MAP) probabilities. The inference engine 112 may, from the adjusted probability factors, discover malicious content. For instance, the inference engine 112 may identify any website represented by a random variable node with an adjusted probability factor (e.g., marginal probability, probability distribution, or MAP probability) that meets a malicious criterion.
[0042] Example malicious criterion include exceeding a particular threshold probability value (e.g., greater than a 50% probability or a probability factor value that is greater than a value of 0.5), within a particular maliciousness probability range, a probability distribution with a threshold lower range or upper range, or any other configurable criterion to categorize a website as malicious based on an adjust probability factor. In some examples, the inference engine 112 generates a ranked list of discovered malicious websites 510, for example ranked according to the adjusted probability factors.
[0043] The graphical model inference may support discovery of malicious web content (e.g., web content that was not previously known, identified, or categorized as malicious). As such, the inference engine 112 may filter websites already known as malicious from any listing of discovered malicious websites. For instance, the inference engine 112 may filter, from the ranked list of discovered malicious websites 510, the malicious websites 311 and 312 specified in the blacklist 310 or any other known malicious websites.
[0044] As described above, a system 100 may support discovery of malicious web content through a content-based classification engine 108, graphical model construction engine 110, and inference engine 112. Through construction of a graphical model from a web graph, the system 100 may map the hyperl inked structure of the web, by which the system 100 account for how the linked nature of the web and dependencies, homophily, and malicious websites linking to other malicious websites impact the probability that a particular website is malicious. The system 100 may also consider content-based features of the particular website, specifically through seeding the random variable node representing the particular website through a content-based classifier. As such, through the particular combination of a graphical model seeded with a content-based classifier, the system 100 may support determination of malicious web content with increased accuracy and efficiency.
[0045] Figure 6 shows an example of logic 600 that a system or device may implement to provide malicious content discovery through graphical model inference. A system may implement the logic 600 as hardware, executable instructions stored on a machine-readable medium, or as combinations of both. In some examples, the system implements the logic 600 through the content- based classification engine 108, the graphical model construction engine 110, and the inference engine 112, by which the system may perform or execute the logic 600 as a method to discover malicious web content through graphical model inference.
[0046] The system may access a blacklist and a whitelist (602), for example by accessing or retrieving the blacklist or whitelist from a security organization or other source. The system may extract content features from the malicious websites specified in the blacklist and the non-malicious websites specified in the whitelist (604), through which the system may train a content-based classifier to predict the maliciousness of an input website through a probability factor (606). That is, the system may train the content-based classifier according to the content extracted from the malicious websites specified in the blacklist, according to content extracted from white listed or non-malicious websites, or a combination of both. As noted above, the content-based classifier may generate a probability that a particular input website is malicious, which may be provided as a probability factor for seeding a graphical model (as discussed below regarding 614).
[0047] Turning to graphical model construction, the system may obtain a web graph (608) and construct a graphical model from the web graph (610). The graphical model may reflect the hyperi inked structure of the web, and also convey a dependence structure for random variable nodes of the graphical model. The system may seed the random variable nodes in the graphical model with probability factors (which, prior to inference, may also be referred to as priors). In that regard, the system may seed random variable nodes of the graphical model representing any of the malicious websites specified in the blacklist (612). The system may also seed the random variable nodes of the graphical model representing websites not specified in the blacklist, and do specifically using the probability factors generated by the content-based classifier (614). Then, the system may perform graphical model inference on the graphical model to adjust the probability factors (616) and generate a ranked list of discovered malicious websites (618) from the graphical model inference.
[0048] Figure 7 shows another example of logic 700 that a system or device may implement to provide malicious content discovery through graphical model inference. A system may implement the logic 700 as hardware, executable instructions stored on a machine-readable medium, or as combinations of both. In some examples, the system implements the logic 700 through the content- based classification engine 108, the graphical model construction engine 110, and the inference engine 112, by which the system may perform or execute the logic 700 as a method to discover malicious web content through graphical model inference.
[0049] The system may construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent websites and links between the random variable nodes in the graphical model represent hyperlinks between the websites (702). In seeding the graphical model, the system may seed a first random variable node in the graphical model that represents a malicious website specified in a blacklist with a probability factor based on the blacklist (704), e.g., based on the fact that the malicious website is specified in the blacklist. The system may also seed a second random variable node in the graphical model that represents a different website not specified in the blacklist with a probability factor generated by a content-based classifier trained through content extraction of malicious websites specified in a blacklist (706).
[0050] Then, the system may perform a graphical model inference on the graphical model constructed from the web graph to adjust probability factors for the random variable nodes of the graphical model and identify the different website represented by the second random variable node as a discovered malicious website when the adjusted probability factor of the second random variable node exceeds a maliciousness probability threshold (708). The malicious probability threshold may be configurable or user-specified, for example.
[0051] Figure 8 shows an example of a device 800 that supports discovery of malicious web content through graphical model inference. The device 800 may include a processing resource 810, which may take the form of a single or multiple processors. The processors) may include a central processing unit (CPU), microprocessor, or any hardware device suitable for executing instructions stored on a machine-readable medium, such as the machine-readable medium 820 shown in Figure 8. The machine-readable medium 820 may be any non- transitory electronic, magnetic, optical, or other physical storage device that stores executable instructions, such as the instructions 822, 824, 826, 828, 830, and 832 shown in Figure 8. As such, the machine-readable medium 820 may be, for example, Random Access Memory (RAM) such as dynamic RAM (DRAM), flash memory, memristor memory, spin-transfer torque memory, an Electrically- Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disk, and the like.
[0052] The device 800 may execute instructions stored on the machine- readable medium 820 through the processing resource 810. Executing the instructions may cause the device 800 to perform any of the malicious web content discovery features described herein, including according to any features of the content-based classification engine 108, graphical model construction engine 110, inference engine 112, logic 600 and 700, or any combination thereof.
[0053] For example, execution of the instructions 822, 824, 826, 828, 830, and 832 by the processing resource 810 may cause the device 800 to train a content- based classifier to generate a probability factor that a web page is malicious through content extraction from malicious web pages specified in a blacklist, the content extraction including hyperlinks of the malicious web pages, domain information of the malicious web pages, page content of the malicious web pages, or any combination thereof; construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent web pages and links between the random variable nodes in the graphical model represent hyperlinks between the web pages; seed random variable nodes in the web graph that represent web pages not specified in the blacklist through probability factors generated by the content-based classifier; seed random variable nodes in the web graph that represent the malicious web pages specified in the blacklist with probability factors determined without use of the content-based classifier; perform a graphical model inference on the graphical model constructed from the web graph and seeded with the probability factors to adjust the probability factors for the random variable nodes of the graphical model; and determine a discovered malicious web page from the adjusted probability factors. The probability factor of a particular random variable node in the graphical model may include, for example, a probability that a particular web page represented by the particular node is malicious.
[0054] In some examples, the machine-readable medium 820 may further include instructions executable by the processing resource 810 to access multiple blacklists from different sources and seed the random variable nodes in the web graph that represent malicious web pages specified in the blacklist with probability factors determined based on the multiple blacklists. As another example, the machine-readable medium 820 may further include instructions executable by the processing resource 810 to generate a ranked list of discovered malicious web pages ranked according to the adjusted probability factors, and further to filter the malicious web pages specified in the blacklist from the ranked list of discovered malicious web pages.
[0055] The systems, methods, devices, and logic described above, including the content-based classification engine 108, graphical model construction engine 110, and inference engine 112, may be implemented in many different ways in many different combinations of hardware, logic, circuitry, and executable instructions stored on a machine-readable medium. For example, the content- based classification engine 108, graphical model construction engine 110, and inference engine 112, or combinations thereof, may include circuitry in a controller, a microprocessor, or an application specific integrated circuit (ASIC), or may be implemented with discrete logic or components, or a combination of other types of analog or digital circuitry, combined on a single integrated circuit or distributed among multiple integrated circuits. A product, such as a computer program product, may include a storage medium and machine readable instructions stored on the medium, which when executed in an endpoint, computer system, or other device, cause the device to perform operations according to any of the description above, including according to any features of the content-based classification engine 108, graphical model construction engine 110, and inference engine 112.
[0056] The processing capability of the systems, devices, and engines described herein, including the content-based classification engine 108, graphical model construction engine 110, and inference engine 112, may be distributed among multiple system components, such as among multiple processors and memories, optionally including multiple distributed processing systems. Parameters, databases, and other data structures may be separately stored and managed, may be incorporated into a single memory or database, may be logically and physically organized in many different ways, and may implemented in many ways, including data structures such as linked lists, hash tables, or implicit storage mechanisms. Programs may be parts (e.g., subroutines) of a single program, separate programs, distributed across several memories and processors, or implemented in many different ways, such as in a library (e.g., a shared library).
[0057] While various examples have been described above, many more implementations are possible.

Claims

1. A system comprising:
a content-based classification engine to:
generate a probability factor for a particular website, the content- based classification engine trained through content extraction from malicious websites specified in a blacklist; a graphical model construction engine to:
construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent websites and links between the random variable nodes in the graphical model represent hyperlinks between the websites; seed a random variable node representing a particular website not specified in the blacklist with the probability factor generated from the content-based classification engine; and
an inference engine to:
perform a graphical model inference on the graphical model constructed from the web graph and seeded with the probability factor generated by the content-based classification engine to adjust probability factors for the random variable nodes of the graphical model; and generate a list of discovered malicious websites from the adjusted probability factors.
2. The system of claim 1 , wherein the graphical model construction engine is further to seed a different random variable node representing a particular malicious website specified in the blacklist with a probability factor based on the blacklist and not generated by the content-based classification engine.
3. The system of claim 1 , wherein the graphical model construction engine is further to:
access multiple blacklists from different sources; and seed a different random variable node representing a particular malicious website specified in the multiple blacklists with a probability factor that is:
based on the multiple blacklists; and
not generated by the content-based classification engine.
4. The system of claim 1 , wherein the probability factor of the random variable node in the graphical model includes a probability that the particular website represented by the node is malicious.
5. The system of claim 1 , wherein the inference engine is further to filter a malicious website specified in the blacklist from the list of discovered malicious websites.
6. The system of claim 1 , wherein the inference engine is further to rank the list of discovered malicious websites in a descending order of the adjusted probability factors.
7. A method comprising:
constructing a graphical model from a web graph, wherein random variable nodes in the graphical model represent websites and links between the random variable nodes in the graphical model represent hyperlinks between the websites; seeding a first random variable node in the graphical model that represents a malicious website specified in a blacklist with a probability factor based on the blacklist;
seeding a second random variable node in the graphical model that represents a different website not specified in the blacklist with a probability factor generated by a content-based classifier trained through content extraction of malicious websites specified in a blacklist;
performing a graphical model inference on the graphical model constructed from the web graph to adjust probability factors for the random variable nodes of the graphical model; and identifying the different website represented by the second random variable node as a discovered malicious website when the adjusted probability factor of the second random variable node exceeds a maliciousness probability threshold.
8. The method of claim 7, further comprising:
accessing multiple blacklists from different sources; and
wherein seeding the first random variable node that represents the malicious website specified in the multiple blacklists comprises seeding the first random variable node with a probability factor that is:
based on the multiple blacklists; and
not generated by the content-based classifier.
9. The method of claim 7, further comprising:
accessing the blacklist from a particular source; and
wherein seeding the first random variable node comprises determining the probability factor accounting for a confidence level for the particular source.
10. The method of claim 7, wherein the probability factor of a particular random variable node in the graphical model includes a probability that a particular website represented by the particular node is malicious.
11. A non-transitory machine-readable medium comprising instructions executable by a processing resource to:
train a content-based classifier to generate a probability factor that a web page is malicious through content extraction from malicious web pages specified in a blacklist, the content extraction including hyperlinks of the malicious web pages, domain information of the malicious web pages, page content of the malicious web pages, or any combination thereof;
construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent web pages and links between the random variable nodes in the graphical model represent hyperlinks between the web pages;
seed random variable nodes in the web graph that represent web pages not specified in the blacklist through probability factors generated by the content- based classifier;
seed random variable nodes in the web graph that represent the malicious web pages specified in the blacklist with probability factors determined without use of the content-based classifier;
perform a graphical model inference on the graphical model constructed from the web graph and seeded with the probability factors to adjust the probability factors for the random variable nodes of the graphical model; and determine a discovered malicious web page from the adjusted probability factors.
12. The non-transitory machine-readable medium of claim 11 , further comprising instructions executable by the processing resource to:
access multiple blacklists from different sources; and
wherein the instructions are executable by the processing resource to seed the random variable nodes in the web graph that represent malicious web pages specified in the blacklist with probability factors determined based on the multiple blacklists.
13. The non-transitory machine-readable medium of claim 11 , wherein the probability factor of a particular random variable node in the graphical model includes a probability that a particular web page represented by the particular node is malicious.
14. The non-transitory machine-readable medium of claim 11 , further comprising instructions executable by the processing resource to:
generate a ranked list of discovered malicious web pages ranked according to the adjusted probability factors.
15. The non-transitory machine-readable medium of claim 14, wherein the instructions are executable by the processing resource further to:
filter the malicious web pages specified in the blacklist from the ranked list of discovered malicious web pages.
PCT/US2015/061899 2015-11-20 2015-11-20 Malicious web content discovery through graphical model inference WO2017086992A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2015/061899 WO2017086992A1 (en) 2015-11-20 2015-11-20 Malicious web content discovery through graphical model inference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2015/061899 WO2017086992A1 (en) 2015-11-20 2015-11-20 Malicious web content discovery through graphical model inference

Publications (1)

Publication Number Publication Date
WO2017086992A1 true WO2017086992A1 (en) 2017-05-26

Family

ID=58717637

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/061899 WO2017086992A1 (en) 2015-11-20 2015-11-20 Malicious web content discovery through graphical model inference

Country Status (1)

Country Link
WO (1) WO2017086992A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033409A (en) * 2018-08-03 2018-12-18 华北水利水电大学 A kind of pair is randomly selected method
CN111274507A (en) * 2020-01-21 2020-06-12 腾讯科技(深圳)有限公司 Method, device and equipment for browsing webpage content and storage medium
US20210266292A1 (en) * 2020-02-14 2021-08-26 At&T Intellectual Property I, L.P. Scoring domains and ips using domain resolution data to identify malicious domains and ips
CN114553555A (en) * 2022-02-24 2022-05-27 北京字节跳动网络技术有限公司 Malicious website identification method and device, storage medium and electronic equipment
US11711393B2 (en) 2020-10-19 2023-07-25 Saudi Arabian Oil Company Methods and systems for managing website access through machine learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8341745B1 (en) * 2010-02-22 2012-12-25 Symantec Corporation Inferring file and website reputations by belief propagation leveraging machine reputation
US8381294B2 (en) * 2005-07-14 2013-02-19 Imation Corp. Storage device with website trust indication
US20130179974A1 (en) * 2012-01-11 2013-07-11 Pratyusa Kumar Manadhata Inferring a state of behavior through marginal probability estimation
US8572740B2 (en) * 2009-10-01 2013-10-29 Kaspersky Lab, Zao Method and system for detection of previously unknown malware
US20150281244A1 (en) * 2012-10-25 2015-10-01 Beijing Qihoo Technology Company Limited Method And Apparatus For Determining Phishing Website

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8381294B2 (en) * 2005-07-14 2013-02-19 Imation Corp. Storage device with website trust indication
US8572740B2 (en) * 2009-10-01 2013-10-29 Kaspersky Lab, Zao Method and system for detection of previously unknown malware
US8341745B1 (en) * 2010-02-22 2012-12-25 Symantec Corporation Inferring file and website reputations by belief propagation leveraging machine reputation
US20130179974A1 (en) * 2012-01-11 2013-07-11 Pratyusa Kumar Manadhata Inferring a state of behavior through marginal probability estimation
US20150281244A1 (en) * 2012-10-25 2015-10-01 Beijing Qihoo Technology Company Limited Method And Apparatus For Determining Phishing Website

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033409A (en) * 2018-08-03 2018-12-18 华北水利水电大学 A kind of pair is randomly selected method
CN109033409B (en) * 2018-08-03 2022-03-01 华北水利水电大学 Double random extraction method
CN111274507A (en) * 2020-01-21 2020-06-12 腾讯科技(深圳)有限公司 Method, device and equipment for browsing webpage content and storage medium
US20210266292A1 (en) * 2020-02-14 2021-08-26 At&T Intellectual Property I, L.P. Scoring domains and ips using domain resolution data to identify malicious domains and ips
US11533293B2 (en) * 2020-02-14 2022-12-20 At&T Intellectual Property I, L.P. Scoring domains and IPS using domain resolution data to identify malicious domains and IPS
US11711393B2 (en) 2020-10-19 2023-07-25 Saudi Arabian Oil Company Methods and systems for managing website access through machine learning
CN114553555A (en) * 2022-02-24 2022-05-27 北京字节跳动网络技术有限公司 Malicious website identification method and device, storage medium and electronic equipment
CN114553555B (en) * 2022-02-24 2023-11-07 抖音视界有限公司 Malicious website identification method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
Iqbal et al. Adgraph: A graph-based approach to ad and tracker blocking
Khan et al. Defending malicious script attacks using machine learning classifiers
Ramesh et al. An efficacious method for detecting phishing webpages through target domain identification
US8972376B1 (en) Optimized web domains classification based on progressive crawling with clustering
US8229930B2 (en) URL reputation system
US7974970B2 (en) Detection of undesirable web pages
Mohammad et al. Predicting phishing websites using neural network trained with back-propagation
WO2017086992A1 (en) Malicious web content discovery through graphical model inference
CN107346326A (en) For generating the method and system of neural network model
US9922129B2 (en) Systems and methods for cluster augmentation of search results
JP2019517088A (en) Security vulnerabilities and intrusion detection and remediation in obfuscated website content
US20170039483A1 (en) Factorized models
US20180131708A1 (en) Identifying Fraudulent and Malicious Websites, Domain and Sub-domain Names
Kaytan et al. Effective classification of phishing web pages based on new rules by using extreme learning machines
Mourtaji et al. Hybrid Rule‐Based Solution for Phishing URL Detection Using Convolutional Neural Network
RU2658878C1 (en) Method and server for web-resource classification
Li et al. A minimum enclosing ball-based support vector machine approach for detection of phishing websites
US11048738B2 (en) Records search and management in compliance platforms
EP3309701A1 (en) Systems and methods for anonymous construction and indexing of visitor databases using first-party cookies
CN104239582A (en) Method and device for identifying phishing webpage based on feature vector model
Wu et al. TrackerDetector: A system to detect third-party trackers through machine learning
CN103440454B (en) A kind of active honeypot detection method based on search engine keywords
Jha et al. Intelligent phishing website detection using machine learning
US11108802B2 (en) Method of and system for identifying abnormal site visits
Tchakounte et al. Crawl-shing: A focused crawler for fetching phishing contents based on graph isomorphism

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15908979

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15908979

Country of ref document: EP

Kind code of ref document: A1