WO2017086992A1

WO2017086992A1 - Malicious web content discovery through graphical model inference

Info

Publication number: WO2017086992A1
Application number: PCT/US2015/061899
Authority: WO
Inventors: Manish Marwah
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2015-11-20
Filing date: 2015-11-20
Publication date: 2017-05-26

Abstract

In some examples, a method includes constructing a graphical model from a web graph. Random variable nodes in the graphical model may represent websites and links between the random variable nodes in the graphical model may represent hyperlinks between the websites. The method may also include seeding a first random variable node in the graphical model that represents a malicious website specified in a blacklist with a probability factor and seeding a second random variable node in the graphical model that represents a different website not specified in the blacklist, particularly with a probability factor generated by a content-based classifier trained through content extraction of malicious websites specified in a blacklist. The method may further include performing graphical model inference on the graphical model constructed and identifying the website represented by the second random variable node as a discovered malicious website.

Description

MALICIOUS WEB CONTENT DISCOVERY

THROUGH GRAPHICAL MODEL INFERENCE

BACKGROUND

[0001] With rapid advances in technology, electronic devices have become increasingly prevalent in society today. Laptop computers, desktop computers, mobile phones, and tablet devices are but a few examples of electronic devices allowing a user to access digital data, communicate across vast interconnected networks (such as the Internet), and execute web-based applications. Increasing the security of computing devices will further improve user experience.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] Certain examples are described in the following detailed description and in reference to the drawings.

[0003] Figure 1 shows an example of a system that supports discovery of malicious web content through graphical model inference.

[0004] Figure 2 shows an example of a graphical model that a graphical model construction engine may construct.

[0005] Figure 3 shows an example of graphical model seeding based on a blacklist of malicious web content that the graphical model construction engine may perform.

[0006] Figure 4 shows an example of graphical model seeding based on a content-based classifier that the graphical model construction engine may perform.

[0007] Figure 5 shows an example of graphical model inference that an inference engine may perform to discover malicious content from the graphical model. [0008] Figure 6 shows an example of logic that a system or device may implement to provide malicious content discovery through graphical model inference.

[0009] Figure 7 shows another example of logic that a system or device may implement to provide malicious content discovery through graphical model inference.

[0010] Figure 8 shows an example of a device that supports discovery of malicious web content through graphical model inference.

DETAILED DESCRIPTION

[0011] The disclosure herein may provide systems, methods, devices, and logic that support discovery of malicious web content, such as malicious websites, malicious web domains, malicious web pages, malicious web hosts, or combinations thereof. As described in greater detail below, a graphical model may be constructed using a web graph, which may capture the hyperl inked structure of interconnected web resources (e.g., interlinked websites, the Internet, or the World Wide Web). The constructed graphical model may be seeded with probability factors determined through a blacklist of malicious websites, generated by a content-based classifier trained through contract extraction of the malicious websites specified in the blacklist, or through combinations of both. Thus, the discovery features described herein may account for global factors through mapping of the hyperl inked structure of interconnected web resources and location-specific factors through content-based classification, and may methodically do so in combination to infer malicious web content. As such, the malicious web content discovery features described herein may support identification of malicious web with increased accuracy and efficiency.

[0012] Figure 1 shows an example of a system 100 that supports discovery of malicious web content through graphical model inference. The system 100 may take the form of a computing system, including a single or multiple computing devices such as application servers, compute nodes, desktop or laptop computers, smart phones or other mobile devices, tablet devices, embedded controllers, and more.

[0013] The system 100 may discover malicious web content. Malicious web content may refer to any website, web domain, web page, or web host that provides, propagates, or includes malicious content. Malicious content may include any program or file that is intended to damage or disable computer operations, gather sensitive information, or gain unauthorized access to a computer or computer system. Thus, a malicious website may refer to a website through which malicious software, viruses, worms, tnojan horses, spyware, spam content, phishing mechanisms, or any other malicious content is propagated or linked through. Likewise, a malicious web page may refer to any particular web page through which malicious content is propagated. A malicious web host may refer to any web host that hosts a malicious web domain, malicious website, or malicious web page.

[0014] As described in greater detail below, the system 100 may discover malicious web content through graphical model inference performed on a graphical model constructed to reflect a hyper) inked structure of a web system and seeded with probability factors determined through a blacklist and content- based classifier. As one example, the system 100 shown in Figure 1 includes the content-based classification engine 108, the graphical model construction engine 110, and the inference engine 112, through which the system 100 may discover malicious web content through graphical model inference.

[0015] The system 100 may implement the engines 108, 110, and 112 (and components thereof) in various ways, for example as hardware and programming. The programming for the engines 108, 110, and 112 may take the form of processor-executable instructions stored on a non-transitory machine- readable storage medium and the hardware for the engines 108, 110, and 112 may include a processing resource to execute those instructions. A processing resource may include a number of processors and may be implemented through a single processor or multi-processor architecture. In some examples, the system 100 implements multiple engines using the same system features or hardware components (e.g., a common processing resource).

[0016] In the example shown in Figure 1 , the content-based classification engine 108 includes an engine component to generate a probability factor for a particular website, and the content-based classification engine 108 may be trained through content extraction from malicious websites specified in a blacklist. The graphical model construction engine 110 includes engine components to construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent websites and links between the random variable nodes in the graphical model represent hyperlinks between the websites and seed a random variable node representing a particular website not specified in the blacklist with the probability factor generated from the content-based classification engine. The inference engine 112 includes engine components to perform a graphical model inference on the graphical model constructed from the web graph and seeded with the probability factor generated by the content-based classification engine to adjust probability factors for the random variable nodes of the graphical model and generate a list of discovered malicious websites from the adjusted probability factors.

[0017] Some example features relating to malicious content discovery through graphical model inference are described in greater detail next. Many of the following examples are described specifically with regards to discovery of malicious websites. However, any of malicious web content discovery features described herein may be consistently applied to any other abstraction levels of malicious web content, such as for malicious web pages or malicious web hosts as examples.

[0018] Figure 2 shows an example of a graphical model that the graphical model construction engine 110 may construct. A graphical model may also be referred to as a probabilistic graphical model, and may be used to express conditional dependence structure between random variables. The graphical model construction engine 110 may construct a graphical model to model a hyperl inked structure of interconnected web resources, such as the World Wide Web, an enterprise intranet, or various other types of interconnect resources (or portions thereof). To do so, the graphical model construction engine 110 may obtain a web graph 202, which may be any graph or data structure that indicates hyperlinks between web pages, websites, or other web resources of a structure of interconnected web resources, such as the World Wide Web.

[0019] To obtain the web graph 202, the graphical model construction engine 110 itself may perform crawling operations to map out links of selected portions of the web (e.g., including particular websites or pages). As another example, the graphical model construction engine 110 may otherwise obtain the web graph 202 from a web crawler or other information source. Then, the graphical model construction engine 110 may construct the graphical model from the web graph by associating random variables with the websites specified in the web graph (or web pages, depending on the web abstraction level for the malicious content discovery).

[0020] In the example shown in Figure 2, the graphical model construction engine 110 constructs the graphical model 210. A graphical model may include nodes for the random variables mapped in the graphical model, which may also be referred to as random variable nodes. In particular, the graphical model construction engine 110 may plot the webpages and websites as the random variable nodes of the graphical model and plot the hyperlinks between the webpages and websites as edges between the random variable nodes.

[0021] For the graphical model 210 constructed by the graphical model construction engine 110, the random variable of each random variable node may indicate the probability that a particular website is malicious (and may thus be referred to as a probability factor). That is, a probability factor of a random variable node may refer to a probability function or value that particular web content represented by the random variable node is malicious. The probability factor for a node of the graphical model 210 may be represented in various ways and include any number of data types, such as a probability function, a probability value (e.g., between a range of 0-1 ), a probability distribution, and the like. [0022] In some examples, the random variable nodes of the graphical model 210 each represent a particular website. To illustrate through the example shown in Figure 2, the graphical model 210 includes the random variable nodes labeled as 211 and 212, which correspond to and represent the website 221 and the website 222 respectively (and the websites 221 and 222 may include multiple web pages). Edges between random variable nodes in the graphical model 210 may represent hyperlinks between the websites represented by the random variable nodes. As such, the random variable nodes 211 are 212 are joined by an edge in the graphical model 210, indicating that the websites 221 and 222 are hyperi inked to one another, e.g., at least one webpage of the website 221 links to at least one webpage of the website 222 or vice versa. In some examples, the graphical model construction engine 110 uses an undirected or a random Markov field model in modeling the random variable nodes and edges in the graphical model 210.

[0023] By modeling the hyperlinked structure of the web through a graphical model, the graphical model construction engine 110 may account for an impact that other websites linked to a particular website may have in terms of hosting or propagating malicious content. The graphical model 210 may exploit the concept of homophily that a web entity is likely (in probabilistic terms) to be associated with similar entities. In the context of a hyperlinked structure of web resources, the graphical model 210 may be used to imply, probabilistically express, or infer that malicious websites are likely to have hyperlinks to other malicious websites and non-malicious websites are likely to have hyperlinks to other non-malicious websites. Thus, through the graphical model 210, a system 100 may exploit global information to determine (e.g., infer) a maliciousness probability for websites represented in the graphical model 210.

[0024] In discovering malicious web content, the graphical model construction engine 110 may seed the random variable nodes of the graphical model 210 with probability factors, including according to a blacklist and according to a content- based classifier. Examples by which the graphical model 210 is seeded with probability factors are described in greater detail next through Figures 3 and 4. [0025] Figure 3 shows an example of graphical model seeding based on a blacklist of malicious web content that the graphical model construction engine 110 may perform. To do so, the graphical model construction engine 110 may access a blacklist 310, which may indicate specific web content (e.g., websites, web pages, domains, web hosts, etc.) determined or known to include malicious content. The blacklist 310 may, for example, specify the Uniform Resource Locator (URL) or otherwise identify web content that has been determined to include malware, is a phishing website, or according to any other maliciousness categorization. The graphical model construction engine 110 may receive the blacklist 310 from any listing source, such as various security organizations that distribute such blacklists.

[0026] In Figure 3, the blacklist 310 specifies two malicious websites show as the malicious website 311 and the malicious website 312. The graphical model construction engine 110 may seed particular random variable nodes in the graphical model 210 according to the specification of the malicious websites 311 and 312 in the blacklist 310. That is, the graphical model construction engine 110 may seed particular random variable nodes in the graphical model that represent the malicious websites 311 and 312 specified in the blacklist 310, which in Figure 3 are the random variable nodes 331 and 332 respectively.

[0027] The graphical model construction engine 110 may seed a random variable node in the graphical model 210 with a random variable referred to as a probability factor. As noted above, the probability factor may specify a probability that the website represented by the random variable node is malicious (also referred to as a maliciousness probability). As an example, the graphical model construction engine 110 may seed a probability factor as a vector of two values, a first probability that the website represented by a random variable node is malicious and a second probability that the website represented by the random variable node is not malicious. In this example, the sum of the first and second values may be a value of 1. As other examples, the probability factor may specify a single maliciousness probability value, a maliciousness probability distribution, or any other probabilistic expression. In some examples, the probability factor is a probability distribution over several values, categories, or classifications of maliciousness, such as a distribution of probabilities that a website is non- malicious, includes malware, is a phishing site, etc.

[0028] For the malicious websites 311 and 312 specified in the blacklist 310, the graphical model construction engine 110 may seed the corresponding random variable nodes 331 and 332 with a probability factor indicative of a high probability of maliciousness. The graphical model construction engine 110 may seed random variable nodes representing malicious websites identified in a blacklist within a predetermined high-maliciousness probability range indicative of a high probability of including malicious content, for example a high-maliciousness probability range of .95-.Θ9. Within the high-maliciousness probability range, the graphical model construction engine 110 may determine the particular value of the probability factor according to a confidence level of the source of the blacklist (which may be based on a reputation of the blacklist source, other information about the malicious website, or any number of other factors). For each malicious website specified in a blacklist, the graphical model construction engine 110 may seed the random variable node that represents the malicious website accordingly.

[0029] In some examples, the graphical model construction engine 110 seeds a random variable node according to multiple blacklists. The graphical model construction engine 110 may access multiple blacklists, for example through retrieval from multiple, different sources (e.g., different security organizations). The graphical model construction engine 110 may weight the impact upon the probability factor value determination according to a confidence level for a particular source. In determining a probability factor for a random variable node that represents a particular malicious website, the graphical model construction engine 110 may account for, as example factors, the number of blacklists a particular malicious website appears in, the confidence level of sources of the blacklists in which the malicious website appears, and more.

[0030] Additionally or alternatively to the blacklist 310, the graphical model construction engine 110 may seed random variable nodes according to a whitelist. A whitelist may refer to any listing or identification of non-malicious web content, which may be verified or authenticated by a security organization or other entity as to not include or propagate malicious content. In such examples, the graphical model construction engine 110 may seed random variable nodes representing web content identified in the whitelist with a probability factor indicative of a low probability of maliciousness. For instance, the graphical model construction engine 110 may assign a probability factor to such random variable nodes within a low-maliciousness probability range (for example, between .01- .05). Along similar lines as malicious web-content identified in the blacklist 310, the graphical model construction engine 110 may determine a probability factor value within the low-maliciousness probability range accounting for any number of factors based on the whitelist(s), source(s) of the whitelist, or various other factors.

[0031] As described above, the graphical model construction engine 110 may seed random variable nodes in the graphical model that represent malicious websites identified in a blacklist, non-malicious websites specified in a whitelist, or combinations of both. Thus, the graphical model construction engine 110 may seed the graphical model 210 with probability factors for web content previously known or identified as being malicious or non-malicious, e.g., with a known or determined maliciousness characterization. However, some of the random variable nodes in a graphical model may represent websites not previously identified as malicious or non-malicious (e.g., not specified in a blacklist or a whitelist accessed by the graphical model construction engine 110). For these random variable nodes that represent websites with an undetermined maliciousness characterization, , the graphical model construction engine 110 may seed such random variable nodes with a probability factor generated by content-based classifier, as described next.

[0032] Figure 4 shows an example of graphical model seeding based on a content-based classifier that the graphical model construction engine 110 may perform. In some examples, the content-based classification engine 108 implements a content-based classifier which generates a probability that a particular input website is malicious (e.g., a probability factor). The content-based classification engine 108 may train a content-based classifier through extracting content characteristics of malicious websites, such as malicious websites known, determined, or identified through a blacklist.

[0033] In the example shown in Figure 4, the content-based classification engine 108 accesses the blacklist 310 and extracts content of the malicious websites 311 and 312 specified in the blacklist 310. By extracting the local content features of identified or known malicious websites, the content-based classification engine 108 may track specific attributes, characteristics, and content of malicious websites to predict the maliciousness of other websites not specified in the blacklist 310.

[0034] The content-based classification engine 108 may extract various types of content from the malicious websites 311 and 312. As one example, the content-based classification engine 108 may extract lexical features of the malicious websites 311 and 312, such as specific web page content, URL characteristics, images or visual characteristics, etc. The content-based classification engine 108 may do so based on a bag of words model, for example. As another example, the content-based classification engine 108 may extract host features of the malicious websites 311 and 312, which may include host information obtained through Domain Name Service (DNS) requests such as a host name, domain registration time, owner information, and the like. The malicious websites 311 and 312 specified in the blacklist 310 and the extracted content features may provide a training set by which the content-based classification engine 108 trains a classifier.

[0035] In some examples, the content-based classification engine 108 accesses a whitelist of web content known to not contain any malicious content or verified as authentic and non-malicious. In a similar manner as described above, the content-based classification engine 108 may extract content from whitelisted websites identified as non-malicious to extract lexical and/or host features of non- malicious websites, for example to include in the training set for the content-based classifier. [0036] From the content extraction from malicious websites, non-malicious websites, or both, the content-based classification engine 108 may train a content-based classifier to generate a probability that an input website is malicious. To do so, the content-based classification engine 108 may employ any number of machine learning models, including classifiers trained using na^'ive Bayes methods, support vector machine techniques, logistic regression, neural networks, and more. In some examples, the content-based classification engine 108 obtains labels for training from the blacklist 310 and a whitelist, and subsampling is used to ensure that no imbalance occurs in the training set (e.g., the number of samples per label are within a numerical threshold from one another).

[0037] Upon training a content-based classifier, the content-based classification engine 108 may generate probability factors for random variable nodes in the graphical model 210. The content-based classifier may examine content attributes of the input website and predict the maliciousness of the input website in the form of a maliciousness probability. The content-based classification engine 108 may thus generate a maliciousness probability for a website corresponding to a random variable node in the graphical model, which the graphical model construction engine 110 may use as a probability factor to seed the random variable node in the graphical model 210. Accordingly, the content- based classification engine 108 may generate probability factors for websites not specified in the blacklist 310, not specified in a whitelist, or both. That is, the content-based classification engine 108 may generate probability factors for websites with an unknown or undetermined malicious characterization, and the graphical model construction engine 110 may seed the probability factors generated by the content-based classification engine 108 for these websites in the graphical model 210.

[0038] In the example shown in Figure 4, the graphical model 210 includes random variable nodes (representing malicious websites specified in the blacklist 310) seeded according to the blacklist 310, random variable nodes (representing non-malicious websites specified in a whitelist) seeded according to the whitelist, as well as random variable nodes (representing websites not specified in the blacklist 310 or the whitelist) seeded according to the content-based classification engine 108. The probability factors seeded in the graphical model 210 may also be referred to as priors, e.g., probability factors determined prior to any graphical model inference or prior to accounting for the dependence structure of the graphical model 210.

[0039] Through the probability factor seeding features described above, the graphical model construction engine 110 may incorporate content-based classification of websites into the graphical model 210. Doing so may allow a system to discover malicious content that incorporates both global and local web content considerations, which may increase the accuracy of malicious web content identification. Moreover, the local content extraction and classification may be specifically combined with the global hyperlink modeling through the graphical model, providing an efficient and accurate mechanism to infer malicious web content. Some graphical model inference features are described next in Figure 5.

[0040] Figure 5 shows an example of graphical model inference that the inference engine 112 may perform to discover malicious web content from a graphical model. Upon seeding of the graphical model 210 with probability factors by the graphical model construction engine 110, the inference engine 112 may perform a graphical model inference on the graphical model 210 to discover malicious web content. For example, the inference engine 112 may determine a marginal probability distribution of random variable nodes in the graphical model 210 and may apply any inference method to do. Example inference methods the inference engine 112 may apply include belief propagation methods, exact inference, MCMC, Gibbs sampling, junction tree methods, variational methods, and the like.

[0041] Through the graphical model inference, the inference engine 112 may adjust the probability factors for the random variable nodes, for example through the determination or adjusting of marginal probabilities, probability distribution adjustments, or via maximum a posteriori (MAP) probabilities. The inference engine 112 may, from the adjusted probability factors, discover malicious content. For instance, the inference engine 112 may identify any website represented by a random variable node with an adjusted probability factor (e.g., marginal probability, probability distribution, or MAP probability) that meets a malicious criterion.

[0042] Example malicious criterion include exceeding a particular threshold probability value (e.g., greater than a 50% probability or a probability factor value that is greater than a value of 0.5), within a particular maliciousness probability range, a probability distribution with a threshold lower range or upper range, or any other configurable criterion to categorize a website as malicious based on an adjust probability factor. In some examples, the inference engine 112 generates a ranked list of discovered malicious websites 510, for example ranked according to the adjusted probability factors.

[0043] The graphical model inference may support discovery of malicious web content (e.g., web content that was not previously known, identified, or categorized as malicious). As such, the inference engine 112 may filter websites already known as malicious from any listing of discovered malicious websites. For instance, the inference engine 112 may filter, from the ranked list of discovered malicious websites 510, the malicious websites 311 and 312 specified in the blacklist 310 or any other known malicious websites.

[0044] As described above, a system 100 may support discovery of malicious web content through a content-based classification engine 108, graphical model construction engine 110, and inference engine 112. Through construction of a graphical model from a web graph, the system 100 may map the hyperl inked structure of the web, by which the system 100 account for how the linked nature of the web and dependencies, homophily, and malicious websites linking to other malicious websites impact the probability that a particular website is malicious. The system 100 may also consider content-based features of the particular website, specifically through seeding the random variable node representing the particular website through a content-based classifier. As such, through the particular combination of a graphical model seeded with a content-based classifier, the system 100 may support determination of malicious web content with increased accuracy and efficiency.

[0045] Figure 6 shows an example of logic 600 that a system or device may implement to provide malicious content discovery through graphical model inference. A system may implement the logic 600 as hardware, executable instructions stored on a machine-readable medium, or as combinations of both. In some examples, the system implements the logic 600 through the content- based classification engine 108, the graphical model construction engine 110, and the inference engine 112, by which the system may perform or execute the logic 600 as a method to discover malicious web content through graphical model inference.

[0046] The system may access a blacklist and a whitelist (602), for example by accessing or retrieving the blacklist or whitelist from a security organization or other source. The system may extract content features from the malicious websites specified in the blacklist and the non-malicious websites specified in the whitelist (604), through which the system may train a content-based classifier to predict the maliciousness of an input website through a probability factor (606). That is, the system may train the content-based classifier according to the content extracted from the malicious websites specified in the blacklist, according to content extracted from white listed or non-malicious websites, or a combination of both. As noted above, the content-based classifier may generate a probability that a particular input website is malicious, which may be provided as a probability factor for seeding a graphical model (as discussed below regarding 614).

[0047] Turning to graphical model construction, the system may obtain a web graph (608) and construct a graphical model from the web graph (610). The graphical model may reflect the hyperi inked structure of the web, and also convey a dependence structure for random variable nodes of the graphical model. The system may seed the random variable nodes in the graphical model with probability factors (which, prior to inference, may also be referred to as priors). In that regard, the system may seed random variable nodes of the graphical model representing any of the malicious websites specified in the blacklist (612). The system may also seed the random variable nodes of the graphical model representing websites not specified in the blacklist, and do specifically using the probability factors generated by the content-based classifier (614). Then, the system may perform graphical model inference on the graphical model to adjust the probability factors (616) and generate a ranked list of discovered malicious websites (618) from the graphical model inference.

[0048] Figure 7 shows another example of logic 700 that a system or device may implement to provide malicious content discovery through graphical model inference. A system may implement the logic 700 as hardware, executable instructions stored on a machine-readable medium, or as combinations of both. In some examples, the system implements the logic 700 through the content- based classification engine 108, the graphical model construction engine 110, and the inference engine 112, by which the system may perform or execute the logic 700 as a method to discover malicious web content through graphical model inference.

[0049] The system may construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent websites and links between the random variable nodes in the graphical model represent hyperlinks between the websites (702). In seeding the graphical model, the system may seed a first random variable node in the graphical model that represents a malicious website specified in a blacklist with a probability factor based on the blacklist (704), e.g., based on the fact that the malicious website is specified in the blacklist. The system may also seed a second random variable node in the graphical model that represents a different website not specified in the blacklist with a probability factor generated by a content-based classifier trained through content extraction of malicious websites specified in a blacklist (706).

[0050] Then, the system may perform a graphical model inference on the graphical model constructed from the web graph to adjust probability factors for the random variable nodes of the graphical model and identify the different website represented by the second random variable node as a discovered malicious website when the adjusted probability factor of the second random variable node exceeds a maliciousness probability threshold (708). The malicious probability threshold may be configurable or user-specified, for example.

[0051] Figure 8 shows an example of a device 800 that supports discovery of malicious web content through graphical model inference. The device 800 may include a processing resource 810, which may take the form of a single or multiple processors. The processors) may include a central processing unit (CPU), microprocessor, or any hardware device suitable for executing instructions stored on a machine-readable medium, such as the machine-readable medium 820 shown in Figure 8. The machine-readable medium 820 may be any non- transitory electronic, magnetic, optical, or other physical storage device that stores executable instructions, such as the instructions 822, 824, 826, 828, 830, and 832 shown in Figure 8. As such, the machine-readable medium 820 may be, for example, Random Access Memory (RAM) such as dynamic RAM (DRAM), flash memory, memristor memory, spin-transfer torque memory, an Electrically- Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disk, and the like.

[0052] The device 800 may execute instructions stored on the machine- readable medium 820 through the processing resource 810. Executing the instructions may cause the device 800 to perform any of the malicious web content discovery features described herein, including according to any features of the content-based classification engine 108, graphical model construction engine 110, inference engine 112, logic 600 and 700, or any combination thereof.

[0053] For example, execution of the instructions 822, 824, 826, 828, 830, and 832 by the processing resource 810 may cause the device 800 to train a content- based classifier to generate a probability factor that a web page is malicious through content extraction from malicious web pages specified in a blacklist, the content extraction including hyperlinks of the malicious web pages, domain information of the malicious web pages, page content of the malicious web pages, or any combination thereof; construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent web pages and links between the random variable nodes in the graphical model represent hyperlinks between the web pages; seed random variable nodes in the web graph that represent web pages not specified in the blacklist through probability factors generated by the content-based classifier; seed random variable nodes in the web graph that represent the malicious web pages specified in the blacklist with probability factors determined without use of the content-based classifier; perform a graphical model inference on the graphical model constructed from the web graph and seeded with the probability factors to adjust the probability factors for the random variable nodes of the graphical model; and determine a discovered malicious web page from the adjusted probability factors. The probability factor of a particular random variable node in the graphical model may include, for example, a probability that a particular web page represented by the particular node is malicious.

[0054] In some examples, the machine-readable medium 820 may further include instructions executable by the processing resource 810 to access multiple blacklists from different sources and seed the random variable nodes in the web graph that represent malicious web pages specified in the blacklist with probability factors determined based on the multiple blacklists. As another example, the machine-readable medium 820 may further include instructions executable by the processing resource 810 to generate a ranked list of discovered malicious web pages ranked according to the adjusted probability factors, and further to filter the malicious web pages specified in the blacklist from the ranked list of discovered malicious web pages.

[0055] The systems, methods, devices, and logic described above, including the content-based classification engine 108, graphical model construction engine 110, and inference engine 112, may be implemented in many different ways in many different combinations of hardware, logic, circuitry, and executable instructions stored on a machine-readable medium. For example, the content- based classification engine 108, graphical model construction engine 110, and inference engine 112, or combinations thereof, may include circuitry in a controller, a microprocessor, or an application specific integrated circuit (ASIC), or may be implemented with discrete logic or components, or a combination of other types of analog or digital circuitry, combined on a single integrated circuit or distributed among multiple integrated circuits. A product, such as a computer program product, may include a storage medium and machine readable instructions stored on the medium, which when executed in an endpoint, computer system, or other device, cause the device to perform operations according to any of the description above, including according to any features of the content-based classification engine 108, graphical model construction engine 110, and inference engine 112.

[0056] The processing capability of the systems, devices, and engines described herein, including the content-based classification engine 108, graphical model construction engine 110, and inference engine 112, may be distributed among multiple system components, such as among multiple processors and memories, optionally including multiple distributed processing systems. Parameters, databases, and other data structures may be separately stored and managed, may be incorporated into a single memory or database, may be logically and physically organized in many different ways, and may implemented in many ways, including data structures such as linked lists, hash tables, or implicit storage mechanisms. Programs may be parts (e.g., subroutines) of a single program, separate programs, distributed across several memories and processors, or implemented in many different ways, such as in a library (e.g., a shared library).

[0057] While various examples have been described above, many more implementations are possible.

Claims

1. A system comprising:

a content-based classification engine to:

generate a probability factor for a particular website, the content- based classification engine trained through content extraction from malicious websites specified in a blacklist; a graphical model construction engine to:

construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent websites and links between the random variable nodes in the graphical model represent hyperlinks between the websites; seed a random variable node representing a particular website not specified in the blacklist with the probability factor generated from the content-based classification engine; and

an inference engine to:

perform a graphical model inference on the graphical model constructed from the web graph and seeded with the probability factor generated by the content-based classification engine to adjust probability factors for the random variable nodes of the graphical model; and generate a list of discovered malicious websites from the adjusted probability factors.

2. The system of claim 1 , wherein the graphical model construction engine is further to seed a different random variable node representing a particular malicious website specified in the blacklist with a probability factor based on the blacklist and not generated by the content-based classification engine.

3. The system of claim 1 , wherein the graphical model construction engine is further to:

access multiple blacklists from different sources; and seed a different random variable node representing a particular malicious website specified in the multiple blacklists with a probability factor that is:

based on the multiple blacklists; and

not generated by the content-based classification engine.

4. The system of claim 1 , wherein the probability factor of the random variable node in the graphical model includes a probability that the particular website represented by the node is malicious.

5. The system of claim 1 , wherein the inference engine is further to filter a malicious website specified in the blacklist from the list of discovered malicious websites.

6. The system of claim 1 , wherein the inference engine is further to rank the list of discovered malicious websites in a descending order of the adjusted probability factors.

7. A method comprising:

constructing a graphical model from a web graph, wherein random variable nodes in the graphical model represent websites and links between the random variable nodes in the graphical model represent hyperlinks between the websites; seeding a first random variable node in the graphical model that represents a malicious website specified in a blacklist with a probability factor based on the blacklist;

seeding a second random variable node in the graphical model that represents a different website not specified in the blacklist with a probability factor generated by a content-based classifier trained through content extraction of malicious websites specified in a blacklist;

performing a graphical model inference on the graphical model constructed from the web graph to adjust probability factors for the random variable nodes of the graphical model; and identifying the different website represented by the second random variable node as a discovered malicious website when the adjusted probability factor of the second random variable node exceeds a maliciousness probability threshold.

8. The method of claim 7, further comprising:

accessing multiple blacklists from different sources; and

wherein seeding the first random variable node that represents the malicious website specified in the multiple blacklists comprises seeding the first random variable node with a probability factor that is:

based on the multiple blacklists; and

not generated by the content-based classifier.

9. The method of claim 7, further comprising:

accessing the blacklist from a particular source; and

wherein seeding the first random variable node comprises determining the probability factor accounting for a confidence level for the particular source.

10. The method of claim 7, wherein the probability factor of a particular random variable node in the graphical model includes a probability that a particular website represented by the particular node is malicious.

11. A non-transitory machine-readable medium comprising instructions executable by a processing resource to:

train a content-based classifier to generate a probability factor that a web page is malicious through content extraction from malicious web pages specified in a blacklist, the content extraction including hyperlinks of the malicious web pages, domain information of the malicious web pages, page content of the malicious web pages, or any combination thereof;

construct a graphical model from a web graph, wherein random variable nodes in the graphical model represent web pages and links between the random variable nodes in the graphical model represent hyperlinks between the web pages;

seed random variable nodes in the web graph that represent web pages not specified in the blacklist through probability factors generated by the content- based classifier;

seed random variable nodes in the web graph that represent the malicious web pages specified in the blacklist with probability factors determined without use of the content-based classifier;

perform a graphical model inference on the graphical model constructed from the web graph and seeded with the probability factors to adjust the probability factors for the random variable nodes of the graphical model; and determine a discovered malicious web page from the adjusted probability factors.

12. The non-transitory machine-readable medium of claim 11 , further comprising instructions executable by the processing resource to:

access multiple blacklists from different sources; and

wherein the instructions are executable by the processing resource to seed the random variable nodes in the web graph that represent malicious web pages specified in the blacklist with probability factors determined based on the multiple blacklists.

13. The non-transitory machine-readable medium of claim 11 , wherein the probability factor of a particular random variable node in the graphical model includes a probability that a particular web page represented by the particular node is malicious.

14. The non-transitory machine-readable medium of claim 11 , further comprising instructions executable by the processing resource to:

generate a ranked list of discovered malicious web pages ranked according to the adjusted probability factors.

15. The non-transitory machine-readable medium of claim 14, wherein the instructions are executable by the processing resource further to:

filter the malicious web pages specified in the blacklist from the ranked list of discovered malicious web pages.