WO2016168428A1

WO2016168428A1 - Cross-site scripting detection method

Info

Publication number: WO2016168428A1
Application number: PCT/US2016/027482
Authority: WO
Inventors: Karem A. Sakallah; Thierry Sans; Sherin HAZBOUN
Original assignee: Qatar Foundation For Education, Science And Community Development
Priority date: 2015-04-15
Filing date: 2016-04-14
Publication date: 2016-10-20

Abstract

The cross-site scripting detection method (2700) inputs a web page application program into a virtual machine. Output of the virtual machine produces bytecode and a control flow graph. A state transition graph encoder uses the bytecode and control flow graph to produce transition, initial, and property (Τ,Ι,Ρ) state information. The Τ,Ι,Ρ state information is then parsed by a parser. Output of the parser is fed to a reachability verifier that produces a verified flag (OK) or alternatively a positive counter-example (CEX). The CEX is transformed into the web page application program and fed as HTML to a browser which checks for feasibility of a XSS find. A feasibility indication suggests a true positive find of the XSS malicious code. An infeasibility indication suggests a false positive find upon which, via feedback of the infeasibility indication, the reachability verifier is retuned to refine the solution.

Description

CROSS-SITE SCRIPTING DETECTION METHOD

TECHNICAL FIELD

The present invention relates to code error detection methods, and particularly to a cross-site scripting detection method that checks software, particularly web server software and web applications, to ensure that the software has no bugs and has not been altered by malicious software in a manner that would alter program flow to produce unwanted and undesirable effects.

BACKGROUND ART

Many cybersecurity problems can be traced back to flawed and/or malicious software. Systematic checking of such software is thus a potentially effective method for detecting vulnerabilities that can be exploited by an adversary to hijack control flow for malicious purposes. Despite progress that has been made in this area, there is still a need for systems and software to maintain the integrity of web sites, web servers, and client machines that access remote servers over a network.

Thus, a cross-site scripting detection method solving the aforementioned problems is desired.

DISCLOSURE OF INVENTION

The cross-site scripting detection method analyzes code in a systematic manner; i.e., in a manner which is a comprehensive analysis of the software to make sure that, under all possible inputs, it never reaches a bad state (i.e., control is hijacked) and, failing that, to explicitly identify the conditions under which it does reach a bad state. This second outcome can then be used by the software developer(s) to patch the code to eliminate the particular bug/vulnerability that led to the bad state. This process can be repeated until the code is certified to be free of errors and, thus, secure.

These and other features of the present invention will become readily apparent upon further review of the following specification and drawings.

BRIEF DESCRIPTION OF DRAWINGS

Fig. 1 is a simplified block diagram of communications between a client side browser and server side web application according to the prior art. Fig. 2 is a simplified block diagram of an XSS (cross-site scripting) attack between a client side browser and a server side web application according to the prior art.

Fig. 3 is a simplified block diagram showing typical web application processing of string input and string output according to the prior art.

Fig. 4 is a simplified block diagram showing web application processing of unsafe string input producing string output with script according to the prior art.

Fig. 5 is a logic diagram showing transition system verification for reachable states being a proper subset of good states.

Fig. 6 is a logic diagram showing transition system verification for reachable states not being a proper subset of good states.

Fig. 7 is a block diagram of a client browser interfacing with a website according to the prior art.

Fig. 8 is a flow chart showing control flow of a request to a web application.

Fig. 9 is a directed control state transition graph (STG) associated with the request in the example of Fig. 8.

Fig. 10 is a directed graph showing the full STG program variables associated with the request in the example of Fig. 8.

Fig- 11 is a control STG showing ; initial transition relation associated with the request.

Fig. 12 is a control STG showing ; first transition relation associated with the request.

Fig. 13 is a control STG showing ; second and third transition relations associated with the request.

Fig- 14 is a control STG showing ; fourth transition relation associated with the request.

Fig. 15 is a control STG showing ; fifth transition relation associated with the request.

Fig. 16 is a control STG showing ; sixth transition relation associated with the request.

Fig. 17 is a control STG showing ; seventh transition relation associated with the request.

Fig. 18 is a control STG showing ; eighth transition relation associated with the request.

Fig. 19 is a control STG showing ; ninth transition relation associated with the request.

Fig. 20 is a control STG showing ; tenth transition relation associated with the request.

Fig. 21 is a control STG showing ; eleventh transition relation associated with the request.

Fig. 22 is a control STG showing ; twelfth transition relation associated with the request. Fig. 23 is a control STG showing thirteenth and fourteenth transition relations associated with the request.

Fig. 24 is a control STG showing fifteenth transition relation associated with the request.

Fig. 25 is a control STG showing result including the fifteenth transition relation associated with the request.

Fig. 26 is a control STG showing result comprising initial state and safety property with the fifteenth transition relation associated with the request.

Fig . 27 is a circuit diagram representing the circuit representation of an abstract php program.

Fig . 28 is a flow diagram representing control and data flow of the circuit of Fig. 27.

Fig . 29 is a Boolean truth table showing first analysis of the input string.

Fig . 30 is a Boolean truth table showing second analysis of the input string.

Fig . 31 is a Boolean truth table showing third analysis of the input string.

Fig . 32 is a control flow graph showing analysis flow of the input string.

Fig . 33 is a control STG showing control state transitions associated with the request.

Fig . 34 is a block diagram showing the full STG associated with the request.

Fig . 35 is a Boolean truth table showing first analysis of the input string.

Fig . 36 is a Boolean truth table showing second analysis of the input string.

Fig . 37 is a Boolean truth table showing third analysis of the input string.

Fig . 38 is a block diagram showing the XSS catcher.

Fig . 39 is a block diagram showing the XSS catcher framework.

Fig . 40 is a flow diagram of CounterExample-Guided Abstraction Refinement

(CEGAR).

Similar reference characters denote corresponding features consistently throughout the attached drawings.

BEST MODES FOR CARRYING OUT THE INVENTION

At the outset, it should be understood by one of ordinary skill in the art that embodiments of the present method can comprise software or firmware code executing on a computer, a microcontroller, a microprocessor, or a DSP processor; state machines implemented in application specific or programmable logic; or numerous other forms without departing from the spirit and scope of the method described herein. The present method can be provided as a computer program, which includes a non-transitory machine-readable medium having stored thereon instructions that can be used to program a computer (or other electronic devices) to perform a process according to the method. The machine -readable medium can include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, flash memory, or other type of media or machine-readable medium suitable for storing electronic instructions.

The cross-site scripting detection method analyzes code in a systematic manner. In this context, "systematic" refers to a comprehensive analysis of the software to make sure that, under all possible inputs, it never reaches a bad state (i.e., control is hijacked), and failing that, to explicitly identify the conditions under which it does reach a bad state. This second outcome can then be used by the software developer(s) to patch the code to eliminate the particular bug/vulnerability that led to the bad state. This process can be repeated until the code is certified to be free of errors and, thus, secure.

The present cross-site scripting detection method inputs a web page application program into a virtual machine. Output of the virtual machine produces bytecode and a control flow graph. A state transition graph encoder uses the bytecode and control flow graph to produce transition, initial, and property (Τ,Ι,Ρ) state information. The Τ,Ι,Ρ state information is then parsed by a parser. Output of the parser is fed to a reachability verifier that produces a verified flag (OK), or alternatively, a positive counter-example (CEX). The CEX is transformed into the web page application program and fed as HTML to a browser, which checks for feasibility of a XSS find. A feasibility indication suggests a true positive find of the XSS malicious code. An infeasibility indication suggests a false positive find, upon which, via feedback of the infeasibility indication, the reachability verifier is retuned to refine the solution.

To accomplish this type of code checking, the code must be formally modeled as a transition system to (a) enable the automatic checking of its safety from errors, i.e., that all states reachable from the initial state(s) are "good" states; or (b) to give an input sequence that shows how a bad state is reachable in case the code is buggy. This type of reachability checking has been applied in several other contexts. Specifically, a great deal of progress has been achieved in formally checking hardware systems with exponentially- sized state spaces using a variety of techniques to mitigate the inherent computational complexity in such models and achieve remarkable scalability. A key enabler for this success is the use of abstraction (over-approximation) to automatically eliminate irrelevant detail in the transition models and to automatically refine the abstraction based on false positives yielded by the approximate analysis. This so-called CounterExample- Guided Abstraction Refinement (CEGAR for short) framework is now standard in many formal verification tool chains for both hardware and software verification. In our previous work on hardware verification (equivalence checking of sequential designs, architectural compliance verification of microprocessor implementations to their instruction set architectures, etc.), we successfully demonstrated the scalability of this approach both commercially in the Reveal product (www.reveal-da.com) and in the prototype Averroes system that further extends Reveal's modeling and computational capabilities.

As shown in Fig. 1, the client side browser 100 has javascript (JS) 104 running on it, while the server side 102 hosts the web application. When the user is looking through a web application and wants to request something from the server (by filling in some fields and clicking on buttons, boxes, or the like) an http request is initiated carrying the user supplied parameters (strings) in the form of a URL, for example, and then sent to the server 102, which processes the request and sends back the response as HTML code that the browser 100 renders. The browser 100 can also execute JS code 104 sent from the server 102. If this code originated from the web application, there's no vulnerability. However, if this code was injected as a parameter in the http request and escapes through the response, then a vulnerability exists. The XSS vulnerability is a security exploit in which the attacker injects arbitrary (executable) javascript code into Web Pages.

As shown in Fig. 2, the XSS attack scenario 200 is an illustrative example of an XSS attack. The attacker sends a URL (Uniform Resource Locator) with traps (malicious parameters) to Alice, who clicks on it, which triggers an http request to the server carrying these parameters along. If the web application is vulnerable, these traps can escape through the output and execute when the server sends a response back to Alice. The attack often results in sending some sensitive information back to the attacker.

The present cross-site scripting detection method adapts the CEGAR approach to the so-called cross-site scripting (XSS) vulnerability, a major security hole in many web server codes that causes URL hijacking. It is estimated that 70% of existing web applications contain some type of XSS vulnerability that can be exploited to compromise security.

Existing methods for detecting and mitigating XSS vulnerabilities have not been very effective, either because they entail a significant performance overhead, or they lead to both false positives and false negatives, rendering them mostly unusable. The cross-site scripting catcher (XSSCatcher) approach introduced here differs in that it is a sound and complete solution that either guarantees the absence of XSS vulnerabilities from a web application or demonstrates the existence of one or more vulnerabilities by providing a counterexample that can be run on the code. XSSCatcher thus eliminates both false positives and false negatives and helps to improve the security of web applications. XSSCatcher achieves this by converting a web application to a set of logical equations that capture the application's behavior as a formal transition system, which is then analyzed using techniques similar to those that have been successfully demonstrated in automated hardware verification.

The encoding of the web application as a transition system involves abstracting the application's inputs and variables as binary quantities, denoting whether they are safe (data) or unsafe (executable scripts). It also involves abstracting the application's control flow by ignoring the application's functional behavior (since the goal is to simply track the propagation of safe and unsafe strings through the code). Such an encoding yields a transition system expressed by formulas in propositional logic, which is readily analyzable by an approximate reachability algorithm to determine if unsafe inputs can escape to the application's output and end up being executing by the client-side browser. If the analysis shows that the application's formal transition model is safe, one can conclude that the application itself is safe, since the transition model over-approximates the application's behavior. Otherwise, the analysis yields a counterexample that must be checked to determine if it is a false positive. An automatic refinement procedure eliminates such spurious counterexamples and tightens the abstract transition system in an iterative CEGAR framework until it becomes sufficiently precise to prove the absence of XSS vulnerabilities or to produce a feasible counterexample demonstrating their existence.

The architecture, when fully functional, will read a web application program (we are currently targeting programs written in the PHP language), process it with a compiler front- end to obtain its bytecode representation, and generate its control flow graph (CFG). A novel encoder transforms these representations to an abstract formal transition system that over- approximates the program' s behavior. The transition system consists of three formulas in a suitable logic: a transition formula T, an initial state formula I, and a property formula P.

These three formulas are passed on to a reachability verifier using a suitable parser. The verifier automatically checks the formulas to determine if the original program is free of XSS vulnerabilities or to produce a program trace (also known as a counterexample CEX) that shows the existence of one or more XSS vulnerabilities. The potential existence of vulnerabilities triggers a back-end process to check if the vulnerabilities represent false positives (due to the over-approximation inherent in the formal model). This is done by executing the counterexample on the actual web application and rendering the result on the client-side browser. A feasible execution indicates a real vulnerability (true positive).

Otherwise, the counterexample is a false positive and is used to generate a set of constraints that refine (i.e., tighten) the over- approximation, which is now re-checked by the reachability verifier. This iterative process is carried out until all false positives are ruled out and the program is declared safe, or if a true positive is generated, indicating that the program has one or more XSS vulnerabilities.

Document Object Model (DOM) is a programming interface for HTML, XHTML, and XML. It provides a representation of the document as a structured group of nodes and objects that have properties and methods and event handlers attached (if they exist).

Essentially, it connects web pages to scripts or programming languages. Asynchronous JavaScript and XML (AJAX) is a technique for creating fast and dynamic web pages. AJAX allows web pages to be updated asynchronously by exchanging small amounts of data with the server behind the scenes. This means that it is possible to update parts of a web page without reloading the whole page. Types of XSS attacks include reflected, which is an unsafe input data contained in the URL that trickles down to the output. There also is a stored attack in which the unsafe input is stored in the server' s database. Additionally there is the DOM-based attack, where unsafe input in the URL is directly used by legitimate Javascript code to modify the document. Also there is the AJAX-based attack, where an AJAX request retrieves corrupted data from the server and uses this data to modify the DOM. All types of XSS attacks share a common thread, which is executing data.

Reflected XSS attacks are the ones where unsafe input data contained in the URL trickles down to the output. With such type of XSS attack, the attacker has to forge the URL, send it to the victim through some channel, and trick her into opening it. Stored XSS attacks are the ones where the unsafe input is stored in the server' s database. The attack is executed when the victim visits a page that retrieves this malicious information from the database.

DOM-based XSS attacks are attacks where unsafe input in the URL is directly used by legitimate Javascript code. As an example, a malicious Javascript code can extract the user' s preferred language from the URL and modify the current document to mention this information. Like a reflected XSS attack, the attacker can forge a malicious URL and deliver it to the user. However, the difference between a DOM-based attack and the two others described above is that this attack does not come from a malicious HTTP response forged by the server. AJAX-based XSS attacks are the ones where an AJAX request retrieves corrupted data from the server (similarly to the reflected or stored XSS attack) and uses this data to modify the DOM. An AJAX request is a Javascript command allowing the application to retrieve data from the server by doing an HTTP request in the background, i.e., without the need to reload the page. Because AJAX can retrieve small amounts of data from the server, it has been one of the key innovations that brought responsiveness to web applications.

As shown in Fig. 3, the web application 102 has a string input from an http request and a string output in the form of HTML, JS, or the like. As shown in Fig. 4, when there is an XSS vulnerability, an unsafe string input (script) is processed by the web application 102 and a string output with the script is the result. The web applications are abstracted with ignoring its data manipulation, since we're only interested in tracking unsafe input parameters through the program, allowing web applications to be viewed as a function that takes input and produces output. The input to web applications comprises user-supplied strings sent by http request. The output of these applications comprises mostly HTML code and JS. Sometimes the input can be unsafe (malicious scripts). If web applications are designed correctly with proper escaping and sanitization of input, no unsafe input will end up being executed. If the executable script manages to finds its way to the output, then there is a XSS vulnerability. The execution of such input can also be called "executing data", which is very dangerous. Thus, an XSS is a manifestation of data execution, similar to a typical buffer overflow in C.

The present cross-site scripting detection method combines taint analysis (data flow) with control flow in a systematic formal procedure to obtain a sound and complete solution to detecting the existence of XSS vulnerabilities or proving their absence. This will be performed by abstracting web applications while ignoring its data manipulation, since the user is only interested in tracking unsafe input parameters through the program. That allows us to view web applications as a function that takes input and produces output The input to web applications are user-supplied strings sent by http request, the output of these applications being mostly HTML code and JS.

Referring to Fig. 5, diagram 500 shows reachable states R as a proper subset of good states P, and Fig. 6, diagram 600 shows reachable states R as an improper subset of good states P. Web applications are formally modeled as transition systems. Defining a transition system, the present state variables are called X, and next state variables are called X+. A predicate on the state variables that defines the initial states is called I(X), and a transition relation T specifies how transition between states occurs as a function of inputs. The transitive closure of T starting from I yields the set of reachable states R. Finally, the predicate P on the states defines what is considered good or safe states.

Referring to exemplary web application communication 700 shown in Fig. 7, consider the general case of two transition systems, one that satisfies its property, and one that doesn't. In an exemplary web application, Text is the parameter coming for an http request (input) and read by a Get command, which an associative array that takes the text value and assigns it to a [var str echo str] that is sent through an http response back to browser (output).

The program is modeled as a transition system (TS), so first one needs to extract the state variables. This program has two state variables that are of two kinds, Ip being the instruction pointer that navigates through the executable statements of the program. Here it has 5 values: s, 2,4,7, e, and Str, and other variables are contained in the code. See Table 1 for the exemplary code and variables.

Table 1 : Exemplary Code and Variables

Code Variable

S: <?php State: ip e {S,2,4,7,E}

$str e {Unsafe, Safe}

1:

2: $str = $_GET['texf ] ; Input: text e {Unsafe,Safe}

3: if ($str == "0"){ Output: out e {Unsafe,Safe}

4: echo "";

5: } Control Flow: C≡ ($str == "0") e

{False, True}

6: else{

7: echo $str;

A Boolean abstraction of the program is performed, where all of the program variables and output are treated as either safe→ 1 or unsafe→ 0. This is performed systematically so as to define the state variables of two kinds: ip, and other state variables in the program.

Next one needs to identify the inputs and classify them into safe/unsafe. Whenever one has an output of the program, it is represented in a variable called "out" (an implicit variable that receives the string of an echo cmd). The system also abstracts conditions into fresh conditions (i.e., fresh Boolean variables) cl,c2 .... These variables help to model control flow. See Table 2 for the abstraction: Table 2: Abstraction

Code → Abstraction

S: <?php S: <?php

1: 1:

2: $str = $_GET['texf ] ; 2: $str = text;

3: if ($str == "0"){ 3: if ( C ){

4: echo ""; 4: out = l;

5: } 5: }

6: else{ 6: else{

7: echo $str; 7: out = $str;

8: } 8: }

E:?> E:?>

Referring to Fig. 8, CFG 800, Fig. 9, Control state transition graph (STG) 900, and Fig. 10, Full STG 1000, one then analyzes this abstraction of the code by constructing a CFG 800, which follows the sequential control flow of the program. In each ip state, one can describe the updates to other program variables, such as str=text in ip state 2. Update to out happens in ip state 4 and 7. To view this in a more traditional STG, one can map this CFG into STG (see control STG 900) where states are circles and transitions between them are labeled with conditions. Unlabeled edges are unconditional. This is still not a complete STG. In Full STG 1000, the program variables (in this case $str and ip, so 2x5=10 states) are considered, along with taking all possibilities of str being safe 1 or unsafe 0, and taking all ip domain. At the beginning of the program, one assumes that these internal variables are safe, so the system cannot start from S state being 0. This Full STG 1000 is not constructed explicitly, rather it is constructed implicitly, through the analysis.

The present cross-site scripting detection method's goal is to encode the transition system as a logical formula, as shown in the transition relation diagram 1100 of Fig. 11. The present cross-site scripting detection method formally models this program in logic so that it could be analyzed automatically by Boolean satisfiability (SAT) or Satisfiability Modulo Theory (SMT) solvers. Referring to the transition relation diagrams 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200,2300, 2400, and 2500 of Figs 12 through 25, respectively, for every starting ip state, a constraint that captures its transition can be written. Moreover, the opposite can be done until, as shown in transition Relation diagram 2600 of Fig. 26, the initial state lis defined and the safety property P is defined. Starting from a destination, one must see where / could come from. This set of equations is passed to SAT or SMT solver to verify it. Even if text is unsafe, an integer (int) cast will turn it into something harmless "safe" (see Table 5). The "One-Hot Encoding of ip states" is illustrated in Table 3. The equivalent sequential circuit variables are illustrated in Table 4.

Table 3: IP State Boolean Location Variable Encoding

IP State Boolean location variable L

ip = S: Ls = l

ip = 2: L₂ = l

ip = 4: k.= l

ip = 7: L₇ = 1

ip = E: =l

Table 4: IP State Boolean Location Variable Encoding

Bit-level sequential circuit corresponding to abstract

Table 5: Abstraction with int Cast

Code → Abstraction

S: <?php S: <?php

1 : 1 :

2: $str = 2: $str = 1 ;

int$_GET['text'] ;

3: if ($str == "0") { 3 : if ( C ) {

4: echo ""; 4: out = l ;

5: } 5: }

6: else{ 6: else{

7: echo $str; 7: out = $str;

8: } 8: }

E:?> E:?>

As shown in Fig. 27, the abstract php program may be represented as a sequential circuit 2700. A C input comprises an input to AND gate 272a and an inverted input to AND gate 272b. Their remaining non-inverted inputs are connected together. An input includes the D input to a D type flip-flop 270a. Text feeds the "1" input to multiplexer 2705. Q output of flip-flop 270a feeds D input to flip-flop 270b (L_s = L ), and also the select input to the multiplexer 2705. The Q output of flip-flop 270b represents L₂ in the php program abstraction and is fed to the aforementioned remaining non-inverted inputs of AND gates 272a and 272b. Output of multiplexer 2705 represents $str⁺and is fed to D input of flip-flop 270e. Q output of flip-flop 270e represents $strand feeds "0" input of multiplexer 2705, as well as non-inverting input to OR gate 277. Output of OR gate 277 represents P, the output of the php program abstraction. Output of AND gate 272a represents L , which is fed to D input of flip-flop 270d. Q output of flip-flop 270c represents L₄, which is fed to a first input of three input OR gate 275. Output of AND gate 272b represents L₇ , which is fed to D input of flip-flop 270d. Q output of flip-flop 270d represents L₇, which is fed to a second input of three input OR gate 275. Output of the OR gate 275 represents L and is fed to the D input of flip-flop 27 Of. Q output of flip-flop 270f represents L_E and is fed back to third input of three input OR gate 275.

In the process of developing XSSCatcher, the web application's source code, written in php, is run through HHVM (hiphop virtual machine) that produces bytecode. The present cross-site scripting detection method works with the bytecode and not with php directly because the bytecode is simpler because it has single static assignment and small number of primitives. So it's more trackable to work with it. A control flow graph is then built, as shown before, which represents basic blocks (sequential instructions and branches). Next, some basic blocks are merged and represent branching nodes as conditions on edges in a state transition graph 2800 (shown in Fig. 28).

The transition system can be modeled in logical formulas stating the transition relation, initial state, and safety property that is then parsed and passed to a reachability verifier to see if a bad state can be reached from the initial state. If a bad state be not reached, then there is no vulnerability. If a bad state is reached, a counter-example has been discovered. Project framework flow 3900 is illustrated in Fig. 39. CounterExample-Guided Abstraction Refinement, also known as CEGAR 4000 (see flow diagram of Fig. 40) is the main advantage of the present cross-site scripting detection method, with the protocol shown in Table 1.

To complete the state transition graph 2800, reachability is checked and a counter example is obtained. The counter example is concretized. The original php code is run through the browser and observed whether it gets executed. If yes, then there is an XSS vulnerability. If not, it is a false alarm and refinement is done. The present cross-site scripting detection method learns from this counter example and adds it as a lemma to its knowledge base. Truth tables 2900, 3000, and 3100 of Figs 29, 30, and 31, respectively, illustrate the learning process. Control flow graph (CFG) 3200 is developed. Next control STG 3300 is developed. Subsequently, the full STG 3400 is derived from aforementioned CFG 3200 and STG 3300. Truth tables 3500, 3600, and 3700 of Figs 35, 36, and 37, respectively, illustrate this process.

The present cross-site scripting detection method does complete abstraction of the program (over- approximation). Moreover, the control flow of the entire program is considered, thus not only are the inputs tracked. The present cross-site scripting detection method refines so that false positives are removed, i.e., (sound), complete (no false negatives). Fig. 38 provides an overview 3800 of the cross-site scripting catcher

(XSScatcher).

While the present cross-site scripting detection method is based on php language, the main concept of it can handle all other languages after making some modifications. The present cross-site scripting detection method works on detecting the reflected type of XSS attacks, but with further adaptation, it will be able to handle other types.

It is to be understood that the present invention is not limited to the embodiments described above, but encompasses any and all embodiments within the scope of the following claims.

Claims

CLAIMS We claim:

1. A cross-site scripting (XSS) detection method for detecting XSS vulnerabilities in a web page, comprising the steps of:

inputting a web page application program into a virtual machine, the virtual machine producing bytecode and a control flow graph (CFG) as output;

encoding the bytecode and control flow graph output using a state transition graph encoder, the encoder providing transition, initial, and property (Τ,Ι,Ρ) state information as output;

parsing the Τ,Ι,Ρ state information using a parser, said parser providing a Τ,Ι,Ρ-parsed output;

verifying the Τ,Ι,Ρ-parsed output using a reachability verifier, the verifier providing negative (OK) or positive counter-example (CEX) cross-site scripting discovery results; feeding the web page application program based on the positive result as HTML to a browser;

via the browser, outputting a feasibility indication or infeasibility indication of the cross-site scripting discovery, the infeasibility indication signifying a false positive cross-site scripting discovery, the feasibility indication signifying a true positive cross-site scripting discovery; and

using feedback of the infeasibility indication to tune the reachability verifier for future elimination of the false positive cross-site scripting discovery.

2. The cross-site scripting detection method according to claim 1, further comprising the step of abstracting the application's control flow by ignoring the application's functional behavior.

3. The cross-site scripting detection method according to claim 2, wherein the Τ,Ι,Ρ verification step further comprises the step of determining whether all reachable states (R states) are a proper subset of the good states (P states).

4. The cross-site scripting detection method according to claim 3, wherein the Τ,Ι,Ρ verification step further comprises the step of determining whether all reachable states (R states) are an improper subset of the good states (P states).

5. The cross-site scripting detection method according to claim 4, further comprising the step of extracting state variables of the web application program to facilitate modeling the web application program as a transition system (TS).

6. The cross-site scripting detection method according to claim 5, further comprising the step of performing a Boolean abstraction of the web application program, wherein all of the program variables and output are treated as being either safe→ 1 or unsafe→ 0.

7. The cross-site scripting detection method according to claim 6, further comprising the steps of:

identifying the inputs; and

classifying the inputs into safe/unsafe.

8. The cross-site scripting detection method according to claim 7, further comprising the step of abstracting conditions into fresh conditions (fresh Boolean variables), wherein the fresh Boolean variables assist in modeling the control flow of the CFG.

9. The cross-site scripting detection method according to claim 8, further comprising the step of describing a program variable update in each IP state, where I represents an initial state and P represents a safety property.

10. The cross-site scripting detection method according to claim 9, further comprising the step of mapping the CFG into a full state transition graph STG.

11. The cross-site scripting detection method according to claim 10, further comprising the step of encoding the transition system as a logical formula that can be analyzed automatically by SAT or SMT solvers.

12. A computer software product, comprising a non-transitory medium readable by a processor, the non-transitory medium having stored thereon a set of instructions for performing a cross-site scripting detection method, the set of instructions including:

(a) a first sequence of instructions which, when executed by the processor, causes said processor to input a web page application program into a virtual machine, the virtual machine producing bytecode and a control flow graph (CFG) as output;

(b) a second sequence of instructions which, when executed by the processor, causes said processor to encode the bytecode and control flow graph output using a state transition graph encoder, the encoder providing transition, initial, and property (Τ,Ι,Ρ) state information as output;

(c) a third sequence of instructions which, when executed by the processor, causes said processor to parse the Τ,Ι,Ρ state information using a parser, said parser providing a Τ,Ι,Ρ-parsed output; (d) a fourth sequence of instructions which, when executed by the processor, causes said processor to verify the Τ,Ι,Ρ-parsed output using a reachability verifier, the verifier providing negative (OK) or positive counter-example (CEX) cross-site scripting discovery results;

(e) a fifth sequence of instructions which, when executed by the processor, causes said processor to feed the web page application program based on the positive result as HTML to a browser;

(f) a sixth sequence of instructions which, when executed by the processor, causes said processor to via the browser, output a feasibility indication or infeasibility indication of the cross-site scripting discovery, the infeasibility indication signifying a false positive cross- site scripting discovery, the feasibility indication signifying a true positive cross-site scripting discovery; and

(g) a seventh sequence of instructions which, when executed by the processor, causes said processor to use feedback of the infeasibility indication to tune the reachability verifier for future elimination of the false positive cross-site scripting discovery.

13. The computer software product according to claim 12, further comprising an eighth sequence of instructions which, when executed by the processor, causes said processor to abstract the application's control flow by ignoring the application's functional behavior.

14. The computer software product according to claim 13, further comprising a ninth sequence of instructions which, when executed by the processor, causes said processor to determine whether all reachable states (R states) are a proper subset of the good states (P states).

15. The computer software product according to claim 14, further comprising a tenth sequence of instructions which, when executed by the processor, causes said processor to determine whether all reachable states (R states) are an improper subset of the good states (P states).

16. The computer software product according to claim 15, further comprising a eleventh sequence of instructions which, when executed by the processor, causes said processor to extract state variables of the web application program to facilitate modeling the web application program as a transition system (TS).

17. The computer software product according to claim 16, further comprising a twelfth sequence of instructions which, when executed by the processor, causes said processor to perform a Boolean abstraction of the web application program, wherein all of the program variables and output are treated as being either safe→ 1 or unsafe→ 0.

18. The computer software product according to claim 17, further comprising:

a thirteenth sequence of instructions which, when executed by the processor, causes said processor to identify the inputs; and

a fourteenth sequence of instructions which, when executed by the processor, causes said processor to classify the inputs into safe/unsafe.

19. The computer software product according to claim 18, further comprising a fifteenth sequence of instructions which, when executed by the processor, causes said processor to abstract conditions into fresh conditions (fresh Boolean variables), wherein the fresh Boolean variables assist in modeling the control flow of the CFG.

20. The computer software product according to claim 18, further comprising:

a sixteenth sequence of instructions which, when executed by the processor, causes said processor to describe a program variable update in each IP state, where I represents an initial state and P represents a safety property;

a seventeenth sequence of instructions which, when executed by the processor, causes said processor to map the CFG into a full state transition graph STG;

an eighteenth sequence of instructions which, when executed by the processor, causes said processor to encode the transition system as a logical formula that can be analyzed automatically by SAT or SMT solvers.