WO2024214265A1 - 解析装置、解析方法及び解析プログラム - Google Patents
解析装置、解析方法及び解析プログラム Download PDFInfo
- Publication number
- WO2024214265A1 WO2024214265A1 PCT/JP2023/015096 JP2023015096W WO2024214265A1 WO 2024214265 A1 WO2024214265 A1 WO 2024214265A1 JP 2023015096 W JP2023015096 W JP 2023015096W WO 2024214265 A1 WO2024214265 A1 WO 2024214265A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- instruction
- branch
- virtual machine
- analysis
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
Definitions
- the present invention relates to an analysis device, an analysis method, and an analysis program.
- Script analysis techniques are used for a variety of purposes, such as compiler optimization for just-in-time (JIT) compilation, software testing and debugging, fuzzing, and malware analysis.
- Control flow graph One method for expressing the structure of a program is the control flow graph (CFG).
- CFG is a graphical representation of the paths that may be taken when a program is executed.
- a CFG is a directed graph in which basic blocks are nodes and branches from one basic block to another are edges.
- Dynamic CFG is a method of constructing a CFG by monitoring and executing a program to be analyzed and plotting the control flow that is actually observed. Specifically, every time a branch instruction is executed, a basic block currently being executed is defined, and the actual branch source and branch destination are observed and edges are drawn between them to construct a CFG.
- Static CFG involves statically scanning the code portion of the program being analyzed, and creating a graph of the control flow observed during that process. Specifically, the scan begins at the program's entry point, and each time a branch instruction is discovered, the scan continues accordingly. For each branch instruction, the instruction's operands are then examined to obtain the branch destination, and a CFG is constructed by drawing edges from the branch source to the branch destination.
- obfuscation involves applying conversions to programs that make them difficult to interpret, primarily to hinder static analysis.
- parts of the script are encoded or encrypted, and then dynamically decoded or decrypted at run time before execution. In such cases, the type of script that will be executed is not clear until it is executed. This makes static analysis of the script difficult.
- Script execution method Scripts are executed by a crypto engine (also called an interpreter). Scripts are generally converted into bytecodes at runtime, and the bytecodes are interpreted and executed by a virtual machine (VM). For this reason, scripts are analyzed before execution, and the bytecodes are analyzed at execution.
- a crypto engine also called an interpreter
- VM virtual machine
- CFGs are used for a variety of purposes in tasks involving program analysis, such as compiler optimization, software testing, fuzzing, and malware analysis.
- compiler optimization CFGs are used, for example, for reachability analysis.
- software testing CFGs are used, for example, for calculating code coverage.
- fuzzing CFGs are used, for example, for gray-box fuzzing.
- malware analysis CFGs are used, for example, for similarity studies by comparing control flow graphs.
- Constructing a CFG is therefore a fundamental analysis that can be widely applied to technologies that involve program analysis. This is true not only for executable binary programs but also for scripts. In other words, analyzing a script and constructing a CFG is important for further analysis of the script.
- analyzing an obfuscated script requires analyzing the bytecode. Therefore, analyzing a script and building a CFG also requires analyzing the bytecode.
- Bytecode is composed of VM instructions, which are instructions that the VM can interpret, and analyzing the bytecode to build a CFG requires that the specifications of the VM instructions, especially those related to branching, be clear.
- Non-Patent Document 1 a method has been proposed in which the VM instructions of an unknown script engine are analyzed using a test script, branching VM instructions are identified, and a CFG is constructed for the bytecode (Non-Patent Document 1).
- Non-Patent Document 2 A method has also been proposed in which the VM instructions in unknown bytecodes are analyzed using data tracing with taint analysis to identify branching VM instructions and build a CFG for the bytecode.
- Non-Patent Document 1 only determines the opcode of the branch VM instruction, and does not disclose a method for analyzing the operands. This means that the only way to build a CFG is based on the actual changes in the virtual program counter during execution, which means that only a dynamic CFG can be built.
- Non-Patent Document 2 only targets information obtained from the actually executed parts of the VM and its bytecode used by malware for obfuscation, and has the problem that it cannot be directly applied to the VM used by the script engine and the bytecode generated from the script.
- Another problem with the technique described in Non-Patent Document 2 is that it does not show how to analyze complex control flows, such as calls using function name symbols, which can be seen in script bytecode.
- the present invention has been made in consideration of the above, and aims to provide an analysis device, an analysis method, and an analysis program that can construct a control flow graph by statically scanning a bytecode, even for bytecodes with unknown instruction set architectures.
- the analysis device of the present invention has a first analysis unit that analyzes the virtual machine of the script engine based on an execution trace obtained by executing a test script while monitoring the binary of the script engine; a second analysis unit that analyzes the instruction set architecture, which is the instruction system of the virtual machine, to collect virtual machine instructions, and based on the analysis results of the virtual machine of the script engine and the analysis results of the instruction set architecture, determines a branch virtual machine instruction that causes a branch in the script among the collected virtual machine instructions, and analyzes the addressing method of the branch destination of the branch virtual machine instruction and the operand of the branch virtual machine instruction; and a construction unit that statically scans the bytecode extracted by the script engine by analyzing the script to be analyzed based on the analysis results of the virtual machine and the analysis results of the instruction set architecture, obtains the branch destination of the branch virtual machine instruction, and constructs a control flow graph of the bytecode.
- the instruction set architecture which is the instruction system of the virtual machine, to collect virtual machine instructions, and based on the analysis results
- a control flow graph is constructed by statically scanning the bytecodes.
- FIG. 1 is a diagram illustrating an example of the configuration of a script engine.
- FIG. 2 is a diagram showing pseudo code of a VM included in the script engine.
- FIG. 3 is a diagram illustrating an example of a configuration of an analysis device according to an embodiment.
- FIG. 4 is a diagram showing an example of a first test script used for detecting a virtual program counter (VPC).
- FIG. 5 is a diagram showing an example of a second test script used for determining a branch VM instruction.
- FIG. 6 is a diagram illustrating an example of an execution trace.
- FIG. 7 illustrates an example of a VM execution trace.
- FIG. 8 is a diagram illustrating the process of the bytecode extraction unit.
- FIG. 9 is a diagram illustrating the process of the bytecode analysis unit.
- FIG. 1 is a diagram illustrating an example of the configuration of a script engine.
- FIG. 2 is a diagram showing pseudo code of a VM included in the script engine.
- FIG. 10 is a diagram illustrating the process of the VM instruction boundary detection unit.
- FIG. 11 is a diagram illustrating the process of the virtual program counter detection unit.
- FIG. 12 is a diagram illustrating the process of the dispatcher detection unit.
- FIG. 13 is a diagram illustrating the process of the code cache detection unit.
- FIG. 14 is a diagram illustrating the process of the branch VM instruction determination unit.
- FIG. 15 is a diagram illustrating the process of the branch VM instruction analyzing unit.
- FIG. 16 is a diagram illustrating the process of the branch VM instruction analyzing unit.
- FIG. 17 is a diagram illustrating the process of the branch VM instruction analyzing unit.
- FIG. 18 is a diagram showing an example of a script to be analyzed and a driver generated by the driver generating unit.
- FIG. 18 is a diagram showing an example of a script to be analyzed and a driver generated by the driver generating unit.
- FIG. 19 is a flowchart illustrating a processing procedure of the analysis process according to the embodiment.
- FIG. 20 is a flowchart illustrating the procedure of the execution trace acquisition process shown in FIG.
- FIG. 21 is a flowchart illustrating a processing procedure of the VM instruction boundary detection processing illustrated in FIG.
- FIG. 22 is a diagram for explaining the process of the virtual program counter detection unit shown in FIG.
- FIG. 23 is a diagram for explaining the process of the dispatcher detection unit shown in FIG.
- FIG. 24 is a flowchart illustrating the processing procedure of the code cache detection processing shown in FIG.
- FIG. 25 is a diagram illustrating the VM execution trace acquisition process shown in FIG.
- FIG. 26 is a flowchart illustrating the procedure of the VM command collection process illustrated in FIG. FIG.
- FIG. 27 is a flowchart illustrating the processing procedure of the branch VM instruction determination processing shown in FIG.
- FIG. 28 is a flowchart illustrating the procedure of the branch VM instruction analysis process shown in FIG.
- FIG. 29 is a flowchart showing the procedure of the direct addressing method analysis process shown in FIG.
- FIG. 30 is a flowchart showing the procedure of the indirect addressing method analysis process shown in FIG.
- FIG. 31 is a flowchart showing the procedure of the relative addressing method analysis process shown in FIG.
- FIG. 32 is a flowchart of the driver generation process shown in FIG.
- FIG. 33 is a flowchart of the bytecode extraction process shown in FIG. 19 .
- FIG. 34 is a flowchart showing the procedure of the bytecode analysis process shown in FIG.
- FIG. 35 is a diagram illustrating an example of a computer that realizes the analysis device by executing a program.
- An analysis device executes a first test script while monitoring the binary of a script engine, and acquires a branch trace and a memory access trace as an execution trace.
- the analysis device analyzes a virtual machine (VM) based on the execution trace, and acquires, as architecture information, a VM instruction boundary, a virtual program counter (VPC), and a code cache in which executed VM instructions are stored.
- VM virtual machine
- VPC virtual program counter
- the analysis device executes the second test script while monitoring the VPC and the dispatcher to obtain a VM execution trace.
- the analysis device analyzes the VM execution trace to collect VM instructions and to determine which of the collected VM instructions are branch VM instructions.
- the analysis device analyzes the addressing method of the branch destination of the branch VM instruction and the operands of the branch VM instruction.
- the analysis device then constructs a control flow graph (CFG) by statically scanning the bytecode, even for bytecode extracted by a script engine whose internal VM specifications and instruction set architecture are unknown.
- the bytecode is extracted by the script engine by analyzing the script to be analyzed.
- the CFG is a graphical representation of the possible paths that may be taken when a program is executed.
- the analysis device collects VM instructions, and obtains, from among the VM instructions, branch VM instructions, the addressing method of the branch destination of the branch VM instruction, and the operands of the branch VM instruction. In this way, the analysis device can statically scan the VM instructions that make up the bytecode in order, and if the current VM instruction is a branch VM instruction, transition the scan to the branch destination instruction, thereby constructing a CFG.
- Figure 1 is a diagram for explaining an example of the configuration of a script engine.
- script engine 1 has a bytecode compiler 2 and a VM 3.
- bytecode compiler 2 has a syntax analysis unit 4 and a bytecode generation unit 5.
- VM 3 has a code cache unit 6, a fetch unit 7, a decode unit 8, and an execution unit 9. These fetch unit 7, decode unit 8, and execution unit 9 are executed repeatedly and are called an interpreter loop. Then, script engine 1 accepts the input of a script.
- the syntax analysis unit 4 receives the script as input, and through lexical and syntactic analysis generates an Abstract Syntax Tree (AST), which it outputs to the bytecode generation unit 5.
- the bytecode generation unit 5 receives the AST as input, converts it into bytecode, and stores it in the code cache unit 6.
- the fetch unit 7 fetches the VM opcode from the code cache unit 6 and outputs it to the decode unit 8.
- the VM opcode refers to the opcode portion of the VM instruction.
- the decode unit 8 receives the VM opcode as input, interprets the VM opcode using a decoder/dispatcher, and dispatches it to the corresponding program.
- the execution unit 9 executes the program corresponding to the VM instruction. The contents written in the script are executed by executing the VM instructions one after another through a repeated interpreter loop.
- FIG 2 is a diagram showing pseudocode of the VM of the script engine.
- the pseudocode first initializes the VPC (line 1).
- the while loop is the interpreter loop (lines 1 to 6).
- the fetch unit 7 obtains the VM opcode of the VM instruction at the position pointed to by the VPC in the code cache that holds the bytecode (line 2).
- the decoder uses a Switch statement to interpret the VM instruction (line 3), and the dispatcher calls the instruction handler based on the VM opcode (lines 4 onwards).
- the instruction handler then performs the operation corresponding to the instruction. Input and output are performed using a virtual stack and virtual registers (line 5), and constants and variables are referenced via a symbol table (line 6).
- a branch VM command is a VM command that causes a branch within a script.
- Fig. 3 is a diagram illustrating an example of the configuration of the analysis device 10 according to the embodiment.
- the analysis device 10 has an input unit 11, a control unit 12, a storage unit 13, and an output unit 14.
- the analysis device 10 accepts inputs of a test script, a script engine binary, and a seed script.
- the input unit 11 is composed of input devices such as a keyboard and a mouse, and accepts information input from the outside and inputs it to the control unit 12.
- the input unit 11 also has a communication interface for sending and receiving various information to and from other devices connected via a wired connection or a network, etc., and accepts input of information sent from other devices.
- the input unit 11 accepts input of test scripts and script engine binaries, and outputs them to the control unit 12.
- the test script is a script that is input when dynamically analyzing the script engine to obtain an execution trace and a VM execution trace, and includes a first test script for VM analysis and a second test script for branch VM instruction detection. Details of the test scripts will be described later.
- the script engine binary is an executable file that constitutes the script engine.
- the script engine binary may be composed of multiple executable files.
- test script configuration Let us explain about test scripts.
- a test script is a script that is input when dynamically analyzing a script engine. This test script focuses on the number of branch instruction executions and memory reads and writes, and is used to capture the difference in the behavior of the script engine that occurs when the test script is executed a different number of times. This test script is prepared before the analysis and is created manually. Creating it requires knowledge of the specifications of the target script language.
- FIG. 4 shows an example of a first test script used to detect VPCs.
- the first test script uses a repetitive process (line 2).
- the first test script changes the execution conditions and generates differences by increasing or decreasing the number of repetitions (line 2) and the number of repeated statements (lines 3 to 5) in the test script.
- the first test script is subject to processing from the execution trace acquisition process to the code cache detection process, which will be described later.
- FIG. 5 is a diagram showing an example of a second test script used to detect a branch VM instruction.
- the second test script uses multiple conditional branches (lines 4 to 8).
- the branch conditions are controlled so that the multiple conditional branches are either taken or not taken in a specific sequence pattern (lines 1 and 5).
- the number of conditional branches and the sequence pattern of branch success or failure are changed to generate differences.
- the second test script is subject to the execution trace acquisition process, VM execution trace acquisition process, branch VM instruction determination process, and branch VM instruction analysis process described below.
- the storage unit 13 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk, and stores the processing program that operates the analysis device 10 and data used during execution of the processing program.
- the storage unit 13 has an execution trace database (DB) 131, a VM execution trace DB 133, and an architecture information DB 132 that stores architecture information acquired by the virtual machine analysis unit 121 and the instruction set architecture analysis unit 122 (described later).
- the execution trace DB 131 and the VM execution trace DB 133 store the execution traces and VM execution traces acquired by the execution trace acquisition unit 1211 and the VM execution trace acquisition unit 1221, respectively.
- the execution trace DB 131 and the VM execution trace DB 133 are managed by the analysis device 10.
- the execution trace DB 131 and the VM execution trace DB 133 may be managed by another device (such as a server), in which case the execution trace acquisition unit 1211 (described later) and the VM execution trace acquisition unit 1221 (described later) output the acquired execution traces and VM execution traces to a management server or the like for the execution trace DB 131 and the VM execution trace DB 133 via the communication interface of the output unit 14, and store them in the execution trace DB 131 and the VM execution trace DB 133.
- Fig. 6 is a diagram showing an example of an execution trace. As described above, an execution trace is composed of a branch trace and a memory access trace. Fig. 6 shows an excerpt of an execution trace. The structure of an execution trace will be described below with reference to Fig. 6.
- Trace indicates whether the log line is a branch trace or a memory access trace.
- a branch trace log line has the format shown, for example, in lines 1 to 10 of Figure 6, and consists of three elements: type, src, and dst.
- type indicates whether the executed branch instruction was a call instruction, a jmp instruction, or a ret instruction.
- src indicates the address of the branch source, and dst indicates the address of the branch destination.
- a log line of a memory access trace has the format shown, for example, in lines 11 to 13 of Figure 6, and consists of three elements: type, target, and value.
- Type indicates whether the memory access is a read or write.
- Target indicates the memory address that is the target of the memory access. Value stores the result of the memory access.
- Fig. 7 is a diagram showing an example of a VM execution trace.
- a VM execution trace is a record of a VM opcode and a VPC.
- Fig. 7 shows a part of a VM execution trace. The configuration of a VM execution trace will be described below with reference to Fig. 7.
- a log line of a VM execution trace is, for example, in the format shown in Figure 7, and consists of two elements: vpc and vmop (VM opcode).
- vpc indicates the value of the VPC.
- vmop indicates the value of the VM opcode that is virtually assigned to each pointer that points to the beginning of the VM instruction handler to be executed, obtained from the pointer cache.
- the control unit 12 has an internal memory for storing programs that define various processing procedures and the necessary data, and executes various processes using these.
- the control unit 12 is an electronic circuit such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit).
- the control unit 12 has a virtual machine analysis unit 121 (first analysis unit) and an instruction set architecture analysis unit 122 (second analysis unit).
- the virtual machine analysis unit 121 analyzes the VM of the script engine.
- the virtual machine analysis unit 121 obtains multiple execution traces by changing the execution conditions, analyzes the multiple execution traces using differential execution analysis, and obtains the VPC.
- the virtual machine analysis unit 121 also analyzes the script engine binary to obtain the boundaries and dispatchers of VM instructions.
- the virtual machine analysis unit 121 detects a code cache from the VM execution trace.
- the VM instructions to be executed are stored in the code cache.
- the virtual machine analysis unit 121 has an execution trace acquisition unit 1211 (first acquisition unit), a VM instruction boundary detection unit 1212 (first detection unit), a virtual program counter detection unit 1213 (second detection unit), a dispatcher detection unit 1214 (third detection unit), and a code cache detection unit 1215 (fourth detection unit).
- the execution trace acquisition unit 1211 receives the first test script and the script engine binary as input.
- the execution trace acquisition unit 1211 acquires an execution trace by executing the first test script while monitoring the execution of the script engine binary.
- An execution trace consists of a branch trace and a memory access trace.
- a branch trace records the type of branch instruction at the time of execution, the branch source address, and the branch destination address.
- a memory access trace records the type of memory operation at the time of execution (read/write), and the memory address and value of the operation target. It is known that branch traces and memory access traces can be obtained by hooking a memory operation instruction, inserting code for log output, and executing it.
- the execution trace obtained by the execution trace acquisition unit 1211 is stored in the execution trace DB 131.
- the VM instruction boundary detection unit 1212 clusters the execution trace to detect the boundaries of each VM instruction.
- the VM instruction boundary detection unit 1212 clusters the execution trace to detect clusters with a threshold or more of execution count as VM instructions. In clustering, consecutive code regions that are executed multiple times are detected. For example, executed instructions that are close in distance to each other in the code may be grouped together, common subsequences of executed code blocks may be searched for, or other methods may be used.
- the analysis device 10 detects the start and end points of consecutive instruction sequences that make up the detected VM instruction as boundaries.
- the VM instruction boundaries detected here are used in VPC detection and dispatcher detection.
- the virtual program counter detection unit 1213 extracts and analyzes the execution trace for the first test script stored in the execution trace DB 131 to detect the VPC.
- the virtual program counter detection unit 1213 analyzes multiple execution traces using differential execution analysis focusing on the number of memory reads and the boundaries of each VM instruction detected by the VM instruction boundary detection unit 1212 to detect the VPC.
- the virtual program counter detection unit 1213 uses the fact that a read into the memory that holds the VPC always occurs after the execution of each VM instruction, and detects the VPC by discovering the destination of this read.
- the virtual program counter detection unit 1213 uses differential execution analysis that focuses on the number of memory reads to detect VPCs.
- the virtual program counter detection unit 1213 compares execution traces of multiple test scripts acquired using the first test script, and finds memories for which the number of memory reads changes in proportion to both the increase or decrease in the number of repetitions and the number of repeated statements.
- the virtual program counter detection unit 1213 then refers to the boundaries of each VM instruction detected by the VM instruction boundary detection unit 1212, and narrows down the memory values that have been read to those that always point to the start point of the VM instruction.
- the virtual program counter detection unit 1213 detects this memory as a VPC.
- the dispatcher detection unit 1214 extracts each VM instruction portion from the script engine binary based on the boundaries of the VM instructions detected by the VM instruction boundary detection unit 1212, and detects the portions with high similarity between each VM instruction as dispatchers.
- the dispatcher is realized by referencing the pointer cache and jumping to the pointer of the next VM instruction handler.
- Dispatchers are placed in a distributed manner at the rear of each VM instruction handler, and the code therein is generally highly identical.
- the analysis device detects dispatchers using a specified method by searching for code with high similarity that exists at the rear of such VM instruction handlers. To detect the portions with high similarity, for example, a sequence alignment algorithm or other methods may be used.
- the code cache detection unit 1215 detects a code cache, which is a cache in which virtual machine instructions to be executed are stored, from the VM execution trace based on the execution trace, VPC, and VM execution trace.
- the code cache detection unit 1215 detects the memory area pointed to by the VPC as a code cache from the VM execution trace.
- the code cache detection unit 1215 detects the code location from which the memory allocation function that allocated this code cache was called from the execution trace.
- the code cache detection unit 1215 detects all memory areas allocated at this code location from the VM execution trace as code caches.
- the code cache detection unit 1215 detects code locations that are writing to the code cache from the execution trace.
- the code cache detection unit 1215 detects writing by these code locations in the VM execution trace as updates to the code cache.
- the instruction set architecture analysis unit 122 analyzes the instruction set architecture, which is the system of VM instructions.
- the instruction set architecture analysis unit 122 collects VM instructions. Based on the results of the script engine's VM analysis and the results of the instruction set architecture analysis, the instruction set architecture analysis unit 122 determines, from among the collected VM instructions, a branch VM instruction that causes a branch within the script, and analyzes the address designation method for the branch destination of the branch VM instruction and the operands of the branch VM instruction.
- the instruction set architecture analysis unit 122 has a VM execution trace acquisition unit 1221 (second acquisition unit), a VM instruction collection unit 1222 (first collection unit), a branch VM instruction determination unit 1223 (first determination unit), and a branch VM instruction analysis unit 1224 (third analysis unit).
- the VM execution trace acquisition unit 1221 receives as input a second test script using values characteristic of the operation target and a script engine binary.
- the VM execution trace acquisition unit 1221 acquires a VM execution trace by monitoring the VPC and the pointer of the VM instruction handler dispatched by the dispatcher.
- the VM execution trace acquisition unit 1221 acquires a VM execution trace, which is an execution trace executed on the VM, by executing the second test script while monitoring the execution of the script engine binary.
- the VM execution trace acquisition unit 1221 executes a large number of second test scripts to acquire a VM execution trace.
- the VM execution trace acquisition unit 1221 links a pointer to the VM instruction with the VM instruction, and virtually assigns a VM opcode as an identifier to each.
- VM execution trace acquisition unit 1221 accepts test scripts and script engine binaries as input.
- VM execution trace acquisition unit 1221 acquires VM execution traces by monitoring VPCs and pointers to VM instruction handlers dispatched by the dispatcher.
- VM execution trace acquisition unit 1221 acquires VM execution traces, which are execution traces executed on a VM, by executing second test scripts while monitoring the execution of script engine binaries.
- VM execution trace acquisition unit 1221 executes multiple second test scripts to acquire VM execution traces.
- VM execution trace acquisition unit 1221 links pointers to VM instructions with VM instructions, and virtually assigns VM opcodes as identifiers to each of them.
- a VM execution trace is an execution trace executed in a VM, in which a VM opcode is virtually assigned as an identifier, and in which a pointer to the executed VM handler and a VPC are recorded.
- a VM execution trace is a record of a pointer to an executed VM instruction handler and a VPC.
- a VM execution trace is composed of a VPC and a VM opcode for each executed VM instruction.
- the recording of a VPC can be achieved by monitoring the memory of the VPC detected by the virtual program counter detection unit 1213.
- a VM opcode is an identifier virtually assigned to each of a pointer to a VM instruction and a VM instruction that are linked together.
- the VM execution trace acquired by the VM execution trace acquisition unit 1221 is stored in the VM execution trace DB 133.
- the VM instruction collection unit 1222 receives the VPC and dispatcher as input, executes the second test script while monitoring the VPC and dispatcher, and obtains a VM execution trace.
- the VM instruction collection unit 1222 collects VM instructions from the VM execution trace.
- branching is done by a branch VM instruction that handles the branching.
- a branch VM instruction generally has information about the branch destination in an operand.
- VPC virtual program counter
- the memory area in which the branch destination address is stored is specified. This memory area may be a variable, a virtual stack, or a virtual register. In direct addressing, the address stored at the destination of the specified address becomes the value of the VPC.
- the first is a list of the opcodes of the branch VM instruction.
- the second is information indicating the operands of the branch VM instruction, especially when the VM instruction has multiple operands, which operand holds information about the branch destination.
- the third is the addressing method for the branch destination.
- the branch VM instruction determination unit 1223 determines which VM instructions are branch VM instructions from among the VM instructions collected by the VM instruction collection unit 1222.
- the branch VM instruction determination unit 1223 determines which VM instructions are branch VM instructions from among the VM instructions collected by the VM instruction collection unit 1222 based on the variation in the amount of change in VPC for each VM opcode in the VM execution trace.
- the branch VM instruction determination unit 1223 retrieves and analyzes the VM execution traces stored in the VM execution trace DB 133 to determine branch VM instructions. For each VM opcode assigned as an identifier, the branch VM instruction determination unit 1223 collects the amount of change in VPC before and after its execution. If the VM opcode is other than a branch VM instruction, the amount of change in VPC is almost constant. On the other hand, if the VM opcode is a branch VM instruction, the VPC varies depending on the branch destination.
- the branch VM instruction determination unit 1223 therefore determines whether an instruction is a branch VM instruction based on the variance in the amount of change in the virtual program counter for each VM opcode in the VM execution trace.
- the branch VM instruction determination unit 1223 focuses on the fact that the amount of variance in the VPC value differs between branch VM instructions and other VM instructions, determines a threshold value, and determines instructions with greater variance in the VPC value as branch VM instructions.
- the branch VM instruction determination unit 1223 evaluates the variance in the amount of change in the VPC for each VM opcode using variance, and determines instructions with variance equal to or greater than a certain threshold as branch VM instructions.
- the branch VM instruction analysis unit 1224 analyzes the addressing method of the branch destination of the branch VM instruction determined by the branch VM instruction determination unit 1223 and the operands of the branch VM instruction based on the analysis results of the VM of the script engine and the analysis results of the instruction set architecture.
- the branch VM instruction analysis unit 1224 analyzes whether the addressing method of the branch destination of the branch VM instruction is one of the three methods of immediate addressing, direct addressing, or relative addressing, and determines the operand of the branch destination of the branch VM instruction being analyzed.
- the construction unit 123 statically scans the bytecode extracted from the script to be analyzed based on the analysis results of the VM and the instruction set architecture, obtains the branch destination of the branch VM instruction, and constructs a CFG.
- the construction unit 123 has a driver generation unit 1231 (generation unit), a bytecode extraction unit 1232 (extraction unit), and a bytecode analysis unit 1233 (fourth analysis unit).
- the driver generation unit 1231 generates a script, called a driver, that actually calls the subroutines from the subroutine definitions so that all subroutines defined in the script to be analyzed are called at least once.
- FIG. 8 is a diagram explaining the processing of the bytecode extraction unit 1232.
- the bytecode extraction unit 1232 executes the analysis target script and the driver while monitoring writing and execution of the code cache.
- the bytecode extraction unit 1232 determines whether a VM command is present when the analysis target script is executed by the script engine based on whether writing to the code cache occurs when the analysis script driver is executed.
- the bytecode extraction unit 1232 extracts the written data as bytecode ((1) in FIG. 8). This is because when writing to the code cache occurs, the written data corresponds to the bytecode generated by the script engine executing the script to be analyzed.
- the bytecode extraction unit 1232 extracts the start point of execution as an entry point ((2) in FIG. 8).
- the start point of execution corresponds to the entry point of a subroutine in the bytecode.
- the execution of bytecode written to the code cache is equivalent to the execution of a VM instruction.
- Each VM instruction generally performs a small operation (such as addition), and a bytecode is composed of a combination of multiple VM instructions. For this reason, when a bytecode is executed, the VM instructions that make it up are executed one by one. In other words, the starting point of bytecode execution can be said to be the first VM instruction executed in the bytecode.
- the bytecode is generated by the script engine based on the input script to be analyzed when a subroutine is called by the driver, and is extracted from the code cache by the bytecode extraction process of the bytecode extraction unit 1232.
- FIG. 9 is a diagram explaining the processing of the bytecode analysis unit 1233.
- the bytecode analysis unit 1233 statically scans the bytecode based on the bytecode and entry point extracted by the bytecode extraction unit 1232, and the opcode, operand, and addressing method information of the branch VM instruction, to construct a CFG.
- the bytecode analysis unit 1233 scans the bytecode from the entry point in the same flow as the execution flow.
- the bytecode analysis unit 1233 linearly scans the bytecode one VM instruction at a time from the entry point ((1) in FIG. 9).
- the instruction set architecture analysis unit 122 has already analyzed which parts of the bytecode are VM instructions by executing the VM instruction collection process. In other words, it is possible to determine the boundaries of each VM instruction that constitutes the bytecode. This allows the bytecode analysis unit 1233 to scan from the entry point to the next VM instruction, and then the VM instruction after that.
- the instruction set architecture analysis unit 122 analyzes whether the type of VM instruction is a branch VM instruction or another VM instruction. Therefore, the bytecode analysis unit 1233 determines whether the current VM instruction is a branch VM instruction. Whether or not it is a branch VM instruction is determined using the determination result by the branch VM instruction determination unit 1223. Specifically, based on the VM opcode of the branch VM instruction acquired by the branch VM instruction determination unit 1223, the bytecode analysis unit 1233 determines whether or not the current VM instruction is a branch VM instruction.
- the bytecode analysis unit 1233 obtains the branch destination based on the addressing method, opcode, and operand information of the branch destination of this branch VM instruction, and transitions scanning to the branch destination ((2) in FIG. 9).
- the addressing method, opcode, and operand information of the branch destination of the branch VM instruction is analyzed by the branch VM instruction analysis unit 1224.
- the bytecode analysis unit 1233 adds the currently scanned basic block as a node to the CFG ((3) in FIG. 9). Furthermore, the bytecode analysis unit 1233 adds an edge from the currently scanned basic block to the branch destination to the CFG ((4) in FIG. 9). The bytecode analysis unit 1233 linearly scans the bytecode one VM instruction at a time ((5) in FIG. 9) and continues scanning up to the return VM instruction ((6) in FIG. 9).
- the output unit 14 is, for example, a liquid crystal display or a printer, and outputs various information including information related to the analysis device 10.
- the output unit 14 may also be an interface that handles the input and output of various data between an external device and the output unit 14, and may output various information to the external device.
- the VM instruction boundary detection unit 1212 detects the boundaries of each VM instruction. At this time, the VM instruction boundary detection unit 1212 detects VM instructions and their boundaries for threaded code type VMs, which do not have an interpreter loop and therefore make it difficult to grasp the boundaries of VM instructions. Specifically, the VM instruction boundary detection unit 1212 extracts execution traces from the execution trace DB 131. Then, as shown in FIG. 10, the VM instruction boundary detection unit 1212 clusters the execution traces using a predetermined method, and detects clusters with a threshold or more of execution counts as VM instructions (e.g., VM instruction handlers 1 to 3). The VM instruction boundary detection unit 1212 detects the start and end points of the consecutive instruction strings that make up a VM instruction as boundaries.
- VM instructions e.g., VM instruction handlers 1 to 3
- the virtual program counter detection unit 1213 detects the VPC and the pointer cache. The detection of the virtual program counter is realized by analyzing the memory access trace log of the acquired execution trace. The virtual program counter detection unit 1213 uses differential execution analysis focusing on the number of times memory is read.
- FIG. 11 is a diagram for explaining the processing of the virtual program counter detection unit 1213.
- the virtual program counter detection unit 1213 extracts one execution trace by the first test script from the execution trace DB 131.
- the number of times the VPC is read is proportional to the number of repetitions in the test script and the number of statements in the repetitive process. If the number of repetitions is N and the number of repeated statements is M, then approximately MN VPC reads will occur. For this reason, the virtual program counter detection unit 1213 extracts memory that has increased by 4MN and 9MN in the execution trace for the first test script in which N and M have been increased to 2N and 2M, respectively, and 3N and 3M. Specifically, as shown in FIG. 11, the virtual program counter detection unit 1213 extracts memory areas that have a monotonically increasing read/write for each VM instruction execution ((1) in FIG. 11).
- the virtual program counter detection unit 1213 detects as a VPC a memory value that has been read and that always points to the start point of a VM instruction. Specifically, the virtual program counter detection unit 1213 compares the VPC's pointing destination with the address of the VM instruction handler, and narrows it down to matching memory areas ((2) in FIG. 11).
- the dispatcher detection unit 1214 detects a dispatcher by analyzing the binary of the script engine using a predetermined method.
- FIG. 12 is a diagram for explaining the process of the dispatcher detection unit 1214.
- the dispatcher detection unit 1214 detects dispatchers. Based on the boundaries of VM instructions detected by the VM instruction boundary detection unit 1212, the dispatcher detection unit 1214 extracts each VM instruction portion from the script engine binary. Then, based on the assumption that the similarity of dispatcher code is high (FIG. 12 (1)), the dispatcher detection unit 1214 calculates the similarity between the codes of each VM instruction and detects the portion with high similarity between all VM instructions as the dispatcher. The dispatcher detection unit 1214 can detect the code that is commonly executed in the latter half of the VM instructions as the dispatcher (FIG. 12 (1)).
- the code cache detection unit 1216 detects the memory area pointed to by the VPC as a code cache from the VM execution trace ((1) in FIG. 13).
- the code cache detection unit 1216 detects the code location that called the memory allocation function that allocated this code cache from the execution trace ((2) in FIG. 13). The code cache detection unit 1216 detects all memory areas allocated at this code location from the VM execution trace as code caches ((3) in FIG. 13).
- the code cache detection unit 1216 detects the code location that is writing to the code cache from the execution trace ((4) in FIG. 13). The code cache detection unit 1216 detects the writing by this code location from the VM execution trace as an update to the code cache ((5) in FIG. 13).
- the branch VM instruction determination unit 1223 first analyzes the acquired VM execution trace log to determine the branch VM instruction.
- the test script (second test script) here is only required to include a branch VM instruction, and may be any script including a branch control syntax, not limited to the example shown in FIG. 5.
- the test script is prepared by collecting information from the Internet or obtaining information from official documents.
- the branch VM instruction determination unit 1223 associates a pointer to a VM instruction with a VM instruction for each VM execution trace in the VM execution trace DB 133, and virtually assigns a VM opcode as an identifier to each.
- Figure 14 is a diagram explaining the processing of the branch VM instruction determination unit 1223.
- a VM instruction is a branch instruction
- the advancement of the VPC changes depending on the branch destination.
- the advancement of the VPC changes depending on the size of the VM instruction. For this reason, when pairs of VM instruction opcodes and pointers to VM instructions are collected and the advancement of the VPC is examined for each opcode, if it is a branch instruction, the advancement of the VPC will vary depending on the branch destination.
- the branch VM instruction determination unit 1223 therefore uses variance to evaluate the variance of the pointer to this VM instruction.
- the branch VM instruction determination unit 1223 calculates the variance of the amount of change in the VPC for each VM opcode, and narrows it down to only VM opcodes whose calculated variance is greater than a threshold value. In this way, the branch VM instruction determination unit 1223 associates the pointer with the VM instruction, and determines that a VM instruction with variance in the advance of the VPC (VM instruction handler 3 in the example of FIG. 14) is a branch VM instruction ((1) in FIG. 14).
- the threshold value is set to a value that can divide the two groups that result by plotting the obtained variance value on a number line, for example.
- Branch VM instruction analysis unit 1224 analyzes whether the addressing method of the branch destination of the branch VM instruction is one of the three methods of immediate addressing, direct addressing, or relative addressing, and determines the operand of the branch destination of the branch VM instruction to be analyzed.
- the immediate addressing method is a method in which the address of the branch destination is directly specified. With the immediate addressing method, the VPC after the branch becomes this specified value.
- Figures 15 to 17 are diagrams explaining the processing of the branch VM instruction analysis unit 1224.
- the branch VM instruction analysis unit 1224 then assigns a taint tag T1 (see FIG. 15) to the code cache of the branch VM instruction being analyzed, and executes the branch VM instruction while propagating the taint tag in accordance with the movement of data. Execution of the branch VM instruction changes the VPC, causing a transition in execution ((1) in FIG. 15).
- the branch VM instruction analysis unit 1224 determines that the addressing method of the branch destination of this branch VM instruction is the immediate addressing method. In other words, if taint tag T1 is propagated to the VPC by execution of the branch VM instruction, the branch VM instruction analysis unit 1224 determines that the addressing method of the branch destination of the branch VM instruction is the immediate addressing method ((2) in FIG. 15). The branch VM instruction analysis unit 1224 then determines that the original data portion of the data moved to the VPC with taint tag T1 added is the operand of the branch destination of the branch VM instruction being analyzed.
- the direct addressing method is a method of specifying the memory area in which the branch destination address is stored. With direct addressing, the address stored at the destination of the specified address becomes the value of the VPC.
- the branch VM instruction analysis unit 1224 therefore assigns a first tag T21 (see FIG. 16) to the code cache of the branch VM instruction being analyzed, and executes the branch VM instruction while propagating the first tag T21 in accordance with the movement of data. If data with the first tag T21 is referenced by a pointer during execution of the branch VM instruction being analyzed, the branch VM instruction analysis unit 1224 assigns a second tag T22 to the referenced data ((1) in FIG. 16).
- the branch VM instruction analysis unit 1224 determines that the addressing method for the branch destination of this branch VM instruction is the direct addressing method ((2) in FIG. 16). Then, for the first tag T21 that triggered the assignment of the second tag T22 to the data moved to the VPC, the branch VM instruction analysis unit 1224 determines that the original data portion that assigned the tag is the operand of the branch destination.
- the relative addressing method specifies an offset from the current VPC. Therefore, with the relative addressing method, the value of the VPC after branching is the current VPC value plus this offset.
- the branch VM instruction analysis unit 1224 then assigns a first tag T31 (see FIG. 17) to the VPC of the branch VM instruction being analyzed, and assigns a second tag T32 (see FIG. 17) to the code cache of the branch VM instruction being analyzed.
- the branch VM instruction analysis unit 1224 then executes the branch VM instruction being analyzed while propagating the tags (first tag T31 and second tag T32) in accordance with the movement of data.
- the branch VM instruction analysis unit 1224 determines that the addressing method of the branch destination of the branch VM instruction to be analyzed is the relative addressing method ((1) in FIG. 17). Then, the branch VM instruction analysis unit 1224 determines that the original data portion of the second tag added to the first tag is the offset operand of the branch destination.
- driver generation unit 1231 accepts an input.
- the script to be analyzed will be described.
- FIG. 18 is a diagram showing an example of a script to be analyzed and a driver generated by the driver generation unit 1231. As shown in FIG. 18, the script to be analyzed may include multiple subroutines.
- Bytecode is dynamically generated at run time based on the input script to be analyzed. Depending on the design of the script engine, it may not generate all bytecode at once, but rather, for example, generate each subroutine just before it is executed. In such a case, bytecode for subroutines that were not called will not be generated, and a CFG cannot be constructed.
- the driver generation unit 1231 ensures that all subroutines defined in the script to be analyzed are called at least once. Specifically, the driver generation unit 1231 generates a script that actually calls the subroutine, called a driver, from the definition of the subroutine. The driver generation unit 1231 scans the script to be analyzed to extract all function definitions, and generates a driver that calls the function based on the extracted function definitions.
- Fig. 19 is a flowchart showing the procedure of the analysis process according to the embodiment.
- the input unit 11 receives a test script and a script engine binary as input (step S1).
- the test script includes a first test script and a second test script.
- the execution trace acquisition unit 1211 performs an execution trace acquisition process in which the first test script is executed while monitoring the binary of the script engine to acquire a branch trace and a memory access trace (step S2).
- the VM instruction boundary detection unit 1212 detects VM instructions and performs VM instruction boundary detection processing to detect VM instruction boundaries (step S3).
- the virtual program counter detection unit 1213 extracts and analyzes the execution trace for the first test script stored in the execution trace DB 131, and performs virtual program counter detection processing to discover the VPC (step S4).
- the dispatcher detection unit 1214 performs dispatcher detection processing to extract each VM command portion from the script engine binary and detect the portion with high similarity between each VM command as a dispatcher (step S5).
- the code cache detection unit 1215 performs a code cache detection process based on the execution trace and VPC to detect the area of the code location from which the memory allocation function was called as a code cache, and to detect the area in which writing is being done to the code location area as an update to the code cache (step S6).
- the VM execution trace acquisition unit 1221 receives the test script and the script engine binary as input, and executes the test script while monitoring the execution of the script engine binary, thereby performing a VM execution trace acquisition process to acquire a VM execution trace (step S7).
- the VM instruction collection unit 1222 performs a VM instruction collection process to collect VM instructions from the VM execution trace (step S8).
- the branch VM instruction determination unit 1223 performs a branch VM instruction determination process to determine branch VM instructions from among the VM instructions collected by the VM instruction collection unit 1222 (step S9).
- the branch VM instruction analysis unit 1224 performs a branch VM instruction analysis process to analyze the address specification method of the branch destination of the branch VM instruction being analyzed and the operands of the branch VM instruction (step S10).
- the driver generation unit 1231 performs a driver generation process to generate a script, called a driver, that actually calls the subroutines from the subroutine definitions so that all subroutines defined in the script to be analyzed are called at least once (step S11).
- the bytecode extraction unit 1232 executes the script to be analyzed and the driver generated in step S11 while monitoring the writing and execution of the code cache, and performs a bytecode extraction process to extract the bytecode and entry point (step S12).
- the bytecode analysis unit 1233 performs a bytecode analysis process to statically scan the bytecode and construct a CFG based on the bytecode and entry point extracted by the bytecode extraction unit 1232, and the opcode, operand, and addressing method information of the branch VM instruction (step S13).
- the output unit 14 outputs the CFG constructed in step S13 (step S14).
- FIG. 20 is a flowchart showing the processing procedure of the execution trace acquisition process shown in Fig. 19.
- the execution trace acquisition unit 1211 receives the first test script and the script engine binary as input (step S21). Then, the execution trace acquisition unit 1211 hooks the received script engine to acquire a branch trace (step S22). The execution trace acquisition unit 1211 also hooks the received script engine to acquire a memory access trace (step S23).
- the execution trace acquisition unit 1211 inputs the first test script received in this state into the script engine and executes it (step S24), and stores the execution trace acquired thereby in the execution trace DB 131 (step S25).
- the execution trace acquisition unit 1211 determines whether or not all of the input first test scripts have been executed (step S26). If all of the input first test scripts have been executed (step S26: Yes), the execution trace acquisition unit 1211 ends the process. On the other hand, if all of the input first test scripts have not been executed (step S26: No), the execution trace acquisition unit 1211 returns to the execution of the first test script in step S24 and continues the process.
- Fig. 21 is a flowchart showing the processing procedure of the VM instruction boundary detection process shown in Fig. 19.
- the VM instruction boundary detection unit 1212 extracts execution traces from the execution trace DB 131 (step S31).
- the VM instruction boundary detection unit 1212 clusters the execution traces using a predetermined method (step S32). Any method may be used for clustering.
- the VM instruction boundary detection unit 1212 detects clusters whose execution count is equal to or exceeds a threshold as VM instructions (step S33). Then, the VM instruction boundary detection unit 1212 determines the start and end points of the continuous instruction sequence that constitutes the VM instruction as boundaries (step S34). The VM instruction boundary detection unit 1212 outputs the VM instruction boundary as a return value (step S35), and ends the VM instruction boundary detection process.
- Fig. 22 is a flowchart showing the processing procedure of the virtual program counter detection process shown in Fig. 19.
- the virtual program counter detection unit 1213 extracts one execution trace by the first test script from the execution trace DB 131 (step S41). Next, the virtual program counter detection unit 1213 focuses on memory access traces among the execution traces, and counts up the number of reads for each memory read destination (step S42).
- the virtual program counter detection unit 1213 receives as input the first test script used to obtain the execution trace (step S43), and analyzes the first test script to obtain the number of repetitions and the number of repeated statements (step S44).
- the virtual program counter detection unit 1213 extracts from the execution trace DB 131 another execution trace by the first test script, which has a different number of repetitions and number of repeated statements (step S45). Then, the virtual program counter detection unit 1213 focuses on the memory access trace and counts the number of reads for each memory read destination (step S46). The virtual program counter detection unit 1213 also receives as input the first test script used to obtain the execution trace (step S47), and analyzes the first test script to obtain the number of repetitions and the number of repeated statements (step S48).
- the virtual program counter detection unit 1213 narrows down the memory read destinations to only those whose read counts change in proportion to the number of repetitions or the increase or decrease in the number of repeated statements (step S49). Furthermore, the virtual program counter detection unit 1213 narrows down the memory read destinations narrowed down in step S49 to those whose read memory values always point to the start point of the VM instruction (step S50).
- the virtual program counter detection unit 1213 determines whether the memory read destinations have been narrowed down to only one (step S51). If the virtual program counter detection unit 1213 has not narrowed down the memory read destinations to only one (step S51: No), the process returns to step S45, where the virtual program counter detection unit 1213 retrieves the next execution trace and continues processing. On the other hand, if the virtual program counter detection unit 1213 has narrowed down the memory read destinations to only one (step S51: Yes), the virtual program counter detection unit 1213 stores the narrowed down memory read destination in the architecture information DB 132 as a virtual program counter (step S52), and ends processing.
- Fig. 23 is a flowchart showing the processing procedure of the dispatcher detection process shown in Fig. 19.
- the dispatcher detection unit 1214 receives the script engine binary as input (step S61).
- the dispatcher detection unit 1214 receives the boundaries of VM commands from the VM command boundary detection unit 1212 (step S62).
- the dispatcher detection unit 1214 extracts each VM instruction portion from the script engine binary based on the boundaries of the VM instructions received from the VM instruction boundary detection unit 1212 (step S63).
- the dispatcher detection unit 1214 calculates the similarity between the codes of each VM instruction using a predetermined method (step S64). Any method for calculating the similarity may be used as long as it is a method that can calculate the similarity between codes.
- the dispatcher detection unit 1214 extracts the part with high similarity among all VM commands based on the similarity calculated in step S64 (step S65). The dispatcher detection unit 1214 then determines whether it is the end part of the VM command (step S66).
- step S66: No If it is not the end of the VM command (step S66: No), the dispatcher detection unit 1214 returns to step S65 and continues processing. If it is the end of the VM command (step S66: Yes), the dispatcher detection unit 1214 outputs the extracted part as a dispatcher (step S67) and ends processing.
- Fig. 24 is a flowchart showing the processing procedure of the code cache detection process shown in Fig. 19.
- the code cache detection unit 1215 When the code cache detection unit 1215 receives an execution trace and a VM execution trace as input (step S71), it acquires the memory area pointed to by the VPC from the VM execution trace (step S72). The VM execution trace is acquired by the VM execution trace acquisition unit 1221.
- the code cache detection unit 1215 obtains from the execution trace the code location of the caller of the memory allocation function that allocated the memory area obtained in step S72 (step S73).
- the code cache detection unit 1215 detects, from the VM execution trace, all areas allocated at the code location obtained in step S73 as code caches (step S74).
- the code cache detection unit 1215 acquires the code location that is writing to the code cache from the execution trace (step S75). The code cache detection unit 1215 detects all areas in the VM execution trace that are written to at the code location acquired in step S75 as code cache updates (step S76). The code cache detection unit 1215 returns the detected code cache and its updated location (step S77), and ends the code cache detection process.
- Fig. 25 is a flowchart showing the procedure of the VM execution trace acquisition process shown in Fig. 19.
- the VM execution trace acquisition unit 1221 receives the second test script and the script engine binary as input (step S81). Then, the VM execution trace acquisition unit 1221 hooks the received script engine to record the VPC and VM opcode (step S82).
- the VM execution trace acquisition unit 1221 inputs the second test script received in this state into the script engine and executes it (step S83), and stores the VM execution trace acquired thereby in the VM execution trace DB 133 (step S84).
- the VM execution trace acquisition unit 1221 determines whether or not all of the input second test scripts have been executed (step S85). If all of the input second test scripts have been executed (step S85: Yes), the VM execution trace acquisition unit 1221 ends the process. If all of the input second test scripts have not been executed (step S85: No), the VM execution trace acquisition unit 1221 returns to the execution of the second test script in step S83 and continues the process.
- Fig. 26 is a flowchart showing the processing procedure of the VM command collection process shown in Fig. 19.
- the VM command collection unit 1222 receives the VPC and dispatcher as input (step S91) and acquires various scripts from the Internet (step S92). The VM command collection unit 1222 executes the scripts while monitoring the VPC and dispatcher, and acquires a VM execution trace (step S93).
- the VM instruction collection unit 1222 acquires VM instructions from the VM execution trace (step S94) and adds them to a list of VM instructions (step S95). If the VM instruction collection unit 1222 finds a VM instruction that is not in the list (step S96: No), it returns to step S92. If the VM instruction collection unit 1222 finds no VM instructions that are not in the list (step S96: Yes), it returns the list of VM instructions (step S97) and ends the VM instruction collection process.
- Fig. 27 is a flowchart showing the processing procedure of the branch VM instruction determination process shown in Fig. 19.
- the branch VM instruction determination unit 1223 extracts one VM execution trace from the VM execution trace DB 133 (step S101).
- the branch VM instruction determination unit 1223 links a pointer to the VM instruction with the VM instruction, and assigns a VM opcode to each as an identifier (step S102). Then, the branch VM instruction determination unit 1223 counts the amount of change in VPC before and after execution for each VM opcode (step S103).
- the branch VM instruction determination unit 1223 determines whether all VM execution traces in the VM execution trace DB 133 have been processed (step S104). If all VM execution traces in the VM execution trace DB 133 have not been processed (step S104: No), the branch VM instruction determination unit 1223 returns to step S101 and retrieves and processes the next VM execution trace.
- step S104 If all VM execution traces in the VM execution trace DB 133 have been processed (step S104: Yes), the branch VM instruction determination unit 1223 calculates the variance of the amount of change in VPC for each VM opcode (step S105). Then, the branch VM instruction determination unit 1223 receives a threshold value as an input (step S106). The branch VM instruction determination unit 1223 narrows down to only VM opcodes whose variance is greater than the threshold value (step S107), stores them as branch VM instructions in the architecture information DB 132 (step S108), and ends the process.
- Fig. 28 is a flowchart showing the processing procedure of the branch VM instruction analysis process shown in Fig. 19.
- the branch VM instruction analysis unit 1224 performs an immediate addressing method analysis process to analyze whether the addressing method of the branch destination of the branch VM instruction is the immediate addressing method (step S111).
- the branch VM instruction analysis unit 1224 performs a direct addressing method analysis process to analyze whether the addressing method of the branch destination of the branch VM instruction is a direct addressing method (step S112).
- the branch VM instruction analysis unit 1224 performs a relative addressing method analysis process to analyze whether the addressing method of the branch destination of the branch VM instruction is a relative addressing method (step S113).
- the branch VM instruction analysis unit 1224 outputs the analysis results, ie, the address specification method for the branch destination of the branch VM instruction and the operands of the branch VM instruction (step S114).
- FIG. 29 is a flowchart showing the procedure of the immediate addressing method analysis process shown in Fig. 28.
- the branch VM instruction analysis unit 1224 receives the VPC and the code cache as input (step S121).
- the branch VM instruction analysis unit 1224 receives the script acquired in the VM instruction collection process as input (step S122).
- the branch VM instruction analysis unit 1224 executes the script (step S123).
- the branch VM instruction analysis unit 1224 stops execution when the bytecode is written to the code cache (step S124).
- the branch VM instruction analysis unit 1224 assigns a taint tag to the bytecode in the code cache of the branch VM instruction to be analyzed (step S125).
- branch VM instruction analysis unit 1224 resumes execution of the branch VM instruction being analyzed while propagating the taint tag in accordance with the data movement (step S126).
- the branch VM instruction analysis unit 1224 determines whether or not there is tagged data that has been moved to the VPC during execution of the branch VM instruction being analyzed (step S127).
- step S127 If there is tagged data that has been moved to the VPC during execution of the branch VM instruction (step S127: Yes), the branch VM instruction analysis unit 1224 determines that the addressing method for the branch destination of the branch VM instruction being analyzed is the immediate addressing method (step S128).
- the branch VM instruction analysis unit 1224 determines that the original data portion to which the taint tag has been added and which has been moved to the VPC is the branch destination operand of the branch VM instruction being analyzed (step S129).
- step S127 determines that the addressing method for the branch destination of the branch VM instruction being analyzed is not the immediate addressing method (step S130).
- the branch VM instruction analysis unit 1224 outputs the addressing method of the branch VM instruction resulting from the analysis and the operands of the branch VM instruction (step S131).
- FIG. 30 is a flowchart showing the processing procedure of the direct addressing method analysis process shown in Fig. 28.
- the branch VM instruction analysis unit 1224 receives the VPC and the code cache as input (step S141).
- the branch VM instruction analysis unit 1224 receives the script acquired in the VM instruction collection process as input (step S142).
- the branch VM instruction analysis unit 1224 executes the script (step S143).
- the branch VM instruction analysis unit 1224 stops execution when the bytecode is written to the code cache (step S144).
- the branch VM instruction analysis unit 1224 assigns a first tag to the bytecode in the code cache of the branch VM instruction to be analyzed (step S145).
- the branch VM instruction analysis unit 1224 resumes execution of the branch VM instruction being analyzed while propagating the tag in accordance with the movement of data (step S146).
- the branch VM instruction analysis unit 1224 determines whether the first tagged data is referenced by a pointer when the branch VM instruction to be analyzed is executed (step S147).
- step S147 If the data with the first tag is referenced by a pointer (step S147: Yes), the branch VM instruction analysis unit 1224 assigns a second tag to the referenced data (step S148).
- step S147 If the first tagged data is not pointer referenced (step S147: No), or after processing of step S148, the branch VM instruction analysis unit 1224 determines whether there is second tagged data that has been moved to the VPC during execution of the branch VM instruction to be analyzed (step S149).
- step S149 If there is second tagged data that was moved to the VPC during execution of the branch VM instruction being analyzed (step S149: Yes), the branch VM instruction analysis unit 1224 determines that the addressing method for the branch destination of this branch VM instruction is the direct addressing method (step S150).
- the branch VM instruction analysis unit 1224 determines that the original data portion to which the first tag was added, which triggered the addition of the second tag to the data moved to the VPC, is the branch destination operand (step S151).
- step S149 If there is no second tagged data moved to the VPC during execution of the branch VM instruction (step S149: No), the branch VM instruction analysis unit 1224 determines that the addressing method for the branch destination of this branch VM instruction is not the direct addressing method (step S152).
- the branch VM instruction analysis unit 1224 outputs the addressing method of the branch VM instruction resulting from the analysis and the operands of the branch VM instruction (step S153).
- FIG. 31 is a flowchart showing the processing procedure of the relative addressing method analysis process shown in Fig. 28.
- the branch VM instruction analysis unit 1224 receives the VPC and the code cache as input (step S161).
- the branch VM instruction analysis unit 1224 receives the script acquired in the VM instruction collection process as input (step S162).
- the branch VM instruction analysis unit 1224 executes the script (step S163).
- the branch VM instruction analysis unit 1224 stops execution when writing the bytecode to the code cache (step S164).
- the branch VM instruction analysis unit 1224 assigns a first tag to the VPC of the branch VM instruction to be analyzed (step S165).
- the branch VM instruction analysis unit 1224 assigns a second tag to the bytecode in the code cache of the branch VM instruction being analyzed (step S166).
- the branch VM instruction analysis unit 1224 resumes execution of the branch VM instruction being analyzed while propagating the tag in accordance with the movement of data (step S167).
- the branch VM instruction analysis unit 1224 determines whether the first tag and the second tag are added and moved to the VPC during execution of the branch VM instruction to be analyzed (step S168).
- step S168 If the first tag and the second tag are added and moved to the VPC during execution of the branch VM instruction (step S168: Yes), the branch VM instruction analysis unit 1224 determines that the addressing method for the branch destination of the branch VM instruction being analyzed is the relative addressing method (step S169).
- the branch VM instruction analysis unit 1224 determines that the original data portion of the second tag added to the first tag is the offset operand of the branch destination (step S170).
- step S168 the branch VM instruction analysis unit 1224 determines that the addressing method for the branch destination of this branch VM instruction is not a relative addressing method (step S171).
- the branch VM instruction analysis unit 1224 outputs the addressing method of the branch VM instruction resulting from the analysis and the operands of the branch VM instruction (step S172).
- Fig. 32 is a flowchart showing the processing procedure of the driver generation processing shown in Fig. 19.
- the driver generation unit 1231 receives an analysis target script as an input (step S181), scans the analysis target script, and extracts all function definitions (step S182).
- the driver generation unit 1231 extracts one function definition (step S183).
- the driver generation unit 1231 searches for a call to that function in the script to be analyzed (step S184).
- the driver generation unit 1231 determines whether a function call has been found (step S185). If a function call has been found (step S185: Yes), the driver generation unit 1231 uses the function call found in the search as a driver (step S186).
- step S185 If no function call is found (step S185: No), the driver generation unit 1231 generates a driver that assigns random values to each argument based on the extracted function definition and calls the function (step S187). The driver generation unit 1231 calls the function using the driver (step S188).
- the driver generation unit 1231 determines whether or not execution stopped due to an error when the function was called (step S189). If execution stopped due to an error when the function was called (step S189: Yes), the driver generation unit 1231 generates a driver that calls the function by providing another random value as an argument (step S190). Then, the process proceeds to step S188.
- step S189: No the driver generation unit 1231 determines whether all function definitions have been processed (step S191).
- step S191: No the driver generation unit 1231 extracts the next function definition (step S192) and proceeds to step S184. If all function definitions have been processed (step S191: Yes), the driver generation unit 1231 outputs the generated driver (step S193).
- FIG. 33 is a flowchart of the bytecode extraction process shown in FIG. 19 .
- the bytecode extraction unit 1232 receives the script to be analyzed and the driver generated by the driver generation unit 1231 (step S201).
- the bytecode extraction unit 1232 executes the script to be analyzed and the driver (step S203) while monitoring the writing and execution of the code cache (step S202).
- the bytecode extraction unit 1232 determines whether or not a write was made to the code cache during the execution process of step S203 (step S204).
- step S204 If a write has been made to the code cache (step S204: Yes), the bytecode extraction unit 1232 extracts the written data as a bytecode (step S205).
- step S204 If no writing has been done to the code cache (step S204: No), or after processing of step S205, the bytecode extraction unit 1232 determines whether the written bytecode has been executed (step S206).
- step S206 If the written bytecode is executed (step S206: Yes), the bytecode extraction unit 1232 extracts the start point of execution as an entry point (step S207).
- step S206 If the written bytecode has not been executed (step S206: No), or after processing of step S207, the bytecode extraction unit 1232 determines whether the execution of the analysis script and the driver has ended (step S208).
- step S208 If the execution of the analysis script and the driver has not finished (step S208: No), the bytecode extraction unit 1232 continues execution (step S209) and proceeds to step S204.
- step S208 When the execution of the analysis script and the driver is completed (step S208: Yes), the bytecode extraction unit 1232 outputs the extracted bytecode and entry point (step S210).
- FIG. 34 is a flowchart showing the procedure of the bytecode analysis process shown in FIG.
- the bytecode analysis unit 1233 receives the bytecode and entry point extracted by the bytecode extraction unit 1232 (step S211).
- the bytecode analysis unit 1233 receives information on the opcode, operands, and addressing method of the branch VM instruction (step S212).
- the bytecode analysis unit 1233 scans the bytecode from the entry point in the same flow as the execution flow (step S213).
- the bytecode analysis unit 1233 determines whether or not there is a branch VM instruction (step S214). If there is no branch VM instruction (step S214: No), the bytecode analysis unit 1233 continues scanning in step S214.
- step S214 If there is a branch VM instruction (step S214: Yes), the bytecode analysis unit 1233 obtains the branch destination based on the opcode, operand, and addressing method (step S215).
- the bytecode analysis unit 1233 adds the currently scanned basic block as a node to the CFG (step S216). The bytecode analysis unit 1233 then adds an edge from the currently scanned address to the branch destination address to the CFG (step S217).
- step S217 After processing in step S217, the bytecode analysis unit 1233 determines whether scanning has reached the return VM instruction at the end of the bytecode (step S218).
- step S218 If the bytecode has not yet been scanned up to the return VM instruction at the end of the bytecode (step S218: No), the bytecode analysis unit 1233 continues scanning (step S219) and proceeds to step S214.
- step S218 If the bytecode has been scanned up to the return VM instruction at the end of the bytecode (step S218: Yes), the bytecode analysis unit 1233 outputs the constructed CFG (step S220).
- the analysis device 10 executes the first test script while monitoring the binary of the script engine, and acquires a memory access trace as an execution trace.
- the analysis device 10 analyzes the VM of the script engine based on the execution trace.
- the analysis device 10 acquires architecture information of the VPC, dispatcher, and code cache.
- the analysis device 10 analyzes the instruction set architecture, which is the system of instructions for a virtual machine, collects VM instructions, determines which of the collected VM instructions are branch VM instructions, and analyzes the addressing method for the branch destination of the branch VM instruction and the operands of the branch VM instruction.
- the instruction set architecture which is the system of instructions for a virtual machine
- the analysis device 10 executes the second test script while monitoring the VPC and the dispatcher to obtain a VM execution trace. By analyzing this VM execution trace, the analysis device 10 collects VM instructions, determines whether the VM instruction is a branch VM instruction, analyzes the branch destination, and obtains the addressing method of the branch destination of the branch VM instruction and the operands of the branch VM instruction as information on the instruction set architecture.
- the analysis device 10 generates and executes a driver that calls all subroutines of the script, thereby comprehensively generating and extracting bytecodes.
- the analysis device 10 constructs a CFG by statically scanning the bytecodes extracted in this manner based on information about branch VM instructions. Specifically, the analysis device 10 statically scans in order the VM instructions that constitute the bytecodes extracted by the script engine analyzing the script to be analyzed, and if the current VM instruction is a branch VM instruction, it is able to construct a CFG by obtaining information about the branch destination.
- the analysis device 10 can detect various architectural information by analyzing the execution trace and VM execution trace obtained, even for script engines whose VM internal specifications are unknown, and can build a CFG without requiring manual reverse engineering.
- the analysis device 10 can automatically analyze branch VM instructions for a variety of script engines as long as a test script is prepared, making it possible to build a CFG without requiring individual design or execution. As a result, the analysis device 10 makes it possible to build a CFG for scripts written in a variety of script languages, making it possible to grasp behavior in more detail.
- the analysis device 10 can analyze a script engine and obtain information about branch VM instructions, thereby enabling the construction of a CFG for script engines in a wide variety of script languages.
- this embodiment is useful for constructing CFGs in a wide variety of script engines, and is also suitable for implementing analysis for scripts for which constructing a CFG is difficult due to the absence of analysis support functions such as a debugger or unknown internal specifications of the VM.
- Each component of the analysis device 10 shown in Fig. 3 is a functional concept, and does not necessarily have to be physically configured as shown in the figure.
- the specific form of distribution and integration of the functions of the analysis device 10 is not limited to that shown in the figure, and all or part of it can be functionally or physically distributed or integrated in any unit depending on various loads, usage conditions, etc.
- each process performed by the analysis device 10 may be realized, in whole or in part, by a CPU and a program that is analyzed and executed by the CPU. Furthermore, each process performed by the analysis device 10 may be realized as hardware using wired logic.
- [program] 35 is a diagram showing an example of a computer in which analysis device 10 is realized by executing a program.
- Computer 1000 has, for example, memory 1010 and CPU 1020.
- Computer 1000 also has hard disk drive interface 1030, disk drive interface 1040, serial port interface 1050, video adapter 1060, and network interface 1070. Each of these components is connected by bus 1080.
- the memory 1010 includes a ROM 1011 and a RAM 1012.
- the ROM 1011 stores a boot program such as a BIOS (Basic Input Output System).
- BIOS Basic Input Output System
- the hard disk drive interface 1030 is connected to a hard disk drive 1090.
- the disk drive interface 1040 is connected to a disk drive 1100.
- a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100.
- the serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example.
- the video adapter 1060 is connected to a display 1130, for example.
- the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the analysis device 10 is implemented as a program module 1093 in which code executable by the computer 1000 is written.
- the program module 1093 is stored, for example, in the hard disk drive 1090.
- a program module 1093 for executing processes similar to the functional configuration of the analysis device 10 is stored in the hard disk drive 1090.
- the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
- the setting data used in the processing of the above-mentioned embodiment is stored as program data 1094, for example, in memory 1010 or hard disk drive 1090.
- the CPU 1020 reads the program module 1093 or program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as necessary and executes it.
- the program module 1093 and program data 1094 may not necessarily be stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like.
- the program module 1093 and program data 1094 may be stored in another computer connected via a network (such as a LAN (Local Area Network), WAN (Wide Area Network)).
- the program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via the network interface 1070.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2023/015096 WO2024214265A1 (ja) | 2023-04-13 | 2023-04-13 | 解析装置、解析方法及び解析プログラム |
| JP2025513736A JPWO2024214265A1 (https=) | 2023-04-13 | 2023-04-13 |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2023/015096 WO2024214265A1 (ja) | 2023-04-13 | 2023-04-13 | 解析装置、解析方法及び解析プログラム |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024214265A1 true WO2024214265A1 (ja) | 2024-10-17 |
Family
ID=93058899
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2023/015096 Ceased WO2024214265A1 (ja) | 2023-04-13 | 2023-04-13 | 解析装置、解析方法及び解析プログラム |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JPWO2024214265A1 (https=) |
| WO (1) | WO2024214265A1 (https=) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20130333033A1 (en) * | 2012-06-06 | 2013-12-12 | Empire Technology Development Llc | Software protection mechanism |
| WO2021070393A1 (ja) * | 2019-10-11 | 2021-04-15 | 日本電信電話株式会社 | 解析機能付与装置、解析機能付与方法及び解析機能付与プログラム |
| WO2023067668A1 (ja) * | 2021-10-18 | 2023-04-27 | 日本電信電話株式会社 | 解析機能付与方法、解析機能付与装置及び解析機能付与プログラム |
-
2023
- 2023-04-13 JP JP2025513736A patent/JPWO2024214265A1/ja active Pending
- 2023-04-13 WO PCT/JP2023/015096 patent/WO2024214265A1/ja not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20130333033A1 (en) * | 2012-06-06 | 2013-12-12 | Empire Technology Development Llc | Software protection mechanism |
| WO2021070393A1 (ja) * | 2019-10-11 | 2021-04-15 | 日本電信電話株式会社 | 解析機能付与装置、解析機能付与方法及び解析機能付与プログラム |
| WO2023067668A1 (ja) * | 2021-10-18 | 2023-04-27 | 日本電信電話株式会社 | 解析機能付与方法、解析機能付与装置及び解析機能付与プログラム |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2024214265A1 (https=) | 2024-10-17 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Mirsky et al. | {VulChecker}: Graph-based vulnerability localization in source code | |
| Wen et al. | Automatically inspecting thousands of static bug warnings with large language model: How far are we? | |
| He et al. | Sofi: Reflection-augmented fuzzing for javascript engines | |
| JP7517585B2 (ja) | 解析機能付与装置、解析機能付与プログラム及び解析機能付与方法 | |
| US8171551B2 (en) | Malware detection using external call characteristics | |
| Ahmadi et al. | Finding bugs using your own code: detecting functionally-similar yet inconsistent code | |
| JP7287480B2 (ja) | 解析機能付与装置、解析機能付与方法及び解析機能付与プログラム | |
| US10650145B2 (en) | Method for testing computer program product | |
| AU2004232058A1 (en) | Method and system for detecting vulnerabilities in source code | |
| Sun et al. | Osprey: A fast and accurate patch presence test framework for binaries | |
| JP7568131B2 (ja) | 解析機能付与方法、解析機能付与装置及び解析機能付与プログラム | |
| CN113626823B (zh) | 一种基于可达性分析的组件间交互威胁检测方法及装置 | |
| KR101583133B1 (ko) | 스택 기반 소프트웨어 유사도 평가 방법 및 장치 | |
| JP7838662B2 (ja) | 脆弱性発見装置、脆弱性発見方法及び脆弱性発見プログラム | |
| Xu et al. | A review of code vulnerability detection techniques based on static analysis | |
| WO2024214265A1 (ja) | 解析装置、解析方法及び解析プログラム | |
| WO2024214263A1 (ja) | 解析機能付与装置、解析機能付与方法及び解析機能付与プログラム | |
| WO2024214264A1 (ja) | 解析装置、解析方法及び解析プログラム | |
| WO2023067663A1 (ja) | 解析機能付与方法、解析機能付与装置及び解析機能付与プログラム | |
| Jin et al. | Current and future research of machine learning based vulnerability detection | |
| WO2024214262A1 (ja) | 解析機能付与装置、解析機能付与方法及び解析機能付与プログラム | |
| Liu et al. | Automatic Software Vulnerability Detection in Binary Code | |
| WO2023067665A1 (ja) | 解析機能付与方法、解析機能付与装置及び解析機能付与プログラム | |
| WO2024214261A1 (ja) | 解析装置、解析方法及び解析プログラム | |
| WO2024214260A1 (ja) | 解析装置、解析方法及び解析プログラム |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23933041 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2025513736 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2025513736 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 23933041 Country of ref document: EP Kind code of ref document: A1 |