CN113836023B

CN113836023B - Compiler security testing method based on architecture cross check

Info

Publication number: CN113836023B
Application number: CN202111128084.3A
Authority: CN
Inventors: 徐坚皓; 丁柱; 茅兵
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2023-06-27
Anticipated expiration: 2041-09-26
Also published as: CN113836023A

Abstract

The invention provides a compiler security testing method based on architecture cross check, which aims to detect software security holes introduced by a compiler when compiling common open source software into different architecture binary codes. Since the changes of the security-related code before and after compiling satisfy the architecture consistency, software vulnerabilities introduced by the compiler can be detected by comparing the changes and performing an architecture cross check. The method comprises the following steps: modeling and positioning the safety related instruction, corresponding to IR (Intermediate Representation) code and binary code, judging the change of the semantic state of the safety related instruction and performing cross check of the architecture. The method of the invention realizes the efficient and accurate positioning of the software security hole introduced by the compiler.

Description

Compiler security testing method based on architecture cross check

Technical Field

The invention relates to a compiler security testing method based on architecture cross checking.

Background

Software vulnerabilities (Software Vulnerability) refer to flaws in the security of a computer system that threaten the confidentiality, integrity, availability, access control, etc. of the system or its application data. A compiler (compiler) is a computer program widely used to convert source code written in a certain programming language into binary program files for actual execution.

Software vulnerabilities often stem from problems with the source program itself, and can be localized to relevant errors of the software source code. However, in recent years, more and more software security vulnerabilities, whose vulnerability points do not exist directly in the source code, are introduced by the compiler at the compilation stage.

Each architecture may have problems introduced by its own compiler. The compiler backend of each architecture is independent in implementation and design, and there are many proprietary code generation and optimization strategies, as such complex software, which the compiler will inevitably have problems in design and implementation at these modules. These architecture-related problems have an increasingly prominent real-world impact. With the popularity of mobile internet, internet of things and embedded devices, a large number of systems and applications need to run on different architecture platforms. As an important software infrastructure, existing compilers do support more and more architecture, e.g. GCC and Clang both support over 40 architecture. Compiler backend for these architectures is also widely used. There is an increasing importance of system software supporting a large number of architectures, for example the most popular open source operating system Kernel Linux Kernel can run on more than 25 architecture platforms.

Traditional vulnerability discovery work is mainly performed through dynamic and static techniques. Dynamic vulnerability discovery refers to the discovery of vulnerabilities during execution by a given program specific input. According to different operation modes, the fuzzy test and dynamic symbol execution can be classified. The static vulnerability discovery technology is to detect possible vulnerability points by summarizing features of vulnerabilities reflected in programs and assisting program analysis technology. According to different vulnerability summarization modes, the vulnerability model can be divided into a manually defined vulnerability model and a machine learning model learning vulnerability model. According to different research objects, the method can be divided into source code static vulnerability mining and binary static vulnerability mining.

The above conventional vulnerability discovery work generally only considers the security of the software program itself, and does not specifically consider the security problem introduced by the compiler into the binary program. Recently, some vulnerability detection works consider the influencing factors of the compiler, are used to detect the security problems introduced by the compiler in a targeted manner, and can be mainly divided into two categories.

The first class of work is to detect vulnerabilities introduced by the compiler by static analysis methods based on their semantic features.

The second type of work is the correctness test of the compiler. The correctness test of the compiler is a process of detecting the compiler through a series of methods to ensure that the compiler meets the correctness (i.e. accords with the compiling language specification) in the compiling process. One core problem with such an approach is the test criteria, i.e. how to determine that the behavior of the compiler is correct. There are generally two types of solutions to the problem of testing criteria: differential testing (Differntial Testing) and metamorphic testing (Metamorphic Testing).

While the above approach and finding many compiler security issues, directed to introducing a source program software vulnerability to a compiler, gets a positive response from the developer, creating a large impact. However, the existing detection methods also face some common problems:

1. the universality of the detection method and the pertinence to the safety problem cannot be considered. On the one hand, the general detection methods, such as CSmith, EMI, etc., are aimed at the correctness problem of the compiler rather than the security problem. This makes some potential security issues unattractive, for example, a compiler developer does not handle these issues in time, and a compiler user does not apply a vulnerability patch of a compiler in time. This results in that both of these methods require a priori knowledge and have not been able to discover problems of unknown type.

2. It is difficult to address architecture related issues. On the one hand, the existing targeted static method does not consider the compiler optimization behavior related to the architecture, and cannot find the problem related to the architecture at all. On the other hand, the dynamic approach also suffers from the following problems: (1) difficulty of deployment of multiple architectures: many methods are difficult to migrate to multi-architecture environments because of the high degree of reliance on architecture-dependent infrastructure; (2) the effort of repeated detection is too great: the dynamic analysis of the binaries of different architectures of the same source code must be independent, which greatly increases the amount of analysis (many of which are duplicated) if multiple architectures are considered, making the coverage problem inherent in dynamic methods more serious.

Disclosure of Invention

The invention aims to: aiming at the defects of the prior art, the invention provides a compiler security testing method based on architecture cross check, which can detect software loopholes introduced by a given compiler (supporting multiple architectures) when generating binary codes.

The core idea of the invention is as follows: since the changes of the security-related code before and after compiling satisfy the architecture consistency, software vulnerabilities introduced by the compiler can be detected by comparing the changes and performing an architecture cross check.

The architecture consistency is met by the security-related code changes before and after compilation because the security-related semantic segments in the system program all have specific functions and are generally architecture independent, and the architecture independence of the security functions determines that the security-related semantic segments should be consistent from architecture to architecture. If inconsistencies occur, binary code representing a certain architecture has a high probability of introducing security problems.

Specifically, the invention comprises the following steps:

step 1, compiling a test program to an architecture-independent IR (intermediate representation ) code using a target compiler (compiler as a test object);

step 2, a stake-inserting target compiler can collect new compiling time information for code correspondence of the following step 3;

step 3, defining a safety related code by using the error processing code and a special model, and analyzing and positioning the safety related code on the IR code;

step 4, continuing to compile the IR codes by using the target compiler after the instrumentation to obtain binary codes of different architectures (namely different central processing unit instruction set architectures, commonly known as x86_64, arm64, arm, risc 64 and the like);

step 5, based on the new compiling time information, corresponding the IR level security related code to a two-level system code;

and 6, performing architecture cross check on semantic state change of the safety-related codes.

The step 2 comprises the following steps:

and collecting new compiling time information in a compiling stage by a compiling flow of the instrumentation compiler, wherein the new compiling time information comprises the following components: representative instruction information (instruction type, instruction parameter, number of basic block where the instruction is located and position in the basic block), IR (intermediate representation code) level control flow information (predecessor and successor basic block numbers of each basic block), correspondence of IR and binary code basic block level (IR basic block numbers corresponding to machine-related code basic blocks), storing collected new compile-time information into debug information of related instructions (newly built custom debug information of the present invention); wherein the representative instructions include the following three: compare and branch instructions, function call instructions, memory access instructions.

In step 2, a new data structure is created in the target compiler for compiling intermediate codes of different levels to store dynamic custom debugging information (different from original debugging information maintained by the compiler) in the compiling process;

the instrumentation compiler architecture-dependent compilation process is initiated at a stage prior to the start of the compilation process, acquires the required IR level of compilation-time information (present in the existing analysis process of the compiler), stores the acquired information in the data structure (here the IR level of compilation-time information, which needs to be guaranteed to pass to the binary level;

the compiling process related to the architecture of the instrumentation compiler is tracked, the original debugging information is processed by the compiler, the user-defined debugging information is processed in the same way, namely, the transfer of the debugging information is tracked when different intermediate codes are converted and when the compiler optimizes the leading-in instruction combination or generation, and finally, the IR-level compiling time information is transferred to the corresponding data structure of the binary code level;

and (3) the process of generating the binary codes by the instrumentation compiler, supplementing the compiling time information of the binary code level again, and storing the stored compiling time information of the binary code level as custom debugging information in a data segment of a binary code file in a uniform format (such as Protobuf format). And then reading the specified data segment of the binary code file according to the specific format to obtain the compiling time information of each instruction in the binary code.

The step 3 comprises the following steps:

step 3-1, for the test source code, collecting interfaces of error processing codes according to the general programming specification, so as to locate the error processing codes;

step 3-2, judging whether the security check is based on the error processing code: determining a conditional check as a security check when it has and only one branch is leading to error handling;

step 3-3, traversing the IR code by taking the variable of the security check operation as a key variable (such as a pointer variable in pointer security check and a divisor variable in 0 check), constructing a definition-use chain, and marking instructions with a use relation on the key variable as the security operation so as to locate the security related code.

The detection framework of the invention can also define various special safety related codes according to actual requirements. Can be used to detect a specific security problem of a certain class. As an example, two semantic model examples are provided herein, and corresponding localization methods (i.e. "dedicated model" described in step 3):

race access to the memory cells (Racy Memory Access). Racy Memory Access is a common type of sensitive operation, and because the compiler models the concurrency inadequately, the compiler is likely to introduce a new Racy Memory Access, causing concurrency problems. Referring to the Lockset analysis of RacerX and RELAY, the racy object (memory object with race access present) can be located in these system software.

Dedicated security related function calls. Common system software provides a sophisticated security-related function call interface. These functions encapsulate some measures to address some specific security issues. The definition of these functions can be found in fixed code modules (such as errno.h and errno-base.h of Linux kernel). On this basis, by constructing a function call graph, related calls in a program and wrapper functions (wrapper functions) of the functions can be located according to the graph relationship, and the function calls are used as safety related instructions.

In step 4, when the target compiler after instrumentation continues to compile the IR code, custom debugging information of the compile-time information is stored.

The step 5 comprises the following steps:

step 5-1, analyzing the binary code obtained in the step 4, and obtaining the corresponding relation between three representative instructions and the IR code (namely, which binary instruction is specifically corresponding to one IR instruction) according to the custom debugging information of the binary code, and judging that the two instructions are in one-to-one correspondence if the custom debugging information of one instruction of the IR code and the corresponding instruction in the binary code are consistent; the code is composed of instructions. The IR instructions are part of the IR code.

And 5-2, analyzing the relation between the instruction which needs to be corresponding and the control flow (the basic block number where the instruction is located, the position relation in the same basic block, the control flow dominant relation among the instructions and the like) and the data flow (the Use-defined chain and the definition-Use chain and the like of related data) of the instruction which needs to be corresponding and the existing representative instruction by utilizing a general static analysis method (which can be realized by the self, an analysis framework of a compiler can also provide an analysis interface), and judging that the two instructions are in one-to-one correspondence according to the relation between the control flow and the data flow, the relation between the three representative instructions and the IR code, the type information (the acquisition mode is a plurality of, such as the type information of the IR instruction is acquired through a compiler analysis interface), and the corresponding relation between the target binary code (the instruction type of the binary instruction is obtained through a common disassembly tool such as objdump) and the IR code if the control flow data flow relation of the code instruction and the control flow of the binary code instruction is consistent and the type is matched with each other.

The step 6 comprises the following steps:

step 6-1, based on the corresponding relation between the target binary instruction and the IR code, judging whether the safety related code positioned in step 3 has a semantic state change after compiling in step 4, wherein the semantic state change comprises: the removal of the safety-related instructions and the sequential exchange of the safety-related instructions are judged in the following way: if one binary instruction does not find an IR instruction as a correspondence, it is determined that it is removed, and if the control flow order between binary instructions is different from the order between the corresponding IR instructions, it is determined that their order is swapped;

and 6-2, comparing the binary codes of different architectures compiled in the step 4, judging whether the semantic state changes of the safety related instructions are consistent, if not, reporting that a suspected safety problem occurs, and submitting a report to an open source code community serving as a test.

In order to perform architectural cross checking for security related semantic state changes in binary codes, the present invention uses IR codes as a bridge for comparison between different architectural binary codes and provides a set of corresponding methods from IR code to binary code.

IR code can be used as a bridge for comparison between different architectural binaries because IR code, as a common intermediate language form for modern compilers, is architecture independent (does not affect architectural cross checking) and facilitates program analysis compared to source code.

The principle that architecture cross checking is feasible is: after the safety related semantic fragments in the same source code are compiled to binary codes of different architectures by a compiler, the state changes of the semantic fragments before and after compiling are consistent; if there is a discrepancy, then there must be some architecture that presents a suspected security problem.

The beneficial effects are that: the invention can automatically find a potential security hole which has wide influence and serious security influence and is difficult to find in some important system software, namely the security hole introduced by a compiler, efficiently (without dynamically running the compiled binary file). The invention introduces the general modeling of the security codes in the compiler testing method for the first time, and overcomes the defect that the prior technical method cannot aim at the security problem or can only aim at the specific security problem. The invention can detect the security problems introduced by the compiler in different architectures in a targeted manner, which is not detected by the prior technical scheme or can not be detected in a targeted manner (can not be detected only in an inefficient and repeated manner for fingers).

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.

FIG. 1 is a schematic diagram of a instrumentation compiler.

FIG. 2 is a schematic cross-checking diagram of discriminating changes in the semantic state of security related instructions and implementing an architecture.

Fig. 3 is a flow chart of the method of the present invention.

Detailed Description

The invention provides a compiler security testing method based on architecture cross check, which comprises the following steps:

The step 2 comprises the following steps:

The step 3 comprises the following steps:

The step 5 comprises the following steps:

The step 6 comprises the following steps:

Examples

As shown in fig. 1, the compiling process of the instrumentation compiler collects representative instruction information, IR level control flow information, IR and binary code basic block level corresponding information at the compiling stage, and stores the information into custom debug information of related instructions.

As shown in fig. 2, the change of the semantic state of the security related instruction is discriminated and the cross check of the architecture is performed. Wherein BISF (back-end idependent semantic fragment) is the safety-related instruction modeled and located according to the method of the present invention. On the basis of locating the safety-related codes, the safety-related codes are corresponding in the binary codes of different architectures, and the change of the semantic states of the safety-related instructions is judged. And finally, cross checking is carried out on the change of the semantic state of the safety-related instruction, and the real safety problem is analyzed and confirmed.

The test case set of the test compiler is the source code of common large-scale open source software, such as Linux kernel, chromium and the like. They typically have good error code specifications to which the modeling of the security related code of the present invention can be applied.

The compiler tested by the invention is an open source compiler with compiler optimization supporting multiple architectures, and commonly comprises multiple languages such as GCC, clang, G++, clang++, and the like.

The architecture referred to herein refers to various central processor instruction set architectures, commonly referred to as x86_64, arm64, arm, riscv64, etc.

IR (Intermediate Representation) code refers to a compiled intermediate file represented by a compiler intermediate representation language, commonly represented by LLVM IR, GCC IR.

Embodiments compile Linux kernel source code into several different architectural binaries (x86_64, arm64, arm, riscv64, mips, ppc 64) using clang.

Referring to fig. 3, the implementation flow of this example is described:

in the preparation phase, linux kernel source code is compiled into IR code (here LLVM IR code) using clang, and the compilation option selects "alloesconfig". As a small amount of architecture related assembly codes exist in the Linux kernel source code, the related source codes are eliminated.

And pile inserting stage. See fig. 1. And collecting representative instruction information, IR level control flow information and IR and binary code basic block level corresponding information in the compiling stage, and storing the information into the custom debugging information of the related instructions. Representative instructions include compare and branch instructions, function call instructions, memory access instructions; their instruction information includes instruction type, instruction parameters, the number of the basic block in which the instruction is located, and the position in the basic block. The IR level control stream information includes a precursor of a basic block and a subsequent basic block number. The IR and binary code basic block level correspondence information includes the Machine IR level basic block and IR basic block correspondence information of LLVM records.

Modeling and analysis stage. In IR codes, the modeling according to the invention, the analysis yields specific safety-relevant instructions.

Corresponding stage. Taking the analysis of whether the security check is removed as an example, as shown in fig. 2, the security check is first corresponded. Determining the state change of the security check before and after compiling according to the corresponding result: if the IR level security check cannot be found to correspond in the binary code, the security check is deemed to be removed by the compiler.

Architecture cross-checking phase. Also taking the example of analyzing whether the security check is removed, if the security check is removed by one architecture in the binary of the different architecture and the security check is not removed by the other architecture after the cross check, the compiler is considered to remove the security check by mistake in the architecture, and thus, the security problem exists. The removed security checks and the involved architecture are analyzed and reported.

The present invention provides a compiler security testing method based on architecture cross checking, and the method and the way for implementing the technical scheme are numerous, the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several improvements and modifications can be made without departing from the principles of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. A compiler security testing method based on architecture cross check is characterized by comprising the following steps:

step 1, compiling a test program to an IR code irrelevant to an architecture by using a target compiler;

step 2, a pile inserting target compiler is enabled to collect new compiling time information;

step 4, continuously compiling the IR codes by using the target compiler after the instrumentation to obtain binary codes with different architectures;

step 6, performing system structure cross check on semantic state change of the safety related codes;

the step 2 comprises the following steps:

and collecting new compiling time information in a compiling stage by a compiling flow of the instrumentation compiler, wherein the new compiling time information comprises the following components: representative instruction information, IR level control flow information, IR and binary code basic block level correspondence, storing the collected new compile-time information into the debug information of the related instruction; wherein the representative instructions include the following three: comparing and branching instruction, function calling instruction and memory access instruction;

in step 2, creating a data structure for compiling intermediate codes of different levels in a target compiler to store dynamic custom debugging information in the compiling process;

the method comprises the steps of acquiring the required IR level compiling time information at a stage before the beginning of a compiling process related to a instrumentation compiler architecture, and storing the acquired information in the data structure;

the pile inserting compiler generates binary codes, supplements the compiling time information of the binary code level again, stores the stored compiling time information of the binary code level as custom debugging information into the data section of the binary code file in a unified format, and reads the appointed data section of the binary code file according to the specific format to obtain the compiling time information of each instruction in the binary code;

the step 3 comprises the following steps:

and 3-3, taking the variable of the security check operation as a key variable, traversing the IR code, constructing a definition-use chain, and marking instructions with use relation to the key variable as security operation so as to locate the security related code.

2. The method of claim 1, wherein in step 4, custom debugging information of compile-time information is stored while continuing to compile IR code using the instrumented target compiler.

3. The method of claim 2, wherein step 5 comprises:

step 5-1, analyzing the binary code obtained in the step 4, and obtaining the corresponding relation between three representative instructions and the IR code according to the custom debugging information of the binary code, and judging that two instructions are in one-to-one correspondence if one instruction of the IR code is consistent with the custom debugging information of the corresponding instruction in the binary code;

and 5-2, analyzing the control flow and data flow relation between the instruction to be corresponded and the existing representative instruction by utilizing a static analysis method, obtaining the corresponding relation between the target binary code and the IR code according to the control flow and data flow relation, the corresponding relation between the three representative instructions and the IR code and the type information of the target instruction, and judging that the two instructions are in one-to-one correspondence if the control flow data flow relation of the IR code instruction and the control flow data flow relation of one binary code instruction are consistent and are matched with each other.

4. A method according to claim 3, wherein step 6 comprises: