KR20180057317A - Apparatus and method for intermediate language transformation of binary data - Google Patents

Apparatus and method for intermediate language transformation of binary data Download PDF

Info

Publication number
KR20180057317A
KR20180057317A KR1020160155803A KR20160155803A KR20180057317A KR 20180057317 A KR20180057317 A KR 20180057317A KR 1020160155803 A KR1020160155803 A KR 1020160155803A KR 20160155803 A KR20160155803 A KR 20160155803A KR 20180057317 A KR20180057317 A KR 20180057317A
Authority
KR
South Korea
Prior art keywords
intermediate language
language
generated
preprocessing
information
Prior art date
Application number
KR1020160155803A
Other languages
Korean (ko)
Inventor
차상길
류찬호
정승일
오동엽
Original Assignee
한국과학기술원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국과학기술원 filed Critical 한국과학기술원
Priority to KR1020160155803A priority Critical patent/KR20180057317A/en
Publication of KR20180057317A publication Critical patent/KR20180057317A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The present invention provides a device and a method for converting a step type intermediate language, which enable to efficient binary analysis by composing an intermediate language for each step. The device for converting an intermediate language comprises: a preprocessing part receiving an assembly language which is inversely assembled from a binary file, and generating a preprocessing intermediate language from the assembly language; and a post-processing part generating a first step intermediate language from the preprocessing language, taking the first step intermediate language as an input to generate a second step intermediate language, and generating the intermediate language for each step generated in a predetermined number of times as a final intermediate language.

Description

[0001] Apparatus and method for intermediate language transformation [

The present invention relates to a stepwise intermediate language conversion apparatus and method.

An intermediate language is a language that the compiler generates to translate source code into binary code. In other words, it is an intermediate language to represent binaries of various platforms in one unified form.

However, the process of converting an intermediate language from binary code is a very difficult problem, unlike compilation, because the abstract information such as type information and variable names existing in the source in the compilation process disappears. Therefore, it is necessary to restore the lost abstract information in the intermediate language conversion process.

Converting binary code into an intermediate language is the most important underlying technology for binary analysis. Without the conversion process to the intermediate language, the existing program analysis technique can not be applied. In addition, in terms of reverse engineering, it is necessary to conduct analysis through a high-level language rather than a low-level code.

Most of the existing intermediate language conversion is done in one step. That is, the method of disassembling the binary code and expressing it as one intermediate language is adopted. This method is advantageous for expressing low-level machine language, but it is inefficient for expressing high-level language.

For example, to express a For statement that exists in a high-level language, it is possible to link several intermediate language statements in a low-level intermediate language. In addition, type information that exists in a high-level language can not contain such information in a low-level Abstract Syntax Tree (AST).

Therefore, the present invention provides a step-like intermediate language conversion apparatus and method for constructing a step-by-step intermediate language to enable efficient binary analysis.

According to another aspect of the present invention, there is provided an intermediate language conversion apparatus,

A pre-processing unit for receiving a disassembled assembly language from a binary file, and for generating a preprocessing intermediate language from the assembly language; And generating a first stage intermediate language from the preprocessing intermediate language, receiving the first stage intermediate language as input and generating a second stage intermediate language, and generating a stepwise intermediate language as a final intermediate language generated at a predetermined number of times And a post-processing unit.

Wherein the post-processing unit comprises: a first post-processor for receiving the preprocessing intermediate language as input and generating the first stage intermediate language from the preprocessing intermediate language; And a second post-processor that receives the first intermediate language as input and generates the second-stage intermediate language from the first intermediate language.

The level of the first language intermediate language may be higher than the level of the preprocessing intermediate language and the level of the second language intermediate language may be higher than the level of the first language intermediate language.

The preprocessor may receive the binary information, which is environment information of the generated binary file, and may generate the preprocessed intermediate language after deriving the abstract information based on the binary information when generating the preprocessed intermediate language.

According to another aspect of the present invention, there is provided a method for converting an assembly language into an intermediate language,

Generating a preprocessing intermediate language from the assembly language; Generating a first language intermediate language from the generated preprocessing intermediate language; Generating a second language intermediate language from the first language intermediate language; And setting the second stage intermediate language as a final intermediate language if the generated second stage intermediate language is an intermediate language for each final stage of the preset stage.

The step of generating the preprocessing intermediate language includes: receiving binary information, which is environment information on which a binary file for the assembly language is generated; And deriving abstraction information from the assembly language based on the binary information.

The step of generating the second language intermediate language may include receiving the second language intermediate language and generating a third language intermediate language if the second language intermediate language is not the intermediate language of the last step of the preset step Step < / RTI >

The first language intermediate language, the second language intermediate language, and the third language intermediate language are generated in units of preset blocks, and the block unit corresponds to any one of a function or a module constituting the binary file .

According to the present invention, a binary code can be expressed in a language ranging from a low level to a high level.

In addition, the efficiency of reverse engineering can be improved through high level intermediate language, and various abstract information that does not exist in binary can be deduced.

FIG. 1 is a diagram illustrating an exemplary program generation process.
2 is an exemplary diagram of an environment including an intermediate language conversion apparatus according to an embodiment of the present invention.
3 is a structural diagram of an intermediate language conversion apparatus according to an embodiment of the present invention.
4 is a flowchart illustrating a method of restoring a program through an intermediate language conversion method according to an embodiment of the present invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise.

Hereinafter, an intermediate language conversion apparatus and method according to an embodiment of the present invention will be described with reference to the drawings. Before describing an embodiment of the present invention, a general program creation process will be described with reference to FIG.

FIG. 1 is a diagram illustrating an exemplary program generation process.

As shown in FIG. 1, in a general program generation process, when a source code generated by a program developer is inputted with an idea of a program (S10), the program generates an input source code through an intermediate language and an assembly language (S20, S30). Then, the generated assembly language is converted into a binary code that can be understood by the computer, and the program is operated (S40).

Thus, in the process of creating a program, only one intermediate language generation step is required when converting from source code to assembly language. In the reverse engineering process of a general programming process, the same intermediate language generation step occurs, which is advantageous for low-level machine language representation. However, in order to express various loops or conditional statements existing in a high-level language, it is inefficient because it can be expressed only by connecting a plurality of intermediate language sentences.

Therefore, not only is it possible to provide a basis for efficient binary analysis by constructing a step-by-step intermediate language in the reverse engineering process, but also an intermediate language capable of expressing a binary code in a multi-platform can be created to facilitate expression and expansion of various platforms, An intermediate language conversion apparatus and method for generating a language and enabling high level code information expansion will be described with reference to Figs. 2 to 4. Fig.

2 is an exemplary diagram of an environment including an intermediate language conversion apparatus according to an embodiment of the present invention.

As shown in FIG. 2, binary information is obtained based on various types of binary codes, operating system (OS) information of an environment in which binary codes are written, and instruction set architecture (ISA) information. The assembly language generated by converting the binary code is input to the intermediate language conversion apparatus 100 together with the determined binary information.

Here, an example of grasping binary information will be described. When PE file format is used among various types of file storage formats of a binary file (for example, PE, ELF, MACH, etc.), a PE viewer tool System information and command system information can be easily identified. There are various methods of grasping the binary information, and the present invention is not limited to any one method.

The intermediate language conversion apparatus 100 generates a plurality of intermediate languages based on the assembly language and the binary information. Intermediate languages generated in each of the plurality of stages are combined in a basic block unit and generated as an intermediate representation. The generated intermediate representation is used to generate the intermediate representation of the next step, which will be described in detail later.

In the above-described environment, the structure of the intermediate language conversion apparatus 100 for converting an assembly language into an intermediate language using assembly language and binary information will be described with reference to FIG. The intermediate language conversion apparatus 100 corresponds to an intermediate language translator in a general program, and is referred to as an intermediate language conversion apparatus for convenience of explanation.

3 is a structural diagram of an intermediate language conversion apparatus according to an embodiment of the present invention.

As shown in FIG. 3, the intermediate language conversion apparatus 100 includes a preprocessing unit 110 and a post-processing unit 120.

The preprocessing unit 110 is connected to the intermediate language conversion apparatus 100 and receives the assembly language from an inverse assembler (not shown) located at the front end of the intermediate language conversion apparatus 100. The preprocessing unit 110 preprocesses the received assembly language to generate an intermediate language.

That is, the preprocessing unit 110 derives abstraction information from the assembly language, and then generates a low-level intermediate language (hereinafter, referred to as a pre-processing intermediate language) for convenience of explanation. Here, the abstraction information means data information such as variables and types included in the assembly language. The generated preprocessing intermediate language includes flow information such as a logical operation structure.

When the preprocessing unit 110 preprocesses the assembly language, abstract information is derived by referring to binary information input from the outside, and then the preprocessing intermediate language is generated. The reason why the preprocessing unit 110 refers to binary information is that binary codes are generated in different forms according to OS, ISA, and supported hardware.

Therefore, in order to preprocess the assembly language, which is a machine language converted from the binary code, the preprocessing unit 110 needs binary information, which is information on the environment in which the machine language is generated. To this end, it is assumed that binary information includes operating system information in which a binary file is generated and command system information for recognizing the binary file in the embodiment of the present invention.

The preprocessing unit 110 prepares the assembly language by referring to the binary information, derives the abstract information from the assembly language, and generates the preprocessed intermediate language by using various methods. It is omitted.

The post-processing unit 120 including a plurality of stepwise post-processors may repeatedly translate the preprocessed intermediate language generated by the preprocessing unit 110 by a preset step to generate a step-by-step intermediate language in a basic block unit. When a step-by-step intermediate language is translated as much as a pre-set step order, a high-level intermediate language (hereinafter referred to as a final intermediate language) is generated.

Here, the final intermediate language is an intermediate language at a stage before being generated as a source file, and means an intermediate language identical to the expression described in a programming language of a higher concept. If language features (eg, grammar, etc.) are taken into account from the final intermediate language, they can be generated in source code.

The reference block unit is a block unit based on a reference point such as a branch instruction, which is a basic unit of the intermediate language expression in each step according to the embodiment of the present invention. In other words, if you convert from a binary file to assembly language, it is difficult to understand the binary file flow only with the generated assembly language. Therefore, in order to easily grasp the entire configuration of the binary file, a step-by-step intermediate language can be expressed in units of functions or modules constituting a binary file, so that a high-level language can be expressed as an intermediate language without connecting several intermediate languages .

The first stage intermediate language, which is a step-by-step intermediate language created by translating the preprocessing intermediate language by the first-stage post-processor 120-1, is generated as an intermediate language at a higher level than the preprocessing intermediate language. The second stage post-processor 120-2 again translates the first intermediate language, which is the intermediate language of the generated upper level, to generate the second intermediate language at a higher level than the first intermediate language.

The post-processing unit 120 repeats the above procedure for the predetermined number of steps to generate a step-by-step intermediate language, and finally, the translated intermediate language is generated as the final intermediate language. At this time, the post-processing unit 120 abstracts various information in stages at each step of generating the intermediate language for each step.

To this end, in the embodiment of the present invention, a plurality of post-processing units 120-1 to 120-n form the post-processing unit 120. However, in the post-processing unit 120, .

An intermediate language grammar for generating a step-by-step intermediate language is defined in each of the post processors 120-1 to 120-n. Here, the method of defining the intermediate language grammatical form or intermediate language grammars stored in the post-processors 120-1 to 120-n is not limited to any one method.

In addition, the post processors 120-1 to 120-n of each of the stages may generate the number of post processors after the step is defined according to the step defined to generate the intermediate language for each step, or may use only some of the post processors The present invention is not limited to any one method.

And, the step-by-step intermediate language can be expressed variously according to the use of the intermediate language. For example, stepwise intermediate language can be expressed in terms of characteristics according to the purpose of reverse engineering such as for vulnerability analysis or malicious code analysis. The method of expressing the intermediate language in stages can be performed by various methods, and a detailed description thereof will be omitted in the embodiment of the present invention.

A method for converting an assembly language into an intermediate language using the above-described intermediate language conversion apparatus 100 will be described with reference to FIG.

4 is a flowchart illustrating a method of restoring a program through an intermediate language conversion method according to an embodiment of the present invention.

4, when a binary code is generated in any one of systems having various command systems or operating systems (S100), the generated binary code is converted from an inverse assembler (not shown) to an assembly language (S110). How the binary code is generated or how the disassembler converts the binary code into the assembly language is already known, and a detailed description thereof will be omitted in the embodiment of the present invention.

When the intermediate language conversion apparatus 100 receives the assembly language generated by the disassembler in step S110, the preprocessing unit 110 generates the assembly language as a preprocessing intermediate language using the binary information (S120). Here, the binary information is information on the environment in which the binary code is generated in step S100, and includes command system information, operating system information, and hardware information supported by the binary code.

In step S120, when the preprocessing unit 110 generates the preprocessing intermediate language, the post-processing unit 120 receives the preprocessing intermediate language generated in step S120 and performs translation into the intermediate language by a predetermined number of times, Language is generated (S130). That is, the post-processing unit 120 receives the preprocessing intermediate language, interprets the preprocessing intermediate language as a primary language, and generates a first language intermediate language. Where the first stage intermediate language is created in a higher level language than the preprocessing intermediate language. Generated in a higher level language means that it is interpreted in a language understood by the person.

The post-processing unit 120 derives the abstract information for each step in the intermediate language generated in step S130 (S140). Here, the method of deriving the abstraction information in the step-by-step intermediate language can be performed in various ways, and the derived abstraction information can also be variously derived according to the purpose of reverse engineering, so that the present invention is not limited to any one embodiment.

Here, the abstraction information includes data abstraction information and flow abstraction information. The data abstraction information corresponds to information for defining an intermediate language in order of restoring complex data in a unit data size in the order of simple data, continuous data, complex data, and data confidentiality. The flow abstraction information is information capable of restoring contents of a program flow change such as a simple logical value, repetition, branch, and function.

In step S150, the post-processing unit 120 determines whether the step-by-step intermediate language generated in step S130 corresponds to a pre-set intermediate language. If the generated intermediate language is a step-by-step intermediate language corresponding to the preset step, the intermediate language for each step is output as the final intermediate language (S160). In the embodiment of the present invention, only the final intermediate language is output in step S160, but the intermediate languages for each step may be output.

The final intermediate language output in step S160 is then generated as source code by code printing (S170). The matters relating to code printing are already known, and a detailed description thereof will be omitted in the embodiment of the present invention.

If it is determined in step S150 that the generated intermediate language is not the intermediate language corresponding to the preset step, the procedure after step S130 in which the intermediate language for each step is interpreted to generate the intermediate language for each step is performed do.

For example, if a stepwise intermediate language is set to be generated in all three steps by a user's input or other input, the post-processing unit 120 generates a first language intermediate language by interpreting the preprocessed intermediate language. Then, the generated first intermediate language is analyzed to generate a second intermediate language at a higher level than the first intermediate language.

Finally, the second stage intermediate language is interpreted once again to create a third level intermediate language at a higher level than the second intermediate language. Here, the third stage intermediate language becomes the final intermediate language. Thus, by expressing the intermediate language generating step of one stage in the past as an intermediate language generating step of a plurality of stages, it is easy to express a high-level language. In addition, since a low-level intermediate language is input and a high-level intermediate language is generated, it is possible to verify the intermediate language generated at each stage.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, It belongs to the scope of right.

Claims (12)

In the intermediate language conversion apparatus,
A pre-processing unit for receiving a disassembled assembly language from a binary file, and for generating a preprocessing intermediate language from the assembly language; And
A first stage intermediate language is generated from the preprocessing intermediate language, a second stage intermediate language is generated by receiving the first stage intermediate language, and a stepwise intermediate language generated at a predetermined number of times is generated as a final intermediate language Processing unit
And an intermediate language conversion unit.
The method according to claim 1,
The post-
A first post-processor for receiving the preprocessing intermediate language as input and generating the first stage intermediate language from the preprocessing intermediate language; And
A second post-processor that receives the first intermediate language as input and generates the second intermediate language from the first intermediate language,
And an intermediate language conversion unit.
3. The method of claim 2,
Wherein the post-processor is included in the post-processing unit a predetermined number of times to generate the intermediate language for each step.
3. The method of claim 2,
Wherein the level of the first language intermediate language is higher than the level of the preprocessing intermediate language and the level of the second language intermediate language is higher than the level of the first language intermediate language.
The method according to claim 1,
The pre-
Receiving binary information, which is environmental information of the generated binary file,
Wherein when generating the preprocessing intermediate language, abstracting information is derived based on the binary information, and then the preprocessing intermediate language is generated.
6. The method of claim 5,
Wherein the binary information includes operating system information on which the binary file is generated and command system information for recognizing the binary file.
A method for converting an assembly language into an intermediate language by an intermediate language conversion apparatus,
Generating a preprocessing intermediate language from the assembly language;
Generating a first language intermediate language from the generated preprocessing intermediate language;
Generating a second language intermediate language from the first language intermediate language; And
If the generated second stage intermediate language is an intermediate language for each final stage of the preset stage, setting the second stage intermediate language as a final intermediate language
/ RTI >
8. The method of claim 7,
Wherein the generating the preprocessing intermediate language comprises:
Receiving binary information that is environment information on which a binary file for the assembly language is generated; And
Deriving the abstract information from the assembly language based on the binary information
Further comprising the steps of:
9. The method of claim 8,
Wherein the binary information includes operating system information on which the binary file is generated and command system information for recognizing the binary file.
8. The method of claim 7,
Wherein the level of the first language intermediate language is higher than the level of the preprocessing intermediate language and the level of the second language intermediate language is higher than the level of the first language intermediate language.
8. The method of claim 7,
Wherein the step of generating the second language intermediate language comprises:
If the second stage intermediate language is not the intermediate language for the last stage of the preset stage, generating the third stage intermediate language by receiving the second stage intermediate language as input
Further comprising the steps of:
12. The method of claim 11,
Wherein the first language intermediate language, the second language intermediate language, and the third language intermediate language are respectively generated in units of preset blocks,
Wherein the block unit corresponds to a unit of a function or a module constituting the binary file.
KR1020160155803A 2016-11-22 2016-11-22 Apparatus and method for intermediate language transformation of binary data KR20180057317A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020160155803A KR20180057317A (en) 2016-11-22 2016-11-22 Apparatus and method for intermediate language transformation of binary data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020160155803A KR20180057317A (en) 2016-11-22 2016-11-22 Apparatus and method for intermediate language transformation of binary data

Related Child Applications (1)

Application Number Title Priority Date Filing Date
KR1020180100655A Division KR20180098213A (en) 2018-08-27 2018-08-27 Apparatus and method for intermediate language transformation of binary data

Publications (1)

Publication Number Publication Date
KR20180057317A true KR20180057317A (en) 2018-05-30

Family

ID=62300110

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020160155803A KR20180057317A (en) 2016-11-22 2016-11-22 Apparatus and method for intermediate language transformation of binary data

Country Status (1)

Country Link
KR (1) KR20180057317A (en)

Similar Documents

Publication Publication Date Title
CN108388425B (en) Method for automatically completing codes based on LSTM
CN103218294B (en) A kind of adjustment method of embedded system, debugging conversion equipment and system
CN109086215B (en) Embedded software unit test case generation method and system
JPH08202545A (en) Object-oriented system and method for generation of target language code
US7849394B2 (en) Linked code generation report
JP2007141173A (en) Compiling system, debug system and program development system
CN107291522B (en) Compiling optimization method and system for user-defined rule file
RU2004100525A (en) METHOD AND SYSTEM FOR RECORDING MACROS IN SYNTAXIS, INDEPENDENT ON THE LANGUAGE
CN112269566B (en) Script generation processing method, device, equipment and system
CN112540767B (en) Program code generation method and device, electronic equipment and storage medium
US20020026632A1 (en) Universal computer code generator
JP2016157407A (en) Prior construction method of vocabulary semantic pattern for text analysis and response system
CN112764738A (en) Code automatic generation method and system based on multi-view program characteristics
CN117971236B (en) Operator analysis method, device, equipment and medium based on lexical and grammatical analysis
US20150020051A1 (en) Method and apparatus for automated conversion of software applications
KR20060089862A (en) Pre-compiling device
CN112270176B (en) Method, apparatus, and computer storage medium for mode conversion in a deep learning framework
Zhang et al. Automated extraction of grammar optimization rule configurations for metamodel-grammar co-evolution
US20080141230A1 (en) Scope-Constrained Specification Of Features In A Programming Language
CN104731705B (en) A kind of dirty data propagation path based on complex network finds method
KR20180098213A (en) Apparatus and method for intermediate language transformation of binary data
KR20180057317A (en) Apparatus and method for intermediate language transformation of binary data
US20090112568A1 (en) Method for Generating a Simulation Program Which Can Be Executed On a Host Computer
Akers et al. Case study: Re-engineering C++ component models via automatic program transformation
CN109814869B (en) Analysis method and system applied to robot and computer readable storage medium

Legal Events

Date Code Title Description
A201 Request for examination
A302 Request for accelerated examination
E902 Notification of reason for refusal
AMND Amendment
E601 Decision to refuse application
AMND Amendment
A107 Divisional application of patent