CN116501378B - Implementation method and device for reverse engineering reduction source code and electronic equipment - Google Patents

Implementation method and device for reverse engineering reduction source code and electronic equipment Download PDF

Info

Publication number
CN116501378B
CN116501378B CN202310766247.3A CN202310766247A CN116501378B CN 116501378 B CN116501378 B CN 116501378B CN 202310766247 A CN202310766247 A CN 202310766247A CN 116501378 B CN116501378 B CN 116501378B
Authority
CN
China
Prior art keywords
information
binary file
target program
analysis
control flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310766247.3A
Other languages
Chinese (zh)
Other versions
CN116501378A (en
Inventor
柯志杰
徐斌
何怀兵
王骏涛
明小民
胡亚林
杨琰
潘爱平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Big Data Industry Development Co ltd
Original Assignee
Wuhan Big Data Industry Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Big Data Industry Development Co ltd filed Critical Wuhan Big Data Industry Development Co ltd
Priority to CN202310766247.3A priority Critical patent/CN116501378B/en
Publication of CN116501378A publication Critical patent/CN116501378A/en
Application granted granted Critical
Publication of CN116501378B publication Critical patent/CN116501378B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/74Reverse engineering; Extracting design information from source code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/52Binary to binary
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/53Decompilation; Disassembly
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to a method and a device for realizing reverse engineering reduction source codes and electronic equipment, wherein the method comprises the following steps: acquiring a binary file of a target program, and extracting information in the binary file; carrying out static analysis on the binary file to obtain control flow information, type of function variables and scope information of the target program; dynamic analysis is carried out on the binary file to obtain memory distribution information of the target program, operation information of the target program on a register and interaction information of the target program and an operation system; disassembling the binary file based on static information and dynamic information in the binary file to obtain an assembly language code of the binary file; and converting the assembly language code into a high-level language code, and performing structural optimization and grammar correction on the high-level language code to obtain a source code of the target program. The application improves the quality and efficiency of reverse code reduction.

Description

Implementation method and device for reverse engineering reduction source code and electronic equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for implementing reverse engineering reduction source code, and an electronic device.
Background
Reverse engineering refers to the understanding of the internal operating mechanisms and design principles of a software system by reverse analyzing its binary code. In practical application, reverse engineering is often used in the fields of software debugging, software security analysis, software copyright protection and the like. In the process of reverse analysis, a disassembler is usually required to convert binary codes into assembly codes, and then the assembly codes are analyzed manually or by a tool to finally restore the source codes of the software.
However, in the actual reverse analysis process, due to the complexity of the assembly code and the limitation of manual analysis, only a part of source code can be usually restored, and the quality of the restored code often cannot meet the requirements of developers. Therefore, how to improve the quality and efficiency of reverse code reduction becomes a urgent problem to be solved in the current reverse engineering field.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a method, an apparatus and an electronic device for implementing reverse engineering reduction source code, so as to achieve the purpose of improving the quality and efficiency of reverse code reduction.
In order to achieve the above object, the present application provides a method for implementing reverse engineering reduction source code, including:
acquiring a binary file of a target program, and extracting information in the binary file;
performing structural analysis on the binary file to obtain structural information of the target program, performing control flow analysis on the binary file to obtain control flow information of the target program, and performing variable analysis on the binary file to obtain type and scope information of function variables of the target program;
performing memory analysis on the binary file to obtain memory distribution information of the target program, performing register analysis on the binary file to obtain operation information of the target program on a register, and performing system call analysis on the binary file to obtain interaction information of the target program and an operation system;
disassembling the binary file based on the information in the binary file, the structure information, the control flow information, the type of the target program function variable, the scope information, the memory distribution information, the operation information on the register and the interaction information to obtain an assembly language code of the binary file;
and converting the assembly language code into a high-level language code, and performing structural optimization and grammar correction on the high-level language code to obtain the source code of the target program.
In some possible implementations, the obtaining the binary file of the target program and extracting the information in the binary file include:
acquiring a binary file of the target program through a binary file analysis tool;
extracting an assembly instruction sequence of the binary file through a disassembly tool;
and obtaining information in the binary file according to the assembly instruction sequence of the binary file, wherein the information in the binary file comprises an entry point and a function address of a target program.
In some possible implementations, the performing structural analysis on the binary file to obtain structural information of the target program includes:
dividing the binary file into a plurality of basic blocks;
performing instruction analysis on the basic blocks to obtain the types, operands and operation types of the basic block instructions;
and sequencing the basic blocks according to the type, the operand and the operation type of the basic block instruction to obtain the structure information of the target program.
In some possible implementations, the performing control flow analysis on the binary file to obtain control flow information of the target program includes:
performing control flow analysis on the binary file to obtain a calling relationship, a conditional jump relationship and a circulating branch relationship of a function in the binary file;
adding the calling relation, the conditional jump relation and the circulating branch relation of the functions in the binary file into a control flow chart;
and obtaining control flow information of the target program based on the control flow chart.
In some possible implementations, the performing memory analysis on the binary file to obtain memory distribution information of the target program, performing register analysis on the binary file to obtain operation information of the target program on a register, and performing system call analysis on the binary file to obtain interaction information of the target program and an operation system includes:
acquiring the memory distribution information of the target program through a memory monitoring tool;
monitoring the access behavior of the target program on a register through a debugger, and acquiring the operation information of the target program on the register;
and acquiring the interaction information of the target program and the operating system through a program debugging tool.
In some possible implementations, the disassembling the binary file based on the information in the binary file, the structure information, the control flow information, the type of the object function variable, and the scope information, the memory distribution information, the operation information on the register, and the interaction information to obtain the assembly language code of the binary file includes:
determining a starting point of code reconstruction according to the entry point and the function address of the target program;
performing code reconstruction based on the structure information, the control flow information, the type of the target program function variable, the scope information, the memory distribution information, the operation information on the register and the interaction information to obtain an assembly instruction;
restoring the pseudo instruction in the binary file;
obtaining the entry, the exit and the internal execution sequence of the function in the binary file based on the reconstructed starting point, the assembler instruction and the pseudo instruction;
and determining the assembly language code of the binary file according to the entry, the exit and the internal execution sequence of the function in the binary file.
In some possible implementations, testing the source code of the target program by an automated test tool is further included to reconcile the functions of the source code of the target program and the target program.
On the other hand, the application also provides a device for realizing reverse engineering reduction source codes, which comprises the following steps:
a binary file information acquisition unit for acquiring a binary file of a target program and extracting information in the binary file;
the binary file static analysis unit is used for carrying out structural analysis on the binary file to obtain the structural information of the target program, carrying out control flow analysis on the binary file to obtain the control flow information of the target program, and carrying out variable analysis on the binary file to obtain the type and scope information of the function variables of the target program;
the binary file dynamic analysis unit is used for carrying out memory analysis on the binary file to obtain the memory distribution information of the target program, carrying out register analysis on the binary file to obtain the operation information of the target program on a register, and carrying out system call analysis on the binary file to obtain the interaction information of the target program and an operation system;
the assembly language code acquisition unit is used for disassembling the binary file based on the information in the binary file, the structure information, the control flow information, the type of the target program function variable, the scope information, the memory distribution information, the operation information on the register and the interaction information to obtain the assembly language code of the binary file;
the source code acquisition unit is used for converting the assembly language code into a high-level language code, and carrying out structural optimization and grammar correction on the high-level language code to obtain the source code of the target program.
In another aspect, the application also provides an electronic device comprising a memory and a processor, wherein,
the memory is used for storing programs;
the processor is coupled to the memory, and is configured to execute the program stored in the memory, so as to implement a step in the implementation method of reverse engineering reduction source code in any one of the implementation modes.
In another aspect, the present application further provides a computer readable storage medium, configured to store a computer readable program or instructions, where the program or instructions, when executed by a processor, implement the steps in a method for implementing reverse engineering reduction source code in any one of the foregoing implementation manners.
The beneficial effects of adopting the embodiment are as follows: the application provides a realization method of reverse engineering restoration source codes, which comprises the steps of firstly obtaining a binary file of a target program, extracting information in the binary file, then statically analyzing the binary file to obtain structure information, control flow information, types of target program function variables and scope information, further dynamically analyzing the binary file to obtain memory distribution information, operation information and interaction information on a register, finally disassembling the binary file based on the information in the binary file, statically analyzed data and dynamically analyzed data to obtain assembly language codes of the binary file, and finally converting the assembly language codes into high-level language codes, and carrying out structural optimization and grammar correction to obtain the source codes of the target program. According to the method, file analysis, static analysis, dynamic analysis, assembly language reduction and semantic reduction are carried out on the target program, and the original source code can be effectively and accurately restored.
Drawings
FIG. 1 is a schematic diagram of a method/flow chart/system architecture diagram of an embodiment of a method for implementing reverse engineering reduction source code according to the present application;
FIG. 2 is a schematic structural diagram of an embodiment of a device for implementing reverse engineering reduction source code according to the present application;
fig. 3 is a schematic structural diagram of an embodiment of an electronic device according to the present application.
Detailed Description
The following detailed description of preferred embodiments of the application is made in connection with the accompanying drawings, which form a part hereof, and together with the description of the embodiments of the application, are used to explain the principles of the application and are not intended to limit the scope of the application.
Fig. 1 is a schematic flow chart of an embodiment of a method for implementing reverse engineering reduction source code provided by the present application, as shown in fig. 1, and the method for implementing reverse engineering reduction source code includes:
s101, acquiring a binary file of a target program, and extracting information in the binary file;
s102, carrying out structural analysis on the binary file to obtain structural information of the target program, carrying out control flow analysis on the binary file to obtain control flow information of the target program, and carrying out variable analysis on the binary file to obtain type and scope information of function variables of the target program;
s103, performing memory analysis on the binary file to obtain memory distribution information of the target program, performing register analysis on the binary file to obtain operation information of the target program on a register, and performing system call analysis on the binary file to obtain interaction information of the target program and an operation system;
s104, disassembling the binary file based on the information in the binary file, the structure information, the control flow information, the type of the target program function variable, the scope information, the memory distribution information, the operation information on the register and the interaction information to obtain an assembly language code of the binary file;
s105, converting the assembly language code into a high-level language code, and performing structural optimization and grammar correction on the high-level language code to obtain the source code of the target program.
Compared with the prior art, the implementation method of the reverse engineering restoring source code provided by the embodiment comprises the steps of firstly obtaining the binary file of the target program, extracting information in the binary file, then statically analyzing the binary file to obtain structure information, control flow information, types of target program function variables and scope information, further dynamically analyzing the binary file to obtain memory distribution information, operation information and interaction information on a register, finally disassembling the binary file based on the information in the binary file, the statically analyzed data and the dynamically analyzed data to obtain the assembly language code of the binary file, and finally converting the assembly language code into high-level language code, and performing structural optimization and grammar correction to obtain the source code of the target program. According to the method, file analysis, static analysis, dynamic analysis, assembly language reduction and semantic reduction are carried out on the target program, and the original source code can be effectively and accurately restored.
It should be noted that, in the binary file, the entry point is a starting point of the program in running, and the function address is important reference information in the program executing process. By analyzing the header information of the binary file and the lead-in table, we can obtain the entry point and the function address. In some embodiments of the present application, in step S101, the obtaining a binary file of the target program, and extracting information in the binary file includes:
acquiring a binary file of the target program through a binary file analysis tool;
extracting an assembly instruction sequence of the binary file through a disassembly tool;
and obtaining information in the binary file according to the assembly instruction sequence of the binary file, wherein the information in the binary file comprises an entry point and a function address of a target program.
In a specific embodiment of the present application, the binary code of the target program is analyzed using a binary file analysis tool. Binary files are typically composed of machine code that requires the use of specialized disassembly tools or libraries to convert them into sequences of assembler instructions. For example, we can use a disassembly tool such as IDA Pro to analyze the binary code of the target program and obtain its sequence of assembler instructions. Assuming the binary file of the target program is test.exe, we can use IDA Pro to analyze test.exe and obtain its assembler instruction sequence. The following is code for using IDA Pro to obtain the sequence of assembler instructions:
# import IDA Pro library
from idaapi import *
from idc import *
# open binary file
binary = IDA_Pro("test.exe", 0)
Inlet point address of # acquisition program
entry_point = GetEntryPoint()
Traversing function table to obtain addresses of all functions
functions = []
for function_address in Functions():
functions.append(function_address)
# fetch assembler instruction sequence
disasm = []
for function_address in functions:
function_name = GetFunctionName(function_address)
function_start=GetFunctionAttr(function_address,FUNCATTR_START)
function_end = GetFunctionAttr(function_address, FUNCATTR_END)
for address in range(function_start, function_end):
instruction = GetDisasm(address)
disasm.append(instruction)
# output assembler instruction sequence
for instruction in disasm:
print(instruction)
By analyzing the sequence of assembler instructions, we can obtain the entry point of the program and the addresses of the individual functions. In an assembler instruction sequence, the program entry point is typically the first instruction, while the address of the function is typically the target address of a jump instruction or call instruction. By parsing these instructions we can get the addresses of the program entry points and the individual functions.
Let us assume that we obtain the assembler instruction sequence of the target program test.exe using the above example code, analyze the assembler instruction sequence, obtain the program entry point and the address of the respective function.
After the entry point and function address of the program are obtained, we need to perform static analysis. Static analysis is divided into: structural analysis of assembly codes, control flow analysis and variable analysis. The structural analysis of the assembly code refers to analyzing basic blocks, instructions, data structures and the like of the assembly code, and acquiring structural information of a program. Control flow analysis refers to analyzing the control flow of a program, such as function call, loop, branch, etc., and obtaining the control flow information of the program. The variable analysis refers to analyzing variables used in a program, and obtaining information such as types and scope of the variables.
Structural analysis of assembly code: and (3) performing basic block analysis on the assembly code, dividing the code into basic blocks (basic blocks), namely code segments which do not contain jump instructions, and acquiring the starting address and the length of the basic blocks. And then, carrying out instruction analysis on each basic block, acquiring information such as the type, operand type and the like of the instruction, and classifying according to the instruction type, such as MOV instructions, ADD instructions and the like. Finally, the instruction sequences are arranged according to the sequence of the basic blocks to form the structural information of the program.
In some embodiments of the present application, in step S102, the performing structural analysis on the binary file to obtain structural information of the target program includes:
dividing the binary file into a plurality of basic blocks;
performing instruction analysis on the basic blocks to obtain the types, operands and operation types of the basic block instructions;
and sequencing the basic blocks according to the type, the operand and the operation type of the basic block instruction to obtain the structure information of the target program.
The control flow analysis is performed on the assembly code, the assembly code is converted into a control flow chart, and the control flow of the program is determined. In some embodiments of the present application, in step S102, the performing control flow analysis on the binary file to obtain control flow information of the target program includes:
performing control flow analysis on the binary file to obtain a calling relationship, a conditional jump relationship and a circulating branch relationship of a function in the binary file;
adding the calling relation, the conditional jump relation and the circulating branch relation of the functions in the binary file into a control flow chart;
and obtaining control flow information of the target program based on the control flow chart.
In a specific embodiment of the present application, control flow analysis is performed on assembly code, the assembly code is converted into a control flow chart, and the control flow of a program is determined, wherein the code is as follows:
control_flow_graph = {}
# analysis function call
for function_address in functions:
function_instructions = functions[function_address]
for i, instruction in enumerate(function_instructions):
if "call" in instruction:
# resolution function call
# ...
# add function call to control flow graph
# ...
pass
Analysis condition jump
for function_address in functions:
function_instructions = functions[function_address]
for i, instruction in enumerate(function_instructions):
if "j" in instruction:
Jump to# resolution condition
# ...
# adding conditional jumps to control flow diagrams
# ...
pass
Analysis cycle #
for function_address in functions:
function_instructions = functions[function_address]
for i, instruction in enumerate(function_instructions):
if "jmp" in instruction:
# resolution cycle
# ...
# add loop to control flow graph
# ...
pass
In some embodiments of the present application, in step S103, the performing memory analysis on the binary file to obtain memory distribution information of the target program, performing register analysis on the binary file to obtain operation information of the target program on a register, and performing system call analysis on the binary file to obtain interaction information of the target program and an operation system includes:
acquiring the memory distribution information of the target program through a memory monitoring tool;
monitoring the access behavior of the target program on a register through a debugger, and acquiring the operation information of the target program on the register;
and acquiring the interaction information of the target program and the operating system through a program debugging tool.
In an embodiment of the present application, a memory monitoring tool, such as Memwatch, valgrind, is used to monitor the memory access behavior of a program, thereby obtaining the memory distribution information of the program. The following is an example code for memory analysis using Valgrind:
$ valgrind --tool=memcheck ./test.exe
register analysis: we can use a debugger to monitor the access behavior of a program to registers to obtain the operating information of the program on the registers. The following is a register analysis using GDB.
A tool such as a space is used to monitor the system call behavior of the program, so as to acquire the interaction information of the program and the operating system. The following is example code for system call analysis using a stride:
$ strace ./test.exe
I/O monitoring tools, such as dtrace, may be used to monitor the input/output behavior of a program to obtain input/output information for the program. The following is example code for I/O analysis using dtrace:
$ sudo dtrace -n 'syscall::read:entry { trace(arg0); }' -c ./test.exe
after the complete information of the program is obtained, the binary file is disassembled to obtain the assembly language code of the binary file. In some embodiments of the present application, in step S104, the disassembling the binary file based on the information in the binary file, the structure information, the control flow information, the type of the object function variable, and the scope information, the memory distribution information, the operation information on the register, and the interaction information to obtain an assembly language code of the binary file includes:
determining a starting point of code reconstruction according to the entry point and the function address of the target program;
performing code reconstruction based on the structure information, the control flow information, the type of the target program function variable, the scope information, the memory distribution information, the operation information on the register and the interaction information to obtain an assembly instruction;
restoring the pseudo instruction in the binary file;
obtaining the entry, the exit and the internal execution sequence of the function in the binary file based on the reconstructed starting point, the assembler instruction and the pseudo instruction;
and determining the assembly language code of the binary file according to the entry, the exit and the internal execution sequence of the function in the binary file.
In the specific embodiment of the application, after the complete information of the program is obtained, the binary file is disassembled into the assembly code, and the assembly language code of the target program is restored. A control flow diagram of the program is constructed. The control flow chart reflects the basic block, branch, loop and other structures in the program, and can be used for further analyzing and restoring assembly codes, and the specific steps are as follows:
step one: basic blocks and instructions are identified, and basic blocks and instructions in a target program need to be identified on the basis of a control flow diagram. A basic block is a set of consecutive instructions in which there is no branch or jump statement. The instruction is the smallest execution unit in the program and may be an assembler instruction or a pseudo instruction.
Example code: # assume that instractions are a list of already partitioned instructions
basic_blocks= [ ] # for storing basic blocks
Current_block= [ ] # is used to store the current basic block
# traversal instruction list
for i, instruction in enumerate(instructions):
current_block.append(instruction)
if is_branch(instruction) or is_return(instruction):
If the current instruction is a jump instruction or a return instruction, the current basic block is added to the basic block list and a new basic block is created
basic_blocks.append(current_block)
current_block = []
elif i == len(instructions) - 1:
# if the current instruction is the last instruction, the current basic block is added to the basic block list
basic_blocks.append(current_block)
# determine if the instruction is a jump instruction
def is_branch(instruction):
return instruction.startswith("j") or instruction.startswith("call")
# determine if the instruction is a return instruction
def is_return(instruction):
return instruction.startswith("ret")
Step two: reconstructing the assembler instruction, wherein the identified assembler instruction needs to be reconstructed to restore the original assembler instruction. The process of reconstruction typically requires analyzing the operands and registers of the instructions in combination with the structure information of the target program, control flow information, and type and scope information of the target program function variables.
Step three: the recovery pseudo-instructions may include some pseudo-instructions, such as data definition instructions and symbol definition instructions, in addition to assembler instructions. When restoring assembly code, these pseudo instructions are restored to ensure the integrity of the program.
Step four: and reconstructing the function, namely reconstructing the function in the target program to restore the assembly code of the function in the target program. The results of the control flow diagrams and data flow analysis are combined to determine the order of execution of the entry, exit, and interior of the function, etc.
Step five: generating assembly codes: after the previous steps are completed, the assembly code of the program can be generated according to the identified basic blocks and instructions, and the reconstructed assembly instructions and functions. Assembly code is typically presented in text form and can be used to further analyze and modify programs.
After restoring the assembly code, we need to perform semantic restoration. The restored assembly codes are converted into high-level language codes through a semantic restoration technology, and structural optimization and grammar correction are carried out to generate source codes with clear structures and easy understanding.
The first step: it is first necessary to identify the functions in the program and determine their parameters, return values and internally executed statement blocks. This may be achieved by call instructions, return instructions, branch instructions, etc. in assembly code.
And a second step of: within the functions, it is necessary to identify variables used by the program and determine their type and scope information. The type of variable may be determined by the operands and registers used in the assembler instruction, and the scope may be determined by data flow analysis. Example code:
each basic block inside the # traversal function
for basic_block in function.basic_blocks:
Traversing each instruction in the basic block #
for instruction in basic_block.instructions:
# determine if the instruction is a variable definition instruction
if is_variable_definition(instruction):
# obtain variable name and type information
variable_name = get_variable_name(instruction)
variable_type = get_variable_type(instruction)
# add variable to variable table
symbol_table.add_variable(variable_name, variable_type, basic_block)
# determine if the instruction is a variable use instruction
elif is_variable_usage(instruction):
# obtain variable name and scope information
variable_name = get_variable_name(instruction)
variable_scope = symbol_table.get_variable_scope(variable_name)
# update variable scope information
if basic_block not in variable_scope:
symbol_table.update_variable_scope(variable_name, basic_block)
And a third step of: after the functions and variables are determined, the statements in the program need to be identified and converted into a grammar structure in a high-level language. This may be accomplished by identifying operands and registers used in the assembler instruction, as well as structures such as branches and loops.
Fourth step: in the process of recognizing the statement, a data flow analysis is required to determine the flow direction of the values of the variables in the program. This can help determine the actual role of the statements in the program, as well as the range of values and scope of the variables.
Example code:
first, each basic block in the program is subjected to a data flow analysis, resulting in the active state (whether used and defined) of each variable in that basic block and its range of values (if any).
for each basic block in the program:
For each basic block, initializing variable states to undefined
for each variable in the program:
set variable as undefined in the basic block
Data flow analysis, updating variable states
for each instruction in the basic block:
if instruction defines a variable:
set the variable as defined in the basic block
Value range of the variable/update
update the value range of the variable based on the instruction
if instruction uses a variable:
set the variable as used in the basic block
If a variable is used, its value range can be updated according to the definition of the variable in the basic block
for each predecessor of the basic block:
if the variable is defined in the predecessor:
update the value range of the variable based on the definition in the predecessor
Then, each function in the program is subjected to data flow analysis to obtain the active state and the value range of each variable in the function, and the scope of the variable is determined
for each function in the program:
For each function, initializing variable states to undefined
for each variable in the function:
set variable as undefined in the function
Data flow analysis, updating variable states and value ranges
for each basic block in the function:
for each instruction in the basic block:
Similar to the analysis in basic blocks, update variable states and value ranges
...
Scope of a variable
for each variable in the function:
If a variable is defined at a function entry, it is a global variable
if the variable is defined at the entry of the function:
set the variable as a global variable
If the variable is undefined at the function entry, it is a local variable
else:
set the variable as a local variable in the function
Fifth step: after completion of the previous steps, semantic restoration may begin. The process of semantic restoration is a process of converting assembly code into high-level language code, and the syntax structure and semantic meaning of the high-level language code need to be determined according to the semantics of assembly instructions. Finally, a high-level language code may be generated based on the completed semantic restoration work. The generated high-level language code is typically presented in text form and can be used to further analyze and modify the program.
It should be noted that, the automatic test is a more efficient and accurate source code verification method, and can automatically run the program by writing test cases and scripts, and compare the output results of the target program. The automated test can greatly shorten the verification time, reduce human error, and can be repeatedly performed to ensure the correctness of the program. In some embodiments of the application, testing the source code of the target program by an automated test tool is further included to reconcile the functions of the source code of the target program and the target program.
In order to better implement a method for implementing a reverse engineering reduction source code in the embodiment of the present application, correspondingly, as shown in fig. 2, the embodiment of the present application further provides an apparatus for implementing a reverse engineering reduction source code, where the apparatus 200 for implementing a reverse engineering reduction source code includes:
a binary file information obtaining unit 201, configured to obtain a binary file of a target program, and extract information in the binary file;
a binary file static analysis unit 202, configured to perform structural analysis on the binary file to obtain structural information of the target program, perform control flow analysis on the binary file to obtain control flow information of the target program, and perform variable analysis on the binary file to obtain type and scope information of function variables of the target program;
a binary file dynamic analysis unit 203, configured to perform memory analysis on the binary file to obtain memory distribution information of the target program, perform register analysis on the binary file to obtain operation information of the target program on a register, and perform system call analysis on the binary file to obtain interaction information of the target program and an operation system;
an assembly language code obtaining unit 204, configured to disassemble the binary file based on the information in the binary file, the structure information, the control flow information, the type of the target program function variable, and the scope information, the memory distribution information, the operation information on the register, and the interaction information, to obtain an assembly language code of the binary file;
and the source code obtaining unit 205 is configured to convert the assembly language code into a high-level language code, and perform structural optimization and grammar correction on the high-level language code to obtain the source code of the target program.
The implementation device 200 of the reverse engineering reduction source code provided in the foregoing embodiment may implement the technical solution described in the foregoing embodiment of the implementation method of the reverse engineering reduction source code, and the specific implementation principle of each module or unit may refer to the corresponding content in the foregoing embodiment of the implementation method of the reverse engineering reduction source code, which is not described herein again.
As shown in fig. 3, the present application further provides an electronic device 300 accordingly. The electronic device 300 comprises a processor 301, a memory 302 and a display 303. Fig. 3 shows only some of the components of the electronic device 300, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead.
The processor 301 may be, in some embodiments, a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chip for executing program code or processing data stored in the memory 302, such as an implementation of reverse engineering restore source code in the present application.
In some embodiments, processor 301 may be a single server or a group of servers. The server farm may be centralized or distributed. In some embodiments, the processor 301 may be local or remote. In some embodiments, processor 301 may be implemented in a cloud platform. In an embodiment, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-internal, multiple clouds, or the like, or any combination thereof.
The memory 302 may be an internal storage unit of the electronic device 300 in some embodiments, such as a hard disk or memory of the electronic device 300. The memory 302 may also be an external storage device of the electronic device 300 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 300.
Further, the memory 303 may also include both internal storage units and external storage devices of the electronic device 300. The memory 302 is used for storing application software and various types of data for installing the electronic device 300.
The display 303 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like in some embodiments. The display 303 is used for displaying information at the electronic device 300 and for displaying a visual user interface. The components 301-303 of the electronic device 300 communicate with each other via a system bus.
In one embodiment, when the processor 301 executes an implementation program of reverse engineering restore source code in the memory 302, the following steps may be implemented:
acquiring a binary file of a target program, and extracting information in the binary file;
performing structural analysis on the binary file to obtain structural information of the target program, performing control flow analysis on the binary file to obtain control flow information of the target program, and performing variable analysis on the binary file to obtain type and scope information of function variables of the target program;
performing memory analysis on the binary file to obtain memory distribution information of the target program, performing register analysis on the binary file to obtain operation information of the target program on a register, and performing system call analysis on the binary file to obtain interaction information of the target program and an operation system;
disassembling the binary file based on the information in the binary file, the structure information, the control flow information, the type of the target program function variable, the scope information, the memory distribution information, the operation information on the register and the interaction information to obtain an assembly language code of the binary file;
and converting the assembly language code into a high-level language code, and performing structural optimization and grammar correction on the high-level language code to obtain the source code of the target program.
It should be understood that: the processor 301 may perform other functions in addition to the above functions when executing an implementation program of reverse engineering restoring source code in the memory 302, and in particular, reference may be made to the foregoing description of the corresponding method embodiments.
Further, the type of the electronic device 300 is not particularly limited, and the electronic device 300 may be a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a wearable device, a laptop computer (laptop), or the like. Exemplary embodiments of portable electronic devices include, but are not limited to, portable electronic devices that carry IOS, android, microsoft or other operating systems. The portable electronic device described above may also be other portable electronic devices, such as a laptop computer (laptop) or the like having a touch-sensitive surface, e.g. a touch panel. It should also be appreciated that in other embodiments of the application, the electronic device 300 may not be a portable electronic device, but rather a desktop computer having a touch-sensitive surface (e.g., a touch panel).
Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by way of a computer program to instruct associated hardware, where the program may be stored on a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.
The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application.

Claims (10)

1. The implementation method of reverse engineering reduction source codes is characterized by comprising the following steps:
acquiring a binary file of a target program, and extracting information in the binary file;
performing structural analysis on the binary file to obtain structural information of the target program, performing control flow analysis on the binary file to obtain control flow information of the target program, and performing variable analysis on the binary file to obtain type and scope information of function variables of the target program;
performing memory analysis on the binary file to obtain memory distribution information of the target program, performing register analysis on the binary file to obtain operation information of the target program on a register, and performing system call analysis on the binary file to obtain interaction information of the target program and an operation system;
disassembling the binary file based on the information in the binary file, the structure information, the control flow information, the type of the target program function variable, the scope information, the memory distribution information, the operation information on the register and the interaction information to obtain an assembly language code of the binary file;
and converting the assembly language code into a high-level language code, and performing structural optimization and grammar correction on the high-level language code to obtain the source code of the target program.
2. The method for implementing reverse engineering restoration source code according to claim 1, wherein the obtaining the binary file of the target program and extracting the information in the binary file comprise:
acquiring a binary file of the target program through a binary file analysis tool;
extracting an assembly instruction sequence of the binary file through a disassembly tool;
and obtaining information in the binary file according to the assembly instruction sequence of the binary file, wherein the information in the binary file comprises an entry point and a function address of a target program.
3. The method for implementing reverse engineering restoring source code according to claim 1, wherein the performing structural analysis on the binary file to obtain structural information of the target program includes:
dividing the binary file into a plurality of basic blocks;
performing instruction analysis on the basic blocks to obtain the types, operands and operation types of the basic block instructions;
and sequencing the basic blocks according to the type, the operand and the operation type of the basic block instruction to obtain the structure information of the target program.
4. The method for implementing reverse engineering reduction source code according to claim 1, wherein the performing control flow analysis on the binary file to obtain control flow information of the target program includes:
performing control flow analysis on the binary file to obtain a calling relationship, a conditional jump relationship and a circulating branch relationship of a function in the binary file;
adding the calling relation, the conditional jump relation and the circulating branch relation of the functions in the binary file into a control flow chart;
and obtaining control flow information of the target program based on the control flow chart.
5. The method for implementing reverse engineering restoring source code according to claim 3, wherein the performing memory analysis on the binary file to obtain the memory distribution information of the target program, performing register analysis on the binary file to obtain the operation information of the target program on a register, and performing system call analysis on the binary file to obtain the interaction information of the target program and an operation system includes:
acquiring the memory distribution information of the target program through a memory monitoring tool;
monitoring the access behavior of the target program on a register through a debugger, and acquiring the operation information of the target program on the register;
and acquiring the interaction information of the target program and the operating system through a program debugging tool.
6. The method for implementing reverse engineering restoring source code according to claim 2, wherein the disassembling the binary file based on the information in the binary file, the structure information, the control flow information, the type of the object function variable and the scope information, the memory distribution information, the operation information on the register, and the interaction information to obtain the assembly language code of the binary file includes:
determining a starting point of code reconstruction according to the entry point and the function address of the target program;
performing code reconstruction based on the structure information, the control flow information, the type of the target program function variable, the scope information, the memory distribution information, the operation information on the register and the interaction information to obtain an assembly instruction;
restoring the pseudo instruction in the binary file;
obtaining the entry, the exit and the internal execution sequence of the function in the binary file based on the reconstructed starting point, the assembler instruction and the pseudo instruction;
and determining the assembly language code of the binary file according to the entry, the exit and the internal execution sequence of the function in the binary file.
7. The method of claim 1, further comprising testing the source code of the target program by an automated test tool to reconcile the functions of the source code and the target program.
8. An implementation device for reverse engineering reduction source code is characterized by comprising:
a binary file information acquisition unit for acquiring a binary file of a target program and extracting information in the binary file;
the binary file static analysis unit is used for carrying out structural analysis on the binary file to obtain the structural information of the target program, carrying out control flow analysis on the binary file to obtain the control flow information of the target program, and carrying out variable analysis on the binary file to obtain the type and scope information of the function variables of the target program;
the binary file dynamic analysis unit is used for carrying out memory analysis on the binary file to obtain the memory distribution information of the target program, carrying out register analysis on the binary file to obtain the operation information of the target program on a register, and carrying out system call analysis on the binary file to obtain the interaction information of the target program and an operation system;
the assembly language code acquisition unit is used for disassembling the binary file based on the information in the binary file, the structure information, the control flow information, the type of the target program function variable, the scope information, the memory distribution information, the operation information on the register and the interaction information to obtain the assembly language code of the binary file;
the source code acquisition unit is used for converting the assembly language code into a high-level language code, and carrying out structural optimization and grammar correction on the high-level language code to obtain the source code of the target program.
9. An electronic device comprising a memory and a processor, wherein,
the memory is used for storing programs;
the processor is coupled to the memory for executing the program stored in the memory to implement the steps in a method for implementing reverse engineering reduction source code according to any one of claims 1 to 7.
10. A computer readable storage medium storing a computer readable program or instructions which when executed by a processor is capable of carrying out the steps of a method of implementing reverse engineering reduction source code according to any one of claims 1 to 7.
CN202310766247.3A 2023-06-27 2023-06-27 Implementation method and device for reverse engineering reduction source code and electronic equipment Active CN116501378B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310766247.3A CN116501378B (en) 2023-06-27 2023-06-27 Implementation method and device for reverse engineering reduction source code and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310766247.3A CN116501378B (en) 2023-06-27 2023-06-27 Implementation method and device for reverse engineering reduction source code and electronic equipment

Publications (2)

Publication Number Publication Date
CN116501378A CN116501378A (en) 2023-07-28
CN116501378B true CN116501378B (en) 2023-09-12

Family

ID=87330524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310766247.3A Active CN116501378B (en) 2023-06-27 2023-06-27 Implementation method and device for reverse engineering reduction source code and electronic equipment

Country Status (1)

Country Link
CN (1) CN116501378B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103777966A (en) * 2014-02-28 2014-05-07 广州视源电子科技股份有限公司 Method and device for extracting code information from binary file
CN104503793A (en) * 2014-12-24 2015-04-08 风腾科技(北京)有限公司 Method for running and graphically analyzing codes in code practicing software
CN104915211A (en) * 2015-06-18 2015-09-16 西安交通大学 Intrinsic function recognition method based on sub-graph isomorphism matching algorithm in decompilation
KR20180129623A (en) * 2017-05-26 2018-12-05 충남대학교산학협력단 Apparatus for statically analyzing assembly code including assoxiated multi files
US10656940B1 (en) * 2019-02-04 2020-05-19 Architecture Technology Corporation Systems, devices, and methods for source code generation from binary files
KR102104198B1 (en) * 2019-01-10 2020-05-29 한국과학기술원 Technology and system for improving the accuracy of binary reassembly system with lazy symbolization
KR102341137B1 (en) * 2020-11-09 2021-12-20 동국대학교 산학협력단 Code converting method based on intermediate language and electronic device including the same
CN114528015A (en) * 2022-04-24 2022-05-24 湖南泛联新安信息科技有限公司 Method for analyzing homology of binary executable file, computer device and storage medium
CN114691151A (en) * 2022-03-18 2022-07-01 中国科学院信息工程研究所 Optimized code decompiling method and system based on deep learning
CN114816435A (en) * 2022-03-15 2022-07-29 中国人民解放军军事科学院战争研究院 Software development method based on reverse technology
CN114816436A (en) * 2022-03-15 2022-07-29 中国人民解放军军事科学院战争研究院 Source code analysis device based on disassembling
CN114969755A (en) * 2022-05-26 2022-08-30 重庆邮电大学 Cross-language unknown executable program binary vulnerability analysis method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070271553A1 (en) * 2006-05-22 2007-11-22 Micro Focus (Us), Inc. Method and system for translating assembler code to a target language
US8645935B2 (en) * 2009-05-01 2014-02-04 University Of Maryland Automatic parallelization using binary rewriting
EP2958044B1 (en) * 2014-06-20 2019-09-18 Secure-IC SAS A computer implemented method and a system for controlling dynamically the execution of a code
US10354069B2 (en) * 2016-09-02 2019-07-16 Bae Systems Information And Electronic Systems Integration Inc. Automated reverse engineering

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103777966A (en) * 2014-02-28 2014-05-07 广州视源电子科技股份有限公司 Method and device for extracting code information from binary file
CN104503793A (en) * 2014-12-24 2015-04-08 风腾科技(北京)有限公司 Method for running and graphically analyzing codes in code practicing software
CN104915211A (en) * 2015-06-18 2015-09-16 西安交通大学 Intrinsic function recognition method based on sub-graph isomorphism matching algorithm in decompilation
KR20180129623A (en) * 2017-05-26 2018-12-05 충남대학교산학협력단 Apparatus for statically analyzing assembly code including assoxiated multi files
KR102104198B1 (en) * 2019-01-10 2020-05-29 한국과학기술원 Technology and system for improving the accuracy of binary reassembly system with lazy symbolization
US10656940B1 (en) * 2019-02-04 2020-05-19 Architecture Technology Corporation Systems, devices, and methods for source code generation from binary files
KR102341137B1 (en) * 2020-11-09 2021-12-20 동국대학교 산학협력단 Code converting method based on intermediate language and electronic device including the same
CN114816435A (en) * 2022-03-15 2022-07-29 中国人民解放军军事科学院战争研究院 Software development method based on reverse technology
CN114816436A (en) * 2022-03-15 2022-07-29 中国人民解放军军事科学院战争研究院 Source code analysis device based on disassembling
CN114691151A (en) * 2022-03-18 2022-07-01 中国科学院信息工程研究所 Optimized code decompiling method and system based on deep learning
CN114528015A (en) * 2022-04-24 2022-05-24 湖南泛联新安信息科技有限公司 Method for analyzing homology of binary executable file, computer device and storage medium
CN114969755A (en) * 2022-05-26 2022-08-30 重庆邮电大学 Cross-language unknown executable program binary vulnerability analysis method

Also Published As

Publication number Publication date
CN116501378A (en) 2023-07-28

Similar Documents

Publication Publication Date Title
Leroy et al. CompCert-a formally verified optimizing compiler
CN110287702B (en) Binary vulnerability clone detection method and device
US8752020B2 (en) System and process for debugging object-oriented programming code leveraging runtime metadata
US9274930B2 (en) Debugging system using static analysis
US7761282B2 (en) System and method to simulate conditions and drive control-flow in software
US10176077B2 (en) Generating breakpoints for cross-layer debugging
US20180032320A1 (en) Computer-implemented method for allowing modification of a region of original code
JP6342129B2 (en) Source code error position detection apparatus and method for mixed mode program
US7908596B2 (en) Automatic inspection of compiled code
US10037260B2 (en) System for expression evaluation at debug time
Durfina et al. Detection and recovery of functions and their arguments in a retargetable decompiler
Seidel et al. Qrisp: A framework for compilable high-level programming of gate-based quantum computers
Bouraqadi et al. Test-driven development for generated portable Javascript apps
Erb et al. Combining discrete event simulations and event sourcing
CN116501378B (en) Implementation method and device for reverse engineering reduction source code and electronic equipment
CN111488275A (en) UI automation test method and device, storage medium and electronic equipment
US10776255B1 (en) Automatic verification of optimization of high level constructs using test vectors
Agarwal et al. Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Pikus The Art of Writing Efficient Programs: An advanced programmer's guide to efficient hardware utilization and compiler optimizations using C++ examples
CN111930398A (en) Application program updating method, device, system, medium and equipment
Smith et al. A more agile approach to embedded system development
CN111694729A (en) Application testing method and device, electronic equipment and computer readable medium
Ji et al. Design and implementation of retargetable software debugger based on gdb
Jacobson et al. FLoAT: Framework for Workflow Analysis and Transformation
Mononen Evaluation of Test-Driven Approaches for Embedded Software Development

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant