CN114528015B - Method for analyzing homology of binary executable file, computer device and storage medium - Google Patents
Method for analyzing homology of binary executable file, computer device and storage medium Download PDFInfo
- Publication number
- CN114528015B CN114528015B CN202210434518.0A CN202210434518A CN114528015B CN 114528015 B CN114528015 B CN 114528015B CN 202210434518 A CN202210434518 A CN 202210434518A CN 114528015 B CN114528015 B CN 114528015B
- Authority
- CN
- China
- Prior art keywords
- function
- vector
- vectors
- binary executable
- executable file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/75—Structural analysis for program understanding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/53—Decompilation; Disassembly
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/75—Structural analysis for program understanding
- G06F8/751—Code clone detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Complex Calculations (AREA)
Abstract
The invention specifically discloses a method for analyzing the homology of a binary executable file, computer equipment and a storage medium, wherein the method comprises the following steps: generating a function control flow graph corresponding to the binary executable file; generating corresponding instruction vectors and basic block vectors in a high-dimensional vector space; acquiring a function vector matrix with the same dimension based on a preset neural network model; and after matrix multiplication is carried out on the function vector matrix, sequencing is carried out according to the sequence from big to small, so as to obtain a similar function pair, and a matching result can be obtained by searching the similar function pair. The invention applies the bidirectional multilayer conversion encoder processed by natural language to the semantic generation of assembly code language, generates corresponding instruction vector and basic block vector in the same-dimension high-dimensional vector space, combines the function control flow graph and the basic block vector to obtain corresponding function semantic vector, realizes homologous analysis for searching based on similar functions, and has the characteristics of strong robustness, high running speed and good test effect.
Description
Technical Field
The invention relates to the technical field of computer software, in particular to a method for analyzing a binary executable file homology, computer equipment and a storage medium.
Background
Binary code homology detection is an important technology in the field of software engineering and the field of program security, is used for detecting whether two given binary segments are similar or not, and is widely applied to software security analysis tasks such as vulnerability search, malicious code identification, patch analysis and plagiarism detection. At present, the technology mainly comprises the following modes:
1. taking a commercial disassembling tool (BinDiff) as a representative, taking function original byte Hash, a function call graph, a control flow graph structure and a character string as characteristics, and matching by using a plurality of heuristic algorithms; the methods mainly calculate similarity of isomorphism or graph decomposition algorithms of control flows or data flow diagrams, match the isomorphism or graph decomposition algorithms by using Hungarian algorithms, and when the characteristics used by the methods are deviated, for example, the deviation is introduced by different compiling optimization options, the result of the method may cause great influence.
2. The recent machine learning methods are applied to the binary code homologous detection work, function matching is carried out by manually selecting features through a graph neural network, the methods need to manually select the binary features, the requirement on professional skills of users is very high, the selected binary features are limited in range, and the semantics of binary code fragments are difficult to fully capture.
3. The natural language processing method is applied to the binary code homologous detection, the method does not need any prior evidence selection characteristic, but adopts an unsupervised learning algorithm to automatically extract the semantics of the binary code segment and generate semantic representation for carrying out the homologous detection, however, the method needs to directly apply the natural language processing technology to the binary code homologous detection, neglects the difference between the binary code and the natural language and brings errors.
In view of this, it is a technical problem that those skilled in the art urgently need to solve to research a binary executable file homology analysis method, a computer device and a storage medium with good test effect, high running speed and strong robustness.
Disclosure of Invention
In order to solve the above technical problem, the present invention provides a method for analyzing the homology of a binary executable file, including the following steps:
s1, converting the binary executable file into assembly codes by using a disassembling tool, and generating a function control flow graph corresponding to the binary executable file based on the assembly codes;
s2, respectively mapping the assembly code and the basic blocks of the function control flow graph to high-dimensional vectors by using a bidirectional multilayer conversion encoder for natural language processing to generate corresponding instruction vectors, and acquiring the basic block vectors of the corresponding function control flow graph based on the generated instruction vectors;
S3, generating corresponding function semantic vectors based on the basic block vectors of the function control flow graph in the step S2, and inputting the function semantic vectors into a preset neural network model to obtain a function vector matrix of the function control flow graph in a high-dimensional vector space;
and S4, performing matrix multiplication on the function vector matrix of the function control flow graph, and then sequencing the function vector matrix from large to small to further obtain a similar function pair, and searching by using the similar function pair to obtain a matching result.
In the method for analyzing the binary executable file homology, the disassembly tool is an IDA Pro disassembler.
In the method for analyzing the homology of the binary executable file, the basic block of the function control flow graph consists of an instruction sequence of the binary executable file.
In the above method for analyzing the homology of the binary executable file, the high-dimensional vector is a 128-dimensional vector.
In the method for analyzing the same source of the binary executable file, the preset neural network model in step S3 is a supervised neural network model, and the training process includes: firstly, compiling and generating a binary executable file with debugging information, obtaining corresponding function names according to the debugging information of the binary executable file and generating a training set, and training a neural network model by using the generated training set until vectors obtained by mapping functions from the same source code and having the same name are close to each other in a 128-dimensional vector space, thereby obtaining a preset neural network model.
In the above binary executable file homology analysis method, the specific implementation manner of step S4 is as follows: unitizing all vectors in a function vector matrix of each function control flow graph to enable the modular length of the corresponding vector to be 1, then performing matrix multiplication on the function vector matrix after unitization, sequencing each column in the matrix after multiplication from large to small to obtain a similar function pair, and finally searching by using the generated similar function pair to obtain a matching result.
In the method for analyzing the homology of the binary executable file, the matrix multiplication in step S4 is formulated as:
in the formula (1), the reaction mixture is,is shown in whichEach binary file corresponds to the function vector matrix after the unitization processing,is shown in whichEach binary file corresponds to the function vector matrix after the unit processing, wherein,The result of multiplying the two function vector matrixes after the unitization processing is shown,representing a matrix transposition.
The invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the steps of the binary executable file homologous analysis method when executing the computer program.
The present invention also provides a computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the above-mentioned binary executable file homology analysis method.
Compared with the prior art, the method applies a bidirectional multilayer conversion encoder processed by natural language to semantic generation of assembly code language, generates corresponding instruction vectors and basic block vectors in a high-dimensional vector space with the same dimension, combines a function control flow graph and the basic block vectors by using a preset neural network model to obtain function semantic vectors corresponding to the binary executable file, obtains similar function pairs based on matrix multiplication processing, and realizes homologous analysis of the binary executable file by searching the similar function pairs.
Drawings
FIG. 1 is a flow chart of a method for homology analysis of binary executables in the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
As shown in fig. 1, fig. 1 shows a specific flow of the binary executable file homology analysis method.
In one embodiment, a method for performing homology analysis on a binary executable file comprises the following steps:
s1, converting the binary executable file into an assembly code by using a Disassembler, and generating a Control Flow Graph (CFG) corresponding to the binary executable file based on the assembly code, wherein the Disassembler is an IDA Pro Disassembler (Interactive Disassembler Professional, which is a typical recursive descent Disassembler);
s2, respectively mapping the assembly code and the basic blocks of the function control flow graph to high-dimensional vectors by using a bidirectional multilayer conversion encoder for natural language processing to generate corresponding instruction vectors, and acquiring the basic block vectors of the corresponding function control flow graph based on the generated instruction vectors, wherein the basic blocks of the function control flow graph consist of an instruction sequence of a binary executable file, and the high-dimensional vectors are 128-dimensional vectors;
s3, generating corresponding function semantic vectors based on the basic block vectors of the function control flow graph in the step S2, and inputting the function semantic vectors into a preset neural network model to obtain a function vector matrix of the function control flow graph in a high-dimensional vector space;
In this step, the preset neural network model is used without training again, so that the operation speed is effectively increased, the preset neural network model is a neural network model with supervision training, and the training process is as follows: firstly, compiling to generate a binary executable file with debugging information (adopting a compiling option of-g during compiling), obtaining a corresponding function name according to the debugging information of the binary executable file and generating a training set, and training a neural network model by using the generated training set until vectors obtained by mapping functions from the same source code and having the same name are close to each other in a 128-dimensional vector space, thereby obtaining a preset neural network model.
It should be noted that, in this step, after the function semantic vector is input into the preset neural network, the function control flow graph including the basic block vector is mapped onto the numerical vector of the high-dimensional vector space in the same dimension through the preset neural network, so as to obtain the function vector matrix of the function control flow graph in the same dimension.
S4, performing matrix multiplication on the function vector matrix of the function control flow graph, and then sorting according to the sequence from big to small, so as to obtain a similar function pair, and searching by using the similar function pair to obtain a matching result, wherein the method specifically comprises the following steps: unitizing all vectors in a function vector matrix of each function control flow graph to enable the modular length of the corresponding vector to be 1, then performing matrix multiplication on the function vector matrix after unitization, sequencing each column in the matrix after multiplication from large to small to obtain a similar function pair, and finally searching by using the generated similar function pair to obtain a matching result.
In this step, the matrix multiplication is expressed by the formula:
in the formula (1), the reaction mixture is,is shown in whichEach binary file corresponds to the function vector matrix after the unitization processing,is shown in whichEach binary file corresponds to the function vector matrix after the unit processing, wherein,The result of multiplying the two function vector matrixes after the unitization processing is shown,representing a matrix transposition;
such as: given two binary executables a and b;is a vector matrix composed of vectors of all functions in the binary executable a,vector matrix composed of vectors of all functions in binary executable file b, respectivelyAndthe vector in (1) is unitized (unitization means dividing the value of each dimension in the vector by the modular length of the vector to make the modular length of the vector after unitization 1), and the vector is obtainedAnd(ii) a Then the unit vector matrix after the unitization processing is carried outAndmultiplying to obtain a multiplied vector matrixVector matrix ofEach column in the sequence is sorted from big to small, namely, the function similarity is sorted from high to low, then the corresponding function can be obtained according to the number to obtain a similar function pair, and the matching result can be obtained by searching through the similar function pair.
In summary, in this embodiment, the method for homologously analyzing a binary executable file applies a bidirectional multi-layer transcoder processed by a natural language to semantic generation of an assembly code language, generates a corresponding instruction vector and a corresponding basic block vector in a high-dimensional vector space of the same dimension, combines a function control flow graph and the basic block vector by using a preset neural network model to obtain a function semantic vector corresponding to the binary executable file, obtains a similar function pair based on matrix multiplication, and searches for the similar function pair to realize homologously analyzing the binary executable file.
In another aspect, the computer device provided in this embodiment includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for performing a binary executable file homology analysis when executing the computer program, and the computer device may be any computer device capable of performing a binary executable file homology analysis, such as a mobile phone, a tablet, and a mobile computer.
Finally, the present embodiment provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the above-mentioned binary executable file homology analysis method.
Since the computer device and the computer-readable storage medium both include the binary executable file homology analysis method, the computer device and the computer-readable storage medium also have the beneficial effects of the binary executable file homology analysis method, and are not described herein again.
The above description details a binary executable file homology analysis method, a computer device, and a storage medium provided by the present invention. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the core concepts of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
Claims (5)
1. A method for homologies analysis of binary executables, the method comprising the steps of:
s1, converting the binary executable file into assembly codes by using a disassembling tool, and generating a function control flow graph corresponding to the binary executable file based on the assembly codes;
s2, respectively mapping the assembly code and the basic blocks of the function control flow graph to high-dimensional vectors by using a bidirectional multilayer conversion encoder for natural language processing to generate corresponding instruction vectors, and acquiring the basic block vectors of the corresponding function control flow graph based on the generated instruction vectors;
S3, generating corresponding function semantic vectors based on the basic block vectors of the function control flow graph in the step S2, and inputting the function semantic vectors into a preset neural network model to obtain a function vector matrix of the function control flow graph in a high-dimensional vector space;
s4, matrix multiplication is carried out on the function vector matrixes of the function control flow graph, then sorting is carried out from big to small, so as to obtain similar function pairs, the matching result can be obtained by searching the similar function pairs,
the disassembly tool is an IDA Pro disassembler,
the basic blocks of the function control flow graph are composed of sequences of instructions of a binary executable file,
the high-dimensional vector is a 128-dimensional vector,
the method is characterized in that the preset neural network model in the step S3 is a neural network model with supervised training, and the training process is as follows: firstly, compiling and generating a binary executable file with debugging information, obtaining corresponding function names according to the debugging information of the binary executable file and generating a training set, and training a neural network model by using the generated training set until vectors obtained by mapping functions from the same source code and having the same name are close to each other in a 128-dimensional vector space, thereby obtaining a preset neural network model.
2. The method for homology analysis of binary executable files according to claim 1, wherein the step S4 is implemented in a specific manner as follows: unitizing all vectors in a function vector matrix of each function control flow graph to enable the modular length of the corresponding vector to be 1, then performing matrix multiplication on the function vector matrix after unitization, sequencing each column in the matrix after multiplication from large to small to obtain a similar function pair, and finally searching by using the generated similar function pair to obtain a matching result.
3. The binary executable file homology analysis method according to claim 2, wherein the matrix multiplication in the step S4 is formulated as:
in the formula (1), the reaction mixture is,is shown in whichEach binary file corresponds to the function vector matrix after the unitization processing,is shown in whichEach binary file corresponds to the function vector matrix after the unit processing, wherein,The result of multiplying the two function vector matrixes after the unitization processing is shown,representing a matrix transpose.
4. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method for homology analysis of binary executable files according to any one of claims 1 to 3 are implemented when the computer program is executed by the processor.
5. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for homology analysis of binary executable files according to any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210434518.0A CN114528015B (en) | 2022-04-24 | 2022-04-24 | Method for analyzing homology of binary executable file, computer device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210434518.0A CN114528015B (en) | 2022-04-24 | 2022-04-24 | Method for analyzing homology of binary executable file, computer device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114528015A CN114528015A (en) | 2022-05-24 |
CN114528015B true CN114528015B (en) | 2022-07-29 |
Family
ID=81628207
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210434518.0A Active CN114528015B (en) | 2022-04-24 | 2022-04-24 | Method for analyzing homology of binary executable file, computer device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114528015B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117473494A (en) * | 2023-06-06 | 2024-01-30 | 兴华永恒(北京)科技有限责任公司 | Method and device for determining homologous binary files, electronic equipment and storage medium |
CN116501378B (en) * | 2023-06-27 | 2023-09-12 | 武汉大数据产业发展有限公司 | Implementation method and device for reverse engineering reduction source code and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101315599A (en) * | 2007-05-29 | 2008-12-03 | 北京航空航天大学 | Method and device for detecting similarity of source codes |
CN112613040A (en) * | 2020-12-14 | 2021-04-06 | 中国科学院信息工程研究所 | Vulnerability detection method based on binary program and related equipment |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10474455B2 (en) * | 2017-09-08 | 2019-11-12 | Devfactory Fz-Llc | Automating identification of code snippets for library suggestion models |
CN108416768B (en) * | 2018-03-01 | 2021-05-25 | 南开大学 | Binary-based foreground image similarity evaluation method |
CN110135157B (en) * | 2019-04-04 | 2021-04-09 | 国家计算机网络与信息安全管理中心 | Malicious software homology analysis method and system, electronic device and storage medium |
CN111159223B (en) * | 2019-12-31 | 2021-09-03 | 武汉大学 | Interactive code searching method and device based on structured embedding |
CN112163226B (en) * | 2020-11-30 | 2021-02-26 | 中国人民解放军国防科技大学 | Binary function similarity detection method based on graph automatic encoder |
CN113254934B (en) * | 2021-06-29 | 2021-09-24 | 湖南大学 | Binary code similarity detection method and system based on graph matching network |
CN113887215A (en) * | 2021-10-18 | 2022-01-04 | 平安科技(深圳)有限公司 | Text similarity calculation method and device, electronic equipment and storage medium |
CN114115894A (en) * | 2021-11-22 | 2022-03-01 | 中国工程物理研究院计算机应用研究所 | Cross-platform binary code similarity detection method based on semantic space alignment |
-
2022
- 2022-04-24 CN CN202210434518.0A patent/CN114528015B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101315599A (en) * | 2007-05-29 | 2008-12-03 | 北京航空航天大学 | Method and device for detecting similarity of source codes |
CN112613040A (en) * | 2020-12-14 | 2021-04-06 | 中国科学院信息工程研究所 | Vulnerability detection method based on binary program and related equipment |
Also Published As
Publication number | Publication date |
---|---|
CN114528015A (en) | 2022-05-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111639344B (en) | Vulnerability detection method and device based on neural network | |
CN114528015B (en) | Method for analyzing homology of binary executable file, computer device and storage medium | |
CN112733137B (en) | Binary code similarity analysis method for vulnerability detection | |
CN110334344B (en) | Semantic intention recognition method, device, equipment and storage medium | |
CN111506721B (en) | Question-answering system and construction method for domain knowledge graph | |
CN105868108A (en) | Instruction-set-irrelevant binary code similarity detection method based on neural network | |
Huang et al. | JSContana: Malicious JavaScript detection using adaptable context analysis and key feature extraction | |
JP7417679B2 (en) | Information extraction methods, devices, electronic devices and storage media | |
CN113591093B (en) | Industrial software vulnerability detection method based on self-attention mechanism | |
US11449676B2 (en) | Systems and methods for automated document graphing | |
CN112989829A (en) | Named entity identification method, device, equipment and storage medium | |
CN115878094B (en) | Code searching method, device, equipment and storage medium | |
Sharif et al. | Function identification in android binaries with deep learning | |
CN116663008A (en) | Vulnerability detection method, vulnerability detection device, electronic equipment, storage medium and program product | |
EP4198844A1 (en) | Artificial intelligence feedback method and artificial intelligence feedback system | |
Rodriguez et al. | An IR-based artificial bee colony approach for traceability link recovery | |
CN111090462B (en) | API (application program interface) matching method and device based on API document | |
CN112328743A (en) | Code searching method and device, readable storage medium and electronic equipment | |
CN116881437B (en) | Data processing system for acquiring text set | |
CN116910756B (en) | Detection method for malicious PE (polyethylene) files | |
CN115065567B (en) | Plug-in execution method for DGA domain name study and judgment inference machine | |
CN117435246B (en) | Code clone detection method based on Markov chain model | |
CN118096091A (en) | Mechanical engineering project demand analysis method, system, equipment and medium | |
CN112489633A (en) | Training method, device and storage medium of voice feature coding network | |
Tian et al. | Function Level Cross-Modal Code Similarity Detection with Jointly Trained Deep Encoders |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |