CN113010209A - Binary code similarity comparison technology for resisting compiling difference - Google Patents

Binary code similarity comparison technology for resisting compiling difference Download PDF

Info

Publication number
CN113010209A
CN113010209A CN202011117765.5A CN202011117765A CN113010209A CN 113010209 A CN113010209 A CN 113010209A CN 202011117765 A CN202011117765 A CN 202011117765A CN 113010209 A CN113010209 A CN 113010209A
Authority
CN
China
Prior art keywords
graph
vector
function
embedding
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011117765.5A
Other languages
Chinese (zh)
Inventor
刘嘉勇
王炎
贾鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202011117765.5A priority Critical patent/CN113010209A/en
Publication of CN113010209A publication Critical patent/CN113010209A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of binary code similarity detection and the field of graph convolution neural networks, and aims to provide a binary code similarity comparison technology for resisting compilation difference. The core of the technology is to convert a binary function into a graph network, and learn semantic, syntactic and structural information of the function by utilizing a graph embedding network so as to generate a graph embedding vector representation of the function. The work flow of the technology is to convert a binary function into an attribute control flow graph, and then extract the grammatical and semantic features of each node. And respectively training a grammatical attribute control flow graph and a semantic attribute control flow graph of the function by using a graph embedding network, and finally generating a graph embedding representation of the function. Wherein a mechanism of attention is employed in the embedding vector aggregation at the instruction, basic block and function stages, respectively. And finally, carrying out similarity comparison on cross-compiler, cross-optimization level, cross-program version and obfuscated code by using the generated embedded vector. The technology provides a new solution for similarity detection of binary codes.

Description

Binary code similarity comparison technology for resisting compiling difference
Technical Field
The invention relates to the technical field of similarity detection of binary codes and the technical field of graph convolution neural networks. The main core is that the characteristics of three aspects of grammar, semantics and structure of the function are combined, a graph convolution neural network based on an attention mechanism is used for obtaining a grammar graph embedding vector and a semantic graph embedding vector of the function, a multi-layer attention mechanism is used for aggregating to obtain a final function graph embedding vector, and the final similarity detection in cross-compiler, cross-optimization level, cross-program version and confusion code is carried out based on the vector.
Background
Code similarity detection refers to comparing two or more code segments to determine whether similar codes exist between the two or more code segments, and can be divided into source code similarity detection and binary code similarity detection. Binary code lacks more symbolic information than source code, and the difficulty of binary code similarity detection is higher due to the adoption of different compilers, different optimization levels, different program versions, the utilization of obfuscation techniques and the like. Binary code similarity detection can be applied to many fields, such as code clone detection, encryption function identification, malicious code detection, malicious code family classification, vulnerability search, security patch analysis and other fields. Most of the traditional methods are based on Hash fuzzy matching, minimum subgraph matching or detection by utilizing technologies such as symbolic execution, taint analysis and the like. In recent years, it has been proposed to use machine learning-based methods for similarity detection, where the grammatical, semantic and structural information of the main learning function is used to construct a high-dimensional vector representation. However, the existing detection techniques still have the following problems.
Firstly, the extracted information of the binary function is not comprehensive enough, and if the detection is performed only by using the grammatical information or semantic information of the function, the detection precision is low.
Secondly, the existing assembly instruction normalization method has a serious Out-Of-Vocabulariy problem.
And thirdly, the method is low in robustness and cannot effectively resist interference of cross-version, cross-compiler, cross-optimization level, cross-program and obfuscation technology.
At present, the existing binary code check aiming at the same compiler version, the same optimization level, the same program version and the non-confusion can not meet the current similarity detection requirement at all, and particularly, the detection precision of the original technology is lower under the condition that the codes are protected by using the confusion technology. There is a need for a new method that can overcome these differences under the current conditions of many influencing factors, so as to effectively improve the accuracy of similarity detection on binary codes.
Disclosure of Invention
The invention provides a binary code similarity comparison technology for resisting compilation difference, which is an invention provided for solving the problems in the prior art in the process of detecting the similarity of binary codes. The invention aims to solve the problem that the existing detection method has low detection precision under the condition of high difference caused by different compilers, different optimization levels, different versions and confusion technologies, and provides a binary function embedding generation technology based on a hierarchical attention-seeking neural network so as to effectively resist the influence of various difference factors and improve the detection precision. The detection method provides a new detection idea, a new word vector generation model is constructed by combining the multi-dimensional characteristics of the binary function, and more hidden information of the function is reserved by using the graph convolution neural network technology, so that the influence of the difference of various compiling factors on the detection accuracy is effectively resisted. The method can be widely used in various binary code similarity detection scenes, and compared with the traditional method, the detection precision under the condition of high difference is obviously improved.
In order to achieve the above object, the present invention provides a binary code similarity comparison technique for resisting compilation difference, which can extract grammatical, semantic and structural features from a provided binary function, then generate a basic block embedding vector based on a pre-trained word vector model, then train a graph convolution neural network on an attribute control flow graph of the function to generate a final function embedding vector representation, and perform similarity detection of the function by using the embedding vector. The invention provides a layered attention diagram embedded network, wherein in order to embody different instructions, different basic blocks and grammatical information and semantic information of functions, the grammatical information and the semantic information have different importance for similarity comparison. The network comprises three layers of attention mechanisms: instruction level, basic block level, and function level. The technical framework comprises five modules of feature extraction, an instruction embedding generation model, a basic block embedding vector generation model, a graph embedding network, a similarity comparison model and the like. The feature extraction module is mainly used for extracting grammatical features, semantic features and control flow graphs of the functions and carrying out normalization processing on the extracted features; the instruction embedding generation model is mainly used for training an embedding vector of each word on a normalized post-assembly instruction and generating an embedding vector representation of the instruction; the basic block embedding vector generation module is used for respectively generating a corresponding basic block grammar embedding vector and a basic block semantic embedding vector by using the grammar feature and the semantic feature of each basic block, wherein the semantic embedding representation of the basic block is generated based on the attention mechanism aggregation instruction embedding vector; the graph embedding network module respectively trains an input function grammar attribute control flow graph and a function semantic attribute control flow graph, generates a grammar embedding vector of a function and a semantic embedding vector of the function based on a basic block level attention mechanism, and finally generates a final function embedding vector based on the grammar embedding vector and the semantic embedding vector of an attention mechanism aggregation function; and the similarity comparison module carries out similarity calculation according to the embedding of the two generated function graphs to obtain a final similarity result.
Drawings
The objects, implementations, advantages and features of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which.
FIG. 1 is an architectural diagram showing the overall structure of the detection technique of the present invention.
FIG. 2 is a diagram illustrating an example of instruction normalization in the detection technique of the invention.
FIG. 3 is a flow diagram illustrating a word embedding training model in the detection technique of the present invention.
FIG. 4 is a flow chart illustrating the basic block embedding vector generation process in the detection technique of the present invention.
FIG. 5 is an architectural diagram illustrating a graph embedding a network model in the detection technique of the present invention.
FIG. 6 is a flow chart illustrating attention-based embedded vector aggregation in the detection technique of the present invention.
FIG. 7 is a flow chart illustrating a comparison of binary function similarity in the detection technique of the present invention.
Detailed Description
The similarity detection technology can be used for cross-compiler, cross-optimization level and cross-version code clone check, and can also be applied to scenes using different code obfuscation technologies. The invention is further described below with reference to the accompanying drawings. The invention aims to provide a binary function similarity comparison technology aiming at various complex compiling scenes, which is based on information contained in a binary function and utilizes graph-embedded network training to obtain high-precision vector representation and can be effectively used for detecting scenes such as clone codes, vulnerability search, malicious code family classification and the like.
FIG. 1 is an architectural diagram illustrating the technique of the present invention.
As shown in fig. 1, the detection technique first extracts an attribute control flow graph of a function, each basic block represents a node, the jump relationship between the basic blocks represents an edge between nodes, and an attribute is an assembly instruction of the basic block. After the attribute control flow graph is extracted, the features of each basic block are extracted, and the features mainly comprise two types: syntactic and semantic features. The grammatical features mainly count the regularity and structural features of some instructions in each basic block. The semantic features are that the assembly instructions are used as texts, and vector representation of each instruction is obtained by utilizing a natural language processing technology. And then inputting the extracted syntactic characteristics and semantic characteristics into a basic block embedding generation model to obtain the syntactic embedding representation and the semantic embedding representation of each basic block in the attribute control flow graph. At the moment, each function respectively comprises a function grammar attribute control flow and a function semantic attribute control flow, then the two attribute control flow are respectively input into a graph embedding network based on an attention mechanism to be trained to obtain a semantic graph embedding vector of the function and a grammatical graph embedding vector of the function, and finally the attention mechanism is utilized to aggregate the two vectors to obtain the final function graph embedding representation. The detection technique computes the distance between the function-embedded representations based on the siemese network to represent the similarity between the functions.
FIG. 2 is a schematic diagram illustrating the normalization process of an assembly instruction.
As shown in fig. 2, an assembly instruction is mainly composed of operands and operation codes, the number of operation codes under each platform is not large, but the number of operands after various combinations is large. Compiling at different times results in operands using different registers, using the same compiler, the same level of optimization for the same source code. In order to avoid these differences more effectively. Normalization of the operands of the assembly instruction is required. The normalization scheme adopted by the invention is mainly divided into 3 types, if the operand belongs to the register type, the operand is respectively normalized into 'reg 1', 'reg 2', 'reg 3' and 'reg 4' according to the size occupied by the register; if the operand belongs to the immediate type, judging whether the immediate belongs to the character string, if so, representing by 'STR', otherwise, representing by 'HIMM'; and if the operand belongs to the memory type, judging whether the operand is addressed based on the base address, if not, replacing the operand by using ' MEM ', if the operand is addressed based on the base address and is not combined with the index, representing the operand by using ' normalized register ' + HIMM ] ', and otherwise, representing the operand by using ' normalized register ' + index ' + HIMM ] '.
Fig. 3 is a diagram of a descriptor embedding model.
After normalization processing is performed on the assembly instruction, the operand and the operation code are respectively used as token for word vector training. In order to construct a training corpus containing more semantic information tokens, the invention carries out random walk on the basis of an inter-process function call graph, and combines tokens in the walk path each time into a sentence. The entire binary file is then used as a document for word vector training, the training process being illustrated in FIG. 3. And when the vector of each target token is trained, taking the previous instruction and the next instruction of the target instruction as context information, and if the target instruction is the first instruction, only taking the next instruction. If the target instruction is the last instruction, only the last instruction is fetched. The processing process of each instruction is as follows: each token has an initialization vector, the average value of the token vectors in each instruction is taken firstly, then the operation code token vectors are divided by the number of the tokens in the instruction, and then the difference operation is carried out with the average value of the tokens. And splicing the calculated result with the opcode token vector to obtain the vector representation of the instruction. And averaging the last instruction vector and the next instruction vector of the target instruction, and then calculating through NCE loss to obtain the vector representation of token in the target instruction. And repeating the above steps to finally obtain the vector representation of each token. The invention considers that the importance degrees of each operation code token are different, obtains the importance degree of each operation code token by combining TF-IDF, then multiplies the importance degree parameter by the operation code token, and then splices with the average value of the operation code token vector to obtain the vector representation of each assembly instruction.
Fig. 4 is a flow chart describing the basic block embedding vector generation process.
As shown in fig. 4, the two types of features extracted from the basic block are respectively processed in different manners to obtain corresponding basic block embedded vector representations. The grammatical characteristics mainly count the number of constants, the number of character strings, the number of transfer instructions, the number of call instructions, the number of basic block instructions, the number of arithmetic instructions, the number of subsequent nodes and the centrality of betweenness in each basic block. For the extraction of these 8 features, the values of each feature are combined to construct a1 × 8 dimensional array. The array is then directly used as the syntax embedding representation of the basic block. Aiming at semantic features, firstly, normalization processing is carried out on each instruction based on a normalization rule to obtain a normalized token. And inputting the token subjected to normalization processing in the basic block into an instruction embedding model to obtain the embedded vector representation of each instruction. The importance degree of each instruction in a basic block is different, and in order to more highlight the action of important instructions and weaken the contribution of other instructions, the invention adopts an attention mechanism to aggregate the instruction vectors of the basic block. And obtaining the semantic embedded representation of the basic block after aggregation.
FIG. 5 is an architectural diagram depicting an embedded network model.
As shown in FIG. 5, the present invention uses a struc2vec graph embedded network as a training network. But we have improved the struc2vec network on the basis of the existing network. And after T rounds of iteration, the basic struc2vec network adopts a summing mode to carry out aggregation. However, in the present invention, we take into account the difference in importance between different basic blocks, and introduce a mechanism of attention. In the improved struc2vec network, we aggregate after T rounds of iteration by an attention mechanism to highlight that important basic blocks can take more weight after aggregation.
FIG. 6 is a flow chart describing embedded vector aggregation based on an attention mechanism.
As shown in FIG. 6, after graph embedding network training, each attribute control flow graph may be represented by a graph embedding vector. In the invention, a grammatical attribute control flow graph and a semantic attribute control flow graph are extracted from one function, and a function grammatical graph embedding vector and a function semantic graph embedding vector are correspondingly generated. In order to more effectively aggregate the two vector representations, and embody contributions of different degrees for function graph embedding, the invention adopts an attention mechanism to aggregate the two vectors. Firstly, function semantic graph embedding and function syntactic graph embedding are spliced, meanwhile, semantic graph embedding and self-splicing are carried out, and then parameters are introduced for calculation to obtain attention weight. And multiplying the attention weight by the corresponding embedded vector, and summing to obtain the final embedded representation of the function graph.
FIG. 7 is a flow chart of binary function similarity comparison.
As shown in fig. 7, the present invention employs the Siamese architecture for similarity detection. To use this architecture, the present invention duplicates the function graph embedding generation network into two and then inputs it as a pair of binary functions. After the function graph is embedded into the network, the two functions are respectively expressed by embedding the function graph into vectors, and finally the cosine distance of the two vectors is calculated to obtain the final similarity.
As described above, the present invention performs similarity detection of binary codes by generating a graph-embedded representation of a binary function, and has the advantages of: 1. the characteristics of different dimensions of the function are extracted, and the detection precision is improved. 2. Compared with the existing method, the provided normalization method for the assembly instruction is more effective in avoiding the OOV problem. 3. Attention mechanisms are respectively introduced at three positions of instruction aggregation, basic block aggregation and function embedding aggregation. This mechanism allows more important vectors to have a greater impact on the final similarity comparison when vector aggregation is performed. 4. The detection technology of the invention can effectively improve the similarity detection precision under different compilers, different optimization levels, different compiling platforms and mixed code scenes.
Although the preferred embodiments of the present invention have been described for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims.

Claims (7)

1. A binary code similarity comparison technique against compile differences, the method comprising the steps of:
A. disassembling the binary file, constructing a binary function attribute control flow graph, and extracting the grammar and semantic information of the basic block;
B. constructing a basic block embedded vector generation model, and converting the grammar and semantic information of the extracted basic block into a numerical vector;
C. respectively training a grammatical attribute control flow graph and a semantic attribute control flow graph of a function by using a graph embedding network to generate a function grammatical graph embedding vector and a function semantic graph embedding vector;
D. embedding vectors by using a grammar graph of the attention mechanism aggregation function and embedding vectors by using a semantic graph to generate final function graph embedding;
E. and performing similarity detection between functions based on the generated graph embedding vector.
2. The binary code similarity comparison technique against compile differences according to claim 1, wherein said step a further comprises the steps of:
a1, the extracted grammatical features include 8 types: the number of constants, the number of character strings, the number of transfer instructions, the number of call instructions, the number of basic block instructions, the number of arithmetic instructions, the number of subsequent nodes and the centrality of betweenness;
a2, the extracted semantic features are assembly instructions of the functions.
3. The binary code similarity comparison technique against compile differences as claimed in claim 2, wherein the assembler instruction extracted in step a2 further needs to be normalized as follows:
a21, if the operand of the instruction belongs to the register type, normalizing the operand into 'reg 1', 'reg 2', 'reg 3', 'reg 4' according to the occupied size of the register;
a22, if the operand of the instruction belongs to the immediate type, according to whether the immediate belongs to the character string, if so, using 'STR' to represent, otherwise using 'HIMM' to represent;
a23, if the operand belongs to the memory type, judging whether the operand is addressed based on the base address, if not, replacing the operand by [ MEM ] ", if the operand is addressed based on the base address and not combined with the index, representing the operand by using the [" normalized register "+ HIMM ]", otherwise, representing the operand by using the [ "normalized register" + index HIMM + HIMM ] ".
4. The binary code similarity comparison technique against compile differences as claimed in claim 1, wherein said step B further comprises the steps of:
b1, the number of the corresponding 8 grammar features in each basic block in the statistical function, and directly combining the numerical values into a vector to represent the grammar vector of the basic block;
b2, performing randomized walk by controlling a flow graph among functions, converting the normalized assembly instruction into a token word bank based on a walk path, and then performing word vector training by taking the whole binary program as a document;
b3, multiplying the operation code toekn vector by a statistical word frequency-inverse text frequency index (TF-IDF) parameter, and then adding the average value of the operation number token vector to obtain the vector representation of each assembly instruction;
and B4, aggregating the vector of each assembly instruction obtained in the basic block through an attention mechanism to obtain a semantic vector representation of the basic block.
5. The binary code similarity comparison technique against compile differences as claimed in claim 1, wherein said step C further comprises the steps of:
c1, constructing a grammatical attribute control flow graph by taking the grammatical vector of each basic block generated by the B1 as the attribute of the control flow graph, taking the grammatical control flow graph as input, embedding a network training model by using a Struc2vec graph, and performing aggregation by adopting an attention mechanism during node aggregation;
and C2, constructing a semantic control flow graph by taking the semantic vector of each basic block generated by the B4 as the attribute of the control flow graph, taking the semantic control flow graph as input, embedding a network training model by using a Struc2vec graph, and performing aggregation by adopting an attention mechanism during node aggregation.
6. The binary code similarity comparison technique against compile differences as claimed in claim 1, wherein said step D further comprises the steps of:
d1, splicing the generated function grammar graph embedding vector and the function semantic graph embedding vector to train attention parameters;
and D2, respectively multiplying the trained attention parameters by the corresponding graph embedding vectors, and aggregating to generate a final function graph embedding vector.
7. The binary code similarity comparison technique against compilation difference as claimed in claim 1, wherein the step E is specifically as follows:
e1, combining the graph embedding network proposed by us with a Siamese network, inputting into a pair form, wherein the model comprises two identical network structures but shares the same parameters;
and E2, calculating the distance between the two graph embedding vectors by using cosine similarity to obtain the similarity between the functions.
CN202011117765.5A 2020-10-19 2020-10-19 Binary code similarity comparison technology for resisting compiling difference Pending CN113010209A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011117765.5A CN113010209A (en) 2020-10-19 2020-10-19 Binary code similarity comparison technology for resisting compiling difference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011117765.5A CN113010209A (en) 2020-10-19 2020-10-19 Binary code similarity comparison technology for resisting compiling difference

Publications (1)

Publication Number Publication Date
CN113010209A true CN113010209A (en) 2021-06-22

Family

ID=76383603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011117765.5A Pending CN113010209A (en) 2020-10-19 2020-10-19 Binary code similarity comparison technology for resisting compiling difference

Country Status (1)

Country Link
CN (1) CN113010209A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254934A (en) * 2021-06-29 2021-08-13 湖南大学 Binary code similarity detection method and system based on graph matching network
CN113535229A (en) * 2021-06-30 2021-10-22 中国人民解放军战略支援部队信息工程大学 Anti-confusion binary code clone detection method based on software gene
CN113918171A (en) * 2021-10-19 2022-01-11 哈尔滨理工大学 Novel disassembling method using extended control flow graph
CN115758164A (en) * 2022-10-12 2023-03-07 清华大学 Binary code similarity detection method, model training method and device
CN115858002A (en) * 2023-02-06 2023-03-28 湖南大学 Binary code similarity detection method and system based on graph comparison learning and storage medium
CN116166538A (en) * 2022-12-28 2023-05-26 山东大学 Cross-version prediction variation testing method and system
CN116578979A (en) * 2023-05-15 2023-08-11 软安科技有限公司 Cross-platform binary code matching method and system based on code features
CN117608539A (en) * 2023-11-02 2024-02-27 清华大学 Binary code representation vector generation method, binary code representation vector generation device, binary code representation vector generation equipment and storage medium
CN116578979B (en) * 2023-05-15 2024-05-31 软安科技有限公司 Cross-platform binary code matching method and system based on code features

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868108A (en) * 2016-03-28 2016-08-17 中国科学院信息工程研究所 Instruction-set-irrelevant binary code similarity detection method based on neural network
CN107357566A (en) * 2017-06-06 2017-11-17 上海交通大学 More framework binary system similar codes detecting systems and method
US20180285101A1 (en) * 2017-03-29 2018-10-04 Technion Research & Development Foundation Limited Similarity of binaries
CN110704103A (en) * 2019-09-04 2020-01-17 中国人民解放军战略支援部队信息工程大学 Binary file semantic similarity comparison method and device based on software genes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868108A (en) * 2016-03-28 2016-08-17 中国科学院信息工程研究所 Instruction-set-irrelevant binary code similarity detection method based on neural network
US20180285101A1 (en) * 2017-03-29 2018-10-04 Technion Research & Development Foundation Limited Similarity of binaries
CN107357566A (en) * 2017-06-06 2017-11-17 上海交通大学 More framework binary system similar codes detecting systems and method
CN110704103A (en) * 2019-09-04 2020-01-17 中国人民解放军战略支援部队信息工程大学 Binary file semantic similarity comparison method and device based on software genes

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAOJUN XU: ""Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection"", 《CCS "17: PROCEEDINGS OF THE 2017 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY》 *
赵朋磊: ""基于图神经网络的二进制函数相似度检测算法研究及实现"", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254934A (en) * 2021-06-29 2021-08-13 湖南大学 Binary code similarity detection method and system based on graph matching network
CN113535229A (en) * 2021-06-30 2021-10-22 中国人民解放军战略支援部队信息工程大学 Anti-confusion binary code clone detection method based on software gene
CN113535229B (en) * 2021-06-30 2022-12-02 中国人民解放军战略支援部队信息工程大学 Anti-confusion binary code clone detection method based on software gene
CN113918171A (en) * 2021-10-19 2022-01-11 哈尔滨理工大学 Novel disassembling method using extended control flow graph
CN115758164A (en) * 2022-10-12 2023-03-07 清华大学 Binary code similarity detection method, model training method and device
CN116166538A (en) * 2022-12-28 2023-05-26 山东大学 Cross-version prediction variation testing method and system
CN116166538B (en) * 2022-12-28 2023-08-22 山东大学 Cross-version prediction variation testing method and system
CN115858002A (en) * 2023-02-06 2023-03-28 湖南大学 Binary code similarity detection method and system based on graph comparison learning and storage medium
CN115858002B (en) * 2023-02-06 2023-04-25 湖南大学 Binary code similarity detection method and system based on graph comparison learning and storage medium
CN116578979A (en) * 2023-05-15 2023-08-11 软安科技有限公司 Cross-platform binary code matching method and system based on code features
CN116578979B (en) * 2023-05-15 2024-05-31 软安科技有限公司 Cross-platform binary code matching method and system based on code features
CN117608539A (en) * 2023-11-02 2024-02-27 清华大学 Binary code representation vector generation method, binary code representation vector generation device, binary code representation vector generation equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113010209A (en) Binary code similarity comparison technology for resisting compiling difference
CN111428044B (en) Method, device, equipment and storage medium for acquiring supervision and identification results in multiple modes
Ashizawa et al. Eth2vec: learning contract-wide code representations for vulnerability detection on ethereum smart contracts
CN112733137B (en) Binary code similarity analysis method for vulnerability detection
CN110581864B (en) Method and device for detecting SQL injection attack
CN114065199B (en) Cross-platform malicious code detection method and system
Gui et al. Cross-language binary-source code matching with intermediate representations
Shakya et al. Smartmixmodel: machine learning-based vulnerability detection of solidity smart contracts
Liang et al. Neutron: an attention-based neural decompiler
CN110990058A (en) Software similarity measurement method and device
Zhao et al. Semantics-aware obfuscation scheme prediction for binary
Wang et al. Gvd-net: Graph embedding-based machine learning model for smart contract vulnerability detection
CN117032717A (en) Java compiler security risk detection method based on byte code similarity
CN116595537A (en) Vulnerability detection method of generated intelligent contract based on multi-mode features
Artuso et al. Binbert: Binary code understanding with a fine-tunable and execution-aware transformer
Huang et al. Deep Smart Contract Intent Detection
Wang et al. Hierarchical attention graph embedding networks for binary code similarity against compilation diversity
Khodadadi et al. HyMo: Vulnerability Detection in Smart Contracts using a Novel Multi-Modal Hybrid Model
Li et al. A simple function embedding approach for binary similarity detection
Li et al. Adabot: Fault-tolerant java decompiler
He et al. Enhancing smart contract security: Leveraging pre‐trained language models for advanced vulnerability detection
Jia et al. FuncFooler: A Practical Black-box Attack Against Learning-based Binary Code Similarity Detection Methods
Song et al. Milo: Attacking Deep Pre-trained Model for Programming Languages Tasks with Anti-analysis Code Obfuscation
Lian et al. A Universal and Efficient Multi-modal Smart Contract Vulnerability Detection Framework for Big Data
Xing et al. HGE-BVHD: Heterogeneous Graph Embedding Scheme of Complex Structure Functions for Binary Vulnerability Homology Discrimination

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210622

WD01 Invention patent application deemed withdrawn after publication