CN112596736B - Semantic-based cross-instruction architecture binary code similarity detection method - Google Patents

Semantic-based cross-instruction architecture binary code similarity detection method Download PDF

Info

Publication number
CN112596736B
CN112596736B CN202011552657.0A CN202011552657A CN112596736B CN 112596736 B CN112596736 B CN 112596736B CN 202011552657 A CN202011552657 A CN 202011552657A CN 112596736 B CN112596736 B CN 112596736B
Authority
CN
China
Prior art keywords
basic block
embedding
semantic
arm
assembly code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011552657.0A
Other languages
Chinese (zh)
Other versions
CN112596736A (en
Inventor
王莘
姜训智
董少波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202011552657.0A priority Critical patent/CN112596736B/en
Publication of CN112596736A publication Critical patent/CN112596736A/en
Application granted granted Critical
Publication of CN112596736B publication Critical patent/CN112596736B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a semantic-based method for detecting similarity of binary codes of a cross-instruction architecture. The invention relates to the technical field of vulnerability detection, which constructs assembly language data sets of different instruction architectures and extracts assembly code pairs of different instruction architectures with the same semantics; normalizing the assembly code basic block pair; pre-training an assembly code encoder, and converting an assembly code semantic similarity task into a natural language neural machine translation task; carrying out negative sampling and basic block embedding of character string semantics, aligning and splicing a character string embedding vector and an embedding vector obtained by an embedding network to form a new basic block embedding vector; training an embedded network, and outputting embedded vectors of x86 and ARM; the invention can accelerate the accuracy and efficiency of vulnerability detection, copyright dispute and the like which need to compare the basic blocks of different instruction frameworks.

Description

Semantic-based cross-instruction architecture binary code similarity detection method
Technical Field
The invention relates to the technical field of vulnerability detection, in particular to a semantic-based method for detecting similarity of binary codes of a cross-instruction architecture.
Background
The similarity research of the prior binary codes with cross-instruction architectures generally needs to manually select the characteristics of the binary codes to carry out basic block embedding, and the characteristics not only need professional knowledge, but also have less embedded information and cannot completely express the semantics of the binary codes. Such as Gemini, Genius.
To solve the above problem, methods such as SAFE, Asm2vec, etc. apply a method based on static word representation to binary codes. Operators and operands in the basic blocks are expressed as fixed-dimension vectors, and each element in the basic block embedding is an automatically calculated real number, so that the information capacity in the basic block embedding is greatly improved. These approaches are not suitable for basic block embedding across instruction architectures because elements in different instruction architectures cannot appear in the same context.
Disclosure of Invention
The invention is used for detecting the similarity of binary codes among different instruction architectures (such as x86, ARM and the like), can be used in the fields of vulnerability detection, copyright dispute, malicious software analysis and the like, and provides a semantic-based method for detecting the similarity of the binary codes of the cross-instruction architecture, which provides the following technical scheme:
a semantic-based cross-instruction architecture binary code similarity detection method comprises the following steps:
step 1: constructing assembly language data sets of different instruction architectures, and extracting assembly code pairs of different instruction architectures with the same semantics;
step 2: normalizing the assembly code basic block pair;
and step 3: pre-training an assembly code encoder, and converting an assembly code semantic similarity task into a natural language neural machine translation task;
and 4, step 4: carrying out negative sampling to enable the model to distinguish basic blocks with different semantics but similar embedded vectors;
and 5: combining the basic block embedding of the character string semantics, aligning and splicing the character string embedding vector and the embedding vector obtained by the embedding network to form a new basic block embedding vector;
step 6: training an embedded network, and outputting embedded vectors of x86 and ARM;
and 7: and performing binary code similarity comparison, and judging whether the binary codes of two different instruction architectures are similar.
Preferably, the step 1 specifically comprises:
the method comprises the steps of embedding basic blocks into assembly code basic block pairs with the same semantics, compiling source codes into intermediate language IR with basic block boundary marks by using a modified LLVM compiler, generating assembly languages with different instruction architectures by using different back ends of the LLVM after optimization by an optimizer, wherein each basic block has a unique identifier and the boundary is marked, and extracting the assembly code pairs with the same semantics and different instruction architectures through regular filtering.
Preferably, the step 2 specifically comprises:
for the register, the registers in x86 are classified into 14 types, including a pointer register, a floating point register, four types of general registers, four types of data registers and four types of address registers, wherein the general registers, the data registers and the address registers are classified into 4 types according to the length of data stored by the general registers, the data registers and the address registers, and the types are 8 bits, 16 bits, 32 bits and 64 bits respectively; registers in the ARM are divided into two categories: general purpose registers and pointer registers, to normalize the assembly code of the x86 basic block.
Preferably, the step 3 specifically comprises:
converting the assembly code semantic similarity task into a natural language neural machine translation task, training a current mainstream neural machine translation model Transformer according to the generated assembly code with different instruction architectures and the same semantics, wherein a middle vector generated by the model contains rich assembly code semantic information;
the x86 basic block and the ARM basic block are denoted as S ═ (S), respectively1,...,sn) And T ═ T (T)1,...,tm) Wherein s isnAnd tmRespectively being x86 vocabulary Vx86And ARM vocabulary VARMIndex of middle character, Transformer model target is to predict the next character t of ARM basic blockkThe final output is the next character
Figure BDA0002857465920000021
The loss function L is represented by the following formula:
Figure BDA0002857465920000022
wherein the content of the first and second substances,
Figure BDA0002857465920000023
is tkThe one-hot code of (1);
an x86 encoder is obtained, and the x86 assembly code is embedded into a vector containing semantic features.
Preferably, the step 4 specifically includes:
in advance ofOn the basis of a training model, a vector E is embedded by measuring two x86 basic blocks1,E2The similarity is judged by the Euclidean distance, and the Euclidean distance is expressed by the following formula:
Figure BDA0002857465920000024
where d is the embedding dimension, e1i∈E1,e2i∈E2
When an embedded network is trained, a triplet is required to be used, wherein the triplet comprises an anchor point, a positive sample and a negative sample, an x86-ARM basic block pair A is randomly found out and is used as the anchor point and the positive sample, another x86-ARM basic block pair B which is similar to the embedded vector of the basic block pair A but has different semantics is found out by calculating the distance, when the anchor point is the ARM basic block of the basic block pair A, the positive sample is the x86 basic block of the basic block pair A, and the negative sample is the x86 basic block of the basic block pair B; when the anchor point is x86 basic blocks of basic block pair a, the positive samples are ARM basic blocks of basic block pair a, and the negative samples are ARM basic blocks of basic block pair B.
Preferably, the step 5 specifically comprises:
collecting used character strings in the basic blocks, embedding each character string into a vector by using LSTM, calculating final embedding for all character string vectors in each basic block by using Sum Pooling, aligning and splicing the character string embedding vectors with embedding vectors obtained by an embedding network to form a new basic block embedding vector.
Preferably, the step 6 specifically includes:
using the triple samples obtained by negative sampling, and respectively obtaining the positive samples, the negative samples and the semantic embedded vectors of the anchor points after embedding the triple samples into the network model;
respectively splicing character string vectors corresponding to the basic blocks to obtain embedded vectors with semantics and character strings, wherein the embedding of the anchor points is closer to the embedding of positive samples than the embedding of negative samples, and a ternary loss function based on allowance is adopted by the following expression:
L=max{D(E1,E2)-D(E1,E3)+γ,0}
wherein E is1、E2And E3Embedding of anchor, positive and negative samples, respectively, gamma>0 is a margin parameter, D is an Euclidean distance, D is a loss encouragement (E)1,E2) At least the ratio of D (E)1,E3) And (5) reducing by a margin gamma to obtain a final embedded vector model, inputting x86 and ARM assembly code basic blocks, and outputting x86 and ARM embedded vectors.
Preferably, the step 7 specifically includes:
giving two binary codes of different instruction architectures, obtaining respective assembly codes through disassembling, respectively obtaining semantic embedded vectors of assembly code basic blocks, and calculating Euclidean distance of the embedded vectors to judge whether the binary codes of the two different instruction architectures are similar.
The invention has the following beneficial effects:
the invention combines the semantic features and character string features of the basic block at the same time, and can quickly compare the basic blocks of different instruction architectures by using the pre-trained embedded model. The accuracy and efficiency of comparing basic blocks of different instruction frameworks required by vulnerability detection, copyright dispute and the like can be improved.
Drawings
FIG. 1 is a schematic diagram of LLVM optimization;
FIG. 2 is a schematic illustration of a normalization operation;
FIG. 3 is a flow chart of a semantic-based method for detecting similarity of binary codes across instruction architectures.
Detailed Description
The present invention will be described in detail with reference to specific examples.
The first embodiment is as follows:
according to fig. 1-3, the present invention provides a semantic-based method for detecting similarity of binary codes in a cross-instruction architecture, comprising the following steps:
a semantic-based cross-instruction architecture binary code similarity detection method comprises the following steps:
step 1: constructing assembly language data sets of different instruction architectures, and extracting assembly code pairs of different instruction architectures with the same semantics;
the step 1 specifically comprises the following steps:
the method comprises the steps of embedding basic blocks into assembly code basic block pairs with the same semantics, compiling source codes into intermediate language IR with basic block boundary marks by using a modified LLVM compiler, generating assembly languages with different instruction architectures by using different back ends of the LLVM after optimization by an optimizer, wherein each basic block has a unique identifier and the boundary is marked, and extracting the assembly code pairs with the same semantics and different instruction architectures through regular filtering.
Step 2: normalizing the assembly code basic block pair;
the step 2 specifically comprises the following steps:
for the register, the registers in x86 are classified into 14 types, including a pointer register, a floating point register, four types of general registers, four types of data registers and four types of address registers, wherein the general registers, the data registers and the address registers are classified into 4 types according to the length of data stored by the general registers, the data registers and the address registers, and the types are 8 bits, 16 bits, 32 bits and 64 bits respectively; registers in the ARM are divided into two categories: general purpose registers and pointer registers, to normalize the assembly code of the x86 basic block.
And step 3: pre-training an assembly code encoder, and converting an assembly code semantic similarity task into a natural language neural machine translation task;
the step 3 specifically comprises the following steps:
converting the assembly code semantic similarity task into a natural language neural machine translation task, training a current mainstream neural machine translation model Transformer according to the generated assembly code with different instruction architectures and the same semantics, wherein a middle vector generated by the model contains rich assembly code semantic information;
the x86 basic block and the ARM basic block are denoted as S ═ (S), respectively1,...,sn) And T ═ T (T)1,...,tm) Wherein s isnAnd tmRespectively being x86 vocabulary Vx86And ARM vocabulary VARMIndex of middle character, Transformer model target is to predict the next character t of ARM basic blockkThe final output is the next character
Figure BDA0002857465920000041
The loss function L is represented by the following formula:
Figure BDA0002857465920000042
wherein the content of the first and second substances,
Figure BDA0002857465920000043
is tkThe one-hot code of (1);
an x86 encoder is obtained, and the x86 assembly code is embedded into a vector containing semantic features.
And 4, step 4: carrying out negative sampling to enable the model to distinguish basic blocks with different semantics but similar embedded vectors;
the step 4 specifically comprises the following steps:
embedding vector E by measuring two x86 basic blocks on the basis of a pre-training model1,E2The similarity is judged by the Euclidean distance, and the Euclidean distance is expressed by the following formula:
Figure BDA0002857465920000051
where d is the embedding dimension, e1i∈E1,e2i∈E2
When an embedded network is trained, a triplet is required to be used, wherein the triplet comprises an anchor point, a positive sample and a negative sample, an x86-ARM basic block pair A is randomly found out and is used as the anchor point and the positive sample, another x86-ARM basic block pair B which is similar to the embedded vector of the basic block pair A but has different semantics is found out by calculating the distance, when the anchor point is the ARM basic block of the basic block pair A, the positive sample is the x86 basic block of the basic block pair A, and the negative sample is the x86 basic block of the basic block pair B; when the anchor point is x86 basic blocks of basic block pair a, the positive samples are ARM basic blocks of basic block pair a, and the negative samples are ARM basic blocks of basic block pair B.
And 5: combining the basic block embedding of the character string semantics, aligning and splicing the character string embedding vector and the embedding vector obtained by the embedding network to form a new basic block embedding vector;
the step 5 specifically comprises the following steps:
collecting used character strings in the basic blocks, embedding each character string into a vector by using LSTM, calculating final embedding for all character string vectors in each basic block by using Sum Pooling, aligning and splicing the character string embedding vectors with embedding vectors obtained by an embedding network to form a new basic block embedding vector.
Step 6: training an embedded network, and outputting embedded vectors of x86 and ARM;
the step 6 specifically comprises the following steps:
using the triple samples obtained by negative sampling, and respectively obtaining the positive samples, the negative samples and the semantic embedded vectors of the anchor points after embedding the triple samples into the network model;
respectively splicing character string vectors corresponding to the basic blocks to obtain embedded vectors with semantics and character strings, wherein the embedding of the anchor points is closer to the embedding of positive samples than the embedding of negative samples, and a ternary loss function based on allowance is adopted by the following expression:
L=max{D(E1,E2)-D(E1,E3)+γ,0}
wherein E is1、E2And E3Embedding of anchor, positive and negative samples, respectively, gamma>0 is a margin parameter, D is an Euclidean distance, D is a loss encouragement (E)1,E2) At least the ratio of D (E)1,E3) And (5) reducing by a margin gamma to obtain a final embedded vector model, inputting x86 and ARM assembly code basic blocks, and outputting x86 and ARM embedded vectors.
And 7: and performing binary code similarity comparison, and judging whether the binary codes of two different instruction architectures are similar.
The step 7 specifically comprises the following steps:
giving two binary codes of different instruction architectures, obtaining respective assembly codes through disassembling, respectively obtaining semantic embedded vectors of assembly code basic blocks, and calculating Euclidean distance of the embedded vectors to judge whether the binary codes of the two different instruction architectures are similar.
The second embodiment is as follows:
the invention mainly provides a basic block embedding model, then disassembles binary codes of a cross-instruction architecture to obtain assembly code basic blocks of a corresponding architecture, embeds the basic blocks into vectors by using the proposed embedding model, and determines the similarity of the binary codes of different instruction architectures by comparing cosine similarity of the embedded vectors.
In the proposed basic block embedding model, these two embedding modules can be used for basic blocks on any other architecture. X86 and ARM are specified here for ease of description only.
The method comprises the following steps: assembly language data sets of different instruction architectures are constructed.
Training basic block embedding requires a large number of assembly code basic block pairs with the same semantics in preparation for the subsequent semantic embedding of the training basic blocks. The compiled assembly codes of the same source code are considered to be equivalent semantically, so that the source code is compiled into the intermediate language IR with basic block boundary marks by using a modified LLVM compiler, the assembly languages with different instruction architectures are generated by using different back ends of the LLVM after being optimized by an optimizer, each basic block has a unique identifier and the boundary is marked, and assembly code pairs with the same semantics and different instruction architectures can be extracted by simple regular filtering.
Step two: the assembly code basic block pairs are normalized.
Some vocabulary may be out of vocabulary (OOV) if the original assembly code is fed directly into the neural network training semantics. And some registers may be replaced by other registers of the same class without semantic change. Therefore, the assembly code needs to be normalized.
Constants in the assembly code are divided into five types, i.e., immediate, ADDRESS, variable name, function name and basic block tag, which are replaced with characters of 'IMM', 'ADDRESS', 'VAR', 'FUNC', 'BB', etc., respectively.
For registers, registers in x86 are classified into 14 types, including pointer registers, floating point registers, four types of general purpose registers, four types of data registers, and four types of address registers. The general register, the data register and the address register are divided into 4 types according to the length of data stored by the general register, the data register and the address register, and the types are respectively 8 bits, 16 bits, 32 bits and 64 bits. Registers in the ARM are divided into two categories: general purpose registers and pointer registers.
As shown in fig. 1, the assembly code of the x86 basic block on the left is transformed to the right by the normalization operation.
Step three: pre-training of the encoder of the assembly code.
And (3) converting the assembly code semantic similarity task into a natural language neural machine translation task, and using a large number of different instruction architecture assembly codes with the same semantics generated in the step one and the step two to train a current mainstream neural machine translation model Transformer, wherein a middle vector generated by the model contains abundant assembly code semantic information. The x86 basic block and the ARM basic block are denoted as S ═ (S), respectively1,...,sn) And T ═ T (T)1,...,tm) Wherein s isiAnd tiRespectively being x86 vocabulary Vx86And ARM vocabulary VARMThe index of the middle character. The Transformer model goal is to predict the next character t of the ARM basic blockkThe final output is the next character
Figure BDA0002857465920000071
Probability distribution of (2). The loss function is expressed as follows.
Figure BDA0002857465920000072
Wherein
Figure BDA0002857465920000073
Is tkOne-hot encoding of (1). Thus obtainingAn x86 encoder is obtained, and x86 assembly code can be embedded into a vector containing semantic features.
Step four: and (4) negative sampling.
The added samples after negative sampling can enable the model to distinguish basic blocks with different semantics but similar embedded vectors, and the performance of the model is improved. On the basis of a pre-training model, a vector E can be embedded by measuring two x86 basic blocks1,E2The Euclidean distance of the two points to judge the similarity of the two points.
Figure BDA0002857465920000074
Wherein
Figure BDA0002857465920000075
d is the embedding dimension, e1i∈E1,e2i∈E2. The smaller the euclidean distance, the higher the similarity of the two basic blocks. The triplet (anchor point, positive sample and negative sample) is used for training the embedded network, and an x86-ARM basic block pair A is randomly found to serve as the anchor point and the positive sample. Another x86-ARM basic block pair B can be found by computing the distance that is similar to the basic block pair a embedded vector but with different semantics. If the anchor point is the ARM basic block of the basic block pair A, the positive samples are x86 basic blocks of the basic block pair A, and the negative samples are x86 basic blocks of the basic block pair B; if the anchor point is the x86 basic block of basic block pair A, then the positive samples are the ARM basic block of basic block pair A and the negative samples are the ARM basic block of basic block pair B.
Step five: basic block embedding in conjunction with string semantics.
This is greatly facilitated by matching two basic blocks, since some of the variables in the assembly code represent strings, and strings in basic blocks for different instruction architectures are mostly identical. The used strings in the basic blocks are first collected, then embedded as a vector for each string using LSTM, and the final embedding is calculated for all string vectors in each basic block using Sum Pooling. Finally, aligning and splicing the character string embedded vector and the embedded vector obtained by the embedded network to form a new basic block embedded vector.
Step six: and (5) training the embedded network.
Firstly, a triple sample obtained by negative sampling is used, and semantic embedding vectors of a positive sample, a negative sample and an anchor point are respectively obtained after a network model is embedded. And then respectively splicing the character string vectors corresponding to the basic blocks to obtain an embedded vector with semantics and character strings. Where the embedding of anchor points should be closer to the embedding of positive samples than the embedding of negative samples. The use of a margin-based ternary loss function is therefore as follows:
L=max{D(E1,E2)-D(E1,E3)+γ,0}
wherein E1、E2And E3Embedding of anchor, positive and negative samples, respectively, gamma>0 is a margin parameter, and D is an Euclidean distance. The loss encourages D (E)1,E2) At least the ratio of D (E)1,E3) Less by a margin y. Thus, a final embedded vector model is obtained, the assembly code basic blocks of x86 and ARM are input, and the embedded vectors of x86 and ARM are output.
Step seven: and finally comparing the similarity of the binary codes.
Two binary codes of different instruction architectures are given, and the respective assembly codes are obtained through disassembly. Through the embedding model obtained through the six steps, semantic embedding vectors of the assembly code basic block can be respectively obtained, and the Euclidean distance of the embedding vectors can be calculated to judge whether the binary codes of two different instruction architectures are similar or not.
The above description is only a preferred embodiment of the semantic-based inter-instruction-architecture binary code similarity detection method, and the protection scope of the semantic-based inter-instruction-architecture binary code similarity detection method is not limited to the above embodiments, and all technical solutions belonging to the idea belong to the protection scope of the present invention. It should be noted that modifications and variations which do not depart from the gist of the invention will be those skilled in the art to which the invention pertains and which are intended to be within the scope of the invention.

Claims (6)

1. A semantic-based cross-instruction architecture binary code similarity detection method is characterized by comprising the following steps: the method comprises the following steps:
step 1: constructing assembly language data sets of different instruction architectures, and extracting assembly code pairs of different instruction architectures with the same semantics;
step 2: normalizing the assembly code basic block pair;
and step 3: pre-training an assembly code encoder, and converting an assembly code semantic similarity task into a natural language neural machine translation task;
and 4, step 4: carrying out negative sampling to enable the model to distinguish basic blocks with different semantics but similar embedded vectors;
the step 4 specifically comprises the following steps:
embedding vector E by measuring two x86 basic blocks on the basis of a pre-training model1,E2The similarity is judged by the Euclidean distance, and the Euclidean distance is expressed by the following formula:
Figure FDA0003223588850000011
where d is the embedding dimension, e1i∈E1,e2i∈E2
When an embedded network is trained, a triplet is required to be used, wherein the triplet comprises an anchor point, a positive sample and a negative sample, an x86-ARM basic block pair A is randomly found out and is used as the anchor point and the positive sample, another x86-ARM basic block pair B which is similar to the embedded vector of the basic block pair A but has different semantics is found out by calculating the distance, when the anchor point is the ARM basic block of the basic block pair A, the positive sample is the x86 basic block of the basic block pair A, and the negative sample is the x86 basic block of the basic block pair B; when the anchor point is an x86 basic block of the basic block pair A, the positive sample is an ARM basic block of the basic block pair A, and the negative sample is an ARM basic block of the basic block pair B;
and 5: combining the basic block embedding of the character string semantics, aligning and splicing the character string embedding vector and the embedding vector obtained by the embedding network to form a new basic block embedding vector;
step 6: training an embedded network, and outputting embedded vectors of x86 and ARM;
the step 6 specifically comprises the following steps:
using the triple samples obtained by negative sampling, and respectively obtaining the positive samples, the negative samples and the semantic embedded vectors of the anchor points after embedding the triple samples into the network model;
respectively splicing character string vectors corresponding to the basic blocks to obtain embedded vectors with semantics and character strings, wherein the embedding of the anchor points is closer to the embedding of positive samples than the embedding of negative samples, and a ternary loss function based on allowance is adopted by the following expression:
L=max{D(E1,E2)-D(E1,E3)+γ,0}
wherein E is1、E2And E3Embedding of anchor, positive and negative samples, respectively, gamma>0 is a margin parameter, D is an Euclidean distance, D is a loss encouragement (E)1,E2) At least the ratio of D (E)1,E3) Reducing the margin gamma to obtain a final embedded vector model, inputting x86 and ARM assembly code basic blocks, and outputting x86 and ARM embedded vectors;
and 7: and performing binary code similarity comparison, and judging whether the binary codes of two different instruction architectures are similar.
2. The semantic-based cross-instruction architecture binary code similarity detection method according to claim 1, characterized by: the step 1 specifically comprises the following steps:
the method comprises the steps of embedding basic blocks into assembly code basic block pairs with the same semantics, compiling source codes into intermediate language IR with basic block boundary marks by using a modified LLVM compiler, generating assembly languages with different instruction architectures by using different back ends of the LLVM after optimization by an optimizer, wherein each basic block has a unique identifier and the boundary is marked, and extracting the assembly code pairs with the same semantics and different instruction architectures through regular filtering.
3. The semantic-based cross-instruction architecture binary code similarity detection method according to claim 1, characterized by: the step 2 specifically comprises the following steps:
for the register, the registers in x86 are classified into 14 types, including a pointer register, a floating point register, four types of general registers, four types of data registers and four types of address registers, wherein the general registers, the data registers and the address registers are classified into 4 types according to the length of data stored by the general registers, the data registers and the address registers, and the types are 8 bits, 16 bits, 32 bits and 64 bits respectively; registers in the ARM are divided into two categories: general purpose registers and pointer registers, to normalize the assembly code of the x86 basic block.
4. The semantic-based cross-instruction architecture binary code similarity detection method according to claim 1, characterized by: the step 3 specifically comprises the following steps:
converting the assembly code semantic similarity task into a natural language neural machine translation task, training a current mainstream neural machine translation model Transformer according to the generated assembly code with different instruction architectures and the same semantics, wherein a middle vector generated by the model contains rich assembly code semantic information;
the x86 basic block and the ARM basic block are denoted as S ═ (S), respectively1,...,sn) And T ═ T (T)1,...,tm) Wherein s isnAnd tmRespectively being x86 vocabulary Vx86And ARM vocabulary VARMIndex of middle character, Transformer model target is to predict the next character t of ARM basic blockkThe final output is the next character
Figure FDA0003223588850000021
The loss function L is represented by the following formula:
Figure FDA0003223588850000022
wherein the content of the first and second substances,
Figure FDA0003223588850000023
is tkThe one-hot code of (1);
an x86 encoder is obtained, and the x86 assembly code is embedded into a vector containing semantic features.
5. The semantic-based cross-instruction architecture binary code similarity detection method according to claim 1, characterized by: the step 5 specifically comprises the following steps:
collecting used character strings in the basic blocks, embedding each character string into a vector by using LSTM, calculating final embedding for all character string vectors in each basic block by using Sum Pooling, aligning and splicing the character string embedding vectors with embedding vectors obtained by an embedding network to form a new basic block embedding vector.
6. The semantic-based cross-instruction architecture binary code similarity detection method according to claim 1, characterized by: the step 7 specifically comprises the following steps:
giving two binary codes of different instruction architectures, obtaining respective assembly codes through disassembling, respectively obtaining semantic embedded vectors of assembly code basic blocks, and calculating Euclidean distance of the embedded vectors to judge whether the binary codes of the two different instruction architectures are similar.
CN202011552657.0A 2020-12-24 2020-12-24 Semantic-based cross-instruction architecture binary code similarity detection method Active CN112596736B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011552657.0A CN112596736B (en) 2020-12-24 2020-12-24 Semantic-based cross-instruction architecture binary code similarity detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011552657.0A CN112596736B (en) 2020-12-24 2020-12-24 Semantic-based cross-instruction architecture binary code similarity detection method

Publications (2)

Publication Number Publication Date
CN112596736A CN112596736A (en) 2021-04-02
CN112596736B true CN112596736B (en) 2021-10-08

Family

ID=75201980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011552657.0A Active CN112596736B (en) 2020-12-24 2020-12-24 Semantic-based cross-instruction architecture binary code similarity detection method

Country Status (1)

Country Link
CN (1) CN112596736B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535229B (en) * 2021-06-30 2022-12-02 中国人民解放军战略支援部队信息工程大学 Anti-confusion binary code clone detection method based on software gene
CN113569251A (en) * 2021-07-05 2021-10-29 哈尔滨工业大学 Binary executable file vulnerability detection method based on assembly instruction sequence
CN113656066B (en) * 2021-08-16 2022-08-05 南京航空航天大学 Clone code detection method based on feature alignment
CN115951931B (en) * 2023-03-14 2023-05-16 山东大学 Binary code similarity detection method based on BERT
CN116578979A (en) * 2023-05-15 2023-08-11 软安科技有限公司 Cross-platform binary code matching method and system based on code features

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522581A (en) * 2020-04-22 2020-08-11 山东师范大学 Enhanced code annotation automatic generation method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8195447B2 (en) * 2006-10-10 2012-06-05 Abbyy Software Ltd. Translating sentences between languages using language-independent semantic structures and ratings of syntactic constructions

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522581A (en) * 2020-04-22 2020-08-11 山东师范大学 Enhanced code annotation automatic generation method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于机器学习的恶意代码检测与分类技术研究;刘浏;《中国博士学位论文全文数据库》;20200215(第2期);第15-73页 *
基于离线汇编指令流分析的恶意程序算法识别技术;赵晶玲等;《清华大学学报(自然科学版)》;20160531;第56卷(第5期);第484-492页 *

Also Published As

Publication number Publication date
CN112596736A (en) 2021-04-02

Similar Documents

Publication Publication Date Title
CN112596736B (en) Semantic-based cross-instruction architecture binary code similarity detection method
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
CN113312500B (en) Method for constructing event map for safe operation of dam
CN109657239A (en) The Chinese name entity recognition method learnt based on attention mechanism and language model
CN115168856B (en) Binary code similarity detection method and Internet of things firmware vulnerability detection method
CN110309511B (en) Shared representation-based multitask language analysis system and method
CN113282713B (en) Event trigger detection method based on difference neural representation model
CN113010209A (en) Binary code similarity comparison technology for resisting compiling difference
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN112231472A (en) Judicial public opinion sensitive information identification method integrated with domain term dictionary
CN113900923A (en) System and method for checking similarity of binary functions of cross-instruction set architecture
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN115033895B (en) Binary program supply chain safety detection method and device
CN114416159B (en) API recommendation method and device based on information enhancement calling sequence
CN114742069A (en) Code similarity detection method and device
CN112818698B (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN114564953A (en) Emotion target extraction model based on multiple word embedding fusion and attention mechanism
CN114115894A (en) Cross-platform binary code similarity detection method based on semantic space alignment
CN116595189A (en) Zero sample relation triplet extraction method and system based on two stages
CN116595537A (en) Vulnerability detection method of generated intelligent contract based on multi-mode features
Kurach et al. Better text understanding through image-to-text transfer
CN115238115A (en) Image retrieval method, device and equipment based on Chinese data and storage medium
Vu-Manh et al. Improving Vietnamese dependency parsing using distributed word representations
CN112861131A (en) Library function identification detection method and system based on convolution self-encoder
CN112597299A (en) Text entity classification method and device, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant