CN113204764B - Unsigned binary indirect control flow identification method based on deep learning - Google Patents

Unsigned binary indirect control flow identification method based on deep learning Download PDF

Info

Publication number
CN113204764B
CN113204764B CN202110363702.6A CN202110363702A CN113204764B CN 113204764 B CN113204764 B CN 113204764B CN 202110363702 A CN202110363702 A CN 202110363702A CN 113204764 B CN113204764 B CN 113204764B
Authority
CN
China
Prior art keywords
indirect
layer
call
ith
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110363702.6A
Other languages
Chinese (zh)
Other versions
CN113204764A (en
Inventor
王鹃
王蕴茹
杨梦达
王杰
钟璟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202110363702.6A priority Critical patent/CN113204764B/en
Publication of CN113204764A publication Critical patent/CN113204764A/en
Application granted granted Critical
Publication of CN113204764B publication Critical patent/CN113204764B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Security & Cryptography (AREA)
  • Molecular Biology (AREA)
  • Computer Hardware Design (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Virology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an unsigned binary indirect control flow identification method based on deep learning, which identifies a target basic block of an indirect jump instruction in a binary system through the deep learning. The method comprises the steps of constructing an indirect call branch and a function sequence based on instructions, basic blocks and function code blocks in a binary code file so as to construct triple samples of indirect jump and indirect call and generate an indirect jump training set and an indirect call training set; respectively constructing a neural network indirect jump and indirect call target identification classification model, and respectively constructing a neural network indirect jump and indirect call target identification classification loss function model; the method comprises the steps of preprocessing a binary file to be detected, generating indirect jump and indirect call samples aiming at indirect jump and indirect call instructions, carrying out target identification through the neural network indirect jump and indirect call target identification classification loss function model, and restoring an indirect control flow target through a classification result. The invention improves the accuracy of identification.

Description

Unsigned binary indirect control flow identification method based on deep learning
Technical Field
The invention belongs to the technical field of software analysis, and particularly relates to an unsigned binary indirect control flow identification method based on deep learning.
Background
Reconstructing a control flow graph from an unsigned binary file is a prerequisite basis for many problems in software analysis, such as instruction identification and function identification in disassembly. In addition, the control flow graph reconstruction of the binary hierarchy also plays an important role in the problems of control flow integrity research, malware classification, source tracing and the like. In general, statically reconstructing a control flow graph in binary is a recursive process, however, the process is often hampered by indirect control flow. For direct control flow, the operand of the jump/call instruction is the target address of the instruction in the control flow; for indirect branches, the operands of the corresponding instructions are often registers or memory locations that store the target addresses, so the targets of indirect control flow are difficult to statically determine. In view of the disadvantages of low coverage rate and low processing efficiency of the dynamic analysis method, the problem of static identification of indirect control flow from binary system of unsigned information becomes an urgent need to be solved.
Existing binary analysis tools typically employ different technical means to handle indirect jumps and indirect calls. Indirect jumps mainly include jump tables (compiled from switch-case and if-else). Existing jump table static processing methods can be divided into a) heuristic methods based on inverse slicing and pattern matching, and b) deep analysis techniques such as data flow analysis or Value Set Analysis (VSA). The heuristic determines the target basic block of the jump table by searching for a particular pattern to determine the base address of the jump table and the boundaries of the index. However, the method based on pattern matching requires manual setting of different patterns under different compilers and architectures, and thus lacks scalability. The deep analysis technology can reserve certain semantic information to improve the identification precision, but the method has higher calculation cost and is difficult to be applied to large-scale application programs. For indirect calling, an effective static analysis means of binary hierarchy is still lacking. The indirect call is mainly obtained by compiling a function pointer and a virtual function, and the function is used for realizing the dynamic behavior of the program. Mainstream analysis tools typically use constant propagation techniques to resolve the target of an indirect call. That is, when a constant flows to an indirect call instruction, the constant is regarded as a target of the corresponding indirect call instruction. However, only a few indirectly called target functions can be identified by this method.
In view of the above, the invention aims to solve the problem that the target of the indirect control flow instruction in the binary system of unsigned information is difficult to statically acquire by constructing the semantic-based binary system indirect control flow identification scheme. The invention utilizes semantic association between the source and the target of indirect jump (call) and carries out automatic identification on the target of indirect control flow based on a deep learning method. In addition, the framework of the invention does not need to adopt different technical means to process the indirect jump and the indirect call, and the accuracy similar to that of a mainstream binary analysis tool can be achieved in the aspect of the indirect jump.
Disclosure of Invention
Aiming at the problems, the invention provides an unsigned binary indirect control flow identification method based on deep learning. The method constructs a binary indirect control flow target recognition scheme taking deep learning as a center by acquiring semantic information among bytes in a binary system and based on context association between an indirect control flow source and a target, and comprises the following specific steps of:
step 1: an original binary code file is introduced, a plurality of bytes in the original binary code file form a plurality of instruction code blocks, the plurality of instruction code blocks form a plurality of basic block code blocks, the plurality of basic block code blocks form a plurality of function code blocks, an indirect call branch and a function sequence are constructed according to the basic block code blocks and the function code blocks, an indirect jump triple sample and an indirect call triple sample are further constructed, the indirect jump triple sample and the indirect call triple sample are respectively marked, and an indirect jump training set and an indirect call training set are generated;
step 2: establishing a neural network indirect jump target identification classification model, sequentially inputting each indirect jump triple sample in an indirect jump training set into the neural network indirect jump target identification classification model, further classifying to obtain a corresponding triple sample prediction result, further combining an indirect jump sample label and a prediction label of the classification model to establish a neural network indirect jump target identification classification loss function model, obtaining an optimization parameter set of a network through optimization training, and establishing the trained neural network indirect jump target identification classification model according to the network optimization parameter set; establishing a neural network indirect call target identification classification model, sequentially inputting each indirectly called triple sample in an indirect call training set into the neural network indirect call target identification classification model, further classifying to obtain a corresponding triple sample prediction result, further establishing a neural network indirect call target identification classification loss function model by combining an indirect call sample label and a label predicted by the classification model, obtaining an optimization parameter set of the network through optimization training, and establishing the trained neural network indirect call target identification classification model according to the network optimization parameter set;
and 3, step 3: extracting a command code block in a binary system to be detected, a basic block code block in the binary system to be detected and a function code block in the binary system to be detected from a binary system to be detected through the step 1, and judging whether the command code block in the binary system to be detected is an indirect jump command code block or an indirect call command code block;
preferably, the original binary code file in step 1 is:
texti={ci,1,ci,2,...,ci,L}
i∈[1,K]
wherein textiRepresenting the ith original binary code file, K representing the number of original binary code files, L representing the number of bytes in the ith original binary code file, ci,jRepresents the jth byte in the ith original binary code file, j ∈ [1, L [ ]];
Step 1, a plurality of bytes in the original binary code file form a plurality of instruction code blocks, which is specifically expressed as:
Insi,k={ci,sins_k,ci,sins_k+1,...,ci,sins_k+nins_k-1}
k∈[1,Nins]
therein, Insi,kRepresenting the kth instruction code block, N, in the ith original binary code fileinsIndicating the number of instruction code blocks in the ith original binary code file, sins _ k being the subscript of the byte starting the kth instruction code block, nins _ k indicating the number of bytes in the kth instruction code block, ci,sins_k+jRepresents sink-th byte of the k-th instruction code block in the ith original binary code file, wherein j belongs to [0, nins _ k-1 ]];
Step 1, the plurality of instruction code blocks form a plurality of basic block code blocks, which are specifically expressed as:
Bi,m={Insi,sbb_m,Insi,sbb_m+1,...,Insi,sbb_m+nbb_m-1}
m∈[1,Nbb]
wherein, Bi,mRepresenting the mth basic block code block, N, in the ith original binary code filebbDenotes the number of basic block code blocks in the ith original binary code file, sbb _ m is the subscript of the instruction code block starting from the mth basic block code block, nbb _ m denotes the number of instruction code blocks in the mth basic block code block, Insi,sbb_m+jRepresents the sbb _ m + j +1 instruction code blocks in the mth basic block code block in the ith original binary code file, wherein j belongs to [0, nbb _ m-1 ]];
Step 1, the specific expression that the plurality of basic block code blocks form a plurality of function code blocks is as follows:
Fi,n={Bi,sfunc_n,Bi,sfunc_n+1,...,Bi,sfunc_n+nfunc_n-1}
n∈[1,Nfunc]
wherein, Fi,nRepresenting the nth function code block, N, in the ith original binary code filefuncIndicating the number of function code blocks in the ith original binary code file, sfunc _ n is the subscript of the basic block code block starting from the nth function code block, nfunc _ n indicates the number of basic block code blocks in the nth function code block, Bi,sfunc_n+jRepresenting the sffunc _ n + j +1 basic block code block in the nth function code block in the ith original binary code file, wherein j belongs to [0, nffunc _ n-1 ]];
The step 1 of constructing the indirect call branch and the function sequence according to the basic block code block and the function code block is as follows:
the indirect call branch:
Bri,m={Bi,entry_m,e,Bi,entry_m+1,...,e,Bi,call_m}
m∈[1,Ncall]
wherein, Bri,mFor the indirect call branch sequence, N, of the mth indirect call instruction code block in the ith original binary code filecallRepresents the number of indirect call instruction code blocks in the ith original binary code file, entry _ mFor the index of the entry basic block of the mth indirect call branch sequence, entry _ m +1 is Bi,entry_mThe call _ m is the subscript of the basic block code block where the mth indirect call instruction code block is located;
the function sequence is as follows:
Fsi,n={Bi,sfunc_n,e,Bi,sfunc_n+1,...,e,Bi,sfunc_n+nfunc_n-1}
n∈[1,Nfunc]
wherein, Fsi,nAs a function Fi,nCorresponding function sequences, e is the control flow inside the function;
step 1, further constructing indirect jump triple samples as follows:
Jdatai,k=(Bi,m,e,Bi,n)
k∈[1,Ndata_jmp]
wherein, Jdatai,kThe sample of the kth indirect jump data generated by the ith original binary code file is shown, namely the sample corresponding to the kth jump table in the ith original binary code file, Ndata_jmpRepresenting the number of indirect jump samples in the ith original binary code file, e representing the control flow inside the function code block, Bi,mFor the basic block code block, B, in which the indirect jump instruction code block is located in the kth jump tablei,nDivide B for the function code block where the kth jump table is locatedi,mOf any basic block code block, i.e. hypothesis Bi,m∈Fi,lThen B isi,n∈Fi,l-{Bi,m},m,n∈[1,Nbb];
The above Jdatai,kThe corresponding kth jump table is configured as follows:
JTablei,k={Bi,m:{Bi,sjt_k,Bi,sjt_k+1,...,Bi,sjt_k+njt_k-1}}
wherein sjt _ k is the subscript of the basic block code block starting from the kth jump table, njt _ k indicates the number of jump entries in the kth jump table, Bi,sjt_k+jRepresenting the kth jump in the ith original binary code fileSjt _ k + j +1 th jump entry in the table, j ∈ [0, njt _ k-1 ]];
Step 1, further constructing indirect calling triple samples as follows:
Cdatai,k=(Bri,k,E,Fsi,n)
k∈[1,Ndata_call]
wherein, Cdatai,kA kth indirect call data sample generated by the ith original binary code file is represented, that is, a sample corresponding to the kth indirect call instruction code block in the ith original binary code file is assumed to be Insi,l,Ndata_callRepresenting the number of indirect call samples in the ith original binary code file, E representing the control flow between function code blocks, Bri,kFor the kth indirect call instruction code block Ins in the ith original binary code filei,lThe indirect call branch is used for constructing Br based on a breadth-first search algorithmi,k;Fsi,nFor the nth function F in the ith original binary code filei,nCorresponding function sequences, Fi,nAny address-token function in the binary code is used;
definitions of CTRarget (Ins)i,l) To Insi,lList of actually called function code blocks, namely:
CTarget(Insi,l)={Fi,ct1,Fi,ct2,...,Fi,ctn}
wherein, Fi,ct1,Fi,ct2,...,Fi,ctnTo Insi,lThe actual objective function of (2).
Step 1, marking the indirect jump triple sample and the indirect call triple sample respectively, and generating an indirect jump training set and an indirect call training set as follows:
triple sample for indirect jump, Jdatai,k=(Bi,m,e,Bi,n):
If B isi,n∈JTablei,k[Bi,m]Then Jdatai,kIs labeled as Jlabeli_k,1Otherwise, Jlabeli_k,0
Triple sample for indirect calls, namely Cdatai,k=(Bri,k,E,Fsi,n):
If Fi,n∈CTarget(Insi,l) Then the sample is labeled as Clabeli_k,1Otherwise, it is Clabeli_k,0
Step 1, constructing an indirect jump training set, namely:
JDATA={(Jdata1,1,Jlabel1_1,k1),(Jdata1,2,Jlabel1_2,k2),......,(JdataK,Ndata_jmp_k,JlabelK_Ndata_jmp_k,kNjmp)}
JDATA is an indirect jump training set (Jdata)1,1,Jlabel1_1,k1) For the first sample in the data set, Jdata, as described above1,1Jlabel for the 1 st sample in the 1 st original binary file1_1,k1Is Jdata1,1K1 is 0 or 1; (Jdata)i,j,Jlabeli_j,km) For the mth sample in the dataset, Jdatai,jFor the jth sample in the ith original binary file, Jlabeli_j,kmM is the subscript of the sample in the dataset for its corresponding label, where i ∈ [1, K],j∈[1,Ndata_jmp_i]K is the number of original binary code files, Ndata_jmp_iRepresenting the total number of indirect jump samples of the ith binary, NjmpThe total number of samples in the training set for indirect jump.
Step 1, constructing an indirect call training set, namely:
CDATA={(Cdata1,1,Clabel1_1,k1),(Cdata1,2,Clabel1_2,k2),......,(CdataK,Ndata_call_k,ClabelK_Ndata_call_k,kNcall)}
wherein, CDATA is indirect calling training set (Cdata)1,1,Clabel1_1,k1) For the first sample in the data set, Cdata, as described above1,1For the 1 st sample in the 1 st original binary file, Clabel1_1,k1Is Cdata1,1K1 takes a value of 0 or 1; (Cdata)i,j,Clabeli_j,km) For the mth sample in the data set, Cdatai,jFor the jth sample in the ith original binary file, Clabeli_j,kmM is the subscript of the sample in the dataset for its corresponding label, where i ∈ [1, K],j∈[1,Ndata_call_i]K is the number of original binary code files, Ndata_call_iAnd indicating the total number of the indirect jump samples of the ith binary system, wherein Ncall is the total number of the samples in the indirect jump training set.
Preferably, the neural network indirect jump target identification classification model in the step 2 is formed by serially cascading an embedded layer, a deep bidirectional long-short term memory network, an attention layer and a batch normalization layer;
the embedding layer converts words from one-hot vectors which are high-dimensional sparse into low-dimensional dense vectors. And a low-dimensional dense vector jmp _ embedding _ vector of each word is a parameter to be optimized.
The deep bidirectional long and short term memory network is formed by sequentially connecting a first bidirectional long and short term memory layer, a second bidirectional long and short term memory layer and a random deactivation layer in series and cascade;
the ith layer of bidirectional long and short term memory layer is used for selectively discarding data through a gating mechanism, and then updating the data by combining the old state value memorized by the network to obtain a determined updated value and outputting the updated value to the next layer;
the weight of a forgetting gate of the ith layer of the bidirectional long-short term memory layer is jmp _ weightsf _ lstm _ i which is a parameter to be optimized;
the bias of a forgetting gate of the i-th bidirectional long and short term memory layer is jmp _ biasf _ lstm _ i; is the parameter to be optimized;
the weight of an input gate of the ith layer bidirectional long-short term memory layer is jmp _ weight _ lstm _ i which is a parameter to be optimized;
the bias of an input gate of the ith layer bidirectional long-short term memory layer is jmp _ biasi _ lstm _ i which is a parameter to be optimized;
the weight of an output gate of the ith layer of the bidirectional long and short term memory layer is jmp _ weight sc _ lstm _ i which is a parameter to be optimized;
the offset of an output gate of the ith bidirectional long and short term memory layer is jmp _ biasc _ lstm _ i which is a parameter to be optimized;
the weight of the state of the computing unit of the i-th layer bidirectional long and short term memory layer is jmp _ weight _ lstm _ i which is a parameter to be optimized;
the bias of the state of the computing unit of the i-th bidirectional long and short term memory layer is jmp _ biaso _ lstm _ i which is a parameter to be optimized;
the random inactivation layer is used for discarding the output data of the bidirectional long-short term memory layer with a certain probability so as to avoid overfitting.
The attention layer is used for reducing the problem of context loss caused by the disappearance of the gradient of the long sample sequence in the step 3 by giving greater weight to the important words;
the weight of the attention layer is jmp _ weights _ attention, which is a parameter to be optimized;
the bias of the attention layer is jmp _ bias _ attribute, which is a parameter to be optimized;
the context vector of the attention layer is jmp _ u _ attention, which is the parameter to be optimized.
The batch standardization layer comprises a full connection layer, a batch standardization layer and a normalization index layer;
the fully-connected layer outputs a one-dimensional matrix with the size of W x H, W256 and H1, and is used for integrating the output data of the attention layer and mapping the output data to the sample space of the next batch normalization layer;
the weight of the full connection layer is jmp _ weights _ dense, which is a parameter to be optimized;
the bias of the full connection layer is jmp _ bias _ dense, which is a parameter to be optimized.
The batch normalization layer is used for accelerating the optimization training convergence in the step 2;
the translation parameter of the batch normalization layer is jmp _ shift _ bn which is a parameter to be optimized;
the scaling parameter of the batch normalization layer is jmp _ scale _ bn, which is the parameter to be optimized.
The normalization index layer is used for converting continuous output characteristics of batch normalization layer intoDiscrete predictive features; the layer firstly carries out sigmoid operation on output characteristics of a batch standardization layer, then uses a cross entropy loss function which is more suitable for measuring two probability distribution differences as a measurement function, and optimizes a learning result of an upper layer, so that a final result is a label Jlabel predicted for an ith samplei,1*、Jlabeli,2A probability distribution of i ∈ [1, N ]]N represents the number of samples in the deep learning training set, and the problem is classified into two categories, so that the labels are classified into two categories;
the neural network indirect jump target identification classification loss function model is a cross entropy loss function, and is specifically defined as follows:
Figure BDA0003006563480000071
wherein N is the total number of training samples; predict a probability distribution of
Figure BDA0003006563480000072
Indirect control flow of correct or not for ith sample
Figure BDA0003006563480000081
A probability distribution of wherein
Figure BDA0003006563480000082
The probability value corresponding to the label is
Figure BDA0003006563480000083
The true label probability distribution is y(i)For step 2, the indirect control flow Jlabel of whether the ith sample is correcti,1、Jlabeli,2If the label of the ith sample is Jlabeli,jThen set the corresponding probability value y(i)jThe probability is one, and the other label Jlabel corresponding to the probability is onei,k(k ≠ j) probability value y(i)kIs zero;
the loss function is defined as:
Figure BDA0003006563480000084
wherein the cross entropy loss function l (Θ) requires the computation of all training samples
Figure BDA0003006563480000085
Values, and averaging. The training target of the neural network is set to predict the probability distribution
Figure BDA0003006563480000086
Probability distribution y of labels as close to reality as possible(i)I.e. to minimize the cross entropy loss function l (Θ); finally, calculating to obtain the probability of prediction classification;
optimizing the network parameters by using an Adam optimization algorithm to obtain a network optimization parameter set in the step 2 as follows:
the vector of each word represents jmp _ embedding _ vector _ best;
for the ith bidirectional long-short term memory layer;
the optimized weight parameters are jmp _ weight sf _ lstm _ best _ i, jmp _ weight si _ lstm _ best _ i, jmp _ weight sc _ lstm _ best _ i and jmp _ weight so _ lstm _ best _ i;
the optimized bias parameters are jmp _ biasf _ lstm _ best _ i, jmp _ biasi _ lstm _ best _ i, jmp _ biasc _ lstm _ best _ i and jmp _ biaso _ lstm _ best _ i respectively;
for the attention layer:
the optimized parameters comprise weight jmp _ weights _ entry _ best; a bias jmp _ bias _ authentication _ best and a context vector jmp _ u _ authentication _ best;
the optimized weight parameter jmp _ weights _ dense _ best of the full connection layer;
the optimized bias parameters of the full-connection layer are jmp _ bias _ dense _ best respectively;
the optimized translation parameter of the batch normalization layer is jmp _ shift _ bn _ best;
the optimized scaling parameter of the batch normalization layer is jmp _ scale _ bn _ best;
step 2, the neural network indirectly calls a target recognition classification model, and the target recognition classification model is formed by serially cascading an embedded layer, a deep bidirectional long-short term memory network, an attention layer and a batch normalization layer;
the embedding layer converts words from high-dimensional sparse one-hot vectors to low-dimensional dense vectors. And a low-dimensional dense vector call _ embedding _ vector of each word is a parameter to be optimized.
The bidirectional long and short term memory network is formed by sequentially connecting a first bidirectional long and short term memory layer, a second bidirectional long and short term memory layer and a random deactivation layer in series;
the ith layer of bidirectional long and short term memory layer is used for selectively discarding data through a gating mechanism, and then updating the data by combining the old state value memorized by the network to obtain a determined updated value and outputting the updated value to the next layer;
the weight of a forgetting gate of the ith layer of the bidirectional long and short term memory layer is call _ weight sf _ lstm _ i, and the weight is a parameter to be optimized;
the bias of a forgetting gate of the ith bidirectional long-short term memory layer is call _ biasf _ lstm _ i; is the parameter to be optimized;
the weight of an input gate of the ith layer of the bidirectional long and short term memory layer is call _ weight _ lstm _ i, which is a parameter to be optimized;
the bias of an input gate of the ith bidirectional long and short term memory layer is call _ biasi _ lstm _ i which is a parameter to be optimized;
the weight of an output gate of the ith layer of the bidirectional long and short term memory layer is call _ weight sc _ lstm _ i which is a parameter to be optimized;
the bias of an output gate of the ith bidirectional long and short term memory layer is call _ biasc _ lstm _ i which is a parameter to be optimized;
the weight of the state of the computing unit of the i-th layer bidirectional long and short term memory layer is call _ weight _ lstm _ i which is a parameter to be optimized;
the bias of the state of the computing unit of the ith layer of the bidirectional long and short term memory layer is call _ biaso _ lstm _ i which is a parameter to be optimized;
the random inactivation layer is used for discarding the output data of the bidirectional long-short term memory layer with a certain probability so as to avoid overfitting.
The attention layer is used for reducing the problem of context loss caused by the disappearance of the gradient of the long sample sequence in the step 3 by giving more weight to the important words;
the weight of the attention layer is call _ weights _ attribute, which is a parameter to be optimized;
the bias of the attention layer is call _ bias _ attitude, which is a parameter to be optimized;
the context vector of the attention layer is call _ u _ attention, which is the parameter to be optimized.
The batch standardization layer comprises a full connection layer, a batch standardization layer and a normalization index layer;
the fully-connected layer outputs a one-dimensional matrix with the size of W x H, W256 and H1, and is used for integrating the output data of the attention layer and mapping the output data to the sample space of the next batch normalization layer;
the weight of the full connection layer is call _ weights _ dense, which is a parameter to be optimized;
the bias of the full connection layer is call _ bias _ dense, which is a parameter to be optimized.
The batch normalization layer is used for accelerating the optimization training convergence in the step 3;
the translation parameter of the batch normalization layer is call _ shift _ bn which is a parameter to be optimized;
the scaling parameter of the batch normalization layer is call _ scale _ bn, which is the parameter to be optimized.
The normalization index layer is used for converting continuous output characteristics of the batch normalization layer into discrete prediction characteristics; the method comprises the steps of firstly carrying out sigmoid operation on output characteristics of a batch standardization layer, then using a cross entropy loss function which is more suitable for measuring two probability distribution differences as a measurement function, and optimizing a learning result of an upper layer, so that a final result is a label Clabel predicted aiming at an ith samplei,1*、Clabeli,2A probability distribution of i ∈ [1, N ]]N represents the number of samples in the deep learning training set, and the problem is classified into two categories, so that the labels are classified into two categories;
the neural network indirect jump target identification classification loss function model is a cross entropy loss function, and is specifically defined as follows:
Figure BDA0003006563480000101
wherein N is the total number of training samples; predict a probability distribution of
Figure BDA0003006563480000102
Indirect control flow of correct or not for ith sample
Figure BDA0003006563480000103
A probability distribution of wherein
Figure BDA0003006563480000104
The probability value corresponding to the label is
Figure BDA0003006563480000105
The true label probability distribution is y(i)For step 2, the indirect control flow Clabel indicates whether the ith sample is correcti,1、Clabeli,2If the label of the ith sample is Clabeli,jThen set the corresponding probability value y(i)jProbability of one, and corresponding another label Clabeli,k(k ≠ j) probability value y(i)kIs zero;
the loss function is defined as:
Figure BDA0003006563480000106
wherein the cross entropy loss function l (Θ) requires the computation of all training samples
Figure BDA0003006563480000107
Values, and averaged. The training target of the neural network is set to predict the probability distribution
Figure BDA0003006563480000108
Probability distribution y of labels as close to reality as possible(i)I.e. to minimize the cross entropy loss function l (Θ); finally, calculating to obtain the probability of prediction classification;
optimizing the network parameters by using an Adam optimization algorithm to obtain a network optimization parameter set in the step 2 as follows:
the vector of each word represents call _ embedding _ vector _ best;
for the ith bidirectional long-short term memory layer;
the optimized weight parameters are respectively call _ weight sf _ lstm _ best _ i,
call_weightsi_lstm_best_i*、call_weightsc_lstm_best_i*、
call_weightso_lstm_best_i*;
The optimized bias parameters are respectively call _ biasf _ lstm _ best _ i, call _ biasi _ lstm _ best _ i, call _ biasc _ lstm _ best _ i and call _ biaso _ lstm _ best _ i;
for the attention layer:
the optimized parameters comprise the weight call _ weights _ entry _ best; a bias call bias entry best and a context vector call u entry best;
the optimized weight parameter of the full connection layer is call _ weights _ dense _ best;
respectively obtaining bias parameters of the optimized fully-connected layer, namely call _ bias _ dense _ best;
the optimized translation parameter of the batch normalization layer is call _ shift _ bn _ best;
the optimized scaling parameter of the batch normalization layer is call _ scale _ bn _ best;
preferably, the step 3 of judging whether the instruction code block in the binary system to be detected is an indirect jump instruction code block or an indirect call instruction code block specifically includes:
if the instruction code block in the binary system to be detected belongs to the indirect jump instruction code block, the function code block where the indirect jump instruction code block is located, the basic block code block and the corresponding jump table are preprocessed through the step 1 to obtain a binary indirect jump triple sample to be detected,further predicting to obtain a binary sample label to be detected through a neural network indirect jump target recognition classification model defined in the step 2 after training, restoring a target basic block code block of each indirect jump instruction code block according to the sample label based on the triple definition in the step 1, namely if the kth indirect jump data sample Jdata in the ith binary original filei,k=(Bi,m,e,Bi,n) The predicted sample label is Jlabeli_k,1Then B isi,nA target basic block code block;
if the instruction code block in the binary system to be detected belongs to the indirect call instruction code block, the indirect call instruction code block and the binary system original file where the indirect call instruction code block is located, the function code block where the indirect call instruction code block is located are preprocessed through the step 1 to obtain the indirect call branch of the binary system to be detected and the function sequence of the binary system to be detected, the indirect call triple sample is further formed in the step one, the target recognition classification model is further indirectly called through the neural network defined in the step 2 after training, the sample label of the binary system to be detected is obtained through prediction, the target function code block of each indirect call instruction code block is restored according to the sample label based on the triple definition in the step 1, namely the kth indirect call data sample Cdata in the ith original binary system file is Cdata sample Cdatai,k=(Bri,k,E,Fsi,n) If the corresponding sample label is obtained by prediction as Clabeli_k,1Then Fsi,nCorresponding Fi,nThe target function code block.
The method is different from the traditional method for recovering the indirect control flow target based on the technologies such as data flow, starts from semantic association between the source of the indirect control flow and the target object, compares binary system with natural language, and constructs a binary system control flow instruction target automatic identification framework based on deep learning.
In order to fully reserve context associated information in a binary system, the invention provides a triple sample coding mode, and a construction method of an indirect call branch construction algorithm and a function sequence based on breadth-first search.
The invention provides an indirect control flow basic information acquisition method based on intermediate codes, and a mapping mechanism from the intermediate codes to binary is constructed.
The invention learns the semantic association characteristics between the source and the target of the binary indirect control flow through the neural network model of Bi-LSTM + Attention, and further can identify the target basic block or function of a skip list and a function pointer in the binary. The model can provide higher identification accuracy rate after verification.
The method is different from the prior static analysis method, disassembly processing is not needed to be carried out on the binary system, and the target object of the indirect control flow instruction can be recovered through a pre-trained classifier after less data preprocessing operations.
Drawings
FIG. 1 is a layout of the system design framework of the present invention.
FIG. 2 is a flow diagram of a binary indirect control flow extraction module of an embodiment of the present invention.
FIG. 3 is a schematic structural diagram of a Bi-LSTM + attention neural network constructed according to an embodiment of the present invention.
FIG. 4 is a flow diagram of an indirect control flow instruction target classification detection module of an embodiment of the invention.
Fig. 5 is a flowchart of an indirect call branch generation algorithm according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is explained in detail in the following by combining the drawings and the embodiment.
The following describes the embodiments of the present invention with reference to fig. 1 to 5:
step 1: introducing an original binary code file, wherein a plurality of bytes in the original binary code file form a plurality of instruction code blocks, the plurality of instruction code blocks form a plurality of basic block code blocks, the plurality of basic block code blocks form a plurality of function code blocks, an indirect call branch and a function sequence are constructed according to the basic block code blocks and the function code blocks, an indirect jump triple sample and an indirect call triple sample are further constructed, the indirect jump triple sample and the indirect call triple sample are respectively marked, and an indirect jump training set and an indirect call training set are generated;
step 1, the original binary code file is:
texti={ci,1,ci,2,...,ci,L}
i∈[1,K]
wherein textiRepresenting the ith original binary code file, K representing the number of original binary code files, L representing the number of bytes in the ith original binary code file, ci,jRepresents the jth byte in the ith original binary code file, j ∈ [1, L [ ]];
Step 1, a plurality of bytes in the original binary code file form a plurality of instruction code blocks, which is specifically expressed as:
Insi,k={ci,sins_k,ci,sins_k+1,...,ci,sins_k+nins_k-1}
k∈[1,Nins]
therein, Insi,kRepresents the kth instruction code block, N, in the ith original binary code fileinsIndicating the number of instruction code blocks in the ith original binary code file, sins _ k being the subscript of the byte starting the kth instruction code block, nins _ k indicating the number of bytes in the kth instruction code block, ci,sins_k+jRepresents sins _ k + j +1 byte in the k instruction code block in the ith original binary code file, wherein j belongs to [0, nins _ k-1 ]];
Step 1, the plurality of instruction code blocks form a plurality of basic block code blocks, which are specifically expressed as:
Bi,m={Insi,sbb_m,Insi,sbb_m+1,...,Insi,sbb_m+nbb_m-1}
m∈[1,Nbb]
wherein, Bi,mRepresenting the mth basic block code block, N, in the ith original binary code filebbDenotes the number of basic block code blocks in the ith original binary code file, sbb _ m is the subscript of the instruction code block starting from the mth basic block code block, nbb _ m denotes the number of instruction code blocks in the mth basic block code block, Insi,sbb_m+jRepresenting the ith original binary generationThe sbb _ m + j +1 instruction code block in the mth basic block code block in the code file, j belongs to [0, nbb _ m-1 ]];
Step 1, the specific expression that the plurality of basic block code blocks form a plurality of function code blocks is as follows:
Fi,n={Bi,sfunc_n,Bi,sfunc_n+1,...,Bi,sfunc_n+nfunc_n-1}
n∈[1,Nfunc]
wherein, Fi,nRepresenting the nth function code block, N, in the ith original binary code filefuncIndicating the number of function code blocks in the ith original binary code file, sfunc _ n is the subscript of the basic block code block starting from the nth function code block, nfunc _ n indicates the number of basic block code blocks in the nth function code block, Bi,sfunc_n+jRepresenting the sffunc _ n + j +1 basic block code block in the nth function code block in the ith original binary code file, wherein j belongs to [0, nffunc _ n-1 ]];
The step 1 of constructing the indirect call branch and the function sequence according to the basic block code block and the function code block is as follows:
the indirect call branch:
Bri,m={Bi,entry_m,e,Bi,entry_m+1,...,e,Bi,call_m}
m∈[1,Ncall]
wherein, Bri,mFor the indirect call branch sequence, N, of the mth indirect call instruction code block in the ith original binary code filecallIndicating the number of indirect call instruction code blocks in the ith original binary code file, wherein entry _ m is a subscript of an entry basic block of the mth indirect call branch sequence, and entry _ m +1 is Bi,entry_mThe call _ m is the subscript of the basic block code block where the mth indirect call instruction code block is located;
the function sequence is as follows:
Fsi,n={Bi,sfunc_n,e,Bi,sfunc_n+1,...,e,Bi,sfunc_n+nfunc_n-1}
n∈[1,Nfunc]
wherein, Fsi,nAs a function Fi,nCorresponding function sequences, e is the control flow inside the function;
step 1, further constructing indirect jump triple samples as follows:
Jdatai,k=(Bi,m,e,Bi,n)
k∈[1,Ndata_jmp]
wherein, Jdatai,kThe sample of the kth indirect jump data generated by the ith original binary code file is shown, namely the sample corresponding to the kth jump table in the ith original binary code file, Ndata_jmpRepresenting the number of indirect jump samples in the ith original binary code file, e representing the control flow inside the function code block, Bi,mFor the basic block code block, B, in which the indirect jump instruction code block is located in the kth jump tablei,nDivide B for the function code block where the kth jump table is locatedi,mOf any basic block code block, i.e. hypothesis Bi,m∈Fi,lThen B isi,n∈Fi,l-{Bi,m},m,n∈[1,Nbb];
The above Jdatai,kThe corresponding kth jump table is configured as follows:
JTablei,k={Bi,m:{Bi,sjt_k,Bi,sjt_k+1,...,Bi,sjt_k+njt_k-1}}
wherein sjt _ k is the subscript of the basic block code block starting from the kth jump table, njt _ k indicates the number of jump entries in the kth jump table, Bi,sjt_k+jRepresenting sjt _ k + j +1 th jump entry in the kth jump table in the ith original binary code file, wherein j belongs to [0, njt _ k-1 ]];
Step 1, further constructing indirect calling triple samples as follows:
Cdatai,k=(Bri,k,E,Fsi,n)
k∈[1,Ndata_call]
wherein, Cdatai,kRepresenting the kth indirect call data sample generated by the ith original binary code file, i.e. the ithA sample corresponding to the kth indirect call instruction code block in the original binary code file is assumed to be Insi,l,Ndata_callRepresenting the number of indirect call samples in the ith original binary code file, E representing the control flow between function code blocks, Bri,kFor the kth indirect call instruction code block Ins in the ith original binary code filei,lThe indirect call branch is used for constructing Br based on a breadth-first search algorithmi,k;Fsi,nFor the nth function F in the ith original binary code filei,nCorresponding function sequences, Fi,nAny address-token function in the binary code is used;
definitions of CTRarget (Ins)i,l) To Insi,lList of actually called function code blocks, namely:
CTarget(Insi,l)={Fi,ct1,Fi,ct2,...,Fi,ctn}
wherein, Fi,ct1,Fi,ct2,...,Fi,ctnIs an Insi,lThe actual objective function of (2).
Step 1, marking the indirect jump triple sample and the indirect call triple sample respectively, and generating an indirect jump training set and an indirect call training set as follows:
triple sample for indirect jump, Jdatai,k=(Bi,m,e,Bi,n):
If B isi,n∈JTablei,k[Bi,m]Then Jdatai,kIs labeled as Jlabeli_k,1Otherwise, Jlabeli_k,0
Triple sample for indirect calls, namely Cdatai,k=(Bri,k,E,Fsi,n):
If Fi,n∈CTarget(Insi,l) Then the sample is labeled as Clabeli_k,1Otherwise, it is Clabeli_k,0
Step 1, constructing an indirect jump training set, namely:
JDATA={(Jdata1,1,Jlabel1_1,k1),(Jdata1,2,Jlabel1_2,k2),......,(JdataK,Ndata_jmp_k,JlabelK_Ndata_jmp_k,kNjmp)}
JDATA is an indirect jump training set (Jdata)1,1,Jlabel1_1,k1) For the first sample in the data set, Jdata, as described above1,1Jlabel for the 1 st sample in the 1 st original binary file1_1,k1Is Jdata1,1K1 is 0 or 1; (Jdata)i,j,Jlabeli_j,km) For the mth sample in the dataset, Jdatai,jFor the jth sample in the ith original binary file, Jlabeli_j,kmM is the subscript of the sample in the dataset for its corresponding label, where i ∈ [1, K],j∈[1,Ndata_jmp_i]K is the number of original binary code files, Ndata_jmp_iRepresenting the total number of indirect jump samples of the ith binary system, NjmpThe total number of samples in the training set for indirect jump.
Step 1, constructing an indirect call training set, namely:
CDATA={(Cdata1,1,Clabel1_1,k1),(Cdata1,2,Clabel1_2,k2),......,(CdataK,Ndata_call_k,ClabelK_Ndata_call_k,kNcall)}
wherein, CDATA is indirect calling training set (Cdata)1,1,Clabel1_1,k1) For the first sample in the data set, Cdata, as described previously1,1For the 1 st sample in the 1 st original binary file, Clabel1_1,k1Is Cdata1,1K1 is 0 or 1; (Cdata)i,j,Clabeli_j,km) For the mth sample in the data set, Cdatai,jFor the jth sample in the ith original binary file, Clabeli_j,kmM is the subscript of the sample in the dataset for its corresponding label, where i ∈ [1, K],j∈[1,Ndata_call_i]K is the number of original binary code files, Ndata_call_iAnd indicating the total number of the indirect jump samples of the ith binary system, wherein Ncall is the total number of the samples in the indirect jump training set.
Step 2: establishing a neural network indirect jump target identification classification model, sequentially inputting each indirect jump triple sample in an indirect jump training set into the neural network indirect jump target identification classification model, further classifying to obtain a corresponding triple sample prediction result, further combining an indirect jump sample label and a prediction label of the classification model to establish a neural network indirect jump target identification classification loss function model, obtaining an optimization parameter set of a network through optimization training, and establishing the trained neural network indirect jump target identification classification model according to the network optimization parameter set; establishing a neural network indirect call target identification classification model, sequentially inputting each indirectly called triple sample in an indirect call training set into the neural network indirect call target identification classification model, further classifying to obtain a corresponding triple sample prediction result, further establishing a neural network indirect call target identification classification loss function model by combining an indirect call sample label and a label predicted by the classification model, obtaining an optimization parameter set of the network through optimization training, and establishing the trained neural network indirect call target identification classification model according to the network optimization parameter set;
2, the neural network indirect jump target recognition classification model is formed by serially cascading an embedded layer, a deep bidirectional long-short term memory network, an attention layer and a batch normalization layer;
the embedding layer converts words from one-hot vectors which are high-dimensional sparse into low-dimensional dense vectors. And a low-dimensional dense vector jmp _ embedding _ vector of each word is a parameter to be optimized.
The deep bidirectional long and short term memory network is formed by sequentially connecting a first bidirectional long and short term memory layer, a second bidirectional long and short term memory layer and a random deactivation layer in series and cascade;
the ith layer of bidirectional long and short term memory layer is used for selectively discarding data through a gating mechanism, and then updating the data by combining the old state value memorized by the network to obtain a determined updated value and outputting the updated value to the next layer;
the weight of a forgetting gate of the i-th layer bidirectional long and short term memory layer is jmp _ weightsf _ lstm _ i, and the weight is a parameter to be optimized;
the bias of a forgetting gate of the i-th bidirectional long and short term memory layer is jmp _ biasf _ lstm _ i; is the parameter to be optimized;
the weight of an input gate of the i-th layer bidirectional long and short term memory layer is jmp _ weight _ lstm _ i, and the input gate is a parameter to be optimized;
the bias of an input gate of the i-th bidirectional long and short term memory layer is jmp _ biasi _ lstm _ i which is a parameter to be optimized;
the weight of an output gate of the ith layer of the bidirectional long and short term memory layer is jmp _ weight sc _ lstm _ i which is a parameter to be optimized;
the offset of an output gate of the ith bidirectional long and short term memory layer is jmp _ biasc _ lstm _ i which is a parameter to be optimized;
the weight of the state of the computing unit of the ith layer bidirectional long-short term memory layer is jmp _ weight _ lstm _ i which is a parameter to be optimized;
the bias of the state of the computing unit of the i-th bidirectional long and short term memory layer is jmp _ biaso _ lstm _ i which is a parameter to be optimized;
the random inactivation layer is used for discarding the output data of the bidirectional long-short term memory layer with a certain probability so as to avoid overfitting.
The attention layer is used for reducing the problem of context loss caused by the disappearance of the gradient of the long sample sequence in the step 3 by giving greater weight to the important words;
the weight of the attention layer is jmp _ weights _ attribute, which is a parameter to be optimized;
the bias of the attention layer is jmp _ bias _ attribute, which is a parameter to be optimized;
the context vector of the attention layer is jmp _ u _ attention, which is the parameter to be optimized.
The batch standardization layer comprises a full connection layer, a batch standardization layer and a normalization index layer;
the fully-connected layer outputs a one-dimensional matrix with the size of W x H, W256 and H1, and is used for integrating the output data of the attention layer and mapping the output data to the sample space of the next batch normalization layer;
the weight of the full connection layer is jmp _ weights _ dense, which is a parameter to be optimized;
the bias of the full connection layer is jmp _ bias _ dense, which is a parameter to be optimized.
The batch normalization layer is used for accelerating the optimization training convergence in the step 2;
the translation parameter of the batch normalization layer is jmp _ shift _ bn which is a parameter to be optimized;
the scaling parameter of the batch normalization layer is jmp _ scale _ bn, which is the parameter to be optimized.
The normalization index layer is used for converting continuous output characteristics of the batch normalization layer into discrete prediction characteristics; the method comprises the steps of firstly carrying out sigmoid operation on output characteristics of a batch standardization layer, then using a cross entropy loss function which is more suitable for measuring two probability distribution differences as a measurement function, and optimizing a learning result of an upper layer, so that a final result is a label Jlabel predicted according to an ith samplei,1*、Jlabeli,2A probability distribution of i ∈ [1, N ]]N represents the number of samples in the deep learning training set, and the problem is classified into two categories, so that the labels are classified into two categories;
the neural network indirect jump target identification classification loss function model is a cross entropy loss function, and is specifically defined as follows:
Figure BDA0003006563480000181
wherein N is the total number of training samples; predict a probability distribution of
Figure BDA0003006563480000182
Indirect control flow of correct or not for ith sample
Figure BDA0003006563480000183
A probability distribution of wherein
Figure BDA0003006563480000184
The probability value corresponding to the label is
Figure BDA0003006563480000185
The true label probability distribution is y(i)For step 2, the indirect control flow Jlabel of whether the ith sample is correcti,1、Jlabeli,2If the label of the ith sample is Jlabeli,jThen set the corresponding probability value y(i)jThe probability is one, and the other label Jlabel corresponding to the probability is onei,k(k ≠ j) probability value y(i)kIs zero;
the loss function is defined as:
Figure BDA0003006563480000186
wherein the cross entropy loss function l (Θ) requires the computation of all training samples
Figure BDA0003006563480000187
Values, and averaging. The training target of the neural network is set to predict the probability distribution
Figure BDA0003006563480000188
Probability distribution y of labels as close to reality as possible(i)I.e. to minimize the cross entropy loss function l (Θ); finally, calculating to obtain the probability of prediction classification;
optimizing the network parameters by using an Adam optimization algorithm to obtain a network optimization parameter set in the step 2 as follows:
the vector of each word represents jmp _ embedding _ vector _ best;
for the ith bidirectional long-short term memory layer;
the optimized weight parameters are respectively jmp _ weight sf _ lstm _ best _ i, jmp _ weight si _ lstm _ best _ i, jmp _ weight sc _ lstm _ best _ i and jmp _ weight _ lstm _ best _ i;
the optimized bias parameters are jmp _ biasf _ lstm _ best _ i, jmp _ biasi _ lstm _ best _ i, jmp _ biasc _ lstm _ best _ i and jmp _ biaso _ lstm _ best _ i respectively;
for the attention layer:
the optimized parameters comprise weight jmp _ weights _ entry _ best; a bias jmp _ bias _ attribute _ best and a context vector jmp _ u _ attribute _ best;
the optimized weight parameter jmp _ weights _ dense _ best of the full connection layer;
the optimized bias parameters of the full connection layer are jmp _ bias _ dense _ best respectively;
the optimized translation parameter of the batch normalization layer is jmp _ shift _ bn _ best;
the optimized scaling parameter of the batch normalization layer is jmp _ scale _ bn _ best;
step 2, the neural network indirectly calls a target recognition classification model, and the target recognition classification model is formed by serially cascading an embedded layer, a deep bidirectional long-short term memory network, an attention layer and a batch normalization layer;
the embedding layer converts words from one-hot vectors which are high-dimensional sparse into low-dimensional dense vectors. And a low-dimensional dense vector call _ embedding _ vector of each word is a parameter to be optimized.
The bidirectional long and short term memory network is formed by sequentially connecting a first bidirectional long and short term memory layer, a second bidirectional long and short term memory layer and a random deactivation layer in series;
the ith layer of bidirectional long and short term memory layer is used for selectively discarding data through a gating mechanism, and then updating the data by combining the old state value memorized by the network to obtain a determined updated value and outputting the updated value to the next layer;
the weight of a forgetting gate of the ith layer of the bidirectional long and short term memory layer is call _ weight sf _ lstm _ i, and the weight is a parameter to be optimized;
the bias of a forgetting gate of the ith bidirectional long-short term memory layer is call _ biasf _ lstm _ i; is the parameter to be optimized;
the weight of an input gate of the ith layer of the bidirectional long and short term memory layer is call _ weight _ lstm _ i, which is a parameter to be optimized;
the bias of an input gate of the ith bidirectional long and short term memory layer is call _ biasi _ lstm _ i which is a parameter to be optimized;
the weight of an output gate of the ith layer of the bidirectional long and short term memory layer is call _ weight sc _ lstm _ i which is a parameter to be optimized;
the bias of an output gate of the ith bidirectional long-short term memory layer is call _ biasc _ lstm _ i which is a parameter to be optimized;
the weight of the state of the computing unit of the i-th layer bidirectional long and short term memory layer is call _ weight _ lstm _ i which is a parameter to be optimized;
the bias of the state of the computing unit of the ith layer of the bidirectional long and short term memory layer is call _ biaso _ lstm _ i which is a parameter to be optimized;
the random inactivation layer is used for discarding the output data of the bidirectional long-short term memory layer with a certain probability so as to avoid overfitting.
The attention layer is used for reducing the problem of context loss caused by the disappearance of the gradient of the long sample sequence in the step 3 by giving greater weight to the important words;
the weight of the attention layer is call _ weights _ attribute, which is a parameter to be optimized;
the bias of the attention layer is call _ bias _ attitude, which is a parameter to be optimized;
the context vector of the attention layer is call _ u _ attention, which is the parameter to be optimized.
The batch standardization layer comprises a full connection layer, a batch standardization layer and a normalization index layer;
the fully-connected layer outputs a one-dimensional matrix with the size of W x H, W256 and H1, and is used for integrating the output data of the attention layer and mapping the output data to the sample space of the next batch normalization layer;
the weight of the full connection layer is call _ weights _ dense, which is a parameter to be optimized;
the bias of the full connection layer is call _ bias _ dense, which is a parameter to be optimized.
The batch normalization layer is used for accelerating the optimization training convergence in the step 3;
the translation parameter of the batch normalization layer is call _ shift _ bn which is a parameter to be optimized;
the scaling parameter of the batch normalization layer is call _ scale _ bn, which is the parameter to be optimized.
The normalization index layer is used for converting continuous output characteristics of the batch normalization layer into discrete prediction characteristics; the method comprises the steps of firstly carrying out sigmoid operation on output characteristics of a batch standardization layer, then using a cross entropy loss function which is more suitable for measuring two probability distribution differences as a measurement function, and optimizing a learning result of an upper layer, so that a final result is a label Clabel predicted aiming at an ith samplei,1*、Clabeli,2A probability distribution of i ∈ [1, N ]]N represents the number of samples in the deep learning training set, and the problem is classified into two categories, so that the labels are classified into two categories;
the neural network indirect jump target identification classification loss function model is a cross entropy loss function, and is specifically defined as follows:
Figure BDA0003006563480000201
wherein N is the total number of training samples; predict a probability distribution of
Figure BDA0003006563480000202
Indirect control flow of correct or not for ith sample
Figure BDA0003006563480000203
A probability distribution of wherein
Figure BDA0003006563480000204
The probability value corresponding to the label is
Figure BDA0003006563480000205
True label probability distribution is y(i)For step 2, the indirect control flow Clabel indicates whether the ith sample is correcti,1、Clabeli,2If the label of the ith sample is Clabeli,jThen set the corresponding probability value y(i)jProbability of one, and corresponding another label Clabeli,k(k ≠ j) probability value y(i)kIs zero;
the loss function is defined as:
Figure BDA0003006563480000206
wherein the cross entropy loss function l (Θ) requires the computation of all training samples
Figure BDA0003006563480000211
Values, and averaging. The training target of the neural network is set to predict the probability distribution
Figure BDA0003006563480000212
Probability distribution y of labels as close to reality as possible(i)I.e. to minimize the cross entropy loss function l (Θ); finally, calculating to obtain the probability of prediction classification;
optimizing the network parameters by using an Adam optimization algorithm to obtain a network optimization parameter set in the step 2 as follows:
the vector of each word represents call _ embedding _ vector _ best;
for the ith bidirectional long-short term memory layer;
the optimized weight parameters are respectively call _ weight sf _ lstm _ best _ i,
call_weightsi_lstm_best_i*、call_weightsc_lstm_best_i*、
call_weightso_lstm_best_i*;
The optimized bias parameters are respectively call _ biasf _ lstm _ best _ i, call _ biasi _ lstm _ best _ i, call _ biasc _ lstm _ best _ i and call _ biaso _ lstm _ best _ i;
for the attention layer:
the optimized parameters comprise the weight call _ weights _ entry _ best; a bias call bias entry best and a context vector call u entry best;
the optimized weight parameter of the full connection layer is call _ weights _ dense _ best;
respectively obtaining bias parameters of the optimized fully-connected layer, namely call _ bias _ dense _ best;
the optimized translation parameter of the batch normalization layer is call _ shift _ bn _ best;
the optimized scaling parameter of the batch normalization layer is call _ scale _ bn _ best;
to sum up, the data set to be tested is input in the form of a three-dimensional matrix, wherein one dimension represents the length of the longest sample sequence, two dimensions represent the total number of sample sequences to be tested, and three dimensions represent the dictionary dimension, which is 258 in the present invention. Each sample passes through an embedded layer and two bidirectional long-short term memory layers in sequence. The results obtained were passed through a random inactivation layer and then into the attention layer. The output of the attention layer is processed by a full connection layer and a batch standardization layer. The final output gets the probability of the likelihood that the sample is a normal indirect control flow through sigmoid.
And step 3: extracting a command code block in a binary system to be detected, a basic block code block in the binary system to be detected and a function code block in the binary system to be detected from a binary system to be detected through the step 1, and judging whether the command code block in the binary system to be detected is an indirect jump command code block or an indirect call command code block;
if an instruction code block in a binary system to be detected belongs to an indirect jump instruction code block, preprocessing a function code block, a basic block code block and a corresponding jump table where the indirect jump instruction code block is located through step 1 to obtain a binary system indirect jump triple sample to be detected, further predicting a binary system sample label to be detected through a neural network indirect jump target identification classification model defined in step 2 after training, restoring a target basic block code block of each indirect jump instruction code block according to the sample label based on the triple definition in step 1, namely, if a kth indirect jump data sample Jdata in an ith binary system original filei,k=(Bi,m,e,Bi,n) The predicted sample label is Jlabeli_k,1Then B isi,nA target basic block code block;
if the instruction code block in the binary system to be detected belongs to the indirect call instruction code block, the binary system original file where the indirect call instruction code block is located and the function code block where the indirect call instruction code block is located are preprocessed through the step 1 to obtain the binary system to be detectedIndirectly calling branches of the binary system and a binary system function sequence to be detected, further forming an indirectly calling triple sample of the step one, further indirectly calling a target recognition classification model through a neural network defined in the step 2 after training, predicting to obtain a binary system sample label to be detected, and restoring a target function code block of each indirectly calling instruction code block according to the sample label based on the triple definition of the step 1, namely for the kth indirectly calling data sample Cdata in the ith original binary filei,k=(Bri,k,E,Fsi,n) If the corresponding sample label is obtained by prediction as Clabeli_k,1Then Fsi,nCorresponding Fi,nThe target function code block.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims (4)

1. An unsigned binary indirect control flow identification method based on deep learning is characterized by comprising the following steps:
step 1: introducing an original binary code file, wherein a plurality of bytes in the original binary code file form a plurality of instruction code blocks, the plurality of instruction code blocks form a plurality of basic block code blocks, the plurality of basic block code blocks form a plurality of function code blocks, an indirect call branch and a function sequence are constructed according to the basic block code blocks and the function code blocks, an indirect jump triple sample and an indirect call triple sample are further constructed, the indirect jump triple sample and the indirect call triple sample are respectively marked, and an indirect jump training set and an indirect call training set are generated;
step 2: establishing a neural network indirect jump target identification classification model, sequentially inputting each indirect jump triple sample in an indirect jump training set into the neural network indirect jump target identification classification model, further classifying to obtain a corresponding triple sample prediction result, further combining an indirect jump sample label and a prediction label of the classification model to establish a neural network indirect jump target identification classification loss function model, obtaining an optimization parameter set of a network through optimization training, and establishing the trained neural network indirect jump target identification classification model according to the network optimization parameter set; establishing a neural network indirect call target identification classification model, sequentially inputting each indirectly called triple sample in an indirect call training set into the neural network indirect call target identification classification model, further classifying to obtain a corresponding triple sample prediction result, further establishing a neural network indirect call target identification classification loss function model by combining an indirect call sample label and a label predicted by the classification model, obtaining an optimization parameter set of the network through optimization training, and establishing the trained neural network indirect call target identification classification model according to the network optimization parameter set;
and step 3: and (2) extracting the instruction code block in the binary system to be detected, the basic block code block in the binary system to be detected and the function code block in the binary system to be detected from the binary system to be detected through the step 1, and judging whether the instruction code block in the binary system to be detected is an indirect jump instruction code block or an indirect call instruction code block.
2. The deep learning based unsigned binary indirect control flow identification method of claim 1,
step 1, the original binary code file is:
texti={ci,1,ci,2,...,ci,L}
i∈[1,K]
wherein textiRepresenting the ith original binary code file, K representing the number of original binary code files, L representing the number of bytes in the ith original binary code file, ci,jRepresents the jth byte in the ith original binary code file, j ∈ [1, L [ ]];
Step 1, a plurality of bytes in the original binary code file form a plurality of instruction code blocks, which is specifically expressed as:
Insi,k={ci,sins_k,ci,sins_k+1,...,ci,sins_k+nins_k-1}
k∈[1,Nins]
therein, Insi,kRepresenting the kth instruction code block, N, in the ith original binary code fileinsDenotes the number of instruction code blocks in the ith original binary code file, sins _ k is the index of the starting byte of the kth instruction code block, nins _ k denotes the number of bytes in the kth instruction code block, ci,sins_k+jRepresents sins _ k + j +1 byte in the k instruction code block in the ith original binary code file, wherein j belongs to [0, nins _ k-1 ]];
Step 1, the plurality of instruction code blocks form a plurality of basic block code blocks, which are specifically expressed as:
Bi,m={Insi,sbb_m,Insi,sbb_m+1,...,Insi,sbb_m+nbb_m-1}
m∈[1,Nbb]
wherein, Bi,mRepresenting the mth basic block code block, N, in the ith original binary code filebbDenotes the number of basic block code blocks in the ith original binary code file, sbb _ m is the subscript of the instruction code block starting from the mth basic block code block, nbb _ m denotes the number of instruction code blocks in the mth basic block code block, Insi,sbb_m+jRepresents the sbb _ m + j +1 instruction code block in the mth basic block code block in the ith original binary code file, wherein j belongs to [0, nbb _ m-1 ]];
Step 1, the specific expression that the plurality of basic block code blocks form a plurality of function code blocks is as follows:
Fi,n={Bi,sfunc_n,Bi,sfunc_n+1,...,Bi,sfunc_n+nfunc_n-1}
n∈[1,Nfunc]
wherein, Fi,nRepresenting the nth function code block, N, in the ith original binary code filefuncIndicating the number of function code blocks in the ith original binary code file, sffunc _ n is the subscript of the basic block code block starting from the nth function code block, nfunc _ n represents the number of basic block code blocks in the nth function code block, Bi,sfunc_n+jRepresenting the sffunc _ n + j +1 basic block code block in the nth function code block in the ith original binary code file, wherein j belongs to [0, nffunc _ n-1 ]];
The step 1 of constructing the indirect call branch and the function sequence according to the basic block code block and the function code block is as follows:
the indirect call branch:
Bri,m={Bi,entry_m,e,Bi,entry_m+1,...,e,Bi,call_m}
m∈[1,Ncall]
wherein, Bri,mFor the indirect call branch sequence, N, of the mth indirect call instruction code block in the ith original binary code filecallIndicating the number of indirect call instruction code blocks in the ith original binary code file, wherein entry _ m is the subscript of the entry basic block of the mth indirect call branch sequence, and entry _ m +1 is Bi,entry_mThe call _ m is the subscript of the basic block code block where the mth indirect call instruction code block is located;
the function sequence is as follows:
Fsi,n={Bi,sfunc_n,e,Bi,sfunc_n+1,...,e,Bi,sfunc_n+nfunc_n-1}
n∈[1,Nfunc]
wherein, Fsi,nAs a function Fi,nCorresponding function sequences, e is the control flow inside the function;
step 1, further constructing indirect jump triple samples as follows:
Jdatai,k=(Bi,m,e,Bi,n)
k∈[1,Ndata_jmp]
wherein, Jdatai,kThe sample of the kth indirect jump data generated by the ith original binary code file is shown, namely the sample corresponding to the kth jump table in the ith original binary code file, Ndata_jmpRepresenting the number of indirect jump samples in the ith original binary code file, and e representing the functionControl flow inside digital code blocks, Bi,mFor the basic block code block, B, in which the indirect jump instruction code block is located in the kth jump tablei,nDivide B for the function code block where the kth jump table is locatedi,mOf any basic block code block, i.e. hypothesis Bi,m∈Fi,lThen B isi,n∈Fi,l-{Bi,m},m,n∈[1,Nbb];
The above Jdatai,kThe corresponding kth jump table is configured as follows:
JTablei,k={Bi,m:{Bi,sjt_k,Bi,sjt_k+1,...,Bi,sjt_k+njt_k-1}}
wherein sjt _ k is the subscript of the basic block code block starting from the kth jump table, njt _ k indicates the number of jump entries in the kth jump table, Bi,sjt_k+jRepresenting sjt _ k + j +1 th jump entry in the kth jump table in the ith original binary code file, wherein j belongs to [0, njt _ k-1 ]];
Step 1, further constructing indirect calling triple samples as follows:
Cdatai,k=(Bri,k,E,Fsi,n)
k∈[1,Ndata_call]
wherein, Cdatai,kA kth indirect call data sample generated by the ith original binary code file is represented, that is, a sample corresponding to the kth indirect call instruction code block in the ith original binary code file is assumed to be Insi,l,Ndata_callRepresenting the number of indirect call samples in the ith original binary code file, E representing the control flow between function code blocks, Bri,kFor the kth indirect call instruction code block Ins in the ith original binary code filei,lThe indirect call branch is used for constructing Br based on a breadth-first search algorithmi,k;Fsi,nFor the nth function F in the ith original binary code filei,nCorresponding function sequences, Fi,nAny address-token function in the binary code is used;
definitions of CTRarget (Ins)i,l) To Insi,lActually called function code blockList, i.e.:
CTarget(Insi,l)={Fi,ct1,Fi,ct2,...,Fi,ctn}
wherein, Fi,ct1,Fi,ct2,...,Fi,ctnTo Insi,lThe actual objective function of (2);
step 1, marking the indirect jump triple sample and the indirect call triple sample respectively, and generating an indirect jump training set and an indirect call training set as follows:
triple sample for indirect jump, Jdatai,k=(Bi,m,e,Bi,n):
If B isi,n∈JTablei,k[Bi,m]Then Jdatai,kIs labeled as Jlabeli_k,1Otherwise, Jlabeli_k,0
Triple sample for indirect calls, namely Cdatai,k=(Bri,k,E,Fsi,n):
If Fi,n∈CTarget(Insi,l) Then the sample is labeled Clabeli_k,1Otherwise, it is Clabeli_k,0
Step 1, generating an indirect jump training set, namely:
JDATA={(Jdata1,1,Jlabel1_1,k1),(Jdata1,2,Jlabel1_2,k2),......,(JdataK,Ndata_jmp_k,JlabelK_Ndata_jmp_k,kNjmp)}
JDATA is an indirect jump training set (Jdata)1,1,Jlabel1_1,k1) For the first sample in the data set, Jdata, as described above1,1Jlabel for the 1 st sample in the 1 st original binary file1_1,k1Is Jdata1,1K1 is 0 or 1; (Jdata)i,j,Jlabeli_j,km) For the mth sample in the dataset, Jdatai,jFor the jth sample in the ith original binary file, Jlabeli_j,kmM is the subscript of the sample in the dataset for its corresponding label, where i ∈ [1, K],j∈[1,Ndata_jmp_i]K being the original binary code fileNumber, Ndata_jmp_iRepresenting the total number of indirect jump samples of the ith binary, NjmpThe total number of samples in the indirect jump training set;
step 1, generating an indirect call training set, namely:
CDATA={(Cdata1,1,Clabel1_1,k1),(Cdata1,2,Clabel1_2,k2),......,(CdataK,Ndata_call_k,ClabelK_Ndata_call_k,kNcall)}
wherein, CDATA is indirect calling training set (Cdata)1,1,Clabel1_1,k1) For the first sample in the data set, Cdata, as described above1,1For the 1 st sample in the 1 st original binary file, Clabel1_1,k1Is Cdata1,1K1 is 0 or 1; (Cdata)i,j,Clabeli_j,km) For the mth sample in the data set, Cdatai,jFor the jth sample in the ith original binary file, Clabeli_j,kmM is the subscript of the sample in the dataset for its corresponding label, where i ∈ [1, K],j∈[1,Ndata_call_i]K is the number of original binary code files, Ndata_call_iAnd indicating the total number of the indirect jump samples of the ith binary system, wherein Ncall is the total number of the samples in the indirect jump training set.
3. The deep learning based unsigned binary indirect control flow identification method of claim 1,
2, the neural network indirect jump target recognition classification model is formed by serially cascading an embedded layer, a deep bidirectional long-short term memory network, an attention layer and a batch normalization layer;
the embedding layer converts words from one-hot vectors with high dimensional sparsity into low dimensional dense vectors;
the low-dimensional dense vector of each word represents jmp _ embedding _ vector;
the deep bidirectional long and short term memory network is formed by sequentially connecting a 1 st bidirectional long and short term memory layer, a 2 nd bidirectional long and short term memory layer and a random deactivation layer in series and cascading;
the ith layer of bidirectional long and short term memory layer is used for selectively discarding data through a gating mechanism, and then updating the data by combining the old state value memorized by the network to obtain a determined updated value and outputting the updated value to the next layer;
the weight of a forgetting gate of the ith layer of the bidirectional long and short term memory layer is jmp _ weightsf _ lstm _ i;
the bias of a forgetting gate of the i-th bidirectional long and short term memory layer is jmp _ biasf _ lstm _ i; (ii) a
The weight of an input gate of the ith layer of the bidirectional long and short term memory layer is jmp _ weight _ lstm _ i;
the bias of an input gate of the i-th bidirectional long and short term memory layer is jmp _ biasi _ lstm _ i;
the weight of an output gate of the ith layer bidirectional long and short term memory layer is jmp _ weight sc _ lstm _ i;
the bias of an output gate of the ith bidirectional long and short term memory layer is jmp _ biasc _ lstm _ i;
the weight of the state of the computing unit of the ith layer of the bidirectional long and short term memory layer is jmp _ weight _ lstm _ i;
the bias of the state of the computing unit of the i-th bidirectional long and short term memory layer is jmp _ biaso _ lstm _ i;
i∈[1,2];
the random inactivation layer is used for discarding the output data of the bidirectional long and short term memory layer with a certain probability and avoiding overfitting;
the attention layer is used for reducing the problem of context loss caused by the disappearance of the gradient of the long sample sequence by giving greater weight to the important words;
the weight of the attention layer is jmp _ weights _ attribute;
the bias of the attention layer is jmp _ bias _ attention;
the context vector of the attention layer is jmp _ u _ attention;
the batch standardization layer comprises a full connection layer, a batch standardization layer and a normalization index layer;
the fully-connected layer outputs a one-dimensional matrix with the size of W x H, W256 and H1, and is used for integrating the output data of the attention layer and mapping the output data to the sample space of the next batch normalization layer;
the weight of the full connection layer is jmp _ weights _ dense;
the bias of the full connection layer is jmp _ bias _ dense;
the batch normalization layer is used for accelerating the optimization training convergence in the step 2;
the translation parameter of the batch normalization layer is jmp _ shift _ bn;
the scaling parameter of the batch normalization layer is jmp _ scale _ bn;
the normalization index layer is used for converting continuous output characteristics of the batch normalization layer into discrete prediction characteristics; the layer firstly carries out sigmoid operation on output characteristics of a batch standardization layer, then uses a cross entropy loss function which is more suitable for measuring two probability distribution differences as a measurement function, and optimizes a learning result of an upper layer, so that a final result is a label Jlabel predicted for an ith samplei,1*、Jlabeli,2A probability distribution of i ∈ [1, N ]]N represents the number of samples in the deep learning training set, and the problem is classified into two categories, so that the labels are classified into two categories;
the neural network indirect jump target identification classification loss function model is a cross entropy loss function, and is specifically defined as follows:
Figure FDA0003569494610000061
wherein N is the total number of training samples; predict a probability distribution of
Figure FDA0003569494610000062
Indirect control flow of correct or not for ith sample
Figure FDA0003569494610000063
A probability distribution of wherein
Figure FDA0003569494610000064
The probability value corresponding to the label is
Figure FDA0003569494610000065
The true label probability distribution is y(i)Indirect control flow Jlabel for whether the ith sample is correcti,1、Jlabeli,2If the label of the ith sample is Jlabeli,jThen set the corresponding probability value y(i)jThe probability is one, and the other label Jlabel corresponding to the probability is onei,k(k ≠ j) probability value y(i)kIs zero;
the loss function is defined as:
Figure FDA0003569494610000066
wherein, cross entropy loss function
Figure FDA0003569494610000069
Requiring calculation of all training samples
Figure FDA0003569494610000067
The values are calculated and averaged; the training target of the neural network is set to predict the probability distribution
Figure FDA0003569494610000068
Probability distribution y of labels as close to reality as possible(i)I.e. functions that cause cross entropy loss
Figure FDA00035694946100000610
Minimization; finally, calculating to obtain the probability of prediction classification;
optimizing the network parameters by using an Adam optimization algorithm to obtain a network optimization parameter set in the step 2 as follows:
the vector of each word represents jmp _ embedding _ vector _ best;
for the ith bidirectional long-short term memory layer;
the optimized weight parameters are jmp _ weight sf _ lstm _ best _ i, jmp _ weight si _ lstm _ best _ i, jmp _ weight sc _ lstm _ best _ i and jmp _ weight so _ lstm _ best _ i;
the optimized bias parameters are respectively jmp _ biasf _ lstm _ best _ i, jmp _ biasi _ lstm _ best _ i, jmp _ biasc _ lstm _ best _ i and jmp _ biaso _ lstm _ best _ i;
for the attention layer:
the optimized parameters comprise weight jmp _ weights _ entry _ best; a bias jmp _ bias _ authentication _ best and a context vector jmp _ u _ authentication _ best;
the optimized weight parameter jmp _ weights _ dense _ best of the full connection layer;
the optimized bias parameters of the full-connection layer are jmp _ bias _ dense _ best respectively;
the optimized translation parameter of the batch normalization layer is jmp _ shift _ bn _ best;
the optimized scaling parameter of the batch normalization layer is jmp _ scale _ bn _ best;
step 2, the neural network indirectly calls a target recognition classification model, and the target recognition classification model is formed by serially cascading an embedded layer, a deep bidirectional long-short term memory network, an attention layer and a batch normalization layer;
the embedding layer converts words from one-hot vectors with high dimensional sparsity into low dimensional dense vectors; a low-dimensional dense vector call _ embedding _ vector for each word;
the bidirectional long and short term memory network is formed by sequentially connecting a first bidirectional long and short term memory layer, a second bidirectional long and short term memory layer and a random deactivation layer in series;
the ith layer of bidirectional long and short term memory layer is used for selectively discarding data through a gating mechanism, and then updating the data by combining the old state value memorized by the network to obtain a determined updated value and outputting the updated value to the next layer;
the weight of a forgetting gate of the ith layer of the bidirectional long and short term memory layer is call _ weight sf _ lstm _ i;
the bias of a forgetting gate of the ith bidirectional long-short term memory layer is call _ biasf _ lstm _ i; (ii) a
The weight of an input gate of the ith bidirectional long and short term memory layer is call _ weight _ lstm _ i;
the bias of an input gate of the ith bidirectional long and short term memory layer is call _ biasi _ lstm _ i;
the weight of an output gate of the ith layer bidirectional long-short term memory layer is call _ weight sc _ lstm _ i;
the bias of an output gate of the ith bidirectional long and short term memory layer is call _ biasc _ lstm _ i;
the weight of the state of the computing unit of the ith layer of the bidirectional long and short term memory layer is call _ weight _ lstm _ i;
the bias of the state of the computing unit of the ith bidirectional long and short term memory layer is call _ biaso _ lstm _ i;
the random inactivation layer is used for discarding the output data of the bidirectional long and short term memory layer with a certain probability and avoiding overfitting;
the attention layer is used for reducing the problem of context loss caused by the disappearance of the gradient of the long sample sequence in the step 3 by giving greater weight to the important words;
the weight of the attention layer is call _ weights _ entry;
the bias of the attention layer is call _ bias _ attention;
the context vector of the attention layer is call _ u _ attention;
the batch standardization layer comprises a full connection layer, a batch standardization layer and a normalization index layer;
the fully-connected layer outputs a one-dimensional matrix with the size of W x H, W256 and H1, and is used for integrating the output data of the attention layer and mapping the output data to the sample space of the next batch normalization layer;
the weight of the full connection layer is call _ weights _ dense;
the bias of the full connection layer is call _ bias _ dense;
the batch normalization layer is used for accelerating the optimization training convergence in the step 2;
the translation parameter of the batch standardized layer is call _ shift _ bn;
the scaling parameter of the batch normalization layer is call _ scale _ bn;
the normalized index layer is used for mixing batchesConverting continuous output characteristics of the quantitative normalization layer into discrete prediction characteristics; the method comprises the steps of firstly carrying out sigmoid operation on output characteristics of a batch standardization layer, then using a cross entropy loss function which is more suitable for measuring two probability distribution differences as a measurement function, and optimizing a learning result of an upper layer, so that a final result is a label Clabel predicted aiming at an ith samplei,1*、Clabeli,2A probability distribution of i ∈ [1, N ]]N represents the number of samples in the deep learning training set, and the problem is classified into two categories, so that the labels are classified into two categories;
the neural network indirect jump target identification classification loss function model is a cross entropy loss function, and is specifically defined as follows:
Figure FDA0003569494610000081
wherein N is the total number of training samples; predict a probability distribution of
Figure FDA0003569494610000082
Indirect control flow of correct or not for ith sample
Figure FDA0003569494610000083
A probability distribution of wherein
Figure FDA0003569494610000084
The probability value corresponding to the label is
Figure FDA0003569494610000085
The true label probability distribution is y(i)For step 2, the indirect control flow Clabel indicates whether the ith sample is correcti,1、Clabeli,2If the label of the ith sample is Clabeli,jThen set the corresponding probability value y(i)jProbability of one, and corresponding another label Clabeli,k(k ≠ j) probability value y(i)kIs zero;
the loss function is defined as:
Figure FDA0003569494610000086
wherein, cross entropy loss function
Figure FDA0003569494610000089
Requiring calculation of all training samples
Figure FDA0003569494610000087
The values are calculated and averaged; the training target of the neural network is set to predict the probability distribution
Figure FDA0003569494610000088
Probability distribution y of labels as close to reality as possible(i)I.e. functions that cause cross entropy loss
Figure FDA00035694946100000810
Minimization; finally, calculating to obtain the probability of prediction classification;
optimizing the network parameters by using an Adam optimization algorithm to obtain a network optimization parameter set in the step 2 as follows:
the vector of each word represents call _ embedding _ vector _ best;
for the ith bidirectional long-short term memory layer;
the optimized weight parameters are respectively call _ weight sf _ lstm _ best _ i,
call_weightsi_lstm_best_i*、call_weightsc_lstm_best_i*、
call_weightso_lstm_best_i*;
The optimized bias parameters are respectively call _ biasf _ lstm _ best _ i, call _ biasi _ lstm _ best _ i, call _ biasc _ lstm _ best _ i and call _ biaso _ lstm _ best _ i;
for the attention layer:
the optimized parameters comprise the weight call _ weights _ entry _ best; a bias call bias entry best and a context vector call u entry best;
the optimized weight parameter of the full connection layer is call _ weights _ dense _ best;
respectively obtaining bias parameters of the optimized fully-connected layer, namely call _ bias _ dense _ best;
the optimized translation parameter of the batch normalization layer is call _ shift _ bn _ best;
the optimized scaling parameter of the batch normalization layer is call _ scale _ bn _ best.
4. The deep learning based unsigned binary indirect control flow identification method of claim 1,
step 3, judging whether the instruction code block in the binary system to be detected is an indirect jump instruction code block or an indirect call instruction code block, specifically:
if an instruction code block in a binary system to be detected belongs to an indirect jump instruction code block, preprocessing a function code block, a basic block code block and a corresponding jump table where the indirect jump instruction code block is located through step 1 to obtain a binary system indirect jump triple sample to be detected, further predicting a binary system sample label to be detected through a neural network indirect jump target identification classification model defined in step 2 after training, restoring a target basic block code block of each indirect jump instruction code block according to the sample label based on the triple definition in step 1, namely, if a kth indirect jump data sample Jdata in an ith binary system original filei,k=(Bi,m,e,Bi,n) The predicted sample label is Jlabeli_k,1Then B isi,nA target basic block code block;
if the instruction code block in the binary system to be detected belongs to the indirect call instruction code block, the indirect call instruction code block and the binary system original file where the indirect call instruction code block is located and the function code block where the indirect call instruction code block is located are preprocessed through the step 1 to obtain the indirect call branch of the binary system to be detected and the function sequence of the binary system to be detected, so that the indirect call triple sample in the step 1 is further formed, and the target recognition classification model is further indirectly called through the neural network defined in the step 2 after trainingAnd predicting to obtain a binary sample label to be detected, restoring a target function code block of each indirect call instruction code block according to the sample label based on the triple definition in the step 1, namely for the kth indirect call data sample Cdata in the ith original binary filei,k=(Bri,k,E,Fsi,n) If the corresponding sample label is obtained by prediction as Clabeli_k,1Then Fsi,nCorresponding Fi,nThe target function code block.
CN202110363702.6A 2021-04-02 2021-04-02 Unsigned binary indirect control flow identification method based on deep learning Active CN113204764B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110363702.6A CN113204764B (en) 2021-04-02 2021-04-02 Unsigned binary indirect control flow identification method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110363702.6A CN113204764B (en) 2021-04-02 2021-04-02 Unsigned binary indirect control flow identification method based on deep learning

Publications (2)

Publication Number Publication Date
CN113204764A CN113204764A (en) 2021-08-03
CN113204764B true CN113204764B (en) 2022-05-17

Family

ID=77026179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110363702.6A Active CN113204764B (en) 2021-04-02 2021-04-02 Unsigned binary indirect control flow identification method based on deep learning

Country Status (1)

Country Link
CN (1) CN113204764B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115758164A (en) * 2022-10-12 2023-03-07 清华大学 Binary code similarity detection method, model training method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717512B (en) * 2018-05-16 2021-06-18 中国人民解放军陆军炮兵防空兵学院郑州校区 Malicious code classification method based on convolutional neural network
CN110008710B (en) * 2019-04-15 2022-11-18 上海交通大学 Vulnerability detection method based on deep reinforcement learning and program path instrumentation
US11714905B2 (en) * 2019-05-10 2023-08-01 Sophos Limited Attribute relevance tagging in malware recognition
CN110943981B (en) * 2019-11-20 2022-04-08 中国人民解放军战略支援部队信息工程大学 Cross-architecture vulnerability mining method based on hierarchical learning
CN111639344B (en) * 2020-07-31 2020-11-20 中国人民解放军国防科技大学 Vulnerability detection method and device based on neural network

Also Published As

Publication number Publication date
CN113204764A (en) 2021-08-03

Similar Documents

Publication Publication Date Title
CN110134757B (en) Event argument role extraction method based on multi-head attention mechanism
CN110673840B (en) Automatic code generation method and system based on tag graph embedding technology
CN112596736B (en) Semantic-based cross-instruction architecture binary code similarity detection method
CN112528648A (en) Method, device, equipment and storage medium for predicting polyphone pronunciation
CN108875809A (en) The biomedical entity relationship classification method of joint attention mechanism and neural network
CN110569505B (en) Text input method and device
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN115168856B (en) Binary code similarity detection method and Internet of things firmware vulnerability detection method
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN114139522A (en) Key information identification method based on level attention and label guided learning
CN112860919A (en) Data labeling method, device and equipment based on generative model and storage medium
CN113204764B (en) Unsigned binary indirect control flow identification method based on deep learning
Moeng et al. Canonical and surface morphological segmentation for nguni languages
CN114780411B (en) Software configuration item preselection method oriented to performance tuning
CN114997141A (en) Method for extracting relation from text, relation extraction model and medium
CN113703773A (en) NLP-based binary code similarity comparison method
US8386232B2 (en) Predicting results for input data based on a model generated from clusters
Zhang et al. C-memmap: clustering-driven compact, adaptable, and generalizable meta-lstm models for memory access prediction
CN113627172B (en) Entity identification method and system based on multi-granularity feature fusion and uncertain denoising
CN108875024B (en) Text classification method and system, readable storage medium and electronic equipment
CN113780006A (en) Training method of medical semantic matching model, medical knowledge matching method and device
CN115186670B (en) Method and system for identifying domain named entities based on active learning
CN115718889A (en) Industry classification method and device for company profile
CN116362247A (en) Entity extraction method based on MRC framework
CN114357166A (en) Text classification method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant