CN107229563A

CN107229563A - A kind of binary program leak function correlating method across framework

Info

Publication number: CN107229563A
Application number: CN201610178368.6A
Authority: CN
Inventors: 石志强; 常青; 陈昱; 王猛涛; 孙利民; 朱红松
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2016-03-25
Filing date: 2016-03-25
Publication date: 2017-10-03
Anticipated expiration: 2036-03-25
Also published as: CN107229563B

Abstract

The invention discloses a kind of binary program leak function correlating method across framework.This method is：1) binary file for treating binary program carries out conversed analysis, obtains a function library to be measured；Then according to the function library to be measured, function call graph, control flow graph, function base attribute are obtained；2) according to function call graph, control flow graph, each function to be measured of the basic attributes extraction of function feature；Then according to the feature of extraction and the feature of leak function, the numerical value similarity of each function to be measured and leak function is calculated；3) for each function to be measured, the tax power bigraph (bipartite graph) of the function to be measured and leak function is constructed respectively, the overall similarity of the function to be measured and leak function is calculated using bigraph (bipartite graph) algorithm；4) if the overall similarity of function to be measured and leak function is more than setting decision threshold, the function to be measured is judged as doubtful leak function, otherwise be determined as normal function.This method is realized simple, it is easy to promote.

Description

A kind of binary program leak function correlating method across framework

Technical field

The present invention relates to binary program bug excavation and conversed analysis field, and in particular to a kind of binary program leak function correlating method across framework, belongs to computer program detection technique field.

Background technology

With the rapid popularization of the high speed development and information system, geo-informatization system of global information technology, computer software has become World Economics, science and technology, military affairs and the important composition of social development.Practice have shown that, most information security events are all that attacker initiates by software vulnerability.Therefore, security breaches are the deciding factors for directly affecting information safety system, it is necessary to software vulnerability is analyzed and utilized.Leak analysis can be divided into source code level and binary level by analyzed object.The leak analysis technology of source code level is directly to being analyzed with the program of high level language.Analyst, which can utilize, enriches complete semantic information in source code, by a series of leak analysis technologies, code error and design defect in discovery procedure.But a large amount of business softwares exist in binary code form in actual applications, and source code is but difficult to obtain.Therefore, binary program leak analysis has been increasingly becoming an important branch of information security field.

Function corresponding technology is based primarily upon the detection of binary code similitude.The application scenarios of early stage are to calculate for the similarity of two binary files with schema compilation to enter line function association, due to being for being compiled with framework, the assembler obtained after dis-assembling is same instruction set, therefore assembler can be regarded to character string as, directly carries out similarity analysis and processing.2013, Arun Lakhotia, which propose a kind of method of semantic template, was used for the quick positioning of similar codes fragment.2014, Yaniv David compiled distance to calculate the similarity degree of basic block using character.But, researcher has found, if the compiling optimization option used during compiling binary file is different, even when being that there is also very big difference for the obtained assembler of same section of source code dis-assembling, this means that showing assembler form relies on stronger method to compiling optimization option sensitivity, so researcher will study point and turn to the semantic information relatively low to assembler performance form dependence, the semantic information for starting extraction procedure fragment is used as feature.2014, Jannik Pewny proposed a kind of leak association algorithm based on semantic signature, by it is instruction morphing in basic block be expression formula, and be stored as tree construction, similarity calculated using tree compiling distance, and realize prototype TEDEM.The same year, Manuel Egele propose a kind of binary code similarity detection method based on dynamic instrumentation technology, mainly analog function dynamic operation environment enters line code retrieval as the feature of function, it ensures that each basic block is at least executed once by being performed again from the basic BOB(beginning of block) of function entrance along certain execution route, and realizes prototype BLEX.Later, third party code storehouse was compiled and was deployed on different CPU platforms by increasing IOT producers, and this means that the demand that leak function can be searched in the binary file for any schema compilation will be increasing.Existing function corresponding technology or due to method limit to (such as the function corresponding technology based on detection assembler character string similarity degree) or due to instrument limitation (such as dynamic instrumentation instrument PIN only face x86 platforms) can not be applied directly to across in the scene of framework come.Jannik Pewny in 2015 have delivered Cross-Architecture Bug Search in Binary Executables on S＆P.The paper is proposed across this application scenarios of framework first, and (x86, ARM, MIPS) basic block Semantic features extraction across framework is realized using methods such as lifting intermediate language representation, numerical sample and min-hashs.But the accuracy rate of this method is unsatisfactory, rank1 just reaches 32.4% during the function similarity degree for the openssl firmwares for being used for being directed to ARM frameworks and MIPS frameworks more respectively using this method.It is therefore desirable to study the leak corresponding technology across framework, a kind of higher correlating method of accuracy rate is proposed.

At present, a kind of realization is lacked simply, accuracy rate is high, across the binary program leak corresponding technology of framework.

The content of the invention

Present invention aims at provide a kind of binary program leak function correlating method across framework.Method flow of the present invention mainly includes：Conversed analysis is carried out to binary file and obtains function library to be measured, the numerical value similarity of function to be measured and leak function is calculated；Two structure subgraphs of partial structurtes information structure of two functions to be compared are intercepted from function call graph；By two structure subgraph Hierarchical abstractions to assign power bigraph (bipartite graph), calculated using Bipartite Matching algorithm and assign power bigraph (bipartite graph) maximum weight matching, weighted sum and is ranked up accordingly as the overall similarity of two functions；Decision threshold is calculated based on ROC curve, the function that similarity is more than decision threshold is judged as doubtful leak function, carries out next step analysis, otherwise is determined as normal function, does not deal with.

Reconstruction of function controlling stream graph algorithm and structure match algorithm when calculating overall similarity when the innovation point of the present invention is to calculate similarity.The present invention has merged the numerical information and structural information of function, and the extraction of feature can enter line function association, as a result accuracy rate is high independent of specific instruction set to the binary file under different frameworks, realize simple.

To achieve the above object, the present invention is adopted the following technical scheme that：

A kind of binary program leak function correlating method across framework, mainly comprising following 3 steps：

1) the numerical value similarity of function to be measured and leak function is calculated.Conversed analysis is carried out to binary file first and obtains function library to be measured；Extract call relation information (i.e. function call graph) between function to be measured, the basic aspect information of attribute information three of controlling stream graph information, function carries out the processing that quantizes in function, is used as the characteristic vector of function；, as training sample, integrated classifier is trained using from compiling, multi-platform, tape symbol table collection of functions；Calculate function to be measured and the similarity of each feature of leak function constitutes similarity vector, bring into integrated classifier and be predicted, obtain numerical value similarity.

2) construction assigns power bigraph (bipartite graph), and overall similarity is calculated using bigraph (bipartite graph) algorithm.Two structure subgraphs of partial structurtes information structure of two functions to be compared are intercepted from function call graph, the number of plies of interception can be determined according to actual needs.Two structure subgraph Hierarchical abstractions are weighed into bigraph (bipartite graph) to assign, wherein set of node is the function that two structure subgraph respective layers are included, side integrates as the similarity degree of any two function, side right is that previous step calculating obtains numerical value similarity, then the maximum weight matching of power bigraph (bipartite graph) is assigned using Bipartite Matching algorithm layered method, weighted sum is used as function to be measured and the overall similarity of leak function.

3) judged according to the decision threshold calculated based on ROC curve.The overall similarity vector-drawn ROC curve of collection of functions to be measured and leak function is obtained, takes the corresponding threshold value of peak of Y-X curves as decision threshold, the function that similarity is more than decision threshold is judged as doubtful leak function, otherwise is determined as normal function.The each point for constituting ROC curve is (x, y), then the curve that (x, y-x) is constituted is the Y-X curves based on ROC curve, and wherein x domain of definition is M.

The present invention can obtain following beneficial effect：

The present invention is when calculating the numerical value similarity of function to be measured and leak function, mainly consider call relation feature, stack space feature, character string feature, code size feature, path sequence feature, path essential characteristic, degree series feature, degree essential characteristic, 9 aspect features such as figure scale feature, the more complete characteristic feature for reflecting a function, the extraction of feature is independent of specific instruction set, therefore the present invention can carry out leak association to the binary file for two different schema compilations.Simultaneously, when extracting feature, using being extracted by the way of writing IDA plug-in units from IDA analysis results, and IDA has difference when carrying out conversed analysis constructed fuction controlling stream graph to the binary files of different frameworks in itself, the present invention proposes control flow graph restructing algorithm, the real structure of control flow graph is reduced to a certain extent, improves the degree of accuracy of Function feature extraction.

The present invention employs cutted function calling figure in the numerical information in fusion function and structural information, and construction assigns the method that power bigraph (bipartite graph) calculates maximum weight matching.Assuming that contribution of the function node nearer apart from function to be checked to matching is bigger, function node is layered by the hop count apart from function to be checked, the similarity that minimum bipartite graph matching obtains individual layer is carried out to individual layer function node using Kuhn-Munkres algorithms, the Similarity-Weighted summation of each layer is finally obtained into function overall similarity.This method is when calculating the overall similarity of function to be matched, based on the recalls information between function, it is contemplated that the similarity degree of other function pairs treats the influence of adaptation function pair.It is more objective and accurate compared to the method for only using numerical value.

Of the invention and existing technology ratio, independent of specific instruction set, can carry out leak association to the binary file of different frameworks, realize simple, it is easy to promote.

Brief description of the drawings

Fig. 1 is protocol procedures schematic diagram；

Fig. 2 is that IDA differs greatly schematic diagram to the CFG figures of the Functional Analysis under different frameworks, wherein

(a) scheme for the CFG of the busybox-1.20.0 of arm schema compilations mencap_main functions,

(b) scheme for the CFG of the busybox-1.20.0 of mips schema compilations mencap_main functions；

Fig. 3 is reconstruct control flow graph schematic diagram；

Fig. 4 is that structure subgraph is layered schematic diagram；

Fig. 5 assigns power bigraph (bipartite graph) schematic diagram for construction；

Fig. 6 is to determine optimal threshold schematic diagram based on ROC curve.

Embodiment

A kind of binary program leak correlating method across framework, embodiment is as follows：

1) IDA plug-in units are write conversed analysis is carried out to binary file, obtain function library to be measured and function base attribute, function call graph and control flow graph.

2) the numerical value similarity of function to be measured and leak function is calculated.Whole process is extracted including numerical characteristics, three steps of Similarity Measure and neural network prediction similarity.

The stage is extracted in numerical characteristics, numerical characteristics extraction is carried out in terms of function base attribute, function call graph and control flow graph three respectively.Mainly extract call relation feature, character string feature, stack space feature, code size feature, path sequence feature, the path essential characteristic of function to be measured, degree series feature, degree essential characteristic, the nine aspect features such as figure scale feature.This nine aspects feature more intactly reflects the Representative properties of a function.

Analytic function calling figure, calculate each function to be measured by the number of times of other function calls, calculate the number of times after the number of times and duplicate removal of the function call other functions, constitute call relation feature.

Analytic function base attribute, calculates stack space, constitutes stack space feature；Jump instruction number, number of instructions are calculated, size of code constitutes code size feature；The character string quantity called and the string assemble called are calculated, character string feature is constituted.

Before analyzing control flow graph, to feature extraction directly can not be carried out using the IDA control flow graphs (CFG figures) analyzed.In a few cases, CFG figure of the Same Function under different frameworks can be very different, such as the memcap_main functions of busybox, and its CFG figure under ARM frameworks and MIPS frameworks has very big difference, as shown in Figure 2.This is because, the cpu instruction collection of every kind of platform is all responsible for processing by corresponding IDA processor modules.But the strategy of each platform processor module generation CFG figures is simultaneously differed, such as busybox rmdir_main functions, ARM platforms bl instructions are divided to basic block, and the jal (being all function call instruction) under MIPS platforms is not divided to basic block.For the basic block division rule of unified CFG figures, it would be desirable to CFG figures are rebuild, restructing algorithm is as follows

A) address end to end of all basic blocks of recognition function and original side end dot address.

B) all basic blocks are ranked up by basic block leading address ascending order order, count the in-degree and out-degree of each basic block.

C) basic block is scanned from small to large by basic block leading address ascending order order.If the out-degree of n-th of basic block is 0 and (n+1)th basic block in-degree is 0, it is n-th new of basic block then to merge the two basic blocks, delete former n-th and former (n+1)th basic block, and to resetting as the side of end-point addresses using the leading address of former (n+1)th basic block, be changed to be used as end-point addresses using the leading address of n-th of basic block；If the out-degree of n-th of basic block is 0 and (n+1)th basic block in-degree is not 0, the side that (n+1)th basic block is pointed in addition one by n-th basic block, the leading address that its terminal point information is the leading address of n-th basic block and terminal point information is n-th of basic block.

D) until last basic block is arrived in scanning, restructuring procedure terminates.

The reconstruct CFG nomography source codes realized with python are as follows, wherein input parameter bbList refers to the list constituted end to end of all basic blocks, edgeList is the list on all original sides of IDA analyses, startPoint is the function entrance address, wherein output toDic is to rebuild CFG to scheme the dictionary that all sides are constituted, and bbDic is to rebuild the dictionary that all basic blocks are constituted after CFG figures.Memcap_main functions reconstruction effect to busybox is as shown in Figure 3.

Analytic function controlling stream graph, calculates the in-degree that goes out of each node (i.e. basic block), constructs CFG digraph adjacency matrix, control flow graph is converted into non-directed graph, calculate the degree of each node, construct CFG non-directed graph adjacency matrix.To CFG digraphs adjacency matrix and the adjacency matrix progress degree analysis of CFG non-directed graphs.In-degree ascending sequence, out-degree ascending sequence are calculated based on CFG digraphs adjacency matrix, based on CFG non-directed graph adjacency matrix calculating degree ascending sequences, three constitutes degree series feature.

Based on degree ascending sequence, the probability sequence of maximal degree, average degree and degree is calculated.Probability sequence based on degree calculates the entropy of figure, construction degree essential characteristic；Path analysis is carried out to CFG non-directed graphs adjacency matrix, the minimum range of any two node (i.e. basic block) is calculated by Floyd algorithms or dijkstra's algorithm, path sequence feature is constructed；Figure average path length, figure diameter and figure radius are calculated, path essential characteristic is constituted.Analyses of basic attributes of sci is carried out to CFG digraphs adjacency matrix, calculate node number, side number, the link one after another of figure, figure density, the cluster coefficients of figure constitute CFG figure scale features.

Operated by above step, call relation feature, character string feature, stack space feature, code size feature, path sequence feature, the path essential characteristic of function, degree series feature, degree essential characteristic and figure scale feature are extracted altogether.

In characteristic similarity calculation stages, the form of expression of feature based, sequence similarity computational methods using numeric type similarity calculating method, based on string editing distance algorithm and the set similarity calculating method based on Jaccard similarities, calculate function to be compared each feature similarity degree as integrated classifier input vector.

The overall similarity stage is predicted in integrated classifier, as training sample, integrated classifier is trained using from compiling, multi-platform, tape symbol table collection of functions first.Specific method is：Selection selects different compilers with a source code, and different optimization options is compiled for different frameworks, obtains many parts of binary executables.Conversed analysis is carried out to every a binary executable, a function library is obtained and extracts the multidimensional characteristic of each function.Feature based, calculates each two function different functions storehouse similarity as the input vector of integrated classifier.If two function names are identical, label is 1, as positive sample, if two function name differences, and label is 0, is used as negative sample.Set up some preliminary classification devices.There are some independent identically distributed sub- training sample sets of sample architecture for the extraction 80% put back to from original training set, be used as the training sample of each grader.Corresponding sub- training sample set input grader is trained, according to predicting the outcome, the parameter of grader is adjusted and is met the requirements until predicting the outcome, now classifier training is finished.Then it is predicted using the integrated classifier logarithm value similarity trained.Feature is extracted to the leak function and each function to be measured, similarity vector is calculated, is used as test sample.It is predicted with some graders in the integrated classifier trained and obtains some predicted values, takes its weighted average as final predicted value as numerical value similarity.

Such as, the training sample of this match patterns of MIPS-O2 → ARM-O2 is obtained if desired.

Step one：MIPS frameworks are directed to openssl source codes, using a binary file of-O2 optimization option compilings, openssl-MIPS-O2 are named as；ARM frameworks are directed to openssl source codes, using a binary file of-O2 optimization option compilings, openssl-ARM-O2 are named as.

Step 2：Conversed analysis is carried out respectively to this two parts of binary files and obtains two function libraries.If openssl-MIPS-O2 function library has m function, entitled X₁- MIPS-O2, X₂- MIPS-O2 ... ..., X_m-MIPS-O2；Openssl-ARM-O2 function library has n function, entitled Y₁- ARM-O2, X₂- ARM-O2 ... ..., Y_n-ARM-O2.Feature is calculated to all functions in the two storehouses, one is obtained m+n bar features.

Step 3：Function similarity vector between storehouse is calculated, m × n similarity vector is obtained, if X_i=Y_j, then it is considered that the function X in openssl-MIPS-O2 storehouses_i- the MIPS-O2 and function Y in openssl-ARM-O2 storehouses_j- ARM-O2 is same function, then label is classified as 1, is positive sample, conversely, being considered negative sample.

Step 4：In order to which positive and negative sample is balanced and also to speedup, function and 100 openssl-ARM-O2 function every time to 100 openssl-MIPS-O2 carry out Similarity Measure two-by-two and label mark, then can obtain 100 positive samples and 9900 negative samples.Collect the positive sample of whole and 100 are randomly selected from 9900 negative samples as negative sample.

The individual positive samples of min (m, n) and same amount of negative sample have thus been obtained, the original training set of this match patterns of MIPS-O2 → ARM-O2 is used as.

3) construction assigns power bigraph (bipartite graph), and overall similarity is calculated using Bipartite Matching algorithm (such as Kuhn-Munkres algorithms).Whole algorithm steps are as follows：

A) two structure subgraphs of partial structurtes information structure of function to be compared are intercepted from function call graph, wherein, the number of plies of interception can be determined according to experiment effect.

B) the structure subgraph of interception is pressed into the hop count layering from function to be compared (wherein, if function call graph of the structure subgraph from the binary file where leak function, function to be compared herein refers to leak function；If function call graph of the structure subgraph from the binary file where function to be measured, function to be compared herein refers to function to be measured), and weight is assigned by the significance level of comparison function is treated, as shown in Figure 4.

C) two subgraph respective layers are abstract to assign power complete bipartite graph, the function that wherein set of node includes for respective layer, side integrates as the similarity relation of any two function in set of node, and side right is the numerical value similarity of two functions of correspondence, as shown in Figure 5.So just obtain multiple assign and weigh bigraph (bipartite graph).

D) to each power bigraph (bipartite graph) of assigning using similarity of the corresponding maximum weight matching of each layer of Bipartite Matching algorithm layered method as respective layer.

E) overall similarity for summing every layer of Similarity-Weighted as function to be compared.

4) judged according to the decision threshold calculated based on ROC curve.Obtain the overall similarity vector-drawn ROC curve of collection of functions to be measured and leak function.Wherein ROC curve transverse axis is false positive rate, i.e., the ratio (FP/ (FP+TN)) of pseudo- positive example；The longitudinal axis is kidney-Yang rate, i.e., the ratio (TP/ (TP+FN)) of real example.What ROC curve was provided is the situation of change of vacation sun rate and kidney-Yang rate when changes of threshold, and it can be used for the performance of comparator-sorter.Ideally, optimal classification device should be located at the upper left corner, it is meant that grader obtains high kidney-Yang rate when false positive rate is very low, really leak function check will come out, and seldom normal function is mistaken for into leak function.It is the minimum optimal threshold of mistake closer to the point of the ROC curve in the upper left corner, minimum, the i.e. maximum Y-X point of its false positive and false negative sum on training set, as shown in Figure 6.Therefore we are using the corresponding threshold value of the peak of Y-X curves as decision threshold, and the function that similarity is more than decision threshold is judged as doubtful leak function, otherwise is determined as normal function.

In summary, the invention discloses a kind of binary program leak corresponding technology across framework.Application described above scene and embodiment, are not intended to limit the present invention, and any those skilled in the art without departing from the spirit and scope of the present invention, can make various changes and retouching.Therefore, protection scope of the present invention is defined depending on right.

Claims

1. a kind of binary program leak function correlating method across framework, its step is：

1) binary file for treating binary program carries out conversed analysis, obtains a function library to be measured；Then according to the letter to be measured Number storehouse, obtains function call graph, control flow graph, function base attribute；

2) according to function call graph, control flow graph, each function to be measured of the basic attributes extraction of function feature；Then basis The feature of extraction and the feature of leak function, calculate the numerical value similarity of each function to be measured and leak function；

3) for each function to be measured, the tax power bigraph (bipartite graph) of the function to be measured and leak function is constructed respectively, using bigraph (bipartite graph) algorithm Calculate the overall similarity of the function to be measured and leak function；

4) if the overall similarity of function to be measured and leak function is more than setting decision threshold, judge the function to be measured to be doubtful Leak function, on the contrary it is determined as normal function.

2. the method as described in claim 1, it is characterised in that the method for calculating the numerical value similarity is：

21) many parts of binary executables of different frameworks are compiled as to same a source code；Then every a binary system can be held Part of composing a piece of writing carries out conversed analysis, obtains a function library and extracts the feature of each function to be measured；

22) feature based on extraction, calculates each two function different functions storehouse similarity as the input of integrated classifier Vector；If two function names are identical, label is 1, and correspondence input vector is as positive sample, otherwise as negative Sample, obtains an original training set；Wherein integrated classifier includes multiple graders；

23) there are the multiple samples of the extraction put back to from the original training set every time, construct some independent identically distributed sub- training samples Collection, is used as the training sample of each grader in the integrated classifier；

24) sub- training sample set is inputted into corresponding grader respectively to be trained, is then based on leak function and each letter to be measured Several features is predicted using the grader trained to leak function and function to be measured, if then by intervening for obtaining Measured value weighted average is used as the numerical value similarity.

3. method as claimed in claim 1 or 2, it is characterised in that to step 1) control flow graph that obtains rebuilds, Its method is：

A) head of all basic blocks of function, tail address and original side end dot address in recognition function controlling stream graph；

B) all basic blocks are ranked up by basic block leading address ascending order order, count the in-degree and out-degree of each basic block；

C) basic block is scanned from small to large by basic block leading address ascending order order：If the out-degree of n-th of basic block is 0 And (n+1)th basic block in-degree is 0, then it is n-th new of basic block to merge the two basic blocks, deletes original n-th Individual and former (n+1)th basic block, and to being changed to by the side of end-point addresses of the leading address of former (n+1)th basic block with the The leading address of n basic block is used as end-point addresses；If the out-degree of n-th of basic block is 0 and (n+1)th basic block In-degree is not 0, then adds a side that (n+1)th basic block is pointed to by n-th of basic block, the end point letter on the side The leading address that breath is the leading address of n-th of basic block, another terminal point information is n-th of basic block.

4. the method as described in claim 1, it is characterised in that the method for calculating the overall similarity is：

A) the structure subgraph of partial structurtes information structure one of the function to be measured is intercepted from the function call graph, from the leak function The structure subgraph of partial structurtes information structure one of the leak function is intercepted in the function call graph at place；

B) the structure subgraph of interception is pressed into the hop count layering from function to be compared, and power is assigned by the significance level of comparison function is treated Weight, so that it is tax power bigraph (bipartite graph) that two structure subgraph respective layers, which are distinguished abstract, wherein set of node includes for respective layer Function, side integrates as the similarity relation of any two function in set of node, and side right is the numerical value similarity of two functions of correspondence；

C) assign power bigraph (bipartite graph) to each using Bipartite Matching algorithm layered method maximum weight matching and be used as every layer of similarity；

D) overall similarity for summing every layer of Similarity-Weighted as function to be compared.

5. the method as described in claim 1, it is characterised in that the method for determining the decision threshold is：According to the function to be measured of acquisition ROC curve is drawn with the overall similarity of leak function, the corresponding threshold value of peak of Y-X curves is taken as decision threshold.