CN110362343A

CN110362343A - The method of the detection bytecode similarity of N-Gram

Info

Publication number: CN110362343A
Application number: CN201910653076.7A
Authority: CN
Inventors: 彭艳茹; 陈雨亭; 沈备军
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2019-10-22

Abstract

A method of the detection bytecode similarity based on N-Gram, by with N metagrammar model conversation being bytecode by executable binary file to be compared, bytecode is analyzed using N-Gram hash algorithm and obtains corresponding hash value, similarity is calculated after therefrom extracting feature finally by winnowing algorithm.The similarity that the present invention can carry out bytecode level to Java executable file judges, and hash algorithm analysis has been used to improve the efficiency of method execution, obscures scale evaluation, Code Clones detection etc. so as to be widely used in Java bytecode.

Description

The method of the detection bytecode similarity of N-Gram

Technical field

It is specifically a kind of to be based on language model (N- the present invention relates to a kind of technology in computer information processing field Gram the method for detection bytecode similarity).

Background technique

Bytecode similarity calculation is a research direction of program analysis, and Code Clones are detected and obscured with assessment side Mask is significant, can helper person reduce redundant code, improve code efficiency and promoted code security degree, protection Code supports property right.The maximum advantage of Java language is its platform-neutral, primary to compile, and runs everywhere, however this is special Point also results in the defect of its easy decompiling.So how to protect the intellectual property of java applet, the interests of maintenance programmer, Obfuscation comes into being.

Obscure, i.e., under the premise of guaranteeing that byte coded program primitive justice is constant, being changed to is more difficult decompiling or anti-volume More indigestible technology after translating.Generally, it is considered that lower with similarity of source codes, then aliasing effect is better.Code Clones detection is made An ancient research direction for program analysis has developed extremely perfect, finally from initial text based detection Various ways based on controlling stream graph, then till now are used in combination, as long as but being generally speaking related to carrying out program static point Analysis needs to use the expression in structure, similarity comparison (for example, comparing abstract syntax tree etc.), the rate of entire program operation It will substantially reduce.

Summary of the invention

The present invention In view of the above shortcomings of the prior art, proposes a kind of detection bytecode similarity based on N-Gram Method, the similarity that bytecode level can be carried out to Java executable file judges, and has used hash algorithm analysis The efficiency of method execution is improved, obscures scale evaluation, Code Clones detection etc. so as to be widely used in Java bytecode Aspect.

The present invention is achieved by the following technical solutions:

The present invention is by with N metagrammar model conversation being bytecode by executable binary file to be compared, using N- Gram hash algorithm analysis bytecode simultaneously obtains corresponding hash value, after therefrom extracting feature finally by winnowing algorithm Similarity is calculated.

The conversion is parsed simultaneously with the format of the .class file based on stack and the .dex file based on register It is converted into the byte code files that can read, can be used character manipulation processing.

The conversion shields the difference of different file decompilings by unified Binary2Bytecode interface, this connects Mouth identifies that the type of current executable file is .class or .dex and calls corresponding reverse-engineering, that is, is based on corresponding instruction Collection is converted into hashed value.

Hashed value in byte code files is mapped to N-Gram hash expression formula H (c by the analysis₁...c_N)= c₁*b^N-1+c₂*b^N-2+...+c_N-1*b+c_N, in which: H is mapping relations, and c1...cN is a N metagrammar, and b is substrate.

The extraction feature carries out Feature Selection by sliding window using winnowing algorithm to guarantee it uniformly Property, the guarantee of randomness is realized by retaining minimum/big feature of each window intermediate value, specific steps include:

1) it sets moving window size and successively traverses discrete value；

2) minimum or maximum discrete value and its subscript composition characteristic value pair in each Moving Window are recorded；

3) identical characteristic value pair is deleted, feature extraction is completed.

The similarityWherein: List_NIt (s) is N-Gram table in bytecode s The feature list shown, G_N(s) characteristic set indicated for N-Gram in bytecode s, num G_N(s) and G_N(t) element in intersection In List_N(s) and List_N(t) in occur number, i.e., using this method measure two byte code files similarity by some The number that same characteristic features occur also incorporates calculating.

The present invention relates to a kind of systems for realizing the above method, comprising: the executable text interconnected in a serial fashion Part conversion module, cryptographic Hash computing module, characteristics extraction module and similarity calculation module, in which: executable file conversion Module converts the analysable byte code files of text for binary file and exports to cryptographic Hash computing module, and cryptographic Hash calculates Module converts the form of vector for byte code files and exports to characteristics extraction module, and characteristics extraction module uses Winnowing mode, which extract to cryptographic Hash vector, to be indicated vector as final executable file and exports to similarity meter Module is calculated, similarity calculation module calculates final phase using the formula of improved N-Gram similarity calculation based on expression vector Like degree result.

Technical effect

Compared with prior art, the present invention has easy to operate, and execution speed is fast, the strong industrial applicibility of scalability and work Industry technical effect.Easy to operate and strong scalability is embodied in that the method use interfaces to shield each executable file and bytecode The difference of file, and the interface function can continue to extend.It executes speed and is that this method does not both need to answer file fastly Miscellaneous static analysis does not need to carry out complicated comparing calculation to the result of static analysis yet.The energy consumption of this method is very low, executes It can generally be completed in 1s (according to the data in specific embodiment), but typically rely on the size of executable file.It should The energy consumption of method essentially consists in the last one similarity calculation module, worst the result is that O (nm), n, m are two bytecode texts The instruction size of part.

Detailed description of the invention

Fig. 1 is that byte code files are converted into hash vector schematic diagram；

Executable file is resolved to respective byte code file schematic diagram for interface Binary2Bytecode by Fig. 2；

Fig. 3 is that bytecode hashed value vector is converted into 3-Gram hash value schematic diagram；

Fig. 4 is that similarity numerical value set obtains flow diagram；

Fig. 5 is N-Gram similarity VS executable file relative size schematic diagram.

Specific embodiment

The present embodiment is by means of Android R8 (ver.1.4.9).Android R8 is the user Ke Ding that Google newly releases Java directly can be run program and obscure and be converted into Dalvikvm bytecode, i.e. .dex file by bytecode obfuscator processed.Operation System is Ubuntu16.04, JDK1.8.

The present embodiment the following steps are included:

Step 1) data preparation: configuration file is obscured using 50 Android R8 of random manner generation first, is made It will lead to the difference of the .dex file ultimately generated with different configuration files of obscuring.Secondly any configuration is obscured using being not added Source java applet is converted .dex file by Android R8.

The conversion, using Binary2Bytecode interface as shown in Figure 2, by the class for identifying executable file Type calls different decompiling instruments, and such design can shield the difference between executable file, for using this method User for, it is transparent.

50 in step 1 are obscured .dex file at random and are carried out with the .dex file that do not obscure respectively based on N- by step 2) The similarity calculation of Gram, resulting 50 similarity numerical value store in the table, the experimental comparison after convenience, such as Fig. 4 institute Show, specific steps include:

2.1) hashed value is converted according to based on corresponding instruction set by .dex file；

2.2) processing of N-Gram hash algorithm is carried out to byte code files, the present embodiment uses bytecode hashed value 3-Gram hash algorithm, vector as shown in fig. 1 are mapped using following hash expression formula:

H(c₁...c_N)=c₁*b^N-1+c₂*b^N-2+...+c_N-1*b+c_N, in which: H indicates mapping relations, and c1...cN indicates one A N metagrammar, b indicate substrate, take 8 in the present embodiment.

2.3) using winnowing algorithm to hashed value extraction feature and according to feature calculation similarity, by means of sliding Window carries out Feature Selection to guarantee its uniformity, and being ensured of for randomness is minimum (maximum) by retaining each window intermediate value Feature realize.Detailed process is as follows:

Assuming that there is such as next group of characteristic value and set window size as 4:

17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99

Minimum value is 15 in first window, and so on:

17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99

Remained in second window that 15:

17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99

Until the 5th window is updated to 32 (this is also the case where two characteristic values being selected farthest are spaced):

17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99

The characteristic value pair finally chosen are as follows: [15,3] [32,7] [8,8] [30,11] [87,12], preceding paragraph is characterized value, after Item is corresponding subscript.When occurring that hash value is the same but the different feature of subscript, subscript is to distinguish different characteristic value.

Step 3) compares 50 .dex file relative size change rates of 50 similarity numerical value and this that step 2 obtains It is right, as shown in Figure 5.Android R8 when a certain program is obscured and is converted, the size of final executable file for It is sensitive for obscuring the variation of degree, so experiment selection obscures rear executable file relative to not obscuring executable file The index that relative size compares as experiment.

As seen from the figure, N-Gram calculated similarity size variation and executable file relative size variation Be it is almost the same, can prove the validity of this method, and the method speed of service is quickly.

It is above-mentioned with the sample operation that size is 50 in the environment of JDK1.8Ubuntu16.04 by specific actual experiment Method, obtained experimental result are that this method can effectively describe the similarity of byte code files.And single file is similar Degree is calculated can complete at ms grades.

Above-mentioned specific implementation can by those skilled in the art under the premise of without departing substantially from the principle of the invention and objective with difference Mode carry out local directed complete set to it, protection scope of the present invention is subject to claims and not by above-mentioned specific implementation institute Limit, each implementation within its scope is by the constraint of the present invention.

Claims

1. a kind of method of the detection bytecode similarity based on N-Gram, which is characterized in that by the way that be compared can be performed Binary file is bytecode with N metagrammar model conversation, analyzes bytecode using N-Gram hash algorithm and obtains corresponding Similarity is calculated after therefrom extracting feature finally by winnowing algorithm in hash value.

2. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that described turns Change, with based on stack .class file and the .dex file based on register format parsed and be converted into can read, can The byte code files handled using character manipulation.

3. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that described turns Change, the difference of different file decompilings is shielded by unified Binary2Bytecode interface, interface identification is current executable The type of file is .class or .dex and calls corresponding reverse-engineering, i.e., is converted into hashed value based on corresponding instruction set.

4. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that point Analysis, i.e., map to N-Gram hash expression formula H (c for the hashed value in byte code files₁...c_N)=c₁*b^N-1+c₂*b^N-2 +...+c_N-1*b+c_N, in which: H is mapping relations, and c1...cN is a N metagrammar, and b is substrate.

5. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that described mentions Feature is taken, Feature Selection is carried out to guarantee its uniformity, by retaining each window by sliding window using winnowing algorithm Minimum/big feature of mouth intermediate value realizes the guarantee of randomness, and specific steps include:

1) it sets moving window size and successively traverses discrete value；

6. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that the phase Like degreeWherein: List_N(s) feature list indicated for N-Gram in bytecode s, G_N (s) characteristic set indicated for N-Gram in bytecode s, num G_N(s) and G_N(t) in intersection element in List_N(s) with List_N(t) is there is some same characteristic features using the similarity that this method measures two byte code files in the number occurred in Number also incorporate calculating.

7. a kind of system for realizing any of the above-described claim the method characterized by comprising in a serial fashion mutually Executable file conversion module, cryptographic Hash computing module, characteristics extraction module and the similarity calculation module of connection, in which: Executable file conversion module converts binary file to the analysable byte code files of text and exports to cryptographic Hash and calculates Module, cryptographic Hash computing module convert the form of vector for byte code files and export to characteristics extraction module, characteristic value Extraction module, which uses winnowing mode to extract as final executable file cryptographic Hash vector, indicates vector and defeated Out to similarity calculation module, similarity calculation module is based on the formula for indicating that vector uses improved N-Gram similarity calculation Calculate final similarity result.