CN110362343A - The method of the detection bytecode similarity of N-Gram - Google Patents

The method of the detection bytecode similarity of N-Gram Download PDF

Info

Publication number
CN110362343A
CN110362343A CN201910653076.7A CN201910653076A CN110362343A CN 110362343 A CN110362343 A CN 110362343A CN 201910653076 A CN201910653076 A CN 201910653076A CN 110362343 A CN110362343 A CN 110362343A
Authority
CN
China
Prior art keywords
gram
bytecode
similarity
file
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910653076.7A
Other languages
Chinese (zh)
Inventor
彭艳茹
陈雨亭
沈备军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201910653076.7A priority Critical patent/CN110362343A/en
Publication of CN110362343A publication Critical patent/CN110362343A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Abstract

A method of the detection bytecode similarity based on N-Gram, by with N metagrammar model conversation being bytecode by executable binary file to be compared, bytecode is analyzed using N-Gram hash algorithm and obtains corresponding hash value, similarity is calculated after therefrom extracting feature finally by winnowing algorithm.The similarity that the present invention can carry out bytecode level to Java executable file judges, and hash algorithm analysis has been used to improve the efficiency of method execution, obscures scale evaluation, Code Clones detection etc. so as to be widely used in Java bytecode.

Description

The method of the detection bytecode similarity of N-Gram
Technical field
It is specifically a kind of to be based on language model (N- the present invention relates to a kind of technology in computer information processing field Gram the method for detection bytecode similarity).
Background technique
Bytecode similarity calculation is a research direction of program analysis, and Code Clones are detected and obscured with assessment side Mask is significant, can helper person reduce redundant code, improve code efficiency and promoted code security degree, protection Code supports property right.The maximum advantage of Java language is its platform-neutral, primary to compile, and runs everywhere, however this is special Point also results in the defect of its easy decompiling.So how to protect the intellectual property of java applet, the interests of maintenance programmer, Obfuscation comes into being.
Obscure, i.e., under the premise of guaranteeing that byte coded program primitive justice is constant, being changed to is more difficult decompiling or anti-volume More indigestible technology after translating.Generally, it is considered that lower with similarity of source codes, then aliasing effect is better.Code Clones detection is made An ancient research direction for program analysis has developed extremely perfect, finally from initial text based detection Various ways based on controlling stream graph, then till now are used in combination, as long as but being generally speaking related to carrying out program static point Analysis needs to use the expression in structure, similarity comparison (for example, comparing abstract syntax tree etc.), the rate of entire program operation It will substantially reduce.
Summary of the invention
The present invention In view of the above shortcomings of the prior art, proposes a kind of detection bytecode similarity based on N-Gram Method, the similarity that bytecode level can be carried out to Java executable file judges, and has used hash algorithm analysis The efficiency of method execution is improved, obscures scale evaluation, Code Clones detection etc. so as to be widely used in Java bytecode Aspect.
The present invention is achieved by the following technical solutions:
The present invention is by with N metagrammar model conversation being bytecode by executable binary file to be compared, using N- Gram hash algorithm analysis bytecode simultaneously obtains corresponding hash value, after therefrom extracting feature finally by winnowing algorithm Similarity is calculated.
The conversion is parsed simultaneously with the format of the .class file based on stack and the .dex file based on register It is converted into the byte code files that can read, can be used character manipulation processing.
The conversion shields the difference of different file decompilings by unified Binary2Bytecode interface, this connects Mouth identifies that the type of current executable file is .class or .dex and calls corresponding reverse-engineering, that is, is based on corresponding instruction Collection is converted into hashed value.
Hashed value in byte code files is mapped to N-Gram hash expression formula H (c by the analysis1...cN)= c1*bN-1+c2*bN-2+...+cN-1*b+cN, in which: H is mapping relations, and c1...cN is a N metagrammar, and b is substrate.
The extraction feature carries out Feature Selection by sliding window using winnowing algorithm to guarantee it uniformly Property, the guarantee of randomness is realized by retaining minimum/big feature of each window intermediate value, specific steps include:
1) it sets moving window size and successively traverses discrete value;
2) minimum or maximum discrete value and its subscript composition characteristic value pair in each Moving Window are recorded;
3) identical characteristic value pair is deleted, feature extraction is completed.
The similarityWherein: ListNIt (s) is N-Gram table in bytecode s The feature list shown, GN(s) characteristic set indicated for N-Gram in bytecode s, num GN(s) and GN(t) element in intersection In ListN(s) and ListN(t) in occur number, i.e., using this method measure two byte code files similarity by some The number that same characteristic features occur also incorporates calculating.
The present invention relates to a kind of systems for realizing the above method, comprising: the executable text interconnected in a serial fashion Part conversion module, cryptographic Hash computing module, characteristics extraction module and similarity calculation module, in which: executable file conversion Module converts the analysable byte code files of text for binary file and exports to cryptographic Hash computing module, and cryptographic Hash calculates Module converts the form of vector for byte code files and exports to characteristics extraction module, and characteristics extraction module uses Winnowing mode, which extract to cryptographic Hash vector, to be indicated vector as final executable file and exports to similarity meter Module is calculated, similarity calculation module calculates final phase using the formula of improved N-Gram similarity calculation based on expression vector Like degree result.
Technical effect
Compared with prior art, the present invention has easy to operate, and execution speed is fast, the strong industrial applicibility of scalability and work Industry technical effect.Easy to operate and strong scalability is embodied in that the method use interfaces to shield each executable file and bytecode The difference of file, and the interface function can continue to extend.It executes speed and is that this method does not both need to answer file fastly Miscellaneous static analysis does not need to carry out complicated comparing calculation to the result of static analysis yet.The energy consumption of this method is very low, executes It can generally be completed in 1s (according to the data in specific embodiment), but typically rely on the size of executable file.It should The energy consumption of method essentially consists in the last one similarity calculation module, worst the result is that O (nm), n, m are two bytecode texts The instruction size of part.
Detailed description of the invention
Fig. 1 is that byte code files are converted into hash vector schematic diagram;
Executable file is resolved to respective byte code file schematic diagram for interface Binary2Bytecode by Fig. 2;
Fig. 3 is that bytecode hashed value vector is converted into 3-Gram hash value schematic diagram;
Fig. 4 is that similarity numerical value set obtains flow diagram;
Fig. 5 is N-Gram similarity VS executable file relative size schematic diagram.
Specific embodiment
The present embodiment is by means of Android R8 (ver.1.4.9).Android R8 is the user Ke Ding that Google newly releases Java directly can be run program and obscure and be converted into Dalvikvm bytecode, i.e. .dex file by bytecode obfuscator processed.Operation System is Ubuntu16.04, JDK1.8.
The present embodiment the following steps are included:
Step 1) data preparation: configuration file is obscured using 50 Android R8 of random manner generation first, is made It will lead to the difference of the .dex file ultimately generated with different configuration files of obscuring.Secondly any configuration is obscured using being not added Source java applet is converted .dex file by Android R8.
The conversion, using Binary2Bytecode interface as shown in Figure 2, by the class for identifying executable file Type calls different decompiling instruments, and such design can shield the difference between executable file, for using this method User for, it is transparent.
50 in step 1 are obscured .dex file at random and are carried out with the .dex file that do not obscure respectively based on N- by step 2) The similarity calculation of Gram, resulting 50 similarity numerical value store in the table, the experimental comparison after convenience, such as Fig. 4 institute Show, specific steps include:
2.1) hashed value is converted according to based on corresponding instruction set by .dex file;
2.2) processing of N-Gram hash algorithm is carried out to byte code files, the present embodiment uses bytecode hashed value 3-Gram hash algorithm, vector as shown in fig. 1 are mapped using following hash expression formula:
H(c1...cN)=c1*bN-1+c2*bN-2+...+cN-1*b+cN, in which: H indicates mapping relations, and c1...cN indicates one A N metagrammar, b indicate substrate, take 8 in the present embodiment.
2.3) using winnowing algorithm to hashed value extraction feature and according to feature calculation similarity, by means of sliding Window carries out Feature Selection to guarantee its uniformity, and being ensured of for randomness is minimum (maximum) by retaining each window intermediate value Feature realize.Detailed process is as follows:
Assuming that there is such as next group of characteristic value and set window size as 4:
17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99
Minimum value is 15 in first window, and so on:
17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99
Remained in second window that 15:
17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99
Until the 5th window is updated to 32 (this is also the case where two characteristic values being selected farthest are spaced):
17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99
The characteristic value pair finally chosen are as follows: [15,3] [32,7] [8,8] [30,11] [87,12], preceding paragraph is characterized value, after Item is corresponding subscript.When occurring that hash value is the same but the different feature of subscript, subscript is to distinguish different characteristic value.
Step 3) compares 50 .dex file relative size change rates of 50 similarity numerical value and this that step 2 obtains It is right, as shown in Figure 5.Android R8 when a certain program is obscured and is converted, the size of final executable file for It is sensitive for obscuring the variation of degree, so experiment selection obscures rear executable file relative to not obscuring executable file The index that relative size compares as experiment.
As seen from the figure, N-Gram calculated similarity size variation and executable file relative size variation Be it is almost the same, can prove the validity of this method, and the method speed of service is quickly.
It is above-mentioned with the sample operation that size is 50 in the environment of JDK1.8Ubuntu16.04 by specific actual experiment Method, obtained experimental result are that this method can effectively describe the similarity of byte code files.And single file is similar Degree is calculated can complete at ms grades.
Above-mentioned specific implementation can by those skilled in the art under the premise of without departing substantially from the principle of the invention and objective with difference Mode carry out local directed complete set to it, protection scope of the present invention is subject to claims and not by above-mentioned specific implementation institute Limit, each implementation within its scope is by the constraint of the present invention.

Claims (7)

1. a kind of method of the detection bytecode similarity based on N-Gram, which is characterized in that by the way that be compared can be performed Binary file is bytecode with N metagrammar model conversation, analyzes bytecode using N-Gram hash algorithm and obtains corresponding Similarity is calculated after therefrom extracting feature finally by winnowing algorithm in hash value.
2. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that described turns Change, with based on stack .class file and the .dex file based on register format parsed and be converted into can read, can The byte code files handled using character manipulation.
3. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that described turns Change, the difference of different file decompilings is shielded by unified Binary2Bytecode interface, interface identification is current executable The type of file is .class or .dex and calls corresponding reverse-engineering, i.e., is converted into hashed value based on corresponding instruction set.
4. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that point Analysis, i.e., map to N-Gram hash expression formula H (c for the hashed value in byte code files1...cN)=c1*bN-1+c2*bN-2 +...+cN-1*b+cN, in which: H is mapping relations, and c1...cN is a N metagrammar, and b is substrate.
5. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that described mentions Feature is taken, Feature Selection is carried out to guarantee its uniformity, by retaining each window by sliding window using winnowing algorithm Minimum/big feature of mouth intermediate value realizes the guarantee of randomness, and specific steps include:
1) it sets moving window size and successively traverses discrete value;
2) minimum or maximum discrete value and its subscript composition characteristic value pair in each Moving Window are recorded;
3) identical characteristic value pair is deleted, feature extraction is completed.
6. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that the phase Like degreeWherein: ListN(s) feature list indicated for N-Gram in bytecode s, GN (s) characteristic set indicated for N-Gram in bytecode s, num GN(s) and GN(t) in intersection element in ListN(s) with ListN(t) is there is some same characteristic features using the similarity that this method measures two byte code files in the number occurred in Number also incorporate calculating.
7. a kind of system for realizing any of the above-described claim the method characterized by comprising in a serial fashion mutually Executable file conversion module, cryptographic Hash computing module, characteristics extraction module and the similarity calculation module of connection, in which: Executable file conversion module converts binary file to the analysable byte code files of text and exports to cryptographic Hash and calculates Module, cryptographic Hash computing module convert the form of vector for byte code files and export to characteristics extraction module, characteristic value Extraction module, which uses winnowing mode to extract as final executable file cryptographic Hash vector, indicates vector and defeated Out to similarity calculation module, similarity calculation module is based on the formula for indicating that vector uses improved N-Gram similarity calculation Calculate final similarity result.
CN201910653076.7A 2019-07-19 2019-07-19 The method of the detection bytecode similarity of N-Gram Pending CN110362343A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910653076.7A CN110362343A (en) 2019-07-19 2019-07-19 The method of the detection bytecode similarity of N-Gram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910653076.7A CN110362343A (en) 2019-07-19 2019-07-19 The method of the detection bytecode similarity of N-Gram

Publications (1)

Publication Number Publication Date
CN110362343A true CN110362343A (en) 2019-10-22

Family

ID=68220358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910653076.7A Pending CN110362343A (en) 2019-07-19 2019-07-19 The method of the detection bytecode similarity of N-Gram

Country Status (1)

Country Link
CN (1) CN110362343A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579155A (en) * 2021-02-23 2021-03-30 北京北大软件工程股份有限公司 Code similarity detection method and device and storage medium
WO2023028721A1 (en) * 2021-08-28 2023-03-09 Huawei Technologies Co.,Ltd. Systems and methods for detection of code clones

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101546320A (en) * 2008-03-27 2009-09-30 林兆祥 Data difference analysis method based on sliding window
CN105871619A (en) * 2016-04-18 2016-08-17 中国科学院信息工程研究所 Method for n-gram-based multi-feature flow load type detection
US20180143979A1 (en) * 2016-11-21 2018-05-24 Université de Lausanne Method for segmenting and indexing features from multidimensional data
CN109101479A (en) * 2018-06-07 2018-12-28 苏宁易购集团股份有限公司 A kind of clustering method and device for Chinese sentence
CN109151218A (en) * 2018-08-21 2019-01-04 平安科技(深圳)有限公司 Call voice quality detecting method, device, computer equipment and storage medium
CN109977668A (en) * 2017-12-27 2019-07-05 哈尔滨安天科技股份有限公司 The querying method and system of malicious code

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101546320A (en) * 2008-03-27 2009-09-30 林兆祥 Data difference analysis method based on sliding window
CN105871619A (en) * 2016-04-18 2016-08-17 中国科学院信息工程研究所 Method for n-gram-based multi-feature flow load type detection
US20180143979A1 (en) * 2016-11-21 2018-05-24 Université de Lausanne Method for segmenting and indexing features from multidimensional data
CN109977668A (en) * 2017-12-27 2019-07-05 哈尔滨安天科技股份有限公司 The querying method and system of malicious code
CN109101479A (en) * 2018-06-07 2018-12-28 苏宁易购集团股份有限公司 A kind of clustering method and device for Chinese sentence
CN109151218A (en) * 2018-08-21 2019-01-04 平安科技(深圳)有限公司 Call voice quality detecting method, device, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GCYXF: "《Jaccard系数(Jaccard Coefficient)和tf-idf方法》", 《HTTPS://BLOG.CSDN.NET/GCYXF/ARTICLE/DETAILS/39480425》 *
君的名字: "《【代码克隆检测】基于K-gram hash 分析特征提取技术(代码篇)》", 《 HTTPS://BLOG.CSDN.NET/GRACE_0642/ARTICLE/DETAILS/53128303》 *
君的名字: "《基于K-gram的winnowing特征提取剽窃查重检测技术(概念篇)》", 《HTTPS://BLOG.CSDN.NET/GRACE_0642/ARTICLE/DETAILS/53115067》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579155A (en) * 2021-02-23 2021-03-30 北京北大软件工程股份有限公司 Code similarity detection method and device and storage medium
WO2023028721A1 (en) * 2021-08-28 2023-03-09 Huawei Technologies Co.,Ltd. Systems and methods for detection of code clones

Similar Documents

Publication Publication Date Title
CN104424402B (en) It is a kind of for detecting the method and device of pirate application program
CN101477610B (en) Software watermark process for combined embedding of source code and target code
CN110362343A (en) The method of the detection bytecode similarity of N-Gram
CN107239678B (en) Android application repacking detection method based on Java file directory structure
CN101807239A (en) Method for preventing source code from decompiling
CN108595921A (en) Character string obscures method and apparatus in a kind of source code
CN110569629A (en) Binary code file tracing method
WO2019246294A1 (en) Methods, devices and systems for data augmentation to improve fraud detection
CN108694042B (en) JavaScript code confusion resolution method in webpage
CN110990058B (en) Software similarity measurement method and device
Liu et al. Vfdetect: A vulnerable code clone detection system based on vulnerability fingerprint
Nazir et al. Software birthmark design and estimation: a systematic literature review
Shakya et al. Smartmixmodel: machine learning-based vulnerability detection of solidity smart contracts
Mai et al. MobileNet-Based IoT Malware Detection with Opcode Features
CN105808602B (en) Method and device for detecting junk information
WO2017095480A1 (en) Spreadsheet with unit parsing
CN112084146A (en) Firmware homology detection method based on multi-dimensional features
Aumpansub et al. Detecting software vulnerabilities using neural networks
CN114254613A (en) Cross-architecture cryptographic algorithm identification method and system based on IR2Vec
Lee et al. Trend of malware detection using deep learning
Adhikari et al. Using the Strings Metadata to Detect the Source Language of the Binary
CN113342283A (en) User position information storage method and device, electronic equipment and readable storage medium
Lee et al. An efficient categorization of the instructions based on binary excutables for dynamic software birthmark
CN107423586B (en) Method for protecting software and software protecting equipment
Ullah et al. Efficient features for function matching in multi-architecture binary executables

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191022