CN110362343A - The method of the detection bytecode similarity of N-Gram - Google Patents
The method of the detection bytecode similarity of N-Gram Download PDFInfo
- Publication number
- CN110362343A CN110362343A CN201910653076.7A CN201910653076A CN110362343A CN 110362343 A CN110362343 A CN 110362343A CN 201910653076 A CN201910653076 A CN 201910653076A CN 110362343 A CN110362343 A CN 110362343A
- Authority
- CN
- China
- Prior art keywords
- gram
- bytecode
- similarity
- file
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/75—Structural analysis for program understanding
- G06F8/751—Code clone detection
Abstract
A method of the detection bytecode similarity based on N-Gram, by with N metagrammar model conversation being bytecode by executable binary file to be compared, bytecode is analyzed using N-Gram hash algorithm and obtains corresponding hash value, similarity is calculated after therefrom extracting feature finally by winnowing algorithm.The similarity that the present invention can carry out bytecode level to Java executable file judges, and hash algorithm analysis has been used to improve the efficiency of method execution, obscures scale evaluation, Code Clones detection etc. so as to be widely used in Java bytecode.
Description
Technical field
It is specifically a kind of to be based on language model (N- the present invention relates to a kind of technology in computer information processing field
Gram the method for detection bytecode similarity).
Background technique
Bytecode similarity calculation is a research direction of program analysis, and Code Clones are detected and obscured with assessment side
Mask is significant, can helper person reduce redundant code, improve code efficiency and promoted code security degree, protection
Code supports property right.The maximum advantage of Java language is its platform-neutral, primary to compile, and runs everywhere, however this is special
Point also results in the defect of its easy decompiling.So how to protect the intellectual property of java applet, the interests of maintenance programmer,
Obfuscation comes into being.
Obscure, i.e., under the premise of guaranteeing that byte coded program primitive justice is constant, being changed to is more difficult decompiling or anti-volume
More indigestible technology after translating.Generally, it is considered that lower with similarity of source codes, then aliasing effect is better.Code Clones detection is made
An ancient research direction for program analysis has developed extremely perfect, finally from initial text based detection
Various ways based on controlling stream graph, then till now are used in combination, as long as but being generally speaking related to carrying out program static point
Analysis needs to use the expression in structure, similarity comparison (for example, comparing abstract syntax tree etc.), the rate of entire program operation
It will substantially reduce.
Summary of the invention
The present invention In view of the above shortcomings of the prior art, proposes a kind of detection bytecode similarity based on N-Gram
Method, the similarity that bytecode level can be carried out to Java executable file judges, and has used hash algorithm analysis
The efficiency of method execution is improved, obscures scale evaluation, Code Clones detection etc. so as to be widely used in Java bytecode
Aspect.
The present invention is achieved by the following technical solutions:
The present invention is by with N metagrammar model conversation being bytecode by executable binary file to be compared, using N-
Gram hash algorithm analysis bytecode simultaneously obtains corresponding hash value, after therefrom extracting feature finally by winnowing algorithm
Similarity is calculated.
The conversion is parsed simultaneously with the format of the .class file based on stack and the .dex file based on register
It is converted into the byte code files that can read, can be used character manipulation processing.
The conversion shields the difference of different file decompilings by unified Binary2Bytecode interface, this connects
Mouth identifies that the type of current executable file is .class or .dex and calls corresponding reverse-engineering, that is, is based on corresponding instruction
Collection is converted into hashed value.
Hashed value in byte code files is mapped to N-Gram hash expression formula H (c by the analysis1...cN)=
c1*bN-1+c2*bN-2+...+cN-1*b+cN, in which: H is mapping relations, and c1...cN is a N metagrammar, and b is substrate.
The extraction feature carries out Feature Selection by sliding window using winnowing algorithm to guarantee it uniformly
Property, the guarantee of randomness is realized by retaining minimum/big feature of each window intermediate value, specific steps include:
1) it sets moving window size and successively traverses discrete value;
2) minimum or maximum discrete value and its subscript composition characteristic value pair in each Moving Window are recorded;
3) identical characteristic value pair is deleted, feature extraction is completed.
The similarityWherein: ListNIt (s) is N-Gram table in bytecode s
The feature list shown, GN(s) characteristic set indicated for N-Gram in bytecode s, num GN(s) and GN(t) element in intersection
In ListN(s) and ListN(t) in occur number, i.e., using this method measure two byte code files similarity by some
The number that same characteristic features occur also incorporates calculating.
The present invention relates to a kind of systems for realizing the above method, comprising: the executable text interconnected in a serial fashion
Part conversion module, cryptographic Hash computing module, characteristics extraction module and similarity calculation module, in which: executable file conversion
Module converts the analysable byte code files of text for binary file and exports to cryptographic Hash computing module, and cryptographic Hash calculates
Module converts the form of vector for byte code files and exports to characteristics extraction module, and characteristics extraction module uses
Winnowing mode, which extract to cryptographic Hash vector, to be indicated vector as final executable file and exports to similarity meter
Module is calculated, similarity calculation module calculates final phase using the formula of improved N-Gram similarity calculation based on expression vector
Like degree result.
Technical effect
Compared with prior art, the present invention has easy to operate, and execution speed is fast, the strong industrial applicibility of scalability and work
Industry technical effect.Easy to operate and strong scalability is embodied in that the method use interfaces to shield each executable file and bytecode
The difference of file, and the interface function can continue to extend.It executes speed and is that this method does not both need to answer file fastly
Miscellaneous static analysis does not need to carry out complicated comparing calculation to the result of static analysis yet.The energy consumption of this method is very low, executes
It can generally be completed in 1s (according to the data in specific embodiment), but typically rely on the size of executable file.It should
The energy consumption of method essentially consists in the last one similarity calculation module, worst the result is that O (nm), n, m are two bytecode texts
The instruction size of part.
Detailed description of the invention
Fig. 1 is that byte code files are converted into hash vector schematic diagram;
Executable file is resolved to respective byte code file schematic diagram for interface Binary2Bytecode by Fig. 2;
Fig. 3 is that bytecode hashed value vector is converted into 3-Gram hash value schematic diagram;
Fig. 4 is that similarity numerical value set obtains flow diagram;
Fig. 5 is N-Gram similarity VS executable file relative size schematic diagram.
Specific embodiment
The present embodiment is by means of Android R8 (ver.1.4.9).Android R8 is the user Ke Ding that Google newly releases
Java directly can be run program and obscure and be converted into Dalvikvm bytecode, i.e. .dex file by bytecode obfuscator processed.Operation
System is Ubuntu16.04, JDK1.8.
The present embodiment the following steps are included:
Step 1) data preparation: configuration file is obscured using 50 Android R8 of random manner generation first, is made
It will lead to the difference of the .dex file ultimately generated with different configuration files of obscuring.Secondly any configuration is obscured using being not added
Source java applet is converted .dex file by Android R8.
The conversion, using Binary2Bytecode interface as shown in Figure 2, by the class for identifying executable file
Type calls different decompiling instruments, and such design can shield the difference between executable file, for using this method
User for, it is transparent.
50 in step 1 are obscured .dex file at random and are carried out with the .dex file that do not obscure respectively based on N- by step 2)
The similarity calculation of Gram, resulting 50 similarity numerical value store in the table, the experimental comparison after convenience, such as Fig. 4 institute
Show, specific steps include:
2.1) hashed value is converted according to based on corresponding instruction set by .dex file;
2.2) processing of N-Gram hash algorithm is carried out to byte code files, the present embodiment uses bytecode hashed value
3-Gram hash algorithm, vector as shown in fig. 1 are mapped using following hash expression formula:
H(c1...cN)=c1*bN-1+c2*bN-2+...+cN-1*b+cN, in which: H indicates mapping relations, and c1...cN indicates one
A N metagrammar, b indicate substrate, take 8 in the present embodiment.
2.3) using winnowing algorithm to hashed value extraction feature and according to feature calculation similarity, by means of sliding
Window carries out Feature Selection to guarantee its uniformity, and being ensured of for randomness is minimum (maximum) by retaining each window intermediate value
Feature realize.Detailed process is as follows:
Assuming that there is such as next group of characteristic value and set window size as 4:
17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99
Minimum value is 15 in first window, and so on:
17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99
Remained in second window that 15:
17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99
Until the 5th window is updated to 32 (this is also the case where two characteristic values being selected farthest are spaced):
17 28 75 15 56 89 78 32 8 69 35 30 87 101 203 99
The characteristic value pair finally chosen are as follows: [15,3] [32,7] [8,8] [30,11] [87,12], preceding paragraph is characterized value, after
Item is corresponding subscript.When occurring that hash value is the same but the different feature of subscript, subscript is to distinguish different characteristic value.
Step 3) compares 50 .dex file relative size change rates of 50 similarity numerical value and this that step 2 obtains
It is right, as shown in Figure 5.Android R8 when a certain program is obscured and is converted, the size of final executable file for
It is sensitive for obscuring the variation of degree, so experiment selection obscures rear executable file relative to not obscuring executable file
The index that relative size compares as experiment.
As seen from the figure, N-Gram calculated similarity size variation and executable file relative size variation
Be it is almost the same, can prove the validity of this method, and the method speed of service is quickly.
It is above-mentioned with the sample operation that size is 50 in the environment of JDK1.8Ubuntu16.04 by specific actual experiment
Method, obtained experimental result are that this method can effectively describe the similarity of byte code files.And single file is similar
Degree is calculated can complete at ms grades.
Above-mentioned specific implementation can by those skilled in the art under the premise of without departing substantially from the principle of the invention and objective with difference
Mode carry out local directed complete set to it, protection scope of the present invention is subject to claims and not by above-mentioned specific implementation institute
Limit, each implementation within its scope is by the constraint of the present invention.
Claims (7)
1. a kind of method of the detection bytecode similarity based on N-Gram, which is characterized in that by the way that be compared can be performed
Binary file is bytecode with N metagrammar model conversation, analyzes bytecode using N-Gram hash algorithm and obtains corresponding
Similarity is calculated after therefrom extracting feature finally by winnowing algorithm in hash value.
2. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that described turns
Change, with based on stack .class file and the .dex file based on register format parsed and be converted into can read, can
The byte code files handled using character manipulation.
3. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that described turns
Change, the difference of different file decompilings is shielded by unified Binary2Bytecode interface, interface identification is current executable
The type of file is .class or .dex and calls corresponding reverse-engineering, i.e., is converted into hashed value based on corresponding instruction set.
4. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that point
Analysis, i.e., map to N-Gram hash expression formula H (c for the hashed value in byte code files1...cN)=c1*bN-1+c2*bN-2
+...+cN-1*b+cN, in which: H is mapping relations, and c1...cN is a N metagrammar, and b is substrate.
5. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that described mentions
Feature is taken, Feature Selection is carried out to guarantee its uniformity, by retaining each window by sliding window using winnowing algorithm
Minimum/big feature of mouth intermediate value realizes the guarantee of randomness, and specific steps include:
1) it sets moving window size and successively traverses discrete value;
2) minimum or maximum discrete value and its subscript composition characteristic value pair in each Moving Window are recorded;
3) identical characteristic value pair is deleted, feature extraction is completed.
6. the method for the detection bytecode similarity according to claim 1 based on N-Gram, characterized in that the phase
Like degreeWherein: ListN(s) feature list indicated for N-Gram in bytecode s, GN
(s) characteristic set indicated for N-Gram in bytecode s, num GN(s) and GN(t) in intersection element in ListN(s) with
ListN(t) is there is some same characteristic features using the similarity that this method measures two byte code files in the number occurred in
Number also incorporate calculating.
7. a kind of system for realizing any of the above-described claim the method characterized by comprising in a serial fashion mutually
Executable file conversion module, cryptographic Hash computing module, characteristics extraction module and the similarity calculation module of connection, in which:
Executable file conversion module converts binary file to the analysable byte code files of text and exports to cryptographic Hash and calculates
Module, cryptographic Hash computing module convert the form of vector for byte code files and export to characteristics extraction module, characteristic value
Extraction module, which uses winnowing mode to extract as final executable file cryptographic Hash vector, indicates vector and defeated
Out to similarity calculation module, similarity calculation module is based on the formula for indicating that vector uses improved N-Gram similarity calculation
Calculate final similarity result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910653076.7A CN110362343A (en) | 2019-07-19 | 2019-07-19 | The method of the detection bytecode similarity of N-Gram |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910653076.7A CN110362343A (en) | 2019-07-19 | 2019-07-19 | The method of the detection bytecode similarity of N-Gram |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110362343A true CN110362343A (en) | 2019-10-22 |
Family
ID=68220358
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910653076.7A Pending CN110362343A (en) | 2019-07-19 | 2019-07-19 | The method of the detection bytecode similarity of N-Gram |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110362343A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112579155A (en) * | 2021-02-23 | 2021-03-30 | 北京北大软件工程股份有限公司 | Code similarity detection method and device and storage medium |
WO2023028721A1 (en) * | 2021-08-28 | 2023-03-09 | Huawei Technologies Co.,Ltd. | Systems and methods for detection of code clones |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101546320A (en) * | 2008-03-27 | 2009-09-30 | 林兆祥 | Data difference analysis method based on sliding window |
CN105871619A (en) * | 2016-04-18 | 2016-08-17 | 中国科学院信息工程研究所 | Method for n-gram-based multi-feature flow load type detection |
US20180143979A1 (en) * | 2016-11-21 | 2018-05-24 | Université de Lausanne | Method for segmenting and indexing features from multidimensional data |
CN109101479A (en) * | 2018-06-07 | 2018-12-28 | 苏宁易购集团股份有限公司 | A kind of clustering method and device for Chinese sentence |
CN109151218A (en) * | 2018-08-21 | 2019-01-04 | 平安科技(深圳)有限公司 | Call voice quality detecting method, device, computer equipment and storage medium |
CN109977668A (en) * | 2017-12-27 | 2019-07-05 | 哈尔滨安天科技股份有限公司 | The querying method and system of malicious code |
-
2019
- 2019-07-19 CN CN201910653076.7A patent/CN110362343A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101546320A (en) * | 2008-03-27 | 2009-09-30 | 林兆祥 | Data difference analysis method based on sliding window |
CN105871619A (en) * | 2016-04-18 | 2016-08-17 | 中国科学院信息工程研究所 | Method for n-gram-based multi-feature flow load type detection |
US20180143979A1 (en) * | 2016-11-21 | 2018-05-24 | Université de Lausanne | Method for segmenting and indexing features from multidimensional data |
CN109977668A (en) * | 2017-12-27 | 2019-07-05 | 哈尔滨安天科技股份有限公司 | The querying method and system of malicious code |
CN109101479A (en) * | 2018-06-07 | 2018-12-28 | 苏宁易购集团股份有限公司 | A kind of clustering method and device for Chinese sentence |
CN109151218A (en) * | 2018-08-21 | 2019-01-04 | 平安科技(深圳)有限公司 | Call voice quality detecting method, device, computer equipment and storage medium |
Non-Patent Citations (3)
Title |
---|
GCYXF: "《Jaccard系数(Jaccard Coefficient)和tf-idf方法》", 《HTTPS://BLOG.CSDN.NET/GCYXF/ARTICLE/DETAILS/39480425》 * |
君的名字: "《【代码克隆检测】基于K-gram hash 分析特征提取技术(代码篇)》", 《 HTTPS://BLOG.CSDN.NET/GRACE_0642/ARTICLE/DETAILS/53128303》 * |
君的名字: "《基于K-gram的winnowing特征提取剽窃查重检测技术(概念篇)》", 《HTTPS://BLOG.CSDN.NET/GRACE_0642/ARTICLE/DETAILS/53115067》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112579155A (en) * | 2021-02-23 | 2021-03-30 | 北京北大软件工程股份有限公司 | Code similarity detection method and device and storage medium |
WO2023028721A1 (en) * | 2021-08-28 | 2023-03-09 | Huawei Technologies Co.,Ltd. | Systems and methods for detection of code clones |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104424402B (en) | It is a kind of for detecting the method and device of pirate application program | |
CN101477610B (en) | Software watermark process for combined embedding of source code and target code | |
CN110362343A (en) | The method of the detection bytecode similarity of N-Gram | |
CN107239678B (en) | Android application repacking detection method based on Java file directory structure | |
CN101807239A (en) | Method for preventing source code from decompiling | |
CN108595921A (en) | Character string obscures method and apparatus in a kind of source code | |
CN110569629A (en) | Binary code file tracing method | |
WO2019246294A1 (en) | Methods, devices and systems for data augmentation to improve fraud detection | |
CN108694042B (en) | JavaScript code confusion resolution method in webpage | |
CN110990058B (en) | Software similarity measurement method and device | |
Liu et al. | Vfdetect: A vulnerable code clone detection system based on vulnerability fingerprint | |
Nazir et al. | Software birthmark design and estimation: a systematic literature review | |
Shakya et al. | Smartmixmodel: machine learning-based vulnerability detection of solidity smart contracts | |
Mai et al. | MobileNet-Based IoT Malware Detection with Opcode Features | |
CN105808602B (en) | Method and device for detecting junk information | |
WO2017095480A1 (en) | Spreadsheet with unit parsing | |
CN112084146A (en) | Firmware homology detection method based on multi-dimensional features | |
Aumpansub et al. | Detecting software vulnerabilities using neural networks | |
CN114254613A (en) | Cross-architecture cryptographic algorithm identification method and system based on IR2Vec | |
Lee et al. | Trend of malware detection using deep learning | |
Adhikari et al. | Using the Strings Metadata to Detect the Source Language of the Binary | |
CN113342283A (en) | User position information storage method and device, electronic equipment and readable storage medium | |
Lee et al. | An efficient categorization of the instructions based on binary excutables for dynamic software birthmark | |
CN107423586B (en) | Method for protecting software and software protecting equipment | |
Ullah et al. | Efficient features for function matching in multi-architecture binary executables |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191022 |