CN106649218A - Quick binary file comparing method based on SimHash algorithm - Google Patents

Quick binary file comparing method based on SimHash algorithm Download PDF

Info

Publication number
CN106649218A
CN106649218A CN201611009372.6A CN201611009372A CN106649218A CN 106649218 A CN106649218 A CN 106649218A CN 201611009372 A CN201611009372 A CN 201611009372A CN 106649218 A CN106649218 A CN 106649218A
Authority
CN
China
Prior art keywords
binary file
function
simhash
keyword
basic block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611009372.6A
Other languages
Chinese (zh)
Inventor
唐勇
俞昕
王宝生
王毅
喻波
解炜
李�根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201611009372.6A priority Critical patent/CN106649218A/en
Publication of CN106649218A publication Critical patent/CN106649218A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a binary file comparing method based on a SimHash algorithm. The method comprises the steps that firstly, a writing plug-in conducts information extraction on a binary file by means of the extended function of an IDA Pro, wherein the information comprises an assembly instruction sequence, a control flow diagram and call flow diagram information of the binary file; secondly, the extracted binary file information is preprocessed; thirdly, key word definition is conducted on the preprocessed binary file information; fourthly, weight evaluation is conducted on extracted key words; fifthly, by means of the extracted key words and the weights thereof, the SimHash minutia feature of a function is calculated and stored; sixthly, based on an inquired analogue result, accurate matching is conducted by means of a classical algorithm based on structured matching. The binary file comparing method has the advantages of being good in universality, high in efficiency, high in accuracy and the like.

Description

A kind of quick comparative approach of the binary file based on SimHash algorithms
Technical field
Present invention relates generally to computer system security technical field, a kind of two entering based on SimHash algorithms is refered in particular to The quick comparative approach of file processed.
Background technology
With the extensive application developed rapidly with internet of computer technology, the scale of software itself is also with function Diversity and become increasing.Increasingly abundant function brings many peaces while good experience is brought to user, also Full problem.Meanwhile, the software of certain scale can use unavoidably third-party component, and third-party component often lacks source generation Code:Such as the dynamic link library file of microsoft system, it is intended to carry out such software code inspection, reverse-engineering means are almost only One selection.
In reverse-engineering, the main task of static analysis is erased completely in all or part of recovery binary file Function and data message, be that the work such as the network analysis of later release, the crucial point location that equipment is attacked and control is utilized are established Fixed basis.But because current software version is numerous, in large scale, this work increasingly cannot be completed by manpower.Cause This reverse demand of automation is arisen at the historic moment.It is exactly the comparison technology of binary file in main method used among these.
At present, conventional binary file comparison method substantially has four kinds, is respectively based on the binary system byte of source file Content is compared, is compared based on the assembly instruction after source file dis-assembling, the similitude graphics Web publishing based on assembly instruction and be based on The structured graphics of assembly instruction compares.This several method constantly overcomes the defect of previous methods presence according to sequencing, but For some large complicated binary files, these methods there are still deficiency.
In existing binary system comparative approach, in addition to the method for directly comparing binary content, mostly employ:Extract special Levy, traversal function set finds adaptation function, the three-step-march flow process of iterated revision matching result.For two binary files A And B.Assume that N (A) represents the number of function in A, N (B) represents the number of function in B, and t represents that two characteristic signatures are more required The time wanted, n represents loop iteration number of times.The then time complexity of the algorithm flow substantially O (n*N (A) * N (B) * t).Though Then continuous certain methods are improved, can progressively drawdown ratio compared with set, but analyze from overall time complexity, not It is obviously improved.For some large softwares, the function set after dis-assembling is often even up to a million for hundreds of thousands, for The number of comparisons of conventional method has reached 10 for these softwares10~12Secondary (passing over characteristic matching time and iterations), this The time efficiency of sample is not acceptable, need a kind of new method energy drawdown ratio compared with set, only calculating those may be similar Function between matched.
The content of the invention
The technical problem to be solved in the present invention is that:For the technical problem that prior art is present, the present invention provides one Kind versatility is good, the binary file comparative approach based on SimHash algorithms with greater efficiency and the degree of accuracy.
To solve above-mentioned technical problem, the present invention is employed the following technical solutions:
A kind of binary file comparative approach based on SimHash algorithms, its step is:
S1:Using the expanded function of IDA Pro, write plug-in unit carries out information extraction to binary file;Described information bag Include the assembly instruction sequence of binary file, controlling stream graph, call flow graph information;
S2:Binary file information to extracting is pre-processed;
S3:Keyword definition is carried out to the binary file information after pretreated;
S4:Keyword to extracting carries out weight measurement;
S5:Using the keyword and its weight that extract, the SimHash fingerprint characteristics of function are calculated, and to fingerprint characteristic Stored;
S6:Based on the analog result after inquiry, then accurately mate is carried out using the classic algorithm based on structure match.
As a further improvement on the present invention:In step s3, the definition to keyword considers:Single instruction, basic block, Elementary path;The command code and operand of single instruction are taken as keyword, with SPP algorithms the SPP values of each basic block are calculated, The SPP values of elementary path are used as keyword.
As a further improvement on the present invention:In step s 4, keyword importance degree is:Elementary path > basic blocks > single instructions;For single instruction, the weight of its command code and operand is 1;For basic block, its weight is it SPP values;For elementary path, its weight is the number of instructions that the path is passed through.
As a further improvement on the present invention:It is in the detailed process of step S3:
S301:Segmentation merged block is carried out to controlling stream graph;
S302:Register Fuzzy processing is carried out to command sequence;
S303:Address information in command sequence is redirected.
As a further improvement on the present invention:In step S301, when a basic block only one of which sub-block in CFG, and The sub-block also only one of which father block when, the two basic blocks be defined as split block;Assume the collection of all basic blocks in function F It is combined into B, p (a) is the set of the father node of basic block a in CFG, c (a) is the set of the child node of basic block a in CFG, e (a, B) for, with a as starting point, b is a line of terminal in CFG, flow process is:
I. the basic block in set B is traveled through;If traveled through, step v is jumped to;Otherwise, take from set B basic Block b, execution step ii;
If ii. the size of child node set c (b) of basic block b is 1, execution step iii otherwise returns i;
Iii. a=c (b) is made, if the size of father node set p (a) of basic block a is 1, execution step iv is otherwise returned Return i;
Iv. b and a are merged into into new basic block c;Meanwhile, remove side e (b, a), { e (x, b) | x ∈ p (b) } and e (a, x)|x∈c(a)};And set up new side { e (x, c) | x ∈ p (b) } and { e (c, x) | x ∈ c (a) };Basic block c is added simultaneously In set B, return to step i;
V. terminate.
As a further improvement on the present invention:It is in the flow process of step S303:
I. for code segment jump instruction, the offset address after instruction is ignored when characteristic value is extracted, selects purpose The cryptographic Hash of address block is used as feature;
Ii. for the data pointer of data segment, the opcode of the data value that the pointer is indexed is extracted as feature.
As a further improvement on the present invention:The flow process of step S5 is:
S501:The SimHash fingerprint characteristics of function are calculated;
S502:The SimHash fingerprint characteristics storage of function;
S503:The SimHash fingerprint characteristics inquiry of function.
As a further improvement on the present invention:The flow process of step S501 is:
I. the binary number S of vectorial V and 32 of one 32 dimension is set, and they is initialized as into 0;
Ii. set set of keywords and be combined into K, s (k)={ hash (k) | k ∈ K } is the cryptographic Hash that the length of keyword k is 32, V (k) is the weight of keyword k;
Iii. for each keyword k ∈ K, for (i=0 to 31):If the i-th bit of s (k) is the i-th of 1, V Individual element adds v (k), otherwise deducts v (k);
Iv. after operating according to step iii to all of keyword, if i-th element of V is more than 0, by S's I-th bit is set to 1, is otherwise 0;Last output S is the fingerprint characteristic of function.
As a further improvement on the present invention:The flow process of step S502 is:
I. the function set F of binary file is traveled through, for each function f ∈ F, fingerprint S (f) is calculated;
Ii. the concordance list of 8 is set up, index range is:0~255;
Iii. to 32 binary representations of each fingerprint S (f), according to 31~24,23~16,15~8,7~0 4 are divided into Section, 32 complete fingerprints is present in the index entry represented by this 4 sections by 8 per section.
As a further improvement on the present invention:The flow process of step S503 is:
I. the function set F of file to be matched is traveled through, for each function f ∈ F, fingerprint S (f) is calculated;
Ii. to 32 binary representations of each fingerprint S (f), according to 31~24,23~16,15~8,7~0 4 are divided into Section, 8 per section;It is designated as S1, S2, S3, S4
Iii. by S1, S2, S3, S4As index entry, inquired about in the binary file for possessing knowledge base.
Compared with prior art, it is an advantage of the current invention that:
1. the binary file comparative approach based on SimHash algorithms of the present invention, with good versatility, emphasis solution Time efficiency of having determined this root problem so that being relatively possibly realized to large software.Meanwhile, feature is being extracted to function When, the architectural feature and incidence relation of more concern functions, and the different characteristic of function is carried out by synthesis using SimHash algorithms Evaluate so that for being relatively possibly realized using the complex software of Code Obfuscation Security Technology etc..
2. the binary file comparative approach based on SimHash algorithms of the present invention, the accuracy that binary file compares It is good.In method proposed by the invention, first SPP algorithms are employed to basic block and elementary path and extract its keyword, solved The problem that its built-in command of having determined is reset;Meanwhile, when to the SimHash fingerprint characteristics for extracting function, due to the nothing of SimHash The structural features of sequence and keyword, further solve erroneous judgement caused by resetting due to minor alteration and instruction, improve The matching accuracy rate of function.
3. the binary file comparative approach based on SimHash algorithms of the present invention, with preferable high efficiency.Using this The proposed method of invention, for the characteristic fingerprint (32) of each function in A, is classified as 4 sections (8 per section), utilizes This 4 sections, as index, in the fingerprint index table of B index entry are searched.Then for each is indexed, 2 can be at most returned(y-8)Individual candidate As a result.Time, more complicated degree was from 2x+yIt is reduced to 4*2x+y-8=2x+y-6.For some large programs, greatly reduce and compare consumption When, reach the purpose of quick search.
Description of the drawings
Fig. 1 is the schematic flow sheet of the inventive method.
Specific embodiment
The present invention is described in further details below with reference to Figure of description and specific embodiment.
As shown in figure 1, one kind of the present invention is based on SimHash algorithms (similar Hash, Similarity Hashing) Binary file comparative approach, step is:
S1:Using the expanded function of IDA Pro, write plug-in unit carries out information extraction to binary file;Described information bag Include the assembly instruction sequence of binary file, controlling stream graph, call the information such as flow graph.The IDA Pro are interactive dis-assembling Device professional version (Interactive Disassembler Professional).
S2:Binary file information to extracting is pre-processed.
S3:Keyword definition is carried out to the binary file information after pretreated.
In concrete application example, the definition of keyword mainly considers the following aspects:Single instruction, basic block, base This path.
A) single instruction realizes the elementary cell of its function as function, is one of essential characteristic of function.Take single finger The command code and operand of order is used as keyword;
B) basic block reflects the inner structural features of function as the node in control flow graph.The present invention is adopted SPP algorithms calculate the SPP values of each basic block.Due to the randomness and repeatability of SPP algorithms, solve instruction rearrangement and made Into the different problem of command sequence.Therefore can be using SPP values as keyword;
C) elementary path, the approach that elementary path is realized for function performance, the function that function is substantially the same, its elementary path It is essentially identical, can be using the SPP values of elementary path as keyword.
S4:Keyword to extracting carries out weight measurement.
In concrete application example, keyword importance degree is:Elementary path > basic block > single instructions.For list For individual instruction, the weight of its command code and operand is 1.For basic block, its weight is its SPP value.For basic For path, its weight is the number of instructions that the path is passed through.
S5:Using the keyword and its weight that extract, the SimHash fingerprint characteristics of function are calculated, and to fingerprint characteristic Stored.
All functions have carried out fingerprint characteristic calculating and have deposited in binary file and benchmark binary file are matched to band Chu Hou, treating adaptation function using fingerprint characteristic carries out similar function inquiry.
S6:Based on the analog result after inquiry, then accurately mate is carried out using the classic algorithm based on structure match.
Used as preferred embodiment, the detailed process of above-mentioned steps S3 of the present invention is:
S301:Segmentation merged block is carried out to controlling stream graph.Segmentation block definition be:" when a basic block only has one in CFG Individual sub-block, and the sub-block also only one of which father block when, the two basic blocks be defined as split block, CFG be function controlling stream Journey figure (Control Flow Graph) ".The collection for assuming all basic blocks in function F is combined into B, and p (a) is basic block a in CFG The set of father node, c (a) is the set of the child node of basic block a in CFG, and e (a, b) is that b is terminal with a as starting point in CFG A line, merge algorithm it is as follows:
Vi. the basic block in set B is traveled through.If traveled through, step v is jumped to.Otherwise, base is taken from set B This block b, execution step ii
If vii. the size of child node set c (b) of basic block b is 1, execution step iii otherwise returns i
Viii. a=c (b) is made, if the size of father node set p (a) of basic block a is 1, execution step iv is otherwise returned Return i
Ix. b and a are merged into into new basic block c.Meanwhile, remove side e (b, a), { e (x, b) | x ∈ p (b) } and e (a, x)|x∈c(a)}.And set up new side { e (x, c) | x ∈ p (b) } and { e (c, x) | x ∈ c (a) }.Basic block c is added simultaneously In set B, return to step i
X. algorithm terminates.
S302:Register Fuzzy processing is carried out to command sequence.General register type often optimizes for compiler Option, therefore Fuzzy processing is carried out to it, it is believed that (EAX=EBX=ECX=EDX), for 16 (AX, BX, CX, DX) And high (AH, BH, CH, DH), low level (AL, BL, CL, DL) do same operation.
S303:Address information in command sequence is redirected.
Concrete operations are as follows:
Iii. for code segment jump instruction, the offset address after instruction is ignored when characteristic value is extracted, selects mesh Address block cryptographic Hash as feature.
Iv. for the data pointer of data segment, the opcode of the data value that the pointer is indexed is extracted as feature.
Used as preferred embodiment, the detailed process of above-mentioned steps S5 of the present invention is:
S501:The SimHash fingerprint characteristics of function are calculated:
V. the binary number S of vectorial V and 32 of one 32 dimension is set, and they is initialized as into 0.
Vi. set set of keywords and be combined into K, s (k)={ hash (k) | k ∈ K } is the cryptographic Hash that the length of keyword k is 32, V (k) is the weight of keyword k.
Vii. for each keyword k ∈ K, for (i=0 to 31):If the i-th bit of s (k) is the i-th of 1, V Individual element adds v (k), otherwise deducts v (k).
Viii. after operating according to step iii to all of keyword, if i-th element of V is more than 0, by S I-th bit be set to 1, be otherwise 0.Last output S is the fingerprint characteristic of function.
S502:The SimHash fingerprint characteristics storage of function:
Iv. the function set F of binary file is traveled through, for each function f ∈ F, fingerprint S (f) is calculated.
V. the concordance list of 8 is set up, index range is:0~255.
Vi. to 32 binary representations of each fingerprint S (f), according to 31~24,23~16,15~8,7~0 4 are divided into Section, 32 complete fingerprints is present in the index entry represented by this 4 sections by 8 per section.
S503:The SimHash fingerprint characteristics inquiry of function:
Iv. the function set F of file to be matched is traveled through, for each function f ∈ F, fingerprint S (f) is calculated.
V. to 32 binary representations of each fingerprint S (f), according to 31~24,23~16,15~8,7~0 4 sections are divided into, 8 per section.It is designated as S1, S2, S3, S4
Vi. by S1, S2, S3, S4As index entry, inquired about in the binary file for possessing knowledge base.
To the function list item that each is inquired, its its complete S imHash fingerprint is calculated complete with function to be matched The Hamming distances of SimHash fingerprints.If Hamming distances are less than or equal to 3, then it is assumed that be similar function.
The binary file of current main-stream compares the comparison that instrument does not support large scope software, and for some presence The comparison accuracy rate of Code Obfuscation Security Technology is relatively low.Main reason is that:Large scope software function is numerous and changes complicated, when Between efficiency and matching accuracy be difficult to hold.In said method proposed by the invention, because emphasis solves time efficiency This root problem so that being relatively possibly realized to large software.Meanwhile, when feature is extracted to function, more concern letters Several architectural feature and incidence relation, and the different characteristic of function is carried out by overall merit using SimHash algorithms so that for Relatively it is possibly realized using the complex software of Code Obfuscation Security Technology etc..
In conventional methods where, the instruction signature whether based on function signature or basic block is extracted and all must be compatible with compiling The impact that device is produced to code.When basic block is matched using structured signature, can not find sometimes because command sequence is run The problem that etc. original is thus resulted in, these problems will cause error in judgement.Equally, for the function label based on structural comparison Name, if only considering the architectural feature of the functions such as internal basic block number, call instruction number, jump instruction number, and ignores instruction Content, may be identical by two structures but semantic different function is matched together, such as:Max function and minimum of a value letter Number.In said method proposed by the invention, first, SPP algorithms are employed to basic block and elementary path and extracts its key Word, solves the problems, such as that its built-in command is reset;Meanwhile, when to the SimHash fingerprint characteristics for extracting function, due to The randomness of SimHash and the structural features of keyword, are further solved caused by being reset due to minor alteration and instruction Erroneous judgement, improves the matching accuracy rate of function.
Traditional binary file comparative approach, employs the mode for comparing one by one of blindness to determine matching object.It is false If two binary files respectively have 2xWith 2yIndividual function, then the number of comparisons of conventional method be at least 2x+yIt is secondary.Time consumption is huge, And do many useless comparisons.And said method proposed by the invention is adopted, for the feature of each function in A refers to Line (32), is classified as 4 sections (8 per section), by the use of this 4 sections as index, in the fingerprint index table of B index entry is searched. Then for each is indexed, 2 can be at most returned(y-8)Individual candidate result.Time, more complicated degree was from 2x+yIt is reduced to 4*2x+y-8=2x+y-6。 For some large programs, greatly reduce than relatively time-consuming, reached the purpose of quick search.
The above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment, All technical schemes belonged under thinking of the present invention belong to protection scope of the present invention.It should be pointed out that for the art For those of ordinary skill, some improvements and modifications without departing from the principles of the present invention should be regarded as the protection of the present invention Scope.

Claims (10)

1. a kind of binary file comparative approach based on SimHash algorithms, it is characterised in that step is:
S1:Using the expanded function of IDA Pro, write plug-in unit carries out information extraction to binary file;Described information includes two The assembly instruction sequence of binary file, controlling stream graph, call flow graph information;
S2:Binary file information to extracting is pre-processed;
S3:Keyword definition is carried out to the binary file information after pretreated;
S4:Keyword to extracting carries out weight measurement;
S5:Using the keyword and its weight that extract, the SimHash fingerprint characteristics of function are calculated, and fingerprint characteristic is carried out Storage;
S6:Based on the analog result after inquiry, then accurately mate is carried out using the classic algorithm based on structure match.
2. the binary file comparative approach based on SimHash algorithms according to claim 1, it is characterised in that in step In rapid S3, the definition to keyword considers:Single instruction, basic block, elementary path;Take the command code and operand of single instruction As keyword, the SPP values of each basic block are calculated with SPP algorithms, the SPP values of elementary path are used as keyword.
3. the binary file comparative approach based on SimHash algorithms according to claim 1, it is characterised in that in step In rapid S4, keyword importance degree is:Elementary path > basic block > single instructions;For single instruction, its operation The weight of code and operand is 1;For basic block, its weight is its SPP value;For elementary path, its weight is should The number of instructions that path is passed through.
4. the binary file comparative approach based on SimHash algorithms according to claim 1 or 2 or 3, its feature exists In being in the detailed process of step S3:
S301:Segmentation merged block is carried out to controlling stream graph;
S302:Register Fuzzy processing is carried out to command sequence;
S303:Address information in command sequence is redirected.
5. the binary file comparative approach based on SimHash algorithms according to claim 4, it is characterised in that in step In rapid S301, when a basic block only one of which sub-block in CFG, and the sub-block also only one of which father block when, the two basic blocks It is defined as splitting block;The collection for assuming all basic blocks in function F is combined into B, and p (a) is the father node of basic block a in CFG Set, c (a) is the set of the child node of basic block a in CFG, and e (a, b) is that, with a as starting point, b is a line of terminal in CFG, Flow process is:
I. the basic block in set B is traveled through;If traveled through, step v is jumped to;Otherwise, basic block b is taken from set B, Execution step ii;
If ii. the size of child node set c (b) of basic block b is 1, execution step iii otherwise returns i;
Iii. a=c (b) is made, if the size of father node set p (a) of basic block a is 1, execution step iv otherwise returns i;
Iv. b and a are merged into into new basic block c;Meanwhile, remove side e (b, a), { e (x, b) | x ∈ p (b) } and e (a, x) | x ∈c(a)};And set up new side { e (x, c) | x ∈ p (b) } and { e (c, x) | x ∈ c (a) };Simultaneously basic block c is added into set B In, return to step i;
V. terminate.
6. the binary file comparative approach based on SimHash algorithms according to claim 5, it is characterised in that in step Suddenly the flow process of S303 is:
I. for code segment jump instruction, the offset address after instruction is ignored when characteristic value is extracted, selects destination address The cryptographic Hash of block is used as feature;
Ii. for the data pointer of data segment, the opcode of the data value that the pointer is indexed is extracted as feature.
7. the binary file comparative approach based on SimHash algorithms according to claim 1 or 2 or 3, its feature exists In the flow process of step S5 is:
S501:The SimHash fingerprint characteristics of function are calculated;
S502:The SimHash fingerprint characteristics storage of function;
S503:The SimHash fingerprint characteristics inquiry of function.
8. the binary file comparative approach based on SimHash algorithms according to claim 7, it is characterised in that described The flow process of step S501 is:
I. the binary number S of vectorial V and 32 of one 32 dimension is set, and they is initialized as into 0;
Ii. set set of keywords and be combined into K, s (k)={ hash (k) | k ∈ K } is the cryptographic Hash that the length of keyword k is 32, v (k) For the weight of keyword k;
Iii. for each keyword k ∈ K, for (i=0to 31):If the i-th bit of s (k) is i-th element of 1, V Plus v (k), v (k) is otherwise deducted;
Iv. after operating according to step iii to all of keyword, if i-th element of V is more than 0, by the i-th bit of S 1 is set to, is otherwise 0;Last output S is the fingerprint characteristic of function.
9. the binary file comparative approach based on SimHash algorithms according to claim 8, it is characterised in that described The flow process of step S502 is:
I. the function set F of binary file is traveled through, for each function f ∈ F, fingerprint S (f) is calculated;
Ii. the concordance list of 8 is set up, index range is:0~255;
Iii. to 32 binary representations of each fingerprint S (f), it is divided into 4 sections according to 31~24,23~16,15~8,7~0, often Section 8,32 complete fingerprints are present in the index entry represented by this 4 sections.
10. the binary file comparative approach based on SimHash algorithms according to claim 8, it is characterised in that described The flow process of step S503 is:
I. the function set F of file to be matched is traveled through, for each function f ∈ F, fingerprint S (f) is calculated;
Ii. to 32 binary representations of each fingerprint S (f), it is divided into 4 sections according to 31~24,23~16,15~8,7~0, often Section 8;It is designated as S1, S2, S3, S4
Iii. by S1, S2, S3, S4As index entry, inquired about in the binary file for possessing knowledge base.
CN201611009372.6A 2016-11-16 2016-11-16 Quick binary file comparing method based on SimHash algorithm Pending CN106649218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611009372.6A CN106649218A (en) 2016-11-16 2016-11-16 Quick binary file comparing method based on SimHash algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611009372.6A CN106649218A (en) 2016-11-16 2016-11-16 Quick binary file comparing method based on SimHash algorithm

Publications (1)

Publication Number Publication Date
CN106649218A true CN106649218A (en) 2017-05-10

Family

ID=58807152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611009372.6A Pending CN106649218A (en) 2016-11-16 2016-11-16 Quick binary file comparing method based on SimHash algorithm

Country Status (1)

Country Link
CN (1) CN106649218A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704501A (en) * 2017-08-28 2018-02-16 中国科学院信息工程研究所 A kind of method and system for identifying homologous binary file
CN108280197A (en) * 2018-01-29 2018-07-13 中国科学院信息工程研究所 A kind of method and system of the homologous binary file of identification
CN109670317A (en) * 2018-12-24 2019-04-23 中国科学院软件研究所 A kind of internet of things equipment inheritance bug excavation method based on atom controlling stream graph
CN112100318A (en) * 2020-11-12 2020-12-18 北京智慧星光信息技术有限公司 Multi-dimensional information merging method, device, equipment and storage medium
CN115016843A (en) * 2022-05-23 2022-09-06 北京计算机技术及应用研究所 High-precision binary code similarity comparison method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105718506A (en) * 2016-01-04 2016-06-29 胡新伟 Duplicate-checking comparison method for science and technology projects
CN106055602A (en) * 2016-05-24 2016-10-26 腾讯科技(深圳)有限公司 File verification method and apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105718506A (en) * 2016-01-04 2016-06-29 胡新伟 Duplicate-checking comparison method for science and technology projects
CN106055602A (en) * 2016-05-24 2016-10-26 腾讯科技(深圳)有限公司 File verification method and apparatus

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘春红 等: "二进制文件同源性检测的结构化相似度计算", 《北京邮电大学学报》 *
张广庆 等: "基于simhash的海量相似文档快速搜索优化方法", 《指挥信息系统与技术》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704501A (en) * 2017-08-28 2018-02-16 中国科学院信息工程研究所 A kind of method and system for identifying homologous binary file
CN107704501B (en) * 2017-08-28 2020-04-24 中国科学院信息工程研究所 Method and system for identifying homologous binary file
CN108280197A (en) * 2018-01-29 2018-07-13 中国科学院信息工程研究所 A kind of method and system of the homologous binary file of identification
CN108280197B (en) * 2018-01-29 2020-09-11 中国科学院信息工程研究所 Method and system for identifying homologous binary file
CN109670317A (en) * 2018-12-24 2019-04-23 中国科学院软件研究所 A kind of internet of things equipment inheritance bug excavation method based on atom controlling stream graph
CN112100318A (en) * 2020-11-12 2020-12-18 北京智慧星光信息技术有限公司 Multi-dimensional information merging method, device, equipment and storage medium
CN115016843A (en) * 2022-05-23 2022-09-06 北京计算机技术及应用研究所 High-precision binary code similarity comparison method
CN115016843B (en) * 2022-05-23 2024-03-26 北京计算机技术及应用研究所 High-precision binary code similarity comparison method

Similar Documents

Publication Publication Date Title
Chua et al. Neural nets can learn function type signatures from binaries
CN109697162B (en) Software defect automatic detection method based on open source code library
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
CN106649218A (en) Quick binary file comparing method based on SimHash algorithm
CN112733137B (en) Binary code similarity analysis method for vulnerability detection
CN111459799B (en) Software defect detection model establishing and detecting method and system based on Github
WO2020215563A1 (en) Training sample generation method and device for text classification, and computer apparatus
CN109344230B (en) Code library file generation, code search, coupling, optimization and migration method
CN106407809A (en) A Linux platform malicious software detection method
CN104142822A (en) Source code flow analysis using information retrieval
Yu et al. A feature selection approach based on a similarity measure for software defect prediction
CN113326187A (en) Data-driven intelligent detection method and system for memory leakage
CN103534696A (en) Exploiting query click logs for domain detection in spoken language understanding
JP5780036B2 (en) Extraction program, extraction method and extraction apparatus
JP6588661B2 (en) Information retrieval accuracy evaluation method, system, apparatus, and computer-readable storage medium
CN113886832A (en) Intelligent contract vulnerability detection method, system, computer equipment and storage medium
CN115373737B (en) Code clone detection method based on feature fusion
CN110554952B (en) Search-based hierarchical regression test data generation method
CN110928550A (en) Method for eliminating redundancy of GCC abstract syntax tree based on keyword Trie tree
CN112328710B (en) Entity information processing method, device, electronic equipment and storage medium
Li et al. A Deep Learning Based Approach to Detect Code Clones
CN114398069B (en) Method and system for identifying accurate version of public component library based on cross fingerprint analysis
CN113536077B (en) Mobile APP specific event content detection method and device
JP5514682B2 (en) Batch processing program analysis method and apparatus
Yang et al. RouAlign: Cross-Version Function Alignment and Routine Recovery with Graphlet Edge Embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170510

RJ01 Rejection of invention patent application after publication