CN106126235B - A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system - Google Patents

A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system Download PDF

Info

Publication number
CN106126235B
CN106126235B CN201610474461.1A CN201610474461A CN106126235B CN 106126235 B CN106126235 B CN 106126235B CN 201610474461 A CN201610474461 A CN 201610474461A CN 106126235 B CN106126235 B CN 106126235B
Authority
CN
China
Prior art keywords
function
code
similar
block
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610474461.1A
Other languages
Chinese (zh)
Other versions
CN106126235A (en
Inventor
张永铮
乔延臣
云晓春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201610474461.1A priority Critical patent/CN106126235B/en
Publication of CN106126235A publication Critical patent/CN106126235A/en
Application granted granted Critical
Publication of CN106126235B publication Critical patent/CN106126235B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/36Software reuse
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis

Abstract

The invention discloses a kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and systems.This system includes that preprocessing module, code library building module and function are traced to the source module;Preprocessing module, obtains the assembly code of each sample, and extracts the function in each assembly code;And according in each function jump instruction and jump address the function be split as several code blocks and calculate the simhash value of each code block;Code library constructs module, and building simhash value indexes corresponding code block, and code block index includes the function of the code block, and index functions include the three-level inverted index of the sample of the function;Function is traced to the source module, for retrieving the similar codes block of the function to be traced to the source in code library, the corresponding potential similar function of each similar codes block, then according to jumping relationship and determine whether similar function is similar to function to be traced to the source between similar codes block.The present invention improves homologous the degree of automation for determining work.

Description

A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system
Technical field
The present invention relates to conversed analysis and malicious code analysis field, and in particular to one kind is based on simhash and the row's of falling rope Multiplexing code base construction method, quick source tracing method and the system drawn.
Background technique
Code reuse is usually using function as basic unit, even if being compiled device height optimization, still to retain big flow function whole Body is more in line with multiplexing scene to similar judgement so tracing to the source as unit of function herein.The homologous judgement of malicious code It is main according to the multiplexing for being malicious code authors in different malicious codes and writing to individual code, as Sasser and Netsky, The homologous judgement of Flame and Gauss etc. is according to the special function that they are shared.But to improve development rate, malicious code Author is often multiplexed the disclosure that other people write or semi-over code, if Chthonic is that a modify on the basis of Zeus source code is opened The malicious code of hair.While it was reported that, Equation APT (Advanced Persistent Threat, advanced duration prestige The side of body) attack used in a sample, be determined and belong to Zlob family, illustrate APT attack tissue can also be multiplexed Open Source Code.For Needs are executed, compiler is typically inserted into a large amount of codes in compiling.After tested, when compiling only has the C language code of a function When, the VC6.0 compiler under Windows inserts 103 functions, and the GCC4.7.2 compiler under Linux inserts 18 letters Number.The insertion position of the function and function of different compiler insertions is different, needs a large amount of experiences and skill that could identify this A little functions.Multiplexing function causes very big interference to malicious code analysis and homologous judgement work, relies primarily on malice generation at present The experience identification of code analysis personnel, causes homologous judgement inefficient.Efficiency will be greatly improved by quickly knowing multiplexing function, and be promoted The homologous confidence level for determining conclusion.
The basis traced to the source of multiplexing function is that similar function determines, if there are the similar of function in certain binary sample Function then illustrates the function for multiplexing function.Similar function decision technology at present has very high accuracy rate and recall rate, but It is to determine that efficiency is lower, the multiplexing function for being not suitable with magnanimity code is traced to the source.A small amount of modification of one function source code, compiling option, The difference of position can all cause the difference of instruction sequences, register, jump location etc. in reverse rear assembly code, if therefore Trace to the source using the methods of Hash and will lead to low-down recall rate.In function, the structure that jumps of code block is similar judgement Important feature, and jump that relationship is extracted, the comparison of structure chart will take considerable time, be cause it is similar at present determine it is accurate Rate, recall rate and speed are difficult to get both major reason.
Summary of the invention
For technical problem of the existing technology, in order to realize quickly tracing to the source and determine for multiplexing code, the present invention is public A kind of multiplexing code base construction method, quick source tracing method and system based on simhash and inverted index is opened.
The present invention is based on simhash and Inverted Index Technique, can quickly trace to the source phase in magnanimity code as unit of function Like function.Inversely have non-shell adding and the sample acquisition assembly code that shelled first, function therein is drawn according to jump instruction It is divided into the simhash value of multiple code blocks and calculation code block, building simhash value and code block, code block and function, function Three-level inverted index between sample.It traces to the source function, the simhash value according to code block quickly finds similar codes block, then The potential similar function of inverted index, and trace to the source to place sample.
The invention discloses a kind of code library building side to trace to the source for being multiplexed code based on simhash and inverted index Method is based on the code library, can quickly be traced to the source using source tracing method disclosed by the invention and position the sample of similar function and its place. Specifically include 4 steps:
(1) assembly code of each executable program sample is inversely obtained;
(2) according to call instruction call and call address, the function in assembly code is extracted, and according in each function The function is split as multiple code blocks by jump instruction and jump address;
(3) simhash algorithm (proposition " Similarity estimation techniques from 2002 is used Rounding algorithms ") calculate the simhash value of each code block;
(4) construct three-level inverted index: simhash value indexes corresponding code block, and code block index includes the code block Function, index functions include the function sample.
The invention discloses a kind of quick source tracing methods of the multiplexing code based on simhash and inverted index, for wait trace back Source function can quickly be traceable to function similar with the function and its place sample using this method in magnanimity code library, if It is not traceable to similar function, it is believed that the function is multiplex function on the basis of code library.Specifically include 5 steps:
(1) function to be traced to the source according to jump instruction and jump address is split as multiple code blocks, and calculation code block Simhash value;
(2) the simhash value with the Hamming distance of code block simhash value within 3 is searched in code library, is then led to It crosses inverted index and searches out every simhash value corresponding code block of the Hamming distance within 3 as similar codes block;
(3) potential similar function is searched out by similar codes block and inverted index, then according to potential similar function with The quantity of the similar codes block of the function to be traced to the source is that each potential similar function assigns a weight, and filtering out weight is more than one Determine the function of threshold value;For example pass through 3 similar codes blocks of function A, retrieve the similar potential similar letter of two and function A Number B, C, if function B and function A has a code block similar, then the weight of function B is exactly 1/3, function C and function A has Two code blocks are similar, then the weight of function C is 2/3;If threshold value is 1/2, then only will be considered that function C and function A phase Seemingly;
(4) relationship that jumps between the similar codes block by comparing the similar function filtered out in (3) finally determines whether It is similar;Set in the present invention only when between code block jump relationship it is completely similar when, just think that function is similar.Such as function A And function B has 2 code blocks (1,2) similar, if jumped between two code blocks in function A relationship be 1 jump to 2, but Be in function B 2 jump to 1, then function A, B be it is dissimilar, only between code block jump relationship it is identical when, just think letter Number A, B is similar.
(5) it is traced to the source by inverted index the sample where navigating to similar function.
The present invention discloses a kind of quick traceability systems of the multiplexing code based on simhash and inverted index, mainly It is made of preprocessing module, code library building module, function 3 modules of module of tracing to the source.
Compared with prior art, the positive effect of the present invention are as follows:
The present invention can quickly trace to the source function code similar with certain function and its place sample in great amount of samples, and have Higher accuracy rate and recall rate.Conversed analysis personnel can be helped to mention with tools such as exploitation code search engines based on this method High efficiency promotes homologous the degree of automation for determining work.
It is shown experimentally that, the present invention has very high accuracy rate, recall rate and cracking speed of tracing to the source:
(1) using all PE files under " Program Files " and " Windows " file in 32 WinXP systems Construct a code library;
(2) being compiled using VC6.0 only includes a main function, and the C language source code of only one printf sentence, Release editions executable files are compiled as, the reverse assembly code for obtaining this document, IDA Pro energy automatic identification simultaneously rejects library Function, so eventually by there are also 19 compilers to be inserted into function in addition to main function in code;
(3) since there are the files of a large amount of VC6.0 compiling in WinXP system, speculate that 19 compilers are inserted into function accordingly There is certain probability to be traceable to similar function in the code library that WinXP file constructs, thus to 19 compilers be inserted into functions into Row trace to the source, discovery wherein 16 there are similar function, the other three sub_401010, sub_4057BC and sub_402AD1 do not trace back Source is to similar function, in the DellPower Edge R410 for being furnished with 16 core Intel (R) Xeon (R) CPU E562 and 16G memory Carry out the experiment on server, being averaged of each function time of tracing to the source is about 0.149 second.
Following table lists the similar function that part is traceable to:
Detailed description of the invention
Fig. 1 is the code base construction method flow chart based on simhash and inverted index;
Fig. 2 is three-level inverted index structure figure;
Fig. 3 is that function is traced to the source flow chart;
Fig. 4 is the quick traceability system architecture diagram of multiplexing code based on simhash and inverted index
Specific embodiment
In the following, in conjunction with specific embodiment, the present invention is described in detail.
Fig. 1 gives the process of code base construction method provided by the invention, and specific implementation step is as follows:
(1) heat treatment is carried out to the sample of shell adding
1) shell tool is looked into using PeiD determine shell used in sample;
2) heat treatment is carried out using different shelling tools for different shells;
3) give up other samples for leading to not shelling using special shell.
Final sample is the sample that shelled.
(2) assembly code of each sample is obtained using reverse tool
The present invention is by taking IDA Pro as an example.
(3) function in assembly code is extracted
In the assembly code inversely obtained, " proc near " identifies the beginning of a function, and " endp " identifies a letter Several end extracts function according to the two marks from assembly code.
(4) code block in function is extracted
The address that the jump instructions such as jnz, jz in foundation function and jump instruction are directed toward, function code is divided into multiple Code block, perhaps no jump instruction or the last item are jump instruction to each code block.
(5) code standardization is handled
Register in assembly code, immediate, memory address will be caused to dramatically change the slight change of source code, is Ignore the influence caused by code of this species diversity, according to following rule assembly code be standardized:
● register such as eax, ax, al etc. are standardized as REG32, REG16, REG18 according to shared digit respectively;
● memory such as [eax], [edi+4] is represented as MEM;
● immediate such as 0,5A4Dh are expressed as VAL;
● call instructs instruction when the system library function outside calling to be not processed, and calls intrinsic function such as " call sub_ Specification turns to " call INNER " when 105A8 ";
● jump instruction such as " jz short loc_4023E7 " specification turns to " jz loc ".
(6) the simhash value of calculation code block
Simhash is a kind of fuzzy Hash, is frequently utilized for the judgement work of the Similar Text, webpage of Light Difference.
1) one 64 vectors are created, and are initialized as 0;
2) word segmentation processing is carried out to standardized code block, segmented as the standardized instruction sequence of 2-gram;
3) be each participle imparting weight, by the frequency of participle as basic weight, due to calling API largely On determine the function of function, so call instruction plays a significant role in code block, therefore segmented to comprising call instruction Weight double;
4) Hash processing is done using MD5 hash algorithm to each participle, takes 64 Hash as the participle in MD5 value Value;
5) weighting merges, and to each of participle Hash, if the position is 1, the value of vector corresponding positions adds the participle Weight, otherwise subtract the weight of the participle;
6) dimensionality reduction, if the position is greater than 0, is set as 1 to each of vector, is otherwise set as 0, forms one 64 Simhash value.
(7) three-level inverted index is constructed
By preceding 6 step, it is extracted the code block in the function, function that Massive Sample includes included, and calculate each generation The simhash value of code block constructs three-level inverted index for these elements, as shown in Figure 2:
1) there is the code block of the simhash value by simhash value index, simhash collision probability is larger, so in the presence of Several code blocks have the case where same simhash value;
2) similar probability is increased after code block standardization as the function where code block index codes block, it is dissimilar Function be also possible to that there are identical standardized codes blocks;
3) as the sample where index functions function.
It is based ultimately upon simhash and inverted index and constructs the code traced to the source for function on the basis of Massive Sample Library.
Fig. 3 gives the process for function of quickly tracing to the source on the code library of building, and specific implementation step is as follows:
(1) function code P to be traced to the source is split as multiple code blocks according to jump instruction, it is assumed that shared n, calculates later The simhash value of each code block:
P→{sh1,sh2,…,shn[formula 1]
(2) simhash multilist indexing means are used, are shi| phase of i ∈ [1, n] the quick-searching Hamming distances within 3 Like simhash, similar simhash set is constituted:
shSeti={ sh:d (shi, sh)≤3 | i ∈ [1, n] [formula 2]
d(shi, sh) and indicate shiIllustrate not with the Hamming distances of sh if all similar simhash set are empty set In the presence of the function with the function with similar codes block, otherwise, continue in next step;
(3) inverted index according to the simhash and code block constructed, retrieval simhash value belong to shSetiGeneration Code block L constitutes similar codes set of blocks:
LSeti={ L:simhash (L) ∈ shSeti| i ∈ [1, n] [formula 3]
The simhash value of simhash (L) expression code block L;
(4) inverted index according to the code block and place function that have constructed retrieves the function P ' where similar codes block, Constituting-functions set:
(5) to each function retrieved, code registration of the function P ' in function P to be traced to the source, the i.e. P ' are calculated With the accounting of the similar codes of P in P, it is expressed as iSim (P, P '), formula is as follows:
Wherein total instruction number of len (P) representative function P, len (Li) indicate code block LiInstruction number, pass through experimental verification It was found that it is most possibly similar to function code to be traced to the source when the code registration of function is not less than 0.5, therefore finally obtain latent In similar function set are as follows:
PSet=P ': iSim (P, P ') >=0.5 [formula 6]
If potential similar function set PSet is empty set, illustrate that there is no have enough similar codes with function to be traced to the source Function, it is difficult to assert that they are similar, function to be traced to the source is regarded as original function on existing code set.
(6) if potential similar function and function to be traced to the source only have a similar codes block, it is determined as similar function, if There are when multiple similar codes blocks, only it is determined as similar function when the relationship that jumps is identical.
The quick traceability system of multiplexing code disclosed by the invention based on simhash and inverted index can be used for quickly tracing to the source Similar function determines whether the sample of multiplexing function and multiplexing, helps to promote malicious code analysis efficiency, multiplexing relationship Detection and the homologous judgement of malicious code.The quick traceability system of multiplexing code based on simhash and inverted index mainly by Preprocessing module, code library building module and function are traced to the source module 3 modules composition.
System structure is as shown in Figure 4.System specific implementation step is as follows:
(1) it pre-processes
1) for each sample, doing the reverse dis-assembling processing of shelling obtains assembly code;
2) to each assembly code, all functions in assembly code are extracted according to special identifier;
3) it to each function, according to jump instruction and jumps relationship and is divided into multiple code blocks;
4) to each code block, the simhash value in code standardization processing and calculation code library is done.
(2) code library constructs
Based on function, the code block, simhash value obtained after Massive Sample, and pretreatment, building falls to arrange comprising three-level The code library of index is had the code block of the simhash value by simhash value index, as where code block index codes block Function, as the sample where index functions function.
(3) function is traced to the source
By preprocessing module, function to be traced to the source is divided into multiple code blocks, and obtains the simhash of code block Value, the similar codes block being primarily based in the simhash value retrieval coding library of code library, secondly through similar codes library inverted index To potential similar function, similar function is finally determine whether based on the relationship that jumps between code registration and similar codes block.

Claims (10)

1. a kind of multiplexing code base construction method, the steps include:
1) assembly code of each executable program sample is obtained;
2) extract the function in each assembly code, and according in each function jump instruction and jump address the function torn open It is divided into several code blocks and is saved in code library;
3) the simhash value of each code block is calculated;
4) the three-level inverted index between simhash value and code block, code block and function, function and sample is constructed.
2. the method as described in claim 1, which is characterized in that the three-level inverted index are as follows: simhash value index is corresponding Code block, code block index include the function of the code block, and index functions include the sample of the function.
3. method according to claim 1 or 2, which is characterized in that extract assembly code according to call instruction and call address In function.
4. a kind of quick source tracing method of multiplexing code, the steps include:
1) function to be traced to the source is split as several code blocks according to jump instruction and jump address, and calculates each code block Simhash value;
2) for each code block, the Hamming distance with the simhash value of the code block is searched in code library in set distance Similar codes block of the interior code block as the code block;
3) the corresponding potential similar function of each similar codes block is searched in code library, then according to potential similar function and this The similar codes number of blocks of function to be traced to the source is that corresponding potential similar function assigns a weight;Then filtering out weight is more than setting The potential similar function of threshold value is as similar function;
4) determine whether the similar function is similar to function to be traced to the source according to the relationship that jumps between similar codes block;If similar, Then the similar function is multiplexing code.
5. method as claimed in claim 4, which is characterized in that the search step 4 in code library) determine similar function where Sample, return to the sample at the similar function and its place.
6. method as claimed in claim 4, which is characterized in that the code library includes multiple code blocks and its simhash value, And construct simhash value and index corresponding code block, code block index includes the function of the code block, and index functions include the letter The three-level inverted index of several samples.
7. the method as described in claim 4 or 5 or 6, which is characterized in that the method for calculating the simhash value are as follows:
71) one N vectors are created, and are initialized as 0;
72) word segmentation processing is carried out to standardized code block, segmented as the standardized instruction sequence of 2-gram;
73) weight is assigned for each participle, by the frequency of participle as basic weight, to the weight segmented comprising call instruction It doubles;
74) Hash processing is done using MD5 hash algorithm to each participle, takes the position N in MD5 value as the cryptographic Hash of the participle;
75) each to participle Hash, if the position is 1, the value of the vector corresponding positions adds the weight of the participle, otherwise Subtract the weight of the participle;
76) each to the vector is set as 1 if the position is greater than 0, is otherwise set as 0, forms a N simhash value.
8. the method as described in claim 4 or 5 or 6, which is characterized in that in step 3), searched in code library first each The corresponding function of similar codes block, then according to the function and should function be traced to the source code registration determine the function whether be Potential similar function.
9. the method as described in claim 4 or 5 or 6, which is characterized in that in step 4), according to jumping between similar codes block Relationship determines similar function method whether similar with function to be traced to the source are as follows: if the similar function only has with function to be traced to the source is somebody's turn to do One similar codes block is then determined as that the similar function is similar to the function to be traced to the source, if the similar function and the letter to be traced to the source There are multiple similar codes blocks between number, then jump relationship and the letter to be traced to the source between these similar codes blocks in the similar function Corresponded in number between code block jump relationship it is identical when determine the similar function to should function be traced to the source it is similar.
10. a kind of quick traceability system of multiplexing code, which is characterized in that including preprocessing module, code library building module and letter Count module of tracing to the source;Wherein:
Preprocessing module for obtaining the assembly code of each sample, and extracts the function in each assembly code;And foundation The function is split as several code blocks and calculates each code block by jump instruction and jump address in each function Simhash value;
Code library constructs module, for saving the simhash value of function, code block and its sample at place and code block, and It constructs simhash value and indexes corresponding code block, code block index includes the function of the code block, and index functions include the function Sample three-level inverted index;
Function is traced to the source module, and code block and its simhash value for being divided according to function to be traced to the source are examined in code library Rope is somebody's turn to do the similar codes block of function to be traced to the source, and the corresponding potential similar letter of each similar codes block is then searched in code library Then number is that corresponding potential similar function assigns one with the similar codes number of blocks for being somebody's turn to do function to be traced to the source according to potential similar function Weight filters out potential similar function of the weight more than given threshold as similar function;Then according between similar codes block The relationship of jumping determines whether the similar function is similar to function to be traced to the source;If similar, which is multiplexing code.
CN201610474461.1A 2016-06-24 2016-06-24 A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system Active CN106126235B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610474461.1A CN106126235B (en) 2016-06-24 2016-06-24 A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610474461.1A CN106126235B (en) 2016-06-24 2016-06-24 A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system

Publications (2)

Publication Number Publication Date
CN106126235A CN106126235A (en) 2016-11-16
CN106126235B true CN106126235B (en) 2019-05-07

Family

ID=57266110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610474461.1A Active CN106126235B (en) 2016-06-24 2016-06-24 A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system

Country Status (1)

Country Link
CN (1) CN106126235B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106557332A (en) * 2016-11-30 2017-04-05 上海寒武纪信息科技有限公司 A kind of multiplexing method and device of instruction generating process
CN107590385B (en) * 2017-09-15 2020-03-17 湖南大学 Hardware-assisted code reuse attack resisting defense system and method
CN107885503B (en) * 2017-11-11 2021-01-08 湖南大学 Iterative compilation optimization method based on program characteristic analysis
CN108763486A (en) * 2018-05-30 2018-11-06 湖南写邦科技有限公司 Paper duplicate checking method, terminal and storage medium based on terminal
CN109445844A (en) * 2018-11-05 2019-03-08 浙江网新恒天软件有限公司 Code Clones detection method based on cryptographic Hash, electronic equipment, storage medium
CN109815996B (en) * 2019-01-07 2021-05-04 北京首钢自动化信息技术有限公司 Scene self-adaptation method and device based on recurrent neural network
CN110647666B (en) * 2019-09-03 2023-12-19 平安科技(深圳)有限公司 Intelligent matching method and device for templates and formulas and computer readable storage medium
CN110569629A (en) * 2019-09-10 2019-12-13 北京计算机技术及应用研究所 Binary code file tracing method
CN111290784B (en) * 2020-01-21 2021-08-24 北京航空航天大学 Program source code similarity detection method suitable for large-scale samples
CN111241497A (en) * 2020-02-13 2020-06-05 北京高质系统科技有限公司 Open source code tracing detection method based on software multiplexing feature learning
CN113360134B (en) * 2020-03-06 2022-06-17 武汉斗鱼网络科技有限公司 Method, device, equipment and storage medium for generating security verification program
US20220129417A1 (en) * 2020-10-22 2022-04-28 Google Llc Code Similarity Search
CN112257068A (en) * 2020-11-17 2021-01-22 南方电网科学研究院有限责任公司 Program similarity detection method and device, electronic equipment and storage medium
CN114840204A (en) * 2021-02-02 2022-08-02 华为技术有限公司 Function jump method and device for non-homonymous symbols and computer readable storage medium
CN113703773B (en) * 2021-08-26 2022-07-19 北京计算机技术及应用研究所 NLP-based binary code similarity comparison method
CN113590192B (en) * 2021-09-26 2022-01-04 北京迪力科技有限责任公司 Quality analysis method and related equipment
CN113722238B (en) * 2021-11-01 2022-04-26 北京大学 Method and system for realizing rapid open source component detection of source code file
CN114995880B (en) * 2022-05-23 2024-04-05 北京计算机技术及应用研究所 Binary code similarity comparison method based on SimHash

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101425008A (en) * 2007-11-01 2009-05-06 北京航空航天大学 Method for measuring similarity of source code based on edition distance
CN102063446A (en) * 2009-11-13 2011-05-18 中国移动通信集团四川有限公司 Method for creating inverted index and inverted indexing device
CN103646080A (en) * 2013-12-12 2014-03-19 北京京东尚科信息技术有限公司 Microblog duplication-eliminating method and system based on reverse-order index

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101425008A (en) * 2007-11-01 2009-05-06 北京航空航天大学 Method for measuring similarity of source code based on edition distance
CN102063446A (en) * 2009-11-13 2011-05-18 中国移动通信集团四川有限公司 Method for creating inverted index and inverted indexing device
CN103646080A (en) * 2013-12-12 2014-03-19 北京京东尚科信息技术有限公司 Microblog duplication-eliminating method and system based on reverse-order index

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于抽象语法树的重复代码检测》;吴冲;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20151215;第2015年卷(第12期);第I138-154页 *

Also Published As

Publication number Publication date
CN106126235A (en) 2016-11-16

Similar Documents

Publication Publication Date Title
CN106126235B (en) A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system
CN107798136B (en) Entity relation extraction method and device based on deep learning and server
Wang et al. Efficient approximate entity extraction with edit distance constraints
RU2016113791A (en) METHOD AND DEVICE FOR CONSTRUCTION OF PATTERN AND METHOD AND DEVICE FOR IDENTIFICATION OF INFORMATION
Crim et al. Automatically annotating documents with normalized gene lists
CN101339547A (en) Apparatus and method for machine translation
CN111581355A (en) Method, device and computer storage medium for detecting subject of threat intelligence
Nguyen et al. Text classification of technical papers based on text segmentation
Yang et al. Smart library: Identifying books on library shelves using supervised deep learning for scene text reading
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
CN102339294A (en) Searching method and system for preprocessing keywords
CN106528527A (en) Identification method and identification system for out of vocabularies
CN102314464A (en) Lyrics searching method and lyrics searching engine
CN113297580B (en) Code semantic analysis-based electric power information system safety protection method and device
Nguyen et al. Extracting bacteria biotopes with semi-supervised named entity recognition and coreference resolution
CN109543846B (en) MVO (mechanical vapor deposition) improvement based DBSCAN (direct species analysis controller area network) mine water inrush spectrum identification method
CN103049434A (en) System and method for identifying anagrams
KR20100105080A (en) Query processing method and apparatus based on n-gram
WO2015080559A2 (en) A method and system for automated word sense disambiguation
CN111078227B (en) Binary code and source code similarity analysis method and device based on code characteristics
US20230075290A1 (en) Method for linking a cve with at least one synthetic cpe
CN109472145A (en) A kind of code reuse recognition methods and system based on graph theory
CN111090859B (en) Malicious software detection method based on graph editing distance
CN109783607B (en) Method for matching and identifying massive keywords in arbitrary text
EP3138033B1 (en) Method and apparatus for performing block retrieval on block to be processed of urine sediment image

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant