CN106126235B - A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system - Google Patents
A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system Download PDFInfo
- Publication number
- CN106126235B CN106126235B CN201610474461.1A CN201610474461A CN106126235B CN 106126235 B CN106126235 B CN 106126235B CN 201610474461 A CN201610474461 A CN 201610474461A CN 106126235 B CN106126235 B CN 106126235B
- Authority
- CN
- China
- Prior art keywords
- function
- code
- similar
- block
- source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/36—Software reuse
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/43—Checking; Contextual analysis
Abstract
The invention discloses a kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and systems.This system includes that preprocessing module, code library building module and function are traced to the source module;Preprocessing module, obtains the assembly code of each sample, and extracts the function in each assembly code;And according in each function jump instruction and jump address the function be split as several code blocks and calculate the simhash value of each code block;Code library constructs module, and building simhash value indexes corresponding code block, and code block index includes the function of the code block, and index functions include the three-level inverted index of the sample of the function;Function is traced to the source module, for retrieving the similar codes block of the function to be traced to the source in code library, the corresponding potential similar function of each similar codes block, then according to jumping relationship and determine whether similar function is similar to function to be traced to the source between similar codes block.The present invention improves homologous the degree of automation for determining work.
Description
Technical field
The present invention relates to conversed analysis and malicious code analysis field, and in particular to one kind is based on simhash and the row's of falling rope
Multiplexing code base construction method, quick source tracing method and the system drawn.
Background technique
Code reuse is usually using function as basic unit, even if being compiled device height optimization, still to retain big flow function whole
Body is more in line with multiplexing scene to similar judgement so tracing to the source as unit of function herein.The homologous judgement of malicious code
It is main according to the multiplexing for being malicious code authors in different malicious codes and writing to individual code, as Sasser and Netsky,
The homologous judgement of Flame and Gauss etc. is according to the special function that they are shared.But to improve development rate, malicious code
Author is often multiplexed the disclosure that other people write or semi-over code, if Chthonic is that a modify on the basis of Zeus source code is opened
The malicious code of hair.While it was reported that, Equation APT (Advanced Persistent Threat, advanced duration prestige
The side of body) attack used in a sample, be determined and belong to Zlob family, illustrate APT attack tissue can also be multiplexed Open Source Code.For
Needs are executed, compiler is typically inserted into a large amount of codes in compiling.After tested, when compiling only has the C language code of a function
When, the VC6.0 compiler under Windows inserts 103 functions, and the GCC4.7.2 compiler under Linux inserts 18 letters
Number.The insertion position of the function and function of different compiler insertions is different, needs a large amount of experiences and skill that could identify this
A little functions.Multiplexing function causes very big interference to malicious code analysis and homologous judgement work, relies primarily on malice generation at present
The experience identification of code analysis personnel, causes homologous judgement inefficient.Efficiency will be greatly improved by quickly knowing multiplexing function, and be promoted
The homologous confidence level for determining conclusion.
The basis traced to the source of multiplexing function is that similar function determines, if there are the similar of function in certain binary sample
Function then illustrates the function for multiplexing function.Similar function decision technology at present has very high accuracy rate and recall rate, but
It is to determine that efficiency is lower, the multiplexing function for being not suitable with magnanimity code is traced to the source.A small amount of modification of one function source code, compiling option,
The difference of position can all cause the difference of instruction sequences, register, jump location etc. in reverse rear assembly code, if therefore
Trace to the source using the methods of Hash and will lead to low-down recall rate.In function, the structure that jumps of code block is similar judgement
Important feature, and jump that relationship is extracted, the comparison of structure chart will take considerable time, be cause it is similar at present determine it is accurate
Rate, recall rate and speed are difficult to get both major reason.
Summary of the invention
For technical problem of the existing technology, in order to realize quickly tracing to the source and determine for multiplexing code, the present invention is public
A kind of multiplexing code base construction method, quick source tracing method and system based on simhash and inverted index is opened.
The present invention is based on simhash and Inverted Index Technique, can quickly trace to the source phase in magnanimity code as unit of function
Like function.Inversely have non-shell adding and the sample acquisition assembly code that shelled first, function therein is drawn according to jump instruction
It is divided into the simhash value of multiple code blocks and calculation code block, building simhash value and code block, code block and function, function
Three-level inverted index between sample.It traces to the source function, the simhash value according to code block quickly finds similar codes block, then
The potential similar function of inverted index, and trace to the source to place sample.
The invention discloses a kind of code library building side to trace to the source for being multiplexed code based on simhash and inverted index
Method is based on the code library, can quickly be traced to the source using source tracing method disclosed by the invention and position the sample of similar function and its place.
Specifically include 4 steps:
(1) assembly code of each executable program sample is inversely obtained;
(2) according to call instruction call and call address, the function in assembly code is extracted, and according in each function
The function is split as multiple code blocks by jump instruction and jump address;
(3) simhash algorithm (proposition " Similarity estimation techniques from 2002 is used
Rounding algorithms ") calculate the simhash value of each code block;
(4) construct three-level inverted index: simhash value indexes corresponding code block, and code block index includes the code block
Function, index functions include the function sample.
The invention discloses a kind of quick source tracing methods of the multiplexing code based on simhash and inverted index, for wait trace back
Source function can quickly be traceable to function similar with the function and its place sample using this method in magnanimity code library, if
It is not traceable to similar function, it is believed that the function is multiplex function on the basis of code library.Specifically include 5 steps:
(1) function to be traced to the source according to jump instruction and jump address is split as multiple code blocks, and calculation code block
Simhash value;
(2) the simhash value with the Hamming distance of code block simhash value within 3 is searched in code library, is then led to
It crosses inverted index and searches out every simhash value corresponding code block of the Hamming distance within 3 as similar codes block;
(3) potential similar function is searched out by similar codes block and inverted index, then according to potential similar function with
The quantity of the similar codes block of the function to be traced to the source is that each potential similar function assigns a weight, and filtering out weight is more than one
Determine the function of threshold value;For example pass through 3 similar codes blocks of function A, retrieve the similar potential similar letter of two and function A
Number B, C, if function B and function A has a code block similar, then the weight of function B is exactly 1/3, function C and function A has
Two code blocks are similar, then the weight of function C is 2/3;If threshold value is 1/2, then only will be considered that function C and function A phase
Seemingly;
(4) relationship that jumps between the similar codes block by comparing the similar function filtered out in (3) finally determines whether
It is similar;Set in the present invention only when between code block jump relationship it is completely similar when, just think that function is similar.Such as function A
And function B has 2 code blocks (1,2) similar, if jumped between two code blocks in function A relationship be 1 jump to 2, but
Be in function B 2 jump to 1, then function A, B be it is dissimilar, only between code block jump relationship it is identical when, just think letter
Number A, B is similar.
(5) it is traced to the source by inverted index the sample where navigating to similar function.
The present invention discloses a kind of quick traceability systems of the multiplexing code based on simhash and inverted index, mainly
It is made of preprocessing module, code library building module, function 3 modules of module of tracing to the source.
Compared with prior art, the positive effect of the present invention are as follows:
The present invention can quickly trace to the source function code similar with certain function and its place sample in great amount of samples, and have
Higher accuracy rate and recall rate.Conversed analysis personnel can be helped to mention with tools such as exploitation code search engines based on this method
High efficiency promotes homologous the degree of automation for determining work.
It is shown experimentally that, the present invention has very high accuracy rate, recall rate and cracking speed of tracing to the source:
(1) using all PE files under " Program Files " and " Windows " file in 32 WinXP systems
Construct a code library;
(2) being compiled using VC6.0 only includes a main function, and the C language source code of only one printf sentence,
Release editions executable files are compiled as, the reverse assembly code for obtaining this document, IDA Pro energy automatic identification simultaneously rejects library
Function, so eventually by there are also 19 compilers to be inserted into function in addition to main function in code;
(3) since there are the files of a large amount of VC6.0 compiling in WinXP system, speculate that 19 compilers are inserted into function accordingly
There is certain probability to be traceable to similar function in the code library that WinXP file constructs, thus to 19 compilers be inserted into functions into
Row trace to the source, discovery wherein 16 there are similar function, the other three sub_401010, sub_4057BC and sub_402AD1 do not trace back
Source is to similar function, in the DellPower Edge R410 for being furnished with 16 core Intel (R) Xeon (R) CPU E562 and 16G memory
Carry out the experiment on server, being averaged of each function time of tracing to the source is about 0.149 second.
Following table lists the similar function that part is traceable to:
Detailed description of the invention
Fig. 1 is the code base construction method flow chart based on simhash and inverted index;
Fig. 2 is three-level inverted index structure figure;
Fig. 3 is that function is traced to the source flow chart;
Fig. 4 is the quick traceability system architecture diagram of multiplexing code based on simhash and inverted index
Specific embodiment
In the following, in conjunction with specific embodiment, the present invention is described in detail.
Fig. 1 gives the process of code base construction method provided by the invention, and specific implementation step is as follows:
(1) heat treatment is carried out to the sample of shell adding
1) shell tool is looked into using PeiD determine shell used in sample;
2) heat treatment is carried out using different shelling tools for different shells;
3) give up other samples for leading to not shelling using special shell.
Final sample is the sample that shelled.
(2) assembly code of each sample is obtained using reverse tool
The present invention is by taking IDA Pro as an example.
(3) function in assembly code is extracted
In the assembly code inversely obtained, " proc near " identifies the beginning of a function, and " endp " identifies a letter
Several end extracts function according to the two marks from assembly code.
(4) code block in function is extracted
The address that the jump instructions such as jnz, jz in foundation function and jump instruction are directed toward, function code is divided into multiple
Code block, perhaps no jump instruction or the last item are jump instruction to each code block.
(5) code standardization is handled
Register in assembly code, immediate, memory address will be caused to dramatically change the slight change of source code, is
Ignore the influence caused by code of this species diversity, according to following rule assembly code be standardized:
● register such as eax, ax, al etc. are standardized as REG32, REG16, REG18 according to shared digit respectively;
● memory such as [eax], [edi+4] is represented as MEM;
● immediate such as 0,5A4Dh are expressed as VAL;
● call instructs instruction when the system library function outside calling to be not processed, and calls intrinsic function such as " call sub_
Specification turns to " call INNER " when 105A8 ";
● jump instruction such as " jz short loc_4023E7 " specification turns to " jz loc ".
(6) the simhash value of calculation code block
Simhash is a kind of fuzzy Hash, is frequently utilized for the judgement work of the Similar Text, webpage of Light Difference.
1) one 64 vectors are created, and are initialized as 0;
2) word segmentation processing is carried out to standardized code block, segmented as the standardized instruction sequence of 2-gram;
3) be each participle imparting weight, by the frequency of participle as basic weight, due to calling API largely
On determine the function of function, so call instruction plays a significant role in code block, therefore segmented to comprising call instruction
Weight double;
4) Hash processing is done using MD5 hash algorithm to each participle, takes 64 Hash as the participle in MD5 value
Value;
5) weighting merges, and to each of participle Hash, if the position is 1, the value of vector corresponding positions adds the participle
Weight, otherwise subtract the weight of the participle;
6) dimensionality reduction, if the position is greater than 0, is set as 1 to each of vector, is otherwise set as 0, forms one 64
Simhash value.
(7) three-level inverted index is constructed
By preceding 6 step, it is extracted the code block in the function, function that Massive Sample includes included, and calculate each generation
The simhash value of code block constructs three-level inverted index for these elements, as shown in Figure 2:
1) there is the code block of the simhash value by simhash value index, simhash collision probability is larger, so in the presence of
Several code blocks have the case where same simhash value;
2) similar probability is increased after code block standardization as the function where code block index codes block, it is dissimilar
Function be also possible to that there are identical standardized codes blocks;
3) as the sample where index functions function.
It is based ultimately upon simhash and inverted index and constructs the code traced to the source for function on the basis of Massive Sample
Library.
Fig. 3 gives the process for function of quickly tracing to the source on the code library of building, and specific implementation step is as follows:
(1) function code P to be traced to the source is split as multiple code blocks according to jump instruction, it is assumed that shared n, calculates later
The simhash value of each code block:
P→{sh1,sh2,…,shn[formula 1]
(2) simhash multilist indexing means are used, are shi| phase of i ∈ [1, n] the quick-searching Hamming distances within 3
Like simhash, similar simhash set is constituted:
shSeti={ sh:d (shi, sh)≤3 | i ∈ [1, n] [formula 2]
d(shi, sh) and indicate shiIllustrate not with the Hamming distances of sh if all similar simhash set are empty set
In the presence of the function with the function with similar codes block, otherwise, continue in next step;
(3) inverted index according to the simhash and code block constructed, retrieval simhash value belong to shSetiGeneration
Code block L constitutes similar codes set of blocks:
LSeti={ L:simhash (L) ∈ shSeti| i ∈ [1, n] [formula 3]
The simhash value of simhash (L) expression code block L;
(4) inverted index according to the code block and place function that have constructed retrieves the function P ' where similar codes block,
Constituting-functions set:
(5) to each function retrieved, code registration of the function P ' in function P to be traced to the source, the i.e. P ' are calculated
With the accounting of the similar codes of P in P, it is expressed as iSim (P, P '), formula is as follows:
Wherein total instruction number of len (P) representative function P, len (Li) indicate code block LiInstruction number, pass through experimental verification
It was found that it is most possibly similar to function code to be traced to the source when the code registration of function is not less than 0.5, therefore finally obtain latent
In similar function set are as follows:
PSet=P ': iSim (P, P ') >=0.5 [formula 6]
If potential similar function set PSet is empty set, illustrate that there is no have enough similar codes with function to be traced to the source
Function, it is difficult to assert that they are similar, function to be traced to the source is regarded as original function on existing code set.
(6) if potential similar function and function to be traced to the source only have a similar codes block, it is determined as similar function, if
There are when multiple similar codes blocks, only it is determined as similar function when the relationship that jumps is identical.
The quick traceability system of multiplexing code disclosed by the invention based on simhash and inverted index can be used for quickly tracing to the source
Similar function determines whether the sample of multiplexing function and multiplexing, helps to promote malicious code analysis efficiency, multiplexing relationship
Detection and the homologous judgement of malicious code.The quick traceability system of multiplexing code based on simhash and inverted index mainly by
Preprocessing module, code library building module and function are traced to the source module 3 modules composition.
System structure is as shown in Figure 4.System specific implementation step is as follows:
(1) it pre-processes
1) for each sample, doing the reverse dis-assembling processing of shelling obtains assembly code;
2) to each assembly code, all functions in assembly code are extracted according to special identifier;
3) it to each function, according to jump instruction and jumps relationship and is divided into multiple code blocks;
4) to each code block, the simhash value in code standardization processing and calculation code library is done.
(2) code library constructs
Based on function, the code block, simhash value obtained after Massive Sample, and pretreatment, building falls to arrange comprising three-level
The code library of index is had the code block of the simhash value by simhash value index, as where code block index codes block
Function, as the sample where index functions function.
(3) function is traced to the source
By preprocessing module, function to be traced to the source is divided into multiple code blocks, and obtains the simhash of code block
Value, the similar codes block being primarily based in the simhash value retrieval coding library of code library, secondly through similar codes library inverted index
To potential similar function, similar function is finally determine whether based on the relationship that jumps between code registration and similar codes block.
Claims (10)
1. a kind of multiplexing code base construction method, the steps include:
1) assembly code of each executable program sample is obtained;
2) extract the function in each assembly code, and according in each function jump instruction and jump address the function torn open
It is divided into several code blocks and is saved in code library;
3) the simhash value of each code block is calculated;
4) the three-level inverted index between simhash value and code block, code block and function, function and sample is constructed.
2. the method as described in claim 1, which is characterized in that the three-level inverted index are as follows: simhash value index is corresponding
Code block, code block index include the function of the code block, and index functions include the sample of the function.
3. method according to claim 1 or 2, which is characterized in that extract assembly code according to call instruction and call address
In function.
4. a kind of quick source tracing method of multiplexing code, the steps include:
1) function to be traced to the source is split as several code blocks according to jump instruction and jump address, and calculates each code block
Simhash value;
2) for each code block, the Hamming distance with the simhash value of the code block is searched in code library in set distance
Similar codes block of the interior code block as the code block;
3) the corresponding potential similar function of each similar codes block is searched in code library, then according to potential similar function and this
The similar codes number of blocks of function to be traced to the source is that corresponding potential similar function assigns a weight;Then filtering out weight is more than setting
The potential similar function of threshold value is as similar function;
4) determine whether the similar function is similar to function to be traced to the source according to the relationship that jumps between similar codes block;If similar,
Then the similar function is multiplexing code.
5. method as claimed in claim 4, which is characterized in that the search step 4 in code library) determine similar function where
Sample, return to the sample at the similar function and its place.
6. method as claimed in claim 4, which is characterized in that the code library includes multiple code blocks and its simhash value,
And construct simhash value and index corresponding code block, code block index includes the function of the code block, and index functions include the letter
The three-level inverted index of several samples.
7. the method as described in claim 4 or 5 or 6, which is characterized in that the method for calculating the simhash value are as follows:
71) one N vectors are created, and are initialized as 0;
72) word segmentation processing is carried out to standardized code block, segmented as the standardized instruction sequence of 2-gram;
73) weight is assigned for each participle, by the frequency of participle as basic weight, to the weight segmented comprising call instruction
It doubles;
74) Hash processing is done using MD5 hash algorithm to each participle, takes the position N in MD5 value as the cryptographic Hash of the participle;
75) each to participle Hash, if the position is 1, the value of the vector corresponding positions adds the weight of the participle, otherwise
Subtract the weight of the participle;
76) each to the vector is set as 1 if the position is greater than 0, is otherwise set as 0, forms a N simhash value.
8. the method as described in claim 4 or 5 or 6, which is characterized in that in step 3), searched in code library first each
The corresponding function of similar codes block, then according to the function and should function be traced to the source code registration determine the function whether be
Potential similar function.
9. the method as described in claim 4 or 5 or 6, which is characterized in that in step 4), according to jumping between similar codes block
Relationship determines similar function method whether similar with function to be traced to the source are as follows: if the similar function only has with function to be traced to the source is somebody's turn to do
One similar codes block is then determined as that the similar function is similar to the function to be traced to the source, if the similar function and the letter to be traced to the source
There are multiple similar codes blocks between number, then jump relationship and the letter to be traced to the source between these similar codes blocks in the similar function
Corresponded in number between code block jump relationship it is identical when determine the similar function to should function be traced to the source it is similar.
10. a kind of quick traceability system of multiplexing code, which is characterized in that including preprocessing module, code library building module and letter
Count module of tracing to the source;Wherein:
Preprocessing module for obtaining the assembly code of each sample, and extracts the function in each assembly code;And foundation
The function is split as several code blocks and calculates each code block by jump instruction and jump address in each function
Simhash value;
Code library constructs module, for saving the simhash value of function, code block and its sample at place and code block, and
It constructs simhash value and indexes corresponding code block, code block index includes the function of the code block, and index functions include the function
Sample three-level inverted index;
Function is traced to the source module, and code block and its simhash value for being divided according to function to be traced to the source are examined in code library
Rope is somebody's turn to do the similar codes block of function to be traced to the source, and the corresponding potential similar letter of each similar codes block is then searched in code library
Then number is that corresponding potential similar function assigns one with the similar codes number of blocks for being somebody's turn to do function to be traced to the source according to potential similar function
Weight filters out potential similar function of the weight more than given threshold as similar function;Then according between similar codes block
The relationship of jumping determines whether the similar function is similar to function to be traced to the source;If similar, which is multiplexing code.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610474461.1A CN106126235B (en) | 2016-06-24 | 2016-06-24 | A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610474461.1A CN106126235B (en) | 2016-06-24 | 2016-06-24 | A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106126235A CN106126235A (en) | 2016-11-16 |
CN106126235B true CN106126235B (en) | 2019-05-07 |
Family
ID=57266110
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610474461.1A Active CN106126235B (en) | 2016-06-24 | 2016-06-24 | A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106126235B (en) |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106557332A (en) * | 2016-11-30 | 2017-04-05 | 上海寒武纪信息科技有限公司 | A kind of multiplexing method and device of instruction generating process |
CN107590385B (en) * | 2017-09-15 | 2020-03-17 | 湖南大学 | Hardware-assisted code reuse attack resisting defense system and method |
CN107885503B (en) * | 2017-11-11 | 2021-01-08 | 湖南大学 | Iterative compilation optimization method based on program characteristic analysis |
CN108763486A (en) * | 2018-05-30 | 2018-11-06 | 湖南写邦科技有限公司 | Paper duplicate checking method, terminal and storage medium based on terminal |
CN109445844A (en) * | 2018-11-05 | 2019-03-08 | 浙江网新恒天软件有限公司 | Code Clones detection method based on cryptographic Hash, electronic equipment, storage medium |
CN109815996B (en) * | 2019-01-07 | 2021-05-04 | 北京首钢自动化信息技术有限公司 | Scene self-adaptation method and device based on recurrent neural network |
CN110647666B (en) * | 2019-09-03 | 2023-12-19 | 平安科技(深圳)有限公司 | Intelligent matching method and device for templates and formulas and computer readable storage medium |
CN110569629A (en) * | 2019-09-10 | 2019-12-13 | 北京计算机技术及应用研究所 | Binary code file tracing method |
CN111290784B (en) * | 2020-01-21 | 2021-08-24 | 北京航空航天大学 | Program source code similarity detection method suitable for large-scale samples |
CN111241497A (en) * | 2020-02-13 | 2020-06-05 | 北京高质系统科技有限公司 | Open source code tracing detection method based on software multiplexing feature learning |
CN113360134B (en) * | 2020-03-06 | 2022-06-17 | 武汉斗鱼网络科技有限公司 | Method, device, equipment and storage medium for generating security verification program |
US20220129417A1 (en) * | 2020-10-22 | 2022-04-28 | Google Llc | Code Similarity Search |
CN112257068A (en) * | 2020-11-17 | 2021-01-22 | 南方电网科学研究院有限责任公司 | Program similarity detection method and device, electronic equipment and storage medium |
CN114840204A (en) * | 2021-02-02 | 2022-08-02 | 华为技术有限公司 | Function jump method and device for non-homonymous symbols and computer readable storage medium |
CN113703773B (en) * | 2021-08-26 | 2022-07-19 | 北京计算机技术及应用研究所 | NLP-based binary code similarity comparison method |
CN113590192B (en) * | 2021-09-26 | 2022-01-04 | 北京迪力科技有限责任公司 | Quality analysis method and related equipment |
CN113722238B (en) * | 2021-11-01 | 2022-04-26 | 北京大学 | Method and system for realizing rapid open source component detection of source code file |
CN114995880B (en) * | 2022-05-23 | 2024-04-05 | 北京计算机技术及应用研究所 | Binary code similarity comparison method based on SimHash |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101425008A (en) * | 2007-11-01 | 2009-05-06 | 北京航空航天大学 | Method for measuring similarity of source code based on edition distance |
CN102063446A (en) * | 2009-11-13 | 2011-05-18 | 中国移动通信集团四川有限公司 | Method for creating inverted index and inverted indexing device |
CN103646080A (en) * | 2013-12-12 | 2014-03-19 | 北京京东尚科信息技术有限公司 | Microblog duplication-eliminating method and system based on reverse-order index |
-
2016
- 2016-06-24 CN CN201610474461.1A patent/CN106126235B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101425008A (en) * | 2007-11-01 | 2009-05-06 | 北京航空航天大学 | Method for measuring similarity of source code based on edition distance |
CN102063446A (en) * | 2009-11-13 | 2011-05-18 | 中国移动通信集团四川有限公司 | Method for creating inverted index and inverted indexing device |
CN103646080A (en) * | 2013-12-12 | 2014-03-19 | 北京京东尚科信息技术有限公司 | Microblog duplication-eliminating method and system based on reverse-order index |
Non-Patent Citations (1)
Title |
---|
《基于抽象语法树的重复代码检测》;吴冲;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20151215;第2015年卷(第12期);第I138-154页 * |
Also Published As
Publication number | Publication date |
---|---|
CN106126235A (en) | 2016-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106126235B (en) | A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system | |
CN107798136B (en) | Entity relation extraction method and device based on deep learning and server | |
Wang et al. | Efficient approximate entity extraction with edit distance constraints | |
RU2016113791A (en) | METHOD AND DEVICE FOR CONSTRUCTION OF PATTERN AND METHOD AND DEVICE FOR IDENTIFICATION OF INFORMATION | |
Crim et al. | Automatically annotating documents with normalized gene lists | |
CN101339547A (en) | Apparatus and method for machine translation | |
CN111581355A (en) | Method, device and computer storage medium for detecting subject of threat intelligence | |
Nguyen et al. | Text classification of technical papers based on text segmentation | |
Yang et al. | Smart library: Identifying books on library shelves using supervised deep learning for scene text reading | |
CN110191096A (en) | A kind of term vector homepage invasion detection method based on semantic analysis | |
CN102339294A (en) | Searching method and system for preprocessing keywords | |
CN106528527A (en) | Identification method and identification system for out of vocabularies | |
CN102314464A (en) | Lyrics searching method and lyrics searching engine | |
CN113297580B (en) | Code semantic analysis-based electric power information system safety protection method and device | |
Nguyen et al. | Extracting bacteria biotopes with semi-supervised named entity recognition and coreference resolution | |
CN109543846B (en) | MVO (mechanical vapor deposition) improvement based DBSCAN (direct species analysis controller area network) mine water inrush spectrum identification method | |
CN103049434A (en) | System and method for identifying anagrams | |
KR20100105080A (en) | Query processing method and apparatus based on n-gram | |
WO2015080559A2 (en) | A method and system for automated word sense disambiguation | |
CN111078227B (en) | Binary code and source code similarity analysis method and device based on code characteristics | |
US20230075290A1 (en) | Method for linking a cve with at least one synthetic cpe | |
CN109472145A (en) | A kind of code reuse recognition methods and system based on graph theory | |
CN111090859B (en) | Malicious software detection method based on graph editing distance | |
CN109783607B (en) | Method for matching and identifying massive keywords in arbitrary text | |
EP3138033B1 (en) | Method and apparatus for performing block retrieval on block to be processed of urine sediment image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |