CN109062792A - A kind of Open Source Code detection method based on String matching and characteristic matching - Google Patents

A kind of Open Source Code detection method based on String matching and characteristic matching Download PDF

Info

Publication number
CN109062792A
CN109062792A CN201810807404.XA CN201810807404A CN109062792A CN 109062792 A CN109062792 A CN 109062792A CN 201810807404 A CN201810807404 A CN 201810807404A CN 109062792 A CN109062792 A CN 109062792A
Authority
CN
China
Prior art keywords
source code
matching
string
open source
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810807404.XA
Other languages
Chinese (zh)
Inventor
李必信
杨安奇
周颖
王璐璐
廖力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201810807404.XA priority Critical patent/CN109062792A/en
Publication of CN109062792A publication Critical patent/CN109062792A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The Open Source Code detection method based on String matching and characteristic matching that the invention proposes a kind of further reuses and redevelopment for detecting the Open Source Code in mixed source software to facilitate developer.This method combines attribute count method and the respective advantage and disadvantage of structure measure, the search space of database is reduced first by way of characteristic matching, then the mode according to token string, thought based on string matching algorithm, to match maximum public substring, to make match time and memory consumption be advanced optimized in the Open Source Code detection for being applied to large-scale software development project.

Description

A kind of Open Source Code detection method based on String matching and characteristic matching
Technical field
The present invention relates to a kind of detection methods, and in particular to a kind of to be detected based on String matching and the Open Source Code of characteristic matching Method belongs to pattern match and data mining technology field.
Background technique
It is growing with Open Source Code, it is more and more to the detection technique research of Open Source Code, in the prior art, Relevant method is as follows: 1. text baseds compare, and by procedure division at character string, search repetition by comparing character string Character string sequence, the advantages of this method is that matching is unrelated with concrete syntax, realize it is simple and convenient, with very strong flexibility And adaptability adapts to large scale system limited the disadvantage is that this simple row matching process detection space is huge.Meanwhile the present age When code has subtle change, this technology just be can not be detected, that is to say, that it can only detect completely the same code.2. base In parameterizing matched detection technique, main thought is to catch changeless content in program, such as operator, expression formula Deng.Its advantage be exactly solve the different code repeatability test problems of variable name, and disadvantage is that it splits code, thus So that the duplicated code block very little being detected, while its space complexity is also excessively high.3. the method pair based on abstract syntax tree Language carries out syntactic analysis, establishes complete abstract syntax tree, detects using canonical algorithm and repeats subtree.In view of large software There may be a large amount of subtree in system, search space can be very big, therefore uses hash container and store all subtrees, And only hash container is compared.It is similar with text based method, it is repaired based on the detection technique of abstract syntax tree to subtle The detection effect for the duplicated code corrected one's mistakes is not ideal enough.It thereby produces.4. based on isomorphism subgraph detection weight in dependency graph (PDG) The method of multiplexed code, although this method is capable of detecting when the code after rearrangement, the algorithm time of this method is complicated Degree is up to O (n4), be still not suitable for the detection applied to large software.Therefore, a kind of new scheme of urgent need solves to be somebody's turn to do Technical problem.
Summary of the invention
In order to improve matching speed of the code matches in Large-Scale Projects, the present invention has used token sequences match Thought, while proposing a kind of Rapid matching mode based on code statistical nature, two methods combined, further The Time & Space Complexity of Optimized Matching, method of the program based on token String matching duplicated code, this method is first by program It is divided into token stream, then finds corresponding duplicated code by identifying duplicate token sequence.Since the method uses The syntactic information of source code, thus analyze more acurrate, and the time complexity of algorithm is relatively low, and it is large-scale to be suitable for analysis Software code.
To achieve the goals above, technical scheme is as follows: a kind of open source based on String matching and characteristic matching Code detection method, which is characterized in that the described method comprises the following steps: step 1: code characteristic is extracted;Step 2: it is based on The Open Source Code of characteristic matching detects;Step 3: the Open Source Code detection based on String matching.
As an improvement of the present invention, the step 1: code characteristic is extracted, and concrete operations are as follows, obtains source code Source code file is carried out cutting according to function, extracts the statistical nature in respective code file by file, analysis source code file It is stored in the feature database of respective code file with the structure feature of function rank, by corresponding basic statistics feature.
As an improvement of the present invention, the acquisition source code file of the step 1 includes obtaining from Open Source Code library Source code file.
As an improvement of the present invention, the step 2 is detected based on the Open Source Code of characteristic matching, specific as follows,
Obtain code file to be matched;
It searches for its feature database and obtains corresponding characteristic information;
Feature Selection is carried out with the method manually compared based on iterative calculation;
Feature composition characteristic vector after screening is done into euclidean distance metric with feature vector corresponding in feature database;
It chooses metric and is less than Candidate Set of the document code of given threshold as search.
As an improvement of the present invention, the step 3 is detected based on the Open Source Code of String matching, specific as follows, to wait Selected works are search space,
Step 1: source program being subjected to morphological analysis, converts it into token string mark;
Step 2: utilizing some syntax rules, flag sequence is divided into several segments as unit of basic program block, create Build a registration database;
Step 3: using RKR-GST string matching algorithm, obtain every section of matching string;
Step 4: lesser adjacent segment being combined into biggish duplicated code segment, and judges whether this code is mistake Inspection;
Step 5: exporting the Open Source Code eventually detected.
In the program, which comprises the statistics of attributes based on code file: for code text different in project Part, the file for choosing the entitled common programming language of suffix are analyzed, and carry out morphology syntactic analysis to language, statistics file has Association attributes, such as total line number, sentence number, branch statement ratio, annotation ratio, function number, average each function packet Sentence number, function cyclomatic complexity, function depth for containing etc., as the matched characteristic value to be selected of subsequent characteristics.Based on morphology Unified model building: unified model building is that can will arbitrarily change but not influence the code of program function in code file to reflect It penetrates as unified form, such as variable name is uniformly mapped as to fixed value, thus can detecte out simple modification variable name Code.Feature Selection and characteristic matching: for the association attributes counted, for code matches, some attributes for Detect code similarity relevance it is smaller, and some then relevance is very high, by test compare, remove useless attributive character, The feature vector of a measurement code similarity is ultimately formed, and determines the similitude of code according to the size of Euclidean distance. Distance is bigger, and similitude is lower, and apart from smaller, similitude is higher.It is noted that this mode based on distance metric is searched for Fast speed, time complexity are O (n), similar codes range are quickly positioned in this way, to reduce token sequence Arrange matched search space.Token sequence similarity calculates: the similitude between two program segments is their corresponding labels Similitude between string.Each string can be regarded as and be made of several substrings, then identical substring in two strings For their public substring, thus, the ratio that their similitude can be used all public substrings shared in entire string indicates.It is logical The thought with RKR-GST string matching algorithm is crossed, obtains corresponding duplicated code, and calculate similarity according to above description.
Compared with the existing technology, beneficial effects of the present invention are as follows: 1) accuracy rate of Open Source Code detection, which has, further mentions It is high;
2) speed of Open Source Code detection increases substantially.
Detailed description of the invention
Fig. 1 is the statistics of attributes schematic diagram of word-based method syntactic analysis;
Fig. 2 is the detailed data flow graph for converting source code to token string;
Fig. 3 is characterized the schematic diagram of screening and characteristic matching;
Fig. 4 is the matching flow diagram based on Token sequence;
Fig. 5 is that Open Source Code detects implementation flow chart.
Specific embodiment
In order to further enhance the appreciation and understanding of the invention, it is further described this hair with reference to the accompanying drawings and detailed description It is bright.
Embodiment 1: referring to Fig. 1-Fig. 5, a kind of Open Source Code detection method based on String matching and characteristic matching, the side Method is the following steps are included: step 1: code characteristic is extracted;Step 2: the Open Source Code detection based on characteristic matching;Step 3: Open Source Code detection based on String matching.
The step 1: code characteristic is extracted, and concrete operations are as follows, is obtained source code file, analysis source code file, is incited somebody to action Source code file according to function carry out cutting, extract respective code file in statistical nature and function rank structure feature, It will be in the feature database of corresponding basic statistics feature deposit respective code file;The acquisition source code file of the step 1 includes Source code file is obtained from Open Source Code library.
The step 2 is detected based on the Open Source Code of characteristic matching, specific as follows,
Obtain code file to be matched;
It searches for its feature database and obtains corresponding characteristic information;
Feature Selection is carried out with the method manually compared based on iterative calculation;
Feature composition characteristic vector after screening is done into euclidean distance metric with feature vector corresponding in feature database;
It chooses metric and is less than Candidate Set of the document code of given threshold as search.
The step 3 is detected based on the Open Source Code of String matching, specific as follows, using Candidate Set as search space,
Step 1: source program being subjected to morphological analysis, converts it into token string mark;
Step 2: utilizing some syntax rules, flag sequence is divided into several segments as unit of basic program block, create Build a registration database;
Step 3: using RKR-GST string matching algorithm, obtain every section of matching string;
Step 4: lesser adjacent segment being combined into biggish duplicated code segment, and judges whether this code is mistake Inspection;
Step 5: exporting the Open Source Code eventually detected.
Below in conjunction with the attached drawing specific embodiment that the present invention is further explained.
For the method applied by the present invention, for different programming languages, the analysis of morphology and grammer is slightly Difference, but thought is changed ten thousand times without leaving the original aim or stand, and chooses mainstream speech as java, C here to explain.For Fig. 1, one is inputted Java source code, by morphological analysis and keyword match, can count class contained in source code how many, side operator How many a and some basic lines of code, keyword number, branch statement ratio, the basic statistics such as annotation ratio letter Breath, and then syntactic analysis is carried out again, obtain the cyclomatic complexity of function, the structurings statistical information such as function depth, and these are believed Breath carries out layout arrangement, obtains the attributive analysis report an of source file, this analysis program can be to all Open Source Codes File is analyzed, and the statistical result of respective project respective file is stored in database, as the matched spy of subsequent characteristics Levy library.
In order to correctly detect related replicating code, detection process is allowed for ignoring the difference on program surface, be concentrated In more substantive ingredient.Therefore, it needs to carry out morphological analysis to source code before comparison, because morphology is determining constant , professional etiquette of going forward side by side is formatted processing, is weeded out the attribute unrelated with program structure, such as variable name, annotation, blank, is counted energy The feature for reflecting program structure information, can just make the obtained result of similarity system design more accurate in this way.Normalization is handled So-called unified model building, it is meant that it goes to replace some lexical elements with a defined general identifications string, and for those The sentence for directly influencing sentence semantics or program structure will remain unchanged.Such as: for C language source program, it can make such as Lower processing is to generate string.1. identifier (variable name, method name etc.), only describes data used in program, can use Unified special marking replaces each identifier;2. number and character constant, although constant will not during execution Variation, but may but be changed in the design process.But the variation does not have much influence analysis code logic, Therefore, constant would generally be blurred in analysis, replaces all constants with a mark or is directly ignored.3. mapping is same Adopted word is a kind of common form, such as basic data type int, long, short uniformly replace with common tags, such as type.4. the symbols such as annotation, space, tab, newline should be ignored, 5. other crucial words, operator, logic symbol, The symbol that separator etc. directly influences the internal logic function of program is then retained constant.That is stated in above procedure is detailed Thin source code preprocessed data flow graph is as shown in Figure 2.
After the completion of related token sequence and characteristic vector pickup, be first depending on feature vector, i.e., multiple statistical attributes, The minimum particle size for paying attention to statistical attribute here is function rank, because if if being the other statistical attribute of statement level, not It is enough to illustrate that it is non-code of living alone, therefore being compared using function as minimum particle size is to compare clearly.Firstly, since statistics Attribute has very much, some attributes perhaps can not react the similitude of code, such as the number of class, the number of method, and have A little attributes are for judging whether it is similar codes and its important, such as cyclomatic complexity, function depth, therefore as shown in figure 3, first Choose more appropriate characteristic value by artificial experience, composition characteristic vector, by the way that feature vector is normalized, and foundation Euclidean distance formulaCalculate distance, distance more similar apart from smaller representative code Bigger representative code is not identical.By setting a threshold value p, if metric range is greater than threshold value, illustrate the feature vector used It is not enough to match similar codes, thus need to carry out feature recombination, and if distance is less than threshold value, it is further artificial Compare the similarity of two sections of codes, if the artificial result relatively obtained and measurement results and given threshold is closely located, just Think that the feature vector chosen is more appropriate, otherwise screen feature by adjusting the corresponding weight of characteristic attribute or further, comes It re-starts distance to calculate, eventually finds matching characteristic vector the most suitable.It here is actually to use iteration combination people As a result the thought of work inspection needs artificial comparison, thus the further quality of judging characteristic.Certainly pass through handmarking's sample This, the statistical learning of two classification problem is carried out using sample as data set, and evaluates training according to the mode of cross validation Index can also achieve the purpose that Feature Selection so as to adjust characteristic parameter, but this mode need certain artificial mark at This, thus the mode for having selected iteration artificial judgment to choose here, it can also reach good effect after testing.
By the characteristic matching of previous step, we can the cracking range for reducing database search, because of side above Method is to need to calculate the distance of feature vector, and the intrinsic dimensionality that we extract is limited, therefore matched time complexity is O (n) rank.However feature-based matching can only be rough expression code some similar structures, it is whether genuine as code It is similar or equal, it can not accurately measure, it is played the role of being exactly to greatly reduce searching based on token sequences match Rope space, this looks for the people of a understanding just as us in the boundless and indistinct sea of faces, we can be according to the surface to him, example Such as height, weight etc. filters out a group of people, then those people with similar structure of main detection, to accelerate our search Process.Here is the specific method of token sequences match, as shown in figure 4, it specific step is as follows:
Step 1: source program being subjected to morphological analysis, converts it into token string mark;
Step 2: utilizing some syntax rules, flag sequence is divided into several segments as unit of basic program block, create Build a registration database;
Step 3: using RKR-GST string matching algorithm, obtain every section of matching string;
Step 4: lesser adjacent segment being combined into biggish duplicated code segment, and judges whether this code is mistake Inspection;
Step 5: exporting the Open Source Code eventually detected.
A batch documents conversion program is designed first, and morphology is carried out to all programming language files in destination folder Analysis, and corresponding text file is generated, for storing the token string of language file generation.Each corresponding text text Path under destination folder where part not only records corresponding language file, but also have recorded each token string and exist Line number where in original program, for use in being positioned later to the duplicated code detected.
In order to make traditional lexical analyzer can adapt to the detection of duplicated code, the present invention improves it, right All token words have carried out Unify legislation model construction, and the normalisation rule of formulation is as follows:
1. keyword: all keywords, lowercase are converted into capitalization, such as: " if " is converted into " IF ";
2. bound symbol: all bound symbols do not change in conversion, such as: " ++ " still keeps " ++ " constant;
3. constant: being divided into numeric type, character types and character string type three classes.The constant of all numeric types is converted C_fig is converted to for " language title _ fig ", such as C language;The constant of all character types is converted to " language title _ har";The constant of all character string types is converted into " language title _ str ".
4. variable name: all variable names are all converted into the format of B_X1X2.Wherein B_ indicates that this is a variable name, Any situation is all constant;The data type (such as shaping, character type etc.) of X1 expression variable;X2 indicate variable classification (such as Common variables, array, pointer, array of pointers etc.).
5. function name: all function names are all converted into H_T format, and wherein H_ indicates that this is a variable name, and T is indicated The type of function return value.
By being standardized conversion to token string, the renaming phenomenon of duplicated code segment identifier can be tolerated.It is above-mentioned The similar identifier of certain types is mapped as identical token word by rule, this is because after replicating stickup, source code The data type of a little variables may need to change, such as shaping variables transformations are floating type variable or other data types. By the way that the close but different variable name of data type is mapped as identical token word, in matching below, this two sections of generations Code will be found out in the form of duplicated code, the missing inspection without the renaming or Type Change because of variable name, equally, In order to tolerate the renaming of function name, the similar function name of all return Value Types is all mapped as identical token word.
After having converted all source programs, source program becomes a very long character string.We are according to function grade Other granularity divides this sequence, forms registration database, and why dividing is because of the use of Open Source Code sometimes It may be the use to a module, rather than entire file, being matched and integrated according to division in this way makes Open Source Code Matching it is more perfect.Here according to simple syntactic information in source code, source program is divided into several basic modules, each Unit of the module as sequences segmentation.The basis of this dividing method is: duplicated code is usually by one or more consecutive base This section composition.By that can find out basic code segments, then combine them into using basic block as segmentation unit Bigger code segment.
Then it is namely based on the matching way of segmentation token string, RKR-GST string matching algorithm is employed herein, RKR-GST algorithm is that famous program code plagiarizes inspection software YAP3 and internet program code copy detection online service system The detection algorithm that Jplag is used, by the Wise M J of Sydney University in introducing KR (Karp Rabin) string matching algorithm in 1993 Thought, GST (Greedy String Tiling) algorithm improved, is proposed.The String matching that RKR-GST algorithm solves Problem is described as following form:
Input: text string T=" t0t1…tn-1", it is the character type string that a length is n;Pattern string P=" p0p1… pm-1", it is the character type string that a length is m.Wherein, n, m are positive integer, and each element in n >=m, T and P is limited In character set Σ.
Output: the maximum matching substring set substring of T and P, and between any two element in substring It is not overlapped, does not also mutually include.
The maximum matching substring of T and P is defined as, it is assumed that long in P since j-th of position in T since i-th of position Degree is equal to each other for each element of the substring of k, and ti+k≠pj+k, or at this time in T and P at least one string has been tied Beam or further element are included in other maximum matching strings, then claim " titi+1…ti+k-1" (or " pjpj+1…pj+k-1") it is T The maximum matching substring that a length with P is k, is denoted as max_match (i, j, k).
Similitude applied to program code detects, and RKR-GST algorithm needs to be arranged one and is referred to as smallest match length Parameter.The parameter is for being arranged the permitted smallest match of tab character string that two program code conversions to be compared obtain Length will be ignored less than the matching of this length and disregard.For example, there is identical character string in two program codes " while " is normal.
Its key step is as follows:
Step 1: initialization substring is sky, and the initial value of search length s is arranged;
Step 2: calculating the hashed value for all substrings that unmarked length in text string T is s, and be stored in hash knot In structure Hash_structure.
Step 3: calculating the hashed value for the substring that each unmarked length in pattern string P is s, and and Hash_ Hashed value in structure is compared, if equal, extends matching, it may be assumed that if " titi+1…ti+s-1" and " pjpj+1… pj+s-1" hashed value it is equal, then compare ti+sWith pj+sIf equal, continue to compare ti+s+1With pj+s+1..., it is until to be matched Two characters differ, or one of them has been coupled with label or end of string.Using the extension obtained substring of matching as Candidate maximum matching substring record.
Step 4: to all candidate maximum matching substrings obtained in step 3, according to the descending sequence of length into The following label processing of row: if text substring corresponding with current candidate maximum matching substring and each character of mode substring are corresponding It is equal, then following processing is carried out, otherwise removal current candidate maximum matches substring, place from candidate maximum matching substring queue Manage next candidate maximum matching substring.
If current candidate maximum matches substring and the arbitrary element in set substring is not be overlapped, it is added into In set substring, and it is marked respectively in text string T and pattern string P;
If current candidate maximum matches substring and a certain element in set substring there are Chong Die, further sentence Whether the remaining substring length after disconnected removal lap is greater than or equal to smallest match length s, if it is, by remaining substring It is added in candidate maximum matching substring queue lists, otherwise abandons.
Step 5: from Step 2 to Step 4 is repeated, until it is maximum more than or equal to the candidate of s to find new length Match substring.
Step 6: returning to set substring.
What is saved in set substring is all maximum matching of the length not less than s of text string T and pattern string P String and its position and length information in T and P.The similarity of text string T and pattern string P can be calculated as follows:
Wherein, | substring | indicate that all maximum matches the sum of the length of substring in set substring, | P | and | T | respectively indicate the length of text string T and pattern string P.
RKR-GST string matching algorithm annotates modification, typesetting, modified logo accord with, adjust timing in code block again The anti-detection means such as sequence, process combination and the cancellation function call of unrelated sentence have preferable detection effect.
In conclusion we pass through the statistical nature construction feature vector of extraction procedure code, carried out based on Euclidean distance Similar codes screening, and then database search range is reduced, then final generation is carried out with RKR-GST algorithm by token string Code matching, so that the speed of matching search gets a promotion, is finally opened from open source library in the case where guaranteeing the accurate situation of result The testing result of source code.
The above is only the embodiment that the present invention is applied to C language and java language, it is noted that for this technology For the those of ordinary skill in field, various improvements and modifications may be made without departing from the principle of the present invention, this A little improvements and modifications also should be regarded as protection scope of the present invention.

Claims (5)

1. a kind of Open Source Code detection method based on String matching and characteristic matching, which is characterized in that the method includes following Step: step 1: code characteristic is extracted;Step 2: the Open Source Code detection based on characteristic matching;Step 3: it is based on String matching Open Source Code detection.
2. the Open Source Code detection method according to claim 1 based on String matching and characteristic matching, which is characterized in that institute State step 1: code characteristic is extracted, and concrete operations are as follows, obtains source code file, analysis source code file, by source code file Carry out cutting according to function, the structure feature of extracting statistical nature and function rank in respective code file, by corresponding base This statistical nature is stored in the feature database of respective code file.
3. the Open Source Code detection method according to claim 2 based on String matching and characteristic matching, which is characterized in that institute The acquisition source code file for stating step 1 includes obtaining source code file from Open Source Code library.
4. the Open Source Code detection method according to claim 3 based on String matching and characteristic matching, which is characterized in that institute Step 2 is stated to detect based on the Open Source Code of characteristic matching, it is specific as follows,
Obtain code file to be matched;
It searches for its feature database and obtains corresponding characteristic information;
Feature Selection is carried out with the method manually compared based on iterative calculation;
Feature composition characteristic vector after screening is done into euclidean distance metric with feature vector corresponding in feature database;
It chooses metric and is less than Candidate Set of the document code of given threshold as search.
5. the Open Source Code detection method according to claim 4 based on String matching and characteristic matching, which is characterized in that institute Step 3 is stated to detect based on the Open Source Code of String matching, it is specific as follows, using Candidate Set as search space,
Step 1: source program being subjected to morphological analysis, converts it into token string mark;
Step 2: utilizing some syntax rules, flag sequence is divided into several segments, creation one as unit of basic program block A registration database;
Step 3: using RKR-GST string matching algorithm, obtain every section of matching string;
Step 4: lesser adjacent segment being combined into biggish duplicated code segment, and judges whether this code is erroneous detection;
Step 5: exporting the Open Source Code eventually detected.
CN201810807404.XA 2018-07-21 2018-07-21 A kind of Open Source Code detection method based on String matching and characteristic matching Pending CN109062792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810807404.XA CN109062792A (en) 2018-07-21 2018-07-21 A kind of Open Source Code detection method based on String matching and characteristic matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810807404.XA CN109062792A (en) 2018-07-21 2018-07-21 A kind of Open Source Code detection method based on String matching and characteristic matching

Publications (1)

Publication Number Publication Date
CN109062792A true CN109062792A (en) 2018-12-21

Family

ID=64835012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810807404.XA Pending CN109062792A (en) 2018-07-21 2018-07-21 A kind of Open Source Code detection method based on String matching and characteristic matching

Country Status (1)

Country Link
CN (1) CN109062792A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188104A (en) * 2019-05-30 2019-08-30 中森云链(成都)科技有限责任公司 A kind of Python program code method for fast searching towards K12 programming
CN110399729A (en) * 2019-04-11 2019-11-01 国家计算机网络与信息安全管理中心 A kind of binary software analysis1 method based on module diagnostic weight
CN110704308A (en) * 2019-09-11 2020-01-17 无锡江南计算技术研究所 Multistage feature extraction method
CN110909363A (en) * 2019-11-25 2020-03-24 中国人寿保险股份有限公司 Software third-party component vulnerability emergency response system and method based on big data
CN110955758A (en) * 2019-12-18 2020-04-03 中国电子技术标准化研究院 Code detection method, code detection server and index server
CN110990256A (en) * 2019-10-29 2020-04-10 中移(杭州)信息技术有限公司 Open source code detection method, device and computer readable storage medium
CN111367566A (en) * 2019-06-27 2020-07-03 北京关键科技股份有限公司 Mixed source code feature extraction and matching method
CN111858322A (en) * 2020-07-10 2020-10-30 中国科学技术大学 Python language feature automatic identification system and method
CN112328743A (en) * 2020-11-03 2021-02-05 北京嘀嘀无限科技发展有限公司 Code searching method and device, readable storage medium and electronic equipment
CN114385492A (en) * 2021-12-30 2022-04-22 大连理工大学 Advanced comprehensive tool optimization option defect detection method based on differential test
CN115145633A (en) * 2022-07-25 2022-10-04 杭州师范大学 Code error automatic detection method based on control flow graph

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894236A (en) * 2010-07-28 2010-11-24 北京华夏信安科技有限公司 Software homology detection method and device based on abstract syntax tree and semantic matching
CN102750379A (en) * 2012-06-25 2012-10-24 华南理工大学 Fast character string matching method based on filtering type
CN107169321A (en) * 2017-06-10 2017-09-15 西安交通工程学院 The program plagiarism detection method and system being combined based on attribute count and structure measurement technology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894236A (en) * 2010-07-28 2010-11-24 北京华夏信安科技有限公司 Software homology detection method and device based on abstract syntax tree and semantic matching
CN102750379A (en) * 2012-06-25 2012-10-24 华南理工大学 Fast character string matching method based on filtering type
CN107169321A (en) * 2017-06-10 2017-09-15 西安交通工程学院 The program plagiarism detection method and system being combined based on attribute count and structure measurement technology

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399729A (en) * 2019-04-11 2019-11-01 国家计算机网络与信息安全管理中心 A kind of binary software analysis1 method based on module diagnostic weight
CN110399729B (en) * 2019-04-11 2021-04-27 国家计算机网络与信息安全管理中心 Binary software analysis method based on component characteristic weight
CN110188104A (en) * 2019-05-30 2019-08-30 中森云链(成都)科技有限责任公司 A kind of Python program code method for fast searching towards K12 programming
CN111367566A (en) * 2019-06-27 2020-07-03 北京关键科技股份有限公司 Mixed source code feature extraction and matching method
CN110704308A (en) * 2019-09-11 2020-01-17 无锡江南计算技术研究所 Multistage feature extraction method
CN110704308B (en) * 2019-09-11 2022-09-09 无锡江南计算技术研究所 Multistage feature extraction method
CN110990256A (en) * 2019-10-29 2020-04-10 中移(杭州)信息技术有限公司 Open source code detection method, device and computer readable storage medium
CN110990256B (en) * 2019-10-29 2023-09-05 中移(杭州)信息技术有限公司 Open source code detection method, device and computer readable storage medium
CN110909363A (en) * 2019-11-25 2020-03-24 中国人寿保险股份有限公司 Software third-party component vulnerability emergency response system and method based on big data
CN110955758A (en) * 2019-12-18 2020-04-03 中国电子技术标准化研究院 Code detection method, code detection server and index server
CN111858322A (en) * 2020-07-10 2020-10-30 中国科学技术大学 Python language feature automatic identification system and method
CN111858322B (en) * 2020-07-10 2022-01-11 中国科学技术大学 Python language feature automatic identification system and method
CN112328743A (en) * 2020-11-03 2021-02-05 北京嘀嘀无限科技发展有限公司 Code searching method and device, readable storage medium and electronic equipment
CN114385492A (en) * 2021-12-30 2022-04-22 大连理工大学 Advanced comprehensive tool optimization option defect detection method based on differential test
CN115145633A (en) * 2022-07-25 2022-10-04 杭州师范大学 Code error automatic detection method based on control flow graph

Similar Documents

Publication Publication Date Title
CN109062792A (en) A kind of Open Source Code detection method based on String matching and characteristic matching
Agichtein et al. Mining reference tables for automatic text segmentation
CN112800201B (en) Natural language processing method and device and electronic equipment
US6178417B1 (en) Method and means of matching documents based on text genre
CN109190092A (en) The consistency checking method of separate sources file
KR100627195B1 (en) System and method for searching electronic documents created with optical character recognition
JP2010501096A (en) Cooperative optimization of wrapper generation and template detection
CN111124487B (en) Code clone detection method and device and electronic equipment
CN1936892A (en) Image content semanteme marking method
CN109885641B (en) Method and system for searching Chinese full text in database
AU2018102145A4 (en) Method of establishing English geographical name index and querying method and apparatus thereof
Soori et al. Text similarity based on data compression in Arabic
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN110688461B (en) Online text education resource label generation method integrating multi-source knowledge
CN114238735B (en) Intelligent internet data acquisition method
CN112925874B (en) Similar code searching method and system based on case marks
CN115617965A (en) Rapid retrieval method for language structure big data
CN111858908A (en) Method and device for generating newspaper picking text, server and readable storage medium
Mishra et al. Multimodal machine learning for extraction of theorems and proofs in the scientific literature
CN113722421A (en) Contract auditing method and system and computer readable storage medium
CN1253814C (en) Automatic pick-up method of key features of digital document
Viyanon et al. A system for detecting xml similarity in content and structure using relational database
Nagasudha et al. Key word spotting using HMM in printed Telugu documents
Goslin et al. English Language Spelling Correction as an Information Retrieval Task Using Wikipedia Search Statistics
CN114492419B (en) Text labeling method, system and device based on newly added key words in labeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181221