CN110990017B - Credible tree based feature storage and matching method - Google Patents
Credible tree based feature storage and matching method Download PDFInfo
- Publication number
- CN110990017B CN110990017B CN201910857073.5A CN201910857073A CN110990017B CN 110990017 B CN110990017 B CN 110990017B CN 201910857073 A CN201910857073 A CN 201910857073A CN 110990017 B CN110990017 B CN 110990017B
- Authority
- CN
- China
- Prior art keywords
- matching
- layer
- logic
- characteristic value
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/43—Checking; Contextual analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
- G06F8/425—Lexical analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/75—Structural analysis for program understanding
- G06F8/751—Code clone detection
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of computer application, in particular to a characteristic storage and matching method based on a trusted tree. The invention is realized by the following technical scheme: a feature storage and matching method based on a credible tree comprises the following steps: a layering establishing step; dividing the whole project into a plurality of logic layers; a feature generation step; extracting characteristic values reflecting the characteristics of the parameters of each logic element of each logic layer; correlating the upper layer characteristic with the lower layer characteristic; the logic element of the previous logic layer comprises a self characteristic value and a combined characteristic value, wherein the combined characteristic value is the sum of the characteristic values of all the logic elements of the next logic layer corresponding to the logic element; and (5) matching. The invention aims to provide a credible tree-based feature storage and matching method, which is based on the storage from bottom to top of a credible tree in the data storage process, and performs feature matching from top to bottom in the matching process, so that the speed is high, and the matching is accurate.
Description
Technical Field
The invention relates to the technical field of computer application, in particular to a characteristic storage and matching method based on a trusted tree.
Background
The code similarity detection technology is mainly used for detecting the code plagiarism at present and is an important task in the development and maintenance activities of computer software. The method has wide application in a plurality of fields of protecting software copyright intellectual property, source code plagiarism detection, software component library inquiry, program understanding and the like. The method can help the copying and original situation of the software code, and has important practical significance for the adherence in the software copyright.
For example, chinese patent document No. CN109918218 discloses a technical solution: a code similarity detection method and system based on a relation variable diagram comprise an identifier determining module, a similarity calculating module and the like, and are used for determining matching query results of code similarity among different documents. Other solutions exist both at home and abroad to detect code plagiarism problems, such as the MOSS system of Stanford university in the United states, the SIM system of Wischatan State university, and the YAP3 system of Sydney university in Australia.
However, in both the technical solutions in the comparison files and the comparison systems in the prior art, the storage technology for the features of the codes often adopts flat storage, and the matching technology often adopts sequential matching. Such a matching process is very cumbersome. Particularly in a massive code matching scene, the matching mode has the disadvantages of large data volume, low efficiency and high algorithm complexity.
Disclosure of Invention
The invention aims to provide a characteristic storage and matching method based on a credible tree.
The technical purpose of the invention is realized by the following technical scheme:
a feature storage and matching method based on a credible tree comprises the following steps:
s01, establishing a hierarchy;
dividing the whole project into a plurality of logic layers;
s02, a feature generation step;
extracting characteristic values reflecting the characteristics of parameters of each logic element of each logic layer;
s03, associating upper and lower layer characteristics;
the logic element of the previous logic layer comprises a self characteristic value and a combined characteristic value, wherein the combined characteristic value is the sum of the characteristic values of all the logic elements of the next logic layer corresponding to the logic element;
s04, matching;
and comparing the characteristic value of the comparison code packet with the characteristic value of the code packet in the library.
Preferably, the logic layer is five layers, and the logic layer is a code segment layer, a function layer, a file interlayer and a project layer from bottom to top.
Preferably, in the step of S04, a top-down matching step is adopted, and if the feature value of a certain layer of the comparison code packet is the same as the feature value of a certain layer corresponding to the code packet in the library, the matching is considered to be successful, and the matching step is ended.
Preferably, if the characteristic value of a certain layer of the comparison code packet is different from the characteristic value of a certain layer corresponding to the code packet in the library, the matching is not successful, a lower-layer matching step is performed, and the characteristic value of the next layer of the comparison code packet is compared with the characteristic value of the next layer corresponding to the code packet in the library one by one.
Preferably, in the step of generating characteristics at S02, the characteristic value includes a capacity value of the logical element.
Preferably, in step S02, the feature value includes a word frequency at which a certain variable appears in the logical element.
In the word frequency statistics process of the variable, a statistic fixed value and a use context are required, when the variable is fixed, the variable is positioned on the left side of "=" or "+ =" or "- =", and when the variable is used, the variable is positioned on the right side of "=" or "- =" or "+ =".
As the optimization of the invention, in the variable word frequency statistical process, the context of the common statement needs to be counted, and the context of the common statement is a conditional statement and/or a calculation statement and/or an array access and/or a normally bright assignment.
Preferably, in the variable word frequency statistical process, the context of the nested sentences needs to be counted, and the nested sentences are outermost loops and/or second outer loops and/or third loops and/or more inner loops.
In conclusion, the invention has the following beneficial effects:
1. the trusted tree is established from bottom to top, and is matched and searched from top to bottom, so that the matching efficiency is greatly optimized, and the algorithm complexity is reduced.
2. The characteristic value of the logic element of each layer comprises the characteristic value of the logic element of the lower layer, and the matching accuracy is high.
Description of the drawings:
FIG. 1 is a schematic diagram of the layers of a trusted tree;
FIG. 2 is a simplified schematic diagram;
FIG. 3 is a diagram illustrating a calculation method of the feature codes in the file layer and the folder layer.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
The present embodiment is only for explaining the present invention, and it is not limited to the present invention, and those skilled in the art can make modifications of the present embodiment without inventive contribution as needed after reading the present specification, but all of them are protected by patent law within the scope of the claims of the present invention.
In the present application, as shown in fig. 1, a trusted tree is established that includes five logical layers, which are from top to bottom: first layer, item layer. Second layer, folder. The third layer, the file layer. The fourth layer, the function layer. And the fifth layer is a code segment layer.
Before comparing and matching the codes formally, the project needs to extract features first, so that the logic element of each logic layer has a feature value reflecting the feature of the logic element.
And when each logic layer is established, a layer establishing strategy from bottom to top is adopted. As shown in FIG. 1, a code segment layer is established, a function layer is established, and a file layer is established. This is because in this case, the feature value of any logic element includes the sum of its own feature and the corresponding feature of the lower layer. For example, as shown in fig. 2, where a-H each represent a respective logical element, the upper level features are derived from the lower level features by making corresponding mapping rules. For example: the eigenvalues of F reflect (A, B, C) and the eigenvalues of H reflect (F, G).
Specifically, as shown in fig. 3, a second-layer file layer and a third-layer file layer are used as examples. The folder contains two files, file 1 and file 2. Files 1 and 2 have successfully extracted their own characteristic values, for example, the characteristic values of folder 1 are a1, b1, … m1, respectively. The characteristic values of the folder 2 are a2, b2 and … m2. after the file layer of the layer is established and the characteristic values of the files are generated, the generation of the logical element of the previous layer, namely the folder is started, and the characteristic value of the folder is generated.
The characteristic values of the folder in FIG. 3 contain two parts: self eigenvalues and merged eigenvalues. The first part, self-feature value, is a, B … N in fig. 3, which is automatically generated feature information according to some relevant parameters of the folder, such as capacity size, number of files contained, and so on. The second part, the combined feature values, a, b, … m in fig. 3. The calculation of this part is the summation of the logical elements of the corresponding next layer, i.e. the folder in fig. 3 sums the file 1 and the file 2, where a is a1+ a2, b is b1+ b2, and m is m1+ m2. If the folder has 3 files or four files, a is a1+ a2+ a3 or a1+ a2+ a3+ a4.
Any letter in this side indicates a certain feature of the logical element. For example, a1, b1 … m1, a1 and b1 of the document 1 are respectively a certain feature and a certain feature quantity.
For the extraction of the feature quantity, different specific embodiments may have different extraction manners, and the technical scheme is not particularly limited. In general, the feature quantity may be its own capacity or a word frequency of a certain variable.
For the word frequency calculation, the following three statistical situations are included in the context:
(1) and (5) fixing the value and using. The number of times a statistical variable is valued and used throughout a code segment. The fixed value is located on the left side of "=" (or "+ =", "- =", etc.), and the use is located on the right side of the above operation.
(2) A general sentence. And counting the occurrence times of the variables in the common sentences. Common sentence contexts that need attention include conditional sentences, compute (add, subtract, multiply, divide) sentences, array accesses (variables as subscripts), constant assignments, and the like.
(3) Nesting statements. Counting the number of times a variable appears in a nested statement. Such as an outermost cycle, a next outermost cycle, a third cycle, or a further inner cycle.
To this end, as shown in fig. 1, a project is from bottom to top, each layer is already established, and a corresponding feature value of any one logic element in each layer is also generated. The matching step is then entered.
If the comparison is carried out on one file, the characteristic value of the file is directly compared with the characteristic value of each file in the third logic layer of the project in the database, namely the file layer one by one, and if the comparison is carried out on the other file, the characteristic value of the file is compared with the characteristic value of each file in the second logic layer, namely the file interlayer one by one. The comparison is in the opposite order to the above-mentioned establishment, and the establishment of the trusted tree is from the bottom layer to the upper layer by layer, but the comparison is from the upper layer to the bottom layer by layer.
Specifically, as shown in fig. 3, the folder in the database is referred to as the folder one, and the folder in the database is referred to as the folder two for comparison.
And comparing the feature code of the second folder with the feature code of the first folder, if the feature codes are consistent, indicating that the matching is successful, and ending the matching step.
If not, the next layer needs to be compared one by one. For example, the folder No. two contains three files, file 3, file 4, and file 5. Then, the feature codes of file 3 are compared with the feature codes of file 1 and file 2 in the first folder, and file 4 and file 5 operate in the same way.
Similarly, if the feature codes are consistent, the matching is successful, and if the feature codes are inconsistent, the next layer is continued, and the feature codes of all function levels are compared.
The data is stored from bottom to top based on the credible tree and is matched from top to bottom, so that the matching efficiency is greatly optimized. The traditional matching mode has too large data volume and low efficiency, the algorithm complexity of the traditional matching algorithm is O (n ^2), and the algorithm complexity after changing into the Merkle tree algorithm is O (logn).
As shown in FIG. 1, the classification of code segments allows a software engineer to choose the separation of code segments. For example, the C language may be divided according to parentheses, and in other variant languages, the C language may be divided according to enter symbols and tab symbols, and the present document is not particularly limited.
Claims (6)
1. A feature storage and matching method based on a credible tree is characterized by comprising the following steps: s01, establishing a hierarchy; dividing the whole project into a plurality of logic layers; s02, a feature generation step; extracting characteristic values reflecting the characteristics of parameters of each logic element of each logic layer; s03, correlating the upper and lower layer characteristics; the logic element of the previous logic layer comprises a self characteristic value and a combined characteristic value, wherein the combined characteristic value is the sum of the characteristic values of all the logic elements of the next logic layer corresponding to the logic element; s04, matching; comparing the characteristic value of the comparison code packet with the characteristic value of the code packet in the library, wherein the logic layer comprises five layers, the logic layer comprises a code segment layer, a function layer, a file interlayer and a project layer from bottom to top, the step S04 and the matching step adopt a top-to-bottom matching step, if the characteristic value of a certain layer of the comparison code packet is the same as the characteristic value of a certain layer corresponding to the code packet in the library, the matching is considered successful, the matching step is finished, if the characteristic value of a certain layer of the comparison code packet is different from the characteristic value of a certain layer corresponding to the code packet in the library, the matching is considered unsuccessful, the lower layer matching step is started, and the characteristic value of the next layer of the comparison code packet and the characteristic value of the next layer corresponding to the code packet in the library are compared one by one.
2. The method for storing and matching features based on the trusted tree as claimed in claim 1, wherein: in step S02, the feature value includes a capacity value of the logical element.
3. The method for storing and matching features based on the trusted tree as claimed in claim 1, wherein: in the step of S02, the feature value includes a word frequency of a variable appearing in the logical element.
4. The method for storing and matching features based on the trusted tree as claimed in claim 3, wherein: in the process of word frequency statistics of variables, a constant value and a use context need to be counted, when the variable is constant value, the variable is positioned on the left side of "=" or "+ =" or "- =", and when the variable is used, the variable is positioned on the right side of "=" or "- =" or "+ =".
5. The method for storing and matching features based on the trusted tree as claimed in claim 3, wherein: in the variable word frequency statistical process, the context of the common statement needs to be counted, and the context of the common statement is a conditional statement and/or a calculation statement and/or an array access and/or a normally bright assignment.
6. The method for storing and matching features based on the trusted tree as claimed in claim 3, wherein: in the variable word frequency statistical process, the context of nested sentences needs to be counted, and the nested sentences are outermost loops and/or second outer loops and/or third loops and/or more inner loops.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910857073.5A CN110990017B (en) | 2019-09-11 | 2019-09-11 | Credible tree based feature storage and matching method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910857073.5A CN110990017B (en) | 2019-09-11 | 2019-09-11 | Credible tree based feature storage and matching method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110990017A CN110990017A (en) | 2020-04-10 |
CN110990017B true CN110990017B (en) | 2022-09-09 |
Family
ID=70081738
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910857073.5A Active CN110990017B (en) | 2019-09-11 | 2019-09-11 | Credible tree based feature storage and matching method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110990017B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1517914A (en) * | 2003-01-06 | 2004-08-04 | Searching of structural file | |
TW200935257A (en) * | 2008-02-14 | 2009-08-16 | Univ Nat Taiwan Science Tech | Method for detecting similarity or plagiarism in computer programs |
CN108345468A (en) * | 2018-01-29 | 2018-07-31 | 华侨大学 | Programming language code duplicate checking method based on tree and sequence similarity |
CN110032500A (en) * | 2019-03-01 | 2019-07-19 | 阿里巴巴集团控股有限公司 | Multilayer nest data analysis method and equipment |
-
2019
- 2019-09-11 CN CN201910857073.5A patent/CN110990017B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1517914A (en) * | 2003-01-06 | 2004-08-04 | Searching of structural file | |
TW200935257A (en) * | 2008-02-14 | 2009-08-16 | Univ Nat Taiwan Science Tech | Method for detecting similarity or plagiarism in computer programs |
CN108345468A (en) * | 2018-01-29 | 2018-07-31 | 华侨大学 | Programming language code duplicate checking method based on tree and sequence similarity |
CN110032500A (en) * | 2019-03-01 | 2019-07-19 | 阿里巴巴集团控股有限公司 | Multilayer nest data analysis method and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110990017A (en) | 2020-04-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Treude et al. | Difference computation of large models | |
CN108984789B (en) | Distributed accounting method and device, storage medium and electronic equipment | |
US11100284B2 (en) | Blockchain-based text similarity detection method, apparatus and electronic device | |
CN107784110B (en) | Index establishing method and device | |
US10614312B2 (en) | Method and apparatus for determining signature actor and identifying video based on probability of appearance of signature actor | |
Oliveira et al. | An efficient similarity-based approach for comparing XML documents | |
CN107451177B (en) | Query method and system for single error-surveying block chain of increased blocks | |
CN109800337B (en) | Multi-mode regular matching algorithm suitable for large alphabet | |
Ye et al. | Learning deep graph representations via convolutional neural networks | |
CN114201756A (en) | Vulnerability detection method and related device for intelligent contract code segment | |
CN110990017B (en) | Credible tree based feature storage and matching method | |
CN110941831B (en) | Vulnerability matching method based on slicing technology | |
CN109657060B (en) | Safety production accident case pushing method and system | |
WO2014191719A1 (en) | Datasets profiling tools, methods, and systems | |
CN115795563A (en) | State data checking method and device | |
Bhat et al. | Content-based file type identification | |
CN115906166A (en) | Information processing method and device for sharing private data by block chain | |
Amsterdamer et al. | Automated Selection of Multiple Datasets for Extension by Integration | |
CN108108472B (en) | Data processing method and server | |
CN117874307B (en) | Engineering data field identification method and device, electronic equipment and storage medium | |
Thin et al. | Formal Analysis of a PoS Blockchain | |
CN110795530B (en) | Context-based value feature extraction system and method | |
US20230230708A1 (en) | Methods and systems for probabilistic filtering of candidate intervention representations | |
CN109325496B (en) | Bullet screen checking method and device based on character removal, terminal and storage medium | |
US8560981B2 (en) | Segmenting integrated circuit layout design files using speculative parsing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |