CN110990017B - Credible tree based feature storage and matching method - Google Patents

Credible tree based feature storage and matching method Download PDF

Info

Publication number
CN110990017B
CN110990017B CN201910857073.5A CN201910857073A CN110990017B CN 110990017 B CN110990017 B CN 110990017B CN 201910857073 A CN201910857073 A CN 201910857073A CN 110990017 B CN110990017 B CN 110990017B
Authority
CN
China
Prior art keywords
matching
layer
logic
characteristic value
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910857073.5A
Other languages
Chinese (zh)
Other versions
CN110990017A (en
Inventor
程华
周诚淇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201910857073.5A priority Critical patent/CN110990017B/en
Publication of CN110990017A publication Critical patent/CN110990017A/en
Application granted granted Critical
Publication of CN110990017B publication Critical patent/CN110990017B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of computer application, in particular to a characteristic storage and matching method based on a trusted tree. The invention is realized by the following technical scheme: a feature storage and matching method based on a credible tree comprises the following steps: a layering establishing step; dividing the whole project into a plurality of logic layers; a feature generation step; extracting characteristic values reflecting the characteristics of the parameters of each logic element of each logic layer; correlating the upper layer characteristic with the lower layer characteristic; the logic element of the previous logic layer comprises a self characteristic value and a combined characteristic value, wherein the combined characteristic value is the sum of the characteristic values of all the logic elements of the next logic layer corresponding to the logic element; and (5) matching. The invention aims to provide a credible tree-based feature storage and matching method, which is based on the storage from bottom to top of a credible tree in the data storage process, and performs feature matching from top to bottom in the matching process, so that the speed is high, and the matching is accurate.

Description

Credible tree based feature storage and matching method
Technical Field
The invention relates to the technical field of computer application, in particular to a characteristic storage and matching method based on a trusted tree.
Background
The code similarity detection technology is mainly used for detecting the code plagiarism at present and is an important task in the development and maintenance activities of computer software. The method has wide application in a plurality of fields of protecting software copyright intellectual property, source code plagiarism detection, software component library inquiry, program understanding and the like. The method can help the copying and original situation of the software code, and has important practical significance for the adherence in the software copyright.
For example, chinese patent document No. CN109918218 discloses a technical solution: a code similarity detection method and system based on a relation variable diagram comprise an identifier determining module, a similarity calculating module and the like, and are used for determining matching query results of code similarity among different documents. Other solutions exist both at home and abroad to detect code plagiarism problems, such as the MOSS system of Stanford university in the United states, the SIM system of Wischatan State university, and the YAP3 system of Sydney university in Australia.
However, in both the technical solutions in the comparison files and the comparison systems in the prior art, the storage technology for the features of the codes often adopts flat storage, and the matching technology often adopts sequential matching. Such a matching process is very cumbersome. Particularly in a massive code matching scene, the matching mode has the disadvantages of large data volume, low efficiency and high algorithm complexity.
Disclosure of Invention
The invention aims to provide a characteristic storage and matching method based on a credible tree.
The technical purpose of the invention is realized by the following technical scheme:
a feature storage and matching method based on a credible tree comprises the following steps:
s01, establishing a hierarchy;
dividing the whole project into a plurality of logic layers;
s02, a feature generation step;
extracting characteristic values reflecting the characteristics of parameters of each logic element of each logic layer;
s03, associating upper and lower layer characteristics;
the logic element of the previous logic layer comprises a self characteristic value and a combined characteristic value, wherein the combined characteristic value is the sum of the characteristic values of all the logic elements of the next logic layer corresponding to the logic element;
s04, matching;
and comparing the characteristic value of the comparison code packet with the characteristic value of the code packet in the library.
Preferably, the logic layer is five layers, and the logic layer is a code segment layer, a function layer, a file interlayer and a project layer from bottom to top.
Preferably, in the step of S04, a top-down matching step is adopted, and if the feature value of a certain layer of the comparison code packet is the same as the feature value of a certain layer corresponding to the code packet in the library, the matching is considered to be successful, and the matching step is ended.
Preferably, if the characteristic value of a certain layer of the comparison code packet is different from the characteristic value of a certain layer corresponding to the code packet in the library, the matching is not successful, a lower-layer matching step is performed, and the characteristic value of the next layer of the comparison code packet is compared with the characteristic value of the next layer corresponding to the code packet in the library one by one.
Preferably, in the step of generating characteristics at S02, the characteristic value includes a capacity value of the logical element.
Preferably, in step S02, the feature value includes a word frequency at which a certain variable appears in the logical element.
In the word frequency statistics process of the variable, a statistic fixed value and a use context are required, when the variable is fixed, the variable is positioned on the left side of "=" or "+ =" or "- =", and when the variable is used, the variable is positioned on the right side of "=" or "- =" or "+ =".
As the optimization of the invention, in the variable word frequency statistical process, the context of the common statement needs to be counted, and the context of the common statement is a conditional statement and/or a calculation statement and/or an array access and/or a normally bright assignment.
Preferably, in the variable word frequency statistical process, the context of the nested sentences needs to be counted, and the nested sentences are outermost loops and/or second outer loops and/or third loops and/or more inner loops.
In conclusion, the invention has the following beneficial effects:
1. the trusted tree is established from bottom to top, and is matched and searched from top to bottom, so that the matching efficiency is greatly optimized, and the algorithm complexity is reduced.
2. The characteristic value of the logic element of each layer comprises the characteristic value of the logic element of the lower layer, and the matching accuracy is high.
Description of the drawings:
FIG. 1 is a schematic diagram of the layers of a trusted tree;
FIG. 2 is a simplified schematic diagram;
FIG. 3 is a diagram illustrating a calculation method of the feature codes in the file layer and the folder layer.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
The present embodiment is only for explaining the present invention, and it is not limited to the present invention, and those skilled in the art can make modifications of the present embodiment without inventive contribution as needed after reading the present specification, but all of them are protected by patent law within the scope of the claims of the present invention.
Embodiment 1, a method for storing and matching features based on a trusted tree, namely a Merkle tree. The method is generated for solving the authentication problem in multiple primary signatures, and the Merkle trusted tree structure has the advantages of authenticating a large number of primary signatures and has obvious advantages in the aspect of authentication. Nowadays, the tree structure of the Merkle trusted tree has been widely applied to various fields of information security, such as certificate revocation, source multicast authentication, group key agreement, and the like. And the digital signature scheme based on the Merkle trusted tree only depends on the safety of the hash function in terms of safety, and does not need too many theoretical assumptions, so that the digital signature based on the Merkle trusted tree is safer and more practical.
In the present application, as shown in fig. 1, a trusted tree is established that includes five logical layers, which are from top to bottom: first layer, item layer. Second layer, folder. The third layer, the file layer. The fourth layer, the function layer. And the fifth layer is a code segment layer.
Before comparing and matching the codes formally, the project needs to extract features first, so that the logic element of each logic layer has a feature value reflecting the feature of the logic element.
And when each logic layer is established, a layer establishing strategy from bottom to top is adopted. As shown in FIG. 1, a code segment layer is established, a function layer is established, and a file layer is established. This is because in this case, the feature value of any logic element includes the sum of its own feature and the corresponding feature of the lower layer. For example, as shown in fig. 2, where a-H each represent a respective logical element, the upper level features are derived from the lower level features by making corresponding mapping rules. For example: the eigenvalues of F reflect (A, B, C) and the eigenvalues of H reflect (F, G).
Specifically, as shown in fig. 3, a second-layer file layer and a third-layer file layer are used as examples. The folder contains two files, file 1 and file 2. Files 1 and 2 have successfully extracted their own characteristic values, for example, the characteristic values of folder 1 are a1, b1, … m1, respectively. The characteristic values of the folder 2 are a2, b2 and … m2. after the file layer of the layer is established and the characteristic values of the files are generated, the generation of the logical element of the previous layer, namely the folder is started, and the characteristic value of the folder is generated.
The characteristic values of the folder in FIG. 3 contain two parts: self eigenvalues and merged eigenvalues. The first part, self-feature value, is a, B … N in fig. 3, which is automatically generated feature information according to some relevant parameters of the folder, such as capacity size, number of files contained, and so on. The second part, the combined feature values, a, b, … m in fig. 3. The calculation of this part is the summation of the logical elements of the corresponding next layer, i.e. the folder in fig. 3 sums the file 1 and the file 2, where a is a1+ a2, b is b1+ b2, and m is m1+ m2. If the folder has 3 files or four files, a is a1+ a2+ a3 or a1+ a2+ a3+ a4.
Any letter in this side indicates a certain feature of the logical element. For example, a1, b1 … m1, a1 and b1 of the document 1 are respectively a certain feature and a certain feature quantity.
For the extraction of the feature quantity, different specific embodiments may have different extraction manners, and the technical scheme is not particularly limited. In general, the feature quantity may be its own capacity or a word frequency of a certain variable.
For the word frequency calculation, the following three statistical situations are included in the context:
(1) and (5) fixing the value and using. The number of times a statistical variable is valued and used throughout a code segment. The fixed value is located on the left side of "=" (or "+ =", "- =", etc.), and the use is located on the right side of the above operation.
(2) A general sentence. And counting the occurrence times of the variables in the common sentences. Common sentence contexts that need attention include conditional sentences, compute (add, subtract, multiply, divide) sentences, array accesses (variables as subscripts), constant assignments, and the like.
(3) Nesting statements. Counting the number of times a variable appears in a nested statement. Such as an outermost cycle, a next outermost cycle, a third cycle, or a further inner cycle.
To this end, as shown in fig. 1, a project is from bottom to top, each layer is already established, and a corresponding feature value of any one logic element in each layer is also generated. The matching step is then entered.
If the comparison is carried out on one file, the characteristic value of the file is directly compared with the characteristic value of each file in the third logic layer of the project in the database, namely the file layer one by one, and if the comparison is carried out on the other file, the characteristic value of the file is compared with the characteristic value of each file in the second logic layer, namely the file interlayer one by one. The comparison is in the opposite order to the above-mentioned establishment, and the establishment of the trusted tree is from the bottom layer to the upper layer by layer, but the comparison is from the upper layer to the bottom layer by layer.
Specifically, as shown in fig. 3, the folder in the database is referred to as the folder one, and the folder in the database is referred to as the folder two for comparison.
And comparing the feature code of the second folder with the feature code of the first folder, if the feature codes are consistent, indicating that the matching is successful, and ending the matching step.
If not, the next layer needs to be compared one by one. For example, the folder No. two contains three files, file 3, file 4, and file 5. Then, the feature codes of file 3 are compared with the feature codes of file 1 and file 2 in the first folder, and file 4 and file 5 operate in the same way.
Similarly, if the feature codes are consistent, the matching is successful, and if the feature codes are inconsistent, the next layer is continued, and the feature codes of all function levels are compared.
The data is stored from bottom to top based on the credible tree and is matched from top to bottom, so that the matching efficiency is greatly optimized. The traditional matching mode has too large data volume and low efficiency, the algorithm complexity of the traditional matching algorithm is O (n ^2), and the algorithm complexity after changing into the Merkle tree algorithm is O (logn).
As shown in FIG. 1, the classification of code segments allows a software engineer to choose the separation of code segments. For example, the C language may be divided according to parentheses, and in other variant languages, the C language may be divided according to enter symbols and tab symbols, and the present document is not particularly limited.

Claims (6)

1. A feature storage and matching method based on a credible tree is characterized by comprising the following steps: s01, establishing a hierarchy; dividing the whole project into a plurality of logic layers; s02, a feature generation step; extracting characteristic values reflecting the characteristics of parameters of each logic element of each logic layer; s03, correlating the upper and lower layer characteristics; the logic element of the previous logic layer comprises a self characteristic value and a combined characteristic value, wherein the combined characteristic value is the sum of the characteristic values of all the logic elements of the next logic layer corresponding to the logic element; s04, matching; comparing the characteristic value of the comparison code packet with the characteristic value of the code packet in the library, wherein the logic layer comprises five layers, the logic layer comprises a code segment layer, a function layer, a file interlayer and a project layer from bottom to top, the step S04 and the matching step adopt a top-to-bottom matching step, if the characteristic value of a certain layer of the comparison code packet is the same as the characteristic value of a certain layer corresponding to the code packet in the library, the matching is considered successful, the matching step is finished, if the characteristic value of a certain layer of the comparison code packet is different from the characteristic value of a certain layer corresponding to the code packet in the library, the matching is considered unsuccessful, the lower layer matching step is started, and the characteristic value of the next layer of the comparison code packet and the characteristic value of the next layer corresponding to the code packet in the library are compared one by one.
2. The method for storing and matching features based on the trusted tree as claimed in claim 1, wherein: in step S02, the feature value includes a capacity value of the logical element.
3. The method for storing and matching features based on the trusted tree as claimed in claim 1, wherein: in the step of S02, the feature value includes a word frequency of a variable appearing in the logical element.
4. The method for storing and matching features based on the trusted tree as claimed in claim 3, wherein: in the process of word frequency statistics of variables, a constant value and a use context need to be counted, when the variable is constant value, the variable is positioned on the left side of "=" or "+ =" or "- =", and when the variable is used, the variable is positioned on the right side of "=" or "- =" or "+ =".
5. The method for storing and matching features based on the trusted tree as claimed in claim 3, wherein: in the variable word frequency statistical process, the context of the common statement needs to be counted, and the context of the common statement is a conditional statement and/or a calculation statement and/or an array access and/or a normally bright assignment.
6. The method for storing and matching features based on the trusted tree as claimed in claim 3, wherein: in the variable word frequency statistical process, the context of nested sentences needs to be counted, and the nested sentences are outermost loops and/or second outer loops and/or third loops and/or more inner loops.
CN201910857073.5A 2019-09-11 2019-09-11 Credible tree based feature storage and matching method Active CN110990017B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910857073.5A CN110990017B (en) 2019-09-11 2019-09-11 Credible tree based feature storage and matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910857073.5A CN110990017B (en) 2019-09-11 2019-09-11 Credible tree based feature storage and matching method

Publications (2)

Publication Number Publication Date
CN110990017A CN110990017A (en) 2020-04-10
CN110990017B true CN110990017B (en) 2022-09-09

Family

ID=70081738

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910857073.5A Active CN110990017B (en) 2019-09-11 2019-09-11 Credible tree based feature storage and matching method

Country Status (1)

Country Link
CN (1) CN110990017B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1517914A (en) * 2003-01-06 2004-08-04 Searching of structural file
TW200935257A (en) * 2008-02-14 2009-08-16 Univ Nat Taiwan Science Tech Method for detecting similarity or plagiarism in computer programs
CN108345468A (en) * 2018-01-29 2018-07-31 华侨大学 Programming language code duplicate checking method based on tree and sequence similarity
CN110032500A (en) * 2019-03-01 2019-07-19 阿里巴巴集团控股有限公司 Multilayer nest data analysis method and equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1517914A (en) * 2003-01-06 2004-08-04 Searching of structural file
TW200935257A (en) * 2008-02-14 2009-08-16 Univ Nat Taiwan Science Tech Method for detecting similarity or plagiarism in computer programs
CN108345468A (en) * 2018-01-29 2018-07-31 华侨大学 Programming language code duplicate checking method based on tree and sequence similarity
CN110032500A (en) * 2019-03-01 2019-07-19 阿里巴巴集团控股有限公司 Multilayer nest data analysis method and equipment

Also Published As

Publication number Publication date
CN110990017A (en) 2020-04-10

Similar Documents

Publication Publication Date Title
Treude et al. Difference computation of large models
CN108984789B (en) Distributed accounting method and device, storage medium and electronic equipment
US11100284B2 (en) Blockchain-based text similarity detection method, apparatus and electronic device
CN107784110B (en) Index establishing method and device
US10614312B2 (en) Method and apparatus for determining signature actor and identifying video based on probability of appearance of signature actor
Oliveira et al. An efficient similarity-based approach for comparing XML documents
CN107451177B (en) Query method and system for single error-surveying block chain of increased blocks
CN109800337B (en) Multi-mode regular matching algorithm suitable for large alphabet
Ye et al. Learning deep graph representations via convolutional neural networks
CN114201756A (en) Vulnerability detection method and related device for intelligent contract code segment
CN110990017B (en) Credible tree based feature storage and matching method
CN110941831B (en) Vulnerability matching method based on slicing technology
CN109657060B (en) Safety production accident case pushing method and system
WO2014191719A1 (en) Datasets profiling tools, methods, and systems
CN115795563A (en) State data checking method and device
Bhat et al. Content-based file type identification
CN115906166A (en) Information processing method and device for sharing private data by block chain
Amsterdamer et al. Automated Selection of Multiple Datasets for Extension by Integration
CN108108472B (en) Data processing method and server
CN117874307B (en) Engineering data field identification method and device, electronic equipment and storage medium
Thin et al. Formal Analysis of a PoS Blockchain
CN110795530B (en) Context-based value feature extraction system and method
US20230230708A1 (en) Methods and systems for probabilistic filtering of candidate intervention representations
CN109325496B (en) Bullet screen checking method and device based on character removal, terminal and storage medium
US8560981B2 (en) Segmenting integrated circuit layout design files using speculative parsing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant