CN116028112A - Small program clone detection method based on complex network analysis - Google Patents

Small program clone detection method based on complex network analysis Download PDF

Info

Publication number
CN116028112A
CN116028112A CN202310045745.9A CN202310045745A CN116028112A CN 116028112 A CN116028112 A CN 116028112A CN 202310045745 A CN202310045745 A CN 202310045745A CN 116028112 A CN116028112 A CN 116028112A
Authority
CN
China
Prior art keywords
applet
file
detected
feature
pairs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310045745.9A
Other languages
Chinese (zh)
Inventor
范铭
鄢子强
王寅
石吉飞
刘峻峰
陶俊杰
刘烃
晋武侠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202310045745.9A priority Critical patent/CN116028112A/en
Publication of CN116028112A publication Critical patent/CN116028112A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Stored Programmes (AREA)

Abstract

The invention discloses a small program clone detection method based on complex network analysis, which aims to solve the problem of the existing small program clone detection. Firstly, preprocessing the small programs to be detected, extracting coarse-granularity statistical characteristics, fine-granularity layout characteristics and code characteristics from a source code through static analysis, dividing the statistical characteristics into different cloning small program family clusters, calculating similarity vectors of the small programs to be detected and each small program in the cloning small program family clusters according to the layout characteristics and the code characteristics, and judging whether the two small programs are cloned by a classifier. By the method, cloning conditions of the applet can be detected, false alarm rate and false alarm rate are reduced, and safety guarantee of a user when using mobile application is improved. Provides a new method for cloning detection of small programs.

Description

Small program clone detection method based on complex network analysis
Technical Field
The invention relates to the field of program clone detection in mobile application programs, in particular to a WeChat applet clone detection method based on complex network analysis.
Background
With the development of internet technology, third party platform-mounted applications, i.e., applets, represented by a plurality of manufacturers such as WeChat, payment treasures, hundred degrees and the like are becoming an indispensable service mode in people's life. Over 700 ten thousand of whole network applets in 2021, the WeChat applet developer breaks through 300 ten thousand, and the DAU exceeds 4.5 hundred million; the daily use times are increased by 32% in comparison with the active applets by 41%. At the same time, the tremendous growth of applets and their potential benefits make them targets for malicious developers, 35274.
Plagiarism can severely violate the rights of original developers, such as replacement of advertising links to gain economic benefit, or confusing audiovisual to stream malicious applets. This behavior would jeopardize the applet ecology. Meanwhile, a plagiarism may introduce malicious codes to steal user related privacy and other malicious behaviors.
Currently, code clone detection techniques can be largely divided into four major categories: text-based, lexical-based, grammatical-based, and semantic-based.
Text-based detection methods are divided into two types, one is to consider them as a problem of string similarity, and only compare from the characters; another is to extract fine-grained features such as special characters from the character perspective. Text-based detection is fast, but has low accuracy, and is difficult to be confused.
The detection method based on the lexical method can be divided into a Token-based detection method and an API-based detection method. The Token-based detection method processes the identifier in the code, returns to the Token for source code comparison, and can effectively solve the problem of replacement of variable names and function names. Because the API is a function provided by the system or framework to the developer in advance, no matter how the control flow and the data flow of the code are confused by the pirate, the most basic API call will not be changed greatly, and the API call times will be extracted as a feature by the API-based detection method. The detection accuracy based on the lexical method is slightly higher than that based on the text, but is easy to misjudge due to the lack of grammar and semantic analysis.
Grammar-based detection generally considers similar code segments to have similar grammar structures. The method uses a parser to parse the object source code into an AST abstract syntax tree, and compares the similarity of the tree by comparing the tree structures among the codes or extracting features from the tree structures, wherein the source code segment corresponding to the similar subtree is the clone code. The grammar-based detection has high computational overhead and high accuracy.
The semantic-based detection method can analyze a control flow graph and a data flow graph from source codes, and then combine the control flow graph and the data flow graph to construct the PDG, and the detection method also depends on the control flow graph and the data flow graph only. And comparing the similarity of the graphs by utilizing a subgraph isomorphism algorithm or an algorithm for extracting the characteristics of the graphs, wherein the source codes corresponding to the similar graphs are clone codes. The semantic-based detection has strong anti-aliasing capability and high accuracy, but the same calculation cost is large.
Disclosure of Invention
The invention provides a small program clone detection method based on complex network analysis, which aims to solve the problem that the single detection method of the existing small program clone detection cannot achieve both detection speed and detection precision. Firstly, preprocessing an applet to be detected, extracting coarse-granularity statistical features SF, fine-granularity layout features LF and code dimension features, namely custom function features CFF, file dependence features FDF and double-layer dependence features TLDF from a source code through static analysis, dividing the features into different clone applet family clusters according to the statistical features SF, calculating similarity vectors of the applet to be detected and each applet in the clone applet family clusters according to the layout features LF, the custom function features CFF, the file dependence features FDF and the double-layer dependence features TLDF, and judging whether the two applets are cloned or not through a classifier. By the method, the cloning condition of the applet can be detected, the detection speed is increased, the false alarm rate and the missing report rate are reduced, and the safety guarantee of a user when using the mobile application is improved. The blank of the small program clone detection method is filled up.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
step S101: preprocessing the small program S to be detected, including decompiling, anti-confusion and extracting a main package;
step S102: extracting statistical features SF, layout features LF, custom function features CFF, file dependence features FDF and double-layer dependence features TLDF by analyzing file types and source code abstract syntax trees according to the preprocessed applet S source codes to be detected obtained in the step S101;
step S103: according to the statistical features SF of the small program S to be detected obtained in the step S102, calculating the distance between the small program S to be detected and the center of the family cluster of the small program, and dividing the small program S to be detected into the family cluster of the small program which is the closest to the family cluster of the small program;
step S104: according to the cloning applet family cluster obtained in the step S103, forming applet pairs by the applet S to be detected and each applet in the cloning applet family cluster, wherein similarity vectors of the applet pairs are the similarity of layout features LF, custom function features CFF, file dependence features FDF and double-layer dependence features TLDF respectively;
step S105: utilizing a pre-labeled label as a similarity vector of a cloned and non-cloned applet pair, and constructing a classifier by using a machine learning method;
step S106: according to the classifier obtained in the step S105, similarity vectors of the applet pairs formed by the applet S to be detected in the step S104 and each applet in the family cluster of the cloned applets are input, and the applet pairs are classified into a cloned applet pair and a non-cloned applet pair, so that the applets cloned with the applet S to be detected are found.
Further, the step S101 specifically includes:
step S201: decompiling a packaged file of the applet S to be detected according to the applet packaging rule to obtain source codes of the applet S to be detected;
step S202: extracting a confusion mode to be matched with the existing confusion method according to the source code of the applet S to be detected, which is obtained in the step S201, and if the confusion mode can be matched with the existing confusion method, using a corresponding anti-confusion method to anti-confusion the source code;
step S203: and (3) collecting and sorting the source codes of the obfuscated applet S to be detected according to the obtained anti-obfuscated applet S in the step S202 to obtain a common third party library T of the applet. For the file f in the applet source code S to be detected, the file name and file size attribute are used to match the known third party library file in T. And if the file f in the S source code meets the matching, filtering the file f from the third party library.
Further, in the step S102, the statistical feature SF is a 241-dimension vector, the first 3 dimensions are respectively the number of pages of the applet, the average number of static files, and the number of custom functions of the developer, the 4 th dimension to the 137 th dimension are the source API call times, and the 138 th dimension to the 241 th dimension are sink API call times.
Further, the step S102 of extracting the statistical feature SF specifically includes:
step S301: counting the page number of the applet from a routing component, a routing function, pages fields and tabbar fields of app.json according to the preprocessed applet S source code to be detected obtained in the step S101;
step S302: according to the preprocessed to-be-detected applet S source code obtained in the step S101, counting the average number of static resource files under a static resource file catalog by comparing file types, wherein the static resource file catalog refers to the catalog of the static resource files under the catalog, and the static resource files refer to files which are not json, js, html-like and css;
step S303: counting the number of self-defined functions of a developer according to the preprocessed applet S source code to be detected obtained in the step S101, wherein the self-defined functions of the developer refer to functions defined by a non-system and a third party library;
step S304: counting the calling times of source APIs and sink APIs according to the preprocessed to-be-detected applet S source codes obtained in the step S101, wherein the source APIs are APIs taking sensitive data as return values, and the sink APIs are APIs taking the sensitive data as parameters;
step S305: the feature vector is formed by the data obtained in step S301, step S302, step S303 and step S304 as the statistical feature SF of the applet S to be detected.
Further, in the step S102, the layout feature LF is a hash sequence, denoted as LF (S) =<fh 1 ,…,fh N >,fh i Representing the i-th hash value.
Further, the step S102 of extracting the layout feature LF specifically includes:
step S401: analyzing an html-like file for page layout display according to the preprocessed applet S source code to be detected obtained in the step S101, and extracting a component sequence;
step S402: according to the component sequence obtained in step S401, it is sliced using weak hash, for example, alder32, with the slicing condition that the remainder of the intra-chip weak hash value pair 64 is 63;
step S403: according to the fragmented component sequence obtained in step S402, a hash value of the fragment is calculated using a strong hash, for example, FNV64, and the new hash sequence is used as the layout feature LF thereof.
Further, in the step S102, the custom function feature CFF is a hash sequence, expressed as CFF (S) =<fh 1 ,…,fh N >,fh i Representing the i-th hash value.
Further, the step S102 of extracting the custom function feature CFF specifically includes:
step S501: analyzing js files for executing the functional logic into AST abstract syntax trees according to the preprocessed applet S source codes to be detected obtained in the step S101;
step S502: extracting a type sequence of a developer custom function according to an AST abstract syntax tree of the js file obtained in the step S501, wherein the type sequence refers to a sequence formed by traversing the AST abstract syntax tree by a preamble and sequentially taking out type attribute values of nodes;
step S503: according to the component sequence obtained in the step S502, the component sequence is sliced by weak hash, and the slicing condition is that the remainder of the intra-chip weak hash value pair 64 is 63;
step S504: and according to the fragmented component sequence obtained in the step S503, calculating a hash value of the fragments by using the strong hash, and taking the new hash sequence as a self-defined function feature CFF thereof.
Further, in the step S102, the file dependency feature FDF is a file dependency graph, expressed as FDF (S) = (FileV (S), fileE (S)), fileV (S) is a node set of the file dependency graph of the applet S to be detected, the node attribute is a file name, fileE (S) is a file dependency graph edge set of the applet S to be detected, and FileE (S) = {<v i ,v k >(v is shown in the figure) i ,v k ∈FileV(S)。
Further, the step S102 of extracting the file dependent feature FDF specifically includes:
step S601: taking all js files under the S source code file directory of the applet to be detected as a file dependency graph node set;
step S602: traversing js file to analyze the js file into AST abstract syntax tree, extracting an imported dependency sentence from the AST abstract syntax tree, and adding the pair of files into the file dependency graph edge set if the imported file is in the file dependency graph node set.
Further, in the step S102, the dual-layer dependency feature TFDF is a two-layer graph structure, the upper layer graph is a file dependency graph, the lower layer graph is a function call graph, which is expressed as FDF (S) = (FileV ' (S), fileE ' (S)), fileV ' (S) is a node set of the file dependency graph of the applet S to be detected, the node attribute is the file function call graph, fileE ' (S) is a file dependency graph edge set of the applet S to be detected, and FileE ' (S) = {<v i ,v k >(v is shown in the figure) i ,v k ∈FileV′(S)。
Further, the step S102 of extracting the dual-layer dependency feature TFDF specifically includes:
step S701: taking all js files under the S source code file directory of the applet to be detected as a file dependency graph node set;
step S702: traversing js file to analyze the js file into AST abstract syntax tree, extracting function call relation in the file from the AST abstract syntax tree to construct a function call graph, and taking the function call graph as the attribute of the js file;
step S703: according to the AST abstract syntax tree obtained in step S702, the import dependent sentence is extracted, and if the imported file is in the file dependency graph node set, the pair of files is added to the file dependency graph edge set.
Further, in step S103, the center of the family cluster of the applet is solved in advance, that is, the statistical features SF of the applet in the family cluster of the applet are averaged, and the distance between the statistical features SF of the applet S to be detected and the center of each family cluster of the applet is calculated and divided into the closest family cluster of the applet.
Further, the method for calculating the similarity of the layout features LF of the small program pairs in step S104 is to traverse the layout features LF of both, the hash values are regarded as character strings, calculate the levenstein ratio to compare the similarity of the two hash values, and if the levenstein ratio is greater than a certain threshold, add one to the number of similar hash values, and compare the number of similar hash values with the maximum value of the lengths of the layout features LF of the two small programs; the custom function feature CFF of the applet is also a hash sequence, using the same similarity calculation method as the layout feature LF.
Further, in the method for calculating the file dependency feature FDF similarity of the small program pair in step S104, the Weisfeiler-Lehman algorithm is used to iterate the file dependency graphs of both, and the similarity is calculated according to the number of similar tags in the output tag set.
Further, in the step S104, the dual-layer dependency feature TLDF of the applet pair is that, because the node attribute is a function call graph, the similarity calculation method is to traverse the file dependency graphs of the two, and the Weisfeiler-Lehman algorithm is used to calculate the similarity for the function call graph of the node, if the similarity is greater than a certain threshold, the node is placed in the anchor point set, and the number of nodes in the anchor point set is greater than the maximum value of the node numbers of the file dependency graphs of the two.
Further, the similarity vector of the applet pairs in step S105 is the layout feature LF similarity, the custom function feature CFF similarity, the file dependent feature FDF similarity, and the double-layer dependent feature TLDF similarity of the two applets. If the two applets are similar in function and layout through manual auditing and have plagiarism of both official forums or news reports, the cloning relationship is considered to exist, and the two applets are marked as a cloning applet pair, and otherwise, the two applets are marked as an unclonable applet pair. The classifier input is the similarity vector for the applet pair, outputs 0 and 1,0 representing that the applet pair is an unclonable applet pair, and 1 representing that the applet pair is a clononable applet pair.
The invention further improves that: the layout feature extraction method in step S102 is to parse the wml file to extract the component sequence, then convert the component sequence into a fuzzy hash sequence, that is, use weak hash fragments first and then use strong hash to calculate the fragment hash values, and splice the fragment hash values.
The invention further improves that: the double-layer dependency feature in step S102 uses the function call graph as the attribute of the file dependency graph node, the node of the function call graph is the function of the file, and the edge is the function call relationship of the file.
The invention further improves that: and step S103 is to divide the small programs into different cloning small program family clusters according to the statistical characteristics of the small programs to be detected, and step S106 is to judge whether the small programs are cloned or not by using a classifier according to the similarity vector of the small program pairs.
Compared with the prior art, the invention has the following advantages:
1) The invention adopts the method that coarse granularity features are clustered firstly, and then the similarity vector of fine granularity features is calculated for analysis, so that the efficiency is improved compared with the method that the similarity vector is directly calculated;
2) The layout features provided by the invention use the combination of weak hash and Jiang Haxi, and the code features use the file dependency graph and the function call graph, so that code confusion can be effectively resisted, and the robustness and the accuracy are improved;
3) The invention uses the classifier in machine learning to judge the small program clone, is more accurate than manually setting the threshold value, and has better generalization capability.
Drawings
FIG. 1 is an overall flow chart of the method for detecting small-program clones based on complex network analysis of the present invention;
FIG. 2 is a flow chart of a method for extracting layout features from applet source codes to be detected by static analysis in the present invention;
FIG. 3 is a flow chart of a method for extracting custom function features from applet source codes to be detected by static analysis in accordance with the present invention.
Detailed Description
Specific embodiments of the small-program clone detection method based on complex network analysis of the present invention are described in detail below with reference to the accompanying drawings.
FIG. 1 is an overall flow chart of the method for detecting small-program clones based on complex network analysis of the present invention;
the invention discloses a small program clone detection method based on complex network analysis, which comprises the following steps:
step S101: preprocessing the to-be-detected applet S, including decompiling, defrobling and extracting the main package.
Specifically, the method can be divided into the following steps:
step S201: decompiling a packaged file of the applet S to be detected according to the applet packaging rule to obtain source codes of the applet S to be detected;
step S202: extracting a confusion mode to be matched with the existing confusion method according to the source code of the applet S to be detected, which is obtained in the step S201, and if the confusion mode can be matched with the existing confusion method, using a corresponding anti-confusion method to anti-confusion the source code;
step S203: and (3) collecting and sorting the source codes of the obfuscated applet S to be detected according to the obtained anti-obfuscated applet S in the step S202 to obtain a common third party library T of the applet. For the file f in the applet source code S to be detected, the file name and file size attribute are used to match the known third party library file in T. And if the file f in the S source code meets the matching, filtering the file f from the third party library.
Step S102: extracting statistical features SF, layout features LF, custom function features CFF, file dependence features FDF and double-layer dependence features TLDF by analyzing file types and source code abstract syntax trees according to the preprocessed applet S source codes to be detected obtained in the step S101;
specifically, the statistical feature SF is a 241-dimensional vector, the first 3 dimensions are respectively the page number of the applet, the average number of static files, and the number of custom functions of the developer, the 4 th to 137 th dimensions are the source API call times, the 138 th to 241 th dimensions are sink API call times, and the statistical feature extraction can be divided into the following steps:
step S301: and counting the page number of the applet from the routing component, the routing function, the pages field and the tabbar field of the app.json according to the preprocessed applet S source code to be detected obtained in the step S101. The routing component is a component with a page jump function, such as < navigator >, and the routing function is a function with a page jump function, such as navigator, in the API;
step S302: according to the preprocessed to-be-detected applet S source code obtained in the step S101, counting the average number of static resource files under a static resource file catalog by comparing file types, wherein the static resource file catalog refers to the catalog of the static resource files under the catalog, and the static resource files refer to files which are not json, js, html-like and css;
step S303: counting the number of self-defined functions of a developer according to the preprocessed applet S source code to be detected obtained in the step S101, wherein the self-defined functions of the developer refer to functions defined by a non-system and a third party library;
step S304: counting the calling times of source APIs and sink APIs according to the preprocessed to-be-detected applet S source codes obtained in the step S101, wherein the source APIs are APIs taking sensitive data as return values, and the sink APIs are APIs taking the sensitive data as parameters;
step S305: the feature vector is formed by the data obtained in step S301, step S302, step S303 and step S304 as the statistical feature SF of the applet S to be detected.
Step S103: extracting layout features LF by using static analysis according to the preprocessed applet S source codes to be detected obtained in the step S101;
layout feature LF is a hash sequence, denoted LF (S) =<fh 1 ,…,fh N >,fh i Representing the i-th hash value.
FIG. 2 is a flow chart of a method for extracting layout features from applet source codes to be detected by static analysis in accordance with the present invention.
Specifically, the method can be divided into the following steps:
step S401: analyzing an html-like file for page layout display according to the preprocessed applet S source code to be detected obtained in the step S101, and extracting a component sequence;
step S402: according to the component sequence obtained in the step S401, the component sequence is sliced by weak hash, alder32 is used for the weak hash, and the slicing condition is that the remainder of the intra-chip weak hash value pair 64 is 63;
step S403: and according to the fragmented component sequence obtained in the step S402, calculating a fragmented hash value by using a strong hash, and taking the new hash sequence as a layout feature LF, wherein the strong hash uses FNV64.
The custom function feature CFF is a hash sequence, denoted CFF (S) =<fh 1 ,…,fh N >,fh i Representing the i-th hash value.
FIG. 3 is a flow chart of a method for extracting custom function features from applet source codes to be detected by static analysis in accordance with the present invention.
Specifically, the method can be divided into the following steps:
step S501: analyzing js files for executing the functional logic into AST abstract syntax trees according to the preprocessed applet S source codes to be detected obtained in the step S101;
step S502: extracting a type sequence of a developer custom function according to an AST abstract syntax tree of the js file obtained in the step S501, wherein the type sequence refers to a sequence formed by traversing the AST abstract syntax tree by a preamble and sequentially taking out type attribute values of nodes;
step S503: according to the component sequence obtained in the step S502, the component sequence is sliced by using weak hash, the weak hash uses Alder32, and the slicing condition is that the remainder of the intra-chip weak hash value pair 64 is 63;
step S504: and according to the fragmented component sequence obtained in the step S503, calculating a fragmented hash value by using a strong hash, wherein the strong hash uses FNV64, and the new hash sequence is used as a self-defined function feature CFF thereof.
The file dependency feature FDF is a file dependency graph, expressed as FDF (S) = (FileV (S), fileE (S)), and FileV (S) is a node set of the file dependency graph of the applet S to be detectedTogether, the node property is the file name, fileE (S) is the set of file dependency graph edges for the applet S to be detected, fileE (S) = {<v i ,v k >(v is shown in the figure) i ,v k E FileV (S). Specifically, the extraction step can be divided into:
step S601: taking all js files under the S source code file directory of the applet to be detected as a file dependency graph node set;
step S602: traversing js file to analyze the js file into AST abstract syntax tree, extracting an imported dependency sentence from the AST abstract syntax tree, and adding the pair of files into the file dependency graph edge set if the imported file is in the file dependency graph node set.
The dual-layer dependency feature TFDF is a two-layer graph structure, the upper layer graph is a file dependency graph, the lower layer graph is a function call graph, and the lower layer graph is expressed as FDF (S) = (FileV ' (S), fileE ' (S)), fileV ' (S) is a node set of the file dependency graph of the applet S to be detected, the node attribute is the file function call graph, fileE ' (S) is a file dependency graph edge set of the applet S to be detected, and FileE ' (S) = {<v i ,v k >(v is shown in the figure) i ,v k E FileV' (S). Specifically, the extraction step can be divided into:
step S701: taking all js files under the S source code file directory of the applet to be detected as a file dependency graph node set;
step S702: traversing js file to analyze the js file into AST abstract syntax tree, extracting function call relation in the file from the AST abstract syntax tree to construct a function call graph, and taking the function call graph as the attribute of the js file;
step S703: according to the AST abstract syntax tree obtained in step S702, the import dependent sentence is extracted, and if the imported file is in the file dependency graph node set, the pair of files is added to the file dependency graph edge set.
Step S103: according to the statistical features SF of the small program S to be detected obtained in the step S102, calculating the distance between the small program S to be detected and the center of the family cluster of the small program, and dividing the small program S to be detected into the family cluster of the small program which is the closest to the family cluster of the small program;
specifically, in step S105, the center of the family cluster of the applet is solved in advance, that is, the statistical features SF of the applet in the family cluster of the applet are averaged, and the distance between the statistical features SF of the applet S to be detected and the center of each family cluster of the applet is calculated and divided into the closest family cluster of the applet.
Step S104: according to the cloning applet family cluster obtained in the step S105, forming applet pairs by the applet S to be detected and each applet in the cloning applet family cluster, calculating the similarity of the layout features LF, the custom function features CFF, the file dependent features FDF and the double-layer dependent features TLDF in the applet pairs, and obtaining the similarity vector of the applet pairs;
specifically, the method for calculating the similarity of the layout features LF of the small program pairs in step S106 is to traverse the layout features LF of both, the hash values are regarded as character strings, calculate the levenstein ratio to compare the similarity of the two hash values, and if the levenstein ratio is greater than a certain threshold, add one to the number of similar hash values, and compare the number of similar hash values with the maximum value of the lengths of the layout features LF of the two small programs; the custom function feature CFF of the applet is also a hash sequence, using the same similarity calculation method as the layout feature LF.
The file dependency feature FDF similarity calculation method of the applet pairs is to iterate file dependency graphs of the two by using a Weisfeiler-Lehman algorithm, and calculate the similarity according to the number of similar labels in the output label set.
The method for calculating the similarity of the TLDF features of the two layers of the applet pairs comprises the steps of traversing file dependency graphs of the two layers of the applet pairs, calculating the similarity of a function call graph of a node by using a Weisfeiler-Lehman algorithm, and if the similarity is larger than a certain threshold value, putting the node into an anchor point set, wherein the number of the nodes in the anchor point set is larger than the maximum value of the node number of the file dependency graph of the two layers of the applet pairs.
Step S105: utilizing a pre-labeled label as a similarity vector of a cloned and non-cloned applet pair, and constructing a classifier by using a machine learning method;
specifically, the similarity vector of the applet pairs in step S104 is the layout feature LF similarity, the custom function feature CFF similarity, the file dependent feature FDF similarity, and the double-layer dependent feature TLDF similarity of the two applets. If the two applets are similar in function and layout through manual auditing and have plagiarism of both official forums or news reports, the cloning relationship is considered to exist, and the two applets are marked as a cloning applet pair, and otherwise, the two applets are marked as an unclonable applet pair. The classifier input is the similarity vector for the applet pair, outputs 0 and 1,0 representing that the applet pair is an unclonable applet pair, and 1 representing that the applet pair is a clononable applet pair. The machine learning method may use random forests, SVMs, etc.
Step S106: according to the classifier obtained in the step S105, similarity vectors of the applet pairs formed by the applet S to be detected in the step S106 and each applet in the family cluster of the applet are input, and the applet pairs are classified into a clone applet pair and a non-clone applet pair, so that the applet cloned with the applet S to be detected is found.

Claims (10)

1. The small program clone detection method based on complex network analysis is characterized by comprising the following steps:
step S101: preprocessing the small program S to be detected, including decompiling, anti-confusion and extracting a main package;
step S102: extracting statistical features SF, layout features LF, custom function features CFF, file dependence features FDF and double-layer dependence features TLDF by analyzing file types and source code abstract syntax trees according to the preprocessed applet S source codes to be detected obtained in the step S101;
step S103: according to the statistical features SF of the small program S to be detected obtained in the step S102, calculating the distance between the small program S to be detected and the center of the family cluster of the small program, and dividing the small program S to be detected into the family cluster of the small program which is the closest to the family cluster of the small program;
step S104: according to the cloning applet family cluster obtained in the step S103, forming applet pairs by the applet S to be detected and each applet in the cloning applet family cluster, wherein similarity vectors of the applet pairs are the similarity of layout features LF, custom function features CFF, file dependence features FDF and double-layer dependence features TLDF respectively;
step S105: utilizing a pre-labeled label as a similarity vector of a cloned and non-cloned applet pair, and constructing a classifier by using a machine learning method;
step S106: according to the classifier obtained in the step S105, similarity vectors of the applet pairs formed by the applet S to be detected in the step S104 and each applet in the family cluster of the cloned applets are input, and the applet pairs are classified into a cloned applet pair and a non-cloned applet pair, so that the applets cloned with the applet S to be detected are found.
2. The method according to claim 1, wherein the step S101 is specifically:
step S201: decompiling a packaged file of the applet S to be detected according to the applet packaging rule to obtain source codes of the applet S to be detected;
step S202: extracting a confusion mode to be matched with the existing confusion method according to the source code of the applet S to be detected, which is obtained in the step S201, and if the confusion mode can be matched with the existing confusion method, using a corresponding anti-confusion method to anti-confusion the source code;
step S203: and (3) collecting and sorting to obtain a common third party library T of the applet to be detected according to the anti-confused applet S source codes obtained in the step S202, matching a file name and a file size attribute of a file f in the applet source codes S to be detected with a known third party library file in the T, and filtering the file f if the file f in the S source codes meets the matching.
3. The method according to claim 1, wherein the statistical feature SF in the step S102 is a 241-dimensional vector, and the first 3 dimensions are the number of pages of the applet, the average number of static files, the number of custom functions of the developer, the 4 th to 137 th dimensions are the sourceepi call times, and the 138 th to 241 th dimensions are the sinkAPI call times, respectively.
4. A method according to claim 1 or 3, wherein the step S102 of extracting the statistical feature SF is specifically:
step S301: counting the page number of the applet from a routing component, a routing function, pages fields and tabbar fields of app.json according to the preprocessed applet S source code to be detected obtained in the step S101;
step S302: according to the preprocessed to-be-detected applet S source code obtained in the step S101, counting the average number of static resource files under a static resource file catalog by comparing file types, wherein the static resource file catalog refers to the catalog of the static resource files under the catalog, and the static resource files refer to files which are not json, js, html-like and css;
step S303: counting the number of self-defined functions of a developer according to the preprocessed applet S source code to be detected obtained in the step S101, wherein the self-defined functions of the developer refer to functions defined by a non-system and a third party library;
step S304: counting the calling times of sourceAPI and the calling times of sinkAPI according to the preprocessed applet S source code to be detected obtained in the step S101, wherein sourceAPI is an API taking sensitive data as a return value, and sinkAPI is an API taking the sensitive data as a parameter to be transmitted in;
step S305: the feature vector is formed by the data obtained in step S301, step S302, step S303 and step S304 as the statistical feature SF of the applet S to be detected.
5. The method according to claim 1, wherein the layout feature LF in step S102 is a hash sequence, denoted as LF (S) =<fh 1 ,…,fh N >,fh i Representing an ith hash value;
the step S102 of extracting layout features LF specifically includes:
step S401: analyzing an html-like file for page layout display according to the preprocessed applet S source code to be detected obtained in the step S101, and extracting a component sequence;
step S402: fragmenting the component sequence obtained in step S401 using weak hashing;
step S403: and (4) calculating the hash value of the fragments by using the strong hash according to the fragmented component sequences obtained in the step S402, and taking the new hash sequence as a layout feature LF thereof.
6. The method according to claim 1, wherein the custom function feature CFF in step S102 is a hash sequence, denoted as CFF (S) =<fh 1 ,…,fh N >,fh i Representing an ith hash value;
the step S102 of extracting the custom function feature CFF specifically includes:
step S501: analyzing js files for executing the functional logic into AST abstract syntax trees according to the preprocessed applet S source codes to be detected obtained in the step S101;
step S502: extracting a type sequence of a developer custom function according to an AST abstract syntax tree of the js file obtained in the step S501, wherein the type sequence refers to a sequence formed by traversing the AST abstract syntax tree by a preamble and sequentially taking out type attribute values of nodes;
step S503: fragmenting the component sequence obtained in step S502 using weak hashing;
step S504: and according to the fragmented component sequence obtained in the step S503, calculating a hash value of the fragments by using the strong hash, and taking the new hash sequence as a self-defined function feature CFF thereof.
7. The method according to claim 1, wherein the file dependency feature FDF is a file dependency graph in the step S102, denoted as FDF (S) = (FileV (S), fileE (S)), fileV (S) is a node set of the file dependency graph of the applet S to be detected, the node attribute is a file name, fileE (S) is a file dependency graph edge set of the applet S to be detected, and FileE (S) = { S }<v i ,v k >(v is shown in the figure) i ,v k ∈FileV(S);
The step S102 of extracting the file dependent feature FDF specifically includes:
step S601: taking all js files under the S source code file directory of the applet to be detected as a file dependency graph node set;
step S602: traversing js files to analyze the js files into AST abstract syntax trees, extracting imported dependent sentences from the AST abstract syntax trees, and adding the pair of files into a file dependency graph edge set if the imported files are in a file dependency graph node set;
in step S102, the dual-layer dependency feature TFDF is a two-layer graph structure, the upper layer graph is a file dependency graph, the lower layer graph is a function call graph, and the function call graph is expressed as FDF (S) = (FileV ' (S), fileE ' (S)), fileV ' (S) is a node set of the file dependency graph of the applet S to be detected, the node attribute is the file function call graph, fileE ' (S) is a file dependency graph edge set of the applet S to be detected, and FileE ' (S) = {<v i ,v k >(v is shown in the figure) i ,v k ∈FileV′(S);
Step S102 of extracting the dual-layer dependency feature TFDF specifically includes:
step S701: taking all js files under the S source code file directory of the applet to be detected as a file dependency graph node set;
step S702: traversing js file to analyze the js file into AST abstract syntax tree, extracting function call relation in the file from the AST abstract syntax tree to construct a function call graph, and taking the function call graph as the attribute of the js file;
step S703: according to the AST abstract syntax tree obtained in step S702, the import dependent sentence is extracted, and if the imported file is in the file dependency graph node set, the pair of files is added to the file dependency graph edge set.
8. The method according to claim 1, wherein the center of the family cluster of the applet is pre-solved in step S103, i.e. the statistical features SF of the applet in the family cluster of the applet are averaged, and the distance between the statistical features SF of the applet S to be detected and the center of each family cluster of the applet is calculated and divided into the closest family clusters of the applet.
9. The method according to claim 1, wherein the method for calculating the similarity of the layout features LF of the small program pairs in step S104 is to traverse the layout features LF of both the small program pairs, the hash values are regarded as character strings, calculate the levenstein ratio to compare the similarity of the two hash values, and if the levenstein ratio is greater than a certain threshold, add one to the number of the similar hash values, and compare the number of the similar hash values to the maximum value of the lengths of the two small program layout features LF; the self-defining function feature CFF of the applet is also a hash sequence, and the similarity calculation method which is the same as the layout feature LF is used;
the file dependency feature FDF similarity calculation method of the applet pairs is to iterate file dependency graphs of the two by using a Weisfeiler-Lehman algorithm, and calculate the similarity according to the number of similar labels in the output label set;
the method for calculating the similarity of the TLDF features of the two layers of the applet pairs comprises the steps of traversing file dependency graphs of the two layers of the applet pairs, calculating the similarity of a function call graph of a node by using a Weisfeiler-Lehman algorithm, if the function call graph of the node is larger than a certain threshold value, putting the node into an anchor point set, and comparing the number of nodes in the anchor point set with the maximum value of the node number of the file dependency graph of the two layers of the applet pairs.
10. The method according to claim 1, wherein the similarity vector of the applet pairs in the step S105 is the layout feature LF similarity, the custom function feature CFF similarity, the file dependent feature FDF similarity, and the double-layer dependent feature TLDF similarity of the two applets, and if the two applets are similar in function and layout through manual review and plagiarism by both official forums or news reports, the applet pairs are identified as clone applet pairs, otherwise the applet pairs are identified as non-clone applet pairs, the classifier inputs are the similarity vector of the applet pairs, and outputs of 0 and 1,0 represent that the applet pairs are non-clone applet pairs, and 1 represents that the applet pairs are clone applet pairs.
CN202310045745.9A 2023-01-30 2023-01-30 Small program clone detection method based on complex network analysis Pending CN116028112A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310045745.9A CN116028112A (en) 2023-01-30 2023-01-30 Small program clone detection method based on complex network analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310045745.9A CN116028112A (en) 2023-01-30 2023-01-30 Small program clone detection method based on complex network analysis

Publications (1)

Publication Number Publication Date
CN116028112A true CN116028112A (en) 2023-04-28

Family

ID=86079263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310045745.9A Pending CN116028112A (en) 2023-01-30 2023-01-30 Small program clone detection method based on complex network analysis

Country Status (1)

Country Link
CN (1) CN116028112A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116432125A (en) * 2023-06-01 2023-07-14 中南大学 Code classification method based on hash algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116432125A (en) * 2023-06-01 2023-07-14 中南大学 Code classification method based on hash algorithm
CN116432125B (en) * 2023-06-01 2023-09-05 中南大学 Code Classification Method Based on Hash Algorithm

Similar Documents

Publication Publication Date Title
CN111428044B (en) Method, device, equipment and storage medium for acquiring supervision and identification results in multiple modes
CN106709345B (en) Method, system and equipment for deducing malicious code rules based on deep learning method
CN110737899B (en) Intelligent contract security vulnerability detection method based on machine learning
CN108965245B (en) Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model
CN103177215B (en) Based on the computer malware new detecting method of software control stream feature
CN108959924A (en) A kind of Android malicious code detecting method of word-based vector sum deep neural network
CN110192210A (en) Building and processing are used for the calculating figure of dynamic, structured machine learning model
CN109800575B (en) Security detection method for Android application program
CN109325193A (en) WAF normal discharge modeling method and device based on machine learning
CN113596007A (en) Vulnerability attack detection method and device based on deep learning
CN116028112A (en) Small program clone detection method based on complex network analysis
CN114417865A (en) Method, device and equipment for processing description text of disaster event and storage medium
CN114490953B (en) Method for training event extraction model, method, device and medium for extracting event
CN115495744A (en) Threat information classification method, device, electronic equipment and storage medium
CN108509794A (en) A kind of malicious web pages defence detection method based on classification learning algorithm
CN110226179A (en) Contextual information is integrated by neural network to detect the fraud in payment transaction stream automatically
CN112817877B (en) Abnormal script detection method and device, computer equipment and storage medium
CN103036848A (en) Reverse engineering method and system of protocol
CN107688594B (en) The identifying system and method for risk case based on social information
CN112084095B (en) Energy network connection monitoring method and system based on block chain and storage medium
Shaik et al. Fake news detection using NLP
CN110674288A (en) User portrait method applied to network security field
CN110598115A (en) Sensitive webpage identification method and system based on artificial intelligence multi-engine
CN115964489A (en) Supply and demand event extraction method and system based on stacked pointer network
CN113919544B (en) Crime early warning method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination