CN113721978B - Method and system for detecting open source component in mixed source software - Google Patents

Method and system for detecting open source component in mixed source software Download PDF

Info

Publication number
CN113721978B
CN113721978B CN202111286072.3A CN202111286072A CN113721978B CN 113721978 B CN113721978 B CN 113721978B CN 202111286072 A CN202111286072 A CN 202111286072A CN 113721978 B CN113721978 B CN 113721978B
Authority
CN
China
Prior art keywords
source code
code file
source
simhash
minhash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111286072.3A
Other languages
Chinese (zh)
Other versions
CN113721978A (en
Inventor
张涛
陈钟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202111286072.3A priority Critical patent/CN113721978B/en
Publication of CN113721978A publication Critical patent/CN113721978A/en
Application granted granted Critical
Publication of CN113721978B publication Critical patent/CN113721978B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/77Software metrics

Abstract

The embodiment disclosed by the application provides a method and a system for detecting open source components in mixed source software. Wherein, the method comprises the following steps: acquiring a source code file in target mixed-source software, namely acquiring a first source code file, classifying the first source code file and executing corresponding homologous analysis; performing homologous analysis on source code files with sizes exceeding a first threshold value in the first source code files based on a Simhash algorithm; and performing homologous analysis on the source code files with the size not exceeding a first threshold value in the first source code files based on a Minhash algorithm. Compared with the prior art, the scheme can balance the contradiction between the efficiency requirement and the accuracy of the open-source component detection of the mixed-source software, and obtain an acceptable open-source component detection result on the premise of ensuring the detection efficiency.

Description

Method and system for detecting open source component in mixed source software
Technical Field
The embodiments disclosed in the present application relate generally to the field of open source governance related technologies and, more particularly, to the field of Software Component Analysis (SCA) security test subdivision technologies, and, more particularly, to a method and system for detecting open source components in mixed-source software.
Background
In recent years, the proportion of open-source components used in software development has been increasing. The introduction of open source components can greatly improve the efficiency of software development. Nowadays, almost all software developer entities choose to use open-source frameworks, open-source libraries, open-source components, etc. to simplify the development process and shorten the development period. However, the introduction of open source components does not avoid the possibility of introducing some vulnerabilities, which can cause security problems, and intellectual property compliance problems. Especially, direct copy of open source code file is multiplexed or introduced only by simple modification, so that the open source content is prevented from being widely and frequently used in public and being a target of attack by an attacker in priority.
At present, although a plurality of SCA tools support analysis of open source components, most of the tools analyze the open source components of a project based on a feature file of the project, and the analysis of the open source components based on codes is rare, mainly because the analysis of the open source components based on massive open source codes is difficult, and the detection efficiency cannot reach expectation.
Disclosure of Invention
According to the embodiment disclosed by the application, the method and the system for detecting the open source components in the mixed source software are provided, so that the open source component detection of the code granularity is realized, and the acceptable open source component detection result on the generalized code granularity can be obtained on the premise of ensuring the detection efficiency.
In bookIn a first aspect of the disclosure, a method for detecting a source component in mixed-source software is provided. The method comprises the following steps: acquiring a source code file in target mixed-source software as a first source code file; respectively executing corresponding homologous analysis on the first source code file according to the size of the first source code file; performing homologous analysis on source code files with sizes exceeding a first threshold value in the first source code files based on a Simhash algorithm; performing homologous analysis on the source code files with the size not exceeding a first threshold value in the first source code files based on a Minhash algorithm; specifically, the above homology analysis based on the Simhash algorithm may include: defining a Simhash functionh 1 Generating a first Simhash value of a first source code file based on codes of the first source code file to be detected, wherein the size of the first source code file exceeds a first threshold value, and performing matching analysis on the first Simhash value and second Simhash values in a fingerprint library one by one to determine whether the first source code file is homologous with a source code file corresponding to the second Simhash values in the fingerprint library; the second Simhash value is based on a Simhash function according to the code of the second source code file in the source code libraryh 1 Respectively generating and storing Simhash values in a fingerprint library; the above-mentioned homology analysis based on the Minhash algorithm may include: acquiring a third source code file in a source code library; generating signature sets corresponding to the first source code file and the third source code file respectively according to codes of the first source code file and the third source code file, wherein the size of the first source code file to be detected does not exceed a first threshold value, and constructing a feature matrix of the signature sets; and defines a Minhash functionh 2 Based on the feature matrix and the Minhash functionh 2 Estimating the similarity of Jaccard between the signature sets of the first source code file and the third source code file, and determining whether the first source code file is homologous with one or more of the third source code files; the source code library is a source code warehouse which is generated by collecting source code files and storing the collected source code files in a classified manner; the second source code file refers to a source code file with a size exceeding a first threshold in the source code library(ii) a The third source code file refers to a source code file with the size not exceeding a first threshold value in a source code library; simhash function involved in homology analysish 1 Minhash functionh 2 The term "the function" may refer to any Simhash function or Minhash function.
In a second aspect of the disclosure, a system for detection of an open source component in mixed source software is provided. The system comprises: the system comprises an analyzer, a classifier, a Simhash recognizer and a Minhash recognizer; the analyzer is used for analyzing the target mixed-source software and acquiring a first source code file from the target mixed-source software; the classifier is used for classifying the first source code file obtained by the analyzer and distributing the first source code file to the corresponding recognizer for homologous analysis; the Simhash recognizer is used for executing homologous analysis based on a Simhash algorithm on the source code files with the sizes exceeding a first threshold value in the first source code files; the Minhash identifier is used for executing homologous analysis based on a Minhash algorithm on the source code files with the size not exceeding a first threshold value in the first source code files; specifically, the parser may parse a software package of the target mixed-source software, and parse the software package to obtain a source code file of the target mixed-source software, that is, the first source code file; the classifier can classify the first source code file according to the size of the first source code file and further allocate an identifier for the first source code file; the method comprises the steps that a first source code file with the size exceeding a first threshold value is distributed to a Simhash recognizer for homologous analysis; allocating a first source code file with the size not exceeding a first threshold value to a Minhash recognizer for carrying out homologous analysis; the Simhash identifier is configured to pass a Simhash functionh 1 Carrying out Hash processing on codes of a first source code file, which is distributed by a classifier and has the size exceeding a first threshold value, generating a first Simhash value of the first source code file, and carrying out matching analysis on the first Simhash value and a second Simhash value one by one to determine whether the first source code file is homologous with a source code file corresponding to the second Simhash value; the second Simhash value refers to that the code of the second source code file in the source code library is based on a Simhash functionh 1 Separately generated Simhash value; the Minhash identifier is configured to obtain a third source code file in the source code library, and a Minhash function and a signature set feature matrix based on the first source code file and the third source code fileh 2 Estimating the Jaccard similarity between the signature sets of the first source code file and the third source code file to determine whether the first source code file is homologous with one or more of the third source code file; the signature set feature matrix is constructed according to the signature sets of the first source code file and the third source code file; the signature sets of the first source code file and the third source code file are generated according to the codes of the first source code file and the third source code file respectively; the source code library is a source code warehouse which is generated by collecting source code files and storing the collected source code files in a classified manner; the second source code file refers to a source code file with the size exceeding a first threshold value in a source code library; the third source code file refers to a source code file with the size not exceeding a first threshold value in a source code library; simhash function involved in homology analysish 1 Minhash functionh 2 The term "the function" may refer to any Simhash function or Minhash function.
In a third aspect of the present disclosure, a system for detection of an open source component in mixed source software is provided. The system comprises: at least one processor, a memory coupled to the at least one processor, and a computer program stored in the memory; wherein the processor executes the computer program to implement the method for open source component detection in the mixed source software according to the first aspect.
In a fourth aspect of the disclosure, a computer-readable storage medium is provided. The medium having stored thereon computer instructions related to detection of open source components of the software; the computer instructions are capable of, when executed by a computer processor, implementing the method for open source component detection in mixed source software according to the first aspect.
In a fifth aspect of the disclosure, a computer program product is provided. The program product comprises a computer program enabling, when executed by a computer processor, the method for open source component detection in mixed source software according to the first aspect.
It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:
FIG. 1 shows a schematic diagram of a process for open source component detection in mixed-source software according to an embodiment of the present disclosure;
FIG. 2 illustrates a block diagram of a system for source component detection in mixed-source software, according to an exemplary implementation of the present disclosure;
FIG. 3 illustrates a block diagram of a computing device capable of implementing various embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
The terms "include" and its similar terms are to be understood as open-ended inclusions, i.e., "including but not limited to," in the description of the embodiments of the present disclosure. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.
In the description of the embodiments of the present disclosure, the technical term "target mixed-source software" refers to any mixed-source software as an object to be detected. In the specific implementation process of the embodiment of the present disclosure, the "target mixed source software" to be detected is generally in a software package form as a detection object, and is submitted to the method, system, and the like related to the scheme of the embodiment of the present disclosure for performing open source detection analysis.
In order to maximize the software development efficiency, for the software developer entity, the third party resources, especially the open-source third party library, are utilized to improve the development efficiency of each software project under development from multiple dimensions such as simplified design, reduced code amount, etc., which is undoubtedly the necessary way to improve the software development efficiency. In reality, almost all software developer entities choose to use third party libraries, especially open source third party resources. This trend towards structured multiplexing, in particular from frames, components to code segments relating to a certain function, etc., is becoming more and more apparent. According to the Gartner statistics, the code amount of software using a third party library accounts for more than 80% of the total code amount in recent years, and the self-researched code proportion is lower and lower. However, while introducing the open source component, a possible hidden vulnerability of the open source component is also introduced; it is difficult for an attacker to find the zero-day bug, so that attacking a software system with a known bug is often a common way to attack the software system, and the open source resources are reused publicly and frequently, which undoubtedly increases the risk in this respect. Each open source code is accompanied by open source license information to which it conforms, and not only are the user obligations defined in the open source licenses, but there is a risk that different open source licenses may conflict with one another. Thus, the blind introduction of open source components is likely to pose intellectual property risks due to inadvertent or introduction of open source code where licenses conflict with each other.
According to the embodiment of the disclosure, a scheme for detecting the open source components in the mixed source software is provided. In the scheme, a source code file, namely a first source code file, in target mixed-source software is obtained, and corresponding homologous analysis is respectively performed on the first source code file according to the size of the first source code file: performing homologous analysis on the source code files with the sizes exceeding the first threshold value based on a Simhash algorithm; and performing homologous analysis on the source code files with the sizes not exceeding the first threshold value based on a Minhash algorithm. The scheme disclosed by the invention not only can balance the contradiction between the efficiency requirement and the accuracy of the open-source component detection of the mixed-source software, but also can obtain an acceptable open-source component detection result on the premise of ensuring the detection efficiency.
Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings. Fig. 1 shows a schematic diagram of a process 100 for open source component detection in mixed-source software according to an embodiment of the present disclosure. As shown in fig. 1, a process 100 for open source component detection in mixed source software mainly includes: at a block 101, analyzing a to-be-detected package of target mixed-source software, and acquiring all source code files as first source code files; at block 102, performing corresponding homologous analysis on each first source code file according to the size of the first source code file obtained by analysis; in block 102a, for a source code file with a size exceeding a first threshold in the first source code file, processing the source code file by using a Simhash algorithm, and matching a processing result with a fingerprint library to realize homologous analysis of the source code file; and for the source code files with the size not exceeding the first threshold in the first source code file, estimating the similarity between the codes of the first source code file and the third source code file by adopting a Minhash algorithm to realize the homologous analysis of the first source code file. Specifically, in some embodiments, the homology analysis based on the Simhash algorithm and the fingerprint library matching at block 102a may be: selecting a determined Simhash function (the Simhash function can be any Simhash function), processing all codes of a source code file (namely, a second source code file) with the size exceeding a first threshold value in a source code library based on the Simhash function, calculating the hash value of the codes, obtaining a second Simhash value, and storing the second Simhash value in a fingerprint library so as to provide comparison and matching when related open source component detection calls the homologous analysis of a Simhash algorithm; processing all codes of a first source code file to be detected with the size exceeding a first threshold value based on the same Simhash function, and calculating a hash value of the first source code file, namely a first Simhash value; and comparing the first Simhash value with a second Simhash value in a fingerprint library one by one, performing matching analysis, and finally determining whether the first source code file is homologous with a source code file corresponding to the second Simhash value in the fingerprint library. In block 102b, the process of homology analysis based on the Minhash algorithm may be: acquiring a source code file (namely a third source code file) with the size not exceeding a first threshold value in a source code library, and processing the first source code file with the size not exceeding the first threshold value to be detected and the third source code file together to judge the similarity between the first source code file and the third source code file so as to realize homologous analysis; after third source code files are obtained (generally, a source code library usually has a plurality of third source code files), respectively generating signature sets corresponding to the first source code file and each third source code file according to codes of the first source code file and the third source code file, and forming a feature matrix by using each signature set; generally, the feature matrix of the signature set is usually combined into a column by the signature of each source code file; selecting a determined Minhash function (which may be any one of the Minhash functions), processing the feature matrix, for example, performing row scrambling on the feature matrix described in the foregoing example, and after (each) row scrambling, calculating the minimum hash value of each column by using the selected Minhash function, and further estimating the Jaccard similarity between the signature sets of the first source code file and the third source code file based on the minimum hash values, and determining whether the first source code file is homologous to each third source code file. In addition, in some embodiments, when performing various hash calculations on the code of each source code file therein, the following may also be performed: before the hash calculation, the code text of the source code file is subjected to word segmentation processing, and a form such as the signature set is generated, so that the recognition efficiency and the recognition accuracy are improved.
In some embodiments, the homology analysis based on the Simhash algorithm and the fingerprint library matching at block 102a may be, more specifically: and respectively calculating the hamming distance between the first Simhash value and each second Simhash value in the fingerprint database when the first Simhash value is compared with the second Simhash values in the fingerprint database one by one and is subjected to matching analysis, and judging the similarity degree of the first Simhash value and each second Simhash value according to the calculated hamming distance. For example, a hamming distance threshold is set, and source code files corresponding to hamming distance threshold which are lower than the hamming distance threshold are judged to be homologous.
In some embodiments, in block 102b, the process of estimating similarity based on the signature set feature matrix and the selected Minhash function to further implement the homology analysis may specifically be: performing corresponding scrambling processing on the feature matrix (for example, if the feature matrix is generated according to the conventional setting as described in the previous example, the scrambling processing may be row scrambling processing if the signature set is taken as a column, or adaptively performing corresponding scrambling processing (for example, column scrambling processing if the signature set is taken as a row), performing multiple (random) scrambling processing and calculating the minimum hash value of each processing based on the selected Minhash function to obtain the minimum hash signature matrix of the first source code file and the third source code file; and respectively estimating the Jaccard similarity between the first source code file and the signature set of other third source code files in the feature matrix based on the minimum Hash signature matrix to respectively determine whether the first source code file is homologous with each third source code file. The process of estimating similarity based on the signature set feature matrix and the selected Minhash function to further implement the homologous analysis may specifically be: performing corresponding scrambling processing on the feature matrix (as above, for example, when the feature matrix takes a signature set as a column, targeted row scrambling), performing multiple (random) scrambling processing, and correspondingly calculating the minimum hash values of the signature sets of the first source code file and the third source code file based on the selected Minhash function to generate a minimum hash signature matrix; based on the minimum hash signature matrix, an LSH algorithm is adopted to ensure that only highly similar ones are processed: directly ignoring the signature sets which are not the candidate pairs, screening out only some candidate pairs with high similarity, and further quickly finding out a third source code file of which the Jaccard similarity between the signature set and the signature set of the first source code file exceeds a second threshold value through the modes of candidate pair comparison and the like; if a third source code file with the Jaccard similarity between the signature set of the third source code file and the signature set of the first source code file exceeding a second threshold value is determined, the third source code file is considered to be similar to the codes of the first source code file, and the third source code file and the first source code file are judged to be homologous; and if no third source code file with the Jaccard similarity between the signature set of the source code file and the signature set of the first source code file exceeding a second threshold value exists, determining that no third source code file with the same source as the first source code file exists in the source code library.
In some embodiments, in the process of implementing the homology analysis based on the Minhash algorithm, in the generating of the signature sets of the first source code file and the third source code file, at block 102b, the following may also be performed: and setting a keyword dictionary according to programming experience, and processing the codes of the first source code file and the third source code file based on the keyword dictionary to obtain signature sets of the first source code file and the third source code file which are efficient and accurate in word segmentation.
In some embodiments, for a third source code file in the source code library, in order to improve the efficiency of synchronous analysis (based on the Minhash algorithm), the third source code file is managed separately, and a third source code file library is generated and used for storing and managing the third source code file; therefore, when the homologous analysis based on the Minhash algorithm is executed in the box 102b, the third source code file can be directly acquired from the third source code file library, instead of being searched and acquired from a massive source code library, so that the acquisition efficiency of the third source code file is greatly improved.
Based on the mode, the targeted open source component detection of the source code file in the mixed source software is undoubtedly realized according to the file characteristics, and compared with the detection mode based on the abstract information of the source code file in the prior art, the open source component analysis of the source code file in the code granularity is realized, and meanwhile, the advantages and the disadvantages of various detection technologies are fully considered, so that a set of balanced open source component detection scheme with efficiency and precision is formed.
FIG. 2 illustrates a block diagram of a system 200 for detection of an open source component in mixed source software, in accordance with an embodiment of the present disclosure. As shown in fig. 2, a system 200, comprising: a parser 210, a classifier 220, a Simhash recognizer 230, and a Minhash recognizer 240; the analyzer 210 is configured to analyze the target mixed-source software and obtain a first source code file from the target mixed-source software; the classifier 220 is used for classifying the first source code file obtained by the analyzer and distributing the first source code file to a corresponding recognizer for homologous analysis; a Simhash identifier 230, configured to perform a Simhash algorithm-based homologous analysis on source code files with sizes exceeding a first threshold in the first source code file; and the Minhash identifier 240 is configured to perform a Minhash algorithm-based homology analysis on the source code files with the size not exceeding the first threshold in the first source code file. In some embodiments, parser 210 may be: and analyzing the target mixed-source software package (for example, detecting through uploading and submitting), and analyzing to obtain all source code files in the target mixed-source software, wherein the source code files are the first source code file. The classifier 220, may be: and classifying each first source code file according to the size of the first source code file obtained by the parser 210, and accordingly allocating a corresponding recognizer to the first source code file. Specifically, for a first source code file in which the size exceeds a first threshold, the Simhash identifier 230 is assigned to perform the homologous analysis; for the first source code file in which the size does not exceed the first threshold, the Minhash identifier 240 is assigned to perform the homology analysis. The Simhash identifier 230 may be: the source code file matching method includes that codes of a first source code file which is distributed by a classifier 220 and has a size exceeding a first threshold value are subjected to hash processing through a determined Simhash function (any Simhash function can be used), a first Simhash value of the first source code file is obtained through calculation, and the first Simhash value and each second Simhash value are subjected to matching analysis one by one to determine whether the first source code file is homologous with a source code file corresponding to the second Simhash value; the second Simhash values are obtained by respectively processing and calculating codes of second source code files in the source code library based on the same determined Simhash function (corresponding second Simhash values are generated for all second source code files in the source code library). The Minhash identifier 240 may be: the system is configured to obtain third source code files in a source code library, and determine whether the first source code file has homology with one or more third source code files based on the signature set feature matrix of the first source code file and the third source code files and a determined Minhash function (which may be any Minhash function) to estimate the Jaccard similarity between the signature sets of the first source code file and the third source code files; the signature set feature matrix is constructed according to the signature sets of the first source code file and the third source code file; and the signature sets of the first source code file and the third source code file are respectively generated correspondingly according to the codes of the first source code file and the third source code files. In addition, in some embodiments, when performing various hash calculations on the code of each source code file therein, the following may also be performed: before the Hash calculation, each recognizer carries out word segmentation processing on the code text of the source code file to generate a form of a signature set, for example, so as to improve the recognition efficiency and accuracy.
In some embodiments, when performing the homologous analysis on the first source code file based on the matching of the first Simhash value and the second Simhash value, the Simhash identifier 230 may be specifically configured to: and calculating the hamming distance between the first Simhash value and each second Simhash value respectively to judge the similarity between the first source code file and the source code file corresponding to the second Simhash value, and judging whether the first source code file and the second source code file are homologous according to the similarity. More specifically, for example, a hamming distance threshold is set, and the source code file with the hamming distance lower than the second Simhash value corresponding to the threshold is set as the source code file of the first source code file.
In some embodiments, when performing homology analysis on the first source code file based on the signature set feature matrix and a Minhash function, the Minhash identifier 240 may be configured to enable the process of estimating Jaccard similarity between the signature sets of the first source code file and the third source code file based on the signature set feature matrix and the Minhash function of the first source code file and the third source code file to implement homology analysis, including: performing corresponding scrambling processing (generally, line scrambling, please refer to the related description of the previous section in detail) on the feature matrix, and performing (random) scrambling processing for multiple times and calculating the minimum hash value of each processing based on a selected Minhash function to obtain the minimum hash signature matrix of the first source code file and the third source code file; and respectively estimating the Jaccard similarity between the first source code file and the signature set of other third source code files in the feature matrix based on the minimum Hash signature matrix to respectively determine whether the first source code file is homologous with each third source code file. Or, the above process of estimating similarity based on the signature set feature matrix and the Minhash function to further implement the homologous analysis may also be a more efficient manner, including: performing corresponding scrambling processing (generally, line scrambling, please refer to the related description of the previous section) on the feature matrix for multiple times of (random) scrambling processing, and correspondingly calculating the minimum hash values of the signature sets of the first source code file and the third source code file based on the selected Minhash function to generate a minimum hash signature matrix; based on the minimum hash signature matrix, an LSH algorithm is adopted to ensure that only highly similar ones are processed: directly ignoring the signature sets which are not the candidate pairs, screening out only some candidate pairs with high similarity, and further quickly finding out a third source code file of which the Jaccard similarity between the signature set and the signature set of the first source code file exceeds a second threshold value through the modes of candidate pair comparison and the like; if a third source code file with the Jaccard similarity between the signature set of the third source code file and the signature set of the first source code file exceeding a second threshold value is determined, the third source code file is considered to be similar to the codes of the first source code file, and the third source code file and the first source code file are judged to be homologous; and if no third source code file with the Jaccard similarity between the signature set of the source code file and the signature set of the first source code file exceeding a second threshold value exists, determining that no third source code file with the same source as the first source code file exists in the source code library.
In some embodiments, the system further comprises setting a keyword dictionary (generally, according to a programming verification setting); and enabling the Minhash identifier 240 to be configured to process the codes of the first source code file and the third source code file according to the keyword dictionary when the Minhash algorithm-based homologous analysis is executed on the first source code file and the signature set of the first source code file and the third source code file is generated, so as to obtain the signature set of the first source code file and the third source code file which are efficiently and accurately segmented.
In some embodiments, the system further comprises: and (4) fingerprint database. The fingerprint database is used for storing the second Simhash value. The codes of all the second source code files in the source code library are subjected to hash processing based on the same Simhash function to generate corresponding second Simhash values respectively, the second Simhash values are stored in the fingerprint library, and the second Simhash values are provided for the Simhash identifier 230 to be inquired when the Simhash function performs homologous analysis and correlation comparison matching.
In some embodiments, the system further comprises: and a third source code file library. And the third source code file library is specially used for storing and managing the third source code files. Specifically, a third source code file is collected, a third source code file library is generated, and the source code files are managed separately; and is configured to enable the Minhash identifier 240 to directly obtain the third source code file from the third source code file library when performing the Minhash algorithm-based homologous analysis.
According to some embodiments of the present disclosure, a system for detecting a source-opening component in source-mixing software is also provided. The system, in particular, may be implemented by a computing device. Fig. 3 illustrates a block diagram of a computing device 300 of the above-described embodiments that may be used to implement some embodiments of the present disclosure. As shown in fig. 3, the computing device 300 includes a Central Processing Unit (CPU) 301 capable of performing various appropriate operations and processes according to computer program instructions stored in a Read Only Memory (ROM) 302 or computer program instructions loaded from a storage unit 308 into a Random Access Memory (RAM) 303, and in the (RAM) 303, various program codes, data required for the operation of the computing device 300 may also be stored. The CPU301, ROM302, RAM303 are connected to each other via a bus 304, and an input/output (I/O) interface 305 is also connected to the bus 304. Some of the components of computing device 300 are accessed through I/O interface 305, including: an input unit 306 such as a keyboard and mouse; an output unit 307 such as a display or the like; a storage unit 308 such as a magnetic disk, an optical disk, a Solid State Disk (SSD), etc., and a communication unit 309 such as a network card, a modem, etc. The communication unit 309 enables the computing device 300 to exchange information/data with other devices through a computer network. The CPU301 is capable of executing the various methods and processes described in the above embodiments, such as the process 100. In some embodiments, process 100 may be implemented as a computer software program that is embodied on a computer-readable medium, such as storage unit 308. In some embodiments, part or all of the computer program is loaded or installed into computing device 300. When loaded into RAM303 for execution by CPU301, the computer programs can perform some or all of the operations of process 100.
The functions described herein above may all be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (10)

1. A method for detecting source-opening components in mixed-source software is characterized by comprising the following steps:
acquiring a first source code file; the first source code file refers to a source code file in target mixed-source software;
respectively executing corresponding homologous analysis on the first source code file according to the size of the first source code file;
and for the first source code file with the size exceeding a first threshold, carrying out homologous analysis on the first source code file based on a Simhash algorithm:
defining a Simhash functionh 1 Generating a first Simhash value of the first source code file based on the code of the first source code file, matching and analyzing the first Simhash value of the first source code file and a second Simhash value in a fingerprint library one by one, and determining whether the first source code file is homologous with a source code file corresponding to the second Simhash value in the fingerprint library; wherein, the second Simhash value refers to that the code according to the second source code file is based on a Simhash functionh 1 Respectively generating and storing Simhash values in a fingerprint library;
and for the first source code file with the size not exceeding a first threshold, carrying out homologous analysis on the first source code file based on a Minhash algorithm:
acquiring a third source code file; generating signature sets of the first source code file and the third source code file according to codes of the first source code file and the third source code file, and constructing a feature matrix of the signature sets according to the signature sets; defining a Minhash functionh 2 Based on the feature matrix and the Minhash functionh 2 Estimating the Jaccard similarity between the signature sets of the first source code file and the third source code file, and determining whether the first source code file is homologous with the third source code file; wherein the characteristic matrix and the Minhash function are basedh 2 The process of estimating similarity to determine whether the first source code file is homologous with a third source code file specifically includes: performing corresponding scrambling processing on the feature matrix based on a Minhash functionh 2 Obtaining a minimum hash signature matrix of the first source code file and the third source code file; respectively estimating the Jaccard similarity between the signature sets of the first source code file and each third source code file based on the minimum Hash signature matrix to respectively determine whether the first source code file is homologous with each third source code file; or, corresponding scrambling processing is carried out on the feature matrix based on a Minhash functionh 2 Obtaining a minimum hash signature matrix of the first source code file and the third source code file; based onThe third source code file with the signature set of the third source code file and the signature set of the first source code file, wherein the Jaccard similarity of the minimum hash signature matrix and the signature set of the first source code file exceeds a second threshold value, is found out by adopting an LSH algorithm; if a third source code file with the similarity of Jaccard of the signature set of the third source code file and the signature set of the first source code file exceeding a second threshold value exists, judging that the third source code file and the first source code file are homologous; if no third source code file with the Jaccard similarity between the signature set of the source code file and the signature set of the first source code file exceeding a second threshold value exists, determining that no third source code file with the same source as the first source code file exists in the source code library;
the second source code file refers to a source code file with the size exceeding a first threshold value in a source code library; the third source code file refers to a source code file with the size not exceeding a first threshold value in a source code library; the source code library is a source code warehouse which is generated by collecting source code files and storing the collected source code files in a classified manner; the Simhash functionh 1 Minhash functionh 2 The term "mean any of the Simhash function and the Minhash function.
2. The method of claim 1,
determining whether the first source code file is homologous to a source code file corresponding to a second Simhash value based on a match of the first Simhash value with the second Simhash value in a fingerprint library, including: and respectively calculating hamming distances between the first Simhash value and each second Simhash value in a fingerprint library to judge the similarity between the first source code file and the source code file corresponding to the second Simhash value, and judging whether the first source code file and the second source code file are homologous according to the similarity.
3. The method of claim 1,
in the process of generating the signature set, the method comprises the following steps: and performing word segmentation processing on the codes of the first source code file and the third source code file based on the keyword dictionary to obtain signature sets of the first source code file and the third source code file which are beneficial to improving the efficiency and the precision of homologous analysis.
4. The method of claim 1,
the third source code file is managed independently to generate a third source code file library;
and when the Minhash algorithm-based homologous analysis is executed, the third source code file is directly acquired from the third source code file library.
5. A system for detecting a source component in mixed-source software, the system comprising:
the system comprises an analyzer, a classifier, a Simhash recognizer and a Minhash recognizer; wherein the content of the first and second substances,
the analyzer is used for analyzing the target mixed source software and acquiring a first source code file; the parser can parse a software package of the target source-mixed software to obtain a source code file of the target source-mixed software, wherein the source code file is the first source code file;
the classifier is used for classifying the first source code file obtained by the analyzer and distributing the first source code file to the corresponding recognizer for homologous analysis; the classifier can allocate an identifier for the first source code file according to the size of the first source code file; the method comprises the steps that a first source code file with the size exceeding a first threshold value is distributed to a Simhash recognizer for homologous analysis; allocating a first source code file with the size not exceeding a first threshold value to a Minhash recognizer for carrying out homologous analysis;
the Simhash recognizer is used for executing homologous analysis based on a Simhash algorithm on the first source code file with the size exceeding a first threshold value; the Simhash identifier is configured to pass a Simhash functionh 1 Performing hash processing on codes of the first source code file to generate a first Simhash value of the first source code file, and performing matching analysis on the first Simhash value and a second Simhash value one by one to determine whether the first source code file is homologous with a source code file corresponding to the second Simhash value; wherein, the second Simhash value refers to that the code according to the second source code file is based on a Simhash functionh 1 Are respectively provided withThe generated Simhash value;
the Minhash recognizer is used for performing Minhash algorithm-based homologous analysis on the first source code file with the size not exceeding a first threshold; wherein the Minhash identifier is configured to obtain a third source code file, a signature set feature matrix based on the first source code file and the third source code file, and a Minhash functionh 2 Evaluating a Jaccard similarity between the first source code file and a signature set of a third source code file to determine whether the first source code file is homologous to the third source code file; the signature set feature matrix is constructed according to the signature sets of the first source code file and the third source code file; the signature sets of the first source code file and the third source code file are generated according to the codes of the first source code file and the third source code file; the Minhash identifier is based on the feature matrix and the Minhash functionh 2 Evaluating similarity to determine whether the first source code file is homologous to a third source code file is configured to: performing corresponding scrambling processing on the feature matrix based on a Minhash functionh 2 Obtaining a minimum hash signature matrix of the first source code file and the third source code file; respectively estimating the Jaccard similarity between the signature sets of the first source code file and each third source code file based on the minimum Hash signature matrix to respectively determine whether the first source code file is homologous with each third source code file; or, corresponding scrambling processing is carried out on the feature matrix based on a Minhash functionh 2 Obtaining a minimum hash signature matrix of the first source code file and the third source code file; based on the minimum Hash signature matrix, finding out a third source code file of which the similarity between the signature set and the Jaccard of the signature set of the first source code file exceeds a second threshold value by adopting an LSH algorithm; if a third source code file with the similarity of Jaccard of the signature set of the third source code file and the signature set of the first source code file exceeding a second threshold value exists, judging that the third source code file and the first source code file are homologous; if none of the Jaccard similarity of the signature set and the signature set of the first source code file exceeds a second thresholdIf the third source code file is not the same as the first source code file, determining that the third source code file does not exist in the source code library;
the second source code file refers to a source code file with the size exceeding a first threshold value in a source code library; the third source code file refers to a source code file with the size not exceeding a first threshold value in a source code library; the source code library is a source code warehouse which is generated by collecting source code files and storing the collected source code files in a classified manner; the Simhash functionh 1 Minhash functionh 2 The term "mean any of the Simhash function and the Minhash function.
6. The system of claim 5,
the Simhash identifier, when performing a homologous analysis on the first source code file based on matching of the first Simhash value and the second Simhash value, is configured to include: and calculating the hamming distance between the first Simhash value and each second Simhash value respectively to judge the similarity between the first source code file and the source code file corresponding to the second Simhash value, and judging whether the first source code file and the second source code file are homologous according to the similarity.
7. The system of claim 5,
the Minhash identifier is configured to perform word segmentation processing on codes of the first source code file and the third source code file based on a keyword dictionary during the process of performing the Minhash algorithm-based homologous analysis on the first source code file until the signature sets of the first source code file and the third source code file are generated, so as to obtain the signature sets of the first source code file and the third source code file which are beneficial to improving the efficiency and the precision of the homologous analysis.
8. The system of claim 5,
the system, still include: fingerprint library and/or third source code file library;
wherein the fingerprintA repository for storing the second Simhash value; based on Simhash functionh 1 Performing hash processing on codes of a second source code file in a source code library to respectively generate corresponding second Simhash values, and storing the second Simhash values in the fingerprint library;
a third source code file library, dedicated to storing and managing the third source code files; collecting a third source code file, generating a third source code file library, and managing the source code file independently; and configuring to enable the Minhash identifier to directly acquire the third source code file from a third source code file library when executing the Minhash algorithm-based homologous analysis.
9. A system for detecting a source component in mixed-source software, the system comprising:
at least one processor, a memory coupled to the at least one processor, and a computer program stored in the memory;
wherein a processor executes the computer program to implement the method of open source component detection in mixed source software as claimed in any one of claims 1 to 4.
10. A computer-readable storage medium, characterized in that,
the medium having stored thereon computer instructions related to open source component detection;
the computer instructions, when executed by a computer processor, are capable of performing the method of open source component detection in mixed source software of any of claims 1-4.
CN202111286072.3A 2021-11-02 2021-11-02 Method and system for detecting open source component in mixed source software Active CN113721978B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111286072.3A CN113721978B (en) 2021-11-02 2021-11-02 Method and system for detecting open source component in mixed source software

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111286072.3A CN113721978B (en) 2021-11-02 2021-11-02 Method and system for detecting open source component in mixed source software

Publications (2)

Publication Number Publication Date
CN113721978A CN113721978A (en) 2021-11-30
CN113721978B true CN113721978B (en) 2022-02-11

Family

ID=78686420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111286072.3A Active CN113721978B (en) 2021-11-02 2021-11-02 Method and system for detecting open source component in mixed source software

Country Status (1)

Country Link
CN (1) CN113721978B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649221A (en) * 2016-12-06 2017-05-10 北京锐安科技有限公司 Method and device for detecting duplicated texts
CN111241497A (en) * 2020-02-13 2020-06-05 北京高质系统科技有限公司 Open source code tracing detection method based on software multiplexing feature learning
CN112579155A (en) * 2021-02-23 2021-03-30 北京北大软件工程股份有限公司 Code similarity detection method and device and storage medium
CN112988217A (en) * 2021-03-10 2021-06-18 北京大学 Code library design method and detection method for rapid full-network code traceability detection

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10564939B2 (en) * 2017-06-05 2020-02-18 Devfactory Fz-Llc Method and system for arbitrary-granularity execution clone detection
US10705837B2 (en) * 2017-09-12 2020-07-07 Devfactory Innovations Fz-Llc Method and apparatus for finding long methods in code
CN111367566A (en) * 2019-06-27 2020-07-03 北京关键科技股份有限公司 Mixed source code feature extraction and matching method
CN112698861A (en) * 2021-03-25 2021-04-23 深圳开源互联网安全技术有限公司 Source code clone identification method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649221A (en) * 2016-12-06 2017-05-10 北京锐安科技有限公司 Method and device for detecting duplicated texts
CN111241497A (en) * 2020-02-13 2020-06-05 北京高质系统科技有限公司 Open source code tracing detection method based on software multiplexing feature learning
CN112579155A (en) * 2021-02-23 2021-03-30 北京北大软件工程股份有限公司 Code similarity detection method and device and storage medium
CN112988217A (en) * 2021-03-10 2021-06-18 北京大学 Code library design method and detection method for rapid full-network code traceability detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
大规模数据的集合相似度估计研究进展;何安娜 等;《无线通信技术》;20170430(第4期);1-5 *

Also Published As

Publication number Publication date
CN113721978A (en) 2021-11-30

Similar Documents

Publication Publication Date Title
US9876812B1 (en) Automatic malware signature extraction from runtime information
US11693962B2 (en) Malware clustering based on function call graph similarity
CN110348214B (en) Method and system for detecting malicious codes
CN108334417B (en) Method and device for determining data exception
US20200380125A1 (en) Method for Detecting Libraries in Program Binaries
US10459704B2 (en) Code relatives detection
US10581845B2 (en) Method and apparatus for assigning device fingerprints to internet devices
CN113609261B (en) Vulnerability information mining method and device based on knowledge graph of network information security
US20190286487A1 (en) System and method for performing biometric operations in parallel
WO2021167483A1 (en) Method and system for detecting malicious files in a non-isolated environment
US20200117574A1 (en) Automatic bug verification
US11068595B1 (en) Generation of file digests for cybersecurity applications
CN108229168B (en) Heuristic detection method, system and storage medium for nested files
CN113722238B (en) Method and system for realizing rapid open source component detection of source code file
CN113722711A (en) Data adding method based on big data security vulnerability mining and artificial intelligence system
CN113721978B (en) Method and system for detecting open source component in mixed source software
CN116248412B (en) Shared data resource abnormality detection method, system, equipment, memory and product
CN113971284A (en) JavaScript-based malicious webpage detection method and device and computer-readable storage medium
US11250127B2 (en) Binary software composition analysis
CN113836297B (en) Training method and device for text emotion analysis model
CN114722401A (en) Equipment safety testing method, device, equipment and storage medium
CN112632548B (en) Malicious android program detection method and device, electronic equipment and storage medium
CN114925365A (en) File processing method and device, electronic equipment and storage medium
CN113344023A (en) Code recommendation method, device and system
CN112948415A (en) SQL statement detection method and device, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant