CN115480823A - Source code homologous detection method, device, equipment and storage medium - Google Patents

Source code homologous detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN115480823A
CN115480823A CN202211010931.0A CN202211010931A CN115480823A CN 115480823 A CN115480823 A CN 115480823A CN 202211010931 A CN202211010931 A CN 202211010931A CN 115480823 A CN115480823 A CN 115480823A
Authority
CN
China
Prior art keywords
code
source code
fingerprint
function
base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211010931.0A
Other languages
Chinese (zh)
Inventor
韩烨
孙治
毛得明
陈剑锋
赵童
王炳文
权赵恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 30 Research Institute
Original Assignee
CETC 30 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 30 Research Institute filed Critical CETC 30 Research Institute
Priority to CN202211010931.0A priority Critical patent/CN115480823A/en
Publication of CN115480823A publication Critical patent/CN115480823A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • G06F16/152File search processing using file content signatures, e.g. hash values

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a method, a device, equipment and a storage medium for source code homologous detection, wherein the method comprises the steps of constructing a code base; executing function level slicing on codes in a code base, extracting fingerprint characteristics of code segments, and storing the fingerprint characteristics into a code fingerprint information base; and executing function slicing on the source code to be detected, extracting the fingerprint characteristics of the code segments, and executing fingerprint information matching in a code fingerprint information base by utilizing the fingerprint characteristics to obtain a detection result. According to the invention, through a code fingerprint generation technology and a similar hash technology, an efficient and accurate homologous code retrieval algorithm is designed, the problem that the existing source code similarity detection technology is difficult to be applied to similar code searching in a massive code library is well solved, and meanwhile, the method has higher retrieval efficiency and retrieval accuracy, can quickly retrieve different types of similar codes from the massive code library, and finds known defects and loopholes possibly existing in the codes on the basis of the retrieval efficiency and the retrieval accuracy.

Description

Source code homologous detection method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of code clone detection, in particular to a source code homologous detection method, a source code homologous detection device, source code homologous detection equipment and a storage medium.
Background
Source code homology analysis is a hot issue in the field of software engineering. In the software development process, a developer reuses existing codes (namely code cloning) by copying, pasting and modifying in order to accelerate the development process. Studies have shown that in typical software systems, there is approximately 7-24% of cloned code, and in some software even up to 50%. Too many clone codes can bring many problems to the maintenance of the software project, for example, a code segment containing an unknown bug is reused, unknown security risks are brought to the software, and the like. Therefore, how to efficiently detect clone code in a software system is important for maintaining the supply chain security of the software.
With the rapid development of computer and internet technologies, the number of software system applications and mobile application programs also shows explosive growth, which puts higher requirements on the source code clone detection. The existing source code clone detection method can express rapidness and high efficiency when facing small-scale source codes, but cannot be used when facing massive codes. Therefore, it is very significant to provide a homology analysis method suitable for the source codes of mass data to solve the problem of clone detection of mass source codes. The existing source code similarity detection technology can accurately analyze the similarity of two or more code segments, but most of analysis processes are complex, the detection speed is low, homologous code segments are difficult to quickly search from a massive code library, and the requirement of software supply chain-based source code homology analysis is difficult to meet.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a source code homology detection method, a source code homology detection device, source code homology detection equipment and a storage medium, and aims to solve the technical problems that the similarity of two or more code segments can be accurately analyzed by the existing source code similarity detection technology, but the analysis process is mostly complex, the detection speed is slow, the homology code segments are difficult to quickly search from a massive code library, and the requirement of software supply chain source code homology analysis is difficult to meet.
In order to achieve the above object, the present invention provides a source code homology detection method, which includes the following steps:
constructing a code base; the code library stores codes taking functions as units;
executing function level slicing on codes in the code base, extracting fingerprint characteristics of code segments, and storing the fingerprint characteristics into a code fingerprint information base;
and executing function slicing on a source code to be detected, extracting fingerprint characteristics of the code segments, and executing fingerprint information matching in the code fingerprint information base by utilizing the fingerprint characteristics to obtain a detection result.
Optionally, the step of constructing a code base specifically includes:
acquiring a source code on an open source platform by using a source code acquisition service, and executing code division with a function as a unit on the source code;
and uploading the codes taking the function as a unit to an HDFS distributed file system to construct a code base.
Optionally, after the step of performing function-level slicing, the method further includes: preprocessing the code segment; wherein the pre-processing includes de-annotating, de-blanking lines, and de-keying.
Optionally, the step of extracting fingerprint features of the code segment specifically includes:
executing token conversion on the extracted code segments to obtain a token sequence;
and coding the token sequence by using a similar hash algorithm to obtain the fingerprint characteristics of the code segment.
Optionally, the step of encoding the token sequence by using a similar hash algorithm to obtain the fingerprint features of the code segments includes:
taking each token in the token sequence as a feature of a function, and endowing each feature with a weight by using a TF-IDF method;
converting the characteristics into hash values by using an MD5 hash algorithm;
carrying out bit-wise weighting on the hash value of each feature by using the weight of the feature, resetting the bit weight of the hash value of 0 as the feature weight multiplied by minus 1, and resetting the bit weight of the hash value of 1 as the feature weight multiplied by plus 1;
performing bit-wise accumulation on the hash value according to the weighted result;
and carrying out binary conversion on the bit-based accumulation result, replacing the bit with a positive number as the accumulation result with 1, and replacing the bit with a negative number as the accumulation result with 0 to serve as the characteristic fingerprint of the function.
Optionally, the step of performing fingerprint information matching in the code fingerprint information base by using the fingerprint features specifically includes:
writing query statements by using a DSL query language provided by ElasticSearch to carry out segmented filtration on the code fingerprint information in a code fingerprint information base;
compiling a Hamming distance calculation script by using a Painless script compiling language provided by ElasticSearch, and calculating the actual Hamming distance between a function segment to be analyzed and a function segment which is not filtered in a code base;
and sorting the function segments with the Hamming distance smaller than d from the fingerprint information of the function segments to be analyzed in the code fingerprint information base according to the sequence from small to large from the Hamming distance of the function segments to be analyzed, and returning the function segments to the user as the retrieval result of the homologous function segments of the function to be analyzed.
Optionally, the search result further includes a file name and a project name of the homologous function fragment returned to the user.
In addition, in order to achieve the above object, the present invention provides a source code isogenesis detection apparatus, including:
the construction module is used for constructing a code base; the code base stores codes with functions as units;
the storage module is used for executing function level slicing on codes in the code base, extracting fingerprint characteristics of code segments and storing the fingerprint characteristics and codes of the code segments into a code fingerprint information base;
and the matching module is used for executing function slicing on the source code to be detected, extracting the fingerprint characteristics of the code segments, and executing fingerprint information matching in the code fingerprint information base by utilizing the fingerprint characteristics to obtain a detection result.
In addition, in order to achieve the above object, the present invention also provides a source code homology detection device, including: the system comprises a memory, a processor and a source code homologous detection method program which is stored on the memory and can run on the processor, wherein when the source code homologous detection method program is executed by the processor, the steps of the source code homologous detection method are realized.
In addition, in order to achieve the above object, the present invention further provides a storage medium, on which a source code homology detection method program is stored, and the source code homology detection method program realizes the steps of the source code homology detection method when being executed by a processor.
The embodiment of the invention provides a source code homologous detection method, a source code homologous detection device, source code homologous detection equipment and a storage medium, wherein the method comprises the steps of constructing a code library; executing function level slicing on codes in the code base, extracting fingerprint characteristics of code segments, and storing the fingerprint characteristics into a code fingerprint information base; and executing function slicing on a source code to be detected, extracting fingerprint characteristics of the code segments, and executing fingerprint information matching in the code fingerprint information base by utilizing the fingerprint characteristics to obtain a detection result. According to the invention, through a code fingerprint generation technology and a similar hash technology, an efficient and accurate homologous code retrieval algorithm is designed, the problem that the existing source code similarity detection technology is difficult to be applied to similar code searching in a massive code library is well solved, and meanwhile, the method has higher retrieval efficiency and retrieval accuracy, can quickly retrieve different types of similar codes from the massive code library, and finds known defects and loopholes possibly existing in the codes on the basis of the retrieval efficiency and the retrieval accuracy.
Drawings
FIG. 1 is a schematic structural diagram of a source code homology detection device according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart illustrating a source code homology detection method according to an embodiment of the present invention;
FIG. 3 is a block diagram of an embodiment of a source code homology detection method according to the present invention;
FIG. 4 is a diagram illustrating a code fingerprint generation process based on similar hashing according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a flow of function-level homologous code retrieval in an embodiment of the present invention;
fig. 6 is a block diagram of a source code homology detection apparatus according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
At present, in the related technical field, a source code similarity detection technology can accurately analyze the similarity of two or more code segments, but the analysis process is mostly complex, the detection speed is slow, homologous code segments are difficult to quickly search from a massive code library, and the requirement of software supply chain open source code homology analysis is difficult to meet.
To solve this problem, various embodiments of the source code homology detection method of the present invention are proposed. The source code homologous detection method provided by the invention designs an efficient and accurate homologous code retrieval algorithm through a code fingerprint generation technology and a similar hash technology, well solves the problem that the existing source code similarity detection technology is difficult to be applied to similar code search in a massive code library, has higher retrieval efficiency and retrieval accuracy, can quickly retrieve different types of similar codes from the massive code library, and finds known defects and vulnerabilities possibly existing in the codes on the basis of the retrieval efficiency and the retrieval accuracy.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a source code homology detection device according to an embodiment of the present invention.
The device may be a User Equipment (UE) such as a Mobile phone, smart phone, laptop, digital broadcast receiver, personal Digital Assistant (PDA), tablet computer (PAD), handheld device, vehicular device, wearable device, computing device or other processing device connected to a wireless modem, mobile Station (MS), or the like. The device may be referred to as a user terminal, portable terminal, desktop terminal, etc.
Generally, the apparatus comprises: at least one processor 301, a memory 302, and a source code homology detection method program stored on the memory and executable on the processor, the source code homology detection method program being configured to implement the steps of the source code homology detection method as described above.
The processor 301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 301 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 301 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 301 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. Processor 301 may also include an AI (Artificial Intelligence) processor for processing relevant source code homology detection method operations such that a source code homology detection method model may be trained autonomously, improving efficiency and accuracy.
Memory 302 may include one or more computer-readable storage media, which may be non-transitory. Memory 302 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer-readable storage medium in memory 302 is used to store at least one instruction for execution by processor 301 to implement the source code homology detection method provided by the method embodiments herein.
In some embodiments, the terminal may further include: a communication interface 303 and at least one peripheral device. The processor 301, the memory 302 and the communication interface 303 may be connected by a bus or signal lines. Various peripheral devices may be connected to communication interface 303 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 304, a display screen 305, and a power source 306.
The communication interface 303 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 301 and the memory 302. The communication interface 303 is used for receiving the movement tracks of the plurality of mobile terminals uploaded by the user and other data through the peripheral device. In some embodiments, processor 301, memory 302, and communication interface 303 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 301, the memory 302 and the communication interface 303 may be implemented on a single chip or circuit board, which is not limited by the embodiment.
The Radio Frequency circuit 304 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 304 communicates with a communication network and other communication devices through electromagnetic signals, so as to obtain the movement tracks and other data of a plurality of mobile terminals. The rf circuit 304 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 304 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 304 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 305 is a touch display screen, the display screen 305 also has the ability to capture touch signals on or above the surface of the display screen 305. The touch signal may be input to the processor 301 as a control signal for processing. At this point, the display screen 305 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 305 may be one, the front panel of the electronic device; in other embodiments, the display screens 305 may be at least two, respectively disposed on different surfaces of the electronic device or in a folded design; in still other embodiments, the display 305 may be a flexible display disposed on a curved surface or a folded surface of the electronic device. Even further, the display screen 305 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display screen 305 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The power supply 306 is used to supply power to various components in the electronic device. The power source 306 may be alternating current, direct current, disposable or rechargeable. When power source 306 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
Those skilled in the art will appreciate that the configuration shown in FIG. 1 does not constitute a limitation of the source code isogeny detection apparatus and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
An embodiment of the present invention provides a source code homology detection method, and referring to fig. 2, fig. 2 is a schematic flow diagram of an embodiment of the source code homology detection method according to the present invention.
In this embodiment, the source code homology detection method includes the following steps:
step S100: constructing a code base; the code base stores codes with functions as units;
step S200: executing function level slicing on codes in the code base, extracting fingerprint characteristics of code segments, and storing the fingerprint characteristics into a code fingerprint information base;
step S300: and executing function slicing on a source code to be detected, extracting fingerprint characteristics of the code segments, and executing fingerprint information matching in the code fingerprint information base by utilizing the fingerprint characteristics to obtain a detection result.
It should be noted that the source code homology detection method provided by this embodiment mainly includes three components, namely massive code library construction, code fingerprint generation, and similar code search, as shown in fig. 3. The massive code base construction part firstly collects source codes from various open source platforms through a source code acquisition service, and then uploads the processed codes to the HDFS distributed file system by taking a function as a unit. The code fingerprint generating part firstly carries out a series of preprocessing operations such as function level slicing on codes in a code base, and then extracts fingerprint characteristics aiming at function level code segments by using a code fingerprint characteristic extraction technology based on similar Hash, expresses the code segments by a section of binary code and stores the code segments into a code fingerprint information base. The similar code searching part firstly carries out function slicing and preprocessing on source codes uploaded by a user, generates fingerprint characteristics for each code segment, further queries function level clone segments of the codes uploaded by the user in a code fingerprint information base by utilizing the fingerprint characteristics, and returns query results to the user.
The source code homology detection of the embodiment can realize large-scale source code homology analysis based on fingerprint characteristics, and aims to quickly search similar codes of type1, type2 and type3 types from a massive code library when facing open source projects written by using programming languages such as C, C + +, java, python and the like, and find known defects and vulnerabilities possibly existing in the codes on the basis of the similar codes.
Mass code base construction technology
The basis of large-scale source code homology analysis is to obtain massive source codes, and at present, a plurality of open-source code hosting platforms such as Github, gitee, bitBucket and SourceForge exist on the Internet, and the open-source code platforms provide support for constructing a massive source code library. On the basis of observing various open source code protocols (such as BSD, GPL, LGPL, MIT, MPL and EPL), open source projects of various programming languages are obtained from the platforms and used for constructing a massive code library. And uploading the open source items acquired by each platform to the HDFS distributed file system in real time through the source code acquisition service.
(II) code fingerprint generation technology based on similar hash
After a large amount of open source project source codes are collected in real time by using a source code collecting service and the construction of a massive code base is completed, in order to realize function-level quick retrieval of the same-source codes, fingerprint features need to be generated for each function segment in the code base to serve as a basis for code query. The method and the device generate the fingerprint information of the code by using the code fingerprint generation technology based on the similar hash, and fully ensure the accuracy of a returned result while realizing quick retrieval.
The process of generating function level code fingerprint information for open source items in a massive code base by using a similar hash technology is shown in fig. 4.
First, the source code to be processed is read from the HDFS at item granularity, and then function-level slicing is performed on the read items. Items written in different languages adopt different function slicing strategies, then function blocks generated by slicing are preprocessed, the operations comprise annotation removing, blank line removing, keyword removing and the like, token transformation is carried out based on a series of rules, and functions are expressed into a token sequence and serve as input of a similar hash algorithm. The similar hash algorithm further encodes the token sequence corresponding to the function into a 64-bit binary number, that is, a fingerprint of the function, and specifically includes the following 5 steps:
(1) Taking each token in the function token sequence as a feature of the function, and assigning a weight to each feature by using a TF-IDF method.
(2) And (3) converting the characteristic value generated in the step (1) into 64-bit hash by using an MD5 hash algorithm.
(3) And (2) bit-wise weighting the hash value of each feature by using the feature weight generated in the step (1), wherein the bit weight of the hash value of 0 is reset to be the feature weight multiplied by minus 1, and the bit weight of the hash value of 1 is reset to be the feature weight multiplied by plus 1.
(4) And (3) performing bit-wise accumulation on the hash values generated in the step (2) according to the weighted result.
(5) And (4) carrying out binary conversion on the bitwise accumulation result in the step (4). And replacing the bit with positive number as the accumulation result with 1, and replacing the bit with negative number as the accumulation result with 0 as the characteristic fingerprint of the function.
And generating a characteristic fingerprint for each function of the source item in the massive code base by using the method, and storing the characteristic fingerprint in a code fingerprint information base. The code fingerprint information base is constructed on the basis of an ElasticSearch platform so as to search homologous codes in a massive code base by using fingerprint characteristics. The code fingerprint information of each function is saved as a document in the ElasticSearch, and different indexes are created in the ElasticSearch according to different programming languages. Each document contains the fingerprint information of the function, and also contains key information such as the file name and the project name of the function.
(III) homologous code retrieval technology based on ElasticSearch
In the process of searching the homologous codes, the Hamming distance (Hamming distance) of the code fingerprint characteristics is used as the measurement standard of the code similarity, and the similarity is higher for code pairs with smaller Hamming distances.
When the program source code is subjected to homology analysis, a user uploads the project source code to be analyzed to a system through a front-end page. The back end of the system carries out a series of analysis, processing and calculation on the code, and finally returns a function level code segment which has a clone relation with the uploaded code in the massive code base to the user. The specific process is shown in fig. 5, and comprises the following steps:
(1) And performing function level slicing on the items uploaded by the user to generate function level code segments.
(2) The function level code segments are preprocessed by removing comments, blank lines, keywords and the like, and the specific method is the same as the preprocessing method of the function segments in the massive code base when the code fingerprint information base is constructed, so that the correct retrieval of similar codes is ensured.
(3) And (4) carrying out token transformation on the preprocessed function segments, and representing each function segment in the items uploaded by the user by using a token sequence.
(4) Fingerprint features are generated for each function segment using a similar hash based code fingerprinting technique.
(5) Dividing the 64-bit hash code corresponding to the fingerprint features into 8 segments from high to low, writing a query statement by using a DSL query language provided by ElasticSearch to perform segmented filtration on the code fingerprint information in a code fingerprint information base, and excluding function fingerprints with Hamming distances larger than d from the function segments to be analyzed. The size of d can be chosen autonomously by the user in order to accommodate the different requirements of the homology analysis. If only function fragments with type1 similar to the function fragment to be analyzed (similar in text level) need to be retrieved, the value of d can be 0 or 1, and if function fragments with type2 similar to the type3 (similar in lexical or grammatical levels) similar to the function fragment to be analyzed need to be retrieved, the value of d can be 8 to 10.
(6) A Hamming distance calculation script is written by using a Painless script writing language provided by ElasticSearch, and the actual Hamming distance between a function segment to be analyzed and a function segment which is not filtered in a code base is calculated.
(7) Redefining a correlation sorting rule of the ElasticSearch query result, sorting the function segments with the Hamming distance smaller than d from the fingerprint information of the function segments to be analyzed in the code fingerprint information base according to the sequence from small to large from the Hamming distance of the function segments to be analyzed, and returning the function segments to the user as the retrieval result of the homologous function segments of the function to be analyzed. The returned information comprises key information such as the file name and the project name of the homologous function fragment.
(8) And (4) performing project-level summarization on the analysis results to generate a homology analysis report.
The embodiment provides a source code homology detection method, and a large-scale source code homology analysis method based on fingerprint features can realize similar code retrieval in a massive code library, and solves the technical problem that the traditional source code similarity detection technology is not strong in expandability. Meanwhile, the construction of the massive code base greatly expands the query range of similar codes and is beneficial to obtaining a more comprehensive homology analysis result. In addition, the code fingerprint generation technology based on similar hash greatly improves the speed of data processing and searching and can ensure the accuracy of similar code retrieval.
Referring to fig. 6, fig. 6 is a block diagram illustrating an embodiment of a source code homology detection apparatus according to the present invention.
As shown in fig. 6, the source code homology detection apparatus according to the embodiment of the present invention includes:
a building module 10, configured to build a code base; the code library stores codes taking functions as units;
the storage module 20 is configured to perform function level slicing on the codes in the code library, extract fingerprint features of code segments, and store the fingerprint features in a code fingerprint information library;
and the matching module 30 is configured to execute a function slice on the source code to be detected, extract fingerprint features of the code segments, and execute fingerprint information matching in the code fingerprint information base by using the fingerprint features to obtain a detection result.
Other embodiments or specific implementation manners of the source code homology detection device of the present invention may refer to the above method embodiments, and are not described herein again.
In addition, an embodiment of the present invention further provides a storage medium, where a source code homology detection method program is stored on the storage medium, and when executed by a processor, the source code homology detection method program implements the steps of the source code homology detection method described above. Therefore, a detailed description thereof will be omitted. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application. It is determined that, by way of example, the program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus necessary general hardware, and may also be implemented by special purpose hardware including special purpose integrated circuits, special purpose CPUs, special purpose memories, special purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, the implementation of a software program is a more preferable embodiment for the present invention. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

Claims (10)

1. A source code homology detection method, the method comprising the steps of:
constructing a code base; the code library stores codes taking functions as units;
executing function level slicing on codes in the code base, extracting fingerprint characteristics of code segments, and storing the fingerprint characteristics into a code fingerprint information base;
and executing function slicing on a source code to be detected, extracting fingerprint characteristics of the code segments, and executing fingerprint information matching in the code fingerprint information base by utilizing the fingerprint characteristics to obtain a detection result.
2. The source code homology detection method according to claim 1, wherein the step of constructing a code base specifically includes:
acquiring a source code on an open source platform by using a source code acquisition service, and executing code division with a function as a unit on the source code;
and uploading the codes taking the function as a unit to an HDFS distributed file system to construct a code base.
3. The source code homology detection method of claim 1, wherein after the performing a function level slicing step, the method further comprises: preprocessing the code segment; wherein the pre-processing includes de-annotating, de-blanking lines, and de-keying.
4. The method for detecting the homology of a source code according to claim 1, wherein the step of extracting the fingerprint feature of the code segment specifically comprises:
executing token conversion on the extracted code segments to obtain a token sequence;
and coding the token sequence by using a similar hash algorithm to obtain the fingerprint characteristics of the code segment.
5. The source code homology detection method according to claim 4, wherein the step of encoding the token sequence by using a similar hash algorithm to obtain the fingerprint characteristics of the code segment specifically includes:
taking each token in the token sequence as a feature of a function, and endowing each feature with a weight by using a TF-IDF method;
converting the characteristics into hash values by using an MD5 hash algorithm;
carrying out bit-wise weighting on the hash value of each feature by using the weight of the feature, resetting the bit weight of the hash value of 0 as the feature weight multiplied by minus 1, and resetting the bit weight of the hash value of 1 as the feature weight multiplied by plus 1;
performing bit-wise accumulation on the hash value according to the weighted result;
and carrying out binary conversion on the bit-based accumulation result, replacing the bit with a positive number as the accumulation result with 1, and replacing the bit with a negative number as the accumulation result with 0 to serve as the characteristic fingerprint of the function.
6. The source code homology detection method according to claim 5, wherein the step of performing fingerprint information matching in the code fingerprint information base by using the fingerprint features specifically comprises:
writing query statements by using a DSL query language provided by ElasticSearch to carry out segmented filtration on the code fingerprint information in a code fingerprint information base;
compiling a Hamming distance calculation script by using a Painless script compiling language provided by ElasticSearch, and calculating the actual Hamming distance between a function segment to be analyzed and a function segment which is not filtered in a code base;
and sorting function segments with the Hamming distance smaller than d from the fingerprint information of the function segments to be analyzed in the code fingerprint information base according to the sequence of the Hamming distances from the function segments to be analyzed from small to large, and returning the function segments to the user as the retrieval result of the homologous function segments of the functions to be analyzed.
7. The method of claim 6, wherein the search result further comprises a file name and an item name of a homologous function fragment returned to the user.
8. A source code homology detection apparatus, comprising:
the construction module is used for constructing a code base; the code library stores codes taking functions as units;
the storage module is used for executing function level slicing on codes in the code base, extracting fingerprint characteristics of code segments and storing the fingerprint characteristics into a code fingerprint information base;
and the matching module is used for executing function slicing on the source code to be detected, extracting the fingerprint characteristics of the code segments, and executing fingerprint information matching in the code fingerprint information base by utilizing the fingerprint characteristics to obtain a detection result.
9. A source code homology detection device, characterized in that the source code homology detection device comprises: memory, processor and a source code isogenesis detection method program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the source code isogenesis detection method according to any one of claims 1 to 7.
10. A storage medium, on which a source code homology detection method program is stored, the source code homology detection method program implementing the steps of the source code homology detection method according to any one of claims 1 to 7 when executed by a processor.
CN202211010931.0A 2022-08-22 2022-08-22 Source code homologous detection method, device, equipment and storage medium Pending CN115480823A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211010931.0A CN115480823A (en) 2022-08-22 2022-08-22 Source code homologous detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211010931.0A CN115480823A (en) 2022-08-22 2022-08-22 Source code homologous detection method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115480823A true CN115480823A (en) 2022-12-16

Family

ID=84421814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211010931.0A Pending CN115480823A (en) 2022-08-22 2022-08-22 Source code homologous detection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115480823A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116340185A (en) * 2023-05-19 2023-06-27 国网数字科技控股有限公司 Method, device and equipment for analyzing software open source code components

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116340185A (en) * 2023-05-19 2023-06-27 国网数字科技控股有限公司 Method, device and equipment for analyzing software open source code components
CN116340185B (en) * 2023-05-19 2023-09-01 国网数字科技控股有限公司 Method, device and equipment for analyzing software open source code components

Similar Documents

Publication Publication Date Title
CN110837550B (en) Knowledge graph-based question answering method and device, electronic equipment and storage medium
CN111898696B (en) Pseudo tag and tag prediction model generation method, device, medium and equipment
CN109918662B (en) Electronic resource label determination method, device and readable medium
US20210390370A1 (en) Data processing method and apparatus, storage medium and electronic device
CN112035549B (en) Data mining method, device, computer equipment and storage medium
CN111177367B (en) Case classification method, classification model training method and related products
CN114861889B (en) Deep learning model training method, target object detection method and device
CN112988753B (en) Data searching method and device
CN115480823A (en) Source code homologous detection method, device, equipment and storage medium
CN111931488A (en) Method, device, electronic equipment and medium for verifying accuracy of judgment result
CN113204695B (en) Website identification method and device
CN116868193A (en) Firmware component identification and vulnerability assessment
CN116127925B (en) Text data enhancement method and device based on destruction processing of text
CN116719683A (en) Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium
CN110738048A (en) keyword extraction method and device and terminal equipment
CN115686495A (en) Application generation method and device and server
CN114692778A (en) Multi-modal sample set generation method, training method and device for intelligent inspection
CN114024718A (en) Malicious domain name detection method, device, equipment and storage medium
CN113505595A (en) Text phrase extraction method and device, computer equipment and storage medium
CN112364683A (en) Case evidence fixing method and device
CN112699780A (en) Object identification method, device, equipment and storage medium
CN113836370B (en) User group classification method and device, storage medium and computer equipment
WO2019051704A1 (en) Method and device for identifying junk file
CN103870822A (en) Word identification method and device
WO2024017287A1 (en) Model training method and apparatus therefor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination