CN111666101A - Software homologous analysis method and device - Google Patents

Software homologous analysis method and device Download PDF

Info

Publication number
CN111666101A
CN111666101A CN202010335325.0A CN202010335325A CN111666101A CN 111666101 A CN111666101 A CN 111666101A CN 202010335325 A CN202010335325 A CN 202010335325A CN 111666101 A CN111666101 A CN 111666101A
Authority
CN
China
Prior art keywords
source code
file
homologous
code file
software
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010335325.0A
Other languages
Chinese (zh)
Inventor
高庆
张世琨
肖华
马森
岳贯集
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202010335325.0A priority Critical patent/CN111666101A/en
Publication of CN111666101A publication Critical patent/CN111666101A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/74Reverse engineering; Extracting design information from source code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The embodiment of the invention provides a software homologous analysis method and a device, wherein the method comprises the following steps: obtaining a source code database, the source code database comprising: the characteristic information of the reference source code file and the creation time of the reference source code file; acquiring a target source code file of target software, wherein the target source code file comprises: characteristic information of the target source code file; determining alternative homologous files of the target source code file based on the matching result of the characteristic information of the reference source code file and the characteristic information of the target source code file; taking the corresponding alternative homologous file with the earliest creation time as a final homologous file corresponding to the target source code file; and determining a software homology analysis result according to the final homology file. The software homologous analysis method provided by the embodiment of the invention solves the analysis result error caused by software propagation and improves the precision of software homologous analysis.

Description

Software homologous analysis method and device
Technical Field
The invention relates to the technical field of computers, in particular to a software homology analysis method and device.
Background
With the development of software technology, copying and reference of software are common. On the code level, copying a section of code, and applying the code in a new scene with or without modification, wherein the code reuse mode is called code cloning; at the software level, copying and reference to the software system are called software homology. The same or similar software needs to be found through the analysis of the source code and the function of the software to be tested, and the homologous analysis result of the software is obtained.
In the software homology analysis method in the prior art, homologous source code files are directly searched, all the homologous source code files are collected to analyze parameter information, and a software homology analysis result is obtained.
The prior art cannot solve the problem of errors caused by software propagation to analysis work. Software propagation means that reference relations among software are not in one-to-one correspondence, but in a one-to-many or many-to-one situation, and the reference relations are transitive. The existence of software propagation brings difficulties for homology analysis of software, for example, if item A refers to item C, and item B also refers to item C, then a part of the code from item C is common between items A and B. At this time, if the existing software homologous analysis method is used to analyze the item a, similar components will be detected in all items directly or indirectly referring to the item C, and if these items are all regarded as the source of the code components of the item a, a large error will occur in the homologous analysis result.
Disclosure of Invention
Embodiments of the present invention provide a method and apparatus for software homology analysis that overcome the above-mentioned problems, or at least partially solve the above-mentioned problems.
In a first aspect, an embodiment of the present invention provides a software homology analysis method, including: obtaining a source code database, the source code database comprising: the characteristic information of the reference source code file and the creation time of the reference source code file; acquiring a target source code file of target software, wherein the target source code file comprises: characteristic information of the target source code file; determining alternative homologous files of the target source code file based on the matching result of the characteristic information of the reference source code file and the characteristic information of the target source code file; taking the corresponding alternative homologous file with the earliest creation time as a final homologous file corresponding to the target source code file; and determining a software homology analysis result according to the final homology file.
In some embodiments, the determining an alternative homologous file of the target source code file based on a matching result of the feature information of the reference source code file and the feature information of the target source code file includes: determining a first type of alternative homologous files of the target source code file based on a matching result of the original MD5 characteristic information of the reference source code file and the original MD5 characteristic information of the target source code file; determining a second type of alternative homologous files of the target source code file based on a matching result of the de-annotated de-vacancy MD5 characteristic information of the reference source code file and the de-annotated de-vacancy MD5 characteristic information of the target source code file; and determining the alternative homologous files according to the first type alternative homologous file and the second type alternative homologous file.
In some embodiments, the de-annotated de-whiter MD5 feature information is generated for annotation rules according to different programming languages.
In some embodiments, the source code database further comprises: the project name of the reference source code file and the version information of the reference source code file; determining a software homology analysis result according to the final homology file, wherein the determining comprises the following steps: and determining a software homology analysis result according to the creation time of the final homologous file, the project name of the final homologous file and the version information of the final homologous file.
In some embodiments, the obtaining a source code database comprises: acquiring one million open source items ranked at the top in the GitHub; extracting feature information, creation time, project name and version information of the open source project; taking a source code file in the open source project as the reference source code file; and constructing the source code database based on the reference source code file.
In some embodiments, the determining a software homology analysis result according to the final homology file includes: the target software comprises a plurality of sections of target source code files, the final homologous files corresponding to the plurality of sections of target source code files are collected, and a software homologous analysis result is determined.
In a second aspect, an embodiment of the present invention provides a software homology analysis apparatus, including: a source code database acquisition unit configured to acquire a source code database, the source code database including: the characteristic information of the reference source code file and the creation time of the reference source code file; an object source code file obtaining unit, configured to obtain an object source code file of object software, where the object source code file includes: characteristic information of the target source code file; the alternative homologous file determining unit is used for determining an alternative homologous file of the target source code file based on a matching result of the characteristic information of the reference source code file and the characteristic information of the target source code file; a final homologous file determining unit, configured to use the candidate homologous file with the earliest creation time as a final homologous file corresponding to the target source code file; and the software homologous analysis result determining unit is used for determining a software homologous analysis result according to the final homologous file.
In some embodiments, the alternative homologous file determining unit includes: a first candidate homologous file determining subunit, configured to determine a first type of candidate homologous file of the target source code file based on a matching result of the original MD5 feature information of the reference source code file and the original MD5 feature information of the target source code file; a second alternative homologous file determining subunit, configured to determine a second type of alternative homologous file of the target source code file based on a matching result of the de-annotated de-vacancy MD5 characteristic information of the reference source code file and the de-annotated de-vacancy MD5 characteristic information of the target source code file; a third candidate homologous file determining subunit, configured to determine the candidate homologous file according to the first class of candidate homologous files and the second class of candidate homologous files.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the software homology analysis method provided in any one of the possible implementation schemes of the first aspect when executing the program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the software homology analysis method provided in any one of the possible implementations of the first aspect.
According to the software homologous analysis method, the software homologous analysis device, the electronic device and the non-transitory computer readable storage medium, the alternative homologous file with the earliest corresponding creation time is used as the final homologous file corresponding to the target source code file and further used as the object of software homologous analysis, analysis result errors caused by software propagation are solved in the software homologous analysis process, and the precision of software homologous analysis is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a software homology analysis method according to an embodiment of the present invention;
FIG. 2 is a flowchart of determining alternative files for software homology analysis according to an embodiment of the present invention;
FIG. 3 is a flowchart of constructing a source code database according to the software homology analysis method of the embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a software homology analysis device according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an alternative homologous file determining unit of the software homologous analysis apparatus according to the embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The software homology analysis method according to the embodiment of the present invention is described below with reference to fig. 1 to 3.
As shown in fig. 1, the software homology analysis method according to the embodiment of the present invention includes the following steps S100 to S500.
Step S100, obtaining a source code database, wherein the source code database comprises: the characteristic information of the reference source code file and the creation time of the reference source code file.
It can be understood that the source code database is obtained by extracting reference source code files from a large number of open source projects, and serves as a comparison library for software homology analysis, the source code database includes characteristic information of each reference source code file, the characteristic information is used for providing a mark for each reference source code file, and serves as a basis for retrieval or matching, and the source code database also includes creation time of each reference source code file.
Step S200, obtaining a target source code file of the target software, wherein the target source code file comprises: characteristic information of the target source code file.
It is understood that the target software is the software to be subjected to the homology analysis, and the target source code file is extracted according to a file name suffix, such as c, cpp, java, js, php, py, and the like, and has characteristic information for providing a mark for the target source code file.
And step S300, determining alternative homologous files of the target source code file based on the matching result of the characteristic information of the reference source code file and the characteristic information of the target source code file.
It should be noted that there may be characteristic information of a reference source code file matching with the characteristic information of the target source code file in the source code database, the characteristic information of the target source code file is used to search in the source code database, a corresponding reference source code file is found, and the search result, that is, the corresponding reference source code file, is used as a candidate homologous file of the target source code file.
And step S400, taking the alternative homologous file with the earliest corresponding creation time as a final homologous file corresponding to the target source code file.
It should be noted that there may be multiple candidate homologous files matched in step S300, and due to the influence of software propagation, there may be direct or indirect reference between the multiple candidate homologous files, and here, a time tracing algorithm is adopted, and the earliest homologous file in the multiple candidate homologous files is taken as the final homologous file corresponding to the target source code file, so as to eliminate an error caused by multiple references.
And S500, determining a software homology analysis result according to the final homology file.
That is, the software homology analysis result is determined according to the source code file with the earliest creation time, i.e., the final homology file, in the matched reference source code file.
In practical applications, the source code database further includes: referring to the project name of the source code file and referring to the version information of the source code file; therefore, according to the final homologous file, determining the results of the software homology analysis, including: and determining a software homologous analysis result according to the creation time of the final homologous file, the project name of the final homologous file and the version information of the final homologous file.
It can be understood that the final homologous file of the target source code file includes parameters such as creation time, project name, and version information, and the parameters are further analyzed on the basis that the final homologous file is obtained in step S400, and the creation time, the project name, and the version information of the final homologous file are presented in the form of text or a form, so as to serve as a software homologous analysis result.
According to the embodiment of the invention, the corresponding alternative homologous file with the earliest creation time is used as the final homologous file corresponding to the target source code file and further used as the object of the software homologous analysis, so that the analysis result error caused by software propagation is solved and the precision of the software homologous analysis is improved in the software homologous analysis process.
As shown in fig. 2, in some embodiments, step S300: and determining an alternative homologous file of the target source code file based on the matching result of the characteristic information of the reference source code file and the characteristic information of the target source code file, wherein the steps comprise S310-S330.
Step S310, a first type of alternative homologous files of the target source code file are determined based on the matching result of the original MD5 characteristic information of the reference source code file and the original MD5 characteristic information of the target source code file.
It should be noted that MD5 (Message Digest Algorithm, MD5 Message-Digest Algorithm) is a widely used cryptographic hash function that can generate a 128-bit (16-byte) hash value (hash value) to ensure the integrity of the Message transmission. The original MD5 feature information of the target source code file is used to search in the source code database to find the corresponding reference source code file, and the search result, i.e. the corresponding reference source code file, is used as the first kind of candidate homologous file.
And S320, determining a second type of alternative homologous files of the target source code file based on the matching result of the characteristic information of the de-annotated de-vacancy character MD5 of the reference source code file and the characteristic information of the de-annotated de-vacancy character MD5 of the target source code file.
The de-annotation de-whiteware MD5 feature information is generated according to the annotation rules of different programming languages.
Note that the feature information of the comment removal space MD5 is generated after removing space and comments in the code file according to the comment rules of different programming languages. And searching in the source code database by using the characteristic information of the de-annotated and de-blank character MD5 of the target source code file to find a corresponding reference source code file, and taking the search result, namely the corresponding reference source code file, as a second type alternative homologous file.
And step S330, determining alternative homologous files according to the first type alternative homologous file and the second type alternative homologous file.
It can be understood that, on the basis that the first type of candidate homologous file is obtained in step S310 and the second type of candidate homologous file is obtained in step S320, the first type of candidate homologous file and the second type of candidate homologous file are merged to obtain the candidate homologous file.
According to the embodiment of the invention, the original MD5 characteristic information and the de-annotated and de-blank symbol MD5 characteristic information are respectively searched in the source code database, so that the searching process is more accurate, and the software homology analysis precision is higher.
As shown in fig. 3, in some embodiments, step S110: and acquiring a source code database, wherein the method comprises the following steps S110-S140.
And step S110, acquiring one million open source items ranked at the top in the GitHub.
It is worth mentioning that the GitHub is a hosting platform facing open source and private software projects, and because only git is supported to be hosted as a unique version library format, the GitHub is named, and a user can very easily find massive open source codes in the GitHub. Here, one million top-ranked open source items in GitHub are obtained.
And step S120, extracting the characteristic information, the creation time, the project name and the version information of the open source project.
It will be appreciated that the open source item in the GitHub has characteristic information, creation time, item name, and version information, where these parameters are extracted.
And step S130, taking the source code file in the open source project as a reference source code file.
And step S140, constructing a source code database based on the reference source code file.
It is understood that the source code file in the open source project in the GitHub is used as the reference source code file mentioned in step S100, and the source code database is constructed on the basis of the reference source code file.
According to the embodiment of the invention, the open source project is acquired from the GitHub, so that the establishment process of the source code database is more convenient and quicker, the acquired source code database is richer, and the software homologous analysis precision is higher.
In some embodiments, step S500: determining the software homology analysis result according to the final homology file, wherein the software homology analysis result comprises the following steps: the target software comprises a plurality of sections of target source code files, final homologous files corresponding to the plurality of sections of target source code files are collected, and a software homologous analysis result is determined.
It can be understood that the target software is composed of multiple target source code files, each target source code file corresponds to one final homologous file on the basis of the above embodiment, the final homologous files corresponding to the multiple target source code files are summarized here, and relevant parameters of the final homologous files are presented in the form of text or form, so as to serve as a software homologous analysis result.
According to the embodiment of the invention, the final homologous files of the target software can be completely displayed by summarizing the final homologous files corresponding to the multiple sections of target source code files, so that the information quantity presented by the software homologous analysis result is richer.
In order to detect the technical effect of the embodiment of the invention, 7 open source items on the GitHub are selected, software homology analysis is carried out under the condition that a time tracing algorithm is used and is not used, and the used characteristic information is the original MD5 characteristic information and the MD5 characteristic information after removing comments and blank characters. The effect comparison is shown in table 1:
TABLE 1
Figure BDA0002466354180000091
The software homology analysis device provided by the embodiment of the invention is described below with reference to fig. 4 and 5, and the software homology analysis device described below and the software homology analysis method described above may be referred to correspondingly.
As shown in fig. 4, the software homology analysis apparatus according to the embodiment of the present invention includes a source code database obtaining unit 410, a target source code file obtaining unit 420, an alternative homology file determining unit 430, a final homology file determining unit 440, and a software homology analysis result determining unit 450.
A source code database obtaining unit 410, configured to obtain a source code database, where the source code database includes: the characteristic information of the reference source code file and the creation time of the reference source code file.
An object source code file obtaining unit 420, configured to obtain an object source code file of the object software, where the object source code file includes: characteristic information of the target source code file.
And an alternative homologous file determining unit 430, configured to determine an alternative homologous file of the target source code file based on a matching result of the feature information of the reference source code file and the feature information of the target source code file.
And a final homologous file determining unit 440, configured to use the candidate homologous file with the earliest creation time as the final homologous file corresponding to the target source code file.
And the software homology analysis result determining unit 450 is configured to determine a software homology analysis result according to the final homology file.
The software homology analysis device provided by the embodiment of the invention is used for executing the software homology analysis method, and the specific implementation mode of the software homology analysis device is consistent with the implementation mode of the method, which is not described herein again.
As shown in fig. 5, in some embodiments, the alternative homology file determining unit 430 in the software homology analyzing apparatus includes: a first alternative homologous file determining sub-unit 431, a second alternative homologous file determining sub-unit 432, and a third alternative homologous file determining sub-unit 433.
And the first alternative homologous file determining subunit 431 is used for determining a first type alternative homologous file of the target source code file based on the matching result of the original MD5 characteristic information of the reference source code file and the original MD5 characteristic information of the target source code file.
A second alternative homologous file determining subunit 432, configured to determine a second type of alternative homologous file of the target source code file based on a matching result of the de-annotated de-vacancy MD5 characteristic information of the reference source code file and the de-annotated de-vacancy MD5 characteristic information of the target source code file.
The third candidate homologous file determining subunit 433 is configured to determine a candidate homologous file according to the first type of candidate homologous file and the second type of candidate homologous file.
The software homology analysis device provided by the embodiment of the invention is used for executing the software homology analysis method, and the specific implementation mode of the software homology analysis device is consistent with the implementation mode of the method, which is not described herein again.
Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a software isogenesis analysis method, the method comprising: obtaining a source code database, the source code database comprising: the characteristic information of the reference source code file and the creation time of the reference source code file; acquiring a target source code file of target software, wherein the target source code file comprises: characteristic information of the target source code file; determining alternative homologous files of the target source code file based on the matching result of the characteristic information of the reference source code file and the characteristic information of the target source code file; taking the corresponding alternative homologous file with the earliest creation time as a final homologous file corresponding to the target source code file; and determining the software homology analysis result according to the final homology file.
It should be noted that, when being implemented specifically, the electronic device in this embodiment may be a server, a PC, or other devices, as long as the structure includes the processor 610, the communication interface 620, the memory 630, and the communication bus 640 shown in fig. 6, where the processor 610, the communication interface 620, and the memory 630 complete mutual communication through the communication bus 640, and the processor 610 may call the logic instruction in the memory 630 to execute the above method. The embodiment does not limit the specific implementation form of the electronic device.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Further, an embodiment of the present invention discloses a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer, the computer can execute the software homology analysis method provided by the above method embodiments, the method includes: obtaining a source code database, the source code database comprising: the characteristic information of the reference source code file and the creation time of the reference source code file; acquiring a target source code file of target software, wherein the target source code file comprises: characteristic information of the target source code file; determining alternative homologous files of the target source code file based on the matching result of the characteristic information of the reference source code file and the characteristic information of the target source code file; taking the corresponding alternative homologous file with the earliest creation time as a final homologous file corresponding to the target source code file; and determining the software homology analysis result according to the final homology file.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the software homology analysis method provided in the foregoing embodiments when executed by a processor, where the method includes: obtaining a source code database, the source code database comprising: the characteristic information of the reference source code file and the creation time of the reference source code file; acquiring a target source code file of target software, wherein the target source code file comprises: characteristic information of the target source code file; determining alternative homologous files of the target source code file based on the matching result of the characteristic information of the reference source code file and the characteristic information of the target source code file; taking the corresponding alternative homologous file with the earliest creation time as a final homologous file corresponding to the target source code file; and determining the software homology analysis result according to the final homology file.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A software homology analysis method is characterized by comprising the following steps:
obtaining a source code database, the source code database comprising: the characteristic information of the reference source code file and the creation time of the reference source code file;
acquiring a target source code file of target software, wherein the target source code file comprises: characteristic information of the target source code file;
determining alternative homologous files of the target source code file based on the matching result of the characteristic information of the reference source code file and the characteristic information of the target source code file;
taking the corresponding alternative homologous file with the earliest creation time as a final homologous file corresponding to the target source code file;
and determining a software homology analysis result according to the final homology file.
2. The software homology analysis method according to claim 1,
the determining, based on a matching result of the feature information of the reference source code file and the feature information of the target source code file, an alternative homologous file of the target source code file includes:
determining a first type of alternative homologous files of the target source code file based on a matching result of the original MD5 characteristic information of the reference source code file and the original MD5 characteristic information of the target source code file;
determining a second type of alternative homologous files of the target source code file based on a matching result of the de-annotated de-vacancy MD5 characteristic information of the reference source code file and the de-annotated de-vacancy MD5 characteristic information of the target source code file;
and determining the alternative homologous files according to the first type alternative homologous file and the second type alternative homologous file.
3. The software homology analysis method according to claim 2,
the de-annotation de-whiteware MD5 feature information is generated according to annotation rules of different programming languages.
4. The software homology analysis method according to claim 1,
the source code database further comprises: the project name of the reference source code file and the version information of the reference source code file;
determining a software homology analysis result according to the final homology file, wherein the determining comprises the following steps:
and determining a software homology analysis result according to the creation time of the final homologous file, the project name of the final homologous file and the version information of the final homologous file.
5. The software homology analysis method according to claim 4,
the acquiring of the source code database comprises:
acquiring one million open source items ranked at the top in the GitHub;
extracting feature information, creation time, project name and version information of the open source project;
taking a source code file in the open source project as the reference source code file;
and constructing the source code database based on the reference source code file.
6. The software homology analysis method according to any one of claims 1 to 5,
determining a software homology analysis result according to the final homology file, wherein the determining comprises the following steps:
the target software comprises a plurality of sections of target source code files, the final homologous files corresponding to the plurality of sections of target source code files are collected, and a software homologous analysis result is determined.
7. A software homology analysis apparatus, comprising:
a source code database acquisition unit configured to acquire a source code database, the source code database including: the characteristic information of the reference source code file and the creation time of the reference source code file;
an object source code file obtaining unit, configured to obtain an object source code file of object software, where the object source code file includes: characteristic information of the target source code file;
the alternative homologous file determining unit is used for determining an alternative homologous file of the target source code file based on a matching result of the characteristic information of the reference source code file and the characteristic information of the target source code file;
a final homologous file determining unit, configured to use the candidate homologous file with the earliest creation time as a final homologous file corresponding to the target source code file;
and the software homologous analysis result determining unit is used for determining a software homologous analysis result according to the final homologous file.
8. The software homology analysis device according to claim 7, wherein the alternative homology file determination unit comprises:
a first candidate homologous file determining subunit, configured to determine a first type of candidate homologous file of the target source code file based on a matching result of the original MD5 feature information of the reference source code file and the original MD5 feature information of the target source code file;
a second alternative homologous file determining subunit, configured to determine a second type of alternative homologous file of the target source code file based on a matching result of the de-annotated de-vacancy MD5 characteristic information of the reference source code file and the de-annotated de-vacancy MD5 characteristic information of the target source code file;
a third candidate homologous file determining subunit, configured to determine the candidate homologous file according to the first class of candidate homologous files and the second class of candidate homologous files.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the software homology analysis method according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the software homology analysis method according to any one of claims 1 to 7.
CN202010335325.0A 2020-04-24 2020-04-24 Software homologous analysis method and device Pending CN111666101A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010335325.0A CN111666101A (en) 2020-04-24 2020-04-24 Software homologous analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010335325.0A CN111666101A (en) 2020-04-24 2020-04-24 Software homologous analysis method and device

Publications (1)

Publication Number Publication Date
CN111666101A true CN111666101A (en) 2020-09-15

Family

ID=72382867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010335325.0A Pending CN111666101A (en) 2020-04-24 2020-04-24 Software homologous analysis method and device

Country Status (1)

Country Link
CN (1) CN111666101A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579155A (en) * 2021-02-23 2021-03-30 北京北大软件工程股份有限公司 Code similarity detection method and device and storage medium
CN114385231A (en) * 2021-12-20 2022-04-22 杭州安恒信息安全技术有限公司 Data processing method, data processing device, storage medium and electronic equipment
CN115238102A (en) * 2022-06-28 2022-10-25 北京关键科技股份有限公司 Code data feature extraction and retrieval method and device
CN115686623A (en) * 2022-11-03 2023-02-03 苏州棱镜七彩信息科技有限公司 Homologous detection method of closed-source software

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160103831A1 (en) * 2014-10-14 2016-04-14 Adobe Systems Incorporated Detecting homologies in encrypted and unencrypted documents using fuzzy hashing
CN106990956A (en) * 2017-03-10 2017-07-28 苏州棱镜七彩信息科技有限公司 Code file clone's detection method based on suffix tree
CN108229170A (en) * 2018-02-02 2018-06-29 中科软评科技(北京)有限公司 Utilize big data and the software analysis method and device of neural network
CN109710299A (en) * 2018-12-14 2019-05-03 平安普惠企业管理有限公司 A kind of open source class libraries monitoring method, device, equipment and computer storage medium
CN110334248A (en) * 2019-06-26 2019-10-15 京东数字科技控股有限公司 A kind of system configuration information treating method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160103831A1 (en) * 2014-10-14 2016-04-14 Adobe Systems Incorporated Detecting homologies in encrypted and unencrypted documents using fuzzy hashing
CN106990956A (en) * 2017-03-10 2017-07-28 苏州棱镜七彩信息科技有限公司 Code file clone's detection method based on suffix tree
CN108229170A (en) * 2018-02-02 2018-06-29 中科软评科技(北京)有限公司 Utilize big data and the software analysis method and device of neural network
CN109710299A (en) * 2018-12-14 2019-05-03 平安普惠企业管理有限公司 A kind of open source class libraries monitoring method, device, equipment and computer storage medium
CN110334248A (en) * 2019-06-26 2019-10-15 京东数字科技控股有限公司 A kind of system configuration information treating method and apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李锁 等: "基于代码克隆检测的代码来源分析方法", 《计算机应用与软件》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579155A (en) * 2021-02-23 2021-03-30 北京北大软件工程股份有限公司 Code similarity detection method and device and storage medium
CN114385231A (en) * 2021-12-20 2022-04-22 杭州安恒信息安全技术有限公司 Data processing method, data processing device, storage medium and electronic equipment
CN114385231B (en) * 2021-12-20 2024-05-28 杭州安恒信息安全技术有限公司 Data processing method and device, storage medium and electronic equipment
CN115238102A (en) * 2022-06-28 2022-10-25 北京关键科技股份有限公司 Code data feature extraction and retrieval method and device
CN115686623A (en) * 2022-11-03 2023-02-03 苏州棱镜七彩信息科技有限公司 Homologous detection method of closed-source software

Similar Documents

Publication Publication Date Title
CN111666101A (en) Software homologous analysis method and device
AU2017101864A4 (en) Method, device, server and storage apparatus of reviewing SQL
CN111507086B (en) Automatic discovery of translated text locations in localized applications
CN108228231B (en) Visualization drifting method of Git warehouse file annotation system
CN110474900B (en) Game protocol testing method and device
CN110851209A (en) Data processing method and device, electronic equipment and storage medium
CN111435367B (en) Knowledge graph construction method, system, equipment and storage medium
US20150186195A1 (en) Method of analysis application object which computer-executable, server performing the same and storage media storing the same
CN111930610B (en) Software homology detection method, device, equipment and storage medium
CN112559526A (en) Data table export method and device, computer equipment and storage medium
JP2017049639A (en) Evaluation program, procedure manual evaluation method, and evaluation device
CN112612810A (en) Slow SQL statement identification method and system
CN117093556A (en) Log classification method, device, computer equipment and computer readable storage medium
JP2018133044A (en) Webapi execution flow generation device and webapi execution flow generation method
CN117033309A (en) Data conversion method and device, electronic equipment and readable storage medium
CN111078671A (en) Method, device, equipment and medium for modifying data table field
JP2006023968A (en) Unique expression extracting method and device and program to be used for the same
KR102153674B1 (en) A method for classifying sql query, a method for detecting abnormal occurrence, and a computing device
JP2016057715A (en) Graphic type program analyzer
CN114816518A (en) Simhash-based open source component screening and identifying method and system in source code
CN110110280B (en) Curve integral calculation method, device and equipment for coordinates and storage medium
CN114579580A (en) Data storage method and data query method and device
CN112540820A (en) User interface updating method and device and electronic equipment
KR20200118965A (en) A method for classifying sql query, a method for detecting abnormal occurrence, and a computing device
CN115718696B (en) Source code cryptography misuse detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200915