CN112988217B - Code base design method and detection method for rapid full-network code traceability detection - Google Patents
Code base design method and detection method for rapid full-network code traceability detection Download PDFInfo
- Publication number
- CN112988217B CN112988217B CN202110278117.6A CN202110278117A CN112988217B CN 112988217 B CN112988217 B CN 112988217B CN 202110278117 A CN202110278117 A CN 202110278117A CN 112988217 B CN112988217 B CN 112988217B
- Authority
- CN
- China
- Prior art keywords
- code
- git
- database
- commit
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 238000001514 detection method Methods 0.000 title claims abstract description 56
- 238000013461 design Methods 0.000 title claims description 16
- 238000013507 mapping Methods 0.000 claims abstract description 55
- 238000013500 data storage Methods 0.000 claims abstract description 18
- 238000013075 data extraction Methods 0.000 claims abstract description 16
- 238000010276 construction Methods 0.000 claims abstract description 14
- 230000008569 process Effects 0.000 claims abstract description 14
- 238000002898 library design Methods 0.000 claims abstract description 12
- 101100217298 Mus musculus Aspm gene Proteins 0.000 claims description 41
- 230000006870 function Effects 0.000 claims description 14
- 238000010367 cloning Methods 0.000 claims description 10
- 238000012986 modification Methods 0.000 claims description 10
- 230000004048 modification Effects 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 9
- 238000011161 development Methods 0.000 claims description 7
- 238000004458 analytical method Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 238000000638 solvent extraction Methods 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 3
- 238000013515 script Methods 0.000 claims description 3
- 230000006835 compression Effects 0.000 claims 3
- 238000007906 compression Methods 0.000 claims 3
- 230000001133 acceleration Effects 0.000 claims 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 12
- 239000000284 extract Substances 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000009411 base construction Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000740 bleeding effect Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 239000003990 capacitor Substances 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000012528 membrane Substances 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/71—Version control; Configuration management
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a code library design method for rapid full-network code traceability detection, which is used for obtaining a code library by carrying out efficient storage on Git objects in a full-network using Git open source project through project discovery, data extraction, data storage, code information mapping construction and data updating processes, and realizing efficient updating of the code library; comprising the following steps: a storage mode of dividing and storing according to the types of the Git objects is adopted; constructing a relation mapping between code files and code file information, and rapidly searching the whole network information of the code files; and a high-efficiency updating mode is adopted for the constructed ultra-large-scale code library, a customized Git fetch protocol is proposed based on the Libgit2 function library, and the constructed ultra-large-scale code library is taken as the rear end to efficiently obtain newly added Git object data of a remote warehouse. The code library generated by the method can be updated regularly and efficiently, and supports quick full-network traceability detection of codes on file granularity, and the detection efficiency is high.
Description
Technical Field
The invention provides a code base design method for quick full-network code traceability detection and a quick full-network code traceability detection method based on the code base, and belongs to the technical field of software engineering.
Background
With the vigorous development of open source software, massive excellent open source software resources are accumulated on the network, and the open source code is increasingly used in software development. The use of the open source code improves the software development efficiency and introduces risks, for example, if the source of the open source code is not known, the subsequent bug fixes of the open source code cannot be synchronously updated, and meanwhile, the open source code can expose itself to legal risks such as license compliance risks, intellectual property rights and the like, so that security threats and economic or reputation losses with different degrees are brought. A well-known open source risk case is heart bleeding hole (heart leak). It is a security vulnerability that appears in the encryption program library OpenSSL, which is widely used to implement the transport layer security protocol of the internet. It was introduced in OpenSSL in 2012, with 4 months of 2014 being the first disclosure to the public. Whenever a defective OpenSSL instance is used, either the server or the client may be attacked accordingly. Thus, traceable detection of code in a software product is critical to the software product.
In order to realize code traceability detection of software products, a code library for code matching search needs to be constructed, and the accuracy and efficiency of the code traceability detection are directly affected by the number and the construction mode of codes contained in the code library. The existing code tracing detection technology is mainly to put forward an efficient code detection algorithm on the premise of assuming that a large number of code libraries exist, for example, research is conducted on the open source software which selects the most likely to be used of the substitution codes from the large number of open source software libraries to participate in tracing comparison. But lacks efficient techniques on how to build the code library. In the prior art, downloading several open source software to the local to form a code library is commonly employed. However, the code stock has a small project coverage range, and is not enough to support the problem that the code traceability detection efficiency is not high due to the fact that the code library architecture is poorly designed to carry out the whole network traceability detection on the code.
Disclosure of Invention
Aiming at the technical problems existing in the prior art, the invention provides a code base design method for quick full-network code traceability detection, and realizes the quick full-network code traceability detection based on the code base. Meanwhile, the code library can be updated regularly and efficiently.
In the present invention, "full-network code" refers to the code data collected for most open source code hosting platforms. A repository hosted on a code hosting platform is referred to as a remote repository; cloning the remote warehouse to the local and then obtaining the local warehouse; the invention designs a code base for quick full-network code traceability detection, which is a database formed by cloning a remote warehouse to a local warehouse and extracting data from the local warehouse.
The code base design for rapid full-network code traceability detection provided by the invention utilizes the internal principle and hash value of Git, and specifically comprises the following steps:
1) The remote warehouse may be downloaded to the local via a gate clone command, and updates to the remote warehouse may be transferred back to the local via a gate latch command. The gate fetch calculates what objects are missing from the local repository compared to the remote repository by comparing the heads of the local repository and the remote repository, and the remote repository then transmits these missing objects back to the local.
2) Git uses four types of data objects for versioning, the references to these objects being SHA1 values calculated based on the contents of the object. The commit object represents a change to the item, containing a SHA1 that includes the commit parent object (if any), the folder (tree object), the author ID and timestamp, the commit maker ID and timestamp, and the commit information. the tree object represents a folder within an item, which is a list containing the SHA1 of the files (blobs) and subfolders (other tree objects) in the folder, as well as their associated schema, type and name. A blob object is a compressed version of a version of file content (source code). A tag object is a string that is used to associate a readable name with a particular version of a version library. One commit represents a code change, typically involving modifications to several files (blobs).
3) The hash value is a hash value of a plurality of bits calculated according to the file content, and different file contents can generate different hash values, so that the file can be uniquely indexed by the hash value.
The invention adopts a code base design method oriented to the whole network code tracing detection, and aims at the Git object in the open source project of the whole network using Git to be efficiently stored to obtain the code base, so that the code base design method can be used for code tracing detection and analysis and simultaneously provides an efficient updating scheme of the code base. Specifically, the design construction of the code base for the code full-network traceability detection comprises the following steps: item discovery, data extraction, data storage, code information mapping construction and data updating. The data storage part designs a storage mode which is different from the traditional Git storage mode and is stored according to the Git object classification and block, and the storage mode can greatly reduce the storage space of a code base and improve the whole network retrieval efficiency, so that the method is the initial method of the invention; the code information mapping construction part is also the first method of the invention, and the invention proposes to construct the relation mapping between the code file and the information of the code file (including the project and the commit thereof, the author and the time for creating the same and the file name thereof), so that the whole network information of the code file can be quickly searched; the invention provides a high-efficiency updating mode for a constructed ultra-large-scale code library, a customized gate fetch protocol is provided based on a Libgit2 function library, the constructed code library is taken as a rear end, and the customized protocol can accurately obtain newly-added gate object data of a remote warehouse with extremely low time cost and space cost. Finally, the invention also provides a rapid full-network tracing detection scheme for the codes on the file granularity.
The technical scheme of the invention is as follows:
a code base design method for quick whole network code traceability detection aims at the problem that a whole network uses a Git object in a Git open source project to be efficiently stored to obtain a code base, and realizes efficient updating of the code base; providing a storage mode of block storage according to the types of the Git objects so as to reduce the storage space of a code base and improve the efficiency of whole network retrieval; the relation mapping between the code files and the code file information is constructed, so that the whole network information of the code files can be quickly searched; and a high-efficiency updating mode is adopted for the constructed ultra-large-scale code library, a customized Git fetch protocol is proposed based on a Libgit2 function library, and the constructed code library is taken as a rear end, so that newly-added Git object data of a remote warehouse can be obtained efficiently; the code base design for the rapid full-network code traceability detection comprises the following steps: project discovery, data extraction, data storage, code information mapping construction and data updating processes; the method specifically comprises the following steps:
A. acquiring a whole network open source software project list through a plurality of project discovery methods;
open source software projects are hosted mostly in some popular development collaboration platforms such as GitHub, bitsocket, gitLab and SourceForge. The invention adopts various methods including utilizing methods such as API provided by a development collaboration platform, analyzing the webpage of the platform and the like to discover the items, and then taking the union of the discovered item sets as a final open source item list so as to acquire the open source item list.
In the implementation, the implementation can be completed on a common server (such as an Intel E5-2670 CPU server), and the requirement on hardware is low. The invention packages the script of the project discovery process into a docker mirror image.
B. And (3) data extraction: b, downloading the items in the open source item list acquired in the step A to the local and extracting the Git objects in the items;
in practice, a copy of the remote repository is created locally via the gitclone command. After the open source items are copied in batches, all the Git objects in the cloned open source items are extracted in batches through Git.
The data extraction may be done in parallel on a (cloud) server. The invention uses the Git C language interface Libgit2 to list all the Git objects in the project, then classifies the objects according to the object types, and finally extracts the content of each object. The invention adopts an Intel E5-2670 CPU with 36 nodes and 16 cores as CPU of each node, and a cluster with 256GB memory, and each node starts 16 threads to finish the Git object extraction work. One node can process about 5 tens of thousands of items in 2 hours. After the Git data in the cloned items is extracted, the cloned items are deleted and a new clone-extraction process is started.
Git object data store: according to the Git object type classification type, the Git object data is stored in a blocking mode, so that the data storage space is reduced, and the parallel processing efficiency is improved; the method specifically comprises the following steps:
a. binary files (such as PDF and pictures) included in the open source item are not saved;
b. git object data is stored in accordance with the Git object type taxonomy, i.e., the types of databases include a commit database, a tree database, a blob database (excluding binary blobs), and a tag database. This storage reduces the data storage space to the hundred TB level while also enabling quick retrieval of whether the data is stored in the code library.
c. The database of each Git object comprises cache data and content data which are respectively stored in the cache database and the content database so as to accelerate the retrieval speed; the cache database and the content database contained in each class of databases (namely a commit database, a tree database, a blob database and a tag database) can be divided into multiple parts (such as 128 parts) for parallel operation; the cache database is used to quickly determine if a given Git object is already stored in the database and is necessary for data extraction (if present, this Git object is not extracted, thus saving time). In addition, the cache database also helps to determine whether a repository needs to be cloned. If the head of a warehouse (the commit object pointed to by each branch stored in the. Git/refs/heads) is already in the cache database, no cloning is required.
d. The cache database is a key value database; the content database is saved in a spliced mode so as to be convenient to update.
The cache database is a key-value database, where the key is the SHA1 value (20 bytes) of the Git object, which is the offset location and size in the content database of the Git object compressed using the Perl's express library. The content database contains compressed content of the Git objects that are continuously stitched together. The content database is saved in a splicing way, and can be updated quickly, and only new content needs to be spliced to the end of a corresponding file. For the commit and tree objects, a random lookup key value database is additionally created, respectively, where the key is SHA1 of the Git object and the value is the compressed content of the corresponding Git object. The key value database random query performance is relatively fast, and each thread can query the git objects with the size of more than 170K per second.
e. Parallelization is achieved with SHA1 values.
The present invention uses the last 7 bits of the first byte of the SHA1 value of the Git object to split each type of database into 128 shares. Thus, all four types of Git objects have 128 cache databases and 128 content databases. In addition, the commit object and the tree object respectively have 128 random lookup key value databases, and 128 (4+4+2) databases are totally arranged, and the databases can be placed on a server to accelerate parallel. In implementation, the size of a single content database ranges from 20MB of a tag object to 0.8TB of a blob object, and the maximum size of a single cache database is a tree object and is 2Gb.
f. The present invention uses a database TokyoCapbinet (similar to berkeley db) written in the C language.
TokyoCapbinet uses hashing as an index to provide about ten times faster read query performance than various common key-value databases such as MongoDB or Cassandra. The method has the advantages that the reading and inquiring speed and the extremely strong portability just meet the construction requirement of a code base for the whole network code traceability detection, so the method adopts a database TokyoCapbinet instead of a NoSQL database with more complete functions.
D. Code information mapping construction:
the invention designs a code library which aims at carrying out full-network traceability detection on codes and supporting analysis on the safety and compliance of software projects, constructs the relation mapping from a code file (blob) to the project containing the blob, from the code file to the author containing the blob, from the code file to the filename of the code file and from the code file to the creation time of the code file, and the relation mapping is stored in a database form, so that the full-network information of one code file, such as the project containing the blob and the commit, the author creating the blob and the time of creating the blob, can be obtained quickly, and the construction of the code information mapping is realized. Obtaining such information for a code file is useful for a comprehensive assessment of the security and compliance of a software project.
The invention builds relation mapping by taking a commit as a center, and specifically comprises the following steps:
build a mutual mapping between a commit and an item, build a relationship mapping of commit and author, time, build a relationship mapping of author to commit, build a mutual mapping between commit to code files (blobs), and build a mutual mapping between commit to filenames.
A list of items containing one code file (blob) may be determined by a combination of code file (blob) to commit and commit to item relationships; the creation time of a code file (blob) may be determined by a combination of code file (blob) to commit and commit to time relationships, and the author of a code file (blob) may be determined by a combination of code file (blob) to commit and commit to author relationships.
A mapping of the interrelationship between code files and filenames is also constructed to support the tracing of specific code segments.
These relationship maps are saved using a tokyo capacitor database for quick retrieval. The present invention still uses a partitioned storage to increase retrieval efficiency, and in particular the present invention divides each class of relational mapping into 32 sub-databases. For commit and (code file) blobs, the last 5 bits of their SHA1 first character are used for partitioning. For author, item, and filename, the present invention uses the last 5 bits of their first byte of FNV-1Hash for partitioning.
E. Data update
The Git objects are immutable (i.e., existing Git objects will remain unchanged, and only new Git objects will exist), so only these new Git objects need to be acquired. The invention specifically uses two methods to update the code library:
a. the new Git item is identified, and the Git object therein is cloned and then extracted.
b. The updated items are identified by obtaining the latest commit of the branches of the remote repository of the collected repository, and then by modifying the gitfetch protocol so that the protocol can take the built code repository as the back end without the local Git repository (the cloned Git repository is deleted to save space after the data is extracted in step B), obtain the newly added Git object of the remote repository, and extract the newly added Git object into the code repository. The invention restores the flow of the gate latch through the source code of the gate latch function realized in Libgit2, which comprises the following steps:
b1 A remote repository is added to the local repository. Representing a remote repository with a git_remote construct in Libgit2, when this construct is created, all branch references in the git/refs/references folder in the local repository are filled into one member variable (ref) in the construct;
b2 A local repository establishes a connection to a remote repository;
b3 After the connection is established, the remote repository replies (response) and sends all branch references (content in the git/refs/heads folder) of the remote repository to the local;
b4 After the local repository receives the references back from the remote repository, it will check one by one if the object to which these references refer is in the local repository, and if in the local repository it will be marked that this branch is not updated, without requiring the remote repository to send the update. These references are then inserted into the member variables mentioned in step b1
b5 After the local repository has checked all of these references, the member variable is sent back to the remote repository (including the marked references) to "negotiate" with the remote repository. Where the local will wait for an ACK signal sent back by the remote repository. Libgit2 waits here in such a way that commit objects within the local repository are ordered in chronological order, then traversed forward from the most recent commit, and for each commit object it is sent to the remote repository telling it that there is this object locally, and then sent to the remote repository a checked reference. This is repeated up to 256 times until an ACK signal sent back by the remote repository is received.
b6 After negotiating with the remote repository (i.e., telling the remote repository what the local repository branch was up-to-date commit, what is desired), the remote repository can calculate which Git objects to send back to local. The remote repository packages these objects into files in the pack file format and sends them back locally.
b7 After receiving the returned data, the local warehouse analyzes the data according to the format of the pack file, and constructs a corresponding index (index) file, thereby facilitating retrieval. Constructing index files requires restoration from the Git objects in the local Git repository.
As can be seen from the steps of the gate fetch, no other process involves other gate objects than the branch references, except for the steps b 5) and b 7), and the gate fetch determines whether the remote repository is updated by comparing the branch references of the remote repository with the local differences. We propose the following modifications to the gate fetch:
1) Modifying the original gate feed, step b 3: and storing branch references sent back by the remote warehouse to the local, judging whether the branch references are stored in a local code base, if so, indicating that the remote warehouse is not updated, and if not, indicating that the remote warehouse is updated, and entering the next step.
2) Modifying the original gate feed, step b 5: the original gate fetch protocol orders the commit and sends it to the remote warehouse, just to wait for the remote warehouse's ACK signal, but there is no special effect, so the invention changes the wait method: the most up-to-date commit object of the primary branch is sent each time, repeated up to 256 times until an ACK signal is received at the remote repository
3) Modifying the original gate feed, step b 6: and (c) storing the file in the pack file format sent back by the remote warehouse to the local, analyzing the pack file according to the Git object in the code library, and not carrying out the step b 7.
After the modification is carried out on the gate latch, the gate latch can be updated by taking the constructed code base as the back end, a complete warehouse does not need to be cloned for each update, and network bandwidth overhead and time overhead are reduced.
In specific implementation, the invention also provides a method for quickly tracing the whole network code on the file granularity based on the code library, which comprises the following steps:
1) For a code file, its SHA1 value is calculated
2) And D, inquiring the whole network information of the code file by taking SHA1 of the code file as a key according to the code information mapping constructed in the step D, wherein the whole network information comprises information such as a project list, a commit list, a corresponding file name and an author of the code file, and the like, and feeding back the information to a user.
Compared with the prior art, the invention has the beneficial effects that:
the code library design provided by the invention can support efficient full-network tracing detection of codes. According to the technical scheme and the embodiment provided by the invention, a local code library can be built for open source Git warehouses on a plurality of code hosting platforms of the whole network including Github without a large number of servers; incremental updating of the code library can be accomplished without requiring extra bandwidth.
The technical scheme and the embodiment provided by the invention provide detailed guidance for constructing the code base oriented to the whole network code traceability detection, and make up for the gap of a mass code base construction technology in the code traceability detection field.
Drawings
FIG. 1 is a block flow diagram of a code library design method for fast full-network code traceability detection in an embodiment of the invention
FIG. 2 is a block diagram illustrating a code library update strategy according to an embodiment of the present invention.
FIG. 3 is a block flow diagram of a customized gate fetch process in accordance with an embodiment of the present invention.
FIG. 4 is a block flow diagram of a customized gate batch protocol based remote repository update acquisition in an embodiment of the present invention.
Fig. 5 is a flow chart of a fast full-network code tracing detection method based on a constructed code base in an embodiment of the invention.
Detailed Description
The invention is further described by way of examples in the following with reference to the accompanying drawings, but in no way limit the scope of the invention.
The invention provides a code base design method for quick full-network code traceability detection, which specifically comprises the following steps:
A. and acquiring a full-network open source software project list through a plurality of project discovery methods. The implementation method comprises the following steps:
currently open source software projects are hosted mostly in some popular development collaboration platforms such as GitHub, bitsocket, gitLab and SourceForge. Yet another part of the source items are websites hosted on individuals or specific items. Thus, to support full-network traceability detection of code, it is necessary to obtain as complete an open source item list as possible. In view of this challenge, the present invention combines various methods to discover items, such as methods that utilize APIs provided by the platform, parse web pages of the platform, and the like. And finally, taking the union of the item sets discovered by the methods as a final open source item list.
B. And (3) data extraction: b, downloading the items in the open source item list in the step A to the local and extracting the Git objects in the items;
this step is responsible for downloading the items found in step a locally and extracting the Git objects therein. A copy of the remote repository is created locally via the gitclone command. After the items are copied in batches, all the Git objects in the cloned items are extracted in batches through Git. This step may be done in parallel on the (cloud) server.
C. And (3) data storage: according to the type division and block storage of the Git object types, the data storage space is reduced, and the parallel processing efficiency is improved;
there may be many repetitive Git objects between open source projects due to the multiplexed code, pull-request development schema, etc. Meanwhile, the open source item may also include many binary files, such as PDF and pictures. Without such redundancy and binary files, it is estimated that the required data storage space would exceed 1.5PB, and such a large amount of data would make code tracing tasks almost impossible to implement. In order to avoid the redundancy of the Git objects among the warehouses, the design of the code base is oriented to the tracing detection of the whole network codes, so that the invention does not store binary files, and stores the binary files according to the types of the Git objects, namely a commit database, a tree database, a blob database (not containing binary blobs) and a tag database. The storage mode can quickly search whether the data is stored in the code base while reducing the data storage space to the hundred TB level.
D. Code information mapping construction:
the object of the code library is to quickly carry out full-network traceability detection on codes and support analysis on the safety and compliance of software projects, and for this purpose, the invention constructs the relation mapping from a code file (blob) to the project containing it, from a code file to the com containing it, from a code file to its author, from a code file to its filename and from a code file to its creation time, and the relation mapping is stored in the form of a database, so that the full-network information of one code file, such as the project and com containing it, the author creating it and the time of creating it, can be quickly obtained, and the construction of the code information mapping is realized. Obtaining such information for a code file is useful for a comprehensive assessment of the security and compliance of a software project.
E. Data update
Keeping the code library up to date is critical to the code trace-source detection task. With the growth of the existing warehouse size and the advent of new warehouses, the process of cloning all warehouses requires longer and longer times. Currently, to clone all of the git stores (more than 1 million and 3 million including fork), the estimated total time requires six hundred single-threaded servers to run for one week, with the result that more than 1.5PB of disk space will be occupied. Fortunately, the Git objects are immutable (i.e., existing Git objects will remain unchanged, and only new Git objects will exist), so only these new Git objects need to be acquired. Specifically, the present invention proposes to update a code library using two strategies:
1. the new Git item is identified, and the Git object therein is cloned and then extracted.
2. The updated items are identified by taking the latest commit of all branches of the remote warehouse of the collected warehouse, and then by modifying the gitfetch protocol so that the protocol can take the built code library as the back end without the local Git warehouse (cloned Git warehouse is deleted to save space after the data is extracted in step B), obtain the updates of the remote warehouse and extract the newly added Git object into the code library. The invention restores the flow of the gate latch through the source code of the gate latch function realized in Libgit2, as shown in figure 2, and specifically comprises the following 7 steps:
1) The remote repository is added to the local repository. Representing a remote repository with a git_remote construct in Libgit2, when this construct is created, all branch references in the git/refs/references folder in the local repository are filled into one member variable (ref) in the construct;
2) The local warehouse establishes a connection to the remote warehouse;
3) After the connection is established, the remote warehouse replies (response) and sends all branch references (content in the file of the git/refs/heads) of the remote warehouse to the local;
4) After the local repository receives the references back from the remote repository, it will check one by one if the objects to which these references refer are in the local repository, and if in the local repository it will be marked that this branch is not updated, without requiring the remote repository to send updates. These references are then inserted into the member variables mentioned in step 1
5) After the local repository has checked all of these references, it sends this member variable back to the remote repository (including the marked references) to "negotiate" with the remote repository. Where the local will wait for an ACK signal sent back by the remote repository. Libgit2 waits here in such a way that commit objects within the local repository are ordered in chronological order, then traversed forward from the most recent commit, and for each commit object it is sent to the remote repository telling it that there is this object locally, and then sent to the remote repository a checked reference. This is repeated up to 256 times until an ACK signal sent back by the remote repository is received.
6) After negotiating with the remote warehouse (i.e., telling the remote warehouse: what the latest commit is for the local warehouse branch, which is desired), the remote warehouse operator can calculate which Git objects to send back locally. The remote repository packages these objects into files in the pack file format and sends them back locally.
7) After receiving the returned data, the local warehouse analyzes the data according to the format of the pack file, and constructs a corresponding index (index) file, so that the retrieval is convenient. Constructing index files requires restoration from the Git objects in the local Git repository.
As can be seen from the steps of the gate fetch, except for the 5) and 7) steps, the other processes do not involve other gate objects than the branch references point to, and the gate fetch determines whether the remote repository is updated by comparing the branch references of the remote repository with the branch references of the local repository. The invention proposes the following modification of the gate feed:
1) Step 3, modifying the original gate feed: and storing branch references sent back by the remote warehouse to the local, judging whether the branch references are stored in a local code base, if so, indicating that the remote warehouse has no newly added Git object data, and if not, entering the next step.
2) Modifying the original gate feed step 5: the original gate fetch protocol orders the commit and sends it to the remote warehouse, just to wait for the remote warehouse's ACK signal, but there is no special effect, so the invention changes the wait method: the most up-to-date commit object of the primary branch is sent each time, repeated up to 256 times until an ACK signal is received at the remote repository
3) Step 6, modifying the original gate feed: and (3) storing the file in the pack file format sent back by the remote warehouse to the local, and analyzing the pack file according to the Git object in the code library without performing the step 7.
After the modification is carried out on the gate latch, the gate latch can be updated by taking the constructed code base as the back end, a complete warehouse does not need to be cloned for each update, and network bandwidth overhead and time overhead are reduced.
Finally, the invention provides a rapid full-network tracing detection scheme for codes on file granularity, which specifically comprises the following two steps:
1. for a code file, its SHA1 value is calculated
2. And D, inquiring the whole network information of the code file by taking SHA1 of the code file as a key according to the code information mapping tool database constructed in the step D, wherein the whole network information comprises a project list, a commit list, a corresponding file name, an author and the like containing the code file, and feeding back the information to a user.
As a preferred solution, the step B uses the interface Libgit2 of the language C of Git (because the language C is more efficient and faster) to complete the extraction task.
As a preferred embodiment, step C and step D use a TokyoCabinet database.
As a preferred solution, the step E uses the Git's C language interface Libgit2 to implement a customized Git fetch protocol.
Fig. 2 is a flowchart of a code library design method for rapid full-network code traceability detection according to an embodiment of the present invention, including the following specific implementation steps:
A. item discovery:
in order to obtain the most complete list of open source items possible, the present invention incorporates a variety of heuristics, including using APIs to develop collaboration platforms, parsing the platform's web pages, etc. to discover the items. And finally, taking the union of the item sets discovered by the methods as a final open source item list. The invention packages the script of the project discovery process into a docker mirror image. Specifically, the item discovery method adopted by the invention is as follows:
1. an API for developing a collaboration platform is used. Some code-hosting platforms, such as Github, provide APIs that can be used to discover the complete set of open source items on this platform. These APIs are platform specific and may have different usage patterns, so different API queries need to be designed for APIs of different platforms. These APIs typically have access rate limitations for users or IP addresses that can be broken through by building a pool of user IDs. For the Github platform, the GraphQL API of the Github is used for acquiring the updated Github warehouse list, the specific operation is that the time period of the warehouse needing to be acquired is divided equally according to the number of user IDs in a user ID pool, then each user ID is responsible for updating the warehouse number in one time period, and the query conditions are as follows: { is: public sharing: false push: start_time..end_time }, where start_time and end_time are replaced with 10 minutes as intervals in each time period, and the updated number of warehouses in each 10-minute interval is obtained; for the Bitbucket platform, the api query used is https:// api. Bitucket. Org/2.0/repositives/? pagelen=100 & after=date, where replacing date with a specific time, such as 2017-11-18, can obtain a Bitbucket warehouse created after 2017-11-18 days; for the Sourcefuge platform, the platform provides an item list in an XML format, the address of an XML file is at https:// Sourcefuge.net/sitemap.xml, and all item lists on Sourcefuge can be obtained by downloading XML analysis; for the GitLab platform, the API query used is https:// gitlab.com/API/v4/project accepted = false & membrane =false & order_by = create_ a t & owed = false & page = { } & per_page = 99& simple = false & solution = required = false & statidics = false & with_sample_attribute = false & with_issubs_enable = false, where the parameters of the page are set to 1, and then all items on the gilab are retrieved incrementally.
2. And analyzing the web page of the website. For a Bioconductor platform, all items on the website can be obtained by analyzing an http:// git. Biocductor. Org web page; for the repo.or.cz platform, by parsing https:// repo.or.cz/? a=project_list webpage, and all the projects on the website can be obtained; for the Android platform, all items on the website can be obtained by analyzing https:// Android.
For the ZX2C4 platform, analyzing https:// git.zx2c4.Com web pages to obtain all items on the platform; for the eclipse platform, resolving an http:// git.eclipse.org/webpage to obtain all items on the platform; for the PostgreSQL platform, analyzing the http:// git. Postgresql. Org web page to obtain all the items on the platform;
for a Kernel. Org platform, resolving an http:// git. Kernel. Org webpage to obtain all items on the platform; and (3) analyzing the http:// git.savannah.gnu.org/cgit webpage for the Savanneah platform to obtain all the items on the platform.
This step can be accomplished on a common server (e.g., intel E5-2670 CPU server), with very low hardware requirements. By 9 months in 2020, we have retrieved 1 million to 3 thousand different warehouses (excluding the GitHub warehouse labeled fork and the warehouse without content).
B. And (3) data extraction:
this can be done in parallel on a very large number of servers, but requires a large amount of network bandwidth and storage space. Remote warehouse is cloned locally in batches through a gate clone command, and through calculation, a single-thread shell process can clone 2 ten thousand to 5 ten thousand randomly selected items (the time varies greatly with the size of the warehouse and the difference of platforms) for 24 hours without limitation of network bandwidth on an Intel E5-2670 CPU server. In order to clone all projects (over 1 million 3 tens of millions) in a week, about 400-800 servers are required, which is costly. Thus, the present invention optimizes the search by running multiple threads on each server and searches only a small fraction of the warehouses that have changed since the last search. The invention uses 5 data transmission nodes on the computing cluster platform with 300 nodes and bandwidth up to 56Gb/s to finish cloning task. In addition, this step can be accomplished by using a cloud server instead of a computing cluster, and customized cloud service resources meeting the requirements of the user can be purchased at the time of cloning, and then released after the batch cloning is finished. The cloud server can reach higher bandwidth and clone speed is faster.
After the item is cloned locally, all of the Git objects within the item need to be extracted. The Git client can only display the content of one Git object next to the other, which is not beneficial to automatic batch processing. The invention uses the Git C language interface Libgit2 to list all the Git objects in the project, then classifies the objects according to the object types, and finally extracts the content of each object. The invention is characterized in that on a cluster with 36 nodes and 16 cores of CPU of each node, namely Intel E5-2670 CPU and 256GB of memory, each node starts 16 threads to finish the extraction work of the Git object. One node can process about 5 tens of thousands of items in 2 hours. After the Git data in the cloned items is extracted, the cloned items are deleted and a new clone-extraction process is started.
C. And (3) data storage: and according to the type division and block storage of the Git object type, binary files are not stored, so that the data storage space is reduced, and the parallel processing speed is increased.
The invention stores according to the types of the git object to avoid redundancy and reduce the storage cost; code tracing detection is oriented, and binary files are not stored during storage; each Git object database comprises cache data and content data which are respectively stored in the cache database and the content database so as to accelerate the retrieval speed; in order to be parallel, the cache database and the content database of each Git object can be divided into multiple parts (such as 128 parts) for parallel; the content database is saved in a spliced mode so as to be convenient to update.
Specifically, the present invention stores separately by types of Git objects to avoid redundancy, so there are 4 types of databases in total: a commit database, a blob database, a tree database, and a tag database. Each database contains cache data and content data, which are stored in the cache database and the content database, respectively. The cache database is used to quickly determine if a particular object is already stored in our database and is necessary for the data extraction described above (if any, this Git object is not extracted, thus saving time). In addition, the cache database also helps to determine whether a repository needs to be cloned. If the head of a repository (the commit object pointed to by each branch held in the. Git/refs/heads) is already in our cache database, indicating that the repository is not updated, then there is no need to clone the repository.
The cache database is a key-value database, where the key is the SHA1 value (20 bytes) of the Git object, which is the offset location and size in the content database of the Git object compressed using the Perl's express library. The content database contains compressed content of the Git objects that are continuously stitched together. The content database is saved in a splicing way, so that the updating can be completed quickly, and only new content needs to be spliced to the end of a corresponding file. Although this storage method can scan the whole database quickly, it is not optimal for the required random search, for example, when calculating a modification made by a commit, we need to traverse the commit database twice to obtain the tree object pointed to by the commit object and the tree object pointed to by its parent commit object, then traverse the tree database multiple times to obtain the contents contained by the two tree objects, find out the file with the difference, and finally traverse the blob database once to calculate the modification, and each traverse causes repeated additional time overhead. Thus, for the commit and tree, the present invention additionally creates a random lookup key value database, respectively, where the key is SHA1 of the Git object and the value corresponds to the compressed contents of the Git object. The key value database has relatively fast random query performance, and the test shows that: a single thread on a server with a CPU Intel E5-2623 can randomly query 100 tens of thousands of git objects within 6 seconds, i.e., each thread queries over 170K of git objects per second.
Currently, the present invention retrieves more than 200 billions of Git objects (including 23 billions of commit objects, 91 billions of blob objects, 94 billions of tree objects and 1800 tens of thousands of tag objects), with a data storage space of about 150TB. Processing such large volumes of data would become particularly inefficient if parallel processing were not performed. The invention realizes parallelization by using SHA1 values. The present invention uses the last 7 bits of the first byte of the SHA1 value of the Git object to split each type of database into 128 shares. Thus, all four types of Git objects have 128 cache databases and 128 content databases. In addition, the commit object and the tree object respectively have 128 random lookup key value databases, and 128 (4+4+2) databases are totally arranged, and the databases can be placed on a server to accelerate parallel. Currently, the size of a single content database ranges from 20MB of tag objects to 0.8TB of blob objects, and the maximum size of a single cache database is tree objects, namely 2Gb.
Nevertheless, the size of the database limits the choice of databases. For example, a graph database such as neo4j is very useful for storing and querying relationships, including transitive relationships, but it is not capable of handling billions of relationships (at least on common servers). In addition to neo4j, the present invention also contemplates a number of conventional databases. The present invention evaluates the common relational databases MySQL and PostgreSQL and the key-value database (NoSQL) databases MongoDB, redis and Cassandra. SQL, like all centralized databases, has limitations in processing PB-level data. The present invention is therefore directed to NoSQL databases, which are designed for large-scale data storage and massive parallel data processing on a large number of commercial servers.
Through testing, the present invention used a database written in the C language named TokyoCabinet (similar to berkeley db). TokyoCapbinet uses hashing as an index to provide about ten times faster read query performance than various common key-value databases such as MongoDB or Cassandra. The faster reading and inquiring speed and extremely strong portability just meet the construction requirement of a code library for full-network code traceability detection, so that the code library is used for replacing a NoSQL database with more complete functions.
D. The code information mapping construction comprises the following steps:
designing and generating a relation map for quickly searching the whole network information of code file (blob) to its information including its item and commit, its author and time, its file name
The code library aims to rapidly carry out full-network traceability detection on codes and support analysis on safety and compliance of software projects. The invention thus generates a mapping of the relationship of the code file (blob) to its information (including the item and commit containing it, the author and time of creation of it, and the filename) and saves it in the form of a database, allowing retrieval of the code file's full network information. . The whole network information of the code file is useful for comprehensive evaluation of safety and compliance of software projects, and is important content of whole network code traceability detection.
The information of the code file includes the item and commit containing it, its file name, the author and the time it was created. Wherein the author and time to create it is contained in the commit from which it was created, and wherein the commit to project relationship mapping and project to commit relationship mapping are completed in step B data extraction. Therefore, the present invention centers on commit, building a relationship map, specifically: build a mutual mapping between a commit and an item, build a relationship mapping of commit and author, time, build a relationship mapping of author to commit, build a mutual mapping between commit to code files (blobs), and build a mutual mapping between commit to filenames. Then, a list of items containing one code file (blob) may be determined by a combination of code file (blob) to commit and commit to item relationships; similarly, the creation time of a code file (blob) may be determined by a combination of code file (blob) to commit and commit to time relationships, and the author of a code file (blob) may be determined by a combination of code file (blob) to commit and commit to author relationships.
The mapping from commit to author, time and item is not difficult to achieve because author and time are part of the commit object and the mapping between commit and item is obtained at the time of step B data extraction. However, a code file (blob) introduced or deleted by a commit has no direct relationship to the commit and needs to be calculated by recursively traversing the commit and the tree objects of its parent commit. A commit contains a snapshot of the repository, containing all the tree (folder) and blobs (code file). To calculate the difference between a commit and its parent commit, i.e., a new code file (blob), we start with the tree object pointed to by the commit object, traverse each child tree and extract all code files (blobs), respectively. By comparing all code files (blobs) of each commit, a new code file (blob) introduced by the commit can be obtained. On average, it takes about 1 minute to acquire the file name and code file (blobs) of ten thousand commit changes in a single thread. It is estimated that for 23 hundred million completions, the overall time of a single thread takes 104 days, which can be accomplished in one week by running 16 threads on a server of a 16-core Intel E5-2623 CPU. In addition, these relationships are incremental, only once generated, then the operations described above are performed on each updated commit, and then inserted into the existing database. From the combination of the relationships of the code files (blobs) to the commit and the commit to the filename, the correspondence of the code files (blobs) to the filenames cannot be determined because one commit modifies a plurality of files. The invention also constructs the correlation mapping between the code file and the file name to support the tracing of the specific code segment. For example, if a piece of Python code is to be checked for trace-source, then all Python files are checked. Then the mapping of file names to code files can take all Python files ending with. Py, and then code trace checking is performed for these files.
Similar to the data storage part in step C, the present invention uses TokyoCapbinet database to store these relationship maps for quick retrieval. The present invention throws away the use of partitioned storage to increase retrieval efficiency, and in particular the present invention divides each class of relational mapping into 32 sub-databases. For commit and (code file) blobs, the present invention uses the last 5 bits of their SHA1 first character to divide. For author, item, and filename, the present invention uses the last 5 bits of their first byte of FNV-1Hash for partitioning.
E. Data update
Keeping the code library up to date is critical to the code trace-source detection task. In order to obtain acceptable update times, the present invention accomplishes the update of the data in the following manner:
1. the new Git item is identified, and the Git object therein is cloned and then extracted. And C, determining a new item in the newly-added item list by comparing the newly-found open source item list in the step A with the last open source item list, cloning the newly-added item to the local, and extracting the Git object in the newly-added item list.
2. The updated items are identified, then only the updated items are cloned, and the newly added Git object is extracted. The invention modifies the gate fetch protocol based on Libgit2 as follows:
1) Step 3, modifying the original gate feed: the branch references sent back by the remote repository are saved locally. After the filter_ways function in the src/fetch.c file of Libgit2 calls the git_remote_ls function, the SHA1 value of the heads sent back by the remote warehouse received by the git_remote_ls is saved into the file.
2) Modifying the original gate feed step 5: modify src/transport/smart_protocol.c file of Libgit2, modify the git_smart __ new_fetch function: the call for git_revwalk_next is annotated, the git_reference_name_to_id call is added so that the most up-to-date commit object of the main branch is sent each time, repeating up to 256 times until an ACK signal is received from the remote repository.
3) Step 6, modifying the original gate feed: modifying the gate_smart __ new_fetch function in the Libgit 2/src/transport/smart_protocol. C file to send the remote repository back data (gate_pkt_progress)
* p) saving the file in a local file, returning the file directly, and not carrying out the step 7.
After the modification, recompilating the Libgit2 library, and then using the modified Git fetch protocol to obtain the newly added Git object data of the remote warehouse, the specific steps are as follows:
1. initializing an empty Git warehouse
2. The SHA1 values and contents of all branch references of a warehouse are extracted from the constructed code library, and the blank git warehouse is filled. The filling mode is as follows: constructing header information of branch references in the format of: object type + space
+object content length+one null byte (null byte), such as "blob 12\u0000". The header information and the original data are then spliced together, and the spliced content is compressed by the zlib's expression function. Finally, creating the first two sub-directories named SHA1 in the git/objects folder in the empty git repository, creating a file named SHA1 last 38 bits in the sub-directory, writing the compressed content into the file
3. Creating a file named Branch name (e.g., master) in the. Git/refs/heads folder in this empty Git repository, and then writing the SHA1 value of the commit referenced by the Branch into this file
The invention directly splices the data of the newly added Git object into the corresponding content database according to the type and the SHA1 value thereof, records the SHA1 value thereof in the cache database, and updates the corresponding relation mapping database according to the offset and the size in the content file.
After the code library is constructed, the code can be rapidly detected in a full-network tracing way on the granularity of the file. The method comprises the following steps:
1. the SHA1 value of the code file is calculated. The calculation is here performed using the sha1 function of the hashlib library of python 2. For example, https:// github.com/fcholet/deep-learning-models/blob/master/ResNet 50.Py file contains an implementation of the deep learning model ResNet50, calculated to have a SHA1 value of e8cf3d7c248fbf6608c4947dc53cf368449c8c5f
2. And D, inquiring the whole network information of the code file by taking SHA1 of the code file as a key according to the code information mapping tool database constructed in the step D, wherein the whole network information comprises a project list, a commit list, a corresponding file name, an author and the like containing the code file, and feeding back the information to a user. There are 192 completions containing the blob from the mapping of the blob to the com, and 377 projects containing the blob from the mapping of the com to the projects. The above process only takes 0.831s.
It should be noted that the purpose of the disclosed embodiments is to aid further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.
Claims (10)
1. A code base design method for quick whole network code traceability detection is used for obtaining a code base by carrying out efficient storage on Git objects in a whole network using Git open source project through project discovery, data extraction, data storage, code information mapping construction and data updating processes, and realizing efficient updating of the code base;
Comprising the following steps: a storage mode of dividing and storing according to the types of the Git objects is adopted; constructing a relation mapping between code files and code file information, and rapidly searching the whole network information of the code files; a high-efficiency updating mode is adopted for the constructed ultra-large-scale code library, a customized gate fetch protocol is proposed based on a Libgit2 function library, and the constructed ultra-large-scale code library is taken as a rear end to efficiently obtain newly-added gate object data of a remote warehouse;
the method specifically comprises the following steps:
A. a server is utilized to obtain a full-network open source software project list through a plurality of project discovery methods, and scripts of the project discovery process are packed into a dock mirror image;
B. and (3) data extraction: b, downloading the items in the open source item list acquired in the step A to the local and extracting the Git objects in the items; the extraction is completed in parallel in a multithreading manner on the server cluster;
git object data store: according to the type division and block storage of the Git object types, the data storage space is reduced, and the parallel processing efficiency is improved; the method specifically comprises the following steps:
a. the binary files included in the open source items are not saved;
b. classifying and storing the Git object data according to the Git object type, namely, the types of the database comprise a commit database, a tree database, a blob database and a tag database, so that the data storage space is reduced to hundred TB levels, and meanwhile, whether the data are stored in a code base or not can be quickly searched;
c. The database of each Git object comprises cache data and content data which are respectively stored in the cache database and the content database, so that the retrieval speed is increased; dividing a cache database and a content database contained in each type of database into a plurality of parts for parallelization; the cache database is used for quickly determining whether a certain Git object is already stored in the database and is necessary for data extraction; if a certain Git object exists in the database, the Git object is not extracted; the cache database is also used for determining whether a warehouse needs to be cloned; if the commit object pointed to by the head of a repository is already in the cache database, i.e., no cloning is required;
d. the cache database is a key value database; the content database is saved in a splicing mode so as to be convenient to update;
the key in the cache database is the SHA1 value of the Git object, and the value in the cache database is the offset position and the size of the Git object in the content database after compression by using the Perl's express library;
the content database contains compressed content of the Git objects which are continuously spliced together; the content database is saved in a splicing mode, and only new content is spliced to the tail of a corresponding file;
Creating a random lookup key value database for the commit and tree objects respectively, wherein the key is SHA1 of the Git object and the value is the compression content of the corresponding Git object;
e. dividing each type of database into a plurality of parts by using SHA1 values, and realizing parallelization acceleration;
f. using a database TokyoCabinet indexed by hash;
D. constructing a mapping of the code information relationship by taking a commit as a center; comprising the following steps: a mapping of code files to items containing it, code files to completions containing it, code files to its author, code files to its filename, code files to its creation time; storing the relation maps by using a TokyoCapbinet database in a block storage mode so as to perform quick retrieval;
E. acquiring a new Git object, and updating data of a code base; the method comprises two methods:
a. identifying a new Git item, cloning, and extracting a Git object in the new Git item;
b. identifying an updated item by acquiring the latest commit of a branch of a remote warehouse of the collected warehouse, and then modifying the gate fetch so that when the local gate warehouse is not available, a built code library is taken as the rear end, a newly added gate object of the remote warehouse is acquired, and the newly added gate object is extracted into the code library; the method specifically comprises the following steps:
b1 Adding the remote repository to the local repository; the remote warehouse is represented in Libgit2 by the git _ remote structure,
all branch references within the local warehouse. Git/refs/heads folder are filled into one member variable ref within the structure when the structure is created;
b2 A connection from the local warehouse to the remote warehouse is established;
b3 After the connection is established, the remote warehouse replies a response, and all branch references of the remote warehouse, namely, contents in the file folders of the git/refs/heads, are sent to the local;
storing branch references sent back by the remote warehouse to the local, judging whether the branch references are stored in a local code base, if so, indicating that the remote warehouse is not updated, and if not, entering the next step;
b4 After the local warehouse receives the references sent back by the remote warehouse, checking whether the objects pointed by the references are in the local warehouse one by one; if in the local repository, a flag is made indicating that the remote repository need not be requested to send updates, then these references are inserted into the member variables;
b5 The local warehouse orders the commit, sends the member variable, including the marked references, back to the remote warehouse, and negotiates with the remote warehouse; locally waiting for an ACK signal sent back by the remote warehouse; specifically, the most up-to-date commit object of the main branch is sent each time, and the execution is repeated for a plurality of times until an ACK signal of the remote warehouse is received;
b6 After negotiating with the remote repository, the remote repository may calculate the Git object to be sent back to the local; the remote warehouse packages the objects into files in a pack format and sends the files back to the local;
storing the file in the pack file format sent back by the remote warehouse to the local, and analyzing the pack file according to the Git object in the code library;
through the steps after the modification of the gate latch, the gate latch is updated by taking the constructed code base as the back end, a complete warehouse does not need to be cloned for each update, and network bandwidth overhead and time overhead are reduced.
2. The code library design method for rapid full-network code traceability detection according to claim 1, wherein the method for rapid full-network code traceability detection of codes on file granularity based on the code library comprises the following steps:
1) Calculating the SHA1 value of one code file;
2) And D, inquiring the whole network information of the code file by taking SHA1 of the code file as a key according to the code information mapping constructed in the step D, wherein the whole network information comprises a project list, a commit list, a corresponding file name and author information of the code file, and feeding back the project list, the commit list and the corresponding file name and author information to a user.
3. The code library design method for rapid full-network code traceability detection of claim 1, wherein in step a, said server comprises Intel E5-2670 CPU server; the development collaboration platform for hosting the open source software project comprises a Github, a Bitbucket, a GitLab and a SourceForge; the obtaining the full-network open source software project list through the multiple project discovery methods comprises the following steps: and using the API provided by the development collaboration platform and the webpage method of the analysis platform to take the union of the discovered project sets as a final open source project list, thereby obtaining the open source project list.
4. The code library design method for rapid full-network code traceability detection according to claim 1, wherein the step B is to perform data extraction, specifically to create a copy of a remote warehouse locally through a gitclone command, and then to extract all the Git objects in the cloned open source item in batches through Libgit 2.
5. The code library design method for rapid full-network code traceability detection according to claim 4, wherein the data extraction specifically adopts an Intel E5-2670 CPU with 36 nodes, the CPU of each node is 16 cores, the memory is a cluster with 256GB, and each node starts 16 threads; and using a Git C language interface Libgit2 to list all Git objects in the project, classifying according to object types, and extracting the content of each object.
6. The code library design method for rapid full-network code traceability detection according to claim 1, wherein in step C, the cache database and the content database contained in each class of databases can be divided into 128 parts for parallelism;
the parallelization is realized by using the SHA1 value, specifically, the last 7 bits of the first byte of the SHA1 value of the Git object are used for dividing each type of database into 128 parts, so that the four types of Git objects have 128 cache databases and 128 content databases, the commit object and the tree object also have 128 random lookup key value databases respectively, and 128 (4+4+2) databases are shared; these databases may be placed on a server to accelerate parallelism.
7. The method for rapid full-network code traceability detection of claim 6, wherein the size of a single content database is from 20MB of tag object to 0.8TB of blob object, and the size of a single cache database is up to 2Gb.
8. The code library design method for rapid full-network code traceability detection of claim 1, wherein step D builds a relationship map centered on commit, comprising:
constructing a mutual mapping between a commit and an item, constructing a relationship mapping between a commit and an author, time, constructing a relationship mapping between an author to a commit, constructing a mutual mapping between a commit to a code file blob, and constructing a mutual mapping between a commit to a file name;
determining an item list containing one code file blob by combining the relations from the code file blob to the commit and from the commit to the item; determining the creation time of the code file blob by the combination of the relationship from the code file blob to the commit and from the commit to the time; determining an author of the code file blob by a combination of the code file blob to commit and commit to author relationships;
reconstructing a correlation mapping between the code file and the file name to support the tracing of the specific code segment;
Storing the relation maps by using a TokyoCapbinet database and using a block storage, and particularly dividing each type of relation map into 32 sub-databases; for commit and code file blob, partitioning is performed using the last 5 bits of the first character of SHA 1; for author, item, and file name, the last 5 bits of the first byte of FNV-1Hash are used for partitioning.
9. The code library design method for rapid full-network code traceability detection according to claim 1, wherein step E recompiles Libgit2 library after modifying the gate fetch by method b; acquiring newly added Git object data of a remote warehouse; the method comprises the following specific steps:
E1. initializing an empty Git warehouse;
E2. extracting SHA1 values and contents of all branch references of a warehouse from the constructed code library, and filling the SHA1 values and contents into an empty git warehouse;
E3. creating a file named a branch name in a Git/refs/heads folder in an empty Git repository, and then writing the SHA1 value of the commit referenced by the branch into the file;
and directly splicing the data of the newly added Git object into a corresponding content database according to the type and the SHA1 value thereof, and recording the SHA1 value thereof, the offset and the size in the content file in a cache database, namely updating the corresponding relation mapping database.
10. The code library design method for rapid full-network code traceability detection according to claim 9, wherein in step E2, the filling mode is specifically as follows:
constructing header information of branch references in the format of: object type + space + object content length + one empty byte;
then splicing the head information and the original data, and compressing the spliced content by using a compression function of zlib;
finally, the first two subdirectories named SHA1 are created in the. Git/objects folder in the empty git repository, a file named SHA1 last 38 is created in the subdirectory, and the compressed content is written into the file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110278117.6A CN112988217B (en) | 2021-03-10 | 2021-03-10 | Code base design method and detection method for rapid full-network code traceability detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110278117.6A CN112988217B (en) | 2021-03-10 | 2021-03-10 | Code base design method and detection method for rapid full-network code traceability detection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112988217A CN112988217A (en) | 2021-06-18 |
CN112988217B true CN112988217B (en) | 2023-11-17 |
Family
ID=76335615
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110278117.6A Active CN112988217B (en) | 2021-03-10 | 2021-03-10 | Code base design method and detection method for rapid full-network code traceability detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112988217B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113468529B (en) * | 2021-06-30 | 2022-08-09 | 建信金融科技有限责任公司 | Data searching method and device |
CN113590192B (en) * | 2021-09-26 | 2022-01-04 | 北京迪力科技有限责任公司 | Quality analysis method and related equipment |
CN113721978B (en) * | 2021-11-02 | 2022-02-11 | 北京大学 | Method and system for detecting open source component in mixed source software |
CN114637512A (en) * | 2022-03-11 | 2022-06-17 | 北京达佳互联信息技术有限公司 | Program code optimization method and device, electronic equipment and storage medium |
CN115640324A (en) * | 2022-12-23 | 2023-01-24 | 深圳开源互联网安全技术有限公司 | Information query method, device, terminal equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107315689A (en) * | 2017-07-04 | 2017-11-03 | 上海爱数信息技术股份有限公司 | The Automation regression testing method of granularity is retrieved based on Git code files |
CN108563444A (en) * | 2018-03-22 | 2018-09-21 | 福州瑞芯微电子股份有限公司 | A kind of Android system firmware source code restoring method and storage medium |
CN109697362A (en) * | 2018-12-13 | 2019-04-30 | 西安四叶草信息技术有限公司 | Network hole detection method and device |
CN109800018A (en) * | 2019-01-10 | 2019-05-24 | 郑州云海信息技术有限公司 | A kind of code administration method and system based on Gerrit |
CN110334326A (en) * | 2019-09-02 | 2019-10-15 | 宁波均胜普瑞智能车联有限公司 | A kind of method and system for identifying recipe file and being converted into XML file |
CN111753149A (en) * | 2020-06-28 | 2020-10-09 | 深圳前海微众银行股份有限公司 | Sensitive information detection method, device, equipment and storage medium |
CN111813378A (en) * | 2020-07-08 | 2020-10-23 | 北京迪力科技有限责任公司 | Code base construction system, method and related device |
CN111813412A (en) * | 2020-06-28 | 2020-10-23 | 中国科学院计算机网络信息中心 | Method and system for constructing test data set for evaluating binary code comparison tool |
-
2021
- 2021-03-10 CN CN202110278117.6A patent/CN112988217B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107315689A (en) * | 2017-07-04 | 2017-11-03 | 上海爱数信息技术股份有限公司 | The Automation regression testing method of granularity is retrieved based on Git code files |
CN108563444A (en) * | 2018-03-22 | 2018-09-21 | 福州瑞芯微电子股份有限公司 | A kind of Android system firmware source code restoring method and storage medium |
CN109697362A (en) * | 2018-12-13 | 2019-04-30 | 西安四叶草信息技术有限公司 | Network hole detection method and device |
CN109800018A (en) * | 2019-01-10 | 2019-05-24 | 郑州云海信息技术有限公司 | A kind of code administration method and system based on Gerrit |
CN110334326A (en) * | 2019-09-02 | 2019-10-15 | 宁波均胜普瑞智能车联有限公司 | A kind of method and system for identifying recipe file and being converted into XML file |
CN111753149A (en) * | 2020-06-28 | 2020-10-09 | 深圳前海微众银行股份有限公司 | Sensitive information detection method, device, equipment and storage medium |
CN111813412A (en) * | 2020-06-28 | 2020-10-23 | 中国科学院计算机网络信息中心 | Method and system for constructing test data set for evaluating binary code comparison tool |
CN111813378A (en) * | 2020-07-08 | 2020-10-23 | 北京迪力科技有限责任公司 | Code base construction system, method and related device |
Non-Patent Citations (1)
Title |
---|
基于代码克隆检测的代码来源分析方法;李锁;吴毅坚;赵文耘;;计算机应用与软件(第02期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112988217A (en) | 2021-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112988217B (en) | Code base design method and detection method for rapid full-network code traceability detection | |
US11475034B2 (en) | Schemaless to relational representation conversion | |
RU2740865C1 (en) | Methods and device for efficient implementation of database supporting fast copying | |
US11567919B2 (en) | Methods and systems for performing transparent object migration across storage tiers | |
CN106663056B (en) | Metadata index search in a file system | |
US10176225B2 (en) | Data processing service | |
US10296611B2 (en) | Optimized rollover processes to accommodate a change in value identifier bit size and related system reload processes | |
JP4352079B2 (en) | System, apparatus, and method for retrieving information from a distributed database | |
US20160055191A1 (en) | Executing constant time relational queries against structured and semi-structured data | |
KR101127304B1 (en) | Hsm two-way orphan reconciliation for extremely large file systems | |
US20120016901A1 (en) | Data Storage and Processing Service | |
US8200702B2 (en) | Independently variably scoped content rule application in a content management system | |
US11048699B1 (en) | Grand unified file indexing | |
Katz et al. | DVID: distributed versioned Image-Oriented dataservice | |
US11461333B2 (en) | Vertical union of feature-based datasets | |
US7051041B1 (en) | Simplified relational database extension to DBM hash tables and method for using same | |
Kvet et al. | Master index access as a data tuple and block locator | |
CN117421302A (en) | Data processing method and related equipment | |
Liu et al. | CloudETL: scalable dimensional ETL for hadoop and hive | |
US11907217B2 (en) | Database object validation for reusing captured query plans | |
Low et al. | Git Is For Data. | |
US11803511B2 (en) | Methods and systems for ordering operations on a file system having a hierarchical namespace | |
US11599520B1 (en) | Consistency management using query restrictions in journal-based storage systems | |
Aljarallah | Comparative study of database modeling approaches | |
CN115840786B (en) | Data lake data synchronization method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |