CN112988217B

CN112988217B - Code base design method and detection method for rapid full-network code traceability detection

Info

Publication number: CN112988217B
Application number: CN202110278117.6A
Authority: CN
Inventors: 周明辉; 高恺; 何昊
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2023-11-17
Anticipated expiration: 2041-03-10
Also published as: CN112988217A

Abstract

The invention discloses a code library design method for rapid full-network code traceability detection, which is used for obtaining a code library by carrying out efficient storage on Git objects in a full-network using Git open source project through project discovery, data extraction, data storage, code information mapping construction and data updating processes, and realizing efficient updating of the code library; comprising the following steps: a storage mode of dividing and storing according to the types of the Git objects is adopted; constructing a relation mapping between code files and code file information, and rapidly searching the whole network information of the code files; and a high-efficiency updating mode is adopted for the constructed ultra-large-scale code library, a customized Git fetch protocol is proposed based on the Libgit2 function library, and the constructed ultra-large-scale code library is taken as the rear end to efficiently obtain newly added Git object data of a remote warehouse. The code library generated by the method can be updated regularly and efficiently, and supports quick full-network traceability detection of codes on file granularity, and the detection efficiency is high.

Description

Code library design method and detection method for rapid network-wide code traceability detection

技术领域Technical field

本发明提供一种用于快速全网代码溯源检测的代码库设计方法及基于代码库快速全网代码溯源检测方法，属于软件工程技术领域。The invention provides a code base design method for fast whole-network code traceability detection and a fast whole-network code traceability detection method based on the code library, and belongs to the technical field of software engineering.

背景技术Background technique

随着开源软件的蓬勃发展，网上积累了海量的优秀开源软件资源，软件开发也越来越多地使用开源代码。开源代码的使用在提高软件开发效率的同时，也引入了风险，比如如果不了解开源代码的来源，就不能对该开源代码后续的漏洞修复同步更新，同时还会将自己暴露在许可证合规风险和知识产权等法律风险之中，带来不同程度的安全威胁及经济或名誉损失。著名的开源风险案例是心脏出血漏洞(heartbleed)。它是一个出现在加密程序库OpenSSL的安全漏洞，该程序库广泛用于实现互联网的传输层安全协议。它于2012年被引入了OpenSSL中，2014年4月首次向公众披露。只要使用的是存在缺陷的OpenSSL实例，无论是服务器还是客户端，都可能因此而受到攻击。因此，对软件产品中的代码进行溯源检测对于软件产品至关重要。With the vigorous development of open source software, a large number of excellent open source software resources have accumulated on the Internet, and open source code is increasingly used in software development. While the use of open source code improves the efficiency of software development, it also introduces risks. For example, if you do not understand the source of the open source code, you will not be able to update the subsequent vulnerability fixes of the open source code simultaneously, and you will also be exposed to license compliance. Risks and legal risks such as intellectual property rights bring varying degrees of security threats and economic or reputational losses. A well-known open source risk case is the Heartbleed vulnerability. It is a security vulnerability in the encryption library OpenSSL, which is widely used to implement the Internet's Transport Layer Security protocol. It was introduced into OpenSSL in 2012 and first disclosed to the public in April 2014. As long as a defective OpenSSL instance is used, both the server and the client may be vulnerable to attack. Therefore, traceability detection of the code in software products is crucial for software products.

为了实现软件产品的代码溯源检测，需要构建用于代码匹配搜索的代码库，代码库中包含的代码数量和构建方式直接影响代码溯源检测的准确度和效率。囿于大型代码库构建的难度，现有代码溯源检测技术大都是在假设已有海量代码库的前提下提出高效的代码检测算法，比如，研究从海量的开源软件库中选取代码最有可能被使用的开源软件参与溯源比对。而缺乏如何构建代码库的有效技术。现有技术中，通常采用下载若干开源软件到本地以形成代码库。但是，这些代码库存在项目涵盖范围小而不足于支撑对代码进行全网溯源检测、代码库架构设计不良导致进行代码溯源检测效率不高等问题。In order to implement code traceability detection of software products, it is necessary to build a code library for code matching search. The number of codes and construction methods contained in the code library directly affect the accuracy and efficiency of code traceability detection. Due to the difficulty of constructing large code bases, existing code traceability detection technologies mostly propose efficient code detection algorithms based on the assumption that there are already massive code libraries. For example, research on selecting codes from massive open source software libraries is most likely to be The open source software used participates in traceability comparison. And there is a lack of effective techniques on how to structure a code base. In the existing technology, several open source software are usually downloaded locally to form a code base. However, these code libraries have problems such as small project coverage and insufficient support for network-wide traceability detection of codes, and poor code library architecture design leading to inefficient code traceability detection.

发明内容Contents of the invention

针对现有技术存在的技术问题，本发明提供了一种用于快速全网代码溯源检测的代码库设计方法，并基于代码库实现快速全网代码溯源检测，采用本发明方法生成的代码库可以支持对代码在文件粒度上的快速全网溯源检测，检测效率高。同时，该代码库可以定期高效地更新。In view of the technical problems existing in the prior art, the present invention provides a code library design method for rapid network-wide code traceability detection, and realizes rapid network-wide code traceability detection based on the code library. The code library generated by the method of the present invention can Supports rapid network-wide traceability detection of code at file granularity, with high detection efficiency. At the same time, the code base can be updated regularly and efficiently.

本发明中，“全网代码”指的是搜集的绝大多数开源代码托管平台的代码数据。托管在代码托管平台上的仓库称为远程仓库；将远程仓库克隆到本地后为本地仓库；本发明设计用于快速全网代码溯源检测的代码库是将远程仓库克隆到本地形成本地仓库，再从本地仓库中提取数据而形成的数据库。In the present invention, "whole network code" refers to the collected code data of most open source code hosting platforms. The warehouse hosted on the code hosting platform is called a remote warehouse; the remote warehouse is cloned locally to become a local warehouse; the code library designed by the present invention for rapid network-wide code traceability detection is to clone the remote warehouse locally to form a local warehouse, and then A database formed by extracting data from a local warehouse.

本发明提供的用于快速全网代码溯源检测的代码库设计利用了Git的内部原理和哈希值，具体包括：The code library design provided by the present invention for rapid network-wide code traceability detection utilizes the internal principles and hash values of Git, and specifically includes:

1)通过git clone命令可以将远程仓库下载到本地，通过git fetch命令可以将远程仓库的更新传送回本地。git fetch通过比较本地仓库和远程仓库的heads来计算出本地仓库相比远程仓库缺少什么对象，然后远程仓库将这些缺少的对象传输回本地。1) The remote warehouse can be downloaded to the local through the git clone command, and the updates of the remote warehouse can be transferred back to the local through the git fetch command. git fetch calculates what objects are missing in the local warehouse compared to the remote warehouse by comparing the heads of the local warehouse and the remote warehouse, and then the remote warehouse transfers these missing objects back to the local.

2)Git使用四种类型的数据对象进行版本控制，这些对象的引用是基于该对象的内容计算的SHA1值。commit对象表示对项目的一次更改，包含一个包括提交父对象(如果有的话)的SHA1、文件夹(tree对象)、作者ID和时间戳、提交者ID和时间戳，以及提交信息。tree对象表示项目内的一个文件夹，是一个列表，其中包含了该文件夹中的文件(blob)和子文件夹(其他tree对象)的SHA1，以及它们相关的模式、类型和名称。blob对象是某个版本的文件内容(源代码)的压缩版本。tag对象是一个用于将可读名称与版本库的特定版本关联的字符串。一个commit代表一次代码更改，通常包括对若干个文件(blob)的修改。2) Git uses four types of data objects for version control, and the references to these objects are SHA1 values calculated based on the contents of the object. The commit object represents a change to the project and contains a SHA1 including the commit parent object (if any), the folder (tree object), author ID and timestamp, submitter ID and timestamp, and submission information. The tree object represents a folder within the project and is a list that contains the SHA1 of the files (blobs) and subfolders (other tree objects) in the folder, as well as their related modes, types and names. A blob object is a compressed version of a certain version of a file's contents (source code). The tag object is a string used to associate a human-readable name with a specific version of the repository. A commit represents a code change, usually including modifications to several files (blobs).

3)哈希值是依据文件内容计算出的若干位的散列值，不同的文件内容会生成不同的哈希值，因此可以用哈希值来唯一地对文件进行索引。3) The hash value is a hash value of several bits calculated based on the file content. Different file contents will generate different hash values, so the hash value can be used to uniquely index the file.

本发明采用一种面向全网代码溯源检测的代码库设计方法，针对全网使用Git的开源项目内的Git对象进行高效存储而得到代码库，可用于代码溯源检测和分析，同时提供代码库的高效更新方案。具体来说，用于代码全网溯源检测的代码库的设计构建包括：项目发现，数据提取，数据存储，代码信息映射构建和数据更新。数据存储部分，本发明设计了一种与传统Git存储方式不同的按Git对象分类型分块存储的存储模式，该存储模式可以大幅减少代码库的存储空间并提高全网检索的效率，是本发明的首创方法；代码信息映射构建部分也是本发明的首创做法，本发明提出构建代码文件到代码文件的信息(包含它的项目和commit、创建它的作者和时间、它的文件名)间的关系映射，可以快速对代码文件的全网信息进行检索；本发明提出了对构建的超大规模的代码库的高效更新方式，基于Libgit2函数库提出定制化的git fetch协议，以构建的代码库为后端，该定制化的协议可以以极小的时间代价和空间代价正确获得远程仓库的新增的Git对象数据。最后本发明还提供一种对代码在文件粒度上的快速全网溯源检测方案。The present invention adopts a code library design method for code traceability detection on the entire network, and efficiently stores Git objects in open source projects that use Git on the entire network to obtain a code library, which can be used for code traceability detection and analysis, and at the same time provides code library Efficient update solution. Specifically, the design and construction of the code library for code network-wide traceability detection includes: project discovery, data extraction, data storage, code information mapping construction and data update. As for the data storage part, the present invention designs a storage mode that is different from the traditional Git storage mode and blocks storage according to Git object classification. This storage mode can greatly reduce the storage space of the code base and improve the efficiency of the entire network retrieval. The original method of the invention; the code information mapping construction part is also the original method of the present invention. The present invention proposes to construct the code file to the information of the code file (including its project and commit, the author and time when it was created, and its file name). Relational mapping can quickly retrieve the entire network information of code files; the present invention proposes an efficient update method for the constructed ultra-large-scale code library, and proposes a customized git fetch protocol based on the Libgit2 function library, using the constructed code library as On the back end, this customized protocol can correctly obtain the new Git object data of the remote warehouse with minimal time and space costs. Finally, the present invention also provides a fast network-wide traceability detection solution for codes at file granularity.

本发明的技术方案是：The technical solution of the present invention is:

一种用于快速全网代码溯源检测的代码库设计方法，针对全网使用Git开源项目内的Git对象进行高效存储而得到代码库，并实现代码库的高效更新；提出按Git对象分类型分块存储的存储模式，以减少代码库的存储空间并提高全网检索的效率；构建代码文件到代码文件信息之间的关系映射，可对代码文件的全网信息进行快速检索；并对构建的超大规模的代码库采用高效更新方式，基于Libgit2函数库提出定制化的git fetch协议，以构建的代码库为后端，可高效获得远程仓库的新增的Git对象数据；用于快速全网代码溯源检测的代码库设计包括：项目发现、数据提取、数据存储、代码信息映射构建和数据更新过程；具体包括如下步骤：A code base design method for rapid network-wide code traceability detection, which uses Git objects in Git open source projects to efficiently store the entire network to obtain a code base, and achieve efficient updates of the code base; it is proposed to classify Git objects by type Block storage storage mode to reduce the storage space of the code library and improve the efficiency of the entire network retrieval; build a relationship mapping between code files to code file information, which can quickly retrieve the entire network information of code files; and build The ultra-large code base adopts an efficient update method. Based on the Libgit2 function library, a customized git fetch protocol is proposed. With the built code base as the backend, new Git object data of the remote warehouse can be obtained efficiently; used for fast whole-network code The code base design for traceability detection includes: project discovery, data extraction, data storage, code information mapping construction and data update process; specifically includes the following steps:

A.通过多种项目发现方法获取全网开源软件项目列表；A. Obtain a list of open source software projects across the entire network through a variety of project discovery methods;

开源软件项目大都托管在一些流行的开发协作平台如GitHub，Bitbucket，GitLab和SourceForge。本发明采用多种方法包括利用开发协作平台提供的API、解析平台的网页等方法来发现项目，再将发现的项目集合的并集作为最终的开源项目列表，从而获取开源项目列表。Open source software projects are mostly hosted on some popular development collaboration platforms such as GitHub, Bitbucket, GitLab and SourceForge. The present invention uses a variety of methods, including using the API provided by the development collaboration platform, parsing the web pages of the platform, etc. to discover projects, and then uses the union of the discovered project sets as the final open source project list, thereby obtaining the open source project list.

具体实施时，在一台普通的服务器(如Intel E5-2670 CPU服务器)上即可完成，对硬件的要求低。本发明将项目发现过程的脚本打包进docker镜像中。In specific implementation, it can be completed on an ordinary server (such as Intel E5-2670 CPU server), and the requirements for hardware are low. This invention packages the script of the project discovery process into the docker image.

B.数据提取：将步骤A获取的开源项目列表中的项目下载到本地并提取其中的Git对象；B. Data extraction: Download the projects in the open source project list obtained in step A to the local computer and extract the Git objects in them;

具体实施时，通过git clone克隆命令在本地创建一份远程仓库的拷贝。批量拷贝开源项目后，通过Git将克隆下来的开源项目内的所有Git对象批量提取出来。In specific implementation, a copy of the remote warehouse is created locally through the git clone command. After batch copying open source projects, batch extract all Git objects in the cloned open source projects through Git.

数据提取可以在(云)服务器上并行完成。本发明使用Git的C语言接口Libgit2，先列出项目内所有的Git对象，然后按对象类型分类，最后把各对象的内容提取出来。本发明具体采用一个拥有36个结点、每个结点的CPU为16核的Intel E5-2670 CPU，内存为256GB的集群，每个结点开启16个线程完成上述Git对象提取工作。一个结点在2个小时可以处理大约5万个项目。将克隆下来的项目内的Git数据提取出来后，把克隆下来的项目删除掉，然后开始新的克隆-提取过程。Data extraction can be done in parallel on (cloud) servers. This invention uses Git's C language interface Libgit2 to first list all Git objects in the project, then classify them according to object type, and finally extract the contents of each object. The present invention specifically uses a cluster with 36 nodes, each node has a 16-core Intel E5-2670 CPU, and a memory of 256GB. Each node starts 16 threads to complete the above Git object extraction work. One node can process approximately 50,000 items in 2 hours. After extracting the Git data in the cloned project, delete the cloned project, and then start a new cloning-extraction process.

C.Git对象数据存储：按照Git对象类型分类型分块存储Git对象数据，降低数据存储空间，提高并行处理效率；具体包括：C.Git object data storage: Store Git object data in blocks according to Git object type classification, reducing data storage space and improving parallel processing efficiency; specifically including:

a.不保存开源项目包括的二进制文件(比如PDF和图片)；a. Do not save binary files (such as PDFs and images) included in open source projects;

b.按照Git对象类型分类型存储Git对象数据，即数据库的类型包括commit数据库、tree数据库、blob数据库(不包含二进制blob)和tag数据库。这种存储方式将数据存储空间降至百TB级别，同时还能够快速地检索数据是否保存在代码库中。b. Store Git object data according to the Git object type, that is, the database types include commit database, tree database, blob database (excluding binary blobs) and tag database. This storage method reduces the data storage space to hundreds of terabytes, and at the same time can quickly retrieve whether the data is saved in the code base.

c.每类Git对象的数据库包括缓存数据和内容数据，分别保存在缓存数据库和内容数据库中，以加快检索速度；每类数据库(即commit数据库、tree数据库、blob数据库、tag数据库)包含的缓存数据库和内容数据库可分成多份(如128份)用于并行；缓存数据库用于快速确定某个Git对象是否已经存储在数据库中，并且是数据提取所必需的(如果存在，就不提取这个Git对象，进而节省时间)。此外，缓存数据库也有助于确定是否需要克隆一个仓库。如果一个仓库的head(保存在.git/refs/heads中的每个分支指向的commit对象)已经在缓存数据库中，就不需要克隆。c. The database of each type of Git object includes cache data and content data, which are stored in the cache database and content database respectively to speed up retrieval; the cache included in each type of database (i.e. commit database, tree database, blob database, tag database) The database and content database can be divided into multiple copies (such as 128 copies) for parallelization; the cache database is used to quickly determine whether a Git object is already stored in the database and is necessary for data extraction (if it exists, this Git object will not be extracted objects, thereby saving time). In addition, the cache database can also help determine whether a repository needs to be cloned. If a repository's head (the commit object pointed to by each branch stored in .git/refs/heads) is already in the cache database, there is no need to clone.

d.缓存数据库是一个键值数据库；内容数据库采用拼接的方式保存，以方便更新。d. The cache database is a key-value database; the content database is saved in a splicing manner to facilitate updates.

缓存数据库是一个键值数据库，其中键为Git对象的SHA1值(20个字节)，值为利用Perl的compress库压缩后的该Git对象在内容数据库中的偏移位置和大小。内容数据库包含连续拼接在一起的Git对象的压缩后的内容。内容数据库是采用拼接的方式保存的，可以快速完成更新，只需把新的内容拼接到对应的文件末尾即可。对于commit和tree对象，分别另外创建了一个随机查找键值数据库，其中键是Git对象的SHA1，值是对应Git对象的压缩内容。键值数据库随机查询性能比较快，每个线程每秒可查询170K以上的git对象。The cache database is a key-value database, where the key is the SHA1 value (20 bytes) of the Git object, and the value is the offset position and size of the Git object in the content database after being compressed using Perl's compress library. The content database contains the compressed contents of Git objects spliced together contiguously. The content database is saved in a splicing manner, which can be updated quickly by splicing the new content to the end of the corresponding file. For commit and tree objects, a separate random search key-value database is created, where the key is the SHA1 of the Git object and the value is the compressed content of the corresponding Git object. The random query performance of the key-value database is relatively fast, and each thread can query more than 170K git objects per second.

e.利用SHA1值实现并行化。e. Utilize SHA1 values for parallelization.

本发明使用Git对象的SHA1值第一个字节的后7位将各类型的数据库分割成128份。这样一来，四种类型的Git对象都有128个缓存数据库和128个内容数据库。此外commit对象和tree对象还分别有128个随机查找键值数据库，共有128*(4+4+2)个数据库，这些数据库可以放在一台服务器上加速并行。具体实施时，单个内容数据库的大小从tag对象的20MB到blob对象的0.8TB，单个缓存数据库最大是tree对象，大小为2Gb。This invention uses the last 7 bits of the first byte of the SHA1 value of the Git object to divide each type of database into 128 parts. In this way, there are 128 cache databases and 128 content databases for the four types of Git objects. In addition, the commit object and the tree object each have 128 random search key value databases, with a total of 128*(4+4+2) databases. These databases can be placed on one server to accelerate parallelism. During specific implementation, the size of a single content database ranges from 20MB for tag objects to 0.8TB for blob objects. The maximum size of a single cache database is a tree object with a size of 2Gb.

f.本发明使用C语言编写的数据库TokyoCabinet(类似于berkeley db)。f. The present invention uses the database TokyoCabinet written in C language (similar to berkeley db).

TokyoCabinet使用哈希作为索引，可以提供比MongoDB或Cassandra等各种常见键值数据库快约十倍的读取查询性能。更快的读取查询速度和极强的可移植性刚好符合面向全网代码溯源检测的代码库的构建需求，因此本发明采用数据库TokyoCabinet而非功能更全的NoSQL数据库。TokyoCabinet uses hashes as indexes, which can provide read query performance about ten times faster than various common key-value databases such as MongoDB or Cassandra. Faster reading query speed and strong portability just meet the construction requirements of a code library for network-wide code traceability detection. Therefore, the present invention uses the database TokyoCabinet instead of the NoSQL database with more complete functions.

D.代码信息映射构建：D. Code information mapping construction:

本发明设计的代码库的目标是可以快速对代码进行全网溯源检测，支持对软件项目的安全和合规性进行分析，本发明构建了代码文件(blob)到包含它的项目，代码文件到包含它的commit，代码文件到它的作者，代码文件到它的文件名和代码文件到它的创建时间的关系映射，这些关系映射以数据库的形式保存，可以快速得到一个代码文件的全网信息，比如包含它的项目和commit、创建它的作者和创建它的时间，实现对这些代码信息映射的构建。针对一个代码文件，获得这些信息对于软件项目的安全和合规性的全面评估是有用的。The goal of the code library designed by the present invention is to quickly conduct network-wide traceability detection of codes and support the analysis of the security and compliance of software projects. The present invention constructs a code file (blob) to the project containing it, and the code file to Contains its commit, the relationship mapping of the code file to its author, the code file to its file name, and the code file to its creation time. These relationship mappings are saved in the form of a database, and the entire network information of a code file can be quickly obtained. For example, the project and commit that contains it, the author who created it and the time when it was created are implemented to construct a mapping of these code information. Obtaining this information for a code file is useful for a comprehensive assessment of the security and compliance of a software project.

本发明以commit为中心构建关系映射，具体包括：This invention builds relationship mapping centered on commit, which specifically includes:

构建commit和项目之间的相互映射、构建commit和作者、时间的关系映射、构建作者到commit的关系映射、构建commit到代码文件(blob)之间的相互映射和commit到文件名之间的相互映射。Build the mutual mapping between commits and projects, the relationship mapping between commits and authors, and time, the relationship mapping between authors and commits, the mutual mapping between commits to code files (blobs), and the mutual mapping between commits and file names. mapping.

包含一个代码文件(blob)的项目列表可以通过代码文件(blob)到commit和commit到项目的关系组合确定；代码文件(blob)的创建时间可以通过代码文件(blob)到commit和commit到时间的关系组合确定，代码文件(blob)的作者可以通过代码文件(blob)到commit和commit到作者的关系组合确定。The list of projects containing a code file (blob) can be determined by the combination of the code file (blob) to commit and commit to project; the creation time of the code file (blob) can be determined by the code file (blob) to commit and commit to time. The relationship combination is determined. The author of the code file (blob) can be determined through the relationship combination of the code file (blob) to commit and commit to the author.

还构建了代码文件和文件名之间的相互关系映射来支持特定代码片段的溯源。A mapping of the relationship between code files and file names is also constructed to support traceability of specific code fragments.

利用TokyoCabinet数据库保存这些关系映射，以进行快速检索。本发明仍使用分块存储来提高检索效率，具体来说本发明将每类关系映射分成32个子数据库。对于commit和(代码文件)blob，使用它们SHA1的第一个字符的后5位进行划分。对于作者、项目和文件名，本发明使用他们的FNV-1Hash的第一个字节的后5位进行划分。Utilize the TokyoCabinet database to save these relational maps for quick retrieval. The present invention still uses block storage to improve retrieval efficiency. Specifically, the present invention divides each type of relationship mapping into 32 sub-databases. For commits and (code file) blobs, use the last 5 digits of the first character of their SHA1 to divide. For author, project and file names, the present invention uses the last 5 bits of the first byte of their FNV-1Hash for partitioning.

E.数据更新E.Data update

Git对象是不可改变的(即已有的Git对象会保持不变，只会有新的Git对象存在)，因此，只需获取这些新的Git对象。本发明具体使用两种方法对代码库进行更新：Git objects are immutable (that is, existing Git objects will remain unchanged, only new Git objects will exist), so you only need to obtain these new Git objects. This invention specifically uses two methods to update the code base:

a.识别新的Git项目，克隆然后提取其中的Git对象。a. Identify the new Git project, clone and extract the Git objects in it.

b.通过获取已收集仓库的远程仓库的分支的最新的commit来识别更新的项目，然后通过修改git fetch协议，使得该协议可以在没有本地Git仓库(在步骤B提取出数据后就把克隆的Git仓库删除了以节省空间)的情况下，以构建的代码库为后端，获取远程仓库的新增的Git对象，并提取出新增的Git对象到代码库中。本发明通过Libgit2中实现gitfetch功能的源代码，还原出git fetch的流程，具体包括以下步骤：b. Identify updated projects by obtaining the latest commit of the branch of the remote repository that has been collected, and then modify the git fetch protocol so that the protocol can be cloned without a local Git repository (after extracting the data in step B) When the Git repository is deleted (to save space), the built code base is used as the backend to obtain the new Git objects of the remote repository and extract the new Git objects into the code base. This invention restores the git fetch process through the source code that implements the gitfetch function in Libgit2, which specifically includes the following steps:

b1)将远程仓库添加到本地仓库中。在Libgit2中用git_remote结构体来表示远程仓库，在创建这个结构体的时候，会将本地仓库内.git/refs/heads文件夹内的所有分支引用都填充到该结构体内的一个成员变量(ref)中；b1) Add the remote warehouse to the local warehouse. In Libgit2, the git_remote structure is used to represent the remote warehouse. When this structure is created, all branch references in the .git/refs/heads folder in the local warehouse will be filled into a member variable (ref )middle;

b2)本地仓库建立到远程仓库的连接；b2) The local warehouse establishes a connection to the remote warehouse;

b3)建立连接后，远程仓库会进行回复(respond)，将远程仓库的所有的分支引用(.git/refs/heads文件夹内的内容)发送到本地；b3) After the connection is established, the remote warehouse will respond (respond) and send all branch references (contents in the .git/refs/heads folder) of the remote warehouse to the local;

b4)本地仓库接收到远程仓库发回的引用后，会逐个检查这些引用指向的对象是否在本地仓库中，如果在本地仓库中，就标记它表明这一分支没有更新，不需要请求远程仓库发送更新。然后将这些引用插入到第b1步提到的成员变量中b4) After the local warehouse receives the references sent back by the remote warehouse, it will check one by one whether the objects pointed to by these references are in the local warehouse. If it is in the local warehouse, it will mark it to indicate that this branch has not been updated, and there is no need to request the remote warehouse to send it. renew. Then insert these references into the member variables mentioned in step b1

b5)本地仓库检查完所有这些引用后，会将这个成员变量发回到远程仓库(包括标记过的引用)，与远程仓库“谈判”(negotiate)。这里本地会等待远程仓库发回的ACK信号。Libgit2在这里等待的方式是将本地仓库内的commit对象按照时间顺序排序，然后按从最近的commit开始往前遍历，对于每个commit对象，将其发送给远程仓库告诉它本地有这个对象，然后再发送给远程仓库一次检查完的引用。这样重复至多256次，直到收到远程仓库发回的ACK信号。b5) After the local warehouse has checked all these references, it will send this member variable back to the remote warehouse (including the marked references) to "negotiate" with the remote warehouse. Here the local will wait for the ACK signal sent back by the remote warehouse. The way Libgit2 waits here is to sort the commit objects in the local warehouse in chronological order, and then traverse forward from the most recent commit. For each commit object, send it to the remote warehouse to tell it that the object exists locally, and then Then send the checked reference to the remote warehouse. This is repeated up to 256 times until an ACK signal is received from the remote warehouse.

b6)与远程仓库谈判完毕后(即告诉远程仓库:本地仓库分支最新的commit是什么，想要哪些)，远程仓库可以计算出要把哪些Git对象发回到本地。远程仓库将这些对象打包成packfile格式的文件，发回本地。b6) After negotiating with the remote warehouse (that is, telling the remote warehouse: what are the latest commits of the local warehouse branch and which ones do you want), the remote warehouse can calculate which Git objects to send back to the local. The remote warehouse packages these objects into files in packfile format and sends them back to the local computer.

b7)本地仓库接收到发回的数据后，会根据packfile的格式，解析它，并构建出对应的索引(index)文件，方便检索。构建index文件时需要根据本地Git仓库中的Git对象来恢复。b7) After receiving the data sent back, the local warehouse will parse it according to the format of the packfile and build the corresponding index file to facilitate retrieval. When building the index file, it needs to be restored based on the Git object in the local Git repository.

从git fetch的步骤可以看出，除了第b5)步和第b7)步外，其他过程都不涉及除分支引用指向的其他Git对象，git fetch是通过比较远程仓库的分支引用与本地的不同来判断远程仓库是否有更新的。我们提出对git fetch进行如下修改：It can be seen from the steps of git fetch that except for steps b5) and b7), other processes do not involve other Git objects except the branch references pointed to. git fetch is performed by comparing the branch references of the remote warehouse with the local ones. Determine whether the remote warehouse has updates. We propose to make the following modifications to git fetch:

1)修改原始git fetch第b3步：将远程仓库发回的分支引用保存到本地，判断这些分支引用是否保存在本地代码库中，如果存在，说明远程仓库无更新，如果不存在说明远程仓库有更新，进入下一步。1) Modify the original git fetch step b3: Save the branch references sent back by the remote warehouse to the local, and determine whether these branch references are saved in the local code library. If they exist, it means that the remote warehouse has not been updated. If they do not exist, it means that the remote warehouse has. Update and go to the next step.

2)修改原始git fetch第b5步：原始的git fetch协议对commit排序并发送到远程仓库，只是为了等待远程仓库的ACK信号，并没有什么特别的作用，所以本发明换一种等待方法：每次都发送主分支最新的commit对象，重复至多256次直到收到远程仓库的ACK信号2) Modify the original git fetch step b5: The original git fetch protocol sorts the commits and sends them to the remote warehouse, just to wait for the ACK signal of the remote warehouse, and has no special effect, so the present invention changes the waiting method: every Send the latest commit object of the main branch each time, repeating up to 256 times until the ACK signal from the remote warehouse is received.

3)修改原始git fetch第b6步：将远程仓库发回的packfile格式的文件保存到本地，依据代码库中的Git对象解析packfile文件，不进行第b7步。3) Modify the original git fetch step b6: Save the packfile format file sent back by the remote warehouse to the local, parse the packfile file according to the Git object in the code library, and do not proceed to step b7.

对git fetch进行上述修改后，可以使git fetch以构建的代码库为后端进行更新，不需要每次更新都需要克隆完整的仓库，同时减少网络带宽开销和时间开销。After making the above modifications to git fetch, git fetch can be updated with the built code base as the backend, without the need to clone the complete warehouse for each update, while reducing network bandwidth overhead and time overhead.

具体实施时，本发明还提供一种基于代码库对代码在文件粒度上的快速全网代码溯源检测方法，包括如下步骤：During specific implementation, the present invention also provides a fast network-wide code traceability detection method based on the code library at the file granularity, including the following steps:

1)针对一个代码文件，计算其SHA1值1) For a code file, calculate its SHA1 value

2)根据步骤D构建的代码信息映射，以代码文件的SHA1为键，查询该代码文件的全网信息，包括包含该代码文件的项目列表、commit列表和对应的文件名和作者等信息，反馈给用户。2) Based on the code information mapping constructed in step D, use the SHA1 of the code file as the key to query the entire network information of the code file, including the project list, commit list, corresponding file name, author and other information containing the code file, and feedback to user.

与现有技术相比，本发明的有益效果是：Compared with the prior art, the beneficial effects of the present invention are:

通过本发明所提供的一种代码库设计，可以支持对代码进行高效的全网溯源检测。通过本发明的技术方案和提供的实施例，无需大量的服务器即可完成对全网包括GitHub在内的多个代码托管平台上的开源Git仓库构建本地代码库；无需特别多的带宽，即可完成对代码库的增量更新。Through the code library design provided by the present invention, efficient network-wide traceability detection of codes can be supported. Through the technical solutions and provided embodiments of the present invention, it is possible to complete the construction of local code libraries for open source Git warehouses on multiple code hosting platforms across the entire network, including GitHub, without the need for a large number of servers; without the need for particularly large amounts of bandwidth, Complete incremental updates to the code base.

本发明的技术方案和提供的实施例，为构建面向全网代码溯源检测的代码库提供了详细的指导，弥补了在代码溯源检测领域海量代码库构建技术的空缺。The technical solution and provided embodiments of the present invention provide detailed guidance for constructing a code library for network-wide code traceability detection, filling the gap in massive code library construction technology in the field of code traceability detection.

附图说明Description of the drawings

图1为本发明实施例中用于快速全网代码溯源检测的代码库设计方法的流程框图Figure 1 is a flow chart of a code base design method for rapid network-wide code traceability detection in an embodiment of the present invention.

图2为本发明实施例中代码库更新策略的流程框图。Figure 2 is a flow chart of a code base update strategy in an embodiment of the present invention.

图3为本发明实施例中定制化的git fetch过程的流程框图。Figure 3 is a flow chart of the customized git fetch process in the embodiment of the present invention.

图4为本发明实施例中基于定制化的git fetch协议获取远程仓库更新的流程框图。Figure 4 is a flow chart for obtaining remote warehouse updates based on the customized git fetch protocol in an embodiment of the present invention.

图5为本发明实施例中基于构建的代码库的快速全网代码溯源检测方法的流程框图。Figure 5 is a flow chart of a fast network-wide code traceability detection method based on the constructed code library in an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图，通过实施例进一步描述本发明，但不以任何方式限制本发明的范围。The present invention is further described below through examples in conjunction with the accompanying drawings, but does not limit the scope of the present invention in any way.

本发明提供一种用于快速全网代码溯源检测的代码库设计方法，具体包括以下步骤：The present invention provides a code library design method for rapid network-wide code traceability detection, which specifically includes the following steps:

A.通过多种项目发现方法获取全网开源软件项目列表。实现方法为：A. Obtain a list of open source software projects across the entire network through a variety of project discovery methods. The implementation method is:

目前开源软件项目大都托管在一些流行的开发协作平台如GitHub，Bitbucket，GitLab和SourceForge。还有一部分开源项目是托管在个人或特定项目的网站。因此，为了支持对代码的全网溯源检测，需要获取尽可能完整的开源项目列表。针对此挑战，本发明结合多种方法如利用平台提供的API、解析平台的网页等方法来发现项目。最后将这些方法发现的项目集合的并集作为最终的开源项目列表。At present, most open source software projects are hosted on some popular development collaboration platforms such as GitHub, Bitbucket, GitLab and SourceForge. There are also some open source projects that are hosted on personal or project-specific websites. Therefore, in order to support network-wide traceability detection of code, it is necessary to obtain as complete a list of open source projects as possible. In response to this challenge, the present invention combines multiple methods, such as using APIs provided by the platform, parsing the web pages of the platform, to discover projects. Finally, the union of the project sets discovered by these methods is used as the final open source project list.

B.数据提取：将步骤A中开源项目列表中的项目下载到本地并提取其中的Git对象；B. Data extraction: Download the projects in the open source project list in step A locally and extract the Git objects in them;

这一步负责将步骤A发现的项目下载到本地并提取其中的Git对象。通过gitclone命令在本地创建一份远程仓库的拷贝。批量拷贝项目后，通过Git将克隆下来的项目内的所有Git对象批量提取出来。这一步可以在(云)服务器上并行完成。This step is responsible for downloading the project discovered in step A to the local and extracting the Git objects in it. Use the gitclone command to create a copy of the remote repository locally. After batch copying the project, use Git to batch extract all the Git objects in the cloned project. This step can be done in parallel on the (cloud) server.

C.数据存储：按照Git对象类型分类型分块存储，降低数据存储空间，提高并行处理效率；C. Data storage: Store in blocks according to Git object types, reducing data storage space and improving parallel processing efficiency;

开源项目间可能会因为复用代码、pull-request开发模式等原因存在很多重复的Git对象。同时，开源项目还会包括很多二进制文件，比如PDF和图片。如果不去除这种冗余和二进制文件，估计需要的数据存储空间将超过1.5PB，如此巨大的数据量将导致代码溯源任务几乎不可能实现。为了避免仓库间Git对象的冗余，加上代码库的设计是面向全网代码溯源检测的，因此本发明不保存二进制文件，本发明按照Git对象类型分类型存储，即commit数据库、tree数据库、blob数据库(不包含二进制blob)和tag数据库。这种存储方式将数据存储空间降至百TB级别的同时，还能够快速地检索数据是否保存在代码库中。There may be many duplicate Git objects between open source projects due to code reuse, pull-request development model and other reasons. At the same time, open source projects will also include many binary files, such as PDFs and images. If this redundancy and binary files are not removed, it is estimated that the required data storage space will exceed 1.5PB. Such a huge amount of data will make the code traceability task almost impossible to achieve. In order to avoid the redundancy of Git objects between warehouses, and the design of the code library is oriented to code traceability detection on the entire network, the present invention does not save binary files. The present invention stores them according to the types of Git objects, that is, commit database, tree database, blob database (excluding binary blobs) and tag database. This storage method reduces the data storage space to hundreds of terabytes and can also quickly retrieve whether the data is saved in the code base.

D.代码信息映射构建：D. Code information mapping construction:

本代码库的目标是可以快速对代码进行全网溯源检测，支持对软件项目的安全和合规性进行分析，为此本发明构建了代码文件(blob)到包含它的项目，代码文件到包含它的commit，代码文件到它的作者，代码文件到它的文件名和代码文件到它的创建时间的关系映射，这些关系映射以数据库的形式保存，可以快速得到一个代码文件的全网信息，比如包含它的项目和commit、创建它的作者和创建它的时间，实现对这些代码信息映射的构建。针对一个代码文件，获得这些信息对于软件项目的安全和合规性的全面评估是有用的。The goal of this code library is to quickly conduct network-wide traceability detection of codes and support analysis of the security and compliance of software projects. To this end, the present invention constructs a code file (blob) to the project containing it, and the code file to the project containing it. Its commit, the relationship mapping of the code file to its author, the code file to its file name, and the code file to its creation time. These relationship mappings are saved in the form of a database, and the entire network information of a code file can be quickly obtained, such as Contains its project and commit, the author who created it and the time when it was created, to achieve the construction of a mapping of these code information. Obtaining this information for a code file is useful for a comprehensive assessment of the security and compliance of a software project.

E.数据更新E.Data update

保持代码库的最新对于代码溯源检测任务是至关重要的。随着现有仓库规模的增长和新仓库的出现，克隆所有仓库的过程需要的时间越来越长。目前，要克隆所有的git仓库(包括fork在内超过1亿3千万个)，估计总时间需要六百台单线程服务器运行一周，结果将占用超过1.5PB的磁盘空间。幸运的是，git对象是不可改变的(即已有的Git对象会保持不变，只会有新的Git对象存在)，因此，只需获取这些新的Git对象。具体来说，本发明提出使用两种策略对代码库进行更新：Keeping the code base up to date is critical to the task of code traceability detection. As existing repositories grow in size and new repositories appear, the process of cloning all repositories takes longer and longer. Currently, to clone all git repositories (more than 130 million including forks), it is estimated that the total time will require 600 single-threaded servers to run for a week, and the result will be more than 1.5PB of disk space. Fortunately, git objects are immutable (that is, existing Git objects will remain unchanged, only new Git objects will exist), so just get these new Git objects. Specifically, the present invention proposes to use two strategies to update the code base:

1.识别新的Git项目，克隆然后提取其中的Git对象。1. Identify the new Git project, clone and extract the Git objects in it.

2.通过获取已收集仓库的远程仓库所有分支的最新的commit来识别更新的项目，然后通过修改git fetch协议，使得该协议可以在没有本地Git仓库(在步骤B提取出数据后就把克隆的Git仓库删除了以节省空间)的情况下，以构建的代码库为后端，获取远程仓库的更新，并提取出新增的Git对象到代码库中。本发明通过Libgit2中实现git fetch功能的源代码，还原出git fetch的流程，如图2所示，具体包括以下7步：2. Identify updated projects by obtaining the latest commits of all branches of the remote repository that has been collected, and then modify the git fetch protocol so that the protocol can be cloned without a local Git repository (after extracting the data in step B) (The Git repository has been deleted to save space), use the built code base as the backend to obtain updates from the remote repository, and extract the new Git objects into the code base. This invention restores the git fetch process through the source code that implements the git fetch function in Libgit2, as shown in Figure 2, which specifically includes the following 7 steps:

1)将远程仓库添加到本地仓库中。在Libgit2中用git_remote结构体来表示远程仓库，在创建这个结构体的时候，会将本地仓库内.git/refs/heads文件夹内的所有分支引用都填充到该结构体内的一个成员变量(ref)中；1) Add the remote warehouse to the local warehouse. In Libgit2, the git_remote structure is used to represent the remote warehouse. When this structure is created, all branch references in the .git/refs/heads folder in the local warehouse will be filled into a member variable (ref )middle;

2)本地仓库建立到远程仓库的连接；2) The local warehouse establishes a connection to the remote warehouse;

3)建立连接后，远程仓库会进行回复(respond)，将远程仓库的所有的分支引用(.git/refs/heads文件夹内的内容)发送到本地；3) After the connection is established, the remote warehouse will respond (respond) and send all branch references (contents in the .git/refs/heads folder) of the remote warehouse to the local;

4)本地仓库接收到远程仓库发回的引用后，会逐个检查这些引用指向的对象是否在本地仓库中，如果在本地仓库中，就标记它表明这一分支没有更新，不需要请求远程仓库发送更新。然后将这些引用插入到第1步提到的成员变量中4) After the local warehouse receives the references sent back by the remote warehouse, it will check one by one whether the objects pointed to by these references are in the local warehouse. If it is in the local warehouse, it will mark it to indicate that this branch has not been updated, and there is no need to request the remote warehouse to send it. renew. Then insert these references into the member variables mentioned in step 1

5)本地仓库检查完所有这些引用后，会将这个成员变量发回到远程仓库(包括标记过的引用)，与远程仓库“谈判”(negotiate)。这里本地会等待远程仓库发回的ACK信号。Libgit2在这里等待的方式是将本地仓库内的commit对象按照时间顺序排序，然后按从最近的commit开始往前遍历，对于每个commit对象，将其发送给远程仓库告诉它本地有这个对象，然后再发送给远程仓库一次检查完的引用。这样重复至多256次，直到收到远程仓库发回的ACK信号。5) After the local warehouse has checked all these references, it will send this member variable back to the remote warehouse (including the marked references) to "negotiate" with the remote warehouse. Here the local will wait for the ACK signal sent back by the remote warehouse. The way Libgit2 waits here is to sort the commit objects in the local warehouse in chronological order, and then traverse forward from the most recent commit. For each commit object, send it to the remote warehouse to tell it that the object exists locally, and then Then send the checked reference to the remote warehouse. This is repeated up to 256 times until an ACK signal is received from the remote warehouse.

6)与远程仓库谈判完毕后(即告诉远程仓库：本地仓库分支最新的commit是什么，想要哪些)，远程仓库员可以计算出要把哪些Git对象发回到本地。远程仓库将这些对象打包成packfile格式的文件，发回本地。6) After negotiating with the remote warehouse (that is, telling the remote warehouse: what are the latest commits of the local warehouse branch and which ones do you want), the remote warehouse manager can calculate which Git objects to send back to the local. The remote warehouse packages these objects into files in packfile format and sends them back to the local computer.

7)本地仓库接收到发回的数据后，会根据packfile的格式，解析它，并构建出对应的索引(index)文件，方便检索。构建index文件时需要根据本地Git仓库中的Git对象来恢复。7) After receiving the data sent back, the local warehouse will parse it according to the format of the packfile and build the corresponding index file to facilitate retrieval. When building the index file, it needs to be restored based on the Git object in the local Git repository.

从git fetch的步骤可以看出，除了第5)步和第7)步外，其他过程都不涉及除分支引用指向的其他Git对象，git fetch是通过比较远程仓库的分支引用与本地仓库的分支引用的不同来判断远程仓库是否有更新的。本发明提出对git fetch进行如下修改：It can be seen from the steps of git fetch that except for steps 5) and 7), other processes do not involve other Git objects except the branch reference pointed to. git fetch is by comparing the branch reference of the remote warehouse with the branch of the local warehouse. The difference in references is used to determine whether the remote repository is updated. The present invention proposes to modify git fetch as follows:

1)修改原始git fetch第3步：将远程仓库发回的分支引用保存到本地，判断这些分支引用是否保存在本地代码库中，如果存在，说明远程仓库无新增的Git对象数据，如果不存在说明远程仓库有新增的Git对象数据，进入下一步。1) Modify the original git fetch Step 3: Save the branch references sent back by the remote warehouse to the local, and determine whether these branch references are saved in the local code library. If they exist, it means that the remote warehouse has no new Git object data. If not, Existence indicates that the remote warehouse has new Git object data. Go to the next step.

2)修改原始git fetch第5步：原始的git fetch协议对commit排序并发送到远程仓库，只是为了等待远程仓库的ACK信号，并没有什么特别的作用，所以本发明换一种等待方法：每次都发送主分支最新的commit对象，重复至多256次直到收到远程仓库的ACK信号2) Modify the original git fetch step 5: The original git fetch protocol sorts the commits and sends them to the remote warehouse, just to wait for the ACK signal of the remote warehouse, which has no special effect, so the present invention changes the waiting method: every Send the latest commit object of the main branch each time, repeating up to 256 times until the ACK signal from the remote warehouse is received.

3)修改原始git fetch第6步：将远程仓库发回的packfile格式的文件保存到本地，依据代码库中的Git对象解析packfile文件，不进行第7步。3) Modify the original git fetch step 6: Save the packfile format file sent back by the remote warehouse to the local, parse the packfile file according to the Git object in the code library, do not proceed to step 7.

最后，本发明提供一种对代码在文件粒度上的快速全网溯源检测方案，具体来说包括两步：Finally, the present invention provides a fast network-wide traceability detection solution for code at file granularity, which specifically includes two steps:

1.针对一个代码文件，计算其SHA1值1. Calculate the SHA1 value of a code file

2.根据步骤D构建的代码信息映射工具数据库，以该代码文件的SHA1为键，查询该代码文件的全网信息，包括包含该代码文件的项目列表，commit列表和对应的文件名和作者等信息，反馈给用户。2. Based on the code information mapping tool database built in step D, use the SHA1 of the code file as the key to query the entire network information of the code file, including the project list containing the code file, the commit list and the corresponding file name and author. , feedback to the user.

作为一种优选方案，所述步骤B使用Git的C语言接口Libgit2(因为C语言更高效，速度更快)完成提取任务。As a preferred solution, step B uses Git's C language interface Libgit2 (because C language is more efficient and faster) to complete the extraction task.

作为一种优选方案，所述步骤C和步骤D使用TokyoCabinet数据库。As a preferred solution, step C and step D use the TokyoCabinet database.

作为一种优选方案，所述步骤E使用Git的C语言接口Libgit2来实现定制化的Gitfetch协议。As a preferred solution, step E uses Git's C language interface Libgit2 to implement the customized Gitfetch protocol.

图2所示为本发明实施例中用于快速全网代码溯源检测的代码库设计方法的流程，包括以下具体实施步骤：Figure 2 shows the process of a code base design method for rapid network-wide code traceability detection in an embodiment of the present invention, which includes the following specific implementation steps:

A.项目发现：A. Project Discovery:

为了获取尽可能完整的开源项目列表，本发明结合多种启发式方法，包括使用开发协作平台的API，解析平台的网页等来发现项目。最后将这些方法发现的项目集合的并集作为最终的开源项目列表。本发明将项目发现过程的脚本打包进docker镜像中。具体来说，本发明采用的项目发现方法如下：In order to obtain as complete a list of open source projects as possible, the present invention combines a variety of heuristic methods, including using the API of the development collaboration platform, parsing the web pages of the platform, etc. to discover projects. Finally, the union of the project sets discovered by these methods is used as the final open source project list. This invention packages the script of the project discovery process into the docker image. Specifically, the project discovery method adopted in this invention is as follows:

1.使用开发协作平台的API。一些代码托管平台如GitHub会提供API，这些API可以被用来发现这个平台上完整的开源项目集合。这些API是平台特定的，会有不同的使用方式，因此需要针对不同的平台的API设计不同的API查询。但这些API一般都会针对用户或IP地址有访问速率限制，可以通过构建用户ID池来突破这种限制。对于GitHub平台，我们使用GitHub的GraphQL API，获取有更新的GitHub仓库列表，具体操作是将需要获取的仓库的时间段按照用户ID池中的用户ID数量均分，然后每个用户ID负责一个时间段内的更新仓库数量，查询条件是：{is:public archived:false pushed:start_time..end_time}，这里在每个时间段内以10分钟为区间替换start_time和end_time，获取每10分钟区间内的更新的仓库数量；对于Bitbucket平台，使用的api查询是https://api.bitbucket.org/2.0/repositories/？pagelen＝100&after＝date，这里将date替换成特定的时间比如2017-11-18，就能获得在2017-11-18日之后创建的Bitbucket仓库；对于SourceForge平台，该平台提供了XML格式的项目列表，XML文件的地址在https://sourceforge.net/sitemap.xml，下载XML解析即可得到SourceForge上所有的项目列表；对于GitLab平台，使用的API查询是https://gitlab.com/api/v4/projects？archived＝false&membership＝false&order_by＝created_a t&owned＝false&page＝{}&per_page＝99&simple＝false&sort＝desc&starred＝false&statisti cs＝false&with_custom_attributes＝false&with_issues_enabled＝false&with_merge_request s_enabled＝false，这里将page的参数设置为1，然后递增获取所有的gitlab上的项目。1. Use the API of the development collaboration platform. Some code hosting platforms such as GitHub provide APIs that can be used to discover the complete collection of open source projects on the platform. These APIs are platform-specific and can be used in different ways, so different API queries need to be designed for different platform APIs. However, these APIs generally have access rate restrictions for users or IP addresses. This restriction can be overcome by building a user ID pool. For the GitHub platform, we use GitHub's GraphQL API to obtain a list of updated GitHub warehouses. The specific operation is to divide the time period of the warehouses that need to be obtained equally according to the number of user IDs in the user ID pool, and then each user ID is responsible for one time The number of updated warehouses in the segment. The query condition is: {is:public archived:false pushed:start_time..end_time}. Here, start_time and end_time are replaced with 10 minutes as an interval in each time period to obtain the number of updates in each 10-minute interval. The number of updated repositories; for the Bitbucket platform, the api query used is https://api.bitbucket.org/2.0/repositories/? pagelen=100&after=date, replace date with a specific time such as 2017-11-18, and you can get the Bitbucket warehouse created after 2017-11-18; for the SourceForge platform, the platform provides a project list in XML format , the address of the XML file is https://sourceforge.net/sitemap.xml, download the XML parsing to get a list of all projects on SourceForge; for the GitLab platform, the API query used is https://gitlab.com/api/ v4/projects? archived=false&membership=false&order_by=created_a t&owned=false&page={}&per_page=99&simple=false&sort=desc&starred=false&statistics=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_request s_enabled=false, here set the page parameter to 1, and then obtain all git incrementally on the lab project.

2.解析网站的网页。对于Bioconductor平台，通过解析http://git.bioconductor.org网页，可以得到该网站上的所有项目；对于repo.or.cz平台，通过解析https://repo.or.cz/？a＝project_list网页，即可得到该网站上的所有项目；对于Android平台，通过解析https://android.googlesource.com/网页，即可得到该网站上的所有项目；2. Parse the web pages of the website. For the Bioconductor platform, by parsing the http://git.bioconductor.org web page, you can get all the projects on the website; for the repo.or.cz platform, by parsing https://repo.or.cz/? a=project_list web page, you can get all the projects on the website; for the Android platform, you can get all the projects on the website by parsing the https://android.googlesource.com/ web page;

对于ZX2C4平台，解析https://git.zx2c4.com网页，即可得到该平台上的所有项目；对于eclipse平台，解析http://git.eclipse.org/网页，即可得到该平台上的所有项目；对于PostgreSQL平台，解析http://git.postgresql.org网页，即可得到该平台上的所有项目；For the ZX2C4 platform, parse the https://git.zx2c4.com web page to get all projects on the platform; for the eclipse platform, parse the http://git.eclipse.org/ web page to get all the projects on the platform All projects; for the PostgreSQL platform, parse the http://git.postgresql.org web page to get all projects on the platform;

对于Kernel.org平台，解析http://git.kernel.org网页，即可得到该平台上的所有项目；对于Savannah平台，解析http://git.savannah.gnu.org/cgit网页，即可得到该平台上的所有项目。For the Kernel.org platform, parse the http://git.kernel.org web page to get all the projects on the platform; for the Savannah platform, parse the http://git.savannah.gnu.org/cgit web page to get all the projects on the platform. Get all the projects on the platform.

这一步在一台普通的服务器(如Intel E5-2670 CPU服务器)上即可完成，对硬件的要求很低。截至2020年9月，我们检索到有1亿3千多万个不同的仓库(不包括标记为fork的GitHub仓库和没有内容的仓库)。This step can be completed on an ordinary server (such as Intel E5-2670 CPU server), and the hardware requirements are very low. As of September 2020, we have retrieved more than 130 million different repositories (excluding GitHub repositories marked as fork and repositories with no content).

B.数据提取：B. Data extraction:

这一步可以在非常多的服务器上并行完成，但需要大量的网络带宽和存储空间。通过git clone命令把远程仓库批量克隆到本地，经过测算，在一台Intel E5-2670 CPU服务器上，一个单线程shell进程在没有网络带宽的限制下24小时可以克隆2万到5万个随机选择的项目(时间随仓库的大小和平台的不同而变化很大)。为了在一星期内克隆所有的项目(超过1亿3千万个)，需要约400-800台服务器，需要的代价是很高的。因此，本发明通过在每台服务器上运行多个线程来优化检索，并只检索自上次检索以来发生变化的一小部分仓库。本发明目前使用拥有300个结点、带宽高达56Gb/s的计算集群平台上的5个数据传输结点完成克隆任务。此外，这一步可以使用云服务器代替计算集群来完成，可以在克隆的时间购买定制的符合自己需求的云服务资源，然后批量克隆结束之后再释放这些资源。云服务器可以达到更高的带宽，克隆速度更快。This step can be done in parallel on many servers, but requires a lot of network bandwidth and storage space. Use the git clone command to batch clone the remote warehouse to the local. After calculation, on an Intel E5-2670 CPU server, a single-threaded shell process can clone 20,000 to 50,000 randomly selected objects in 24 hours without network bandwidth restrictions. of projects (times vary greatly depending on warehouse size and platform). In order to clone all projects (more than 130 million) in one week, about 400-800 servers are needed, which is very expensive. Therefore, the present invention optimizes retrieval by running multiple threads on each server and retrieves only a small portion of the repository that has changed since the last retrieval. The present invention currently uses 5 data transmission nodes on a computing cluster platform with 300 nodes and a bandwidth of up to 56Gb/s to complete the cloning task. In addition, this step can be completed using cloud servers instead of computing clusters. You can purchase customized cloud service resources that meet your needs during cloning, and then release these resources after the batch cloning is completed. Cloud servers can achieve higher bandwidth and clone faster.

将项目克隆到本地后，需要将项目内所有的Git对象全部提取出来。Git客户端只能挨个显示一个Git对象的内容，不利于自动化批量处理。本发明使用Git的C语言接口Libgit2，先列出项目内所有的Git对象，然后按对象类型分类，最后把各对象的内容提取出来。本发明目前在一个拥有36个结点、每个结点的CPU为16核的Intel E5-2670 CPU，内存为256GB的集群上，每个结点开启16个线程完成上述Git对象提取工作。一个结点在2个小时可以处理大约5万个项目。将克隆下来的项目内的Git数据提取出来后，把克隆下来的项目删除掉，然后开始新的克隆-提取过程。After cloning the project locally, you need to extract all Git objects in the project. The Git client can only display the contents of one Git object one by one, which is not conducive to automated batch processing. This invention uses Git's C language interface Libgit2 to first list all Git objects in the project, then classify them according to object type, and finally extract the contents of each object. The present invention is currently on a cluster with 36 nodes, each node has a 16-core Intel E5-2670 CPU and a memory of 256GB. Each node starts 16 threads to complete the above Git object extraction work. One node can process approximately 50,000 items in 2 hours. After extracting the Git data in the cloned project, delete the cloned project, and then start a new cloning-extraction process.

C.数据存储：按照Git对象类型分类型分块存储，且不保存二进制文件，降低数据存储空间，增加并行处理速度。C. Data storage: Store in blocks according to Git object types, and do not save binary files, reducing data storage space and increasing parallel processing speed.

本发明按git对象类型分类型存储以避免冗余，减少存储开销；面向代码溯源检测，存储时不保存二进制文件；每种Git对象数据库包含缓存数据和内容数据，分别存储在缓存数据库和内容数据库中，以加快检索速度；为了可以并行，每种Git对象的缓存数据库和内容数据库可分成多份(如128份)用于并行；内容数据库采用拼接的方式保存以方便更新。The invention stores git objects by type to avoid redundancy and reduce storage overhead; it is oriented to code traceability detection and does not save binary files during storage; each Git object database contains cache data and content data, which are stored in the cache database and content database respectively. in order to speed up retrieval; in order to enable parallelization, the cache database and content database of each Git object can be divided into multiple copies (such as 128 copies) for parallelization; the content database is saved in a spliced manner to facilitate updates.

具体地，本发明按照Git对象的类型分别存储以避免冗余，因此共有4种类型的数据库：commit数据库、blob数据库、tree数据库和tag数据库。每种数据库包含缓存数据和内容数据，分别保存在缓存数据库和内容数据库中。缓存数据库用于快速确定某个特定的对象是否已经存储在我们的数据库中，并且是上述数据提取所必需的(如果存在，就不提取这个Git对象，进而节省时间)。此外，缓存数据库也有助于确定是否需要克隆一个仓库。如果一个仓库的head(保存在.git/refs/heads中的每个分支指向的commit对象)已经在我们的缓存数据库中，说明这个仓库没有更新，就不需要克隆这个仓库。Specifically, the present invention stores Git objects separately according to their types to avoid redundancy, so there are four types of databases: commit database, blob database, tree database and tag database. Each database contains cache data and content data, which are stored in the cache database and content database respectively. The cache database is used to quickly determine if a specific object is already stored in our database and is required for the above data extraction (if it exists, the Git object is not extracted, thus saving time). In addition, the cache database can also help determine whether a repository needs to be cloned. If the head of a warehouse (the commit object pointed to by each branch stored in .git/refs/heads) is already in our cache database, it means that the warehouse has not been updated, and there is no need to clone the warehouse.

缓存数据库是一个键值数据库，其中键为Git对象的SHA1值(20个字节)，值是利用Perl的compress库压缩后的该Git对象在内容数据库中的偏移位置和大小。内容数据库包含连续拼接在一起的Git对象的压缩后的内容。内容数据库是采用拼接的方式保存的，这样可以保证更新的快速完成，只需把新的内容拼接到对应的文件末尾即可。虽然这种存储方式可以快速扫描整个数据库，但对于需要的随机查找来说并不是最佳选择，例如，在计算一个commit作出的修改时，我们需要遍历两遍commit数据库，来获取这个commit对象指向的tree对象和它的父commit对象指向的tree对象，然后再多次遍历tree数据库，获取这两个tree对象包含的内容，找出有差异的文件，最后遍历一次blob数据库，计算出修改，每次遍历都会造成重复的额外的时间开销。因此，对于commit和tree，本发明还分别另外创建了一个随机查找键值数据库，其中键是git对象的SHA1，值对应Git对象的压缩内容。这个键值数据库随机查询性能比较快，经过测试：在一台CPU为Intel E5-2623的服务器上的单线程能够在6秒内随机查询100万个git对象，即每个线程每秒查询170K以上的git对象。The cache database is a key-value database, where the key is the SHA1 value (20 bytes) of the Git object, and the value is the offset position and size of the Git object in the content database after being compressed using Perl's compress library. The content database contains the compressed contents of Git objects spliced together contiguously. The content database is saved in a splicing manner, which ensures that the update is completed quickly. You only need to splice the new content to the end of the corresponding file. Although this storage method can quickly scan the entire database, it is not the best choice for the random search required. For example, when calculating the modifications made by a commit, we need to traverse the commit database twice to obtain the commit object pointing to The tree object and the tree object pointed to by its parent commit object, and then traverse the tree database multiple times to obtain the content contained in the two tree objects, find the files with differences, and finally traverse the blob database once to calculate the modifications. Each traversal will cause repeated additional time overhead. Therefore, for commit and tree, the present invention also creates a random search key value database respectively, where the key is the SHA1 of the git object and the value corresponds to the compressed content of the Git object. The random query performance of this key-value database is relatively fast. After testing: a single thread on a server with a CPU of Intel E5-2623 can randomly query 1 million git objects in 6 seconds, that is, each thread queries more than 170K per second. git object.

目前，本发明检索到200多亿个Git对象(包括23亿多个commit对象，91亿多个blob对象，94亿多个tree对象和1800多万个tag对象)，数据存储空间约150TB。如果不进行并行处理，那么处理这么大容量的数据将会变得特别低效。本发明利用SHA1值实现并行化。本发明使用Git对象的SHA1值第一个字节的后7位将各类型的数据库分割成128份。这样一来，四种类型的Git对象都有128个缓存数据库和128个内容数据库。此外commit对象和tree对象还分别有128个随机查找键值数据库，共有128*(4+4+2)个数据库，这些数据库可以放在一台服务器上加速并行。目前单个内容数据库的大小从tag对象的20MB到blob对象的0.8TB，单个缓存数据库最大是tree对象，大小为2Gb。Currently, the present invention has retrieved more than 20 billion Git objects (including more than 2.3 billion commit objects, more than 9.1 billion blob objects, more than 9.4 billion tree objects, and more than 18 million tag objects), and the data storage space is about 150TB. Without parallel processing, processing such a large amount of data would become extremely inefficient. The present invention utilizes SHA1 values to achieve parallelization. This invention uses the last 7 bits of the first byte of the SHA1 value of the Git object to divide each type of database into 128 parts. In this way, there are 128 cache databases and 128 content databases for the four types of Git objects. In addition, the commit object and the tree object each have 128 random search key value databases, with a total of 128*(4+4+2) databases. These databases can be placed on one server to accelerate parallelism. Currently, the size of a single content database ranges from 20MB for tag objects to 0.8TB for blob objects. The largest single cache database is a tree object with a size of 2Gb.

尽管如此，数据库的规模限制了对数据库的选择。例如，像neo4j这样的图数据库对于存储和查询关系，包括传递关系是非常有用的，但是它并不能(至少在普通的服务器上)处理千亿级别的关系。除了neo4j之外，本发明还尝试了许多传统的数据库。本发明评估了常见的关系型数据库MySQL和PostgreSQL以及键值数据库(NoSQL)数据库MongoDB、Redis和Cassandra。SQL像所有的集中式数据库一样，在处理PB级别的数据方面有局限性。因此本发明专注于NoSQL数据库，这类数据库是为大规模数据的存储和在大量商用服务器上进行大规模并行数据处理而设计的。Nonetheless, the size of the database limits the choice of database. For example, a graph database like neo4j is very useful for storing and querying relationships, including transitive relationships, but it cannot (at least on a normal server) handle hundreds of billions of relationships. In addition to neo4j, the present invention also tried many traditional databases. This invention evaluates the common relational databases MySQL and PostgreSQL as well as the key-value database (NoSQL) databases MongoDB, Redis and Cassandra. SQL, like all centralized databases, has limitations in handling petabytes of data. The present invention therefore focuses on NoSQL databases, which are designed for large-scale data storage and large-scale parallel data processing on a large number of commercial servers.

经过测试，本发明使用了一个名为TokyoCabinet的C语言编写的数据库(类似于berkeley db)。TokyoCabinet使用哈希作为索引，可以提供比MongoDB或Cassandra等各种常见键值数据库快约十倍的读取查询性能。更快的读取查询速度和极强的可移植性刚好符合面向全网代码溯源检测的代码库的构建需求，因此我们用它来代替功能更全的NoSQL数据库。After testing, this invention uses a database written in C language called TokyoCabinet (similar to Berkeley db). TokyoCabinet uses hashes as indexes, which can provide read query performance about ten times faster than various common key-value databases such as MongoDB or Cassandra. The faster reading query speed and strong portability just meet the requirements for building a code base for network-wide code traceability detection, so we use it to replace the NoSQL database with more complete functions.

D.代码信息映射构建，包括：D. Code information mapping construction, including:

设计并生成可以快速对代码文件(blob)到它的信息间的关系映射，代码文件的信息包括包含它的项目和commit，创建它的作者和时间，它的文件名，这些关系映射以数据库的形式保存，可以对代码文件的全网信息进行快速检索Design and generate a relationship mapping that can quickly map a code file (blob) to its information. The information of the code file includes the project and commit that contains it, the author and time when it was created, and its file name. These relationships are mapped to the database's Save in form, you can quickly retrieve the entire network information of code files

本代码库的目标是可以快速对代码进行全网溯源检测，支持对软件项目的安全和合规性进行分析。因此本发明生成了代码文件(blob)到它的信息(包括包含它的项目和commit、创建它的作者和时间以及文件名)的关系映射，并以数据库的形式保存下来，进而可以对代码文件的全网信息进行检索。。代码文件的全网信息对于软件项目的安全和合规性的全面评估是有用的，是全网代码溯源检测的重要内容。The goal of this code base is to quickly conduct network-wide traceability testing of codes and support analysis of the security and compliance of software projects. Therefore, the present invention generates a relationship mapping between the code file (blob) and its information (including the project and commit containing it, the author and time when it was created, and the file name), and saves it in the form of a database, so that the code file can be Search the entire network information. . Network-wide information of code files is useful for comprehensive assessment of the security and compliance of software projects, and is an important part of network-wide code traceability detection.

代码文件的信息包括包含它的项目和commit，它的文件名，创建它的作者和时间。其中，创建它的作者和时间是包含在创建它的commit里的，同时commit到项目的关系映射和项目到commit的关系映射在步骤B数据提取就能完成。因此，本发明以commit为中心，构建关系映射，具体来说：构建commit和项目之间的相互映射、构建commit和作者、时间的关系映射、构建作者到commit的关系映射、构建commit到代码文件(blob)之间的相互映射和commit到文件名之间的相互映射。然后，包含一个代码文件(blob)的项目列表可以通过代码文件(blob)到commit和commit到项目的关系组合确定；类似的，一个代码文件(blob)的创建时间可以通过代码文件(blob)到commit和commit到时间的关系组合确定，代码文件(blob)的作者可以通过代码文件(blob)到commit和commit到作者的关系组合确定。Information about a code file includes the project and commit that contains it, its file name, the author who created it and the time it was created. Among them, the author and time when it was created are included in the commit that created it. At the same time, the relationship mapping from commit to project and the relationship mapping from project to commit can be completed in step B data extraction. Therefore, the present invention takes commit as the center to build relationship mapping, specifically: building the mutual mapping between commit and project, building the relationship mapping between commit and author and time, building the relationship mapping from author to commit, and building the relationship mapping from commit to code file. The mutual mapping between (blob) and the mutual mapping between commit and file name. Then, the list of projects containing a code file (blob) can be determined by the combination of the code file (blob) to commit and commit to projects; similarly, the creation time of a code file (blob) can be determined by the code file (blob) to The combination of the relationship between commit and commit to time is determined. The author of the code file (blob) can be determined by the combination of the relationship between the code file (blob) and commit and commit to the author.

从commit到作者、时间和项目的映射并不难实现，因为作者和时间是commit对象的一部分，commit和项目之间的映射是在步骤B数据提取时就能获得。但是一个commit引入或删除的代码文件(blob)与commit没有直接关系，需要通过递归遍历commit和其父commit的tree对象来计算。一个commit包含仓库的一次快照，包含了所有的tree(文件夹)和blobs(代码文件)。为了计算一个commit和它的父commit之间的差异，即新的代码文件(blob)，我们分别从commit对象指向的tree对象开始，遍历每个子tree并提取所有的代码文件(blob)。通过比较每个commit的所有代码文件(blob)，可以得到一个commit引入的新的代码文件(blob)。平均来说，在单线程中获取一万个commit的变更的文件名和代码文件(blobs)大约需要1分钟。经过估计，对于23亿多个commit，单线程的整体时间需要104天，通过在一台16核Intel E5-2623 CPU的服务器上运行16个线程，可以在一星期内完成。此外，这些关系是增量的，只需生成一次，然后在每次更新的commit上进行上述操作即可，然后插入已有数据库中。根据代码文件(blob)到commit和commit到文件名的关系组合，不能确定代码文件(blob)和文件名的对应关系，因为一个commit会修改多个文件。本发明还构建了代码文件和文件名之间的相互关系映射来支持特定代码片段的溯源。例如，如果要对一段Python代码进行溯源检查，那么要对所有的Python文件进行检查。那么文件名到代码文件的映射，可以获取所有以.py结尾的Python文件，然后针对这些文件进行代码溯源检查。The mapping from commit to author, time and project is not difficult to achieve, because the author and time are part of the commit object, and the mapping between commit and project can be obtained during step B data extraction. However, the code files (blobs) introduced or deleted by a commit are not directly related to the commit and need to be calculated by recursively traversing the tree objects of the commit and its parent commit. A commit contains a snapshot of the warehouse, including all trees (folders) and blobs (code files). In order to calculate the difference between a commit and its parent commit, that is, the new code file (blob), we start from the tree object pointed to by the commit object, traverse each sub-tree and extract all code files (blob). By comparing all code files (blobs) of each commit, you can get the new code files (blobs) introduced by a commit. On average, it takes about 1 minute to get the changed file names and code files (blobs) of 10,000 commits in a single thread. It is estimated that for more than 2.3 billion commits, the overall time for a single thread takes 104 days. By running 16 threads on a server with a 16-core Intel E5-2623 CPU, it can be completed in one week. In addition, these relationships are incremental and only need to be generated once, then the above operations are performed on each updated commit, and then inserted into the existing database. Based on the combination of the relationship between code file (blob) and commit and commit to file name, the corresponding relationship between code file (blob) and file name cannot be determined because one commit will modify multiple files. The present invention also constructs a mutual relationship mapping between code files and file names to support traceability of specific code fragments. For example, if you want to perform a traceability check on a piece of Python code, then all Python files must be checked. Then the mapping of file names to code files can obtain all Python files ending with .py, and then perform code traceability checks on these files.

与步骤C数据存储部分类似，本发明利用TokyoCabinet数据库保存这些关系映射，来进行快速检索。本发明扔使用分块存储来提高检索效率，具体来说本发明将每类关系映射分成32个子数据库。对于commit和(代码文件)blob，本发明使用它们SHA1的第一个字符的后5位进行划分。对于作者、项目和文件名，本发明使用他们的FNV-1Hash的第一个字节的后5位进行划分。Similar to the data storage part of step C, the present invention uses the TokyoCabinet database to save these relationship mappings for fast retrieval. The present invention uses block storage to improve retrieval efficiency. Specifically, the present invention divides each type of relationship mapping into 32 sub-databases. For commits and (code file) blobs, the present invention uses the last 5 bits of the first character of their SHA1 for partitioning. For author, project and file names, the present invention uses the last 5 bits of the first byte of their FNV-1Hash for partitioning.

E.数据更新E.Data update

保持代码库的最新对于代码溯源检测任务是至关重要的。为了获得可以接受的更新时间，本发明采用以下方式完成数据的更新：Keeping the code base up to date is critical to the task of code traceability detection. In order to obtain an acceptable update time, the present invention uses the following methods to complete the data update:

1.识别新的Git项目，克隆然后提取其中的Git对象。通过步骤A发现的新的开源项目列表，然后与上一次的开源项目列表比较，确定其中新增的项目，然后将其克隆到本地，提取其中的Git对象。1. Identify the new Git project, clone and extract the Git objects in it. The new open source project list discovered through step A is then compared with the last open source project list to determine the newly added projects, then clone it locally and extract the Git objects in it.

2.识别更新的项目，然后只克隆更新的项目，并提取出新增的Git对象。本发明基于Libgit2对其中的git fetch协议进行如下修改：2. Identify the updated project, then clone only the updated project, and extract the new Git object. The present invention makes the following modifications to the git fetch protocol based on Libgit2:

1)修改原始git fetch第3步：将远程仓库发回的分支引用保存到本地。在Libgit2的src/fetch.c文件中的filter_wants函数调用git_remote_ls函数后，将git_remote_ls收到的远程仓库发回的heads的SHA1值保存到文件中。1) Modify the original git fetch step 3: Save the branch reference sent back by the remote warehouse to the local. After the filter_wants function in the src/fetch.c file of Libgit2 calls the git_remote_ls function, the SHA1 value of the heads sent back by the remote warehouse received by git_remote_ls is saved to the file.

2)修改原始git fetch第5步：修改Libgit2的src/transports/smart_protocol.c文件，修改git_smart__negotiate_fetch函数：注释掉关于git_revwalk_next的调用，添加git_reference_name_to_id调用，使得每次都发送主分支最新的commit对象，重复至多256次直到收到远程仓库的ACK信号。2) Modify the original git fetch Step 5: Modify the src/transports/smart_protocol.c file of Libgit2, modify the git_smart__negotiate_fetch function: comment out the call about git_revwalk_next, add the git_reference_name_to_id call, so that the latest commit object of the main branch is sent every time, repeat At most 256 times until the ACK signal from the remote warehouse is received.

3)修改原始git fetch第6步：修改Libgit2的/src/transports/smart_protocol.c文件中的git_smart__negotiate_fetch函数，将远程仓库发回数据(git_pkt_progress3) Modify the original git fetch Step 6: Modify the git_smart__negotiate_fetch function in the /src/transports/smart_protocol.c file of Libgit2 to send data back to the remote warehouse (git_pkt_progress

*p)保存到本地文件中，直接return返回，不进行第7步。*p) Save to a local file and return directly without proceeding to step 7.

进行上述修改后，重新编译Libgit2库，然后使用修改后的git fetch协议去获取远程仓库的新增的Git对象数据，具体步骤如下：After making the above modifications, recompile the Libgit2 library, and then use the modified git fetch protocol to obtain the new Git object data of the remote warehouse. The specific steps are as follows:

1.初始化一个空的Git仓库1. Initialize an empty Git repository

2.从构建的代码库中提取出一个仓库的所有分支引用的SHA1值和内容，填充到这个空的git仓库里。填充方式如下：构造分支引用的头部信息，格式是：对象类型+空格2. Extract the SHA1 values and contents referenced by all branches of a warehouse from the built code base, and fill them into the empty git warehouse. The filling method is as follows: construct the header information of the branch reference, the format is: object type + space

+对象内容长度+一个空字节(null byte)，比如“blob 12\u0000”。然后将头部信息和原始数据拼接起来，用zlib的compress函数压缩拼接后的内容。最后在这个空的git仓库中的.git/objects文件夹中创建名为SHA1的前两位的子目录，在这个子目录里面创建名以SHA1后38位的文件，将压缩的内容写进这个文件中+Object content length+a null byte, such as "blob 12\u0000". Then the header information and the original data are spliced together, and the spliced content is compressed using zlib's compress function. Finally, create a subdirectory named with the first two digits of SHA1 in the .git/objects folder in the empty git warehouse, create a file named with the last 38 digits of SHA1 in this subdirectory, and write the compressed content into this in file

3.在这个空的Git仓库中的.git/refs/heads文件夹中创建名为分支名(如master)的文件，然后将该分支引用的commit的SHA1值写进这个文件3. Create a file named branch name (such as master) in the .git/refs/heads folder in this empty Git repository, and then write the SHA1 value of the commit referenced by the branch into this file

本发明把新增的Git对象的数据根据类型及其SHA1值将其直接拼接到对应的内容数据库中，并在缓存数据库中记录其SHA1值，在内容文件中的偏移和大小，更新相应的关系映射数据库。This invention directly splices the newly added Git object data into the corresponding content database according to the type and its SHA1 value, records its SHA1 value, offset and size in the content file in the cache database, and updates the corresponding Relational mapping database.

该代码库构建完成后，可以对代码在文件粒度上进行快速全网溯源检测。步骤如下：After the code base is built, the code can be quickly traced back to the entire network at the file granularity. Proceed as follows:

1.计算代码文件的SHA1值。这里使用python2的hashlib库的sha1函数进行计算。比如https://github.com/fchollet/deep-learning-models/blob/master/resnet50.py文件包含深度学习模型ResNet50的实现，计出算其SHA1值是e8cf3d7c248fbf6608c4947dc53cf368449c8c5f1. Calculate the SHA1 value of the code file. Here we use the sha1 function of the hashlib library of python2 for calculation. For example, the file https://github.com/fchollet/deep-learning-models/blob/master/resnet50.py contains the implementation of the deep learning model ResNet50, and its SHA1 value is calculated to be e8cf3d7c248fbf6608c4947dc53cf368449c8c5f

2.根据步骤D构建的代码信息映射工具数据库，以该代码文件的SHA1为键，查询该代码文件的全网信息，包括包含该代码文件的项目列表，commit列表和对应的文件名和作者等信息，反馈给用户。通过blob到commit的映射，得到包含该blob的commit有192个，通过commit到project的映射，得到包含该blob的project有377个。上述过程只需0.831s。2. Based on the code information mapping tool database built in step D, use the SHA1 of the code file as the key to query the entire network information of the code file, including the project list containing the code file, the commit list and the corresponding file name and author. , feedback to users. Through the mapping from blob to commit, we get 192 commits containing the blob, and through the mapping from commit to project, we get 377 projects containing the blob. The above process only takes 0.831s.

需要注意的是，公布实施例的目的在于帮助进一步理解本发明，但是本领域的技术人员可以理解：在不脱离本发明及所附权利要求的范围内，各种替换和修改都是可能的。因此，本发明不应局限于实施例所公开的内容，本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of publishing the embodiments is to help further understand the present invention, but those skilled in the art can understand that various substitutions and modifications are possible without departing from the scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the contents disclosed in the embodiments, and the scope of protection claimed by the present invention shall be subject to the scope defined by the claims.

Claims

1. A code base design method for quick whole network code traceability detection is used for obtaining a code base by carrying out efficient storage on Git objects in a whole network using Git open source project through project discovery, data extraction, data storage, code information mapping construction and data updating processes, and realizing efficient updating of the code base;

Comprising the following steps: a storage mode of dividing and storing according to the types of the Git objects is adopted; constructing a relation mapping between code files and code file information, and rapidly searching the whole network information of the code files; a high-efficiency updating mode is adopted for the constructed ultra-large-scale code library, a customized gate fetch protocol is proposed based on a Libgit2 function library, and the constructed ultra-large-scale code library is taken as a rear end to efficiently obtain newly-added gate object data of a remote warehouse;

the method specifically comprises the following steps:

A. a server is utilized to obtain a full-network open source software project list through a plurality of project discovery methods, and scripts of the project discovery process are packed into a dock mirror image;

B. and (3) data extraction: b, downloading the items in the open source item list acquired in the step A to the local and extracting the Git objects in the items; the extraction is completed in parallel in a multithreading manner on the server cluster;

git object data store: according to the type division and block storage of the Git object types, the data storage space is reduced, and the parallel processing efficiency is improved; the method specifically comprises the following steps:

a. the binary files included in the open source items are not saved;

b. classifying and storing the Git object data according to the Git object type, namely, the types of the database comprise a commit database, a tree database, a blob database and a tag database, so that the data storage space is reduced to hundred TB levels, and meanwhile, whether the data are stored in a code base or not can be quickly searched;

c. The database of each Git object comprises cache data and content data which are respectively stored in the cache database and the content database, so that the retrieval speed is increased; dividing a cache database and a content database contained in each type of database into a plurality of parts for parallelization; the cache database is used for quickly determining whether a certain Git object is already stored in the database and is necessary for data extraction; if a certain Git object exists in the database, the Git object is not extracted; the cache database is also used for determining whether a warehouse needs to be cloned; if the commit object pointed to by the head of a repository is already in the cache database, i.e., no cloning is required;

d. the cache database is a key value database; the content database is saved in a splicing mode so as to be convenient to update;

the key in the cache database is the SHA1 value of the Git object, and the value in the cache database is the offset position and the size of the Git object in the content database after compression by using the Perl's express library;

the content database contains compressed content of the Git objects which are continuously spliced together; the content database is saved in a splicing mode, and only new content is spliced to the tail of a corresponding file;

Creating a random lookup key value database for the commit and tree objects respectively, wherein the key is SHA1 of the Git object and the value is the compression content of the corresponding Git object;

e. dividing each type of database into a plurality of parts by using SHA1 values, and realizing parallelization acceleration;

f. using a database TokyoCabinet indexed by hash;

D. constructing a mapping of the code information relationship by taking a commit as a center; comprising the following steps: a mapping of code files to items containing it, code files to completions containing it, code files to its author, code files to its filename, code files to its creation time; storing the relation maps by using a TokyoCapbinet database in a block storage mode so as to perform quick retrieval;

E. acquiring a new Git object, and updating data of a code base; the method comprises two methods:

a. identifying a new Git item, cloning, and extracting a Git object in the new Git item;

b. identifying an updated item by acquiring the latest commit of a branch of a remote warehouse of the collected warehouse, and then modifying the gate fetch so that when the local gate warehouse is not available, a built code library is taken as the rear end, a newly added gate object of the remote warehouse is acquired, and the newly added gate object is extracted into the code library; the method specifically comprises the following steps:

b1 Adding the remote repository to the local repository; the remote warehouse is represented in Libgit2 by the git _ remote structure,

all branch references within the local warehouse. Git/refs/heads folder are filled into one member variable ref within the structure when the structure is created;

b2 A connection from the local warehouse to the remote warehouse is established;

b3 After the connection is established, the remote warehouse replies a response, and all branch references of the remote warehouse, namely, contents in the file folders of the git/refs/heads, are sent to the local;

storing branch references sent back by the remote warehouse to the local, judging whether the branch references are stored in a local code base, if so, indicating that the remote warehouse is not updated, and if not, entering the next step;

b4 After the local warehouse receives the references sent back by the remote warehouse, checking whether the objects pointed by the references are in the local warehouse one by one; if in the local repository, a flag is made indicating that the remote repository need not be requested to send updates, then these references are inserted into the member variables;

b5 The local warehouse orders the commit, sends the member variable, including the marked references, back to the remote warehouse, and negotiates with the remote warehouse; locally waiting for an ACK signal sent back by the remote warehouse; specifically, the most up-to-date commit object of the main branch is sent each time, and the execution is repeated for a plurality of times until an ACK signal of the remote warehouse is received;

b6 After negotiating with the remote repository, the remote repository may calculate the Git object to be sent back to the local; the remote warehouse packages the objects into files in a pack format and sends the files back to the local;

storing the file in the pack file format sent back by the remote warehouse to the local, and analyzing the pack file according to the Git object in the code library;

through the steps after the modification of the gate latch, the gate latch is updated by taking the constructed code base as the back end, a complete warehouse does not need to be cloned for each update, and network bandwidth overhead and time overhead are reduced.

2. The code library design method for rapid full-network code traceability detection according to claim 1, wherein the method for rapid full-network code traceability detection of codes on file granularity based on the code library comprises the following steps:

1) Calculating the SHA1 value of one code file;

2) And D, inquiring the whole network information of the code file by taking SHA1 of the code file as a key according to the code information mapping constructed in the step D, wherein the whole network information comprises a project list, a commit list, a corresponding file name and author information of the code file, and feeding back the project list, the commit list and the corresponding file name and author information to a user.

3. The code library design method for rapid full-network code traceability detection of claim 1, wherein in step a, said server comprises Intel E5-2670 CPU server; the development collaboration platform for hosting the open source software project comprises a Github, a Bitbucket, a GitLab and a SourceForge; the obtaining the full-network open source software project list through the multiple project discovery methods comprises the following steps: and using the API provided by the development collaboration platform and the webpage method of the analysis platform to take the union of the discovered project sets as a final open source project list, thereby obtaining the open source project list.

4. The code library design method for rapid full-network code traceability detection according to claim 1, wherein the step B is to perform data extraction, specifically to create a copy of a remote warehouse locally through a gitclone command, and then to extract all the Git objects in the cloned open source item in batches through Libgit 2.

5. The code library design method for rapid full-network code traceability detection according to claim 4, wherein the data extraction specifically adopts an Intel E5-2670 CPU with 36 nodes, the CPU of each node is 16 cores, the memory is a cluster with 256GB, and each node starts 16 threads; and using a Git C language interface Libgit2 to list all Git objects in the project, classifying according to object types, and extracting the content of each object.

6. The code library design method for rapid full-network code traceability detection according to claim 1, wherein in step C, the cache database and the content database contained in each class of databases can be divided into 128 parts for parallelism;

the parallelization is realized by using the SHA1 value, specifically, the last 7 bits of the first byte of the SHA1 value of the Git object are used for dividing each type of database into 128 parts, so that the four types of Git objects have 128 cache databases and 128 content databases, the commit object and the tree object also have 128 random lookup key value databases respectively, and 128 (4+4+2) databases are shared; these databases may be placed on a server to accelerate parallelism.

7. The method for rapid full-network code traceability detection of claim 6, wherein the size of a single content database is from 20MB of tag object to 0.8TB of blob object, and the size of a single cache database is up to 2Gb.

8. The code library design method for rapid full-network code traceability detection of claim 1, wherein step D builds a relationship map centered on commit, comprising:

constructing a mutual mapping between a commit and an item, constructing a relationship mapping between a commit and an author, time, constructing a relationship mapping between an author to a commit, constructing a mutual mapping between a commit to a code file blob, and constructing a mutual mapping between a commit to a file name;

determining an item list containing one code file blob by combining the relations from the code file blob to the commit and from the commit to the item; determining the creation time of the code file blob by the combination of the relationship from the code file blob to the commit and from the commit to the time; determining an author of the code file blob by a combination of the code file blob to commit and commit to author relationships;

reconstructing a correlation mapping between the code file and the file name to support the tracing of the specific code segment;

Storing the relation maps by using a TokyoCapbinet database and using a block storage, and particularly dividing each type of relation map into 32 sub-databases; for commit and code file blob, partitioning is performed using the last 5 bits of the first character of SHA 1; for author, item, and file name, the last 5 bits of the first byte of FNV-1Hash are used for partitioning.

9. The code library design method for rapid full-network code traceability detection according to claim 1, wherein step E recompiles Libgit2 library after modifying the gate fetch by method b; acquiring newly added Git object data of a remote warehouse; the method comprises the following specific steps:

E1. initializing an empty Git warehouse;

E2. extracting SHA1 values and contents of all branch references of a warehouse from the constructed code library, and filling the SHA1 values and contents into an empty git warehouse;

E3. creating a file named a branch name in a Git/refs/heads folder in an empty Git repository, and then writing the SHA1 value of the commit referenced by the branch into the file;

and directly splicing the data of the newly added Git object into a corresponding content database according to the type and the SHA1 value thereof, and recording the SHA1 value thereof, the offset and the size in the content file in a cache database, namely updating the corresponding relation mapping database.

10. The code library design method for rapid full-network code traceability detection according to claim 9, wherein in step E2, the filling mode is specifically as follows:

constructing header information of branch references in the format of: object type + space + object content length + one empty byte;

then splicing the head information and the original data, and compressing the spliced content by using a compression function of zlib;

finally, the first two subdirectories named SHA1 are created in the. Git/objects folder in the empty git repository, a file named SHA1 last 38 is created in the subdirectory, and the compressed content is written into the file.