CN113987427A

CN113987427A - Tracing method of homologous codes

Info

Publication number: CN113987427A
Application number: CN202111259702.8A
Authority: CN
Inventors: 石澳; 种衍斌; 易焕腾; 但吉兵; 罗峋; 梁大功
Original assignee: Suzhou Lengjing Qicai Information Technology Co ltd
Current assignee: Suzhou Lengjing Qicai Information Technology Co ltd
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2022-01-28

Abstract

The invention relates to a source tracing method of homologous codes, which comprises the following steps: establishing a standardized vulnerability knowledge base in a crawler mode, acquiring open source codes meeting the specification, preprocessing open source code results and storing the open source code results; constructing a Neo4j database; and carrying out service processing. Therefore, the detection speed can be improved based on the graph library incidence relation of the content hash value and the directory hash value of the source code. The problem of the total amount of detection through collision of binary files and files is solved, the speed is improved by more than 10 times, and therefore engine support is provided for further safety detection of an open source software supply chain. Through multi-dimensional clustering, whether a certain module or a subclass thereof is adopted can be distinguished according to the calculated result scores, and the problem that the detection result of the supply chain is inaccurate due to the inclusion relationship among open-source project modules is solved. The method can provide more accurate open source projects and versions, so that vulnerability detection of the software industry security industry can be more accurate.

Description

Tracing method of homologous codes

Technical Field

The invention relates to a source tracing method, in particular to a source tracing method of homologous codes.

Background

With the popularization of networks and the high-speed development of software, security vulnerabilities are continuously emerging, so that vulnerability information mechanisms are derived, such as authorized vulnerability mechanisms and security manufacturers like Snyk, NVD, CNVD, WhiteSource, blackDuck and the like, which all publish software vulnerabilities, for the use of a software security supply chain, usually, when a program is written, foreign dependence (such as jar packages or certain modules) is introduced, the jar packages or the modules of other people have extremely large content, most of the packages or the modules are indirectly quoted, even one part of an opening project is directly used for packaging, how to confirm the opening source specific software used by people is a difficult problem, and if the opening source software and the version used by a specific project can be confirmed, the vulnerabilities can be timely found and updated, and the security of the software supply chain can be improved.

To confirm a specifically used open source item, firstly, a data source of a related open source item is acquired, a catalog and a content fingerprint feature library are created, finally, weight calculation is carried out according to a matched item, finally, similarity clustering is adopted, and the most correct item and version are selected according to a detection rate after clustering. In view of the above-mentioned drawbacks, the designer actively makes research and innovation to create a tracing method for homologous codes, so that the method has industrial application value.

However, in the prior art, because the full-scale fingerprint database is adopted, full-text retrieval is required to be performed on the bottom layer every time file matching is performed, and if the items are large, the retrieval time is too long. Meanwhile, the technology has no solution to the problem of project inclusion relationship (for example, the spark program has a plurality of modules, and through the file detection, a plurality of projects such as spark-core, spark streaming, spark and the like can be obtained, but the real detection result only needs spark)

Disclosure of Invention

To solve the above technical problems, an object of the present invention is to provide a method for tracing a source of a homologous code.

The source tracing method of the homologous code is characterized by comprising the following steps:

step one, establishing a standardized vulnerability knowledge base in a crawler mode, acquiring open source codes meeting the specification, preprocessing open source code results and storing the open source code results. And step two, constructing a Neo4j database. And step three, performing service processing.

Further, in the method for tracing to the source of the homologous code, the processing procedure of the step one is,

a. crawling open source codes which meet the specification in a source code hosting platform and a community through a network crawler, wherein the source code hosting platform and the community comprise gitubs, gites and linux, and the specification comprises judgment star level and branch number;

b. and storing the total numerical value of the items related to the source code file after preprocessing, wherein the preprocessing comprises blank space removal, special characters removal and file removal (such as ReadMe. txt) which does not help detection, and the total numerical value comprises the number of file lines and the number of files.

Furthermore, in the method for tracing the source of the homologous code, in the first step, the anti-crawler mechanism is bypassed through dynamic proxy, decoding and random crawling, and the crawled file is stored locally.

Furthermore, in the above method for tracing the source of the homologous code, the processing procedure of the second step is,

a. traversing the open source code, acquiring all directories containing files and extracting, converting the extracted file content into a hash after md5 in an md5 mode, removing duplication of the directories, associating the directories with the hash after md5, converting the directories into the hash after md5, associating the open source project name, version and file content set to obtain a hash directory-hash file content and project set;

b. adding a directory to Neo4j to ensure uniqueness, if the directory exists in Neo4j, adding file contents, and if the directory does not exist, creating the directory;

c. adding file content to ensure uniqueness, if neo4j has the file, adding a project node, and if not, creating the file node;

d. adding a project node to ensure uniqueness, if the neo4j has the project, increasing the relation associated with the file, and if not, creating the project node;

finally, a graph database with the correlated directory hash, file hash, project and version is constructed.

Furthermore, in the above source tracing method for the homologous code, in the second step, all files are traversed, the paths of the files are saved, the contents of the files are read, the directories and the files containing the file contents are associated, the contents and the directories are encrypted through md5 and converted into hash, the hash is stored as a rowkey to neo4j, and a directory hash associated file hash and a directory file hash associated item are respectively established.

Further, in the above method for tracing to the source of the homologous code, in the second step, if the same file belongs to the same project, dynamic expansion is performed, and neo4j adds nodes and relationships to form a relationship library.

Furthermore, in the above method for tracing the source of the homologous code, the processing procedure of step three is,

a. reading a project to be detected, traversing the project to be detected, acquiring a directory and a file containing the file, encrypting and converting the directory and the file into a hash value through md5, and storing basic information of the hash value into a project task library to be detected to form a mysql service library;

b. the method comprises the steps that a directory in a Neo4j graph database collides with a directory of a file to be detected, after traversal is completed, different versions of the same project are grouped in a project version set, the highest project and version are selected according to weight values and stored in a new project version set, basic information in the corresponding project version is searched in a mysql library and is correlated, and the result is stored in a mysql cache library;

c. and clustering and grouping, reading the contents of the cache library and the task library, calculating the detection rate, the number of matched lines and the like according to the number of files, the total line number and the number of file lines of the two parties, scoring, finally selecting the item with the highest score in the same group as a result, and returning a result set.

Furthermore, in the above method for tracing the source of the homologous code, the basic information includes a file line number, a file number, and a total line number; during the collision period, if a 100% result is obtained, a file hash value in a directory of neo4j is obtained, the hash value is collided with the hash value of the file to be detected, if the result is consistent, weight accumulation is carried out on the item and the version corresponding to the content of the file, if the weight of the item and the version does not exist, the item and the version are created and stored in an item version set, and if the weight of the item and the version does not exist, the item and the version are used as weight values of key for accumulation.

Further, in the method for tracing to the source of the homologous code, the clustering is performed by clustering according to the name similarity, the directory similarity and the dimension of the file name under the project, and an optimal set of data is selected according to the detection rate and the number of matched lines.

Still further, in the method for tracing to the source of the homologous code, in the step b, the number of rows and the detectable rate information of the selected result and the item matching are stored in the mysql service library.

By the scheme, the invention at least has the following advantages:

1. based on the graph library incidence relation between the content hash value of the source code and the directory hash value, the detection speed can be improved.

2. The problem of the total amount of detection through collision of binary files and files is solved, the speed is improved by more than 10 times, and therefore engine support is provided for further safety detection of an open source software supply chain.

3. Through multi-dimensional clustering, whether a certain module or a subclass thereof is adopted can be distinguished according to the calculated result scores, and the problem that the detection result of the supply chain is inaccurate due to the inclusion relationship among open-source project modules is solved. The method can provide more accurate open source projects and versions, so that vulnerability detection of the software industry security industry can be more accurate.

4. The implementation process is simple, clustering grouping can be performed through dimensions such as directory fingerprint libraries, name similarity, path inclusion relations and the like, and dimension cache result sets such as code line number matching number, file matching rate and the like of each group of data are calculated, so that a final result is obtained.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings.

Drawings

FIG. 1 is a schematic flow chart diagram of an implementation of a tracing method of homologous code.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

The method for tracing the source of the homologous code as shown in fig. 1 is characterized by comprising the following steps:

step one, establishing a standardized vulnerability knowledge base in a crawler mode, acquiring open source codes meeting the specification, preprocessing open source code results and storing the open source code results. Specifically, the adopted treatment process comprises the following steps:

Meanwhile, during implementation, an anti-crawler mechanism can be bypassed through dynamic proxy, decoding and random crawling, and the crawled file is stored to the local.

Step two, constructing a Neo4j graph database, wherein the processing process is as follows:

firstly, traversing the open source code, acquiring all directories containing files and extracting the directories, converting the extracted file content into a hash after md5 in an md5 mode, removing the duplication of the directories, associating the directories with the hash after md5, converting the directories into the hash after md5, and associating the open source project name, version and file content set to obtain a hash directory-hash file content and project set.

Then, a directory is added to Neo4j to ensure uniqueness, and if there is this directory in Neo4j, the file content is added, and if not, this directory is created.

Then, the file content is added to ensure uniqueness, if neo4j has the file, the project node is added, and if not, the file node is created.

Subsequently, a project node is added to ensure uniqueness, if neo4j has the project, the relation associated with the file is added, and if not, the project node is created.

And step three, performing service processing. Specifically, the processing procedure is as follows:

step a, reading the item to be detected, traversing the item to be detected, acquiring a directory and a file containing the file, encrypting and converting the directory and the file into a hash value through md5, and storing basic information (such as file line number, file number, total line number and the like) of the item to be detected into a task library of the item to be detected to form a mysql service library.

And b, colliding the directory in the Neo4j graph database with the directory of the file to be detected, after traversing is completed, grouping the same items and different versions in the item version set, selecting the highest item and version according to the weight value, storing the highest item and version into a new item version set, searching basic information in the corresponding item version in the mysql library, associating the basic information, and storing the result into the mysql cache library. The related basic information comprises the number of file lines, the number of files and the total number of lines.

Specifically, during collision, if a 100% result is obtained, a file hash value in a directory of neo4j is obtained, the hash value is collided with the hash value of a file to be detected, if the result is consistent, weight accumulation is performed on an item and a version (in neo4 j) corresponding to the file content, if the weight of the item and the version does not exist, the item and the version are created and stored into an item version set, and if the weight of the item and the version does not exist, the item and the version are accumulated as the weight value of key. At the same time, the second file is also processed in the manner described above. And the information of the number of rows and the detection rate of the selected result matched with the item can be stored in the mysql service library.

And c, reading the contents of the cache library and the task library through clustering grouping, calculating the detection rate, the matching line number and the like according to the file number, the total line number and the file line number of the two parties, scoring, finally selecting the item with the highest score in the same group as a result, and returning a result set. For example, the scoring method adopted by the invention is that the file matching relevance ratio accounts for 30 points, if the open source item of the gallery is matched with the item to be detected by 100%, 30 points are obtained, if the file detection rate is 60%, 18 points are obtained, and the line number relevance ratio is 30 points. And, the number of lines and the number of files are scored as above. Therefore, the clustering of the results can be performed through multi-dimension, and the inclusion relation is removed.

Considering that the implementation result may have an inclusion relationship, which easily causes the result to be inaccurate, the invention adopts a clustering grouping mode. Specifically, the clustering grouping is realized by clustering grouping according to the name similarity, the directory similarity and the dimension of the file name under the project, and selecting an optimal group of data according to the detection rate and the matched line number.

Taking a database framework (hibernate) as an example, a folder path of the database framework is transmitted into a business processing interface.

At this time, the program starts to read the specific files and directories in the path (convert to Hash), and then performs directory query of Neo4j to obtain the Hash value under the corresponding directory (the entry files of all open source files corresponding to the directory).

And then matching the file hash value of hibernate with the content hash value of the project file in the matched directory. During the period, if the matching is successful, the item and the version which are associated with the file and the directory are obtained, the unique identifier is formed by the item and the version, and if other files and directories of the unique identifier are matched, the weight of the unique identifier is added by 1. If not, the unique identifier is created and the default value is set to 1.

And finally, grouping the names of the same items in the unique identification, and selecting the version with the highest weight in the unique identification.

During implementation, the information of the matched directory, the matched file, the line number of the file and the like is stored in the memory, and the information in the memory is acquired to be associated with the final unique identification item, so that the basic information of the unique identification matched with the upper open source item is obtained (for example, the information of the line number, the line number proportion, the file number proportion and the like matched with the upper open source item). Meanwhile, finally, the association grouping is carried out according to the names and the directory similarity of the results (the items with the highest weight). And then, scoring the items in the same group according to the basic information of each open source item, and finally combining the contained results (selecting the optimal item in the same group and storing the optimal item in a service library).

The invention has the following advantages by the aid of the character expression and the accompanying drawings:

Furthermore, the indication of the orientation or the positional relationship described in the present invention is based on the orientation or the positional relationship shown in the drawings, and is only for convenience of describing the present invention and simplifying the description, but does not indicate or imply that the indicated device or configuration must have a specific orientation or be operated in a specific orientation configuration, and thus, should not be construed as limiting the present invention.

The terms "primary" and "secondary" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "primary" or "secondary" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically limited otherwise.

Also, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "connected" and "disposed" are to be construed broadly, e.g., as meaning fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; the two components can be directly connected or indirectly connected through an intermediate medium, and the two components can be communicated with each other or mutually interacted. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations. And it may be directly on the other component or indirectly on the other component. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element.

It will be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like, refer to an orientation or positional relationship illustrated in the drawings, which are used for convenience in describing the invention and to simplify the description, and do not indicate or imply that the device or component being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the invention.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, it should be noted that, for those skilled in the art, many modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. The source tracing method of the homologous codes is characterized by comprising the following steps:

establishing a standardized vulnerability knowledge base in a crawler mode, acquiring open source codes meeting the specification, preprocessing open source code results and storing the open source code results;

step two, constructing a Neo4j database;

and step three, performing service processing.

2. The method for tracing the source of the homologous code according to claim 1, wherein: the processing procedure of the step one is that,

b. and storing the total numerical value of the items related to the source code file after preprocessing, wherein the preprocessing comprises blank space removal, special characters removal and file removal which does not help detection, and the total numerical value comprises the number of file lines and the number of files.

3. The method for tracing the source of the homologous code according to claim 1, wherein: in the first step, a crawler-reverse mechanism is bypassed through dynamic proxy, decoding and random crawling, and the crawled file is stored to the local.

4. The method for tracing the source of the homologous code according to claim 1, wherein: the processing procedure of the second step is that,

5. The method of claim 4, wherein: in the second step, if the same file belongs to the same project, dynamic expansion is performed, and neo4j adds nodes and relationships to form a relationship library.

6. The method for tracing the source of the homologous code according to claim 1, wherein: the processing procedure of the third step is that,

7. The method of claim 6, wherein: the basic information comprises file line number, file number and total line number; during the collision period, if a 100% result is obtained, a file hash value in a directory of neo4j is obtained, the hash value is collided with the hash value of the file to be detected, if the result is consistent, weight accumulation is carried out on the item and the version corresponding to the content of the file, if the weight of the item and the version does not exist, the item and the version are created and stored in an item version set, and if the weight of the item and the version does not exist, the item and the version are used as weight values of key for accumulation.

8. The method of claim 6, wherein: and the clustering grouping is that clustering grouping is carried out according to the name similarity, the directory similarity and the dimension of the file name under the project, and an optimal group of data is selected according to the detection rate and the matched line number.

9. The method of claim 6, wherein: and b, storing the selected result, the row number matched with the item and the detection rate information to a mysql service library.