CN117520169A - High-precision npm software supply chain dependency analysis method - Google Patents

High-precision npm software supply chain dependency analysis method Download PDF

Info

Publication number
CN117520169A
CN117520169A CN202311489487.XA CN202311489487A CN117520169A CN 117520169 A CN117520169 A CN 117520169A CN 202311489487 A CN202311489487 A CN 202311489487A CN 117520169 A CN117520169 A CN 117520169A
Authority
CN
China
Prior art keywords
node
dependency
npm
metadata
version
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311489487.XA
Other languages
Chinese (zh)
Inventor
申文博
王明森
常瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202311489487.XA priority Critical patent/CN117520169A/en
Publication of CN117520169A publication Critical patent/CN117520169A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/61Installation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a high-precision npm software supply chain dependency analysis method, which comprises the following steps: acquiring metadata of all npm open source software packages, and carrying out real-time tracking and incremental updating; extracting version and dependency information of a software package from the metadata; according to the version and the dependency information of the software package, performing concurrent dependency analysis on the npm software package by utilizing an improved dependency analysis algorithm to obtain dependency analysis results and to-be-installed results of all the software packages; and storing the dependent analysis results of all npm software packages, and setting corresponding query interfaces. The method can automatically acquire and synchronize the metadata in the npm official warehouse, can complete the dependent analysis of all versions of all full-ecological open-source software packages in a short time, and can also ensure the real-time updating of analysis results.

Description

High-precision npm software supply chain dependency analysis method
Technical Field
The invention belongs to the field of open source software supply chain safety, and particularly relates to a high-precision npm software supply chain dependency analysis method.
Background
With the rapid development of the internet and software industry, the functional requirements and complexity of software are continuously increased, and the introduction of an open-source third party component has become an indispensable and crucial ring in the software development process. Under the trend, the JavaScript programming language has become one of the most core technologies of the internet by virtue of low development difficulty, huge number of open source components, perfect ecology and the like. In 2022, javaScript has become a front-end programming language for 98% of websites on the internet, and under the support of technologies such as node. Js and Electron, it has become one of the main programming languages in the back-end and client fields. npm is the most prominent package manager in JavaScript language, whose official warehouse currently has more than three million open source packages and is still in explosive growth.
The npm software supply chain security problem is particularly serious because of the extremely high degree of dependency of JavaScript software on third party components, the huge and growing size of npm software warehouses. In 7 2018, the npm credentials of the escort-scope software package maintainer were compromised, and an attacker issued a malicious version resulting in theft of npm credentials of the affected computer. In 11 months 2018, the event-stream software package is added to the malware package, so that bitcoin can be stolen from a specific application. The internal problems of the is-process software package caused by the large number of well known tools and serverless applications to fail to build or run at month 2022. Month 1 of 2022, the color software package was implanted with malicious code by the developer, resulting in a DoS attack on the user. Month 3 of 2022, malicious code of the node-ipc software package caused its users in white russia and russia to be attacked by file deletion.
In the face of such frequent and extensive software supply chain attack security events in the npm ecology, there have been many researchers and manufacturers proposing related npm software supply chain analysis methods and tools. However, these tools have a problem of affecting accuracy in the analysis step, which results in the influence of subsequent analysis. Such as:
1. many existing works use outdated official data or third party data, which cannot guarantee the integrity, timeliness and authority of the data, and further influence the timeliness and accuracy of analysis results.
2. Because the JavaScript environments such as node. Js and the like manage and load the dependent items by depending on the dependent file tree, the analysis of the npm software dependent relationship is independent of the construction of the dependent file tree, and whether the dependent items are installed, the installed version and the like are influenced by the dependent file tree. Meanwhile, npm dependent items are of multiple types, wherein default requirements for parsing include dependencies, peerDependencies and optiondependencies, and different types of dependencies require different parsing and installation rules, further complicating npm the selection of dependent items and dependent versions. The existing research and the algorithms used by the tools only consider the direct dependency relationship and neglect the construction of the dependency file tree, or only consider the common type of dependency, neglect the complex dependency types such as peer dependency and the like, so that larger errors exist between the dependency analysis result and the npm official management tool, and the true situation of the npm software package in use cannot be reflected by the false results.
3. Some security tools use npm official tools directly. npm official tools are time inefficient and do not have availability for full ecological analysis. Therefore, the tools only analyze the appointed software package, have low speed and one-sided result, can not provide comprehensive analysis and statistical data, and can not be used for open source software supply chain analysis.
Disclosure of Invention
Aiming at the defects of the prior art, the embodiment of the invention provides a high-precision npm software supply chain dependency analysis method. The method can automatically acquire metadata in the synchronous npm official warehouse, extract version and dependency related information in the metadata in real time, analyze all types of dependencies of all software package versions by using an improved dependency analysis algorithm, store analysis results and provide a query interface for users.
According to a first aspect of embodiments of the present application, there is provided a high-precision npm software supply chain dependency analysis method, comprising:
(1) Acquiring metadata of all npm open source software packages, and carrying out real-time tracking and incremental updating;
(2) Extracting version and dependency information of a software package from the metadata;
(3) According to the version and the dependency information of the software package, performing concurrent dependency analysis on the npm software package by utilizing an improved dependency analysis algorithm to obtain dependency analysis results and to-be-installed results of all the software packages;
(4) And storing the dependent analysis results of all npm software packages, and setting corresponding query interfaces.
Further, step (1) includes:
(1.1) acquiring all metadata information of a npm open source warehouse and storing the metadata information into a local metadata database;
(1.2) repeating the data tracking and metadata downloading at regular time, acquiring npm update data of the open source warehouse, and updating the local metadata database to keep the local data in the latest state.
Further, step (2) includes:
(2.1) extracting tag information, version information and metadata of each version of all npm software packages from the metadata;
and (2.2) respectively storing the tag information, version information and metadata of each version into three key-value type memory databases.
Further, in the step (3), the dependency resolution result includes a dependency item, a dependency version, a dependency hierarchy and a dependency path, and the to-be-installed result includes relative paths of all packages that need to be installed according to the dependency relationship after installing the target npm software package under the empty directory.
Further, in step (3), the improved dependency resolution algorithm includes:
the method comprises the steps of (1) internode the npm software package version to be analyzed, and adding the node queue to be analyzed;
when the node queue to be analyzed is not empty, the following operations are circularly executed:
(i) Dequeuing the head node;
(i) For each invalid dependency edge of the dequeue node, loading a target node of the edge, recursively loading a peer dependency group of the target node, and adding the edge and the target node into a task queue, wherein the target node of the edge is a dependent software package, the invalid dependency edge indicates that the edge is not pointed to the node, or the name or version of the pointed target node does not meet the name and version range of the edge, and the peer dependency group refers to all direct or indirect dependencies of the node;
(iii) And for each edge and corresponding node in the task queue, determining the position of the node and the peer dependency group thereof in a node tree by a node tree position determining algorithm, and adding the node and the node for regenerating the invalid dependency edge into the node queue to be analyzed, wherein the node tree is composed of father-son relations among the nodes and represents a file tree to be installed.
Further, the node tree position determining algorithm is to execute the following steps for each node and corresponding edge in the peer dependency group of the algorithm input node:
determining an initial target node: for an input edge and a corresponding input node, if the edge represents a peer dependency, the initial target node is a father node of a source node of the input edge, otherwise, the initial target node is the source node, wherein the source node of the edge is a software package with the corresponding dependency of the edge;
traversing a step-by-step father node of an initial target node, executing a placement type judging algorithm, and judging the placement type of an input node under the current traversing node, wherein the placement type comprises OK, KEEP, REPLACE and CONFLICT;
and taking the last traversal node with the placement type not being CONFLICT, and carrying out node placement operation under the node according to the placement type.
Further, the placement type judgment algorithm includes:
(I) If the target node has a child node with the same name as the node to be placed:
(i) If the node with the same name is the same as the node version to be placed, judging that the placement type is KEEP;
(ii) If the node to be placed can REPLACE the same-name node, judging that the placement type is REPLACE;
(iii) Otherwise, judging the placement type as CONFLICT;
(II) if the target node does not have a child node with the same name as the node to be placed:
(i) If the node to be placed is placed as a peer dependency and CONFLICTs can be caused by blocking the dependency on the node path, judging the placement type as CONFLICT;
(ii) If the recursion peer dependency groups of the nodes to be placed can be placed, judging that the placement type is OK;
(iii) Otherwise, the placement type is judged to be CONFLICT.
According to a second aspect of embodiments of the present application, there is provided a high-precision npm software supply chain dependency analysis apparatus comprising:
the metadata acquisition module is used for acquiring metadata of all npm open source software packages and carrying out real-time data tracking and incremental updating;
the information extraction module is used for extracting the version and the dependency information of the software package from the metadata;
the dependency analysis module is used for carrying out concurrent dependency analysis on the npm software package by utilizing an improved dependency analysis algorithm according to the version and the dependency information of the software package to obtain a dependency analysis result and a to-be-installed result of all the software packages;
and the result storage module is used for storing the dependency analysis results of all npm software packages and setting corresponding query interfaces.
According to a third aspect of embodiments of the present application, there is provided an electronic device, including:
one or more processors;
a memory for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of the first aspect.
According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to the first aspect.
The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:
the invention can automatically complete the acquisition and synchronization of npm data. By means of technical means such as updating flow progress records and data entry verification designed for the npm database, the data can be guaranteed to be correctly traced, and the integrity and timeliness of analysis data are further guaranteed.
The algorithm design and specific implementation of the present invention focuses on time efficiency. Through the parallelization execution of the improved dependency analysis algorithm, the targeted database design optimization and the use of the memory database, the invention can complete the dependency analysis of all versions of all full-ecological open-source software packages in a short time.
The invention designs a feedback updating mechanism of the analysis result after data synchronization, triggers incremental re-analysis and result updating through npm data updating, ensures real-time updating of the analysis result, and further improves timeliness and accuracy of analysis.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flow chart of a high-precision npm software supply chain dependency analysis method.
Fig. 2 is a schematic diagram of step (1).
Fig. 3 is a schematic diagram of step (2).
Fig. 4 is a schematic diagram of step (3).
FIG. 5 is a pseudo code schematic diagram of an improved dependency resolution algorithm.
FIG. 6 is a pseudo code schematic diagram of a node tree position determination algorithm.
Fig. 7 is a pseudo code schematic diagram of a placement type judgment algorithm.
Fig. 8 is a schematic diagram of step (4).
Fig. 9 is a block diagram of a high-precision npm software supply chain dependency analysis apparatus.
Fig. 10 is a schematic diagram of an electronic device.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application.
The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
FIG. 1 is a flow chart of a high-precision npm software supply chain dependency analysis method, as shown in FIG. 1, according to an exemplary embodiment, which may include the steps of:
(1) Acquiring metadata of all npm open source software packages, and carrying out real-time tracking and incremental updating;
specifically, as shown in fig. 2, this step includes two processes of a data tracking process and a metadata downloading process, which are sequentially performed in the following sub-steps:
(1.1) acquiring all metadata information of a npm open source warehouse and storing the metadata information into a local metadata database;
specifically, first, history data tracking is performed, update flow information of the official database is queried using a query API of the npm official database, all data indexes of the official database which are built up from the official database are obtained npm according to the update flow information, and metadata downloading is triggered to obtain all metadata information. The metadata downloading process downloads data according to the index, then saves the acquired metadata to a local metadata database, and additionally stores the information such as the source revision, checksum of the data entry for data checking and verification. The local metadata database uses a document-type database to ensure compatibility and integrity with the npm database and data.
(1.2) repeating data tracking and metadata downloading at regular time, acquiring update data of a npm open source warehouse, and updating a local metadata database to keep the local data in an up-to-date state;
specifically, using a query API of the npm official database to track data, obtaining a data update index of the npm official database, performing metadata downloading according to the data update index, obtaining npm software package metadata to be updated, storing the npm software package metadata into the database, and recording an updated npm software package metadata index number; more specifically, after the downloading and updating are completed, recording is performed to save information such as the updating progress, the index number, etc. for the query starting point when the flow is executed again.
It should be noted that, the present application continuously repeats the operation of step (1.2), so as to implement real-time tracking and incremental updating of the database. If the data is updated, the updating actions of the following steps of metadata extraction, dependency information analysis and the like are triggered when the data is updated.
(2) Extracting version and dependency information of a software package from the metadata;
specifically, all data required for dependency resolution is extracted from the local metadata database and stored to a more efficient database, as shown in fig. 3, this step may include the steps of:
(2.1) extracting tag information, version information and metadata of each version of all npm software packages from the metadata;
specifically, the name, the dist-tags, the version and the like of the npm software package are extracted from metadata, the name-version information is split and recombined and then updated to a version information database, the name-tag information is split and recombined and then updated to a tag information database, and the version-metadata information is split and cleaned and then updated to a dependency information database.
(2.2) storing the tag information, version information and metadata of each version into three key-value-type memory databases respectively;
specifically, the data extracted in this step is a data source directly used in the subsequent dependency analysis, the above division of metadata information can reduce the amount of text processing tasks in the dependency analysis process, the calculation complexity of the query can be reduced by storing the data into the key-value database, and the hardware execution time of the query can be reduced by storing the data into the memory database. The various processes described above ensure the compaction of the extracted data and the efficient subsequent reading.
(3) According to the version and the dependency information of the software package, performing concurrent dependency analysis on the npm software package by utilizing an improved dependency analysis algorithm to obtain dependency analysis results and to-be-installed results of all the software packages;
as shown in fig. 4, this step may include the sub-steps of:
(3.1) concurrently executing an improved dependency analysis algorithm on all npm software packages to be analyzed, and obtaining the needed information for analysis by querying the name-version information database, the name-tag information database and the version-metadata information database in the step (2);
specifically, the dependency resolution is implemented by a modified dependency resolution algorithm. The improved dependency resolution algorithm will perform a resolution algorithm on each npm package version in the npm ecology that will construct a file tree (node tree) to install and a dependency graph (node graph) from the metadata stored in the database.
The main basic concepts defining the improved dependency resolution algorithm are as follows:
1. and (3) node: representing a npm software package containing its name and version information.
2. Edges: representing the dependency relationship among the nodes, and containing the name and the dependency version range information;
1) The source node of the edge represents the software package with the dependency. The target node of the edge represents the software package that is being relied upon. The outgoing edge of a node represents the dependency of the node. The incoming edge of the node represents the dependent relationship of the node;
2) The valid representation of an edge points to the target node, and the name and version of the target node meets the name and version scope of the edge. The invalidation of an edge indicates that the edge does not point to a node, or that the name or version of the target node to which the edge points does not satisfy the name and version scope of the edge.
3. Node tree: the tree consisting of nodes and parent-child relationships between nodes represents the file tree to be installed, i.e. the node_modules directory tree after the dependency installation is actually executed.
4. Node diagram: a graph of nodes and edges represents dependencies between packages.
5. peer dependency group: all direct or indirect dependencies of a node are referred to as the peer dependency group of the node.
The algorithm is mainly divided into three sub-algorithms, and the three sub-algorithms are called step by step:
core logic sub-algorithm: the main body of the analysis algorithm is improved based on breadth-first search.
Node tree position determination sub-algorithm: in the process of executing the analysis algorithm, the node is placed at a proper position in the node tree according to one edge and a corresponding target node.
Placement type judgment sub-algorithm: in the process of executing the node tree position determining algorithm, according to an edge and a corresponding target node, the algorithm for judging the placement mode of the node at a specific position, wherein the placement type and the corresponding meaning are as follows:
1) OK indicates that no existing node exists and can be directly placed;
2) KEEP indicates that the existing node meets the edge, and the node does not need to be placed;
3) REPLACE indicates that the existing node does not satisfy the edge, but the node may REPLACE the existing node;
4) CONFLICT indicates that the existing node does not satisfy the edge, and the node cannot replace the existing node and cannot be placed.
Specifically, the core logic of the analysis algorithm is improved based on breadth-first search, the actual installation process of the npm software package is attached, the consistency of the analysis order of the npm official tools is ensured, and the accuracy of the npm ecological analysis tools is improved; the core logic of the analysis algorithm carries out targeted processing on the common dependence and the peer dependence, so that the accuracy of the npm ecological analysis tool is improved. The main logic of the part is shown in fig. 5, specifically:
(I) The method comprises the steps of (1) internode the npm software package version to be analyzed, and adding the node queue to be analyzed;
(II) when the node queue to be parsed is not empty, performing the following operations in a loop:
(i) Dequeuing the head node;
(ii) For each invalid dependency edge of the dequeue node, loading a target node of the edge, recursively loading a peer dependency group of the target node, and adding the edge and the target node into a task queue;
(iii) For each edge and corresponding node in the task queue, executing a node tree position determining algorithm to determine the position of the node and the peer dependency group thereof in the node tree, adding the node and the node which regenerates the invalid dependency edge into the node queue to be analyzed, wherein the node which regenerates the invalid dependency edge means that the node tree position determining algorithm can cause the dependency edge of the existing node in the tree to be changed from valid to invalid.
In particular, the node tree position determination algorithm is used to place nodes in a node tree according to edges. The node tree position determining algorithm ensures the correctness and the accuracy of the position determining algorithm by simulating the path searching logic of the node dependent loading process; the node tree position determining algorithm carries out targeted processing and recursive execution on the common dependence and the peer dependence, so that the robustness of the analysis process is improved, and the analysis accuracy and the accuracy of the dependence of different types are improved. The main logic of the part is as shown in fig. 6, and the main logic is that the following steps are executed for each node and corresponding edge in the peer dependency group of the algorithm input node:
(I) Determining an initial target node: for an input edge and a corresponding input node, if the edge represents a peer dependency, the initial target node is a father node of a source node of the input edge, otherwise, the initial target node is the source node;
(II) traversing the step-by-step father node of the initial target node, executing a placement type judgment algorithm, and judging the placement type of the input node under the current traversing node;
and (III) taking the last traversal node with the placement type not being CONFLICT, and carrying out node placement operation under the node according to the placement type, namely directly placing the node with the placement type of OK, wherein the node with the placement type of KEEP is not placed, and the node with the placement type of REPLACE is placed by replacing the existing node.
Specifically, the placement type judgment algorithm is used for judging the placement type of the node to be placed under the target node. The placement type judgment algorithm judges all possible situations, and ensures that the actual situation of the node tree is not heavy and not leaked; the placement type judgment algorithm carries out targeted processing and recursive execution on the common dependence and the peer dependence, so that the judgment accuracy and the judgment accuracy of the dependence of different types are improved. The main logic of this part is shown in fig. 7, and is:
(I) If the target node has a child node with the same name as the node to be placed:
(i) If the node with the same name is the same as the node version to be placed, judging that the placement type is KEEP;
(ii) If the node to be placed can REPLACE the same-name node (namely, the node to be placed can meet all incoming edges of the same-name node, and the peer dependency group of the node to be placed can be placed or replaced recursively), judging that the placement type is REPLACE;
(iii) Otherwise, judging the placement type as CONFLICT;
(II) if the target node does not have a child node with the same name as the node to be placed:
(i) If the node to be placed is placed as a peer dependency and CONFLICTs can be caused by blocking the dependency on the node path, judging the placement type as CONFLICT;
(ii) If the recursion peer dependency groups of the nodes to be placed can be placed, judging that the placement type is OK;
(iii) Otherwise, the placement type is judged to be CONFLICT.
In addition, the improved dependent analysis algorithm adds tag, alias and other information and os, cpu, glibc and other preset information to specify and filter the version determined by the dependent analysis version; the loading logic of the packet node is improved, the robustness of the loading process is improved, the repeated loading of data can be reduced, and the problems of null pointer, infinite loop and the like of the existing tool can be avoided; redundant judgment and invalid processes are reduced, data reading and processing processes are optimized, and time efficiency is greatly improved.
In addition, the algorithm is realized by using a C++ language in the embodiment, and the optimization of time efficiency is paid attention to so as to meet the time requirement of full ecological dependence analysis, and the high-precision dependence analysis of all npm open source software can be completed within 100 hours.
(3.2) returning analysis result data after the analysis is completed in the parallelized dependency analysis process;
specifically, each piece of analysis result data includes a dependent analysis result and a result to be installed. The dependency analysis result comprises dependency items, dependency versions, dependency levels, dependency paths and the like; the result to be installed includes the relative paths of all packages that need to be installed according to the dependency after the target npm software package is installed under the empty directory.
(4) Storing the dependency analysis results of all npm software packages, and setting corresponding query interfaces;
specifically, the dependency items, dependency versions, dependency relationships, dependency levels, and the like of all npm packages are stored in a relational database, the dependency paths, and the like of all npm packages are stored in a document database, and a packaged query interface for the two databases is provided.
In one embodiment, as shown in FIG. 8, the results are processed and stored in a dependency information database and a dependency path database, respectively. The processing includes calculating a dependency path, a dependency hierarchy, compressing a file tree to be installed, serializing a dependency graph, and the like from the dependency relationship. The storing process comprises the steps of storing the dependency items, the dependency versions and the dependency hierarchy into a dependency information database, and storing the dependency paths and the to-be-installed results into a dependency path database in a documented mode.
And when the query request is accepted, executing the packaged database query according to the request, and returning a query result. The dependency information database is subjected to structural query to obtain information such as dependency versions, dependency relationships, dependency levels and the like of the dependent items and the dependent items, and statistical information such as the number of dependencies, the number of dependent items, the dependency ranking and the like can be obtained; and obtaining the dependence path information by inquiring the document of the dependence path database.
In conclusion, the method has good popularization and application prospects:
the invention can automatically complete the acquisition and synchronization of npm data. By technical means of updating flow progress records, data entry verification and the like designed for the npm database, the data is guaranteed to be correctly and traceable, and the integrity and timeliness of analysis data are further guaranteed;
the invention improves the existing supply chain dependency analysis algorithm, and improves the accuracy and the customizability of npm software supply chain analysis tools through the specific dependencies, peerDependencies and optiondependencies dependency types of npm and the targeted analysis of tag, alias and semver version types and the targeted filtration of os, cpu and other information;
the algorithm design and specific implementation of the present invention focuses on time efficiency. Through the parallelization execution of the improved dependency analysis algorithm, the targeted database design optimization and the use of the memory database, the invention can complete the dependency analysis of all versions of all full-ecological open-source software packages in a short time;
the invention designs a feedback updating mechanism of the analysis result after data synchronization, triggers incremental re-analysis and result updating through npm data updating, ensures real-time updating of the analysis result, and further improves timeliness and accuracy of analysis.
Corresponding to the foregoing embodiments of the high-precision npm software supply chain dependency analysis method, the present application also provides embodiments of the high-precision npm software supply chain dependency analysis apparatus.
FIG. 9 is a block diagram of a high-precision npm software supply chain dependency analysis apparatus, according to an example embodiment. Referring to fig. 9, the apparatus may include:
the metadata acquisition module 21 is used for acquiring metadata of all npm open source software packages and carrying out real-time tracking and incremental updating on the data;
an information extraction module 22 for extracting version and dependency information of the software package from the metadata;
the dependency analysis module 23 is configured to perform concurrent dependency analysis on the npm software package by using an improved dependency analysis algorithm according to the version and the dependency information of the software package, so as to obtain a dependency analysis result and a to-be-installed result of all software packages;
the result storage module 24 is configured to store the dependency analysis results of all npm software packages, and set a corresponding query interface.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Correspondingly, the application also provides electronic equipment, which comprises: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the high-precision npm software supply chain dependency analysis method as described above. As shown in fig. 10, a hardware structure diagram of an apparatus with any data processing capability, where the high-precision npm software supply chain dependency analysis method provided by the embodiment of the present invention is located, is except for the processor, the memory and the network interface shown in fig. 10, where the apparatus with any data processing capability in the embodiment is located, generally according to the actual function of the apparatus with any data processing capability, and may further include other hardware, which will not be described herein.
Accordingly, the present application also provides a computer readable storage medium having stored thereon computer instructions which when executed by a processor implement a highly accurate npm software supply chain dependency analysis method as described above. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may also be an external storage device, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any device having data processing capabilities. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the present application is not limited to the precise steps that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof.

Claims (10)

1. A high-precision npm software supply chain dependency analysis method, comprising:
(1) Acquiring metadata of all npm open source software packages, and carrying out real-time tracking and incremental updating;
(2) Extracting version and dependency information of a software package from the metadata;
(3) According to the version and the dependency information of the software package, performing concurrent dependency analysis on the npm software package by utilizing an improved dependency analysis algorithm to obtain dependency analysis results and to-be-installed results of all the software packages;
(4) And storing the dependent analysis results of all npm software packages, and setting corresponding query interfaces.
2. The method of claim 1, wherein step (1) comprises:
(1.1) acquiring all metadata information of a npm open source warehouse and storing the metadata information into a local metadata database;
(1.2) repeating the data tracking and metadata downloading at regular time, acquiring npm update data of the open source warehouse, and updating the local metadata database to keep the local data in the latest state.
3. The method of claim 1, wherein step (2) comprises:
(2.1) extracting tag information, version information and metadata of each version of all npm software packages from the metadata;
and (2.2) respectively storing the tag information, version information and metadata of each version into three key-value type memory databases.
4. The method of claim 1, wherein in step (3), the dependency resolution result includes a dependency term, a dependency version, a dependency hierarchy, and a dependency path, and the to-be-installed result includes a relative path of all packages that need to be installed according to the dependency after installing the target npm software package under the empty directory.
5. The method of claim 1, wherein in step (3), the improved dependency resolution algorithm comprises:
the method comprises the steps of (1) internode the npm software package version to be analyzed, and adding the node queue to be analyzed;
when the node queue to be analyzed is not empty, the following operations are circularly executed:
(i) Dequeuing the head node;
(i) For each invalid dependency edge of the dequeue node, loading a target node of the edge, recursively loading a peer dependency group of the target node, and adding the edge and the target node into a task queue, wherein the target node of the edge is a dependent software package, the invalid dependency edge indicates that the edge is not pointed to the node, or the name or version of the pointed target node does not meet the name and version range of the edge, and the peer dependency group refers to all direct or indirect dependencies of the node;
(iii) And for each edge and corresponding node in the task queue, determining the position of the node and the peer dependency group thereof in a node tree by a node tree position determining algorithm, and adding the node and the node for regenerating the invalid dependency edge into the node queue to be analyzed, wherein the node tree is composed of father-son relations among the nodes and represents a file tree to be installed.
6. The method of claim 5, wherein the node tree position determination algorithm performs the following steps for each node and corresponding edge in the peer dependency group of algorithm input nodes:
determining an initial target node: for an input edge and a corresponding input node, if the edge represents a peer dependency, the initial target node is a father node of a source node of the input edge, otherwise, the initial target node is the source node, wherein the source node of the edge is a software package with the corresponding dependency of the edge;
traversing a step-by-step father node of an initial target node, executing a placement type judging algorithm, and judging the placement type of an input node under the current traversing node, wherein the placement type comprises OK, KEEP, REPLACE and CONFLICT;
and taking the last traversal node with the placement type not being CONFLICT, and carrying out node placement operation under the node according to the placement type.
7. The method of claim 6, wherein the placement type determination algorithm comprises:
(I) If the target node has a child node with the same name as the node to be placed:
(i) If the node with the same name is the same as the node version to be placed, judging that the placement type is KEEP;
(ii) If the node to be placed can REPLACE the same-name node, judging that the placement type is REPLACE;
(iii) Otherwise, judging the placement type as CONFLICT;
(II) if the target node does not have a child node with the same name as the node to be placed:
(i) If the node to be placed is placed as a peer dependency and CONFLICTs can be caused by blocking the dependency on the node path, judging the placement type as CONFLICT;
(ii) If the recursion peer dependency groups of the nodes to be placed can be placed, judging that the placement type is OK;
(iii) Otherwise, the placement type is judged to be CONFLICT.
8. A high-precision npm software supply chain dependency analysis apparatus, comprising:
the metadata acquisition module is used for acquiring metadata of all npm open source software packages and carrying out real-time data tracking and incremental updating;
the information extraction module is used for extracting the version and the dependency information of the software package from the metadata;
the dependency analysis module is used for carrying out concurrent dependency analysis on the npm software package by utilizing an improved dependency analysis algorithm according to the version and the dependency information of the software package to obtain a dependency analysis result and a to-be-installed result of all the software packages;
and the result storage module is used for storing the dependency analysis results of all npm software packages and setting corresponding query interfaces.
9. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.
10. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1-7.
CN202311489487.XA 2023-11-08 2023-11-08 High-precision npm software supply chain dependency analysis method Pending CN117520169A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311489487.XA CN117520169A (en) 2023-11-08 2023-11-08 High-precision npm software supply chain dependency analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311489487.XA CN117520169A (en) 2023-11-08 2023-11-08 High-precision npm software supply chain dependency analysis method

Publications (1)

Publication Number Publication Date
CN117520169A true CN117520169A (en) 2024-02-06

Family

ID=89757977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311489487.XA Pending CN117520169A (en) 2023-11-08 2023-11-08 High-precision npm software supply chain dependency analysis method

Country Status (1)

Country Link
CN (1) CN117520169A (en)

Similar Documents

Publication Publication Date Title
US11328003B2 (en) Data relationships storage platform
Yang et al. How not to structure your database-backed web applications: a study of performance bugs in the wild
US8959106B2 (en) Class loading using java data cartridges
RU2599538C2 (en) Methods and systems for loading data into temporal data warehouse
Xu et al. Integrating hadoop and parallel dbms
Ossher et al. Automated dependency resolution for open source software
US11256666B2 (en) Method and apparatus for handling digital objects in a communication network
CN111324610A (en) Data synchronization method and device
US11599539B2 (en) Column lineage and metadata propagation
CN111259004B (en) Method for indexing data in storage engine and related device
CN115543402B (en) Software knowledge graph increment updating method based on code submission
US9396218B2 (en) Database insert with deferred materialization
CN115033894A (en) Software component supply chain safety detection method and device based on knowledge graph
CN112099880A (en) Method and system for reducing application program driven by scene
CN117421302A (en) Data processing method and related equipment
Severin et al. Smart money wasting: Analyzing gas cost drivers of ethereum smart contracts
CN116361287A (en) Path analysis method, device and system
CN117520169A (en) High-precision npm software supply chain dependency analysis method
Li et al. Efficient time-interval data extraction in MVCC-based RDBMS
CN114461454A (en) Data recovery method and device, storage medium and electronic equipment
US11188228B1 (en) Graphing transaction operations for transaction compliance analysis
CN113553320B (en) Data quality monitoring method and device
US11256602B2 (en) Source code file retrieval
US11893120B1 (en) Apparatus and method for efficient vulnerability detection in dependency trees
CN118069479A (en) Performance evaluation method, device and equipment of SQL (structured query language) sentences and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination