CN116302919A

CN116302919A - A multi-language scalable code dependency parsing model and parsing method

Info

Publication number: CN116302919A
Application number: CN202211516728.0A
Authority: CN
Inventors: 晋武侠; 丁紫凡; 陈大为; 刘烃; 范铭
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-06-23

Abstract

The present invention discloses a multi-language scalable code dependency parsing model and parsing method. Firstly, an abstract syntax tree is generated by using a parser, based on traversing the abstract syntax tree, information is obtained at an appropriate node, and entity extraction is completed. and save, secondly, in this process, some dependencies that can be directly judged and incomplete dependencies that can judge half of the information are retained. After other" to complete dependency backfilling, the incomplete dependencies that complete dependency backfilling will be supplemented as complete dependency storage, and the incomplete dependencies that fail to depend on backfilling will remain as unresolved dependency storage.

Description

A multi-language scalable code dependency parsing model and parsing method

技术领域technical field

本发明涉及可信软件及静态代码分析领域，特别涉及一种语言无关、测试项目应用场景无关、可扩展的静态代码依赖解析领域。The invention relates to the field of trusted software and static code analysis, in particular to a language-independent, test item application-scenario-independent, scalable static code dependency analysis field.

背景技术Background technique

静态代码分析是软件工程领域的一个重要组成部分，软件产品在开发和维护过程中都需要进行静态代码分析，静态代码分析得到的依赖解析结果为上层软件及架构分析提供坚实基础，可以广泛应用于漏洞检测、架构度量、软件坏味道等多个领域，能够保障软件产品的可靠性和安全性。相较于动态分析，静态代码分析不需要运行软件产品，其在时间复杂度和空间复杂度上具有更好的表现。Static code analysis is an important part of the software engineering field. Software products need static code analysis in the process of development and maintenance. The dependency analysis results obtained by static code analysis provide a solid foundation for upper-level software and architecture analysis, and can be widely used in Vulnerability detection, architecture measurement, software bad taste and other fields can guarantee the reliability and security of software products. Compared with dynamic analysis, static code analysis does not need to run software products, and it has better performance in terms of time complexity and space complexity.

学术界的研究往往集中于某些特定语言，且多偏向于调用(call)依赖图的构建和特定场景下的分析(如Linux操作系统)，实体粒度相对较粗，往往是在函数层级上解析，依赖种类相对较少，依赖种类集中于调用关系和继承关系。更细粒度的实体和更多种类的依赖可以提供更加丰富的信息，为架构度量等上层应用提供稳固支撑。当前，工业界的研究往往基于自身需求进行设计，开发完善度较高且对外开放使用的工具多为闭源且收费的项目，其成本较高，开放性较差，该类项目由于语言语法规范多样性和无法在编译期间确定接受对象的类型等原因，在部分场景下准确率较低，且无法进行补充和迭代开发，其余项目多为开发的早期阶段，支持的功能较少，能力较为薄弱。现阶段还存在一些用于描述静态分析中间结果的协议，但是其普及度和使用度不够，较难验证其可行性。Research in academic circles tends to focus on certain specific languages, and tends to focus on the construction of call (call) dependency graphs and analysis in specific scenarios (such as Linux operating system). The entity granularity is relatively coarse, and it is often analyzed at the function level. , there are relatively few dependent types, and the dependent types focus on the calling relationship and inheritance relationship. More fine-grained entities and more types of dependencies can provide richer information and provide solid support for upper-layer applications such as architectural metrics. At present, the research in the industry is often designed based on its own needs. The tools that are highly developed and open to the outside world are mostly closed-source and fee-based projects, which have high costs and poor openness. Diversity and the inability to determine the type of accepted objects during compilation, etc. In some scenarios, the accuracy rate is low, and supplementary and iterative development cannot be carried out. The rest of the projects are mostly in the early stage of development, with fewer supported functions and relatively weak capabilities . At this stage, there are still some protocols used to describe the intermediate results of static analysis, but their popularity and use are not enough, and it is difficult to verify their feasibility.

发明内容Contents of the invention

本发明的目的是旨在提供一种多语言的、可扩展的代码依赖解析方法，其可以在排序性能和时间开销上取得一定的平衡，并且在大规模软件中能够取得性能和时间开销双方面的优势，利用静态代码分析的方式来克服动态代码分析中程序必须没有运行错误、耗费算力和时间的缺点，以解决上述技术问题。The purpose of the present invention is to provide a multi-language, scalable code dependency parsing method, which can achieve a certain balance between sorting performance and time overhead, and can achieve both performance and time overhead in large-scale software The advantages of static code analysis are used to overcome the shortcomings of dynamic code analysis that the program must have no running errors and consume computing power and time, so as to solve the above technical problems.

为达到上述目的，本发明采用以下技术方案予以实现：In order to achieve the above object, the present invention adopts the following technical solutions to achieve:

S1 遍历文件系统S1 traverses the file system

S2 生成抽象语法树S2 generates an abstract syntax tree

S3基于抽象语法树遍历的实体抽取S3 Entity Extraction Based on Abstract Syntax Tree Traversal

S4根据实体抽取结果的中间处理S4 Intermediate processing based on entity extraction results

S5 基于符号表的依赖回填S5 symbol table-based dependency backfilling

S6 生成结果并输出S6 generates results and outputs

一种多语言可扩展的代码依赖解析模型，包括：A multi-language scalable code dependency parsing model, including:

代码依赖图模型，由节点和边组成，其中，节点表示的是实体对象，存储与实体相关的全部信息，包括但不限于实体的种类、唯一标识索引、全名、属性信息和位置信息，边是链接起发生依赖的两个实体之间的桥梁，此处，边为有向边，边的方向指代依赖发生的方向，即源实体到目的实体之间发生某依赖关系，除此之外，边还保留了其他与依赖相关的信息，包括但不限于依赖的种类和依赖的发生位置。The code depends on the graph model, which is composed of nodes and edges. Nodes represent entity objects and store all information related to entities, including but not limited to entity types, unique identification indexes, full names, attribute information, and location information. Edges It is a bridge linking two entities that are dependent. Here, the edge is a directed edge, and the direction of the edge refers to the direction in which the dependency occurs, that is, a certain dependency relationship occurs between the source entity and the target entity. In addition , the edge also retains other dependency-related information, including but not limited to the type of dependency and where it occurs.

一种多语言可扩展的代码依赖解析方法，遍历文件系统模块，遍历输入项目路径，获取得到需要处理的文件信息及处理顺序；A multi-language scalable code dependency parsing method, traversing the file system module, traversing the input project path, and obtaining the file information and processing order to be processed;

抽象语法树生成模块，根据文件列表及处理顺序，设置合适的编译选项，生成抽象语法树；The abstract syntax tree generation module, according to the file list and processing order, sets appropriate compilation options to generate an abstract syntax tree;

基于抽象语法树的实体抽取模块，根据生成的抽象语法树和访问者设计模式，从抽象语法树上相应的节点上抽取实体信息；The abstract syntax tree-based entity extraction module extracts entity information from the corresponding nodes on the abstract syntax tree according to the generated abstract syntax tree and the visitor design pattern;

中间处理模块，用于对实体列表中未完成的处理信息进行标识解决；The intermediate processing module is used to identify and solve the unfinished processing information in the entity list;

基于符号表的依赖回填模块，根据实体仓库里的实体信息和处理得到的残缺依赖补全实体信息，获得最终实体依赖图；The dependency backfill module based on the symbol table completes the entity information according to the entity information in the entity warehouse and the incomplete dependency obtained through processing, and obtains the final entity dependency graph;

结果输出模块。Result output module.

依赖图中每个节点可以确认唯一实体，节点与节点之间允许存在一种依赖发生多次或者是多种依赖同时发生等复杂情况，即边可以重叠存在，另外，依赖信息之间存在关联关系，根据依赖信息推导出循环依赖/间接依赖，依赖的发生位置可以是在实体定义文件，也可以不在，依赖的发生位置与源实体和目的实体的定义位置均可以在不同文件之中。Each node in the dependency graph can identify a unique entity. Complex situations such as one dependency occurring multiple times or multiple dependencies occurring at the same time are allowed between nodes, that is, edges can overlap. In addition, there is an association relationship between dependency information , deduce the circular dependency/indirect dependency according to the dependency information. The location of the dependency can be in the entity definition file or not. The location of the dependency and the definition location of the source entity and the destination entity can be in different files.

方法优化了内部的存储结构和处理逻辑，对于大规模和存在复杂依赖关系的软件产品，方法可以在理想的时间和空间完成解析。The method optimizes the internal storage structure and processing logic. For large-scale software products with complex dependencies, the method can complete the analysis in ideal time and space.

对于需要解析其他形式，需要其他相关种类依赖的情况下，可以对方法过程中生成的抽象语法树进行扩展处理，整个框架功能隔离明确，允许并友好支持模块扩展。In the case that other forms need to be parsed and other related types of dependencies are required, the abstract syntax tree generated during the method process can be extended. The entire framework has clear functional isolation, allowing and friendly support for module extensions.

对于输入的软件产品的语言不设限制，对于容忍软件产品的语法错误和语义错误，方法具有良好的鲁棒性。There is no restriction on the language of the input software product, and the method has good robustness for tolerating grammatical errors and semantic errors of the software product.

其抽取到的实体粒度为符号级别的实体，粒度更细，保留的实体信息以变量为最小实体层级，保留的依赖信息以发生在实体之间的依赖信息为最小依赖层级。The entity granularity extracted by it is the entity at the symbol level, and the granularity is finer. The retained entity information takes the variable as the minimum entity level, and the retained dependency information takes the dependency information that occurs between entities as the minimum dependency level.

更细粒度的信息格式，包括实体为具有包括但不限于名称、代码位置和类型等属性的对象结构，除类型外实体还可以有细分类型，其反映了更细粒度的实体特征：A finer-grained information format, including entities as object structures with attributes including but not limited to names, code locations, and types. In addition to types, entities can also have subdivided types, which reflect finer-grained entity characteristics:

依赖种类的广泛性，包括动态依赖和静态依赖等；Extensive dependency types, including dynamic dependencies and static dependencies;

依赖回填的顺序性，包括综合符号表，按照“先Export再Import最后其他”的顺序完成依赖回填，完成依赖回填的残缺依赖将补充为完整依赖存储。The sequence of dependency backfilling, including the comprehensive symbol table, is completed in the order of "Export first, then Import, and finally others". The incomplete dependencies that complete dependency backfilling will be supplemented as complete dependency storage.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明多语言可扩展的代码依赖解析方法，首先，利用解析器生成抽象语法树，基于对抽象语法树的遍历，在恰当的结点获取信息，完成对实体的抽取和保存，其次，此过程中保留了一些能够直接判断的依赖关系和能够判断一半信息的残缺依赖，最后，处理前序流程的分析中得到的残缺依赖，综合符号表，按照“先Export再Import最后其他”的顺序完成依赖回填，完成依赖回填的残缺依赖将补充为完整依赖存储，依赖回填失败的残缺依赖将保留为未解决的依赖存储。The multi-language scalable code dependency parsing method of the present invention, firstly, uses the parser to generate an abstract syntax tree, and based on the traversal of the abstract syntax tree, obtains information at an appropriate node, and completes the extraction and storage of entities; secondly, this process Some dependencies that can be directly judged and incomplete dependencies that can judge half of the information are retained. Finally, the incomplete dependencies obtained in the analysis of the pre-order process are processed, and the comprehensive symbol table is completed in the order of "Export first, then Import, and finally others". Backfill, the incomplete dependency that completes dependency backfill will be supplemented as a complete dependency store, and the incomplete dependency that fails to depend on backfill will remain as an unresolved dependency store.

附图说明Description of drawings

图1为本发明实施例的多语言可扩展的代码依赖解析方法的实体依赖分类图；FIG. 1 is an entity dependency classification diagram of a multilingual extensible code dependency analysis method according to an embodiment of the present invention;

图2为本发明实施例的多语言可扩展的代码依赖解析方法流程图，其中包括三个子图，子图(a)为流程图总图，子图(b)为依赖回填流程图，子图(c)为扩展模块，隐式依赖抽取流程图；Fig. 2 is a flow chart of a multi-language scalable code dependency parsing method according to an embodiment of the present invention, which includes three sub-graphs, sub-graph (a) is a general flow chart, sub-graph (b) is a dependency backfill flow chart, and sub-graph (c) is an extension module, implicit dependency extraction flow chart;

图3为本发明实施例的多语言可扩展的实体依赖模型图。Fig. 3 is a multilingual extensible entity dependency model diagram of an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is an embodiment of a part of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present invention.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first" and "second" in the description and claims of the present invention and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.

下面结合附图对本发明做进一步详细描述：The present invention is described in further detail below in conjunction with accompanying drawing:

参见图1，本发明实施例对所有可能的实体和依赖依其性质和抽取方式等的不同分别划分为了如(a)-(c)所示的3种和如(d)-(f)所示的3种。对于实体，图1(a)为部分的结构性实体，其一般是软件项目的组成文件和部分具有特定模式的文件夹；图1(b)为部分的代码实体，其一般是软件项目代码文件中符合特定语法格式且承担具体软件功能逻辑的元素；图1(c)为部分的作用域实体，其一般是软件项目代码文件中符合特定语法格式但仅起逻辑控制流转换功能的元素。对于依赖，图1(d)为部分的无需绑定依赖，其一般是在对代码抽象语法树进行遍历的过程中随着实体的抽取就能同时完全解析的依赖；图1(e)为部分的需要绑定依赖，其一般是在对代码抽象语法树进行遍历时无法完全解析的依赖，通常需要在随后进行二次分析以根据代码字符串找到该依赖所指向的实体；图1(f)为部分的隐式依赖，其一般是在无类型标注系统的编程语言中根据某变量随后的对属性的使用情况来反推该变量可能的类型所构建出的依赖，相比于前述依赖，隐式依赖解析还需要额外的步骤来实现求解和约束求精。Referring to Figure 1, the embodiment of the present invention divides all possible entities and dependencies into 3 types as shown in (a)-(c) and 3 types as shown in (d)-(f) according to their properties and extraction methods. 3 types shown. For entities, Figure 1(a) is part of the structural entity, which is generally a software project’s constituent files and some folders with specific patterns; Figure 1(b) is a part of the code entity, which is generally a software project code file Elements conforming to a specific grammatical format and undertaking specific software function logic; Figure 1(c) is a partial scope entity, which is generally an element in a software project code file that conforms to a specific grammatical format but only functions as a logical control flow conversion. For dependencies, Figure 1(d) is a partial dependency without binding, which is generally a dependency that can be fully resolved simultaneously with the extraction of entities during the traversal of the abstract syntax tree of the code; Figure 1(e) is a partial The need to bind dependencies, which is generally a dependency that cannot be fully resolved when traversing the code abstract syntax tree, usually requires subsequent secondary analysis to find the entity pointed to by the dependency according to the code string; Figure 1(f) Implicit dependency is a part of the implicit dependency, which is generally a dependency constructed by inferring the possible type of a variable based on the subsequent use of attributes of a variable in a programming language without a type annotation system. Compared with the aforementioned dependency, the implicit Formal dependency resolution also requires additional steps to implement solution and constraint refinement.

参见图2(a)本发明实施例的总体流程图，本方法的输入包括项目源代码、项目配置和第三方库列表等，输出为实体仓库、依赖仓库和自定义分析结果(若有)等，具体包括以下过程：Referring to Fig. 2 (a) the overall flowchart of the embodiment of the present invention, the input of this method includes project source code, project configuration and third-party library list, etc., and the output is entity warehouse, dependent warehouse and custom analysis results (if any), etc. , including the following procedures:

准备工作选取合适的解析器。本方法需要调研现有解析器，所选解析器需要能够在容错的情况下高效率、高准确率、低能耗生成抽象语法树，该解析器需要是行业内部存在共识，普遍验证、根据语言版本变更及时维护的工具，例如，本发明中对于Java和C++语言分别采用Eclipse的JDT和CDT，对于Python采用语言官方parser，对于Javascript和TypeScript采用babel。Preparations Select a suitable parser. This method needs to investigate existing parsers. The selected parser needs to be able to generate an abstract syntax tree with high efficiency, high accuracy, and low energy consumption under the condition of fault tolerance. The parser needs to be a consensus within the industry, universally verified, and according to the language version. Change the tools for timely maintenance. For example, in the present invention, Eclipse's JDT and CDT are used for Java and C++ languages, the language official parser is used for Python, and babel is used for Javascript and TypeScript.

S1遍历软件项目目录以获得所有结构性实体。具体地，包含以下步骤：S1 traverses the software project directory to obtain all structural entities. Specifically, the following steps are included:

S101：给定软件项目的多个目录路径，首先使用树的遍历算法遍历目录中所有的文件，对于编程语言源文件、特定编程语言或框架的配置文件和其他会影响依赖分析的文件，分别为其创建File实体；S101: Given multiple directory paths of a software project, first use the tree traversal algorithm to traverse all the files in the directory. For programming language source files, configuration files of specific programming languages or frameworks, and other files that will affect dependency analysis, respectively: It creates a File entity;

S102：对于部分特定编程语言中具有特定功能意义的目录结构，如Python中包含__init__.py文件的目录特化为Python Module和JavaScript中包含package.json文件的目录特化为Node.js Package，额外为其创建对应的结构性实体。若某结构性实体有相对应的配置文件，则一并将配置文件中会影响依赖分析的信息以属性的形式保存在该实体中。S102: For directory structures with specific functional meanings in some specific programming languages, for example, the directory containing the __init__.py file in Python is specialized as Python Module and the directory containing package.json file in JavaScript is specialized as Node.js Package, Additionally create corresponding structural entities for it. If a structural entity has a corresponding configuration file, the information in the configuration file that will affect the dependency analysis is saved in the entity in the form of attributes.

S2遍历代码抽象语法树节点以获得所有代码实体和作用域实体。具体地，包含以下步骤：S2 traverses the code abstract syntax tree nodes to obtain all code entities and scope entities. Specifically, the following steps are included:

S201：特定编程语言限定的预处理环节。若某特定编程语言存在涉及到对代码文件进行改写的行为，则本发明实施例会在此处依相关定义进行相同的改写。如C/C++中需要先处理所有头文件(.h文件)再处理(宏展开)其他代码文件，以及还需要访问和解析特定的程序环境变量等。S201: A preprocessing link defined by a specific programming language. If there is an action involving rewriting code files in a specific programming language, the embodiment of the present invention will perform the same rewriting according to relevant definitions here. For example, in C/C++, all header files (.h files) need to be processed before processing (macro expansion) other code files, and specific program environment variables need to be accessed and analyzed.

S202：将S101中获得的File实体所对应的代码文件的内容使用Parser解析为代码抽象语法树。根据所选使用的Parser的不同，该步骤既可以对单文件一一进行解析，也可直接使用特定Parser所提供的对完整项目一次性建立完整的抽象语法树的API来一次性解析；S202: Use the Parser to parse the content of the code file corresponding to the File entity obtained in S101 into a code abstract syntax tree. Depending on the Parser selected, this step can either parse a single file one by one, or directly use the API provided by a specific Parser to build a complete abstract syntax tree for a complete project at one time for one-off parsing;

S203：使用Parser所提供的基于访问者模式的节点遍历接口依次序遍历所有代码抽象语法树节点，对于各代码实体和作用域实体所分别对应的抽象语法树节点，注册相对应的抽取方法。特定的抽取方法能在被调用时创建其类型的实体并从对应的抽象语法树节点中提取需要的信息保存在该实体中。在此过程中，所有实体都依照代码作用域包含关系层级化保存；S203: Use the visitor mode-based node traversal interface provided by Parser to sequentially traverse all code abstract syntax tree nodes, and register corresponding extraction methods for the abstract syntax tree nodes corresponding to each code entity and scope entity. A specific extraction method can create an entity of its type when it is called, and extract the required information from the corresponding abstract syntax tree node and save it in the entity. During this process, all entities are stored hierarchically according to the code scope inclusion relationship;

S204：初始化并随S203节点遍历过程持续维护跨节点的上下文信息栈。部分编程语言特性会使得当前节点的属性计算依赖于其同层的兄弟节点或跨越多层的父子节点的相关属性，Parser所提供的节点遍历接口通常是无状态的，因此需要额外使用一个可在节点跳转时维持和传递特定信息的栈结构；S204: Initialize and continuously maintain the cross-node context information stack along with the node traversal process in S203. Some programming language features will make the attribute calculation of the current node depend on the related attributes of sibling nodes of the same layer or parent-child nodes spanning multiple layers. The node traversal interface provided by Parser is usually stateless, so it is necessary to use an additional one that can be used in A stack structure that maintains and transmits specific information when a node jumps;

S205：部分无需绑定依赖的立即解析。特定编程语言的语法模式使得待绑定的实体实际上出现在特定依赖的语法标识符之后，在依次序遍历抽象语法树节点时，按S203创建该待绑定的实体的同时即可以完成该依赖的创建和绑定，使其成为一个完整依赖；S205: Partial immediate resolution without binding dependencies. The grammatical mode of a specific programming language makes the entity to be bound actually appear after the grammatical identifier of the specific dependency. When traversing the abstract syntax tree nodes in order, the dependency can be completed while creating the entity to be bound according to S203 The creation and binding of , making it a complete dependency;

S206：需要绑定依赖(残缺依赖)的记录。与S204相反，大多数语法模式都使得待绑定的实体出现在特定依赖的语法标识符之前，在遍历到该依赖的语法标识符之时，仅能获取到字符串形式的对该实体的引用，该字符串需要在后续步骤被解析为S203所保存的实体，而在本步骤只能将该不完整的依赖(残缺依赖)及其他有助于绑定的相关信息暂时保存在单独的结构中。S206: A record that needs to be bound dependent (incompletely dependent). Contrary to S204, most grammatical patterns make the entity to be bound appear before the grammatical identifier of a specific dependency, and when traversing to the grammatical identifier of the dependency, only a reference to the entity in the form of a string can be obtained , the string needs to be parsed into the entity saved in S203 in a subsequent step, but in this step, only the incomplete dependency (incomplete dependency) and other relevant information that is helpful for binding can be temporarily saved in a separate structure .

S3根据实体抽取结果做中间处理。完成在2中因实体未全部抽取而未能处理的过程中保存的实体属性信息。例如：(1)为由数据聚合实体定义的对象绑定其类型信息；(2)实现对函数重载信息的识别和记录等。S3 performs intermediate processing according to the entity extraction result. Complete the entity attribute information saved in the process of not processing due to not all entities being extracted in 2. For example: (1) bind the type information of the object defined by the data aggregation entity; (2) realize the identification and recording of the function overload information, etc.

S4基于层级化符号表的依赖绑定。按照“Export->Import->其他”的顺序遍历2中所记录的残缺依赖，根据模块路径解析规则和基于作用域的符号表层级搜索等方法补全依赖关系。具体地，包含以下步骤：S4 is based on hierarchical symbol table dependency binding. Traverse the incomplete dependencies recorded in 2 in the order of "Export->Import->Others", and complete the dependencies according to the module path resolution rules and scope-based symbol table hierarchical search. Specifically, the following steps are included:

S401：处理抽象语法树可以解决的绑定信息，根据抽象语法树提供的绑定信息来回填同一个文件中的依赖，保存在依赖仓库里；S401: Process the binding information that can be resolved by the abstract syntax tree, and backfill the dependencies in the same file according to the binding information provided by the abstract syntax tree, and store them in the dependency warehouse;

S402：处理跨文件间及同文件未绑定的依赖，此过程从符号表进行查找回填。此处进行分类讨论：第一步根据语言需要，对Import依赖首先查找是否有相对应的Export依赖；第二步根据生命周期分类，对于依赖目标绑定实体为函数/类等全生命周期的情况，先采用根据作用域向上逐级查找的方法查找，查找失败后在有引用关系(Import、Include等)的文件中进行查找，最后查找编程语言内置的Built-in实体；对于存在生命周期限制的局部变量等实体而言，由于内存会在local scope结束的时候被释放，不会被外部的实体访问到，故只使用短名字来进行设置，在查找过程中只在当前作用域或者是上层作用域内进行搜索，继而绑定。对于采用上述绑定成功的依赖信息，作为完整依赖，保存在实体仓库里，对于绑定失败的依赖信息，将其归类为未知依赖，未知依赖产生的原因包括但不限于标准API的调用和第三方库的版本不一致等情况，同样加入依赖仓库进行存储，可以将其未绑定部分的实体，标记为unresolved，该依赖也同样标记为unresolved。S402: Handle cross-file and unbound dependencies between files. This process searches and backfills from the symbol table. Classification discussion is carried out here: the first step is based on the language needs, first check whether there is a corresponding Export dependency for the Import dependency; the second step is classified according to the life cycle, for the case where the dependent target binding entity is a full life cycle such as a function/class , first use the method of searching upwards according to the scope, and then search in files with reference relationships (Import, Include, etc.) after the search fails, and finally search for the built-in Built-in entity of the programming language; for those with life cycle restrictions For entities such as local variables, since the memory will be released at the end of the local scope and will not be accessed by external entities, only short names are used for setting, and only the current scope or upper layer functions during the search process Search within the domain and then bind. Dependency information that successfully binds using the above is stored in the entity warehouse as a complete dependency. Dependency information that fails to bind is classified as unknown dependencies. The reasons for unknown dependencies include but are not limited to standard API calls and If the version of the third-party library is inconsistent, etc., it is also stored in the dependency warehouse, and the unbound entity can be marked as unresolved, and the dependency is also marked as unresolved.

参见图2(b)所示，为本方法实施例中的对残缺依赖进行依赖绑定的流程图。总体上关于依赖绑定，可能实施的步骤有本地作用域搜索、模块路径解析、Exported作用域搜索、Imported作用域搜索、三方库列表搜索和语言Built-in列表搜索，不同类型的依赖可能会分别用到上述所有步骤的单个或多个。对于由S206所记录的所有残缺依赖，在处理之前依先类型按特定的处理时序排列，总体上遵循“先Export再Import后其他”的顺序来处理。处理每种类型的依赖所需要用到的步骤均在图中以“√”(单步骤)或数字(按数字顺序执行步骤)标识。Referring to FIG. 2( b ), it is a flowchart of dependency binding for incomplete dependencies in this method embodiment. Generally speaking, regarding dependency binding, the steps that may be implemented include local scope search, module path resolution, Exported scope search, Imported scope search, tripartite library list search, and language Built-in list search. Different types of dependencies may be separately Single or multiple of all the above steps are used. For all the incomplete dependencies recorded in S206, they are sorted by type and in a specific processing sequence before processing, and generally follow the order of "Export first, then Import, and then others". The steps required to handle each type of dependency are identified in the figure with a "√" (single step) or a number (steps are performed in numerical order).

以下内容是对该方法的可扩展性的说明和解释：The following is an illustration and explanation of the scalability of the method:

参见图2(c)所示，为本方法实施例中可扩展部分的样例，此实施例中呈现的为隐式依赖抽取模块。隐式依赖为难以通过常规静态代码分析索引到的依赖关系，该类依赖在通常情况下只能通过动态分析获取到。在本方法中，可以借助生成的AST对语义进行抽象，例如x＝y可以表示为Move(x,y)，x.f＝y可以表示为Store(x,f,y)；根据语义抽象、前面所述提到的显式依赖和构建的堆模型，可以获得根据指向关系的解析代码依赖，进而规约为由成员调用等约束的隐式依赖关系。Referring to Fig. 2(c), it is an example of an extensible part in this method embodiment, and this embodiment presents an implicit dependency extraction module. Implicit dependencies are dependencies that are difficult to index through conventional static code analysis, and this type of dependency can only be obtained through dynamic analysis under normal circumstances. In this method, the semantics can be abstracted with the help of the generated AST, for example, x=y can be expressed as Move(x,y), and x.f=y can be expressed as Store(x,f,y); according to the semantic abstraction, the above The explicit dependencies and heap model mentioned above can obtain the parsed code dependencies according to the pointing relationship, and then reduce the implicit dependencies constrained by member calls.

参见图3所示，为本方法实施例中的实体依赖模型图，实体为具有包括但不限于名称(Name)、代码位置(Location)和类型(Type)等属性的对象结构，除类型外实体还可以有细分类型(Sub-type)，其反映了更细粒度的实体特征；依赖D＝<E_src,Type,E_dest>为包含起始实体、依赖类型和终止实体的三元组结构，其也具有包括但不限于代码位置(Location)等属性。该模型将传统的简单枚举模型{E＝{E_a,E_b,…},D＝{D_a,D_b,…}}进行了更细粒度的划分，利用了依赖的三元组结构中的起始实体类型和终止实体类型将可能在多种实体对之间产生的同名依赖按照实体类型的不同划分到不同的二维坐标(E_src,E_dest)下。该实体依赖模型能不但能展现完整的实体和依赖类型分类，也能清晰的展现何种实体之间可以产生何种依赖的细粒度关系，在指导编码实现时具有提示注意避免遗漏的作用。Referring to Fig. 3, it is an entity dependency model diagram in the method embodiment, and the entity is an object structure including but not limited to properties such as name (Name), code position (Location) and type (Type), except the type entity There can also be a sub-type (Sub-type), which reflects finer-grained entity characteristics; dependency D=<E _src , Type, E _dest > is a triple structure including the starting entity, dependent type and terminating entity , which also has attributes including but not limited to code location (Location). This model divides the traditional simple enumeration model {E={E _a ,E _b ,…}, D={D _a ,D _b ,…}} into finer-grained divisions, and utilizes the dependent triple structure The start entity type and end entity type in divide the dependencies with the same name that may occur between multiple entity pairs into different two-dimensional coordinates (E _src , E _dest ) according to different entity types. The entity dependency model can not only show the complete entity and dependency type classification, but also clearly show the fine-grained relationship between which entities can generate which dependencies, and can prompt attention to avoid omissions when guiding coding implementation.

本发明实施例中对模块的划分是示意性的，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，另外，在本发明各个实施例中的各功能模块可以集成在一个处理器中，也可以是单独物理存在，也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。The division of modules in the embodiments of the present invention is schematic, and is only a logical function division. In actual implementation, there may be other division methods. In addition, each functional module in each embodiment of the present invention can be integrated into a processing In the controller, it can also be physically present separately, or two or more modules can be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

最后应当说明的是：以上实施例仅用以说明本发明的技术方案而非对其限制，尽管参照上述实施例对本发明进行了详细的说明，所属领域的普通技术人员应当理解：依然可以对本发明的具体实施方式进行修改或者等同替换，而未脱离本发明精神和范围的任何修改或者等同替换，其均应涵盖在本发明的权利要求保护范围之内。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: the present invention can still be Any modification or equivalent replacement that does not depart from the spirit and scope of the present invention shall fall within the protection scope of the claims of the present invention.

Claims

1. A multi-language scalable code dependency parsing model, characterized in that it includes:

The code depends on the graph model, which is composed of nodes and edges. Nodes represent entity objects and store all information related to entities, including but not limited to entity types, unique identification indexes, full names, attribute information, and location information. Edges It is a bridge linking two entities that are dependent. Here, the edge is a directed edge, and the direction of the edge refers to the direction in which the dependency occurs, that is, a certain dependency relationship occurs between the source entity and the target entity. In addition , the edge also retains other dependency-related information, including but not limited to the type of dependency and where it occurs.

2. Adopt a kind of multi-language scalable code dependency parsing method of model as claimed in claim 1, it is characterized in that, traverse the file system module, traverse the input project path, obtain the file information and processing sequence that need to be processed;

The abstract syntax tree generation module, according to the file list and processing order, sets appropriate compilation options to generate an abstract syntax tree;

The abstract syntax tree-based entity extraction module extracts entity information from the corresponding nodes on the abstract syntax tree according to the generated abstract syntax tree and the visitor design pattern;

The intermediate processing module is used to identify and solve the unfinished processing information in the entity list;

The dependency backfill module based on the symbol table completes the entity information according to the entity information in the entity warehouse and the incomplete dependency obtained through processing, and obtains the final entity dependency graph;

Result output module.

3. A multi-language scalable code dependency analysis method according to claim 2, characterized in that each node in the dependency graph can identify a unique entity, and a dependency between nodes is allowed to occur multiple times or It is a complex situation where multiple dependencies occur at the same time, that is, edges can overlap. In addition, there is a relationship between dependency information, and circular dependency/indirect dependency is derived based on dependency information. The location of dependency can be in the entity definition file or No, the location where the dependency occurs and the definition location of the source entity and the target entity can be in different files.

4. A multi-language scalable code dependency parsing method according to claim 2, characterized in that the method optimizes the internal storage structure and processing logic, and for large-scale software products with complex dependencies, the method can Complete analysis in ideal time and space.

5. A multi-language scalable code dependency parsing method according to claim 2, characterized in that, when other forms need to be parsed and other related types of dependencies are needed, the abstract syntax tree generated in the method process can be For extension processing, the functional isolation of the entire framework is clear, allowing and friendly support for module extensions.

6. A multilingual scalable code dependency parsing method according to claim 2, characterized in that, there is no restriction on the language of the input software product, and the method has good performance in tolerating grammatical errors and semantic errors of the software product. robustness.

7. A multi-language scalable code dependency parsing method according to claim 2, characterized in that the entity granularity extracted by it is a symbol-level entity, and the granularity is finer, and the retained entity information takes the variable as the minimum entity Hierarchy, the retained dependency information takes the dependency information that occurs between entities as the minimum dependency level.

8. A multi-language scalable code dependency parsing method according to claim 2, characterized in that the finer-grained information format includes entities as objects with attributes including but not limited to names, code locations, and types Structure, in addition to types, entities can also have subdivision types, which reflect finer-grained entity characteristics.

9. A multi-language scalable code dependency parsing method according to claim 2, characterized in that the wide variety of dependencies includes dynamic dependencies and static dependencies.

10. A multi-language scalable code dependency parsing method according to claim 2, characterized in that the sequence of dependency backfilling includes a comprehensive symbol table, and the dependency backfilling is completed in the order of "Export first, then Import, and finally others" , the incomplete dependencies that complete dependency backfilling will be supplemented as complete dependency storage.