CN111949800A

CN111949800A - Method and system for establishing knowledge graph of open source project

Info

Publication number: CN111949800A
Application number: CN202010643011.7A
Authority: CN
Inventors: 孙艳春; 黄罡; 孙志玉
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2020-07-06
Filing date: 2020-07-06
Publication date: 2020-11-17

Abstract

The embodiment of the invention provides a method and a system for establishing an open source project knowledge graph, which specifically define a data mode of the open source project knowledge graph in advance; acquiring knowledge information of a program code from an open source project code, acquiring knowledge information related to the open source project from an open source community where the open source project is located and a remote warehouse of the open source project, analyzing the knowledge information of all different sources, and extracting a plurality of triples; unifying and disambiguating all triples, constructing a knowledge graph of the open source project based on the data mode, and finally carrying out visual analysis and display on the knowledge graph. The embodiment of the invention establishes the knowledge graph of the open source project, which is used for a developer to quickly and accurately search the project code to be learned and understand the code through the related code knowledge, thereby meeting the requirement of the newly added developer on quick learning of the open source project.

Description

Method and system for establishing knowledge graph of open source project

Technical Field

The invention relates to the technical field of open source projects, in particular to an open source project knowledge graph establishing method and an open source project knowledge graph establishing system.

Background

An open source project (open source project) is a software project with open source codes, and developers can modify the source codes of the open source project through an open source community to customize personalized products of the developers.

A large-scale open source project is usually developed by multiple developers, and attracts multiple developers to learn the source code of the open source project, and after continuous learning and technical training, the developers may enter the main branch of the open source project to contribute their own strength to the open source project.

Most open source projects lack project architecture documents and management and retrieval functions aiming at project code knowledge, and the main functions of the current open source community are mainly focused on version management of the projects, and are only the existing developers of the open source projects and users not contributing to development. Therefore, when a newly added developer initially contacts an open source project, the developer can only read the source code step by step to know the project code, but the developer is difficult to directly find the code related to the requirement, and the learning efficiency is very low.

Currently, most of research in the field of open source community code analysis focuses on analyzing the code itself, and the analysis method mainly uses information such as syntax trees, static analysis results, and the like. The research on intelligent learning of developers focuses on how to help developers to program intelligently, such as recommending code segments, inferring developer's intentions, and intelligent programming tools. In general, there is no consideration from the perspective of the learning needs of the developer to provide the developer with targeted code-related knowledge information, thereby helping the developer quickly understand and join in the open-source project.

Therefore, for newly added developers, the needed project code cannot be quickly found, and the related knowledge of the project code cannot be quickly acquired to understand the code, which ultimately results in learning inefficiency.

Disclosure of Invention

In view of the above, embodiments of the present invention provide an open-source project knowledge-graph establishing method and a corresponding open-source project knowledge-graph establishing system, which overcome or at least partially solve the above problems.

In order to solve the above problem, an embodiment of the present invention provides a method for establishing an open source project knowledge graph, where the method includes: predefining a data mode of an open source project knowledge graph; acquiring knowledge information of a program code from an open source project code by a static code analysis method, wherein the knowledge information of the program code comprises: functions and files; acquiring knowledge information related to the open source project from an open source community where the open source project is located and a remote warehouse of the open source project, wherein the knowledge information related to the open source project comprises: project submission records, code merging requests, and problem sets; analyzing the knowledge information of the program code and the knowledge information related to the open source project, extracting a plurality of triples, unifying the data format of each knowledge entity in all the triples according to different structural characteristics of each data source in the triples, and disambiguating each triplet so as to ensure that the triples are concentrated, and each effective knowledge entity has one entity name corresponding to the effective knowledge entity; constructing a knowledge graph of the open source project by utilizing the three tuple sets based on the data mode; and carrying out visual analysis and display on the open source project knowledge graph by using a visualization tool Gephi.

Optionally, the method for predefining a data pattern of an open source project knowledge graph includes: extracting knowledge information forming a knowledge graph from a plurality of angles of an open source project, wherein the method comprises the following steps: basic element relationships and entities;

optionally, the entity comprises: functions, files, project submission records, problem sets and code merging requests, wherein the basic element relationship comprises the following steps: calling relation, containing relation, modifying relation and relating relation;

optionally, the method for acquiring knowledge information of the program code itself from the open source project code includes: acquiring knowledge information of a program code from an open source project code by a static code analysis method; analyzing the open source project codes by using a static analysis tool aiming at each file and each project module, and respectively generating local relation subgraphs, wherein the relation subgraphs are described and output in Dot language;

optionally, the static analysis tool comprises: doxygen, CppCheck and FindBugs;

optionally, the plurality of triplets includes: calling relation triple of sub function to obj function, containing relation triple of sub file to obj function, modifying relation triple of sub submission record to obj file, relation-related triple of sub problem set to obj item submission record, relation-related triple of sub problem set to obj problem set, relation-related triple of sub problem set to obj merging request, relation-related triple of sub merging request to obj item submission record, relation-related triple of sub code merging request to obj file

Optionally, the method for constructing a knowledge graph of open source items using the plurality of triples based on the data schema includes: based on the data mode, aiming at each data source, cleaning, extracting triples and disambiguation are carried out, a ternary group set representing a certain relation is independently extracted to form a relation ternary group set subgraph, the process of constructing the ternary group set subgraph is concurrently extracted, and finally all the ternary group set subgraphs are aggregated to construct a knowledge graph of the open source project.

The embodiment of the invention also provides a system for establishing the open source project knowledge graph, which specifically comprises the following steps:

the definition module is used for predefining a data mode of the open source project knowledge graph;

the knowledge acquisition module is used for acquiring knowledge information of a program code from an open source project code by a static code analysis method, wherein the knowledge information of the program code comprises the following steps: functions, files and calling and containing relations among the functions and the files; and acquiring knowledge information related to the open source project from the open source community where the open source project is located and a remote warehouse of the open source project, wherein the knowledge information related to the open source project comprises: project submission records, code merge requests, and problem sets and the relationship of involvement between them;

the knowledge analysis module is used for analyzing the knowledge information of the program code and the knowledge information related to the open source project and extracting a plurality of triples; unifying the data format of each knowledge entity in the triples according to different structural characteristics of each data source in the triples, and disambiguating each triplet so as to ensure that each effective knowledge entity has one entity name corresponding to the effective knowledge entity in the triples;

a construction module for constructing a knowledge graph of the open source item using the plurality of triples based on the data schema;

and the display module is used for carrying out visual analysis and display on the open source project knowledge graph by using a visualization tool Gephi.

Optionally, the defining module extracts knowledge information constituting a knowledge graph from multiple angles of the open source project, including: basic element relationships and entities; wherein the entity comprises: functions, files, project submission records, problem sets and code merging requests; wherein the basic element relationships comprise: calling a relationship, containing a relationship, modifying a relationship, referring to a relationship.

optionally, the static analysis tool comprises: doxygen, CppCheck and FindBugs;

It can be seen from the above technical solutions that the embodiments of the present invention provide a method and a system for establishing an open source project knowledge graph, which specifically aim at the learning requirement of a developer on an open source project, extract knowledge information of a project code itself and knowledge information related to the project code, which are required by the developer for learning and participating in the open source project development, from multiple data sources of an open source project, an open source community where the open source project is located, and a remote warehouse of the open source project, construct the open source project knowledge graph, and further comprehensively and effectively display the code knowledge of the open source project, thereby helping to improve the learning efficiency of the developer, and prompting the developer to participate in the open source project development better, and making a contribution to the development of the open source project.

Drawings

FIG. 1 is a flowchart illustrating steps of an embodiment of a method for building an open source project knowledge-graph according to the present invention;

fig. 2 is a diagram of an SPO triple structure provided in an embodiment of the present invention;

FIG. 3 is a block diagram of a data schema designed for an open source project knowledge-graph according to an embodiment of the present invention;

fig. 4 is a partial function call relationship subgraph in a static code analysis result according to an embodiment of the present invention;

FIG. 5 is a block diagram of an embodiment of an open source project knowledge-graph building system provided by the present invention;

FIG. 6 is a visualization effect diagram of an open source project knowledge graph constructed by an embodiment of the invention;

fig. 7 is a framework structure diagram of an embodiment of the method for establishing an open source project knowledge graph provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

FIG. 1 is a flowchart illustrating steps of a method for establishing an open source project knowledge-graph according to an embodiment of the present invention.

Referring to fig. 1, the method for establishing an open source project knowledge graph provided by this embodiment is applied to an open source project in an open source community, and this embodiment aims to acquire code knowledge information of the open source project and related knowledge information in the open source community from the perspective of code learning requirements of developers to establish the knowledge graph of the open source project, so as to meet the learning requirements of the developers on unknown codes, and the specific knowledge graph establishing method includes the following steps:

step S101, a data mode of the knowledge graph of the open source project is predefined.

When a developer wants to know a specific function in a strange system, the function cannot be positioned and read, and the developer needs to start from the outermost layer of a program calling relation, gradually deepen to a target function along function calling, learn and familiarize the whole function and the positioning of the function in the system through a calling path; for example, starting from the related functions in the unit test, the specific function to be tested is deepened step by step, and the calling and operation details of the function under the test case are known.

Most current analytical research on open source items is directed to code searching or feature localization. The goal of these studies is primarily how to better use open source projects from the user's perspective or how to analyze project code structures. But for the actual open source community and the contributors and learners thereof, it is very important to participate in the contribution of the open source project with low threshold. Most open source projects in the current open source community do not maintain developer-oriented system design documentation, meaning that if a newly added developer were to develop some functionality, it would take a significant amount of time to read and learn the project code. This can create significant difficulties and challenges for developers to participate in actual development if the project organization or program annotation is not perfect. Therefore, the knowledge graph corresponding to the open source project is constructed according to learning requirements of developers, and the developers are helped to learn strange codes.

The essence of the knowledge graph is a knowledge base of a semantic network, which aims to describe the relationship between entities and the knowledge in various kinds of knowledge in the real world. An entity can refer to a thing object in reality or an abstract concept, and a relationship is a connection between entities and semantic description thereof. A knowledge-graph can be generally viewed as a graph structure, with the entities of the knowledge-graph as nodes of the graph and the relationships as edges in the graph.

Knowledge maps were first applied to search engines, which can identify objects specifically referred to by a user when the user searches for and knows a piece of knowledge using the search engine. For example, when a user retrieves the show time of a certain drama, the results of the same-name novel are not confused.

Generally, the knowledge graph can be displayed in a web page in a form of a knowledge side column (knowledgeable panel), which not only can display website links of user search contents, but also can perform structured aggregation and display on information of search topics.

It can be seen that the knowledge-graph is composed of a number of interconnected relationships and their attributes, and these relationships are usually represented as an SPO triple (Subject-predict-Object). As shown in fig. 2, in a triple (triple), Object represents a Subject therein, Predicate represents a relationship itself, Object represents an Object pointed to by the relationship, and both the Subject and the Object are entities of the knowledge graph.

A data mode (Schema) in the knowledge graph is the extraction and specification of knowledge, the given Schema is designed in advance and is conformed to, the standardization is facilitated, the follow-up processing and query of knowledge triples are facilitated, and in order to construct the knowledge graph of an open source project, the data mode of the knowledge graph is defined firstly. Constructing Schema for a knowledge graph is equivalent to building an Ontology (Ontology) for the knowledge graph. The ontology includes concepts, concept hierarchies, attributes, attribute value types, relationships, a set of relationship definition Domain (Domain) concepts, and a set of relationship value Domain (Range) concepts. On the basis, we can also add Rules (Rules) or Axioms) additionally to represent more complex constraint relationships of the mode layer.

FIG. 3 illustrates a data schema designed for an open source project knowledge-graph according to an embodiment of the present invention.

The data schema includes the abstraction of the basic element relationships and entities that make up the knowledge-graph from multiple perspectives of the open source project. Wherein the entity comprises: functions, files, project submission records, problem sets and code merging requests; wherein the basic element relationship comprises: calling a relationship, containing a relationship, modifying a relationship, referring to a relationship.

It should be noted that the knowledge information extracted from multiple angles of the open source project is not the knowledge information required by how the user uses the open source project, nor the code segment recommendation knowledge information required in sophisticated developer intelligent programming, based on the requirement of the developer to learn the open source project and enable the developer to participate in the open source project development. For example, the method extracts more information inside the open source project, so that a developer can know and then contribute code to the inside of the open source project and develop the code, and does not extract the api information to enable the developer to use the api. The method and the device extract the problem set of the target open source project in the open source community, namely, the basic element relationship between the knowledge entity and the entity in the discussion text information of the target open source project is extracted by a developer so as to connect functions, files and project submission records, the problem set, code merging requests and other information with one another, help the developer to quickly know and learn project codes and related code knowledge, and not only extract the description information of discussion and the question-answer relationship in discussion. Meanwhile, as can be understood by those skilled in the art, the knowledge graph provided by the embodiment of the invention is used for meeting the requirement of a newly added developer on the quick learning and participation in the development of the open-source project.

Table 1 below shows entities to be extracted and corresponding descriptions of the entities for constructing the knowledge graph in the data schema of the knowledge graph designed in the embodiment of the present invention.

Table 1 knowledge graph comprising entities and descriptions

Specifically, the method comprises the following steps: the entity "Func", i.e., a function, represents a function in the open source item code; the entity "File", i.e., a File, represents one File in the open source project; the entity "Commit", i.e., the project submission record, represents one of the open source project submission histories; an entity "Issue", i.e., a set of questions, represents a set of questions and comments thereof of the open-source project in the open-source community; the entity "Pull Request", i.e., the code merge Request, represents a code merge Request for an open-source item in the open-source community.

As shown in table 2 below, in the data model of the knowledge graph designed in the embodiment of the present invention, the basic element relationships that need to be extracted to construct the knowledge graph and the corresponding descriptions of the basic element relationships are described.

TABLE 2 basic element relationships included in the knowledge graph

Specifically, the method comprises the following steps: a relation "(sub, func _ call, obj)" indicating a call relation of the sub function to the obj function; a relation "(sub, file _ content _ func, obj)" indicating the inclusion relation of the sub file to the obj function; a relation "(sub, commit _ change _ file, obj)" indicating a modification relation of the sub item submission record to the obj file; a relation "(sub _ relation _ commit, obj)" indicating a relation related to the sub problem set to the obj item submission record; a relation "(sub, issue _ related _ issue, obj)" indicating a relation of the sub issue set to the obj issue set; a relation "(sub, issue _ related _ pr, obj)" indicating a containing relation of the sub problem set to the obj code merge request; the relation "(sub, pr _ relative _ commit, obj)" represents a relation related to the submission record of the sub code merge request to the obj item, and the relation "(sub, pr _ relative _ file, obj)" represents a relation related to the obj file of the sub code merge request.

Taking the sub problem set to take the inclusion relationship of the obj code merging request as an example, in the discussion information in the open source community, a developer may refer to the code merging request information of a certain open source project in a discussion related to the open source project, and at this time, the problem set discussed by the developer is regarded as having the inclusion relationship with the code merging request information.

Taking the relation of the sub problem set to the obj problem set as an example, in the discussion information in the open source community, a developer may refer to another discussion post in a discussion related to a certain open source project, and at this time, the problem set discussed by the developer is regarded as the relation related to the problem set discussed by another discussion post.

Step S102, acquiring knowledge information of the program code from the open source project code.

In the embodiment, a static code analysis method is selected to acquire the knowledge information of the program code from the open source project code. Knowledge information of the program code itself, including: functions, files in the project code and calls, containment relationships between them. The call and inclusion relationship may specifically be a call relationship of a sub function to an obj function, an inclusion relationship of a sub file to an obj function, and a call relationship of a sub file to an obj file.

Compared with a dynamic code analysis method, the static code analysis method has the advantages of being fast, universal, convenient and fast and having less dependence. From the viewpoint of reading and learning the source opening project by a developer, the operation of an unfamiliar large source opening project needs to set an operation environment, look up documents and select parameters, and is time-consuming and labor-consuming for the developer. The static analysis method is used for analyzing the codes, so that the method is very suitable for developers to read and learn scenes of open source projects, and can effectively solve or partially solve the problems.

Optionally, a static analysis tool, comprising: doxygen, CppCheck and FindBugs.

Furthermore, the embodiment selects Doxygen as a static analysis tool of the open source project code. Doxygen is a multi-language-supporting and cross-platform static code analysis tool and has the advantages of wide language support, good universality and diversified analysis contents. It can analyze the input project code and extract code information such as function call relations, file structures and function attributes. It supports many common languages such as C + +, C, Java, Objective-C, Python, etc., and is used in many operating systems.

Preferably, a static analysis tool, Doxygen, is used to analyze the open source project code for each file and each project module of the open source project, and respectively generate a local relationship subgraph.

Preferably, the relational subgraph is described in a Dot language, and the knowledge information of the relational subgraph described in the Dot language is mapped to the actual function name to realize output. The Dot language is a text-graphics description language for providing a simple way to describe graphics that can be read by humans and computer programs.

FIG. 4 shows a partial function call relationship subgraph in one module of a static code analysis result which is described and output by using a Doxygen static analysis and in a Dot language, and the structure and the expression of the relationship subgraph are illustrated by taking the partial function call relationship subgraph as an example.

And step S103, acquiring knowledge information related to the open source project from the open source community and the remote warehouse of the open source project.

In this embodiment, for example, the open source community GitHub widely used at present is used, and a githuba api is used, and pygitub and Pygit framework tools are combined to obtain information in the githu open source community and the Git warehouse, and effective knowledge information related to the target open source project is extracted through keywords.

The knowledge information related to the open source project comprises: item submission records, code merge requests, problem sets, and related relationships between them.

Specifically, in the embodiment of the present invention, a pygithubi framework implemented based on githbaapiv 3 encapsulation is used, and knowledge information related to a target open source project is extracted from a githhub open source community through a keyword, including extracting relevant information of a problem set and a code merging request related to the target open source project from discussion information of the project between developers. And acquiring knowledge information related to the target open source project from a Git warehouse by using a Pygit framework, wherein traversing and data extracting are carried out on a submission record tree of the target open source project, and the related information of the project submission record is acquired.

The GitHubAPI is an open query interface provided by the open source community GitHub officially, and can provide simple use and access to functions in the open source community. Pygit is a Python package for libgit, and various attributes in the object are accessed through a concise interface to execute various operations on a git warehouse. The libgit is a portable C language implementation of a method for controlling a main core in a system Git of an open source version, and a reliable and stable C interface link library is generated, so that a developer can easily realize conventional operation in the Git by using a code calling API mode. Thus, libgit can be understood as a shared library of Git, where complex optimization and uncore functionality is removed from libgit2 relative to Git as a stand-alone program application.

And step S104, analyzing the knowledge information of the program code and the knowledge information related to the open source project, and extracting a plurality of triples.

On the basis of step 103, the embodiment of the present invention further analyzes the knowledge information of the program code itself and the knowledge information related to the open source project based on a heuristic rule, and extracts a plurality of triples. The method comprises the following steps: and extracting a plurality of knowledge entities and basic element relations related to the target open source project from the text information by a text matching and natural language analysis method.

The heuristic rules are widely applied in the technical field, and have various optional types, and the specific using method of the heuristic rules in the embodiment of the invention is not described in detail. Specifically, an ant colony algorithm, a neural network algorithm, or the like may be selected.

In the embodiment of the invention, the ternary relationship in the knowledge graph data mode is expressed in an RDF form, that is, the knowledge information of the program code and the knowledge information related to the open source project are analyzed and extracted, and are converted into a plurality of triples. The rdf (resource Description framework) is a resource Description framework customized by W3C (World Wide Web Consortium), and is a technical specification mainly used for more abundantly describing and expressing entities/resources.

In a preferred embodiment of the present invention, based on the basic element relationships and entities that constitute the knowledge-graph and are included in the data schema in step 101, a plurality of triples are further obtained, and the categories of the triples include: the triple of the call relationship of the sub function to the obj function, the triple of the inclusion relationship of the sub file to the obj function, the triple of the modification relationship of the sub submission record to the obj file, the triple of the relation of the sub problem set to the obj project submission record, the triple of the relation of the sub problem set to the obj problem set, the triple of the relation of the sub problem set to the obj merging request, the triple of the relation of the sub merging request to the obj project submission record, and the triple of the relation of the sub code merging request to the obj file

And step S105, unifying the data formats of all knowledge entities in the triples according to different structural characteristics of each data source in the triples, and disambiguating each triplet to ensure that each effective knowledge entity has one and only one entity name corresponding to the effective knowledge entity in the triples.

Since the knowledge information of the program code and the knowledge information related to the open source project are respectively obtained from the open source project code and the open source community, and the data come from different sources, the expression form, the data format and the data consistency of the knowledge entity in the knowledge information are very likely to be different.

In the embodiment of the invention, all knowledge entities possibly with different suffixes are unified and disambiguated by using rules in a text analysis and heuristic rule mode. The method specifically comprises the following steps: firstly, checking all extracted knowledge triples according to a predefined knowledge graph schema, defining different mapping rules by using a natural language analysis method aiming at triples which are not matched with the schema, and mapping the knowledge information of different formats extracted from different sources to the same formatted data, so that each effective knowledge entity has one entity name corresponding to the effective knowledge entity in a triplet set consisting of all triples.

For example, for the extraction of file knowledge information, the representation of file names in knowledge from different sources may be absolute path, relative path, and individual file names, and we need to disambiguate these situations so that the names and relationships of the same file can be mapped onto the same entity, which facilitates the fusion of the relationships from different sources.

And S106, constructing a knowledge graph of the open source project by utilizing the three-tuple set based on the data mode.

In the embodiment of the invention, all the three-tuple sets of the knowledge information expressed in the RDF form are obtained based on the steps, and the unification and the disambiguation of the knowledge entities are completed. At this time, the triples are combined based on the same knowledge entities to generate a knowledge graph of the open source item.

Since each triplet needs to be verified with knowledge information in the existing map before being fused into the existing map, each verification usually needs a plurality of triplets to be verified together, and the next triplet can be added after the fusion is completed, the map construction method is low in efficiency.

Therefore, in a preferred embodiment provided by the present invention, based on a predefined data pattern, for each data source, cleaning, extracting triples, disambiguating, individually extracting a triplet set representing a certain type of relationship, forming a triplet set subgraph of such relationship, concurrently extracting a data pipeline constructed by the triplet set subgraphs, that is, constructing multiple triplet set subgraphs to be performed concurrently, and finally aggregating all the triplet set subgraphs to construct a knowledge graph of the open source item. Compared with the serial processing in the original construction process, the parallel construction method of the atlas in the preferred embodiment has higher efficiency, and simultaneously promotes the unification in development.

And S107, carrying out visual analysis and display on the open source project knowledge graph by using a visualization tool Gephi.

FIG. 6 shows a visualization effect diagram of an open source project knowledge graph constructed by an embodiment of the invention.

Through a visualization tool Gephi, the information visualization of the knowledge graph of the open source project is realized, and a system administrator can be helped to find the problems of knowledge loss, incapability of aligning and the like in the knowledge graph so as to further optimize and improve the construction process of the knowledge graph correspondingly.

In addition, based on the visual content, the related knowledge information of the open source project can be displayed according to the learning requirement of the developer, the related map subgraph can also be displayed according to the knowledge node inquired by the developer, the related knowledge of the target code to be learned is provided for the developer, a user is given certain prompt information to optimize the use experience, and a plurality of clicks and events are added to provide richer information.

As shown in fig. 7, in the embodiment of the present invention, for the learning requirement of the developer on the open source project, knowledge required for the developer to learn and participate in the development of the open source project is extracted from the open source project code itself, the open source community where the open source project is located, and the remote warehouse of the open source project, so as to construct a knowledge graph of the open source project, thereby helping to improve the learning efficiency of the developer, and further enabling the developer to better participate in the development of the open source project.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Example two

Referring to fig. 5, a block diagram of a structure of an embodiment of a system for establishing an open source project knowledge graph according to the present invention is shown, where the system may be applied to an open source project, and the present embodiment is intended to acquire code knowledge information of the open source project and related knowledge information in an open source community from a perspective of a developer learning a code requirement, so as to establish the knowledge graph of the open source project, so as to meet a learning requirement of the developer on an unfamiliar code, and the establishing system specifically includes:

the defining module 201 is used for predefining a data mode of the open source project knowledge graph.

It can be seen that the knowledge-graph is composed of a number of interconnected relationships and their attributes, and these relationships are usually represented as an SPO triple (Subject-predict-Object). As shown in fig. 2, in a triple (triple), Object represents an Object therein, Predicate represents a relationship itself, and Object represents a Subject to which the relationship points.

It is noted that the knowledge information extracted from multiple angles of the open source project is not the knowledge information required by how the user uses the open source project, nor is the code segment recommendation knowledge information required in sophisticated developer intelligent programming, based on the requirement that the developer learns the open source project and can participate in the open source project development. For example, the method extracts more information inside the open source project, so that a developer can know and then contribute code to the inside of the open source project and develop the code, and does not extract api information to enable the developer to use the api. In addition, the problem set of the target open source project in the open source community is extracted, namely, the basic element relation between the knowledge entity and the entity in the discussion text information of the target open source project is extracted by a developer, so that functions, files, project submission records, problem sets, code merging requests and other information are connected with one another, the project codes and related code knowledge which are quickly known and learned by the developer are helped, and the description information of the discussion and the question-answer relation in the discussion are not extracted.

Referring to table 1, in the data pattern of the knowledge graph designed in the embodiment of the present invention, the entity to be extracted for constructing the knowledge graph specifically includes: the entity "Func", i.e., a function, represents a function in the open source item code; the entity "File", i.e., a File, represents one File in the open source project; the entity "Commit", i.e., the project submission record, represents one of the open source project submission histories; an entity "Issue", i.e., a set of questions, represents a set of questions and comments thereof of the open-source project in the open-source community; the entity "Pull Request", i.e., the code merge Request, represents a code merge Request for an open-source item in the open-source community.

Referring to table 2, in the data pattern of the knowledge graph designed in the embodiment of the present invention, the basic element relationship to be extracted for constructing the knowledge graph specifically includes: a relation "(sub, func _ call, obj)" indicating a call relation of the sub function to the obj function; a relation "(sub, file _ content _ func, obj)" indicating the inclusion relation of the sub file to the obj function; a relation "(sub, commit _ change _ file, obj)" indicating a modification relation of the sub item submission record to the obj file; a relation "(sub _ relation _ commit, obj)" indicating a relation related to the sub problem set to the obj item submission record; a relation "(sub, issue _ related _ issue, obj)" indicating a relation of the sub issue set to the obj issue set; a relation "(sub, issue _ related _ pr, obj)" indicating a containing relation of the sub problem set to the obj code merge request; the relation "(sub, pr _ relative _ commit, obj)" represents a relation related to the submission record of the sub code merge request to the obj item, and the relation "(sub, pr _ relative _ file, obj)" represents a relation related to the obj file of the sub code merge request.

A knowledge obtaining module 202, configured to obtain knowledge information of a program code itself from an open source project code by using a static code analysis method, where the knowledge information of the program code itself includes: functions, files and calling and containing relations among the functions and the files; and acquiring knowledge information related to the open source project from the open source community where the open source project is located and a remote warehouse of the open source project, wherein the knowledge information related to the open source project comprises: the project submission records, the code merge requests, and the problem sets and the related relationships between them.

An alternative static analysis tool comprising: doxygen, CppCheck and FindBugs.

In this embodiment, for example, the open source community GitHub widely used at present is used, and a githuba api is used, and pygitub and Pygit framework tools are combined to obtain information in the githu open source community and the Git warehouse, and effective knowledge information related to the target open source project is extracted through keywords. The knowledge information related to the open source project comprises: item submission records, code merge requests, problem sets, and related relationships between them.

Specifically, in the embodiment of the present invention, a PyGithub framework implemented based on the githhub APIV3 encapsulation is used, and knowledge information related to the target open source project is extracted from the githhub open source community through keywords, including extracting relevant information of a problem set and a code merging request related to the target open source project from discussion information of the project between developers. And acquiring knowledge information related to the target open source project from a Git warehouse by using a Pygit framework, wherein traversing and data extracting are carried out on a submission record tree of the target open source project, and the related information of the project submission record is acquired.

The GitHub API is an open query interface provided by the open source community GitHub officially, and can provide simple use and access to functions in the open source community. Pygit is a Python package for libgit, and various attributes in the object are accessed through a concise interface to execute various operations on a git warehouse. The libgit is a portable C language implementation of a method for controlling a main core in a system Git of an open source version, and a reliable and stable C interface link library is generated, so that a developer can easily realize conventional operation in the Git by using a code calling API mode. Thus, libgit can be understood as a shared library of Git, where complex optimization and uncore functionality is removed from libgit2 relative to Git as a stand-alone program application.

The knowledge analysis module 203 is configured to analyze the knowledge information of the program code itself and the knowledge information related to the open source project, and extract a plurality of triples; and unifying the data format of each knowledge entity in the triples according to the different structural characteristics of each data source in the triples, and disambiguating each triplet so as to ensure that each effective knowledge entity has one and only one entity name corresponding to the effective knowledge entity in the triples.

In the embodiment of the present invention, on the basis of the knowledge acquisition module 202, the knowledge information of the program code itself and the knowledge information related to the open source project are further analyzed based on a heuristic rule, and a plurality of triples are extracted. The method comprises the following steps: and extracting a plurality of knowledge entities and basic element relations related to the target open source project from the text information by a text matching and natural language analysis method.

In a preferred embodiment of the present invention, the ternary relationship in the knowledge-graph data schema is expressed in RDF format, that is, the knowledge information of the program code itself and the knowledge information related to the open source project are analyzed and extracted, and converted into a plurality of triples. The rdf (resource Description framework) is a resource Description framework customized by W3C (World Wide Web Consortium), and is a technical specification mainly used for more abundantly describing and expressing entities/resources.

In a preferred embodiment of the present invention, a plurality of triples are further obtained based on the basic element relationships and entities that constitute the knowledge-graph and are included in the data schema by the definition module 201, and the categories of the triples include: the triple of the call relationship of the sub function to the obj function, the triple of the inclusion relationship of the sub file to the obj function, the triple of the modification relationship of the sub submission record to the obj file, the triple of the relation of the sub problem set to the obj project submission record, the triple of the relation of the sub problem set to the obj problem set, the triple of the relation of the sub problem set to the obj merging request, the triple of the relation of the sub merging request to the obj project submission record, and the triple of the relation of the sub code merging request to the obj file

A construction module 204, configured to construct a knowledge graph of the open source item using the plurality of triples based on the data schema.

Therefore, in a preferred embodiment of the present invention, based on a predefined data pattern, for each data source, cleaning, extracting triples, disambiguating, individually extracting a triplet set representing a certain type of relationship, forming a triplet set subgraph of such relationship, concurrently extracting a data pipeline constructed by the triplet set subgraphs, that is, constructing multiple triplet set subgraphs to be performed concurrently, and finally aggregating all the triplet set subgraphs to construct a knowledge graph of the open source item. Compared with the serial processing in the original construction process, the parallel construction method of the atlas in the preferred embodiment has higher efficiency, and simultaneously promotes the unification in development.

And the display module 205 is configured to perform visual analysis and display on the open source project knowledge graph by using a visualization tool Gephi.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method for establishing the open source project knowledge graph and the system for establishing the open source project knowledge graph provided by the invention are described in detail, specific examples are applied in the text to explain the principle and the implementation mode of the invention, and the description of the examples is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for establishing an open source project knowledge graph, the method comprising:

predefining a data mode of an open source project knowledge graph;

acquiring knowledge information of a program code from an open source project code by a static code analysis method, wherein the knowledge information of the program code comprises: functions, files and calling and containing relations among the functions and the files;

acquiring knowledge information related to the open source project from an open source community where the open source project is located and a remote warehouse of the open source project, wherein the knowledge information related to the open source project comprises: project submission records, code merge requests, problem sets, and related relationships between them;

analyzing the knowledge information of the program code and the knowledge information related to the open source project, extracting a plurality of triples, unifying the data format of each knowledge entity in all the triples according to different structural characteristics of each data source in the triples, and disambiguating each triplet so as to ensure that the triples are concentrated, and each effective knowledge entity has one entity name corresponding to the effective knowledge entity;

constructing a knowledge graph of the open source project by utilizing the three tuple sets based on the data mode;

and carrying out visual analysis and display on the open source project knowledge graph by using a visualization tool Gephi.

2. The method of claim 1, wherein predefining data patterns of the open source project knowledge-graph comprises:

extracting knowledge information forming a knowledge graph from a plurality of angles of an open source project, wherein the knowledge information comprises: basic element relationships and entities; wherein the entity comprises: functions, files, project submission records, problem sets and code merging requests; wherein the basic element relationships comprise: calling a relationship, containing a relationship, modifying a relationship, referring to a relationship.

3. The method of claim 1, wherein obtaining knowledge information of the program code itself from the open source project code by a static code analysis method comprises:

analyzing the open source project codes by using a static analysis tool aiming at each file and each project module of the open source project, and respectively generating local relation subgraphs, wherein the relation subgraphs are described and output by a Dot language;

the static analysis tool, comprising: doxygen, CppCheck and FindBugs.

4. The method of claim 1, wherein the categories of the plurality of triples comprise:

the method comprises the following steps of calling relation triples of a sub function to obj functions, containing relation triples of sub files to obj functions, modifying relation triples of sub submission records to obj files, relation related triples of sub problem sets to obj item submission records, relation related triples of sub problem sets to obj problem sets, relation related triples of sub problem sets to obj merging requests, relation related triples of sub merging requests to obj item submission records, and relation related triples of sub code merging requests to obj files.

5. The method of claim 1, wherein the step of constructing a knowledge-graph of open source items comprises:

and based on the data mode, cleaning, extracting triples and disambiguating each data source, further respectively extracting a ternary set representing a certain type of relation to form a ternary set subgraph, concurrently extracting a data pipeline for constructing the ternary set subgraph, and finally aggregating all the ternary set subgraphs to construct a knowledge graph of the open source project.

6. A system for establishing an open source project knowledge graph, the system comprising:

7. The system of claim 6, wherein the definition module comprises:

extracting knowledge information forming a knowledge graph from a plurality of angles of an open source project, wherein the method comprises the following steps: basic element relationships and entities; wherein the entity comprises: functions, files, project submission records, problem sets and code merging requests; wherein the basic element relationships comprise: calling a relationship, containing a relationship, modifying a relationship, referring to a relationship.

8. The system of claim 6, wherein the knowledge acquisition module comprises:

the static analysis tool, comprising: doxygen, CppCheck and FindBugs.

9. The system of claim 6, wherein in the knowledge analysis module, the categories of the plurality of triples comprise:

10. The system of claim 6, the build module, comprising: