CN114840657A

CN114840657A - API knowledge graph self-adaptive construction and intelligent question-answering method based on mixed mode

Info

Publication number: CN114840657A
Application number: CN202210589958.3A
Authority: CN
Inventors: 王伟东; 王冠; 陆思陶
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-08-02

Abstract

The invention discloses a mixed-mode API (application program interface) knowledge graph self-adaptive construction and intelligent question-answering method. The method comprises the following steps: based on the Stack Overflow platform problem discussion, Github opens a source project, a network software library API description document, a local project collects API information and performs knowledge extraction to construct an API knowledge graph; adaptively training an API (application programming interface) knowledge graph query model in a mixed mode; querying the API information through natural language description by a querying user through mixed-mode knowledge graph query; and constructing a visual API information display page of the structure partition. The invention solves the problems of low accuracy, low speed, low content richness and non-visual presentation when the natural language is used for describing the query in the traditional API query, provides more accurate, richer and more visual API information for querying users and improves the development efficiency of programmers based on the mixed-mode knowledge graph API query method.

Description

API knowledge graph self-adaptive construction and intelligent question-answering method based on mixed mode

Technical Field

The invention relates to the field of API recommendation and information search. More particularly, the invention relates to an API knowledge graph self-adaptive construction and intelligent question-answering method based on a mixed mode.

Background

As software projects evolve, the size and complexity of software projects increase. In the face of increasingly complex software development projects, a programming interface in a software library plays a vital role, and programmers can complete software development tasks more efficiently with the aid of an API. However, with the frequent updates of APIs and API documentation in software libraries on the internet and the dramatic increase in production of API documentation in projects and internet software packages, it is almost impossible for programmers to become familiar with the full knowledge associated with each API. Therefore, programmers need to learn about the relevant knowledge and functional usage of the required API in the project, and also need to search for a suitable API according to the implementation requirement to complete the development task. In this context, the search of API documents and related information naturally becomes a critical ring in determining the efficiency of the programmer's job.

However, the traditional retrieval method of the API documents and the related information has strong limitation. Taking the search of the c + + standard library chinese API document as an example, the search page of the API document only provides a classification for the API category and supports only a search mode for the API name. For a query that we already know the name of the API or can directly retrieve the relevant information of the API, we need to look at the API function or the rest of the relevant detailed information, which we call it a query for "known unknown".

Many times, a programmer needs to query a class function with known belonged class but unknown name, for example, it needs to search a function in the C + + standard library, which can convert numbers into strings in the String class, and it needs to go through all functions under the String class and read the function description of each function to search the class function meeting the requirement. Further, the programmer may need to be faced with the need to find a class or function that can meet the requirements of the programming task, e.g., "a function with a value converted to string form", without any known information at all. This is the case without the right next hand. The programmer has no time and is not allowed to go through reading all API's introductions. In the programming task, all APIs which can possibly meet the requirements of people can be rapidly screened in the considered possible categories according to the programming experience of people, or the answers of other people are waited by turning to an internet forum and the like. For the first solution, a large amount of time is needed, and there is a risk that the problem of the API document can be solved due to neglecting correct answers in the viewing process, but the result needed by the user cannot be obtained by searching; with the second solution, unpredictable waiting time is required to wait for answers of other people, and there is also a risk that no one answers or answers are incorrect. For the case that part of the information is known and the API target to be queried cannot be retrieved directly or is not known at all, we refer to the query as being "unknown". It is obvious that the process of querying for "unknown" APIs based on traditional API document searching consumes us a lot of time during programming work and causes unpredictable fluctuations in the completion time and quality of our programming task.

Therefore, an API question-answering method is urgently needed, which can make up for the above defects in the conventional API information retrieval mode, and improve the efficiency of programmers in software development.

Disclosure of Invention

The invention aims to provide a mixed-mode API knowledge graph self-adaptive construction and intelligent question-answering method, which combines the structural advantages of a knowledge graph established by API related information with natural language processing of query sentences, and enlarges the search range of query problems and the richness of search results. Meanwhile, the expansion of the Chinese language is completed by processing the query statement, and the expansion to other languages can be realized. In addition, the classification of the query statements and the mixed query mode based on the rule template and the machine learning improve the accuracy of query results while ensuring the search speed of the query, and provide a temporal guarantee for the search process of programmers in development.

In order to achieve the above purpose, the invention adopts the following method:

based on the Stack Overflow platform problem discussion, Github opens a source project, a network software library API description document, a local project collects API information and performs knowledge extraction to construct an API knowledge graph;

adaptively training the API knowledge graph query model of the mixed mode;

querying the API information through natural language description by a querying user through mixed-mode knowledge graph query;

and constructing a visual API information display page of the structure partition.

Specifically, based on the Stack Overflow platform problem discussion, the Github open source project, the network software library API description document, the local project collecting API information and extracting knowledge to construct the API knowledge graph includes: extracting API names as API entities and extracting relationships, API attributes and sample codes among the APIs as the relationships of the entities and storing the relationships in a triple form for the API document pages in the network software library and the local API description documents according to the marks of the HTML pages and the labels of the description documents; acquiring software projects from Github according to a star level sequence, splitting a source code file from the software projects, acquiring API entities in source codes and relations between the entities together with source codes of local projects according to the structure of an abstract syntax tree, verifying the integrity of entity sample codes, extracting sample codes from the source codes for missing API entities as supplements of the API relations, and storing the API entity relations in a triple form; questions and answers to API-explain documents, Github project, are extracted for unstructured information knowledge in text. The named entity recognition is completed by using a long-short term memory artificial neural network and adding conditional random field assistance to mark characters of a text at a character level. Based on the Chinese vocabulary library and the extracted entity vocabulary table, the text is segmented and the part of speech is labeled. And performing dependency syntax analysis on the word segmentation result to generate a dependency syntax tree. And performing semantic role analysis through the dependency syntax tree to obtain semantic role labels of the words. And extracting the relation triple according to the semantic role of the word by the semantic role according to a preset rule. Constructing an API knowledge graph by taking the extracted API entities as nodes and the relation triples as edges

Further, the method for training the mixed-mode knowledge-graph query model comprises the following steps: acquiring hot API (application programming interface) query questions and adoption answers of the community platform, constructing query statement classification rules, and classifying the query statements into regular query questions and irregular query questions according to the rules; inquiring the rule problem by adopting an inquiry mode based on a rule template, and inquiring the inquiry of the unmatched rule by similarity matching; and querying the irregular problem by adopting a machine learning-based query mode.

Furthermore, the method for the user to query by using the natural language statement through the mixed mode API knowledge graph query method comprises the following steps: a user inputs a query statement; dividing the query into a regular problem and an irregular problem according to the query statement classification model; and self-adaptively selecting a knowledge graph query mode based on a rule template or a knowledge graph query mode based on mechanics according to query classification to query, and returning a query result.

Preferably, the method for building the visual API information presentation page includes: and building a front-end display interface based on Django, displaying the API information through an interactive tree diagram and a page of the structure partition, and analyzing and labeling the core answer API information according to the query question content.

The invention has the following beneficial effects:

(1) the API self-adaptive construction and question-and-answer method of the mixed-mode API knowledge graph disclosed by the embodiment of the specification provides query of a natural language query statement to API information, meets the requirement of a programmer on API information search through natural language function description, fills a gap between knowledge graph search and query question sentence, improves the accuracy and query speed of query by using natural language in API query in the traditional method, solves the problem that the programmer is difficult to query unknown API when becoming a task, and improves the working efficiency of the programmer.

(2) By supplementing external source data such as API documents and source codes of local items and item source codes in the Internet, the information of the knowledge graph is enriched, the query search range is expanded, the content presented by the API query result is expanded, and programmers are helped to solve the problem that the API query result is not enough to deal with the difficulty.

(3) The specification provides a mixed-mode API knowledge graph API self-adaptive construction and mixed question-answer mode method, and a mixed-mode query method is adopted to perform classified query on query sentences, so that query accuracy is improved, query time is guaranteed, and time efficiency of a programmer for querying an API in a programming task is guaranteed.

(4) The visualization and interactive API information presentation method enriches the presentation modes of the information of the knowledge map, highlights the core query content, helps programmers to quickly and accurately acquire the core information for solving the problems, and improves the query efficiency of the programmers.

Drawings

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

FIG. 1 is a layout diagram showing the overall process of the present invention.

FIG. 2 is a diagram illustrating the origin of data collection in accordance with the present invention.

FIG. 3 is a diagram illustrating source code knowledge extraction according to the present invention.

FIG. 4 is a diagram representing a textual knowledge extraction knowledge-graph construction of the present invention.

FIG. 5 is a flow chart illustrating the mixed-mode knowledge-graph query model training of the present invention.

FIG. 6 is a flow chart illustrating a mixed-mode knowledge-graph query in accordance with the present invention.

Detailed Description

In order to more clearly illustrate the invention, the invention is further described below with reference to preferred embodiments and the accompanying drawings. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.

First, the technical concept of the technical solution disclosed in the present invention will be explained. The traditional query mode of the API documents has higher limitation, particularly the index query of the API documents needs to be carried out under the condition of known API names, and the description of the functionality often cannot obtain expected search results because of the difference between the natural language description of the problem and the description of the API documents. Meanwhile, the query result of the API document has low content of display content and poor intuitiveness of the display mode of the text form, so that a programmer can consume a large amount of time in the reading process of the API query result, but can not find expected information for solving the problem.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Example one

FIG. 1 is a flow diagram of a mixed-mode API knowledge-graph API construction and intelligent question-answering method. Referring to fig. one, the embodiment provides a mixed-mode API knowledge graph adaptive construction and intelligent question-answering method, including:

adaptively training an API (application programming interface) knowledge graph query model in a mixed mode;

Firstly, please refer to fig. 2, extract an API name as an API entity and extract a relationship, an API attribute and a sample code between APIs as an entity relationship for an API document page and a local API description document in a network software library according to a tag of an HTML page and a tag of the description document, and store the API entity relationship in a triple form; acquiring software projects from a Github platform according to a star level sequence, splitting source code files from the software projects, and combining the source code files with local project source code files to construct a source code database; and selecting an API (application programming interface) from the question discussion of the Stack Overflow according to the heat to inquire question sentences and corresponding adoption answer texts, collecting question and answer texts from Github open source projects, and combining the collected question and answer texts into unstructured API question and answer pair texts for storage.

Further, please refer to fig. 3 for the source code file. And compiling source codes of different languages to obtain an abstract syntax tree of the codes. An abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code, with each node on the tree representing a structure in the source code. By converting the code in the text form into the structured form, the related information of the class and the function definition is extracted quickly. The method comprises the steps of obtaining the name (the file to which the class name) of a class, the initialization parameter of the class, the function of the class, the public variable of the class, the inheritance relationship of the class, the complete name (the file to which the class belongs) of the function), the parameter of the function, the return value of the function, the calling relationship of the function and other related information defined in a source code according to the structure of an abstract syntax tree, and storing the related information in a triple form. For each class and function entity, the integrity of its sample code, functional specifications, and other relationships is verified. For the function with missing functional description, an attempt is made to find an annotation near the definition of the class and the function; referring then to fig. 4, for the unstructured information in the API specification document, q & a text of the gitubb item, 5 entities of the item, file, class, function, variable, and use class (file-class), use function (file-function), use variable (file-variable), inherit (class-class), class function (class-function), initialize parameter (class-variable), call (function-function), parameter (function-variable), return value (function-variable), 9 relations are extracted. And adding a conditional random field to the text by using a long-short term memory artificial neural network to perform character-level entity recognition to obtain an entity mark sequence in a correct sequence. Based on the Chinese vocabulary library and the extracted entity vocabulary table, the text is segmented and the part of speech is labeled, and then the segmentation result is subjected to dependency syntax analysis to generate a dependency syntax tree. And performing semantic role analysis through the dependency syntax tree to obtain semantic role labels of the words, and extracting a relation III according to the semantic roles of the words according to a preset rule to supplement knowledge. And constructing a knowledge graph by taking the obtained entities and the relations as a basis.

Further, referring to FIG. 5, the model in the mixed-mode knowledge-graph API query is trained. Firstly, training a query statement classification model, collecting question descriptions and corresponding adoption answers from Stack Overflow, classifying the question descriptions and the corresponding adoption answers into a regular query question and an irregular query question, training the LSTM query statement classification model, and classifying the LSTM query statement classification model after user query API information is trained into a regular question and an irregular question; a machine learning based query model is then trained. The entity and the relation in the knowledge graph are converted into two matrixes, each row represents a word vector of the entity or the relation, the word vector and the relation vector of the entity are trained by using a TransE knowledge graph model, the triple relation existing in the knowledge graph is represented by the word vector of the entity, and furthermore, a GRU query statement mapping model is trained to convert a query statement into a vector of a factual API entity node.

Further, referring to fig. 6, the user inputs a natural language query statement from the front-end query interface to query the API information. The program classifies the query statement as either a regular problem or an irregular problem according to rules. For the problem of the rule, the Chinese vocabulary and the entity vocabulary are used for segmenting the query sentence, and the part of speech of the segmentation result is labeled. And then, performing dependency syntax analysis by using the word tuple sequence to obtain dependency syntax relations among the words, and selecting a specific dependency syntax relation related to the words in the entity table according to a rule to convert the specific dependency syntax relation into a query statement according to a preset rule. And for the condition that the query statement does not obtain a query result through the query, performing API information query on the knowledge graph by taking the entity in the query statement as a core, calculating the similarity of the entity in the candidate set and the query statement, selecting the API entity with the highest similarity as a query result, and returning the related API information. For the irregular problem, the query statement is mapped into an entity node vector in a knowledge graph vector space according to a query statement mapping model, an API entity node is queried in the knowledge graph through similarity matching, and relevant information is returned.

The returned API information then organizes the API information presentation structure according to the query question content. And returning API related information to the front end, presenting the API information in a label partition mode according to the query question content and the returned information structure, highlighting the question answer core information according to the semantics of the query question, and visually presenting the API relation information in an interactive tree graph mode.

Those skilled in the art can understand that all or part of the processes in the methods according to the embodiments described above can be implemented by instructing relevant hardware by a computer program, where the program can be stored in a computer-readable storage medium or deployed in a cloud server, and when executed, the program can include the processes according to the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a ROM, a RAM, etc.

The foregoing is illustrative of the present application and is not to be construed as limiting thereof. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An API knowledge graph self-adaptive construction and intelligent question-answering method based on a mixed mode is characterized by comprising the following steps:

2. The method of claim 1, wherein collecting structured API information and unstructured API text information comprises:

taking an Internet API document and a local project description document as sources, acquiring structured API information through a tag of the API document, and extracting a relation between an API entity and an API;

collecting unstructured software code files by taking a Github project code warehouse and a local project as sources, and extracting API entities and the relationship among APIs through abstract syntax tree analysis

Taking a Stack Overflow platform question discussion and Gituhub question-answer, a local project source code and an API description document as sources, and extracting API entities and relationships among APIs in unstructured API text information in an API question-answer pair form through named entity identification, dependency syntax analysis and semantic role identification.

3. The method of claim 1, wherein the method for adaptively training a mixed-mode API knowledge-graph query model comprises:

training a classification model of the query statement pair, namely a regular query problem and an irregular query problem;

and training a word vector matrix of entities and relations in the knowledge graph and a query sentence mapping vector space model.

4. The method of claim 1, wherein the mixed-mode API knowledgegraph query method comprises:

the classification of query statements is completed through a rule algorithm and a neural network, and the problem is queried through a division rule and the problem is queried through an irregular rule;

and according to the problem classification, selecting a rule template-based method or a machine learning-based method to complete subsequent query.

5. The method of claim 1, wherein constructing a visualization API information presentation page comprises:

and building a front-end display interface based on Django, and displaying the API information through an interactive tree diagram and a page of a structural partition.