CN110874431B - JAVA Doc knowledge graph-based multidimensional evaluation recommendation method - Google Patents

JAVA Doc knowledge graph-based multidimensional evaluation recommendation method Download PDF

Info

Publication number
CN110874431B
CN110874431B CN201911142972.3A CN201911142972A CN110874431B CN 110874431 B CN110874431 B CN 110874431B CN 201911142972 A CN201911142972 A CN 201911142972A CN 110874431 B CN110874431 B CN 110874431B
Authority
CN
China
Prior art keywords
class
classes
data
java
knowledge graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911142972.3A
Other languages
Chinese (zh)
Other versions
CN110874431A (en
Inventor
贾力
杨明
高提雷
杨棣
解婉誉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan University of Finance and Economics
Kunming Metallurgy College
Original Assignee
Yunnan University of Finance and Economics
Kunming Metallurgy College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan University of Finance and Economics, Kunming Metallurgy College filed Critical Yunnan University of Finance and Economics
Priority to CN201911142972.3A priority Critical patent/CN110874431B/en
Publication of CN110874431A publication Critical patent/CN110874431A/en
Application granted granted Critical
Publication of CN110874431B publication Critical patent/CN110874431B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The invention discloses a multidimensional evaluation recommendation method based on a Java Doc knowledge graph, and belongs to the technical field of database recommendation. Building a Java class knowledge graph through crawling and analyzing Java Doc documents; analyzing the path relation between entities according to the relation between the class and the outside, establishing a recommendation function to mine data, and determining a recommended domain according to the relation between the class and the class; clustering between classes by using a K-means-based clustering method, and taking a clustering result as a complementary set of a recommendation domain mined based on a knowledge graph; and performing multi-dimensional quantitative scoring on the classes of the recommended domains, performing similarity quantitative scoring on the multiple dimension isomorphic models, and finally returning the quantitative scoring of the classes in the recommended domains to the user. The invention realizes that the user can contact new JAVA built-in classes and know the difference of the JAVA built-in classes, thereby achieving the purpose of using more proper API functions.

Description

JAVA Doc knowledge graph-based multidimensional evaluation recommendation method
Technical Field
The invention belongs to the technical field of database recommendation, and particularly relates to a multi-dimensional evaluation recommendation method based on a JAVA Doc knowledge graph.
Background
A knowledge graph is a knowledge base used by Google to enhance its search engine functionality. The method is mainly characterized in that formalized knowledge can be expressed, and an associated topological structure between knowledge is established. The basic unit of a knowledge graph is an entity-relationship-entity RDF (relational Data Format) triple. At present, a large set of knowledge maps have emerged, representative of which are knowItAll, YAGO, DBpedia, Freebase, NELL, Probase. Because the data is stored as a graph structure, related operations such as searching are not limited to matching of character strings, but are turned to semantics and contextual connection of entities. The internal relation existing between the entities can be found more intuitively by carrying out knowledge reasoning on the knowledge graph.
Syntax and semantics define a language, and Application Program Interfaces (APIs) make the language easier to use, and most current software relies on APIs for implementation. The JAVA language is an object-oriented programming language that is used in many applications. The method has the characteristics of simple syntax, stable performance, encapsulation, inheritance and polymorphism, so that the software becomes easier to maintain and is convenient to expand. The constantly updated JAVA software development kit (JDK) provides a number of built-in and powerful basic data structures and related operating JAVA built-in classes, which are a heavily re-used part of software development. JDK plays an irreplaceable role in software development, JAVA built-in classes become more and more diverse with the continuous update of JDK, and the function of each function becomes more specific and targeted. Since JDK contains hundreds of built-in data structures and interfaces, the relationships between classes and interfaces are very complex, so that people have no opportunity and ability to learn the relationships and differences between classes and classes at a bit, and many more suitable JAVA built-in classes and data structures are not discovered and applied by users. People mostly select familiar JAVA built-in classes for use when using the JAVA built-in classes, but the JAVA built-in classes are often quoted in unsuitable occasions.
Therefore, a multi-dimensional evaluation recommendation method based on the JAVA Doc knowledge graph is needed, so that a user can contact new JAVA embedded classes and know the mutual difference of the JAVA embedded classes, the purpose of using a more appropriate API function is achieved, and the capability of the JAVA embedded classes is fully exerted; meanwhile, the method is beneficial to the java API developer to comprehensively analyze and improve the own code base.
Disclosure of Invention
The invention aims to provide a JAVA Doc knowledge graph-based multidimensional evaluation recommendation method so as to realize contact with new JAVA built-in classes and the definite difference between the built-in classes and achieve the purpose of using a more appropriate API function.
The technical scheme adopted by the invention is to provide a multidimensional evaluation recommendation method based on a Java Doc knowledge graph, which is characterized by comprising the following steps:
s1, building a Java class knowledge graph through crawling and analyzing the Java Doc document;
s2, analyzing the path relation between the entities through the relation between the classes and the outside according to the established Java Doc knowledge graph, establishing a recommendation function to mine data, and determining a recommended domain through the relation between the classes;
s3, selecting a text-based K-means clustering method to cluster the classes, and taking the clustering result as a complementary set of a recommendation domain mined based on a knowledge graph;
s4, carrying out multidimensional quantitative scoring on the classes of the recommended domains based on the scoring standard of Java language characteristics;
and S5, establishing a comprehensive quantitative scoring model through the scoring of the items in the S4, carrying out similarity quantitative scoring on the multiple dimension isomorphic models, scoring the candidate items of each recommendation domain, and returning the quantitative scoring of the classes in the recommendation domains to the user.
Further, the establishing of a Java class knowledge graph in S1 includes the following steps:
s11, data extraction: crawling data in Java Doc with html format files by adopting Beautiful Soup toolkit in pyhon; specifically, internal data is obtained by crawling a head tag, an implementation interface tag and a direct subclass tag;
s12, data normalization: screening the crawled data; specifically, through simple screening of the acquired items, items with at least one non-empty item attribute are selected as a data set used in an experiment;
s13, establishing an RDF model: analyzing data, establishing an RDF (remote data format) model of the data, converting the data into RDF, storing the RDF into a graph database, and adopting Neo4j as a storage medium of the data;
s14, data visualization: storing the data into a database and then carrying out visualization work of the knowledge graph; specifically, a tool kit of py2Neo is used for linking databases and programs and database operation in the process of data import, and data is stored into a Neo4j database system in the form of RDF.
Further, the scoring criteria in S4 are based on the following three dimensions:
1) measuring the relation between the classes and the interfaces through a Page Rank algorithm;
2) class and affinity centrality of class in the knowledge graph;
3) the intermediate centrality of classes in the knowledge graph.
The invention has the beneficial effects that: through the recommendation algorithm, a user can contact new JAVA built-in classes and know the mutual difference of the JAVA built-in classes, so that the purpose of using a more proper API function is achieved, and the capability of the JAVA built-in classes is fully exerted; secondly, beginners can know the built-in JAVA more comprehensively and lead the beginners to learn the JAVA more deeply; for java API developers, they can be helped to better analyze and improve their own code base and adjust for future extensions.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flow chart of the establishment of a code map of Java Doc.
FIG. 2 is a diagram of a method RDF model.
FIG. 3 is a class RDF model diagram.
FIG. 4 is a diagram of an Interface RDF model.
FIG. 5 is a diagram of the knowledge-graph overall RDF structure.
FIG. 6 is a diagram of the structure of the ArrayList class stored in Neo4 j.
FIG. 7 is a diagram of the relationship of classes, interfaces, cross-references and extensions within a small area.
Fig. 8 is an overall structure and distribution diagram of a JAVA Doc code graph.
FIG. 9 is a diagram of the recommended domains of ArrayList.
FIG. 10 is a diagram of the recommended domains for ArrayList refinement.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention discloses a multi-dimensional evaluation recommendation method based on a JAVA Doc knowledge graph, which comprises the following steps of:
s1, building a Java class knowledge graph through crawling and analyzing Java Doc documents:
s11, data extraction: crawling data in Java Doc with html format files by adopting Beautiful Soup toolkit in pyhon; specifically, internal data is obtained by crawling a head tag, an implementation interface tag and a direct subclass tag;
s12, data normalization: screening the crawled data; specifically, through simple screening of the acquired items, items with at least one non-empty item attribute are selected as a data set used in an experiment;
s13, establishing an RDF model: analyzing data, establishing an RDF (remote data format) model of the data, converting the data into RDF, storing the RDF into a graph database, and adopting Neo4j as a storage medium of the data;
s14, data visualization: storing the data into a database and then carrying out visualization work of the knowledge graph; specifically, a tool kit of py2Neo is used for linking databases and programs and database operation in the process of data import, and data is stored into a Neo4j database system in the form of RDF.
S2, according to the established Java Doc knowledge graph, analyzing the path relation between the entities through the relation between the classes and the outside, and establishing a recommendation function to mine data; the recommended recommendation domain is determined by the classes and the relationship between the classes.
And S3, clustering between classes by using a text-based K-means clustering method, and taking a clustering result as a complementary set of a recommendation domain mined based on the knowledge graph.
S4, carrying out quantitative scoring on the candidate items of the recommendation domain; based on the properties of the Java language, the scoring criteria are based on three dimensions: 1) measuring the relation Between classes and interfaces through a Page Rank algorithm, 2) the Closeness Centrality (Closense centricity) of the classes and the classes in the knowledge graph, and 3) the medium Centrality (Between centricity) of the classes in the knowledge graph; and combining the three scoring items, and performing multidimensional quantitative scoring on the classes in the recommended domain.
And S5, establishing a comprehensive quantitative scoring model through scoring in the fourth step, and performing similarity quantitative scoring on the multiple dimension isomorphic models to score the candidate items of each recommendation domain. The quantitative scores for the classes in the recommended domain are returned to the user.
JAVA Doc-based code map establishment and JAVA Doc source data analysis
The Java Doc is a static webpage file in an HTML format, and through analysis of the Java Doc, the content is acquired by adopting a Beautiful Soup toolkit in pyhon, wherein the Beautiful Soup is a toolkit capable of crawling information from an HTML or XML file. The method selects the java.util packet in the JAVA Doc to acquire the data. The java.util package is a tool package which is used in a large amount in the software development process, and comprises all key data structures such as arrays, queues, hash tables and the like and related tools. The HTML document details the role of each class, the methods contained by the class, and the functional description, parameter description, etc. of each method. Also included are relationships of classes to other external classes, such as inheritance or implementation of an interface.
Through locking and traversing the tags in the HTML file, the description of the class, the name of the class, all methods contained in the class, the description, declaration, parameters, return values of the methods, and all data contained in the HTML file such as exceptions are obtained through crawler software.
1. The invention acquires the class name and the relation between the class and the outside, and acquires the internal data by crawling the head label in JAVA Doc and the Direct retrieval Subclases < dt > and ALL augmented < dt > realizing interface labels.
2. The method comprises the steps of obtaining the internal methods of each class and the attributes of the internal methods, obtaining the description of the class, the name of the class, all methods contained in the class, the description, declaration, parameters, return values, throwing out all data entities contained in the HTML document and the like through a crawler.
Data normalization
Since the data acquired in the previous step is raw data, there are a lot of impurities and noise. And a large amount of repeated data exists, and the acquired data needs to be filtered.
Impurities are classified into the following cases:
1. the declaration of a class contains the repeated occurrence of the information for realizing the interface and inheriting the parent class and owning the child class. The interface and parent class implemented by the ArrayList class appear in both class declaration and class introduction sections and the data is inconsistent. For this case, the present invention records only once for each name of the class and interface that has appeared, instead of recording many times.
2. Due to repeated occurrence of tags of the HTML document, positioning errors or multiple times of positioning can occur when the content at a specific position is acquired. Aiming at the situation, a regular expression is used for screening out all the labels, and then the labels which are positioned for multiple times due to multiple nesting are filtered. Such as < dt > < dl > < dt > … </dt > </dl > </dt >. The situation of locating < dt > occurs multiple times during the query for < dt > tags.
3. Since some functions do not have parameters or exceptions, some functions do have multiple parameters and exceptions. Each parameter and exception in the HTML document appears in the form of a tag for an indefinite number of times and at an indefinite location. The specific tag appears multiple times but the parameter tag does not appear. For this case, the present invention cannot distinguish under which content the content under the < dd > tag belongs.
Therefore, the invention captures the content and the label of the title together, records the last appearing title, stores the content into the corresponding array according to the title, and utilizes the filter to process data.
Data error correction
Since the JAVA Doc document writes references, there is a problem that class names are written inconsistently. Such as: ArrayList < T > and ArrayLis point to the same specific class. Aiming at the situation, the invention records all related class names, when < T > occurs, the first half part of the class names is compared by using a regular expression, and the class names are corrected under the same class name, as the formula (1):
CN={X|X∈prePart(CN)} (1)
CN is a class name to be matched, X is a class group to be matched, and PrePart is a class matching function;
the first row of the data crawl result is the class name, the interface and direct subclass implemented by the class, the super interface and extended class, and the ArrayList is empty. Finally, the description of the class and all the data of the method.
RDF model establishment
The knowledge graph adopts RDF as the most basic data structure to store data, and establishes a knowledge network through RDF. The RDF data can be represented as a labeled graph, nodes in the graph correspond to subjects and objects in triples, and predicates are edges. RDF triples may be labeled:
s={R∪P,q∈R,o∈R∪P∪L} (2)
where s is subject, R is a set of URIs, B is a set of points, and L is a set of textual descriptions, then an RDF triple can be represented as:
RDF={X|X∈(s,o,p)} (3)
i.e., a triplet is a vector with s, o as the vertex and p as the directed edge.
Because the knowledge graph is a directed graph established by taking RDF as a framework, the invention needs to model the extracted JAVA Doc data and place the JAVA Doc data into a model described by the RDF.
1. Establishing a class model:
the establishment of the class model is divided into two parts, wherein the first part is the establishment of the relation model of the class and the internal components of the class, and the second part is the establishment of the class and the external relation model of the class.
A first part: the class of java is a collection of entities that has both properties and methods. Where the attributes are served for the method. Then the internal relationships of the java class can be expressed as the relationships of the class and its internal methods. Each method has its own arguments, return values, outliers, and function declarations. These are also included within the class.
In order to make the data hierarchy clear, the invention takes the smallest unit in the class as a node. Rather than having the entire class or the entire method as a node. This allows each data to have its own properties and content and allows the model to become a 1-N relationship structure.
Data models for class-by-class methods: in the process of establishing the model, the invention needs to define the label, the attribute and the content of each node and the label and the content of the relationship between the nodes. In each method, there are function declarations, parameters, parameter descriptions, function descriptions, return values, outliers, and the like. The method can then be expressed as:
Figure BDA0002281452700000071
based on the above analysis, the present invention establishes its own node for each data unit, and for the description of the function and the description of the parameter, the present invention merges them into the function and an attribute value belonging to its description parameter. Based on this, the present invention models the method data, as shown in FIG. 2.
After the data model of the class method is built, the model of the class body is built. The properties contained in the class of Java Doc are: the method comprises the steps of function description of the class, the name of the class, the position of the located package, the inheritance relationship of the parent class, the realization relationship of the interface, the inheritance relationship of the subclass and the like. Because the invention models the internal relation of java class first, then the class can be expressed as the following formula:
Figure BDA0002281452700000072
the present invention initially models the interior of Class according to the attributes contained in Class.
A second part: after modeling the interior of the class, the invention models the external relation of the class node and other nodes, in Java, the relation of the class and the class is divided into inheritance and subclass, and the relation of the class and the interface is divided into realization and expansion. These relationships are defined as the following equations:
X:Class A,Y:Class B,Z:Class C,I:Interface
Figure BDA0002281452700000073
where the relationship of parent and child is bi-directional, X is the parent of Y. In order to reduce the time complexity in the later retrieval process, the invention reduces the complexity of each node. And modeling the external relation of the type of node and other nodes into a single-term vector. I.e., equation (7):
Figure BDA0002281452700000074
according to the formula, the invention constructs a class model, and adds the relationship between classes and interfaces into RDF modeling of class, as shown in FIG. 3.
2. Establishing an interface model:
the interface model is very similar to the class model, but the interfaces and relationships between the interfaces are defined as "super interfaces", i.e., inheritance relationships between the interfaces. Since the present invention designs all relationships as single-term vectors, the relationships between classes and interfaces have already been defined in the class section. When the interface model is established, the invention only needs to consider the data inside the interface, including the description, the method, the data inside the method and the relationship between the interface and the interface.
The interface node is represented as:
Figure BDA0002281452700000081
the relationship of the interface comprises the super-interface relationship of the interface and the interface, and the realization relationship of the interface and the class.
As in equation 9:
X:Interface A,Y:Interface B,I:Class A
Figure BDA0002281452700000082
wherein, X represents interface A, Y represents interface B, and I represents class A.
According to the formula (8) and the formula (9), the interface node is modeled by the present invention, and the modeling is shown in fig. 4.
3. Establishment of integral RDF structure model
The RDF modeling is carried out on each element and node in the whole knowledge graph independently, but a combination is not carried out among the nodes, and the whole RDF structure is combined and analyzed for better analyzing the efficiency and the integrity of the whole modeling.
In the knowledge graph established by the invention, the main node types are Class, Interface and Method. The secondary node type is Description, Parameter, return, throw. The relationship types are include, described by, has Super Interfaces, has sub classes, and instantiated Interfaces.
And establishing a whole knowledge graph RDF structure diagram for all the elements, as shown in FIG. 5.
Storing code knowledge graph and knowledge graph visualization
And after RDF modeling is carried out on the data, the obtained Java Doc data is stored in a knowledge base according to an RDF model. The knowledge graph is data with a graph structure, a graph database can be well matched with RDF data, and the data can be stored in the database without any format change. After comparison, the invention selects Neo4J as a storage database of the knowledge graph. Neo4j follows the property graph data model, with primitives being nodes (nodes), relationships (relationships), and properties (properties). The method can represent all entities and the relations among the entities through the RDF structure, and can flexibly modify, delete and add the data nodes. Neo4j quickly positions the subgraph to be found by the invention to find the required nodes by traversing the subgraph in depth and breadth through the graph pattern matching, and fully supports the database operation of ACID.
Since the data is crawled by the crawler program of python, the tool kit of py2neo is used for linking the database and the program and operating the database in the process of data import. The whole data input process is a knowledge graph establishing process, and the invention stores data into a Neo4j database system in the form of RDF. In order to avoid the generation of repeated nodes and the loss of relational nodes, the invention divides the establishment of the whole knowledge graph into two parts: 1. establishing independent nodes, and 2, establishing the relationship among the nodes.
The data is traversed for the first time, independent nodes of all classes and interfaces are established, and since the method belongs to each specific class, a large number of homonymous methods exist between the classes. The class and method attribution is also generated in the first traversal of the data with the class nodes and the interface nodes. This allows methods of the same name but different classes to be associated with the correct class determination.
Since the reconstruction function of the same-name function exists in the same class, the invention establishes two nodes with the same name but different node.id (node.id is a unique identifier of each node in the database) aiming at the method.
Therefore, traversing data for the first time the present invention needs to establish class or interface nodes and method nodes and member data nodes of method, and associate method with class and interface. The storage structure of ArrayList class nodes in the database is shown in FIG. 6.
The second traversal of the data, the first traversal, has established a total of 117 classes in the util packet and the nodes of the interface. In the second traversal process, the invention establishes the relationship between classes and interfaces, and between interfaces. Because the relationship between the nodes is designed into the unidirectional vector, only one is needed to be considered when the parent-class and child-class relationships are processed, and the child class is selected to be expressed. ArrayList classes, etc. are selected to show the relationship between classes and classes, classes and interfaces, and interfaces in a small area, as shown in FIG. 7.
To this end, the present invention has completed building the entire knowledge-graph and stored the graph in the Neo4j database. The overall code knowledge graph is shown in fig. 8.
After the knowledge graph is visualized, the invention can carry out intuitive analysis on the knowledge graph.
Through the complete knowledge graph, the references between classes and interfaces are quite complex and irregular, but the relations are clues of the whole API function library. The invention aims to carry out data mining by using a knowledge graph through knowledge reasoning. Establishing a knowledge graph algorithm pseudo code as follows:
establishing a knowledge graph algorithm:
the method is divided into two parts: 1. establishing class nodes and internal method nodes, 2. establishing the relation between class and external nodes
Figure BDA0002281452700000101
Similar built-in recommendation algorithm based on JAVA Doc code map
1. Determining class recommendation domains
Since the sample of java API classes is very large, it is not reasonable to waste computation time if all classes are recommended, and since there are many classes without any connection, it is necessary to establish a limited set of recommendation functions for the recommendation functions. The invention determines a recommendation domain of a similar function according to the function name input by a user, namely, a set T is determined:
T={X|X∈relate(input Class)} (10)
in order to discover similarities between entities, it is necessary to find associations between entities or similarities inherent in the entities themselves. Since the value of the knowledge-graph entity itself is reduced and a large amount of information is distributed on the path of the network, i.e. the edge of the graph, the present invention determines the recommendation domain through the external contact of the entity.
In the JAVA language, classes and associations between classes are in two forms: 1. inheritance, 2. combination. Inheritance is an important feature in object-oriented programming. A new class can be created on the basis of an existing class using inheritance. The new class can have, can access, can modify the attribute of the original class or rewrite the method defined in the original class (except the member variable and method stated by the private key), and can also add the attribute and method of the new class. In the process of inheritance, the properties and methods of the parent class are fully acquired by the child class. So that the sub-classes under the same parent class have similar functions and essence. The legacy class is restricted by the attributes of the parent class. So by inheriting the relationships, classes of similar functions and attributes can be found. A relationship function is defined based on this invention:
Figure BDA0002281452700000111
where "→ represents an inheritance relationship, there is a problem that classes may be inherited in multiple layers, and analysis can reveal that class2, class3 inherits parent class1 at the same time. Class2 and class3 have similar functions and attributes. Class4 also inherits class2, however, at this point class4 should be incorporated into a similar field of class 2. And class5 is the parent of class1, at which time class5 should be incorporated into a similar field of class 2. To ensure the comprehensiveness of the recommended domain, the invention incorporates only the next level of subclasses from the input class, and the upper level of parents, namely:
Figure BDA0002281452700000112
to date, the recommended field includes a class that passes one level with the input class, a common parent class, and a next level child class on the same level. Taking ArrayList as an example, inputting ArrayList the present invention needs to find a recommended domain of ArrayList, as shown in FIG. 9. The present invention obtains a recommended domain by searching for sibling nodes of ArraList and immediate parent-child nodes of ArrayList.
Since ArrayList has no children, the graph only includes the parent node AbstractList and the sibling child nodes Vector and AbstractSequentialList. However, the analysis of the present invention shows that linkedlst and ArrayList are also a class with very similar functions. The recommendation field obtained according to equation (12) is incomplete. By contrast, the present invention finds that linkedlst is actually an immediate subclass of abstract sequentialllist in the built-in class.
2. Improvements to class recommendation domains
Since it is found that the recommendation field is not complete, the present invention improves twice on the determination algorithm of the recommendation field, and the complementation of the results from the algorithm itself and the addition of other algorithms respectively will improve.
1. An improvement internal to the recommendation domain determination algorithm:
the invention improves the recommendation domain determination algorithm for the first time. Since LinkedList is found to exist in the direct subclass of the same level node, the present invention incorporates all the direct subclasses of the same parent node into the recommendation domain as well, that is:
Figure BDA0002281452700000121
the invention incorporates class _1, class _3, class _5, and class _4 into the recommended field of class _ 2. This makes the recommendation field more comprehensive. The improved recommended fields of the ArrayList are shown in fig. 10.
After the first improvement, the recommendation field becomes more sophisticated. The invention extracts a plurality of common classes to compare the recommended domains, and the comparison result is shown in table 1.
TABLE 1 recommended Domain Algorithm improvement comparison
Figure BDA0002281452700000122
As can be seen from Table 2, the improvement makes more similar classes gathered together, but the invention finds that the recommendation domain of LinkedList should also contain ArrayList, Vector and the like, but only contains the parent node of the invention, through comparison of LinkedList with ArrayList, Vector and the like. This is because the algorithm only explores parallel nodes and nodes down, while grandparents of parents and children emanating from grandparents are not available. Because the present invention cannot determine that several levels should be explored upwards, all java built-in classes have a parent node which is the object class. All classes of the entire atlas are returned if the exploration continues. Second, there are also situations where certain classes are functionally similar, but no paths can reach each other, (e.g., there is no path between the two similar classes of HashMap and HashTable). Therefore, the invention externally expands the algorithm aiming at the problems.
2. External algorithm-assisted augmentation for recommended domains
The essence of the present invention for the determination of the recommendation domain is to find a set of classes of similar categories, as in equation (10). Then in order to determine the similarity between classes, the determination can only be made by the self-attributes of the classes, except through the association between entities. The most important attribute in a class is a method, and the similarity of the classes is the similarity of functions realized by the methods in the classes. Based on the method, the invention selects a clustering algorithm to perform clustering on the classes by the method contained in each class. The clustering algorithm belongs to an unsupervised machine learning method, and analyzes and explores the internal connection and essence of things based on the principle of clustering the things. Clustering algorithms are classified into many kinds, the most important of which is density-based clustering, distance-based clustering between feature vectors, hierarchy-based clustering, and the like. Since the data used in the present invention is the text data of JAVA Doc, which contains the natural language description of all methods. Therefore, the invention adopts a method of extracting text feature vectors and uses a K-means algorithm to carry out cluster analysis.
The K-means clustering algorithm is a partition-based clustering algorithm. The data is divided into K and subsets, and data with similar characteristics are divided into the same subset. Its main objective function is shown in formula (14)
X={x1,x2,...xn}
C={ck|K=1,2,3...n}
Figure BDA0002281452700000131
Figure BDA0002281452700000132
The method uses a K-means algorithm to cluster the JAVA built-in classes. The class itself is represented using a natural language description of the method of JAVA built-in classes. The text needs to be abstracted into feature vectors so that the physical distance between the texts can be calculated. The invention adopts a TF-IDF method to extract the feature vector of each class of text. And performing word frequency calculation on the text by using the TF-IDF algorithm, and evaluating words by using the inverse text frequency. The feature value of each word is calculated. The textual feature vectors of this class are finally represented.
The objective function of TF-IDF is:
Figure BDA0002281452700000133
Figure BDA0002281452700000134
in order to reduce the dimensionality of the feature vector calculated by the TF-IDF, the method preprocesses the input text. After the operations of word segmentation, word stop removal, special symbol removal and the like are carried out, all texts are input into TF-IDF, and a feature vector (878 dimensions in total) of each text is obtained. The feature vector for each class is taken as input for the K-means. Due to the change of the K value, the clustering result also changes. Experiments were performed on the interval K-20-50 according to the score of K-means (see table 2). 4 main K values were extracted for comparison.
TABLE 2K values and clustering results
Value of K K-means outcome score
20 35.86
30 26.8
40 17.05
50 8.85
The lower the score of K-means represents the better aggregability between each class point, so through comparison of the results, the invention uses the clustering result of K-40 as the supplement of the class recommendation domain. Since the number of results is too large, not all results are pasted. The range comparison for the recommended domain after the second improvement is shown in table 3 (5 classes were drawn as samples).
TABLE 3 recommended Domain comparison after second refinement
Figure BDA0002281452700000141
Figure BDA0002281452700000151
After a total of two improvements to the algorithm, the recommended domain for the input class is already comprehensive and the desired result can be fully included in the candidate domain. Then the relevance to the input class is different for each class in the recommended domain. In order to obtain the desired class referral from the referral field. The invention establishes a scoring model based on graph theory aiming at classes. The model may score the similarity of all classes in the recommended domain to the input class. And finally, giving a quantitative judgment result to the user.
The recommended domain determination algorithm is as follows:
Figure BDA0002281452700000152
3. similarity scoring model establishment based on graph theory
In the part of establishing a scoring model, the invention selects three characteristics which can fully utilize the advantages of the code map to score the similarity between classes: 1. affinity Centrality (closense centre); 2. mesocentrality (Between Centrality); the PageRanks algorithm.
1. Center of intimacy (closense center):
the intimacy centrality is the average distance a node reaches all other nodes in the graph. I.e. formula (16)
Figure BDA0002281452700000161
Aiming at the established code map. A higher affinity indicates that he is closer to the center of the entire knowledge-graph. The invention measures the physical distance between two classes by comparing the relative centrality distance of the two classes in the graph. Equation (17)
Figure BDA0002281452700000162
The invention compares classes from relative positions in the graph by quantifying distance. The closer the relative distance, the more similar the two classes are represented.
2. Center of the medium (Between Centrality)
The intermediate centrality is the number of times a certain node is passed through in all paths between any two points in the graph. If the times are more, the centrality of the intermediary is higher. Namely, formula (18):
Figure BDA0002281452700000163
x → Z → N, X → Z → Y, X → Y → M is higher for the mediator centrality of Z than for Y. The higher the intermediaries, the more role the node plays in the graph and the more pathways it controls. In the knowledge graph, a class has a large number of relations, so that the class plays an important role. Therefore, a quantitative assessment of the role of the class is made through the centrality of the intermediary:
Figure BDA0002281452700000164
the PageRank algorithm:
the PageRank algorithm is a web page ranking algorithm used by Google's search engine, which references the traditional citation analysis ideas: when the webpage A has a link pointing to the webpage B, the webpage B is considered to obtain the score of the contribution of the A to the webpage A, and the value is more or less dependent on the importance degree of the A, namely the more important the webpage A is, the higher the contribution value obtained by the webpage B is. Then, in the knowledge graph, the nodes are equivalent to web pages, and the PageRank value of one node can be obtained by analyzing the in-degree and out-degree of the nodes. Namely, formula (20):
Figure BDA0002281452700000165
and aiming at the PageRank algorithm, calculating the PR values of all interfaces in the map through the algorithm.
In JAVA, which interfaces a class implements is also an important basis for determining the functionality of the class. Therefore, the invention analyzes the similarity of the classes by analyzing the interface realized by each class. Since an interface can be implemented by any class, the more powerful the functionality, the more classes implemented interfaces often fail as a feature to distinguish between classes. The present invention evaluates classes using the inverse PR value of each interface. The invention performs intersection operation on the interface realized by the class in the recommended domain and the interface realized by the input class. And then, scoring the classes by taking the crossed interfaces as the basis.
Formula (21):
Figure BDA0002281452700000171
wherein X is a recommended domain to be evaluated category, Y is a user input category, I is an intersection of interfaces realized by X and Y.
To this end, 3 scoring terms have been determined. The invention combines the three scoring items and carries out multidimensional quantitative scoring on the classes in the recommended domain. The degree of similarity between the class in the recommended domain and the input class is determined by the degree of score. Therefore, the present invention builds a model Score () as follows:
Score(X)=βDis(X,Y)+αBe(X)+εPR(X) (22)
where X is the class in the recommended field and Y is the user input class.
And establishing the multidimensional similarity scoring model based on the graph theory. The algorithm pseudo-code is as follows:
Figure BDA0002281452700000172
the above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (2)

1. A multidimensional evaluation recommendation method based on a JavaDoc knowledge graph is characterized by comprising the following steps:
s1, building a Java class knowledge graph through crawling and analyzing the Java Doc document;
s2, analyzing the path relation between the entities through the relation between the classes and the outside according to the established Java Doc knowledge graph, establishing a recommendation function to mine data, and determining a recommended domain through the relation between the classes;
s3, selecting a text-based K-means clustering method to cluster the classes, and taking the clustering result as a complementary set of a recommendation domain mined based on a knowledge graph;
s4, carrying out multidimensional quantitative scoring on the classes of the recommended domains based on the scoring standard of Java language characteristics;
s5, establishing a comprehensive quantitative scoring model through scoring of all items in S4, carrying out similarity quantitative scoring on the comprehensive quantitative scoring model, scoring the candidate items of each recommendation domain, and returning the quantitative scoring of the classes in the recommendation domains to the user;
the scoring criteria in S4 are based on the following three dimensions:
1) measuring the relation between the classes and the interfaces through a Page Rank algorithm;
aiming at a PageRank algorithm, PR values of all interfaces in a map are calculated through the algorithm, in JAVA, a class realizes that which interfaces are also used as an important basis for determining the functions of the class, the class is analyzed through analyzing the interface realized by each class, the class is evaluated by using the inverse PR value of each interface, intersection operation is carried out on the interface realized by the class in a recommended domain and the interface realized by an input class, and then the class is scored by using the intersected interfaces as the basis;
2) class and affinity centrality of class in the knowledge graph;
3) the mesocentrality of classes in the knowledge graph;
wherein the content of the first and second substances,
the intimacy centrality refers to the average distance between a certain node and all other nodes in the graph, the physical distance between two classes is measured by comparing the relative centrality distances of the two classes in the graph, and the classes are compared from the relative positions in the graph by quantifying the distance; the closer the relative distance, the more similar the two classes are represented;
the medium centrality refers to the number of times a certain node is passed through in all paths between any two points in the graph, and if the number of times is more, the medium centrality is higher.
2. The method as claimed in claim 1, wherein the step of establishing a Java class knowledge graph in S1 includes the following steps:
s11, data extraction: crawling data in JavaDoc with html format file by adopting Beautiful Soup toolkit in Python; specifically, internal data is obtained by crawling a head tag, an implementation interface tag and a direct subclass tag;
s12, data normalization: screening the crawled data; specifically, through simple screening of the acquired items, items with at least one non-empty item attribute are selected as a data set used in an experiment;
s13, establishing an RDF model: analyzing data, establishing an RDF (remote data format) model of the data, converting the data into RDF, storing the RDF into a graph database, and adopting Neo4j as a storage medium of the data;
s14, data visualization: storing the data into a database and then carrying out visualization work of the knowledge graph; specifically, a tool kit of py2Neo is used for linking databases and programs and database operation in the process of data import, and data is stored into a Neo4j database system in the form of RDF.
CN201911142972.3A 2019-11-20 2019-11-20 JAVA Doc knowledge graph-based multidimensional evaluation recommendation method Active CN110874431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911142972.3A CN110874431B (en) 2019-11-20 2019-11-20 JAVA Doc knowledge graph-based multidimensional evaluation recommendation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911142972.3A CN110874431B (en) 2019-11-20 2019-11-20 JAVA Doc knowledge graph-based multidimensional evaluation recommendation method

Publications (2)

Publication Number Publication Date
CN110874431A CN110874431A (en) 2020-03-10
CN110874431B true CN110874431B (en) 2022-04-26

Family

ID=69718109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911142972.3A Active CN110874431B (en) 2019-11-20 2019-11-20 JAVA Doc knowledge graph-based multidimensional evaluation recommendation method

Country Status (1)

Country Link
CN (1) CN110874431B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929165A (en) * 2019-12-17 2020-03-27 云南大学 JAVA Doc knowledge graph-based multidimensional evaluation recommendation method
CN112100314B (en) * 2020-08-16 2022-07-22 复旦大学 API course compilation generation method based on software development question-answering website

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110148043A (en) * 2019-03-01 2019-08-20 安徽省优质采科技发展有限责任公司 The bid and purchase information recommendation system and recommended method of knowledge based map
CN110413795A (en) * 2019-06-21 2019-11-05 厦门美域中央信息科技有限公司 A kind of professional knowledge map construction method of data-driven

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615425A (en) * 2015-01-16 2015-05-13 贾志东 Method and system for developing software system based on features and feature tree
WO2017212268A1 (en) * 2016-06-08 2017-12-14 Blippar.Com Limited Data processing system and data processing method
CN106919689B (en) * 2017-03-03 2018-05-11 中国科学技术信息研究所 Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge
CN107391542B (en) * 2017-05-16 2021-01-01 浙江工业大学 Open source software community expert recommendation method based on file knowledge graph
US11334692B2 (en) * 2017-06-29 2022-05-17 International Business Machines Corporation Extracting a knowledge graph from program source code
US20190303141A1 (en) * 2018-03-29 2019-10-03 Elasticsearch B.V. Syntax Based Source Code Search
CN109739994B (en) * 2018-12-14 2023-05-02 复旦大学 API knowledge graph construction method based on reference document
CN110321482B (en) * 2019-06-11 2023-04-18 创新先进技术有限公司 Information recommendation method, device and equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110148043A (en) * 2019-03-01 2019-08-20 安徽省优质采科技发展有限责任公司 The bid and purchase information recommendation system and recommended method of knowledge based map
CN110413795A (en) * 2019-06-21 2019-11-05 厦门美域中央信息科技有限公司 A kind of professional knowledge map construction method of data-driven

Also Published As

Publication number Publication date
CN110874431A (en) 2020-03-10

Similar Documents

Publication Publication Date Title
Zhuge The web resource space model
Candel et al. A unified metamodel for NoSQL and relational databases
Navadiya et al. Web Content Mining Techniques-A Comprehensive Survey
CN110874431B (en) JAVA Doc knowledge graph-based multidimensional evaluation recommendation method
CN110929165A (en) JAVA Doc knowledge graph-based multidimensional evaluation recommendation method
Wang et al. A survey of typical attributed graph queries
Usman et al. Discovering diverse association rules from multidimensional schema
Ceci et al. Closed sequential pattern mining for sitemap generation
Pokorný Functional querying in graph databases
Ghrab et al. Topograph: an end-to-end framework to build and analyze graph cubes
Ye et al. Learning object models from semistructured web documents
Meimaris et al. Computational methods and optimizations for containment and complementarity in web data cubes
Oo Pattern discovery using association rule mining on clustered data
Souibgui et al. An embedding driven approach to automatically detect identifiers and references in document stores
Jayalakshmi et al. An approach for interesting subgraph mining from web log data using W-Gaston algorithm
Carme et al. The lixto project: Exploring new frontiers of web data extraction
Vysniauskas et al. Mapping of OWL ontology concepts to RDB schemas
Jaeger et al. Type extension trees for feature construction and learning in relational domains
KR100564739B1 (en) The method for generating memory resident object-relational schema/query by using UML
Wu et al. Discovering closed and maximal embedded patterns from large tree data
US20230169360A1 (en) Generating ontologies from programmatic specifications
Spahiu Profiling Linked Data
Mathur Automatic Generation of Relational to Ontology Mapping Correspondences
Ding et al. Example query on ontology-labels knowledge graph based on filter-refine strategy
Brodec Discovering and Creating Relations among CSV Columns Using Linked Data Knowledge Bases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Jia Li

Inventor after: Yang Ming

Inventor after: Gao Tilei

Inventor after: Yang Dai

Inventor after: Xie Wanyu

Inventor before: Gao Tilei

Inventor before: Tao Ye

Inventor before: Yang Dai

Inventor before: Zhou Ronghua

Inventor before: Yang Ming

Inventor before: Jia Li

Inventor before: Xie Wanyu

Inventor before: Zhang Tao

Inventor before: Li Ying

Inventor before: Du Shirong

Inventor before: Liu Fen

Inventor before: He Feng

GR01 Patent grant
GR01 Patent grant