Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention discloses a multi-dimensional evaluation recommendation method based on a JAVA Doc knowledge graph, which comprises the following steps of:
s1, building a Java class knowledge graph through crawling and analyzing Java Doc documents:
s11, data extraction: crawling data in Java Doc with html format files by adopting Beautiful Soup toolkit in pyhon; specifically, internal data is obtained by crawling a head tag, an implementation interface tag and a direct subclass tag;
s12, data normalization: screening the crawled data; specifically, through simple screening of the acquired items, items with at least one non-empty item attribute are selected as a data set used in an experiment;
s13, establishing an RDF model: analyzing data, establishing an RDF (remote data format) model of the data, converting the data into RDF, storing the RDF into a graph database, and adopting Neo4j as a storage medium of the data;
s14, data visualization: storing the data into a database and then carrying out visualization work of the knowledge graph; specifically, a tool kit of py2Neo is used for linking databases and programs and database operation in the process of data import, and data is stored into a Neo4j database system in the form of RDF.
S2, according to the established Java Doc knowledge graph, analyzing the path relation between the entities through the relation between the classes and the outside, and establishing a recommendation function to mine data; the recommended recommendation domain is determined by the classes and the relationship between the classes.
And S3, clustering between classes by using a text-based K-means clustering method, and taking a clustering result as a complementary set of a recommendation domain mined based on the knowledge graph.
S4, carrying out quantitative scoring on the candidate items of the recommendation domain; based on the properties of the Java language, the scoring criteria are based on three dimensions: 1) measuring the relation Between classes and interfaces through a Page Rank algorithm, 2) the Closeness Centrality (Closense centricity) of the classes and the classes in the knowledge graph, and 3) the medium Centrality (Between centricity) of the classes in the knowledge graph; and combining the three scoring items, and performing multidimensional quantitative scoring on the classes in the recommended domain.
And S5, establishing a comprehensive quantitative scoring model through scoring in the fourth step, and performing similarity quantitative scoring on the multiple dimension isomorphic models to score the candidate items of each recommendation domain. The quantitative scores for the classes in the recommended domain are returned to the user.
JAVA Doc-based code map establishment and JAVA Doc source data analysis
The Java Doc is a static webpage file in an HTML format, and through analysis of the Java Doc, the content is acquired by adopting a Beautiful Soup toolkit in pyhon, wherein the Beautiful Soup is a toolkit capable of crawling information from an HTML or XML file. The method selects the java.util packet in the JAVA Doc to acquire the data. The java.util package is a tool package which is used in a large amount in the software development process, and comprises all key data structures such as arrays, queues, hash tables and the like and related tools. The HTML document details the role of each class, the methods contained by the class, and the functional description, parameter description, etc. of each method. Also included are relationships of classes to other external classes, such as inheritance or implementation of an interface.
Through locking and traversing the tags in the HTML file, the description of the class, the name of the class, all methods contained in the class, the description, declaration, parameters, return values of the methods, and all data contained in the HTML file such as exceptions are obtained through crawler software.
1. The invention acquires the class name and the relation between the class and the outside, and acquires the internal data by crawling the head label in JAVA Doc and the Direct retrieval Subclases < dt > and ALL augmented < dt > realizing interface labels.
2. The method comprises the steps of obtaining the internal methods of each class and the attributes of the internal methods, obtaining the description of the class, the name of the class, all methods contained in the class, the description, declaration, parameters, return values, throwing out all data entities contained in the HTML document and the like through a crawler.
Data normalization
Since the data acquired in the previous step is raw data, there are a lot of impurities and noise. And a large amount of repeated data exists, and the acquired data needs to be filtered.
Impurities are classified into the following cases:
1. the declaration of a class contains the repeated occurrence of the information for realizing the interface and inheriting the parent class and owning the child class. The interface and parent class implemented by the ArrayList class appear in both class declaration and class introduction sections and the data is inconsistent. For this case, the present invention records only once for each name of the class and interface that has appeared, instead of recording many times.
2. Due to repeated occurrence of tags of the HTML document, positioning errors or multiple times of positioning can occur when the content at a specific position is acquired. Aiming at the situation, a regular expression is used for screening out all the labels, and then the labels which are positioned for multiple times due to multiple nesting are filtered. Such as < dt > < dl > < dt > … </dt > </dl > </dt >. The situation of locating < dt > occurs multiple times during the query for < dt > tags.
3. Since some functions do not have parameters or exceptions, some functions do have multiple parameters and exceptions. Each parameter and exception in the HTML document appears in the form of a tag for an indefinite number of times and at an indefinite location. The specific tag appears multiple times but the parameter tag does not appear. For this case, the present invention cannot distinguish under which content the content under the < dd > tag belongs.
Therefore, the invention captures the content and the label of the title together, records the last appearing title, stores the content into the corresponding array according to the title, and utilizes the filter to process data.
Data error correction
Since the JAVA Doc document writes references, there is a problem that class names are written inconsistently. Such as: ArrayList < T > and ArrayLis point to the same specific class. Aiming at the situation, the invention records all related class names, when < T > occurs, the first half part of the class names is compared by using a regular expression, and the class names are corrected under the same class name, as the formula (1):
CN={X|X∈prePart(CN)} (1)
CN is a class name to be matched, X is a class group to be matched, and PrePart is a class matching function;
the first row of the data crawl result is the class name, the interface and direct subclass implemented by the class, the super interface and extended class, and the ArrayList is empty. Finally, the description of the class and all the data of the method.
RDF model establishment
The knowledge graph adopts RDF as the most basic data structure to store data, and establishes a knowledge network through RDF. The RDF data can be represented as a labeled graph, nodes in the graph correspond to subjects and objects in triples, and predicates are edges. RDF triples may be labeled:
s={R∪P,q∈R,o∈R∪P∪L} (2)
where s is subject, R is a set of URIs, B is a set of points, and L is a set of textual descriptions, then an RDF triple can be represented as:
RDF={X|X∈(s,o,p)} (3)
i.e., a triplet is a vector with s, o as the vertex and p as the directed edge.
Because the knowledge graph is a directed graph established by taking RDF as a framework, the invention needs to model the extracted JAVA Doc data and place the JAVA Doc data into a model described by the RDF.
1. Establishing a class model:
the establishment of the class model is divided into two parts, wherein the first part is the establishment of the relation model of the class and the internal components of the class, and the second part is the establishment of the class and the external relation model of the class.
A first part: the class of java is a collection of entities that has both properties and methods. Where the attributes are served for the method. Then the internal relationships of the java class can be expressed as the relationships of the class and its internal methods. Each method has its own arguments, return values, outliers, and function declarations. These are also included within the class.
In order to make the data hierarchy clear, the invention takes the smallest unit in the class as a node. Rather than having the entire class or the entire method as a node. This allows each data to have its own properties and content and allows the model to become a 1-N relationship structure.
Data models for class-by-class methods: in the process of establishing the model, the invention needs to define the label, the attribute and the content of each node and the label and the content of the relationship between the nodes. In each method, there are function declarations, parameters, parameter descriptions, function descriptions, return values, outliers, and the like. The method can then be expressed as:
based on the above analysis, the present invention establishes its own node for each data unit, and for the description of the function and the description of the parameter, the present invention merges them into the function and an attribute value belonging to its description parameter. Based on this, the present invention models the method data, as shown in FIG. 2.
After the data model of the class method is built, the model of the class body is built. The properties contained in the class of Java Doc are: the method comprises the steps of function description of the class, the name of the class, the position of the located package, the inheritance relationship of the parent class, the realization relationship of the interface, the inheritance relationship of the subclass and the like. Because the invention models the internal relation of java class first, then the class can be expressed as the following formula:
the present invention initially models the interior of Class according to the attributes contained in Class.
A second part: after modeling the interior of the class, the invention models the external relation of the class node and other nodes, in Java, the relation of the class and the class is divided into inheritance and subclass, and the relation of the class and the interface is divided into realization and expansion. These relationships are defined as the following equations:
X:Class A,Y:Class B,Z:Class C,I:Interface
where the relationship of parent and child is bi-directional, X is the parent of Y. In order to reduce the time complexity in the later retrieval process, the invention reduces the complexity of each node. And modeling the external relation of the type of node and other nodes into a single-term vector. I.e., equation (7):
according to the formula, the invention constructs a class model, and adds the relationship between classes and interfaces into RDF modeling of class, as shown in FIG. 3.
2. Establishing an interface model:
the interface model is very similar to the class model, but the interfaces and relationships between the interfaces are defined as "super interfaces", i.e., inheritance relationships between the interfaces. Since the present invention designs all relationships as single-term vectors, the relationships between classes and interfaces have already been defined in the class section. When the interface model is established, the invention only needs to consider the data inside the interface, including the description, the method, the data inside the method and the relationship between the interface and the interface.
The interface node is represented as:
the relationship of the interface comprises the super-interface relationship of the interface and the interface, and the realization relationship of the interface and the class.
As in equation 9:
X:Interface A,Y:Interface B,I:Class A
wherein, X represents interface A, Y represents interface B, and I represents class A.
According to the formula (8) and the formula (9), the interface node is modeled by the present invention, and the modeling is shown in fig. 4.
3. Establishment of integral RDF structure model
The RDF modeling is carried out on each element and node in the whole knowledge graph independently, but a combination is not carried out among the nodes, and the whole RDF structure is combined and analyzed for better analyzing the efficiency and the integrity of the whole modeling.
In the knowledge graph established by the invention, the main node types are Class, Interface and Method. The secondary node type is Description, Parameter, return, throw. The relationship types are include, described by, has Super Interfaces, has sub classes, and instantiated Interfaces.
And establishing a whole knowledge graph RDF structure diagram for all the elements, as shown in FIG. 5.
Storing code knowledge graph and knowledge graph visualization
And after RDF modeling is carried out on the data, the obtained Java Doc data is stored in a knowledge base according to an RDF model. The knowledge graph is data with a graph structure, a graph database can be well matched with RDF data, and the data can be stored in the database without any format change. After comparison, the invention selects Neo4J as a storage database of the knowledge graph. Neo4j follows the property graph data model, with primitives being nodes (nodes), relationships (relationships), and properties (properties). The method can represent all entities and the relations among the entities through the RDF structure, and can flexibly modify, delete and add the data nodes. Neo4j quickly positions the subgraph to be found by the invention to find the required nodes by traversing the subgraph in depth and breadth through the graph pattern matching, and fully supports the database operation of ACID.
Since the data is crawled by the crawler program of python, the tool kit of py2neo is used for linking the database and the program and operating the database in the process of data import. The whole data input process is a knowledge graph establishing process, and the invention stores data into a Neo4j database system in the form of RDF. In order to avoid the generation of repeated nodes and the loss of relational nodes, the invention divides the establishment of the whole knowledge graph into two parts: 1. establishing independent nodes, and 2, establishing the relationship among the nodes.
The data is traversed for the first time, independent nodes of all classes and interfaces are established, and since the method belongs to each specific class, a large number of homonymous methods exist between the classes. The class and method attribution is also generated in the first traversal of the data with the class nodes and the interface nodes. This allows methods of the same name but different classes to be associated with the correct class determination.
Since the reconstruction function of the same-name function exists in the same class, the invention establishes two nodes with the same name but different node.id (node.id is a unique identifier of each node in the database) aiming at the method.
Therefore, traversing data for the first time the present invention needs to establish class or interface nodes and method nodes and member data nodes of method, and associate method with class and interface. The storage structure of ArrayList class nodes in the database is shown in FIG. 6.
The second traversal of the data, the first traversal, has established a total of 117 classes in the util packet and the nodes of the interface. In the second traversal process, the invention establishes the relationship between classes and interfaces, and between interfaces. Because the relationship between the nodes is designed into the unidirectional vector, only one is needed to be considered when the parent-class and child-class relationships are processed, and the child class is selected to be expressed. ArrayList classes, etc. are selected to show the relationship between classes and classes, classes and interfaces, and interfaces in a small area, as shown in FIG. 7.
To this end, the present invention has completed building the entire knowledge-graph and stored the graph in the Neo4j database. The overall code knowledge graph is shown in fig. 8.
After the knowledge graph is visualized, the invention can carry out intuitive analysis on the knowledge graph.
Through the complete knowledge graph, the references between classes and interfaces are quite complex and irregular, but the relations are clues of the whole API function library. The invention aims to carry out data mining by using a knowledge graph through knowledge reasoning. Establishing a knowledge graph algorithm pseudo code as follows:
establishing a knowledge graph algorithm:
the method is divided into two parts: 1. establishing class nodes and internal method nodes, 2. establishing the relation between class and external nodes
Similar built-in recommendation algorithm based on JAVA Doc code map
1. Determining class recommendation domains
Since the sample of java API classes is very large, it is not reasonable to waste computation time if all classes are recommended, and since there are many classes without any connection, it is necessary to establish a limited set of recommendation functions for the recommendation functions. The invention determines a recommendation domain of a similar function according to the function name input by a user, namely, a set T is determined:
T={X|X∈relate(input Class)} (10)
in order to discover similarities between entities, it is necessary to find associations between entities or similarities inherent in the entities themselves. Since the value of the knowledge-graph entity itself is reduced and a large amount of information is distributed on the path of the network, i.e. the edge of the graph, the present invention determines the recommendation domain through the external contact of the entity.
In the JAVA language, classes and associations between classes are in two forms: 1. inheritance, 2. combination. Inheritance is an important feature in object-oriented programming. A new class can be created on the basis of an existing class using inheritance. The new class can have, can access, can modify the attribute of the original class or rewrite the method defined in the original class (except the member variable and method stated by the private key), and can also add the attribute and method of the new class. In the process of inheritance, the properties and methods of the parent class are fully acquired by the child class. So that the sub-classes under the same parent class have similar functions and essence. The legacy class is restricted by the attributes of the parent class. So by inheriting the relationships, classes of similar functions and attributes can be found. A relationship function is defined based on this invention:
where "→ represents an inheritance relationship, there is a problem that classes may be inherited in multiple layers, and analysis can reveal that class2, class3 inherits parent class1 at the same time. Class2 and class3 have similar functions and attributes. Class4 also inherits class2, however, at this point class4 should be incorporated into a similar field of class 2. And class5 is the parent of class1, at which time class5 should be incorporated into a similar field of class 2. To ensure the comprehensiveness of the recommended domain, the invention incorporates only the next level of subclasses from the input class, and the upper level of parents, namely:
to date, the recommended field includes a class that passes one level with the input class, a common parent class, and a next level child class on the same level. Taking ArrayList as an example, inputting ArrayList the present invention needs to find a recommended domain of ArrayList, as shown in FIG. 9. The present invention obtains a recommended domain by searching for sibling nodes of ArraList and immediate parent-child nodes of ArrayList.
Since ArrayList has no children, the graph only includes the parent node AbstractList and the sibling child nodes Vector and AbstractSequentialList. However, the analysis of the present invention shows that linkedlst and ArrayList are also a class with very similar functions. The recommendation field obtained according to equation (12) is incomplete. By contrast, the present invention finds that linkedlst is actually an immediate subclass of abstract sequentialllist in the built-in class.
2. Improvements to class recommendation domains
Since it is found that the recommendation field is not complete, the present invention improves twice on the determination algorithm of the recommendation field, and the complementation of the results from the algorithm itself and the addition of other algorithms respectively will improve.
1. An improvement internal to the recommendation domain determination algorithm:
the invention improves the recommendation domain determination algorithm for the first time. Since LinkedList is found to exist in the direct subclass of the same level node, the present invention incorporates all the direct subclasses of the same parent node into the recommendation domain as well, that is:
the invention incorporates class _1, class _3, class _5, and class _4 into the recommended field of class _ 2. This makes the recommendation field more comprehensive. The improved recommended fields of the ArrayList are shown in fig. 10.
After the first improvement, the recommendation field becomes more sophisticated. The invention extracts a plurality of common classes to compare the recommended domains, and the comparison result is shown in table 1.
TABLE 1 recommended Domain Algorithm improvement comparison
As can be seen from Table 2, the improvement makes more similar classes gathered together, but the invention finds that the recommendation domain of LinkedList should also contain ArrayList, Vector and the like, but only contains the parent node of the invention, through comparison of LinkedList with ArrayList, Vector and the like. This is because the algorithm only explores parallel nodes and nodes down, while grandparents of parents and children emanating from grandparents are not available. Because the present invention cannot determine that several levels should be explored upwards, all java built-in classes have a parent node which is the object class. All classes of the entire atlas are returned if the exploration continues. Second, there are also situations where certain classes are functionally similar, but no paths can reach each other, (e.g., there is no path between the two similar classes of HashMap and HashTable). Therefore, the invention externally expands the algorithm aiming at the problems.
2. External algorithm-assisted augmentation for recommended domains
The essence of the present invention for the determination of the recommendation domain is to find a set of classes of similar categories, as in equation (10). Then in order to determine the similarity between classes, the determination can only be made by the self-attributes of the classes, except through the association between entities. The most important attribute in a class is a method, and the similarity of the classes is the similarity of functions realized by the methods in the classes. Based on the method, the invention selects a clustering algorithm to perform clustering on the classes by the method contained in each class. The clustering algorithm belongs to an unsupervised machine learning method, and analyzes and explores the internal connection and essence of things based on the principle of clustering the things. Clustering algorithms are classified into many kinds, the most important of which is density-based clustering, distance-based clustering between feature vectors, hierarchy-based clustering, and the like. Since the data used in the present invention is the text data of JAVA Doc, which contains the natural language description of all methods. Therefore, the invention adopts a method of extracting text feature vectors and uses a K-means algorithm to carry out cluster analysis.
The K-means clustering algorithm is a partition-based clustering algorithm. The data is divided into K and subsets, and data with similar characteristics are divided into the same subset. Its main objective function is shown in formula (14)
X={x1,x2,...xn}
C={ck|K=1,2,3...n}
The method uses a K-means algorithm to cluster the JAVA built-in classes. The class itself is represented using a natural language description of the method of JAVA built-in classes. The text needs to be abstracted into feature vectors so that the physical distance between the texts can be calculated. The invention adopts a TF-IDF method to extract the feature vector of each class of text. And performing word frequency calculation on the text by using the TF-IDF algorithm, and evaluating words by using the inverse text frequency. The feature value of each word is calculated. The textual feature vectors of this class are finally represented.
The objective function of TF-IDF is:
in order to reduce the dimensionality of the feature vector calculated by the TF-IDF, the method preprocesses the input text. After the operations of word segmentation, word stop removal, special symbol removal and the like are carried out, all texts are input into TF-IDF, and a feature vector (878 dimensions in total) of each text is obtained. The feature vector for each class is taken as input for the K-means. Due to the change of the K value, the clustering result also changes. Experiments were performed on the interval K-20-50 according to the score of K-means (see table 2). 4 main K values were extracted for comparison.
TABLE 2K values and clustering results
Value of K
|
K-means outcome score
|
20
|
35.86
|
30
|
26.8
|
40
|
17.05
|
50
|
8.85 |
The lower the score of K-means represents the better aggregability between each class point, so through comparison of the results, the invention uses the clustering result of K-40 as the supplement of the class recommendation domain. Since the number of results is too large, not all results are pasted. The range comparison for the recommended domain after the second improvement is shown in table 3 (5 classes were drawn as samples).
TABLE 3 recommended Domain comparison after second refinement
After a total of two improvements to the algorithm, the recommended domain for the input class is already comprehensive and the desired result can be fully included in the candidate domain. Then the relevance to the input class is different for each class in the recommended domain. In order to obtain the desired class referral from the referral field. The invention establishes a scoring model based on graph theory aiming at classes. The model may score the similarity of all classes in the recommended domain to the input class. And finally, giving a quantitative judgment result to the user.
The recommended domain determination algorithm is as follows:
3. similarity scoring model establishment based on graph theory
In the part of establishing a scoring model, the invention selects three characteristics which can fully utilize the advantages of the code map to score the similarity between classes: 1. affinity Centrality (closense centre); 2. mesocentrality (Between Centrality); the PageRanks algorithm.
1. Center of intimacy (closense center):
the intimacy centrality is the average distance a node reaches all other nodes in the graph. I.e. formula (16)
Aiming at the established code map. A higher affinity indicates that he is closer to the center of the entire knowledge-graph. The invention measures the physical distance between two classes by comparing the relative centrality distance of the two classes in the graph. Equation (17)
The invention compares classes from relative positions in the graph by quantifying distance. The closer the relative distance, the more similar the two classes are represented.
2. Center of the medium (Between Centrality)
The intermediate centrality is the number of times a certain node is passed through in all paths between any two points in the graph. If the times are more, the centrality of the intermediary is higher. Namely, formula (18):
x → Z → N, X → Z → Y, X → Y → M is higher for the mediator centrality of Z than for Y. The higher the intermediaries, the more role the node plays in the graph and the more pathways it controls. In the knowledge graph, a class has a large number of relations, so that the class plays an important role. Therefore, a quantitative assessment of the role of the class is made through the centrality of the intermediary:
the PageRank algorithm:
the PageRank algorithm is a web page ranking algorithm used by Google's search engine, which references the traditional citation analysis ideas: when the webpage A has a link pointing to the webpage B, the webpage B is considered to obtain the score of the contribution of the A to the webpage A, and the value is more or less dependent on the importance degree of the A, namely the more important the webpage A is, the higher the contribution value obtained by the webpage B is. Then, in the knowledge graph, the nodes are equivalent to web pages, and the PageRank value of one node can be obtained by analyzing the in-degree and out-degree of the nodes. Namely, formula (20):
and aiming at the PageRank algorithm, calculating the PR values of all interfaces in the map through the algorithm.
In JAVA, which interfaces a class implements is also an important basis for determining the functionality of the class. Therefore, the invention analyzes the similarity of the classes by analyzing the interface realized by each class. Since an interface can be implemented by any class, the more powerful the functionality, the more classes implemented interfaces often fail as a feature to distinguish between classes. The present invention evaluates classes using the inverse PR value of each interface. The invention performs intersection operation on the interface realized by the class in the recommended domain and the interface realized by the input class. And then, scoring the classes by taking the crossed interfaces as the basis.
Formula (21):
wherein X is a recommended domain to be evaluated category, Y is a user input category, I is an intersection of interfaces realized by X and Y.
To this end, 3 scoring terms have been determined. The invention combines the three scoring items and carries out multidimensional quantitative scoring on the classes in the recommended domain. The degree of similarity between the class in the recommended domain and the input class is determined by the degree of score. Therefore, the present invention builds a model Score () as follows:
Score(X)=βDis(X,Y)+αBe(X)+εPR(X) (22)
where X is the class in the recommended field and Y is the user input class.
And establishing the multidimensional similarity scoring model based on the graph theory. The algorithm pseudo-code is as follows:
the above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.