CN116108191A - Deep learning model recommendation method based on knowledge graph - Google Patents

Deep learning model recommendation method based on knowledge graph Download PDF

Info

Publication number
CN116108191A
CN116108191A CN202211416498.0A CN202211416498A CN116108191A CN 116108191 A CN116108191 A CN 116108191A CN 202211416498 A CN202211416498 A CN 202211416498A CN 116108191 A CN116108191 A CN 116108191A
Authority
CN
China
Prior art keywords
model
component
components
knowledge
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211416498.0A
Other languages
Chinese (zh)
Inventor
刘名威
陈碧欢
彭鑫
赵文耘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202211416498.0A priority Critical patent/CN116108191A/en
Publication of CN116108191A publication Critical patent/CN116108191A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • G06F8/315Object-oriented languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/35Creation or generation of source code model driven
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Stored Programmes (AREA)

Abstract

The invention belongs to the technical field of software engineering, and particularly relates to a deep learning model recommendation method based on a knowledge graph. The method comprises the following steps: the method for constructing the model knowledge graph takes text corpus related to an open source code warehouse and a component as input, and outputs a model knowledge graph fused with knowledge such as an AI warehouse, a model, a component, realization and the like; the model recommendation method takes the existing model realization codes which need to be modified as input, and outputs a plurality of most similar model reference realizations and corresponding interpretation information; the interpretation information includes components used by the model, relationships between the components, descriptive knowledge about the components, and the like. The invention is based on the model knowledge graph fused with multi-source knowledge, and the model structure is matched in high-level semanteme according to the codes of the model input by the user, so as to obtain a similar model. The invention fully utilizes the background knowledge in the field of artificial intelligence, promotes the reuse of the deep learning model realization and improves the development efficiency of AI application developers.

Description

Deep learning model recommendation method based on knowledge graph
Technical Field
The invention belongs to the technical field of software engineering, and particularly relates to a deep learning model recommendation method based on a knowledge graph.
Background
With the development of artificial intelligence (Artificial Intelligence, abbreviated as AI) technology, deep learning models are integrated in more and more software applications, so as to realize various intelligent capabilities, such as intelligent man-machine interaction, intelligent recommendation, intelligent decision-making and the like. In general, application developers first need to find the appropriate deep learning model and then write code or multiplex existing models. Since many application developers are not AI technology specialists themselves, finding similar models and then making custom modifications is a common development approach. For this reason, developers need to refer to some similar models, so as to learn how to adjust some model components and the use modes thereof.
However, it is not easy for AI application developers to find similar models and learn the components therein and the manner in which they are used. AI has been a very popular research area, and many researchers have published a large number of AI papers each year, propose new AI models and components, open their own AI models, and publish to the GitHub code repository hosting platform. For example, google institute publishes code of BERT model into the GitHub repository. Over 12 tens of thousands of papers in the AI field, reviewed by the peer, were published in various conferences and journals in 2019, and the number is increasing. The number of warehouses on the GitHub in 2018 has exceeded 1 million. Despite the massive open source code repository on the GitHub, how to screen the AI-related repository containing AI models from the GitHub and identify the models therein remains a significant challenge for AI application developers. At present, researchers establish a free open platform paperwithcode containing AI related resources such as AI papers, AI warehouses, AI models, components and the like in a crowdsourcing mode in an artificial intelligence community. Paperwithcode supports a variety of AI resources such as search models, components, etc. AI application developers still cannot find models from paperwithcode that are similar to existing models. Because paperwithcode does not support input models to search only supports input AI paper titles, AI models, components, etc. And the paperwithcode cannot parse the model, and thus cannot know information related to components in the model.
The Github, which is a well-known open source code hosting platform, can support a user to input keywords for searching of a repository or code, thereby supporting model searching to some extent. However, searching similar models from the Github is still difficult because the Github only supports inputting keywords of limited length, and the model implementation length is often long, which is difficult to directly input as the Github. And the search results of the GitHub are based on simple keyword matching, and the results have much noise, such as warehouse irrelevant to AI or irrelevant codes.
In academia, a lot of research on a deep learning model is focused on the model architecture itself or safety, interpretability, fairness and the like, and research specially aiming at the deep learning model recommendation is very little, and particularly research lacking in similar model recommendation. Some general techniques such as code cloning techniques and sample code search techniques may be applied to similar model recommendations. Code clone detection techniques are capable of detecting codes of the same or highly similar syntax from a code library, while sample code searches may also search for similar code segments by entering codes. But these methods are designed for general purpose sample code search purposes and are not applicable to specific model recommendation tasks. This is because the implementation of the model is not necessarily similar in code, but rather the architecture or components of the model are semantically similar at a high level. Thus, recommendations for models need to be matched across the architecture of the model, and related background knowledge is needed to consider the semantic relationships of the components. Furthermore, custom modification of a model often requires integrating multiple models, requiring learning of different components from different models. The final modified model may use multiple components from different models. In addition, the selection of the model also requires much AI-related background knowledge, but AI application developers are not necessarily experts in the AI field, so it is also important to provide the necessary interpretation information for the model, such as information of components used in the model, relationships between components, characteristics of components, and the like.
Disclosure of Invention
The invention aims to provide a deep learning model recommendation method based on a knowledge graph, so as to promote reuse of deep learning model realization and improve development efficiency of AI application developers.
The deep learning model recommendation method based on the knowledge graph provided by the invention comprises an offline model knowledge graph construction method and an online model recommendation method, so that deep learning model recommendation is realized, and the whole flow is shown in figure 1. Regarding the model knowledge graph construction method, the invention takes the text corpus related to an open source code warehouse and a component as input, and outputs a model knowledge graph fused with multi-source knowledge (including related knowledge such as an AI warehouse, a model, a component, an implementation mode and the like). The high-level model of the knowledge graph constructed by the invention is shown in figure 2. Regarding the model recommendation method, the invention takes the existing model realization codes which need to be modified as input, and the output is a plurality of most similar model references and corresponding interpretation information. The interpretation information includes components used by the model, relationships between the components, descriptive knowledge about the components, and the like.
The invention is based on the model knowledge graph fused with multi-source knowledge, can realize codes according to the model input by the user, and performs matching model architecture on high-level semantics, thereby obtaining similar model references. According to the invention, the model knowledge graph can be constructed by fusing multi-source knowledge, the background knowledge in the field of artificial intelligence is fully utilized, the reuse of the deep learning model is promoted, and the development efficiency of AI application developers is improved.
Offline model knowledge graph construction method
Firstly, identifying an AI-related open source warehouse by using a trained classifier; extracting the realization class of the model from the warehouse code, and extracting the dependency relationship between the components from the code realized by the model through abstract syntax tree analysis and heuristic rules; then, a mode mining method based on the prefix and the suffix is used for mining high-level semantic concepts from the model and the components, and linking the high-level concepts of the components with the component types provided in the paperwithcode; in addition, extracting the description of the components by utilizing the characteristics of the dependency analysis and the part-of-speech tagging extraction components from the text corpus of the components, and extracting the open relation between the components based on the open information extraction technology by utilizing heuristic rules; the method comprises the following specific steps of.
(1) And (5) identifying an AI warehouse.
Training a text classifier by manually collecting a ReadMe data set, and classifying the ReadMe text in an open source code warehouse to identify an AI-related warehouse; each open source code repository will often contain a ReadMe file in which many important pieces of information, such as background, main content, dependency configuration, reference documents, etc., of the open source project may be recorded. This information is very useful for learning and using an open source code repository. The specific operation is as follows:
firstly, manually constructing a ReadMe tag data set by utilizing data on a paperWithCode (https:// gilthub. Com/paperWithcode-data); then, training a text classifier model on the dataset, the model being able to predict whether it is an AI repository based on ReadMe text of the open source repository; and finally, identifying the AI warehouse from the large-scale open source warehouse by using the text classifier model.
(2) Training a text classifier and extracting a model.
A text classifier (classifier of class code of model) is trained from the artificially structured dataset for extracting the model from the open source code. The AI repository contains a large amount of code, only a small portion of which pertains to the implementation model. In the code file, the model is usually implemented by a complete class code block, and the content of the class code block (such as class name, method name contained in class) has features which are obviously different from those of the non-model implementation code, so that the implementation code of the model is extracted by adopting a text classification method. The specific operation is as follows:
first, identifying code files from an open source code repository is accomplished using heuristic rules (e.g., a ". Py" ending file), and further using regular expressions (e.g.
Figure SMS_1
) Cutting the classified code blocks; then, manually constructing a label data set of the model by utilizing data on the paperwithcode, wherein the label data set comprises randomly sampled class code blocks and labels of whether the randomly sampled class code blocks are model realization classes or not; then, a supervised learning method is adopted to train a classifier of a model, and the model can predict whether the model is a model realization class or not based on a class code block; and finally, classifying class code blocks extracted from large-scale open source codes by using a trained text classifier, and extracting a model based on classification results.
(3) Component and dependency extraction.
And extracting the component instance, the component and the dependency relationship among the components from the code of the model implementation class by adopting abstract syntax tree analysis and heuristic rules. Input and output transfer is often performed between components, which are represented in code as transfer between parameters, components and their dependencies can be extracted by analyzing the abstract syntax tree of the code. In addition, the code of the component appears as the use code of the component, and the realization code of the component is searched from a code warehouse where the model is realized. The specific operation is as follows:
firstly, converting model realization class codes into abstract syntax trees by using a code static analysis tool, traversing the abstract syntax trees, and identifying class objects related to assignment statements in constructors as component examples and class names as components. Then, based on the abstract syntax tree analysis, parameter transfer relations among the local variables in the code are obtained and used as the dependency relations of the local variables on the components. Finally, since the same code repository will often implement the component in the form of classes, the class names of all the class code blocks extracted in the code repository are used to perform string matching with the identified component names, and if the class names are the same as the component names, they are used as the implementation codes of the component.
(4) High-level concept mining.
In order for the model knowledge graph to have higher level concept understanding capability and deeper semantic reasoning capability, higher level semantic concepts need to be extracted from the model and components. The invention adopts a mode mining method based on prefixes and suffixes to mine high-level semantic concepts from the extracted models and components. The specific operation is as follows:
firstly, carrying out hump type disassembly on all the model or assembly nodes to obtain a word segmentation list; then, obtaining an N-element prefix word set from the word segmentation list; then, matching the prefix and suffix word set with the model or the component set, and adding the prefix and suffix word successfully matched as a high-level concept into the high-level concept set; finally, generic relationships are added between the model or component containing the higher-level concepts and the corresponding higher-level concepts.
(5) Component type links.
The component type information in the paperwithcode comprises information of component type, component type description, belonging component and the like. The invention links the extracted component high-level concept with the component type in the paperwithcode. Firstly, obtaining 350 component types and 1,802 components from a component data set provided by the paperwithcode through preprocessing, and converting the component types and 1,802 components into a component-component type mapping which belongs to the components as input of an algorithm; then, traversing each component in the component high-level concept and component type information in turn, and carrying out keyword matching; if the matching is successful, directly linking the high-level concept to the corresponding component type; otherwise, the pre-trained Wikipedia word vector (https:// code. Google. Com/archive/p/word2vec /) provided by google is used for vectorizing the higher-level concepts and components, and meanwhile, the word-based Jaccard algorithm is used for measuring the similarity between the higher-level concepts and components by calculating the cosine similarity and Jaccard text similarity between the vectors. If the sum of the two similarities is greater than the custom similarity threshold, the higher-level concept is linked to the corresponding component type.
(6) Component descriptive knowledge extraction.
In order to enrich the knowledge graph of the model and enhance the interpretability of the model, the invention extracts component descriptive knowledge. Firstly, screening text corpus related to components from paper abstracts related to PapersWithCode, wikipedia, wikidata and AI; then, the text corpus is analyzed in three different ways to obtain three types of descriptive knowledge to add more descriptive knowledge to the component, including component characteristic extraction, component description extraction and component open relationship extraction. The component characteristic extraction step utilizes a natural language processing tool to analyze the dependency relationship and mark the parts of speech of sentences in the component corpus, the object lattice analyzes the subjects in the sentences, and if the subjects are components, adjectives or adverbs or noun phrases are extracted from the sentences as the characteristics of the components. The component description extraction step extracts, as a description of the component, a sentence beginning with the component name, or a sentence containing both the component name and the component characteristics. The component open relation extraction step extracts the open relation of the triples (head entity, relation and tail entity) with scores from the component corpus by using an open relation extraction tool, and screens out the triples of which the scores are greater than a certain threshold value and the head entity and the tail entity are components to supplement the relation among the components in the knowledge graph.
(II) online model reference realization recommendation method
Firstly, identifying a component used by a model and a dependency relationship between components from input realization codes of the model to be modified; then, it maps the identified components to higher-level concepts in the model knowledge graph, searching for model realizations for which the same higher-level concepts exist as candidate model realizations; finally, by a kernel method of the graph, the similarity between each candidate model and the corresponding high-level conceptual relation graph of the input model is calculated, sorting and clustering are carried out according to the similarity, and the top k most similar model realizations are screened out to be used as reference realizations of the models, and corresponding interpretation is carried out. The method comprises the following specific steps:
(1) Component and dependency extraction.
And extracting the dependency relationship among the components used in the model implementation of the user input from the model implementation code of the user input by utilizing static analysis of the code, and generating a component dependency relationship graph. The whole method is similar to the components and the dependency relation extraction method in the model knowledge graph construction method. The specific operation is as follows:
firstly, a user input model implementation class code is converted into an abstract syntax tree by using a code static analysis tool, the abstract syntax tree is traversed, and class objects related to assignment sentences in a constructor are identified as component examples and class names as components. Then, based on the abstract syntax tree analysis, parameter transfer relations among the local variables in the code are obtained and used as the dependency relations of the local variables on the components.
(2) Candidate results are generated.
In order to achieve semantically similar results at a high level, the components are abstracted at a high level. First, a list of components used in model implementation code entered by a user is acquired. Then, carrying out concept mapping on the components, and mapping the components to concept nodes in the model knowledge graph; then, the concept nodes are extended upwards all the time to obtain the most abstract concept node which can be mapped by each component; and finally, screening out model implementation intersected with the abstract concept according to the abstract concept node set of the obtained component, so as to obtain a candidate result similar in high-level semanteme. The specific operation is as follows:
firstly, mapping the components extracted in the component and dependency relation extraction step to a specific component node in the model knowledge graph directly according to the names of the components. If the direct text matching is not mapped to the component node, the component is subjected to fuzzy matching with the high-level concept node in the model knowledge graph. If the component name contains a component concept, the component is mapped to the component concept node. If the mapping is still unsuccessful, it is considered that the component may extract errors or use very little, with little impact on the result, which can be discarded directly.
Then, after the component obtains the corresponding node in the model knowledge graph through concept mapping, the node needs to be used as a starting point to extend upwards in the knowledge graph, and the node with the upper and lower relation with the starting point is obtained as a new starting point. When the concept node is expanded to the component type node or the top concept node which can be expanded to, stopping, taking the node as the abstract concept node of the component, for example, the node 'nn. LSTM' obtained by concept mapping, carrying out concept expansion to find the higher concept 'LSTM' node, and continuing to expand to find the component type 'Recurrent Neural Networks' node. The final component "nn. Lstm" gets its most abstract concept "Recurrent Neural Networks". After concept mapping and concept expansion, an abstract concept list corresponding to the component list in the model implementation input by the user can be obtained.
In order to further screen out similar model realizations as candidate results, the invention uses components used by each model realization in the model knowledge graph through concept expansion processing to obtain a model realization-component abstract concept list mapping, matches the abstract concept list corresponding to the components in the model realization input by a user, and screens out the model realizations with intersections as candidate results.
(3) And sequencing candidate results.
A model may be considered a sub-graph consisting of components and dependencies between components. On the basis of screening candidate models, the method calculates sub-graph similarity of the user input model and sub-graphs of all candidate models respectively, sorts candidate results according to the sub-graph similarity, and returns a similar model which is similar to the user input model in high-level semanteme. The specific operation is as follows:
first, the "model realization-component abstract concept list" obtained in the previous step is utilized to convert both the model input by the user and the candidate model into a subgraph composed of component abstract concepts. Each node in the subgraph is a component abstract concept corresponding to a component in the model implementation, and each edge corresponds to a component dependency in the model implementation; then, the Weisfeiler-Leman Graph Kernel algorithm [1] is adopted to calculate the graph similarity between the inputted model subgraph input by the user and the candidate model subgraph, the Weisfeiler-Leman Graph Kernel algorithm can map the graph data to the vector representation of fixed length in the same high-dimensional space, and the vector representations of the graph data with similar structural information are very close, namely the cosine similarity is lower; screening candidate models with similarity larger than a self-defined threshold value, and carrying out SinglePass clustering [2] on the candidate models according to the graph similarity, wherein highly similar candidate models are clustered into one cluster, and each cluster only retains the candidate model in the center of the cluster; and finally, returning the model sample codes after the first k clusters which are the most similar and the interpretation thereof as results, wherein the interpretation comprises components used by the model obtained based on the model knowledge graph model, relations among the components, descriptive knowledge related to the components and the like. The present invention supports a user interface for interactive model recommendation, as shown in FIG. 3. Firstly, analyzing a model implementation code input by a user, and analyzing components used in the model and dependency relations among the components; then, searching out a model sample code with high similarity on a component high-level concept and a component dependency relationship as a reference through the concept understanding and semantic reasoning capability of the model knowledge graph; finally, various kinds of interpretability information such as components, component dependency relations, component descriptive knowledge, corresponding relations with the components in the input and the like are also supplemented for the model sample codes. In addition, the user can also screen the search results and screen the model using the specific components.
The method has the following characteristics:
(1) The method for constructing the model knowledge graph is designed, an open source code warehouse and a text corpus related to the components are taken as input, and a model knowledge graph fused with knowledge such as an AI warehouse, a model, the components and an implementation code is output, and the method specifically comprises key steps such as AI warehouse identification, model and implementation extraction, component and dependency relation extraction, high-level concept mining, component type linkage, component descriptive knowledge extraction and the like;
(2) The model recommendation method is designed, existing model realization codes which need to be modified are used as input, and a model framework is matched on the high-level semanteme based on a model knowledge graph fused with multi-source knowledge, so that similar model references are obtained. Along with the recommendation, interpretation information is provided that contains the components used by the model, relationships between the components, descriptive knowledge about the components, and so forth.
Drawings
Fig. 1 is a general flow chart of the present invention.
Fig. 2 is a high-level conceptual model diagram of the model knowledge graph according to the present invention.
FIG. 3 is a reference implementation recommendation user interface for the model of the present invention.
Detailed Description
The invention is further described below by way of examples with reference to the accompanying drawings.
By utilizing the method, based on approximately 7 ten thousand AI warehouses on the Github, a model knowledge graph is constructed, and deep learning model recommendation based on the model knowledge graph is realized, and the specific steps are as follows:
(1) Classifier of AI warehouse.
The data on the paperwithcode is utilized to manually construct a ReadMe tag data set, and a plurality of text classification models are compared, and finally a convolutional neural network (Convolutional Neural Networks, CNN for short) is selected to be used as the text classification model for classifying the AI warehouse. The CNN text classification model consists of an Embedding layer (vector layer), a convolution layer, a pooling layer, a full connection layer and a Softmax layer (normalization layer). Firstly, an Embedding layer is used for encoding an input ReadMe text of an open source code warehouse and converting the input ReadMe text into vector representation; then, the convolution layer extracts different features from the vectors and enters the pooling layer; then, the pooling layer selects important features from the extracted features and inputs the important features into the full-connection layer; the full connection layer integrates the features extracted in the previous step and inputs the features into the Softmax layer; finally, the Softmax layer outputs the result, i.e., whether the input ReadMe text is AI-dependent, to determine whether the corresponding code repository is an AI repository.
In order to train the CNN text classification model, a supervised learning method is employed. Using the warehouse links provided by paperWithCode, the ReadMe text of the code warehouse was crawled and 1,000 ReadMes were randomly selected to construct positive sample data. Again 1,000 ReadMe are randomly selected from the large scale open source warehouse to construct negative sample data, and the concrete model implementation uses the Kashgari framework (https:// Kashgari-zh. Kashgari is an NLP transition learning framework, commonly used for text labeling and text classification.
(2) Classifier of class code of model.
According to Kashgari framework, constructing and comparing various text classification models, and finally selecting a Bi-layer Bi-directional Long Short-Term Memory (BiLSTM) as a classifier of model class codes to identify the model implementation class codes. The model consisted of an Embedding layer, a bilayer BiLSTM layer, a Dropout layer (random inactivation), a fully connected layer and a Softmax layer. Firstly, an Embedding layer is used for coding an input model realization class to be identified and converting the model realization class into a vector representation; then, extracting context information through a double BiLSTM layer; then, in order to prevent over fitting, the Dropout layer randomly sets certain hidden layer node weights to fail, and enters a full connection layer; the full-connection layer integrates the information extracted from the previous steps and inputs the information into the Softmax layer; finally, the Softmax layer outputs the result, i.e., whether the input class code is a model implementation class.
In order to train the model to realize the class code classifier, according to the model in the paperwithcode dataset and the corresponding code warehouse, the 18,033 pieces of class codes with class names equal to the model names are screened out from the class code blocks in the corresponding code warehouse to serve as positive sample data. And manually screening the non-model realization class codes with the same number from the rest class codes to serve as negative sample data, and dividing the class code label data set into a training set, a verification set and a test set according to the proportion of 6:2:2.
(3) Component descriptive knowledge extraction.
The component descriptive knowledge extraction comprises three parts, namely component characteristic extraction, component description extraction and component open relation extraction. The invention uses a natural language processing tool spaCy (https:// space.io /) to analyze the dependency relationship and label the parts of speech of sentences in the assembly corpus. And analyzing the subject in the sentence, and extracting adjectives or adverbs or noun phrases from the sentence as characteristics of the components if the subject is the component. And extracting the component characteristics by utilizing the component characteristics, and extracting sentences beginning with the component names or containing the component names and the component characteristics from the component corpus as descriptions of the components according to heuristic rules. Finally, the open relation of the triples (head entity, relation, tail entity) with the score is extracted from the component corpus by using an OpenIE tool (https:// nlp.stanford.edu/software/opening.html) of StanfordLP, and then the open relation among the component entities with high score is screened out to supplement the relation among the components in the knowledge graph.
(4) And (5) constructing a model knowledge graph.
First, 65,279 Python open source code repositories were cloned from Github according to the PapersWithCode supplied set of 93,303 open source repository link data. At the same time, 81 open source code warehouses of different versions such as a common Python deep learning framework TensorFlow, pyTorch, keras, MXNet, chainer, theano are cloned. In addition, AI warehouses are identified from Python third party libraries provided by Libraries.io dataset (https:// Libraries.io /) using text classification model, and 4,710 open source code warehouses are extended. The invention uses the open source code warehouse of the 3 parts as a knowledge graph to construct an open source code warehouse data set. Further, 1,647 component text description fragments are obtained from the component data set provided by paperwithcode; acquiring 4,847 component concept descriptions from a Wikipedia dataset and a Wikidata dataset; 680,475 contained component sentences are screened from 65,084 AI-related paper abstracts. And constructing the component corpus used by taking the text data related to the components of the 4 parts as the model knowledge graph. Through the knowledge graph construction steps, a model knowledge graph with 651,328 entity nodes and 1,810,335 relations is generated. The entity nodes comprise 21,359 open source warehouse nodes, 42,577 model realization nodes, 29,019 component realization nodes, 24,063 model nodes, 20,189 model high-level concept nodes, 102,969 component nodes, 222,381 component instance nodes, 198,80 component high-level concept nodes, 161,970 component use code nodes, 350 component type nodes, 4,087 component characteristic nodes and 2,484 component description nodes. The relationships include 119,654 subs of relationships, 3838 bellow to relationships, 664423 use relationships, 121363 provider relationships, 126463 image relationships, 222381 instance of relationships, 270976 has use code relationships, 118073 follow relationships, and 163,164 open relationships.
The invention evaluates the key steps of model knowledge graph construction through a plurality of experiments, wherein the accuracy of AI warehouse identification reaches 98.94%, and the accuracy of model realization class extraction reaches 98.00%. And further manually labeling 384 tuples in the knowledge graph of the sampling model, and the final accuracy reaches 90.63%. Thus, each step of model knowledge graph construction is proved to be very effective overall, and the overall quality of the model knowledge graph is high. We invite 10 developers to complete experiments of model modification tasks to verify the effectiveness of the invention on model recommendations. In this experiment, a group of experimental participants were required to find similar models for a given AI model using the present invention and complete the modification task. Another group of experimental participants completed the same task using the GitHub method as a comparison. Experiments prove that the invention is effective for recommending models. With the present invention, participant satisfaction with the model modification results is about 36.4% higher than with Github, task completion time is reduced by about 41.2%, search times are reduced by about 58.5%, and search result ranking is improved by about 79.5%. Experimental participants considered the invention more useful and easier to use than Github.
Reference to the literature
[1] Shervashidze N, Schweitzer P, Leeuwen EJ, Mehlhorn K, Borgwardt KM.Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research,2011,12.
[2] Papka R, Allan J. On-line new event detection using single passclustering[J]. University of Massachusetts, Amherst, 1998, 10(290941.290954)。

Claims (6)

1. The deep learning model recommendation method based on the knowledge graph is characterized by comprising an offline model knowledge graph construction method and an online model recommendation method, and is used for realizing deep learning model recommendation;
the method comprises the steps of (1) taking text corpus related to an open source code warehouse and a component as input, and outputting a model knowledge map fused with multi-source knowledge, wherein the multi-source knowledge comprises related knowledge of an AI warehouse, a model, a component and an implementation mode; comprising the following steps:
firstly, identifying an AI-related open source warehouse by using a trained classifier;
extracting the realization class of the model from the warehouse code, and extracting the dependency relationship between the components from the code realized by the model through abstract syntax tree analysis and heuristic rules;
then, a mode mining method based on the prefix and the suffix is used for mining high-level semantic concepts from the model and the components, and linking the high-level concepts of the components with the component types provided in the paperwithcode;
in addition, extracting the description of the components by utilizing the characteristics of the dependency analysis and the part-of-speech tagging extraction components from the text corpus of the components, and extracting the open relation between the components based on the open information extraction technology by utilizing heuristic rules;
the online model recommendation method takes the existing model realization codes which need to be modified as input, and outputs a plurality of most similar model references and corresponding interpretation information; the interpretation information comprises components used by the model, relations among the components and descriptive knowledge related to the components; comprising the following steps:
firstly, identifying components used by a model and dependency relations among the components from input realization codes of the model to be modified;
then, it maps the identified components to higher-level concepts in the model knowledge graph, searching for model realizations for which the same higher-level concepts exist as candidate model realizations;
finally, calculating the similarity between each candidate model and the corresponding high-level conceptual relation diagram of the input model by a kernel method of the diagram, sequencing and clustering according to the similarity, and screening out the top k most similar model realizations as reference realizations of the models and corresponding interpretations;
the method comprises the steps of realizing codes according to a model input by a user based on a model knowledge graph fused with multi-source knowledge, and matching a model framework in high-level semantics so as to obtain similar model references; by fusing multi-source knowledge to construct a model knowledge graph, background knowledge in the field of artificial intelligence is fully utilized, reuse of deep learning model realization is promoted, and development efficiency of AI application developers is improved.
2. The knowledge-graph-based deep learning model recommendation method according to claim 1, wherein the offline model knowledge graph construction method comprises the following specific steps:
(1) Identifying an AI warehouse;
training a text classifier by manually collecting a ReadMe data set, and classifying the ReadMe text in an open source code warehouse to identify an AI-related warehouse; the specific flow is as follows:
firstly, manually constructing a ReadMe tag data set by utilizing data on the paperWithCode;
then, training a text classifier model on the dataset, the model being able to predict whether it is an AI repository based on ReadMe text of the open source repository;
finally, identifying an AI warehouse from the large-scale open source warehouse by using the text classifier model;
(2) Training a text classifier and extracting a model;
training a text classifier, also called a classifier of class codes of the model, from the artificially constructed dataset for extracting the model from the open source code; the specific flow is as follows:
firstly, identifying a code file from an open source code warehouse by utilizing heuristic rules, and further segmenting a class code block from the identified code file by utilizing a regular expression;
then, manually constructing a label data set of the model by utilizing data on the paperwithcode, wherein the label data set comprises randomly sampled class code blocks and labels of whether the randomly sampled class code blocks are model realization classes or not;
then, a supervised learning method is adopted to train a classifier of a model, and the model can predict whether the model is a model realization class or not based on a class code block;
finally, classifying class code blocks extracted from large-scale open source codes by using a trained text classifier, and extracting a model based on classification results;
(3) Extracting components and dependency relationships;
extracting component examples, components and dependency relations among the components from codes of model implementation classes by adopting abstract syntax tree analysis and heuristic rules; the specific flow is as follows:
firstly, converting model realization class codes into abstract syntax trees by using a code static analysis tool, traversing the abstract syntax trees, and identifying class objects related to assignment statements in constructors as component examples and class names as components;
then, analyzing based on the abstract syntax tree to obtain parameter transfer relation between local variables in the code, and taking the parameter transfer relation as the dependency relation of the local variables on the components;
finally, as the same code warehouse can realize the components in the form of classes, the class names of all class code blocks extracted from the code warehouse are used for carrying out character string matching with the identified component names, and if the class names are the same as the component names, the class names are used as the realization codes of the components;
(4) Digging a high-level concept;
extracting high-level semantic concepts from the model and the components in order to enable the model knowledge graph to have higher-level concept understanding capability and deeper semantic reasoning capability; particularly, a mode mining method based on prefixes and suffixes is adopted, and high-level semantic concepts are mined from the extracted models and components; the specific flow is as follows:
firstly, carrying out hump type disassembly on all the model or assembly nodes to obtain a word segmentation list;
then, obtaining an N-element prefix word set from the word segmentation list;
then, matching the prefix and suffix word set with the model or the component set, and adding the prefix and suffix word successfully matched as a high-level concept into the high-level concept set;
finally, generic relationships are added between the models or components containing the higher-level concepts and the corresponding higher-level concepts;
(5) Component type linking;
the component type information in the paperwithcode comprises component types, component type descriptions and belonging component information; linking the extracted component high-level concept with the component type in the paperwithcode; the specific flow is as follows:
firstly, obtaining 350 component types and 1,802 components from a component data set provided by the paperwithcode through preprocessing, and converting the component types and 1,802 components into a component-component type mapping which belongs to the components as input of an algorithm;
then, traversing each component in the component high-level concept and component type information in turn, and carrying out keyword matching; if the matching is successful, directly linking the high-level concept to the corresponding component type; otherwise, using the pretrained Wikipedia word vector provided by google to carry out vectorization representation on the high-level concept and the components, and simultaneously using a word-based Jaccard algorithm to measure the similarity between the high-level concept and the components by calculating the cosine similarity and Jaccard text similarity between the vectors; if the sum of the two kinds of similarity is larger than the self-defined similarity threshold value, linking the high-level concept to the corresponding component type;
(6) Component descriptive knowledge extraction;
firstly, screening text corpus related to components from paper abstracts related to PapersWithCode, wikipedia, wikidata and AI;
then, analyzing the text corpus respectively by using three different modes to obtain three types of descriptive knowledge to add more descriptive knowledge for the component, wherein the descriptive knowledge comprises component characteristic extraction, component description extraction and component open relation extraction; wherein:
extracting component characteristics, namely analyzing the dependency relationship and marking the parts of speech of sentences in the component corpus by using a natural language processing tool, analyzing subjects in the sentences by using a object lattice, and extracting adjectives or adverbs or noun phrases from the sentences as the characteristics of the components if the subjects are the components;
component description extraction, namely extracting sentences beginning with component names or sentences containing component names and component characteristics as descriptions of components;
the component open relation extraction is to extract the open relation of the head entity, the relation and the tail entity triples with scores from the component corpus by using an open relation extraction tool, and screen out the triples with scores larger than a certain threshold value, wherein the head entity and the tail entity are components to supplement the relation between the components in the knowledge graph.
3. The knowledge-graph-based deep learning model recommendation method according to claim 2, wherein the online model reference implementation recommendation method comprises the following specific steps:
(1) Component and dependency extraction
Extracting a component used in the model implementation of the user input and a dependency relationship among the components from the model implementation code of the user input by utilizing code static analysis, and generating a component dependency relationship graph; the specific flow is as follows:
firstly, converting a class code of a user input model implementation into an abstract syntax tree by using a code static analysis tool, traversing the abstract syntax tree, and identifying a class object related to an assignment statement in a constructor as a component instance and a class name as a component;
then, analyzing based on the abstract syntax tree to obtain parameter transfer relation between local variables in the code, and taking the parameter transfer relation as the dependency relation of the local variables on the components;
(2) Candidate result generation;
the specific operation is as follows:
firstly, mapping the components extracted in the component and dependency relation extraction step to a specific component node in a model knowledge graph directly according to the name of the component; if the direct text matching is not mapped to the component nodes, carrying out fuzzy matching on the component and the high-level concept nodes in the model knowledge graph; if the component name contains a certain component concept, mapping the component to a component concept node; if the mapping is still unsuccessful, the component is considered to be possibly extracted in error or extremely low in use rate, the effect on the result is small, and the component can be directly discarded;
then, after the component is subjected to conceptual mapping to obtain a corresponding node in the model knowledge graph, the node is taken as a starting point, upward expansion is carried out in the knowledge graph, and a node with an upper-lower relationship with the starting point is obtained as a new starting point; stopping when the node is extended to the component type node or the top concept node which can be extended to, and taking the node as the abstract concept node of the component; after concept mapping and concept expansion, an abstract concept list corresponding to the component list in the model implementation input by a user can be obtained;
in order to further screen out similar model realizations as candidate results, components used by each model realization in the model knowledge graph are subjected to concept expansion processing to obtain a model realization-component abstract concept list mapping, the mapping is matched with an abstract concept list corresponding to the components in the model realization input by a user, and the model realizations with intersections are screened out to serve as candidate results;
(3) Ranking candidate results
The model is regarded as a sub-graph consisting of components and dependencies between components; on the basis of screening candidate models, respectively calculating sub-graph similarity of the user input model and sub-graphs of all candidate models, and sequencing candidate results according to the sub-graph similarity, so as to return a similar model which is similar to the user input model in high-level semanteme; the specific flow is as follows:
firstly, converting a model input by a user and a candidate model into a subgraph consisting of component abstract concepts by using a model realization-component abstract concept list obtained in the previous step; each node in the subgraph is a component abstract concept corresponding to a component in the model implementation, and each edge corresponds to a component dependency in the model implementation;
then, calculating the graph similarity between the input user-input model subgraph and the candidate model subgraph by using a Weisfeiler-Leman Graph Kernel algorithm, wherein the Weisfeiler-Leman Graph Kernel algorithm maps the graph data to a vector representation of a fixed length in the same high-dimensional space, and the vector representations of the graph data with similar structural information are very close, namely the cosine similarity is lower;
then, screening candidate models with similarity larger than a self-defined threshold value, and carrying out Single Pass clustering on the candidate models according to the graph similarity, wherein highly similar candidate models are clustered into one cluster, and each cluster only retains the candidate model in the center of the cluster;
and finally, returning the model sample codes after the first k clusters which are the most similar and the interpretation thereof as results, wherein the interpretation comprises components used by the model and obtained based on the model knowledge graph model, the relation among the components and descriptive knowledge related to the components.
4. The knowledge-based deep learning model recommendation method of claim 3, wherein a user interface for interactive model recommendation is supported;
firstly, analyzing a model implementation code input by a user, and analyzing components used in the model and dependency relations among the components;
then, searching out a model sample code with high similarity on a component high-level concept and a component dependency relationship as a reference through the concept understanding and semantic reasoning capability of the model knowledge graph;
finally, the model sample codes are supplemented with various kinds of interpretability information of components, component dependency relations, component descriptive knowledge and corresponding relations with the components in the input;
in addition, the user can also filter the search results and filter the model using the specific components.
5. The knowledge-graph-based deep learning model recommendation method according to one of claims 1 to 4, wherein the text classifier for AI warehouse recognition adopts a convolutional neural network CNN model, which is composed of a vector layer, a convolutional layer, a pooling layer, a full connection layer and a normalization layer; firstly, a vector layer is used for encoding an input readMe text of an open source code warehouse and converting the input readMe text into a vector representation; then, the convolution layer extracts different features from the vectors and enters the pooling layer; then, the pooling layer selects important features from the extracted features and inputs the important features into the full-connection layer; the full-connection layer integrates the features extracted in the previous step and inputs the features to the normalization layer; finally, the normalization layer outputs the result, i.e. whether the input ReadMe text is AI-dependent, so as to determine whether the corresponding code repository is an AI repository.
6. The knowledge-graph-based deep learning model recommendation method according to one of claims 1 to 4, wherein the classifier of the class code of the model adopts a double-layer bi-directional long-short time memory network BiLSTM, and the network model is composed of a vector layer, a double-layer BiLSTM layer, a Dropout layer, a full connection layer and a Softmax layer; firstly, a vector layer is used for encoding an input model realization class to be identified and converting the model realization class into a vector representation; then, extracting context information through a double BiLSTM layer; then, in order to prevent over fitting, the Dropout layer randomly sets certain hidden layer node weights to fail, and enters a full connection layer; the full-connection layer integrates the information extracted from the previous steps and inputs the information into the Softmax layer; finally, the Softmax layer outputs the result, i.e., whether the input class code is a model implementation class.
CN202211416498.0A 2022-11-13 2022-11-13 Deep learning model recommendation method based on knowledge graph Pending CN116108191A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211416498.0A CN116108191A (en) 2022-11-13 2022-11-13 Deep learning model recommendation method based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211416498.0A CN116108191A (en) 2022-11-13 2022-11-13 Deep learning model recommendation method based on knowledge graph

Publications (1)

Publication Number Publication Date
CN116108191A true CN116108191A (en) 2023-05-12

Family

ID=86255068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211416498.0A Pending CN116108191A (en) 2022-11-13 2022-11-13 Deep learning model recommendation method based on knowledge graph

Country Status (1)

Country Link
CN (1) CN116108191A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628228A (en) * 2023-07-19 2023-08-22 安徽思高智能科技有限公司 RPA flow recommendation method and computer readable storage medium
CN117540062A (en) * 2024-01-10 2024-02-09 广东省电信规划设计院有限公司 Retrieval model recommendation method and device based on knowledge graph

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628228A (en) * 2023-07-19 2023-08-22 安徽思高智能科技有限公司 RPA flow recommendation method and computer readable storage medium
CN116628228B (en) * 2023-07-19 2023-09-19 安徽思高智能科技有限公司 RPA flow recommendation method and computer readable storage medium
CN117540062A (en) * 2024-01-10 2024-02-09 广东省电信规划设计院有限公司 Retrieval model recommendation method and device based on knowledge graph
CN117540062B (en) * 2024-01-10 2024-04-12 广东省电信规划设计院有限公司 Retrieval model recommendation method and device based on knowledge graph

Similar Documents

Publication Publication Date Title
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN112199511A (en) Cross-language multi-source vertical domain knowledge graph construction method
CN110245238B (en) Graph embedding method and system based on rule reasoning and syntax mode
Kaur Incorporating sentimental analysis into development of a hybrid classification model: A comprehensive study
US20220004545A1 (en) Method of searching patent documents
CN116108191A (en) Deep learning model recommendation method based on knowledge graph
US20210350125A1 (en) System for searching natural language documents
CN113806563A (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN110633365A (en) Word vector-based hierarchical multi-label text classification method and system
US20210397790A1 (en) Method of training a natural language search system, search system and corresponding use
CN115329085A (en) Social robot classification method and system
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
Gelman et al. A language-agnostic model for semantic source code labeling
CN112632223B (en) Case and event knowledge graph construction method and related equipment
Xiao et al. Information extraction from the web: System and techniques
CN117574898A (en) Domain knowledge graph updating method and system based on power grid equipment
CN115934936A (en) Intelligent traffic text analysis method based on natural language processing
Saeidi et al. Graph representation learning in document wikification
Moreira et al. Deepex: A robust weak supervision system for knowledge base augmentation
Dziczkowski et al. An autonomous system designed for automatic detection and rating of film reviews
CN113111288A (en) Web service classification method fusing unstructured and structured information
Hu et al. Cssam: Code search via attention matching of code semantics and structures
Yang et al. Construction and analysis of scientific and technological personnel relational graph for group recognition
Zeng et al. TagNN: A Code Tag Generation Technology for Resource Retrieval from Open-Source Big Data
Cuculovic Modeling and optimization of an online publishing application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination