CN112650833A - API (application program interface) matching model establishing method and cross-city government affair API matching method - Google Patents

API (application program interface) matching model establishing method and cross-city government affair API matching method Download PDF

Info

Publication number
CN112650833A
CN112650833A CN202011558922.6A CN202011558922A CN112650833A CN 112650833 A CN112650833 A CN 112650833A CN 202011558922 A CN202011558922 A CN 202011558922A CN 112650833 A CN112650833 A CN 112650833A
Authority
CN
China
Prior art keywords
api
similarity
apis
matching
short
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011558922.6A
Other languages
Chinese (zh)
Inventor
李旭涛
龙永深
陈武桥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202011558922.6A priority Critical patent/CN112650833A/en
Publication of CN112650833A publication Critical patent/CN112650833A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an API matching model establishing method and a cross-city government affair API matching method, wherein the method comprises the following steps: acquiring training samples, wherein one training sample comprises description texts of two APIs (application program interfaces), and the description text of each API consists of at least one short text; calculating semantic similarity between short texts of two APIs in each training sample; according to semantic similarity between short texts of two APIs in each training sample, constructing a similarity vector corresponding to each training sample; and inputting the similarity vector into a preset model for training until a loss function of the preset model is converged, and taking the preset model with the converged loss function as an API (application program interface) matching model. According to the invention, the similarity conversion is carried out on the description information of the API, and the semantic similarity between the API is used as the input data of model training, so that the matching accuracy is effectively improved, and a high-accuracy, high-efficiency and high-automation API matching scheme is realized.

Description

API (application program interface) matching model establishing method and cross-city government affair API matching method
Technical Field
The invention relates to the technical field of machine learning, in particular to an API matching model establishing method and a cross-city government API matching method.
Background
In recent years, government data opening platforms are launched by local governments in China, and a plurality of application programs based on government opening data are derived. However, since there is no unified standard between APIs provided by open data platforms of various governments, it is difficult for developers to find the same API between different cities, and it is necessary to manually search and screen the APIs, which is inefficient and easy to miss. This results in higher development costs when developing applications across cities and higher migration costs when applications migrate between cities.
Currently, in addition to the manual search matching method, the existing technology also uses string matching to find the same API between different cities, such as the longest public string or calculating the edit distance. And calculating the similarity of character strings between the API to be matched and all the APIs of the target city one by one according to the description text of the API. And then according to the similarity calculation result, selecting a plurality of APIs with the highest similarity and returning the APIs to the user. However, due to the richness of chinese expression and the disagreement between various government API names, the accuracy of string matching is low. Meanwhile, the method does not reduce the manual participation to the maximum extent, and only realizes semi-automatic matching.
Disclosure of Invention
The invention solves the problem that the matching accuracy of the existing API matching method is too low.
In order to solve the problems, the invention provides an API matching model establishing method and a cross-city government API matching method.
The invention provides an API matching model establishing method, which comprises the following steps:
acquiring training samples, wherein one training sample comprises description texts of two APIs (application program interfaces), and the description text of each API consists of at least one short text;
calculating semantic similarity between short texts of two APIs in each training sample;
according to semantic similarity between short texts of two APIs in each training sample, constructing a similarity vector corresponding to each training sample;
and inputting the similarity vector into a preset model for training until a loss function of the preset model is converged, and taking the preset model with the converged loss function as an API (application program interface) matching model.
Optionally, the description text contains a name short text, a keyword short text and at least one return parameter name short text;
the calculating the semantic similarity between short texts of two APIs in each training sample comprises:
in each training sample, the short texts of the return parameter names of the two APIs are arranged and combined to obtain a plurality of short text pairs of the return parameter names;
calculating the similarity of all the returned parameter name short text pairs;
and calculating the similarity of the short texts of the two API names and the similarity of the short texts of the keywords.
Optionally, the constructing a similarity vector corresponding to each training sample according to semantic similarity between short texts of two APIs in each training sample includes:
selecting a preset number of similarities with the maximum similarity from the similarities of all the returned parameter name short text pairs, wherein the preset number is marked as N;
and combining the similarity of the preset number with the similarity of the name short text, the similarity of the keyword short text and a training sample label to form a similarity vector of 1 × M dimension, wherein M is N + 3.
Optionally, the API matching model is an XGBoost model.
Optionally, the calculating semantic similarity between short texts of two APIs in each training sample includes:
performing word segmentation processing on each short text of the two APIs respectively through a Jieba word segmentation algorithm to obtain a word set after word segmentation processing;
mapping each word in the set of words into a vector using a FastText algorithm;
calculating the TextRank value of each word in the word set;
obtaining sentence vectors of short texts of two APIs (application program interfaces) according to the vectors and the TextRank values of the words in the word set;
and calculating semantic similarity between sentence vectors of the short texts of the two APIs, wherein the semantic similarity between the sentence vectors of the short texts of the two APIs is the semantic similarity between the short texts of the two APIs.
The invention also provides an API matching method, which comprises the following steps:
obtaining description texts of two APIs to be matched, wherein the description text of each API consists of at least one short text;
calculating semantic similarity between short texts of the two APIs to be matched;
determining similarity vectors corresponding to the two APIs to be matched according to semantic similarity between short texts of the two APIs to be matched;
and inputting the similarity vectors corresponding to the two APIs to be matched into an API matching model, wherein the output result of the API matching model is the matching result of the two APIs to be matched, and the API matching model is generated based on the API matching model establishing method.
Optionally, after obtaining the description texts of the two APIs to be matched and before calculating the semantic similarity between the short texts of the two APIs to be matched, the method further includes:
identifying whether the description text of each API contains short text of the geographic position qualifier;
and when the description text of any API contains the short text of the geographic position limiting word, removing the short text of the geographic position limiting word.
The invention also provides a cross-city government affair API matching method, which comprises the following steps:
receiving a cross-city government affair API matching query request, and obtaining a query API and a query range based on the cross-city government affair API matching query request;
traversing a preset government affair API database, respectively forming an API pair to be matched by the query API and each API in the query range in the preset government affair API database, and judging whether the API pair to be matched is matched based on the API matching method;
and outputting all the APIs which are matched with the query API in the query range in the preset government affair API database.
The invention provides an electronic device, comprising a memory and a processor; the memory for storing a computer program; the processor, when executing the computer program, is configured to implement an API matching model building method as described above or an API matching method as described above or a cross-city government API matching method as described above.
The present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an API matching model building method as described above or an API matching method as described above or a cross-city government API matching method as described above.
The API matching model establishing method and the cross-city government affair API matching method have the beneficial effects that: the method solves the problem of cost of manual matching by using a machine learning method to carry out automatic API matching, and simultaneously, in the aspect of feature extraction, the API description text similarity is adopted as a feature, and the description information of the API is converted into a similarity vector, so that the matching accuracy is effectively improved, and finally, a high-accuracy, high-efficiency and high-automation API matching scheme is realized.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating an embodiment of an API matching model building method according to the present invention;
FIG. 2 is a schematic flow chart diagram illustrating another embodiment of the API matching model building method of the present invention;
FIG. 3 is a schematic flowchart of another embodiment of a method for creating an API matching model according to the present invention;
FIG. 4 is a schematic diagram of the construction process from an API pair to a similarity vector;
FIG. 5 is a flowchart illustrating an embodiment of an API matching method of the present invention;
FIG. 6 is a flowchart illustrating an API matching method according to an embodiment of the present invention;
FIG. 7 is a diagram of an embodiment of an API for matching a system across cities;
FIG. 8 is a schematic diagram of another embodiment of the API cross-city matching system.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
According to statistics of 'open report of local data in China', by the last half of 2019 years, 82 government data open platforms exist in China, and 62801 data sets are opened. Compared with 2017, the amplification is nearly 7 times. Meanwhile, more and more open platforms provide APIs (application programming interfaces) which can be called for the data set, so that developers can integrate the APIs into the urban service application program to obtain the latest data in real time, but the problem is that the open data platforms of various governments do not have uniform standards when defining the APIs, and the phenomenon causes obstacles for the developers to use the same type of APIs. When a city service application needs to integrate APIs of multiple cities, or an application migrates from one city to another, developers need to find APIs of the same type from different open platforms, which is called as cross-city matching of APIs. The non-normative nature of API naming and the diversity of chinese expressions present challenges to cross-city matching, which in turn lead to difficulties with application migration. The successful application of migration from other cities is one of the most economical and efficient ways for a city to extend its urban service system. Therefore, the problem of API matching across cities is urgently to be solved.
The disclosed embodiments provide an API matching model building method, which may be executed by a processor in an electronic device, where the electronic device may be implemented as, for example, a smart phone, a tablet computer, a computer device as a server, or the like. FIG. 1 is a flowchart of an API matching model building method according to an embodiment of the present invention; the API matching model establishing method comprises the following steps:
step S10, obtaining training samples, wherein one training sample comprises description texts of two APIs, and the description text of each API is composed of at least one short text.
The description text of the API contains at least one short text, for example: the description text comprises one or more of an API name, an API keyword and an API return argument name, wherein the API return argument name short text may be one or more since there may be more than one API return argument names. For the cross-city API matching, the description text also comprises the geographic position limiting words such as the city or the region to which the API belongs, and the like, but when the training sample is constructed, the geographic position limiting words can be automatically identified, and the geographic position limiting words are removed, so that the matching accuracy of the trained model is improved. As the open data platform is improved, the number of short texts contained in the description text of the API may be increased, the types of short texts may be different, and in the case that the description text containing the short texts is changed, the description text containing the short texts may also be applied to the embodiments of the present disclosure, and is included in the scope of the present disclosure.
The training samples are labeled samples, i.e., labels that identify whether the training samples are positive or negative samples.
And step S20, calculating semantic similarity between short texts of the two APIs in each training sample.
The semantic similarity between short texts of two APIs refers to the semantic similarity between similar short texts of the two APIs, for example, if an API includes three short texts, such as an API name, an API keyword, and an API return parameter name, the semantic similarity between short texts of the two APIs refers to: semantic similarity of a name of one API to a name of another API, semantic similarity of a keyword of one API to a keyword of another API, semantic similarity of a return argument name of one API to a return argument name of another API.
It should be noted that, because the present invention is applied to matching of the cross-city government affairs API, and because of the limitation of the government data platform, only the keywords, the API names, the API return parameter names, etc. of the API can be obtained, these keywords, the API names, and the return parameter names are all short texts, and each short text has different words, some words have strong universality, and cannot well embody the current short text. Therefore, the method uses the semantic similarity as the short text similarity, weakens meaningless words in the text, and simultaneously reduces the interference caused by synonymy different Chinese words, so as to more accurately represent the similarity of the corresponding short text between the two APIs and greatly improve the accuracy of API matching.
And step S30, constructing a similarity vector corresponding to each training sample according to semantic similarity between short texts of two APIs in each training sample.
For a training sample, after calculating the semantic similarity between short texts of two APIs contained in the training sample, obtaining a plurality of semantic similarities, wherein the number of the semantic similarities obtained at this time is greater than or equal to the number of the short texts contained in the description text of the API.
In order to facilitate subsequent model training, the data form of model input is simplified, a similarity vector is obtained based on semantic similarity integration between short texts of two APIs (application program interfaces), namely, each training sample corresponds to one similarity vector, and in subsequent model training, the similarity vector corresponding to each training sample is input into the model to represent the corresponding training sample.
According to semantic similarity between short texts of two APIs in each training sample, constructing a similarity vector corresponding to each training sample, which in an embodiment specifically includes: and combining semantic similarities between all short texts of the two APIs to form a similarity vector. In another embodiment, it specifically comprises: and taking a plurality of similarities from the semantic similarity between the short texts of the two APIs, and integrating the similarities into a similarity vector. In another embodiment, the method specifically comprises: and distributing weight to each short text, calculating the final similarity of each short text based on the similarity and the weight corresponding to each short text, and combining the final similarities of each short text to form a similarity vector. By converting the similarity vector of the description text of the API, the model training speed and the prediction speed are effectively improved, and the API matching accuracy and the API matching efficiency are also effectively improved.
And step S40, inputting the similarity vector into a preset model for training until the loss function of the preset model is converged, and taking the preset model with the converged loss function as an API (application program interface) matching model.
The preset model is an initialized machine learning model, the API matching model trained based on the preset model is finally realized by automatic two-classification, and the output result is that two APIs are matched or not matched. The preset model/API matching model may be a LightGBM model, which is a Gradient Boosting Decision Tree (GBDT), and the related principles thereof belong to the prior art, which is not described herein.
Optionally, the preset model/API matching model is an XGBoost model. Compared with the traditional machine learning classification algorithm, such as logistic regression and support vector machine, XGboost can automatically process missing values, on the aspect of overfitting, the traditional machine learning method generally has no overfitting prevention capability, and has higher requirements on the quantity, purity, feature dimensionality, model complexity and the like of training data, and XGboost uses regularization to prevent overfitting, so that the requirements on the aspects are reduced, and the XGboost model is used for API matching to obtain a better matching effect and has higher accuracy.
The loss function of the preset model can be a logarithmic loss function logloss or other self-defined loss functions.
The prior art has the defects of low API matching accuracy, high labor cost, low efficiency and the like, and cannot be widely applied to actual production. Aiming at the defects, the invention carries out automatic API matching by using a machine learning method, solves the problem of cost of manual matching, simultaneously adopts the similarity of each short text of the API description text as the characteristic in the aspect of characteristic extraction, converts the description text of the API into a similarity vector constructed based on the semantic similarity between each short text of the API because the semantic similarity between each short text of the API and the API have strong correlation or not, effectively improves the matching accuracy, and finally realizes a high-accuracy, high-efficiency and high-automation API matching scheme.
The invention uses the semantic similarity as the semantic similarity between short texts of two APIs, uses a machine learning algorithm-training automatic API matching model to solve the API matching problem, can realize automatic API matching, saves cost, improves efficiency, and can solve the problems of low accuracy, poor practical application effect, great dependence on manual work and the like in the existing method. And for the cross-city API matching, the development of various application programs based on government open data can be assisted, and the promotion of smart cities is accelerated.
Optionally, the description text contains a name short text, a keyword short text and at least one return parameter name short text; as shown in fig. 2, the step S20 includes:
step S201, permutation and combination are performed on the short texts of the return parameter names of the two APIs in each training sample, so as to obtain a plurality of pairs of short texts of the return parameter names.
The number and the arrangement sequence of the return parameter names of different APIs may be different, so that the return parameter names of the two APIs are arranged and combined, and the similarity of each short text pair of the return parameter names formed by the arrangement and the combination is respectively calculated, so that the similarity of the short texts of the return parameter names is accurately calculated. Meanwhile, the short text of the returned parameter name may contain a plurality of returned parameter names, and when the semantic similarity is calculated, the semantic similarity between two returned parameter names from two APIs is calculated, so that the short text of the returned parameter name may contain a plurality of similarities.
Step S202, calculating the similarity of all the returned parameter name short text pairs.
For two APIs with return parameter names with the same or larger similarity, even if the number and the arrangement sequence of the return parameter names of the two APIs are different, after the similarity calculation is carried out on the arrangement combination, the return parameter names with the same or larger similarity of the two APIs can be determined based on the similarity, and the similarity of the short text pairs of the return parameter names with the maximum preset number of similarities is obtained, so that the problem of misjudgment caused by the fact that the number and the arrangement sequence of the return parameter names are different can be solved, and the matching accuracy can be further improved.
For two APIs which do not have the same or return parameter names with larger similarity, the similarity of the preset number with the maximum similarity is taken as the similarity of the short texts of the two API return parameter names, and the similarity of the short texts of the two API return parameter names can be well represented.
Step S203, calculating the similarity of the short texts of the two API names and the similarity of the short texts of the keywords.
The similarity calculation sequence of the name short text, the keyword short text and the return parameter name short text is not limited here. Calculating the similarity of the name short text, the keyword short text may be performed before, after, or simultaneously with the similarity calculation that returns the parameter name short text. The similarity calculation sequence of the name short text and the keyword short text is not limited.
When the description text contains the name short text, the keyword short text and the returned parameter name short text, the similarity of each short text is calculated respectively, the similarity of each short text is obtained, a similarity set is obtained, so that a similarity vector is generated in the subsequent steps based on the similarity set, the similarity between the two API description texts is comprehensively represented, the accuracy of feature construction is ensured, and the accuracy of model training and the prediction accuracy are improved.
Optionally, the step S30 includes: selecting a preset number of similarities with the maximum similarity from the similarities of all the returned parameter name short text pairs, wherein the preset number is marked as N; and combining the similarity of the preset number with the similarity of the name short text, the similarity of the keyword short text and a training sample label to form a similarity vector of 1 × M dimension, wherein M is N + 3.
The number of the pairs of the short texts with the names of the returned parameters obtained after permutation and combination is possibly larger, and the similarity of the short texts with the names of the returned parameters with the maximum preset number of similarities can accurately represent the similarity of the short texts with the names of the two API returned parameters, so that the overlarge similarity vector which is finally input into a preset model for training can be avoided by limiting the number of the similarities which finally represent the similarity of the short texts with the names of the two API returned parameters, the training efficiency and the prediction efficiency are ensured, and the accuracy and the efficiency of model training are considered through the construction of the similarity vector.
When the number of the similarity of all the returned parameter name short text pairs is larger than the preset number, taking the similarity of the preset number with the maximum similarity from the similarity as the similarity of the two API returned parameter name short texts, and discarding redundant similarities; and when the number of the similarities of all the returned parameter name short text pairs is smaller than the preset number, taking the similarities of all the returned parameter name short text pairs as the similarities of the two API returned parameter name short texts, and adding one or more similarities, wherein the sum of the added similarities and the number of the similarities of all the returned parameter name short text pairs is equal to the preset number, and the newly added similarities are filled with 0. Wherein the preset number can be selected from 6 to 10.
The similarity vector also includes a training sample label. And the training sample label is used for identifying whether the training sample is a positive sample or a negative sample, wherein the positive sample refers to the two APIs which are matched APIs of the same type, and the negative sample refers to the two APIs which are unmatched APIs of different types.
The training sample label can be represented in the similarity vector in the form of 1, 0, the positive sample can be labeled as 1, and the negative sample can be labeled as 0, or vice versa, the positive sample can be labeled as 0, and the negative sample can be labeled as 1.
The training sample label can be arranged at the head end or the tail end of the similarity vector so as to be convenient for model acquisition, and the formed similarity vector is a vector comprising the similarity of each short text of the two APIs and the training sample label.
Fig. 4 is a schematic diagram illustrating the construction process of the API to the similarity vector. In the API-a shown in fig. 4, the description text includes 1 name of a, 1 keyword of a, and a short text LA of return parameter name including 6 return parameter names, and in the API-B, the description text includes 1 name of B, 1 keyword of B, and a short text LB of return parameter name including 5 return parameter names, and meanwhile, the API pair shown in fig. 4 also includes a tag indicating whether a matches B.
Calculating the similarity of each short text, namely calculating the similarity of the name of A and the name of B, calculating the similarity of the keyword of A and the keyword of B, then arranging and combining 6 returned parameter names of LA and 5 returned parameter name short texts of LB, calculating the similarity of a returned parameter name short text pair, obtaining 30 similarities under the returned parameter name short texts, sorting the 30 similarities in size, selecting the largest preset number of similarities as the final returned parameter name short text similarity, and participating in the construction of subsequent feature vectors (namely the similarity vectors in the text above);
and combining the similarity of the name short text, the similarity of the keyword short text and the final similarity of the returned parameter name short text with the label to construct a feature vector, wherein the feature vector is used for training a machine learning model (namely the preset model or the API matching model).
As can be seen from fig. 4, the feature vector obtained by the final construction is a vector with dimensions of 1 × M, where the dimensions respectively form: 1 (name similarity) +1 (keyword similarity) + N (return parameter name similarity) +1 (label).
In other schemes using vectors as model input, word vectors trained by a word2vec model are often used as model input, and a conventional word vector generated by the word2vec model has a feature of large dimension, usually more than 1 × 200 dimension. The similarity of each short text of two APIs and a training sample label are combined to form a similarity vector, the similarity vector is used as model input to be trained, the vector constructed based on the vector construction scheme of the invention can represent the API characteristics better because the similarity of each short text is strongly associated with the API, the accuracy of the prediction result of the trained model is higher, and meanwhile, the number of the obtained similarities is lower because the number of the short texts contained in the API description text is generally lower, so that the dimension of the finally obtained vector is lower, namely the vector is compared with the word vector generated by a word2vec model, the vector dimension is reduced, and the model training speed and the prediction speed are improved.
Alternatively, as shown in fig. 3, the step S20 includes:
and S211, performing word segmentation processing on each short text of the two APIs respectively through a Jieba word segmentation algorithm to obtain a word set after word segmentation processing.
The Jieba word segmentation algorithm uses a prefix dictionary to realize efficient word graph scanning, generates a directed acyclic graph formed by all possible word generation conditions of Chinese characters in a sentence, then adopts dynamic programming to search a maximum probability path and finds out a maximum segmentation combination based on word frequency, adopts an HMM model based on Chinese character word forming capability for unknown words, uses a Viterbi algorithm (Viterbi algorithm), and is an existing open-source Chinese word segmentation algorithm, which is not repeated here.
The word segmentation processing is respectively carried out on each short text of the two APIs through a Jieba word segmentation algorithm to obtain a word set corresponding to each short text, for example, the word segmentation processing is carried out on the keywords to obtain a word set corresponding to the short text of the keywords, the word segmentation processing is carried out on the name to obtain a word set corresponding to the short text of the name.
In step S212, each word in the set of words is mapped into a vector using the FastText algorithm.
Compared with other text classification models such as a support vector machine, a logistic regression model, a neural network and the like, the FastText greatly shortens training time while keeping classification effect, does not need pre-trained word vectors, and can train the word vectors by self, so that convenience is brought to training of a BooXST model.
Step S213, calculating TextRank values of the words in the word set.
Step S213 specifically includes: (1) performing part-of-speech tagging on words after each short text word segmentation of the API, filtering out stop words, and only keeping words with specified part-of-speech, such as nouns, verbs and adjectives, namely Si=[ti,1,ti,2,...,ti,n]Wherein S isiRepresents short texts of API, ti,nRepresenting the remaining words. (2) Constructing a candidate keyword graph G (V, E), wherein V is a node set and consists of candidate keywords generated in the step (1), and then constructing an edge between any two nodes by adopting a co-occurrence relation: edges exist between two nodes only when the corresponding words co-occur in a window with the length of K, wherein K represents the size of the window, namely K words co-occur at most, and generally K is 2. (3) Then according to the formula:
Figure BDA0002859679290000111
the weights of the nodes are propagated iteratively until convergence. Where in (Vi) represents the predecessor node set of a node, out (Vj) represents the successor node set of a node Vj, d is the damping coefficient, ωjiTo indicate that the edge connections between two nodes have different degrees of importance.
According to the formula, the TextRank value of each word can be obtained.
Step S214, obtaining sentence vectors of short texts of the two APIs according to the vectors and the TextRank values of the words in the word set.
The execution order of step S212 and step S213 is not limited, and step S212 may be executed before step S213 or after step S213.
Each short text has a corresponding word set, for a word set corresponding to a short text, taking the reciprocal of the TextRank value of the words in the word set to obtain the weight of each word, multiplying the vector of each word by the weight, and adding the vectors after multiplying the weights to obtain the sentence vector of the short text.
Step S215, calculating semantic similarity between sentence vectors of short texts of the two APIs, wherein the semantic similarity between the sentence vectors of the short texts of the two APIs is the semantic similarity between the short texts of the two APIs.
Semantic similarity between sentence vectors can be calculated by cosine similarity intersection. The cosine formula is as follows:
Figure BDA0002859679290000121
wherein similarity refers to similarity, VA、VBAs a vector of sentences, AiAnd BiAre each VAAnd VBEach dimension value, for a total of k dimensions.
Because the name of the API, the keyword and the name of the return parameter are short texts consisting of a plurality of words, different words exist in each short text, and some words have strong universality and cannot well reflect the current short text. Therefore, in the independently developed feature engineering, each item of short text (namely each short text) of the API is segmented through the Jieba segmentation, each segmented word is mapped into a vector by using a FastText algorithm, each segmented word is subjected to weight calculation (the weight is the reciprocal of a TextRank value) through a TextRank algorithm, the vector of each word is multiplied by the weight, the vectors after the weight multiplication are added to obtain the vector of the short text, and the semantic similarity between the vectors of each item of short text is calculated to obtain the semantic similarity between the short texts of the two APIs. The similarity obtained by the method can accurately reflect the similarity between short texts of the two APIs, provides accurate input data for subsequent model training, and is beneficial to improving the prediction accuracy of the API matching model.
Based on the semantic similarity between the short texts of the two APIs obtained in the above steps, a specified number of similarity values can be selected from the short texts, for example, when the short text of the returned parameter name includes a plurality of returned parameter names, the short text of the returned parameter names has a plurality of returned parameter names, the return parameter names of the two APIs are arranged and combined to obtain a short text pair of the returned parameter names, the similarity is calculated for each short text pair of the returned parameter names according to the above steps S211 to S215 to obtain a plurality of similarities, a preset number of similarities are selected from the similarities as the final short text similarity of the returned parameter names, the similarity is calculated for the short text of the names and the short texts of the keywords by using the steps S211 to S215, all the finally obtained similarity construction vectors are constructed to construct a similarity vector, and the similarity vector is input into a machine learning model for training, so as to greatly reduce the training time of the machine learning model, more accurate prediction results are obtained.
The embodiment of the disclosure provides an API matching method. The method may be performed by a processor in an electronic device, where the electronic device refers to a computer, a server, and the like. As shown in fig. 5, which is a flowchart of an API matching method according to an embodiment of the present disclosure, the API matching method includes:
step S50, obtaining description texts of two APIs to be matched, where the description text of each API is composed of at least one short text.
The short text may be one or more of an API name, an API keyword, and an API return argument name, where the API return argument name short text may contain one or more return argument names, i.e., the API return argument name contains the names of the respective return argument names of the API.
For the cross-city API matching, the description text also comprises the geographic position limiting words such as the city to which the API belongs or the region to which the API belongs, therefore, before the matching judgment of the two APIs to be matched, the geographic position limiting words in the API description text are automatically identified, the geographic position limiting words are removed, and then the description texts of the two APIs from which the geographic position limiting words are removed are subjected to subsequent processing, so that the accuracy of model prediction is improved. As the open data platform is improved, the number of short texts contained in the description text of the API may be increased, the types of short texts may be different, and in the case that the description text containing the short texts is changed, the description text containing the short texts may also be applied to the embodiments of the present disclosure, and is included in the scope of the present disclosure.
And step S60, calculating semantic similarity between short texts of the two APIs to be matched.
The semantic similarity between short texts of the two APIs refers to the similarity between similar short texts of the two APIs, for example, if the API includes three short texts such as an API name, an API keyword, and an API return parameter name, the semantic similarity between short texts of the two APIs refers to: similarity of a name of one API to a name of another API, similarity of a keyword of one API to a keyword of another API, similarity of a return argument name of one API to a return argument name of another API.
The specific calculation method of the semantic similarity between the short texts of the two APIs may be the same as the calculation method of the similarity of the training samples during the training of the API matching model.
And step S70, determining similarity vectors corresponding to the two APIs to be matched according to the semantic similarity between the short texts of the two APIs to be matched.
And combining the semantic similarity between the short texts of the two APIs to be matched to form a similarity vector, wherein the similarity vector is the similarity vector corresponding to the two APIs to be matched.
In the embodiment of the present disclosure, the specific manner of determining the similarity vector corresponding to the two APIs based on the semantic similarity between the short texts of the two APIs is described in the above API matching model establishing method in reference to the content related to step S30, and the basic content is consistent, which is not described herein again.
Step S80, inputting the similarity vectors corresponding to the two APIs to be matched into an API matching model, where an output result of the API matching model is a matching result of the two APIs to be matched, where the API matching model is generated based on any one of the API matching model establishing methods.
And taking the similarity vectors corresponding to the two APIs to be matched as the feature data of the two APIs, inputting the feature data into the API matching model, and outputting a result of whether the two APIs are matched or not by the API matching model.
The API matching model can be selected as an XGboost model or a LightGBM model.
The method solves the problem of cost of manual matching by using a machine learning method to carry out automatic API matching, and simultaneously, in the aspect of feature extraction, the API description text similarity is adopted as a feature, and the description information of the API is converted into a similarity vector, so that the matching accuracy is effectively improved, and finally, a high-accuracy, high-efficiency and high-automation API matching scheme is realized.
Optionally, after the step S50 and before the step S60, the method further includes: identifying whether the description text of each API contains short text of the geographic position qualifier; and when the description text of any API contains the short text of the geographic position limiting word, removing the short text of the geographic position limiting word.
The short text of the geographic position qualifier is removed, and the remaining short text is used as the short text for calculating the similarity in step S60. For a specific application scenario of API cross-city matching, the short texts of the geographic position qualifiers of the APIs of different cities are different, and if the short texts are incorporated into similarity calculation and similarity vector construction, the prediction of the matching result of the API matching model can be interfered, so that the prediction accuracy of the API matching model is reduced, and therefore the prediction accuracy of the API matching model can be improved by removing the short texts of the geographic position qualifiers. Specifically, the description text of each API can be matched with words in the geographic position limiting word bank one by presetting the geographic position limiting word bank, and the words which are successfully matched are the geographic position limiting words.
The embodiment of the disclosure provides a cross-city government affair API matching method. The method may be performed by a processor in an electronic device, where the electronic device refers to a computer, a server, and the like. As shown in fig. 6, which is a flowchart of a cross-city government API matching method according to an embodiment of the present disclosure, the API matching method includes:
step S90, receiving the cross-city government affair API matching query request, and obtaining the query API and the query range based on the cross-city government affair API matching query request.
The query request is matched across the city government API, which contains the relevant information of the query API to be queried and may also contain the query scope, e.g., the region scope of the query.
Step S100, traversing a preset government API database, respectively combining the query API and each API in the query range in the preset government API database to form an API pair to be matched, and determining whether the API pair to be matched is matched based on the API matching method (step S50-step S80).
The preset government affair API database stores known API data, for example, for matching the specific application scene across cities, description texts of all APIs can be obtained from data open platforms of various local governments.
When the API matching query request contains a query range, the query API is paired with all the APIs in the query range in a preset government affair API database one by one to respectively form an API pair to be matched.
And step S110, outputting all the APIs matched with the query API in the query range in the preset government affair API database.
And sequentially judging whether all the API pairs to be matched are matched based on the step S100, and finally outputting all the APIs matched with the query API in the preset government affair API database. When the query scope is included in the API match query request, all APIs that match the query API refer to all APIs that match the query API within the query scope.
Through the steps, automatic API matching can be realized, all matched APIs can be returned only by inputting the query conditions (including the query API and the query range), the query efficiency and the API matching efficiency are greatly improved, and the matching accuracy can be ensured through the API matching mode.
The embodiment of the present disclosure provides an implementation manner, and provides an API cross-city matching system shown in fig. 7 and 8, to which the above-described cross-city government API matching method can be applied. The API cross-city matching system comprises an API matching model, a matching mining engine and a government affair API database. The method comprises the following steps that an API (application programming interface) cross-city matching system receives known API information and stores the known API information in a government affair API database according to city classification; the API matching model is constructed by the API matching model establishing method, and the matching judgment model receives the description texts of a pair of APIs, automatically judges whether the two APIs are matched and outputs a result. After receiving the API matching query request, the matching mining engine determines a query API and a query range, matches the query API with all APIs in the query range in the government affair API database one by one, sequentially inputs API matching models, and calls all matched APIs from the government affair API database to return to the query interface according to the judgment result of the API matching models.
Another embodiment of the present disclosure provides an electronic device comprising a memory and a processor; the memory for storing a computer program; the processor, when executing the computer program, is configured to implement an API matching model building method as described above or an API matching method as described above or a cross-city government API matching method as described above.
Yet another embodiment of the present disclosure provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an API matching model building method as described above or an API matching method as described above or a cross-city government API matching method as described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. In this application, the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
Although the present disclosure has been described above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present disclosure, and these changes and modifications are intended to be within the scope of the present disclosure.

Claims (10)

1. An API matching model building method is characterized by comprising the following steps:
acquiring training samples, wherein one training sample comprises description texts of two APIs (application program interfaces), and the description text of each API consists of at least one short text;
calculating semantic similarity between short texts of two APIs in each training sample;
according to semantic similarity between short texts of two APIs in each training sample, constructing a similarity vector corresponding to each training sample;
and inputting the similarity vector into a preset model for training until a loss function of the preset model is converged, and taking the preset model with the converged loss function as an API (application program interface) matching model.
2. The API matching model building method of claim 1, wherein said descriptive text comprises a name short text, a keyword short text, and at least one return parameter name short text;
the calculating the semantic similarity between short texts of two APIs in each training sample comprises:
in each training sample, the short texts of the return parameter names of the two APIs are arranged and combined to obtain a plurality of short text pairs of the return parameter names;
calculating the similarity of all the returned parameter name short text pairs;
and calculating the similarity of the short texts of the two API names and the similarity of the short texts of the keywords.
3. The method for building the API matching model according to claim 2, wherein the constructing the similarity vector corresponding to each training sample according to the semantic similarity between the short texts of the two APIs in each training sample comprises:
selecting a preset number of similarities with the maximum similarity from the similarities of all the returned parameter name short text pairs, wherein the preset number is marked as N;
and combining the similarity of the preset number with the similarity of the name short text, the similarity of the keyword short text and a training sample label to form a similarity vector of 1 × M dimension, wherein M is N + 3.
4. The API matching model creation method of any one of claims 1 to 3, wherein the API matching model is an XGBoost model.
5. The API matching model building method of any one of claims 1 to 3 wherein the calculating semantic similarity between short texts of two APIs in each of the training samples comprises:
performing word segmentation processing on each short text of the two APIs respectively through a Jieba word segmentation algorithm to obtain a word set after word segmentation processing;
mapping each word in the set of words into a vector using a FastText algorithm;
calculating the TextRank value of each word in the word set;
obtaining sentence vectors of short texts of two APIs (application program interfaces) according to the vectors and the TextRank values of the words in the word set;
and calculating semantic similarity between sentence vectors of the short texts of the two APIs, wherein the semantic similarity between the sentence vectors of the short texts of the two APIs is the semantic similarity between the short texts of the two APIs.
6. An API matching method, comprising:
obtaining description texts of two APIs to be matched, wherein the description text of each API consists of at least one short text;
calculating semantic similarity between short texts of the two APIs to be matched;
determining similarity vectors corresponding to the two APIs to be matched according to semantic similarity between short texts of the two APIs to be matched;
inputting the similarity vectors corresponding to the two APIs to be matched into an API matching model, wherein an output result of the API matching model is a matching result of the two APIs to be matched, and the API matching model is generated based on the API matching model establishing method according to any one of claims 1 to 5.
7. The API matching method according to claim 6, wherein after obtaining the description texts of the two APIs to be matched and before calculating the semantic similarity between the short texts of the two APIs to be matched, further comprising:
identifying whether the description text of each API contains short text of the geographic position qualifier;
and when the description text of any API contains the short text of the geographic position limiting word, removing the short text of the geographic position limiting word.
8. A cross-city government API matching method, comprising:
receiving a cross-city government affair API matching query request, and obtaining a query API and a query range based on the cross-city government affair API matching query request;
traversing a preset government affair API database, respectively forming the query API and each API in the query range in the preset government affair API database into an API pair to be matched, and judging whether the API pair to be matched is matched based on the API matching method according to claim 6 or 7;
and outputting all the APIs which are matched with the query API in the query range in the preset government affair API database.
9. An electronic device comprising a memory and a processor;
the memory for storing a computer program;
the processor, when executing the computer program, for implementing the API matching model building method according to any one of claims 1 to 5 or the API matching method according to claim 6 or 7 or the cross-city government API matching method according to claim 8.
10. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, implements the API matching model building method according to any one of claims 1 to 5 or the API matching method according to claim 6 or 7 or the cross-city government API matching method according to claim 8.
CN202011558922.6A 2020-12-25 2020-12-25 API (application program interface) matching model establishing method and cross-city government affair API matching method Pending CN112650833A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011558922.6A CN112650833A (en) 2020-12-25 2020-12-25 API (application program interface) matching model establishing method and cross-city government affair API matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011558922.6A CN112650833A (en) 2020-12-25 2020-12-25 API (application program interface) matching model establishing method and cross-city government affair API matching method

Publications (1)

Publication Number Publication Date
CN112650833A true CN112650833A (en) 2021-04-13

Family

ID=75362886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011558922.6A Pending CN112650833A (en) 2020-12-25 2020-12-25 API (application program interface) matching model establishing method and cross-city government affair API matching method

Country Status (1)

Country Link
CN (1) CN112650833A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641867A (en) * 2021-08-16 2021-11-12 中国科学院自动化研究所 System, method and equipment for measuring inter-city relation based on microblog public sentiment
CN115396831A (en) * 2021-05-08 2022-11-25 中国移动通信集团浙江有限公司 Interaction model generation method, device, equipment and storage medium
CN116127020A (en) * 2023-03-03 2023-05-16 北京百度网讯科技有限公司 Method for training generated large language model and searching method based on model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631009A (en) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 Word vector similarity based retrieval method and system
CN109509030A (en) * 2018-11-15 2019-03-22 北京旷视科技有限公司 Method for Sales Forecast method and its training method of model, device and electronic system
CN110826337A (en) * 2019-10-08 2020-02-21 西安建筑科技大学 Short text semantic training model obtaining method and similarity matching algorithm
CN111090462A (en) * 2019-12-06 2020-05-01 南京大学 API (application program interface) matching method and device based on API document

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631009A (en) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 Word vector similarity based retrieval method and system
CN109509030A (en) * 2018-11-15 2019-03-22 北京旷视科技有限公司 Method for Sales Forecast method and its training method of model, device and electronic system
CN110826337A (en) * 2019-10-08 2020-02-21 西安建筑科技大学 Short text semantic training model obtaining method and similarity matching algorithm
CN111090462A (en) * 2019-12-06 2020-05-01 南京大学 API (application program interface) matching method and device based on API document

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YONGSHEN LONG ET AL.: "Automatic Cross-City API Matching for Urban Service CollaborationBased on Semantics", 《IEEE XPLORE》, 11 December 2020 (2020-12-11), pages 460 - 462 *
周锦章 等: "基于词向量与TextRank的关键词提取方法", 《计算机应用研究》, 30 April 2019 (2019-04-30), pages 1051 - 1054 *
张云帆 等: "基于语义相似度的API使用模式推荐", 《计算机科学》, 15 March 2020 (2020-03-15), pages 34 - 40 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115396831A (en) * 2021-05-08 2022-11-25 中国移动通信集团浙江有限公司 Interaction model generation method, device, equipment and storage medium
CN113641867A (en) * 2021-08-16 2021-11-12 中国科学院自动化研究所 System, method and equipment for measuring inter-city relation based on microblog public sentiment
CN113641867B (en) * 2021-08-16 2023-07-14 中国科学院自动化研究所 Inter-city relationship measurement system, method and equipment based on microblog public opinion
CN116127020A (en) * 2023-03-03 2023-05-16 北京百度网讯科技有限公司 Method for training generated large language model and searching method based on model

Similar Documents

Publication Publication Date Title
CN109657054B (en) Abstract generation method, device, server and storage medium
CN110019732B (en) Intelligent question answering method and related device
CN112650833A (en) API (application program interface) matching model establishing method and cross-city government affair API matching method
CN111522910B (en) Intelligent semantic retrieval method based on cultural relic knowledge graph
CN108681574B (en) Text abstract-based non-fact question-answer selection method and system
US20080133466A1 (en) Method and apparatus for ontology-based classification of media content
CN112800170A (en) Question matching method and device and question reply method and device
CN112214593A (en) Question and answer processing method and device, electronic equipment and storage medium
CN111709243A (en) Knowledge extraction method and device based on deep learning
CN109983473B (en) Flexible integrated recognition and semantic processing
CN112115232A (en) Data error correction method and device and server
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN110633365A (en) Word vector-based hierarchical multi-label text classification method and system
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
CN109446299B (en) Method and system for searching e-mail content based on event recognition
CN115905487A (en) Document question and answer method, system, electronic equipment and storage medium
CN111859950A (en) Method for automatically generating lecture notes
CN111881264B (en) Method and electronic equipment for searching long text in question-answering task in open field
CN112417170B (en) Relationship linking method for incomplete knowledge graph
CN109472032A (en) A kind of determination method, apparatus, server and the storage medium of entity relationship diagram
CN110969005A (en) Method and device for determining similarity between entity corpora
CN110377753B (en) Relation extraction method and device based on relation trigger word and GRU model
CN116204622A (en) Query expression enhancement method in cross-language dense retrieval
CN116186219A (en) Man-machine dialogue interaction method, system and storage medium
JP7121819B2 (en) Image processing method and apparatus, electronic device, computer-readable storage medium, and computer program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination