WO2022156328A1 - 一种融合服务协作关系的Restful类型Web服务聚类方法 - Google Patents

一种融合服务协作关系的Restful类型Web服务聚类方法 Download PDF

Info

Publication number
WO2022156328A1
WO2022156328A1 PCT/CN2021/130789 CN2021130789W WO2022156328A1 WO 2022156328 A1 WO2022156328 A1 WO 2022156328A1 CN 2021130789 W CN2021130789 W CN 2021130789W WO 2022156328 A1 WO2022156328 A1 WO 2022156328A1
Authority
WO
WIPO (PCT)
Prior art keywords
service
word
similarity
topic
collaboration
Prior art date
Application number
PCT/CN2021/130789
Other languages
English (en)
French (fr)
Inventor
胡强
沈嘉吉
荆广辉
杜军威
Original Assignee
青岛科技大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 青岛科技大学 filed Critical 青岛科技大学
Publication of WO2022156328A1 publication Critical patent/WO2022156328A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention specifically relates to a Restful type Web service clustering method integrating service cooperation relationship.
  • Service clustering can divide services with similar functions into different service clusters, thereby effectively reducing the search space of target services and improving service discovery efficiency in the process of service search, service replacement and service combination.
  • Web service is a Web API program encapsulated by a standardized protocol, which can be divided into two types: SOAP and Restful.
  • SOAP Web API
  • Restful As of January 2021, the ProgrammableWeb site has registered more than 27,000 services, the vast majority of which are Restful Web API services.
  • the SOAP type service uses a structured WSDL document to describe the service information.
  • Various tags are explicitly set in the WSDL document, which is easy to extract various characteristic information about the service description.
  • clustering such services a small number of keywords that can represent the service function characteristics are usually extracted, and the semantic similarity of these keywords can be calculated according to the label category to achieve the measurement of service function similarity, so it is easy to implement service clustering.
  • Restful-type Web services usually use unstructured natural language to describe information. There are no tags in the service description information, so it is difficult to extract the effective semantics. information, and the description text is relatively short, and the functions, operations and evaluative words about the service are mixed together.
  • most of the existing clustering technologies use topic models to generate service representation vectors, and realize Restful type Web service clustering by calculating the similarity of service representation vectors.
  • the existing methods only realize service clustering from the perspective of functional similarity, and the existing topic models generally have the problem that the quality of the service representation vector is not high and cannot fully express the service feature information. And the existing clustering methods only consider the functional similarity of Web services, but do not consider the cooperative relationship between services. Therefore, how to technically improve the quality of service representation vector generation is a key issue that affects the effect of service clustering.
  • the present invention provides a Restful type Web service clustering method integrating service cooperation relationship.
  • a Restful type Web service clustering method integrating service cooperation relationship is characterized in that, comprising the following steps:
  • Step 1 Collect Restful type web services, perform word segmentation, function word removal and stem reduction preprocessing on the service description information in each web service that needs to be clustered, and obtain the preprocessed effective service for each service Describe the information text, and build a corpus based on the valid service description information text;
  • Step 2 Using the service description feature word extraction algorithm based on the context weight, from the corpus through the service description, extract a certain proportion of the feature words most relevant to the service features to construct a service feature word set;
  • Step 3 Introduce a correction factor, construct an improved GSDMM model with the topic probability distribution correction factor, and convert the feature word of each service in the service feature word set into a service representation vector based on the model;
  • Step 4 According to the obtained service characterization vector, calculate the functional similarity between different services through the Euclidean distance formula
  • Step 5 build a service collaboration graph, describe the collaboration relationship between different services based on the service collaboration graph, generate a service collaboration vector, and further calculate the collaboration similarity between different services;
  • Step 6 According to the obtained functional similarity and collaboration similarity, through technical parameter adjustment, the comprehensive similarity of services used for clustering is obtained, and clustering is performed by the k-means++ algorithm to complete the service clustering.
  • step 2 the specific operation steps of the service description feature word extraction algorithm based on the context weight described in step 2 include:
  • Step 22 define and initialize the corpus set Corpus_w that stores all service description information texts, and the feature word set FW_S of the services contained in the service set S to be clustered is empty;
  • Step 23 Add the service description texts s.d of all services s to the corpus set Corpus_w, and use word2Vec to train a vector V(w) for each word w in Corpus_w;
  • Step 24 For each word w, calculate the TF-IDF value TF-IDF(w, s) corresponding to w in the service s and the context similarity Con_SemSim(w, s) of the service description, respectively; where TF-IDF calculates Including TF and IDF, its calculation formula is:
  • tf i, j is the word frequency
  • idfi is the reverse file frequency
  • d j is the j-th service description text
  • t i is the i-th word in d j
  • n ij is the number of times t i appears in d j
  • ⁇ k n k,j is the sum of the occurrences of all words in d j
  • is the total number of service description documents in the corpus
  • Con_SemSim(w, s) is calculated as the average semantic similarity between the word w located in the service description and other words, which is calculated using the cosine angle formula of the vector:
  • Step 25 Multiply the TF-IDF(w, s) and the context similarity Con_SemSim(w, s) to obtain the context weight ContextWeight(w, s) of the word w in the description text s.d of the service s;
  • Step 26 Sort the context weight ContextWeight(w, s) of the service description text s.d, take the words with the first ⁇ ratio and add them to s.fw, and finally generate the service feature word set s.fw.
  • the value of ⁇ in step 26 is 60%.
  • step of establishing the improved GSDMM model with probability distribution correction factor includes:
  • Step 31 Input the service feature word set s_fw of all services in the service set into the GSDMM model in turn;
  • Step 32 After 10 rounds of training, the topic-word matrix ⁇ , the service-topic matrix ⁇ , and the initial service representation vector srv(s) corresponding to each service s are obtained;
  • Step 33 Search in the topic-word matrix ⁇ for each word w in s_fw, and find the topic k with the largest distribution probability value corresponding to w, and correspond the maximum topic probability value argmax( ⁇ k, w ) of the word w to it Multiply the context weight ContextWeight(w, s) of , to obtain the correction factor ⁇ (w, s) corresponding to the word w in the s representation topic, namely:
  • ⁇ (w,s) argmax( ⁇ k ,w )*ContextWeight(w,s);
  • Step 34 Determine whether the topic k of the maximum probability distribution corresponding to the word w is a secondary topic in the representation vector of service s. If it is a secondary topic, then the sum of the correction factor 1+ ⁇ (w, s) and the Multiplying the existing distribution probability value of topic k to complete the topic probability distribution correction based on word w;
  • Step 35 Complete the probability correction of all words in the feature word set of the service s according to step 24, and then the final service representation vector can be obtained.
  • step 5 includes:
  • Step 51 Traverse the service process model sp in the cloud platform, and define each service in sp as a service node; if the service si is the precursor service of the service sj, the cooperation of the service sj depends on the service si, which is recorded as si ⁇ sj ;
  • 1 ⁇ i,j ⁇ n ⁇ is the set of cooperative edges, and e (vi,vj) indicates that the services si and sj corresponding to the service nodes vi and vj satisfy si ⁇ sj or sj ⁇ si; the constructed G is the service collaboration graph;
  • Step 53 According to the constructed service collaboration graph G, use node2vec to generate a collaboration vector for each service in G, and the collaboration vector of service s is denoted as cf(s), then the collaboration between any two services si and sj
  • the similarity calculation formula is:
  • m is the dimension of the service cooperation vector.
  • step 6 is:
  • Step 61 According to the obtained service characterization vector array_srv, service cooperation vector array_cs, and the weight ratio hyperparameter ⁇ of the functional similarity and the cooperation similarity, initialize the cluster center point, and select the initial center point that is as far away as possible;
  • Step 62 According to the set number of clustering targets k, calculate the distance between the service s and the cluster center point, find the center point closest to it, and classify;
  • Step 63 Calculate the average service description vector and the average service cooperation vector of each cluster as a new center point
  • Step 64 determine whether the center point has changed, and if it has changed, use the new center point, repeat steps 62-63, and carry out a new round of clustering;
  • Step 65 When the center point does not change or the number of iterations reaches the set maximum number of iterations max-iter, end the calculation and output the final clustering result.
  • the present invention proposes a service description feature word extraction algorithm based on context weight, which can fuse the word frequency of words in the description text with the contextual semantic similarity, sort the words according to the context weight, and sort some words existing in the service description.
  • the words that are not highly related to the service function are screened out, and the words selected according to the context weight ranking will effectively reduce the amount of noise data and improve the quality of the text generated by the service representation vector;
  • the present invention proposes a topic distribution probability correction-oriented GSDMM model based on the traditional GSDMM model.
  • the improved model introduces a probability distribution correction factor, and generates the distribution probability corresponding to the non-key topic in the service representation vector by modifying it. It can effectively improve the completeness of the information description of the generated service representation vector, thereby improving the accuracy of service clustering;
  • the present invention considers that in the case of the same functional similarity, two services with more identical or similar cooperation relationships should be preferentially classified into the same service cluster, so the concept of service cooperation similarity is proposed.
  • the size of similarity measures the probability of two services appearing in the context of similar combined service process, so as to accurately measure the similar service cooperation relationship between different services, and improve the rationality of Web service clustering;
  • the method proposed in the present invention can improve the generation quality of the service representation vector, so that the similarity of service functions during clustering can be significantly improved.
  • Another reference condition is to obtain the optimal weight ratio ⁇ between the two through technical parameter adjustment, which greatly improves the effect of service clustering and the rationality of Web service clustering.
  • Fig. 1 is the schematic diagram of traditional GSDMM
  • Fig. 2 is the schematic diagram of the improved GSDMM with correction factor proposed by the present invention
  • Fig. 3 is a service collaboration graph (part) example obtained by the service collaboration relationship modeling method proposed by the present invention.
  • Fig. 6 is the contrast diagram of AMI index when adopting different models and methods to cluster in the embodiment of the present invention.
  • FIG. 7 is a comparison diagram of NMI indicators when different models and methods are used for clustering in the embodiment of the present invention.
  • a Restful type Web service clustering method that integrates service collaboration relationship proposed by the present invention, its operation steps include:
  • a method for extracting service description feature words based on context weight is proposed. This method can extract a certain proportion of the most relevant feature words from the service description to construct the service feature word set s.fw. Using these "denoised" feature words to generate service representation vectors will make the service representation vectors more accurate.
  • An improved GSDMM model with topic probability distribution correction factor is proposed.
  • the traditional GSDMM model improves the problem of overemphasizing the probability of key topics and weakening the probability of secondary topics when generating topic vectors. This makes the generated service representation vector more complete and more balanced in topic distribution in service feature description.
  • the generated service characterization vector can be more complete and accurate in describing service characteristics.
  • the functions between different services can be calculated through the Euclidean distance formula Similarity fs(srv(si), srv(sj)).
  • the collaboration graph is an undirected graph.
  • the points in the graph are services that have had a service collaboration relationship, and the edges represent the collaboration relationship between the two.
  • the cooperation similarity cs(cv(si), cv(sj)) between different services can be calculated by the cosine angle formula.
  • TS is a text corpus composed of text Ti
  • wij is a word in text Ti
  • the context weight of word wij in text Ti is the product of the IF-TDI word frequency of wij and the mean value of semantic similarity between wij and other words.
  • the present invention integrates the word frequency of words in the description text and the context semantic similarity, constructs the context weight of the words in the service description, sorts the words according to the context weight, and selects a certain number of words as the service according to the context weight ranking. Described feature words.
  • lines 1 and 2 of the algorithm initialize two empty sets, one is the corpus set Corpus_w used to store all service description texts, and the other is the services contained in the service set S to be clustered
  • the feature word set FW_S The third line of the algorithm adds the service description texts of all services contained in S to the corpus Corpus_w, and then uses Word2Vec to train a vector V(w) for each word w in the corpus.
  • TF-IDF word frequency calculation consists of two parts: TF and IDF.
  • TF Term Frequency
  • IDF Inverse Document Frequency
  • IDF reflects the ability of terms to discriminate documents.
  • the calculation formulas of TF, IDF and TF-IDF are shown in formulas (1), (2) and (3), respectively.
  • TF-ID is proportional to the number of occurrences of a word in the document, and inversely proportional to the number of occurrences of the word in the entire language database. It is suitable for evaluating the importance of a word in the entire corpus.
  • Service description contextual similarity refers to the average semantic similarity between words located in the service description and other words.
  • the semantic similarity between words is calculated using the cosine angle formula of vectors, the formula is:
  • Con_SemSim(w,s) is the average of the semantic similarity between word w and all words in s.d- ⁇ w ⁇ (see Algorithm 1, line 8)
  • Word2Vec When using Word2Vec to generate the word vector, it can be realized by the tools in the Python language genism package. After completing the Con_SemSim(w,s) and TF-IDF(w,s) for the word w, the word frequency TF-IDF(w,s) ) and Con_SemSim(w,s) to get the context weight ContextWeight(w,s) of the word w in the description text s.d of the service s. Therefore, the calculation of context weights incorporates word frequency and semantic similarity between words.
  • Line 10 of the algorithm generates a set of service feature words s.fw.
  • Service tags are the main category basis for platform service storage and user service search, and are the key elements for all consideration of service clustering. Therefore, when constructing s.fw, first add all the words in the service tag set s.l to s.fw, and then The words in the service description are sorted according to the context weight ContextWeight(w) of the service description, and the words in the top ⁇ proportion are respectively added to s.fw.
  • is set to 60%, the extracted feature words have the best performance in generating service representation vectors.
  • all service feature word sets FW_S of the entire service set S to be clustered can be obtained.
  • the set of service expression feature words of each service is extracted by Algorithm 1.
  • a topic model is used to convert the feature words of each service into service representation vectors. Similarity can determine the similarity of two services.
  • this paper proposes a GSDMM model for topic distribution probability correction based on the GSDMM model.
  • the distribution probability of improves the completeness of the information description of the generated service representation vector.
  • GSDMM is a probabilistic generative unsupervised model that generates documents based on the Dirichlet Mixture Model (DMM), and then uses the Gibbs Sampling algorithm to approximately solve the model. Compared with other topic models, GSDMM is more suitable for topic feature extraction in short texts.
  • DMM Dirichlet Mixture Model
  • DMM Dirichlet Mixture Model
  • is the word-topic distribution matrix, which depicts the probability that the word w belongs to the k-th topic
  • ⁇ k, w represents the probability distribution of the word w on the topic k
  • the sum of the topic distributions of all words in the same document is 1, that is
  • is the document-topic distribution matrix, which depicts the probability distribution of document d on topic k
  • ⁇ k, d represents the probability distribution of document d on topic k
  • the formula for calculating the probability of a description belonging to a topic in Gibbs sampling is as follows:
  • K represents the number of initial topics
  • D represents the total number of descriptions in the corpus
  • m z represents the number of documents under topic z
  • n z represents the number of words under topic z
  • the topic with the largest distribution probability value in the service characterization vector generated by GSDMM is called the key topic of the service characterization vector, and other topics are secondary topics.
  • the topic distribution probability correction factor ⁇ is introduced into GSDM M, and the distribution probability of each secondary topic in the service representation vector is generated by delta correction, which can effectively improve the discrimination of the service representation vector.
  • Correction factor ⁇ (w,s) argmax( ⁇ k ,w )*ContextWeight(w,s), where argmax( ⁇ k ,w ) is the maximum probability distribution value of word w under all topics K, where k is the topic corresponding to the maximum probability distribution of w.
  • ContextWeight(w, s) is the context weight of word w in the service description of s.
  • Lines 1-3 of Algorithm 2 first call Algorithm 1 to calculate the context weight ContextWeight(w, s) for the word w in the service description text of each service s, and filter out the service feature word set s_fw.
  • Lines 4-8 of the algorithm firstly input the feature word sets of all services in the service set into the GSDMM model in turn, and then perform 10 rounds of training to obtain the topic-word matrix ⁇ , service-topic matrix ⁇ , and each The initial service representation vector srv(s) corresponding to service s.
  • Lines 9-10 of the algorithm are for each word w in the feature word set s_fw of service s, find the topic k with the largest distribution probability value corresponding to w in the word-topic matrix ⁇ , and set the maximum topic distribution probability value of word w argmax ( ⁇ k, w ) is multiplied by its corresponding context weight ContextWeight(w, s) to obtain the correction factor ⁇ (w, s) corresponding to the word w in the service s representation topic.
  • Lines 12-13 of the algorithm determine whether the topic k of the maximum probability distribution corresponding to word w is a secondary topic in the representation vector of service s. The existing distribution probability value of topic k is multiplied to complete the topic probability distribution correction based on word w. After the words in all feature word sets of service s are corrected, the final service representation vector can be obtained.
  • two services with more identical or similar cooperative relationships should be prioritized into the same service cluster.
  • concept of service cooperation similarity is proposed, and the size of the cooperation similarity is used to measure the probability of two services appearing in the context of a similar combined service process.
  • the size of the similarity of two services collaboration depends on the following two factors: service co-occurrence rate and process distance. The more times the two services appear in similar service processes, the greater the similarity of their collaboration; the closer the process distance between the two services and the same service, the greater the similarity of their collaboration.
  • the service collaboration graph is a weighted undirected graph.
  • the nodes of the graph represent services, and the edges represent the collaboration relationship between two node services.
  • a large number of service process models have been accumulated on various cloud platforms. By traversing the service process models, the services in the process model are abstracted as nodes in the service collaboration graph, and the service transfer dependencies are mapped to the edges of the collaboration graph. The construction of service collaboration graph.
  • a collaboration vector can be generated for each service in the service graph through node2vec, and the collaboration vector of service s is defined as cf(s).
  • the collaboration similarity between any two services si and sj is defined as Wherein, m is the dimension of the service cooperation vector, and preferably, the value of m is set to 128.
  • the method of calculating the distance between services in 1. and 2. is as follows:
  • the GetCenters function selects the initial center point as far away as possible according to distance
  • the Get_new_centers function calculates the average service description vector and average service collaboration vector of each cluster as the new center point
  • the present invention uses the real WebAPI service on the ProgrammableWeb website as the clustering object, crawls a total of 23,000 services, and after deleting the invalid services, there are 21,307 services remaining.
  • the evaluation indicators of service clustering are usually divided into two categories: external evaluation indicators and internal evaluation indicators.
  • the external evaluation index uses the sample label information to evaluate whether the clustering is reasonable.
  • the internal evaluation index is to evaluate the clustering effect through the parameters that describe the clustering quality.
  • NMI Normalized Mutual Information
  • Adjusted Rand index ARI (Adjusted Rand index), ARI reflects the degree of overlap between the real label and the clustering results, the higher the score, the greater the degree of overlap.
  • step 4 the functional similarity between different functions can be obtained.
  • experiment 1 is performed.
  • the words selected according to the context weight ranking will effectively reduce the noise data.
  • the quality of the text generated by the service representation vector is improved, and the completeness of the information description of the generated service representation vector is effectively improved through the revised GSDMM model, thereby improving the service clustering accuracy.
  • the data shown in Fig. 4 and Fig. 5 are the comparison indexes when different models and methods are adopted for clustering 21307 services.
  • the R_GSDMM in the figure is the method in the present invention. It can be seen from the comparison data that our The method significantly outperforms other models or methods;
  • the collaboration similarity is calculated by executing step 5, and the collaboration relationship is integrated in step 6.
  • experiment 2 is performed to compare the clustering effect indicators obtained before and after integrating the collaboration relationship.
  • the size of the collaboration similarity can be used to measure the two
  • the probability of a service appearing in the context of a similar composite service process can accurately measure the similar service cooperation relationship between different services and improve the rationality of Web service clustering.
  • Step 5 The services participating in the mashup service in the ProgrammableWeb website are extracted, and the service collaboration graph is constructed.
  • Figure 3 is a part of the established map.
  • node2vec is used to generate service collaboration vector and calculate collaboration similarity.
  • Step 5 the similarity between the functional similarity calculated in the fourth step and the collaborative similarity generated in the fifth step is fused, and k-mean++ is used for clustering.
  • the clustering effect at this time is It is obviously better than the traditional clustering method that does not consider the service cooperation relationship.
  • R_GSDMM_K is the clustering index of the unintegrated writing relationship
  • R_GSDMM_K_C is the clustering index after the integration of the collaborative relationship. It can be seen from the figure that after the integration of the collaborative relationship, the clustering effect is significantly improved.

Abstract

一种融合服务协作关系的Restful类型Web服务聚类方法,其首先对需要进行聚类的服务中的服务描述信息进行预处理,并获取预处理后的服务描述信息文本,构建语料库;其次,利用基于语境权重的服务描述特征词提取算法,从语料库中抽取一定比例的与服务特征最相关的特征词构建服务特征词集合;再次,构建带有主题概率分布修正因子的改进GSDMM模型,将每个服务的特征词转换为服务表征向量,并通过欧式距离公式计算得出不同服务之间的功能相似度;再构建服务协作图谱,基于该图谱得到不同服务之间的协作相似度;最终,根据得到的功能相似度和协作相似度,得到用于聚类的服务综合相似度,并通过k-means++算法进行聚类即可完成服务聚类。

Description

一种融合服务协作关系的Restful类型Web服务聚类方法 技术领域
本发明具体涉及一种融合服务协作关系的Restful类型Web服务聚类方法。
背景技术
服务聚类可以将功能相似的服务划分为不同的服务簇,从而使得在服务查找、服务替换以及服务组合过程中有效地缩减目标服务的查找空间,提高服务发现效率。作为SOA架构下主流服务实现方式,Web服务是一种采用规范化协议封装的Web API程序,可以划分为SOAP和Restful两种类型。截止到2021年1月,ProgrammableWeb站点注册的服务数目已经超过27000,其中绝大多数为Restful型Web API服务。SOAP类型服务采用了结构化的WSDL文档描述服务信息,在WSDL文档中显式的设置了多种标签,易于提取有关服务描述的各类特征信息。此类服务在聚类时通常是提取少量能够代表服务功能特征的关键词,按照标签类别分别计算这些关键词的语义相似度即可实现服务功能相似性的度量,因此易于实现服务聚类。
不同于SOAP类型的服务多采用WSDL等结构化文档进行服务信息的描述,Restful类型的Web服务通常采用非结构化的自然语言进行信息描述,在服务描述信息中没有标签,难以抽取其中的有效语义信息,并且描述文本比较短,有关服务的功能、操作和评价性词语杂糅在一起。为了有效地提取服务描述信息,现有聚类技术大多是采用主题模型生成服务表征向量,通过计算服务表征向量的相似度来实现Restful类型的Web服务聚类。
现有的方法仅从功能相似角度实现服务聚类,而且现有主题模型在生成服务表征向量时普遍存在质量不高,不能全面表达服务特征信息的问题。并且现有的聚类方法仅考虑了Web服务的功能相似问题,没有考虑服务之间的 协作关系,因此,如何从技术上改进服务表征向量的生成质量是影响服务聚类效果的一个关键问题。
发明内容
针对上述存在的问题,本发明提供一种融合服务协作关系的Restful类型Web服务聚类方法。
实现本发明目的的技术解决方案为:
一种融合服务协作关系的Restful类型Web服务聚类方法,其特征在于,包括以下步骤:
步骤1:收集Restful类型的web服务,对需要进行聚类的每个web服务中的服务描述信息进行分词、虚词去除和词干还原预处理,针对每个服务,获取其预处理后的有效服务描述信息文本,并基于有效服务描述信息文本构建语料库;
步骤2:利用基于语境权重的服务描述特征词提取算法,从所述语料库中通过服务描述,抽取一定比例的与服务特征最相关的特征词构建服务特征词集合;
步骤3:引进修正因子,构建带有主题概率分布修正因子的改进GSDMM模型,基于该模型将服务特征词集合中每个服务的特征词转换为服务表征向量;
步骤4:根据得到的服务表征向量,通过欧式距离公式计算得出不同服务之间的功能相似度;
步骤5:构建服务协作图谱,基于该服务协作图谱描述不同服务之间的协作关系,生成服务协作向量,进一步计算不同服务之间的协作相似度;
步骤6:根据得到的功能相似度和协作相似度,通过技术调参,得到用于聚类的服务综合相似度,并通过k-means++算法进行聚类即可完成服务聚类。
进一步地,步骤2所述的基于语境权重的服务描述特征词提取算法的具体操作步骤包括:
步骤21:将服务定义定义为一个四元组s=(Id,n,l,d),其中ID为服务的标识id,n为服务名称,l为服务标签集合,d为服务描述信息;
步骤22:定义并初始化存储所有服务描述信息文本的语料库集合Corpus_w,以及待聚类的服务集合S中所包含服务的特征词集合FW_S为空;
步骤23:将所有服务s的服务描述文本s.d加入到语料库集合Corpus_w中,并利用word2Vec对Corpus_w中的每个词语w训练一个向量V(w);
步骤24:针对每个词语w,分别计算w在服务s中对应的TF-IDF值TF-IDF(w,s)以及服务描述的语境相似度Con_SemSim(w,s);其中TF-IDF计算包括TF和IDF,其计算公式为:
Figure PCTCN2021130789-appb-000001
Figure PCTCN2021130789-appb-000002
TF-DF=TF*IDF     (3),
其中,tf i,j为词频,idf i为逆向文件频率,d j为第j个服务描述文本,t i为d j中的第i个词语;n ij是t i在d j中出现的次数;∑ kn k,j是在d j中所有字词出现的次数之和;|D|为语料库中的服务描述文档的总数;|{j:t i∈d j}|为包含t i的文档数目;
所述语境相似度Con_SemSim(w,s)计算为位于服务描述中的单词w与其他单词之间的平均语义相似度,其采用向量的余弦夹角公式进行计算:
Figure PCTCN2021130789-appb-000003
步骤25:将TF-IDF(w,s)与语境相似度Con_SemSim(w,s)相乘后得到该单词w在服务s的描述文本s.d中的语境权重ContextWeight(w,s);
步骤26:将服务描述文本s.d的语境权重ContextWeight(w,s)进行排序,取前α比例的词语加入到s.fw,最终生成服务特征词集合s.fw。
优选地,步骤26所述的α取值为60%。
进一步地,建立所述带有概率分布修正因子的改进GSDMM模型的步骤包括:
步骤31:将服务集合中的所有服务的服务特征词集合s_fw依次输入到GSDMM模型中;
步骤32:进行10轮训练后得到主题-词语矩阵Φ,服务-主题矩阵Θ,以及每个服务s对应的初始服务表征向量srv(s);
步骤33:针对s_fw中的每个单词w在主题-词语矩阵Φ中查找,并找到w对应的分布概率值最大的主题k,将单词w的最大主题概率值argmax(φ k,w)与其对应的语境权重ContextWeight(w,s)相乘,得到s表征主题中该单词w对应的修正因子δ(w,s),即:
δ(w,s)=argmax(φ k,w)*ContextWeight(w,s);
步骤34:判定单词w对应的最大概率分布的主题k是否为服务s表征向量中的次要主题,若是次要主题,则将修正因子1+δ(w,s)的和与表征向量中的主题k已有分布概率值相乘,完成基于单词w的主题概率分布修正;
步骤35:将服务s的所有特征词集合中的单词都依据步骤24完成概率修 正,即可得到最终的服务表征向量。
进一步地,步骤5的具体操作步骤包括:
步骤51:在云平台中遍历服务流程模型sp,将sp中的每一个服务定义为一个服务节点;如果服务si为服务sj的前驱服务,则服务sj协作依赖于服务si,记为si→sj;
步骤52:定义一个无向图G=(V,E),V={v1,v2,v3,…vn}为服务结点集合,其中结点vi表示一个服务;E={e=(vi,vj)|1≤i,j≤n}为协作边集合,且e=(vi,vj)表示服务结点vi与vj所对应的服务si和sj满足si→sj或sj→si;构建出的G即为服务协作图谱;
步骤53:根据构建好的服务协作图谱G,利用node2vec为G中的每个服务生成一个协作向量,服务s的协作向量记为cf(s),则任意两个服务si和sj之间的协作相似度计算公式为:
Figure PCTCN2021130789-appb-000004
其中,m为服务协作向量的维数。
进一步地,步骤6的具体操作步骤为:
步骤61:根据得到的服务表征向量array_srv以及服务协作向量array_cs、功能相似度与协作相似度二者权重比例超参数λ,初始化簇中心点,选出距离尽可能远的初始中心点;
步骤62:根据设定的聚类目标个数k,计算服务s与簇中心点的距离,找到与之距离最近的中心点,进行分类;
步骤63:计算各个簇的平均服务描述向量与平均服务协作向量,作为新的中心点;
步骤64:判断中心点是否发生改变,若发生改变则使用新的中心点,重 复执行步骤62-63,进行新一轮聚类;
步骤65:当中心点不发生改变或迭代次数到达设定的迭代最大次数max-iter,结束计算,并输出最终的聚类结果。
本方法与现有技术相比,具有以下有益效果:
第一,本发明提出基于语境权重的服务描述特征词提取算法,能够将描述文本中词语的词频与上下文语义相似度进行融合,通过语境权重对词语进行排序,将服务描述中存在的一些与服务功能相关度并不高的词进行筛除,按照语境权重排名筛选出的词将有效降低噪声数据的数量,提高了服务表征向量生成文本的质量;
第二,本发明在传统的GSDMM模型基础上提出一种面向主题分布概率修正的GSDMM模型,该改进的模型引入了概率分布修正因子,通过修正生成服务表征向量中非关键主题对应的分布概率,能够有效提高生成服务表征向量的信息描述完备性,从而提高了服务聚类精度;
第三,本发明考虑到在同一功能相似度情况下,两个具备较多相同或者相似协作关系的服务应该优先划分到同一个服务聚类中,因此提出了服务协作相似度的概念,利用协作相似度的大小衡量两个服务在相似组合服务流程上下文环境中出现的概率大小,从而准确地度量了不同服务之间具备类似的服务协作关系,提高了Web服务聚类的合理性;
综上所述,本发明提出的方法能够改进服务表征向量的生成质量,从而使得聚类时服务功能相似度显著提高,此外,考虑了服务之间的协作关系,将协作相似度作为聚类的另外一个参考条件,通过技术调参,获得二者之间最优权重占比λ,从而大幅度提高了服务聚类效果以及Web服务聚类的合理性。
附图说明
图1为传统GSDMM的示意图;
图2为本发明提出的带有修正因子的改进GSDMM示意图;
图3为本发明所提出的服务协作关系建模方法所得到的服务协作图谱(部分)示例;
图4为本发明实施例中采用不同模型和方法进行聚类时SC指标的对比图;
图5为本发明实施例中采用不同模型和方法进行聚类时DBI指标的对比图;
图6为本发明实施例中采用不同模型和方法进行聚类时AMI指标的对比图;
图7为本发明实施例中采用不同模型和方法进行聚类时NMI指标的对比图;
图8为本发明实施例中采用不同模型和方法进行聚类时ARI指标的对比图;
图9为本发明中融合协作关系后的聚类效果与未融合协作关系的聚类效果的对比图;
具体实施方式
为了使本领域的普通技术人员能更好的理解本发明的技术方案,下面结合附图和实施例对本发明的技术方案做进一步的描述。
本发明提出的一种融合服务协作关系的Restful类型Web服务聚类方法,其操作步骤包括:
1、对需要聚类的每一个服务的服务描述信息中的词语使用NLTK工具包进行分词、虚词去除、词干还原,针对每个服务s,获取有效的服务描述s.d,并利用所有词汇构建语料库集合Corpus_w。
2、提出一种基于语境权重的服务描述特征词提取方法,该方法可以从服 务描述中抽取一定比例与服务特征最相关特征词构建服务特征词集合s.fw。利用这些“去噪”后的特征词去生成服务表征向量,会使得服务表征向量更精准。
3、提出了一种带有主题概率分布修正因子的改进GSDMM模型。通过引进修正因子,改进了传统GSDMM模型在生成主题向量时过于强化关键主题概率,弱化次要主题概率的不足。使得生成的服务表征向量在服务特征描述时更加完备、主题分布更加均衡。
4、通过以上两步的改进,可以使得生成的服务表征向量在描述服务特征时更加完备和精准,利用生成的服务表征向量srv(s),通过欧式距离公式即可算出不同服务之间的功能相似度fs(srv(si),srv(sj))。
5、构建服务协作图谱用以表达服务之间的协作关系。协作图谱是一个无向图,图中的点为有过服务协作关系的服务,边表示二者之间的协作关系。利用node2vec为每服务结点计算服务协作向量cv(s),通过余弦夹角公式即可算出不同服务之间的协作相似度cs(cv(si),cv(sj))。
6、通过技术调参,设置功能相似度fs和协作相似度cs比值,得到用于聚类的服务综合相似度zs=fs-λ.cs。然后通过k-mean++算法进行聚类即可。
下面进行具体说明:
1、基于语境权重的服务特征词提取
定义1 服务
服务定义为一个四元组,s=(Id,n,l,d),其中ID为服务的标识id号,n为服务的名称,l为服务标签集合,d为服务描述信息。
定义2 语境权重
TS为由文本Ti组成的文本语料库,wij为文本Ti中的一个单词,则单词wij在文本Ti中的语境权重为wij的IF-TDI词频与wij与其他词语的语义相似度均值的乘积。
本发明将描述文本中词语的词频与上下文语义相似度相融合,构建词语在服务描述中的语境权重,通过语境权重对词语进行排序,按照语境权重排名筛选出一定数量的词语作为服务描述的特征词。
在服务描述中存在一些与服务功能相关度并不高的词,这些若被提取并用于生成服务特征向量,将会为服务特征向量的生成带来噪声数据,进而降低服务特征向量质量。按照语境权重排名筛选出的词将有效降低噪声数据的数量,提高了服务表征向量生成文本的质量。对于服务s,令S={si},1≤i≤n为待聚类的所有服务构成的服务集合,算法1给出基于语境权重的服务描述特征词提取方法。
算法1 Algorithm1FeatureWord_Extract
Figure PCTCN2021130789-appb-000005
从以上的算法代码片段可以看出,算法的第1、2行初始化两个空集合,一个是用于存储所有服务描述文本的语料库集合Corpus_w,一个是待聚类的服务集合S中所包含服务的特征词集合FW_S。算法的第3行将S中包含的所有服务的服务描述文本均加入语料库Corpus_w,然后利用Word2Vec为语料库中的每个词语w训练一个向量V(w)。
在生成服务s的特征词集合时,针对每个词w,算法第6行至8行分别计算w在服务s中对应的TF-IDF词频TI(w,s)、服务描述语境相似度Con_SemSim(w,s)。TF-IDF词频计算时由两部分组成:TF和IDF。其中,TF(Term Frequency)是词频,计算文本中每个词的出现频率;IDF是逆向文件频率(Inverse Document Frequency),由总文件数目除以包含该词语之文件的数目,再将得到的商取对数。IDF反映了词条对文档区分能力。TF、IDF和TF-IDF的计算公式分别参见公式(1)、(2)和(3)。
Figure PCTCN2021130789-appb-000006
Figure PCTCN2021130789-appb-000007
TF-DF=TF*IDF    (3)
TF-ID的值与一个词在文档中的出现次数成正比,与该词在整个语言库中出现次数成反比,适合用于评价一个单词在整篇语料库中的重要程度。文中服务s的服务描述中的单词w的词频记为TI(w,s),即TI(w,s)=TF-IDF(s.d,Corpus_w)。
服务描述语境相似度是指位于服务描述中的单词与其他单词之间的平均语义相似度。在算法1中,单词之间的语义相似度采用向量的余弦夹角公式进行计算,公式为:
Figure PCTCN2021130789-appb-000008
Con_SemSim(w,s)为单词w与s.d-{w}中所有单词的语义相似度的平均值(参见算法1第8行)
在利用Word2Vec生成单词的向量时可以通过Python语言genism包中的工具实现,在对单词w完成Con_SemSim(w,s)和TF-IDF(w,s)之后,将词频TF-IDF(w,s)与Con_SemSim(w,s)相乘后即可得到单词w在服务s的描述文本s.d中的语境权重ContextWeight(w,s)。因此,语境权重的计算融合了词频和词语之间的语义相似度。
算法第10行生成服务特征词集合s.fw。服务标签是平台服务存储和用户服务查找的主要类别依据,是服务聚类所有考虑的关键要素,因此在构建s.fw时,首先将服务标签集合s.l中的所有单词加入到s.fw,然后对服务描述中的词语,依照服务描述语境权重ContextWeight(w)进行排序,分别取位于前α比例的词语加入到s.fw。实验验证,α设置为60%时,所提取的特征词生成服务表征向量效果最佳。通过循环处理,可以求得整个待聚类服务集合S的所有服务特征词集合FW_S。
2、基于带有主题分布概率修正因子GSDMM的服务表征向量生成
通过算法1提取得到每个服务的服务表达特征词集合,为了能够计算两个服务之间的相似度,采用主题模型将每个服务的特征词转化为服务表征向量,通过服务表征向量之间的相似度可以判定两个服务的相似度。
为了提高服务表征向量对服务功能刻画的完备性,本文在GSDMM模型基础上提出一种面向主题分布概率修正的GSDMM模型,该模型引入概率分布修正因子,通过修正生成服务表征向量中非关键主题对应的分布概率,提高生成服务表征向量的信息描述完备性。
GSDMM是一种概率生成式无监督模型,基于狄利克雷混合模型(DMM)生成文档,然后使用吉布斯采样(Gibbs Sampling)算法近似求解模型。相比其他主题模型,GSDMM更适用于短文本中主题特征提取。
狄利克雷混合模型(DMM)如图所示,由主题得k到文档d的概率为:
Figure PCTCN2021130789-appb-000009
为了获得描述中的单词-主题分布,假设主题在单词上是多项式分布,
则有:
p(w|z=k)=p(w|z=k,Φ)=φ k,w
其中,Φ是单词-主题分布矩阵,刻画了单词w属于第k个主题的概率,用φ k,w表示单词w在主题k上的概率分布,在同一篇文档中所有单词的主题分布之和为1,即
Figure PCTCN2021130789-appb-000010
同样,每个主题的概率也服从多项式分布:
p(d|z=k)=p(d|z=k,Θ)=θ k,d
其中,Θ是文档-主题分布矩阵,刻画了文档d在主题k上的概率分布,θ k,d表示了文档d在主题k上的概率分布,同样在一篇文档描述中,遵循
Figure PCTCN2021130789-appb-000011
吉布斯采样过程为使用一个单词在所有主题上不断地采样,最终得到这个单词的主题分布矩阵,从而得到文档-主题矩阵Θ=d×z以及单词-主题矩阵Φ=w×z。吉布斯采样中一篇描述属于某个主题的概率计算公式如下:
Figure PCTCN2021130789-appb-000012
其中,K表示初始主题个数,D表示语料库中描述总数,m z表示主题z下的文档数,n z表示主题z下的单词数,
Figure PCTCN2021130789-appb-000013
表示主题z下的单词w出现的次数,
Figure PCTCN2021130789-appb-000014
表示去除当前文档。
本发明采用GSDMM生成的服务表征向量中具有最大分布概率值的主题称为服务表征向量的关键主题,其他的主题为次要主题。在GSDM M中引入主题分布概率修正因子δ,通过δ修正生成服务表征向量中的各个次要主题分布概率,可以有效地提高服务表征向量的区分度。修正因子δ(w,s)=argmax(φ k,w)*ContextWeight(w,s),其中,argmax(φ k,w)为单词w在所有主题K下的最大概率分布值,此处k为w的对应最大概率分布的主题。ContextWeight(w,s)为单词w在s的服务描述中的语境权重。下面给出面向主题概率分布修正GSDMM的服务表征向量求解方法。
算法2 Algorithm2 SRV_RGSDMM
Figure PCTCN2021130789-appb-000015
Figure PCTCN2021130789-appb-000016
算法2第1-3行首先调用算法1,为每个服务s的服务描述文本中的词语w计算语境权重ContextWeight(w,s),并筛选出服务特征词集合s_fw。算法第4-8行,首先将服务集合中所有服务的特征词集合依次输入到GSDMM模型中,再进行10个轮次的训练后得到主题-词语矩阵Φ,服务-主题矩阵Θ,以及每个服务s对应的初始服务表征向量srv(s)。
算法第9-10行针对服务s的特征词集合s_fw中的每一个单词w,在单词-主题矩阵Φ找到w对应的分布概率值最大的主题k,将单词w的最大主题分布概率值argmax(φ k,w)与其对应的语境权重ContextWeight(w,s)相乘,得到服务s表征主题中该单词w对应的修正因子δ(w,s)。算法第12-13行判定单词w对应的最大概率分布的主题k是否为服务s表征向量中的次要主题,如果是次要主题,则将修正因子δ(w,s)与表征向量中的主题k已有分布概率值相乘,完成基于单词w的主题概率分布修正。将服务s的所有特征词集合中的单词均完成修正后即可得到最终的服务表征向量。
假设生成的服务表征向量为k维,即srv(s)=(v1,v2,…vk),任意两个服务si和sj之间的功能相似度定义为
Figure PCTCN2021130789-appb-000017
3、服务协作关系的度量
在同一功能相似度情况下,两个具备较多相同或者相似协作关系的服务应该优先划分到同一个服务聚类中。为了度量服务之间是否具备类似的服务协作关系,提出服务协作相似度的概念,利用协作相似度的大小衡量两个服务在相似组合服务流程上下文环境中出现的概率大小。两个服务协作相似度的大小取决于以下两个要素:服务共现率和流程距离。两个服务在相似的服 务流程中出现的次数越多,二者协作相似度越大;两个服务与同一个服务的流程距离越近,二者协作相似度越大。
为了计算两个服务之间的协作相似度,本文构建了服务协作图谱。服务协作图谱为一个加权无向图,图谱的结点表示服务,边表示两个结点服务之间具备协作关系。在各类云平台上均积累了大量的服务流程模型,通过遍历服务流程模型,将流程模型中的服务抽象为服务协作图中的结点,服务转移依赖关系映射为协作图谱的边即可完成服务协作图谱的构建。
定义3 协作依赖
在一个流程模型sp中,若服务si为服务sj的前驱服务,则称服务sj协作依赖于服务si,记为si→sj。
定义4 服务协作图谱
服务协作图谱定义为一个无向图G=(V,E)。V={v1,v2,v3,…vn}为服务结点集合,其中结点vi表示一个服务;E={e=(vi,vj)|1≤i,j≤n}为协作边集合,其中e=(vi,vj)表示服务结点vi与vj所对应的服务si和sj满足si→sj或sj→si。
对于构建好的服务协作图谱,通过node2vec可以为服务图谱中的每个服务生成一个协作向量,服务s的协作向量定义为cf(s)。假设生成的服务协作向量为m维,即cs(s)=(v1,v2,…vm),任意两个服务si和sj之间的协作相似度定义为
Figure PCTCN2021130789-appb-000018
其中,m为服务协作向量的维数,优选地,m的值设置为128。
4、融合功能语义与协作性相似度的服务聚类算法
Figure PCTCN2021130789-appb-000019
Figure PCTCN2021130789-appb-000020
1.和2.中计算服务间距离的方法如下:
Distance(si,sj)=fs(si,sj)-λ*cs(si,sj);
该算法的步骤如下:
1.GetCenters函数根据distance选出距离尽可能远的初始中心点;
2.GetClusters函数加载新数据时使用distance找到与之距离最近的中心点并聚类;
3.Get_new_centers函数计算各个簇的平均服务描述向量以及平均服务协作向量作为新的中心点;
4.判断中心点是否改变,若发生改变则使用新的中心点作为聚类中心循环2-3步;
5.当中心点向量不发生改变或者达到迭代上限时,算法结束,输出聚类结果。
实施例
1、实验数据
本发明采用ProgrammableWeb网站上的真实WebAPI服务作为聚类对象,共计爬取23000个服务,将无效服务删除后,剩余21307个服务。
2、评价指标
服务聚类的评价指标通常分为外部评价指标和内部评价指标两类。外部评价指标使用样本标签信息来评价聚类是否合理。内部评价指标是通过刻画聚类质量的参数对聚类效果进行评价。
为了对本发明提出的聚类方法的有效性进行评价,采用常用的内部评价指标SC、DBI和外部评价指标NMI、AMI、ARI作为评价标准。下面是各个指标的简要说明:
(1)轮廓系数SC(Silhouette Coefficient),分数越高代表聚类效果越好;
(2))戴维森堡丁指数DBI(davies-bouldin-score),类内距离越小、类间距离越大,则DBI指数越小,分类效果越好;
(3)归一化互信息NMI(Normalized Mutual Information),NMI分数越大代表聚类效果越好;
(4)调整互信息AMI(Adjusted Mutual Information),AMI越大代表聚类结果与真实情况越吻合;
(5)调整兰德系数ARI(Adjusted Rand index),ARI反映了真实标签与聚类结果之间的重叠程度,分数越高重叠程度越大。
3、仿真实验
参照技术方案中的步骤1-6进行实验,当执行到步骤4时,能够得到不同之间的功能相似度,此时进行实验1,按照语境权重排名筛选出的词将有效降低噪声数据的数量,提高了服务表征向量生成文本的质量,并且通过修正的GSDMM模型有效提高生成服务表征向量的信息描述完备性,从而提高了服务聚类精度。
实验1:
附图4和附图5中所示的数据为针对21307个服务采取不同模型和方法进行聚类时的对比指标,图中的R_GSDMM为本发明中的方法,从对比数据 可以看出,我们的方法明显优于其他模型或方法;
对于外部指标来说,采用服务类别作为聚类选用了三个带有标签的子服务集进行验证,具体服务信息如表1所示,其得到的平均结果如附图6-8所示。从图6-8的柱形图可以看出,我们提出的方法在各类外部指标评价中均好于其他模型或方法。
表1
Figure PCTCN2021130789-appb-000021
接着通过执行步骤5计算出协作相似度,并在步骤6中将协作关系融入,此时进行实验2,对比融入协作关系前后得出的聚类效果指标,能够利用协作相似度的大小衡量两个服务在相似组合服务流程上下文环境中出现的概率大小,从而准确地度量了不同服务之间具备类似的服务协作关系,提高了Web服务聚类的合理性。
实验2:
将参与到ProgrammableWeb网站中的mashup服务的服务抽取出,构建了服务协作图谱。附图3为建立的图谱的一部分。在该图谱的基础上,利用node2vec生成服务协作向量,并计算协作相似度。(步骤5)然后,通过第6步,将第4步计算的功能相似度和第5步生成的协作相似度进行融合后的相似度,采取k-mean++进行聚类,此时的聚类效果明显优于不考虑服务协作关系的传统聚类方法。
附图9中R_GSDMM_K是未融合写作关系的聚类指标,R_GSDMM_K_C是融合协作关系后的聚类指标,从图中可以看出,融合协作关系后,聚类效果明显提升。
本说明书中未作详细描述的内容属于本领域专业技术人员公知的现有技术。尽管参照前述实施例对本发明专利进行了详细的说明,对于本领域的技术人员来说,其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (6)

  1. 一种融合服务协作关系的Restful类型Web服务聚类方法,其特征在于,包括以下步骤:
    步骤1:收集Restful类型的web服务,对需要进行聚类的每个web服务中的服务描述信息进行分词、虚词去除和词干还原预处理,针对每个服务,获取其预处理后的有效服务描述信息文本,并基于有效服务描述信息文本构建语料库;
    步骤2:利用基于语境权重的服务描述特征词提取算法,从所述语料库中通过服务描述,抽取一定比例的与服务特征最相关的特征词构建服务特征词集合;
    步骤3:引进修正因子,构建带有主题概率分布修正因子的改进GSDMM模型,基于该模型将服务特征词集合中每个服务的特征词转换为服务表征向量;
    步骤4:根据得到的服务表征向量,通过欧式距离公式计算得出不同服务之间的功能相似度;
    步骤5:构建服务协作图谱,基于该服务协作图谱描述不同服务之间的协作关系,生成服务协作向量,进一步计算得到不同服务之间的协作相似度;
    步骤6:根据得到的功能相似度和协作相似度,通过技术调参,得到用于聚类的服务综合相似度,并通过k-means++算法进行聚类即可完成服务聚类。
  2. 根据权利要求1所述的一种融合服务协作关系的Restful类型Web服务聚类方法,其特征在于,步骤2所述的基于语境权重的服务描述特征词提取算法的具体操作步骤包括:
    步骤21:将服务定义为一个四元组s=(Id,n,l,d),其中ID为服务的标识id,n为服务名称,l为服务标签集合,d为服务描述信息;
    步骤22:定义并初始化存储所有服务描述信息文本的语料库集合 Corpus_w,以及待聚类的服务集合S中所包含服务的特征词集合FW_S为空;
    步骤23:将所有服务s的服务描述文本s.d加入到语料库集合Corpus_w中,并利用word2Vec对Corpus_w中的每个词语w训练一个向量V(w);
    步骤24:针对每个词语w,分别计算w在服务s中对应的TF-IDF值TF-IDF(w,s)以及服务描述的语境相似度Con_SemSim(w,s);其中TF-IDF计算包括TF和IDF,其计算公式为:
    Figure PCTCN2021130789-appb-100001
    Figure PCTCN2021130789-appb-100002
    TF-DF=TF*IDF  (3),
    其中,tf i,j为词频,idf i为逆向文件频率,d j为第j个服务描述文本,t i为d j中的第i个词语;n ij是t i在d j中出现的次数;∑ kn k,j是在d j中所有字词出现的次数之和;|D|为语料库中的服务描述文档的总数;|{j:t i∈d j}|为包含t i的文档数目;
    所述语境相似度Con_SemSim(w,s)计算为位于服务描述中的单词w与其他单词之间的平均语义相似度,其采用向量的余弦夹角公式进行计算:
    Figure PCTCN2021130789-appb-100003
    步骤25:将TF-IDF(w,s)与语境相似度Con_SemSim(w,s)相乘后得到该单词w在服务s的描述文本s.d中的语境权重ContextWeight(w,s);
    步骤26:将服务描述文本s.d的语境权重ContextWeight(w,s)进行排序, 取前α比例的词语加入到s.fw,最终生成服务特征词集合s.fw。
  3. 根据权利要求2所述的一种融合服务协作关系的Restful类型Web服务聚类方法,其特征在于,步骤26所述的α取值为60%。
  4. 根据权利要求1所述的一种融合服务协作关系的Restful类型Web服务聚类方法,其特征在于,建立步骤3中所述的带有概率分布修正因子的改进GSDMM模型的步骤包括:
    步骤31:将服务集合中的所有服务的服务特征词集合s_fw依次输入到GSDMM模型中;
    步骤32:进行10轮训练后得到主题-词语矩阵Φ,服务-主题矩阵θ,以及每个服务s对应的初始服务表征向量srv(s);
    步骤33:针对s_fw中的每个单词w在主题-词语矩阵Φ中查找,并找到w对应的分布概率值最大的主题k,将单词w的最大主题概率值argmax(φ k,w)与其对应的语境权重ContextWeight(w,s)相乘,得到s表征主题中该单词w对应的修正因子δ(w,s),即:
    δ(w,s)=argmax(φ k,w)*ContextWeight(w,s);
    步骤34:判定单词w对应的最大概率分布的主题k是否为服务s表征向量中的次要主题,若是次要主题,则将修正因子1+δ(w,s)的和与表征向量中的主题k已有分布概率值相乘,完成基于单词w的主题概率分布修正;
    步骤35:将服务s的所有特征词集合中的单词都依据步骤34完成概率修正,即可得到最终的服务表征向量。
  5. 根据权利要求1所述的一种融合服务协作关系的Restful类型Web服务聚类方法,其特征在于,步骤5的具体操作步骤包括:
    步骤51:在云平台中遍历服务流程模型sp,将sp中的每一个服务定义为 一个服务节点;如果服务si为服务sj的前驱服务,则服务sj协作依赖于服务si,记为si→sj;
    步骤52:定义一个无向图G=(V,E),V={v1,v2,v3,…vn}为服务结点集合,其中结点vi表示一个服务;E={e=(vi,vj)|1≤i,j≤n}为协作边集合,且e=(vi,vj)表示服务结点vi与vj所对应的服务si和sj满足si→sj或sj→si;构建出的G即为服务协作图谱;
    步骤53:根据构建好的服务协作图谱G,利用node2vec为G中的每个服务生成一个协作向量,服务s的协作向量记为cf(s),则任意两个服务si和sj之间的协作相似度计算公式为:
    Figure PCTCN2021130789-appb-100004
    其中,m为服务协作向量的维数。
  6. 根据权利要求1所述的一种融合服务协作关系的Restful类型Web服务聚类方法,其特征在于,步骤6的具体操作步骤为:
    步骤61:根据得到的服务表征向量array_srv以及服务协作向量array_cs、功能相似度与协作相似度二者权重比例超参数λ,初始化簇中心点,选出距离尽可能远的初始中心点;
    步骤62:根据设定的聚类目标个数k,计算服务s与簇中心点的距离,找到与之距离最近的中心点,进行分类;
    步骤63:计算各个簇的平均服务描述向量与平均服务协作向量,作为新的中心点;
    步骤64:判断中心点是否发生改变,若发生改变则使用新的中心点,重复执行步骤62-63,进行新一轮聚类;
    步骤65:当中心点不发生改变或迭代次数到达设定的迭代最大次数 max-iter,结束计算,并输出最终的聚类结果。
PCT/CN2021/130789 2021-01-19 2021-11-16 一种融合服务协作关系的Restful类型Web服务聚类方法 WO2022156328A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110069758.0 2021-01-19
CN202110069758.0A CN112749281B (zh) 2021-01-19 2021-01-19 一种融合服务协作关系的Restful类型Web服务聚类方法

Publications (1)

Publication Number Publication Date
WO2022156328A1 true WO2022156328A1 (zh) 2022-07-28

Family

ID=75652508

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/130789 WO2022156328A1 (zh) 2021-01-19 2021-11-16 一种融合服务协作关系的Restful类型Web服务聚类方法

Country Status (2)

Country Link
CN (1) CN112749281B (zh)
WO (1) WO2022156328A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116258470A (zh) * 2023-05-15 2023-06-13 北京尽微致广信息技术有限公司 一种数据处理方法、系统、存储介质及电子设备
CN116339799A (zh) * 2023-04-06 2023-06-27 山景智能(北京)科技有限公司 一种智能化数据接口管理的方法、系统、终端设备及存储介质
CN116860951A (zh) * 2023-09-04 2023-10-10 贵州中昂科技有限公司 一种基于人工智能的信息咨询服务管理方法及管理系统
CN116881463A (zh) * 2023-09-05 2023-10-13 北京金景科技有限公司 基于数据的艺术多模态语料库构建系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749281B (zh) * 2021-01-19 2023-04-07 青岛科技大学 一种融合服务协作关系的Restful类型Web服务聚类方法
CN113191147A (zh) * 2021-05-27 2021-07-30 中国人民解放军军事科学院评估论证研究中心 无监督的自动术语抽取方法、装置、设备和介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180365248A1 (en) * 2017-06-14 2018-12-20 Sap Se Document representation for machine-learning document classification
CN109255125A (zh) * 2018-08-17 2019-01-22 浙江工业大学 一种基于改进DBSCAN算法的Web服务聚类方法
CN110533072A (zh) * 2019-07-30 2019-12-03 浙江工业大学 Web环境下基于Bigraph结构的SOAP服务相似度计算与聚类方法
CN110661875A (zh) * 2019-09-29 2020-01-07 青岛科技大学 一种基于Word2Vec的云制造服务协作相似度计算方法
CN110659363A (zh) * 2019-07-30 2020-01-07 浙江工业大学 基于膜计算的Web服务混合进化聚类方法
CN111813955A (zh) * 2020-07-01 2020-10-23 浙江工商大学 一种基于知识图谱表示学习的服务聚类方法
CN112749281A (zh) * 2021-01-19 2021-05-04 青岛科技大学 一种融合服务协作关系的Restful类型Web服务聚类方法

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129479B (zh) * 2011-04-29 2013-01-02 南京邮电大学 一种基于概率潜在语义分析模型的万维网服务发现方法
CN103778191B (zh) * 2014-01-03 2017-02-15 南京师范大学 一种顾及空间邻近关系的矢量等高线数据划分方法
CN108255809B (zh) * 2018-01-10 2021-10-08 北京海存志合科技股份有限公司 考虑词语相似度的计算文档所对应的主题的方法
CN108491970B (zh) * 2018-03-19 2021-09-10 东北大学 一种基于rbf神经网络的大气污染物浓度预测方法
CN108712466A (zh) * 2018-04-18 2018-10-26 山东科技大学 一种基于Gaussian ATM和词嵌入的语义稀疏Web服务发现方法
CN110209809B (zh) * 2018-08-27 2023-10-24 腾讯科技(深圳)有限公司 文本聚类方法和装置、存储介质及电子装置
CN110263153B (zh) * 2019-05-15 2021-04-30 北京邮电大学 面向多源信息的混合文本话题发现方法
CN110378124A (zh) * 2019-07-19 2019-10-25 杉树岭网络科技有限公司 一种基于lda机器学习的网络安全威胁分析方法及系统
CN111475609B (zh) * 2020-02-28 2022-04-05 浙江工业大学 一种围绕主题建模的改进型K-means服务聚类方法
CN111724273B (zh) * 2020-05-09 2023-04-14 中国大唐集团科学技术研究院有限公司火力发电技术研究院 采用大容量风电机组的海上风电场自动规划集电线路方法
CN111832289B (zh) * 2020-07-13 2023-08-11 重庆大学 一种基于聚类和高斯lda的服务发现方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180365248A1 (en) * 2017-06-14 2018-12-20 Sap Se Document representation for machine-learning document classification
CN109255125A (zh) * 2018-08-17 2019-01-22 浙江工业大学 一种基于改进DBSCAN算法的Web服务聚类方法
CN110533072A (zh) * 2019-07-30 2019-12-03 浙江工业大学 Web环境下基于Bigraph结构的SOAP服务相似度计算与聚类方法
CN110659363A (zh) * 2019-07-30 2020-01-07 浙江工业大学 基于膜计算的Web服务混合进化聚类方法
CN110661875A (zh) * 2019-09-29 2020-01-07 青岛科技大学 一种基于Word2Vec的云制造服务协作相似度计算方法
CN111813955A (zh) * 2020-07-01 2020-10-23 浙江工商大学 一种基于知识图谱表示学习的服务聚类方法
CN112749281A (zh) * 2021-01-19 2021-05-04 青岛科技大学 一种融合服务协作关系的Restful类型Web服务聚类方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116339799A (zh) * 2023-04-06 2023-06-27 山景智能(北京)科技有限公司 一种智能化数据接口管理的方法、系统、终端设备及存储介质
CN116339799B (zh) * 2023-04-06 2023-11-28 山景智能(北京)科技有限公司 一种智能化数据接口管理的方法、系统、终端设备及存储介质
CN116258470A (zh) * 2023-05-15 2023-06-13 北京尽微致广信息技术有限公司 一种数据处理方法、系统、存储介质及电子设备
CN116860951A (zh) * 2023-09-04 2023-10-10 贵州中昂科技有限公司 一种基于人工智能的信息咨询服务管理方法及管理系统
CN116860951B (zh) * 2023-09-04 2023-11-14 贵州中昂科技有限公司 一种基于人工智能的信息咨询服务管理方法及管理系统
CN116881463A (zh) * 2023-09-05 2023-10-13 北京金景科技有限公司 基于数据的艺术多模态语料库构建系统
CN116881463B (zh) * 2023-09-05 2024-01-26 南京艺术学院 基于数据的艺术多模态语料库构建系统

Also Published As

Publication number Publication date
CN112749281B (zh) 2023-04-07
CN112749281A (zh) 2021-05-04

Similar Documents

Publication Publication Date Title
WO2022156328A1 (zh) 一种融合服务协作关系的Restful类型Web服务聚类方法
CN108763213A (zh) 主题特征文本关键词提取方法
CN108647322B (zh) 基于词网识别大量Web文本信息相似度的方法
CN105631018B (zh) 基于主题模型的文章特征抽取方法
WO2022121163A1 (zh) 用户行为倾向识别方法、装置、设备及存储介质
CN108596637B (zh) 一种电商服务问题自动发现系统
CN105205163B (zh) 一种科技新闻的增量学习多层次二分类方法
CN113360582B (zh) 基于bert模型融合多元实体信息的关系分类方法及系统
CN108874990A (zh) 一种基于电力技术杂志论文非结构化数据提取的方法及系统
CN105869058B (zh) 一种多层潜变量模型用户画像提取的方法
CN113468291A (zh) 基于专利网络表示学习的专利自动分类方法
CN109871429B (zh) 融合Wikipedia分类及显式语义特征的短文本检索方法
CN117112794A (zh) 一种基于知识增强的多粒度政务服务事项推荐方法
CN108536796A (zh) 一种基于图的异构本体匹配方法及系统
CN114722304A (zh) 异质信息网络上基于主题的社区搜索方法
Cao et al. Intention classification in multiturn dialogue systems with key sentences mining
Gao et al. Identification of Deceptive Reviews by Sentimental Analysis and Characteristics of Reviewers.
Asgarnezhad et al. NSE: An effective model for investigating the role of pre-processing using ensembles in sentiment classification
CN113836395A (zh) 一种基于异构信息网络的服务开发者按需推荐方法及系统
CN110377845A (zh) 基于区间半监督lda的协同过滤推荐方法
Al-Hagree et al. Arabic sentiment analysis on mobile applications using Levenshtein distance algorithm and naive Bayes
Wang et al. Content-based weibo user interest recognition
Arivarasan et al. Data mining K-means document clustering using tfidf and word frequency count
Xiahou et al. Customer profitability analysis of automobile insurance market based on data mining
CN112948544B (zh) 一种基于深度学习与质量影响的图书检索方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21920704

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21920704

Country of ref document: EP

Kind code of ref document: A1