CN110659363A

CN110659363A - Web service mixed evolution clustering method based on membrane computing

Info

Publication number: CN110659363A
Application number: CN201910692218.0A
Authority: CN
Inventors: 陆佳炜; 赵伟; 周焕; 马超治; 王小定; 徐俊; 肖刚
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2020-01-07
Anticipated expiration: 2039-07-30
Also published as: CN110659363B

Abstract

A Web service mixed evolution clustering method based on membrane computing comprises the following steps: the first step is as follows: formalized definition; the second step is that: calculating the service similarity; the third step: clustering services; fourthly, the data cell updates the global optimal object according to the operation rule; and fifthly, stopping and outputting, wherein each tissue cell in the system is used as an independent execution unit to perform evolutionary operation in a parallel structure, so that the system is distributed in parallel. The invention can better obtain the characteristics of the service field, can calculate the similarity more accurately and obtain a better clustering result.

Description

Web service mixed evolution clustering method based on membrane computing

Technical Field

The invention relates to a clustering problem of web services, in particular to a mashup service clustering and SOAP service clustering method in REST data services.

Technical Field

With the development of Web 2.0 technology, the number of services and their types on the internet are increasing, which provides the possibility of developing applications of the internet of things in an easier and faster manner, so that how to accurately and effectively discover required atomic services or service combinations becomes a problem. Service clustering techniques can effectively facilitate service discovery, and in recent years, many different types of service clustering methods have been proposed to cluster Mashup services, Web APIs, and Web services.

Currently, most existing methods perform a service clustering operation for SOAP services by calculating functional similarity between Web services using service function descriptions (WSDL documents), and Liu et al extracts four characteristics of Web services, content, context, hostname, and Web service name, from WSDL description text of the Web services for Web service clustering. Elgazzar et al analyzed WSDL documents and clustered them according to functional similarity, Yu and Rege also proposed a clustering method for improving service discovery using a service community learning algorithm, and ontology is also commonly used for semantic similarity calculation and matching between Web services to promote service clustering and discovery. For example, Pop et al have devised a metric to evaluate the degree of match between ontological concepts describing two semantic Web services and cluster them using an ant-based approach to achieve efficient service discovery. Nayak et al propose Web service discovery with additional semantics and clustering based on a clustering hierarchy clustering algorithm.

The existing Mashup service clustering method generally performs clustering by analyzing description texts of services, but does not comprehensively consider the mutual influence of word frequency and correlation in the description text information, the description text information of the services is limited, and other information of some services, such as service APIs (application programming interfaces), service labels and the like, has important embodiment in the function description of the services.

Gao et al propose a new graph theory-based service and Mashup recommendation method, which extracts a theme from the functional description of the service and models the relationship among a user, Mashup, the service and the theme into a quadrilateral graph to improve recommendation performance. Cao et al develops a two-layer topic model by inferring the relationship between Mashup services from Web API calls and labels, and merges the topic distribution of Mashup services at the network layer into the topic probability distribution of the original Mashup services at the content layer.

In addition, Pan W et al propose a novel Mashup service clustering method based on structural similarity and genetic algorithm, describe the relationship between Mashups and Web APIs through a dual-mode graph, apply a SimRank algorithm to quantify the structural similarity between each pair of Mashup services, and finally effectively cluster the Mashup services.

Disclosure of Invention

In order to solve the problem of SOAP service clustering in a web environment, the invention calculates term characteristic values by extracting hidden term information from WSDL documents and dividing the service information into two types, namely service self information and service context information, generates a special Bigraph hierarchical model according to the calculated characteristic values, and calculates the similarity of SOAP services through the Bigraph hierarchical model. For Mashup service, the invention adopts a service feature selection method based on domain perception to obtain a processed service description text, designs a multi-data source LDA topic model based on combining the processed description text, a service API and a label, wherein LDA (latent Dirichlet allocation) is a document topic generation model, also called a three-layer Bayesian probability model, comprising a word, a topic and a document three-layer structure, deduces the topic probability distribution of the service through the model, and calculates the similarity of the service. Meanwhile, a data set is preprocessed by combining a density-based k-means algorithm, an organization P system is utilized, and a Web service mixed evolution clustering method based on membrane calculation is provided by combining an Agnes algorithm based on hierarchical division, a Genetic Algorithm (GA) and a weighted Fuzzy Clustering (FCM) algorithm.

In order to solve the technical problems, the invention provides the following technical scheme:

a Web service mixed evolution clustering method based on membrane computing comprises the following steps:

the first step is as follows: formalized definition, the process is as follows:

1.1 mashup service definition

1.1.1 service document vector model: the preprocessed service document vector model is a four-tuple, RSM ═ RD, RT, RA, T >, where:

RD is a domain feature vector, representing service domain information, defining a service with m domains, and then RD ═ RD₁,RD₂,…,RD_m}；

RT is a service description text feature vector, and assuming that there are n service description texts in each domain, the description texts of m domains are represented as RT ═ RT₁₁,RT₁₂,…,RT_1n,…,RT_mn}；

RA is a service API feature vector;

t is a service label feature vector;

each service description text RT_ijThe characteristic word in (1) is expressed as FW_ijkWhere i represents a domain variable, j represents a description text variable, and k represents a feature word variable, the service description text RT_ijMay also be denoted as RT_ij＝{FW_ij1,FW_ij2,…,FW_ijsI is more than or equal to 1 and less than or equal to m, j is more than or equal to 1 and less than or equal to n, and s is the number of the characteristic words;

1.1.2 service document Cross-Domain concentration: the cross-domain concentration is denoted as D_depIt represents the inclusion service domain RD_iCharacteristic word FW in (1)_ijkService description document RT_ijAnd the proportion of the feature words in all domains of the service is calculated according to the following formula:

wherein df (FW)_ijk,RD_i) Representative service domain RD_iIn (1), containing a feature word FW_ijkDescription of (1)This RT_ijA number of

Representing the inclusion of feature words FW in all domains_ijkDescription text RT of_ijThe higher the cross-domain concentration ratio is, the higher the concentration ratio of the service document in the domain is, so that the method has stronger field representation;

1.1.3 feature word frequency cross-domain concentration: the cross-domain concentration is denoted as D_freIt stands for the feature word FW_ijkIn the service domain RD_iThe different frequency ratios occurring in all service domains are neutralized, and the calculation is as follows:

wherein tf (FW)_ijk,RD_i) Representing the service domain RD_iMiddle and characteristic word FW_ijkOf a quantity of

The number of the feature words appearing in all the service domains is represented, and similarly, the higher feature word frequency cross-domain concentration degree means that the feature words are concentrated to a higher degree in the service domains;

1.1.4 Domain representation of feature words: represents a characteristic word FW_ijkRepresenting a service domain RD_iThe degree of the word frequency is comprehensively calculated according to the cross-domain concentration degree of the service document and the cross-domain concentration degree of the feature word frequency, and the calculation is as follows according to a formula

D_final(FW_ijk,RD_i)＝a*D_dep(FW_ijk,RD_i)+β*D_fre(FW_ijk,RD_i)

Alpha and beta are weight coefficients, and alpha + beta is 1, the domain representation degrees of all the characteristic words in different service domains are obtained through the formula, and the higher the domain representation degree of the characteristic word is, the more the characteristic word can represent the service domain information; a series of typical characteristic words appear in a service domain, the domain representation degree of the words is high but the clustering effect of the service is general, a threshold value of the domain representation degree of the characteristic words is set, the characteristic words exceeding the threshold value are filtered, and the representation effect of the characteristic words on the service domain is improved;

1.1.5 field efficient feature word set: selecting proper feature word sets for representing all feature word sets in a service domain, sorting the feature words in descending order according to domain representation degree of the feature words, and selecting the feature words with the top percentage P in the service domain as the domain efficient feature word sets required by the invention, as shown in the following

HQ(RD_i)＝{FW_ij1，FW_ij2，…，FW_ijp，}

Wherein P is L P/100, if a description text RT is in the process of simplifying the characteristic words_ijCharacteristic word FW_ijkNot belonging to HQ (RD)_i) It is filtered and the service description document RT is updated_ij'；

1.2SOAP services terminology definition:

let TL be { T ═ T₁,T₂,…T_nIs a set of terms in the service corpus, n is the number of terms, a ═ a₁,a₂,…a_mIs an atomic vocabulary constituting the term TL, i.e. the vocabulary has not been subdivided, m corresponding to the number of all atomic vocabularies, defining the frequency of the term

Namely the term T_iThe frequency of occurrence is the same as the sum of the frequency of occurrence of all terms in the corpus TL, and the corresponding frequency of atomic vocabulary

And calculating the sum of the occurrence times of all the vocabularies, wherein the calculation formula is as follows:

Num_TLis TL number of all terms, Num_AAll are the sum of the occurrence times of the atomic vocabularies;

1.3, organization P System (P System) definition:

one degree of 3, i.e., 3, formalized by data cell organization P system is defined as the following octave:

ω＝(OB₁,OB₂,OB_3，OR₁,OR₂,OR₃,OR′,OE_o)

wherein:

OB₁、OB₂and OB₃Is a set of objects of each tissue cell, namely a data cell set;

OR₁、OR₂and OR₃Representing the clustering rules based on Agnes and k-means algorithm, weighted FCM algorithm and GA algorithm respectively for the evolution rules of each tissue cell;

OR' represents the transport rule of each tissue cell in the whole P system, and the sharing and exchange of objects can be carried out between cells through the transport rule;

OEo is the output area of the system, representing the environment;

the second step is that: service similarity calculation

Judging whether the service is the SOAP service, if the service is the SOAP service, jumping to the step 2.2, and if the service is the mashup service, performing the step 2.1;

2.1 mashup service similarity calculation, the process is as follows:

2.1.1 service Pre-processing

The method comprises the following steps of preprocessing crawled service information, namely a service domain, a description text, an API (application program interface) and a label, extracting accurate and effective characteristic words in the service information, constructing a service description document with more accurate description, and improving service clustering precision, wherein the preprocessing steps are as follows:

2.1.1.1 constructing an initial feature vector, and segmenting a statement and extracting effective words by using a natural language processing package NLTK;

2.1.1.2 remove invalid words such as symbols (+, -, _ etc.) and prepositions (a, the, of, and etc.) that are not useful in characterizing a service, preserving names, verbs and adjectives that may characterize the service's characteristics;

2.1.1.3 merging and processing word stems, wherein some words with the same word stem often have similar meanings, for example, the characteristics of use, used and using expression are the same, and the root of a word with the same meaning is deleted and reserved;

2.1.2 service feature reduction Process

Different services have unique field characteristics, the importance of characteristic words in the same domain is related to the frequency and the domain correlation, when the service characteristic value weight is calculated, only one factor is considered in the traditional TF-IDF calculation method or mutual information method, and the description text of the service is subjected to characteristic simplification processing by comprehensively considering the word frequency and the correlation factor, and the steps are as follows:

2.1.2.1 traverse each service domain RD_iEach of the description texts RT_ijEach feature word FW_ijkCalculating the feature word FW according to the formula in 1.1.4_ijkRepresenting a service domain RD_iDegree D of_final(FW_ijk，RD_i) If D is_final(FW_ijk，RD_i) If R is less than R, deleting the characteristic value, wherein R is a threshold value of the domain representation degree of the characteristic value;

2.1.2.2 all service domains after completing 2.1.2.1 steps, according to D_final(FW_ijk，RD_i) Value of (D), to RD_iCharacteristic word FW_ijkSorting in descending order, and selecting the characteristic words of the top percentage P as the service domain RD according to the step 1.5_iDomain efficient feature word set HQ (RD)_i)；

2.1.2.3 repeat step 2.1.2.2 until a domain efficient feature word set HQ (RD) is generated for all service domains_i) Each service domain RD_iAccording to HQ (RD)_i) Deleting all absent HQ (RD)_i) The feature words of (1);

2.1.3 topic clustering model construction:

after the characteristics of the RSM are simplified, obtaining a new service document vector model RSM ═ RD ', RT ', RA ', T >, the method comprises the following steps of simplifying a service description text feature vector RT', a Web API feature vector RA and a label feature vector T, wherein an MD-LDA model is based on hidden Dirichlet distribution (LDA), is a topic model (topicmodel), can give the topic of each document in a document set according to the form of probability distribution, and fuses various data source features of the service, in the MD-LDA model, the relevant word selection method in the service API and the label is consistent with that in the service document description document RT, and each service API or service label has unique contribution to the theme distribution of the document;

therefore, there is a topic distribution, the alpha hyper-parameter of dirichlet corresponds to RA and T in RSM', the beta hyper-parameter of dirichlet corresponds to word distribution in each topic, then a topic is extracted from the topic distribution according to the selected RA or T, and a specific word is generated through the selected topic, thereby generating an MD-LDA model fusing the service description text, the service API and the service label, the generation process is as follows:

2.1.3.1 for RA or T in RSM', the variable dat is defined_dWherein dat is_d1,2,3, N being the total number of RA and T in RSM', one polynomial θ is selected_datObeying alpha hyper-parameter distribution of Dirichlet;

2.1.3.2 for topic K in RSM', K is 1,2

Obeying beta hyper-parameter distribution of Dirichlet;

2.1.3.3 setting variable D as the document tag in the reduced service document RSM', D being 1,2_dRepresents RA or T in each RSM', for each word w in d_diI 1,2,3, M is the total number of words in d;

extract a Web API or tag denoted as x_diWhere obedience is uniformly distributed is denoted as Uniform (dat)_d)；

Extracting a subject as z_diIts obedient polynomial distribution is noted

Distributing;

extracting a word and recording the word as w_diSubject to

Distributing;

each topic in the corresponding MD-LDA probability model

The above word distribution is related, the extraction of words is independent of the dirichlet parameter β, x denotes the tag set dat from the API_dSelecting RA or label T related to given single word, each RA or T is distributed with theta on a subject, theta is selected from Dirichlet parameter alpha, and the subject z is formed by combining the subject distribution of RA and T and the word distribution of the subject_di，

And extracting the word w from the selected topic_di；

As can be seen from the above description, the posterior distribution of the model topic depends on RT ', RA and T in RSM', and the parameters of the MD-LDA model are set as follows:

θ_dat|α～Dirichlet(α)

x_di|dat_d～Uniform(dat_d)

2.1.3.4 parameter reasoning is carried out on the MD-LDA model by a Gibbs sampling method, the sampling method provides a simple and effective method for potential variable estimation, which belongs to a Markov chain Monte Carlo algorithm for obtaining a random sample sequence by multivariate probability distribution, and each step of the Gibbs sampling method follows the following formula distribution;

P(z_di＝j，x_di＝k|w_i＝m，z_{_di}，x_{_di}，dat_d)

wherein z is_{_di}Indicating a processed word w_diThen assign, x, to each word topic_{_di}Indicating a processed word w_diThen assign each word API or label, n_zwRepresenting the total number of words w, m, assigned to the topic z_xzRepresenting the total number of words in the Web API and tags assigned to topic z,

as a subject z_diThe alpha parameter of (a) is,

as a word w_diBeta parameter of (a), number of subjects V, alpha_v，β_vFor alpha parameter and beta parameter of the v-th topic, the word distribution of the topic in the sampling process

The theme distribution theta of the API and the labels is required to be calculated through the following formula;

z_diand x_diBy determining z_{_di}And x_{_di}To sample and decide, for eachRSM', the present invention summarizes all θ_xCalculating the topic distribution of the document d, wherein x belongs to dat_dTo obtain the final topic probability distribution of all RSMs';

2.1.4 similarity calculation

In fact, the theme distribution of the mashup service document is mapped to the text vector space, so that two service documents RSM can be calculated through corresponding theme probability distribution₁' and RSM₂' similarity, the topic in this model is the mixed distribution of word vectors, so the relative entropy (KL) distance can be used as the similarity measure, and the following can be calculated:

t stands for all common topics in two service documents, p_jAnd q is_jRespectively representing the distribution of topics in two documents, when p_j＝q_jAt that time, KL distance calculation result D_KL(RSM′₁，RSM′₂) Is 0, since the KL distance is not of a symmetrical nature, i.e. D_KL(RSM′₁，RSM′₂)≠D_KL(RSM′₂，RSM′₁) So a symmetric version thereof is usually used, the calculation formula is as follows:

D_KL(RSM′₁，RSM′₂)＝λD_KL(RSM′₁，λ*RSM′₁+(1-λ)RSM′₂)+(1-λ)D_KL(RSM′₂，λ*RSM′₁+(1-λ)RSM′₂)

therefore, if λ is equal to 0.5, the above formula is converted into a JS distance, which is also called JS divergence (Jensen-Shannon divergence) and is a variation of the KL distance, and the similarity of the text is calculated by using the JS distance as a standard and is used as the similarity of the service, and the final calculation formula is as follows:

2.2 SOAP similarity calculation, the process is as follows:

2.2.1 calculation of self-eigenvalues

Finding a term T in a service corpus_iThe information quantity I (P) can be calculated by an information theory method_i) On this basis, the term T can be used_iCharacteristic value of (Spe) (T)_i) Assigned as follows

Spe(T_i)＝I(P_i)

By computing a joint probability distribution P { P }_i，q_jIs calculated for the term feature value, where p_iE is P and q_j∈Q，p_iIs to select a word from the term set TL and q_jIs to take a word from the atomic vocabulary A, where { p }₁，p₂，...p_nAnd q₁，q₂，...，q_mAre respectively represented by random variables P, Q, P_iAnd q is_jThe mutual information calculation of (a) can be calculated by the following formula:

the term pi has a characteristic value denoted as I (p)_iQ), representing the relationship between pi terms and the vocabulary library Q, the formula for calculating pi feature values in combination with the frequency of terms and vocabularies in the corpus is as follows:

Spe(T_i)≈I(p_i，Q)

according to the Bayes' theorem,

the final self-information characteristic value SelfSpe (Ti) of the SOAP service is calculated as follows

Analyzing conventional WSDL documents generally includes 1 to 2 words, so

The vocabulary in the representative terms is approximately set to be 1 for calculation, theta represents a weighted value, the weighted value is set based on an information theory method, and the value range is 0 to 1;

2.2.2 contextual information eigenvalue calculation

According to the information theory approach, the context information characteristic of a service is based on the entropy of the modified term word probability distribution, for which its entropy value is calculated by the following formula:

wherein NT represents the term T_iModified quantity of (2), (mod)_m,T_i) Represents mod_mModifying the term T_iThe entropy value of (d) is determined by all (mod)_m,T_i) For the calculation of average information amount, in a specific field, the modifiers of the terms are distributed more closely, so that the entropy value of the terms in a specific field is lower, and the term T can be calculated through the entropy value_iContext information characteristic value ContextSpe (T)_i) The following were used:

wherein j is more than or equal to 1 and less than or equal to K, K is the sum of the number of modifiers with the same definition,

represents each modifier;

2.2.3 hybrid eigenvalue calculation

The self characteristic value and the context information characteristic calculated by the steps 2.2.1 and 2.2.2 can cover the characteristic of the descriptive word and the information which cannot be described by the word, and finally the mixed characteristic value is obtained by the following formula:

the value of the mixing coefficient alpha is between 0 and 1, the mixing coefficient alpha is set to be 0.65 according to experiments, and the values of the self characteristic value, the context characteristic value and the mixing characteristic value of the service are all between 0 and 1 through normalization processing;

2.2.4 Domain weight calculation, the process is as follows:

2.2.4.1 Domain weight calculation

In the process of Bigraph structure generation, a weight based on a domain characteristic value is required, the weight is embodied by terms of the same level, the larger the definition structure similarity is, the larger the weight of terms of the same level is, and the specific calculation method is as follows:

wherein,is a new term T_nSet of sibling terms of (1), hybrid spe (T)_s) And hybrid Spe (T)_n) Respectively representing the characteristic value of each sibling term and the new term, and directly defining the weight value of 0.5G when the newly added term has no sibling term_iFor current Bigraph structures, the bipraph (Bigraph) is a bigram, B ═<BP,BL>The method is proposed by Milner of the lottery winning device, BP and BL are respectively a position graph (place graph) and a connection graph (link graph), BP is a triple BP ═<V,E,P>The node set V, the edge set E and the interface P of the graph form a nested node, the nested node is in a parent-child relationship in the position graph, the branch relationship represents the embedding between the nodes, the BL is also a triple formed by the node set V, the edge set E and the interface P of the graph like the BP, and the BL is used for representing the connection relationship between the nodes;

2.2.4.2 term weight value calculation

The similarity of terms is calculated by comparing the word similarity of two terms, as follows:

wherein,

and

respectively represent in the term T_iAnd T_nThe number of constituent words in (a),

representing the number of the same words in the two terms, defining that a new term comprises more related sub-structure similar terms and the weight is higher, and obtaining the term weight value according to the similarity of the terms, wherein the calculation formula is as follows:

where NP is the total set of superior, sibling, and subordinate terms of the term, T_iRepresents one of these terms;

2.2.5 Bigraph hierarchy of generative terms

Constructing a Bigraph hierarchical structure of different terms, similar to a Bigraph position graph, wherein each node of the Bigraph represents a term object, the value of the node represents the characteristic value of the term object, and the Bigraph hierarchical structure is constructed from top to bottom by the following steps:

2.2.5.1, calculating the mixed characteristic value of the terms in the WSDL document and extracted from Google according to the formula in 2.2.3, putting the mixed characteristic value into an array A, and selecting the previous 3 term objects as three nodes of a Bigraph to form an initial Bigraph structure T according to ascending order;

2.2.5.2 for the remaining term T in array A_nAdded to the existing Bigraph hierarchy if T_xSatisfy (hybrid Spec (T)_n)-0.3＜HybridSpe(T_x)＜HybridSpe(T_n) +0.3, then T_xMarked as target node, T_xDetermining T for the existing Bigraph hierarchy terminology through the target nodes_nDetermining the position of the target substructure, thereby determining a candidate Bigraph structure;

2.2.5.3 by considering new terms anddomain weight W of candidate Bigraph structures_DS(G_i) And term weight W_TS(G_i) Calculating to obtain final node weight through the following formula, thereby finding out the optimal Bigraph structure;

W_f(G_i)＝ωW_DS(G_i)+(1-ω)W_TS(G_i)

where ω is a coefficient, ranging from 0 to 1, by iteratively running 2.2.5.2-2.2.5.3 until all terms are added to the Bigraph hierarchy;

2.2.6 constructing a similarity matrix:

the similarity is calculated using the following formula:

where D represents the maximum number of layers of the Bigraph hierarchy constructed by the term, dis (T)₁，T₂) Stands for two terms T₁，T₂Calculating the similarity of each characteristic of the SOAP service according to the shortest distance in the Bigraph hierarchical structure, namely the similarity of the SOAP service on a certain characteristic, taking the sum of the characteristic similarities as the similarity of the service, and constructing the similarity relation between the services into a similarity matrix;

the third step: service clustering

The selection of the cluster center point needs to calculate the value of the integral cluster variance for the points in the data set, but a plurality of non-alternative points exist in the data set, and data noise points and isolated points of edges exist, and the points not only influence the selection of the cluster center, but also can additionally increase the calculation cost, and simultaneously need to manually pre-specify the number of data clusters; in view of the above disadvantages, a density-based K-means algorithm is proposed to be improved, by calculating the density number of each point, extracting a data point with a high density number as a cluster center, and by the improved K-means algorithm, pre-processing clustering is performed on an initial data set S to be clustered, wherein S is composed of M data points with a dimension d, and the point density calculation of the density-based K-means algorithm is as follows:

wherein sensitivity (S)_i) Is represented as S_iThe total number of points in the range of R, and the distance sim (S)_i，S_j) Adopt as service S_iAnd S_jThe similarity of (2);

for this purpose, the clustering process based on the density K-means algorithm is as follows:

3.1 preprocessing the data by adopting a density-based K-means algorithm and calculating different data S_iThe distance between the two clusters is divided into different clusters according to the radius range R, and the Density, namely Density (S), is selected to be the highest_i) The highest K S_iAs a cluster center, clustering the data by similarity, the process is as follows:

3.1.1 calculation of Each data S_iDistance of each data cluster center within the tissue object Q, validation S_iSorting the data sets based on density according to the number of points in each data cluster;

3.1.2 select the first K S points with the highest density, i.e. the largest number of R range points_kAs a new data cluster center C_k；

3.1.3 obtaining each S according to the distance between the divided different clusters_iAnd C_kSimilarity sim (S) of_i，C_k) According to the average similarity Avesim, if sim (C)_k，S_i)>Avesim, then S_iPartitioning into data clusters C_kFinally obtaining N data clusters;

3.2 tissue cell O1 evolutionary rule

O₁An Agnes algorithm is adopted as an evolution rule to guide and complete the evolution of objects in cells, N initial clusters obtained through a density k-means algorithm are combined through the Agnes algorithm according to a set inter-cluster similarity threshold Cs, and the process is as follows:

3.2.1 clusters C from any two data_i，C_jAverage similarity dis (C) of inner data_i，C_j) Constructing a similarity matrix D

Wherein S_XAs a data cluster C_iData point of (1), S_YAs a data cluster C_jThe data points in (1), U and V are C_i，C_jThe number of data points in;

3.2.2 choosing dis (C)_i，C_j) Largest data cluster C_i，C_jAccording to the threshold value of similarity between clusters Cs, if dis (C)_i，C_j) Cs then cluster the data C_iAnd C_jMerging;

3.2.3 repeating step 3.2.2 until all data clusters meet the similarity threshold requirement;

3.3 histiocyte O₂Rules of evolution

O₂The FCM algorithm based on sample weighting is adopted as an evolution rule to guide and complete the evolution of objects in cells, the difference of samples is not considered in the target function and cluster center calculation of the traditional FCM algorithm, all samples are treated in a same view, but the defect that the influence of isolated points or noise data in a data set is easily expanded exists, so that the contribution of some important samples to clustering is reduced, and the clustering precision is reduced; in order to reduce the influence of sample difference on the clustering effect, the invention provides a sample weighting-based FCM clustering algorithm, which improves the clustering effect by reasonably weighting a target function and a clustering center function;

for data set S ═ S₁,s₂,…,s_n}，

3.3.1 calculating FCM membership according to the following formula:

wherein u is_ijRepresents a membership value of the ith data belonging to the jth cluster,i.e. dividing the ith data into the data cluster j, | s with the maximum membership_i-t_j| is data s_iTo the cluster center t_jN is the number of data, it can be found that the sum of all the data membership degrees is 1, namely, the sum satisfies

j＝1,2,…,n；

3.3.2 computing weight and entropy information

The entropy of thermodynamics represents the chaos degree of information, the invention effectively analyzes the membership degree of data based on entropy definition, and carries out sample weighting on an FCM target function, firstly, an entropy variable E is defined_iRepresenting degree of membership u_ijAnd by calculating the weight w_iMeasurement data s_iFor the degree of influence of this secondary clustering, they are calculated as follows:

3.3.3 according to E_i,w_iCalculating a new objective function

Weight coefficient w_iSatisfy the requirement of

The target function F (S, t) of the newly defined FCM is formulated as follows

m is a weighting index and is an integer greater than or equal to 1, and in order to solve the extreme value of the objective function under the constraint condition, a Lagrange multiplier method is used for constructing a new objective function as follows:

the optimization conditions for extremizing the objective function are as follows:

calculating a new cluster center t_jComprises the following steps:

updating membership u_ijDividing the ith data into data clusters Cj with the maximum membership degree

3.3.4 if | F (S, t)_i-1-F(S,t)_iIf | is greater than the set threshold value, repeating the step 3.3.3, otherwise, ending the algorithm and outputting a result F (S, t)_iRepresenting the FCM objective function value obtained by the ith iteration;

3.4 tissue cell O3 evolutionary rule

O₃The method adopts three genetic operations of selection, crossing and variation of a Genetic Algorithm (GA) as an evolution rule to guide and complete the evolution of each object in cells, and the evolution steps are as follows:

3.4.1 O₃combining m objects in the self cells and objects transferred by other two histiocytes into a new object evolutionary pool P;

3.4.2 O₃selecting, crossing and mutating the new object evolution pool P, wherein the selecting operation is carried out by adopting an optimal storage strategy, the crossing and mutating operations adopt integer form crossing and single point mutation,the specific method comprises the following steps:

3.4.2.1 calculates an evaluation value p of each object k_kN is the number of data clusters, t_iIs the center of the ith data cluster, p_mThe smaller the classification method is, the more appropriate the classification method is, the easier the object is to be inherited to the next generation;

3.4.2.2 define the fitness function fitness of each object k_k

fitness_k＝α(1-α)^index-1

Wherein alpha is the value range of the set parameter from 0 to 1, and index is the iteration number;

3.4.2.3 selecting operation according to the ratio of object fitness

Wherein u is the total number of objects in the object pool, and for each object, a random number p is generated cyclically and randomly, if p is<Cif_kThen the object is inherited to the next generation;

3.4.2.4, determining the crossing position in the crossing operation by the crossing probability Pc, randomly selecting two objects from the evolution pool to carry out the crossing operation, traversing each component of the object, if the random number p generated by the cycle is less than Pc, exchanging the components of the two objects after the position at the position, and ending the traversal;

3.4.2.5 define the mutation probability P_mSetting a random probability P for each object, and if the probability P is less than the mutation probability P_mIf z is according to the mutation probability P_mThe determined variance point (i.e., a component) of the object has a variance value of z_θThe mutated object is represented as:

wherein, delta belongs to [0,1], is a random number generated randomly, and the sign of + and-appears according to probability;

3.4.3 repeating steps 3.4.1-3.4.2, O to keep the scale of the objects in the evolutionary pool stable₃Screening the evolved objects, eliminating the objects according to the fitness of the objects, and reserving m objects with the highest fitness to form an object evolution pool P';

the fourth step: the data cell updates the global optimal object according to the operation rule

The invention defines a transport rule in a designed tissue P system to guide the information exchange between tissue cells, wherein the rule is as follows:

(x，T₁，T₂，...T_m，/T′₁，T′₂，...T′_m，y)，x≠y，x，y＝1，2，3.

this transport rule indicates that histiocyte x and histiocyte y can carry out object transport in both directions, where T₁，T₂，...T_mFor the m objects of tissue cell x, like T₁’，T₂’，...T_m' m objects of histiocyte y; the following effects can be achieved by the transfer rule:

4.1) m subjects T in tissue cell x₁，T₂，...T_mIs transported into the tissue cell y;

4.2) m subjects T in histiocyte y₁’，T₂’，...T_m' is transported into tissue cells x;

(x，T_xbest/T_best，OE_o)，x≠y，x，y＝1，2，3.

this transport rule represents the transport of histiocyte x and the systemic environment, where T_xbestFor the current calculation of the locally optimal object in the tissue cell x, T_bestFor the global optimal object in the current environment, the optimal object in the tissue cell x is transported to the environment by the transport rule, and the environment is updated at the same timeA global optimal object;

the fifth step: shutdown and output

The method comprises the steps of defining a series of calculation steps as a calculation, starting from the histiocytes containing an initial data cell object set, in each calculation, meaning that one or more evolution rules are acted on the current data cell object set, automatically stopping the system when a stopping constraint condition of the system is reached, and presenting the calculation result in the external environment of the system.

In order to reduce the complexity of the system, a simple shutdown condition based on maximum execution calculation is adopted, specifically, the system is stopped when the organization P executes to the set maximum calculation number, and a global optimal object set in the current environment is output

The invention has the beneficial effects that: the method obtains the description text of the processed mashup service by adopting a domain-awareness-based service feature selection method, forms a service description document by using high-efficiency information required by a service feature simplification algorithm extraction method, can effectively remove useless information in the description document, and can better obtain the feature of the service field by considering the importance, the frequency and the domain correlation of feature words in the same domain when simplifying the feature words compared with the traditional TF-IDF calculation method or mutual information method.

For SOAP services, a special Bigraph hierarchical model is generated by extracting hidden term information from a WSDL document, and service information is divided into two types, namely service self information and service context information, by means of a composition word based on the special Bigraph hierarchical model, so that a new term feature value calculation method is introduced. Most terms are composite terms with a set of modifiers, self-information is important to represent a set of internal features in a domain corpus. The context information helps to make up for the deficiency of the information of the service itself. The final feature value is calculated by a combination of the self information and the context information. The similarity can be calculated more accurately.

Meanwhile, by using the k-means based on density and the P-based organization system, the advantages of three clustering algorithms can be effectively combined by taking a hierarchical Agnes algorithm, a Genetic Algorithm (GA) algorithm and a weighted Fuzzy Clustering (FCM) algorithm as evolution rules, so that a better clustering result is obtained.

Detailed Description

The present invention is further explained below.

the first step is as follows: formalized definition, the process is as follows:

1.1 mashup service definition

RA is a service API feature vector;

t is a service label feature vector;

wherein df (FW)_ijk，RD_i) Representative service domain RD_iIn (1), containing a feature word FW_ijkDescription text RT of_ijA number of

wherein tf (FW)_ijk，RD_i) Representing the service domain RD_iMiddle and characteristic word FW_ijkOf a quantity of

D_final(FW_ijk，RD_i)＝α*D_dep(FW_ijk，RD_i)+β*D_fre(FW_ijk，RD_i)

α and β are weighting coefficients, and α + β is 1, the domain representation degrees of all feature words in different service domains can be obtained through the above formula, the higher the domain representation degree of a feature word is, the more the feature word can represent the service domain information, it needs to be noted that a series of typical feature words appear in one service domain, the domain representation degrees of these words are very high but the clustering effect of the service is general, a threshold value of the domain representation degree of a feature word is set, the feature words exceeding the threshold value are filtered, and the representation effect of the feature words on the service domain is improved;

HQ(RD_i)＝{FW_ij1，FW_ij2，...，FW_ijp，}

Wherein P is L P/100, if a description text RT is in the process of simplifying the characteristic words_ijCharacteristic word FW_ijkNot belonging to HQ (RD)_i) It is filtered and the service description document RT is updated_ij′；

1.2SOAP service term definition:

let TL be { T ═ T₁，T₂，...T_nIs a set of terms in the service corpus, n is the number of terms, a ═ a₁，a₂，...a_mIs an atomic vocabulary constituting the term TL, i.e. the vocabulary has not been subdivided, m corresponding to the number of all atomic vocabularies, defining the frequency of the term

Num_TLnumber of all terms for TL, Num_AAll are the sum of the occurrence times of the atomic vocabularies;

1.3, organization P System (P System) definition:

ω＝(OB₁，OB₂，OB₃，OR₁，OR₂，OR₃，OR′，OE_o)

wherein:

OEo is the output area of the system, representing the environment;

the second step is that: service similarity calculation

2.1 mashup service similarity calculation

2.1.1 service Pre-processing

2.1.1.2 remove invalid words such as symbols (+, -, etc.) and prepositions (a, the, of, and etc.), which are useless in characterizing the service, keeping names, verbs and adjectives that can characterize the service's characteristics;

2.1.2 service feature reduction Process

2.1.2.3 repeat step 2.1.2.2 until a domain efficient feature word set HQ (RD) is generated for all service domains_i) Each service domain RD_iAccording to HQ (RD)_i) Deleting all absent HQ (RD)_i) Is characterized byA word;

2.1.3 topic clustering model construction:

after the characteristic simplification of the RSM is completed, a new service document vector model RSM ' < RD ', RT ', RA ', T ' >, is obtained, the invention constructs an extended LDA topic model based on a plurality of data sources, which is marked as MD-LDA, the method comprises the following steps of simplifying a service description text feature vector RT', a Web API feature vector RA and a label feature vector T, wherein an MD-LDA model is based on hidden Dirichlet distribution (LDA), is a topic model (topic model), can give the topic of each document in a document set according to the form of probability distribution, and fuses various data source features of the service, in the MD-LDA model, the relevant word selection method in the service API and the label is consistent with that in the service document description document RT, and each service API or service label has unique contribution to the theme distribution of the document;

2.1.3.2 for topic K in RSM', K is 1,2

Obeying beta hyper-parameter distribution of Dirichlet;

Extracting a subject as z_diIts obedient polynomial distribution is notedDistributing;

extracting a word and recording the word as w_diSubject toDistributing;

each topic in the corresponding MD-LDA probability model

And extracting the word w from the selected topic_di；

θ_dat|α～Dirichlet(α)

x_di|dat_d～Uniform(dat_d)

2.1.3.4 parameter reasoning for MD-LDA model by Gibbs sampling method, which provides a simple and effective method for potential variable estimation, belonging to a Markov chain Monte Carlo algorithm for obtaining random sample sequence by multivariate probability distribution, wherein each step of Gibbs sampling method follows the following formula distribution:

P(z_di＝j，x_di＝k|w_i＝m，z_{_di}，x_{_di}，dat_d)

as a subject z_diThe alpha parameter of (a) is,

as a word w_diBeta parameter of (a), number of subjects V, alpha_v，β_vFor alpha parameter and beta parameter of the u-th topic, the word distribution of the topic during sampling

z_diand x_diBy determining z_{_di}And x_{_di}To sample decisions, for each RSM ', the invention summarizes all θ' s_xCalculate the topic distribution of document d, where χ ∈ dar_dTo obtain the final topic probability distribution of all RSMs';

2.1.4 similarity calculation

In fact, the theme distribution of the mashup service document is mapped to the text vector space, so that two service documents RSM can be calculated through corresponding theme probability distribution₁' and RSM₂The similarity of' the topic in the model is the mixed distribution of word vectors, so the relative entropy (KL) distance can be used as the similarity measurement standard, and the specific calculation is as follows;

2.2 SOAP similarity calculation

2.2.1 calculation of self-eigenvalues

Spe(T_i)＝I(P_i)

By computing a joint probability distribution P { P }_i，q_jIs calculated for the term feature value, where p_iE is P and q_j∈Q，p_iIs to select a word from the term set TL and q_jIs to take a word from the atomic vocabulary A, where { p }₁，p₂，...p_nAnd q₁，q₂，...，q_mAre respectively represented by random variables P, Q, P_iAnd q is_jThe mutual information calculation of (a) is calculated by the following formula:

Spe(T_i)≈I(P_i，Q)

according to the Bayes' theorem,

Analyzing conventional WSDL documents generally includes 1 to 2 words, so

2.2.2 contextual information eigenvalue calculation

represents each modifier;

2.2.3 hybrid eigenvalue calculation

2.2.4 Domain weight calculation:

2.2.4.1 Domain weight calculation

In the process of Bigraph structure generation, a weight based on a domain characteristic value is required, the weight is embodied by terms of the same level, the larger the definition structure similarity is, the larger the weight of terms of the same level is, and the calculation method is as follows:

wherein,

is a new term T_nSet of sibling terms of (1), hybrid spe (T)_s) And hybrid Spe (T)_n) Respectively representing the characteristic value of each sibling term and the new term, and directly defining the weight value as 0.5, G, if the newly added term has no sibling term_iFor current Bigraph structures, the bipraph (Bigraph) is a bigram, B ═<BP,BL>The method is proposed by Milner of the lottery winning device, BP and BL are respectively a position graph (place graph) and a connection graph (link graph), BP is a triple BP ═<V,E,P>The node set V, the edge set E and the interface P of the graph form a nested node, the nested node is in a parent-child relationship in the position graph, the branch relationship represents the embedding between the nodes, the BL is also a triple formed by the node set V, the edge set E and the interface P of the graph like the BP, and the BL is used for representing the connection relationship between the nodes;

2.2.4.2 term weight value calculation

wherein,and

respectively represent in the term T_iAnd T_nThe number of constituent words in (a),representing the number of the same words in the two terms, defining that a new term comprises more related sub-structure similar terms and the weight is higher, and obtaining the term weight value according to the similarity of the terms, wherein the calculation formula is as follows:

2.2.5 Bigraph hierarchy of generative terms

A Bigraph hierarchy construction algorithm of terms is provided, a Bigraph hierarchy structure of different terms is constructed, a position graph similar to the Bigraph is constructed, each node of the Bigraph represents a term object, the value of the node represents the characteristic value of the term object, and the Bigraph hierarchy structure is constructed from top to bottom, and the method comprises the following steps:

2.2.5.1, calculating the mixed characteristic value of the terms in the WSDL document and extracted from Google according to the formula in 2.2.3, putting the mixed characteristic value into an array A, and selecting the first 3 term objects as three nodes of a Bigraph to form an initial Bigraph structure T according to ascending order;

2.2.5.2 for the remaining term T in array A_nAdded to the existing Bigraph hierarchy if T_xSatisfy (hybrid Spec (T)_n)-0.3＜HybridSpe(T_x)＜HybridSpe(T_n) +0.3, then T_xMarked as target node, T_xFor the existing Bigraph hierarchy terminology, T is determined by these target nodes_nDetermining the position of the target substructure, thereby determining a candidate Bigraph structure;

2.2.5.3 by comprehensively considering the new term and the domain weight W of the candidate Bigraph structure_DS(G_i) And term weight W_TS(G_i) Calculating to obtain final node weight through the following formula, thereby finding out the optimal Bigraph structure;

W_f(G_i)＝ωW_DS(G_i)+(1-ω)W_TS(G_i)

2.2.6 constructing a similarity matrix:

the similarity is calculated using the following formula:

the third step: service clustering

The selection of the clustering center point needs to calculate the value of the integral cluster variance for the points in the data set, but a plurality of non-alternative points exist in the data set, data noise points and isolated points of edges exist, the points not only can influence the selection of the cluster center, but also can additionally increase the calculation cost, and meanwhile, the number of data clusters needs to be artificially pre-specified, and in view of the defects, a density-based K-means algorithm is provided for improvement, the density number of each point is calculated, data points with high density are extracted as the cluster center, the initial data set S to be clustered is subjected to pre-processing clustering through the improved K-means algorithm, the S is composed of M data points with dimensionality d, and the point density calculation of the density-based K-means algorithm is as follows:

3.1.3 obtaining each S according to the distance between the divided different clusters_iAnd C_kSimilarity sim (S) of_i，C_k) According to the average similarity Avesim, if sim (C)_k，S_i) If Avesim is greater, then S is_iPartitioning into data clusters C_kFinally obtaining N data clusters;

3.2 tissue cell O1 evolutionary rule

O₁An Agnes algorithm is adopted as an evolution rule to guide and complete the evolution of objects in cells, N initial clusters obtained by a density k-means algorithm are combined through the Agnes algorithm according to a set inter-cluster similarity threshold Cs, and the process is as follows:

3.3 histiocyte O₂Rules of evolution

O₂The FCM algorithm based on sample weighting is adopted as an evolution rule to guide and complete the evolution of objects in cells, the difference of samples is not considered in the target function and cluster center calculation of the traditional FCM algorithm, all samples are treated in a same view, but the defect that the influence of isolated points or noise data in a data set is easily expanded exists, so that the contribution of some important samples to clustering is reduced, and the clustering precision is reduced; in order to reduce the influence of sample difference on the clustering effect, an FCM clustering algorithm based on sample weighting is provided, and the clustering effect is improved by reasonably weighting a target function and a clustering center function;

for data set S ═ S₁，S₂，...，s_n}，

3.3.1 calculating FCM membership according to the following formula:

wherein u is_ijRepresenting the membership value of the ith data belonging to the jth cluster, i.e. the ith data is divided into the data cluster j, | s with the maximum membership_i-t_j| is data s_iTo the cluster center t_jN is the number of data, it can be found that the sum of all the data membership degrees is 1, namely, the sum satisfies

3.3.2 computing weight and entropy information

3.3.3 according to E_i，w_iCalculating a new objective function

Weight coefficient w_iSatisfy the requirement of

The target function F (S, t) of the newly defined FCM is formulated as follows

calculating a new cluster center t_jComprises the following steps:

3.3.4 if | F (S, t)_i-1-F(S，t)_iIf | is greater than the set threshold value, repeating the step 3.3.3, otherwise, ending the algorithm and outputting a result F (S, t)_iRepresenting the FCM objective function value obtained by the ith iteration;

3.4 tissue cell O3 evolutionary rule

3.4.1 O₃m objects in self cell and the cellThe objects transferred by the two histiocytes are combined into a new object evolution pool P;

3.4.2 O₃and performing selection, crossing and mutation operations on the new object evolution pool P, wherein the selection operation is performed by adopting an optimal storage strategy, and the crossing and mutation operations adopt integer form crossing and single point mutation, and the method comprises the following steps:

3.4.2.2 define the fitness function fitness of each object k_k

fitness_k＝α(1-α)^index-1

3.4.2.3 selecting operation according to the ratio of object fitness

3.4.2.4 the crossing position in the crossing operation is determined by the crossing probability Pc, two objects are randomly selected from the evolution pool to carry out the crossing operation, each component of the object is traversed, if the random number p is generated in a circulating way, if p is less than Pc, the components of the two objects behind the position are exchanged at the position, and the traversal is finished;

3.4.2.5 define the mutation probability P_mSetting a random probability P for each object, and if the probability P is less than the mutation probability P_mIf z is according to the mutation probability P_mThe determined change point of the object (i.e. a certain component),the value after mutation is z_θThe mutated object is represented as:

Transport channels exist among cell membranes of tissue cells in the system, different objects are shared and exchanged among different tissue cells, transport rules defined by the system are required to support, and transport rules are defined in a designed tissue P system to guide the information exchange among the tissue cells, wherein the rules are as follows:

(x，T₁，T₂，…T_m，/T′₁，T′₂，…T′_m，y)，x≠y，x，y＝1，2，3

this transport rule indicates that histiocyte x and histiocyte y can carry out object transport in both directions, where T₁,T₂,…T_mFor the m objects of tissue cell x, like T₁’,T₂’,…T_m' is the m objects of the tissue cell y, by which the following effects can be achieved:

4.1) m subjects T in tissue cell x₁,T₂,…T_mIs transported into the tissue cell y;

4.2) m subjects T in histiocyte y₁’,T₂’,…T_m' is transported into tissue cells x;

(x，T_xbest/T_best，OE_o)，x≠y，x，y＝1，2，3

the transport rule represents histiocytes x andthe system environment is in transit, wherein T_xbestFor the current calculation of the locally optimal object in the tissue cell x, T_bestFor the global optimal object in the current environment, the optimal object in the tissue cell x is transferred into the environment through the transfer rule, and the global optimal object of the environment is updated at the same time;

the fifth step: shutdown and output

In the system, a series of calculation steps are defined as a calculation, starting from the histiocytes containing an initial data cell object set, in each calculation, one or more evolution rules are acted on the current data cell object set, when the shutdown constraint condition of the system is reached, the system is automatically shut down, and the calculation result is presented in the external environment of the system.

In order to reduce the complexity of the system, a simple shutdown condition based on maximum execution of the calculation is adopted, specifically, the system is stopped when the organization P performs to the set maximum calculation number, and the global optimal object set in the current environment is output.

Claims

1. A Web service mixed evolution clustering method based on membrane computing is characterized by comprising the following steps:

the first step is as follows: formalized definition;

the second step is that: calculating the service similarity;

the third step: service clustering

wherein sensitivity (S)_i) Is represented as S_iThe total number of points in the range of R, and the distance sim (S)_i,S_j) Adopt as service S_iAnd S_jThe similarity of (2);

(x,T₁,T₂,…T_m,/T′₁,T′₂,…T′_m,y),x≠y,x,y＝1,2,3.

this transport rule represents a bi-directional object transport of histiocyte x and histiocyte y, where T₁,T₂,…T_mFor the m objects of tissue cell x, like T₁’,T₂’,…T_m' m objects of histiocyte y; the following effects are achieved by the transfer rule:

(x,T_xbest/T_best,OE_o),x≠y,x,y＝1,2,3.

this transport rule represents the transport of histiocyte x and the systemic environment, where T_xbestFor the current calculation of locally optimal pairs in tissue cells xElephant, T_bestFor the global optimal object in the current environment, the optimal object in the tissue cell x is transferred into the environment through the transfer rule, and the global optimal object of the environment is updated at the same time;

the fifth step: shutdown and output

Each histiocyte in the system is used as an independent execution unit to perform evolutionary operation in a parallel structure, so the system is distributed in parallel, in the system, a series of calculation steps are defined as a calculation, starting from the histiocyte containing an initial data cell object set, in each calculation, one or more evolutionary rules are acted on the current data cell object set, when a shutdown constraint condition of the system is reached, the system is automatically shut down, and the calculation result is presented in the external environment of the system.

2. The membrane computing-based Web services mixed-evolution clustering method according to claim 1, wherein in the first step, formally defined procedure is as follows:

1.1 mashup service definition;

1.2SOAP services terminology definition:

1.3 organization P System (P System) definition:

ω＝(OB₁,OB₂,OB₃,OR₁,OR₂,OR₃,OR′,OE_o)

wherein:

and OEo is the output area of the system, representing the environment.

3. The membrane computing-based Web services mixed evolution clustering method of claim 2, wherein the process of 1.1 is as follows:

RD is domain feature vector, representing service domain information, defining a service having m domains, then

RD＝{RD₁,RD₂,…,RD_m}；

RT is the service description text feature vector,assuming that there are n service description texts in each domain, the description texts of m domains are denoted as RT ═ RT₁₁,RT₁₂,…,RT_1n,…,RT_mn}；

RA is a service API feature vector;

t is a service label feature vector;

wherein df (FW)_ijk,RD_i) Representative service domain RD_iIn (1), containing a feature word FW_ijkDescription text RT of_ijA number of

D_final(FW_ijk,RD_i)＝α*D_dep(FW_ijk,RD_i)+β*D_fre(FW_ijk,RD_i)

Alpha and beta are weight coefficients, and alpha + beta is 1, the domain representation degrees of all the feature words in different service domains are obtained through the formula, the higher the domain representation degree of the feature words is, the more the feature words can represent the service domain information, it needs to be noted that a series of typical feature words appear in one service domain, the domain representation degrees of the words are high but the clustering effect of the service is general, a threshold value of the domain representation degree of the feature words is set, the feature words exceeding the threshold value are filtered, and the representation effect of the feature words on the service domain is improved;

1.1.5 field efficient feature word set: representing all feature word sets in a service domain, selecting a proper feature word set, sorting in descending order according to the domain representation degree of the feature words, and selecting the feature words of the top P percent in the service domain as a required domain efficient feature word set as shown in the following

HQ(RD_i)＝{FW_ij1,FW_ij2,…,FW_ijp,}

Wherein P is L P/100, if a description text RT is in the process of simplifying the characteristic words_ijCharacteristic word FW_ijkNot belonging to HQ (RD)_i) Then it is passed throughFiltering and updating service description document RT_ij'。

4. The membrane computing-based Web service mixed evolution clustering method according to any one of claims 1 to 3, wherein in the second step, whether the Web service is a SOAP service is judged, if the SOAP service is the SOAP service, the step 2.2 is skipped, and if the SOAP service is a mashup service, the step 2.1 is performed;

2.1 mashup service similarity calculation

2.1.1 service Pre-processing

By preprocessing the crawled service information, namely the domain, the description text, the API and the label of the service, accurate and effective characteristic words in the service information are extracted, a more accurate service description document is constructed, and the service clustering precision is improved;

2.1.2 service feature reduction Process

Different services have unique field characteristics, the importance of characteristic words in the same field is related to frequency and field correlation, when calculating the service characteristic value weight, the traditional TF-IDF calculation method or mutual information method only considers one factor, and the description text of the service is subjected to characteristic simplification processing by comprehensively considering the word frequency and the correlation factor

2.1.3 topic clustering model construction:

after the characteristics of the RSM are simplified, a new service document vector model RSM '< RD', RT ', RA', T '>, is obtained, an extended LDA topic model based on a plurality of data sources is constructed and marked as MD-LDA, wherein the extended LDA topic model comprises a simplified service description text characteristic vector RT', a Web API characteristic vector RA and a label characteristic vector T, the MD-LDA model is based on hidden Dirichlet distribution, the hidden Dirichlet distribution LDA is a topic model, topics of each document in a document set are given out according to a probability distribution mode, various data source characteristics of the service are fused, in the MD-LDA model, word selection methods related to the service API and the labels are consistent with the service document description document RT, and each service API or service label has unique contribution to the topic distribution of the document;

therefore, a topic distribution is provided, the alpha hyper-parameter of the Dirichlet corresponds to RA and T in RSM', the beta hyper-parameter of the Dirichlet corresponds to word distribution in each topic, then a topic is extracted from the topic distribution according to the RA or T selected in the distribution, and a specific word is generated through the selected topic, so that an MD-LDA model fusing a service description text, a service API and a service label is generated;

2.1.4 similarity calculation

In fact, the theme distribution of the mashup service document is mapped to the text vector space, so that two service documents RSM are calculated through corresponding theme probability distribution₁' and RSM₂The similarity of the' is that the topic in the model is the mixed distribution of word vectors, so the relative entropy KL distance is used as the similarity measurement standard, and the calculation is shown as follows;

t stands for all common topics in two service documents, p_jAnd q is_jRespectively representing the distribution of topics in two documents, when p_j＝q_jAt that time, KL distance calculation result D_KL(RSM′₁,RSM′₂) Is 0, since the KL distance is not of a symmetrical nature, i.e. D_KL(RSM′₁,RSM′₂)≠D_KL(RSM′₂,RSM′₁) So a symmetric version thereof is usually used, the calculation formula is as follows:

D_KL(RSM′₁,RSM′₂)＝λD_KL(RSM′₁,λ*RSM′₁+(1-λ)RSM′₂)

+(1-λ)D_KL(RSM′₂,λ*RSM′₁+(1-λ)RSM′₂)

2.2 SOAP similarity calculation

2.2.1 calculation of self-eigenvalues

Finding a term T in a service corpus_iThe information quantity I (P) is calculated by an information theory method_i) On this basis, the term T will be used_iCharacteristic value of (Spe) (T)_i) Assigned as follows

Spe(T_i)＝I(P_i)

By computing a joint probability distribution P { P }_i,q_jIs calculated for the term feature value, where p_iE is P and q_j∈Q，p_iIs to select a word from the term set TL and q_jIs to take a word from the atomic vocabulary A, where { p }₁,p₂,…p_nAnd q₁,q₂,…,q_mAre respectively represented by random variables P, Q, P_iAnd q is_jThe mutual information calculation of (a) is calculated by the following formula;

Spe(T_i)≈I(p_i,Q)

according to the Bayes' theorem,

Analyzing conventional WSDL documents generally includes 1 to 2 words, so

2.2.2 contextual information eigenvalue calculation

wherein NT represents the term T_iModified quantity of (2), (mod)_m,T_i) Represents mod_mModifying the term T_iThe entropy value of (d) is determined by all (mod)_m,T_i) For the calculation of average information amount, in a specific field, the modifiers of the terms are distributed more closely, so that the entropy value of the terms in a specific field is lower, and the term T is calculated through the entropy value_iContext information characteristic value ContextSpe (T)_i) The following were used:

wherein j is more than or equal to 1 and less than or equal to K, K is the sum of the number of modifiers with the same definition,represents each modifier;

2.2.3 hybrid eigenvalue calculation

Covering the characteristics of the descriptive words and the information which cannot be described by the words by the self characteristic values and the context information characteristics calculated in the steps 2.2.1 and 2.2.2, and finally obtaining the mixed characteristic values by the following formula as follows:

2.2.4 calculating the domain weight;

2.2.5 Bigraph hierarchy of generative terms

Providing a term Bigraph hierarchical structure algorithm, constructing Bigraph hierarchical structures of different terms, wherein each node of the Bigraph represents a term object, the value of the node represents the characteristic value of the term object, and the Bigraph hierarchical structure is constructed from top to bottom;

2.2.6 constructing a similarity matrix:

the similarity is calculated using the following formula:

where D represents the maximum number of layers of the Bigraph hierarchy constructed by the term, dis (T)₁,T₂) Stands for two terms T₁,T₂And calculating the similarity of each characteristic of the SOAP service according to the shortest distance in the Bigraph hierarchical structure, namely the similarity of the SOAP service on a certain characteristic, taking the sum of the characteristic similarities as the similarity of the service, and constructing the similarity relation between the services into a similarity matrix.

5. The membrane computing-based Web services mixed evolution clustering method of claim 4, wherein in 2.1.1, the preprocessing steps are as follows:

2.1.1.2 remove invalid words such as symbols (+, -, _ etc.) and prepositions (a, the, of, and etc.) that are not useful in characterizing a service, preserving names, verbs and adjectives that characterize the service's characteristics;

2.1.1.3 merging and processing word stems, wherein some words with the same word stem often have similar meanings, for example, the characteristics of use, used and using expressions are the same, and the root of a word with the same meaning is deleted to be reserved.

6. The membrane computing-based Web services mixed evolution clustering method of claim 4, wherein the step of 2.1.2 is as follows:

2.1.2.1 traverse each service domain RD_iEach of the description texts RT_ijEach feature word FW_ijkCalculating the feature word FW according to the formula in 1.1.4_ijkRepresenting a service domain RD_iDegree D of_final(FW_ijk,RD_i) If D is_final(FW_ijk,RD_i)<R, deleting the characteristic value, wherein R is a threshold value of the domain representation degree of the characteristic value;

2.1.2.2 all service domains after completing 2.1.2.1 steps, according to D_final(FW_ijk,RD_i) Value of (D), to RD_iCharacteristic word FW_ijkSorting in descending order, and selecting the characteristic words of the top percentage P as the service domain RD according to the step 1.5_iDomain efficient feature word set HQ (RD)_i)；

2.1.2.3 repeat step 2.1.2.2 until a domain efficient feature word set HQ (RD) is generated for all service domains_i) Each service domain RD_iAccording to HQ (RD)_i) Deleting all absent HQ (RD)_i) The feature words of (1).

7. The membrane computing-based Web services mixed evolution clustering method of claim 4, wherein in the 2.1.3, the generation process is as follows:

2.1.3.1 for RA or T in RSM', the variable dat is defined_dWherein dat is_d1,2,3, …, N being the total number of RA and T in RSM', a polynomial θ is selected_datSubject to the alpha hyper-parametric distribution of dirichlet, 2.1.3.2 for topic K in RSM ', satisfying K1, 2, …, K being the number of topics in RSM', one polynomial is selected

Obeying beta hyper-parameter distribution of Dirichlet;

2.1.3.3 setting variable D as document tag in the simplified service document RSM ', D is 1,2, …, D is the total number of RSM', defining variable dat_dRepresents RA or T in each RSM', for each word w in d_diI ═ 1,2,3, …, M being the total number of words in d;

Extracting a subject as z_diIts obedient polynomial distribution is noted

Distributing;

extracting a word and recording the word as w_diSubject to

Distributing;

each topic in the corresponding MD-LDA probability model

The above word distribution is related, the extraction of words is independent of the dirichlet parameter β, x denotes the tag set dat from the API_dSelecting RA or label T related to given single word, each RA or T is distributed with theta on a subject, theta is selected from Dirichlet parameter alpha, and the subject z is formed by combining the subject distribution of RA and T and the word distribution of the subject_diAnd extracting the word w from the selected subject_di；

θ_dat|α～Dirichlet(α)

x_di|dat_d～Uniform(dat_d)

as a subject z_diThe alpha parameter of (a) is,

as a word w_diBeta parameter of (a), number of subjects V, alpha_v,β_vFor alpha parameter and beta parameter of the v-th topic, the word distribution of the topic in the sampling process

And subject distribution theta of API, label needs to pass throughThe following formula is obtained;

z_diand x_diBy determining z_{_di}And x_{_di}To sample decisions, for each RSM ', sum all θ' s_xCalculating the topic distribution of the document d, wherein x belongs to dat_dThus resulting in a final topic probability distribution for all RSMs'.

8. The membrane computing-based Web services mixed evolution clustering method of claim 4, wherein the process of 2.2.4 is as follows:

2.2.4.1 Domain weight calculation

wherein,

is a new term T_nSet of sibling terms of (1), hybrid spe (T)_s) And hybrid Spe (T)_n) Respectively representing the characteristic value of each sibling term and the new term, and directly defining the weight value as 0.5, G, if the newly added term has no sibling term_iFor current Bigraph structures, the even graph Bigraph is a bituple B ═<BP,BL>Proposed by Milner, Bonus awards, BP, BL are respectively a location graph (place graph) and a link graph (link graph)<V,E,P>The node set V, the edge set E and the interface P of the graphThe nested nodes are in a parent-child relationship in the position diagram, the embedding between the nodes is represented by a branch relationship, BL is a triple composed of a node set V, an edge set E and an interface P of the diagram like BP, and BL is used for representing the connection relationship between the nodes;

2.2.4.2 term weight value calculation

wherein,

andrespectively represent in the term T_iAnd T_nThe number of constituent words in (a),

representing the number of the same words in the two terms, defining that the more related sub-structure similar terms a new term contains, the higher the weight, and obtaining the term weight value according to the similarity of the terms, wherein the calculation formula is as follows:

where NP is the total set of superior, sibling, and subordinate terms of the term, T_iRepresents one of these terms.

9. The membrane computing-based Web services mixed evolution clustering method of claim 4, wherein the step of 2.2.5 is as follows:

2.2.5.2 for the remaining term T in array A_nAdded to the existing Bigraph hierarchy if T_xSatisfy (hybrid Spec (T)_n)-0.3<HybridSpe(T_x)<HybridSpe(T_n) +0.3, then T_xMarked as target node, T_xDetermining T for the existing Bigraph hierarchy terminology through the target nodes_nDetermining the position of the target substructure, thereby determining a candidate Bigraph structure;

W_f(G_i)＝ωW_DS(G_i)+(1-ω)W_TS(G_i)

where ω is a coefficient, ranging from 0 to 1, by iteratively running 2.2.5.2-2.2.5.3 until all terms are added to the Bigraph hierarchy.

10. The membrane computing-based Web service mixed evolution clustering method according to one of the claims 1 to 3, wherein in the third step, the clustering process based on the density K-means algorithm is as follows:

3.1 preprocessing the data by adopting a density-based K-means algorithm and calculating different data S_iThe distance between the two clusters is divided into different clusters according to the radius range R, and the Density, namely Density (S), is selected to be the highest_i) The highest K S_iAs a cluster center, clustering the data through similarity;

3.1.2 select the first K S points with the highest density, i.e. the largest number of R range points_kAs newData cluster center C_k；

3.1.3 obtaining each S according to the distance between the divided different clusters_iAnd C_kSimilarity sim (S) of_i,C_k) According to the average similarity Avesim, if sim (C)_k,S_i)>Avesim, then S_iPartitioning into data clusters C_kFinally obtaining N data clusters;

3.2 tissue cell O1 evolutionary rule

3.2.1 clusters C from any two data_i,C_jAverage similarity dis (C) of inner data_i,C_j) Constructing a similarity matrix D

Wherein S_XAs a data cluster C_iData point of (1), S_YAs a data cluster C_jThe data points in (1), U and V are C_i,C_jThe number of data points in;

3.2.2 choosing dis (C)_i,C_j) Largest data cluster C_i,C_jAccording to the threshold value of similarity between clusters Cs, if dis (C)_i,C_j)>Cs then clusters the data C_iAnd C_jMerging;

3.3 histiocyte O₂Rules of evolution

O₂An FCM algorithm based on sample weighting is adopted as an evolution rule to guide and complete the evolution of objects in cells, and a target function and a cluster center of a traditional FCM algorithmThe calculation does not consider the difference of the samples, and all the samples are treated in a same-view way, but the defect that the influence of isolated points or noise data in a data set is easily expanded exists, so that the contribution of some important samples to clustering is reduced, and the clustering precision is reduced; in order to reduce the influence of sample difference on the clustering effect, an FCM clustering algorithm based on sample weighting is provided, and the clustering effect is improved by reasonably weighting a target function and a clustering center function;

for data set S ═ S₁,s₂,…,s_n}，

3.3.1 calculating FCM membership according to the following formula:

3.3.2 computing weight and entropy information

The entropy of thermodynamics represents the chaos degree of information, the data membership is effectively analyzed based on entropy definition, the FCM target function is subjected to sample weighting, and an entropy variable E is defined firstly_iRepresenting degree of membership u_ijAnd by calculating the weight w_iMeasurement data s_iThe degree of influence on the secondary clusters is shown in the following calculation formula:

3.3.3 according to E_i,w_iCalculating a new objective function

Weight coefficient w_iSatisfy the requirement of

The target function F (S, t) of the newly defined FCM is formulated as follows

m is a weighting index which is an integer greater than or equal to 1, and in order to solve the extreme value of the objective function under the constraint condition, a Lagrange multiplier method is used for constructing a new objective function as follows:

calculating a new cluster center t_jComprises the following steps:

3.4 tissue cell O3 evolutionary rule

3.4.2.2 define the fitness function fitness of each object k_k

fitness_k＝α(1-α)^index-1

3.4.2.3 selecting operation according to the ratio of object fitness

Where u is the total number of objects in the pool of objects, for eachCyclically and randomly generating a random number p if p<Cif_kThen the object is inherited to the next generation;

3.4.2.5 define the mutation probability P_mSetting a random probability P for each object, and if the probability P is less than the mutation probability P_mIf z is according to the mutation probability P_mThe determined variation point of the object, i.e. a certain component, the value after variation is z_θThe mutated object is represented as:

3.4.3 repeating steps 3.4.1-3.4.2, O to keep the scale of the objects in the evolutionary pool stable₃And screening the evolved objects, eliminating the objects according to the fitness of the objects, and reserving m objects with the highest fitness to form an object evolution pool P' again.