CN112836491B

CN112836491B - NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model

Info

Publication number: CN112836491B
Application number: CN202110097170.6A
Authority: CN
Inventors: 陆佳炜; 赵伟; 郑嘉弘; 马超治; 程振波; 徐俊; 高飞; 肖刚
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2024-05-07
Anticipated expiration: 2041-01-25
Also published as: CN112836491A

Abstract

A Mashup service spectrum clustering method facing NLP based on GSDPMM and a theme model comprises the following steps: the first step: calculating the number of topics of the Mashup service number by GSDPMM; and a second step of: calculating semantic weight information of words according to the context information and the service tag information so as to obtain a document-word semantic weight information matrix D; and a third step of: counting word co-occurrence information, and calculating SPPMI matrix information; fourth step: based on the document-word semantic weight information matrix D and SPPMI matrix M, obtaining a word embedded information matrix by decomposing M, combining the two information, and calculating the topic information of the service; fifth step: and clustering the obtained Mashup service theme features as the input of spectral clustering. The invention combines the optimized word embedding and word semantic weight calculating method to relieve the sparsity problem brought by short text and find out a better solution set.

Description

NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model

Technical Field

The invention relates to a Mashup service spectrum clustering method based on GSDPMM and a topic model for NLP

Background

With the development of cloud computing and the driving of the idea of "service" of service computing, more and more companies release data, resources or related services onto the internet in the form of Web services, so as to improve the utilization rate of information and self competitiveness. However, the conventional Web service based on the SOAP protocol has the problems of complex technical system, poor expansibility and the like, and is difficult to adapt to complex and changeable application scenes in real life. In order to overcome the problems caused by the traditional service, in recent years, a lightweight information service combination mode, namely Mashup technology, is developed on the Internet, and various Web APIs can be mixed and overlapped to develop various brand new Web services so as to solve the problem that the traditional service is difficult to adapt to complex and changeable application environments.

With rapid increase of Mashup services, how to find high quality services among numerous Mashup services has become a hotspot problem of great concern. Natural language processing (Natural Language Processing, NLP) is an important research direction in the field of computer science and artificial intelligence, which studies processing, understanding and applying human language by computer, mashup service description documents are described by natural language, and Mashup service processing by means of the method in NLP is required to enable computer to understand what is described by the service.

At present, the existing method mainly adopts methods such as latent dirichlet Allocation (LATENT DIRICHLET Allocation, LDA) or Non-negative matrix factorization (Non-negative Matrix Factorization, NMF) to obtain Mashup service theme characteristics, and then further performs clustering work. However, the Mashup service description document is usually short, features are sparse, information quantity is small, the LDA model has far less effect on processing short texts than long texts, so that most of the current topic models are difficult to model short texts lacking in training corpus, words in short texts basically occur once, high-frequency word information is lacking, and semantic weights of words are difficult to calculate for a term frequency-inverse document frequency (TF-IDF) model. In addition, the topic models such as LDA, NMF and the like generally need to specify the number of topics, but the number of topics of the service is difficult to directly determine in advance. At the same time, most of the current service clustering algorithms take a K-means clustering algorithm as a clustering algorithm of a final theme characteristic value, but the traditional K-means algorithm is influenced by randomness of a clustering center point and incapability of finding non-convex shape clusters, so that the clustering quality is possibly unsatisfactory.

Disclosure of Invention

In order to solve the problems that the existing traditional topic model is lack of modeling capacity for short texts, topic numbers are difficult to confirm, and the quality of a K-means clustering algorithm is low, so that Mashup service clustering quality is low, the invention provides a Mashup service spectrum clustering method based on GSDPMM (a collapsed Gibbs Sampling algorithm for THE DIRICHLET Multinomial Mixture Model) and a topic model for NLP. According to the method, theme mining is carried out on Mashup service by utilizing non-negative matrix factorization NMF, an improved Gibbs sampling dirichlet process mixed model (DIRICHLET PROCESS MIXTURE MODEL, DPMM) is introduced to automatically determine the number of themes, an optimized word embedding and word semantic weight calculation method is fused to relieve sparsity problems caused by short texts, and finally a spectral clustering algorithm is adopted to cluster theme features of Mashup service, so that a better solution set is found.

The technical scheme adopted for solving the technical problems is as follows:

a Mashup service spectrum clustering method facing NLP based on GSDPMM and a theme model comprises the following steps:

The first step: calculating the number of topics of Mashup service number by GSDPMM method, which comprises the following steps:

1.1 initializing a matrix z, wherein all elements in n _z,n_zv,m_z are 0, all elements in z are 1, and setting an initial topic number K to be 1 and the iteration number Iter. z counts the subject to which each document belongs, n _z counts the number of words under each subject, n _zv counts the number of different words under different subjects, and m _z counts the number of documents under each subject; wherein z epsilon R ^1xN,n_z∈R^1xK,n_zv∈R^KxV,m_z∈R^1xK, N is the number of Mashup services, and V represents the number of words in the corpus, namely the number of different words;

1.2 traversing all Mashup services, and calculating n _z,n_zv;

1.3, performing Gibbs sampling operation on all Mashup services;

1.4 selecting the topic of document d according to roulette betting;

1.5 according to the subject k of the current document d, increasing m _z [ k ] by 1, increasing n _z [ k ] by Len, wherein Len is the length of the document;

1.6 repeating the steps 1.3-1.5 until all Mashup services are processed;

1.7 repeating 1.3-1.6 until the iteration number Iter is reached;

And a second step of: calculating semantic weight information of words according to the context information and the service tag information to obtain a document-word semantic weight information matrix D, wherein the method comprises the following steps of:

2.1 using natural language toolkits (Natural language toolkit, NLTK) in Python to part-of-speech tagging of words in the Mashup service description document, NLTK is a well-known natural language processing library that can be used to process things related to natural language;

2.2: counting word frequency information, and calculating TF-IDF information;

2.3: extracting Mashup service tag information, and recalculating semantic weights of each word in a Mashup service description document based on noun sets Nset and TF-IDF values;

and a third step of: and counting word co-occurrence information, so as to calculate SPPMI matrix information, wherein the method comprises the following steps of:

3.1, counting word co-occurrence information, wherein the Mashup service description document is short, so that the context co-occurrence information can be acquired more accurately, the whole service description document is used as the length of a sliding window, and the number of times that each word and other words co-occur in the context is calculated;

3.2 Point mutual information (Pointwise Mutual Information, PMI) is calculated, the PMI is widely used for calculating the relation of similarity between words, when the co-occurrence probability of two words in a text is larger, the correlation between the words is stronger, and the PMI calculation formula is as follows:

x and y represent two words, P (x, y) represents the probability that the words x and y co-occur, and P (x) represents the probability that the word x occurs in context. From the actual number of co-occurrences of word w _j and its context word w _c in the corpus, the PMI value between the two can be calculated:

The # (w _j,w_c) represents the actual number of co-occurrences of word w _j and context word w _c in the corpus, E is the total number of co-occurrences of context word pairs, and # (w _j) is the number of co-occurrences of word w _j and other words. Voc represents a corpus, i.e., a collection of non-repeating words;

3.3 calculating an offset positive point mutual information value (Shifted Positive Pointwise Mutual Information, SPPMI) matrix, wherein a SPPMI matrix is calculated by a PMI value, and the calculation mode of the SPPMI matrix is as follows:

SPPMI(w_j,w_c)＝max(PMI(w_j,w_c)-logκ,0)

Where κ is the negative sampling coefficient. Obtaining a context SPPMI matrix M of the word through the formula;

Fourth step: based on the second step, the third step obtains a document-word semantic weight matrix D of a word of the Mashup service document, a word context SPPMI matrix M, and a word embedded information matrix is obtained by decomposing M, and the two information are further combined to calculate the theme information of the service, wherein the steps are as follows:

4.1 by giving the global document-word semantic weight matrix D by the second step, decomposing it into a document-topic matrix θ and a topic-word matrix Z product by NMF, the function of the decomposition matrix D is expressed as:

subject to:θ≥0and Z≥0,θ∈R^NxK,Z∈R^VxK

Wherein the method comprises the steps of Representing L2 norms, N representing the number of Mashup documents, K representing the number of topics of the documents, V representing the number of words of a corpus, R representing a real number set, and superscript T representing matrix transposition; NMF is a matrix decomposition method in which one non-negative matrix is expressed as the product of two other non-negative matrices under the constraint that all elements in the matrix are non-negative numbers;

4.2, obtaining a context SPPMI matrix M of the word through the third step of calculation, introducing word embedding information into a decomposition matrix M, and adopting the formula of the decomposition matrix M as follows:

s is an additional symmetry factor for approximate solution of M, W is word embedding matrix of word;

4.3, using the relation between Mashup service document and word to find the subject information, and learning word embedded information through the co-occurrence information of word context in the document; however, the two parts are not isolated from each other, the semantically related words belong to similar subjects and are very close in the embedding space; the known word embeddings are related to their subject matter, and the relation formula is as follows:

4.4 in the step 4.3, decomposing the theme-word matrix Z into the product of the theme embedding matrix A and the word embedding matrix W, and associating the word embedding with the theme information, so that the accuracy of theme modeling is further improved;

Combining steps 4.1,4.2 and 4.3 to obtain an objective function of the topic model:

subject to:θ≥0and Z≥0

Solving the objective function, and expanding the formula by using matrix trace operation:

J(θ,Z,W,S,A)＝λ_dTr((D-θZ^T)(D-θZ^T)^T)+λ_wTr((M-WSW^T)(M-WSW^T)^T)+λ_tTr((Z-WA^T)(Z-WA^T)^T)

wherein J (θ, Z, W, S, A) is the form of expansion of J ₄ under the parameters θ, Z, W, S, A, and the following formula is further calculated:

J(θ,Z,W,S,A)＝λ_dTr(DD^T-2DZθ^T+θZ^TZθ^T)+λ_wTr(MM^T-2MWSW^T+WSW^TWSW^T)+λ_tTr(ZZ^T-2ZAW^T+WA^TAW^T)

Tr represents matrix tracing, lambda _d,λ_w and lambda _t are weight coefficients of different parts, and are used for adjusting the influence of errors calculated by each part on a result, and the following objective functions are obtained according to regularization constraint:

Wherein the method comprises the steps of As regularization parameters, overfitting is avoided; to minimize the objective function, the above objective function is biased to the following formula:

Order the As indicated by the letter adamas Ma Chengji, the product of the corresponding positions of the matrix, using adamas Ma Chengji to bias the above equation to 0, the following equation was further obtained:

-2(DZ)⊙θ+2(θZ^TZ)⊙θ+α⊙θ＝0

-2(λ_dD^Tθ+λ_tWA^T)⊙Z+2(λ_dZθ^Tθ+λ_tZ)⊙Z+β⊙Z＝0

-2(λ_wMWS+λ_tZA)⊙W+(λ_tWA^TAW+2λ_wWSW^TWS)⊙W+γ⊙W＝0

-(Z^TW)⊙A+(AW^TW)⊙A+ω⊙A＝0

Further updating parameters:

Through the parameter updating mode, a Mashup service document-topic matrix theta and a topic-word matrix Z, a word embedding matrix W and a topic embedding matrix A can be solved;

Fifth step: clustering the Mashup service theme features obtained in 4.4 as the input of spectral clustering, wherein the spectral clustering is an algorithm evolving from graph theory, and is widely applied in clustering later; the main idea is to regard all data as points in space, which can be connected by edges; the edge weight value between two points farther away is lower, while the edge weight value between two points closer away is higher. By cutting the graph formed by all data points, the edge weights among different subgraphs after cutting the graph are as low as possible, and the edge weights in the subgraphs are as high as possible, so that the clustering purpose is achieved, and the method comprises the following steps of:

5.1 calculating a similarity matrix SI, wherein the similarity between service theme features can be calculated by a Gaussian kernel function. In the formula, theta _i represents the theme characteristics of Mashup service i, delta is a scale parameter, exp represents an exponential function based on a natural constant e, and a Gaussian kernel function calculation formula is as follows:

5.2 adding the elements of each column of the matrix SI and adding each column as an element to the degree matrix G diagonal, the formula is as follows.

G_ij＝∑_jSI_ij

5.3 Calculating a Laplacian matrix l=g-SI by G;

5.4 calculation using the eig function in python The feature value and feature vector of the service document feature vector matrix F, tr represents matrix tracing, and a feature value solving function is as follows:

subjectto:F^TF＝I

wherein argmin _F represents The value of F is the smallest;

5.5, sorting the characteristic values from small to large, taking the first C characteristic values and the number of clustering clusters designated by C to obtain characteristic vectors of the first C characteristic values as an initial clustering center;

5.6, calculating the Euclidean distance dist from the feature vector to the clustering center, dividing Mashup service into clusters with minimum distance, and adopting the following calculation formula:

Where f _i represents the i-th value in the feature vector f, and Ce _i represents the i-th value in the cluster center Ce vector;

Updating a cluster center to accumulate tie values of feature vectors in each cluster;

5.8, calculating Euclidean distance between the new cluster center and the old cluster center as an error value;

5.9 repeating steps 5.6-5.8 until the error is less than a certain threshold or the number of iterations reaches a maximum number of iterations.

Further, the process of 1.2 is as follows:

1.2.1 obtaining that the subject of the current document d is k, m _z [ k ] is increased by 1, n _z [ k ] is increased by Len, and Len is the length of the document according to z _d;

1.2.2 traversing each word w in document d, increasing n _zv [ k ] [ w ] by 1;

1.2.3 repeat 1.2.1-1.2.2 until all Mashup services are processed.

Still further, the procedure of 1.3 is as follows:

1.3.1 obtaining that the subject of the current document d is k, m _z [ k ] is reduced by 1, n _z [ k ] is reduced by Len, and Len is the length of the document according to z _d;

1.3.2 traversing each word w in document d, decreasing n _zv [ k ] [ w ] by 1;

1.3.3 traversing each theme, and calculating the probability of the document d on the original theme, wherein the calculation formula is as follows:

1.3.4 calculating the probability of the document d under the new theme, the calculation formula is as follows:

Where alpha, alpha is a superparameter, z _d represents the subject of the current document d, Representing statistical results of not counting document d information,/>Not counting the subject to which each document of the current document d information belongs,/>Representing the number of documents per topic after removal of the current document d information,/>Representing the number of words w in the topic z under the non-statistical document d information,/>Representing the number of words in the topic z under the information of the non-counted document d, N _d representing the number of words in the document d,/>Representing the number of occurrences of word w in document d.

Still further, the procedure of 1.4 is as follows:

1.4.1 accumulating the probability of the document d under each theme to obtain a total probability prob;

1.4.2 randomly generating a random number thred in [0, prob ];

1.4.3 accumulating the probability of the document d under each topic, if the accumulated sum of the current topic k is more than or equal to thred, the topic of the document d is k.

The process of 2.1 is as follows:

2.1.1 traversing each word in the current Mashup service description document, and performing part-of-speech reduction on the words by using NLTK;

2.1.2 extracting word roots by utilizing NLTK, judging whether the word is a noun word, and adding a noun set Nset if the word is the noun word;

2.1.3 repeating steps 2.1.1-2.1.2 until all Mashup services are processed.

The process of 2.2 is as follows:

2.2.1 traversing each word in the Mashup service description document, counting the occurrence times of each word in the current document, calculating the TF value of each word, and calculating the following formula:

Wherein TF _i,j represents word frequency information of a jth word in the ith Mashup service description document, NUM (j) represents the number of times the jth word appears, and LEN (i) represents the length of the ith Mashup text;

2.2.2 counting the number of Mashup service documents appearing in each word, calculating an IDF value, wherein the calculation formula is as follows:

IDF (x) represents the IDF value of word x, N represents the number of Mashup documents, doc (x) represents the number of Mashup documents containing word x;

2.2.3 traversing words in all Mashup documents, and calculating TF-IDF values of the words according to the following calculation formula:

TF-IDF(x)＝TF(x)*IDF(x)

TF-IDF (x) represents the TF-IDF value of word x, and TF (x) represents the TF value of word x.

The process of 2.3 is as follows:

2.3.1 traversing each word w _x in the current Mashup service document to calculate the weight information WeightContext (w _x) of the upper and lower Wen Yuyi, wherein the calculation formula is as follows:

Where sim (w _x,w_y) represents the similarity of words w _x and w _y, calculated by the WordNet tool, w _y is the context word of w _x, d represents the current Mashup service description document, and N _d represents the length of the current Mashup service description document; wordNet is an English dictionary, words are organized through a net structure, words with similar meanings are divided into a group, and similarity is obtained through returning the shortest paths of the words among networks;

2.3.2 calculating service tag semantic weight information WEIGHTTAG for words (w _x), the calculation formula is as follows:

Wherein Tag _d represents a service Tag set of a current Mashup service document, and t represents a word in the service Tag;

2.3.3 recalculating the semantic weights of the words based on the TF-IDF values in combination with the calculation results in 2.3.1 and 2.3.2.

Preferably, the operation of 2.3.3 is as follows:

2.3.3.1 traversing each word w _x in the current Mashup service description document, judging whether the word is in a noun set NSet, if w _x is in the noun set, recalculating word semantic weights by the following formula, and if w _x is not in the noun set NSet, jumping to step 3.3.2;

2.3.3.2 assigning the semantic weight of the word as its TF-IDF value, the calculation formula is as follows:

SemWeight(w_x)＝TF-IDF(w_x)

2.3.3.3.1-2.3.3.2 are repeated until all Mashup services are processed, and a document-word semantic weight matrix D is obtained.

The process of 3.1 is as follows:

3.1.1 for the current Mashup service, calculating the Mashup service description document length Len, and setting the sliding window length to be Len.

3.1.2 Counting co-occurrence situations of words and other words in the Mashup service description document, and adding 1 to the co-occurrence times of the word and the context words in the sliding window if the context word of the current word, namely the word before and after the word, is in the distance of the sliding window Len;

3.1.3 repeating 3.1.2 until all words in Mashup are processed;

3.1.4 repeat 3.1.1-3.1.3 until all Mashup services are processed.

The beneficial effects of the invention are mainly shown in the following steps: performing topic mining on Mashup service by utilizing non-negative matrix factorization NMF, introducing an improved Gibbs sampling Dirichlet process mixed model (DIRICHLET PROCESS MIXTURE MODEL, DPMM) to automatically determine the number of topics, fusing an optimized word embedding and word semantic weight calculation method to relieve sparsity problems caused by short texts, and finally clustering topic features of Mashup service by adopting a spectral clustering algorithm to find a better solution set.

Detailed Description

The invention is further described below.

1.2 traversing all Mashup services, calculating n _z,n_zv, the process is as follows:

1.2.2 traversing each word w in document d, increasing n _zv [ k ] [ w ] by 1;

1.2.3 repeating 1.2.1-1.2.2 until all Mashup services are processed;

1.3 Gibbs sampling operation is carried out on all Mashup services, and the process is as follows:

1.3.2 traversing each word w in document d, decreasing n _zv [ k ] [ w ] by 1;

Where alpha, beta are superparameters, z _d represents the topic of the current document d, Representing statistical results of not counting document d information,/>Not counting the subject to which each document of the current document d information belongs,/>Representing the number of documents per topic after removal of the current document d information,/>Representing the number of words w in the topic z under the non-statistical document d information,/>Representing the number of words in the topic z under the information of the non-counted document d, N _d representing the number of words in the document d,/>Representing the number of occurrences of word w in document d;

1.4 selecting the topic of document d according to roulette, the process is as follows:

1.4.2 randomly generating a random number thred in [0, prob ];

1.4.3 accumulating the probability of the document d under each topic, and if the accumulated sum of the current topic k is more than or equal to thred, the topic of the document d is k;

1.6 repeating the steps 1.3-1.5 until all Mashup services are processed;

1.7 repeating 1.3-1.6 until the iteration number Iter is reached;

2.1 part of speech tagging of words in the Mashup service description document using a natural language toolkit (Natural language toolkit, NLTK) in Python, NLTK is a well-known natural language processing library that can be used to process things related to natural language, as follows:

2.1.3 repeating steps 2.1.1-2.1.2 until all Mashup services are processed;

2.2: counting word frequency information, and calculating TF-IDF information, wherein the process is as follows:

TF-IDF(x)＝TF(x)*IDF(x)

TF-IDF (x) represents the TF-IDF value of word x, and TF (x) represents the TF value of word x;

2.3: extracting Mashup service tag information, and recalculating the semantic weight of each word in the Mashup service description document based on noun set Nset and TF-IDF values, wherein the process is as follows:

2.3.3 recalculating the semantic weights of the words based on the TF-IDF values, in combination with the calculation results in 2.3.1 and 2.3.2, as follows:

SemWeight(w_x)＝TF-IDF(w_x)

2.3.3.3 repeating 2.3.3.1-2.3.3.2 until all Mashup services are processed, and obtaining a document-word semantic weight matrix D;

3.1 counting word co-occurrence information, because the Mashup service description document is shorter, in order to obtain the context co-occurrence information more accurately, the whole service description document is used as the length of a sliding window, and the number of times that each word and other words co-occur in the context is calculated, wherein the process is as follows:

3.1.3 repeating 3.1.2 until all words in Mashup are processed;

3.1.4 repeating 3.1.1-3.1.3 until all Mashup services are processed;

SPPMI(w_j,w_c)＝max(PMI(w_j,w_c)-logκ,0)

wherein kappa is a negative sampling coefficient, and a context SPPMI matrix M of the word is obtained through the formula;

subject to:θ≥0and Z≥0,θ∈R^NxK,Z∈R^VxK

subject to:θ≥0and Z≥0

J(θ,Z,W,S,A)＝λ_dTr((D-θZ^T)(D-θZ^T)^T)

+λ_wTr((M-WSW^T)(M-WSW^T)^T)

+λ_tTr((Z-WA^T)(Z-WA^T)^T)

/>

-2(DZ)⊙θ+2(θZ^TZ)⊙θ+α⊙θ＝0

-2(λ_dD^Tθ+λ_tWA^T)⊙Z+2(λ_dZθ^Tθ+λ_tZ)⊙Z+β⊙Z＝0

-2(λ_wMWS+λ_tZA)⊙W+(λ_tWA^TAW+2λ_wWSW^TWS)⊙W+γ⊙W＝0

-(Z^TW)⊙A+(AW^TW)⊙A+ω⊙A＝0

Further updating parameters:

Fifth step: clustering the Mashup service theme features obtained in 4.4 as the input of spectral clustering. Spectral clustering is an algorithm evolved from graph theory, and is widely applied to clustering later; the main idea is to regard all data as points in space, which can be connected by edges; the edge weight value between two points farther away is lower, while the edge weight value between two points closer away is higher. By cutting the graph formed by all data points, the edge weights among different subgraphs after cutting the graph are as low as possible, and the edge weights in the subgraphs are as high as possible, so that the clustering purpose is achieved, and the method comprises the following steps of:

G_ij＝Σ_jSI_ij

5.3 Calculating a Laplacian matrix l=g-SI by G;

subjectto:F^TF＝I

wherein argmin _F represents The value of F is the smallest;

The embodiments described in this specification are merely illustrative of the manner in which the inventive concepts may be implemented. The scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the scope of the present invention and the equivalents thereof as would occur to one skilled in the art based on the inventive concept.

Claims

1. The Mashup service spectrum clustering method based on GSDPMM and the topic model for NLP is characterized by comprising the following steps:

1.1 initializing a matrix z, wherein all elements in n _z,n_zv,m_z are 0, all elements in z are 1, setting an initial topic number K to be 1 and iteration times Iter, counting topics to which each document belongs by z, counting the number of words under each topic by n _z, counting the number of different words under different topics by n _zv, and counting the number of documents under each topic by m _z; wherein z epsilon R ^1xN,n_z∈R^1xK,n_zv∈R^KxV,m_z∈R^1xK, N is the number of Mashup services, and V represents the number of words in the corpus, namely the number of different words;

1.2 traversing all Mashup services, and calculating n _z,n_zv;

1.3, performing Gibbs sampling operation on all Mashup services;

1.4 selecting the topic of document d according to roulette betting;

1.6 repeating the steps 1.3-1.5 until all Mashup services are processed;

1.7 repeating 1.3-1.6 until the iteration number Iter is reached;

2.1, using a natural language tool kit NLTK in Python to mark parts of speech on words in the Mashup service description document;

2.2: counting word frequency information, and calculating TF-IDF information;

3.2 point mutual information PMI calculation, wherein the PMI calculation formula is as follows:

x and y represent two words, P (x, y) represents the probability of co-occurrence of the words x and y, P (x) represents the probability of occurrence of the word x in context, and the PMI value between the word w _j and its context word w _c can be calculated according to the actual co-occurrence times of the word w _c in the corpus:

The # (w _j,w_c) represents the actual number of co-occurrences of word w _j and context word w _c in the corpus, E is the total number of co-occurrences of context word pairs, # (w _j) is the number of co-occurrences of word w _j and other words, Voc represents a corpus, i.e., a collection of non-repeating words;

3.3 calculating an offset positive point mutual information value SPPMI matrix, wherein the SPPMI matrix is calculated by PMI values, and the calculation mode of the SPPMI matrix is as follows:

SPPMI(w_j,w_c)＝max(PMI(w_j,w_c)-logκ,0)

subject to:θ≥0 and Z≥0,θ∈R^NxK,Z∈R^VxK

subject to:θ≥0and Z≥0

J(θ,Z,W,S,A)＝λ_dTr((D-θZ^T)(D-θZ^T)^T)

+λ_wTr((M-WSW^T)(M-WSW^T)^T)

+λ_tTr((Z-WA^T)(Z-WA^T)^T)

J(θ,Z,W,S,A)＝λ_dTr(DD^T-2DZθ^T+θZ^TZθ^T)

+λ_wTr(MM^T-2MWSW^T+WSW^TWSW^T)

+λ_tTr(ZZ^T-2ZAW^T+WA^TAW^T)

Wherein the alpha, beta, gamma, Omega is regularization parameter, so that overfitting is avoided; to minimize the objective function, the above objective function is biased to the following formula:

let alpha +.θ=0, beta +.Z=0, gamma +.W=0, ΩA=0, +.A represents adamas Ma Chengji, the product of the corresponding positions of the matrix, using adamas Ma Chengji, let the above formula bias to 0, further get the following equation:

-2(DZ)⊙θ+2(θZ^TZ)⊙θ+α⊙θ＝0

-2(λ_dD^Tθ+λ_tWA^T)⊙Z+2(λ_dZθ^Tθ+λ_tZ)⊙Z+β⊙Z＝0

-2(λ_wMWS+λ_tZA)⊙W+(λ_tWA^TAW+2λ_wWSW^TWS)⊙W+γ⊙W＝0

-(Z^TW)⊙A+(AW^TW)⊙A+ω⊙A＝0

Further updating parameters:

Solving a Mashup service document-topic matrix theta and a topic-word matrix Z, a word embedding matrix W and a topic embedding matrix A by the parameter updating mode;

Fifth step: clustering the Mashup service theme features obtained in the step 4.4 as the input of spectral clustering, wherein the steps are as follows:

5.1 calculating a similarity matrix SI, wherein the similarity between service theme features is calculated through a Gaussian kernel function, theta _i in the formula represents theme features of Mashup service i, delta is a scale parameter, exp represents an exponential function based on a natural constant e, and the Gaussian kernel function is calculated as follows:

5.2 adding the elements of each column of the matrix SI and adding each column as an element to the degree matrix G diagonal, the formula is as follows:

G_ij＝∑_jSI_ij

5.3 calculating a Laplacian matrix l=g-SI by G;

subjectto:F^TF＝I

wherein argmin _F represents The value of F is the smallest;

2. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1, wherein the process of 1.2 is as follows:

1.2.2 traversing each word w in document d, increasing n _zv [ k ] [ w ] by 1;

1.2.3 repeat 1.2.1-1.2.2 until all Mashup services are processed.

3. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 1.3 is as follows:

1.3.2 traversing each word w in document d, decreasing n _zv [ k ] [ w ] by 1;

Where alpha, beta are superparameters, z _d represents the topic of the current document d, Representing statistical results of not counting document d information,/>Representing that the topic to which each document belongs is not counted for the current document d information,/>Representing the number of documents per topic after removal of the current document d information,/>Representing the number of words w in the topic z under the non-statistical document d information,/>Representing the number of words in the topic z under the information of the non-counted document d, N _d representing the number of words in the document d,/>Representing the number of occurrences of word w in document d.

4. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 1.4 is as follows:

1.4.2 randomly generating a random number thred in [0, prob ];

5. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 2.1 is as follows:

2.1.3 repeating steps 2.1.1-2.1.2 until all Mashup services are processed.

6. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 2.2 is as follows:

TF-IDF(x)＝TF(x)*IDF(x)

7. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 2.3 is as follows:

8. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 7, wherein the operation of 2.3.3 is as follows:

2.3.3.1 traversing each word w _x in the current Mashup service description document, judging whether the word is in a noun set NSet, if w _x is in the noun set, recalculating the word semantic weight by the following formula, and if w _x is not in the noun set NSet, jumping to step 3.3.2;

SemWeight(w_x)＝TF-IDF(w_x)

9. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 3.1 is as follows:

3.1.1 for the current Mashup service, calculating the length Len of the Mashup service description document, and setting the length of a sliding window as Len;

3.1.3 repeating 3.1.2 until all words in Mashup are processed;

3.1.4 repeat 3.1.1-3.1.3 until all Mashup services are processed.