CN112836491B - NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model - Google Patents

NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model Download PDF

Info

Publication number
CN112836491B
CN112836491B CN202110097170.6A CN202110097170A CN112836491B CN 112836491 B CN112836491 B CN 112836491B CN 202110097170 A CN202110097170 A CN 202110097170A CN 112836491 B CN112836491 B CN 112836491B
Authority
CN
China
Prior art keywords
word
document
matrix
mashup
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110097170.6A
Other languages
Chinese (zh)
Other versions
CN112836491A (en
Inventor
陆佳炜
赵伟
郑嘉弘
马超治
程振波
徐俊
高飞
肖刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202110097170.6A priority Critical patent/CN112836491B/en
Publication of CN112836491A publication Critical patent/CN112836491A/en
Application granted granted Critical
Publication of CN112836491B publication Critical patent/CN112836491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

A Mashup service spectrum clustering method facing NLP based on GSDPMM and a theme model comprises the following steps: the first step: calculating the number of topics of the Mashup service number by GSDPMM; and a second step of: calculating semantic weight information of words according to the context information and the service tag information so as to obtain a document-word semantic weight information matrix D; and a third step of: counting word co-occurrence information, and calculating SPPMI matrix information; fourth step: based on the document-word semantic weight information matrix D and SPPMI matrix M, obtaining a word embedded information matrix by decomposing M, combining the two information, and calculating the topic information of the service; fifth step: and clustering the obtained Mashup service theme features as the input of spectral clustering. The invention combines the optimized word embedding and word semantic weight calculating method to relieve the sparsity problem brought by short text and find out a better solution set.

Description

NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model
Technical Field
The invention relates to a Mashup service spectrum clustering method based on GSDPMM and a topic model for NLP
Background
With the development of cloud computing and the driving of the idea of "service" of service computing, more and more companies release data, resources or related services onto the internet in the form of Web services, so as to improve the utilization rate of information and self competitiveness. However, the conventional Web service based on the SOAP protocol has the problems of complex technical system, poor expansibility and the like, and is difficult to adapt to complex and changeable application scenes in real life. In order to overcome the problems caused by the traditional service, in recent years, a lightweight information service combination mode, namely Mashup technology, is developed on the Internet, and various Web APIs can be mixed and overlapped to develop various brand new Web services so as to solve the problem that the traditional service is difficult to adapt to complex and changeable application environments.
With rapid increase of Mashup services, how to find high quality services among numerous Mashup services has become a hotspot problem of great concern. Natural language processing (Natural Language Processing, NLP) is an important research direction in the field of computer science and artificial intelligence, which studies processing, understanding and applying human language by computer, mashup service description documents are described by natural language, and Mashup service processing by means of the method in NLP is required to enable computer to understand what is described by the service.
At present, the existing method mainly adopts methods such as latent dirichlet Allocation (LATENT DIRICHLET Allocation, LDA) or Non-negative matrix factorization (Non-negative Matrix Factorization, NMF) to obtain Mashup service theme characteristics, and then further performs clustering work. However, the Mashup service description document is usually short, features are sparse, information quantity is small, the LDA model has far less effect on processing short texts than long texts, so that most of the current topic models are difficult to model short texts lacking in training corpus, words in short texts basically occur once, high-frequency word information is lacking, and semantic weights of words are difficult to calculate for a term frequency-inverse document frequency (TF-IDF) model. In addition, the topic models such as LDA, NMF and the like generally need to specify the number of topics, but the number of topics of the service is difficult to directly determine in advance. At the same time, most of the current service clustering algorithms take a K-means clustering algorithm as a clustering algorithm of a final theme characteristic value, but the traditional K-means algorithm is influenced by randomness of a clustering center point and incapability of finding non-convex shape clusters, so that the clustering quality is possibly unsatisfactory.
Disclosure of Invention
In order to solve the problems that the existing traditional topic model is lack of modeling capacity for short texts, topic numbers are difficult to confirm, and the quality of a K-means clustering algorithm is low, so that Mashup service clustering quality is low, the invention provides a Mashup service spectrum clustering method based on GSDPMM (a collapsed Gibbs Sampling algorithm for THE DIRICHLET Multinomial Mixture Model) and a topic model for NLP. According to the method, theme mining is carried out on Mashup service by utilizing non-negative matrix factorization NMF, an improved Gibbs sampling dirichlet process mixed model (DIRICHLET PROCESS MIXTURE MODEL, DPMM) is introduced to automatically determine the number of themes, an optimized word embedding and word semantic weight calculation method is fused to relieve sparsity problems caused by short texts, and finally a spectral clustering algorithm is adopted to cluster theme features of Mashup service, so that a better solution set is found.
The technical scheme adopted for solving the technical problems is as follows:
a Mashup service spectrum clustering method facing NLP based on GSDPMM and a theme model comprises the following steps:
The first step: calculating the number of topics of Mashup service number by GSDPMM method, which comprises the following steps:
1.1 initializing a matrix z, wherein all elements in n z,nzv,mz are 0, all elements in z are 1, and setting an initial topic number K to be 1 and the iteration number Iter. z counts the subject to which each document belongs, n z counts the number of words under each subject, n zv counts the number of different words under different subjects, and m z counts the number of documents under each subject; wherein z epsilon R 1xN,nz∈R1xK,nzv∈RKxV,mz∈R1xK, N is the number of Mashup services, and V represents the number of words in the corpus, namely the number of different words;
1.2 traversing all Mashup services, and calculating n z,nzv;
1.3, performing Gibbs sampling operation on all Mashup services;
1.4 selecting the topic of document d according to roulette betting;
1.5 according to the subject k of the current document d, increasing m z [ k ] by 1, increasing n z [ k ] by Len, wherein Len is the length of the document;
1.6 repeating the steps 1.3-1.5 until all Mashup services are processed;
1.7 repeating 1.3-1.6 until the iteration number Iter is reached;
And a second step of: calculating semantic weight information of words according to the context information and the service tag information to obtain a document-word semantic weight information matrix D, wherein the method comprises the following steps of:
2.1 using natural language toolkits (Natural language toolkit, NLTK) in Python to part-of-speech tagging of words in the Mashup service description document, NLTK is a well-known natural language processing library that can be used to process things related to natural language;
2.2: counting word frequency information, and calculating TF-IDF information;
2.3: extracting Mashup service tag information, and recalculating semantic weights of each word in a Mashup service description document based on noun sets Nset and TF-IDF values;
and a third step of: and counting word co-occurrence information, so as to calculate SPPMI matrix information, wherein the method comprises the following steps of:
3.1, counting word co-occurrence information, wherein the Mashup service description document is short, so that the context co-occurrence information can be acquired more accurately, the whole service description document is used as the length of a sliding window, and the number of times that each word and other words co-occur in the context is calculated;
3.2 Point mutual information (Pointwise Mutual Information, PMI) is calculated, the PMI is widely used for calculating the relation of similarity between words, when the co-occurrence probability of two words in a text is larger, the correlation between the words is stronger, and the PMI calculation formula is as follows:
x and y represent two words, P (x, y) represents the probability that the words x and y co-occur, and P (x) represents the probability that the word x occurs in context. From the actual number of co-occurrences of word w j and its context word w c in the corpus, the PMI value between the two can be calculated:
The # (w j,wc) represents the actual number of co-occurrences of word w j and context word w c in the corpus, E is the total number of co-occurrences of context word pairs, and # (w j) is the number of co-occurrences of word w j and other words. Voc represents a corpus, i.e., a collection of non-repeating words;
3.3 calculating an offset positive point mutual information value (Shifted Positive Pointwise Mutual Information, SPPMI) matrix, wherein a SPPMI matrix is calculated by a PMI value, and the calculation mode of the SPPMI matrix is as follows:
SPPMI(wj,wc)=max(PMI(wj,wc)-logκ,0)
Where κ is the negative sampling coefficient. Obtaining a context SPPMI matrix M of the word through the formula;
Fourth step: based on the second step, the third step obtains a document-word semantic weight matrix D of a word of the Mashup service document, a word context SPPMI matrix M, and a word embedded information matrix is obtained by decomposing M, and the two information are further combined to calculate the theme information of the service, wherein the steps are as follows:
4.1 by giving the global document-word semantic weight matrix D by the second step, decomposing it into a document-topic matrix θ and a topic-word matrix Z product by NMF, the function of the decomposition matrix D is expressed as:
subject to:θ≥0and Z≥0,θ∈RNxK,Z∈RVxK
Wherein the method comprises the steps of Representing L2 norms, N representing the number of Mashup documents, K representing the number of topics of the documents, V representing the number of words of a corpus, R representing a real number set, and superscript T representing matrix transposition; NMF is a matrix decomposition method in which one non-negative matrix is expressed as the product of two other non-negative matrices under the constraint that all elements in the matrix are non-negative numbers;
4.2, obtaining a context SPPMI matrix M of the word through the third step of calculation, introducing word embedding information into a decomposition matrix M, and adopting the formula of the decomposition matrix M as follows:
s is an additional symmetry factor for approximate solution of M, W is word embedding matrix of word;
4.3, using the relation between Mashup service document and word to find the subject information, and learning word embedded information through the co-occurrence information of word context in the document; however, the two parts are not isolated from each other, the semantically related words belong to similar subjects and are very close in the embedding space; the known word embeddings are related to their subject matter, and the relation formula is as follows:
4.4 in the step 4.3, decomposing the theme-word matrix Z into the product of the theme embedding matrix A and the word embedding matrix W, and associating the word embedding with the theme information, so that the accuracy of theme modeling is further improved;
Combining steps 4.1,4.2 and 4.3 to obtain an objective function of the topic model:
subject to:θ≥0and Z≥0
Solving the objective function, and expanding the formula by using matrix trace operation:
J(θ,Z,W,S,A)=λdTr((D-θZT)(D-θZT)T)+λwTr((M-WSWT)(M-WSWT)T)+λtTr((Z-WAT)(Z-WAT)T)
wherein J (θ, Z, W, S, A) is the form of expansion of J 4 under the parameters θ, Z, W, S, A, and the following formula is further calculated:
J(θ,Z,W,S,A)=λdTr(DDT-2DZθT+θZTT)+λwTr(MMT-2MWSWT+WSWTWSWT)+λtTr(ZZT-2ZAWT+WATAWT)
Tr represents matrix tracing, lambda dw and lambda t are weight coefficients of different parts, and are used for adjusting the influence of errors calculated by each part on a result, and the following objective functions are obtained according to regularization constraint:
Wherein the method comprises the steps of As regularization parameters, overfitting is avoided; to minimize the objective function, the above objective function is biased to the following formula:
Order the As indicated by the letter adamas Ma Chengji, the product of the corresponding positions of the matrix, using adamas Ma Chengji to bias the above equation to 0, the following equation was further obtained:
-2(DZ)⊙θ+2(θZTZ)⊙θ+α⊙θ=0
-2(λdDTθ+λtWAT)⊙Z+2(λdTθ+λtZ)⊙Z+β⊙Z=0
-2(λwMWS+λtZA)⊙W+(λtWATAW+2λwWSWTWS)⊙W+γ⊙W=0
-(ZTW)⊙A+(AWTW)⊙A+ω⊙A=0
Further updating parameters:
Through the parameter updating mode, a Mashup service document-topic matrix theta and a topic-word matrix Z, a word embedding matrix W and a topic embedding matrix A can be solved;
Fifth step: clustering the Mashup service theme features obtained in 4.4 as the input of spectral clustering, wherein the spectral clustering is an algorithm evolving from graph theory, and is widely applied in clustering later; the main idea is to regard all data as points in space, which can be connected by edges; the edge weight value between two points farther away is lower, while the edge weight value between two points closer away is higher. By cutting the graph formed by all data points, the edge weights among different subgraphs after cutting the graph are as low as possible, and the edge weights in the subgraphs are as high as possible, so that the clustering purpose is achieved, and the method comprises the following steps of:
5.1 calculating a similarity matrix SI, wherein the similarity between service theme features can be calculated by a Gaussian kernel function. In the formula, theta i represents the theme characteristics of Mashup service i, delta is a scale parameter, exp represents an exponential function based on a natural constant e, and a Gaussian kernel function calculation formula is as follows:
5.2 adding the elements of each column of the matrix SI and adding each column as an element to the degree matrix G diagonal, the formula is as follows.
Gij=∑jSIij
5.3 Calculating a Laplacian matrix l=g-SI by G;
5.4 calculation using the eig function in python The feature value and feature vector of the service document feature vector matrix F, tr represents matrix tracing, and a feature value solving function is as follows:
subjectto:FTF=I
wherein argmin F represents The value of F is the smallest;
5.5, sorting the characteristic values from small to large, taking the first C characteristic values and the number of clustering clusters designated by C to obtain characteristic vectors of the first C characteristic values as an initial clustering center;
5.6, calculating the Euclidean distance dist from the feature vector to the clustering center, dividing Mashup service into clusters with minimum distance, and adopting the following calculation formula:
Where f i represents the i-th value in the feature vector f, and Ce i represents the i-th value in the cluster center Ce vector;
Updating a cluster center to accumulate tie values of feature vectors in each cluster;
5.8, calculating Euclidean distance between the new cluster center and the old cluster center as an error value;
5.9 repeating steps 5.6-5.8 until the error is less than a certain threshold or the number of iterations reaches a maximum number of iterations.
Further, the process of 1.2 is as follows:
1.2.1 obtaining that the subject of the current document d is k, m z [ k ] is increased by 1, n z [ k ] is increased by Len, and Len is the length of the document according to z d;
1.2.2 traversing each word w in document d, increasing n zv [ k ] [ w ] by 1;
1.2.3 repeat 1.2.1-1.2.2 until all Mashup services are processed.
Still further, the procedure of 1.3 is as follows:
1.3.1 obtaining that the subject of the current document d is k, m z [ k ] is reduced by 1, n z [ k ] is reduced by Len, and Len is the length of the document according to z d;
1.3.2 traversing each word w in document d, decreasing n zv [ k ] [ w ] by 1;
1.3.3 traversing each theme, and calculating the probability of the document d on the original theme, wherein the calculation formula is as follows:
1.3.4 calculating the probability of the document d under the new theme, the calculation formula is as follows:
Where alpha, alpha is a superparameter, z d represents the subject of the current document d, Representing statistical results of not counting document d information,/>Not counting the subject to which each document of the current document d information belongs,/>Representing the number of documents per topic after removal of the current document d information,/>Representing the number of words w in the topic z under the non-statistical document d information,/>Representing the number of words in the topic z under the information of the non-counted document d, N d representing the number of words in the document d,/>Representing the number of occurrences of word w in document d.
Still further, the procedure of 1.4 is as follows:
1.4.1 accumulating the probability of the document d under each theme to obtain a total probability prob;
1.4.2 randomly generating a random number thred in [0, prob ];
1.4.3 accumulating the probability of the document d under each topic, if the accumulated sum of the current topic k is more than or equal to thred, the topic of the document d is k.
The process of 2.1 is as follows:
2.1.1 traversing each word in the current Mashup service description document, and performing part-of-speech reduction on the words by using NLTK;
2.1.2 extracting word roots by utilizing NLTK, judging whether the word is a noun word, and adding a noun set Nset if the word is the noun word;
2.1.3 repeating steps 2.1.1-2.1.2 until all Mashup services are processed.
The process of 2.2 is as follows:
2.2.1 traversing each word in the Mashup service description document, counting the occurrence times of each word in the current document, calculating the TF value of each word, and calculating the following formula:
Wherein TF i,j represents word frequency information of a jth word in the ith Mashup service description document, NUM (j) represents the number of times the jth word appears, and LEN (i) represents the length of the ith Mashup text;
2.2.2 counting the number of Mashup service documents appearing in each word, calculating an IDF value, wherein the calculation formula is as follows:
IDF (x) represents the IDF value of word x, N represents the number of Mashup documents, doc (x) represents the number of Mashup documents containing word x;
2.2.3 traversing words in all Mashup documents, and calculating TF-IDF values of the words according to the following calculation formula:
TF-IDF(x)=TF(x)*IDF(x)
TF-IDF (x) represents the TF-IDF value of word x, and TF (x) represents the TF value of word x.
The process of 2.3 is as follows:
2.3.1 traversing each word w x in the current Mashup service document to calculate the weight information WeightContext (w x) of the upper and lower Wen Yuyi, wherein the calculation formula is as follows:
Where sim (w x,wy) represents the similarity of words w x and w y, calculated by the WordNet tool, w y is the context word of w x, d represents the current Mashup service description document, and N d represents the length of the current Mashup service description document; wordNet is an English dictionary, words are organized through a net structure, words with similar meanings are divided into a group, and similarity is obtained through returning the shortest paths of the words among networks;
2.3.2 calculating service tag semantic weight information WEIGHTTAG for words (w x), the calculation formula is as follows:
Wherein Tag d represents a service Tag set of a current Mashup service document, and t represents a word in the service Tag;
2.3.3 recalculating the semantic weights of the words based on the TF-IDF values in combination with the calculation results in 2.3.1 and 2.3.2.
Preferably, the operation of 2.3.3 is as follows:
2.3.3.1 traversing each word w x in the current Mashup service description document, judging whether the word is in a noun set NSet, if w x is in the noun set, recalculating word semantic weights by the following formula, and if w x is not in the noun set NSet, jumping to step 3.3.2;
2.3.3.2 assigning the semantic weight of the word as its TF-IDF value, the calculation formula is as follows:
SemWeight(wx)=TF-IDF(wx)
2.3.3.3.1-2.3.3.2 are repeated until all Mashup services are processed, and a document-word semantic weight matrix D is obtained.
The process of 3.1 is as follows:
3.1.1 for the current Mashup service, calculating the Mashup service description document length Len, and setting the sliding window length to be Len.
3.1.2 Counting co-occurrence situations of words and other words in the Mashup service description document, and adding 1 to the co-occurrence times of the word and the context words in the sliding window if the context word of the current word, namely the word before and after the word, is in the distance of the sliding window Len;
3.1.3 repeating 3.1.2 until all words in Mashup are processed;
3.1.4 repeat 3.1.1-3.1.3 until all Mashup services are processed.
The beneficial effects of the invention are mainly shown in the following steps: performing topic mining on Mashup service by utilizing non-negative matrix factorization NMF, introducing an improved Gibbs sampling Dirichlet process mixed model (DIRICHLET PROCESS MIXTURE MODEL, DPMM) to automatically determine the number of topics, fusing an optimized word embedding and word semantic weight calculation method to relieve sparsity problems caused by short texts, and finally clustering topic features of Mashup service by adopting a spectral clustering algorithm to find a better solution set.
Detailed Description
The invention is further described below.
A Mashup service spectrum clustering method facing NLP based on GSDPMM and a theme model comprises the following steps:
The first step: calculating the number of topics of Mashup service number by GSDPMM method, which comprises the following steps:
1.1 initializing a matrix z, wherein all elements in n z,nzv,mz are 0, all elements in z are 1, and setting an initial topic number K to be 1 and the iteration number Iter. z counts the subject to which each document belongs, n z counts the number of words under each subject, n zv counts the number of different words under different subjects, and m z counts the number of documents under each subject; wherein z epsilon R 1xN,nz∈R1xK,nzv∈RKxV,mz∈R1xK, N is the number of Mashup services, and V represents the number of words in the corpus, namely the number of different words;
1.2 traversing all Mashup services, calculating n z,nzv, the process is as follows:
1.2.1 obtaining that the subject of the current document d is k, m z [ k ] is increased by 1, n z [ k ] is increased by Len, and Len is the length of the document according to z d;
1.2.2 traversing each word w in document d, increasing n zv [ k ] [ w ] by 1;
1.2.3 repeating 1.2.1-1.2.2 until all Mashup services are processed;
1.3 Gibbs sampling operation is carried out on all Mashup services, and the process is as follows:
1.3.1 obtaining that the subject of the current document d is k, m z [ k ] is reduced by 1, n z [ k ] is reduced by Len, and Len is the length of the document according to z d;
1.3.2 traversing each word w in document d, decreasing n zv [ k ] [ w ] by 1;
1.3.3 traversing each theme, and calculating the probability of the document d on the original theme, wherein the calculation formula is as follows:
1.3.4 calculating the probability of the document d under the new theme, the calculation formula is as follows:
Where alpha, beta are superparameters, z d represents the topic of the current document d, Representing statistical results of not counting document d information,/>Not counting the subject to which each document of the current document d information belongs,/>Representing the number of documents per topic after removal of the current document d information,/>Representing the number of words w in the topic z under the non-statistical document d information,/>Representing the number of words in the topic z under the information of the non-counted document d, N d representing the number of words in the document d,/>Representing the number of occurrences of word w in document d;
1.4 selecting the topic of document d according to roulette, the process is as follows:
1.4.1 accumulating the probability of the document d under each theme to obtain a total probability prob;
1.4.2 randomly generating a random number thred in [0, prob ];
1.4.3 accumulating the probability of the document d under each topic, and if the accumulated sum of the current topic k is more than or equal to thred, the topic of the document d is k;
1.5 according to the subject k of the current document d, increasing m z [ k ] by 1, increasing n z [ k ] by Len, wherein Len is the length of the document;
1.6 repeating the steps 1.3-1.5 until all Mashup services are processed;
1.7 repeating 1.3-1.6 until the iteration number Iter is reached;
And a second step of: calculating semantic weight information of words according to the context information and the service tag information to obtain a document-word semantic weight information matrix D, wherein the method comprises the following steps of:
2.1 part of speech tagging of words in the Mashup service description document using a natural language toolkit (Natural language toolkit, NLTK) in Python, NLTK is a well-known natural language processing library that can be used to process things related to natural language, as follows:
2.1.1 traversing each word in the current Mashup service description document, and performing part-of-speech reduction on the words by using NLTK;
2.1.2 extracting word roots by utilizing NLTK, judging whether the word is a noun word, and adding a noun set Nset if the word is the noun word;
2.1.3 repeating steps 2.1.1-2.1.2 until all Mashup services are processed;
2.2: counting word frequency information, and calculating TF-IDF information, wherein the process is as follows:
2.2.1 traversing each word in the Mashup service description document, counting the occurrence times of each word in the current document, calculating the TF value of each word, and calculating the following formula:
Wherein TF i,j represents word frequency information of a jth word in the ith Mashup service description document, NUM (j) represents the number of times the jth word appears, and LEN (i) represents the length of the ith Mashup text;
2.2.2 counting the number of Mashup service documents appearing in each word, calculating an IDF value, wherein the calculation formula is as follows:
IDF (x) represents the IDF value of word x, N represents the number of Mashup documents, doc (x) represents the number of Mashup documents containing word x;
2.2.3 traversing words in all Mashup documents, and calculating TF-IDF values of the words according to the following calculation formula:
TF-IDF(x)=TF(x)*IDF(x)
TF-IDF (x) represents the TF-IDF value of word x, and TF (x) represents the TF value of word x;
2.3: extracting Mashup service tag information, and recalculating the semantic weight of each word in the Mashup service description document based on noun set Nset and TF-IDF values, wherein the process is as follows:
2.3.1 traversing each word w x in the current Mashup service document to calculate the weight information WeightContext (w x) of the upper and lower Wen Yuyi, wherein the calculation formula is as follows:
Where sim (w x,wy) represents the similarity of words w x and w y, calculated by the WordNet tool, w y is the context word of w x, d represents the current Mashup service description document, and N d represents the length of the current Mashup service description document; wordNet is an English dictionary, words are organized through a net structure, words with similar meanings are divided into a group, and similarity is obtained through returning the shortest paths of the words among networks;
2.3.2 calculating service tag semantic weight information WEIGHTTAG for words (w x), the calculation formula is as follows:
Wherein Tag d represents a service Tag set of a current Mashup service document, and t represents a word in the service Tag;
2.3.3 recalculating the semantic weights of the words based on the TF-IDF values, in combination with the calculation results in 2.3.1 and 2.3.2, as follows:
2.3.3.1 traversing each word w x in the current Mashup service description document, judging whether the word is in a noun set NSet, if w x is in the noun set, recalculating word semantic weights by the following formula, and if w x is not in the noun set NSet, jumping to step 3.3.2;
2.3.3.2 assigning the semantic weight of the word as its TF-IDF value, the calculation formula is as follows:
SemWeight(wx)=TF-IDF(wx)
2.3.3.3 repeating 2.3.3.1-2.3.3.2 until all Mashup services are processed, and obtaining a document-word semantic weight matrix D;
and a third step of: and counting word co-occurrence information, so as to calculate SPPMI matrix information, wherein the method comprises the following steps of:
3.1 counting word co-occurrence information, because the Mashup service description document is shorter, in order to obtain the context co-occurrence information more accurately, the whole service description document is used as the length of a sliding window, and the number of times that each word and other words co-occur in the context is calculated, wherein the process is as follows:
3.1.1 for the current Mashup service, calculating the Mashup service description document length Len, and setting the sliding window length to be Len.
3.1.2 Counting co-occurrence situations of words and other words in the Mashup service description document, and adding 1 to the co-occurrence times of the word and the context words in the sliding window if the context word of the current word, namely the word before and after the word, is in the distance of the sliding window Len;
3.1.3 repeating 3.1.2 until all words in Mashup are processed;
3.1.4 repeating 3.1.1-3.1.3 until all Mashup services are processed;
3.2 Point mutual information (Pointwise Mutual Information, PMI) is calculated, the PMI is widely used for calculating the relation of similarity between words, when the co-occurrence probability of two words in a text is larger, the correlation between the words is stronger, and the PMI calculation formula is as follows:
x and y represent two words, P (x, y) represents the probability that the words x and y co-occur, and P (x) represents the probability that the word x occurs in context. From the actual number of co-occurrences of word w j and its context word w c in the corpus, the PMI value between the two can be calculated:
The # (w j,wc) represents the actual number of co-occurrences of word w j and context word w c in the corpus, E is the total number of co-occurrences of context word pairs, and # (w j) is the number of co-occurrences of word w j and other words. Voc represents a corpus, i.e., a collection of non-repeating words;
3.3 calculating an offset positive point mutual information value (Shifted Positive Pointwise Mutual Information, SPPMI) matrix, wherein a SPPMI matrix is calculated by a PMI value, and the calculation mode of the SPPMI matrix is as follows:
SPPMI(wj,wc)=max(PMI(wj,wc)-logκ,0)
wherein kappa is a negative sampling coefficient, and a context SPPMI matrix M of the word is obtained through the formula;
Fourth step: based on the second step, the third step obtains a document-word semantic weight matrix D of a word of the Mashup service document, a word context SPPMI matrix M, and a word embedded information matrix is obtained by decomposing M, and the two information are further combined to calculate the theme information of the service, wherein the steps are as follows:
4.1 by giving the global document-word semantic weight matrix D by the second step, decomposing it into a document-topic matrix θ and a topic-word matrix Z product by NMF, the function of the decomposition matrix D is expressed as:
subject to:θ≥0and Z≥0,θ∈RNxK,Z∈RVxK
Wherein the method comprises the steps of Representing L2 norms, N representing the number of Mashup documents, K representing the number of topics of the documents, V representing the number of words of a corpus, R representing a real number set, and superscript T representing matrix transposition; NMF is a matrix decomposition method in which one non-negative matrix is expressed as the product of two other non-negative matrices under the constraint that all elements in the matrix are non-negative numbers;
4.2, obtaining a context SPPMI matrix M of the word through the third step of calculation, introducing word embedding information into a decomposition matrix M, and adopting the formula of the decomposition matrix M as follows:
s is an additional symmetry factor for approximate solution of M, W is word embedding matrix of word;
4.3, using the relation between Mashup service document and word to find the subject information, and learning word embedded information through the co-occurrence information of word context in the document; however, the two parts are not isolated from each other, the semantically related words belong to similar subjects and are very close in the embedding space; the known word embeddings are related to their subject matter, and the relation formula is as follows:
4.4 in the step 4.3, decomposing the theme-word matrix Z into the product of the theme embedding matrix A and the word embedding matrix W, and associating the word embedding with the theme information, so that the accuracy of theme modeling is further improved;
Combining steps 4.1,4.2 and 4.3 to obtain an objective function of the topic model:
subject to:θ≥0and Z≥0
Solving the objective function, and expanding the formula by using matrix trace operation:
J(θ,Z,W,S,A)=λdTr((D-θZT)(D-θZT)T)
wTr((M-WSWT)(M-WSWT)T)
tTr((Z-WAT)(Z-WAT)T)
wherein J (θ, Z, W, S, A) is the form of expansion of J 4 under the parameters θ, Z, W, S, A, and the following formula is further calculated:
J(θ,Z,W,S,A)=λdTr(DDT-2DZθT+θZTT)+λwTr(MMT-2MWSWT+WSWTWSWT)+λtTr(ZZT-2ZAWT+WATAWT)
Tr represents matrix tracing, lambda dw and lambda t are weight coefficients of different parts, and are used for adjusting the influence of errors calculated by each part on a result, and the following objective functions are obtained according to regularization constraint:
Wherein the method comprises the steps of As regularization parameters, overfitting is avoided; to minimize the objective function, the above objective function is biased to the following formula:
/>
Order the As indicated by the letter adamas Ma Chengji, the product of the corresponding positions of the matrix, using adamas Ma Chengji to bias the above equation to 0, the following equation was further obtained:
-2(DZ)⊙θ+2(θZTZ)⊙θ+α⊙θ=0
-2(λdDTθ+λtWAT)⊙Z+2(λdTθ+λtZ)⊙Z+β⊙Z=0
-2(λwMWS+λtZA)⊙W+(λtWATAW+2λwWSWTWS)⊙W+γ⊙W=0
-(ZTW)⊙A+(AWTW)⊙A+ω⊙A=0
Further updating parameters:
Through the parameter updating mode, a Mashup service document-topic matrix theta and a topic-word matrix Z, a word embedding matrix W and a topic embedding matrix A can be solved;
Fifth step: clustering the Mashup service theme features obtained in 4.4 as the input of spectral clustering. Spectral clustering is an algorithm evolved from graph theory, and is widely applied to clustering later; the main idea is to regard all data as points in space, which can be connected by edges; the edge weight value between two points farther away is lower, while the edge weight value between two points closer away is higher. By cutting the graph formed by all data points, the edge weights among different subgraphs after cutting the graph are as low as possible, and the edge weights in the subgraphs are as high as possible, so that the clustering purpose is achieved, and the method comprises the following steps of:
5.1 calculating a similarity matrix SI, wherein the similarity between service theme features can be calculated by a Gaussian kernel function. In the formula, theta i represents the theme characteristics of Mashup service i, delta is a scale parameter, exp represents an exponential function based on a natural constant e, and a Gaussian kernel function calculation formula is as follows:
5.2 adding the elements of each column of the matrix SI and adding each column as an element to the degree matrix G diagonal, the formula is as follows.
Gij=ΣjSIij
5.3 Calculating a Laplacian matrix l=g-SI by G;
5.4 calculation using the eig function in python The feature value and feature vector of the service document feature vector matrix F, tr represents matrix tracing, and a feature value solving function is as follows:
subjectto:FTF=I
wherein argmin F represents The value of F is the smallest;
5.5, sorting the characteristic values from small to large, taking the first C characteristic values and the number of clustering clusters designated by C to obtain characteristic vectors of the first C characteristic values as an initial clustering center;
5.6, calculating the Euclidean distance dist from the feature vector to the clustering center, dividing Mashup service into clusters with minimum distance, and adopting the following calculation formula:
Where f i represents the i-th value in the feature vector f, and Ce i represents the i-th value in the cluster center Ce vector;
Updating a cluster center to accumulate tie values of feature vectors in each cluster;
5.8, calculating Euclidean distance between the new cluster center and the old cluster center as an error value;
5.9 repeating steps 5.6-5.8 until the error is less than a certain threshold or the number of iterations reaches a maximum number of iterations.
The embodiments described in this specification are merely illustrative of the manner in which the inventive concepts may be implemented. The scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the scope of the present invention and the equivalents thereof as would occur to one skilled in the art based on the inventive concept.

Claims (9)

1. The Mashup service spectrum clustering method based on GSDPMM and the topic model for NLP is characterized by comprising the following steps:
The first step: calculating the number of topics of Mashup service number by GSDPMM method, which comprises the following steps:
1.1 initializing a matrix z, wherein all elements in n z,nzv,mz are 0, all elements in z are 1, setting an initial topic number K to be 1 and iteration times Iter, counting topics to which each document belongs by z, counting the number of words under each topic by n z, counting the number of different words under different topics by n zv, and counting the number of documents under each topic by m z; wherein z epsilon R 1xN,nz∈R1xK,nzv∈RKxV,mz∈R1xK, N is the number of Mashup services, and V represents the number of words in the corpus, namely the number of different words;
1.2 traversing all Mashup services, and calculating n z,nzv;
1.3, performing Gibbs sampling operation on all Mashup services;
1.4 selecting the topic of document d according to roulette betting;
1.5 according to the subject k of the current document d, increasing m z [ k ] by 1, increasing n z [ k ] by Len, wherein Len is the length of the document;
1.6 repeating the steps 1.3-1.5 until all Mashup services are processed;
1.7 repeating 1.3-1.6 until the iteration number Iter is reached;
And a second step of: calculating semantic weight information of words according to the context information and the service tag information to obtain a document-word semantic weight information matrix D, wherein the method comprises the following steps of:
2.1, using a natural language tool kit NLTK in Python to mark parts of speech on words in the Mashup service description document;
2.2: counting word frequency information, and calculating TF-IDF information;
2.3: extracting Mashup service tag information, and recalculating semantic weights of each word in a Mashup service description document based on noun sets Nset and TF-IDF values;
and a third step of: and counting word co-occurrence information, so as to calculate SPPMI matrix information, wherein the method comprises the following steps of:
3.1, counting word co-occurrence information, wherein the Mashup service description document is short, so that the context co-occurrence information can be acquired more accurately, the whole service description document is used as the length of a sliding window, and the number of times that each word and other words co-occur in the context is calculated;
3.2 point mutual information PMI calculation, wherein the PMI calculation formula is as follows:
x and y represent two words, P (x, y) represents the probability of co-occurrence of the words x and y, P (x) represents the probability of occurrence of the word x in context, and the PMI value between the word w j and its context word w c can be calculated according to the actual co-occurrence times of the word w c in the corpus:
The # (w j,wc) represents the actual number of co-occurrences of word w j and context word w c in the corpus, E is the total number of co-occurrences of context word pairs, # (w j) is the number of co-occurrences of word w j and other words, Voc represents a corpus, i.e., a collection of non-repeating words;
3.3 calculating an offset positive point mutual information value SPPMI matrix, wherein the SPPMI matrix is calculated by PMI values, and the calculation mode of the SPPMI matrix is as follows:
SPPMI(wj,wc)=max(PMI(wj,wc)-logκ,0)
wherein kappa is a negative sampling coefficient, and a context SPPMI matrix M of the word is obtained through the formula;
Fourth step: based on the second step, the third step obtains a document-word semantic weight matrix D of a word of the Mashup service document, a word context SPPMI matrix M, and a word embedded information matrix is obtained by decomposing M, and the two information are further combined to calculate the theme information of the service, wherein the steps are as follows:
4.1 by giving the global document-word semantic weight matrix D by the second step, decomposing it into a document-topic matrix θ and a topic-word matrix Z product by NMF, the function of the decomposition matrix D is expressed as:
subject to:θ≥0 and Z≥0,θ∈RNxK,Z∈RVxK
Wherein the method comprises the steps of Representing L2 norms, N representing the number of Mashup documents, K representing the number of topics of the documents, V representing the number of words of a corpus, R representing a real number set, and superscript T representing matrix transposition; NMF is a matrix decomposition method in which one non-negative matrix is expressed as the product of two other non-negative matrices under the constraint that all elements in the matrix are non-negative numbers;
4.2, obtaining a context SPPMI matrix M of the word through the third step of calculation, introducing word embedding information into a decomposition matrix M, and adopting the formula of the decomposition matrix M as follows:
s is an additional symmetry factor for approximate solution of M, W is word embedding matrix of word;
4.3, using the relation between Mashup service document and word to find the subject information, and learning word embedded information through the co-occurrence information of word context in the document; however, the two parts are not isolated from each other, the semantically related words belong to similar subjects and are very close in the embedding space; the known word embeddings are related to their subject matter, and the relation formula is as follows:
4.4 in the step 4.3, decomposing the theme-word matrix Z into the product of the theme embedding matrix A and the word embedding matrix W, and associating the word embedding with the theme information, so that the accuracy of theme modeling is further improved;
Combining steps 4.1,4.2 and 4.3 to obtain an objective function of the topic model:
subject to:θ≥0and Z≥0
Solving the objective function, and expanding the formula by using matrix trace operation:
J(θ,Z,W,S,A)=λdTr((D-θZT)(D-θZT)T)
wTr((M-WSWT)(M-WSWT)T)
tTr((Z-WAT)(Z-WAT)T)
wherein J (θ, Z, W, S, A) is the form of expansion of J 4 under the parameters θ, Z, W, S, A, and the following formula is further calculated:
J(θ,Z,W,S,A)=λdTr(DDT-2DZθT+θZTT)
wTr(MMT-2MWSWT+WSWTWSWT)
tTr(ZZT-2ZAWT+WATAWT)
Tr represents matrix tracing, lambda dw and lambda t are weight coefficients of different parts, and are used for adjusting the influence of errors calculated by each part on a result, and the following objective functions are obtained according to regularization constraint:
Wherein the alpha, beta, gamma, Omega is regularization parameter, so that overfitting is avoided; to minimize the objective function, the above objective function is biased to the following formula:
let alpha +.θ=0, beta +.Z=0, gamma +.W=0, ΩA=0, +.A represents adamas Ma Chengji, the product of the corresponding positions of the matrix, using adamas Ma Chengji, let the above formula bias to 0, further get the following equation:
-2(DZ)⊙θ+2(θZTZ)⊙θ+α⊙θ=0
-2(λdDTθ+λtWAT)⊙Z+2(λdTθ+λtZ)⊙Z+β⊙Z=0
-2(λwMWS+λtZA)⊙W+(λtWATAW+2λwWSWTWS)⊙W+γ⊙W=0
-(ZTW)⊙A+(AWTW)⊙A+ω⊙A=0
Further updating parameters:
Solving a Mashup service document-topic matrix theta and a topic-word matrix Z, a word embedding matrix W and a topic embedding matrix A by the parameter updating mode;
Fifth step: clustering the Mashup service theme features obtained in the step 4.4 as the input of spectral clustering, wherein the steps are as follows:
5.1 calculating a similarity matrix SI, wherein the similarity between service theme features is calculated through a Gaussian kernel function, theta i in the formula represents theme features of Mashup service i, delta is a scale parameter, exp represents an exponential function based on a natural constant e, and the Gaussian kernel function is calculated as follows:
5.2 adding the elements of each column of the matrix SI and adding each column as an element to the degree matrix G diagonal, the formula is as follows:
Gij=∑jSIij
5.3 calculating a Laplacian matrix l=g-SI by G;
5.4 calculation using the eig function in python The feature value and feature vector of the service document feature vector matrix F, tr represents matrix tracing, and a feature value solving function is as follows:
subjectto:FTF=I
wherein argmin F represents The value of F is the smallest;
5.5, sorting the characteristic values from small to large, taking the first C characteristic values and the number of clustering clusters designated by C to obtain characteristic vectors of the first C characteristic values as an initial clustering center;
5.6, calculating the Euclidean distance dist from the feature vector to the clustering center, dividing Mashup service into clusters with minimum distance, and adopting the following calculation formula:
Where f i represents the i-th value in the feature vector f, and Ce i represents the i-th value in the cluster center Ce vector;
Updating a cluster center to accumulate tie values of feature vectors in each cluster;
5.8, calculating Euclidean distance between the new cluster center and the old cluster center as an error value;
5.9 repeating steps 5.6-5.8 until the error is less than a certain threshold or the number of iterations reaches a maximum number of iterations.
2. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1, wherein the process of 1.2 is as follows:
1.2.1 obtaining that the subject of the current document d is k, m z [ k ] is increased by 1, n z [ k ] is increased by Len, and Len is the length of the document according to z d;
1.2.2 traversing each word w in document d, increasing n zv [ k ] [ w ] by 1;
1.2.3 repeat 1.2.1-1.2.2 until all Mashup services are processed.
3. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 1.3 is as follows:
1.3.1 obtaining that the subject of the current document d is k, m z [ k ] is reduced by 1, n z [ k ] is reduced by Len, and Len is the length of the document according to z d;
1.3.2 traversing each word w in document d, decreasing n zv [ k ] [ w ] by 1;
1.3.3 traversing each theme, and calculating the probability of the document d on the original theme, wherein the calculation formula is as follows:
1.3.4 calculating the probability of the document d under the new theme, the calculation formula is as follows:
Where alpha, beta are superparameters, z d represents the topic of the current document d, Representing statistical results of not counting document d information,/>Representing that the topic to which each document belongs is not counted for the current document d information,/>Representing the number of documents per topic after removal of the current document d information,/>Representing the number of words w in the topic z under the non-statistical document d information,/>Representing the number of words in the topic z under the information of the non-counted document d, N d representing the number of words in the document d,/>Representing the number of occurrences of word w in document d.
4. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 1.4 is as follows:
1.4.1 accumulating the probability of the document d under each theme to obtain a total probability prob;
1.4.2 randomly generating a random number thred in [0, prob ];
1.4.3 accumulating the probability of the document d under each topic, if the accumulated sum of the current topic k is more than or equal to thred, the topic of the document d is k.
5. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 2.1 is as follows:
2.1.1 traversing each word in the current Mashup service description document, and performing part-of-speech reduction on the words by using NLTK;
2.1.2 extracting word roots by utilizing NLTK, judging whether the word is a noun word, and adding a noun set Nset if the word is the noun word;
2.1.3 repeating steps 2.1.1-2.1.2 until all Mashup services are processed.
6. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 2.2 is as follows:
2.2.1 traversing each word in the Mashup service description document, counting the occurrence times of each word in the current document, calculating the TF value of each word, and calculating the following formula:
Wherein TF i,j represents word frequency information of a jth word in the ith Mashup service description document, NUM (j) represents the number of times the jth word appears, and LEN (i) represents the length of the ith Mashup text;
2.2.2 counting the number of Mashup service documents appearing in each word, calculating an IDF value, wherein the calculation formula is as follows:
IDF (x) represents the IDF value of word x, N represents the number of Mashup documents, doc (x) represents the number of Mashup documents containing word x;
2.2.3 traversing words in all Mashup documents, and calculating TF-IDF values of the words according to the following calculation formula:
TF-IDF(x)=TF(x)*IDF(x)
TF-IDF (x) represents the TF-IDF value of word x, and TF (x) represents the TF value of word x.
7. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 2.3 is as follows:
2.3.1 traversing each word w x in the current Mashup service document to calculate the weight information WeightContext (w x) of the upper and lower Wen Yuyi, wherein the calculation formula is as follows:
Where sim (w x,wy) represents the similarity of words w x and w y, calculated by the WordNet tool, w y is the context word of w x, d represents the current Mashup service description document, and N d represents the length of the current Mashup service description document; wordNet is an English dictionary, words are organized through a net structure, words with similar meanings are divided into a group, and similarity is obtained through returning the shortest paths of the words among networks;
2.3.2 calculating service tag semantic weight information WEIGHTTAG for words (w x), the calculation formula is as follows:
Wherein Tag d represents a service Tag set of a current Mashup service document, and t represents a word in the service Tag;
2.3.3 recalculating the semantic weights of the words based on the TF-IDF values in combination with the calculation results in 2.3.1 and 2.3.2.
8. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 7, wherein the operation of 2.3.3 is as follows:
2.3.3.1 traversing each word w x in the current Mashup service description document, judging whether the word is in a noun set NSet, if w x is in the noun set, recalculating the word semantic weight by the following formula, and if w x is not in the noun set NSet, jumping to step 3.3.2;
2.3.3.2 assigning the semantic weight of the word as its TF-IDF value, the calculation formula is as follows:
SemWeight(wx)=TF-IDF(wx)
2.3.3.3.1-2.3.3.2 are repeated until all Mashup services are processed, and a document-word semantic weight matrix D is obtained.
9. The Mashup service spectrum clustering method for NLP based on GSDPMM and topic model as claimed in claim 1 or 2, wherein the process of 3.1 is as follows:
3.1.1 for the current Mashup service, calculating the length Len of the Mashup service description document, and setting the length of a sliding window as Len;
3.1.2 counting co-occurrence situations of words and other words in the Mashup service description document, and adding 1 to the co-occurrence times of the word and the context words in the sliding window if the context word of the current word, namely the word before and after the word, is in the distance of the sliding window Len;
3.1.3 repeating 3.1.2 until all words in Mashup are processed;
3.1.4 repeat 3.1.1-3.1.3 until all Mashup services are processed.
CN202110097170.6A 2021-01-25 2021-01-25 NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model Active CN112836491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110097170.6A CN112836491B (en) 2021-01-25 2021-01-25 NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110097170.6A CN112836491B (en) 2021-01-25 2021-01-25 NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model

Publications (2)

Publication Number Publication Date
CN112836491A CN112836491A (en) 2021-05-25
CN112836491B true CN112836491B (en) 2024-05-07

Family

ID=75931371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110097170.6A Active CN112836491B (en) 2021-01-25 2021-01-25 NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model

Country Status (1)

Country Link
CN (1) CN112836491B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093935B (en) * 2023-10-16 2024-03-19 深圳海云安网络安全技术有限公司 Classification method and system for service system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390014A (en) * 2019-07-17 2019-10-29 腾讯科技(深圳)有限公司 A kind of Topics Crawling method, apparatus and storage medium
CN110717047A (en) * 2019-10-22 2020-01-21 湖南科技大学 Web service classification method based on graph convolution neural network
CN111475609A (en) * 2020-02-28 2020-07-31 浙江工业大学 Improved K-means service clustering method around topic modeling
CN111695347A (en) * 2019-03-15 2020-09-22 百度(美国)有限责任公司 System and method for topic discovery and word embedding for mutual learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695347A (en) * 2019-03-15 2020-09-22 百度(美国)有限责任公司 System and method for topic discovery and word embedding for mutual learning
CN110390014A (en) * 2019-07-17 2019-10-29 腾讯科技(深圳)有限公司 A kind of Topics Crawling method, apparatus and storage medium
CN110717047A (en) * 2019-10-22 2020-01-21 湖南科技大学 Web service classification method based on graph convolution neural network
CN111475609A (en) * 2020-02-28 2020-07-31 浙江工业大学 Improved K-means service clustering method around topic modeling

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Collaboratively Improving Topic Discovery and Word Embeddings by Coordinating Global and Local Contexts;Guangxu Xun等,;Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining;全文 *
Non-negative Matrix Factorization Meets Word Embedding;Melissa Ailem等;Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval;全文 *
基于Mashup服务语义表达聚类的API推荐方法研究;朱书苗;中国硕士学位论文全文数据库(第07期);全文 *
面向领域标签辅助的服务聚类方法;田刚等;电子学报;第43卷(第7期);全文 *

Also Published As

Publication number Publication date
CN112836491A (en) 2021-05-25

Similar Documents

Publication Publication Date Title
Abbas et al. Multinomial Naive Bayes classification model for sentiment analysis
CN113011533A (en) Text classification method and device, computer equipment and storage medium
Wahid et al. Topic2Labels: A framework to annotate and classify the social media data through LDA topics and deep learning models for crisis response
CN109902290B (en) Text information-based term extraction method, system and equipment
CN111090731A (en) Electric power public opinion abstract extraction optimization method and system based on topic clustering
WO2014022172A2 (en) Information classification based on product recognition
CN113343078A (en) Web API recommendation method based on topic model clustering
CN112836051B (en) Online self-learning court electronic file text classification method
CN110705247A (en) Based on x2-C text similarity calculation method
US20240111956A1 (en) Nested named entity recognition method based on part-of-speech awareness, device and storage medium therefor
CN112417153A (en) Text classification method and device, terminal equipment and readable storage medium
Bhutada et al. Semantic latent dirichlet allocation for automatic topic extraction
Galal et al. Classifying Arabic text using deep learning
Murshed et al. Enhancing big social media data quality for use in short-text topic modeling
Sotthisopha et al. Improving short text classification using fast semantic expansion on multichannel convolutional neural network
Novotný et al. Text classification with word embedding regularization and soft similarity measure
CN112836491B (en) NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
Barkovska et al. A Conceptual Text Classification Model Based on Two-Factor Selection of Significant Words.
Tang et al. Text semantic understanding based on knowledge enhancement and multi-granular feature extraction
CN114691993A (en) Dynamic self-adaptive topic tracking method, system and device based on time sequence
CN112836488B (en) Web service description document semantic mining method based on TWE-NMF model
JP5342574B2 (en) Topic modeling apparatus, topic modeling method, and program
CN107622129B (en) Method and device for organizing knowledge base and computer storage medium
Chen et al. Pseudo-supervised approach for text clustering based on consensus analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant