CN113361270B - Short text optimization topic model method for service data clustering - Google Patents
Short text optimization topic model method for service data clustering Download PDFInfo
- Publication number
- CN113361270B CN113361270B CN202110570274.4A CN202110570274A CN113361270B CN 113361270 B CN113361270 B CN 113361270B CN 202110570274 A CN202110570274 A CN 202110570274A CN 113361270 B CN113361270 B CN 113361270B
- Authority
- CN
- China
- Prior art keywords
- word
- topic
- setting
- word pair
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000005457 optimization Methods 0.000 title claims description 11
- 239000013598 vector Substances 0.000 claims abstract description 123
- 238000005070 sampling Methods 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 17
- 239000011159 matrix material Substances 0.000 claims description 84
- 230000011218 segmentation Effects 0.000 claims description 25
- 238000004364 calculation method Methods 0.000 claims description 11
- 230000000694 effects Effects 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 8
- 238000012216 screening Methods 0.000 claims description 7
- 230000001186 cumulative effect Effects 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000007547 defect Effects 0.000 abstract description 5
- 230000006870 function Effects 0.000 description 6
- 238000012546 transfer Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A short text optimizing topic model method facing service data clustering is characterized in that firstly, a word pair model based on a BTM topic model is designed, word pair information is screened by the model through word vectors, and the defect of long time consumption of the word pair topic model is overcome; meanwhile, a representative word pair with high relevance to the currently sampled subject in the training process is searched by a probability sampling strategy based on the subject distribution information, and interference caused by noise problems is reduced by adjusting weight information of the word pair in the sampling process; and then, taking the obtained document theme distribution trained by the model as a service feature vector, and utilizing an optimized DPC algorithm (sDPC) to finish clustering operation on the service description document. The invention improves the service clustering precision and solves the sparsity and noise problems brought by the service description document in the service clustering problem.
Description
Technical Field
The invention relates to a short text optimization topic model method for service-oriented data clustering.
Background
Service clustering is an important research object in the field of service computing, and refers to the operation of dividing services with high functional similarity into the same clusters and having low service similarity in different clusters. The space range and time cost of service searching can be reduced by using the service clustering, and the efficiency of searching suitable service is greatly improved.
REST, collectively referred to as representational state transfer (presentation STATE TRANSFER), is a style of software architecture whose idea can be generalized to represent resources with URIs and to represent operations on those resources using HTTP methods. The REST service is an API of REST style, and as long as the front end sends a request containing a corresponding resource URI, and uses the HTTP method (POST, GET, PUT, DELETE) to implement the jump of different operations on the resource, the server end only needs to define a uniform response interface, and does not need to analyze the request in various ways. REST services often return data in JSON or XML form with descriptive documents composed of natural language. Due to the characteristics of light weight, simple structure and direct resource orientation, the system gradually becomes an API service form of the current internet mainstream. Researchers often make calculations of corresponding service characteristics based on their service description documents.
The topic model can automatically acquire implicit topic distribution of the corpus through iterative sampling, the implicit semantic information of the document is fully utilized, and the document topic distribution obtained by training the topic model is used as REST service feature information. However, API descriptive documents have short text features. Short text is short text containing a few words, only a small amount of word co-occurrence information can be obtained, and the text has semantic sparsity. On the processing of short texts, the usual topic model cannot exert a good effect due to the problem of sparsity. Descriptive documents, on the other hand, are faced with noise interference problems, i.e., text contains words that are not associated with a functional topic, which words may have a negative effect on the decision of the topic, known as noise words. The two problems are solved, and effective and reasonable document theme distribution can be extracted from the description document.
Word vectors can map words into a low-dimensional dense real vector space, where words that are more semantically similar are closer together. With this feature, the similarity between words can be expressed by calculating the vector distance. By using the word vector, we can simulate semantic background information, and further understand the semantic relationship between words.
BTM (Biterm Topic Model) word pair topic models are proposed in 2013, the models convert word sets into word pair sets after corpus word segmentation in a pairwise combined mode, the word pair sets are used for sampling, and corresponding topic distribution is obtained through training. The model converts the original corpus into the word model, so that semantic co-occurrence information is increased, and the sparsity problem of short texts is relieved.
The DPC (clustering by FAST SEARCH AND FIND of DENSITY PEAKS) algorithm was proposed by Rodriguez et al, which considers cluster center points to have a greater density relative to peripheral points, while having a greater distance from other dense points. The DPC algorithm selects a clustering center with high possibility by calculating the local density and the density distance of each point so as to acquire an initial clustering center point and reduce the situation of sinking into local optimum.
Disclosure of Invention
In order to overcome the defects of the prior art and solve the problems of sparsity and noise caused by service description documents in service clustering problems, the invention provides a short text optimization topic model method for service data clustering, which is characterized in that firstly, a word pair model based on a BTM topic model is designed, word pair information is screened by the model by using word vectors, and the defect of long time consumption of the word pair topic model is overcome; meanwhile, a representative word pair with high relevance to the currently sampled subject in the training process is searched by a probability sampling strategy based on the subject distribution information, and interference caused by noise problems is reduced by adjusting weight information of the word pair in the sampling process; and then, taking the obtained document theme distribution trained by the model as a service feature vector, and utilizing an optimized DPC algorithm (sDPC) to finish clustering operation on the service description document.
The invention adopts the following technical scheme:
a short text optimization topic model method for service-oriented data clustering, the method comprising the steps of:
the first step: performing word segmentation processing on the document, and performing stop word removal and tense normalization;
and a second step of: converting the word segmentation result into a word pair set;
And a third step of: calculating word pair similarity by utilizing a pre-trained word vector model, and screening a word pair set;
fourth step: calculating representative word pairs in the iteration process of the topic model, and utilizing the representative word pairs to realize a probability sampling algorithm to complete topic model training and output document topic distribution of the service description document;
fifth step: and calling sDPC a clustering algorithm by taking the document theme distribution as a feature vector to finish service clustering.
Further, the procedure of the first step is as follows:
1.1, reading service description document information, taking a service name as a key, taking document content as a value, and converting the document content into a value key pair D;
1.2 traversing the D document content, setting the current document content as D, and setting an empty set word_list. Performing sentence segmentation processing on the d, eliminating punctuation marks, and then performing word segmentation on each sentence;
1.3 judging each word after word segmentation in the traversal process, if the word is not composed of special symbols, is not a pure number and is not in a stop word list, carrying out normalization processing on the word, storing the word into a word_list set in the step 1.2, and storing the value in D by using the word_list instead of D as a value key after judging each word.
Further, the process of the second step is as follows:
2.1 traversing the word segmentation result obtained in the step 1 to generate a non-repeated vocabulary Voc;
2.2 defining a word pair biterm structure, wherein the structure comprises serial numbers of two different words in Voc, and a smaller serial number is set as word1, and a larger serial number is set as word2;
2.3, setting an empty set, namely a word_words, as a storage set of all word segmentation results, traversing a value key pair D, and storing word_list sets corresponding to each key into wole _words in sequence;
2.4 traversing all word information in the Whole_words, and converting the word information into corresponding word sequence numbers in the vocabulary Voc;
2.5 generating a word pair set B.
Preferably, the step of 2.5 is as follows:
2.5.1 traversing the Whole_words set, and setting a vocabulary sequence result set of the current corresponding document segmentation as a single_list;
2.5.2 setting a word pair set B for storing word pair information;
2.5.3 traversing single_list, wherein the current object is single_list (i), single_list (i) represents the vocabulary sequence number of the i-th word in single_list, i is more than or equal to 0 and less than single_list.length, combining each single_list (i) with the vocabulary sequence number of the j-th word corresponding to single_list (j), and generating word pair b, wherein i is more than or equal to i and less than single_list.length;
and 2.5.4, storing the generated word pairs into a word pair set B, and setting a word pair serial number for each word pair B in sequence, and marking the word pair serial number as b.index.
Still further, the process of the third step is as follows:
3.1, reading a pre-trained word vector model which comprises word information and corresponding word vector results, realizing word vector calculation based on a Skip-gram model, predicting conditional probabilities of a plurality of words before and after the Skip-gram model by using a central word, and finally outputting a word vector which is a weight matrix of a hidden layer in a trained neural network, wherein the target function of the Skip-gram is expressed as follows:
Wherein N is the size of a text, V w is a word vector of a word w, c is the size of a window, namely the number of predicted words before and after a central word, T represents inversion of a matrix, exp () represents an exponential function based on a natural constant e, V represents the number of words, after model training, words with similar word senses can obtain more approximate weight, and the similarity among the words can be measured by calculating the distance among the vectors;
3.2 setting eta as a similarity threshold, setting a filtered word pair set as B_sim, traversing the word pair set B, setting the current word pair as B, acquiring a word W 1 in a vocabulary Voc corresponding to b.word1, acquiring a word W 2 in the vocabulary Voc corresponding to b.word2, setting the word pair similarity sim as 1 when W 1 is equal to W 2, setting the word pair similarity sim as 1 when W 1 or W 2 is not present in the word vector model, and acquiring a word vector V 1 corresponding to W 1 by using the word vector model and acquiring a word vector V 2 corresponding to W 2 when W 1 is not equal to W 2 and both W 1 and W 2 are present in the word vector model, and calculating the similarity between the two vectors by using cosine similarity as the word pair similarity sim by using the cosine similarity as follows:
where T represents the inversion of the matrix, the V represents the modulus of the vector;
And 3.3, comparing the word pair similarity sim obtained by calculating the word pair B with eta, and storing B into the screened word pair set B_sim when sim is larger than or equal to eta.
Further, the fourth step is as follows:
4.1 setting a zero matrix nz with a size of k1 for storing word pairs corresponding to each topic, wherein k is the number of topics, setting a zero matrix nwz with a size of k|Voc| for storing the number of times each vocabulary is divided into each topic, wherein |Voc| represents the number of vocabulary words in the vocabulary, and the zero matrix refers to a matrix with matrix elements of all 0;
4.2 randomly assigning subjects to word pairs, and initializing nz and nwz;
4.3, setting iteration number iteration, and setting the current iteration number as iter;
4.4, starting the first iteration, traversing the screened word pair set B_sim, and carrying out sampling operation on each word pair B;
4.5 calculating a representative word pair matrix S;
4.6, continuing iteration, adding 1 to the current iteration number item, traversing the word pair set B_sim after screening, and carrying out sampling operation on each word pair B;
4.7 repeating the operation of the step 4.5;
judging the size of the item, and stopping iteration when the size of the item is equal to the item;
4.9 calculating the document theme distribution theta according to the following formula:
p (z|d) represents the probability of the document d for the topic z, nd z represents the number of words in the document that are classified as topic z, and B_sim| represents the number of word pairs in the filtered set of word pairs.
The step of 4.2 is as follows:
4.2.1 traversing the screened word pair set B_sim, randomly obtaining an integer value t for each word pair B, wherein t is more than or equal to 0 and less than k, and marking t as a topic of the word pair B as b.topic;
4.2.2 traversing the B_sim after randomly assigning the theme, setting the current word pair as B, adding 1 to the value of the nz [ b.topic ] position in the matrix nz, adding 1 to the values of the nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz respectively, wherein b.word1 represents the value of word1 in the word pair, and b.word2 represents the value of word2 in the word pair, and completing the initialization of the matrix.
The step of 4.4 is as follows:
4.4.1 subtracting 1 from the values of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2], respectively, to exclude the effect of the current word on b;
4.4.2 the following formula is called to sample each topic z:
Wherein the method comprises the steps of Indicating the probability that the word pair b belongs to the subject z after the influence of the current word on the word b is removed, wherein n z indicates the number of words belonging to the subject z, namely the numerical value of nz [ z ] in a matrix nz, and the alpha and beta are in direct proportion, wherein n wi|z indicates the number of times that a word w i with the sequence number b.word1 in a vocabulary is classified as the subject z, namely the numerical value of nwz [ z ] [ b.word1] in the matrix nwz, and n wj|z indicates the number of times that a word w j with the sequence number b.word2 in the vocabulary is classified as the subject z, namely the numerical value of nwz [ z ] [ b.word2] in the matrix nwz, M is the number of words in the vocabulary, and the probabilities obtained by all the subjects are sequentially stored in a list distribution;
4.4.3 using roulette operation to the probability distribution obtained in the last step, obtaining a new theme corresponding to the word pair b, setting the new theme as b.topic, the roulette algorithm is also called a proportion selection algorithm, obtaining the corresponding cumulative probability of each individual by accumulating the probability distribution in sections, generating a random number in the interval of [0,1], and selecting the individual with the cumulative probability larger than or equal to the random number and the smallest difference as a roulette output result;
4.4.4 the value of the nz [ b.topic ] position in the matrix nz is added by 1, and the values of the nwz [ b.topic ] [ b.word1] position and the nwz [ b.topic ] [ b.word2] position in the matrix nwz are added by 1 respectively, so that the matrix receives the sampling result.
The step of 4.5 is as follows:
4.5.1, setting a matrix lambda with the size of |B_sim|k, representing a representative word pair judging matrix, wherein |B_sim|represents the number of word pairs in a word pair set, and setting a matrix S with the size of |B_sim|k, representing a representative word pair matrix;
4.5.2 traversing the filtered word pair set B_sim, setting the current word pair as B, traversing all topics, and calculating the word pair probability of the word pair B for the topic z according to the formula, wherein the formula is as follows:
The sign meaning is the same as that of the step 4.4.2, the maximum value in the probability p (z|b) of the word pair b for each topic is found and set as max (p (z|b)), the ratio p (z|b)/max (p (z|b)) is calculated for each topic z, and the ratio is stored in the lambda [ b.index ] [ z ] position in the matrix lambda;
4.5.3 traversing all values in matrix lambda, judging corresponding value of lambda [ b.index ] [ z ] according to Bernoulli distribution with set probability of 0.5, storing 0 or 1 of result into representative word pair matrix S, wherein Bernoulli distribution is a discrete probability distribution, returning to 1 when input probability is larger than set probability, and returning to 0 when input probability is smaller than or equal to set probability.
The step of 4.6 is as follows:
4.6.1 subtracting 1 from the values of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2], respectively, to exclude the effect of the current word on b;
4.6.2 traversing each theme, setting the current theme as z, judging, repeating the operations of step 4.4.2,4.4.3 and 4.4 if the corresponding value of S [ b.index ] [ t ] is 0, and replacing the formula in step 4.4.2 with the following formula if the corresponding value of S [ b.index ] [ t ] is 1:
where μ is a representative word pair weight parameter, set prior to training, adjusted to change model training effects, and then repeat the 4.4.3 and 4.4.4 steps.
Further, the fifth step is as follows:
5.1 calculating a cut-off distance dc;
5.2, taking document theme distribution theta as a feature vector of the service, and calculating local density corresponding to all the feature vectors;
5.3, calculating element distances edistance corresponding to all feature vectors;
5.4 setting cluter as the number of clusters, calculating the product r of the local density corresponding to all feature vectors and the corresponding element distance edistance. Selecting the cluster vectors with the maximum r as a cluster center set center;
5.5 using a center as an initial clustering center, and completing clustering operation by using a Kmeans clustering algorithm, wherein Kmeans is a common partitional clustering algorithm, and an objective function of the Kmeans is a Residual Square Sum (RSS) of an element and the clustering center, and the formula can be expressed as follows:
wherein cluster represents the number of clusters, omega i represents the ith cluster, x represents the feature vector classified into the ith cluster, c represents the cluster center vector corresponding to the ith cluster, and x-c| 2 represents the sum of squares of the differences between each component of vector x and vector c.
The step of 5.1 is as follows:
5.1.1 a zero matrix Cd of size |d| (|d| -1) is set to hold the distance between the vectors. Wherein |D| represents the number of service documents, a list d_backup is set, and candidate cut-off distances are stored;
5.1.2 traversing vectors in theta, setting a currently corresponding document theme distribution vector as Vec m, representing a theme distribution vector with a sequence of m, calculating the distance between the Vec and the vectors in theta except for the vector, wherein the distance is calculated by adopting Euclidean distance, and the calculation formula is as follows:
Wherein Vec mi represents the ith component of the topic distribution vector with the sequence number of m, k represents the topic number, and the distance between Vec and the vector except for the Vec is sorted from big to small and then stored in the mth row of the matrix Cd;
5.1.3 traversing each line in Cd, setting the current line as SD, calculating the difference between all adjacent components in SD, finding the maximum value from the difference, obtaining the smaller one of the two components for calculating the maximum value, setting as md, and the calculation process can be expressed as follows:
md=SDj|max(SDj+1-SDj)
SD j represents the jth component in SD, wherein 0.ltoreq.j < |D|1, |D| represents the number of service documents, max () represents the pair of combinations with the largest difference obtained, and then md is sequentially stored in d_backup;
And 5.1.4, obtaining the minimum value in d_backup, namely the cut-off distance dc.
The step of 5.2 is as follows:
5.2.1 setting list density, and storing local density of each vector;
5.2.2 traversing each row in the distance matrix Cd calculated in the step 5.1.2, setting the current row as SD, and setting a count value count as 0;
5.2.3 traversing SD, setting the current distance value as SD, adding 1 to the count value when SD is smaller than the cut-off distance dc, storing the current count value into the density after one time of SD traversal is completed, and resetting the count to 0;
5.2.4 repeating 5.2.3 until the traversal of Cd is completed, and finally obtaining the density which comprises the local density values corresponding to all the service document objects;
the step of 5.3 is as follows:
5.3.1 obtaining the maximum value in the density, and setting the maximum value as des_max;
5.3.2 setting list edistance to store element distances;
5.3.3 traversing the density, setting the current object value as dens, calculating the maximum Euclidean distance d_max between the feature vector corresponding to the current value and other service feature vectors when dens is equal to des_max, storing d_max in edistance, calculating the Euclidean distance between the feature vector with the local density larger than the current object and the corresponding feature vector when dens is unequal to des_max, acquiring the minimum value d_min, and storing d_min in edistance.
The invention has the beneficial effects that: (1) The topic model based on word pairs can greatly increase co-occurrence information and overcome the defect of high sparsity of the API document text. (2) Word pair screening is performed by using a word vector model, and the defect of high time consumption of the word pair topic model is overcome. (3) The probability sampling algorithm realized by the representative word pair calculation reduces the influence of short text noise words while not influencing the time complexity of the algorithm. (4) And the sDPC algorithm is used for clustering the feature vectors, so that the whole topic semantic information of the service document is considered, and the service clustering precision is improved.
Detailed Description
A short text optimization topic model method for service-oriented data clustering, the method comprising the steps of:
the first step: performing word segmentation processing on the document, and performing stop word removal and tense normalization;
the first step is as follows:
1.1, reading service description document information, taking a service name as a key, taking document content as a value, and converting the document content into a value key pair D;
1.2 traversing the D document content, setting the current document content as D, and setting an empty set word_list. Performing sentence segmentation on the d by using a natural language processing NLTK library, removing punctuation marks, and then performing word segmentation on each sentence;
1.3 judging each word after word segmentation in the traversal process, if the word is not composed of special symbols, is not a pure number and is not in a stop word list, carrying out normalization processing on the word, storing the word into a word_list set in the step 1.2, and storing the value in D by using the word_list instead of D as a value key after judging each word.
And a second step of: converting the word segmentation result into a word pair set;
the second step is as follows:
2.1 traversing the word segmentation result obtained in the step 1 to generate a non-repeated vocabulary Voc;
2.2 defining a word pair biterm structure, wherein the structure comprises serial numbers of two different words in Voc, and a smaller serial number is set as word1, and a larger serial number is set as word2;
2.3, setting an empty set, namely a word_words, as a storage set of all word segmentation results, traversing a value key pair D, and storing word_list sets corresponding to each key into wole _words in sequence;
2.4 traversing all word information in the Whole_words, and converting the word information into corresponding word sequence numbers in the vocabulary Voc;
2.5 generating a word pair set B, wherein the specific steps are as follows:
2.5.1 traversing the Whole_words set, and setting a vocabulary sequence result set of the current corresponding document segmentation as a single_list;
2.5.2 setting a word pair set B for storing word pair information;
2.5.3 traversing single_list, wherein the current object is single_list (i), single_list (i) represents the vocabulary sequence number of the i-th word in single_list, i is more than or equal to 0 and less than single_list.length, combining each single_list (i) with the vocabulary sequence number of the j-th word corresponding to single_list (j), and generating word pair b, wherein i is more than or equal to i and less than single_list.length;
and 2.5.4, storing the generated word pairs into a word pair set B, and setting a word pair serial number for each word pair B in sequence, and marking the word pair serial number as b.index.
And a third step of: calculating word pair similarity by utilizing a pre-trained word vector model, and screening a word pair set;
the third step is as follows:
3.1, reading a pre-trained Word vector model which comprises Word information and corresponding Word vector results, performing pre-training based on a Word2vec Word vector model of a Skip-gram model to realize Word vector calculation, wherein the Skip-gram model uses a central Word to predict conditional probabilities of a plurality of words before and after the Word vector model, and finally, the output Word vector is a weight matrix of a hidden layer in a trained neural network, and the target function of the Skip-gram is expressed as follows:
Wherein N is the size of a text, V w is a word vector of a word w, c is the size of a window, namely the number of predicted words before and after a central word, T represents inversion of a matrix, exp () represents an exponential function based on a natural constant e, V represents the number of words, after model training, words with similar word senses can obtain more approximate weight, and the similarity among the words can be measured by calculating the distance among the vectors;
3.2 setting eta as a similarity threshold, setting a filtered word pair set as B_sim, traversing the word pair set B, setting the current word pair as B, acquiring a word W 1 in a vocabulary Voc corresponding to b.word1, acquiring a word W 2 in the vocabulary Voc corresponding to b.word2, setting the word pair similarity sim as 1 when W 1 is equal to W 2, setting the word pair similarity sim as 1 when W 1 or W 2 is not present in the word vector model, and acquiring a word vector V 1 corresponding to W 1 by using the word vector model and acquiring a word vector V 2 corresponding to W 2 when W 1 is not equal to W 2 and both W 1 and W 2 are present in the word vector model, and calculating the similarity between the two vectors by using cosine similarity as the word pair similarity sim by using the cosine similarity as follows:
where T represents the inversion of the matrix, the V represents the modulus of the vector;
And 3.3, comparing the word pair similarity sim obtained by calculating the word pair B with eta, and storing B into the screened word pair set B_sim when sim is larger than or equal to eta.
Fourth step: calculating representative word pairs in the iteration process of the topic model, and utilizing the representative word pairs to realize a probability sampling algorithm to complete topic model training and output document topic distribution of the service description document;
The fourth step is as follows:
4.1 setting a zero matrix nz with a size of k1 for storing word pairs corresponding to each topic, wherein k is the number of topics, setting a zero matrix nwz with a size of k|Voc| for storing the number of times each vocabulary is divided into each topic, wherein |Voc| represents the number of vocabulary words in the vocabulary, and the zero matrix refers to a matrix with matrix elements of all 0;
4.2 randomly assigning subjects to word pairs, initializing nz and nwz, and the steps are as follows:
4.2.1 traversing the screened word pair set B_sim, randomly obtaining an integer value t for each word pair B, wherein t is more than or equal to 0 and less than k, and marking t as a topic of the word pair B as b.topic;
Traversing the B_sim after randomly endowing the subject, setting the current word pair as B, adding 1 to the value of the nz [ b.topic ] position in the matrix nz, adding 1 to the values of the nwz [ b.topic ] [ b.word1] position and the nwz [ b.topic ] [ b.word2] position in the matrix nwz, respectively, wherein b.word1 represents the value of word1 in the word pair, b.word2 represents the value of word2 in the word pair, and completing matrix initialization;
4.3, setting iteration number iteration, and setting the current iteration number as iter;
4.4, starting the first iteration, traversing the screened word pair set B_sim, and sampling each word pair B, wherein the steps are as follows:
4.4.1 subtracting 1 from the values of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2], respectively, to exclude the effect of the current word on b;
4.4.2 the following formula is called to sample each topic z:
Wherein the method comprises the steps of Indicating the probability that the word pair b belongs to the subject z after the influence of the current word on the word b is removed, wherein n z indicates the number of words belonging to the subject z, namely the numerical value of nz [ z ] in a matrix nz, and the alpha and beta are in direct proportion, wherein n wi|z indicates the number of times that a word w i with the sequence number b.word1 in a vocabulary is classified as the subject z, namely the numerical value of nwz [ z ] [ b.word1] in the matrix nwz, and n wj|z indicates the number of times that a word w j with the sequence number b.word2 in the vocabulary is classified as the subject z, namely the numerical value of nwz [ z ] [ b.word2] in the matrix nwz, M is the number of words in the vocabulary, and the probabilities obtained by all the subjects are sequentially stored in a list distribution;
4.4.3 using roulette operation to the probability distribution obtained in the last step, obtaining a new theme corresponding to the word pair b, setting the new theme as b.topic, the roulette algorithm is also called a proportion selection algorithm, obtaining the corresponding cumulative probability of each individual by accumulating the probability distribution in sections, generating a random number in the interval of [0,1], and selecting the individual with the cumulative probability larger than or equal to the random number and the smallest difference as a roulette output result;
4.4.4 adding 1 to the value of the nz [ b.topic ] position in the matrix nz, and adding 1 to the values of the nwz [ b.topic ] [ b.word1] position and the nwz [ b.topic ] [ b.word2] position in the matrix nwz respectively to enable the matrix to receive a sampling result;
4.5 calculating a representative word pair matrix S, wherein the specific steps are as follows:
4.5.1, setting a matrix lambda with the size of |B_sim|k, representing a representative word pair judging matrix, wherein |B_sim|represents the number of word pairs in a word pair set, and setting a matrix S with the size of |B_sim|k, representing a representative word pair matrix;
4.5.2 traversing the filtered word pair set B_sim, setting the current word pair as B, traversing all topics, and calculating the word pair probability of the word pair B for the topic z according to the formula, wherein the formula is as follows:
The sign meaning is the same as that of the step 4.4.2, the maximum value in the probability p (z|b) of the word pair b for each topic is found and set as max (p (z|b)), the ratio p (z|b)/max (p (z|b)) is calculated for each topic z, and the ratio is stored in the lambda [ b.index ] [ z ] position in the matrix lambda;
4.5.3 traversing all values in the matrix lambda, judging the corresponding value of lambda [ b.index ] [ z ] according to the Bernoulli distribution with the set probability of 0.5, storing 0 or 1 of the result into the representative word pair matrix S, wherein the Bernoulli distribution is a discrete probability distribution, returning to 1 when the input probability is larger than the set probability, and returning to 0 when the input probability is smaller than or equal to the set probability;
4.6, continuing iteration, adding 1 to the current iteration number iter, traversing the word pair set B_sim after screening, and carrying out sampling operation on each word pair B, wherein the specific steps are as follows:
4.6.1 subtracting 1 from the values of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2], respectively, to exclude the effect of the current word on b;
4.6.2 traversing each theme, setting the current theme as z, judging, repeating the operations of step 4.4.2,4.4.3 and 4.4 if the corresponding value of S [ b.index ] [ t ] is 0, and replacing the formula in step 4.4.2 with the following formula if the corresponding value of S [ b.index ] [ t ] is 1:
Wherein mu is a weight parameter of the representative word pair, the weight parameter is set before training, the weight parameter is adjusted to change the training effect of the model, and then the operations of the steps 4.4.3 and 4.4.4 are repeated;
4.7 repeating the operation of the step 4.5;
judging the size of the item, and stopping iteration when the size of the item is equal to the item;
4.9 calculating the document theme distribution theta according to the following formula:
p (z|d) represents the probability of the document d for the topic z, nd z represents the number of words in the document that are classified as topic z, and B_sim| represents the number of word pairs in the filtered set of word pairs.
Fifth step: taking document theme distribution as a feature vector, and calling sDPC a clustering algorithm to finish service clustering;
the fifth step is as follows:
5.1 calculating the cut-off distance dc, the steps are as follows:
5.1.1 a zero matrix Cd of size |d| (|d| -1) is set to hold the distance between the vectors. Where |D| represents the number of service documents. Setting a list d_backup, and storing the cut-off distance of the candidate;
5.1.2 traversing vectors in theta, setting a currently corresponding document theme distribution vector as Vec m, representing a theme distribution vector with a sequence of m, calculating the distance between the Vec and the vectors in theta except for the vector, wherein the distance is calculated by adopting Euclidean distance, and the calculation formula is as follows:
Wherein Vec mi represents the ith component of the topic distribution vector with the sequence number of m, k represents the topic number, and the distance between Vec and the vector except for the Vec is sorted from big to small and then stored in the mth row of the matrix Cd;
5.1.3 traversing each line in Cd, setting the current line to SD. The difference between all adjacent components in SD is calculated, the maximum value is found out, the smaller component of the two components for calculating the maximum value is obtained and set as md, and the calculation process can be expressed as follows:
md=SDj|max(SDj+1-SDj)
SD j represents the jth component in SD, where 0.ltoreq.j < |D| -1, |D| represents the number of service documents. max () represents a pair of combinations that obtain the largest difference, and then sequentially storing md into d_backup;
5.1.4 obtaining the minimum value in d_backup, namely the cut-off distance dc;
And 5.2, taking document theme distribution theta as a feature vector of a service, and calculating local density corresponding to all the feature vectors, wherein the steps are as follows:
5.2.1 setting list density, and storing local density of each vector;
5.2.2 traversing each row in the distance matrix Cd calculated in the step 5.1.2, setting the current row as SD, and setting a count value count as 0;
5.2.3 traversing SD, setting the current distance value as SD, adding 1 to the count value when SD is smaller than the cut-off distance dc, storing the current count value into the density after one time of SD traversal is completed, and resetting the count to 0;
5.2.4 repeat 5.2.3 until Cd traversal is completed. The obtained density finally contains the local density values corresponding to all the service document objects;
5.3, calculating element distances edistance corresponding to all feature vectors, wherein the steps are as follows:
5.3.1 obtaining the maximum value in the density, and setting the maximum value as des_max;
5.3.2 setting list edistance to store element distances;
5.3.3 traversing the Density, setting the current object value to dens. Calculating the maximum Euclidean distance d_max between the feature vector corresponding to the current value and other service feature vectors when dens is equal to des_max, storing d_max into edistance, calculating the Euclidean distance between the feature vector with local density larger than the current object and the corresponding feature vector when dens is unequal to des_max, obtaining the minimum value d_min, and storing d_min into edistance;
5.4, setting cluter as the number of clusters, calculating the product r of the local density corresponding to all the feature vectors and the corresponding element distance edistance, and selecting cluster vectors with the maximum r as a cluster center point set center;
5.5 using a center as an initial clustering center, and completing clustering operation by using a Kmeans clustering algorithm, wherein Kmeans is a common partitional clustering algorithm, and an objective function of the Kmeans is a Residual Square Sum (RSS) of an element and the clustering center, and the formula is expressed as follows:
wherein cluster represents the number of clusters, omega i represents the ith cluster, x represents the feature vector classified into the ith cluster, c represents the cluster center vector corresponding to the ith cluster, and x-c| 2 represents the sum of squares of the differences between each component of vector x and vector c.
The embodiments described in this specification are merely illustrative of the manner in which the inventive concepts may be implemented. The scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the scope of the present invention and the equivalents thereof as would occur to one skilled in the art based on the inventive concept.
Claims (8)
1. A short text optimization topic model method for service-oriented data clustering, the method comprising the steps of:
the first step: performing word segmentation processing on the document, and performing stop word removal and tense normalization;
and a second step of: converting the word segmentation result into a word pair set;
And a third step of: calculating word pair similarity by utilizing a pre-trained word vector model, and screening a word pair set;
fourth step: calculating representative word pairs in the iteration process of the topic model, and utilizing the representative word pairs to realize a probability sampling algorithm to complete topic model training and output document topic distribution of the service description document;
The fourth step is as follows:
4.1 setting a zero matrix nz with a size of k1 for storing word pairs corresponding to each topic, wherein k is the number of topics, setting a zero matrix nwz with a size of k|Voc| for storing the number of times each vocabulary is divided into each topic, wherein |Voc| represents the number of vocabulary words in the vocabulary, and the zero matrix refers to a matrix with matrix elements of all 0;
4.2 randomly assigning subjects to word pairs, and initializing nz and nwz;
4.3, setting iteration number iteration, and setting the current iteration number as iter;
4.4, starting the first iteration, traversing the screened word pair set B_sim, and carrying out sampling operation on each word pair B;
4.5 calculating a representative word pair matrix S;
4.6, continuing iteration, adding 1 to the current iteration number item, traversing the word pair set B_sim after screening, and carrying out sampling operation on each word pair B;
4.7 repeating the operation of the step 4.5;
judging the size of the item, and stopping iteration when the size of the item is equal to the item;
4.9 calculating the document theme distribution theta according to the following formula:
P (z|d) represents the probability of the document d for the topic z, nd z represents the number of words in the document that are classified into the topic z, and B_sim|represents the number of word pairs in the filtered word pair set;
Fifth step: taking document theme distribution as a feature vector, and calling sDPC a clustering algorithm to finish service clustering;
the fifth step is as follows:
5.1 calculating a cut-off distance dc;
5.2, taking document theme distribution theta as a feature vector of the service, and calculating local density corresponding to all the feature vectors;
5.3, calculating element distances edistance corresponding to all feature vectors;
5.4, setting cluter as the number of clusters, calculating the product r of the local density corresponding to all the feature vectors and the corresponding element distance edistance, and selecting cluster vectors with the maximum r as a cluster center point set center;
5.5 using a center as an initial clustering center, and completing clustering operation by using a Kmeans clustering algorithm, wherein Kmeans is a common partitional clustering algorithm, and an objective function of the Kmeans is a Residual Square Sum (RSS) of an element and the clustering center, and the formula can be expressed as follows:
wherein cluster represents the number of clusters, omega i represents the ith cluster, x represents the feature vector classified into the ith cluster, c represents the cluster center vector corresponding to the ith cluster, and x-c| 2 represents the sum of squares of the differences between each component of vector x and vector c.
2. The short text optimization topic model method for service-oriented data clustering as claimed in claim 1, wherein the first step is as follows:
1.1, reading service description document information, taking a service name as a key, taking document content as a value, and converting the document content into a value key pair D;
1.2 traversing the D-type document content, setting the current document content as D, setting an empty set word_list, carrying out sentence segmentation on the D and eliminating punctuation marks, and then carrying out word segmentation on each sentence;
1.3 judging each word after word segmentation in the traversal process, if the word is not composed of special symbols, is not a pure number and is not in a stop word list, carrying out normalization processing on the word, storing the word into a word_list set in the step 1.2, and storing the value in D by using the word_list instead of D as a value key after judging each word.
3. A short text optimized topic model method for service-oriented data clustering as in claim 1 or 2 wherein the process of said second step is as follows:
2.1 traversing the word segmentation result obtained in the step 1 to generate a non-repeated vocabulary Voc;
2.2 defining a word pair biterm structure, wherein the structure comprises serial numbers of two different words in Voc, and a smaller serial number is set as word1, and a larger serial number is set as word2;
2.3, setting an empty set, namely a word_words, as a storage set of all word segmentation results, traversing a value key pair D, and storing word_list sets corresponding to each key into wole _words in sequence;
2.4 traversing all word information in the Whole_words, and converting the word information into corresponding word sequence numbers in the vocabulary Voc;
2.5 generating a word pair set B.
4. A short text optimized topic model method for service-oriented data clustering as claimed in claim 3 wherein said step of 2.5 is as follows:
2.5.1 traversing the Whole_words set, and setting a vocabulary sequence result set of the current corresponding document segmentation as a single_list;
2.5.2 setting a word pair set B for storing word pair information;
2.5.3 traversing single_list, wherein the current object is single_list (i), single_list (i) represents the vocabulary sequence number of the i-th word in single_list, i is more than or equal to 0 and less than single_list.length, combining each single_list (i) with the vocabulary sequence number of the j-th word corresponding to single_list (j), and generating word pair b, wherein i is more than or equal to i and less than single_list.length;
and 2.5.4, storing the generated word pairs into a word pair set B, and setting a word pair serial number for each word pair B in sequence, and marking the word pair serial number as b.index.
5. A short text optimized topic model method for service-oriented data clustering as in claim 1 or 2 wherein the process of the third step is as follows:
3.1, reading a pre-trained word vector model which comprises word information and corresponding word vector results, realizing word vector calculation based on a Skip-gram model, predicting conditional probabilities of a plurality of words before and after the Skip-gram model by using a central word, and finally outputting a word vector which is a weight matrix of a hidden layer in a trained neural network, wherein the target function of the Skip-gram is expressed as follows:
Wherein N is the size of a text, V w is a word vector of a word w, c is the size of a window, namely the number of predicted words before and after a central word, T represents inversion of a matrix, exp () represents an exponential function based on a natural constant e, V represents the number of words, after model training, words with similar word senses can obtain more approximate weight, and the similarity among the words can be measured by calculating the distance among the vectors;
3.2 setting eta as a similarity threshold, setting a filtered word pair set as B_sim, traversing the word pair set B, setting the current word pair as B, acquiring a word W 1 in a vocabulary Voc corresponding to b.word1, acquiring a word W 2 in the vocabulary Voc corresponding to b.word2, setting the word pair similarity sim as 1 when W 1 is equal to W 2, setting the word pair similarity sim as 1 when W 1 or W 2 is not present in the word vector model, and acquiring a word vector V 1 corresponding to W 1 by using the word vector model and acquiring a word vector V 2 corresponding to W 2 when W 1 is not equal to W 2 and both W 1 and W 2 are present in the word vector model, and calculating the similarity between the two vectors by using cosine similarity as the word pair similarity sim by using the cosine similarity as follows:
where T represents the inversion of the matrix, the V represents the modulus of the vector;
And 3.3, comparing the word pair similarity sim obtained by calculating the word pair B with eta, and storing B into the screened word pair set B_sim when sim is larger than or equal to eta.
6. The short text optimization topic model method for service-oriented data clustering of claim 1 wherein the step of 4.2 is as follows:
4.2.1 traversing the screened word pair set B_sim, randomly obtaining an integer value t for each word pair B, wherein t is more than or equal to 0 and less than k, and marking t as a topic of the word pair B as b.topic;
Traversing the B_sim after randomly endowing the subject, setting the current word pair as B, adding 1 to the value of the nz [ b.topic ] position in the matrix nz, adding 1 to the values of the nwz [ b.topic ] [ b.word1] position and the nwz [ b.topic ] [ b.word2] position in the matrix nwz, respectively, wherein b.word1 represents the value of word1 in the word pair, b.word2 represents the value of word2 in the word pair, and completing matrix initialization;
the step of 4.4 is as follows:
4.4.1 subtracting 1 from the values of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2], respectively, to exclude the effect of the current word on b;
4.4.2 the following formula is called to sample each topic z:
Wherein the method comprises the steps of Indicating the probability that the word pair b belongs to the subject z after the influence of the current word on the word b is removed, wherein n z indicates the number of words belonging to the subject z, namely the numerical value of nz [ z ] in a matrix nz, and the alpha and beta are in direct proportion, wherein n wi|z indicates the number of times that a word w i with the sequence number b.word1 in a vocabulary is classified as the subject z, namely the numerical value of nwz [ z ] [ b.word1] in the matrix nwz, and n wj|z indicates the number of times that a word w j with the sequence number b.word2 in the vocabulary is classified as the subject z, namely the numerical value of nwz [ z ] [ b.word2] in the matrix nwz, M is the number of words in the vocabulary, and the probabilities obtained by all the subjects are sequentially stored in a list distribution;
4.4.3 using roulette operation to the probability distribution obtained in the last step, obtaining a new theme corresponding to the word pair b, setting the new theme as b.topic, the roulette algorithm is also called a proportion selection algorithm, obtaining the corresponding cumulative probability of each individual by accumulating the probability distribution in sections, generating a random number in the interval of [0,1], and selecting the individual with the cumulative probability larger than or equal to the random number and the smallest difference as a roulette output result;
4.4.4 adding 1 to the value of the nz [ b.topic ] position in the matrix nz, and adding 1 to the values of the nwz [ b.topic ] [ b.word1] position and the nwz [ b.topic ] [ b.word2] position in the matrix nwz respectively to enable the matrix to receive a sampling result;
the step of 4.5 is as follows:
4.5.1, setting a matrix lambda with the size of |B_sim|k, representing a representative word pair judging matrix, wherein |B_sim|represents the number of word pairs in a word pair set, and setting a matrix S with the size of |B_sim|k, representing a representative word pair matrix;
4.5.2 traversing the filtered word pair set B_sim, setting the current word pair as B, traversing all topics, and calculating the word pair probability of the word pair B for the topic z according to the formula, wherein the formula is as follows:
The sign meaning is the same as that of the step 4.4.2, the maximum value in the probability p (z|b) of the word pair b for each topic is found and set as max (p (z|b)), the ratio p (z|b)/max (p (z|b)) is calculated for each topic z, and the ratio is stored in the lambda [ b.index ] [ z ] position in the matrix lambda;
4.5.3 traversing all values in the matrix lambda, judging the corresponding value of lambda [ b.index ] [ z ] according to the Bernoulli distribution with the set probability of 0.5, storing 0 or 1 of the result into the representative word pair matrix S, wherein the Bernoulli distribution is a discrete probability distribution, returning to 1 when the input probability is larger than the set probability, and returning to 0 when the input probability is smaller than or equal to the set probability;
the step of 4.6 is as follows:
4.6.1 subtracting 1 from the values of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2], respectively, to exclude the effect of the current word on b;
4.6.2 traversing each theme, setting the current theme as z, judging, repeating the operations of step 4.4.2,4.4.3 and 4.4 if the corresponding value of S [ b.index ] [ t ] is 0, and replacing the formula in step 4.4.2 with the following formula if the corresponding value of S [ b.index ] [ t ] is 1:
where μ is a representative word pair weight parameter, set prior to training, adjusted to change model training effects, and then repeat the 4.4.3 and 4.4.4 steps.
7. The short text optimization topic model method for service-oriented data clustering of claim 1 wherein the step of 5.1 is as follows:
5.1.1 setting a zero matrix Cd with the size of |D| (|D| -1) for storing the distance between vectors, wherein |D| represents the number of service documents, setting a list d_backup, and storing the candidate cut-off distance;
5.1.2 traversing vectors in theta, setting a currently corresponding document theme distribution vector as Vec m, representing a theme distribution vector with a sequence of m, calculating the distance between the Vec and the vectors in theta except for the vector, wherein the distance is calculated by adopting Euclidean distance, and the calculation formula is as follows:
Wherein Vec mi represents the ith component of the topic distribution vector with the sequence number of m, k represents the topic number, and the distance between Vec and the vector except for the Vec is sorted from big to small and then stored in the mth row of the matrix Cd;
5.1.3 traversing each line in Cd, setting the current line as SD, calculating the difference between all adjacent components in SD, finding the maximum value from the difference, obtaining the smaller one of the two components for calculating the maximum value, setting as md, and the calculation process can be expressed as follows:
md=SDj|max(SDj+1-SDj)
SD j represents the jth component in SD, wherein 0.ltoreq.j < |D|1, |D| represents the number of service documents, max () represents the pair of combinations with the largest difference obtained, and then md is sequentially stored in d_backup;
5.1.4 obtaining the minimum value in d_backup, namely the cut-off distance dc;
the step of 5.2 is as follows:
5.2.1 setting list density, and storing local density of each vector;
5.2.2 traversing each row in the distance matrix Cd calculated in the step 5.1.2, setting the current row as SD, and setting a count value count as 0;
5.2.3 traversing SD, setting the current distance value as SD, adding 1 to the count value when SD is smaller than the cut-off distance dc, storing the current count value into the density after one time of SD traversal is completed, and resetting the count to 0;
5.2.4 repeating 5.2.3 until Cd traversal is completed, and finally obtaining the density, which is to say, the obtained density contains the local density values corresponding to all the service document objects.
8. The short text optimization topic model method for service-oriented data clustering of claim 1 wherein the step of 5.3 is as follows:
5.3.1 obtaining the maximum value in the density, and setting the maximum value as des_max;
5.3.2 setting list edistance to store element distances;
5.3.3 traversing the density, setting the current object value as dens, calculating the maximum Euclidean distance d_max between the feature vector corresponding to the current value and other service feature vectors when dens is equal to des_max, storing d_max in edistance, calculating the Euclidean distance between the feature vector with the local density larger than the current object and the corresponding feature vector when dens is unequal to des_max, acquiring the minimum value d_min, and storing d_min in edistance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110570274.4A CN113361270B (en) | 2021-05-25 | 2021-05-25 | Short text optimization topic model method for service data clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110570274.4A CN113361270B (en) | 2021-05-25 | 2021-05-25 | Short text optimization topic model method for service data clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113361270A CN113361270A (en) | 2021-09-07 |
CN113361270B true CN113361270B (en) | 2024-05-10 |
Family
ID=77527460
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110570274.4A Active CN113361270B (en) | 2021-05-25 | 2021-05-25 | Short text optimization topic model method for service data clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113361270B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113705247B (en) * | 2021-10-27 | 2022-02-11 | 腾讯科技(深圳)有限公司 | Theme model effect evaluation method, device, equipment, storage medium and product |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3432167A1 (en) * | 2017-07-21 | 2019-01-23 | Tata Consultancy Services Limited | System and method for theme extraction |
CN110647626A (en) * | 2019-07-30 | 2020-01-03 | 浙江工业大学 | REST data service clustering method based on Internet service domain |
CN111191036A (en) * | 2019-12-30 | 2020-05-22 | 杭州远传新业科技有限公司 | Short text topic clustering method, device, equipment and medium |
CN111475607A (en) * | 2020-02-28 | 2020-07-31 | 浙江工业大学 | Web data clustering method based on Mashup service function characteristic representation and density peak detection |
CN111475609A (en) * | 2020-02-28 | 2020-07-31 | 浙江工业大学 | Improved K-means service clustering method around topic modeling |
CN112632215A (en) * | 2020-12-01 | 2021-04-09 | 重庆邮电大学 | Community discovery method and system based on word-pair semantic topic model |
-
2021
- 2021-05-25 CN CN202110570274.4A patent/CN113361270B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3432167A1 (en) * | 2017-07-21 | 2019-01-23 | Tata Consultancy Services Limited | System and method for theme extraction |
CN110647626A (en) * | 2019-07-30 | 2020-01-03 | 浙江工业大学 | REST data service clustering method based on Internet service domain |
CN111191036A (en) * | 2019-12-30 | 2020-05-22 | 杭州远传新业科技有限公司 | Short text topic clustering method, device, equipment and medium |
CN111475607A (en) * | 2020-02-28 | 2020-07-31 | 浙江工业大学 | Web data clustering method based on Mashup service function characteristic representation and density peak detection |
CN111475609A (en) * | 2020-02-28 | 2020-07-31 | 浙江工业大学 | Improved K-means service clustering method around topic modeling |
CN112632215A (en) * | 2020-12-01 | 2021-04-09 | 重庆邮电大学 | Community discovery method and system based on word-pair semantic topic model |
Non-Patent Citations (3)
Title |
---|
A biterm topic model for short texts;Yan X等;Proceedings of the 22nd International Conference on World Wide Web, Rio: IW3C2;全文 * |
基于BTM 的物联网服务发现方法;王舒漫等;计算机应用;第40卷(第02期);第2页右列第3段-第5页左列倒数第2段 * |
基于双词语义增强的BTM主题模型研究;王云云等;软件工程;第23卷(第04期);第2页右列第2段-第3页右列第3段 * |
Also Published As
Publication number | Publication date |
---|---|
CN113361270A (en) | 2021-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110321925B (en) | Text multi-granularity similarity comparison method based on semantic aggregated fingerprints | |
CN111125334A (en) | Search question-answering system based on pre-training | |
KR20200007713A (en) | Method and Apparatus for determining a topic based on sentiment analysis | |
Rezaei et al. | Multi-document extractive text summarization via deep learning approach | |
CN110858217A (en) | Method and device for detecting microblog sensitive topics and readable storage medium | |
CN111506728B (en) | Hierarchical structure text automatic classification method based on HD-MSCNN | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN115495555A (en) | Document retrieval method and system based on deep learning | |
CN112989052B (en) | Chinese news long text classification method based on combination-convolution neural network | |
CN112434134B (en) | Search model training method, device, terminal equipment and storage medium | |
CN117273134A (en) | Zero-sample knowledge graph completion method based on pre-training language model | |
CN116756303A (en) | Automatic generation method and system for multi-topic text abstract | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN113361270B (en) | Short text optimization topic model method for service data clustering | |
Wu et al. | A text category detection and information extraction algorithm with deep learning | |
Al Mostakim et al. | Bangla content categorization using text based supervised learning methods | |
Labbé et al. | Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates | |
Zhang et al. | Text Sentiment Classification Based on Feature Fusion. | |
Yafooz et al. | Enhancing multi-class web video categorization model using machine and deep learning approaches | |
CN114003706A (en) | Keyword combination generation model training method and device | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
CN111125329B (en) | Text information screening method, device and equipment | |
CN117112811A (en) | Patent retrieval method, retrieval system and storage medium based on similarity | |
Safira et al. | Hoax Detection in Social Media using Bidirectional Long Short-Term Memory (Bi-LSTM) and 1 Dimensional-Convolutional Neural Network (1D-CNN) Methods | |
CN112446206A (en) | Menu title generation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |