CN109726394A - Short text Subject Clustering method based on fusion BTM model - Google Patents

Short text Subject Clustering method based on fusion BTM model Download PDF

Info

Publication number
CN109726394A
CN109726394A CN201811546170.4A CN201811546170A CN109726394A CN 109726394 A CN109726394 A CN 109726394A CN 201811546170 A CN201811546170 A CN 201811546170A CN 109726394 A CN109726394 A CN 109726394A
Authority
CN
China
Prior art keywords
text
distance
model
btm
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811546170.4A
Other languages
Chinese (zh)
Inventor
贾海涛
李泽华
刘小清
任利
贾宇明
赫熙煦
周焕来
罗心
王启杰
李清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201811546170.4A priority Critical patent/CN109726394A/en
Publication of CN109726394A publication Critical patent/CN109726394A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of short text Subject Clustering methods based on fusion BTM model, belong to data clusters technical field.The present invention carries out Text Pretreatment to short text to be clustered first, obtains data set D;Then the text vector based on BTM model, VSM model is extracted respectively;When carrying out k-means cluster to data set D, based on estimation cluster numbers k mode cluster numbers obtained set by the present invention, carry out k clustering processing, and the cluster standard used when clustering processing are as follows: the weighted sum of the distance between any two text calculated separately based on two text vectors.Present invention combination BTM model and VSM model realization are to the clustering processing of short text theme, to improve Clustering Effect;The technical issues of interior, between class distance measures Clustering Effect based on class simultaneously, automatically adjusts the quantity that clusters, and compensation BTM model need to shift to an earlier date accuracy decline caused by preassignment theme quantity.

Description

Short text Subject Clustering method based on fusion BTM model
Technical field
The invention belongs to data clusters technical fields, and in particular to a kind of short text theme based on fusion BTM model is poly- Class method.
Background technique
Currently, mainly there is BTM (Biterm topic model) model, VSM (Vector about the model of short text theme Space Model) model and LDA (Latent Dirichlet Allocation) model.
Wherein, BTM model is a kind of text subject model, but it and traditional topic model such as PLSA are (stealthy semantic Analysis) or LDA have apparent difference.General traditional topic model is only applicable to long article present treatment, because of the feature of short text It is sparse and missing can produce serious influence to model foundation, but also have many researchers attempt to be extended model and Optimization, to enhance the applicability to short text.Such as it is spelled by introducing external knowledge to expand short text, or by short text It connects, is handled as pseudo- long text.Although this way can not overcome conventional model day with the deficiency on improved model Raw disadvantage, and the modeling process of BTM model can obtain better effects to avoid disadvantages mentioned above.
VSM model, i.e. vector space model, principle is fairly simple, i.e., will carry out table with based on space vector in text Show, it is subsequent the operation method of vector to be used to carry out operation to text.So by a text be mapped to vector space it Afterwards, the similarity between text can be measured by distance between vector, and should be readily appreciated that.
Currently, classical clustering algorithm has k-means and k-medoids etc., but this kind of algorithm needs specified in advance gather Class number, while optimum cluster number cannot be assessed in advance.
Summary of the invention
Goal of the invention of the invention is: in view of the above problems, in conjunction with BTM model and VSM model realization to short The clustering processing of text subject, to improve Clustering Effect;Based on class, interior, between class distance measures Clustering Effect simultaneously, from It is dynamic to adjust the quantity that clusters, compensate the technical issues of BTM model need to shift to an earlier date accuracy decline caused by preassignment theme quantity.
Short text Subject Clustering method based on fusion BTM model of the invention, including the following steps:
Step S1: Text Pretreatment is carried out to short text to be clustered, obtains data set D;
Step S2: being based on data set D and preset theme number K, carries out BTM model modeling, generates document-theme Distribution matrix θ and theme-word distribution matrix
And the text vector of any text i in data set D is indicated based on document-theme distribution matrix θ, it is denoted as di_btm
Step S3: being based on data set D, carries out text vector table to any text i in data set D based on TF-IDF strategy Show, is denoted as di_vsm
Step S4: initialization tag positionK=kmin, wherein kminUnder the cluster number for indicating k-means algorithm Limit;
Step S5: if k > kmax, then follow the steps 8;It is no to then follow the steps S6;Wherein kmaxIndicate the poly- of k-means algorithm The upper limit of class number;
Step S6: randomly selecting k text vector as initial cluster center from data set D, and is calculated based on k-means Method carries out k clustering processing to data set D, and the clustering result quality J (k) of corresponding k value is calculated based on obtained cluster result;
Defined label positionThe clustering result quality of corresponding k isIfThen updateAfterwards, it then executes Step S7;Otherwise step S7 is directly executed;
Wherein, when carrying out clustering processing, the cluster standard of use are as follows: the weighted sum of first distance and second distance, Middle first distance is based on text vector di_btmJS (Jensen-Shannon) distance, second distance be based on text vector di_vsmCOS distance;
Clustering result qualityWherein I (k) indicates that inter- object distance, B (k) indicate between class distance;
The inter- object distance is the minimum value of each text and the average distance of other texts;
The distance of the between class distance two texts nearest between different clusters;
Step S7: after updating k=k+1, step S5 is continued to execute;
Step S8: carrying out k clustering processing based on k-means algorithm, wherein
In conclusion by adopting the above-described technical solution, the beneficial effects of the present invention are: special in conjunction with both BTM and VSM Text feature is put and optimized, improves Clustering Effect, while interior, between class distance measures Clustering Effect based on class, it is automatic to adjust The quantity that clusters is saved, the problem of BTM model need to shift to an earlier date accuracy decline caused by preassignment theme quantity is compensated.
Detailed description of the invention
Fig. 1 is the schematic diagram of BTM voca.txt input format;
Fig. 2 is the schematic diagram of the defeated doc_wids.txt input format of BTM;
Fig. 3 is fusion BTM Model tying flow chart;
Fig. 4 is participle number result figure schematic diagram;
Fig. 5 is BTM model word to number result schematic diagram;
Fig. 6 is document space vector matrix schematic diagram;
Fig. 7 is BTM model document-theme distribution matrix schematic diagram;
Fig. 8 is BTM and LDA the F value comparison diagram at different themes number K;
Fig. 9 is Fusion Model difference λ valued curve;
Figure 10 is between class distance schematic diagram in different cluster number lower classes;
Figure 11 is each Model tying accuracy rate comparison diagram;
Figure 12 is each model F value comparison diagram.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair It is bright to be described in further detail.
Short text topic model BTM (Biterm Topic Model) mainly for unitary document-theme distribution probability into Words-frequency feature in single document has been desalinated in row research, and vector space model (Vector Space Model) stresses word frequency spy Sign, when in order to overcome the shortcomings of that two models are implemented separately, both the present invention combines feature simultaneously optimizes text feature, improves cluster Effect, while based on class, interior, between class distance measures Clustering Effect, automatically adjusts the quantity that clusters, compensation BTM model need to mention The problem of accuracy decline caused by preceding preassignment theme quantity.
The inner principle of VSM is fairly simple, i.e., will be indicated in text with based on space vector, subsequent to use The operation method of vector carries out operation to text.So the similarity after a text is mapped to vector space, between text It can be measured, and be should be readily appreciated that by distance between vector.
Assuming that have m documents, wherein shared n different words, therefore this m document can be based on n word wi(i=1 ..., N) it is indicated.So document vector space model can be expressed as di={ w1,w2,w3...wn, the text vector of m documents Matrix D can indicate are as follows:
Wherein, every row of the matrix represents a text, and each column represents an independent word, element d in matrixijIndicate the J word proportion in i-th of document.
BTM is the model based on statistics, it has evaded traditional short text model vulnerable to the sparse influence of short essay eigen The shortcomings that, using implicit semantic between word, map the text to theme spatially.So based on this model, it can be deduced that more manage Document-the theme and theme thought-word distribution.In present embodiment, document spy is indicated based on document-theme distribution matrix Sign is labeled as dbtm.Meanwhile short text is subjected to vectorization expression using TF-IDF weighted strategy, it is labeled as dvsm, finally in text Weighted Fusion coefficient lambda is introduced during shelves similarity calculation, carries out fusion treatment.
The wherein d of i-th of textvsm、dbtmVector expression be respectively as follows:
di_vsm={ w1,w2,w3...wn};
di_btm={ p (z1|d),p(z2|d),p(z3|d)....p(zk|d)};
Wherein p (zj| d) indicate j-th of theme zjIn the distribution probability of text d, j=1,2 ..., K;
TF-IDF is the most classical and most common method being weighted in vector space to Feature Words.It is specific real at this It applies in mode, using word independent in short text as characteristic item, corresponding power is weighted by TF and IDF.
TF (Term Frequency) indicates certain Feature Words tkThis d in specific certain textiThe number of middle appearance.With out Occurrence number increases, and shows that specific gravity shared by the specific word is bigger, and the number occurred is labeled as TF(i,k).But for short For text, Feature Words wherein included frequency of occurrence in the text is totally close, is difficult to judge so relying solely on TF Text general characteristic.
IDF (Inverse Document Frequency) refers to the text that Feature Words go out in text set except current text The number occurred in this.It is more multiple if there is number, illustrate that the word cannot distinguish text well, if there is number It is less, illustrate that the word can be taken as the feature of certain texts to be considered, the calculation method of IDF are as follows:
Wherein, i is document (text) specificator, and k is word specificator, and N indicates document library In all text item numbers, n expression include word tkText item number, α is an empirical value, takes α=0.01 under normal circumstances.
It can to sum up obtain, any text diMiddle Feature Words tkWeight are as follows: w(i,k)=TF(i,k)×IDF(i,k), that is, use weight w (i, k) characterizes di_vsm={ w1,w2,w3...wnIn each word, to obtain based on TF-IDF strategy in data set D Any text i carries out text vector.
Text passes through the vector d of theme distributionbtmIt is indicated, so the present invention is based on the masters of text in subsequent contrast It inscribes feature and carries out Similarity measures.Simultaneously because theme vector is to be obtained with statistical model, therefore be described using probability.
Using the symmetrical version based on KL distance (Kullback-Leibler difference), i.e. JS distance is measured.I.e. pair In any text diAnd djBetween first distance (the first similarity) can indicate are as follows:
Wherein,Function
And the full text vector d for being indicated based on Feature Words word weightvsm, then take directly calculate two vector cosine away from From being measured, i.e., for any text diAnd djBetween second distance (the second similarity) can indicate are as follows:
After calculating completion both the above similarity, weighting coefficient λ is introduced, fusion measurement is weighted, is formed based on JS The weighting Text similarity computing formula of distance: D (di,dj)=λ Dbtm(di,dj)+(1-λ)Dvsm(di,dj)。
Referring to Fig. 1, in present embodiment, the short text Subject Clustering based on above-mentioned fusion BTM model includes following Step:
(1) Text Pretreatment is carried out to short text to be clustered, obtains data set D, wherein Text Pretreatment includes to short Text is segmented and is removed stop words processing;
Wherein, short text to be clustered is usual are as follows: has microblogging, discussion bar, forum's comment or communication behavior after Log Clustering User conversation in short text data.
(2) processing of BTM model preprocessing, i.e. regularization, generates two documents respectively, and one of document is dictionary text Shelves, are denoted as voca.txt, it is therefore an objective to be numbered for each word;Another document is denoted as doc_winds.txt, and being used for will be entire Replacement is numbered in word in text set.The specific format of two files is as shown in Figure 2,3.
(3) it is modeled based on BTM model, generates document-theme distribution matrix θ and theme-word distribution matrix
The input of model is two document of words.txt and words_winds.txt, theme number K and Dirichlet priori Two hyper parameters α and β of distribution, wherein α=50/K and β=0.01.Output is document-theme distribution matrix θ and theme-word point Cloth matrixSince the modeling process of BTM model is the prior art, details are not described herein again.
(4) text vector expression is carried out based on TF-IDF strategy;
(5) document-theme vector and document VSM vector generated based on k-means algorithm to BTM model is weighted poly- Class;
(6) clustering for cluster generation is described, makes reference for network management or public opinion monitoring.
The short text Subject Clustering of fusion BTM model i.e. of the invention the specific implementation process is as follows:
Step S1: data set D to be clustered, and the theme number of setting BTM, the cluster of k-means algorithm are inputted Number range [kmin,kmax];
Step S2: being based on data set D, carries out BTM model modeling, generates document-theme distribution matrix θ and theme-word point Cloth matrix
Step S3: being based on data set D, carries out text vector expression based on TF-IDF strategy, obtains dvsm
Step S4: initialization tag positionK=kmin
Step S5: if k > kmax, then follow the steps 8;It is no to then follow the steps S6;
Step S6: k initial cluster center is randomly selected from data set D, and k cluster is carried out based on k-means algorithm Processing calculates the clustering result quality J (k) of corresponding k value based on obtained cluster result;
Defined label positionThe clustering result quality of corresponding k isIfThen updateAfterwards, it then executes Step S7;Otherwise step S7 (marker bit is directly executedIt remains unchanged);
Wherein, when carrying out clustering processing, the cluster standard of use are as follows: the weighted sum of first distance and second distance;
Clustering result qualityWherein I (k) indicates that inter- object distance, B (k) indicate between class distance;
The inter- object distance is the minimum value of each text and the average distance of other texts, it may be assumed that
Wherein, | Cj| it indicates in cluster CjIn amount of text, xi、xpRespectively indicate belonging cluster CjIn text pair As.
The distance of the between class distance two texts nearest between different clusters, it may be assumed that
Wherein, i=1,2 ..., k;J=1,2 ..., k;And i ≠ j, xp,xqIndicate cluster CiAnd CjIn text object.
Step S7: after updating k=k+1, step S5 is continued to execute;
Step S8: carrying out k clustering processing based on k-means algorithm, wherein
Embodiment
The experimental data of use comes from Sina weibo, and modeling and clustering are carried out in data based on this paper algorithm. Experimental section mainly includes the weighting algorithm clustering algorithm of fusion full text this VSM and BTM, and BTM, VSM and LDA is used alone Algorithm carry out comparison.
Wherein experimental data set (corpus of use) is the open source data from Sina weibo in May, 2014, wherein altogether Have 6 hotspots (room rate, civil servant, South Korean TV soaps, haze, transgenosis, mobile phone), each classification 1000 and already provided with classification Label, vocabulary size: 9372 words.
Before experiment carries out, the pretreatment work of data is carried out first, text is subjected to participle number, BTM model needs Word is carried out to number.Number result is segmented as shown in figure 4, word is as shown in Figure 5 to numbering.
After completing data prediction, document vectorization expression is carried out to based on IF-IDF algorithm, forms VSM model, text Shelves indicate that result is as shown in Figure 6.
(1) BTM and LDA Subject Clustering Contrast on effect:
BTM model is a kind of non-supervisory model, and human intervention is not needed in operation, but a disadvantage is that needing before modeling Artificially default document sets overall theme clusters number K, and whether the setting of K is bonded the actual conditions of data, influences whether model Clustering performance.
In this experiment, previously known document sets have 6 classifications, that is, have 6 topics, to separately verify BTM and LDA master Inscribe the optimal number of topics K of model.For the optimal number of topics K of determination, this experiment number of topics takes 5,6,7,8,9,10,11,12 respectively, 13,14 are tested, and model the number of iterations is 1000 times, α=50/K, β=0.01.It is calculated after model training based on k-means Method is clustered, and number k=6 is clustered, since the clustering algorithm is easily trapped into locally optimal solution, so repeating 10 in experiment every time Secondary cluster experiment, takes the average F value (while considering accuracy rate and recall rate) of cluster result to be assessed.After experiment, really Fixed optimal number of topics K will be used in subsequent experimental.Text-theme matrix is as shown in Figure 7 in BTM model.
From the experimental result in table 1 and Fig. 8 it can be seen that, clustering topics are carried out in the case where the classification of known document collection Number K verifying, the Subject Clustering effect that BTM model is obtained when K takes 6 is best, and Subject Clustering effect is most when K takes 10 for LDA model Good, on the whole, BTM effect is better than LDA.
With the increase of number of topics K, two modelling effects have different degrees of decrease.This illustrates in BTM model, in word pair In the case that quantity is constant, preset themes quantity, which deviates really quantity, will lead to original theme probability too much and is subdivided, and cause text The variation of shelves-theme distribution eventually leads to the result for carrying out clustering documents according to this probability distribution and also changes correspondingly.In LDA model In, modeling effect is integrally poor, the reason is that model is influenced by short essay eigen is sparse, causes document-theme distribution inaccurate Really, and then Clustering Effect is poor.
Table 1 BTM and LDA models the comparison of F value under different K values
(2) setting of fusion coefficients λ:
Based on above-mentioned experimental result, this experiment takes BTM model number of topics K=6, k-means algorithm to cluster number k ∈ It is tested respectively when [5,15], last whole result takes mean value to be indicated.In experiment, if λ=0,0.1,0.2,0.3, 0.4,0.5,0.6,0.7,0.8,0.9,1.0 }, Subject Clustering precision is separately verified.In the result it can be found that working as λ=0.8 When, whole cluster result is relatively good, in subsequent experimental, λ=0.8 is taken to be tested.Experimental result is as shown in Figure 9.
(3) optimum cluster number determines:
In a practical situation, in this case it is not apparent that the classification situation of document, so this experiment does not know cluster number in default In the case where, Clustering Effect assessment is carried out based between class distance in class.In fusion BTM model number of topics K=6, the feelings of λ=0.8 Under condition, input cluster range [5,15] is clustered.Figure 10 is between class distance ratio in the class obtained under different cluster numbers Line chart.It can be seen that, when taking cluster number to be 6, between class distance ratio is minimum in class, and Clustering Effect is best in the result.Knot Fruit is consistent with data set features, illustrates the validity of between class distance measurement cluster result in class.So taking the clusters number to be 6 carry out subsequent experiment.
(4) each Model tying Contrast on effect:
6 classifications that this experiment is mainly based upon document sets carry out Clustering Effect comparison.Comparison model is fusion BTM mould Type, BTM model, LDA model and VSM model.In model, the number of topics K of BTM model takes 6, LDA number of topics K to take 10, k-means Algorithm cluster number is set as 6.In experimental result, accuracy rate P and F value is respectively adopted, Clustering Effect is evaluated.Cluster is calculated Method Clustering Effect is better, and corresponding accuracy rate P value is bigger, and F value is then that accuracy rate and recall rate are contemplated simultaneously, cluster Effect is also proportional to the size of F value, and the Clustering Effect of several models is as shown in the table:
Each model accuracy rate P of table 2
Figure 11 gives each Model tying accuracy rate contrast curve chart of corresponding table 2, as seen from the figure, accuracy rate of the invention Performance is best.
The comparison of each model F value of table 3
Figure 12 gives each Model tying F value contrast curve chart of corresponding table 3, and as seen from the figure, advantage of the invention is most bright It is aobvious.
The above description is merely a specific embodiment, any feature disclosed in this specification, except non-specifically Narration, can be replaced by other alternative features that are equivalent or have similar purpose;Disclosed all features or all sides Method or in the process the step of, other than mutually exclusive feature and/or step, can be combined in any way.

Claims (3)

1. a kind of short text Subject Clustering method based on fusion BTM model, characterized in that it comprises the following steps:
Step S1: Text Pretreatment is carried out to short text to be clustered, obtains data set D;
Step S2: being based on data set D and preset theme number K, carries out BTM model modeling, generates document-theme distribution Matrix and theme-word distribution matrix;
And show the text vector of any text i in data set D based on document-theme distribution matrix table, it is denoted as di_btm
Step S3: being based on data set D, carries out text vector expression to any text i in data set D based on TF-IDF strategy, It is denoted as di_vsm
Step S4: initialization tag positionK=kmin, wherein kminIndicate the lower limit of the cluster number of k-means algorithm;
Step S5: if k > kmax, then follow the steps S8;It is no to then follow the steps S6;Wherein kmaxIndicate the cluster of k-means algorithm The upper limit of number;
Step S6: randomly selecting k text vector as initial cluster center from data set D, and is based on k-means algorithm pair Data set D carries out k clustering processing, and the clustering result quality J (k) of corresponding k value is calculated based on obtained cluster result;
Defined label positionThe clustering result quality of corresponding k isIfThen updateAfterwards, then step is executed S7;Otherwise step S7 is directly executed;
Wherein, when carrying out clustering processing, the cluster standard of use are as follows: the weighted sum of first distance and second distance, wherein the One distance is based on text vector di_btmJS distance, second distance be based on text vector di_vsmCOS distance;
Clustering result qualityWherein I (k) indicates that inter- object distance, B (k) indicate between class distance;
The inter- object distance is the minimum value of each text and the average distance of other texts;
The distance of the between class distance two texts nearest between different clusters;
Step S7: after updating k=k+1, step S5 is continued to execute;
Step S8: carrying out k clustering processing based on k-means algorithm, wherein
2. the method as described in claim 1, which is characterized in that the weight of first distance is set as 0.8, the weight of second distance It is 0.2.
3. method described in claim 1, which is characterized in that Text Pretreatment include short text is segmented and is removed it is deactivated Word processing.
CN201811546170.4A 2018-12-18 2018-12-18 Short text Subject Clustering method based on fusion BTM model Pending CN109726394A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811546170.4A CN109726394A (en) 2018-12-18 2018-12-18 Short text Subject Clustering method based on fusion BTM model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811546170.4A CN109726394A (en) 2018-12-18 2018-12-18 Short text Subject Clustering method based on fusion BTM model

Publications (1)

Publication Number Publication Date
CN109726394A true CN109726394A (en) 2019-05-07

Family

ID=66296329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811546170.4A Pending CN109726394A (en) 2018-12-18 2018-12-18 Short text Subject Clustering method based on fusion BTM model

Country Status (1)

Country Link
CN (1) CN109726394A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector
CN110941961A (en) * 2019-11-29 2020-03-31 秒针信息技术有限公司 Information clustering method and device, electronic equipment and storage medium
CN111191036A (en) * 2019-12-30 2020-05-22 杭州远传新业科技有限公司 Short text topic clustering method, device, equipment and medium
CN111523594A (en) * 2020-04-23 2020-08-11 湖州师范学院 Improved KNN fault classification method based on LDA-KMEDOIDS
CN111897952A (en) * 2020-06-10 2020-11-06 中国科学院软件研究所 Sensitive data discovery method for social media
CN112132624A (en) * 2020-09-27 2020-12-25 平安医疗健康管理股份有限公司 Medical claims data prediction system
CN113420112A (en) * 2021-06-21 2021-09-21 中国科学院声学研究所 News entity analysis method and device based on unsupervised learning
WO2023159758A1 (en) * 2022-02-22 2023-08-31 平安科技(深圳)有限公司 Data enhancement method and apparatus, electronic device, and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279556A (en) * 2013-06-09 2013-09-04 南方报业传媒集团 Iteration text clustering method based on self-adaptation subspace study
CN106776579A (en) * 2017-01-19 2017-05-31 清华大学 The sampling accelerated method of Biterm topic models
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method
US20180285348A1 (en) * 2016-07-19 2018-10-04 Tencent Technology (Shenzhen) Company Limited Dialog generation method, apparatus, and device, and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279556A (en) * 2013-06-09 2013-09-04 南方报业传媒集团 Iteration text clustering method based on self-adaptation subspace study
US20180285348A1 (en) * 2016-07-19 2018-10-04 Tencent Technology (Shenzhen) Company Limited Dialog generation method, apparatus, and device, and storage medium
CN106776579A (en) * 2017-01-19 2017-05-31 清华大学 The sampling accelerated method of Biterm topic models
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李泽华: "基于短文本的Web日志挖掘系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector
CN110941961A (en) * 2019-11-29 2020-03-31 秒针信息技术有限公司 Information clustering method and device, electronic equipment and storage medium
CN110941961B (en) * 2019-11-29 2023-08-25 秒针信息技术有限公司 Information clustering method and device, electronic equipment and storage medium
CN111191036A (en) * 2019-12-30 2020-05-22 杭州远传新业科技有限公司 Short text topic clustering method, device, equipment and medium
CN111523594A (en) * 2020-04-23 2020-08-11 湖州师范学院 Improved KNN fault classification method based on LDA-KMEDOIDS
CN111897952A (en) * 2020-06-10 2020-11-06 中国科学院软件研究所 Sensitive data discovery method for social media
CN111897952B (en) * 2020-06-10 2022-10-14 中国科学院软件研究所 Sensitive data discovery method for social media
CN112132624A (en) * 2020-09-27 2020-12-25 平安医疗健康管理股份有限公司 Medical claims data prediction system
CN113420112A (en) * 2021-06-21 2021-09-21 中国科学院声学研究所 News entity analysis method and device based on unsupervised learning
WO2023159758A1 (en) * 2022-02-22 2023-08-31 平安科技(深圳)有限公司 Data enhancement method and apparatus, electronic device, and storage medium

Similar Documents

Publication Publication Date Title
CN109726394A (en) Short text Subject Clustering method based on fusion BTM model
CN107330049B (en) News popularity estimation method and system
CN104899304B (en) Name entity recognition method and device
CN107992542A (en) A kind of similar article based on topic model recommends method
CN103970866B (en) Microblog users interest based on microblogging text finds method and system
CN105550211A (en) Social network and item content integrated collaborative recommendation system
CN107526819A (en) A kind of big data the analysis of public opinion method towards short text topic model
WO2023029356A1 (en) Sentence embedding generation method and apparatus based on sentence embedding model, and computer device
CN109492678A (en) A kind of App classification method of integrated shallow-layer and deep learning
CN113887643B (en) New dialogue intention recognition method based on pseudo tag self-training and source domain retraining
Ye et al. A web services classification method based on GCN
CN108664558A (en) A kind of Web TV personalized ventilation system method towards large-scale consumer
WO2020147259A1 (en) User portait method and apparatus, readable storage medium, and terminal device
JP2020098592A (en) Method, device and storage medium of extracting web page content
CN117034921B (en) Prompt learning training method, device and medium based on user data
Chen et al. Popular topic detection in Chinese micro-blog based on the modified LDA model
CN113837307A (en) Data similarity calculation method and device, readable medium and electronic equipment
Sun et al. Rumour detection technology based on the BiGRU_capsule network
CN108694176A (en) Method, apparatus, electronic equipment and the readable storage medium storing program for executing of document sentiment analysis
CN116842934A (en) Multi-document fusion deep learning title generation method based on continuous learning
Wang Research on the art value and application of art creation based on the emotion analysis of art
Zhou et al. The improved grey model by fusing exponential buffer operator and its application
CN110413782A (en) A kind of table automatic theme classification method, device, computer equipment and storage medium
CN112434126A (en) Information processing method, device, equipment and storage medium
Lu et al. A novel method for Chinese named entity recognition based on character vector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190507

RJ01 Rejection of invention patent application after publication