CN109726394A - Short text Subject Clustering method based on fusion BTM model - Google Patents
Short text Subject Clustering method based on fusion BTM model Download PDFInfo
- Publication number
- CN109726394A CN109726394A CN201811546170.4A CN201811546170A CN109726394A CN 109726394 A CN109726394 A CN 109726394A CN 201811546170 A CN201811546170 A CN 201811546170A CN 109726394 A CN109726394 A CN 109726394A
- Authority
- CN
- China
- Prior art keywords
- text
- distance
- model
- btm
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a kind of short text Subject Clustering methods based on fusion BTM model, belong to data clusters technical field.The present invention carries out Text Pretreatment to short text to be clustered first, obtains data set D;Then the text vector based on BTM model, VSM model is extracted respectively;When carrying out k-means cluster to data set D, based on estimation cluster numbers k mode cluster numbers obtained set by the present invention, carry out k clustering processing, and the cluster standard used when clustering processing are as follows: the weighted sum of the distance between any two text calculated separately based on two text vectors.Present invention combination BTM model and VSM model realization are to the clustering processing of short text theme, to improve Clustering Effect;The technical issues of interior, between class distance measures Clustering Effect based on class simultaneously, automatically adjusts the quantity that clusters, and compensation BTM model need to shift to an earlier date accuracy decline caused by preassignment theme quantity.
Description
Technical field
The invention belongs to data clusters technical fields, and in particular to a kind of short text theme based on fusion BTM model is poly-
Class method.
Background technique
Currently, mainly there is BTM (Biterm topic model) model, VSM (Vector about the model of short text theme
Space Model) model and LDA (Latent Dirichlet Allocation) model.
Wherein, BTM model is a kind of text subject model, but it and traditional topic model such as PLSA are (stealthy semantic
Analysis) or LDA have apparent difference.General traditional topic model is only applicable to long article present treatment, because of the feature of short text
It is sparse and missing can produce serious influence to model foundation, but also have many researchers attempt to be extended model and
Optimization, to enhance the applicability to short text.Such as it is spelled by introducing external knowledge to expand short text, or by short text
It connects, is handled as pseudo- long text.Although this way can not overcome conventional model day with the deficiency on improved model
Raw disadvantage, and the modeling process of BTM model can obtain better effects to avoid disadvantages mentioned above.
VSM model, i.e. vector space model, principle is fairly simple, i.e., will carry out table with based on space vector in text
Show, it is subsequent the operation method of vector to be used to carry out operation to text.So by a text be mapped to vector space it
Afterwards, the similarity between text can be measured by distance between vector, and should be readily appreciated that.
Currently, classical clustering algorithm has k-means and k-medoids etc., but this kind of algorithm needs specified in advance gather
Class number, while optimum cluster number cannot be assessed in advance.
Summary of the invention
Goal of the invention of the invention is: in view of the above problems, in conjunction with BTM model and VSM model realization to short
The clustering processing of text subject, to improve Clustering Effect;Based on class, interior, between class distance measures Clustering Effect simultaneously, from
It is dynamic to adjust the quantity that clusters, compensate the technical issues of BTM model need to shift to an earlier date accuracy decline caused by preassignment theme quantity.
Short text Subject Clustering method based on fusion BTM model of the invention, including the following steps:
Step S1: Text Pretreatment is carried out to short text to be clustered, obtains data set D;
Step S2: being based on data set D and preset theme number K, carries out BTM model modeling, generates document-theme
Distribution matrix θ and theme-word distribution matrix
And the text vector of any text i in data set D is indicated based on document-theme distribution matrix θ, it is denoted as di_btm;
Step S3: being based on data set D, carries out text vector table to any text i in data set D based on TF-IDF strategy
Show, is denoted as di_vsm;
Step S4: initialization tag positionK=kmin, wherein kminUnder the cluster number for indicating k-means algorithm
Limit;
Step S5: if k > kmax, then follow the steps 8;It is no to then follow the steps S6;Wherein kmaxIndicate the poly- of k-means algorithm
The upper limit of class number;
Step S6: randomly selecting k text vector as initial cluster center from data set D, and is calculated based on k-means
Method carries out k clustering processing to data set D, and the clustering result quality J (k) of corresponding k value is calculated based on obtained cluster result;
Defined label positionThe clustering result quality of corresponding k isIfThen updateAfterwards, it then executes
Step S7;Otherwise step S7 is directly executed;
Wherein, when carrying out clustering processing, the cluster standard of use are as follows: the weighted sum of first distance and second distance,
Middle first distance is based on text vector di_btmJS (Jensen-Shannon) distance, second distance be based on text vector
di_vsmCOS distance;
Clustering result qualityWherein I (k) indicates that inter- object distance, B (k) indicate between class distance;
The inter- object distance is the minimum value of each text and the average distance of other texts;
The distance of the between class distance two texts nearest between different clusters;
Step S7: after updating k=k+1, step S5 is continued to execute;
Step S8: carrying out k clustering processing based on k-means algorithm, wherein
In conclusion by adopting the above-described technical solution, the beneficial effects of the present invention are: special in conjunction with both BTM and VSM
Text feature is put and optimized, improves Clustering Effect, while interior, between class distance measures Clustering Effect based on class, it is automatic to adjust
The quantity that clusters is saved, the problem of BTM model need to shift to an earlier date accuracy decline caused by preassignment theme quantity is compensated.
Detailed description of the invention
Fig. 1 is the schematic diagram of BTM voca.txt input format;
Fig. 2 is the schematic diagram of the defeated doc_wids.txt input format of BTM;
Fig. 3 is fusion BTM Model tying flow chart;
Fig. 4 is participle number result figure schematic diagram;
Fig. 5 is BTM model word to number result schematic diagram;
Fig. 6 is document space vector matrix schematic diagram;
Fig. 7 is BTM model document-theme distribution matrix schematic diagram;
Fig. 8 is BTM and LDA the F value comparison diagram at different themes number K;
Fig. 9 is Fusion Model difference λ valued curve;
Figure 10 is between class distance schematic diagram in different cluster number lower classes;
Figure 11 is each Model tying accuracy rate comparison diagram;
Figure 12 is each model F value comparison diagram.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair
It is bright to be described in further detail.
Short text topic model BTM (Biterm Topic Model) mainly for unitary document-theme distribution probability into
Words-frequency feature in single document has been desalinated in row research, and vector space model (Vector Space Model) stresses word frequency spy
Sign, when in order to overcome the shortcomings of that two models are implemented separately, both the present invention combines feature simultaneously optimizes text feature, improves cluster
Effect, while based on class, interior, between class distance measures Clustering Effect, automatically adjusts the quantity that clusters, compensation BTM model need to mention
The problem of accuracy decline caused by preceding preassignment theme quantity.
The inner principle of VSM is fairly simple, i.e., will be indicated in text with based on space vector, subsequent to use
The operation method of vector carries out operation to text.So the similarity after a text is mapped to vector space, between text
It can be measured, and be should be readily appreciated that by distance between vector.
Assuming that have m documents, wherein shared n different words, therefore this m document can be based on n word wi(i=1 ...,
N) it is indicated.So document vector space model can be expressed as di={ w1,w2,w3...wn, the text vector of m documents
Matrix D can indicate are as follows:
Wherein, every row of the matrix represents a text, and each column represents an independent word, element d in matrixijIndicate the
J word proportion in i-th of document.
BTM is the model based on statistics, it has evaded traditional short text model vulnerable to the sparse influence of short essay eigen
The shortcomings that, using implicit semantic between word, map the text to theme spatially.So based on this model, it can be deduced that more manage
Document-the theme and theme thought-word distribution.In present embodiment, document spy is indicated based on document-theme distribution matrix
Sign is labeled as dbtm.Meanwhile short text is subjected to vectorization expression using TF-IDF weighted strategy, it is labeled as dvsm, finally in text
Weighted Fusion coefficient lambda is introduced during shelves similarity calculation, carries out fusion treatment.
The wherein d of i-th of textvsm、dbtmVector expression be respectively as follows:
di_vsm={ w1,w2,w3...wn};
di_btm={ p (z1|d),p(z2|d),p(z3|d)....p(zk|d)};
Wherein p (zj| d) indicate j-th of theme zjIn the distribution probability of text d, j=1,2 ..., K;
TF-IDF is the most classical and most common method being weighted in vector space to Feature Words.It is specific real at this
It applies in mode, using word independent in short text as characteristic item, corresponding power is weighted by TF and IDF.
TF (Term Frequency) indicates certain Feature Words tkThis d in specific certain textiThe number of middle appearance.With out
Occurrence number increases, and shows that specific gravity shared by the specific word is bigger, and the number occurred is labeled as TF(i,k).But for short
For text, Feature Words wherein included frequency of occurrence in the text is totally close, is difficult to judge so relying solely on TF
Text general characteristic.
IDF (Inverse Document Frequency) refers to the text that Feature Words go out in text set except current text
The number occurred in this.It is more multiple if there is number, illustrate that the word cannot distinguish text well, if there is number
It is less, illustrate that the word can be taken as the feature of certain texts to be considered, the calculation method of IDF are as follows:
Wherein, i is document (text) specificator, and k is word specificator, and N indicates document library
In all text item numbers, n expression include word tkText item number, α is an empirical value, takes α=0.01 under normal circumstances.
It can to sum up obtain, any text diMiddle Feature Words tkWeight are as follows: w(i,k)=TF(i,k)×IDF(i,k), that is, use weight w
(i, k) characterizes di_vsm={ w1,w2,w3...wnIn each word, to obtain based on TF-IDF strategy in data set D
Any text i carries out text vector.
Text passes through the vector d of theme distributionbtmIt is indicated, so the present invention is based on the masters of text in subsequent contrast
It inscribes feature and carries out Similarity measures.Simultaneously because theme vector is to be obtained with statistical model, therefore be described using probability.
Using the symmetrical version based on KL distance (Kullback-Leibler difference), i.e. JS distance is measured.I.e. pair
In any text diAnd djBetween first distance (the first similarity) can indicate are as follows:
Wherein,Function
And the full text vector d for being indicated based on Feature Words word weightvsm, then take directly calculate two vector cosine away from
From being measured, i.e., for any text diAnd djBetween second distance (the second similarity) can indicate are as follows:
After calculating completion both the above similarity, weighting coefficient λ is introduced, fusion measurement is weighted, is formed based on JS
The weighting Text similarity computing formula of distance: D (di,dj)=λ Dbtm(di,dj)+(1-λ)Dvsm(di,dj)。
Referring to Fig. 1, in present embodiment, the short text Subject Clustering based on above-mentioned fusion BTM model includes following
Step:
(1) Text Pretreatment is carried out to short text to be clustered, obtains data set D, wherein Text Pretreatment includes to short
Text is segmented and is removed stop words processing;
Wherein, short text to be clustered is usual are as follows: has microblogging, discussion bar, forum's comment or communication behavior after Log Clustering
User conversation in short text data.
(2) processing of BTM model preprocessing, i.e. regularization, generates two documents respectively, and one of document is dictionary text
Shelves, are denoted as voca.txt, it is therefore an objective to be numbered for each word;Another document is denoted as doc_winds.txt, and being used for will be entire
Replacement is numbered in word in text set.The specific format of two files is as shown in Figure 2,3.
(3) it is modeled based on BTM model, generates document-theme distribution matrix θ and theme-word distribution matrix
The input of model is two document of words.txt and words_winds.txt, theme number K and Dirichlet priori
Two hyper parameters α and β of distribution, wherein α=50/K and β=0.01.Output is document-theme distribution matrix θ and theme-word point
Cloth matrixSince the modeling process of BTM model is the prior art, details are not described herein again.
(4) text vector expression is carried out based on TF-IDF strategy;
(5) document-theme vector and document VSM vector generated based on k-means algorithm to BTM model is weighted poly-
Class;
(6) clustering for cluster generation is described, makes reference for network management or public opinion monitoring.
The short text Subject Clustering of fusion BTM model i.e. of the invention the specific implementation process is as follows:
Step S1: data set D to be clustered, and the theme number of setting BTM, the cluster of k-means algorithm are inputted
Number range [kmin,kmax];
Step S2: being based on data set D, carries out BTM model modeling, generates document-theme distribution matrix θ and theme-word point
Cloth matrix
Step S3: being based on data set D, carries out text vector expression based on TF-IDF strategy, obtains dvsm;
Step S4: initialization tag positionK=kmin;
Step S5: if k > kmax, then follow the steps 8;It is no to then follow the steps S6;
Step S6: k initial cluster center is randomly selected from data set D, and k cluster is carried out based on k-means algorithm
Processing calculates the clustering result quality J (k) of corresponding k value based on obtained cluster result;
Defined label positionThe clustering result quality of corresponding k isIfThen updateAfterwards, it then executes
Step S7;Otherwise step S7 (marker bit is directly executedIt remains unchanged);
Wherein, when carrying out clustering processing, the cluster standard of use are as follows: the weighted sum of first distance and second distance;
Clustering result qualityWherein I (k) indicates that inter- object distance, B (k) indicate between class distance;
The inter- object distance is the minimum value of each text and the average distance of other texts, it may be assumed that
Wherein, | Cj| it indicates in cluster CjIn amount of text, xi、xpRespectively indicate belonging cluster CjIn text pair
As.
The distance of the between class distance two texts nearest between different clusters, it may be assumed that
Wherein, i=1,2 ..., k;J=1,2 ..., k;And i ≠ j, xp,xqIndicate cluster CiAnd CjIn text object.
Step S7: after updating k=k+1, step S5 is continued to execute;
Step S8: carrying out k clustering processing based on k-means algorithm, wherein
Embodiment
The experimental data of use comes from Sina weibo, and modeling and clustering are carried out in data based on this paper algorithm.
Experimental section mainly includes the weighting algorithm clustering algorithm of fusion full text this VSM and BTM, and BTM, VSM and LDA is used alone
Algorithm carry out comparison.
Wherein experimental data set (corpus of use) is the open source data from Sina weibo in May, 2014, wherein altogether
Have 6 hotspots (room rate, civil servant, South Korean TV soaps, haze, transgenosis, mobile phone), each classification 1000 and already provided with classification
Label, vocabulary size: 9372 words.
Before experiment carries out, the pretreatment work of data is carried out first, text is subjected to participle number, BTM model needs
Word is carried out to number.Number result is segmented as shown in figure 4, word is as shown in Figure 5 to numbering.
After completing data prediction, document vectorization expression is carried out to based on IF-IDF algorithm, forms VSM model, text
Shelves indicate that result is as shown in Figure 6.
(1) BTM and LDA Subject Clustering Contrast on effect:
BTM model is a kind of non-supervisory model, and human intervention is not needed in operation, but a disadvantage is that needing before modeling
Artificially default document sets overall theme clusters number K, and whether the setting of K is bonded the actual conditions of data, influences whether model
Clustering performance.
In this experiment, previously known document sets have 6 classifications, that is, have 6 topics, to separately verify BTM and LDA master
Inscribe the optimal number of topics K of model.For the optimal number of topics K of determination, this experiment number of topics takes 5,6,7,8,9,10,11,12 respectively,
13,14 are tested, and model the number of iterations is 1000 times, α=50/K, β=0.01.It is calculated after model training based on k-means
Method is clustered, and number k=6 is clustered, since the clustering algorithm is easily trapped into locally optimal solution, so repeating 10 in experiment every time
Secondary cluster experiment, takes the average F value (while considering accuracy rate and recall rate) of cluster result to be assessed.After experiment, really
Fixed optimal number of topics K will be used in subsequent experimental.Text-theme matrix is as shown in Figure 7 in BTM model.
From the experimental result in table 1 and Fig. 8 it can be seen that, clustering topics are carried out in the case where the classification of known document collection
Number K verifying, the Subject Clustering effect that BTM model is obtained when K takes 6 is best, and Subject Clustering effect is most when K takes 10 for LDA model
Good, on the whole, BTM effect is better than LDA.
With the increase of number of topics K, two modelling effects have different degrees of decrease.This illustrates in BTM model, in word pair
In the case that quantity is constant, preset themes quantity, which deviates really quantity, will lead to original theme probability too much and is subdivided, and cause text
The variation of shelves-theme distribution eventually leads to the result for carrying out clustering documents according to this probability distribution and also changes correspondingly.In LDA model
In, modeling effect is integrally poor, the reason is that model is influenced by short essay eigen is sparse, causes document-theme distribution inaccurate
Really, and then Clustering Effect is poor.
Table 1 BTM and LDA models the comparison of F value under different K values
(2) setting of fusion coefficients λ:
Based on above-mentioned experimental result, this experiment takes BTM model number of topics K=6, k-means algorithm to cluster number k ∈
It is tested respectively when [5,15], last whole result takes mean value to be indicated.In experiment, if λ=0,0.1,0.2,0.3,
0.4,0.5,0.6,0.7,0.8,0.9,1.0 }, Subject Clustering precision is separately verified.In the result it can be found that working as λ=0.8
When, whole cluster result is relatively good, in subsequent experimental, λ=0.8 is taken to be tested.Experimental result is as shown in Figure 9.
(3) optimum cluster number determines:
In a practical situation, in this case it is not apparent that the classification situation of document, so this experiment does not know cluster number in default
In the case where, Clustering Effect assessment is carried out based between class distance in class.In fusion BTM model number of topics K=6, the feelings of λ=0.8
Under condition, input cluster range [5,15] is clustered.Figure 10 is between class distance ratio in the class obtained under different cluster numbers
Line chart.It can be seen that, when taking cluster number to be 6, between class distance ratio is minimum in class, and Clustering Effect is best in the result.Knot
Fruit is consistent with data set features, illustrates the validity of between class distance measurement cluster result in class.So taking the clusters number to be
6 carry out subsequent experiment.
(4) each Model tying Contrast on effect:
6 classifications that this experiment is mainly based upon document sets carry out Clustering Effect comparison.Comparison model is fusion BTM mould
Type, BTM model, LDA model and VSM model.In model, the number of topics K of BTM model takes 6, LDA number of topics K to take 10, k-means
Algorithm cluster number is set as 6.In experimental result, accuracy rate P and F value is respectively adopted, Clustering Effect is evaluated.Cluster is calculated
Method Clustering Effect is better, and corresponding accuracy rate P value is bigger, and F value is then that accuracy rate and recall rate are contemplated simultaneously, cluster
Effect is also proportional to the size of F value, and the Clustering Effect of several models is as shown in the table:
Each model accuracy rate P of table 2
Figure 11 gives each Model tying accuracy rate contrast curve chart of corresponding table 2, as seen from the figure, accuracy rate of the invention
Performance is best.
The comparison of each model F value of table 3
Figure 12 gives each Model tying F value contrast curve chart of corresponding table 3, and as seen from the figure, advantage of the invention is most bright
It is aobvious.
The above description is merely a specific embodiment, any feature disclosed in this specification, except non-specifically
Narration, can be replaced by other alternative features that are equivalent or have similar purpose;Disclosed all features or all sides
Method or in the process the step of, other than mutually exclusive feature and/or step, can be combined in any way.
Claims (3)
1. a kind of short text Subject Clustering method based on fusion BTM model, characterized in that it comprises the following steps:
Step S1: Text Pretreatment is carried out to short text to be clustered, obtains data set D;
Step S2: being based on data set D and preset theme number K, carries out BTM model modeling, generates document-theme distribution
Matrix and theme-word distribution matrix;
And show the text vector of any text i in data set D based on document-theme distribution matrix table, it is denoted as di_btm;
Step S3: being based on data set D, carries out text vector expression to any text i in data set D based on TF-IDF strategy,
It is denoted as di_vsm;
Step S4: initialization tag positionK=kmin, wherein kminIndicate the lower limit of the cluster number of k-means algorithm;
Step S5: if k > kmax, then follow the steps S8;It is no to then follow the steps S6;Wherein kmaxIndicate the cluster of k-means algorithm
The upper limit of number;
Step S6: randomly selecting k text vector as initial cluster center from data set D, and is based on k-means algorithm pair
Data set D carries out k clustering processing, and the clustering result quality J (k) of corresponding k value is calculated based on obtained cluster result;
Defined label positionThe clustering result quality of corresponding k isIfThen updateAfterwards, then step is executed
S7;Otherwise step S7 is directly executed;
Wherein, when carrying out clustering processing, the cluster standard of use are as follows: the weighted sum of first distance and second distance, wherein the
One distance is based on text vector di_btmJS distance, second distance be based on text vector di_vsmCOS distance;
Clustering result qualityWherein I (k) indicates that inter- object distance, B (k) indicate between class distance;
The inter- object distance is the minimum value of each text and the average distance of other texts;
The distance of the between class distance two texts nearest between different clusters;
Step S7: after updating k=k+1, step S5 is continued to execute;
Step S8: carrying out k clustering processing based on k-means algorithm, wherein
2. the method as described in claim 1, which is characterized in that the weight of first distance is set as 0.8, the weight of second distance
It is 0.2.
3. method described in claim 1, which is characterized in that Text Pretreatment include short text is segmented and is removed it is deactivated
Word processing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811546170.4A CN109726394A (en) | 2018-12-18 | 2018-12-18 | Short text Subject Clustering method based on fusion BTM model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811546170.4A CN109726394A (en) | 2018-12-18 | 2018-12-18 | Short text Subject Clustering method based on fusion BTM model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109726394A true CN109726394A (en) | 2019-05-07 |
Family
ID=66296329
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811546170.4A Pending CN109726394A (en) | 2018-12-18 | 2018-12-18 | Short text Subject Clustering method based on fusion BTM model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109726394A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110263343A (en) * | 2019-06-24 | 2019-09-20 | 北京理工大学 | The keyword abstraction method and system of phrase-based vector |
CN110941961A (en) * | 2019-11-29 | 2020-03-31 | 秒针信息技术有限公司 | Information clustering method and device, electronic equipment and storage medium |
CN111191036A (en) * | 2019-12-30 | 2020-05-22 | 杭州远传新业科技有限公司 | Short text topic clustering method, device, equipment and medium |
CN111523594A (en) * | 2020-04-23 | 2020-08-11 | 湖州师范学院 | Improved KNN fault classification method based on LDA-KMEDOIDS |
CN111897952A (en) * | 2020-06-10 | 2020-11-06 | 中国科学院软件研究所 | Sensitive data discovery method for social media |
CN112132624A (en) * | 2020-09-27 | 2020-12-25 | 平安医疗健康管理股份有限公司 | Medical claims data prediction system |
CN113420112A (en) * | 2021-06-21 | 2021-09-21 | 中国科学院声学研究所 | News entity analysis method and device based on unsupervised learning |
WO2023159758A1 (en) * | 2022-02-22 | 2023-08-31 | 平安科技(深圳)有限公司 | Data enhancement method and apparatus, electronic device, and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279556A (en) * | 2013-06-09 | 2013-09-04 | 南方报业传媒集团 | Iteration text clustering method based on self-adaptation subspace study |
CN106776579A (en) * | 2017-01-19 | 2017-05-31 | 清华大学 | The sampling accelerated method of Biterm topic models |
CN108197144A (en) * | 2017-11-28 | 2018-06-22 | 河海大学 | A kind of much-talked-about topic based on BTM and Single-pass finds method |
US20180285348A1 (en) * | 2016-07-19 | 2018-10-04 | Tencent Technology (Shenzhen) Company Limited | Dialog generation method, apparatus, and device, and storage medium |
-
2018
- 2018-12-18 CN CN201811546170.4A patent/CN109726394A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279556A (en) * | 2013-06-09 | 2013-09-04 | 南方报业传媒集团 | Iteration text clustering method based on self-adaptation subspace study |
US20180285348A1 (en) * | 2016-07-19 | 2018-10-04 | Tencent Technology (Shenzhen) Company Limited | Dialog generation method, apparatus, and device, and storage medium |
CN106776579A (en) * | 2017-01-19 | 2017-05-31 | 清华大学 | The sampling accelerated method of Biterm topic models |
CN108197144A (en) * | 2017-11-28 | 2018-06-22 | 河海大学 | A kind of much-talked-about topic based on BTM and Single-pass finds method |
Non-Patent Citations (1)
Title |
---|
李泽华: "基于短文本的Web日志挖掘系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110263343A (en) * | 2019-06-24 | 2019-09-20 | 北京理工大学 | The keyword abstraction method and system of phrase-based vector |
CN110941961A (en) * | 2019-11-29 | 2020-03-31 | 秒针信息技术有限公司 | Information clustering method and device, electronic equipment and storage medium |
CN110941961B (en) * | 2019-11-29 | 2023-08-25 | 秒针信息技术有限公司 | Information clustering method and device, electronic equipment and storage medium |
CN111191036A (en) * | 2019-12-30 | 2020-05-22 | 杭州远传新业科技有限公司 | Short text topic clustering method, device, equipment and medium |
CN111523594A (en) * | 2020-04-23 | 2020-08-11 | 湖州师范学院 | Improved KNN fault classification method based on LDA-KMEDOIDS |
CN111897952A (en) * | 2020-06-10 | 2020-11-06 | 中国科学院软件研究所 | Sensitive data discovery method for social media |
CN111897952B (en) * | 2020-06-10 | 2022-10-14 | 中国科学院软件研究所 | Sensitive data discovery method for social media |
CN112132624A (en) * | 2020-09-27 | 2020-12-25 | 平安医疗健康管理股份有限公司 | Medical claims data prediction system |
CN113420112A (en) * | 2021-06-21 | 2021-09-21 | 中国科学院声学研究所 | News entity analysis method and device based on unsupervised learning |
WO2023159758A1 (en) * | 2022-02-22 | 2023-08-31 | 平安科技(深圳)有限公司 | Data enhancement method and apparatus, electronic device, and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109726394A (en) | Short text Subject Clustering method based on fusion BTM model | |
CN107330049B (en) | News popularity estimation method and system | |
CN104899304B (en) | Name entity recognition method and device | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN103970866B (en) | Microblog users interest based on microblogging text finds method and system | |
CN105550211A (en) | Social network and item content integrated collaborative recommendation system | |
CN107526819A (en) | A kind of big data the analysis of public opinion method towards short text topic model | |
WO2023029356A1 (en) | Sentence embedding generation method and apparatus based on sentence embedding model, and computer device | |
CN109492678A (en) | A kind of App classification method of integrated shallow-layer and deep learning | |
CN113887643B (en) | New dialogue intention recognition method based on pseudo tag self-training and source domain retraining | |
Ye et al. | A web services classification method based on GCN | |
CN108664558A (en) | A kind of Web TV personalized ventilation system method towards large-scale consumer | |
WO2020147259A1 (en) | User portait method and apparatus, readable storage medium, and terminal device | |
JP2020098592A (en) | Method, device and storage medium of extracting web page content | |
CN117034921B (en) | Prompt learning training method, device and medium based on user data | |
Chen et al. | Popular topic detection in Chinese micro-blog based on the modified LDA model | |
CN113837307A (en) | Data similarity calculation method and device, readable medium and electronic equipment | |
Sun et al. | Rumour detection technology based on the BiGRU_capsule network | |
CN108694176A (en) | Method, apparatus, electronic equipment and the readable storage medium storing program for executing of document sentiment analysis | |
CN116842934A (en) | Multi-document fusion deep learning title generation method based on continuous learning | |
Wang | Research on the art value and application of art creation based on the emotion analysis of art | |
Zhou et al. | The improved grey model by fusing exponential buffer operator and its application | |
CN110413782A (en) | A kind of table automatic theme classification method, device, computer equipment and storage medium | |
CN112434126A (en) | Information processing method, device, equipment and storage medium | |
Lu et al. | A novel method for Chinese named entity recognition based on character vector |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190507 |
|
RJ01 | Rejection of invention patent application after publication |