CN112667806A - Text classification screening method using LDA - Google Patents

Text classification screening method using LDA Download PDF

Info

Publication number
CN112667806A
CN112667806A CN202011123125.5A CN202011123125A CN112667806A CN 112667806 A CN112667806 A CN 112667806A CN 202011123125 A CN202011123125 A CN 202011123125A CN 112667806 A CN112667806 A CN 112667806A
Authority
CN
China
Prior art keywords
text
sentences
topic
lda
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011123125.5A
Other languages
Chinese (zh)
Inventor
赵博
吕建文
周兴晖
陈力
薛柔月
金鑫
蒋尚秀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Golden Bridge Info Tech Co ltd
Original Assignee
Shanghai Golden Bridge Info Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Golden Bridge Info Tech Co ltd filed Critical Shanghai Golden Bridge Info Tech Co ltd
Priority to CN202011123125.5A priority Critical patent/CN112667806A/en
Publication of CN112667806A publication Critical patent/CN112667806A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides a text classification screening method using LDA, which comprises the following steps: acquiring a data set, wherein the content comprises a plurality of short sentences; preprocessing the data by using a natural language processing method, and cleaning and sorting the data; determining a theme, and manually selecting a plurality of text sentences which accord with the theme; establishing a corresponding text vector matrix by using a bag-of-words model by using the selected text sentences; training a first LDA model by using the vector matrix; screening the remaining sentences in the text by using the first LDA model, calculating the correlation between the text set and a plurality of topic words obtained by the first LDA topic calculation, and using the correlation as a threshold value for evaluating whether one sentence meets the selected topic model; adding texts screened by topic relevance, and training a second LDA model; judging and screening the remaining sentences in the text by cosine similarity by using the second LDA model; and taking the sentence screened in total three times as text data conforming to the screening target.

Description

Text classification screening method using LDA
Technical Field
The invention relates to the field of natural language processing, and can effectively screen sentences conforming to a selected theme, prepare data sets for various machine learning algorithms or classify texts.
Background
Machine learning is currently used in a wider and wider range of fields. However, for a model needing to process natural language, a special theme is often required to be preset to train the model. Training the model requires a manually labeled data set to ensure the quality of the model. However, in many cases, it is a matter of great concern how to provide the highest possible quality data for the model without the ready availability of labeled data.
Training models cannot leave the data, but many times there is not enough data (data quality is too low or monetary cost of labeling is too large), so the industry has proposed so-called unsupervised learning, but it is still rarely used, and more times more training samples are added.
Disclosure of Invention
The technical problem of the invention is solved: a text classification screening method using an LDA (latent Dirichlet allocation model) is provided, when text data is faced, a small amount of data which is manually selected or marked is utilized, then the characteristics of the data are extracted to train a classification model, and the classification model is utilized to screen and classify the data, so that the text data with different subjects can be classified at lower cost and higher speed. The method is characterized in that a small amount of data meeting the theme requirements are manually selected, and then the characteristics of the data are extracted by using an LDA model so as to quickly screen the data.
The technical scheme of the invention is a text classification screening method using LDA, which comprises the following steps:
(1) acquiring a data set, wherein the content comprises a plurality of short sentences;
(2) preprocessing the data by using a natural language processing method, and cleaning and sorting the data;
(3) determining a theme, and manually selecting a plurality of text sentences which accord with the theme;
(4) establishing a corresponding text vector matrix by using a bag-of-words model by using the selected text sentences;
(5) training a first LDA model by using the vector matrix;
(6) screening the remaining sentences in the text by using the LDA model, calculating the correlation between the text set and a plurality of topic words obtained by first LDA topic calculation, and using the correlation as a threshold value for evaluating whether one sentence meets the selected topic model;
(7) adding texts screened by topic relevance, and training a second LDA model;
(8) judging and screening the remaining sentences in the text by cosine similarity by using the second LDA model;
(9) and (4) taking sentences which are manually screened, subject similar screening and cosine similar screening and are screened for three times in total as text data which accord with a screening target.
Further, in step 2, the preprocessing the data includes:
selecting sentences larger than 10 words; removing punctuation marks, removing error codes and removing other characters which are not English and numbers; repairing grammatical problems, repairing word spelling errors and repairing spoken vocabularies; repairing space and indentation problems; repairing the abnormal character; the cleaning and sorting comprises the steps of roughly cleaning by using a bag-of-words model and selecting text sentences with high theme weight.
Further, in step 3, manually selecting a plurality of text sentences that conform to the topic includes: only one item should be reserved for repeated sentences, and for sentences describing the same thing, the repeated sentences are considered to be repetitive when the half words of the sentences are the same;
abbreviations and shorthand content should be expanded, requiring manual discovery and replacement for representations that will give some abbreviations when expressed in spoken language.
Further, in the step 3, for the data set ready to be screened, each document has a single line, and 800 to 1000 texts with the same sequence of words and sentences are manually selected from the single line, wherein the texts meet the requirements of the selected subjects; and establishing a dictionary and an index for each word by using the selected text.
Further, in the step 4, each document is vectorized by using a bag-of-words model, the model only considers the document as a set of a plurality of words, the occurrence of each word in the document is independent and does not depend on whether other words occur, and then a word frequency matrix, namely a Document Theme (DT) matrix, is calculated and generated by using vectorized data.
Further, in step 5, the number of topics to be classified of the document is set, and the DT matrix is used to train the first LDA model: firstly, obtaining the distribution of parameters of topic distribution through Dirichlet distribution, then randomly generating topic distribution of a text, and then randomly generating a topic at each position of the text according to the topic distribution of the text; then, the distribution of the parameters of the word distribution is obtained through Dirichlet distribution, the word distribution of the topic is obtained, a word is randomly generated at the position according to the word distribution of the topic until the last position of the text, and the whole text is generated; and finally, repeating the processes to generate all texts.
Further, in the step 6, for the unselected text, the trained first LDA model is used for theme judgment, and the model gives the probability that the text belongs to a certain theme; and if the probability that a certain pair of sentences belong to a certain topic is highest under the judgment of the LDA, and the value exceeds a certain set threshold value, selecting the text.
Further, in step 7, a new data set is formed by using the texts selected manually and the first LDA model, and the second LDA model is retrained.
Further, in step 8, for all the remaining texts, the cosine similarity detection is performed by using the second LDA model and the previously selected corpus, and if the similarity value between a certain text and a selected certain text with the highest similarity is higher than a set threshold, the certain text is selected.
Further, in step 9, the required text data is screened according to the selected classification standard by manual selection, LDA theme selection, cosine similarity selection, and three selections in total.
Has the advantages that:
the text data processed by the method can be suitable for being classified and screened quickly when the data volume reaches more than ten million lines. Thousands of sentences which accord with the selected theme direction are manually selected, the LDA is used for theme similarity screening, the LDA theme model is used for selecting a part of comment pairs which highly accord with the theme, and therefore a sample which is large enough is used for training a relatively perfect LDA classification model. And finally, carrying out similarity detection on the rest sentences and the sentences in the trained LDA model, and selecting proper data. Through the three-time screening, the disadvantage of unsupervised machine learning is overcome to a certain extent, and the accuracy of screening and classification is improved under the condition of ensuring the speed. Has the following advantages:
(1) the method is suitable for screening large-scale data, and the cost of manual marking can be saved;
(2) the similarity of the 'theme' between different texts can be effectively distinguished;
(3) the classification effect on short texts, particularly short comments, is excellent;
(4) the obtained text is suitable for various machine learning algorithms;
(5) the screening process can ensure the screening quality and greatly improve the screening speed.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by those skilled in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.
The core technical model used by the invention is an LDA topic classification model, and a series of steps and strategies are designed around the model for data screening. The main principle of the LDA model is as follows:
the LDA model is a three-layer Bayes Topic model, Topic information implied in a text is discovered through an unsupervised learning method, and the purpose is to discover implied semantic dimensions, namely 'Topic' or 'Concept', from the text by an unguided learning method. The essence of implicit semantic analysis is to use the co-occurrence features of terms (term) in text to find the Topic structure of text, and this method does not need any background knowledge about text. Implicit semantic representation of text can model linguistic phenomena of "ambiguous words" and "ambiguous words" such that search results obtained by a search engine system match the query of a user at a semantic level, rather than just intersecting the query at a lexical level.
In the case of two classifications using LDA: given data set
Figure BDA0002732703760000041
m is a vector space RnM samples of (1), where xiFor an n-dimensional vector classified by number i, yi ∈ {0, 1 }. Definition of N in the inventionj(j is 0,1) is the number of j-th class samples, Xj(j is 0,1) is the set of jth class samples, and μj(j equals 0,1) is the mean vector of the jth sample, and defines ΣjAnd (j is 0,1) is a covariance matrix of the j-th sample.
μjSum ΣjAre respectively:
Figure BDA0002732703760000042
Figure BDA0002732703760000043
if the data are projected on a straight line omega, the projection of the centers of the two types of samples on the straight line is respectively omegaTμ0And ωTμ1The invention hopes that the projection points of the same kind of data are as close as possible, namely, the covariance omega of the projection points of the same kind of sample is requiredT0Omega and omegaT1ω is as small as possible, so the optimization objective of the present invention is:
Figure BDA0002732703760000044
defining generally an intra-class divergence matrix SωComprises the following steps:
Figure BDA0002732703760000045
defining an inter-class divergence matrix SbComprises the following steps:
Sb=(μ01)(μ01)T
the optimization objective can be rewritten as:
Figure BDA0002732703760000046
thus, by using the lagrange multiplier method, the feature vector can be obtained:
Figure BDA0002732703760000047
this is a form of a generalized rayleigh quotient, and for the two classified samples, the optimal projection direction ω can be determined by simply finding the mean and variance of the original samples.
Under the condition of multi-classification, if the multi-class is projected to the low dimension, the low dimension space projected at the moment is not a straight line but a hyperplane. Assuming that the dimension of the low-dimensional space projected by the present invention is d, the corresponding basis vector is (ω)1,ω2…ωd) The basis vector is formed into a matrix W, which is an n × d matrix, and the optimization objective of the present invention should be written as follows:
Figure BDA0002732703760000051
w is a matrix formed by low-dimensional space basis vectors, and W belongs to Rd×(N-1)Where N is the number of sample classes.
The LDA text classification screening method depends on word vector theory. In the association of words with vectors, each word in an article or articles is generally considered to be
Figure BDA0002732703760000052
Obeying a probability distribution
Figure BDA0002732703760000053
This distribution is called a word
Figure BDA0002732703760000054
A priori distribution of. For example, the frequency of occurrence of the word "network" is closely related to the frequency of occurrence of the word "neural" in the relevant literature. Thus for each word, the probability of generating a corpus from that word is
Figure BDA0002732703760000055
Where W is the probability that each word in the corpus satisfies the polynomial distribution. The probability of generating a corpus is for each word
Figure BDA0002732703760000056
Generating a corpus for integral summation:
Figure BDA0002732703760000057
when calculating the prior probability, note that
Figure BDA0002732703760000058
Considering that the polynomial distribution and the dirichlet distribution are conjugate distributions, the dirichlet distribution can be adopted instead of:
Figure BDA0002732703760000059
here, the
Figure BDA00027327037600000510
Is just the normalization factor
Figure BDA00027327037600000511
Namely:
Figure BDA00027327037600000512
from the fact that the polynomial distribution and the dirichlet distribution are conjugate distributions, one can obtain:
Figure BDA00027327037600000513
according to the above formula, knowing the posterior distribution, the maximum point of the posterior distribution or the average value of the parameter under the posterior distribution can be used as the next maximum point
Figure BDA00027327037600000514
An estimate of (d). For a corpus, the results
Figure BDA00027327037600000515
The higher words can be grouped into a "cluster center", i.e. the topic of the text, v is the number of words, and k is the serial number of one of the words.
According to one embodiment of the invention, the data needs to be preprocessed before the text is classified and screened by using the method. For general web text data, there is much useless information (such as links and emoticons), and rough cleaning is required. Several steps may be used:
1. sentences larger than 10 words are selected.
2. Punctuation symbols are removed, error codes are removed, and other characters other than english and numbers are removed.
3. Repairing grammatical problems, repairing misspellings of words, and repairing spoken words.
4. The space and indentation problems are repaired.
5. Abnormal characters (e.g., common nonsense words such as quote, amp, constellation on explore, etc.) are repaired.
6. And carrying out rough cleaning by using a bag-of-words model, and selecting a text sentence with high topic weight.
In the text screening process, short sentences should be kept as much as possible, not long sentences. Since long sentences contain more words, the weight of the whole sentence may be high due to the number of words rather than the high weight of the desired topic, but is in fact independent of the desired topic. Meanwhile, in order to prevent the occurrence of a too short vocabulary, the matching utilization rate is low, and the too short vocabulary should be deleted during the screening.
According to an embodiment of the invention, when processing data, the characteristics of the network language need to be paid additional attention, and therefore, some manual screening skills are provided:
1. only one entry should be kept for repeated sentences. There are a large number of repetitive languages in the web language, and for sentences describing the same thing, it is considered repetitive when the half words of the sentence are the same.
2. Abbreviations and shorthand contents should be expanded. Due to the simplicity of web languages, many people when using spoken language to express them give some abbreviated representations, such as B & W (Black and white) and FAV (Favorite), require manual discovery and replacement.
According to one embodiment of the present invention, the screening data using LDA is implemented as follows:
1. and preparing a data set to be screened, wherein each document is a single line, and traversal processing is facilitated.
2. From which about 800 to 1000 words and sentences are manually selected to be in order and the texts which meet the selected requirements are selected.
3. And establishing a dictionary and an index for each word by using the selected text.
4. Each text is vectorized by using a bag-of-words model, which is a document representation method commonly used in the field of information retrieval and is only regarded as a set of a plurality of words, and the appearance of each word in the document is independent and independent of the appearance of other words. That is, any word appearing at any one location in a document is independently selected without being influenced by the semantic meaning of the document. And then, calculating and generating a word frequency matrix, namely a Document Theme (DT) matrix, by utilizing the vectorized data.
5. Setting the number of topics to be classified of the document, and training a first LDA model by using a DT matrix: firstly, obtaining the distribution of parameters of topic distribution through Dirichlet distribution, then randomly generating topic distribution of a text, and then randomly generating a topic at each position of the text according to the topic distribution of the text; then, the distribution of the parameters of the word distribution is obtained through Dirichlet distribution, the word distribution of the topic is obtained, a word is randomly generated at the position according to the word distribution of the topic until the last position of the text, and the whole text is generated; and finally, repeating the processes to generate all texts.
6. And for the unselected texts, performing theme judgment by using the trained first LDA model, wherein the model gives the probability that the model belongs to a certain theme. And if the probability that a certain pair of sentences belong to a certain topic is highest under the judgment of the first LDA, and the value exceeds a certain set threshold value, selecting the text.
7. And forming a new data set by using the texts which are manually selected previously and selected by the first LDA model, and retraining the second LDA model.
8. And for all the remaining texts, performing cosine similarity detection by using a second LDA model and the selected corpus, and if the similarity value of a certain text and the selected certain text with the highest similarity is higher than a set threshold value, selecting the certain text.
9. In this way, the required text data is screened according to the selected classification standard by manual selection, first LDA theme selection, cosine similarity selection and three selections in total.
According to an embodiment of the present invention, the specific code processing logic of the method proposed by the present invention is as follows:
the programming language used in the present invention is python, using nltk and genim modules to implement the main functions.
(1) And converting json original data obtained by the crawler into a list, and changing the list into a data structure which can be processed.
(2) And (4) cleaning data, namely firstly compiling a plurality of regular functions, and removing messy codes, non-English words, abbreviations and the like in text sentences. And then loading the commonly used stop word list into a list, and traversing and replacing each word of each text by using a place () function to finally obtain cleaner data.
(3) The topic to be screened is determined.
(4) From this huge data set, 800 pieces of high quality text meeting the theme requirements were manually selected.
(5) This resulted in data suitable for screening using the LDA model. Firstly, each line of the data is regarded as a document character string to be loaded into a list, and a huge document list is formed.
(6) Dictionary () function of the gensim module is used to create a dictionary of words of a corpus, ranging from all the words that appear in a document, each individual word being assigned an index.
(7) Doc2bow () traversal is used to turn the document list into a word vector matrix, otherwise known as a DT matrix.
(8) The LDA model is initialized using genim, models, ldamodel, and assigned as DT matrix, and the number of topic types to be classified, preferably, the number of types is 7 (the experimental results show that the classification of 7 has the most obvious distinguishing effect).
(9) Converting 800 text data except for manual screening into word vectors according to the same method, and judging the possibility of the text belonging to each topic by using ldamodel. Wherein the probability of being most likely to belong to a certain topic is selected if a threshold is reached.
(10) Through the steps in (9), a data set containing more sentences can be obtained. A second LDA model is retrained using this data set.
(11) Using index — similarity.matrixsilicity () function, the query corpus is converted to LDA vector space and indexed for each document/statement therein.
(12) Use of:
sim ═ index [ lsi [ arbitrary word vector ] ]
result ═ [ (DT matrix [ i [0] ], i [1]) for i in estimate (sims) ]
The cosine similarity value of the most similar word vector in the DT matrix corresponding to the word vector corresponding to the document list of any text can be obtained, and if the cosine similarity value is larger than the threshold value, the text data can be selected.
(13) In summary, there were three screens. The first time, the second time, the first LDA theme screening and the third time, the second LDA vector space theme similarity detection screening are used. The layer progressive screening mode improves the screening quality by utilizing the LDA model to learn the characteristics through a small amount of manual screening under the condition of ensuring the screening speed.
Portions of the invention not described in detail are well within the skill of the art.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims (10)

1. A text classification screening method using LDA is characterized by comprising the following steps:
(1) acquiring a data set, wherein the content comprises a plurality of short sentences;
(2) preprocessing the data by using a natural language processing method, and cleaning and sorting the data;
(3) determining a theme, and manually selecting a plurality of text sentences which accord with the theme;
(4) establishing a corresponding text vector matrix by using a bag-of-words model by using the selected text sentences;
(5) training a first LDA model by using the vector matrix;
(6) screening the remaining sentences in the text by using the first LDA model, calculating the correlation between the text set and a plurality of topic words obtained by the first LDA topic calculation, and using the correlation as a threshold value for evaluating whether one sentence meets the selected topic model;
(7) adding texts screened by topic relevance, and training a second LDA model;
(8) judging and screening the remaining sentences in the text by cosine similarity by using the second LDA model;
(9) and (4) taking sentences which are manually screened, subject similar screening and cosine similar screening and are screened for three times in total as text data which accord with a screening target.
2. The method as claimed in claim 1, wherein the step 2 of preprocessing the data comprises:
selecting sentences larger than 10 words; removing punctuation marks, removing error codes and removing other characters which are not English and numbers; repairing grammatical problems, repairing word spelling errors and repairing spoken vocabularies; repairing space and indentation problems; repairing the abnormal character; the cleaning and sorting comprises the steps of roughly cleaning by using a bag-of-words model and selecting text sentences with high theme weight.
3. The method as claimed in claim 1, wherein the step 3 of manually selecting the text sentences conforming to the topic comprises:
only one item should be reserved for repeated sentences, and for sentences describing the same thing, the repeated sentences are considered to be repetitive when the half words of the sentences are the same;
abbreviations and shorthand content should be expanded, requiring manual discovery and replacement for representations that will give some abbreviations when expressed in spoken language.
4. The method as claimed in claim 1, wherein in step 3, for the data set ready for screening, each document is selected in a single line, and then the text conforming to the selected subject requirement is manually selected from 800 to 1000 sentences; and establishing a dictionary and an index for each word by using the selected text.
5. The method as claimed in claim 1, wherein in the step 4, each text is vectorized by using a bag-of-words model, which only treats it as a collection of words, each word in the text is independent of the occurrence of other words, and then a word frequency matrix (DT) matrix is calculated by using vectorized data.
6. The method as claimed in claim 1, wherein in step 5, the number of topics to be classified of the document is set, and the DT matrix is used to train the first LDA model: firstly, obtaining the distribution of parameters of topic distribution through Dirichlet distribution, then randomly generating topic distribution of a text, and then randomly generating a topic at each position of the text according to the topic distribution of the text; then, the distribution of the parameters of the word distribution is obtained through Dirichlet distribution, the word distribution of the topic is obtained, a word is randomly generated at the position according to the word distribution of the topic until the last position of the text, and the whole text is generated; and finally, repeating the processes to generate all texts.
7. The method as claimed in claim 1, wherein in step 6, for the non-selected texts, a trained first LDA model is used to perform topic judgment, and the model gives the probability that the model belongs to a certain topic; and if the probability that a certain pair of sentences belong to a certain topic is highest under the judgment of the LDA, and the value exceeds a certain set threshold value, selecting the text.
8. The method as claimed in claim 1, wherein in step 7, the second LDA model is retrained by using the previously manually selected texts and the first LDA model to form a new data set.
9. The method as claimed in claim 1, wherein in the step 8, cosine similarity detection is performed on all the remaining texts by using the second LDA model and the previously selected corpus, and if the similarity value between a certain text and a selected text with the highest similarity is higher than a preset threshold, the certain text is selected.
10. The method as claimed in claim 1, wherein in the step 9, the required text data is selected according to the selected classification standard by manual selection, LDA theme selection, cosine similarity selection and three selections in total.
CN202011123125.5A 2020-10-20 2020-10-20 Text classification screening method using LDA Pending CN112667806A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011123125.5A CN112667806A (en) 2020-10-20 2020-10-20 Text classification screening method using LDA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011123125.5A CN112667806A (en) 2020-10-20 2020-10-20 Text classification screening method using LDA

Publications (1)

Publication Number Publication Date
CN112667806A true CN112667806A (en) 2021-04-16

Family

ID=75403286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011123125.5A Pending CN112667806A (en) 2020-10-20 2020-10-20 Text classification screening method using LDA

Country Status (1)

Country Link
CN (1) CN112667806A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887584A (en) * 2021-09-16 2022-01-04 同济大学 Emergency traffic strategy evaluation method based on social media data
CN115658866A (en) * 2022-10-27 2023-01-31 国网山东省电力公司烟台供电公司 Text continuous writing method capable of self-adaptive input, storage medium and device
CN116307792A (en) * 2022-10-12 2023-06-23 广州市阿尔法软件信息技术有限公司 Urban physical examination subject scene-oriented evaluation method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107122349A (en) * 2017-04-24 2017-09-01 无锡中科富农物联科技有限公司 A kind of feature word of text extracting method based on word2vec LDA models
CN107291688A (en) * 2017-05-22 2017-10-24 南京大学 Judgement document's similarity analysis method based on topic model
CN107609121A (en) * 2017-09-14 2018-01-19 深圳市玛腾科技有限公司 Newsletter archive sorting technique based on LDA and word2vec algorithms
US20180032600A1 (en) * 2016-08-01 2018-02-01 International Business Machines Corporation Phenomenological semantic distance from latent dirichlet allocations (lda) classification
CN108280164A (en) * 2018-01-18 2018-07-13 武汉大学 A kind of short text filtering and sorting technique based on classification related words
CN108664633A (en) * 2018-05-15 2018-10-16 南京大学 A method of carrying out text classification using diversified text feature
CN109376347A (en) * 2018-10-16 2019-02-22 北京信息科技大学 A kind of HSK composition generation method based on topic model
CN110895656A (en) * 2018-09-13 2020-03-20 武汉斗鱼网络科技有限公司 Text similarity calculation method and device, electronic equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180032600A1 (en) * 2016-08-01 2018-02-01 International Business Machines Corporation Phenomenological semantic distance from latent dirichlet allocations (lda) classification
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107122349A (en) * 2017-04-24 2017-09-01 无锡中科富农物联科技有限公司 A kind of feature word of text extracting method based on word2vec LDA models
CN107291688A (en) * 2017-05-22 2017-10-24 南京大学 Judgement document's similarity analysis method based on topic model
CN107609121A (en) * 2017-09-14 2018-01-19 深圳市玛腾科技有限公司 Newsletter archive sorting technique based on LDA and word2vec algorithms
CN108280164A (en) * 2018-01-18 2018-07-13 武汉大学 A kind of short text filtering and sorting technique based on classification related words
CN108664633A (en) * 2018-05-15 2018-10-16 南京大学 A method of carrying out text classification using diversified text feature
CN110895656A (en) * 2018-09-13 2020-03-20 武汉斗鱼网络科技有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN109376347A (en) * 2018-10-16 2019-02-22 北京信息科技大学 A kind of HSK composition generation method based on topic model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MIHA PAVLINEK ET AL.: "Text classification method based on self-training and LDA topic models", 《EXPERT SYSTEMS WITH APPLICATIONS》, vol. 80, 8 March 2017 (2017-03-08), pages 83 - 93, XP029974861, DOI: 10.1016/j.eswa.2017.03.020 *
杨瑞欣: "面向微博评论的LDA短文本聚类算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 8, 15 August 2020 (2020-08-15), pages 138 - 843 *
王胜 等: "基于SL-LDA的领域标签获取方法", 《计算机科学》, vol. 47, no. 11, 21 July 2020 (2020-07-21), pages 95 - 100 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887584A (en) * 2021-09-16 2022-01-04 同济大学 Emergency traffic strategy evaluation method based on social media data
CN113887584B (en) * 2021-09-16 2022-07-05 同济大学 Emergency traffic strategy evaluation method based on social media data
CN116307792A (en) * 2022-10-12 2023-06-23 广州市阿尔法软件信息技术有限公司 Urban physical examination subject scene-oriented evaluation method and device
CN116307792B (en) * 2022-10-12 2024-03-12 广州市阿尔法软件信息技术有限公司 Urban physical examination subject scene-oriented evaluation method and device
CN115658866A (en) * 2022-10-27 2023-01-31 国网山东省电力公司烟台供电公司 Text continuous writing method capable of self-adaptive input, storage medium and device
CN115658866B (en) * 2022-10-27 2024-03-12 国网山东省电力公司烟台供电公司 Text renewing method capable of self-adaptively inputting, storage medium and device

Similar Documents

Publication Publication Date Title
CN108304468B (en) Text classification method and text classification device
CN108446271B (en) Text emotion analysis method of convolutional neural network based on Chinese character component characteristics
CN109002473B (en) Emotion analysis method based on word vectors and parts of speech
CN112667806A (en) Text classification screening method using LDA
Jungiewicz et al. Towards textual data augmentation for neural networks: synonyms and maximum loss
Kalaivani et al. Feature reduction based on genetic algorithm and hybrid model for opinion mining
CN113704416B (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN112364628B (en) New word recognition method and device, electronic equipment and storage medium
Ashok et al. A personalized recommender system using Machine Learning based Sentiment Analysis over social data
CN111191031A (en) Entity relation classification method of unstructured text based on WordNet and IDF
CN111859961A (en) Text keyword extraction method based on improved TopicRank algorithm
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
Fauziah et al. Lexicon based sentiment analysis in Indonesia languages: A systematic literature review
Shang et al. Improved feature weight algorithm and its application to text classification
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
Ueno et al. A spoiler detection method for japanese-written reviews of stories
Vīksna et al. Sentiment analysis in Latvian and Russian: A survey
CN112990388B (en) Text clustering method based on concept words
Hidayat et al. Feature-Rich Classifiers for Recognizing Textual Entailment in Indonesian
CN112613318B (en) Entity name normalization system, method thereof and computer readable medium
CN115269833A (en) Event information extraction method and system based on deep semantics and multitask learning
CN115309899A (en) Method and system for identifying and storing specific content in text
CN114579729A (en) FAQ question-answer matching method and system fusing multi-algorithm model
CN114186560A (en) Chinese word meaning disambiguation method based on graph convolution neural network fusion support vector machine
CN113780832A (en) Public opinion text scoring method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination