CN112667806A - Text classification screening method using LDA - Google Patents
Text classification screening method using LDA Download PDFInfo
- Publication number
- CN112667806A CN112667806A CN202011123125.5A CN202011123125A CN112667806A CN 112667806 A CN112667806 A CN 112667806A CN 202011123125 A CN202011123125 A CN 202011123125A CN 112667806 A CN112667806 A CN 112667806A
- Authority
- CN
- China
- Prior art keywords
- text
- sentences
- topic
- lda
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012216 screening Methods 0.000 title claims abstract description 41
- 238000000034 method Methods 0.000 title claims abstract description 36
- 239000011159 matrix material Substances 0.000 claims abstract description 26
- 239000013598 vector Substances 0.000 claims abstract description 22
- 238000004140 cleaning Methods 0.000 claims abstract description 10
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000003058 natural language processing Methods 0.000 claims abstract description 4
- 238000004364 calculation method Methods 0.000 claims abstract description 3
- 238000009826 distribution Methods 0.000 claims description 44
- 238000001514 detection method Methods 0.000 claims description 5
- 230000003252 repetitive effect Effects 0.000 claims description 4
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 238000007373 indentation Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 5
- 238000013145 classification model Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
Abstract
The invention provides a text classification screening method using LDA, which comprises the following steps: acquiring a data set, wherein the content comprises a plurality of short sentences; preprocessing the data by using a natural language processing method, and cleaning and sorting the data; determining a theme, and manually selecting a plurality of text sentences which accord with the theme; establishing a corresponding text vector matrix by using a bag-of-words model by using the selected text sentences; training a first LDA model by using the vector matrix; screening the remaining sentences in the text by using the first LDA model, calculating the correlation between the text set and a plurality of topic words obtained by the first LDA topic calculation, and using the correlation as a threshold value for evaluating whether one sentence meets the selected topic model; adding texts screened by topic relevance, and training a second LDA model; judging and screening the remaining sentences in the text by cosine similarity by using the second LDA model; and taking the sentence screened in total three times as text data conforming to the screening target.
Description
Technical Field
The invention relates to the field of natural language processing, and can effectively screen sentences conforming to a selected theme, prepare data sets for various machine learning algorithms or classify texts.
Background
Machine learning is currently used in a wider and wider range of fields. However, for a model needing to process natural language, a special theme is often required to be preset to train the model. Training the model requires a manually labeled data set to ensure the quality of the model. However, in many cases, it is a matter of great concern how to provide the highest possible quality data for the model without the ready availability of labeled data.
Training models cannot leave the data, but many times there is not enough data (data quality is too low or monetary cost of labeling is too large), so the industry has proposed so-called unsupervised learning, but it is still rarely used, and more times more training samples are added.
Disclosure of Invention
The technical problem of the invention is solved: a text classification screening method using an LDA (latent Dirichlet allocation model) is provided, when text data is faced, a small amount of data which is manually selected or marked is utilized, then the characteristics of the data are extracted to train a classification model, and the classification model is utilized to screen and classify the data, so that the text data with different subjects can be classified at lower cost and higher speed. The method is characterized in that a small amount of data meeting the theme requirements are manually selected, and then the characteristics of the data are extracted by using an LDA model so as to quickly screen the data.
The technical scheme of the invention is a text classification screening method using LDA, which comprises the following steps:
(1) acquiring a data set, wherein the content comprises a plurality of short sentences;
(2) preprocessing the data by using a natural language processing method, and cleaning and sorting the data;
(3) determining a theme, and manually selecting a plurality of text sentences which accord with the theme;
(4) establishing a corresponding text vector matrix by using a bag-of-words model by using the selected text sentences;
(5) training a first LDA model by using the vector matrix;
(6) screening the remaining sentences in the text by using the LDA model, calculating the correlation between the text set and a plurality of topic words obtained by first LDA topic calculation, and using the correlation as a threshold value for evaluating whether one sentence meets the selected topic model;
(7) adding texts screened by topic relevance, and training a second LDA model;
(8) judging and screening the remaining sentences in the text by cosine similarity by using the second LDA model;
(9) and (4) taking sentences which are manually screened, subject similar screening and cosine similar screening and are screened for three times in total as text data which accord with a screening target.
Further, in step 2, the preprocessing the data includes:
selecting sentences larger than 10 words; removing punctuation marks, removing error codes and removing other characters which are not English and numbers; repairing grammatical problems, repairing word spelling errors and repairing spoken vocabularies; repairing space and indentation problems; repairing the abnormal character; the cleaning and sorting comprises the steps of roughly cleaning by using a bag-of-words model and selecting text sentences with high theme weight.
Further, in step 3, manually selecting a plurality of text sentences that conform to the topic includes: only one item should be reserved for repeated sentences, and for sentences describing the same thing, the repeated sentences are considered to be repetitive when the half words of the sentences are the same;
abbreviations and shorthand content should be expanded, requiring manual discovery and replacement for representations that will give some abbreviations when expressed in spoken language.
Further, in the step 3, for the data set ready to be screened, each document has a single line, and 800 to 1000 texts with the same sequence of words and sentences are manually selected from the single line, wherein the texts meet the requirements of the selected subjects; and establishing a dictionary and an index for each word by using the selected text.
Further, in the step 4, each document is vectorized by using a bag-of-words model, the model only considers the document as a set of a plurality of words, the occurrence of each word in the document is independent and does not depend on whether other words occur, and then a word frequency matrix, namely a Document Theme (DT) matrix, is calculated and generated by using vectorized data.
Further, in step 5, the number of topics to be classified of the document is set, and the DT matrix is used to train the first LDA model: firstly, obtaining the distribution of parameters of topic distribution through Dirichlet distribution, then randomly generating topic distribution of a text, and then randomly generating a topic at each position of the text according to the topic distribution of the text; then, the distribution of the parameters of the word distribution is obtained through Dirichlet distribution, the word distribution of the topic is obtained, a word is randomly generated at the position according to the word distribution of the topic until the last position of the text, and the whole text is generated; and finally, repeating the processes to generate all texts.
Further, in the step 6, for the unselected text, the trained first LDA model is used for theme judgment, and the model gives the probability that the text belongs to a certain theme; and if the probability that a certain pair of sentences belong to a certain topic is highest under the judgment of the LDA, and the value exceeds a certain set threshold value, selecting the text.
Further, in step 7, a new data set is formed by using the texts selected manually and the first LDA model, and the second LDA model is retrained.
Further, in step 8, for all the remaining texts, the cosine similarity detection is performed by using the second LDA model and the previously selected corpus, and if the similarity value between a certain text and a selected certain text with the highest similarity is higher than a set threshold, the certain text is selected.
Further, in step 9, the required text data is screened according to the selected classification standard by manual selection, LDA theme selection, cosine similarity selection, and three selections in total.
Has the advantages that:
the text data processed by the method can be suitable for being classified and screened quickly when the data volume reaches more than ten million lines. Thousands of sentences which accord with the selected theme direction are manually selected, the LDA is used for theme similarity screening, the LDA theme model is used for selecting a part of comment pairs which highly accord with the theme, and therefore a sample which is large enough is used for training a relatively perfect LDA classification model. And finally, carrying out similarity detection on the rest sentences and the sentences in the trained LDA model, and selecting proper data. Through the three-time screening, the disadvantage of unsupervised machine learning is overcome to a certain extent, and the accuracy of screening and classification is improved under the condition of ensuring the speed. Has the following advantages:
(1) the method is suitable for screening large-scale data, and the cost of manual marking can be saved;
(2) the similarity of the 'theme' between different texts can be effectively distinguished;
(3) the classification effect on short texts, particularly short comments, is excellent;
(4) the obtained text is suitable for various machine learning algorithms;
(5) the screening process can ensure the screening quality and greatly improve the screening speed.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by those skilled in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.
The core technical model used by the invention is an LDA topic classification model, and a series of steps and strategies are designed around the model for data screening. The main principle of the LDA model is as follows:
the LDA model is a three-layer Bayes Topic model, Topic information implied in a text is discovered through an unsupervised learning method, and the purpose is to discover implied semantic dimensions, namely 'Topic' or 'Concept', from the text by an unguided learning method. The essence of implicit semantic analysis is to use the co-occurrence features of terms (term) in text to find the Topic structure of text, and this method does not need any background knowledge about text. Implicit semantic representation of text can model linguistic phenomena of "ambiguous words" and "ambiguous words" such that search results obtained by a search engine system match the query of a user at a semantic level, rather than just intersecting the query at a lexical level.
In the case of two classifications using LDA: given data setm is a vector space RnM samples of (1), where xiFor an n-dimensional vector classified by number i, yi ∈ {0, 1 }. Definition of N in the inventionj(j is 0,1) is the number of j-th class samples, Xj(j is 0,1) is the set of jth class samples, and μj(j equals 0,1) is the mean vector of the jth sample, and defines ΣjAnd (j is 0,1) is a covariance matrix of the j-th sample.
μjSum ΣjAre respectively:
if the data are projected on a straight line omega, the projection of the centers of the two types of samples on the straight line is respectively omegaTμ0And ωTμ1The invention hopes that the projection points of the same kind of data are as close as possible, namely, the covariance omega of the projection points of the same kind of sample is requiredT∑0Omega and omegaT∑1ω is as small as possible, so the optimization objective of the present invention is:
defining generally an intra-class divergence matrix SωComprises the following steps:
defining an inter-class divergence matrix SbComprises the following steps:
Sb=(μ0-μ1)(μ0-μ1)T
the optimization objective can be rewritten as:
thus, by using the lagrange multiplier method, the feature vector can be obtained:
this is a form of a generalized rayleigh quotient, and for the two classified samples, the optimal projection direction ω can be determined by simply finding the mean and variance of the original samples.
Under the condition of multi-classification, if the multi-class is projected to the low dimension, the low dimension space projected at the moment is not a straight line but a hyperplane. Assuming that the dimension of the low-dimensional space projected by the present invention is d, the corresponding basis vector is (ω)1,ω2…ωd) The basis vector is formed into a matrix W, which is an n × d matrix, and the optimization objective of the present invention should be written as follows:
w is a matrix formed by low-dimensional space basis vectors, and W belongs to Rd×(N-1)Where N is the number of sample classes.
The LDA text classification screening method depends on word vector theory. In the association of words with vectors, each word in an article or articles is generally considered to beObeying a probability distributionThis distribution is called a wordA priori distribution of. For example, the frequency of occurrence of the word "network" is closely related to the frequency of occurrence of the word "neural" in the relevant literature. Thus for each word, the probability of generating a corpus from that word isWhere W is the probability that each word in the corpus satisfies the polynomial distribution. The probability of generating a corpus is for each wordGenerating a corpus for integral summation:
when calculating the prior probability, note thatConsidering that the polynomial distribution and the dirichlet distribution are conjugate distributions, the dirichlet distribution can be adopted instead of:
from the fact that the polynomial distribution and the dirichlet distribution are conjugate distributions, one can obtain:
according to the above formula, knowing the posterior distribution, the maximum point of the posterior distribution or the average value of the parameter under the posterior distribution can be used as the next maximum pointAn estimate of (d). For a corpus, the resultsThe higher words can be grouped into a "cluster center", i.e. the topic of the text, v is the number of words, and k is the serial number of one of the words.
According to one embodiment of the invention, the data needs to be preprocessed before the text is classified and screened by using the method. For general web text data, there is much useless information (such as links and emoticons), and rough cleaning is required. Several steps may be used:
1. sentences larger than 10 words are selected.
2. Punctuation symbols are removed, error codes are removed, and other characters other than english and numbers are removed.
3. Repairing grammatical problems, repairing misspellings of words, and repairing spoken words.
4. The space and indentation problems are repaired.
5. Abnormal characters (e.g., common nonsense words such as quote, amp, constellation on explore, etc.) are repaired.
6. And carrying out rough cleaning by using a bag-of-words model, and selecting a text sentence with high topic weight.
In the text screening process, short sentences should be kept as much as possible, not long sentences. Since long sentences contain more words, the weight of the whole sentence may be high due to the number of words rather than the high weight of the desired topic, but is in fact independent of the desired topic. Meanwhile, in order to prevent the occurrence of a too short vocabulary, the matching utilization rate is low, and the too short vocabulary should be deleted during the screening.
According to an embodiment of the invention, when processing data, the characteristics of the network language need to be paid additional attention, and therefore, some manual screening skills are provided:
1. only one entry should be kept for repeated sentences. There are a large number of repetitive languages in the web language, and for sentences describing the same thing, it is considered repetitive when the half words of the sentence are the same.
2. Abbreviations and shorthand contents should be expanded. Due to the simplicity of web languages, many people when using spoken language to express them give some abbreviated representations, such as B & W (Black and white) and FAV (Favorite), require manual discovery and replacement.
According to one embodiment of the present invention, the screening data using LDA is implemented as follows:
1. and preparing a data set to be screened, wherein each document is a single line, and traversal processing is facilitated.
2. From which about 800 to 1000 words and sentences are manually selected to be in order and the texts which meet the selected requirements are selected.
3. And establishing a dictionary and an index for each word by using the selected text.
4. Each text is vectorized by using a bag-of-words model, which is a document representation method commonly used in the field of information retrieval and is only regarded as a set of a plurality of words, and the appearance of each word in the document is independent and independent of the appearance of other words. That is, any word appearing at any one location in a document is independently selected without being influenced by the semantic meaning of the document. And then, calculating and generating a word frequency matrix, namely a Document Theme (DT) matrix, by utilizing the vectorized data.
5. Setting the number of topics to be classified of the document, and training a first LDA model by using a DT matrix: firstly, obtaining the distribution of parameters of topic distribution through Dirichlet distribution, then randomly generating topic distribution of a text, and then randomly generating a topic at each position of the text according to the topic distribution of the text; then, the distribution of the parameters of the word distribution is obtained through Dirichlet distribution, the word distribution of the topic is obtained, a word is randomly generated at the position according to the word distribution of the topic until the last position of the text, and the whole text is generated; and finally, repeating the processes to generate all texts.
6. And for the unselected texts, performing theme judgment by using the trained first LDA model, wherein the model gives the probability that the model belongs to a certain theme. And if the probability that a certain pair of sentences belong to a certain topic is highest under the judgment of the first LDA, and the value exceeds a certain set threshold value, selecting the text.
7. And forming a new data set by using the texts which are manually selected previously and selected by the first LDA model, and retraining the second LDA model.
8. And for all the remaining texts, performing cosine similarity detection by using a second LDA model and the selected corpus, and if the similarity value of a certain text and the selected certain text with the highest similarity is higher than a set threshold value, selecting the certain text.
9. In this way, the required text data is screened according to the selected classification standard by manual selection, first LDA theme selection, cosine similarity selection and three selections in total.
According to an embodiment of the present invention, the specific code processing logic of the method proposed by the present invention is as follows:
the programming language used in the present invention is python, using nltk and genim modules to implement the main functions.
(1) And converting json original data obtained by the crawler into a list, and changing the list into a data structure which can be processed.
(2) And (4) cleaning data, namely firstly compiling a plurality of regular functions, and removing messy codes, non-English words, abbreviations and the like in text sentences. And then loading the commonly used stop word list into a list, and traversing and replacing each word of each text by using a place () function to finally obtain cleaner data.
(3) The topic to be screened is determined.
(4) From this huge data set, 800 pieces of high quality text meeting the theme requirements were manually selected.
(5) This resulted in data suitable for screening using the LDA model. Firstly, each line of the data is regarded as a document character string to be loaded into a list, and a huge document list is formed.
(6) Dictionary () function of the gensim module is used to create a dictionary of words of a corpus, ranging from all the words that appear in a document, each individual word being assigned an index.
(7) Doc2bow () traversal is used to turn the document list into a word vector matrix, otherwise known as a DT matrix.
(8) The LDA model is initialized using genim, models, ldamodel, and assigned as DT matrix, and the number of topic types to be classified, preferably, the number of types is 7 (the experimental results show that the classification of 7 has the most obvious distinguishing effect).
(9) Converting 800 text data except for manual screening into word vectors according to the same method, and judging the possibility of the text belonging to each topic by using ldamodel. Wherein the probability of being most likely to belong to a certain topic is selected if a threshold is reached.
(10) Through the steps in (9), a data set containing more sentences can be obtained. A second LDA model is retrained using this data set.
(11) Using index — similarity.matrixsilicity () function, the query corpus is converted to LDA vector space and indexed for each document/statement therein.
(12) Use of:
sim ═ index [ lsi [ arbitrary word vector ] ]
result ═ [ (DT matrix [ i [0] ], i [1]) for i in estimate (sims) ]
The cosine similarity value of the most similar word vector in the DT matrix corresponding to the word vector corresponding to the document list of any text can be obtained, and if the cosine similarity value is larger than the threshold value, the text data can be selected.
(13) In summary, there were three screens. The first time, the second time, the first LDA theme screening and the third time, the second LDA vector space theme similarity detection screening are used. The layer progressive screening mode improves the screening quality by utilizing the LDA model to learn the characteristics through a small amount of manual screening under the condition of ensuring the screening speed.
Portions of the invention not described in detail are well within the skill of the art.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.
Claims (10)
1. A text classification screening method using LDA is characterized by comprising the following steps:
(1) acquiring a data set, wherein the content comprises a plurality of short sentences;
(2) preprocessing the data by using a natural language processing method, and cleaning and sorting the data;
(3) determining a theme, and manually selecting a plurality of text sentences which accord with the theme;
(4) establishing a corresponding text vector matrix by using a bag-of-words model by using the selected text sentences;
(5) training a first LDA model by using the vector matrix;
(6) screening the remaining sentences in the text by using the first LDA model, calculating the correlation between the text set and a plurality of topic words obtained by the first LDA topic calculation, and using the correlation as a threshold value for evaluating whether one sentence meets the selected topic model;
(7) adding texts screened by topic relevance, and training a second LDA model;
(8) judging and screening the remaining sentences in the text by cosine similarity by using the second LDA model;
(9) and (4) taking sentences which are manually screened, subject similar screening and cosine similar screening and are screened for three times in total as text data which accord with a screening target.
2. The method as claimed in claim 1, wherein the step 2 of preprocessing the data comprises:
selecting sentences larger than 10 words; removing punctuation marks, removing error codes and removing other characters which are not English and numbers; repairing grammatical problems, repairing word spelling errors and repairing spoken vocabularies; repairing space and indentation problems; repairing the abnormal character; the cleaning and sorting comprises the steps of roughly cleaning by using a bag-of-words model and selecting text sentences with high theme weight.
3. The method as claimed in claim 1, wherein the step 3 of manually selecting the text sentences conforming to the topic comprises:
only one item should be reserved for repeated sentences, and for sentences describing the same thing, the repeated sentences are considered to be repetitive when the half words of the sentences are the same;
abbreviations and shorthand content should be expanded, requiring manual discovery and replacement for representations that will give some abbreviations when expressed in spoken language.
4. The method as claimed in claim 1, wherein in step 3, for the data set ready for screening, each document is selected in a single line, and then the text conforming to the selected subject requirement is manually selected from 800 to 1000 sentences; and establishing a dictionary and an index for each word by using the selected text.
5. The method as claimed in claim 1, wherein in the step 4, each text is vectorized by using a bag-of-words model, which only treats it as a collection of words, each word in the text is independent of the occurrence of other words, and then a word frequency matrix (DT) matrix is calculated by using vectorized data.
6. The method as claimed in claim 1, wherein in step 5, the number of topics to be classified of the document is set, and the DT matrix is used to train the first LDA model: firstly, obtaining the distribution of parameters of topic distribution through Dirichlet distribution, then randomly generating topic distribution of a text, and then randomly generating a topic at each position of the text according to the topic distribution of the text; then, the distribution of the parameters of the word distribution is obtained through Dirichlet distribution, the word distribution of the topic is obtained, a word is randomly generated at the position according to the word distribution of the topic until the last position of the text, and the whole text is generated; and finally, repeating the processes to generate all texts.
7. The method as claimed in claim 1, wherein in step 6, for the non-selected texts, a trained first LDA model is used to perform topic judgment, and the model gives the probability that the model belongs to a certain topic; and if the probability that a certain pair of sentences belong to a certain topic is highest under the judgment of the LDA, and the value exceeds a certain set threshold value, selecting the text.
8. The method as claimed in claim 1, wherein in step 7, the second LDA model is retrained by using the previously manually selected texts and the first LDA model to form a new data set.
9. The method as claimed in claim 1, wherein in the step 8, cosine similarity detection is performed on all the remaining texts by using the second LDA model and the previously selected corpus, and if the similarity value between a certain text and a selected text with the highest similarity is higher than a preset threshold, the certain text is selected.
10. The method as claimed in claim 1, wherein in the step 9, the required text data is selected according to the selected classification standard by manual selection, LDA theme selection, cosine similarity selection and three selections in total.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011123125.5A CN112667806A (en) | 2020-10-20 | 2020-10-20 | Text classification screening method using LDA |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011123125.5A CN112667806A (en) | 2020-10-20 | 2020-10-20 | Text classification screening method using LDA |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112667806A true CN112667806A (en) | 2021-04-16 |
Family
ID=75403286
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011123125.5A Pending CN112667806A (en) | 2020-10-20 | 2020-10-20 | Text classification screening method using LDA |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112667806A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113887584A (en) * | 2021-09-16 | 2022-01-04 | 同济大学 | Emergency traffic strategy evaluation method based on social media data |
CN115658866A (en) * | 2022-10-27 | 2023-01-31 | 国网山东省电力公司烟台供电公司 | Text continuous writing method capable of self-adaptive input, storage medium and device |
CN116307792A (en) * | 2022-10-12 | 2023-06-23 | 广州市阿尔法软件信息技术有限公司 | Urban physical examination subject scene-oriented evaluation method and device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106844424A (en) * | 2016-12-09 | 2017-06-13 | 宁波大学 | A kind of file classification method based on LDA |
CN107122349A (en) * | 2017-04-24 | 2017-09-01 | 无锡中科富农物联科技有限公司 | A kind of feature word of text extracting method based on word2vec LDA models |
CN107291688A (en) * | 2017-05-22 | 2017-10-24 | 南京大学 | Judgement document's similarity analysis method based on topic model |
CN107609121A (en) * | 2017-09-14 | 2018-01-19 | 深圳市玛腾科技有限公司 | Newsletter archive sorting technique based on LDA and word2vec algorithms |
US20180032600A1 (en) * | 2016-08-01 | 2018-02-01 | International Business Machines Corporation | Phenomenological semantic distance from latent dirichlet allocations (lda) classification |
CN108280164A (en) * | 2018-01-18 | 2018-07-13 | 武汉大学 | A kind of short text filtering and sorting technique based on classification related words |
CN108664633A (en) * | 2018-05-15 | 2018-10-16 | 南京大学 | A method of carrying out text classification using diversified text feature |
CN109376347A (en) * | 2018-10-16 | 2019-02-22 | 北京信息科技大学 | A kind of HSK composition generation method based on topic model |
CN110895656A (en) * | 2018-09-13 | 2020-03-20 | 武汉斗鱼网络科技有限公司 | Text similarity calculation method and device, electronic equipment and storage medium |
-
2020
- 2020-10-20 CN CN202011123125.5A patent/CN112667806A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180032600A1 (en) * | 2016-08-01 | 2018-02-01 | International Business Machines Corporation | Phenomenological semantic distance from latent dirichlet allocations (lda) classification |
CN106844424A (en) * | 2016-12-09 | 2017-06-13 | 宁波大学 | A kind of file classification method based on LDA |
CN107122349A (en) * | 2017-04-24 | 2017-09-01 | 无锡中科富农物联科技有限公司 | A kind of feature word of text extracting method based on word2vec LDA models |
CN107291688A (en) * | 2017-05-22 | 2017-10-24 | 南京大学 | Judgement document's similarity analysis method based on topic model |
CN107609121A (en) * | 2017-09-14 | 2018-01-19 | 深圳市玛腾科技有限公司 | Newsletter archive sorting technique based on LDA and word2vec algorithms |
CN108280164A (en) * | 2018-01-18 | 2018-07-13 | 武汉大学 | A kind of short text filtering and sorting technique based on classification related words |
CN108664633A (en) * | 2018-05-15 | 2018-10-16 | 南京大学 | A method of carrying out text classification using diversified text feature |
CN110895656A (en) * | 2018-09-13 | 2020-03-20 | 武汉斗鱼网络科技有限公司 | Text similarity calculation method and device, electronic equipment and storage medium |
CN109376347A (en) * | 2018-10-16 | 2019-02-22 | 北京信息科技大学 | A kind of HSK composition generation method based on topic model |
Non-Patent Citations (3)
Title |
---|
MIHA PAVLINEK ET AL.: "Text classification method based on self-training and LDA topic models", 《EXPERT SYSTEMS WITH APPLICATIONS》, vol. 80, 8 March 2017 (2017-03-08), pages 83 - 93, XP029974861, DOI: 10.1016/j.eswa.2017.03.020 * |
杨瑞欣: "面向微博评论的LDA短文本聚类算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 8, 15 August 2020 (2020-08-15), pages 138 - 843 * |
王胜 等: "基于SL-LDA的领域标签获取方法", 《计算机科学》, vol. 47, no. 11, 21 July 2020 (2020-07-21), pages 95 - 100 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113887584A (en) * | 2021-09-16 | 2022-01-04 | 同济大学 | Emergency traffic strategy evaluation method based on social media data |
CN113887584B (en) * | 2021-09-16 | 2022-07-05 | 同济大学 | Emergency traffic strategy evaluation method based on social media data |
CN116307792A (en) * | 2022-10-12 | 2023-06-23 | 广州市阿尔法软件信息技术有限公司 | Urban physical examination subject scene-oriented evaluation method and device |
CN116307792B (en) * | 2022-10-12 | 2024-03-12 | 广州市阿尔法软件信息技术有限公司 | Urban physical examination subject scene-oriented evaluation method and device |
CN115658866A (en) * | 2022-10-27 | 2023-01-31 | 国网山东省电力公司烟台供电公司 | Text continuous writing method capable of self-adaptive input, storage medium and device |
CN115658866B (en) * | 2022-10-27 | 2024-03-12 | 国网山东省电力公司烟台供电公司 | Text renewing method capable of self-adaptively inputting, storage medium and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108304468B (en) | Text classification method and text classification device | |
CN108446271B (en) | Text emotion analysis method of convolutional neural network based on Chinese character component characteristics | |
CN109002473B (en) | Emotion analysis method based on word vectors and parts of speech | |
CN112667806A (en) | Text classification screening method using LDA | |
Jungiewicz et al. | Towards textual data augmentation for neural networks: synonyms and maximum loss | |
Kalaivani et al. | Feature reduction based on genetic algorithm and hybrid model for opinion mining | |
CN113704416B (en) | Word sense disambiguation method and device, electronic equipment and computer-readable storage medium | |
CN112364628B (en) | New word recognition method and device, electronic equipment and storage medium | |
Ashok et al. | A personalized recommender system using Machine Learning based Sentiment Analysis over social data | |
CN111191031A (en) | Entity relation classification method of unstructured text based on WordNet and IDF | |
CN111859961A (en) | Text keyword extraction method based on improved TopicRank algorithm | |
CN112905736A (en) | Unsupervised text emotion analysis method based on quantum theory | |
Fauziah et al. | Lexicon based sentiment analysis in Indonesia languages: A systematic literature review | |
Shang et al. | Improved feature weight algorithm and its application to text classification | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
Ueno et al. | A spoiler detection method for japanese-written reviews of stories | |
Vīksna et al. | Sentiment analysis in Latvian and Russian: A survey | |
CN112990388B (en) | Text clustering method based on concept words | |
Hidayat et al. | Feature-Rich Classifiers for Recognizing Textual Entailment in Indonesian | |
CN112613318B (en) | Entity name normalization system, method thereof and computer readable medium | |
CN115269833A (en) | Event information extraction method and system based on deep semantics and multitask learning | |
CN115309899A (en) | Method and system for identifying and storing specific content in text | |
CN114579729A (en) | FAQ question-answer matching method and system fusing multi-algorithm model | |
CN114186560A (en) | Chinese word meaning disambiguation method based on graph convolution neural network fusion support vector machine | |
CN113780832A (en) | Public opinion text scoring method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |