CN112667806A

CN112667806A - Text classification screening method using LDA

Info

Publication number: CN112667806A
Application number: CN202011123125.5A
Authority: CN
Inventors: 赵博; 吕建文; 周兴晖; 陈力; 薛柔月; 金鑫; 蒋尚秀
Original assignee: Shanghai Golden Bridge Info Tech Co ltd
Current assignee: Shanghai Golden Bridge Info Tech Co ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2021-04-16

Abstract

The invention provides a text classification screening method using LDA, which comprises the following steps: acquiring a data set, wherein the content comprises a plurality of short sentences; preprocessing the data by using a natural language processing method, and cleaning and sorting the data; determining a theme, and manually selecting a plurality of text sentences which accord with the theme; establishing a corresponding text vector matrix by using a bag-of-words model by using the selected text sentences; training a first LDA model by using the vector matrix; screening the remaining sentences in the text by using the first LDA model, calculating the correlation between the text set and a plurality of topic words obtained by the first LDA topic calculation, and using the correlation as a threshold value for evaluating whether one sentence meets the selected topic model; adding texts screened by topic relevance, and training a second LDA model; judging and screening the remaining sentences in the text by cosine similarity by using the second LDA model; and taking the sentence screened in total three times as text data conforming to the screening target.

Description

Text classification screening method using LDA

Technical Field

The invention relates to the field of natural language processing, and can effectively screen sentences conforming to a selected theme, prepare data sets for various machine learning algorithms or classify texts.

Background

Machine learning is currently used in a wider and wider range of fields. However, for a model needing to process natural language, a special theme is often required to be preset to train the model. Training the model requires a manually labeled data set to ensure the quality of the model. However, in many cases, it is a matter of great concern how to provide the highest possible quality data for the model without the ready availability of labeled data.

Training models cannot leave the data, but many times there is not enough data (data quality is too low or monetary cost of labeling is too large), so the industry has proposed so-called unsupervised learning, but it is still rarely used, and more times more training samples are added.

Disclosure of Invention

The technical problem of the invention is solved: a text classification screening method using an LDA (latent Dirichlet allocation model) is provided, when text data is faced, a small amount of data which is manually selected or marked is utilized, then the characteristics of the data are extracted to train a classification model, and the classification model is utilized to screen and classify the data, so that the text data with different subjects can be classified at lower cost and higher speed. The method is characterized in that a small amount of data meeting the theme requirements are manually selected, and then the characteristics of the data are extracted by using an LDA model so as to quickly screen the data.

The technical scheme of the invention is a text classification screening method using LDA, which comprises the following steps:

(1) acquiring a data set, wherein the content comprises a plurality of short sentences;

(2) preprocessing the data by using a natural language processing method, and cleaning and sorting the data;

(3) determining a theme, and manually selecting a plurality of text sentences which accord with the theme;

(4) establishing a corresponding text vector matrix by using a bag-of-words model by using the selected text sentences;

(5) training a first LDA model by using the vector matrix;

(6) screening the remaining sentences in the text by using the LDA model, calculating the correlation between the text set and a plurality of topic words obtained by first LDA topic calculation, and using the correlation as a threshold value for evaluating whether one sentence meets the selected topic model;

(7) adding texts screened by topic relevance, and training a second LDA model;

(8) judging and screening the remaining sentences in the text by cosine similarity by using the second LDA model;

(9) and (4) taking sentences which are manually screened, subject similar screening and cosine similar screening and are screened for three times in total as text data which accord with a screening target.

Further, in step 2, the preprocessing the data includes:

selecting sentences larger than 10 words; removing punctuation marks, removing error codes and removing other characters which are not English and numbers; repairing grammatical problems, repairing word spelling errors and repairing spoken vocabularies; repairing space and indentation problems; repairing the abnormal character; the cleaning and sorting comprises the steps of roughly cleaning by using a bag-of-words model and selecting text sentences with high theme weight.

Further, in step 3, manually selecting a plurality of text sentences that conform to the topic includes: only one item should be reserved for repeated sentences, and for sentences describing the same thing, the repeated sentences are considered to be repetitive when the half words of the sentences are the same;

abbreviations and shorthand content should be expanded, requiring manual discovery and replacement for representations that will give some abbreviations when expressed in spoken language.

Further, in the step 3, for the data set ready to be screened, each document has a single line, and 800 to 1000 texts with the same sequence of words and sentences are manually selected from the single line, wherein the texts meet the requirements of the selected subjects; and establishing a dictionary and an index for each word by using the selected text.

Further, in the step 4, each document is vectorized by using a bag-of-words model, the model only considers the document as a set of a plurality of words, the occurrence of each word in the document is independent and does not depend on whether other words occur, and then a word frequency matrix, namely a Document Theme (DT) matrix, is calculated and generated by using vectorized data.

Further, in step 5, the number of topics to be classified of the document is set, and the DT matrix is used to train the first LDA model: firstly, obtaining the distribution of parameters of topic distribution through Dirichlet distribution, then randomly generating topic distribution of a text, and then randomly generating a topic at each position of the text according to the topic distribution of the text; then, the distribution of the parameters of the word distribution is obtained through Dirichlet distribution, the word distribution of the topic is obtained, a word is randomly generated at the position according to the word distribution of the topic until the last position of the text, and the whole text is generated; and finally, repeating the processes to generate all texts.

Further, in the step 6, for the unselected text, the trained first LDA model is used for theme judgment, and the model gives the probability that the text belongs to a certain theme; and if the probability that a certain pair of sentences belong to a certain topic is highest under the judgment of the LDA, and the value exceeds a certain set threshold value, selecting the text.

Further, in step 7, a new data set is formed by using the texts selected manually and the first LDA model, and the second LDA model is retrained.

Further, in step 8, for all the remaining texts, the cosine similarity detection is performed by using the second LDA model and the previously selected corpus, and if the similarity value between a certain text and a selected certain text with the highest similarity is higher than a set threshold, the certain text is selected.

Further, in step 9, the required text data is screened according to the selected classification standard by manual selection, LDA theme selection, cosine similarity selection, and three selections in total.

Has the advantages that:

the text data processed by the method can be suitable for being classified and screened quickly when the data volume reaches more than ten million lines. Thousands of sentences which accord with the selected theme direction are manually selected, the LDA is used for theme similarity screening, the LDA theme model is used for selecting a part of comment pairs which highly accord with the theme, and therefore a sample which is large enough is used for training a relatively perfect LDA classification model. And finally, carrying out similarity detection on the rest sentences and the sentences in the trained LDA model, and selecting proper data. Through the three-time screening, the disadvantage of unsupervised machine learning is overcome to a certain extent, and the accuracy of screening and classification is improved under the condition of ensuring the speed. Has the following advantages:

(1) the method is suitable for screening large-scale data, and the cost of manual marking can be saved;

(2) the similarity of the 'theme' between different texts can be effectively distinguished;

(3) the classification effect on short texts, particularly short comments, is excellent;

(4) the obtained text is suitable for various machine learning algorithms;

(5) the screening process can ensure the screening quality and greatly improve the screening speed.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by those skilled in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.

The core technical model used by the invention is an LDA topic classification model, and a series of steps and strategies are designed around the model for data screening. The main principle of the LDA model is as follows:

the LDA model is a three-layer Bayes Topic model, Topic information implied in a text is discovered through an unsupervised learning method, and the purpose is to discover implied semantic dimensions, namely 'Topic' or 'Concept', from the text by an unguided learning method. The essence of implicit semantic analysis is to use the co-occurrence features of terms (term) in text to find the Topic structure of text, and this method does not need any background knowledge about text. Implicit semantic representation of text can model linguistic phenomena of "ambiguous words" and "ambiguous words" such that search results obtained by a search engine system match the query of a user at a semantic level, rather than just intersecting the query at a lexical level.

In the case of two classifications using LDA: given data set

m is a vector space RⁿM samples of (1), where x_iFor an n-dimensional vector classified by number i, yi ∈ {0, 1 }. Definition of N in the invention_j(j is 0,1) is the number of j-th class samples, X_j(j is 0,1) is the set of jth class samples, and μ_j(j equals 0,1) is the mean vector of the jth sample, and defines Σ_jAnd (j is 0,1) is a covariance matrix of the j-th sample.

μ_jSum Σ_jAre respectively:

if the data are projected on a straight line omega, the projection of the centers of the two types of samples on the straight line is respectively omega^Tμ₀And ω^Tμ₁The invention hopes that the projection points of the same kind of data are as close as possible, namely, the covariance omega of the projection points of the same kind of sample is required^T∑₀Omega and omega^T∑₁ω is as small as possible, so the optimization objective of the present invention is:

defining generally an intra-class divergence matrix S_ωComprises the following steps:

defining an inter-class divergence matrix S_bComprises the following steps:

S_b＝(μ₀-μ₁)(μ₀-μ₁)^T

the optimization objective can be rewritten as:

thus, by using the lagrange multiplier method, the feature vector can be obtained:

this is a form of a generalized rayleigh quotient, and for the two classified samples, the optimal projection direction ω can be determined by simply finding the mean and variance of the original samples.

Under the condition of multi-classification, if the multi-class is projected to the low dimension, the low dimension space projected at the moment is not a straight line but a hyperplane. Assuming that the dimension of the low-dimensional space projected by the present invention is d, the corresponding basis vector is (ω)₁，ω₂…ω_d) The basis vector is formed into a matrix W, which is an n × d matrix, and the optimization objective of the present invention should be written as follows:

w is a matrix formed by low-dimensional space basis vectors, and W belongs to R^d×(N-1)Where N is the number of sample classes.

The LDA text classification screening method depends on word vector theory. In the association of words with vectors, each word in an article or articles is generally considered to be

Obeying a probability distribution

This distribution is called a word

A priori distribution of. For example, the frequency of occurrence of the word "network" is closely related to the frequency of occurrence of the word "neural" in the relevant literature. Thus for each word, the probability of generating a corpus from that word is

Where W is the probability that each word in the corpus satisfies the polynomial distribution. The probability of generating a corpus is for each word

Generating a corpus for integral summation:

when calculating the prior probability, note that

Considering that the polynomial distribution and the dirichlet distribution are conjugate distributions, the dirichlet distribution can be adopted instead of:

here, the

Is just the normalization factor

Namely:

from the fact that the polynomial distribution and the dirichlet distribution are conjugate distributions, one can obtain:

according to the above formula, knowing the posterior distribution, the maximum point of the posterior distribution or the average value of the parameter under the posterior distribution can be used as the next maximum point

An estimate of (d). For a corpus, the results

The higher words can be grouped into a "cluster center", i.e. the topic of the text, v is the number of words, and k is the serial number of one of the words.

According to one embodiment of the invention, the data needs to be preprocessed before the text is classified and screened by using the method. For general web text data, there is much useless information (such as links and emoticons), and rough cleaning is required. Several steps may be used:

1. sentences larger than 10 words are selected.

2. Punctuation symbols are removed, error codes are removed, and other characters other than english and numbers are removed.

3. Repairing grammatical problems, repairing misspellings of words, and repairing spoken words.

4. The space and indentation problems are repaired.

5. Abnormal characters (e.g., common nonsense words such as quote, amp, constellation on explore, etc.) are repaired.

6. And carrying out rough cleaning by using a bag-of-words model, and selecting a text sentence with high topic weight.

In the text screening process, short sentences should be kept as much as possible, not long sentences. Since long sentences contain more words, the weight of the whole sentence may be high due to the number of words rather than the high weight of the desired topic, but is in fact independent of the desired topic. Meanwhile, in order to prevent the occurrence of a too short vocabulary, the matching utilization rate is low, and the too short vocabulary should be deleted during the screening.

According to an embodiment of the invention, when processing data, the characteristics of the network language need to be paid additional attention, and therefore, some manual screening skills are provided:

1. only one entry should be kept for repeated sentences. There are a large number of repetitive languages in the web language, and for sentences describing the same thing, it is considered repetitive when the half words of the sentence are the same.

2. Abbreviations and shorthand contents should be expanded. Due to the simplicity of web languages, many people when using spoken language to express them give some abbreviated representations, such as B & W (Black and white) and FAV (Favorite), require manual discovery and replacement.

According to one embodiment of the present invention, the screening data using LDA is implemented as follows:

1. and preparing a data set to be screened, wherein each document is a single line, and traversal processing is facilitated.

2. From which about 800 to 1000 words and sentences are manually selected to be in order and the texts which meet the selected requirements are selected.

3. And establishing a dictionary and an index for each word by using the selected text.

4. Each text is vectorized by using a bag-of-words model, which is a document representation method commonly used in the field of information retrieval and is only regarded as a set of a plurality of words, and the appearance of each word in the document is independent and independent of the appearance of other words. That is, any word appearing at any one location in a document is independently selected without being influenced by the semantic meaning of the document. And then, calculating and generating a word frequency matrix, namely a Document Theme (DT) matrix, by utilizing the vectorized data.

5. Setting the number of topics to be classified of the document, and training a first LDA model by using a DT matrix: firstly, obtaining the distribution of parameters of topic distribution through Dirichlet distribution, then randomly generating topic distribution of a text, and then randomly generating a topic at each position of the text according to the topic distribution of the text; then, the distribution of the parameters of the word distribution is obtained through Dirichlet distribution, the word distribution of the topic is obtained, a word is randomly generated at the position according to the word distribution of the topic until the last position of the text, and the whole text is generated; and finally, repeating the processes to generate all texts.

6. And for the unselected texts, performing theme judgment by using the trained first LDA model, wherein the model gives the probability that the model belongs to a certain theme. And if the probability that a certain pair of sentences belong to a certain topic is highest under the judgment of the first LDA, and the value exceeds a certain set threshold value, selecting the text.

7. And forming a new data set by using the texts which are manually selected previously and selected by the first LDA model, and retraining the second LDA model.

8. And for all the remaining texts, performing cosine similarity detection by using a second LDA model and the selected corpus, and if the similarity value of a certain text and the selected certain text with the highest similarity is higher than a set threshold value, selecting the certain text.

9. In this way, the required text data is screened according to the selected classification standard by manual selection, first LDA theme selection, cosine similarity selection and three selections in total.

According to an embodiment of the present invention, the specific code processing logic of the method proposed by the present invention is as follows:

the programming language used in the present invention is python, using nltk and genim modules to implement the main functions.

(1) And converting json original data obtained by the crawler into a list, and changing the list into a data structure which can be processed.

(2) And (4) cleaning data, namely firstly compiling a plurality of regular functions, and removing messy codes, non-English words, abbreviations and the like in text sentences. And then loading the commonly used stop word list into a list, and traversing and replacing each word of each text by using a place () function to finally obtain cleaner data.

(3) The topic to be screened is determined.

(4) From this huge data set, 800 pieces of high quality text meeting the theme requirements were manually selected.

(5) This resulted in data suitable for screening using the LDA model. Firstly, each line of the data is regarded as a document character string to be loaded into a list, and a huge document list is formed.

(6) Dictionary () function of the gensim module is used to create a dictionary of words of a corpus, ranging from all the words that appear in a document, each individual word being assigned an index.

(7) Doc2bow () traversal is used to turn the document list into a word vector matrix, otherwise known as a DT matrix.

(8) The LDA model is initialized using genim, models, ldamodel, and assigned as DT matrix, and the number of topic types to be classified, preferably, the number of types is 7 (the experimental results show that the classification of 7 has the most obvious distinguishing effect).

(9) Converting 800 text data except for manual screening into word vectors according to the same method, and judging the possibility of the text belonging to each topic by using ldamodel. Wherein the probability of being most likely to belong to a certain topic is selected if a threshold is reached.

(10) Through the steps in (9), a data set containing more sentences can be obtained. A second LDA model is retrained using this data set.

(11) Using index — similarity.matrixsilicity () function, the query corpus is converted to LDA vector space and indexed for each document/statement therein.

(12) Use of:

sim ═ index [ lsi [ arbitrary word vector ] ]

result ═ [ (DT matrix [ i [0] ], i [1]) for i in estimate (sims) ]

The cosine similarity value of the most similar word vector in the DT matrix corresponding to the word vector corresponding to the document list of any text can be obtained, and if the cosine similarity value is larger than the threshold value, the text data can be selected.

(13) In summary, there were three screens. The first time, the second time, the first LDA theme screening and the third time, the second LDA vector space theme similarity detection screening are used. The layer progressive screening mode improves the screening quality by utilizing the LDA model to learn the characteristics through a small amount of manual screening under the condition of ensuring the screening speed.

Portions of the invention not described in detail are well within the skill of the art.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims

1. A text classification screening method using LDA is characterized by comprising the following steps:

(5) training a first LDA model by using the vector matrix;

(6) screening the remaining sentences in the text by using the first LDA model, calculating the correlation between the text set and a plurality of topic words obtained by the first LDA topic calculation, and using the correlation as a threshold value for evaluating whether one sentence meets the selected topic model;

(7) adding texts screened by topic relevance, and training a second LDA model;

2. The method as claimed in claim 1, wherein the step 2 of preprocessing the data comprises:

3. The method as claimed in claim 1, wherein the step 3 of manually selecting the text sentences conforming to the topic comprises:

only one item should be reserved for repeated sentences, and for sentences describing the same thing, the repeated sentences are considered to be repetitive when the half words of the sentences are the same;

4. The method as claimed in claim 1, wherein in step 3, for the data set ready for screening, each document is selected in a single line, and then the text conforming to the selected subject requirement is manually selected from 800 to 1000 sentences; and establishing a dictionary and an index for each word by using the selected text.

5. The method as claimed in claim 1, wherein in the step 4, each text is vectorized by using a bag-of-words model, which only treats it as a collection of words, each word in the text is independent of the occurrence of other words, and then a word frequency matrix (DT) matrix is calculated by using vectorized data.

6. The method as claimed in claim 1, wherein in step 5, the number of topics to be classified of the document is set, and the DT matrix is used to train the first LDA model: firstly, obtaining the distribution of parameters of topic distribution through Dirichlet distribution, then randomly generating topic distribution of a text, and then randomly generating a topic at each position of the text according to the topic distribution of the text; then, the distribution of the parameters of the word distribution is obtained through Dirichlet distribution, the word distribution of the topic is obtained, a word is randomly generated at the position according to the word distribution of the topic until the last position of the text, and the whole text is generated; and finally, repeating the processes to generate all texts.

7. The method as claimed in claim 1, wherein in step 6, for the non-selected texts, a trained first LDA model is used to perform topic judgment, and the model gives the probability that the model belongs to a certain topic; and if the probability that a certain pair of sentences belong to a certain topic is highest under the judgment of the LDA, and the value exceeds a certain set threshold value, selecting the text.

8. The method as claimed in claim 1, wherein in step 7, the second LDA model is retrained by using the previously manually selected texts and the first LDA model to form a new data set.

9. The method as claimed in claim 1, wherein in the step 8, cosine similarity detection is performed on all the remaining texts by using the second LDA model and the previously selected corpus, and if the similarity value between a certain text and a selected text with the highest similarity is higher than a preset threshold, the certain text is selected.

10. The method as claimed in claim 1, wherein in the step 9, the required text data is selected according to the selected classification standard by manual selection, LDA theme selection, cosine similarity selection and three selections in total.