CN109299357B

CN109299357B - Laos language text subject classification method

Info

Publication number: CN109299357B
Application number: CN201811017181.3A
Authority: CN
Inventors: 周兰江; 王兴金; 张建安; 周枫
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2022-04-12
Anticipated expiration: 2038-08-31
Also published as: CN109299357A

Abstract

The invention discloses a method for classifying Laos text subjects, and belongs to the technical field of natural language processing and machine learning. The invention combines the N-gram language feature extraction model and the naive Bayes mathematical model to realize the subject recognition of the Laos article, and eliminates the limitation of naive Bayes to a certain extent. The method considers the condition independent assumption, takes the text as a word bag model, does not consider the sequence information among words, and improves the recognition rate of the text by using the unigram and bigram feature models simultaneously.

Description

Laos language text subject classification method

Technical Field

The invention relates to a method for classifying subjects of Laos Chinese, and belongs to the technical field of natural language processing and machine learning.

Background

As networks grow in popularity, the information of the networks grows at an exponential level. When a user searches for desired information using a search engine, the web pages often return thousands of related pages, and how can the user locate the desired information quickly and efficiently without viewing the pages one by one? The topic identification plays an important role, and the topic identification can effectively respond to the user by utilizing a classifier trained in advance to locate the topic of the content desired by the user in the limited information input by the user. The naive Bayes classification model is a method which has a long history and a solid theoretical basis, and is a direct and efficient method for processing a plurality of problems, and a plurality of high-level natural language processing models can also evolve from the naive Bayes classification model. The probability of belonging to a certain class under the condition of a certain characteristic can be converted into the probability of belonging to a certain class under the condition, and the probability can be calculated by combining the corpora of the people, namely, the posterior probability is calculated by utilizing the prior probability and the likelihood probability. However, it has its own disadvantage that it considers all the characteristic attributes to be conditional independent, which is equivalent to putting the text characteristic information into a word bag without considering the influence of the appearance sequence of words, which often ignores much information and causes misinterpretation of the text.

Disclosure of Invention

The invention provides a method for classifying the subjects of Laos texts, which is used for identifying the subjects of Laos articles of unknown types.

The technical scheme of the invention is as follows: a Laos text subject classification method comprises the following specific steps:

step1, crawling Laos texts by using a web crawler technology, and respectively collecting five types of texts: travel, economy, politics, education and others, wherein five folders named by categories are created to store texts of each category, and then the texts of the Laos are preprocessed to remove interference words which are not related to the categories, so that a corpus is constructed;

step2, traversing the corpus by utilizing a python technology, traversing the five created folders, taking the names of the folders as labels, taking the texts in the folders as data to be trained, respectively storing the labels and the data to be trained into two lists, named class _ list and data _ list, and converting the two lists into tuple forms { class _ list, data _ list };

step3, selecting a fitting function in a CountVectorer method under an N-gram mode in a skearn module, generating a bag-of-words model CountVectorer model from the text in the data _ list, converting the text in the data _ list into a text vector form through the bag-of-words model CountVectorer model, wherein the vector form is (a, b and c), and creating a new tuple for storing the class _ list and the processed data _ list; wherein a represents the number of the selected feature words, b is the index corresponding to each feature word, and c is the number of times each feature word appears at the index position corresponding to b;

step4, dividing the tuple of Step3 into a training set and a test set by adopting a train _ test _ split function provided by Sklearn, and inputting the training set into a naive Bayes model for training; and calculating the accuracy of prediction by using a score function in a naive Bayesian model provided by sklern in the test set, and adjusting the number of the selected high-frequency words when the fitting function is adopted in the Step3 according to the result of the accuracy so as to obtain the trained classifier of the Laos language chapter topic identification model.

The interference words comprise emoticons, numbers, spaces and stop words; wherein, the expression symbols, the numbers and the spaces are removed by adopting a regular expression, and the stop words are removed by adopting a stop word list.

The N-gram mode adopts a combination mode of unigram and bigram.

The naive bayes model is implemented using the MultinomialNB algorithm provided by sklern.

The invention has the beneficial effects that:

1. the invention combines the N-gram language feature extraction model and the naive Bayes mathematical model to realize the subject recognition of the Laos article, and eliminates the limitation of naive Bayes to a certain extent. The method considers the condition independent assumption, takes the text as a word bag model, does not consider the sequence information among words, and improves the recognition rate of the text by using the unigram and bigram feature models simultaneously.

2. When the number of the characteristic values is selected, an iteration loop is generated, the number 800 which can enable the classification accuracy rate to be the highest is selected, and the classification accuracy rate is improved to the maximum.

3. The invention selects a polynomial model of naive Bayes, and considers the situation of repeated words in the polynomial model. And when the polynomial model probability is calculated as the formula (1), a Laplace smoothing technology is added, namely, the lambda value is set to be 1, so that unreasonable 0 probability events are avoided. If the training samples are large, the change of the estimated probability caused by adding 1 to each component in the process of calculating the conditional probability can be ignored, but the method can conveniently and effectively avoid the problem of zero probability.

4. The method searches stop words of Laos, and removes some words which have high occurrence frequency but are irrelevant to text type judgment. Therefore, the time for judging and classifying the training model is reduced, the speed is high, and the text classification effect is improved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is an illustration of a CountVectorizer.

Detailed Description

The invention will be further illustrated with reference to the following figures and examples, without however being limited to the scope of the invention described.

Example 1: as shown in fig. 1-2, a method for classifying a Laos text topic comprises the following steps: step1, crawling Laos texts by using a web crawler technology, wherein the total crawled texts in five categories are as follows: economy, politics, education, tourism, general. Storing the documents into corresponding five folders respectively, wherein the folders are named by categories so as to facilitate later retrieval and processing, and then performing text processing on the crawled articles to remove a plurality of irrelevant classified interference words so as to construct a corpus; further, the interference words can be set to include emoticons, numbers, spaces and stop words; wherein, the emoticon, the number and the blank space are removed by adopting a regular expression, and the stop word is removed by adopting a stop word list (namely removing the word appearing in the stop word list). The regular expression encoding used in removing some irrelevant classified interfering words is as follows: u "[ \ u0000- \ u10ff ] + $", the code matches each character of the text, which may exclude characters other than Laos text. Table 1 below shows how many Laos texts and the number of each category are crawled.

Table 1:

categories	Number of
		Economy of production	430
Politics	731
		Education	197
Travel toy	145
		Others	28

Step2, traversing the corpus by utilizing a python technology, traversing the five created folders, wherein the names of the folders are labels, the texts in the folders are data to be trained, the labels and the data to be trained are respectively stored in two lists, named as data _ list and class _ list, and after the processing is finished, the two lists are converted into a tuple form so as to correspond to the data receiving type of the next module. Table 2 below illustrates the storage form of part of data _ list and class _ list:

table 2:

step3, before the text is input into the naive Bayes model for probability calculation, the text needs to be converted into a vector form, and the model can identify the text. The countvector method of feature extraction in the sklern module was chosen here. CountVectorizer aims to convert one text into a vector by counting, and CountVectorizer provides a mode of N-gram, which has many modes, where the way unigram is combined with bigram is chosen. unigram considers words of text to all occur singly to construct a vector, while bigram considers words of text to occur in combination of two words to construct a word vector. The combination of these two ways not only considers the impact of a single word on the classifier effect, but also considers the impact of a combined word on the text classification.

Before conversion, a matching function in the countvectorer is used, the first few high-frequency words can be selected according to the word frequency ordering of the words in the input training set to generate a countvectorer model similar to the word bag model, and the model uses the indexes of the high-frequency words as mapping marks for text-to-vector conversion. We can adjust the selection of high frequency words accordingly according to the accuracy calculated at Step 5. Based on multiple results calculations, there is a relatively high degree of accuracy when feature number selection 800 is performed. CountVectorizer also generates a text vector in the form of (a, b, c). a represents the number of the selected feature words, b is the index corresponding to each feature word, and c is the number of times of appearance of each feature word at the index position corresponding to b.

The CountVectorizer also provides a transform method, which converts the next input text into text vectors according to the previously trained bag-of-words model CountVectorModel, and the text vectors provide a basis for the next input of the naive Bayesian model. And inputting the text in the data _ list into a CountVectorizer for processing to obtain a bag-of-words model CountVectorizer model, wherein the bag-of-words model CountVectorizer model can convert the text which needs to be subjected to the future naive Bayesian model prediction into a word vector. As FIG. 2 illustrates the CountVectorizer, Array () on the left side of the figure represents the input text and the right side represents the word vector model after CountVectorizer processing: each text in the data _ list can be converted into a corresponding word vector by using the bag-of-word model, so that all text data are converted into data which can be recognized by a naive Bayes model. The text processed at Step3 with reference to table two will become table 3 below:

table 3:

step4, these data can then be input into a naive bayes model for training. Before training, we need to partition the data. Sklearn provides a train _ test _ split function, processed data _ list data can be randomly divided into a training set and a test set, and the data _ list data and the corresponding labeled class _ list data are divided into 85% of the training set and 15% of the test set to be stored respectively. The training set is used for training a naive Bayes model, and the testing set is used for verifying the model accuracy. Naive bayes, chosen here, is implemented with the MultinomialNB algorithm provided by sklern. Multinomial nb assumes that the prior probability of a feature is a polynomial distribution, i.e. as in equation (1):

y represents class, C_kRepresents the Kth class, x_jlRepresents the occurrence of j, the word X_jRepresents the word m_kThe total number of words in the K category is represented, n represents the number of all words in the K category (repeated words are calculated once), and λ is usually 1, namely, laplacian smoothing is performed, so that the probability of 0 can be avoided, and each n-gram is ensured to appear in the corpus at least 1 time. In the MultinomialNB algorithm we consider the case of repeated words, that is, repeated words we consider it to occur multiple times. This formula can be understood in this example view, taking the calculation of P (apple | s) as an example:

step5, calculating and predicting accuracy by using a score function in a naive Bayes model provided by sklern, firstly inputting a test set into the score function to obtain accuracy, predicting the category of the test set according to a multinomial NB model, then calculating the accuracy according to the predicted category and the real category, and training a classifier of the Laos language chapter subject recognition model.

Step6, when a Laos text topic recognition classifier needs to be predicted, a prediction function provided by a naive Bayes model of sklern can be used for judging the type of the text, and the prediction function can be predicted according to a multinomial NB model trained by Step 5.

Before the text to be predicted is input into the model, the text still needs to be converted into a text vector, namely, the text is corresponding to a text vector space by using a feature extractor CountVectorizer to generate the text vector. The text vector is then input into the prediction function of naive bayes to predict which class it belongs to.

When a text vector is input into the classifier, the classifier will compute it in five categories based on the text vector: conditional probabilities in tourism, economy, education, politics, others, as in equation (2).

In the example this formula can be understood as such to calculate the probability that Text1 belongs to an economy:

x is the text vector to be discriminated, Y_iIs a category. Class conditional probability P (X | Y)_i) Is the joint probability on all attributes, and the trained samples are limited, so naive bayes adopts the attribute conditional independence assumption, and thus the formula (2) can be rewritten as the following formula (3):

wherein d isNumber of attributes, x_jIs the value of x on the jth attribute. In the example the formula can be understood as (x)_jRepresents the jth word in Text 1):

while calculating P (x)_jIf | c) is carried out using the formula (1). Since the conditional probabilities for each class are calculated as they have the same denominator, it is sufficient to calculate the numerator, i.e., equation (4).

Conditional probabilities of tourism, economy, education, politics, and others are calculated in formula (3), and then a category is determined by comparing which is higher.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A Laos language text subject classification method is characterized in that: the method comprises the following specific steps:

step1, crawling Laos texts by using a web crawler technology, and respectively collecting five types of texts: tourism, economy, politics, education and others, five folders named by categories are created to store texts of each category, and then the texts of the Laos are preprocessed to remove some irrelevant and classified interference words so as to construct a corpus;

step2, traversing the corpus by utilizing a python technology, traversing the five folders created above, taking the folder names as labels, taking the texts in the folders as data to be trained, respectively storing the labels and the data to be trained into two lists, named class _ list and data _ list, and converting the two lists into tuple forms { class _ list, data _ list } after the processing is finished;

step3, selecting a fitting function in a CountVectorer method under an N-gram mode in a skearn module, generating a bag-of-words model CountVectorer model from the text in the data _ list, converting the text in the data _ list into a text vector form through the bag-of-words model CountVectorer model, wherein the vector form is (a, b and c), and creating a new tuple for storing the class _ list and the processed data _ list; wherein a represents the number of the selected feature words, b is the index corresponding to each feature word, and c is the number of times of occurrence of each feature word at the index position corresponding to b;

step4, dividing the tuple of Step3 into a training set and a test set by adopting a train _ test _ split function provided by Sklearn, and inputting the training set into a naive Bayes model for training; calculating the accuracy of prediction by using a score function in a naive Bayesian model provided by sklern in the test set, and adjusting the number of high-frequency words selected when a fitting function is adopted in Step3 according to the result of the accuracy so as to obtain a classifier of the trained Laos language chapter topic identification model;

the interference words comprise emoticons, numbers, spaces and stop words; wherein, the expression symbols, the numbers and the spaces are removed by adopting a regular expression, and the stop words are removed by adopting a stop word list;

the naive bayes model is implemented using the MultinomialNB algorithm provided by sklern:

y represents class, C_kRepresents the Kth class, x_jlThe number of times of occurrence of the word j is represented by l times and X_jRepresents the word m_kRepresents the total number of words in the K category, n represents the number of all non-repeated words in the K category, and λ is taken to be 1, i.e., laplacian smoothing.

2. The method for classifying Laos text topics as claimed in claim 1, wherein: the N-gram mode adopts a combination mode of unigram and bigram.